dm-mpath request merging concerns [was: Re: It's time to put together the schedule]

All of lore.kernel.org
 help / color / mirror / Atom feed

* dm-mpath request merging concerns [was: Re: It's time to put together the schedule]
       [not found] ` <54EAD453.6040907@suse.de>
@ 2015-02-23 13:50   ` Mike Snitzer
  2015-02-23 17:18     ` [Lsf] " Mike Christie
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2015-02-23 13:50 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: lsf, Jeff Moyer, dm-devel

On Mon, Feb 23 2015 at  2:18am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 02/20/2015 02:29 AM, James Bottomley wrote:
> > In the absence of any strong requests, the Programme Committee has taken
> > a first stab at an agenda here:
> > 
> > https://docs.google.com/spreadsheet/pub?key=0ArurRVMVCSnkdEl4a0NrNTgtU2JrWDNtWGRDOWRhZnc
> > 
> > If there's anything you think should be discussed (or shouldn't be
> > discussed) speak now ...
> > 
> Recently we've found a rather worrysome queueing degradation with
> multipathing, which pointed to a deficiency in the scheduler itself:
> SAP found that a device with 4 paths had less I/O throughput than a
> system with 2 paths. When they've reduced the queue depth on the 4
> path system they managed to increase the throughput somewhat, but
> still less than they've had with two paths.

The block layer doesn't have any understanding of how many paths are
behind the top-level dm-mpath request_queue that is supposed to be doing
the merging.

So from a pure design level it is surprising that 2 vs 4 is impacting
the merging at all.  I think Jeff Moyer (cc'd) has dealt with SAP
performance recently too.

> As it turns out, with 4 paths the system rarely did any I/O merging,
> but rather fired off the 4k requests as fast as possible.
> With two paths it was able to do some merging, leading to improved
> performance.
> 
> I was under the impression that the merging algorithm in the block
> layer would only unplug the queue once the request had been fully
> formed, ie after merging has happened. But apparently that is not
> the case here.

Just because you aren't seeing merging are you sure it has anything to
do with unpluging?  Would be nice to know more about the workload.

Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf] dm-mpath request merging concerns [was: Re: It's time to put together the schedule]
  2015-02-23 13:50   ` dm-mpath request merging concerns [was: Re: It's time to put together the schedule] Mike Snitzer
@ 2015-02-23 17:18     ` Mike Christie
  2015-02-23 18:34       ` Benjamin Marzinski
                         ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Mike Christie @ 2015-02-23 17:18 UTC (permalink / raw)
  To: Mike Snitzer, Hannes Reinecke; +Cc: lsf, Jeff Moyer, dm-devel

On 2/23/15, 7:50 AM, Mike Snitzer wrote:
> On Mon, Feb 23 2015 at  2:18am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
>
>> On 02/20/2015 02:29 AM, James Bottomley wrote:
>>> In the absence of any strong requests, the Programme Committee has taken
>>> a first stab at an agenda here:
>>>
>>> https://docs.google.com/spreadsheet/pub?key=0ArurRVMVCSnkdEl4a0NrNTgtU2JrWDNtWGRDOWRhZnc
>>>
>>> If there's anything you think should be discussed (or shouldn't be
>>> discussed) speak now ...
>>>
>> Recently we've found a rather worrysome queueing degradation with
>> multipathing, which pointed to a deficiency in the scheduler itself:
>> SAP found that a device with 4 paths had less I/O throughput than a
>> system with 2 paths. When they've reduced the queue depth on the 4
>> path system they managed to increase the throughput somewhat, but
>> still less than they've had with two paths.
>
> The block layer doesn't have any understanding of how many paths are
> behind the top-level dm-mpath request_queue that is supposed to be doing
> the merging.
>
> So from a pure design level it is surprising that 2 vs 4 is impacting
> the merging at all.  I think Jeff Moyer (cc'd) has dealt with SAP
> performance recently too.
>
>> As it turns out, with 4 paths the system rarely did any I/O merging,
>> but rather fired off the 4k requests as fast as possible.
>> With two paths it was able to do some merging, leading to improved
>> performance.
>>
>> I was under the impression that the merging algorithm in the block
>> layer would only unplug the queue once the request had been fully
>> formed, ie after merging has happened. But apparently that is not
>> the case here.
>
> Just because you aren't seeing merging are you sure it has anything to
> do with unpluging?  Would be nice to know more about the workload.
>

I think I remember this problem. In the original request based design we 
hit this issue and Kiyoshi or Jun'ichi did some changes for it.

I think it was related to the busy/dm_lld_busy code in dm.c and 
dm-mpath.c. The problem was that we do the merging in the dm level 
queue. The underlying paths do not merge bios. They just take the 
request sent to them. The change that was done to promote (I do not 
think we ever completely fixed the issue) was that normally the workload 
was heavy enough or the paths slow enough so the busy check would return 
true enough to cause the dm layers queue not dispatch requests so 
quickly. They then had time to sit in the dm queue and merge with other 
bios.

If the device/transport is fast or the workload is low, the 
multipath_busy never returns busy, then we can hit Hannes's issue. For 4 
paths, we just might not be able to fill up the paths and hit the busy 
check. With only 2 paths, we might be slow enough or the workload is 
heavy enough to keep the paths busy and so we hit the busy check and do 
more merging.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf] dm-mpath request merging concerns [was: Re: It's time to put together the schedule]
  2015-02-23 17:18     ` [Lsf] " Mike Christie
@ 2015-02-23 18:34       ` Benjamin Marzinski
  2015-02-23 19:56         ` Mike Snitzer
  2015-02-23 19:35       ` [Lsf] " Merla, ShivaKrishna
  2015-02-23 19:50       ` Mike Snitzer
  2 siblings, 1 reply; 18+ messages in thread
From: Benjamin Marzinski @ 2015-02-23 18:34 UTC (permalink / raw)
  To: device-mapper development; +Cc: lsf, Jeff Moyer, Mike Snitzer

On Mon, Feb 23, 2015 at 11:18:36AM -0600, Mike Christie wrote:
> On 2/23/15, 7:50 AM, Mike Snitzer wrote:
> >On Mon, Feb 23 2015 at  2:18am -0500,
> >Hannes Reinecke <hare@suse.de> wrote:
> >
> >>On 02/20/2015 02:29 AM, James Bottomley wrote:
> >>>In the absence of any strong requests, the Programme Committee has taken
> >>>a first stab at an agenda here:
> >>>
> >>>https://docs.google.com/spreadsheet/pub?key=0ArurRVMVCSnkdEl4a0NrNTgtU2JrWDNtWGRDOWRhZnc
> >>>
> >>>If there's anything you think should be discussed (or shouldn't be
> >>>discussed) speak now ...
> >>>
> >>Recently we've found a rather worrysome queueing degradation with
> >>multipathing, which pointed to a deficiency in the scheduler itself:
> >>SAP found that a device with 4 paths had less I/O throughput than a
> >>system with 2 paths. When they've reduced the queue depth on the 4
> >>path system they managed to increase the throughput somewhat, but
> >>still less than they've had with two paths.
> >
> >The block layer doesn't have any understanding of how many paths are
> >behind the top-level dm-mpath request_queue that is supposed to be doing
> >the merging.
> >
> >So from a pure design level it is surprising that 2 vs 4 is impacting
> >the merging at all.  I think Jeff Moyer (cc'd) has dealt with SAP
> >performance recently too.
> >
> >>As it turns out, with 4 paths the system rarely did any I/O merging,
> >>but rather fired off the 4k requests as fast as possible.
> >>With two paths it was able to do some merging, leading to improved
> >>performance.
> >>
> >>I was under the impression that the merging algorithm in the block
> >>layer would only unplug the queue once the request had been fully
> >>formed, ie after merging has happened. But apparently that is not
> >>the case here.
> >
> >Just because you aren't seeing merging are you sure it has anything to
> >do with unpluging?  Would be nice to know more about the workload.
> >
> 
> I think I remember this problem. In the original request based design we hit
> this issue and Kiyoshi or Jun'ichi did some changes for it.
> 
> I think it was related to the busy/dm_lld_busy code in dm.c and dm-mpath.c.
> The problem was that we do the merging in the dm level queue. The underlying
> paths do not merge bios. They just take the request sent to them. The change
> that was done to promote (I do not think we ever completely fixed the issue)
> was that normally the workload was heavy enough or the paths slow enough so
> the busy check would return true enough to cause the dm layers queue not
> dispatch requests so quickly. They then had time to sit in the dm queue and
> merge with other bios.
> 
> If the device/transport is fast or the workload is low, the multipath_busy
> never returns busy, then we can hit Hannes's issue. For 4 paths, we just
> might not be able to fill up the paths and hit the busy check. With only 2
> paths, we might be slow enough or the workload is heavy enough to keep the
> paths busy and so we hit the busy check and do more merging.

Netapp is seeing this same issue.  It seems like we might want to make
multipath_busy more aggressive about returning busy, which would
probably require multipath tracking the size and frequency of the
requests.  If it determines that it's getting a lot of requests that
could have been merged, it could start throttling how fast requests are
getting pulled off the queue, even there underlying paths aren't busy.

-Ben

> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf] dm-mpath request merging concerns [was: Re: It's time to put together the schedule]
  2015-02-23 17:18     ` [Lsf] " Mike Christie
  2015-02-23 18:34       ` Benjamin Marzinski
@ 2015-02-23 19:35       ` Merla, ShivaKrishna
  2015-02-23 19:50       ` Mike Snitzer
  2 siblings, 0 replies; 18+ messages in thread
From: Merla, ShivaKrishna @ 2015-02-23 19:35 UTC (permalink / raw)
  To: device-mapper development, Mike Snitzer, Hannes Reinecke; +Cc: lsf, Jeff Moyer



> -----Original Message-----
> From: dm-devel-bounces@redhat.com [mailto:dm-devel-
> bounces@redhat.com] On Behalf Of Mike Christie
> Sent: Monday, February 23, 2015 11:19 AM
> To: Mike Snitzer; Hannes Reinecke
> Cc: lsf@lists.linux-foundation.org; Jeff Moyer; dm-devel@redhat.com
> Subject: Re: [dm-devel] [Lsf] dm-mpath request merging concerns [was: Re:
> It's time to put together the schedule]
> 
> On 2/23/15, 7:50 AM, Mike Snitzer wrote:
> > On Mon, Feb 23 2015 at  2:18am -0500,
> > Hannes Reinecke <hare@suse.de> wrote:
> >
> >> On 02/20/2015 02:29 AM, James Bottomley wrote:
> >>> In the absence of any strong requests, the Programme Committee has
> taken
> >>> a first stab at an agenda here:
> >>>
> >>>
> https://docs.google.com/spreadsheet/pub?key=0ArurRVMVCSnkdEl4a0NrN
> TgtU2JrWDNtWGRDOWRhZnc
> >>>
> >>> If there's anything you think should be discussed (or shouldn't be
> >>> discussed) speak now ...
> >>>
> >> Recently we've found a rather worrysome queueing degradation with
> >> multipathing, which pointed to a deficiency in the scheduler itself:
> >> SAP found that a device with 4 paths had less I/O throughput than a
> >> system with 2 paths. When they've reduced the queue depth on the 4
> >> path system they managed to increase the throughput somewhat, but
> >> still less than they've had with two paths.
> >
> > The block layer doesn't have any understanding of how many paths are
> > behind the top-level dm-mpath request_queue that is supposed to be
> doing
> > the merging.
> >
> > So from a pure design level it is surprising that 2 vs 4 is impacting
> > the merging at all.  I think Jeff Moyer (cc'd) has dealt with SAP
> > performance recently too.
> >
> >> As it turns out, with 4 paths the system rarely did any I/O merging,
> >> but rather fired off the 4k requests as fast as possible.
> >> With two paths it was able to do some merging, leading to improved
> >> performance.
> >>
> >> I was under the impression that the merging algorithm in the block
> >> layer would only unplug the queue once the request had been fully
> >> formed, ie after merging has happened. But apparently that is not
> >> the case here.
> >
> > Just because you aren't seeing merging are you sure it has anything to
> > do with unpluging?  Would be nice to know more about the workload.
> >
> 
> I think I remember this problem. In the original request based design we
> hit this issue and Kiyoshi or Jun'ichi did some changes for it.
> 
> I think it was related to the busy/dm_lld_busy code in dm.c and
> dm-mpath.c. The problem was that we do the merging in the dm level
> queue. The underlying paths do not merge bios. They just take the
> request sent to them. The change that was done to promote (I do not
> think we ever completely fixed the issue) was that normally the workload
> was heavy enough or the paths slow enough so the busy check would return
> true enough to cause the dm layers queue not dispatch requests so
> quickly. They then had time to sit in the dm queue and merge with other
> bios.
> 
> If the device/transport is fast or the workload is low, the
> multipath_busy never returns busy, then we can hit Hannes's issue. For 4
> paths, we just might not be able to fill up the paths and hit the busy
> check. With only 2 paths, we might be slow enough or the workload is
> heavy enough to keep the paths busy and so we hit the busy check and do
> more merging.
> 
Yes, Indeed this the exact issue we saw at NetApp. While running sequential
4K write I/O with large thread count, 2 paths yield better performance than
4 paths and performance drastically drops with 4 paths. The device queue_depth
as 32 and with blktrace we could see better I/O merging happening and average 
request size was > 8K through iostat. With 4 paths none of the I/O gets merged and
always average request size is 4K. Scheduler used was noop as we are using SSD 
based storage. We could get I/O merging to happen even with 4 paths but with lower
device queue_depth of 16. Even then the performance was lacking compared to 2 paths.

> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dm-mpath request merging concerns [was: Re: It's time to put together the schedule]
  2015-02-23 17:18     ` [Lsf] " Mike Christie
  2015-02-23 18:34       ` Benjamin Marzinski
  2015-02-23 19:35       ` [Lsf] " Merla, ShivaKrishna
@ 2015-02-23 19:50       ` Mike Snitzer
  2015-02-23 20:08         ` [Lsf] " Christoph Hellwig
  2 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2015-02-23 19:50 UTC (permalink / raw)
  To: Mike Christie; +Cc: lsf, axboe, Jeff Moyer, dm-devel

On Mon, Feb 23 2015 at 12:18pm -0500,
Mike Christie <michaelc@cs.wisc.edu> wrote:

> On 2/23/15, 7:50 AM, Mike Snitzer wrote:
> >On Mon, Feb 23 2015 at  2:18am -0500,
> >Hannes Reinecke <hare@suse.de> wrote:
> >
> >>On 02/20/2015 02:29 AM, James Bottomley wrote:
> >>>In the absence of any strong requests, the Programme Committee has taken
> >>>a first stab at an agenda here:
> >>>
> >>>https://docs.google.com/spreadsheet/pub?key=0ArurRVMVCSnkdEl4a0NrNTgtU2JrWDNtWGRDOWRhZnc
> >>>
> >>>If there's anything you think should be discussed (or shouldn't be
> >>>discussed) speak now ...
> >>>
> >>Recently we've found a rather worrysome queueing degradation with
> >>multipathing, which pointed to a deficiency in the scheduler itself:
> >>SAP found that a device with 4 paths had less I/O throughput than a
> >>system with 2 paths. When they've reduced the queue depth on the 4
> >>path system they managed to increase the throughput somewhat, but
> >>still less than they've had with two paths.
> >
> >The block layer doesn't have any understanding of how many paths are
> >behind the top-level dm-mpath request_queue that is supposed to be doing
> >the merging.
> >
> >So from a pure design level it is surprising that 2 vs 4 is impacting
> >the merging at all.  I think Jeff Moyer (cc'd) has dealt with SAP
> >performance recently too.
> >
> >>As it turns out, with 4 paths the system rarely did any I/O merging,
> >>but rather fired off the 4k requests as fast as possible.
> >>With two paths it was able to do some merging, leading to improved
> >>performance.
> >>
> >>I was under the impression that the merging algorithm in the block
> >>layer would only unplug the queue once the request had been fully
> >>formed, ie after merging has happened. But apparently that is not
> >>the case here.
> >
> >Just because you aren't seeing merging are you sure it has anything to
> >do with unpluging?  Would be nice to know more about the workload.
> >
> 
> I think I remember this problem. In the original request based
> design we hit this issue and Kiyoshi or Jun'ichi did some changes
> for it.
> 
> I think it was related to the busy/dm_lld_busy code in dm.c and
> dm-mpath.c. The problem was that we do the merging in the dm level
> queue. The underlying paths do not merge bios. They just take the
> request sent to them.

Digging in to this a little, seems pretty clear that DM-mpath doesn't
have enough integration with the block layer related to queue plugging.

DM is looking for back-pressure in terms of "busy" in 2 ways:

1) from blk_queue_lld_busy() callback:
dm_lld_busy -> dm_table_any_busy_target -> multipath_busy -> 
__pgpath_busy -> dm_underlying_device_busy -> blk_lld_busy

2) from q->request_fn:
dm_request_fn -> multipath_busy ->
__pgpath_busy -> dm_underlying_device_busy -> blk_lld_busy

(btw, I'm tempted to remove dm_underlying_device_busy since it just
calls blk_lld_busy... not seeing the point of the DM wrapper.  BUT to my
amazement, dm_underlying_device_busy is the only caller of
blk_lld_busy.  And the only other caller of blk_queue_lld_busy() is
scsi_alloc_queue().  Meaning this is a useless hook for non-SCSI devices
that might be used by multipath in the future, e.g. NVMe.  Also, we
don't support stacked request-based DM targets so I'm missing _why_ DM
is even bothering to call blk_queue_lld_busy -- I think it shouldn't) 

Anyway, as was noted earlier, mpath's underlying devices could be too
fast to show any signs of "busy" pressure.  And ontop of it, the check
for "busy" is racey given that there is no guarantee that the queue that
is checked will be the actual underlying queue the request will get
dispatched to.

Switching gears slightly, DM is blind to plugging and only relies on
"busy" -- this looks like a recipe for blindly dispatching requests to
the underlying queues.

questions:

- Should request-based DM wire up blk_check_plugged() to allow the block
  layer's plugging to more directly influence when blk_start_request()
  is called from dm_request_fn?

- Put differently: in addition to checking ->busy should dm_request_fn
  also maintain and check plugging state that is influenced by
  blk_check_plugged()?

(or is this moot given that the block layer will only call q->request_fn
when the queue isn't plugged anyway!?)

Basically the long and short of this is: the block layer isn't helping
us like we thought (elevator is effectively useless and/or being
circumvented).  And this apparently isn't new.

I'll take a more measured look at all of this while also trying to
make sense of switching request-based DM over to using blk-mq.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dm-mpath request merging concerns [was: Re: It's time to put together the schedule]
  2015-02-23 18:34       ` Benjamin Marzinski
@ 2015-02-23 19:56         ` Mike Snitzer
  2015-02-23 21:19           ` Benjamin Marzinski
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2015-02-23 19:56 UTC (permalink / raw)
  To: Benjamin Marzinski; +Cc: lsf, axboe, device-mapper development, Jeff Moyer

On Mon, Feb 23 2015 at  1:34pm -0500,
Benjamin Marzinski <bmarzins@redhat.com> wrote:

> On Mon, Feb 23, 2015 at 11:18:36AM -0600, Mike Christie wrote:
> > 
> > If the device/transport is fast or the workload is low, the multipath_busy
> > never returns busy, then we can hit Hannes's issue. For 4 paths, we just
> > might not be able to fill up the paths and hit the busy check. With only 2
> > paths, we might be slow enough or the workload is heavy enough to keep the
> > paths busy and so we hit the busy check and do more merging.
> 
> Netapp is seeing this same issue.  It seems like we might want to make
> multipath_busy more aggressive about returning busy, which would
> probably require multipath tracking the size and frequency of the
> requests.  If it determines that it's getting a lot of requests that
> could have been merged, it could start throttling how fast requests are
> getting pulled off the queue, even there underlying paths aren't busy.

the ->busy() checks are just an extra check the shouldn't be the primary
method for governing the effectiveness of the DM-mpath queue's elevator.

I need to get back to basics to appreciate how the existing block layer
is able to have an effective elevator regardless of the device's speed.
And why isn't request-based DM able to just take advantage of it?

(my money is on request-based DM being overly clever but we'll see)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf] dm-mpath request merging concerns [was: Re: It's time to put together the schedule]
  2015-02-23 19:50       ` Mike Snitzer
@ 2015-02-23 20:08         ` Christoph Hellwig
  2015-02-23 20:39           ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2015-02-23 20:08 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: axboe, Mike Christie, dm-devel, lsf, Jeff Moyer

On Mon, Feb 23, 2015 at 02:50:57PM -0500, Mike Snitzer wrote:
> - Should request-based DM wire up blk_check_plugged() to allow the block
>   layer's plugging to more directly influence when blk_start_request()
>   is called from dm_request_fn?
> 
> - Put differently: in addition to checking ->busy should dm_request_fn
>   also maintain and check plugging state that is influenced by
>   blk_check_plugged()?
> 
> (or is this moot given that the block layer will only call q->request_fn
> when the queue isn't plugged anyway!?)
> 
> Basically the long and short of this is: the block layer isn't helping
> us like we thought (elevator is effectively useless and/or being
> circumvented).  And this apparently isn't new.
> 
> I'll take a more measured look at all of this while also trying to
> make sense of switching request-based DM over to using blk-mq.

I'd like to rephrase the question a little bit:  Is the request layer +
device mapper really the right combination for driving multipath I/O?

If we were moving it to a set of helpers where blk-mq drivers could
just ask for resubmitting I/O on another device of it's choice we
would be able to cut the middle man with all the problems that having
a middle man entails. That is all the busy checks that need to be
propagated, the plugging question, the various merge parameters that need
to be communicated from the driver, the sense code intepretation for which
dm needs to call back into SCSI (or NVMe for that matter).  To me
it seems like a much better idea to let the driver (*) driver the
decision, with a few library helpers provided to deal with queue selection
algorithms and other faіrly generic pieces of code.

(*) that is block layer driver, in case of SCSI it would still be the
midlayer pulling the strings.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dm-mpath request merging concerns [was: Re: It's time to put together the schedule]
  2015-02-23 20:08         ` [Lsf] " Christoph Hellwig
@ 2015-02-23 20:39           ` Mike Snitzer
  2015-02-23 21:14             ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2015-02-23 20:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, Mike Christie, dm-devel, lsf, Jeff Moyer

On Mon, Feb 23 2015 at  3:08pm -0500,
Christoph Hellwig <hch@lst.de> wrote:

> On Mon, Feb 23, 2015 at 02:50:57PM -0500, Mike Snitzer wrote:
> > - Should request-based DM wire up blk_check_plugged() to allow the block
> >   layer's plugging to more directly influence when blk_start_request()
> >   is called from dm_request_fn?
> > 
> > - Put differently: in addition to checking ->busy should dm_request_fn
> >   also maintain and check plugging state that is influenced by
> >   blk_check_plugged()?
> > 
> > (or is this moot given that the block layer will only call q->request_fn
> > when the queue isn't plugged anyway!?)
> > 
> > Basically the long and short of this is: the block layer isn't helping
> > us like we thought (elevator is effectively useless and/or being
> > circumvented).  And this apparently isn't new.
> > 
> > I'll take a more measured look at all of this while also trying to
> > make sense of switching request-based DM over to using blk-mq.
> 
> I'd like to rephrase the question a little bit:  Is the request layer +
> device mapper really the right combination for driving multipath I/O?
> 
> If we were moving it to a set of helpers where blk-mq drivers could
> just ask for resubmitting I/O on another device of it's choice we
> would be able to cut the middle man with all the problems that having
> a middle man entails. That is all the busy checks that need to be
> propagated, the plugging question, the various merge parameters that need
> to be communicated from the driver, the sense code intepretation for which
> dm needs to call back into SCSI (or NVMe for that matter).  To me
> it seems like a much better idea to let the driver (*) driver the
> decision, with a few library helpers provided to deal with queue selection
> algorithms and other faіrly generic pieces of code.
> 
> (*) that is block layer driver, in case of SCSI it would still be the
> midlayer pulling the strings.

OK, fair question.  But making existing DM-multipath work as expected
(effective consumer of old block's elevators) would be a nice thing to
do before blowing it up with a redesign.

There is a lot of block code that is only used by request-based DM, the
big one being blk_queue_bio().  Could be a regression crept in and it
went unnoticed until now.  Not sure.

Happy to entertain your question once this immediate issue is under
control (or at least understood).  But it is obviously a worthy
discussion for LSF.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dm-mpath request merging concerns [was: Re: It's time to put together the schedule]
  2015-02-23 20:39           ` Mike Snitzer
@ 2015-02-23 21:14             ` Mike Snitzer
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2015-02-23 21:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, Mike Christie, Jeff Moyer, lsf, dm-devel

On Mon, Feb 23 2015 at  3:39pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Feb 23 2015 at  3:08pm -0500,
> Christoph Hellwig <hch@lst.de> wrote:
> > 
> > I'd like to rephrase the question a little bit:  Is the request layer +
> > device mapper really the right combination for driving multipath I/O?
> > 
> > If we were moving it to a set of helpers where blk-mq drivers could
> > just ask for resubmitting I/O on another device of it's choice we
> > would be able to cut the middle man with all the problems that having
> > a middle man entails. That is all the busy checks that need to be
> > propagated, the plugging question, the various merge parameters that need
> > to be communicated from the driver, the sense code intepretation for which
> > dm needs to call back into SCSI (or NVMe for that matter).  To me
> > it seems like a much better idea to let the driver (*) driver the
> > decision, with a few library helpers provided to deal with queue selection
> > algorithms and other faіrly generic pieces of code.
> > 
> > (*) that is block layer driver, in case of SCSI it would still be the
> > midlayer pulling the strings.
> 
> OK, fair question.  But making existing DM-multipath work as expected
> (effective consumer of old block's elevators) would be a nice thing to
> do before blowing it up with a redesign.
> 
> There is a lot of block code that is only used by request-based DM, the
> big one being blk_queue_bio().

And yet I'm wrong, blk_queue_bio() is clearly used as the primary
make_request function.  Which is good, arrests my fears that DM was only
consumer.  DM is the only consumer of symbol export.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dm-mpath request merging concerns [was: Re: It's time to put together the schedule]
  2015-02-23 19:56         ` Mike Snitzer
@ 2015-02-23 21:19           ` Benjamin Marzinski
  2015-02-23 22:46             ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Benjamin Marzinski @ 2015-02-23 21:19 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: lsf, axboe, device-mapper development, Jeff Moyer

On Mon, Feb 23, 2015 at 02:56:03PM -0500, Mike Snitzer wrote:
> On Mon, Feb 23 2015 at  1:34pm -0500,
> Benjamin Marzinski <bmarzins@redhat.com> wrote:
> 
> > On Mon, Feb 23, 2015 at 11:18:36AM -0600, Mike Christie wrote:
> > > 
> > > If the device/transport is fast or the workload is low, the multipath_busy
> > > never returns busy, then we can hit Hannes's issue. For 4 paths, we just
> > > might not be able to fill up the paths and hit the busy check. With only 2
> > > paths, we might be slow enough or the workload is heavy enough to keep the
> > > paths busy and so we hit the busy check and do more merging.
> > 
> > Netapp is seeing this same issue.  It seems like we might want to make
> > multipath_busy more aggressive about returning busy, which would
> > probably require multipath tracking the size and frequency of the
> > requests.  If it determines that it's getting a lot of requests that
> > could have been merged, it could start throttling how fast requests are
> > getting pulled off the queue, even there underlying paths aren't busy.
> 
> the ->busy() checks are just an extra check the shouldn't be the primary
> method for governing the effectiveness of the DM-mpath queue's elevator.
> 
> I need to get back to basics to appreciate how the existing block layer
> is able to have an effective elevator regardless of the device's speed.
> And why isn't request-based DM able to just take advantage of it?

I always thought that at least one of the schedulers always kept
incoming requests on an interal queue for at least a little bit to see
if any merging could happen, even if they could otherwise just be added
to the request queue. but I admit to being a little vague on how exactly
they all work.

Another place where we could break out of constantly pulling requests of
the queue before they're merged is in dm_prep_fn().  If we thought that
we should break and let merging happen, we could return BLKPREP_DEFER.

-Ben

> 
> (my money is on request-based DM being overly clever but we'll see)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dm-mpath request merging concerns [was: Re: It's time to put together the schedule]
  2015-02-23 22:46             ` Mike Snitzer
@ 2015-02-23 22:14               ` Benjamin Marzinski
  2015-02-24  0:39                 ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Benjamin Marzinski @ 2015-02-23 22:14 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: lsf, axboe, device-mapper development, Jeff Moyer

On Mon, Feb 23, 2015 at 05:46:37PM -0500, Mike Snitzer wrote:
> On Mon, Feb 23 2015 at  4:19pm -0500,
> Benjamin Marzinski <bmarzins@redhat.com> wrote:
> 
> > On Mon, Feb 23, 2015 at 02:56:03PM -0500, Mike Snitzer wrote:
> > > On Mon, Feb 23 2015 at  1:34pm -0500,
> > > Benjamin Marzinski <bmarzins@redhat.com> wrote:
> > > 
> > > > On Mon, Feb 23, 2015 at 11:18:36AM -0600, Mike Christie wrote:
> > > > > 
> > > > > If the device/transport is fast or the workload is low, the multipath_busy
> > > > > never returns busy, then we can hit Hannes's issue. For 4 paths, we just
> > > > > might not be able to fill up the paths and hit the busy check. With only 2
> > > > > paths, we might be slow enough or the workload is heavy enough to keep the
> > > > > paths busy and so we hit the busy check and do more merging.
> > > > 
> > > > Netapp is seeing this same issue.  It seems like we might want to make
> > > > multipath_busy more aggressive about returning busy, which would
> > > > probably require multipath tracking the size and frequency of the
> > > > requests.  If it determines that it's getting a lot of requests that
> > > > could have been merged, it could start throttling how fast requests are
> > > > getting pulled off the queue, even there underlying paths aren't busy.
> > > 
> > > the ->busy() checks are just an extra check the shouldn't be the primary
> > > method for governing the effectiveness of the DM-mpath queue's elevator.
> > > 
> > > I need to get back to basics to appreciate how the existing block layer
> > > is able to have an effective elevator regardless of the device's speed.
> > > And why isn't request-based DM able to just take advantage of it?
> > 
> > I always thought that at least one of the schedulers always kept
> > incoming requests on an interal queue for at least a little bit to see
> > if any merging could happen, even if they could otherwise just be added
> > to the request queue. but I admit to being a little vague on how exactly
> > they all work.
> 
> CFQ has idling, etc.  Which promotes merging.
> 
> > Another place where we could break out of constantly pulling requests of
> > the queue before they're merged is in dm_prep_fn().  If we thought that
> > we should break and let merging happen, we could return BLKPREP_DEFER.
> 
> It is blk_queue_bio(), via q->make_request_fn, that is intended to
> actually do the merging.  What I'm hearing is that we're only getting
> some small amount of merging if:
> 1) the 2 path case is used and therefore ->busy hook within
>    q->request_fn is not taking the request off the queue, so there is
>    more potential for later merging
> 2) the 4 path case IFF nr_requests is reduced to induce ->busy, which
>    only promoted merging as a side-effect like 1) above
> 
> The reality is we aren't getting merging where it _should_ be happening
> (in blk_queue_bio).  We need to understand why that is.

Huh? I'm confused.  If the merges that are happening (which are more
likely if either of those two points you mentioned are true) aren't
happening in blk_queue_bio, then where are they happening?

I thought that the issue is that requests are getting pulled off the
multipath device's request queue and placed on the underlying device's
request queue too quickly, so that there are no requests on multipth's
queue to merge with when blk_queue_bio() is called.  In this case, one
solution would involve keeping multipath from removing these requests
too quickly when we think that it is likely that another request which
can get merged will be added soon. That's what all my ideas have been
about.

Do you think something different is happening here? 

-Ben

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dm-mpath request merging concerns [was: Re: It's time to put together the schedule]
  2015-02-23 21:19           ` Benjamin Marzinski
@ 2015-02-23 22:46             ` Mike Snitzer
  2015-02-23 22:14               ` Benjamin Marzinski
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2015-02-23 22:46 UTC (permalink / raw)
  To: Benjamin Marzinski; +Cc: lsf, axboe, device-mapper development, Jeff Moyer

On Mon, Feb 23 2015 at  4:19pm -0500,
Benjamin Marzinski <bmarzins@redhat.com> wrote:

> On Mon, Feb 23, 2015 at 02:56:03PM -0500, Mike Snitzer wrote:
> > On Mon, Feb 23 2015 at  1:34pm -0500,
> > Benjamin Marzinski <bmarzins@redhat.com> wrote:
> > 
> > > On Mon, Feb 23, 2015 at 11:18:36AM -0600, Mike Christie wrote:
> > > > 
> > > > If the device/transport is fast or the workload is low, the multipath_busy
> > > > never returns busy, then we can hit Hannes's issue. For 4 paths, we just
> > > > might not be able to fill up the paths and hit the busy check. With only 2
> > > > paths, we might be slow enough or the workload is heavy enough to keep the
> > > > paths busy and so we hit the busy check and do more merging.
> > > 
> > > Netapp is seeing this same issue.  It seems like we might want to make
> > > multipath_busy more aggressive about returning busy, which would
> > > probably require multipath tracking the size and frequency of the
> > > requests.  If it determines that it's getting a lot of requests that
> > > could have been merged, it could start throttling how fast requests are
> > > getting pulled off the queue, even there underlying paths aren't busy.
> > 
> > the ->busy() checks are just an extra check the shouldn't be the primary
> > method for governing the effectiveness of the DM-mpath queue's elevator.
> > 
> > I need to get back to basics to appreciate how the existing block layer
> > is able to have an effective elevator regardless of the device's speed.
> > And why isn't request-based DM able to just take advantage of it?
> 
> I always thought that at least one of the schedulers always kept
> incoming requests on an interal queue for at least a little bit to see
> if any merging could happen, even if they could otherwise just be added
> to the request queue. but I admit to being a little vague on how exactly
> they all work.

CFQ has idling, etc.  Which promotes merging.

> Another place where we could break out of constantly pulling requests of
> the queue before they're merged is in dm_prep_fn().  If we thought that
> we should break and let merging happen, we could return BLKPREP_DEFER.

It is blk_queue_bio(), via q->make_request_fn, that is intended to
actually do the merging.  What I'm hearing is that we're only getting
some small amount of merging if:
1) the 2 path case is used and therefore ->busy hook within
   q->request_fn is not taking the request off the queue, so there is
   more potential for later merging
2) the 4 path case IFF nr_requests is reduced to induce ->busy, which
   only promoted merging as a side-effect like 1) above

The reality is we aren't getting merging where it _should_ be happening
(in blk_queue_bio).  We need to understand why that is.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dm-mpath request merging concerns [was: Re: It's time to put together the schedule]
  2015-02-24  0:39                 ` Mike Snitzer
@ 2015-02-24  0:38                   ` Benjamin Marzinski
  2015-02-24  2:02                     ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Benjamin Marzinski @ 2015-02-24  0:38 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: lsf, axboe, device-mapper development, Jeff Moyer

On Mon, Feb 23, 2015 at 07:39:00PM -0500, Mike Snitzer wrote:
> On Mon, Feb 23 2015 at  5:14pm -0500,
> Benjamin Marzinski <bmarzins@redhat.com> wrote:
> 
> > On Mon, Feb 23, 2015 at 05:46:37PM -0500, Mike Snitzer wrote:
> > > 
> > > It is blk_queue_bio(), via q->make_request_fn, that is intended to
> > > actually do the merging.  What I'm hearing is that we're only getting
> > > some small amount of merging if:
> > > 1) the 2 path case is used and therefore ->busy hook within
> > >    q->request_fn is not taking the request off the queue, so there is
> > >    more potential for later merging
> > > 2) the 4 path case IFF nr_requests is reduced to induce ->busy, which
> > >    only promoted merging as a side-effect like 1) above
> > > 
> > > The reality is we aren't getting merging where it _should_ be happening
> > > (in blk_queue_bio).  We need to understand why that is.
> > 
> > Huh? I'm confused.  If the merges that are happening (which are more
> > likely if either of those two points you mentioned are true) aren't
> > happening in blk_queue_bio, then where are they happening?
> 
> AFAICT, purely from this discussion and NetApp's BZ, the little merging
> that is seen is happening by the ->lld_busy_fn hook.  See the comment
> block above blk_lld_busy().

Well, that function is what's causing dm_request_fn to stop pulling
requests of the queue, through

                if (ti->type->busy && ti->type->busy(ti))
                        goto delay_and_out;

But all scsi_lld_busy (which is the request that eventually gets called
to that signals that the queue is busy) does is check some flags and
other values. The actual merging code is in blk_queue_bio(). 

>  
> > I thought that the issue is that requests are getting pulled off the
> > multipath device's request queue and placed on the underlying device's
> > request queue too quickly, so that there are no requests on multipth's
> > queue to merge with when blk_queue_bio() is called.  In this case, one
> > solution would involve keeping multipath from removing these requests
> > too quickly when we think that it is likely that another request which
> > can get merged will be added soon. That's what all my ideas have been
> > about.
> > 
> > Do you think something different is happening here? 
> 
> Requests are being pulled from the DM-multipath's queue if
> ->lld_busy_fn() is false.  Too quickly is all relative.  The case NetApp
> reported is with SSD devices in the backend.  Any increased idling in
> the interest of merging could hurt latency; but the merging may improve
> IOPS.  So it is trade-off.

I'm not at all sure that there's going to be a one-size-fits-all
solution, and it is possible that for really fast devices, load balancing
may end up being not all that useful.

> So what I said before and am still saying is: we need to understand why
> the designed hook for merging, via q->make_request_fn's blk_queue_bio(),
> isn't actually meaningful for DM multipath.
> 
> Merging should happen _before_ q->request_fn() is called.  Not as a
> side-effect of q->request_fn() happening to have intelligence to not
> start the request because the underlying device queues are busy.

The merging is happening before dm_request_fn, if there are any requests
to actually merge with. If blk_queue_bio runs, and there are no requests
left in the queue for the multipath deivce, then there is no chance
of any merging happening, since there are no requests to merge with. The
issue is that when there are multiple really fast paths under multipath,
their queue never fills up and they always report that they aren't busy,
which means the only thing that device-mapper has to do to the requests
on its queue, is put them on the appropriate queue of the underlying
device.  This doesn't take much time, and once it does this, no merging
is done on the underlying device queues. So if the requests spend more
of their time on the scsi device queues (where no merging happens) and
very little of their time on the multipath queue, then there simply
isn't time for merging to happen.  Merging in the underlying device
queues won't really help matters, since multipath will be spreading out
the requests among the various queues, so that contiguous requests won't
often be sent to the same underlying device (that's the whole point of
request-based multipath: doing the merging first, and then sending down
fully merged requests).

What Netapp was seeing was single requests getting added to the
multipath device queue, and then getting pulled off and added to the
underlying device queue before another request could get added to the
multipath request queue.

While I'm pretty sure that this is what's happening, I agree that making
dm_request_fn quit early may not be the best solution.  I'm not sure why
the queue is getting unplugged so quickly in the first place.  Perhaps
we should understand that first. If we're not calling dm_request_fn so
quickly, then we don't need to worry so much about stopping early.

-Ben

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dm-mpath request merging concerns [was: Re: It's time to put together the schedule]
  2015-02-23 22:14               ` Benjamin Marzinski
@ 2015-02-24  0:39                 ` Mike Snitzer
  2015-02-24  0:38                   ` Benjamin Marzinski
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2015-02-24  0:39 UTC (permalink / raw)
  To: Benjamin Marzinski; +Cc: lsf, axboe, device-mapper development, Jeff Moyer

On Mon, Feb 23 2015 at  5:14pm -0500,
Benjamin Marzinski <bmarzins@redhat.com> wrote:

> On Mon, Feb 23, 2015 at 05:46:37PM -0500, Mike Snitzer wrote:
> > 
> > It is blk_queue_bio(), via q->make_request_fn, that is intended to
> > actually do the merging.  What I'm hearing is that we're only getting
> > some small amount of merging if:
> > 1) the 2 path case is used and therefore ->busy hook within
> >    q->request_fn is not taking the request off the queue, so there is
> >    more potential for later merging
> > 2) the 4 path case IFF nr_requests is reduced to induce ->busy, which
> >    only promoted merging as a side-effect like 1) above
> > 
> > The reality is we aren't getting merging where it _should_ be happening
> > (in blk_queue_bio).  We need to understand why that is.
> 
> Huh? I'm confused.  If the merges that are happening (which are more
> likely if either of those two points you mentioned are true) aren't
> happening in blk_queue_bio, then where are they happening?

AFAICT, purely from this discussion and NetApp's BZ, the little merging
that is seen is happening by the ->lld_busy_fn hook.  See the comment
block above blk_lld_busy().
 
> I thought that the issue is that requests are getting pulled off the
> multipath device's request queue and placed on the underlying device's
> request queue too quickly, so that there are no requests on multipth's
> queue to merge with when blk_queue_bio() is called.  In this case, one
> solution would involve keeping multipath from removing these requests
> too quickly when we think that it is likely that another request which
> can get merged will be added soon. That's what all my ideas have been
> about.
> 
> Do you think something different is happening here? 

Requests are being pulled from the DM-multipath's queue if
->lld_busy_fn() is false.  Too quickly is all relative.  The case NetApp
reported is with SSD devices in the backend.  Any increased idling in
the interest of merging could hurt latency; but the merging may improve
IOPS.  So it is trade-off.

So what I said before and am still saying is: we need to understand why
the designed hook for merging, via q->make_request_fn's blk_queue_bio(),
isn't actually meaningful for DM multipath.

Merging should happen _before_ q->request_fn() is called.  Not as a
side-effect of q->request_fn() happening to have intelligence to not
start the request because the underlying device queues are busy.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dm-mpath request merging concerns [was: Re: It's time to put together the schedule]
  2015-02-24  0:38                   ` Benjamin Marzinski
@ 2015-02-24  2:02                     ` Mike Snitzer
  2015-02-24 14:35                       ` Jeff Moyer
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2015-02-24  2:02 UTC (permalink / raw)
  To: Benjamin Marzinski; +Cc: lsf, axboe, device-mapper development, Jeff Moyer

On Mon, Feb 23 2015 at  7:38pm -0500,
Benjamin Marzinski <bmarzins@redhat.com> wrote:

> On Mon, Feb 23, 2015 at 07:39:00PM -0500, Mike Snitzer wrote:
> > On Mon, Feb 23 2015 at  5:14pm -0500,
> > Benjamin Marzinski <bmarzins@redhat.com> wrote:
> > 
> > > On Mon, Feb 23, 2015 at 05:46:37PM -0500, Mike Snitzer wrote:
> > > > 
> > > > It is blk_queue_bio(), via q->make_request_fn, that is intended to
> > > > actually do the merging.  What I'm hearing is that we're only getting
> > > > some small amount of merging if:
> > > > 1) the 2 path case is used and therefore ->busy hook within
> > > >    q->request_fn is not taking the request off the queue, so there is
> > > >    more potential for later merging
> > > > 2) the 4 path case IFF nr_requests is reduced to induce ->busy, which
> > > >    only promoted merging as a side-effect like 1) above
> > > > 
> > > > The reality is we aren't getting merging where it _should_ be happening
> > > > (in blk_queue_bio).  We need to understand why that is.
> > > 
> > > Huh? I'm confused.  If the merges that are happening (which are more
> > > likely if either of those two points you mentioned are true) aren't
> > > happening in blk_queue_bio, then where are they happening?
> > 
> > AFAICT, purely from this discussion and NetApp's BZ, the little merging
> > that is seen is happening by the ->lld_busy_fn hook.  See the comment
> > block above blk_lld_busy().
> 
> Well, that function is what's causing dm_request_fn to stop pulling
> requests of the queue, through
> 
>                 if (ti->type->busy && ti->type->busy(ti))
>                         goto delay_and_out;
> 
> But all scsi_lld_busy (which is the request that eventually gets called
> to that signals that the queue is busy) does is check some flags and
> other values. The actual merging code is in blk_queue_bio(). 
> 
> >  
> > > I thought that the issue is that requests are getting pulled off the
> > > multipath device's request queue and placed on the underlying device's
> > > request queue too quickly, so that there are no requests on multipth's
> > > queue to merge with when blk_queue_bio() is called.  In this case, one
> > > solution would involve keeping multipath from removing these requests
> > > too quickly when we think that it is likely that another request which
> > > can get merged will be added soon. That's what all my ideas have been
> > > about.
> > > 
> > > Do you think something different is happening here? 
> > 
> > Requests are being pulled from the DM-multipath's queue if
> > ->lld_busy_fn() is false.  Too quickly is all relative.  The case NetApp
> > reported is with SSD devices in the backend.  Any increased idling in
> > the interest of merging could hurt latency; but the merging may improve
> > IOPS.  So it is trade-off.
> 
> I'm not at all sure that there's going to be a one-size-fits-all
> solution, and it is possible that for really fast devices, load balancing
> may end up being not all that useful.
> 
> > So what I said before and am still saying is: we need to understand why
> > the designed hook for merging, via q->make_request_fn's blk_queue_bio(),
> > isn't actually meaningful for DM multipath.
> > 
> > Merging should happen _before_ q->request_fn() is called.  Not as a
> > side-effect of q->request_fn() happening to have intelligence to not
> > start the request because the underlying device queues are busy.
> 
> The merging is happening before dm_request_fn, if there are any requests
> to actually merge with. If blk_queue_bio runs, and there are no requests
> left in the queue for the multipath deivce, then there is no chance
> of any merging happening, since there are no requests to merge with. The
> issue is that when there are multiple really fast paths under multipath,
> their queue never fills up and they always report that they aren't busy,
> which means the only thing that device-mapper has to do to the requests
> on its queue, is put them on the appropriate queue of the underlying
> device.  This doesn't take much time, and once it does this, no merging
> is done on the underlying device queues. So if the requests spend more
> of their time on the scsi device queues (where no merging happens) and
> very little of their time on the multipath queue, then there simply
> isn't time for merging to happen.  Merging in the underlying device
> queues won't really help matters, since multipath will be spreading out
> the requests among the various queues, so that contiguous requests won't
> often be sent to the same underlying device (that's the whole point of
> request-based multipath: doing the merging first, and then sending down
> fully merged requests).
> 
> What Netapp was seeing was single requests getting added to the
> multipath device queue, and then getting pulled off and added to the
> underlying device queue before another request could get added to the
> multipath request queue.
> 
> While I'm pretty sure that this is what's happening, I agree that making
> dm_request_fn quit early may not be the best solution.  I'm not sure why
> the queue is getting unplugged so quickly in the first place.  Perhaps
> we should understand that first. If we're not calling dm_request_fn so
> quickly, then we don't need to worry so much about stopping early.

Yeah we are in complete agreement on all this.  (And yes I dont think
adding AI to multipath or request-based DM to conditionally 

My only point about dm_request_fn was that the only reason merging is
happening in blk_queue_bio at all is because ->lld_busy_fn is true.
Ideally the IO would be submitted in batchs with a plug in place (that'd
allow for blk_queue_bio to be more effective).

NetApp's test is using vdbench to submit 4K sequential IO using 64
threads directly (O_DIRECT) to the multipath device (all those threads
makes it look random or at least seeky).  There really isn't a layer
(that I'm aware of) that'd know to start and stop a plug for that test.
A filesystem ontop might do better but I'm not sure.

But plugging aside, requests being dispatched too fast to allow for
merging is something that sounds odd to want.  Better to just submit
larger IOs to begin with.  BUT in the case of databases small (4K) IO is
the norm.  And it was Hannes' report about SAP that got this thread
started so...

Jens and/or Jeff Moyer, are there any knobs that you'd suggest to try to
promote request merging on a really fast block device?  Any scheduler
and knobs you'd suggest would be appreciated.

Short of that, I'm left scratching my head as to the best was to solve
this particular workload.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dm-mpath request merging concerns [was: Re: It's time to put together the schedule]
  2015-02-24  2:02                     ` Mike Snitzer
@ 2015-02-24 14:35                       ` Jeff Moyer
  2015-02-24 14:59                         ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Jeff Moyer @ 2015-02-24 14:35 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: lsf, axboe, device-mapper development

Mike Snitzer <snitzer@redhat.com> writes:

> Jens and/or Jeff Moyer, are there any knobs that you'd suggest to try to
> promote request merging on a really fast block device?  Any scheduler
> and knobs you'd suggest would be appreciated.

There's a small chance that CFQ does what you want.  It has logic to
detect when multiple processes are submitting interleaved I/O, and it
will try to wait until requests are merged before dispatching (see
the comment in cfq_rq_enqueued).

Aside from that, make sure /sys/block/<dev>/queue/rotational is set to 1.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dm-mpath request merging concerns [was: Re: It's time to put together the schedule]
  2015-02-24 14:35                       ` Jeff Moyer
@ 2015-02-24 14:59                         ` Mike Snitzer
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2015-02-24 14:59 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: lsf, axboe, device-mapper development

On Tue, Feb 24 2015 at  9:35am -0500,
Jeff Moyer <jmoyer@redhat.com> wrote:

> Mike Snitzer <snitzer@redhat.com> writes:
> 
> > Jens and/or Jeff Moyer, are there any knobs that you'd suggest to try to
> > promote request merging on a really fast block device?  Any scheduler
> > and knobs you'd suggest would be appreciated.
> 
> There's a small chance that CFQ does what you want.  It has logic to
> detect when multiple processes are submitting interleaved I/O, and it
> will try to wait until requests are merged before dispatching (see
> the comment in cfq_rq_enqueued).

That comment wasn't overly insightful on CFQ's ability to wait for
merging when IO is being submitted larger than a page size to begin
with.  Why will it "immediately let it rip" if the request is only just
larger than a single page?  Seems we'd like that threshold to be higher
depending on the workload (would it be wrong to make it tunable?).
Maybe it is a perfectly fine heuristic as is and I'm just missing it?

FYI, CFQ didn't work any better for NetApp when they tested it last
night; but they likely didn't set rotational.

> Aside from that, make sure /sys/block/<dev>/queue/rotational is set to 1.

Fair point (albeit unintuitive to the user who has an SSD backend).

I'll see if NetApp can re-test CFQ with rotational set.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RFC for small allocation failure mode transition plan (was: Re: [Lsf] common session about page allocator vs. FS/IO) It's time to put together the schedule)
       [not found]           ` <1425311094.2187.11.camel@HansenPartnership.com>
@ 2015-03-08 18:11             ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2015-03-08 18:11 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jeff Layton, lsf, Theodore Ts'o, Dave Chinner,
	Johannes Weiner, Andrew Morton, Vlastimil Babka, linux-mm

On Mon 02-03-15 07:44:54, James Bottomley wrote:
> On Mon, 2015-03-02 at 10:41 -0500, Jeff Layton wrote:
> > On Mon, 2 Mar 2015 16:28:58 +0100
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > [Let's add people from the discussion on the CC]
> > > 
> > > On Mon 02-03-15 07:26:33, James Bottomley wrote:
> > > > On Mon, 2015-03-02 at 16:19 +0100, Michal Hocko wrote:
> > > > > On Mon 23-02-15 18:08:42, Michal Hocko wrote:
[...]
> > > > > > I would like to propose a common session (FS and MM, maybe IO as well)
> > > > > > about memory allocator guarantees and the current behavior of the
> > > > > > page allocator with different gfp flags - GFP_KERNEL being basically
> > > > > > __GFP_NOFAIL for small allocations. __GFP_NOFAIL allocations in general
> > > > > > - how they are used in fs/io code paths and what can the allocator do
> > > > > > to prevent from memory exhaustion. GFP_NOFS behavior when combined with
> > > > > > __GFP_NOFAIL and so on. It seems there was a disconnection between mm
> > > > > > and fs people and one camp is not fully aware of what others are doing
> > > > > > and why as it turned out during recent discussions.
> > > > > 
> > > > > James do you have any plans to put this on the schedule?
> > > > 
> > > > I was waiting to see if there was any other feedback, but if you feel
> > > > strongly it should happen, I can do it.
> > > 
> > > I think it would be helpful, but let's see what other involved in the
> > > discussion think.
> > 
> > It makes sense to me as a plenary discussion.
> > 
> > I was personally quite surprised to hear that small allocations
> > couldn't fail, and dismayed at how much time I've spent writing dead
> > error handling code. ;)
> > 
> > If we're keen to get rid of that behavior (and I think it really ought
> > to go, IMNSHO), then what might make sense is to add a Kconfig switch
> > that allows small allocations to fail as an interim step and see what
> > breaks when it's enabled.
> > 
> > Once we fix all of those places up, then we can see about getting
> > distros to turn it on, and eventually eliminate the Kconfig switch
> > altogether. It'll take a few years, but that's probably the least
> > disruptive approach.
> 
> OK, your wish is my command: it's filled up the last empty plenary slot
> on Monday morning.

I guess the following RFC patch should be good for the first part of the
topic - Small allocations implying __GFP_NOFAIL currently. I am CCing
linux-mm mailing list as well so that people not attending LSF/MM can
comment on the approach.

I hope people will find time to look at it before the session because I
am afraid two topics per one slot will be too dense otherwise. I also
hope this part will be less controversial and the primary point for
discussion will be on HOW TO GET RID OF the current behavior in a sane
way rather than WHY TO KEEP IT.
---

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-03-08 18:11 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1424395745.2603.27.camel@HansenPartnership.com>
     [not found] ` <54EAD453.6040907@suse.de>
2015-02-23 13:50   ` dm-mpath request merging concerns [was: Re: It's time to put together the schedule] Mike Snitzer
2015-02-23 17:18     ` [Lsf] " Mike Christie
2015-02-23 18:34       ` Benjamin Marzinski
2015-02-23 19:56         ` Mike Snitzer
2015-02-23 21:19           ` Benjamin Marzinski
2015-02-23 22:46             ` Mike Snitzer
2015-02-23 22:14               ` Benjamin Marzinski
2015-02-24  0:39                 ` Mike Snitzer
2015-02-24  0:38                   ` Benjamin Marzinski
2015-02-24  2:02                     ` Mike Snitzer
2015-02-24 14:35                       ` Jeff Moyer
2015-02-24 14:59                         ` Mike Snitzer
2015-02-23 19:35       ` [Lsf] " Merla, ShivaKrishna
2015-02-23 19:50       ` Mike Snitzer
2015-02-23 20:08         ` [Lsf] " Christoph Hellwig
2015-02-23 20:39           ` Mike Snitzer
2015-02-23 21:14             ` Mike Snitzer
     [not found] ` <20150223170842.GK24272@dhcp22.suse.cz>
     [not found]   ` <20150302151941.GB26343@dhcp22.suse.cz>
     [not found]     ` <1425309993.2187.3.camel@HansenPartnership.com>
     [not found]       ` <20150302152858.GF26334@dhcp22.suse.cz>
     [not found]         ` <20150302104154.3ae46eb7@tlielax.poochiereds.net>
     [not found]           ` <1425311094.2187.11.camel@HansenPartnership.com>
2015-03-08 18:11             ` RFC for small allocation failure mode transition plan (was: Re: [Lsf] common session about page allocator vs. FS/IO) It's time to put together the schedule) Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.