All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
@ 2016-01-13  9:10 Hannes Reinecke
  2016-01-13 10:50   ` Sagi Grimberg
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Hannes Reinecke @ 2016-01-13  9:10 UTC (permalink / raw)
  To: lsf-pc; +Cc: device-mapper development, linux-scsi@vger.kernel.org

Hi all,

I'd like to attend LSF/MM and would like to present my ideas for a 
multipath redesign.

The overall idea is to break up the centralized multipath handling 
in device-mapper (and multipath-tools) and delegate to the 
appropriate sub-systems.

Individually the plan is:
a) use the 'wwid' sysfs attribute to detect multipath devices;
    this removes the need of the current 'path_id' functionality
    in multipath-tools
b) leverage topology information from scsi_dh_alua (which we will
    have once my ALUA handler update is in) to detect the multipath
    topology. This removes the need of a 'prio' infrastructure
    in multipath-tools
c) implement block or scsi events whenever a remote port becomes
    unavailable. This removes the need of the 'path_checker'
    functionality in multipath-tools.
d) leverage these events to handle path-up/path-down events
    in-kernel
e) move the I/O redirection logic out of device-mapper proper
    and use blk-mq to redirect I/O. This is still a bit of
    hand-waving, and definitely would need discussion to figure
    out if and how it can be achieved.
    This is basically the same topic Mike Snitzer proposed, but
    coming from a different angle.

But in the end we should be able to do strip down the current 
(rather complex) multipath-tools to just handle topology changes; 
everything else will be done internally.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
  2016-01-13  9:10 [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign Hannes Reinecke
@ 2016-01-13 10:50   ` Sagi Grimberg
  2016-01-13 11:08 ` [dm-devel] " Alasdair G Kergon
  2016-01-13 17:52 ` Benjamin Marzinski
  2 siblings, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2016-01-13 10:50 UTC (permalink / raw)
  To: Hannes Reinecke, lsf-pc
  Cc: device-mapper development, linux-scsi@vger.kernel.org, linux-nvme


> Hi all,
>
> I'd like to attend LSF/MM and would like to present my ideas for a
> multipath redesign.
>
> The overall idea is to break up the centralized multipath handling in
> device-mapper (and multipath-tools) and delegate to the appropriate
> sub-systems.

I agree that would be very useful. Great topic. I'd like to attend
this talk as well.

>
> Individually the plan is:
> a) use the 'wwid' sysfs attribute to detect multipath devices;
>     this removes the need of the current 'path_id' functionality
>     in multipath-tools

CC'ing Linux-nvme,

I've recently looked at multipathing support for nvme (and nvme over
fabrics) as well. For nvme the wwid equivalent is the nsid (namespace
identifier). I'm wandering if we can have better abstraction for
user-space so it won't need to change its behavior for scsi/nvme.
The same applies for the the timeout attribute for example which
assumes scsi device sysfs structure.

> b) leverage topology information from scsi_dh_alua (which we will
>     have once my ALUA handler update is in) to detect the multipath
>     topology. This removes the need of a 'prio' infrastructure
>     in multipath-tools

This would require further attention for nvme.

> c) implement block or scsi events whenever a remote port becomes
>     unavailable. This removes the need of the 'path_checker'
>     functionality in multipath-tools.

I'd prefer if we'd have it in the block layer so we can have it for all
block drivers. Also, this assumes that port events are independent of
I/O. This assumption is incorrect in SRP for example which detects port
failures only by I/O errors (which makes path sensing a must).

> d) leverage these events to handle path-up/path-down events
>     in-kernel
> e) move the I/O redirection logic out of device-mapper proper
>     and use blk-mq to redirect I/O. This is still a bit of
>     hand-waving, and definitely would need discussion to figure
>     out if and how it can be achieved.
>     This is basically the same topic Mike Snitzer proposed, but
>     coming from a different angle.

Another (adjacent) topic is multipath performance with blk-mq.

As I said, I've been looking at nvme multipathing support and
initial measurements show huge contention on the multipath lock
which really defeats the entire point of blk-mq...

I have yet to report this as my work is still in progress. I'm not sure
if it's a topic on it's own but I'd love to talk about that as well...

> But in the end we should be able to do strip down the current (rather
> complex) multipath-tools to just handle topology changes; everything
> else will be done internally.

I'd love to see that happening.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
@ 2016-01-13 10:50   ` Sagi Grimberg
  0 siblings, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2016-01-13 10:50 UTC (permalink / raw)



> Hi all,
>
> I'd like to attend LSF/MM and would like to present my ideas for a
> multipath redesign.
>
> The overall idea is to break up the centralized multipath handling in
> device-mapper (and multipath-tools) and delegate to the appropriate
> sub-systems.

I agree that would be very useful. Great topic. I'd like to attend
this talk as well.

>
> Individually the plan is:
> a) use the 'wwid' sysfs attribute to detect multipath devices;
>     this removes the need of the current 'path_id' functionality
>     in multipath-tools

CC'ing Linux-nvme,

I've recently looked at multipathing support for nvme (and nvme over
fabrics) as well. For nvme the wwid equivalent is the nsid (namespace
identifier). I'm wandering if we can have better abstraction for
user-space so it won't need to change its behavior for scsi/nvme.
The same applies for the the timeout attribute for example which
assumes scsi device sysfs structure.

> b) leverage topology information from scsi_dh_alua (which we will
>     have once my ALUA handler update is in) to detect the multipath
>     topology. This removes the need of a 'prio' infrastructure
>     in multipath-tools

This would require further attention for nvme.

> c) implement block or scsi events whenever a remote port becomes
>     unavailable. This removes the need of the 'path_checker'
>     functionality in multipath-tools.

I'd prefer if we'd have it in the block layer so we can have it for all
block drivers. Also, this assumes that port events are independent of
I/O. This assumption is incorrect in SRP for example which detects port
failures only by I/O errors (which makes path sensing a must).

> d) leverage these events to handle path-up/path-down events
>     in-kernel
> e) move the I/O redirection logic out of device-mapper proper
>     and use blk-mq to redirect I/O. This is still a bit of
>     hand-waving, and definitely would need discussion to figure
>     out if and how it can be achieved.
>     This is basically the same topic Mike Snitzer proposed, but
>     coming from a different angle.

Another (adjacent) topic is multipath performance with blk-mq.

As I said, I've been looking at nvme multipathing support and
initial measurements show huge contention on the multipath lock
which really defeats the entire point of blk-mq...

I have yet to report this as my work is still in progress. I'm not sure
if it's a topic on it's own but I'd love to talk about that as well...

> But in the end we should be able to do strip down the current (rather
> complex) multipath-tools to just handle topology changes; everything
> else will be done internally.

I'd love to see that happening.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
  2016-01-13  9:10 [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign Hannes Reinecke
  2016-01-13 10:50   ` Sagi Grimberg
@ 2016-01-13 11:08 ` Alasdair G Kergon
  2016-01-13 11:17   ` Hannes Reinecke
  2016-01-13 17:52 ` Benjamin Marzinski
  2 siblings, 1 reply; 25+ messages in thread
From: Alasdair G Kergon @ 2016-01-13 11:08 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: lsf-pc, device-mapper development, linux-scsi@vger.kernel.org,
	Junichi Nomura

On Wed, Jan 13, 2016 at 10:10:43AM +0100, Hannes Reinecke wrote:
> The overall idea is to break up the centralized multipath handling in 
> device-mapper (and multipath-tools) and delegate to the appropriate 
> sub-systems.
>
> Individually the plan is:

Could we start to drill down into each of these and categorise them in
terms of which parts of the stack are involved in the proposed change
and prioritise them in terms of likely amount of work and
cost/benefit/risk?  Some of them probably need some prototyping and
experimentation.

Alasdair


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
  2016-01-13 11:08 ` [dm-devel] " Alasdair G Kergon
@ 2016-01-13 11:17   ` Hannes Reinecke
  2016-01-13 11:25     ` Alasdair G Kergon
  0 siblings, 1 reply; 25+ messages in thread
From: Hannes Reinecke @ 2016-01-13 11:17 UTC (permalink / raw)
  To: lsf-pc, device-mapper development, linux-scsi@vger.kernel.org,
	Junichi Nomura

On 01/13/2016 12:08 PM, Alasdair G Kergon wrote:
> On Wed, Jan 13, 2016 at 10:10:43AM +0100, Hannes Reinecke wrote:
>> The overall idea is to break up the centralized multipath handling in
>> device-mapper (and multipath-tools) and delegate to the appropriate
>> sub-systems.
>>
>> Individually the plan is:
>
> Could we start to drill down into each of these and categorise them in
> terms of which parts of the stack are involved in the proposed change
> and prioritise them in terms of likely amount of work and
> cost/benefit/risk?  Some of them probably need some prototyping and
> experimentation.
>
Sure.
That's why I proposed it as a discussion topic :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
  2016-01-13 11:17   ` Hannes Reinecke
@ 2016-01-13 11:25     ` Alasdair G Kergon
  0 siblings, 0 replies; 25+ messages in thread
From: Alasdair G Kergon @ 2016-01-13 11:25 UTC (permalink / raw)
  To: device-mapper development
  Cc: lsf-pc, linux-scsi@vger.kernel.org, Junichi Nomura

On Wed, Jan 13, 2016 at 12:17:27PM +0100, Hannes Reinecke wrote:
> That's why I proposed it as a discussion topic :-)

No need to wait for LSF - we have mailing lists:)

Alasdair


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
  2016-01-13 10:50   ` Sagi Grimberg
@ 2016-01-13 11:46     ` Hannes Reinecke
  -1 siblings, 0 replies; 25+ messages in thread
From: Hannes Reinecke @ 2016-01-13 11:46 UTC (permalink / raw)
  To: Sagi Grimberg, lsf-pc
  Cc: device-mapper development, linux-scsi@vger.kernel.org, linux-nvme

On 01/13/2016 11:50 AM, Sagi Grimberg wrote:
>
>> Hi all,
>>
>> I'd like to attend LSF/MM and would like to present my ideas for a
>> multipath redesign.
>>
>> The overall idea is to break up the centralized multipath handling in
>> device-mapper (and multipath-tools) and delegate to the appropriate
>> sub-systems.
>
> I agree that would be very useful. Great topic. I'd like to attend
> this talk as well.
>
>>
>> Individually the plan is:
>> a) use the 'wwid' sysfs attribute to detect multipath devices;
>>     this removes the need of the current 'path_id' functionality
>>     in multipath-tools
>
> CC'ing Linux-nvme,
>
> I've recently looked at multipathing support for nvme (and nvme over
> fabrics) as well. For nvme the wwid equivalent is the nsid (namespace
> identifier). I'm wandering if we can have better abstraction for
> user-space so it won't need to change its behavior for scsi/nvme.
> The same applies for the the timeout attribute for example which
> assumes scsi device sysfs structure.
>
My idea for this is to lookup the sysfs attribute directly from 
multipath-tools. As such we would need to have some transport 
information in multipath so that we know where to find it.
And with that we should easily able to accomodate NVMe, provided the 
nsid is displayed somewhere in sysfs.

>> b) leverage topology information from scsi_dh_alua (which we will
>>     have once my ALUA handler update is in) to detect the multipath
>>     topology. This removes the need of a 'prio' infrastructure
>>     in multipath-tools
>
> This would require further attention for nvme.
>
Indeed. But then I'm not sure how multipath topology would be 
represented in NVMe; we would need some way of transmitting the 
topology information.
Easiest would be to leverage VPD device information; so we only need 
the equivalent of REPORT TARGET PORT GROUPS to implement an 
ALUA-like scenario.

>> c) implement block or scsi events whenever a remote port becomes
>>     unavailable. This removes the need of the 'path_checker'
>>     functionality in multipath-tools.
>
> I'd prefer if we'd have it in the block layer so we can have it for all
> block drivers. Also, this assumes that port events are independent of
> I/O. This assumption is incorrect in SRP for example which detects port
> failures only by I/O errors (which makes path sensing a must).
>
That's what I though initially, too.
But then we're facing a layering issue:
The path events are generated at the _transport_ level.
So for SCSI we have to do a redirection
transport layer->scsi layer->scsi ULD->block device
requiring us to implement for sets of callback functions.
Which I found rather pointless (and time consuming), so I opted for 
scsi events (like we have for UNIT ATTENTION) instead.

However, even now we're having two sets of events (block events and 
scsi events) with a certain overlap, so this really could do with a 
cleanup.

>> d) leverage these events to handle path-up/path-down events
>>     in-kernel
>> e) move the I/O redirection logic out of device-mapper proper
>>     and use blk-mq to redirect I/O. This is still a bit of
>>     hand-waving, and definitely would need discussion to figure
>>     out if and how it can be achieved.
>>     This is basically the same topic Mike Snitzer proposed, but
>>     coming from a different angle.
>
> Another (adjacent) topic is multipath performance with blk-mq.
>
> As I said, I've been looking at nvme multipathing support and
> initial measurements show huge contention on the multipath lock
> which really defeats the entire point of blk-mq...
>
> I have yet to report this as my work is still in progress. I'm not sure
> if it's a topic on it's own but I'd love to talk about that as well...
>
Oh, most definitely. There are some areas in blk-mq which need to be 
covered / implemented before we can even think of that (dynamic 
queue reconfiguration and disabled queue handling being the most 
prominent).

_And_ we have the problem of queue mapping (one queue per ITL nexus?
one queue per hardware queue per ITL nexus?) which might quickly 
lead to a queue number explosion if we've not careful.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
@ 2016-01-13 11:46     ` Hannes Reinecke
  0 siblings, 0 replies; 25+ messages in thread
From: Hannes Reinecke @ 2016-01-13 11:46 UTC (permalink / raw)


On 01/13/2016 11:50 AM, Sagi Grimberg wrote:
>
>> Hi all,
>>
>> I'd like to attend LSF/MM and would like to present my ideas for a
>> multipath redesign.
>>
>> The overall idea is to break up the centralized multipath handling in
>> device-mapper (and multipath-tools) and delegate to the appropriate
>> sub-systems.
>
> I agree that would be very useful. Great topic. I'd like to attend
> this talk as well.
>
>>
>> Individually the plan is:
>> a) use the 'wwid' sysfs attribute to detect multipath devices;
>>     this removes the need of the current 'path_id' functionality
>>     in multipath-tools
>
> CC'ing Linux-nvme,
>
> I've recently looked at multipathing support for nvme (and nvme over
> fabrics) as well. For nvme the wwid equivalent is the nsid (namespace
> identifier). I'm wandering if we can have better abstraction for
> user-space so it won't need to change its behavior for scsi/nvme.
> The same applies for the the timeout attribute for example which
> assumes scsi device sysfs structure.
>
My idea for this is to lookup the sysfs attribute directly from 
multipath-tools. As such we would need to have some transport 
information in multipath so that we know where to find it.
And with that we should easily able to accomodate NVMe, provided the 
nsid is displayed somewhere in sysfs.

>> b) leverage topology information from scsi_dh_alua (which we will
>>     have once my ALUA handler update is in) to detect the multipath
>>     topology. This removes the need of a 'prio' infrastructure
>>     in multipath-tools
>
> This would require further attention for nvme.
>
Indeed. But then I'm not sure how multipath topology would be 
represented in NVMe; we would need some way of transmitting the 
topology information.
Easiest would be to leverage VPD device information; so we only need 
the equivalent of REPORT TARGET PORT GROUPS to implement an 
ALUA-like scenario.

>> c) implement block or scsi events whenever a remote port becomes
>>     unavailable. This removes the need of the 'path_checker'
>>     functionality in multipath-tools.
>
> I'd prefer if we'd have it in the block layer so we can have it for all
> block drivers. Also, this assumes that port events are independent of
> I/O. This assumption is incorrect in SRP for example which detects port
> failures only by I/O errors (which makes path sensing a must).
>
That's what I though initially, too.
But then we're facing a layering issue:
The path events are generated at the _transport_ level.
So for SCSI we have to do a redirection
transport layer->scsi layer->scsi ULD->block device
requiring us to implement for sets of callback functions.
Which I found rather pointless (and time consuming), so I opted for 
scsi events (like we have for UNIT ATTENTION) instead.

However, even now we're having two sets of events (block events and 
scsi events) with a certain overlap, so this really could do with a 
cleanup.

>> d) leverage these events to handle path-up/path-down events
>>     in-kernel
>> e) move the I/O redirection logic out of device-mapper proper
>>     and use blk-mq to redirect I/O. This is still a bit of
>>     hand-waving, and definitely would need discussion to figure
>>     out if and how it can be achieved.
>>     This is basically the same topic Mike Snitzer proposed, but
>>     coming from a different angle.
>
> Another (adjacent) topic is multipath performance with blk-mq.
>
> As I said, I've been looking at nvme multipathing support and
> initial measurements show huge contention on the multipath lock
> which really defeats the entire point of blk-mq...
>
> I have yet to report this as my work is still in progress. I'm not sure
> if it's a topic on it's own but I'd love to talk about that as well...
>
Oh, most definitely. There are some areas in blk-mq which need to be 
covered / implemented before we can even think of that (dynamic 
queue reconfiguration and disabled queue handling being the most 
prominent).

_And_ we have the problem of queue mapping (one queue per ITL nexus?
one queue per hardware queue per ITL nexus?) which might quickly 
lead to a queue number explosion if we've not careful.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
  2016-01-13 10:50   ` Sagi Grimberg
@ 2016-01-13 15:42     ` Mike Snitzer
  -1 siblings, 0 replies; 25+ messages in thread
From: Mike Snitzer @ 2016-01-13 15:42 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, lsf-pc, device-mapper development, linux-nvme,
	linux-scsi@vger.kernel.org

On Wed, Jan 13 2016 at  5:50am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
 
> Another (adjacent) topic is multipath performance with blk-mq.
> 
> As I said, I've been looking at nvme multipathing support and
> initial measurements show huge contention on the multipath lock
> which really defeats the entire point of blk-mq...
> 
> I have yet to report this as my work is still in progress. I'm not sure
> if it's a topic on it's own but I'd love to talk about that as well...

This sounds like you aren't actually using blk-mq for the top-level DM
multipath queue.  And your findings contradicts what I heard from Keith
Busch when I developed request-based DM's blk-mq support, from commit 
bfebd1cdb497 ("dm: add full blk-mq support to request-based DM"):

     "Just providing a performance update. All my fio tests are getting
      roughly equal performance whether accessed through the raw block
      device or the multipath device mapper (~470k IOPS). I could only push
      ~20% of the raw iops through dm before this conversion, so this latest
      tree is looking really solid from a performance standpoint."

> >But in the end we should be able to do strip down the current (rather
> >complex) multipath-tools to just handle topology changes; everything
> >else will be done internally.
> 
> I'd love to see that happening.

Honestly, this needs to be a hardened plan that is hashed out _before_
LSF and then findings presented.  It is a complete waste of time to
debate nuance with Hannes in a one hour session.

Until I implemented the above DM core changes hch and Hannes were very
enthusiastic to throw away the existing DM multipath and multipath-tools
code (the old .request_fn queue lock bottleneck being the straw that
broke the camel's back).  Seems Hannes' enthusiasm hasn't tempered but
his hand-waving is still in full form.

Details matter.  I have no doubts aspects of what we have could be
improved but I really fail to see how moving multipathing to blk-mq is a
constructive way forward.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
@ 2016-01-13 15:42     ` Mike Snitzer
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Snitzer @ 2016-01-13 15:42 UTC (permalink / raw)


On Wed, Jan 13 2016 at  5:50am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
 
> Another (adjacent) topic is multipath performance with blk-mq.
> 
> As I said, I've been looking at nvme multipathing support and
> initial measurements show huge contention on the multipath lock
> which really defeats the entire point of blk-mq...
> 
> I have yet to report this as my work is still in progress. I'm not sure
> if it's a topic on it's own but I'd love to talk about that as well...

This sounds like you aren't actually using blk-mq for the top-level DM
multipath queue.  And your findings contradicts what I heard from Keith
Busch when I developed request-based DM's blk-mq support, from commit 
bfebd1cdb497 ("dm: add full blk-mq support to request-based DM"):

     "Just providing a performance update. All my fio tests are getting
      roughly equal performance whether accessed through the raw block
      device or the multipath device mapper (~470k IOPS). I could only push
      ~20% of the raw iops through dm before this conversion, so this latest
      tree is looking really solid from a performance standpoint."

> >But in the end we should be able to do strip down the current (rather
> >complex) multipath-tools to just handle topology changes; everything
> >else will be done internally.
> 
> I'd love to see that happening.

Honestly, this needs to be a hardened plan that is hashed out _before_
LSF and then findings presented.  It is a complete waste of time to
debate nuance with Hannes in a one hour session.

Until I implemented the above DM core changes hch and Hannes were very
enthusiastic to throw away the existing DM multipath and multipath-tools
code (the old .request_fn queue lock bottleneck being the straw that
broke the camel's back).  Seems Hannes' enthusiasm hasn't tempered but
his hand-waving is still in full form.

Details matter.  I have no doubts aspects of what we have could be
improved but I really fail to see how moving multipathing to blk-mq is a
constructive way forward.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
  2016-01-13 15:42     ` Mike Snitzer
@ 2016-01-13 16:06       ` Sagi Grimberg
  -1 siblings, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2016-01-13 16:06 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Hannes Reinecke, lsf-pc, device-mapper development, linux-nvme,
	linux-scsi@vger.kernel.org


> This sounds like you aren't actually using blk-mq for the top-level DM
> multipath queue.

Hmm. I turned on /sys/module/dm_mod/parameters/use_blk_mq and indeed
saw a significant performance improvement. Anything else I was missing?

> And your findings contradicts what I heard from Keith
> Busch when I developed request-based DM's blk-mq support, from commit
> bfebd1cdb497 ("dm: add full blk-mq support to request-based DM"):
>
>       "Just providing a performance update. All my fio tests are getting
>        roughly equal performance whether accessed through the raw block
>        device or the multipath device mapper (~470k IOPS). I could only push
>        ~20% of the raw iops through dm before this conversion, so this latest
>        tree is looking really solid from a performance standpoint."

I too see ~500K IOPs, but my nvme can push ~1500K IOPs...
Its a simple nvme loopback [1] backed by null_blk.

[1]:
http://lists.infradead.org/pipermail/linux-nvme/2015-November/003001.html
http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-loop.2

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
@ 2016-01-13 16:06       ` Sagi Grimberg
  0 siblings, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2016-01-13 16:06 UTC (permalink / raw)



> This sounds like you aren't actually using blk-mq for the top-level DM
> multipath queue.

Hmm. I turned on /sys/module/dm_mod/parameters/use_blk_mq and indeed
saw a significant performance improvement. Anything else I was missing?

> And your findings contradicts what I heard from Keith
> Busch when I developed request-based DM's blk-mq support, from commit
> bfebd1cdb497 ("dm: add full blk-mq support to request-based DM"):
>
>       "Just providing a performance update. All my fio tests are getting
>        roughly equal performance whether accessed through the raw block
>        device or the multipath device mapper (~470k IOPS). I could only push
>        ~20% of the raw iops through dm before this conversion, so this latest
>        tree is looking really solid from a performance standpoint."

I too see ~500K IOPs, but my nvme can push ~1500K IOPs...
Its a simple nvme loopback [1] backed by null_blk.

[1]:
http://lists.infradead.org/pipermail/linux-nvme/2015-November/003001.html
http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-loop.2

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
  2016-01-13 15:42     ` Mike Snitzer
@ 2016-01-13 16:18       ` Hannes Reinecke
  -1 siblings, 0 replies; 25+ messages in thread
From: Hannes Reinecke @ 2016-01-13 16:18 UTC (permalink / raw)
  To: Mike Snitzer, Sagi Grimberg
  Cc: lsf-pc, device-mapper development, linux-nvme,
	linux-scsi@vger.kernel.org

On 01/13/2016 04:42 PM, Mike Snitzer wrote:
> On Wed, Jan 13 2016 at  5:50am -0500,
> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>
>> Another (adjacent) topic is multipath performance with blk-mq.
>>
>> As I said, I've been looking at nvme multipathing support and
>> initial measurements show huge contention on the multipath lock
>> which really defeats the entire point of blk-mq...
>>
>> I have yet to report this as my work is still in progress. I'm not sure
>> if it's a topic on it's own but I'd love to talk about that as well...
>
> This sounds like you aren't actually using blk-mq for the top-level DM
> multipath queue.  And your findings contradicts what I heard from Keith
> Busch when I developed request-based DM's blk-mq support, from commit
> bfebd1cdb497 ("dm: add full blk-mq support to request-based DM"):
>
>       "Just providing a performance update. All my fio tests are getting
>        roughly equal performance whether accessed through the raw block
>        device or the multipath device mapper (~470k IOPS). I could only push
>        ~20% of the raw iops through dm before this conversion, so this latest
>        tree is looking really solid from a performance standpoint."
>
>>> But in the end we should be able to do strip down the current (rather
>>> complex) multipath-tools to just handle topology changes; everything
>>> else will be done internally.
>>
>> I'd love to see that happening.
>
> Honestly, this needs to be a hardened plan that is hashed out _before_
> LSF and then findings presented.  It is a complete waste of time to
> debate nuance with Hannes in a one hour session.
>
> Until I implemented the above DM core changes hch and Hannes were very
> enthusiastic to throw away the existing DM multipath and multipath-tools
> code (the old .request_fn queue lock bottleneck being the straw that
> broke the camel's back).  Seems Hannes' enthusiasm hasn't tempered but
> his hand-waving is still in full form.
>
> Details matter.  I have no doubts aspects of what we have could be
> improved but I really fail to see how moving multipathing to blk-mq is a
> constructive way forward.
>
So what is your plan?
Move the full blk-mq infrastructure into device-mapper?

 From my perspective, blk-mq and multipath I/O handling have a lot 
in common (the ->map_queue callback is in effect the same ->map_rq 
does), so I still think it should be possible to leverage that directly.
But for that to happen we would need to address some of the 
mentioned issues like individual queue failures and dynamic queue 
remapping; my hope is that they'll be implemented in the course of 
NVMe over fabrics.

Also note that my proposal is more with the infrastructure 
surrounding multipathing (ie topology detection and setup), so it's 
somewhat orthogonal to your proposal.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
@ 2016-01-13 16:18       ` Hannes Reinecke
  0 siblings, 0 replies; 25+ messages in thread
From: Hannes Reinecke @ 2016-01-13 16:18 UTC (permalink / raw)


On 01/13/2016 04:42 PM, Mike Snitzer wrote:
> On Wed, Jan 13 2016 at  5:50am -0500,
> Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
>
>> Another (adjacent) topic is multipath performance with blk-mq.
>>
>> As I said, I've been looking at nvme multipathing support and
>> initial measurements show huge contention on the multipath lock
>> which really defeats the entire point of blk-mq...
>>
>> I have yet to report this as my work is still in progress. I'm not sure
>> if it's a topic on it's own but I'd love to talk about that as well...
>
> This sounds like you aren't actually using blk-mq for the top-level DM
> multipath queue.  And your findings contradicts what I heard from Keith
> Busch when I developed request-based DM's blk-mq support, from commit
> bfebd1cdb497 ("dm: add full blk-mq support to request-based DM"):
>
>       "Just providing a performance update. All my fio tests are getting
>        roughly equal performance whether accessed through the raw block
>        device or the multipath device mapper (~470k IOPS). I could only push
>        ~20% of the raw iops through dm before this conversion, so this latest
>        tree is looking really solid from a performance standpoint."
>
>>> But in the end we should be able to do strip down the current (rather
>>> complex) multipath-tools to just handle topology changes; everything
>>> else will be done internally.
>>
>> I'd love to see that happening.
>
> Honestly, this needs to be a hardened plan that is hashed out _before_
> LSF and then findings presented.  It is a complete waste of time to
> debate nuance with Hannes in a one hour session.
>
> Until I implemented the above DM core changes hch and Hannes were very
> enthusiastic to throw away the existing DM multipath and multipath-tools
> code (the old .request_fn queue lock bottleneck being the straw that
> broke the camel's back).  Seems Hannes' enthusiasm hasn't tempered but
> his hand-waving is still in full form.
>
> Details matter.  I have no doubts aspects of what we have could be
> improved but I really fail to see how moving multipathing to blk-mq is a
> constructive way forward.
>
So what is your plan?
Move the full blk-mq infrastructure into device-mapper?

 From my perspective, blk-mq and multipath I/O handling have a lot 
in common (the ->map_queue callback is in effect the same ->map_rq 
does), so I still think it should be possible to leverage that directly.
But for that to happen we would need to address some of the 
mentioned issues like individual queue failures and dynamic queue 
remapping; my hope is that they'll be implemented in the course of 
NVMe over fabrics.

Also note that my proposal is more with the infrastructure 
surrounding multipathing (ie topology detection and setup), so it's 
somewhat orthogonal to your proposal.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
  2016-01-13 16:06       ` Sagi Grimberg
@ 2016-01-13 16:21         ` Mike Snitzer
  -1 siblings, 0 replies; 25+ messages in thread
From: Mike Snitzer @ 2016-01-13 16:21 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, lsf-pc, device-mapper development, linux-nvme,
	linux-scsi@vger.kernel.org

On Wed, Jan 13 2016 at 11:06am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> >This sounds like you aren't actually using blk-mq for the top-level DM
> >multipath queue.
> 
> Hmm. I turned on /sys/module/dm_mod/parameters/use_blk_mq and indeed
> saw a significant performance improvement. Anything else I was missing?

You can enable CONFIG_DM_MQ_DEFAULT so you don't need to manually set
use_blk_mq.

> >And your findings contradicts what I heard from Keith
> >Busch when I developed request-based DM's blk-mq support, from commit
> >bfebd1cdb497 ("dm: add full blk-mq support to request-based DM"):
> >
> >      "Just providing a performance update. All my fio tests are getting
> >       roughly equal performance whether accessed through the raw block
> >       device or the multipath device mapper (~470k IOPS). I could only push
> >       ~20% of the raw iops through dm before this conversion, so this latest
> >       tree is looking really solid from a performance standpoint."
> 
> I too see ~500K IOPs, but my nvme can push ~1500K IOPs...
> Its a simple nvme loopback [1] backed by null_blk.
> 
> [1]:
> http://lists.infradead.org/pipermail/linux-nvme/2015-November/003001.html
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-loop.2

OK, so you're only getting 1/3 of the throughput.  Time for us to hunt
down the bottleneck (before real devices hit it).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
@ 2016-01-13 16:21         ` Mike Snitzer
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Snitzer @ 2016-01-13 16:21 UTC (permalink / raw)


On Wed, Jan 13 2016 at 11:06am -0500,
Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:

> 
> >This sounds like you aren't actually using blk-mq for the top-level DM
> >multipath queue.
> 
> Hmm. I turned on /sys/module/dm_mod/parameters/use_blk_mq and indeed
> saw a significant performance improvement. Anything else I was missing?

You can enable CONFIG_DM_MQ_DEFAULT so you don't need to manually set
use_blk_mq.

> >And your findings contradicts what I heard from Keith
> >Busch when I developed request-based DM's blk-mq support, from commit
> >bfebd1cdb497 ("dm: add full blk-mq support to request-based DM"):
> >
> >      "Just providing a performance update. All my fio tests are getting
> >       roughly equal performance whether accessed through the raw block
> >       device or the multipath device mapper (~470k IOPS). I could only push
> >       ~20% of the raw iops through dm before this conversion, so this latest
> >       tree is looking really solid from a performance standpoint."
> 
> I too see ~500K IOPs, but my nvme can push ~1500K IOPs...
> Its a simple nvme loopback [1] backed by null_blk.
> 
> [1]:
> http://lists.infradead.org/pipermail/linux-nvme/2015-November/003001.html
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-loop.2

OK, so you're only getting 1/3 of the throughput.  Time for us to hunt
down the bottleneck (before real devices hit it).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
  2016-01-13 16:21         ` Mike Snitzer
@ 2016-01-13 16:30           ` Sagi Grimberg
  -1 siblings, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2016-01-13 16:30 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Hannes Reinecke, lsf-pc, device-mapper development, linux-nvme,
	linux-scsi@vger.kernel.org


>> Hmm. I turned on /sys/module/dm_mod/parameters/use_blk_mq and indeed
>> saw a significant performance improvement. Anything else I was missing?
>
> You can enable CONFIG_DM_MQ_DEFAULT so you don't need to manually set
> use_blk_mq.

I do, I started out with the manual option to test the improvements but
got tired of it very quickly :)

> OK, so you're only getting 1/3 of the throughput.  Time for us to hunt
> down the bottleneck (before real devices hit it).

I have some initial instrumentation analysis so I'll be happy if we can
work on that (probably can free this thread and move it to dm-devel).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
@ 2016-01-13 16:30           ` Sagi Grimberg
  0 siblings, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2016-01-13 16:30 UTC (permalink / raw)



>> Hmm. I turned on /sys/module/dm_mod/parameters/use_blk_mq and indeed
>> saw a significant performance improvement. Anything else I was missing?
>
> You can enable CONFIG_DM_MQ_DEFAULT so you don't need to manually set
> use_blk_mq.

I do, I started out with the manual option to test the improvements but
got tired of it very quickly :)

> OK, so you're only getting 1/3 of the throughput.  Time for us to hunt
> down the bottleneck (before real devices hit it).

I have some initial instrumentation analysis so I'll be happy if we can
work on that (probably can free this thread and move it to dm-devel).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
  2016-01-13 16:18       ` Hannes Reinecke
@ 2016-01-13 16:54         ` Mike Snitzer
  -1 siblings, 0 replies; 25+ messages in thread
From: Mike Snitzer @ 2016-01-13 16:54 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, lsf-pc, device-mapper development, linux-nvme,
	linux-scsi@vger.kernel.org

On Wed, Jan 13 2016 at 11:18am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 01/13/2016 04:42 PM, Mike Snitzer wrote:
> >On Wed, Jan 13 2016 at  5:50am -0500,
> >Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> >
> >>Another (adjacent) topic is multipath performance with blk-mq.
> >>
> >>As I said, I've been looking at nvme multipathing support and
> >>initial measurements show huge contention on the multipath lock
> >>which really defeats the entire point of blk-mq...
> >>
> >>I have yet to report this as my work is still in progress. I'm not sure
> >>if it's a topic on it's own but I'd love to talk about that as well...
> >
> >This sounds like you aren't actually using blk-mq for the top-level DM
> >multipath queue.  And your findings contradicts what I heard from Keith
> >Busch when I developed request-based DM's blk-mq support, from commit
> >bfebd1cdb497 ("dm: add full blk-mq support to request-based DM"):
> >
> >      "Just providing a performance update. All my fio tests are getting
> >       roughly equal performance whether accessed through the raw block
> >       device or the multipath device mapper (~470k IOPS). I could only push
> >       ~20% of the raw iops through dm before this conversion, so this latest
> >       tree is looking really solid from a performance standpoint."
> >
> >>>But in the end we should be able to do strip down the current (rather
> >>>complex) multipath-tools to just handle topology changes; everything
> >>>else will be done internally.
> >>
> >>I'd love to see that happening.
> >
> >Honestly, this needs to be a hardened plan that is hashed out _before_
> >LSF and then findings presented.  It is a complete waste of time to
> >debate nuance with Hannes in a one hour session.
> >
> >Until I implemented the above DM core changes hch and Hannes were very
> >enthusiastic to throw away the existing DM multipath and multipath-tools
> >code (the old .request_fn queue lock bottleneck being the straw that
> >broke the camel's back).  Seems Hannes' enthusiasm hasn't tempered but
> >his hand-waving is still in full form.
> >
> >Details matter.  I have no doubts aspects of what we have could be
> >improved but I really fail to see how moving multipathing to blk-mq is a
> >constructive way forward.
> >
> So what is your plan?
> Move the full blk-mq infrastructure into device-mapper?

1.
Identify what the bottleneck(s) are in current request-based DM blk-mq
support (could be training top-level blk_mq request_queue capabilities
based on underlying devices). 

2.
Get blk-mq to be the primary mode of operation (scsi-mq has a role here)
and then eliminate/deprecate the old .request_fn IO path in blk-core.
- this is a secondary concern, DM can happily continue to carry all
permutations of request_fn on blk-mq path(s), blk-mq on request_fn
path(s) and, blk-mq on blk-mq path(s)... but maybe a start is to make
the top-level request-based DM queue _only_ blk-mq -- so effectively 
set CONFIG_DM_MQ_DEFAULT=Y (and eliminate code that supports
CONFIG_DM_MQ_DEFAULT=N).

IMHO, we don't yet have justification to warrant the relatively drastic
change you're floating (pushing multipathing down into blk-mq).

If/when justification is made we'll go from there.

> From my perspective, blk-mq and multipath I/O handling have a lot in
> common (the ->map_queue callback is in effect the same ->map_rq
> does), so I still think it should be possible to leverage that
> directly.
> But for that to happen we would need to address some of the
> mentioned issues like individual queue failures and dynamic queue
> remapping; my hope is that they'll be implemented in the course of
> NVMe over fabrics.
> 
> Also note that my proposal is more with the infrastructure
> surrounding multipathing (ie topology detection and setup), so it's
> somewhat orthogonal to your proposal.

Sure, it is probabky best if focus is placed on where our current
offering can be incrementally improved.  If that means pushing some
historically userspace (multipath-tools) responsibilities down to the
kernel then we can look at it.

What I want to avoid is a shotgun blast of drastic changes.  That
doesn't serve a _very_ enterprise-oriented layer well at all.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
@ 2016-01-13 16:54         ` Mike Snitzer
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Snitzer @ 2016-01-13 16:54 UTC (permalink / raw)


On Wed, Jan 13 2016 at 11:18am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 01/13/2016 04:42 PM, Mike Snitzer wrote:
> >On Wed, Jan 13 2016 at  5:50am -0500,
> >Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> >
> >>Another (adjacent) topic is multipath performance with blk-mq.
> >>
> >>As I said, I've been looking at nvme multipathing support and
> >>initial measurements show huge contention on the multipath lock
> >>which really defeats the entire point of blk-mq...
> >>
> >>I have yet to report this as my work is still in progress. I'm not sure
> >>if it's a topic on it's own but I'd love to talk about that as well...
> >
> >This sounds like you aren't actually using blk-mq for the top-level DM
> >multipath queue.  And your findings contradicts what I heard from Keith
> >Busch when I developed request-based DM's blk-mq support, from commit
> >bfebd1cdb497 ("dm: add full blk-mq support to request-based DM"):
> >
> >      "Just providing a performance update. All my fio tests are getting
> >       roughly equal performance whether accessed through the raw block
> >       device or the multipath device mapper (~470k IOPS). I could only push
> >       ~20% of the raw iops through dm before this conversion, so this latest
> >       tree is looking really solid from a performance standpoint."
> >
> >>>But in the end we should be able to do strip down the current (rather
> >>>complex) multipath-tools to just handle topology changes; everything
> >>>else will be done internally.
> >>
> >>I'd love to see that happening.
> >
> >Honestly, this needs to be a hardened plan that is hashed out _before_
> >LSF and then findings presented.  It is a complete waste of time to
> >debate nuance with Hannes in a one hour session.
> >
> >Until I implemented the above DM core changes hch and Hannes were very
> >enthusiastic to throw away the existing DM multipath and multipath-tools
> >code (the old .request_fn queue lock bottleneck being the straw that
> >broke the camel's back).  Seems Hannes' enthusiasm hasn't tempered but
> >his hand-waving is still in full form.
> >
> >Details matter.  I have no doubts aspects of what we have could be
> >improved but I really fail to see how moving multipathing to blk-mq is a
> >constructive way forward.
> >
> So what is your plan?
> Move the full blk-mq infrastructure into device-mapper?

1.
Identify what the bottleneck(s) are in current request-based DM blk-mq
support (could be training top-level blk_mq request_queue capabilities
based on underlying devices). 

2.
Get blk-mq to be the primary mode of operation (scsi-mq has a role here)
and then eliminate/deprecate the old .request_fn IO path in blk-core.
- this is a secondary concern, DM can happily continue to carry all
permutations of request_fn on blk-mq path(s), blk-mq on request_fn
path(s) and, blk-mq on blk-mq path(s)... but maybe a start is to make
the top-level request-based DM queue _only_ blk-mq -- so effectively 
set CONFIG_DM_MQ_DEFAULT=Y (and eliminate code that supports
CONFIG_DM_MQ_DEFAULT=N).

IMHO, we don't yet have justification to warrant the relatively drastic
change you're floating (pushing multipathing down into blk-mq).

If/when justification is made we'll go from there.

> From my perspective, blk-mq and multipath I/O handling have a lot in
> common (the ->map_queue callback is in effect the same ->map_rq
> does), so I still think it should be possible to leverage that
> directly.
> But for that to happen we would need to address some of the
> mentioned issues like individual queue failures and dynamic queue
> remapping; my hope is that they'll be implemented in the course of
> NVMe over fabrics.
> 
> Also note that my proposal is more with the infrastructure
> surrounding multipathing (ie topology detection and setup), so it's
> somewhat orthogonal to your proposal.

Sure, it is probabky best if focus is placed on where our current
offering can be incrementally improved.  If that means pushing some
historically userspace (multipath-tools) responsibilities down to the
kernel then we can look at it.

What I want to avoid is a shotgun blast of drastic changes.  That
doesn't serve a _very_ enterprise-oriented layer well at all.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
  2016-01-13  9:10 [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign Hannes Reinecke
  2016-01-13 10:50   ` Sagi Grimberg
  2016-01-13 11:08 ` [dm-devel] " Alasdair G Kergon
@ 2016-01-13 17:52 ` Benjamin Marzinski
  2016-01-14  7:25   ` Hannes Reinecke
  2 siblings, 1 reply; 25+ messages in thread
From: Benjamin Marzinski @ 2016-01-13 17:52 UTC (permalink / raw)
  To: device-mapper development; +Cc: lsf-pc, linux-scsi@vger.kernel.org

On Wed, Jan 13, 2016 at 10:10:43AM +0100, Hannes Reinecke wrote:
> Hi all,
> 
> I'd like to attend LSF/MM and would like to present my ideas for a multipath
> redesign.
> 
> The overall idea is to break up the centralized multipath handling in
> device-mapper (and multipath-tools) and delegate to the appropriate
> sub-systems.
> 
> Individually the plan is:
> a) use the 'wwid' sysfs attribute to detect multipath devices;
>    this removes the need of the current 'path_id' functionality
>    in multipath-tools

If all the devices that we support advertise their WWID through sysfs,
I'm all for this. Not needing to worry about callouts or udev sounds
great.

> b) leverage topology information from scsi_dh_alua (which we will
>    have once my ALUA handler update is in) to detect the multipath
>    topology. This removes the need of a 'prio' infrastructure
>    in multipath-tools

What about devices that don't use alua? Or users who want to be able to
pick a specific path to prefer? While I definitely prefer simple, we
can't drop real funtionality to get there. Have you posted your
scsi_dh_alua update somewhere?

I've recently had requests from users to
1. make a path with the TPGS pref bit set be in its own path group with
the highest priority
2. make the weighted prioritizer use persistent information to make its
choice, so its actually useful. This is to deal with the need to prefer a
specific path in a non-alua setup.

Some of the complexity with priorities is there out of necessity.

> c) implement block or scsi events whenever a remote port becomes
>    unavailable. This removes the need of the 'path_checker'
>    functionality in multipath-tools.

I'm not convinced that we will be able to find out when paths come back
online in all cases without some sort of actual polling. Again, I'd love
this to be simpler, but asking all the types of storage we plan to
support to notify us when they are up and down may not be realistic.

> d) leverage these events to handle path-up/path-down events
>    in-kernel

If polling is necessary, I'd rather it be done in userspace. Personally,
I think the checker code is probably the least obectionable part of the
multipath-tools (It's getting all the device information to set up the
devices in the first place and coordinating with uevents that's really
ugly, IMHO).

-Ben

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
  2016-01-13 17:52 ` Benjamin Marzinski
@ 2016-01-14  7:25   ` Hannes Reinecke
  2016-01-14 19:09     ` Bart Van Assche
  2016-01-21  0:38     ` Benjamin Marzinski
  0 siblings, 2 replies; 25+ messages in thread
From: Hannes Reinecke @ 2016-01-14  7:25 UTC (permalink / raw)
  To: dm-devel

On 01/13/2016 06:52 PM, Benjamin Marzinski wrote:
> On Wed, Jan 13, 2016 at 10:10:43AM +0100, Hannes Reinecke wrote:
>> Hi all,
>>
>> I'd like to attend LSF/MM and would like to present my ideas for a multipath
>> redesign.
>>
>> The overall idea is to break up the centralized multipath handling in
>> device-mapper (and multipath-tools) and delegate to the appropriate
>> sub-systems.
>>
>> Individually the plan is:
>> a) use the 'wwid' sysfs attribute to detect multipath devices;
>>     this removes the need of the current 'path_id' functionality
>>     in multipath-tools
>
> If all the devices that we support advertise their WWID through sysfs,
> I'm all for this. Not needing to worry about callouts or udev sounds
> great.
>
As of now, multipath-tools pretty much requires VPD page 0x83 to be 
implemented. So that's not a big issue. Plus I would leave the old 
infrastructure in place, as there are vendors which do provide their 
own path_id mechanism.

>> b) leverage topology information from scsi_dh_alua (which we will
>>     have once my ALUA handler update is in) to detect the multipath
>>     topology. This removes the need of a 'prio' infrastructure
>>     in multipath-tools
>
> What about devices that don't use alua? Or users who want to be able to
> pick a specific path to prefer? While I definitely prefer simple, we
> can't drop real funtionality to get there. Have you posted your
> scsi_dh_alua update somewhere?
>
Yep. Check the linux-scsi mailing list.

> I've recently had requests from users to
> 1. make a path with the TPGS pref bit set be in its own path group with
> the highest priority
Isn't that always the case?
Paths with TPGS pref bit set will have a different priority than 
those without the pref bit, and they should always have the highest 
priority.
I would rather consider this an error in the prioritizer ...

> 2. make the weighted prioritizer use persistent information to make its
> choice, so its actually useful. This is to deal with the need to prefer a
> specific path in a non-alua setup.
>
yeah, I had a similar request. And we should distinguish between the 
individual transports, as paths might be coming in via different 
protocols/transports.

> Some of the complexity with priorities is there out of necessity.
>
Agree.

>> c) implement block or scsi events whenever a remote port becomes
>>     unavailable. This removes the need of the 'path_checker'
>>     functionality in multipath-tools.
>
> I'm not convinced that we will be able to find out when paths come back
> online in all cases without some sort of actual polling. Again, I'd love
> this to be simpler, but asking all the types of storage we plan to
> support to notify us when they are up and down may not be realistic.
>
Currently we have three main transports: FC, iSCSI, and SAS.
FC has reliable path events via RSCN, as this is also what the 
drivers rely on internally (hello, zfcp :-)
If _that_ doesn't work we're in a deep hole anyway, cf the 
eh_deadline mechanism we had to implement.
iSCSI has the NOP mechanism, which in effect is polling on the iSCSI 
level. That would provide equivalent information; unfortunately not 
every target supports that.
But even without iSCSI has it's own error recovery logic, which will 
kick in whenever an error is detected. So we can as well hook into 
that and use it to send events.
And for SAS we have a far better control over the attached fabric, 
so it should be possible to get reliable events there, too.

That only leaves the non-transport drivers like virtio or the 
various RAID-like cards, which indeed might not be able to provide 
us with events.

So I would propose to make that optional; if events are supported 
(which could be figured out via sysfs) we should be using them and 
don't insist on polling, but fall back to the original methods if we 
don't have them.

>> d) leverage these events to handle path-up/path-down events
>>     in-kernel
>
> If polling is necessary, I'd rather it be done in userspace. Personally,
> I think the checker code is probably the least obectionable part of the
> multipath-tools (It's getting all the device information to set up the
> devices in the first place and coordinating with uevents that's really
> ugly, IMHO).
>
And this is where I do disagree.
The checker code is causing massive lock congestion on large-scale 
systems as there is precisely _one_ checker thread, having to check 
all devices serially. If paths go down on a large system we're 
having a flood of udev events, which we cannot handle in-time as the 
checkerloop holds the lock trying to check all those paths.

So being able to do away with the checkerloop is a major improvement 
there.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
  2016-01-14  7:25   ` Hannes Reinecke
@ 2016-01-14 19:09     ` Bart Van Assche
  2016-01-15  7:12       ` Hannes Reinecke
  2016-01-21  0:38     ` Benjamin Marzinski
  1 sibling, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2016-01-14 19:09 UTC (permalink / raw)
  To: device-mapper development, Hannes Reinecke, Benjamin Marzinski

On 01/13/2016 11:25 PM, Hannes Reinecke wrote:
> On 01/13/2016 06:52 PM, Benjamin Marzinski wrote:
>> On Wed, Jan 13, 2016 at 10:10:43AM +0100, Hannes Reinecke wrote:
>>> c) implement block or scsi events whenever a remote port becomes
>>>     unavailable. This removes the need of the 'path_checker'
>>>     functionality in multipath-tools.
>>
>> I'm not convinced that we will be able to find out when paths come back
>> online in all cases without some sort of actual polling. Again, I'd love
>> this to be simpler, but asking all the types of storage we plan to
>> support to notify us when they are up and down may not be realistic.
>
> Currently we have three main transports: FC, iSCSI, and SAS.

Hello Hannes,

Since several years the Linux SRP initiator driver also has reliable and 
efficient H.A. support. The IB spec supports port state change 
notifications. But whether or not port state information affects the 
path state should be configurable. Several IB users wouldn't like it if 
port state information would affect the path state because the time 
during which a port is down can be shorter than the time during which an 
IB HCA keeps retrying to send a packet.

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
  2016-01-14 19:09     ` Bart Van Assche
@ 2016-01-15  7:12       ` Hannes Reinecke
  0 siblings, 0 replies; 25+ messages in thread
From: Hannes Reinecke @ 2016-01-15  7:12 UTC (permalink / raw)
  To: Bart Van Assche, device-mapper development, Benjamin Marzinski

On 01/14/2016 08:09 PM, Bart Van Assche wrote:
> On 01/13/2016 11:25 PM, Hannes Reinecke wrote:
>> On 01/13/2016 06:52 PM, Benjamin Marzinski wrote:
>>> On Wed, Jan 13, 2016 at 10:10:43AM +0100, Hannes Reinecke wrote:
>>>> c) implement block or scsi events whenever a remote port becomes
>>>>     unavailable. This removes the need of the 'path_checker'
>>>>     functionality in multipath-tools.
>>>
>>> I'm not convinced that we will be able to find out when paths
>>> come back
>>> online in all cases without some sort of actual polling. Again,
>>> I'd love
>>> this to be simpler, but asking all the types of storage we plan to
>>> support to notify us when they are up and down may not be realistic.
>>
>> Currently we have three main transports: FC, iSCSI, and SAS.
>
> Hello Hannes,
>
> Since several years the Linux SRP initiator driver also has reliable
> and efficient H.A. support. The IB spec supports port state change
> notifications. But whether or not port state information affects the
> path state should be configurable. Several IB users wouldn't like it
> if port state information would affect the path state because the
> time during which a port is down can be shorter than the time during
> which an IB HCA keeps retrying to send a packet.
>
Oooh, but of course I've forgotten SRP. Sorry, Bart; it's just not 
on my radar (what with me having no Infiniband equipment to speak of 
...)

But the above really sounds similar to the dev_loss_tmo mechanism we 
have on FC. Maybe it's worth looking into if we could have a similar 
mechanism on SRP.

The point here is that (on FC) we have the following flow of events:

Path loss
-> start dev_loss_tmo
-> rport set to 'blocked'
-> RSCN received
-> move to final rport state (online or gone)
-> unblock rport
-> stop dev_loss_tmo (if rport is online) or
-> dev_loss_tmo fires and removes rport

atm we're being notified once the port is moved to the final state, 
as that's when I/O continues or is being aborted and we're getting 
the I/O completion back.
With path events we could react to the actual path loss, and 
redirect I/O to another path directly when the path loss occurs.
But this really is a matter of policy; it might be that the path 
switch is taking long then the path interruption.
So this needs to be evaluated properly.
But at least we'll be notified allowing us to _do_ these kind of test.
ATM we don't really have a chance to do that.

I'm very willing to look at SRP to see if we can improve things there.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign
  2016-01-14  7:25   ` Hannes Reinecke
  2016-01-14 19:09     ` Bart Van Assche
@ 2016-01-21  0:38     ` Benjamin Marzinski
  1 sibling, 0 replies; 25+ messages in thread
From: Benjamin Marzinski @ 2016-01-21  0:38 UTC (permalink / raw)
  To: device-mapper development

On Thu, Jan 14, 2016 at 08:25:52AM +0100, Hannes Reinecke wrote:
> On 01/13/2016 06:52 PM, Benjamin Marzinski wrote:
> >On Wed, Jan 13, 2016 at 10:10:43AM +0100, Hannes Reinecke wrote:
> >>b) leverage topology information from scsi_dh_alua (which we will
> >>    have once my ALUA handler update is in) to detect the multipath
> >>    topology. This removes the need of a 'prio' infrastructure
> >>    in multipath-tools
> >
> >What about devices that don't use alua? Or users who want to be able to
> >pick a specific path to prefer? While I definitely prefer simple, we
> >can't drop real funtionality to get there. Have you posted your
> >scsi_dh_alua update somewhere?
> >
> Yep. Check the linux-scsi mailing list.

But we still need to be able to handle non-alua devices.

> 
> >I've recently had requests from users to
> >1. make a path with the TPGS pref bit set be in its own path group with
> >the highest priority
> Isn't that always the case?
> Paths with TPGS pref bit set will have a different priority than those
> without the pref bit, and they should always have the highest priority.
> I would rather consider this an error in the prioritizer ...

For a while that was the case.

commit b330bf8a5e6a29b51af0d8b4088e0d8554e5cfb4

changed that, and you sent it. Now, if a the preferred path is
active/optimized, it will get placed in a priority group with other
active/optimized paths.  The SCSI spec is kind of unclear about how to
handle the preferred bit, so I can see either way making sense. When the
path with the preferred bit was all by itself, I had requests to group
it like this.  Now that it gets grouped, I have requests to make it in
its own priority group.  I'm pretty sure that the real answer is to
allow users to choose how to use the pref bit when grouping paths.

> >>c) implement block or scsi events whenever a remote port becomes
> >>    unavailable. This removes the need of the 'path_checker'
> >>    functionality in multipath-tools.
> >
> >I'm not convinced that we will be able to find out when paths come back
> >online in all cases without some sort of actual polling. Again, I'd love
> >this to be simpler, but asking all the types of storage we plan to
> >support to notify us when they are up and down may not be realistic.
> >
> Currently we have three main transports: FC, iSCSI, and SAS.
> FC has reliable path events via RSCN, as this is also what the drivers rely
> on internally (hello, zfcp :-)
> If _that_ doesn't work we're in a deep hole anyway, cf the eh_deadline
> mechanism we had to implement.

I do remember issues over the years where paths have failed without a
RSCNs being generated (brocade switches come to mind, IIRC). And simply
because people are quicker to notice when a failed path isn't getting
dealt with, than they are about a path coming back isn't getting dealt
with, I do worry that we'll find instances where paths are coming back
without RSCNs. And while multipathd's preemptive path checking is nice
to have.  Finding out when the failed paths are usable again is the
really important thing it does.

> 

> iSCSI has the NOP mechanism, which in effect is polling on the iSCSI level.
> That would provide equivalent information; unfortunately not every target
> supports that.
> But even without iSCSI has it's own error recovery logic, which will kick in
> whenever an error is detected. So we can as well hook into that and use it
> to send events.
> And for SAS we have a far better control over the attached fabric, so it
> should be possible to get reliable events there, too.
> 
> That only leaves the non-transport drivers like virtio or the various
> RAID-like cards, which indeed might not be able to provide us with events.
> 
> So I would propose to make that optional; if events are supported (which
> could be figured out via sysfs) we should be using them and don't insist on
> polling, but fall back to the original methods if we don't have them.

As long as there are failbacks that can be used for cases where we arent
getting the events that we need. I'm not against multipath leveraging
the layers beneath it for this information.
 
> >>d) leverage these events to handle path-up/path-down events
> >>    in-kernel
> >
> >If polling is necessary, I'd rather it be done in userspace. Personally,
> >I think the checker code is probably the least obectionable part of the
> >multipath-tools (It's getting all the device information to set up the
> >devices in the first place and coordinating with uevents that's really
> >ugly, IMHO).
> >
> And this is where I do disagree.
> The checker code is causing massive lock congestion on large-scale systems
> as there is precisely _one_ checker thread, having to check all devices
> serially. If paths go down on a large system we're having a flood of udev
> events, which we cannot handle in-time as the checkerloop holds the lock
> trying to check all those paths.
> 
> So being able to do away with the checkerloop is a major improvement there.

But what replaces in for the cases where we do need to poll the device?

I'm not going to argue that multipathd's locking and threading are well
designed. Certainly, uevents *should* be able to continue to be
processed while the checker thread is running. We only need to lock the
vectors while we are changing or traversing them, and with the addition
of some in_use counters we would could get by with very minimal locking
on the paths/maps themselves.

My personal daydream has always been to get rid of the event waiter
threads, since we already get uevents for pretty much all the things
they care about.

-Ben

> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke		   Teamlead Storage & Networking
> hare@suse.de			               +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
> HRB 21284 (AG Nürnberg)
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2016-01-21  0:38 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-13  9:10 [LSF/MM ATTEND][LSF/MM TOPIC] Multipath redesign Hannes Reinecke
2016-01-13 10:50 ` Sagi Grimberg
2016-01-13 10:50   ` Sagi Grimberg
2016-01-13 11:46   ` Hannes Reinecke
2016-01-13 11:46     ` Hannes Reinecke
2016-01-13 15:42   ` Mike Snitzer
2016-01-13 15:42     ` Mike Snitzer
2016-01-13 16:06     ` Sagi Grimberg
2016-01-13 16:06       ` Sagi Grimberg
2016-01-13 16:21       ` Mike Snitzer
2016-01-13 16:21         ` Mike Snitzer
2016-01-13 16:30         ` Sagi Grimberg
2016-01-13 16:30           ` Sagi Grimberg
2016-01-13 16:18     ` Hannes Reinecke
2016-01-13 16:18       ` Hannes Reinecke
2016-01-13 16:54       ` Mike Snitzer
2016-01-13 16:54         ` Mike Snitzer
2016-01-13 11:08 ` [dm-devel] " Alasdair G Kergon
2016-01-13 11:17   ` Hannes Reinecke
2016-01-13 11:25     ` Alasdair G Kergon
2016-01-13 17:52 ` Benjamin Marzinski
2016-01-14  7:25   ` Hannes Reinecke
2016-01-14 19:09     ` Bart Van Assche
2016-01-15  7:12       ` Hannes Reinecke
2016-01-21  0:38     ` Benjamin Marzinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.