All of lore.kernel.org
 help / color / mirror / Atom feed
* Notes from the four separate IO track sessions at LSF/MM
@ 2016-04-27 23:39 James Bottomley
  2016-04-28 12:11 ` Mike Snitzer
  2016-04-29 16:45 ` [dm-devel] Notes from the four separate IO track sessions at LSF/MM Benjamin Marzinski
  0 siblings, 2 replies; 26+ messages in thread
From: James Bottomley @ 2016-04-27 23:39 UTC (permalink / raw)
  To: linux-scsi, linux-block, device-mapper development; +Cc: lsf

This year, we only had two scribes from LWN.net, not three, so there
won't be any coverage of the IO track when we split into three tracks. 
 To cover for that, here are my notes of the four separated sessions

===

Multiqueue Interrupt and Queue Assignment; Hannes Reinecke
----------------------------------------------------------

All multiqueue devices need an interrupt allocation policy and an
affinity but who should set it?  Christoph Hellwig suggested making
what NVMe currently does the default (and has patches).

There then followed a discussion about interrupt allocation which
concluded that realistically, we do want the block layer doing it
because having a single policy for the whole system is by far the
simplest mechanism.  We should wait for evidence that this can't be
made to work at all (which we don't have) before we try to tamper with
it.


Blk-mq Implementor Feedback; Hannes Reinecke, Matthew Wilcox, Keith Busch
-------------------------------------------------------------------------

This began with a discussion of tag allocation policy: blkmq only
allows for a host wide tag space which is partitioned amongst the
number of hardware queues.  Potentially this leads to a tag starvation
issue where the number of host tags is small and the number of hardware
queues is large.  Consensus was that the problem is currently
theoretical but that driver writers should take care to make sure they
don't allocate too many hardware queues if they have a limited number
of tags.

The next problem was abort because of a potential tag re-use issue. 
 After discussion it was agreed there should be no problem because the
tag is held until the abort completes (and the command killed) or error
handling is escalated (in which case the whole host is quiesced). 
 There was a lot of complaining about the host quiesce part because it
takes a while to do on a fully loaded host and also path switchover
cannot occur until it has been completed, so multipath recovery takes
much longer than it should.  The general agreement was that this could
be alleviated somewhat if we could quiesce a single LUN first and issue
a LUN reset rather than doing the whole host after the abort.  Mike
Christie will send patches for LUN quiescing.


IO Cost Estimation; Tejun Heo
-----------------------------

This session began with a description of how the block cgroup currently
works: it has two modes: bandwidth limiting which works regardless of
I/O scheduler and proportional allocation, which only works with the
CFQ I/O scheduler.  Obviously, because blk-mq currently has no
scheduler, it's not possible to do proportional allocation with it. 
 The generic discussion then opened with how do we do correct I/O cost
estimation even with blk-mq so we can do some sort of proportional
allocation.  This is actually a very hard problem to solve,
particularly now that we have to consider SSDs because a large set of
sequential writes are much less likely to excite the write
amplification caused by garbage collection than a set of scattered
writes.  In an ideal world, we'd like to penalise the process doing the
scattered writes for all of the write amplification as well.  However,
after much discussion, it was agreed that the heuristics to try to do
this would end up being very complex and would likely fail in corner
cases anyway, so the best we could do was assess proportions based on
request latency, even though that would not be completely fair to some
workloads.

Multipath - Mike Snitzer
------------------------

Mike began with a request for feedback, which quickly lead to the
complaint that recovery time (and how you recover) was one of the
biggest issues in device mapper multipath (dmmp) for those in the room.
  This is primarily caused by having to wait for the pending I/O to be
released by the failing path. Christoph Hellwig said that NVMe would
soon do path failover internally (without any need for dmmp) and asked
if people would be interested in a more general implementation of this.
 Martin Petersen said he would look at implementing this in SCSI as
well.  The discussion noted that internal path failover only works in
the case where the transport is the same across all the paths and
supports some type of path down notification.  In any cases where this
isn't true (such as failover from fibre channel to iSCSI) you still
have to use dmmp.  Other benefits of internal path failover are that
the transport level code is much better qualified to recognise when the
same device appears over multiple paths, so it should make a lot of the
configuration seamless.  The consequence for end users would be that
now SCSI devices would become handles for end devices rather than
handles for paths to end devices.

James


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Notes from the four separate IO track sessions at LSF/MM
  2016-04-27 23:39 Notes from the four separate IO track sessions at LSF/MM James Bottomley
@ 2016-04-28 12:11 ` Mike Snitzer
  2016-04-28 15:40   ` James Bottomley
  2016-04-29 16:45 ` [dm-devel] Notes from the four separate IO track sessions at LSF/MM Benjamin Marzinski
  1 sibling, 1 reply; 26+ messages in thread
From: Mike Snitzer @ 2016-04-28 12:11 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-scsi, linux-block, device-mapper development, lsf

On Wed, Apr 27 2016 at  7:39pm -0400,
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
 
> Multipath - Mike Snitzer
> ------------------------
> 
> Mike began with a request for feedback, which quickly lead to the
> complaint that recovery time (and how you recover) was one of the
> biggest issues in device mapper multipath (dmmp) for those in the room.
>   This is primarily caused by having to wait for the pending I/O to be
> released by the failing path. Christoph Hellwig said that NVMe would
> soon do path failover internally (without any need for dmmp) and asked
> if people would be interested in a more general implementation of this.
>  Martin Petersen said he would look at implementing this in SCSI as
> well.  The discussion noted that internal path failover only works in
> the case where the transport is the same across all the paths and
> supports some type of path down notification.  In any cases where this
> isn't true (such as failover from fibre channel to iSCSI) you still
> have to use dmmp.  Other benefits of internal path failover are that
> the transport level code is much better qualified to recognise when the
> same device appears over multiple paths, so it should make a lot of the
> configuration seamless.  The consequence for end users would be that
> now SCSI devices would become handles for end devices rather than
> handles for paths to end devices.

I must've been so distracted by the relatively baseless nature of
Christoph's desire to absorb multipath functionality into NVMe (at least
as Christoph presented/defended) that I completely missed the existing
SCSI error recovery woes as something that is DM multipath's fault.
There was a session earlier in LSF that dealt with the inefficiencies of
SCSI error recovery and the associated issues have _nothing_ to do with
DM multipath.  So please clarify how pushing multipath (failover) down
into the drivers will fix the much more problematic SCSI error recovery.

Also, there was a lot of cross-talk during this session so I never heard
that Martin is talking about following Christoph's approach to push
multipath (failover) down to SCSI.  In fact Christoph advocated that DM
multipath carry on being used for SCSI and that only NVMe adopt his
approach.  So this comes as a surprise.

What wasn't captured in your summary is the complete lack of substance
to justify these changes.  The verdict is still very much out on the
need for NVMe to grow multipath functionality (let alone SCSI drivers).
Any work that i done in this area really needs to be justified with
_real_ data.

The other _major_ gripe expressed during the session was how the
userspace multipath-tools are too difficult and complex for users.
IIRC these complaints really weren't expressed in ways that could be
used to actually _fix_ the perceived shortcomings but nevertheless...

Full disclosure: I'll be looking at reinstating bio-based DM multipath to
regain efficiencies that now really matter when issuing IO to extremely
fast devices (e.g. NVMe).  bio cloning is now very cheap (due to
immutable biovecs), coupled with the emerging multipage biovec work that
will help construct larger bios, so I think it is worth pursuing to at
least keep our options open.

Mike

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Notes from the four separate IO track sessions at LSF/MM
  2016-04-28 12:11 ` Mike Snitzer
@ 2016-04-28 15:40   ` James Bottomley
  2016-04-28 15:53     ` [Lsf] " Bart Van Assche
  2016-05-26  2:38     ` bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM] Mike Snitzer
  0 siblings, 2 replies; 26+ messages in thread
From: James Bottomley @ 2016-04-28 15:40 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-scsi, linux-block, device-mapper development, lsf

On Thu, 2016-04-28 at 08:11 -0400, Mike Snitzer wrote:
> On Wed, Apr 27 2016 at  7:39pm -0400,
> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
>  
> > Multipath - Mike Snitzer
> > ------------------------
> > 
> > Mike began with a request for feedback, which quickly lead to the
> > complaint that recovery time (and how you recover) was one of the
> > biggest issues in device mapper multipath (dmmp) for those in the room.
> >   This is primarily caused by having to wait for the pending I/O to be
> > released by the failing path. Christoph Hellwig said that NVMe would
> > soon do path failover internally (without any need for dmmp) and asked
> > if people would be interested in a more general implementation of this.
> >  Martin Petersen said he would look at implementing this in SCSI as
> > well.  The discussion noted that internal path failover only works in
> > the case where the transport is the same across all the paths and
> > supports some type of path down notification.  In any cases where this
> > isn't true (such as failover from fibre channel to iSCSI) you still
> > have to use dmmp.  Other benefits of internal path failover are that
> > the transport level code is much better qualified to recognise when the
> > same device appears over multiple paths, so it should make a lot of the
> > configuration seamless.  The consequence for end users would be that
> > now SCSI devices would become handles for end devices rather than
> > handles for paths to end devices.
> 
> I must've been so distracted by the relatively baseless nature of
> Christoph's desire to absorb multipath functionality into NVMe (at least
> as Christoph presented/defended) that I completely missed the existing
> SCSI error recovery woes as something that is DM multipath's fault.
> There was a session earlier in LSF that dealt with the inefficiencies of
> SCSI error recovery and the associated issues have _nothing_ to do with
> DM multipath.  So please clarify how pushing multipath (failover) down
> into the drivers will fix the much more problematic SCSI error recovery.

The specific problem in SCSI is that we can't signal path failure until
the mid layer eh has completed, which can take ages.  I don't believe
anyone said this was the fault of dm.  However, it does have a visible
consequence in dm in that path failover takes forever (in machine
time).

One way of fixing this is to move failover to the transport layer where
path failure is signalled and take the commands away from the failed
path and on to an alternative before the the mid-layer is even aware we
have a problem.

> Also, there was a lot of cross-talk during this session so I never heard
> that Martin is talking about following Christoph's approach to push
> multipath (failover) down to SCSI.  In fact Christoph advocated that DM
> multipath carry on being used for SCSI and that only NVMe adopt his
> approach.  So this comes as a surprise.

Well one other possibility is to take the requests away much sooner in
the eh cycle.  The thing that keeps us from signalling path failure is
the fact that eh is using the existing commands to do the recovery so
they're not released by the mid-layer until eh is completed.  In theory
we can release the commands earlier once we know we hit the device hard
enough.  However, I've got to say doing the failover before eh begins
does look to be much faster.

> What wasn't captured in your summary is the complete lack of substance
> to justify these changes.  The verdict is still very much out on the
> need for NVMe to grow multipath functionality (let alone SCSI drivers).
> Any work that i done in this area really needs to be justified with
> _real_ data.

Well, the entire room, that's vendors, users and implementors
complained that path failover takes far too long.  I think in their
minds this is enough substance to go on.

> The other _major_ gripe expressed during the session was how the
> userspace multipath-tools are too difficult and complex for users.
> IIRC these complaints really weren't expressed in ways that could be
> used to actually _fix_ the perceived shortcomings but nevertheless...

Tooling could be better, but it isn't going to fix the time to failover
problem.

> Full disclosure: I'll be looking at reinstating bio-based DM multipath to
> regain efficiencies that now really matter when issuing IO to extremely
> fast devices (e.g. NVMe).  bio cloning is now very cheap (due to
> immutable biovecs), coupled with the emerging multipage biovec work that
> will help construct larger bios, so I think it is worth pursuing to at
> least keep our options open.

OK, but remember the reason we moved from bio to request was partly to
be nearer to the device but also because at that time requests were
accumulations of bios which had to be broken out, go back up the stack
individually and be re-elevated, which adds to the inefficiency.  In
theory the bio splitting work will mean that we only have one or two
split bios per request (because they were constructed from a split up
huge bio), but when we send them back to the top to be reconstructed as
requests there's no guarantee that the split will be correct a second
time around and we might end up resplitting the already split bios.  If
you do reassembly into the huge bio again before resend down the next
queue, that's starting to look like quite a lot of work as well.


James


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM
  2016-04-28 15:40   ` James Bottomley
@ 2016-04-28 15:53     ` Bart Van Assche
  2016-04-28 16:19       ` Knight, Frederick
  2016-04-28 16:23       ` Laurence Oberman
  2016-05-26  2:38     ` bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM] Mike Snitzer
  1 sibling, 2 replies; 26+ messages in thread
From: Bart Van Assche @ 2016-04-28 15:53 UTC (permalink / raw)
  To: James Bottomley, Mike Snitzer
  Cc: linux-block, lsf, device-mapper development, linux-scsi

On 04/28/2016 08:40 AM, James Bottomley wrote:
> Well, the entire room, that's vendors, users and implementors
> complained that path failover takes far too long.  I think in their
> minds this is enough substance to go on.

The only complaints I heard about path failover taking too long came 
from people working on FC drivers. Aren't SCSI transport layer 
implementations expected to fail I/O after fast_io_fail_tmo expired 
instead of waiting until the SCSI error handler has finished? If so, why 
is it considered an issue that error handling for the FC protocol can 
take very long (hours)?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [Lsf] Notes from the four separate IO track sessions at LSF/MM
  2016-04-28 15:53     ` [Lsf] " Bart Van Assche
@ 2016-04-28 16:19       ` Knight, Frederick
  2016-04-28 16:37         ` Bart Van Assche
  2016-04-28 17:33         ` James Bottomley
  2016-04-28 16:23       ` Laurence Oberman
  1 sibling, 2 replies; 26+ messages in thread
From: Knight, Frederick @ 2016-04-28 16:19 UTC (permalink / raw)
  To: Bart Van Assche, James Bottomley, Mike Snitzer
  Cc: linux-block, lsf, device-mapper development, linux-scsi

There are multiple possible situations being intermixed in this discussion.  First, I assume you're talking only about random access devices (if you try transport level error recover on a sequential access device - tape or SMR disk - there are lots of additional complexities).

Failures can occur at multiple places:
a) Transport layer failures that the transport layer is able to detect quickly;
b) SCSI device layer failures that the transport layer never even knows about.

For (a) there are two competing goals.  If a port drops off the fabric and comes back again, should you be able to just recover and continue.  But how long do you wait during that drop?  Some devices use this technique to "move" a WWPN from one place to another.  The port drops from the fabric, and a short time later, shows up again (the WWPN moves from one physical port to a different physical port). There are FC driver layer timers that define the length of time allowed for this operation.  The goal is fast failover, but not too fast - because too fast will break this kind of "transparent failover".  This timer also allows for the "OH crap, I pulled the wrong cable - put it back in; quick" kind of stupid user bug.

For (b) the transport never has a failure.  A LUN (or a group of LUNs) have an ALUA transition from one set of ports to a different set of ports.  Some of the LUNs on the port continue to work just fine, but others enter ALUA TRANSITION state so they can "move" to a different part of the hardware.  After the move completes, you now have different sets of optimized and non-optimized paths (or possible standby, or unavailable).  The transport will never even know this happened.  This kind of "failure" is handled by the SCSI layer drivers.

There are other cases too, but these are the most common.

	Fred

-----Original Message-----
From: lsf-bounces@lists.linux-foundation.org [mailto:lsf-bounces@lists.linux-foundation.org] On Behalf Of Bart Van Assche
Sent: Thursday, April 28, 2016 11:54 AM
To: James Bottomley; Mike Snitzer
Cc: linux-block@vger.kernel.org; lsf@lists.linux-foundation.org; device-mapper development; linux-scsi
Subject: Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM

On 04/28/2016 08:40 AM, James Bottomley wrote:
> Well, the entire room, that's vendors, users and implementors
> complained that path failover takes far too long.  I think in their
> minds this is enough substance to go on.

The only complaints I heard about path failover taking too long came 
from people working on FC drivers. Aren't SCSI transport layer 
implementations expected to fail I/O after fast_io_fail_tmo expired 
instead of waiting until the SCSI error handler has finished? If so, why 
is it considered an issue that error handling for the FC protocol can 
take very long (hours)?

Thanks,

Bart.
_______________________________________________
Lsf mailing list
Lsf@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/lsf

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM
  2016-04-28 15:53     ` [Lsf] " Bart Van Assche
  2016-04-28 16:19       ` Knight, Frederick
@ 2016-04-28 16:23       ` Laurence Oberman
  2016-04-28 16:41         ` [dm-devel] " Bart Van Assche
  1 sibling, 1 reply; 26+ messages in thread
From: Laurence Oberman @ 2016-04-28 16:23 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: James Bottomley, Mike Snitzer, linux-block, lsf,
	device-mapper development, linux-scsi

Hello Folks,

We still suffer from periodic complaints in our large customer base regarding the long recovery times for dm-multipath.
Most of the time this is when we have something like a switch back-plane issue or an issue where RSCN'S are blocked coming back up the fabric.
Corner cases still bite us often.

Most of the complaints originate from customers for example seeing Oracle cluster evictions where during the waiting on the mid-layer all mpath I/O is blocked until recovery.

We have to tune eh_deadline, eh_timeout and fast_io_fail_tmo but even tuning those we have to wait on serial recovery even if we set the timeouts low.

Lately we have been living with
eh_deadline=10
eh_timeout=5
fast_fail_io_tmo=10
leaving default sd timeout at 30s

So this continues to be an issue and I have specific examples using the jammer I can provide showing the serial recovery times here.

Thanks

Laurence Oberman
Principal Software Maintenance Engineer
Red Hat Global Support Services

----- Original Message -----
From: "Bart Van Assche" <bart.vanassche@sandisk.com>
To: "James Bottomley" <James.Bottomley@HansenPartnership.com>, "Mike Snitzer" <snitzer@redhat.com>
Cc: linux-block@vger.kernel.org, lsf@lists.linux-foundation.org, "device-mapper development" <dm-devel@redhat.com>, "linux-scsi" <linux-scsi@vger.kernel.org>
Sent: Thursday, April 28, 2016 11:53:50 AM
Subject: Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM

On 04/28/2016 08:40 AM, James Bottomley wrote:
> Well, the entire room, that's vendors, users and implementors
> complained that path failover takes far too long.  I think in their
> minds this is enough substance to go on.

The only complaints I heard about path failover taking too long came 
from people working on FC drivers. Aren't SCSI transport layer 
implementations expected to fail I/O after fast_io_fail_tmo expired 
instead of waiting until the SCSI error handler has finished? If so, why 
is it considered an issue that error handling for the FC protocol can 
take very long (hours)?

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM
  2016-04-28 16:19       ` Knight, Frederick
@ 2016-04-28 16:37         ` Bart Van Assche
  2016-04-28 17:33         ` James Bottomley
  1 sibling, 0 replies; 26+ messages in thread
From: Bart Van Assche @ 2016-04-28 16:37 UTC (permalink / raw)
  To: Knight, Frederick, James Bottomley, Mike Snitzer
  Cc: linux-block, lsf, device-mapper development, linux-scsi

Hello Fred,

Your feedback is very useful, but please note that in my e-mail I used
the phrase "transport layer" to refer to the code in the Linux kernel in
which the fast_io_fail_tmo functionality has been implemented. The
following commit message from 10 years ago explains why the
fast_io_fail_tmo and dev_loss_tmo mechanisms have been implemented:

---------------------------------------------------------------------------
commit 0f29b966d60e9a4f5ecff9f3832257b38aea4f13
Author: James Smart <James.Smart@Emulex.Com>
Date:   Fri Aug 18 17:33:29 2006 -0400

    [SCSI] FC transport: Add dev_loss_tmo callbacks, and new fast_io_fail_tmo w/ callback
    
    This patch adds the following functionality to the FC transport:
    
    - dev_loss_tmo LLDD callback :
      Called to essentially confirm the deletion of an rport. Thus, it is
      called whenever the dev_loss_tmo fires, or when the rport is deleted
      due to other circumstances (module unload, etc).  It is expected that
      the callback will initiate the termination of any outstanding i/o on
      the rport.
    
    - fast_io_fail_tmo and LLD callback:
      There are some cases where it may take a long while to truly determine
      device loss, but the system is in a multipathing configuration that if
      the i/o was failed quickly (faster than dev_loss_tmo), it could be
      redirected to a different path and completed sooner.
    
    Many thanks to Mike Reed who cleaned up the initial RFC in support
    of this post.
---------------------------------------------------------------------------

Bart.

On 04/28/2016 09:19 AM, Knight, Frederick wrote:
> There are multiple possible situations being intermixed in this discussion.
> First, I assume you're talking only about random access devices (if you try
> transport level error recover on a sequential access device - tape or SMR
> disk - there are lots of additional complexities).
> 
> Failures can occur at multiple places:
> a) Transport layer failures that the transport layer is able to detect quickly;
> b) SCSI device layer failures that the transport layer never even knows about.
> 
> For (a) there are two competing goals.  If a port drops off the fabric and
> comes back again, should you be able to just recover and continue.  But how
> long do you wait during that drop?  Some devices use this technique to "move"
> a WWPN from one place to another.  The port drops from the fabric, and a
> short time later, shows up again (the WWPN moves from one physical port to a
> different physical port). There are FC driver layer timers that define the
> length of time allowed for this operation.  The goal is fast failover, but
> not too fast - because too fast will break this kind of "transparent failover".
> This timer also allows for the "OH crap, I pulled the wrong cable - put it
> back in; quick" kind of stupid user bug.
> 
> For (b) the transport never has a failure.  A LUN (or a group of LUNs)
> have an ALUA transition from one set of ports to a different set of ports.
> Some of the LUNs on the port continue to work just fine, but others enter
> ALUA TRANSITION state so they can "move" to a different part of the hardware.
> After the move completes, you now have different sets of optimized and
> non-optimized paths (or possible standby, or unavailable).  The transport
> will never even know this happened.  This kind of "failure" is handled by
> the SCSI layer drivers.
> 
> There are other cases too, but these are the most common.
> 
> 	Fred
> 
> -----Original Message-----
> From: lsf-bounces@lists.linux-foundation.org [mailto:lsf-bounces@lists.linux-foundation.org] On Behalf Of Bart Van Assche
> Sent: Thursday, April 28, 2016 11:54 AM
> To: James Bottomley; Mike Snitzer
> Cc: linux-block@vger.kernel.org; lsf@lists.linux-foundation.org; device-mapper development; linux-scsi
> Subject: Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM
> 
> On 04/28/2016 08:40 AM, James Bottomley wrote:
>> Well, the entire room, that's vendors, users and implementors
>> complained that path failover takes far too long.  I think in their
>> minds this is enough substance to go on.
> 
> The only complaints I heard about path failover taking too long came
> from people working on FC drivers. Aren't SCSI transport layer
> implementations expected to fail I/O after fast_io_fail_tmo expired
> instead of waiting until the SCSI error handler has finished? If so, why
> is it considered an issue that error handling for the FC protocol can
> take very long (hours)?
> 
> Thanks,
> 
> Bart.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM
  2016-04-28 16:23       ` Laurence Oberman
@ 2016-04-28 16:41         ` Bart Van Assche
  2016-04-28 16:47           ` Laurence Oberman
  0 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2016-04-28 16:41 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: linux-block, linux-scsi, Mike Snitzer, James Bottomley,
	device-mapper development, lsf

On 04/28/2016 09:23 AM, Laurence Oberman wrote:
> We still suffer from periodic complaints in our large customer base
 > regarding the long recovery times for dm-multipath.
> Most of the time this is when we have something like a switch
 > back-plane issue or an issue where RSCN'S are blocked coming back up
 > the fabric. Corner cases still bite us often.
>
> Most of the complaints originate from customers for example seeing
 > Oracle cluster evictions where during the waiting on the mid-layer
 > all mpath I/O is blocked until recovery.
>
> We have to tune eh_deadline, eh_timeout and fast_io_fail_tmo but
 > even tuning those we have to wait on serial recovery even if we
 > set the timeouts low.
>
> Lately we have been living with
> eh_deadline=10
> eh_timeout=5
> fast_fail_io_tmo=10
> leaving default sd timeout at 30s
>
> So this continues to be an issue and I have specific examples using
 > the jammer I can provide showing the serial recovery times here.

Hello Laurence,

The long recovery times you refer to, is that for a scenario where all 
paths failed or for a scenario where some paths failed and other paths 
are still working? In the latter case, how long does it take before 
dm-multipath fails over to another path?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM
  2016-04-28 16:41         ` [dm-devel] " Bart Van Assche
@ 2016-04-28 16:47           ` Laurence Oberman
  2016-04-29 21:47             ` Laurence Oberman
  0 siblings, 1 reply; 26+ messages in thread
From: Laurence Oberman @ 2016-04-28 16:47 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, linux-scsi, Mike Snitzer, James Bottomley,
	device-mapper development, lsf

Hello Bart, This is when we have a subset of the paths fails.
As you know the remaining path wont be used until the eh_handler is either done or is short circuited.

What I will do is set this up via my jammer and capture a test using latest upstream.

Of course my customer pain points are all in the RHEL kernels so I need to capture a recovery trace
on the latest upstream kernel.

When the SCSI commands for a path are black-holed and remain that way, even with eh_deadline and the short circuited adapter resets 
we simply try again and get back in the wait loop until we finally declare the device offline.

This can take a while and differs depending on Qlogic, Emulex or fnic etc.

First thing tomorrow will set this up and show you what I mean.

Thanks!!

Laurence Oberman
Principal Software Maintenance Engineer
Red Hat Global Support Services

----- Original Message -----
From: "Bart Van Assche" <bart.vanassche@sandisk.com>
To: "Laurence Oberman" <loberman@redhat.com>
Cc: linux-block@vger.kernel.org, "linux-scsi" <linux-scsi@vger.kernel.org>, "Mike Snitzer" <snitzer@redhat.com>, "James Bottomley" <James.Bottomley@HansenPartnership.com>, "device-mapper development" <dm-devel@redhat.com>, lsf@lists.linux-foundation.org
Sent: Thursday, April 28, 2016 12:41:26 PM
Subject: Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM

On 04/28/2016 09:23 AM, Laurence Oberman wrote:
> We still suffer from periodic complaints in our large customer base
 > regarding the long recovery times for dm-multipath.
> Most of the time this is when we have something like a switch
 > back-plane issue or an issue where RSCN'S are blocked coming back up
 > the fabric. Corner cases still bite us often.
>
> Most of the complaints originate from customers for example seeing
 > Oracle cluster evictions where during the waiting on the mid-layer
 > all mpath I/O is blocked until recovery.
>
> We have to tune eh_deadline, eh_timeout and fast_io_fail_tmo but
 > even tuning those we have to wait on serial recovery even if we
 > set the timeouts low.
>
> Lately we have been living with
> eh_deadline=10
> eh_timeout=5
> fast_fail_io_tmo=10
> leaving default sd timeout at 30s
>
> So this continues to be an issue and I have specific examples using
 > the jammer I can provide showing the serial recovery times here.

Hello Laurence,

The long recovery times you refer to, is that for a scenario where all 
paths failed or for a scenario where some paths failed and other paths 
are still working? In the latter case, how long does it take before 
dm-multipath fails over to another path?

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM
  2016-04-28 16:19       ` Knight, Frederick
  2016-04-28 16:37         ` Bart Van Assche
@ 2016-04-28 17:33         ` James Bottomley
  1 sibling, 0 replies; 26+ messages in thread
From: James Bottomley @ 2016-04-28 17:33 UTC (permalink / raw)
  To: Knight, Frederick, Bart Van Assche, Mike Snitzer
  Cc: linux-block, lsf, device-mapper development, linux-scsi

On Thu, 2016-04-28 at 16:19 +0000, Knight, Frederick wrote:
> There are multiple possible situations being intermixed in this
> discussion.  First, I assume you're talking only about random access
> devices (if you try transport level error recover on a sequential
> access device - tape or SMR disk - there are lots of additional
> complexities).

Tape figured prominently in the reset discussion.  Resetting beyond the
LUN has the possibility to cause grave impact to long running jobs
(mostly on tapes).

> Failures can occur at multiple places:
> a) Transport layer failures that the transport layer is able to
> detect quickly;
> b) SCSI device layer failures that the transport layer never even
> knows about.
> 
> For (a) there are two competing goals.  If a port drops off the
> fabric and comes back again, should you be able to just recover and
> continue.  But how long do you wait during that drop?  Some devices
> use this technique to "move" a WWPN from one place to another.  The
> port drops from the fabric, and a short time later, shows up again
> (the WWPN moves from one physical port to a different physical port).
> There are FC driver layer timers that define the length of time
> allowed for this operation.  The goal is fast failover, but not too
> fast - because too fast will break this kind of "transparent
> failover".  This timer also allows for the "OH crap, I pulled the
> wrong cable - put it back in; quick" kind of stupid user bug.

I think we already have this sorted out with the dev loss timeout which
is implemented in the transport.  It's the grace period you have before
we act on a path loss.

> For (b) the transport never has a failure.  A LUN (or a group of
> LUNs) have an ALUA transition from one set of ports to a different
> set of ports.  Some of the LUNs on the port continue to work just
> fine, but others enter ALUA TRANSITION state so they can "move" to a
> different part of the hardware.  After the move completes, you now
> have different sets of optimized and non-optimized paths (or possible
> standby, or unavailable).  The transport will never even know this
> happened.  This kind of "failure" is handled by the SCSI layer
> drivers.

OK, so ALUA did come up as well, I just forgot.  Perhaps I should back
off a bit and give the historical reasons why dm became our primary
path failover system.  It's because for the first ~15 years of Linux we
had no separate transport infrastructure in SCSI (and, to be fair, T10
didn't either).  In fact, all scsi drivers implemented their own
variants of transport stuff.  This meant there was intial pressure to
make the transport failover stuff driver specific and the answer to
that was a resounding "hell no!" so dm (and md) became the de-facto
path failover standard because there was nowhere else to put it.  The
transport infrastructure didn't really become mature until 2006-2007,
well after this decision was made.  However, now we have transport
infrastructure the question of whether we can use it for path failover
isn't unreasonable.  If we abstract it correctly, it could become a
library usable by all our current transports, so we might only need a
single implementation.

For ALUA specifically (and other weird ALUA like implementations), the
handling code actually sits in drivers/scsi/device-handler, so it could
also be used by the transport code to make path decisions.  The point
here is that even if we implement path failover at the transport level,
we do have more than the information available that the transport
should strictly know to make the failover decision.

James



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] Notes from the four separate IO track sessions at LSF/MM
  2016-04-27 23:39 Notes from the four separate IO track sessions at LSF/MM James Bottomley
  2016-04-28 12:11 ` Mike Snitzer
@ 2016-04-29 16:45 ` Benjamin Marzinski
  1 sibling, 0 replies; 26+ messages in thread
From: Benjamin Marzinski @ 2016-04-29 16:45 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-scsi, linux-block, device-mapper development, lsf

On Wed, Apr 27, 2016 at 04:39:49PM -0700, James Bottomley wrote:
> Multipath - Mike Snitzer
> ------------------------
> 
> Mike began with a request for feedback, which quickly lead to the
> complaint that recovery time (and how you recover) was one of the
> biggest issues in device mapper multipath (dmmp) for those in the room.
>   This is primarily caused by having to wait for the pending I/O to be
> released by the failing path. Christoph Hellwig said that NVMe would
> soon do path failover internally (without any need for dmmp) and asked
> if people would be interested in a more general implementation of this.
>  Martin Petersen said he would look at implementing this in SCSI as
> well.  The discussion noted that internal path failover only works in
> the case where the transport is the same across all the paths and
> supports some type of path down notification.  In any cases where this
> isn't true (such as failover from fibre channel to iSCSI) you still
> have to use dmmp.  Other benefits of internal path failover are that
> the transport level code is much better qualified to recognise when the
> same device appears over multiple paths, so it should make a lot of the
> configuration seamless.

Given the variety of sensible configurations that I've seen for people's
multipath setups, there will definitely be a chunk of configuration that
will never be seemless. Just in the past few weeks, we've added code to
make it easier to allow people to manually configure devices for
situations where none of our automated heuristics do what the user
needs. Even for the easy cases, like ALUA, we've been adding options to
allow users to do things like specify what they want to happen when they
set the TPGS Pref bit.

Recognizing which paths go together is simple. That part has always been
seemless from the users point of view. Configuring how IO is blanced and
failed over between the paths is where the complexity is.

> The consequence for end users would be that
> now SCSI devices would become handles for end devices rather than
> handles for paths to end devices.

This will have a lot of repercussions with applications that uses scsi
devices.  A significant number of tools expect that a scsi device maps
to a connection between an initiator port and a target port. Listing the
topology of these new scsi devices, and getting the IO stats down the
various paths to them will involve writing new tools, or rewriting
existing one. Things like persistent reservations will work differently
(albeit, probably more intuitively).

I'm not saying that this can't be made to work nicely for a significant
subset of cases (like has been pointed out with the muliple transport
case, this won't work for all cases). I just think that it's not a small
amount of work, and not necessarily the only way to speed up failover.

-Ben

> James
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM
  2016-04-28 16:47           ` Laurence Oberman
@ 2016-04-29 21:47             ` Laurence Oberman
  2016-04-29 21:51               ` Laurence Oberman
  2016-04-30  0:36               ` Bart Van Assche
  0 siblings, 2 replies; 26+ messages in thread
From: Laurence Oberman @ 2016-04-29 21:47 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, linux-scsi, Mike Snitzer, James Bottomley,
	device-mapper development, lsf, Benjamin Marzinski

Hello Bart

I will email the entire log just to you. This is a summary only below of pertinent log messages.
I dont think the whole list will have an interest in all thge log messages.
When I sent the dull log to you I will include SCSI debug for the error handler stuff.


I ran the tests. This is a worst case test with 21 LUNS and jammed commands.
Typical failures like a port switch failure or link down wont be like this.
Also where we get RSCN's and we can react quicker we will.

In this case I simulated a hung switch issue like a backplane/mesh problem and believe me I see a lot of these 
black-holed SCSI command situations in real life.
Recovery with 21 LUNS is 300s that have in-flights to abort.

This configuration is a multibus configuration for multipath. 
Two qla2xx ports are connected to a switch and the 2 array pots are connected to the same switch.
This gives me 4 active/active paths for 21 mpath devices 

I start I/O to all 21 reading 64k blocks using dd and iflag=direct

Example mpath device
mpathf (360014056a5be89021364a4c90556bfbb) dm-7 LIO-ORG ,block-14        
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 0:0:0:13 sdp  8:240  active ready running
  |- 0:0:1:13 sdbf 67:144 active ready running
  |- 1:0:0:13 sdo  8:224  active ready running
  `- 1:0:1:13 sdbg 67:160 active ready running

eh_deadline is set to 10 on the 2 qlogic ports, eh_timeout is set to 10 for all devices
In multipath fast_io_fail_tmo=5

I jam one of the target array ports and discard the commands effectively black-holing the commands and leave it that way until we recover and I watch the I/O.
The recovery takes around 300s even with all the tuning and this effectively lands up in Oracle cluster evictions.

Watching multipath -ll mpathe I will block as expected while in recovery

BLocked here
Fri Apr 29 17:16:14 EDT 2016
mpathe (360014052a6f5f9f256d4c1097eedfd95) dm-2 LIO-ORG ,block-13
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 0:0:0:12 sds  65:32  active ready running
  |- 0:0:1:12 sdbh 67:176 active ready running
  |- 1:0:0:12 sdr  65:16  active ready running
  `- 1:0:1:12 sdbi 67:192 active ready running

Starte again here
Fri Apr 29 17:16:26 EDT 2016
mpathe (360014052a6f5f9f256d4c1097eedfd95) dm-2 LIO-ORG ,block-13
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 0:0:0:12 sds  65:32  active ready  running
  |- 0:0:1:12 sdbh 67:176 failed faulty offline
  |- 1:0:0:12 sdr  65:16  active ready  running
  `- 1:0:1:12 sdbi 67:192 failed faulty offline

Tracking I/O
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st                 EDT
 0 21      0 15409656  25580 452056    0    0 13740     0  367 2523  0  1 41 59  0 2016-04-29 17:16:17
 0 21      0 15408904  25580 452336    0    0 15872     0  378 2779  0  1 42 57  0 2016-04-29 17:16:18
 2 20      0 15408096  25580 452624    0    0 17612     0  399 3310  0  0 26 73  0 2016-04-29 17:16:19
 0 21      0 15407188  25580 453096    0    0 17860     0  412 3137  0  0 30 70  0 2016-04-29 17:16:20
 0 21      0 15410420  25580 451552    0    0 23116     0  900 6747  0  1 31 69  0 2016-04-29 17:16:21
 0 21      0 15410552  25580 451420    0    0 22664     0  430 3752  0  0 24 76  0 2016-04-29 17:16:22
 0 21      0 15410552  25580 451420    0    0 15700     0  325 2619  0  0 25 75  0 2016-04-29 17:16:23
 0 21      0 15410552  25580 451420    0    0 13648     0  303 2387  0  0 28 71  0 2016-04-29 17:16:24
..
.. Blocked
..
Starts recovering ~= 300s seconds later
..
 0 38      0 15406428  25860 452652    0    0  3208     0  859 2437  0  1 13 86  0 2016-04-29 17:21:20
 0 38      0 15405668  26244 452268    0    0  6640     0  499 3575  0  1  0 99  0 2016-04-29 17:21:21
 0 38      0 15406840  26496 452300    0    0  5372     0  273 1878  0  0  1 98  0 2016-04-29 17:21:22
 0 38      0 15402684  29156 452048    0    0  9700     0  318 2326  0  0 11 88  0 2016-04-29 17:21:23
 0 38      0 15400800  30152 452168    0    0 11876     0  433 3265  0  1 16 83  0 2016-04-29 17:21:24
 0 38      0 15399792  31140 452344    0    0 11804     0  394 2902  0  1 15 85  0 2016-04-29 17:21:25
 0 38      0 15398552  31952 452196    0    0 12908     0  417 3347  0  1  4 96  0 2016-04-29 17:21:26
 0 35      0 15394564  32660 452800    0    0 10904     0  575 4191  1  1  9 89  0 2016-04-29 17:21:27
 0 29      0 15394292  32968 452900    0    0 13356     0  602 3993  1  1  1 96  0 2016-04-29 17:21:28
 0 26      0 15394464  33692 452196    0    0 16124     0  764 5451  1  1  2 96  0 2016-04-29 17:21:29
 0 24      0 15394168  33880 452392    0    0 20156     0  479 3957  0  1  3 96  0 2016-04-29 17:21:30
 0 24      0 15394216  34008 452460    0    0 21760     0  456 3836  0  1  6 94  0 2016-04-29 17:21:31
 0 22      0 15393920  34016 452604    0    0 20104    28  437 3418  0  1 12 87  0 2016-04-29 17:21:32
 0 22      0 15393952  34016 452600    0    0 20352     0  483 3259  0  1 67 32  0 2016-04-29 17:21:33
 0 22      0 15394148  34016 452600    0    0 20560     0  451 3228  0  1 74 25  0 2016-04-29 17:21:34

I see the error handler start in the qlogic.
Keep in mind we are black-holed so after RESET we start the process again.

Apr 29 17:15:26 localhost root: Starting test with eh_deadline=10
Apr 29 17:16:54 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:9 --  1 2002.
Apr 29 17:16:54 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:7 --  1 2002.
Apr 29 17:16:54 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:8 --  1 2002.
Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:6 --  1 2002.
Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:4 --  1 2002.
Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:3 --  1 2002.
Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:22 --  1 2002.
Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:23 --  1 2002.
Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:21 --  1 2002.
Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:2 --  1 2002.
Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:19 --  1 2002.
Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:18 --  1 2002.
Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:20 --  1 2002.
Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:1 --  1 2002.
Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:0 --  1 2002.
Apr 29 17:16:57 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:17 --  1 2002.
Apr 29 17:16:57 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:16 --  1 2002.
Apr 29 17:16:57 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:10 --  1 2002.
Apr 29 17:16:58 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:14 --  1 2002.
Apr 29 17:16:58 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:13 --  1 2002.
Apr 29 17:16:58 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:12 --  1 2002.
Apr 29 17:16:58 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:0 --  1 2002.
Apr 29 17:17:00 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:10 --  1 2002.
Apr 29 17:17:00 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:12 --  1 2002.
Apr 29 17:17:01 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:0 --  1 2002.
Apr 29 17:17:02 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:1 --  1 2002.
Apr 29 17:17:03 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:13 --  1 2002.
Apr 29 17:17:03 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:14 --  1 2002.
Apr 29 17:17:03 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:16 --  1 2002.
Apr 29 17:17:04 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:17 --  1 2002.
Apr 29 17:17:09 localhost kernel: qla2xxx [0000:07:00.1]-8018:1: ADAPTER RESET ISSUED nexus=1:1:12.
Apr 29 17:17:09 localhost kernel: qla2xxx [0000:07:00.1]-00af:1: Performing ISP error recovery - ha=ffff88042a4b0000.
Apr 29 17:17:10 localhost kernel: qla2xxx [0000:07:00.1]-500a:1: LOOP UP detected (4 Gbps).
Apr 29 17:17:10 localhost kernel: qla2xxx [0000:07:00.1]-8017:1: ADAPTER RESET SUCCEEDED nexus=1:1:12.
Apr 29 17:17:30 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:12 --  1 2002.
Apr 29 17:17:34 localhost kernel: qla2xxx [0000:07:00.0]-8018:0: ADAPTER RESET ISSUED nexus=0:1:17.
Apr 29 17:17:34 localhost kernel: qla2xxx [0000:07:00.0]-00af:0: Performing ISP error recovery - ha=ffff88042b030000.
Apr 29 17:17:35 localhost kernel: qla2xxx [0000:07:00.0]-500a:0: LOOP UP detected (4 Gbps).
Apr 29 17:17:35 localhost kernel: qla2xxx [0000:07:00.0]-8017:0: ADAPTER RESET SUCCEEDED nexus=0:1:17.
Apr 29 17:17:40 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:10 --  1 2002.
Apr 29 17:17:50 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:1 --  1 2002.
Apr 29 17:17:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:17 --  1 2002.
Apr 29 17:18:00 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:0 --  1 2002.
Apr 29 17:18:06 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:18 --  1 2002.
Apr 29 17:18:10 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:13 --  1 2002.
Apr 29 17:18:16 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:19 --  1 2002.
Apr 29 17:18:20 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:8 --  1 2002.
Apr 29 17:18:26 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:8 --  1 2002.
Apr 29 17:18:30 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:7 --  1 2002.
Apr 29 17:18:36 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:22 --  1 2002.
Apr 29 17:18:40 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:22 --  1 2002.
Apr 29 17:18:46 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:21 --  1 2002.
Apr 29 17:18:50 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:21 --  1 2002.
..
..
We start seeing the hung tasks

Apr 29 17:19:16 localhost kernel: INFO: task dd:10193 blocked for more than 120 seconds.
Apr 29 17:19:16 localhost kernel:      Not tainted 4.6.0-rc5+ #1
Apr 29 17:19:16 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 29 17:19:16 localhost kernel: dd              D ffff8804303079d8     0 10193  10177 0x00000080
Apr 29 17:19:16 localhost kernel: ffff8804303079d8 ffff880434814140 ffff8800a86515c0 ffff880430308000
Apr 29 17:19:16 localhost kernel: 0000000000000000 7fffffffffffffff 0000000000000000 ffff8804307bfd00
Apr 29 17:19:16 localhost kernel: ffff8804303079f0 ffffffff816ba8e5 ffff880436696e00 ffff880430307aa0
Apr 29 17:19:16 localhost kernel: Call Trace:
Apr 29 17:19:16 localhost kernel: [<ffffffff816ba8e5>] schedule+0x35/0x80
Apr 29 17:19:16 localhost kernel: [<ffffffff816bd661>] schedule_timeout+0x231/0x2d0
Apr 29 17:19:16 localhost kernel: [<ffffffff81315843>] ? __blk_run_queue+0x33/0x40
Apr 29 17:19:16 localhost kernel: [<ffffffff813158ba>] ? queue_unplugged+0x2a/0xb0
Apr 29 17:19:16 localhost kernel: [<ffffffff816b9f66>] io_schedule_timeout+0xa6/0x110
Apr 29 17:19:16 localhost kernel: [<ffffffff81259332>] do_blockdev_direct_IO+0x1b52/0x2180
Apr 29 17:19:16 localhost kernel: [<ffffffff81254320>] ? I_BDEV+0x20/0x20
Apr 29 17:19:16 localhost kernel: [<ffffffff812599a3>] __blockdev_direct_IO+0x43/0x50
Apr 29 17:19:16 localhost kernel: [<ffffffff81254a7c>] blkdev_direct_IO+0x4c/0x50
Apr 29 17:19:16 localhost kernel: [<ffffffff81193ab1>] generic_file_read_iter+0x641/0x7b0
Apr 29 17:19:16 localhost kernel: [<ffffffff8120bcf5>] ? mem_cgroup_commit_charge+0x85/0x100
Apr 29 17:19:16 localhost kernel: [<ffffffff81254e57>] blkdev_read_iter+0x37/0x40
Apr 29 17:19:16 localhost kernel: [<ffffffff81219379>] __vfs_read+0xc9/0x100
Apr 29 17:19:16 localhost kernel: [<ffffffff8121a1ef>] vfs_read+0x7f/0x130
Apr 29 17:19:16 localhost kernel: [<ffffffff8121b6d5>] SyS_read+0x55/0xc0
Apr 29 17:19:16 localhost kernel: [<ffffffff81003c12>] do_syscall_64+0x62/0x110
Apr 29 17:19:16 localhost kernel: [<ffffffff816be4a1>] entry_SYSCALL64_slow_path+0x25/0x25
..
..

Finally after the serialized timeouts we get the offline states

Apr 29 17:19:26 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:3 --  1 2002.
Apr 29 17:19:30 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:17 --  1 2002.
Apr 29 17:19:36 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:7 --  1 2002.
Apr 29 17:19:40 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:18 --  1 2002.
Apr 29 17:19:46 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:4 --  1 2002.
Apr 29 17:19:51 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:19 --  1 2002.
Apr 29 17:19:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:6 --  1 2002.
Apr 29 17:20:01 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:2 --  1 2002.
Apr 29 17:20:06 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:9 --  1 2002.
Apr 29 17:20:11 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:23 --  1 2002.
Apr 29 17:20:16 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:12 --  1 2002.
Apr 29 17:20:21 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:3 --  1 2002.
Apr 29 17:20:26 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:10 --  1 2002.
Apr 29 17:20:31 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:4 --  1 2002.
Apr 29 17:20:37 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:0 --  1 2002.
Apr 29 17:20:41 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:6 --  1 2002.
Apr 29 17:20:47 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:1 --  1 2002.
Apr 29 17:20:51 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:9 --  1 2002.
Apr 29 17:20:51 localhost kernel: sd 1:0:1:12: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:10: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:1: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:0: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:13: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: device-mapper: multipath: Failing path 66:208.
Apr 29 17:20:51 localhost kernel: device-mapper: multipath: Failing path 68:192.
Apr 29 17:20:51 localhost kernel: device-mapper: multipath: Failing path 67:224.
Apr 29 17:20:51 localhost kernel: device-mapper: multipath: Failing path 67:192.
Apr 29 17:20:51 localhost multipathd: mpatha: sdat - path offline
Apr 29 17:20:51 localhost multipathd: checker failed path 66:208 in map mpatha
Apr 29 17:20:51 localhost multipathd: mpatha: remaining active paths: 3
Apr 29 17:20:51 localhost multipathd: mpathb: sdby - path offline
Apr 29 17:20:51 localhost multipathd: checker failed path 68:192 in map mpathb
Apr 29 17:20:51 localhost multipathd: mpathb: remaining active paths: 3
Apr 29 17:20:51 localhost multipathd: mpathc: sdbk - path offline
Apr 29 17:20:51 localhost multipathd: checker failed path 67:224 in map mpathc
Apr 29 17:20:51 localhost multipathd: mpathc: remaining active paths: 3
Apr 29 17:20:51 localhost multipathd: mpathe: sdbi - path offline
Apr 29 17:20:51 localhost multipathd: checker failed path 67:192 in map mpathe
Apr 29 17:20:51 localhost multipathd: mpathe: remaining active paths: 3
Apr 29 17:20:51 localhost kernel: sd 1:0:1:8: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:7: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:22: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:21: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:20: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:14: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:16: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:17: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:18: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:19: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:2: Device offlined - not ready after error recovery
..
..
Apr 29 17:20:51 localhost kernel: sd 1:0:1:12: [sdbi] tag#14 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: [sdbi] tag#14 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdbi, sector 0
Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: rejecting I/O to offline device
Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: [sdbi] killing request
Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: [sdbi] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: [sdbi] CDB: Read(10) 28 00 00 02 5b 80 00 00 80 00
Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdbi, sector 154496
Apr 29 17:20:52 localhost kernel: sd 1:0:1:10: [sdbk] tag#13 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:52 localhost kernel: sd 1:0:1:10: [sdbk] tag#13 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdbk, sector 0
Apr 29 17:20:52 localhost kernel: sd 1:0:1:1: [sdby] tag#16 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:52 localhost kernel: sd 1:0:1:1: [sdby] tag#16 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:52 localhost multipathd: mpathf: sdbg - path offline
Apr 29 17:20:52 localhost multipathd: checker failed path 67:160 in map mpathf
Apr 29 17:20:52 localhost multipathd: mpathf: remaining active paths: 3
Apr 29 17:20:52 localhost multipathd: mpathg: sdbe - path offline
Apr 29 17:20:52 localhost multipathd: checker failed path 67:128 in map mpathg
Apr 29 17:20:52 localhost multipathd: mpathg: remaining active paths: 3
Apr 29 17:20:52 localhost multipathd: mpathi: sdbc - path offline
Apr 29 17:20:52 localhost multipathd: checker failed path 67:96 in map mpathi
Apr 29 17:20:52 localhost multipathd: mpathi: remaining active paths: 3
Apr 29 17:20:52 localhost multipathd: mpathj: sdba - path offline
Apr 29 17:20:52 localhost multipathd: checker failed path 67:64 in map mpathj
Apr 29 17:20:52 localhost multipathd: mpathj: remaining active paths: 3
Apr 29 17:20:52 localhost multipathd: mpathk: sday - path offline
Apr 29 17:20:52 localhost multipathd: checker failed path 67:32 in map mpathk
Apr 29 17:20:52 localhost multipathd: mpathk: remaining active paths: 3
..
..
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:160.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:128.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:96.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:64.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:32.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 66:240.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:160.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 69:0.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 69:32.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 69:64.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 69:96.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:128.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:96.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:64.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:224.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:32.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:0.
Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdby, sector 0
Apr 29 17:20:52 localhost kernel: sd 1:0:1:0: [sdat] tag#15 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:52 localhost kernel: sd 1:0:1:0: [sdat] tag#15 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdat, sector 0
Apr 29 17:20:52 localhost kernel: sd 1:0:1:13: [sdbg] tag#17 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:52 localhost kernel: sd 1:0:1:13: [sdbg] tag#17 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:53 localhost kernel: blk_update_request: I/O error, dev sdbg, sector 0
Apr 29 17:20:53 localhost kernel: sd 1:0:1:13: rejecting I/O to offline device
Apr 29 17:20:53 localhost kernel: sd 1:0:1:13: [sdbg] killing request
Apr 29 17:20:53 localhost kernel: sd 1:0:1:13: [sdbg] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Apr 29 17:20:53 localhost kernel: sd 1:0:1:13: [sdbg] CDB: Read(10) 28 00 00 02 5d 80 00 00 80 00
Apr 29 17:20:53 localhost kernel: blk_update_request: I/O error, dev sdbg, sector 155008
Apr 29 17:20:53 localhost kernel: sd 1:0:1:8: [sdbo] tag#31 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:53 localhost kernel: sd 1:0:1:8: [sdbo] tag#31 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:53 localhost kernel: blk_update_request: I/O error, dev sdbo, sector 0
Apr 29 17:20:53 localhost kernel: sd 1:0:1:7: [sdca] tag#30 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:53 localhost kernel: sd 1:0:1:7: [sdca] tag#30 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:53 localhost kernel: blk_update_request: I/O error, dev sdca, sector 0
Apr 29 17:20:53 localhost kernel: sd 1:0:1:22: [sdcg] tag#26 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
..
Apr 29 17:21:18 localhost multipathd: checker failed path 66:224 in map mpatha
Apr 29 17:21:18 localhost multipathd: mpatha: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathb: sdbx - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 68:176 in map mpathb
Apr 29 17:21:18 localhost multipathd: mpathb: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathc: sdbj - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 67:208 in map mpathc
Apr 29 17:21:18 localhost multipathd: mpathc: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathe: sdbh - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 67:176 in map mpathe
Apr 29 17:21:18 localhost multipathd: mpathe: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathf: sdbf - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 67:144 in map mpathf
Apr 29 17:21:18 localhost multipathd: mpathf: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathg: sdbd - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 67:112 in map mpathg
Apr 29 17:21:18 localhost multipathd: mpathg: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathi: sdbb - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 67:80 in map mpathi
Apr 29 17:21:18 localhost multipathd: mpathi: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpatho: sdbr - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 68:80 in map mpatho
Apr 29 17:21:18 localhost multipathd: mpatho: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathq: sdbp - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 68:48 in map mpathq
Apr 29 17:21:18 localhost multipathd: mpathq: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathv: sdbz - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 68:208 in map mpathv
Apr 29 17:21:18 localhost multipathd: mpathv: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpatht: sdbl - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 67:240 in map mpatht
Apr 29 17:21:18 localhost multipathd: mpatht: remaining active paths: 2
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 66:224.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 68:176.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:208.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:176.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:144.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:112.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:80.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 68:80.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 68:48.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 68:208.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:240.
Apr 29 17:21:18 localhost kernel: blk_update_request: I/O error, dev sdaw, sector 0
Apr 29 17:21:18 localhost kernel: sd 0:0:1:8: [sdbn] tag#25 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:21:18 localhost kernel: sd 0:0:1:8: [sdbn] tag#25 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:21:18 localhost kernel: blk_update_request: I/O error, dev sdbn, sector 0
Apr 29 17:21:18 localhost kernel: sd 0:0:1:8: rejecting I/O to offline device
Apr 29 17:21:18 localhost kernel: sd 0:0:1:8: [sdbn] killing request


Laurence Oberman
Principal Software Maintenance Engineer
Red Hat Global Support Services

----- Original Message -----
From: "Laurence Oberman" <loberman@redhat.com>
To: "Bart Van Assche" <bart.vanassche@sandisk.com>
Cc: linux-block@vger.kernel.org, "linux-scsi" <linux-scsi@vger.kernel.org>, "Mike Snitzer" <snitzer@redhat.com>, "James Bottomley" <James.Bottomley@HansenPartnership.com>, "device-mapper development" <dm-devel@redhat.com>, lsf@lists.linux-foundation.org
Sent: Thursday, April 28, 2016 12:47:24 PM
Subject: Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM

Hello Bart, This is when we have a subset of the paths fails.
As you know the remaining path wont be used until the eh_handler is either done or is short circuited.

What I will do is set this up via my jammer and capture a test using latest upstream.

Of course my customer pain points are all in the RHEL kernels so I need to capture a recovery trace
on the latest upstream kernel.

When the SCSI commands for a path are black-holed and remain that way, even with eh_deadline and the short circuited adapter resets 
we simply try again and get back in the wait loop until we finally declare the device offline.

This can take a while and differs depending on Qlogic, Emulex or fnic etc.

First thing tomorrow will set this up and show you what I mean.

Thanks!!

Laurence Oberman
Principal Software Maintenance Engineer
Red Hat Global Support Services

----- Original Message -----
From: "Bart Van Assche" <bart.vanassche@sandisk.com>
To: "Laurence Oberman" <loberman@redhat.com>
Cc: linux-block@vger.kernel.org, "linux-scsi" <linux-scsi@vger.kernel.org>, "Mike Snitzer" <snitzer@redhat.com>, "James Bottomley" <James.Bottomley@HansenPartnership.com>, "device-mapper development" <dm-devel@redhat.com>, lsf@lists.linux-foundation.org
Sent: Thursday, April 28, 2016 12:41:26 PM
Subject: Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM

On 04/28/2016 09:23 AM, Laurence Oberman wrote:
> We still suffer from periodic complaints in our large customer base
 > regarding the long recovery times for dm-multipath.
> Most of the time this is when we have something like a switch
 > back-plane issue or an issue where RSCN'S are blocked coming back up
 > the fabric. Corner cases still bite us often.
>
> Most of the complaints originate from customers for example seeing
 > Oracle cluster evictions where during the waiting on the mid-layer
 > all mpath I/O is blocked until recovery.
>
> We have to tune eh_deadline, eh_timeout and fast_io_fail_tmo but
 > even tuning those we have to wait on serial recovery even if we
 > set the timeouts low.
>
> Lately we have been living with
> eh_deadline=10
> eh_timeout=5
> fast_fail_io_tmo=10
> leaving default sd timeout at 30s
>
> So this continues to be an issue and I have specific examples using
 > the jammer I can provide showing the serial recovery times here.

Hello Laurence,

The long recovery times you refer to, is that for a scenario where all 
paths failed or for a scenario where some paths failed and other paths 
are still working? In the latter case, how long does it take before 
dm-multipath fails over to another path?

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM
  2016-04-29 21:47             ` Laurence Oberman
@ 2016-04-29 21:51               ` Laurence Oberman
  2016-04-30  0:36               ` Bart Van Assche
  1 sibling, 0 replies; 26+ messages in thread
From: Laurence Oberman @ 2016-04-29 21:51 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, linux-scsi, Mike Snitzer, James Bottomley,
	device-mapper development, lsf, Benjamin Marzinski

One small correction

In the cut and past the mpath timing was this. I had a cut and past error in my prior message.

Fri Apr 29 17:16:14 EDT 2016
mpathe (360014052a6f5f9f256d4c1097eedfd95) dm-2 LIO-ORG ,block-13
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 0:0:0:12 sds  65:32  active ready running
  |- 0:0:1:12 sdbh 67:176 active ready running
  |- 1:0:0:12 sdr  65:16  active ready running
  `- 1:0:1:12 sdbi 67:192 active ready running

Start again here so its the same 5 minutes while we are in the error_handler

Fri Apr 29 17:21:26 EDT 2016
mpathe (360014052a6f5f9f256d4c1097eedfd95) dm-2 LIO-ORG ,block-13
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 0:0:0:12 sds  65:32  active ready  running
  |- 0:0:1:12 sdbh 67:176 failed faulty offline
  |- 1:0:0:12 sdr  65:16  active ready  running
  `- 1:0:1:12 sdbi 67:192 failed faulty offline

Laurence Oberman
Principal Software Maintenance Engineer
Red Hat Global Support Services

----- Original Message -----
From: "Laurence Oberman" <loberman@redhat.com>
To: "Bart Van Assche" <bart.vanassche@sandisk.com>
Cc: linux-block@vger.kernel.org, "linux-scsi" <linux-scsi@vger.kernel.org>, "Mike Snitzer" <snitzer@redhat.com>, "James Bottomley" <James.Bottomley@HansenPartnership.com>, "device-mapper development" <dm-devel@redhat.com>, lsf@lists.linux-foundation.org, "Benjamin Marzinski" <bmarzins@redhat.com>
Sent: Friday, April 29, 2016 5:47:07 PM
Subject: Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM

Hello Bart

I will email the entire log just to you. This is a summary only below of pertinent log messages.
I dont think the whole list will have an interest in all thge log messages.
When I sent the dull log to you I will include SCSI debug for the error handler stuff.


I ran the tests. This is a worst case test with 21 LUNS and jammed commands.
Typical failures like a port switch failure or link down wont be like this.
Also where we get RSCN's and we can react quicker we will.

In this case I simulated a hung switch issue like a backplane/mesh problem and believe me I see a lot of these 
black-holed SCSI command situations in real life.
Recovery with 21 LUNS is 300s that have in-flights to abort.

This configuration is a multibus configuration for multipath. 
Two qla2xx ports are connected to a switch and the 2 array pots are connected to the same switch.
This gives me 4 active/active paths for 21 mpath devices 

I start I/O to all 21 reading 64k blocks using dd and iflag=direct

Example mpath device
mpathf (360014056a5be89021364a4c90556bfbb) dm-7 LIO-ORG ,block-14        
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 0:0:0:13 sdp  8:240  active ready running
  |- 0:0:1:13 sdbf 67:144 active ready running
  |- 1:0:0:13 sdo  8:224  active ready running
  `- 1:0:1:13 sdbg 67:160 active ready running

eh_deadline is set to 10 on the 2 qlogic ports, eh_timeout is set to 10 for all devices
In multipath fast_io_fail_tmo=5

I jam one of the target array ports and discard the commands effectively black-holing the commands and leave it that way until we recover and I watch the I/O.
The recovery takes around 300s even with all the tuning and this effectively lands up in Oracle cluster evictions.

Watching multipath -ll mpathe I will block as expected while in recovery

BLocked here
Fri Apr 29 17:16:14 EDT 2016
mpathe (360014052a6f5f9f256d4c1097eedfd95) dm-2 LIO-ORG ,block-13
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 0:0:0:12 sds  65:32  active ready running
  |- 0:0:1:12 sdbh 67:176 active ready running
  |- 1:0:0:12 sdr  65:16  active ready running
  `- 1:0:1:12 sdbi 67:192 active ready running

Starte again here
Fri Apr 29 17:16:26 EDT 2016
mpathe (360014052a6f5f9f256d4c1097eedfd95) dm-2 LIO-ORG ,block-13
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 0:0:0:12 sds  65:32  active ready  running
  |- 0:0:1:12 sdbh 67:176 failed faulty offline
  |- 1:0:0:12 sdr  65:16  active ready  running
  `- 1:0:1:12 sdbi 67:192 failed faulty offline

Tracking I/O
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st                 EDT
 0 21      0 15409656  25580 452056    0    0 13740     0  367 2523  0  1 41 59  0 2016-04-29 17:16:17
 0 21      0 15408904  25580 452336    0    0 15872     0  378 2779  0  1 42 57  0 2016-04-29 17:16:18
 2 20      0 15408096  25580 452624    0    0 17612     0  399 3310  0  0 26 73  0 2016-04-29 17:16:19
 0 21      0 15407188  25580 453096    0    0 17860     0  412 3137  0  0 30 70  0 2016-04-29 17:16:20
 0 21      0 15410420  25580 451552    0    0 23116     0  900 6747  0  1 31 69  0 2016-04-29 17:16:21
 0 21      0 15410552  25580 451420    0    0 22664     0  430 3752  0  0 24 76  0 2016-04-29 17:16:22
 0 21      0 15410552  25580 451420    0    0 15700     0  325 2619  0  0 25 75  0 2016-04-29 17:16:23
 0 21      0 15410552  25580 451420    0    0 13648     0  303 2387  0  0 28 71  0 2016-04-29 17:16:24
..
.. Blocked
..
Starts recovering ~= 300s seconds later
..
 0 38      0 15406428  25860 452652    0    0  3208     0  859 2437  0  1 13 86  0 2016-04-29 17:21:20
 0 38      0 15405668  26244 452268    0    0  6640     0  499 3575  0  1  0 99  0 2016-04-29 17:21:21
 0 38      0 15406840  26496 452300    0    0  5372     0  273 1878  0  0  1 98  0 2016-04-29 17:21:22
 0 38      0 15402684  29156 452048    0    0  9700     0  318 2326  0  0 11 88  0 2016-04-29 17:21:23
 0 38      0 15400800  30152 452168    0    0 11876     0  433 3265  0  1 16 83  0 2016-04-29 17:21:24
 0 38      0 15399792  31140 452344    0    0 11804     0  394 2902  0  1 15 85  0 2016-04-29 17:21:25
 0 38      0 15398552  31952 452196    0    0 12908     0  417 3347  0  1  4 96  0 2016-04-29 17:21:26
 0 35      0 15394564  32660 452800    0    0 10904     0  575 4191  1  1  9 89  0 2016-04-29 17:21:27
 0 29      0 15394292  32968 452900    0    0 13356     0  602 3993  1  1  1 96  0 2016-04-29 17:21:28
 0 26      0 15394464  33692 452196    0    0 16124     0  764 5451  1  1  2 96  0 2016-04-29 17:21:29
 0 24      0 15394168  33880 452392    0    0 20156     0  479 3957  0  1  3 96  0 2016-04-29 17:21:30
 0 24      0 15394216  34008 452460    0    0 21760     0  456 3836  0  1  6 94  0 2016-04-29 17:21:31
 0 22      0 15393920  34016 452604    0    0 20104    28  437 3418  0  1 12 87  0 2016-04-29 17:21:32
 0 22      0 15393952  34016 452600    0    0 20352     0  483 3259  0  1 67 32  0 2016-04-29 17:21:33
 0 22      0 15394148  34016 452600    0    0 20560     0  451 3228  0  1 74 25  0 2016-04-29 17:21:34

I see the error handler start in the qlogic.
Keep in mind we are black-holed so after RESET we start the process again.

Apr 29 17:15:26 localhost root: Starting test with eh_deadline=10
Apr 29 17:16:54 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:9 --  1 2002.
Apr 29 17:16:54 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:7 --  1 2002.
Apr 29 17:16:54 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:8 --  1 2002.
Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:6 --  1 2002.
Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:4 --  1 2002.
Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:3 --  1 2002.
Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:22 --  1 2002.
Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:23 --  1 2002.
Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:21 --  1 2002.
Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:2 --  1 2002.
Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:19 --  1 2002.
Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:18 --  1 2002.
Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:20 --  1 2002.
Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:1 --  1 2002.
Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:0 --  1 2002.
Apr 29 17:16:57 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:17 --  1 2002.
Apr 29 17:16:57 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:16 --  1 2002.
Apr 29 17:16:57 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:10 --  1 2002.
Apr 29 17:16:58 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:14 --  1 2002.
Apr 29 17:16:58 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:13 --  1 2002.
Apr 29 17:16:58 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:12 --  1 2002.
Apr 29 17:16:58 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:0 --  1 2002.
Apr 29 17:17:00 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:10 --  1 2002.
Apr 29 17:17:00 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:12 --  1 2002.
Apr 29 17:17:01 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:0 --  1 2002.
Apr 29 17:17:02 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:1 --  1 2002.
Apr 29 17:17:03 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:13 --  1 2002.
Apr 29 17:17:03 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:14 --  1 2002.
Apr 29 17:17:03 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:16 --  1 2002.
Apr 29 17:17:04 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:17 --  1 2002.
Apr 29 17:17:09 localhost kernel: qla2xxx [0000:07:00.1]-8018:1: ADAPTER RESET ISSUED nexus=1:1:12.
Apr 29 17:17:09 localhost kernel: qla2xxx [0000:07:00.1]-00af:1: Performing ISP error recovery - ha=ffff88042a4b0000.
Apr 29 17:17:10 localhost kernel: qla2xxx [0000:07:00.1]-500a:1: LOOP UP detected (4 Gbps).
Apr 29 17:17:10 localhost kernel: qla2xxx [0000:07:00.1]-8017:1: ADAPTER RESET SUCCEEDED nexus=1:1:12.
Apr 29 17:17:30 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:12 --  1 2002.
Apr 29 17:17:34 localhost kernel: qla2xxx [0000:07:00.0]-8018:0: ADAPTER RESET ISSUED nexus=0:1:17.
Apr 29 17:17:34 localhost kernel: qla2xxx [0000:07:00.0]-00af:0: Performing ISP error recovery - ha=ffff88042b030000.
Apr 29 17:17:35 localhost kernel: qla2xxx [0000:07:00.0]-500a:0: LOOP UP detected (4 Gbps).
Apr 29 17:17:35 localhost kernel: qla2xxx [0000:07:00.0]-8017:0: ADAPTER RESET SUCCEEDED nexus=0:1:17.
Apr 29 17:17:40 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:10 --  1 2002.
Apr 29 17:17:50 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:1 --  1 2002.
Apr 29 17:17:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:17 --  1 2002.
Apr 29 17:18:00 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:0 --  1 2002.
Apr 29 17:18:06 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:18 --  1 2002.
Apr 29 17:18:10 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:13 --  1 2002.
Apr 29 17:18:16 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:19 --  1 2002.
Apr 29 17:18:20 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:8 --  1 2002.
Apr 29 17:18:26 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:8 --  1 2002.
Apr 29 17:18:30 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:7 --  1 2002.
Apr 29 17:18:36 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:22 --  1 2002.
Apr 29 17:18:40 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:22 --  1 2002.
Apr 29 17:18:46 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:21 --  1 2002.
Apr 29 17:18:50 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:21 --  1 2002.
..
..
We start seeing the hung tasks

Apr 29 17:19:16 localhost kernel: INFO: task dd:10193 blocked for more than 120 seconds.
Apr 29 17:19:16 localhost kernel:      Not tainted 4.6.0-rc5+ #1
Apr 29 17:19:16 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 29 17:19:16 localhost kernel: dd              D ffff8804303079d8     0 10193  10177 0x00000080
Apr 29 17:19:16 localhost kernel: ffff8804303079d8 ffff880434814140 ffff8800a86515c0 ffff880430308000
Apr 29 17:19:16 localhost kernel: 0000000000000000 7fffffffffffffff 0000000000000000 ffff8804307bfd00
Apr 29 17:19:16 localhost kernel: ffff8804303079f0 ffffffff816ba8e5 ffff880436696e00 ffff880430307aa0
Apr 29 17:19:16 localhost kernel: Call Trace:
Apr 29 17:19:16 localhost kernel: [<ffffffff816ba8e5>] schedule+0x35/0x80
Apr 29 17:19:16 localhost kernel: [<ffffffff816bd661>] schedule_timeout+0x231/0x2d0
Apr 29 17:19:16 localhost kernel: [<ffffffff81315843>] ? __blk_run_queue+0x33/0x40
Apr 29 17:19:16 localhost kernel: [<ffffffff813158ba>] ? queue_unplugged+0x2a/0xb0
Apr 29 17:19:16 localhost kernel: [<ffffffff816b9f66>] io_schedule_timeout+0xa6/0x110
Apr 29 17:19:16 localhost kernel: [<ffffffff81259332>] do_blockdev_direct_IO+0x1b52/0x2180
Apr 29 17:19:16 localhost kernel: [<ffffffff81254320>] ? I_BDEV+0x20/0x20
Apr 29 17:19:16 localhost kernel: [<ffffffff812599a3>] __blockdev_direct_IO+0x43/0x50
Apr 29 17:19:16 localhost kernel: [<ffffffff81254a7c>] blkdev_direct_IO+0x4c/0x50
Apr 29 17:19:16 localhost kernel: [<ffffffff81193ab1>] generic_file_read_iter+0x641/0x7b0
Apr 29 17:19:16 localhost kernel: [<ffffffff8120bcf5>] ? mem_cgroup_commit_charge+0x85/0x100
Apr 29 17:19:16 localhost kernel: [<ffffffff81254e57>] blkdev_read_iter+0x37/0x40
Apr 29 17:19:16 localhost kernel: [<ffffffff81219379>] __vfs_read+0xc9/0x100
Apr 29 17:19:16 localhost kernel: [<ffffffff8121a1ef>] vfs_read+0x7f/0x130
Apr 29 17:19:16 localhost kernel: [<ffffffff8121b6d5>] SyS_read+0x55/0xc0
Apr 29 17:19:16 localhost kernel: [<ffffffff81003c12>] do_syscall_64+0x62/0x110
Apr 29 17:19:16 localhost kernel: [<ffffffff816be4a1>] entry_SYSCALL64_slow_path+0x25/0x25
..
..

Finally after the serialized timeouts we get the offline states

Apr 29 17:19:26 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:3 --  1 2002.
Apr 29 17:19:30 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:17 --  1 2002.
Apr 29 17:19:36 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:7 --  1 2002.
Apr 29 17:19:40 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:18 --  1 2002.
Apr 29 17:19:46 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:4 --  1 2002.
Apr 29 17:19:51 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:19 --  1 2002.
Apr 29 17:19:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:6 --  1 2002.
Apr 29 17:20:01 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:2 --  1 2002.
Apr 29 17:20:06 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:9 --  1 2002.
Apr 29 17:20:11 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:23 --  1 2002.
Apr 29 17:20:16 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:12 --  1 2002.
Apr 29 17:20:21 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:3 --  1 2002.
Apr 29 17:20:26 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:10 --  1 2002.
Apr 29 17:20:31 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:4 --  1 2002.
Apr 29 17:20:37 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:0 --  1 2002.
Apr 29 17:20:41 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:6 --  1 2002.
Apr 29 17:20:47 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:1 --  1 2002.
Apr 29 17:20:51 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:9 --  1 2002.
Apr 29 17:20:51 localhost kernel: sd 1:0:1:12: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:10: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:1: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:0: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:13: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: device-mapper: multipath: Failing path 66:208.
Apr 29 17:20:51 localhost kernel: device-mapper: multipath: Failing path 68:192.
Apr 29 17:20:51 localhost kernel: device-mapper: multipath: Failing path 67:224.
Apr 29 17:20:51 localhost kernel: device-mapper: multipath: Failing path 67:192.
Apr 29 17:20:51 localhost multipathd: mpatha: sdat - path offline
Apr 29 17:20:51 localhost multipathd: checker failed path 66:208 in map mpatha
Apr 29 17:20:51 localhost multipathd: mpatha: remaining active paths: 3
Apr 29 17:20:51 localhost multipathd: mpathb: sdby - path offline
Apr 29 17:20:51 localhost multipathd: checker failed path 68:192 in map mpathb
Apr 29 17:20:51 localhost multipathd: mpathb: remaining active paths: 3
Apr 29 17:20:51 localhost multipathd: mpathc: sdbk - path offline
Apr 29 17:20:51 localhost multipathd: checker failed path 67:224 in map mpathc
Apr 29 17:20:51 localhost multipathd: mpathc: remaining active paths: 3
Apr 29 17:20:51 localhost multipathd: mpathe: sdbi - path offline
Apr 29 17:20:51 localhost multipathd: checker failed path 67:192 in map mpathe
Apr 29 17:20:51 localhost multipathd: mpathe: remaining active paths: 3
Apr 29 17:20:51 localhost kernel: sd 1:0:1:8: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:7: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:22: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:21: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:20: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:14: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:16: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:17: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:18: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:19: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:2: Device offlined - not ready after error recovery
..
..
Apr 29 17:20:51 localhost kernel: sd 1:0:1:12: [sdbi] tag#14 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: [sdbi] tag#14 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdbi, sector 0
Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: rejecting I/O to offline device
Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: [sdbi] killing request
Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: [sdbi] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: [sdbi] CDB: Read(10) 28 00 00 02 5b 80 00 00 80 00
Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdbi, sector 154496
Apr 29 17:20:52 localhost kernel: sd 1:0:1:10: [sdbk] tag#13 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:52 localhost kernel: sd 1:0:1:10: [sdbk] tag#13 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdbk, sector 0
Apr 29 17:20:52 localhost kernel: sd 1:0:1:1: [sdby] tag#16 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:52 localhost kernel: sd 1:0:1:1: [sdby] tag#16 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:52 localhost multipathd: mpathf: sdbg - path offline
Apr 29 17:20:52 localhost multipathd: checker failed path 67:160 in map mpathf
Apr 29 17:20:52 localhost multipathd: mpathf: remaining active paths: 3
Apr 29 17:20:52 localhost multipathd: mpathg: sdbe - path offline
Apr 29 17:20:52 localhost multipathd: checker failed path 67:128 in map mpathg
Apr 29 17:20:52 localhost multipathd: mpathg: remaining active paths: 3
Apr 29 17:20:52 localhost multipathd: mpathi: sdbc - path offline
Apr 29 17:20:52 localhost multipathd: checker failed path 67:96 in map mpathi
Apr 29 17:20:52 localhost multipathd: mpathi: remaining active paths: 3
Apr 29 17:20:52 localhost multipathd: mpathj: sdba - path offline
Apr 29 17:20:52 localhost multipathd: checker failed path 67:64 in map mpathj
Apr 29 17:20:52 localhost multipathd: mpathj: remaining active paths: 3
Apr 29 17:20:52 localhost multipathd: mpathk: sday - path offline
Apr 29 17:20:52 localhost multipathd: checker failed path 67:32 in map mpathk
Apr 29 17:20:52 localhost multipathd: mpathk: remaining active paths: 3
..
..
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:160.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:128.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:96.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:64.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:32.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 66:240.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:160.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 69:0.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 69:32.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 69:64.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 69:96.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:128.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:96.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:64.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:224.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:32.
Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:0.
Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdby, sector 0
Apr 29 17:20:52 localhost kernel: sd 1:0:1:0: [sdat] tag#15 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:52 localhost kernel: sd 1:0:1:0: [sdat] tag#15 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdat, sector 0
Apr 29 17:20:52 localhost kernel: sd 1:0:1:13: [sdbg] tag#17 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:52 localhost kernel: sd 1:0:1:13: [sdbg] tag#17 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:53 localhost kernel: blk_update_request: I/O error, dev sdbg, sector 0
Apr 29 17:20:53 localhost kernel: sd 1:0:1:13: rejecting I/O to offline device
Apr 29 17:20:53 localhost kernel: sd 1:0:1:13: [sdbg] killing request
Apr 29 17:20:53 localhost kernel: sd 1:0:1:13: [sdbg] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Apr 29 17:20:53 localhost kernel: sd 1:0:1:13: [sdbg] CDB: Read(10) 28 00 00 02 5d 80 00 00 80 00
Apr 29 17:20:53 localhost kernel: blk_update_request: I/O error, dev sdbg, sector 155008
Apr 29 17:20:53 localhost kernel: sd 1:0:1:8: [sdbo] tag#31 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:53 localhost kernel: sd 1:0:1:8: [sdbo] tag#31 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:53 localhost kernel: blk_update_request: I/O error, dev sdbo, sector 0
Apr 29 17:20:53 localhost kernel: sd 1:0:1:7: [sdca] tag#30 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:20:53 localhost kernel: sd 1:0:1:7: [sdca] tag#30 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:20:53 localhost kernel: blk_update_request: I/O error, dev sdca, sector 0
Apr 29 17:20:53 localhost kernel: sd 1:0:1:22: [sdcg] tag#26 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
..
Apr 29 17:21:18 localhost multipathd: checker failed path 66:224 in map mpatha
Apr 29 17:21:18 localhost multipathd: mpatha: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathb: sdbx - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 68:176 in map mpathb
Apr 29 17:21:18 localhost multipathd: mpathb: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathc: sdbj - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 67:208 in map mpathc
Apr 29 17:21:18 localhost multipathd: mpathc: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathe: sdbh - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 67:176 in map mpathe
Apr 29 17:21:18 localhost multipathd: mpathe: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathf: sdbf - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 67:144 in map mpathf
Apr 29 17:21:18 localhost multipathd: mpathf: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathg: sdbd - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 67:112 in map mpathg
Apr 29 17:21:18 localhost multipathd: mpathg: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathi: sdbb - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 67:80 in map mpathi
Apr 29 17:21:18 localhost multipathd: mpathi: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpatho: sdbr - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 68:80 in map mpatho
Apr 29 17:21:18 localhost multipathd: mpatho: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathq: sdbp - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 68:48 in map mpathq
Apr 29 17:21:18 localhost multipathd: mpathq: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpathv: sdbz - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 68:208 in map mpathv
Apr 29 17:21:18 localhost multipathd: mpathv: remaining active paths: 2
Apr 29 17:21:18 localhost multipathd: mpatht: sdbl - path offline
Apr 29 17:21:18 localhost multipathd: checker failed path 67:240 in map mpatht
Apr 29 17:21:18 localhost multipathd: mpatht: remaining active paths: 2
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 66:224.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 68:176.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:208.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:176.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:144.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:112.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:80.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 68:80.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 68:48.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 68:208.
Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:240.
Apr 29 17:21:18 localhost kernel: blk_update_request: I/O error, dev sdaw, sector 0
Apr 29 17:21:18 localhost kernel: sd 0:0:1:8: [sdbn] tag#25 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Apr 29 17:21:18 localhost kernel: sd 0:0:1:8: [sdbn] tag#25 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
Apr 29 17:21:18 localhost kernel: blk_update_request: I/O error, dev sdbn, sector 0
Apr 29 17:21:18 localhost kernel: sd 0:0:1:8: rejecting I/O to offline device
Apr 29 17:21:18 localhost kernel: sd 0:0:1:8: [sdbn] killing request


Laurence Oberman
Principal Software Maintenance Engineer
Red Hat Global Support Services

----- Original Message -----
From: "Laurence Oberman" <loberman@redhat.com>
To: "Bart Van Assche" <bart.vanassche@sandisk.com>
Cc: linux-block@vger.kernel.org, "linux-scsi" <linux-scsi@vger.kernel.org>, "Mike Snitzer" <snitzer@redhat.com>, "James Bottomley" <James.Bottomley@HansenPartnership.com>, "device-mapper development" <dm-devel@redhat.com>, lsf@lists.linux-foundation.org
Sent: Thursday, April 28, 2016 12:47:24 PM
Subject: Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM

Hello Bart, This is when we have a subset of the paths fails.
As you know the remaining path wont be used until the eh_handler is either done or is short circuited.

What I will do is set this up via my jammer and capture a test using latest upstream.

Of course my customer pain points are all in the RHEL kernels so I need to capture a recovery trace
on the latest upstream kernel.

When the SCSI commands for a path are black-holed and remain that way, even with eh_deadline and the short circuited adapter resets 
we simply try again and get back in the wait loop until we finally declare the device offline.

This can take a while and differs depending on Qlogic, Emulex or fnic etc.

First thing tomorrow will set this up and show you what I mean.

Thanks!!

Laurence Oberman
Principal Software Maintenance Engineer
Red Hat Global Support Services

----- Original Message -----
From: "Bart Van Assche" <bart.vanassche@sandisk.com>
To: "Laurence Oberman" <loberman@redhat.com>
Cc: linux-block@vger.kernel.org, "linux-scsi" <linux-scsi@vger.kernel.org>, "Mike Snitzer" <snitzer@redhat.com>, "James Bottomley" <James.Bottomley@HansenPartnership.com>, "device-mapper development" <dm-devel@redhat.com>, lsf@lists.linux-foundation.org
Sent: Thursday, April 28, 2016 12:41:26 PM
Subject: Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM

On 04/28/2016 09:23 AM, Laurence Oberman wrote:
> We still suffer from periodic complaints in our large customer base
 > regarding the long recovery times for dm-multipath.
> Most of the time this is when we have something like a switch
 > back-plane issue or an issue where RSCN'S are blocked coming back up
 > the fabric. Corner cases still bite us often.
>
> Most of the complaints originate from customers for example seeing
 > Oracle cluster evictions where during the waiting on the mid-layer
 > all mpath I/O is blocked until recovery.
>
> We have to tune eh_deadline, eh_timeout and fast_io_fail_tmo but
 > even tuning those we have to wait on serial recovery even if we
 > set the timeouts low.
>
> Lately we have been living with
> eh_deadline=10
> eh_timeout=5
> fast_fail_io_tmo=10
> leaving default sd timeout at 30s
>
> So this continues to be an issue and I have specific examples using
 > the jammer I can provide showing the serial recovery times here.

Hello Laurence,

The long recovery times you refer to, is that for a scenario where all 
paths failed or for a scenario where some paths failed and other paths 
are still working? In the latter case, how long does it take before 
dm-multipath fails over to another path?

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM
  2016-04-29 21:47             ` Laurence Oberman
  2016-04-29 21:51               ` Laurence Oberman
@ 2016-04-30  0:36               ` Bart Van Assche
  2016-04-30  0:47                 ` Laurence Oberman
  1 sibling, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2016-04-30  0:36 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: James Bottomley, linux-scsi, Mike Snitzer, linux-block,
	device-mapper development, lsf

On 04/29/2016 02:47 PM, Laurence Oberman wrote:
> Recovery with 21 LUNS is 300s that have in-flights to abort.
> [ ... ]
> eh_deadline is set to 10 on the 2 qlogic ports, eh_timeout is set
 > to 10 for all devices. In multipath fast_io_fail_tmo=5
>
> I jam one of the target array ports and discard the commands
 > effectively black-holing the commands and leave it that way until
 > we recover and I watch the I/O. The recovery takes around 300s even
 > with all the tuning and this effectively lands up in Oracle cluster
 > evictions.

Hello Laurence,

This discussion started as a discussion about the time needed to fail 
over from one path to another. How long did it take in your test before 
I/O failed over from the jammed port to another port?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM
  2016-04-30  0:36               ` Bart Van Assche
@ 2016-04-30  0:47                 ` Laurence Oberman
  2016-05-02 18:49                   ` Bart Van Assche
  0 siblings, 1 reply; 26+ messages in thread
From: Laurence Oberman @ 2016-04-30  0:47 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: James Bottomley, linux-scsi, Mike Snitzer, linux-block,
	device-mapper development, lsf

Hello Bart

Around 300s before the paths were declared hard failed and the devices offlined.
This is when I/O restarts.
The remaining paths on the second Qlogic port (that are not jammed) will not be used until the error handler activity completes.

Until we get these for example, and device-mapper starts declaring paths down we are blocked.
Apr 29 17:20:51 localhost kernel: sd 1:0:1:0: Device offlined - not ready after error recovery
Apr 29 17:20:51 localhost kernel: sd 1:0:1:13: Device offlined - not ready after error recovery

Laurence Oberman
Principal Software Maintenance Engineer
Red Hat Global Support Services

----- Original Message -----
From: "Bart Van Assche" <bart.vanassche@sandisk.com>
To: "Laurence Oberman" <loberman@redhat.com>
Cc: "James Bottomley" <James.Bottomley@HansenPartnership.com>, "linux-scsi" <linux-scsi@vger.kernel.org>, "Mike Snitzer" <snitzer@redhat.com>, linux-block@vger.kernel.org, "device-mapper development" <dm-devel@redhat.com>, lsf@lists.linux-foundation.org
Sent: Friday, April 29, 2016 8:36:22 PM
Subject: Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM

On 04/29/2016 02:47 PM, Laurence Oberman wrote:
> Recovery with 21 LUNS is 300s that have in-flights to abort.
> [ ... ]
> eh_deadline is set to 10 on the 2 qlogic ports, eh_timeout is set
 > to 10 for all devices. In multipath fast_io_fail_tmo=5
>
> I jam one of the target array ports and discard the commands
 > effectively black-holing the commands and leave it that way until
 > we recover and I watch the I/O. The recovery takes around 300s even
 > with all the tuning and this effectively lands up in Oracle cluster
 > evictions.

Hello Laurence,

This discussion started as a discussion about the time needed to fail 
over from one path to another. How long did it take in your test before 
I/O failed over from the jammed port to another port?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM
  2016-04-30  0:47                 ` Laurence Oberman
@ 2016-05-02 18:49                   ` Bart Van Assche
  2016-05-02 19:28                     ` Laurence Oberman
  0 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2016-05-02 18:49 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: linux-block, linux-scsi, Mike Snitzer, James Bottomley,
	device-mapper development, lsf

On 04/29/2016 05:47 PM, Laurence Oberman wrote:
> From: "Bart Van Assche" <bart.vanassche@sandisk.com>
> To: "Laurence Oberman" <loberman@redhat.com>
> Cc: "James Bottomley" <James.Bottomley@HansenPartnership.com>, "linux-scsi" <linux-scsi@vger.kernel.org>, "Mike Snitzer" <snitzer@redhat.com>, linux-block@vger.kernel.org, "device-mapper development" <dm-devel@redhat.com>, lsf@lists.linux-foundation.org
> Sent: Friday, April 29, 2016 8:36:22 PM
> Subject: Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM
>
>> On 04/29/2016 02:47 PM, Laurence Oberman wrote:
>>> Recovery with 21 LUNS is 300s that have in-flights to abort.
>>> [ ... ]
>>> eh_deadline is set to 10 on the 2 qlogic ports, eh_timeout is set
>>> to 10 for all devices. In multipath fast_io_fail_tmo=5
>>>
>>> I jam one of the target array ports and discard the commands
>>> effectively black-holing the commands and leave it that way until
>>> we recover and I watch the I/O. The recovery takes around 300s even
>>> with all the tuning and this effectively lands up in Oracle cluster
>>> evictions.
>>
>> This discussion started as a discussion about the time needed to fail
>> over from one path to another. How long did it take in your test before
>> I/O failed over from the jammed port to another port?
 >
 > Around 300s before the paths were declared hard failed and the
 > devices offlined. This is when I/O restarts.
 > The remaining paths on the second Qlogic port (that are not jammed)
 > will not be used until the error handler activity completes.
 >
 > Until we get these for example, and device-mapper starts declaring
 > paths down we are blocked.
 > Apr 29 17:20:51 localhost kernel: sd 1:0:1:0: Device offlined - not
 > ready after error recovery
 > Apr 29 17:20:51 localhost kernel: sd 1:0:1:13: Device offlined - not
 > ready after error recovery

Hello Laurence,

Everyone else on all mailing lists to which this message has been posted 
replies below the message. Please follow this convention.

Regarding the fail-over time: the ib_srp driver guarantees that 
scsi_done() is invoked from inside its terminate_rport_io() function. 
Apparently the lpfc and the qla2xxx drivers behave differently. Please 
work with the maintainers of these drivers to reduce fail-over time.

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM
  2016-05-02 18:49                   ` Bart Van Assche
@ 2016-05-02 19:28                     ` Laurence Oberman
  2016-05-02 22:28                       ` Bart Van Assche
  0 siblings, 1 reply; 26+ messages in thread
From: Laurence Oberman @ 2016-05-02 19:28 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, linux-scsi, Mike Snitzer, James Bottomley,
	device-mapper development, lsf



Laurence Oberman
Principal Software Maintenance Engineer
Red Hat Global Support Services

----- Original Message -----
From: "Bart Van Assche" <bart.vanassche@sandisk.com>
To: "Laurence Oberman" <loberman@redhat.com>
Cc: linux-block@vger.kernel.org, "linux-scsi" <linux-scsi@vger.kernel.org>, "Mike Snitzer" <snitzer@redhat.com>, "James Bottomley" <James.Bottomley@HansenPartnership.com>, "device-mapper development" <dm-devel@redhat.com>, lsf@lists.linux-foundation.org
Sent: Monday, May 2, 2016 2:49:54 PM
Subject: Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM

On 04/29/2016 05:47 PM, Laurence Oberman wrote:
> From: "Bart Van Assche" <bart.vanassche@sandisk.com>
> To: "Laurence Oberman" <loberman@redhat.com>
> Cc: "James Bottomley" <James.Bottomley@HansenPartnership.com>, "linux-scsi" <linux-scsi@vger.kernel.org>, "Mike Snitzer" <snitzer@redhat.com>, linux-block@vger.kernel.org, "device-mapper development" <dm-devel@redhat.com>, lsf@lists.linux-foundation.org
> Sent: Friday, April 29, 2016 8:36:22 PM
> Subject: Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM
>
>> On 04/29/2016 02:47 PM, Laurence Oberman wrote:
>>> Recovery with 21 LUNS is 300s that have in-flights to abort.
>>> [ ... ]
>>> eh_deadline is set to 10 on the 2 qlogic ports, eh_timeout is set
>>> to 10 for all devices. In multipath fast_io_fail_tmo=5
>>>
>>> I jam one of the target array ports and discard the commands
>>> effectively black-holing the commands and leave it that way until
>>> we recover and I watch the I/O. The recovery takes around 300s even
>>> with all the tuning and this effectively lands up in Oracle cluster
>>> evictions.
>>
>> This discussion started as a discussion about the time needed to fail
>> over from one path to another. How long did it take in your test before
>> I/O failed over from the jammed port to another port?
 >
 > Around 300s before the paths were declared hard failed and the
 > devices offlined. This is when I/O restarts.
 > The remaining paths on the second Qlogic port (that are not jammed)
 > will not be used until the error handler activity completes.
 >
 > Until we get these for example, and device-mapper starts declaring
 > paths down we are blocked.
 > Apr 29 17:20:51 localhost kernel: sd 1:0:1:0: Device offlined - not
 > ready after error recovery
 > Apr 29 17:20:51 localhost kernel: sd 1:0:1:13: Device offlined - not
 > ready after error recovery

Hello Laurence,

Everyone else on all mailing lists to which this message has been posted 
replies below the message. Please follow this convention.

Regarding the fail-over time: the ib_srp driver guarantees that 
scsi_done() is invoked from inside its terminate_rport_io() function. 
Apparently the lpfc and the qla2xxx drivers behave differently. Please 
work with the maintainers of these drivers to reduce fail-over time.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hello Bart

Even in the case of the ib_srp, don't we also have to still run the eh_timeout for each of the devices that has inflight requiring error handling serially.
This means we will still have to wait to get a path failover until all are through the timeout.

Thanks
Laurence

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM
  2016-05-02 19:28                     ` Laurence Oberman
@ 2016-05-02 22:28                       ` Bart Van Assche
  2016-05-03 17:44                         ` Laurence Oberman
  0 siblings, 1 reply; 26+ messages in thread
From: Bart Van Assche @ 2016-05-02 22:28 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: James Bottomley, linux-scsi, Mike Snitzer, linux-block,
	device-mapper development, lsf

On 05/02/2016 12:28 PM, Laurence Oberman wrote:
> Even in the case of the ib_srp, don't we also have to still run the
> eh_timeout for each of the devices that has inflight requiring error
> handling serially. This means we will still have to wait to get a
> path failover until all are through the timeout.

Hello Laurence,

It depends. If a transport layer error (e.g. a cable pull) has been 
observed by the ib_srp driver then fast_io_fail_tmo seconds later the 
ib_srp driver will terminate all outstanding SCSI commands without 
waiting for the error handler to finish. If no transport layer error has 
been observed then at most (SCSI timeout) + (number of pending commands 
+ 1) * 5 seconds later srp_reset_device() will have finished terminating 
all pending SCSI commands.

Bart.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM
  2016-05-02 22:28                       ` Bart Van Assche
@ 2016-05-03 17:44                         ` Laurence Oberman
  0 siblings, 0 replies; 26+ messages in thread
From: Laurence Oberman @ 2016-05-03 17:44 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: James Bottomley, linux-scsi, Mike Snitzer, linux-block,
	device-mapper development, lsf



----- Original Message -----
From: "Bart Van Assche" <bart.vanassche@sandisk.com>
To: "Laurence Oberman" <loberman@redhat.com>
Cc: "James Bottomley" <James.Bottomley@HansenPartnership.com>, "linux-scsi" <linux-scsi@vger.kernel.org>, "Mike Snitzer" <snitzer@redhat.com>, linux-block@vger.kernel.org, "device-mapper development" <dm-devel@redhat.com>, lsf@lists.linux-foundation.org
Sent: Monday, May 2, 2016 6:28:16 PM
Subject: Re: [dm-devel] [Lsf] Notes from the four separate IO track sessions at LSF/MM

On 05/02/2016 12:28 PM, Laurence Oberman wrote:
> Even in the case of the ib_srp, don't we also have to still run the
> eh_timeout for each of the devices that has inflight requiring error
> handling serially. This means we will still have to wait to get a
> path failover until all are through the timeout.

Hello Laurence,

It depends. If a transport layer error (e.g. a cable pull) has been 
observed by the ib_srp driver then fast_io_fail_tmo seconds later the 
ib_srp driver will terminate all outstanding SCSI commands without 
waiting for the error handler to finish. If no transport layer error has 
been observed then at most (SCSI timeout) + (number of pending commands 
+ 1) * 5 seconds later srp_reset_device() will have finished terminating 
all pending SCSI commands.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hello Bart

OK, Yes, that lines up with my testing here with Qlogic and Emulex.
I am about to test srp but I need to add some jammer code first.
The link down and other interruptions will always be fast. 
Its always going to be the black-hole events that are troublesome.

Thanks
Laurence

^ permalink raw reply	[flat|nested] 26+ messages in thread

* bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM]
  2016-04-28 15:40   ` James Bottomley
  2016-04-28 15:53     ` [Lsf] " Bart Van Assche
@ 2016-05-26  2:38     ` Mike Snitzer
  2016-05-27  8:39         ` Hannes Reinecke
  1 sibling, 1 reply; 26+ messages in thread
From: Mike Snitzer @ 2016-05-26  2:38 UTC (permalink / raw)
  To: James Bottomley
  Cc: linux-block, lsf, device-mapper development, linux-scsi, hch

On Thu, Apr 28 2016 at 11:40am -0400,
James Bottomley <James.Bottomley@HansenPartnership.com> wrote:

> On Thu, 2016-04-28 at 08:11 -0400, Mike Snitzer wrote:
> > Full disclosure: I'll be looking at reinstating bio-based DM multipath to
> > regain efficiencies that now really matter when issuing IO to extremely
> > fast devices (e.g. NVMe).  bio cloning is now very cheap (due to
> > immutable biovecs), coupled with the emerging multipage biovec work that
> > will help construct larger bios, so I think it is worth pursuing to at
> > least keep our options open.

Please see the 4 topmost commits I've published here:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.8

All request-based DM multipath support/advances have been completly
preserved.  I've just made it so that we can now have bio-based DM
multipath too.

All of the various modes have been tested using mptest:
https://github.com/snitm/mptest

> OK, but remember the reason we moved from bio to request was partly to
> be nearer to the device but also because at that time requests were
> accumulations of bios which had to be broken out, go back up the stack
> individually and be re-elevated, which adds to the inefficiency.  In
> theory the bio splitting work will mean that we only have one or two
> split bios per request (because they were constructed from a split up
> huge bio), but when we send them back to the top to be reconstructed as
> requests there's no guarantee that the split will be correct a second
> time around and we might end up resplitting the already split bios.  If
> you do reassembly into the huge bio again before resend down the next
> queue, that's starting to look like quite a lot of work as well.

I've not even delved into the level you're laser-focused on here.
But I'm struggling to grasp why multipath is any different than any
other bio-based device...

FYI, the paper I reference in my "dm mpath: reinstate bio-based support"
commit gets into what I've always thought the real justification was for
the transition from bio-based to request-based.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM]
  2016-05-26  2:38     ` bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM] Mike Snitzer
@ 2016-05-27  8:39         ` Hannes Reinecke
  0 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2016-05-27  8:39 UTC (permalink / raw)
  To: Mike Snitzer, James Bottomley
  Cc: linux-block, lsf, device-mapper development, linux-scsi, hch

On 05/26/2016 04:38 AM, Mike Snitzer wrote:
> On Thu, Apr 28 2016 at 11:40am -0400,
> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
>
>> On Thu, 2016-04-28 at 08:11 -0400, Mike Snitzer wrote:
>>> Full disclosure: I'll be looking at reinstating bio-based DM multipath to
>>> regain efficiencies that now really matter when issuing IO to extremely
>>> fast devices (e.g. NVMe).  bio cloning is now very cheap (due to
>>> immutable biovecs), coupled with the emerging multipage biovec work that
>>> will help construct larger bios, so I think it is worth pursuing to at
>>> least keep our options open.
>
> Please see the 4 topmost commits I've published here:
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.8
>
> All request-based DM multipath support/advances have been completly
> preserved.  I've just made it so that we can now have bio-based DM
> multipath too.
>
> All of the various modes have been tested using mptest:
> https://github.com/snitm/mptest
>
>> OK, but remember the reason we moved from bio to request was partly to
>> be nearer to the device but also because at that time requests were
>> accumulations of bios which had to be broken out, go back up the stack
>> individually and be re-elevated, which adds to the inefficiency.  In
>> theory the bio splitting work will mean that we only have one or two
>> split bios per request (because they were constructed from a split up
>> huge bio), but when we send them back to the top to be reconstructed as
>> requests there's no guarantee that the split will be correct a second
>> time around and we might end up resplitting the already split bios.  If
>> you do reassembly into the huge bio again before resend down the next
>> queue, that's starting to look like quite a lot of work as well.
>
> I've not even delved into the level you're laser-focused on here.
> But I'm struggling to grasp why multipath is any different than any
> other bio-based device...
>
Actually, _failover_ is not the primary concern. This is on a (relative) 
slow path so any performance degradation during failover is acceptable.

No, the real issue is load-balancing.
If you have several paths you have to schedule I/O across all paths, 
_and_ you should be feeding these paths efficiently.

With the original (bio-based) layout you had to schedule on the bio 
level, causing the requests to be inefficiently assembled.
Hence the 'rr_min_io' parameter, which were changing paths after 
rr_min_io _bios_. I did some experimenting a while back (I even had a 
presentation on LSF at one point ...), and figuring that you would get a 
performance degradation once the rr_min_io parameter went below 100.
But this means that paths will be switched after every 100 bios, 
irrespective of into how many requests they'll be assembled.
It also means that we have a rather 'choppy' load-balancing behaviour, 
and cannot achieve 'true' load balancing as the I/O scheduler on the bio 
level doesn't have any idea when a new request will be assembled.

I was sort-of hoping that with the large bio work from Shaohua we could 
build bio which would not require any merging, ie building bios which 
would be assembled into a single request per bio.
Then the above problem wouldn't exist anymore and we _could_ do 
scheduling on bio level.
But from what I've gathered this is not always possible (eg for btrfs 
with delayed allocation).

Have you found another way of addressing this problem?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: J. Hawn, J. Guild, F. Imend�rffer, HRB 16746 (AG N�rnberg)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM]
@ 2016-05-27  8:39         ` Hannes Reinecke
  0 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2016-05-27  8:39 UTC (permalink / raw)
  To: Mike Snitzer, James Bottomley
  Cc: linux-block, lsf, device-mapper development, linux-scsi, hch

On 05/26/2016 04:38 AM, Mike Snitzer wrote:
> On Thu, Apr 28 2016 at 11:40am -0400,
> James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
>
>> On Thu, 2016-04-28 at 08:11 -0400, Mike Snitzer wrote:
>>> Full disclosure: I'll be looking at reinstating bio-based DM multipath to
>>> regain efficiencies that now really matter when issuing IO to extremely
>>> fast devices (e.g. NVMe).  bio cloning is now very cheap (due to
>>> immutable biovecs), coupled with the emerging multipage biovec work that
>>> will help construct larger bios, so I think it is worth pursuing to at
>>> least keep our options open.
>
> Please see the 4 topmost commits I've published here:
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.8
>
> All request-based DM multipath support/advances have been completly
> preserved.  I've just made it so that we can now have bio-based DM
> multipath too.
>
> All of the various modes have been tested using mptest:
> https://github.com/snitm/mptest
>
>> OK, but remember the reason we moved from bio to request was partly to
>> be nearer to the device but also because at that time requests were
>> accumulations of bios which had to be broken out, go back up the stack
>> individually and be re-elevated, which adds to the inefficiency.  In
>> theory the bio splitting work will mean that we only have one or two
>> split bios per request (because they were constructed from a split up
>> huge bio), but when we send them back to the top to be reconstructed as
>> requests there's no guarantee that the split will be correct a second
>> time around and we might end up resplitting the already split bios.  If
>> you do reassembly into the huge bio again before resend down the next
>> queue, that's starting to look like quite a lot of work as well.
>
> I've not even delved into the level you're laser-focused on here.
> But I'm struggling to grasp why multipath is any different than any
> other bio-based device...
>
Actually, _failover_ is not the primary concern. This is on a (relative) 
slow path so any performance degradation during failover is acceptable.

No, the real issue is load-balancing.
If you have several paths you have to schedule I/O across all paths, 
_and_ you should be feeding these paths efficiently.

With the original (bio-based) layout you had to schedule on the bio 
level, causing the requests to be inefficiently assembled.
Hence the 'rr_min_io' parameter, which were changing paths after 
rr_min_io _bios_. I did some experimenting a while back (I even had a 
presentation on LSF at one point ...), and figuring that you would get a 
performance degradation once the rr_min_io parameter went below 100.
But this means that paths will be switched after every 100 bios, 
irrespective of into how many requests they'll be assembled.
It also means that we have a rather 'choppy' load-balancing behaviour, 
and cannot achieve 'true' load balancing as the I/O scheduler on the bio 
level doesn't have any idea when a new request will be assembled.

I was sort-of hoping that with the large bio work from Shaohua we could 
build bio which would not require any merging, ie building bios which 
would be assembled into a single request per bio.
Then the above problem wouldn't exist anymore and we _could_ do 
scheduling on bio level.
But from what I've gathered this is not always possible (eg for btrfs 
with delayed allocation).

Have you found another way of addressing this problem?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM]
  2016-05-27  8:39         ` Hannes Reinecke
  (?)
@ 2016-05-27 14:44         ` Mike Snitzer
  2016-05-27 15:42             ` Hannes Reinecke
  -1 siblings, 1 reply; 26+ messages in thread
From: Mike Snitzer @ 2016-05-27 14:44 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: James Bottomley, linux-block, lsf, device-mapper development,
	hch, linux-scsi, axboe, Ming Lei

On Fri, May 27 2016 at  4:39am -0400,
Hannes Reinecke <hare@suse.de> wrote:

> On 05/26/2016 04:38 AM, Mike Snitzer wrote:
> >On Thu, Apr 28 2016 at 11:40am -0400,
> >James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> >
> >>On Thu, 2016-04-28 at 08:11 -0400, Mike Snitzer wrote:
> >>>Full disclosure: I'll be looking at reinstating bio-based DM multipath to
> >>>regain efficiencies that now really matter when issuing IO to extremely
> >>>fast devices (e.g. NVMe).  bio cloning is now very cheap (due to
> >>>immutable biovecs), coupled with the emerging multipage biovec work that
> >>>will help construct larger bios, so I think it is worth pursuing to at
> >>>least keep our options open.
> >
> >Please see the 4 topmost commits I've published here:
> >https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.8
> >
> >All request-based DM multipath support/advances have been completly
> >preserved.  I've just made it so that we can now have bio-based DM
> >multipath too.
> >
> >All of the various modes have been tested using mptest:
> >https://github.com/snitm/mptest
> >
> >>OK, but remember the reason we moved from bio to request was partly to
> >>be nearer to the device but also because at that time requests were
> >>accumulations of bios which had to be broken out, go back up the stack
> >>individually and be re-elevated, which adds to the inefficiency.  In
> >>theory the bio splitting work will mean that we only have one or two
> >>split bios per request (because they were constructed from a split up
> >>huge bio), but when we send them back to the top to be reconstructed as
> >>requests there's no guarantee that the split will be correct a second
> >>time around and we might end up resplitting the already split bios.  If
> >>you do reassembly into the huge bio again before resend down the next
> >>queue, that's starting to look like quite a lot of work as well.
> >
> >I've not even delved into the level you're laser-focused on here.
> >But I'm struggling to grasp why multipath is any different than any
> >other bio-based device...
> >
> Actually, _failover_ is not the primary concern. This is on a
> (relative) slow path so any performance degradation during failover
> is acceptable.
> 
> No, the real issue is load-balancing.
> If you have several paths you have to schedule I/O across all paths,
> _and_ you should be feeding these paths efficiently.

<snip well known limitation of bio-based mpath load balancing, also
detailed in the multipath paper I refernced>

Right, as my patch header details, this is the only limitation that
remains with the reinstated bio-based DM multipath.

> I was sort-of hoping that with the large bio work from Shaohua we

I think you mean Ming Lei and his multipage biovec work?

> could build bio which would not require any merging, ie building
> bios which would be assembled into a single request per bio.
> Then the above problem wouldn't exist anymore and we _could_ do
> scheduling on bio level.
> But from what I've gathered this is not always possible (eg for
> btrfs with delayed allocation).

I doubt many people are running btrfs over multipath in production
but...

Taking a step back: reinstating bio-based DM multipath is _not_ at the
expense of request-based DM multipath.  As you can see I've made it so
that all modes (bio-based, request_fn rq-based, and blk-mq rq-based) are
supported by a single DM multipath target.  When the trnasition to
request-based happened it would've been wise to preserve bio-based but I
digress...

So, the point is: there isn't any one-size-fits-all DM multipath queue
mode here.  If a storage config benefits from the request_fn IO
schedulers (but isn't hurt by .request_fn's queue lock, so slower
rotational storage?) then use queue_mode=2.  If the storage is connected
to a large NUMA system and there is some reason to want to use blk-mq
request_queue at the DM level: use queue_mode=3.  If the storage is
_really_ fast and doesn't care about extra IO grooming (e.g. sorting and
merging) then select bio-based using queue_mode=1.

I collected some quick performance numbers against a null_blk device, on
a single NUMA node system, with various DM layers ontop -- the multipath
runs are only with a single path... fio workload is just 10 sec randread:

FIO_QUEUE_DEPTH=32
FIO_RUNTIME=10
FIO_NUMJOBS=12
{FIO} --numa_cpu_nodes=${NID} --numa_mem_policy=bind:${NID} --cpus_allowed_policy=split --group_reporting --rw=randread --bs=4k --numjobs=${FIO_NUMJOBS} \
              --iodepth=${FIO_QUEUE_DEPTH} --runtime=${FIO_RUNTIME} --time_based --loops=1 --ioengine=libaio \
              --direct=1 --invalidate=1 --randrepeat=1 --norandommap --exitall --name task_${TASK_NAME} --filename=${DEVICE}"

I need real hardware (NVMe over Fabrics please!) to really test this
stuff properly; but I think the following results at least approximate
the relative performance of each multipath mode.

On null_blk blk-mq
------------------

baseline:
null_blk blk-mq       iops=1936.3K
dm-linear             iops=1616.1K

multipath using round-robin path-selector:
bio-based             iops=1579.8K
blk-mq rq-based       iops=1411.6K
request_fn rq-based   iops=326491

multipath using queue-length path-selector:
bio-based             iops=1526.2K
blk-mq rq-based       iops=1351.9K
request_fn rq-based   iops=326399

On null_blk bio-based
---------------------

baseline:
null_blk blk-mq       iops=2776.8K
dm-linear             iops=2183.5K

multipath using round-robin path-selector:
bio-based             iops=2101.5K

multipath using queue-length path-selector:
bio-based             iops=2019.4K

I haven't even looked at optimizing bio-based DM yet.. not liking that
dm-linear is taking a ~15% - ~20% hit from baseline null_blk.  But nice
to see bio-based multipath is very comparable to dm-linear.  So any
future bio-based DM performance advances should translate to better
multipath perf.

> Have you found another way of addressing this problem?

No, bio sorting/merging really isn't a problem for DM multipath to
solve.

Though Jens did say (in the context of one of these dm-crypt bulk mode
threads) that the block core _could_ grow some additional _minimalist_
capability for bio merging:
https://www.redhat.com/archives/dm-devel/2015-November/msg00130.html

I'd like to understand a bit more about what Jens is thinking in that
area because it could benefit DM thinp as well (though that is using bio
sorting rather than merging, introduced via commit 67324ea188).

I'm not opposed to any line of future development -- but development
needs to be driven by observed limitations while testing on _real_
hardware.

Mike

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM]
  2016-05-27 14:44         ` Mike Snitzer
@ 2016-05-27 15:42             ` Hannes Reinecke
  0 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2016-05-27 15:42 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: James Bottomley, linux-block, lsf, device-mapper development,
	hch, linux-scsi, axboe, Ming Lei

On 05/27/2016 04:44 PM, Mike Snitzer wrote:
> On Fri, May 27 2016 at  4:39am -0400,
> Hannes Reinecke <hare@suse.de> wrote:
>
[ .. ]
>> No, the real issue is load-balancing.
>> If you have several paths you have to schedule I/O across all paths,
>> _and_ you should be feeding these paths efficiently.
>
> <snip well known limitation of bio-based mpath load balancing, also
> detailed in the multipath paper I refernced>
>
> Right, as my patch header details, this is the only limitation that
> remains with the reinstated bio-based DM multipath.
>

:-)
And the very reason why we went into request-based multipathing in the 
first place...

>> I was sort-of hoping that with the large bio work from Shaohua we
>
> I think you mean Ming Lei and his multipage biovec work?
>
Errm. Yeah, of course. Apologies.

>> could build bio which would not require any merging, ie building
>> bios which would be assembled into a single request per bio.
>> Then the above problem wouldn't exist anymore and we _could_ do
>> scheduling on bio level.
>> But from what I've gathered this is not always possible (eg for
>> btrfs with delayed allocation).
>
> I doubt many people are running btrfs over multipath in production
> but...
>
Hey. There is a company who does ...

> Taking a step back: reinstating bio-based DM multipath is _not_ at the
> expense of request-based DM multipath.  As you can see I've made it so
> that all modes (bio-based, request_fn rq-based, and blk-mq rq-based) are
> supported by a single DM multipath target.  When the trnasition to
> request-based happened it would've been wise to preserve bio-based but I
> digress...
>
> So, the point is: there isn't any one-size-fits-all DM multipath queue
> mode here.  If a storage config benefits from the request_fn IO
> schedulers (but isn't hurt by .request_fn's queue lock, so slower
> rotational storage?) then use queue_mode=2.  If the storage is connected
> to a large NUMA system and there is some reason to want to use blk-mq
> request_queue at the DM level: use queue_mode=3.  If the storage is
> _really_ fast and doesn't care about extra IO grooming (e.g. sorting and
> merging) then select bio-based using queue_mode=1.
>
> I collected some quick performance numbers against a null_blk device, on
> a single NUMA node system, with various DM layers ontop -- the multipath
> runs are only with a single path... fio workload is just 10 sec randread:
>
Which is precisely the point.
Everything's nice and shiny with a single path, as then the above issue 
simply doesn't apply.
Things only start getting interesting if you have _several_ paths.
So the benchmarks only prove that device-mapper doesn't add too much of 
an overhead; they don't prove that the above point has been addressed...

[ .. ]
>> Have you found another way of addressing this problem?
>
> No, bio sorting/merging really isn't a problem for DM multipath to
> solve.
>
> Though Jens did say (in the context of one of these dm-crypt bulk mode
> threads) that the block core _could_ grow some additional _minimalist_
> capability for bio merging:
> https://www.redhat.com/archives/dm-devel/2015-November/msg00130.html
>
> I'd like to understand a bit more about what Jens is thinking in that
> area because it could benefit DM thinp as well (though that is using bio
> sorting rather than merging, introduced via commit 67324ea188).
>
> I'm not opposed to any line of future development -- but development
> needs to be driven by observed limitations while testing on _real_
> hardware.
>
In the end, with Ming Leis multipage bvec work we essentially already 
moved some merging ability into the bios; during bio_add_page() the 
block layer will already merge bios together.

(I'll probably be yelled at by hch for ignorance for the following, but 
nevertheless)
 From my POV there are several areas of 'merging' which currently happen:
a) bio merging: combine several consecutive bios into a larger one; 
should be largely address by Ming Leis multipage bvec
b) bio sorting: reshuffle bios so that any requests on the request queue 
are ordered 'best' for the underlying hardware (ie the actual I/O 
scheduler). Not implemented for mq, and actually of questionable value 
for fast storage. One of the points I'll be testing in the very near 
future; ideally we find that it's not _that_ important (compared to the 
previous point), then we could drop it altogether for mq.
c) clustering: coalescing several consecutive pages/bvecs into a single 
SG element. Obviously only can happen if you have large enough requests.
But the only gain is shortening the number of SG elements for a requests.
Again of questionable value as the request itself and the amount of data 
to transfer isn't changed. And another point of performance testing on 
my side.

So ideally we will find that b) and c) only contribute with a small 
amount to the overall performance, then we could easily drop it for MQ 
and concentrate on make bio merging work well.
Then it wouldn't really matter if we were doing bio-based or 
request-based multipathing as we had a 1:1 relationship, and this entire 
discussion could go away.

Well. Or that's the hope, at least.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: J. Hawn, J. Guild, F. Imend�rffer, HRB 16746 (AG N�rnberg)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM]
@ 2016-05-27 15:42             ` Hannes Reinecke
  0 siblings, 0 replies; 26+ messages in thread
From: Hannes Reinecke @ 2016-05-27 15:42 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: James Bottomley, linux-block, lsf, device-mapper development,
	hch, linux-scsi, axboe, Ming Lei

On 05/27/2016 04:44 PM, Mike Snitzer wrote:
> On Fri, May 27 2016 at  4:39am -0400,
> Hannes Reinecke <hare@suse.de> wrote:
>
[ .. ]
>> No, the real issue is load-balancing.
>> If you have several paths you have to schedule I/O across all paths,
>> _and_ you should be feeding these paths efficiently.
>
> <snip well known limitation of bio-based mpath load balancing, also
> detailed in the multipath paper I refernced>
>
> Right, as my patch header details, this is the only limitation that
> remains with the reinstated bio-based DM multipath.
>

:-)
And the very reason why we went into request-based multipathing in the 
first place...

>> I was sort-of hoping that with the large bio work from Shaohua we
>
> I think you mean Ming Lei and his multipage biovec work?
>
Errm. Yeah, of course. Apologies.

>> could build bio which would not require any merging, ie building
>> bios which would be assembled into a single request per bio.
>> Then the above problem wouldn't exist anymore and we _could_ do
>> scheduling on bio level.
>> But from what I've gathered this is not always possible (eg for
>> btrfs with delayed allocation).
>
> I doubt many people are running btrfs over multipath in production
> but...
>
Hey. There is a company who does ...

> Taking a step back: reinstating bio-based DM multipath is _not_ at the
> expense of request-based DM multipath.  As you can see I've made it so
> that all modes (bio-based, request_fn rq-based, and blk-mq rq-based) are
> supported by a single DM multipath target.  When the trnasition to
> request-based happened it would've been wise to preserve bio-based but I
> digress...
>
> So, the point is: there isn't any one-size-fits-all DM multipath queue
> mode here.  If a storage config benefits from the request_fn IO
> schedulers (but isn't hurt by .request_fn's queue lock, so slower
> rotational storage?) then use queue_mode=2.  If the storage is connected
> to a large NUMA system and there is some reason to want to use blk-mq
> request_queue at the DM level: use queue_mode=3.  If the storage is
> _really_ fast and doesn't care about extra IO grooming (e.g. sorting and
> merging) then select bio-based using queue_mode=1.
>
> I collected some quick performance numbers against a null_blk device, on
> a single NUMA node system, with various DM layers ontop -- the multipath
> runs are only with a single path... fio workload is just 10 sec randread:
>
Which is precisely the point.
Everything's nice and shiny with a single path, as then the above issue 
simply doesn't apply.
Things only start getting interesting if you have _several_ paths.
So the benchmarks only prove that device-mapper doesn't add too much of 
an overhead; they don't prove that the above point has been addressed...

[ .. ]
>> Have you found another way of addressing this problem?
>
> No, bio sorting/merging really isn't a problem for DM multipath to
> solve.
>
> Though Jens did say (in the context of one of these dm-crypt bulk mode
> threads) that the block core _could_ grow some additional _minimalist_
> capability for bio merging:
> https://www.redhat.com/archives/dm-devel/2015-November/msg00130.html
>
> I'd like to understand a bit more about what Jens is thinking in that
> area because it could benefit DM thinp as well (though that is using bio
> sorting rather than merging, introduced via commit 67324ea188).
>
> I'm not opposed to any line of future development -- but development
> needs to be driven by observed limitations while testing on _real_
> hardware.
>
In the end, with Ming Leis multipage bvec work we essentially already 
moved some merging ability into the bios; during bio_add_page() the 
block layer will already merge bios together.

(I'll probably be yelled at by hch for ignorance for the following, but 
nevertheless)
 From my POV there are several areas of 'merging' which currently happen:
a) bio merging: combine several consecutive bios into a larger one; 
should be largely address by Ming Leis multipage bvec
b) bio sorting: reshuffle bios so that any requests on the request queue 
are ordered 'best' for the underlying hardware (ie the actual I/O 
scheduler). Not implemented for mq, and actually of questionable value 
for fast storage. One of the points I'll be testing in the very near 
future; ideally we find that it's not _that_ important (compared to the 
previous point), then we could drop it altogether for mq.
c) clustering: coalescing several consecutive pages/bvecs into a single 
SG element. Obviously only can happen if you have large enough requests.
But the only gain is shortening the number of SG elements for a requests.
Again of questionable value as the request itself and the amount of data 
to transfer isn't changed. And another point of performance testing on 
my side.

So ideally we will find that b) and c) only contribute with a small 
amount to the overall performance, then we could easily drop it for MQ 
and concentrate on make bio merging work well.
Then it wouldn't really matter if we were doing bio-based or 
request-based multipathing as we had a 1:1 relationship, and this entire 
discussion could go away.

Well. Or that's the hope, at least.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM]
  2016-05-27 15:42             ` Hannes Reinecke
  (?)
@ 2016-05-27 16:10             ` Mike Snitzer
  -1 siblings, 0 replies; 26+ messages in thread
From: Mike Snitzer @ 2016-05-27 16:10 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: James Bottomley, linux-block, lsf, device-mapper development,
	hch, linux-scsi, axboe, Ming Lei

On Fri, May 27 2016 at 11:42am -0400,
Hannes Reinecke <hare@suse.de> wrote:

> On 05/27/2016 04:44 PM, Mike Snitzer wrote:
> >On Fri, May 27 2016 at  4:39am -0400,
> >Hannes Reinecke <hare@suse.de> wrote:
> >
> [ .. ]
> >>No, the real issue is load-balancing.
> >>If you have several paths you have to schedule I/O across all paths,
> >>_and_ you should be feeding these paths efficiently.
> >
> ><snip well known limitation of bio-based mpath load balancing, also
> >detailed in the multipath paper I refernced>
> >
> >Right, as my patch header details, this is the only limitation that
> >remains with the reinstated bio-based DM multipath.
> >
> 
> :-)
> And the very reason why we went into request-based multipathing in
> the first place...
> 
> >>I was sort-of hoping that with the large bio work from Shaohua we
> >
> >I think you mean Ming Lei and his multipage biovec work?
> >
> Errm. Yeah, of course. Apologies.
> 
> >>could build bio which would not require any merging, ie building
> >>bios which would be assembled into a single request per bio.
> >>Then the above problem wouldn't exist anymore and we _could_ do
> >>scheduling on bio level.
> >>But from what I've gathered this is not always possible (eg for
> >>btrfs with delayed allocation).
> >
> >I doubt many people are running btrfs over multipath in production
> >but...
> >
> Hey. There is a company who does ...
> 
> >Taking a step back: reinstating bio-based DM multipath is _not_ at the
> >expense of request-based DM multipath.  As you can see I've made it so
> >that all modes (bio-based, request_fn rq-based, and blk-mq rq-based) are
> >supported by a single DM multipath target.  When the trnasition to
> >request-based happened it would've been wise to preserve bio-based but I
> >digress...
> >
> >So, the point is: there isn't any one-size-fits-all DM multipath queue
> >mode here.  If a storage config benefits from the request_fn IO
> >schedulers (but isn't hurt by .request_fn's queue lock, so slower
> >rotational storage?) then use queue_mode=2.  If the storage is connected
> >to a large NUMA system and there is some reason to want to use blk-mq
> >request_queue at the DM level: use queue_mode=3.  If the storage is
> >_really_ fast and doesn't care about extra IO grooming (e.g. sorting and
> >merging) then select bio-based using queue_mode=1.
> >
> >I collected some quick performance numbers against a null_blk device, on
> >a single NUMA node system, with various DM layers ontop -- the multipath
> >runs are only with a single path... fio workload is just 10 sec randread:
> >
> Which is precisely the point.
> Everything's nice and shiny with a single path, as then the above
> issue simply doesn't apply.

Heh, as you can see from the request_fn results, that wasn't the case
until very recently with all the DM multipath blk-mq advances..

But my broader thesis is that for really fast storage it is looking
increasingly likely that we don't _need_ or want to have the
multipathing layer dealing with requests.  Not unless there is some
inherent big win.  request cloning is definitely heavier than bio
cloning.

And as you can probably infer, my work to reinstate bio-based multipath
is focused precisely at the fast storage case in the hopes of avoiding
hch's threat to pull multipathing down into the NVMe over fabrics
driver.

> Things only start getting interesting if you have _several_ paths.
> So the benchmarks only prove that device-mapper doesn't add too much
> of an overhead; they don't prove that the above point has been
> addressed...

Right, but I don't really care if it is addressed by bio-based because
we have the request_fn mode that offers the legacy IO schedulers.  The
fact that request_fn multipath has been adequate for the enterprise
rotational storage arrays is somehwat surprising... the queue_lock is a
massive bottleneck.

But if bio merging (via multipage biovecs) does prove itself to be a win
for bio-based multipath for all storage (slow and fast) then yes that'll
be really good news.  Nice to have options... we can dial in the option
that is best for a specific usecase/deployment and fix what isn't doing
well.

> [ .. ]
> >>Have you found another way of addressing this problem?
> >
> >No, bio sorting/merging really isn't a problem for DM multipath to
> >solve.
> >
> >Though Jens did say (in the context of one of these dm-crypt bulk mode
> >threads) that the block core _could_ grow some additional _minimalist_
> >capability for bio merging:
> >https://www.redhat.com/archives/dm-devel/2015-November/msg00130.html
> >
> >I'd like to understand a bit more about what Jens is thinking in that
> >area because it could benefit DM thinp as well (though that is using bio
> >sorting rather than merging, introduced via commit 67324ea188).
> >
> >I'm not opposed to any line of future development -- but development
> >needs to be driven by observed limitations while testing on _real_
> >hardware.
> >
> In the end, with Ming Leis multipage bvec work we essentially
> already moved some merging ability into the bios; during
> bio_add_page() the block layer will already merge bios together.
> 
> (I'll probably be yelled at by hch for ignorance for the following,
> but nevertheless)
> From my POV there are several areas of 'merging' which currently happen:
> a) bio merging: combine several consecutive bios into a larger one;
> should be largely address by Ming Leis multipage bvec
> b) bio sorting: reshuffle bios so that any requests on the request
> queue are ordered 'best' for the underlying hardware (ie the actual
> I/O scheduler). Not implemented for mq, and actually of questionable
> value for fast storage. One of the points I'll be testing in the
> very near future; ideally we find that it's not _that_ important
> (compared to the previous point), then we could drop it altogether
> for mq.
> c) clustering: coalescing several consecutive pages/bvecs into a
> single SG element. Obviously only can happen if you have large
> enough requests.
> But the only gain is shortening the number of SG elements for a requests.
> Again of questionable value as the request itself and the amount of
> data to transfer isn't changed. And another point of performance
> testing on my side.
> 
> So ideally we will find that b) and c) only contribute with a small
> amount to the overall performance, then we could easily drop it for
> MQ and concentrate on make bio merging work well.
> Then it wouldn't really matter if we were doing bio-based or
> request-based multipathing as we had a 1:1 relationship, and this
> entire discussion could go away.
> 
> Well. Or that's the hope, at least.

Yeap, let the testing begin! ;)

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2016-05-27 16:10 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-27 23:39 Notes from the four separate IO track sessions at LSF/MM James Bottomley
2016-04-28 12:11 ` Mike Snitzer
2016-04-28 15:40   ` James Bottomley
2016-04-28 15:53     ` [Lsf] " Bart Van Assche
2016-04-28 16:19       ` Knight, Frederick
2016-04-28 16:37         ` Bart Van Assche
2016-04-28 17:33         ` James Bottomley
2016-04-28 16:23       ` Laurence Oberman
2016-04-28 16:41         ` [dm-devel] " Bart Van Assche
2016-04-28 16:47           ` Laurence Oberman
2016-04-29 21:47             ` Laurence Oberman
2016-04-29 21:51               ` Laurence Oberman
2016-04-30  0:36               ` Bart Van Assche
2016-04-30  0:47                 ` Laurence Oberman
2016-05-02 18:49                   ` Bart Van Assche
2016-05-02 19:28                     ` Laurence Oberman
2016-05-02 22:28                       ` Bart Van Assche
2016-05-03 17:44                         ` Laurence Oberman
2016-05-26  2:38     ` bio-based DM multipath is back from the dead [was: Re: Notes from the four separate IO track sessions at LSF/MM] Mike Snitzer
2016-05-27  8:39       ` Hannes Reinecke
2016-05-27  8:39         ` Hannes Reinecke
2016-05-27 14:44         ` Mike Snitzer
2016-05-27 15:42           ` Hannes Reinecke
2016-05-27 15:42             ` Hannes Reinecke
2016-05-27 16:10             ` Mike Snitzer
2016-04-29 16:45 ` [dm-devel] Notes from the four separate IO track sessions at LSF/MM Benjamin Marzinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.