Re: [PATCH 1/1]: scsi dm-mpath do not fail paths which are in ALUA state transitioning

From: Brian Bunker <brian@purestorage.com>
To: Martin Wilck <mwilck@suse.com>
Cc: linux-scsi@vger.kernel.org, Hannes Reinecke <hare@suse.com>
Subject: Re: [PATCH 1/1]: scsi dm-mpath do not fail paths which are in ALUA state transitioning
Date: Tue, 13 Jul 2021 17:37:53 -0700	[thread overview]
Message-ID: <CAHZQxy+crC90wWuHMKA=9CE-gHSiDTEC_jQDnH0Otx=R7PM-SQ@mail.gmail.com> (raw)
In-Reply-To: <CAHZQxyLEsQWjTV_P8YPhConyQiOOtzc+oNmuT=Oi1=WMyysmCg@mail.gmail.com>

On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@suse.com> wrote:
>
> Hello Brian,
>
> On Mo, 2021-07-12 at 14:38 -0700, Brian Bunker wrote:
> > Martin,
> >
> > > Please confirm that your kernel included ee8868c5c78f ("scsi:
> > > scsi_dh_alua: Retry RTPG on a different path after failure").
> > > That commit should cause the RTPG to be retried on other map
> > > members
> > > which are not in failed state, thus avoiding this phenomenon.
> >
> > In my case, there are no other map members that are not in the failed
> > state. One set of paths goes to the ALUA unavailable state when the
> > primary fails, and the second set of paths moves to ALUA state
> > transitioning as the previous secondary becomes the primary.
>
> IMO this is the problem. How does your array respond to SCSI commands
> while ports are transitioning?
>
> SPC5 (§5.16.2.6) says that the server should either fail all commands
> with BUSY or CHECK CONDITION/NOT READY/LOGICAL UNIT NOT
> ACCESSIBLE/ASYMMETRIC ACCESS STATE TRANSITION (a), or should support
> all TMFs and a subset of SCSI commands, while responding with
> CC/NR/AAST to all other commands (b). SPC6 (§5.18.2.6) is no different.
>
> No matter how you read that paragraph, it's pretty clear that
> "transitioning" is generally not a healthy state to attempt I/O.
>
> Are you saying that on your server, the transitioning ports are able to
> process regular I/O commands like READ and WRITE? If that's the case,
> why do you pretend to be "transitioning" at all, rather than in an
> active state? If it's not the case, why would it make sense for the
> host to retry I/O on the transitioning path?

In the ALUA transitioning state, we cannot process READ or WRITE and
will return with the sense data as you mentioned above. We expect
retries down that transitioning path until it transitions to another
ALUA state (at least for some reasonable period of time for the
transition). The spec defines the state as the transition between
target asymmetric states. The current implementation requires
coordination on the target not to return a state transition down all
paths at the same time or risk all paths being failed. Using the ALUA
transition state allows us to respond to initiator READ and WRITE
requests even if we can't serve them when our internal target state is
transitioning (secondary to primary). The alternative is to queue them
which presents a different set of problems.

>
> >  If the
> > paths are failed which are transitioning, an all paths down state
> > happens which is not expected.
>
> IMO it _is_ expected if in fact no path is able to process SCSI
> commands at the given point in time.

In this case it would seem having all paths move to transitioning
would lead to all paths lost. It is possible to imagine
implementations where for a brief period of time all paths are in a
transitioning state. What would be the point of returning a transient
state if the result is a permanent failure?

>
> >  There should be a time for which
> > transitioning is a transient state until the next state is entered.
> > Failing a path assuming there would be non-failed paths seems wrong.
>
> This is a misunderstanding. The path isn't failed because of
> assumptions about other paths. It is failed because we know that it's
> non-functional, and thus we must try other paths, if there are any.
>
> Before 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA
> transitioning state"), I/O was indeed retried on transitioning paths,
> possibly forever. This posed a serious problem when a transitioning
> path was about to be removed (e.g. dev_loss_tmo expired). And I'm still
> quite convinced that it was wrong in general, because by all reasonable
> means a "transitioning" path isn't usable for the host.
>
> If we find a transitioning path, it might make sense to retry on other
> paths first and eventually switch back to the transitioning path, when
> all others have failed hard (e.g. "unavailable" state). However, this
> logic doesn't exist in the kernel. In theory, it could be mapped to a
> "transitioning" priority group in device-mapper multipath. But prio
> groups are managed in user space (multipathd), which treats
> transitioning paths as "almost failed" (priority 0) anyway. We can
> discuss enhancing multipathd such that it re-checks transitioning paths
> more frequently, in order to be able to reinstate them ASAP.
>
> According to what you said above, the "transitioning" ports in the
> problem situation ("second set") are those that were in "unavailable"
> state before, which means "failed" as far as device mapper is concerned
> - IOW, the paths in question would be unused anyway, until they got
> reinstated, which wouldn't happen before they are fully up. With this
> in mind, I have to say I don't understand why your proposed patch would
> help at all. Please explain.
>

My proposed patch would not fail the paths in the case of
BLK_STS_AGAIN. This seems to result in the requests being retried
until the path transitions to either a failed state, standby or
unavailable, or an online state.

> > > The purpose of that patch was to set the state of the transitioning
> > > path to failed in order to make sure IO is retried on a different  >
> > path.
> > > Your patch would undermine this purpose.
>
> (Additional indentation added by me) Can you please use proper quoting?
> You were mixing my statements and your own.
>
> > I agree this is what happens but those transitioning paths might be
> > the only non-failed paths available. I don't think it is reasonable
> > to
> > fail them. This is the same as treating transitioning as standby or
> > unavailable.
>
> Right, and according to the SPC spec (see above), that makes more sense
> than treating it as "active".
>
> Storage vendors seem to interpret "transitioning" very differently,
> both in terms of commands supported and in terms of time required to
> reach the target state. That makes it hard to deal with it correctly on
> the host side.
>
> >  As you point out this happened with the commit you
> > mention. Before this commit what I am doing does not result in an all
> > paths down error, and similarly, it does not in earlier Linux
> > versions
> > or other OS's under the same condition. I see this as a regression.
>
> If you use a suitable "no_path_retry" setting in multipathd, you should
> be able to handle the situation you describe just fine by queueing the
> I/O until the transitioning paths are fully up. IIUC, on your server
> "transitioning" is a transient state that ends quickly, so queueing
> shouldn't be an issue. E.g. if you are certain that "transitioning"
> won't last longer than 10s, you could set "no_path_retry 2".
>
> Regards,
> Martin
>
>
>

I have tested using the no_path_retry and you are correct that it does
work around the issue that I am seeing. The problem with that is are times
we want to convey all paths down to the initiator as quickly
as possible and doing this will delay that.

Thanks,
Brian

Brian Bunker
PURE Storage, Inc.
brian@purestorage.com

On Tue, Jul 13, 2021 at 5:32 PM Brian Bunker <brian@purestorage.com> wrote:
>
> On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@suse.com> wrote:
> >
> > Hello Brian,
> >
> > On Mo, 2021-07-12 at 14:38 -0700, Brian Bunker wrote:
> > > Martin,
> > >
> > > > Please confirm that your kernel included ee8868c5c78f ("scsi:
> > > > scsi_dh_alua: Retry RTPG on a different path after failure").
> > > > That commit should cause the RTPG to be retried on other map
> > > > members
> > > > which are not in failed state, thus avoiding this phenomenon.
> > >
> > > In my case, there are no other map members that are not in the failed
> > > state. One set of paths goes to the ALUA unavailable state when the
> > > primary fails, and the second set of paths moves to ALUA state
> > > transitioning as the previous secondary becomes the primary.
> >
> > IMO this is the problem. How does your array respond to SCSI commands
> > while ports are transitioning?
> >
> > SPC5 (§5.16.2.6) says that the server should either fail all commands
> > with BUSY or CHECK CONDITION/NOT READY/LOGICAL UNIT NOT
> > ACCESSIBLE/ASYMMETRIC ACCESS STATE TRANSITION (a), or should support
> > all TMFs and a subset of SCSI commands, while responding with
> > CC/NR/AAST to all other commands (b). SPC6 (§5.18.2.6) is no different.
> >
> > No matter how you read that paragraph, it's pretty clear that
> > "transitioning" is generally not a healthy state to attempt I/O.
> >
> > Are you saying that on your server, the transitioning ports are able to
> > process regular I/O commands like READ and WRITE? If that's the case,
> > why do you pretend to be "transitioning" at all, rather than in an
> > active state? If it's not the case, why would it make sense for the
> > host to retry I/O on the transitioning path?
>
> In the ALUA transitioning state, we cannot process READ or WRITE and
> will return with the sense data as you mentioned above. We expect
> retries down that transitioning path until it transitions to another
> ALUA state (at least for some reasonable period of time for the
> transition). The spec defines the state as the transition between
> target asymmetric states. The current implementation requires
> coordination on the target not to return a state transition down all
> paths at the same time or risk all paths being failed. Using the ALUA
> transitioning state allows us to respond to initiator READ and WRITE
> requests even if we can't serve them when our internal target state is
> transitioning (secondary to primary). The alternative is to queue them
> which presents a different set of problems.
>
> >
> > >  If the
> > > paths are failed which are transitioning, an all paths down state
> > > happens which is not expected.
> >
> > IMO it _is_ expected if in fact no path is able to process SCSI
> > commands at the given point in time.
>
> In this case it would seem having all paths move to transitioning
> would lead to all paths lost. It is possible to imagine
> implementations where for a brief period of time all paths are in a
> transitioning state. What would be the point of returning a transient
> state if the result is a permanent failure?
>
> >
> > >  There should be a time for which
> > > transitioning is a transient state until the next state is entered.
> > > Failing a path assuming there would be non-failed paths seems wrong.
> >
> > This is a misunderstanding. The path isn't failed because of
> > assumptions about other paths. It is failed because we know that it's
> > non-functional, and thus we must try other paths, if there are any.
> >
> > Before 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA
> > transitioning state"), I/O was indeed retried on transitioning paths,
> > possibly forever. This posed a serious problem when a transitioning
> > path was about to be removed (e.g. dev_loss_tmo expired). And I'm still
> > quite convinced that it was wrong in general, because by all reasonable
> > means a "transitioning" path isn't usable for the host.
> >
> > If we find a transitioning path, it might make sense to retry on other
> > paths first and eventually switch back to the transitioning path, when
> > all others have failed hard (e.g. "unavailable" state). However, this
> > logic doesn't exist in the kernel. In theory, it could be mapped to a
> > "transitioning" priority group in device-mapper multipath. But prio
> > groups are managed in user space (multipathd), which treats
> > transitioning paths as "almost failed" (priority 0) anyway. We can
> > discuss enhancing multipathd such that it re-checks transitioning paths
> > more frequently, in order to be able to reinstate them ASAP.
> >
> > According to what you said above, the "transitioning" ports in the
> > problem situation ("second set") are those that were in "unavailable"
> > state before, which means "failed" as far as device mapper is concerned
> > - IOW, the paths in question would be unused anyway, until they got
> > reinstated, which wouldn't happen before they are fully up. With this
> > in mind, I have to say I don't understand why your proposed patch would
> > help at all. Please explain.
> >
>
> My proposed patch would not fail the paths in the case of
> BLK_STS_AGAIN. This seems to result in the requests being retried
> until the path transitions to either a failed state, standby or
> unavailable, or an online state.
>
> > > > The purpose of that patch was to set the state of the transitioning
> > > > path to failed in order to make sure IO is retried on a different  >
> > > path.
> > > > Your patch would undermine this purpose.
> >
> > (Additional indentation added by me) Can you please use proper quoting?
> > You were mixing my statements and your own.
> >
> > > I agree this is what happens but those transitioning paths might be
> > > the only non-failed paths available. I don't think it is reasonable
> > > to
> > > fail them. This is the same as treating transitioning as standby or
> > > unavailable.
> >
> > Right, and according to the SPC spec (see above), that makes more sense
> > than treating it as "active".
> >
> > Storage vendors seem to interpret "transitioning" very differently,
> > both in terms of commands supported and in terms of time required to
> > reach the target state. That makes it hard to deal with it correctly on
> > the host side.
> >
> > >  As you point out this happened with the commit you
> > > mention. Before this commit what I am doing does not result in an all
> > > paths down error, and similarly, it does not in earlier Linux
> > > versions
> > > or other OS's under the same condition. I see this as a regression.
> >
> > If you use a suitable "no_path_retry" setting in multipathd, you should
> > be able to handle the situation you describe just fine by queueing the
> > I/O until the transitioning paths are fully up. IIUC, on your server
> > "transitioning" is a transient state that ends quickly, so queueing
> > shouldn't be an issue. E.g. if you are certain that "transitioning"
> > won't last longer than 10s, you could set "no_path_retry 2".
> >
> > Regards,
> > Martin
> >
> >
> >
>
> I have tested using the no_path_retry and you are correct that it does
> work around the issue that I am seeing. The problem with that is there
> are times we want to convey all paths down to the initiator as quickly
> as possible and doing this will delay that.
>
> Thanks,
> Brian
>
> Brian Bunker
> PURE Storage, Inc.
> brian@purestorage.com

-- 
Brian Bunker
PURE Storage, Inc.
brian@purestorage.com