Re: [PATCH 1/1] scsi_dh_alua: properly handling the ALUA transitioning state

From: Martin Wilck <mwilck@suse.com>
To: Brian Bunker <brian@purestorage.com>
Cc: Mike Christie <michael.christie@oracle.com>,
	Hannes Reinecke <hare@suse.de>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	linux-scsi@vger.kernel.org,
	Benjamin Marzinski <bmarzins@redhat.com>
Subject: Re: [PATCH 1/1] scsi_dh_alua: properly handling the ALUA transitioning state
Date: Tue, 24 May 2022 10:29:07 +0200	[thread overview]
Message-ID: <731d08291f201a4c056737c107c3c78485b609f3.camel@suse.com> (raw)
In-Reply-To: <4CBD87A8-0A92-4DFD-B50B-124C11BB9C86@purestorage.com>

On Mon, 2022-05-23 at 09:52 -0700, Brian Bunker wrote:
> 
> > On May 23, 2022, at 9:03 AM, Martin Wilck <mwilck@suse.com> wrote:
> > 
> > On Fri, 2022-05-20 at 19:52 -0700, Brian Bunker wrote:
> > > From my perspective, the ALUA transitioning state is a temporary
> > > state
> > > where the target is saying that it does not have a permanent
> > > state
> > > yet. Having the initiator try another pg to me doesn't seem like
> > > the
> > > right thing for it to do.
> > 
> > I agree. Unfortunately, there's no logic in dm-multipath saying "I
> > may
> > switch paths inside a PG, but I may not do PG failover".
> > 
> > > If the target wanted the initiator to use a
> > > different pg, it should use an ALUA state which would make that
> > > clear,
> > > standby, unavailable, etc. The target would only return an error
> > > state
> > > if it was aware that some other path is in an active state.When
> > > transitioning is returned, I don't think the initiator should
> > > assume
> > > that any other pg would be a better choice. I think it should
> > > assume
> > > that the target will make its intention clear for that path with
> > > a
> > > permanent state within a transition timeout.
> > 
> > For me the question is still whether trying to send I/O to the path
> > that is known not to be able to process it makes sense. As noted
> > elsewhere, you patch just delays the BLK_STS_AGAIN by a few
> > milliseconds. You want to avoid a PG switch, and I second that, but
> > IMO
> > that needs a different approach.
> > 
> > > From my perspective the right thing to do is to let the ALUA
> > > handler
> > > do what it is trying to do. If the pg state is transitioning and
> > > within the transition timeout it should continue to retry that
> > > request
> > > checking each time the transition timeout.
> > 
> > But this means that we should modify the logic not only in
> > alua_prep_fn() but also for handling of NOT READY conditions,
> > either in
> > alua_check_sense() or in scsi_io_completion_action().
> > I agree that this would make a lot of sense, perhaps more than
> > trying
> > to implement a cleverer logic in dm-multipath as discussed between
> > Hannes and myself.
> > 
> > This is what we need to figure out first: Do we want to change the
> > logic in the multipath layer, making it somehow aware of the
> > special
> > nature of "transitioning" state, or should we tune the retry logic
> > in
> > the SCSI layer such that dm-multipath will "do the right thing"
> > automatically?
> > 
> > Regards
> > Martin
> > 
> I think that a couple of things are needed to get to where my
> expectation of how it should work would be. First this code should
> come out of the not ready sense check. The reason is that it will
> continually override alua_check’s attempt to change the pg state as
> it exceeds the allowed time and the path state is changed to standby
> to handle a misbehaving target which stays in transitioning past the
> timer.
> 
> --- a/drivers/scsi/device_handler/scsi_dh_alua.c
> +++ b/drivers/scsi/device_handler/scsi_dh_alua.c
> @@ -409,20 +409,12 @@ static char print_alua_state(unsigned char
> state)
>  static enum scsi_disposition alua_check_sense(struct scsi_device
> *sdev,
>                                               struct scsi_sense_hdr
> *sense_hdr)
>  {
> -       struct alua_dh_data *h = sdev->handler_data;
> -       struct alua_port_group *pg;
> -
>         switch (sense_hdr->sense_key) {
>         case NOT_READY:
>                 if (sense_hdr->asc == 0x04 && sense_hdr->ascq ==
> 0x0a) {
>                         /*
>                          * LUN Not Accessible - ALUA state transition
>                          */
> -                       rcu_read_lock();
> -                       pg = rcu_dereference(h->pg);
> -                       if (pg)
> -                               pg->state =
> SCSI_ACCESS_STATE_TRANSITIONING;
> -                       rcu_read_unlock();
> 

Good point. But I'd say that rather than removing it, this code should
be made aware of the timer.

> Second for this to work how it used to work in Linux and how it works
> for other OS’s where ALUA state transition is not a fail path
> response:
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 8d18cc7e510e..33828aa44347 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -776,11 +776,11 @@ static void scsi_io_completion_action(struct
> scsi_cmnd *cmd, int result)
>                                 case 0x1b: /* sanitize in progress */
>                                 case 0x1d: /* configuration in
> progress */
>                                 case 0x24: /* depopulation in
> progress */
>                                        action = ACTION_DELAYED_RETRY;
>                                         break;
>                                 case 0x0a: /* ALUA state transition
> */
> -                                       blk_stat = BLK_STS_AGAIN;
> -                                       fallthrough;
> +                                       action = ACTION_REPREP;
> +                                       break;
> 
> Because, as you said the expiation check for whether to continually
> allow new I/O on the pg is in the prep function of the ALUA handler.

Not any more with your patch, or what am I missing?

> I think that this does introduce a lot error processing for a target
> that will respond quickly. Maybe there is some way to use the
> equivalent of ACTION_DELAYED_RETRY so that the target is not as
> aggressively retried in the transitioning state.

IMHO ACTION_DELAYED_RETRY would be a good option. Note that it blocks
the device; thus it prevents dispatching IO to the device according to
the scsi_device_block() logic. This way it automatically introduces
some delay. 

> 
> It is possible, maybe likely, in other OS’s that this retry is done
> at the multipath level. The DSM from Microsoft in GitHub would seem
> to show that Windows does it that way. The multipath-tools in Linux
> though don’t seem to use sense data to make decisions so getting this
> logic in would seem like a decent set of changes and getting the buy-
> in to go that route.

IMO we're closer to a good solution on the SCSI side. I've no idea how
Microsoft's DSM works, but it probably has a more detailed concept of
path states than Linux' dm-multipath, which operates on abstract block
devices.

Neither dm-multipath nor multipath-tools have a concept of
transitioning state. While it'd be possible to add this logic in
multipath-tools (user space), I see a couple of hard-to-solve problems:

 - multipathd has no concept of ALUA. The ALUA functionality is split
into the device handler part (handled in the kernel, responsible for
explicit PG switching and error handling), and the "prioritizer"
handled in user space. ALUA is just one out of several path
prioritizers. The generic prioritizer concept doesn't define a
"transitioning" property. We'd need to try to generalize the concept,
and possibly figure out how to implement it with other prioritizers.

 - the dm-multipath path group concept is abstract and, in general,
independent of ALUA's port groups. The core property of ALUA (all ports
in a group always have the same primary port state) doesn't generlly
apply to multipath PGs. Nothing prevents users from configuring
different path grouping policies, like multibus or grouping by latency,
even if ALUA is supported. The idea to prevent dm-multipath from
switching PG away from a transitioning path group is only valid if dm-
multipath PGs map 1:1 to ALUA port groups.

 - ALUA itself is more generic than the scenario you describe. You say
that that if a path is transitioning, trying to switch PGs is wrong.
This is true if there's just one other PG, and that other PG is
unavailable. But there may be setups with more than 2 PGs, where one PG
is perfectly healthy while another is transitioning. In such a case,
it'd be reasonable to switch PGs towards the healthy one.

 - multipathd doesn't get notified of ALUA state changes. It polls in
rather long time intervals (5s). If this logic is implemented in
multipathd, we'll see considerable latency in PG switching.

- In general, multipathd doesn't have full control over dm-multipath PG
switching. Some logic is in the kernel and some in user space. It's not
easy for multipathd to prevent the kernel from switching to a certain
PG. Basically the only way is to set all paths in that PG to failed
status.

Here's one dumb question: ALUA defines "transitioning" in an abstract
fashion, as a "movement from one target port asymmetric access state to
another". This doesn't say anything about the state the port is
transitioning towards. IOW, it could be transitioning towards "STANDBY"
or "UNAVAILABLE", perhaps even "OFFLINE". I am unsure if this makes
sense in practice; I _guess_ that most of the time "transitioning"
means something like "booting". Your suggestions seem to confirm that.
at least for PURE hardware, "transitioning" always means "will be
usable soon". Is this correct? And if yes, to which extent could this
be generalized to other arrays, past, present, and future?

Regards
Martin