Re: [PATCH 1/1] scsi_dh_alua: properly handling the ALUA transitioning state

From: Brian Bunker <brian@purestorage.com>
To: Martin Wilck <mwilck@suse.com>
Cc: Mike Christie <michael.christie@oracle.com>,
	Hannes Reinecke <hare@suse.de>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	linux-scsi@vger.kernel.org,
	Benjamin Marzinski <bmarzins@redhat.com>
Subject: Re: [PATCH 1/1] scsi_dh_alua: properly handling the ALUA transitioning state
Date: Mon, 23 May 2022 09:52:51 -0700	[thread overview]
Message-ID: <4CBD87A8-0A92-4DFD-B50B-124C11BB9C86@purestorage.com> (raw)
In-Reply-To: <07f3474feb4ea7c2e80664c9977ae0e24b82cc09.camel@suse.com>


> On May 23, 2022, at 9:03 AM, Martin Wilck <mwilck@suse.com> wrote:
> 
> On Fri, 2022-05-20 at 19:52 -0700, Brian Bunker wrote:
>> From my perspective, the ALUA transitioning state is a temporary
>> state
>> where the target is saying that it does not have a permanent state
>> yet. Having the initiator try another pg to me doesn't seem like the
>> right thing for it to do.
> 
> I agree. Unfortunately, there's no logic in dm-multipath saying "I may
> switch paths inside a PG, but I may not do PG failover".
> 
>> If the target wanted the initiator to use a
>> different pg, it should use an ALUA state which would make that
>> clear,
>> standby, unavailable, etc. The target would only return an error
>> state
>> if it was aware that some other path is in an active state.When
>> transitioning is returned, I don't think the initiator should assume
>> that any other pg would be a better choice. I think it should assume
>> that the target will make its intention clear for that path with a
>> permanent state within a transition timeout.
> 
> For me the question is still whether trying to send I/O to the path
> that is known not to be able to process it makes sense. As noted
> elsewhere, you patch just delays the BLK_STS_AGAIN by a few
> milliseconds. You want to avoid a PG switch, and I second that, but IMO
> that needs a different approach.
> 
>> From my perspective the right thing to do is to let the ALUA handler
>> do what it is trying to do. If the pg state is transitioning and
>> within the transition timeout it should continue to retry that
>> request
>> checking each time the transition timeout.
> 
> But this means that we should modify the logic not only in
> alua_prep_fn() but also for handling of NOT READY conditions, either in
> alua_check_sense() or in scsi_io_completion_action().
> I agree that this would make a lot of sense, perhaps more than trying
> to implement a cleverer logic in dm-multipath as discussed between
> Hannes and myself.
> 
> This is what we need to figure out first: Do we want to change the
> logic in the multipath layer, making it somehow aware of the special
> nature of "transitioning" state, or should we tune the retry logic in
> the SCSI layer such that dm-multipath will "do the right thing"
> automatically?
> 
> Regards
> Martin
> 
I think that a couple of things are needed to get to where my expectation of how it should work would be. First this code should come out of the not ready sense check. The reason is that it will continually override alua_check’s attempt to change the pg state as it exceeds the allowed time and the path state is changed to standby to handle a misbehaving target which stays in transitioning past the timer.

--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -409,20 +409,12 @@ static char print_alua_state(unsigned char state)
 static enum scsi_disposition alua_check_sense(struct scsi_device *sdev,
                                              struct scsi_sense_hdr *sense_hdr)
 {
-       struct alua_dh_data *h = sdev->handler_data;
-       struct alua_port_group *pg;
-
        switch (sense_hdr->sense_key) {
        case NOT_READY:
                if (sense_hdr->asc == 0x04 && sense_hdr->ascq == 0x0a) {
                        /*
                         * LUN Not Accessible - ALUA state transition
                         */
-                       rcu_read_lock();
-                       pg = rcu_dereference(h->pg);
-                       if (pg)
-                               pg->state = SCSI_ACCESS_STATE_TRANSITIONING;
-                       rcu_read_unlock();

Second for this to work how it used to work in Linux and how it works for other OS’s where ALUA state transition is not a fail path response:

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 8d18cc7e510e..33828aa44347 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -776,11 +776,11 @@ static void scsi_io_completion_action(struct scsi_cmnd *cmd, int result)
                                case 0x1b: /* sanitize in progress */
                                case 0x1d: /* configuration in progress */
                                case 0x24: /* depopulation in progress */
                                       action = ACTION_DELAYED_RETRY;
                                        break;
                                case 0x0a: /* ALUA state transition */
-                                       blk_stat = BLK_STS_AGAIN;
-                                       fallthrough;
+                                       action = ACTION_REPREP;
+                                       break;

Because, as you said the expiation check for whether to continually allow new I/O on the pg is in the prep function of the ALUA handler. I think that this does introduce a lot error processing for a target that will respond quickly. Maybe there is some way to use the equivalent of ACTION_DELAYED_RETRY so that the target is not as aggressively retried in the transitioning state.

It is possible, maybe likely, in other OS’s that this retry is done at the multipath level. The DSM from Microsoft in GitHub would seem to show that Windows does it that way. The multipath-tools in Linux though don’t seem to use sense data to make decisions so getting this logic in would seem like a decent set of changes and getting the buy-in to go that route.

Thanks,
Brian