All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/1]: scsi scsi_dh_alua: don't fail I/O until transition time expires
@ 2021-05-28 18:34 Brian Bunker
  2021-06-07 12:42 ` Hannes Reinecke
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Bunker @ 2021-05-28 18:34 UTC (permalink / raw)
  To: linux-scsi

Do not return an error to multipath which will result in a failed path until the transition time expires.

The current patch which returns BLK_STS_AGAIN for ALUA transitioning breaks the assumptions in our target regarding ALUA states. With that change an error is very quickly returned to multipath which in turn immediately fails the path. The assumption in that patch seems to be that another path will be available for multipath to use. That assumption I don’t believe is fair to make since while one path is in a transitioning state it is reasonable to assume that other paths may also be in non active states. 

The SPC spec has a note on this:
The IMPLICIT TRANSITION TIME field indicates the minimum amount of time in seconds the application client should wait prior to timing out an implicit state transition (see 5.15.2.2). A value of zero indicates that the implicit transition time is not specified.

In the SCSI ALUA device handler a value of 0 translates to the transition time being set to 60 seconds. The current approach of failing I/O on the transitioning path in a much quicker time than what is stated seems to violate this aspect of the specification.

#define ALUA_FAILOVER_TIMEOUT		60
unsigned long transition_tmo = ALUA_FAILOVER_TIMEOUT * HZ;

This patch uses the expiry the same way it is used elsewhere in the device handler. Once the transition state is entered keep retrying until the expiry value is reached. If that happens, return the error to multipath the same way that is currently done with BLK_STS_AGAIN.

Signed-off-by: Brian Bunker <brian@purestorage.com>
Acked-by: Krishna Kant <krishna.kant@purestorage.com>
Acked-by: Seamus Connor <sconnor@purestorage.com>

____
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c86c01bfecdb..8fd4f677f9c9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1395,7 +1395,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
                        break;
                default:
                        errors++;
-                       blk_mq_end_request(rq, ret);
+                       blk_mq_end_request(rq, BLK_STS_IOERR);
                }
        } while (!list_empty(list));
 out:
diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c
index efa8c0381476..dd51200665b9 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -417,13 +417,36 @@ static enum scsi_disposition alua_check_sense(struct scsi_device *sdev,
                        /*
                         * LUN Not Accessible - ALUA state transition
                         */
+                       unsigned long expiry = 0;
                        rcu_read_lock();
                        pg = rcu_dereference(h->pg);
-                       if (pg)
-                               pg->state = SCSI_ACCESS_STATE_TRANSITIONING;
+                       if (pg) {
+                               if (pg->state != SCSI_ACCESS_STATE_TRANSITIONING) {
+                                       unsigned long transition_tmo = ALUA_FAILOVER_TIMEOUT * HZ;
+                                       pg->state = SCSI_ACCESS_STATE_TRANSITIONING;
+
+                                       if (pg->transition_tmo)
+                                               transition_tmo = pg->transition_tmo * HZ;
+                                       pg->expiry = round_jiffies_up(jiffies + transition_tmo);
+                                       sdev_printk(KERN_INFO, sdev,
+                                               "%s: port group %02x expiry set to %lu\n",
+                                               ALUA_DH_NAME, pg->group_id, pg->expiry);
+                               }
+                               expiry = pg->expiry;
+                       }
+                       rcu_read_unlock();
+                       if (expiry != 0 && time_before_eq(jiffies, expiry)) {
+                               alua_check(sdev, false);
+                               return NEEDS_RETRY;
+                       }
+                       rcu_read_lock();
+                       pg = rcu_dereference(h->pg);
+                       if (pg && pg->state == SCSI_ACCESS_STATE_TRANSITIONING)
+                               pg->state = SCSI_ACCESS_STATE_UNAVAILABLE;
                        rcu_read_unlock();
-                       alua_check(sdev, false);
-                       return NEEDS_RETRY;
+                       sdev_printk(KERN_ERR, sdev,
+                               "%s: transition timeout (expiry %lu) exceeded\n",
+                               ALUA_DH_NAME, expiry);
                }
                break;
        case UNIT_ATTENTION:
@@ -1109,7 +1132,7 @@ static blk_status_t alua_prep_fn(struct scsi_device *sdev, struct request *req)
        case SCSI_ACCESS_STATE_LBA:
                return BLK_STS_OK;
        case SCSI_ACCESS_STATE_TRANSITIONING:
-               return BLK_STS_AGAIN;
+               return BLK_STS_RESOURCE;
        default:
                req->rq_flags |= RQF_QUIET;
                return BLK_STS_IOERR;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 532304d42f00..bdc5d015de77 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -735,9 +735,6 @@ static void scsi_io_completion_action(struct scsi_cmnd *cmd, int result)
                                case 0x24: /* depopulation in progress */
                                        action = ACTION_DELAYED_RETRY;
                                        break;
-                               case 0x0a: /* ALUA state transition */
-                                       blk_stat = BLK_STS_AGAIN;
-                                       fallthrough;
                                default:
                                        action = ACTION_FAIL;
                                        break;


Brian Bunker
SW Eng
brian@purestorage.com




^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1]: scsi scsi_dh_alua: don't fail I/O until transition time expires
  2021-05-28 18:34 [PATCH 1/1]: scsi scsi_dh_alua: don't fail I/O until transition time expires Brian Bunker
@ 2021-06-07 12:42 ` Hannes Reinecke
  2021-06-08  0:03   ` Brian Bunker
  0 siblings, 1 reply; 7+ messages in thread
From: Hannes Reinecke @ 2021-06-07 12:42 UTC (permalink / raw)
  To: Brian Bunker, linux-scsi

On 5/28/21 8:34 PM, Brian Bunker wrote:
> Do not return an error to multipath which will result in a failed path until the transition time expires.
> 
> The current patch which returns BLK_STS_AGAIN for ALUA transitioning breaks the assumptions in our target
> regarding ALUA states. With that change an error is very quickly returned to multipath which in turn
> immediately fails the path. The assumption in that patch seems to be that another path will be available
> for multipath to use. That assumption I don’t believe is fair to make since while one path is in a
> transitioning state it is reasonable to assume that other paths may also be in non active states. 
> 
I beg to disagree. Path groups are nominally independent, and might
change states independent on the other path groups.
While for some arrays a 'transitioning' state is indeed system-wide,
other arrays might be able to serve I/O on other paths whilst one is in
transitioning.
So I'd rather not presume anything here.


> The SPC spec has a note on this:
> The IMPLICIT TRANSITION TIME field indicates the minimum amount of time in seconds the application client
> should wait prior to timing out an implicit state transition (see 5.15.2.2). A value of zero indicates that
> the implicit transition time is not specified.
> 
Oh, I know _that_ one. What with me being one of the implementors asking
for it :-)

But again, this is _per path_. One cannot assume anything about _other_
paths here.

> In the SCSI ALUA device handler a value of 0 translates to the transition time being set to 60 seconds.
> The current approach of failing I/O on the transitioning path in a much quicker time than what is stated
> seems to violate this aspect of the specification.
> > #define ALUA_FAILOVER_TIMEOUT		60
> unsigned long transition_tmo = ALUA_FAILOVER_TIMEOUT * HZ;
> 

No. The implicit transitioning timeout is used to determine for how long
we will be sending a 'TEST UNIT READY'/'REPORT TARGET PORT GROUPS' combo
to figure out if this particular path is still in transitioning. Once
this timeout is exceeded we're setting the path to 'standby'.
And this 'setting port to standby' is our action for 'timing out an
implicit state transition' as defined by the spec.

> This patch uses the expiry the same way it is used elsewhere in the device handler. Once the transition
> state is entered keep retrying until the expiry value is reached. If that happens, return the error to
> multipath the same way that is currently done with BLK_STS_AGAIN.
> 
And that is precisely what I want to avoid.

As outlined above, we cannot assume that all paths will be set to
'transitioning' once we hit the 'transitioning' condition on one path.
As such, we need to retry the I/O on other paths, to ensure failover
does work in these cases. Hence it's perfectly okay to set this path to
'failed' as we cannot currently send I/O to that path.

If, however, we are hitting a 'transitioning' status on _all_ paths (ie
all paths are set to 'failed') we need to ensure that we do _not_ fail
the I/O (as technically the paths are still alive), but retry with
TUR/RTPG until one path reaches a final state.
Then we should reinstate that path and continue with I/O.

I thought that this is what we do already; but might be that there are
some issues lurking here.

So what is the error you are seeing?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1]: scsi scsi_dh_alua: don't fail I/O until transition time expires
  2021-06-07 12:42 ` Hannes Reinecke
@ 2021-06-08  0:03   ` Brian Bunker
  2021-06-09  7:03     ` Hannes Reinecke
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Bunker @ 2021-06-08  0:03 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-scsi

> Do not return an error to multipath which will result in a failed path until the \
> transition time expires. 
> The current patch which returns BLK_STS_AGAIN for ALUA transitioning breaks the \
> assumptions in our target regarding ALUA states. With that change an error is very \
> quickly returned to multipath which in turn immediately fails the path. The \
> assumption in that patch seems to be that another path will be available for \
> multipath to use. That assumption I don't believe is fair to make since while one \
> path is in a transitioning state it is reasonable to assume that other paths may \
> also be in non active states.  

> I beg to disagree. Path groups are nominally independent, and might
> change states independent on the other path groups.
> While for some arrays a 'transitioning' state is indeed system-wide,
> other arrays might be able to serve I/O on other paths whilst one is in
> transitioning.
> So I'd rather not presume anything here.

I agree. No problem there. Our array could and does return transitioning on
some portal groups while others might be active/online or unavailable.

> As outlined above, we cannot assume that all paths will be set to
> 'transitioning' once we hit the 'transitioning' condition on one path.
> As such, we need to retry the I/O on other paths, to ensure failover
> does work in these cases. Hence it's perfectly okay to set this path to
‘> failed' as we cannot currently send I/O to that path.

> If, however, we are hitting a 'transitioning' status on _all_ paths (ie
> all paths are set to 'failed') we need to ensure that we do _not_ fail
> the I/O (as technically the paths are still alive), but retry with
> TUR/RTPG until one path reaches a final state.
> Then we should reinstate that path and continue with I/O.

I am not saying that all paths should be changed to transitioning, but
I/Os sent to the path that is in transitioning should not immediately
fail if there is not an online path like what does happen without
this patch or one like it.

The other paths which are in other states should succeed or fail
I/O as they would based on their state. I am only concerned about
the portal group in the transitioning state and making sure it doesn’t
immediately bubble errors back to the multipath layer which fails the
path which is what we see and don’t want to see.

> So what is the error you are seeing?

Right now this is what fails and used to work before the patch
This worked in previous Linux versions and continues to work
in Windows, ESXi, Solaris, AIX, and HP-UX. I have tested those.
It might work on others as well, but that list is good enough for me.

We have an array with two controllers and when all is good
each controller reports active/optimized for all of it ports. There
Is a TPG per controller.

CT0 - Primary - AO - TPG 0
CT1 - Secondary - AO - TPG 1

In any upgrade there is a point where we have to have the
secondary promote to primary. In our world we call this a
giveback. This is done by returning unavailable for I/O
that is sent to the previous primary CT0 and transitioning
for CT1, the promoting secondary:

CT0 - was primary - unavailable - TPG 0
CT1 - promoting not yet primary - transitioning - TPG 1

This is where we hit the issue. The paths to CT0 fail
since its ALUA state is unavailable as expected. The paths
to CT1 also quickly fail in the same second after some
retries. There are no paths which can serve I/O for a
short time as the secondary promotes to primary. We
expect ALUA state transitioning to protect this path
against an I/O error returning to multipath which it
no longer does.

If it worked we would expect:
CT0 - becoming secondary - still unavailable - TPG 0
CT1 - Primary - AO - TPG 1

And a short time later:
CT0 - secondary - AO - TPG 0
CT1 - primary - AO - TPG 1

Hopefully that helps with the context and why we
are proposing what we are.

Thanks,
Brian

Brian Bunker
SW Eng
brian@purestorage.com



> On Jun 7, 2021, at 5:42 AM, Hannes Reinecke <hare@suse.de> wrote:
> 
> On 5/28/21 8:34 PM, Brian Bunker wrote:
>> Do not return an error to multipath which will result in a failed path until the transition time expires.
>> 
>> The current patch which returns BLK_STS_AGAIN for ALUA transitioning breaks the assumptions in our target
>> regarding ALUA states. With that change an error is very quickly returned to multipath which in turn
>> immediately fails the path. The assumption in that patch seems to be that another path will be available
>> for multipath to use. That assumption I don’t believe is fair to make since while one path is in a
>> transitioning state it is reasonable to assume that other paths may also be in non active states. 
>> 
> I beg to disagree. Path groups are nominally independent, and might
> change states independent on the other path groups.
> While for some arrays a 'transitioning' state is indeed system-wide,
> other arrays might be able to serve I/O on other paths whilst one is in
> transitioning.
> So I'd rather not presume anything here.
> 
> 
>> The SPC spec has a note on this:
>> The IMPLICIT TRANSITION TIME field indicates the minimum amount of time in seconds the application client
>> should wait prior to timing out an implicit state transition (see 5.15.2.2). A value of zero indicates that
>> the implicit transition time is not specified.
>> 
> Oh, I know _that_ one. What with me being one of the implementors asking
> for it :-)
> 
> But again, this is _per path_. One cannot assume anything about _other_
> paths here.
> 
>> In the SCSI ALUA device handler a value of 0 translates to the transition time being set to 60 seconds.
>> The current approach of failing I/O on the transitioning path in a much quicker time than what is stated
>> seems to violate this aspect of the specification.
>>> #define ALUA_FAILOVER_TIMEOUT		60
>> unsigned long transition_tmo = ALUA_FAILOVER_TIMEOUT * HZ;
>> 
> 
> No. The implicit transitioning timeout is used to determine for how long
> we will be sending a 'TEST UNIT READY'/'REPORT TARGET PORT GROUPS' combo
> to figure out if this particular path is still in transitioning. Once
> this timeout is exceeded we're setting the path to 'standby'.
> And this 'setting port to standby' is our action for 'timing out an
> implicit state transition' as defined by the spec.
> 
>> This patch uses the expiry the same way it is used elsewhere in the device handler. Once the transition
>> state is entered keep retrying until the expiry value is reached. If that happens, return the error to
>> multipath the same way that is currently done with BLK_STS_AGAIN.
>> 
> And that is precisely what I want to avoid.
> 
> As outlined above, we cannot assume that all paths will be set to
> 'transitioning' once we hit the 'transitioning' condition on one path.
> As such, we need to retry the I/O on other paths, to ensure failover
> does work in these cases. Hence it's perfectly okay to set this path to
> 'failed' as we cannot currently send I/O to that path.
> 
> If, however, we are hitting a 'transitioning' status on _all_ paths (ie
> all paths are set to 'failed') we need to ensure that we do _not_ fail
> the I/O (as technically the paths are still alive), but retry with
> TUR/RTPG until one path reaches a final state.
> Then we should reinstate that path and continue with I/O.
> 
> I thought that this is what we do already; but might be that there are
> some issues lurking here.
> 
> So what is the error you are seeing?
> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke		        Kernel Storage Architect
> hare@suse.de			               +49 911 74053 688
> SUSE Software Solutions Germany GmbH, 90409 Nürnberg
> GF: F. Imendörffer, HRB 36809 (AG Nürnberg)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1]: scsi scsi_dh_alua: don't fail I/O until transition time expires
  2021-06-08  0:03   ` Brian Bunker
@ 2021-06-09  7:03     ` Hannes Reinecke
  2021-06-10 21:08       ` Brian Bunker
  0 siblings, 1 reply; 7+ messages in thread
From: Hannes Reinecke @ 2021-06-09  7:03 UTC (permalink / raw)
  To: Brian Bunker; +Cc: linux-scsi

On 6/8/21 2:03 AM, Brian Bunker wrote:
>> Do not return an error to multipath which will result in a failed path until the \
>> transition time expires.
>> The current patch which returns BLK_STS_AGAIN for ALUA transitioning breaks the \
>> assumptions in our target regarding ALUA states. With that change an error is very \
>> quickly returned to multipath which in turn immediately fails the path. The \
>> assumption in that patch seems to be that another path will be available for \
>> multipath to use. That assumption I don't believe is fair to make since while one \
>> path is in a transitioning state it is reasonable to assume that other paths may \
>> also be in non active states.
> 
>> I beg to disagree. Path groups are nominally independent, and might
>> change states independent on the other path groups.
>> While for some arrays a 'transitioning' state is indeed system-wide,
>> other arrays might be able to serve I/O on other paths whilst one is in
>> transitioning.
>> So I'd rather not presume anything here.
> 
> I agree. No problem there. Our array could and does return transitioning on
> some portal groups while others might be active/online or unavailable.
> 
>> As outlined above, we cannot assume that all paths will be set to
>> 'transitioning' once we hit the 'transitioning' condition on one path.
>> As such, we need to retry the I/O on other paths, to ensure failover
>> does work in these cases. Hence it's perfectly okay to set this path to
> ‘> failed' as we cannot currently send I/O to that path.
> 
>> If, however, we are hitting a 'transitioning' status on _all_ paths (ie
>> all paths are set to 'failed') we need to ensure that we do _not_ fail
>> the I/O (as technically the paths are still alive), but retry with
>> TUR/RTPG until one path reaches a final state.
>> Then we should reinstate that path and continue with I/O.
> 
> I am not saying that all paths should be changed to transitioning, but
> I/Os sent to the path that is in transitioning should not immediately
> fail if there is not an online path like what does happen without
> this patch or one like it.
> 
> The other paths which are in other states should succeed or fail
> I/O as they would based on their state. I am only concerned about
> the portal group in the transitioning state and making sure it doesn’t
> immediately bubble errors back to the multipath layer which fails the
> path which is what we see and don’t want to see.
> 
>> So what is the error you are seeing?
> 
> Right now this is what fails and used to work before the patch
> This worked in previous Linux versions and continues to work
> in Windows, ESXi, Solaris, AIX, and HP-UX. I have tested those.
> It might work on others as well, but that list is good enough for me.
> 
> We have an array with two controllers and when all is good
> each controller reports active/optimized for all of it ports. There
> Is a TPG per controller.
> 
> CT0 - Primary - AO - TPG 0
> CT1 - Secondary - AO - TPG 1
> 
> In any upgrade there is a point where we have to have the
> secondary promote to primary. In our world we call this a
> giveback. This is done by returning unavailable for I/O
> that is sent to the previous primary CT0 and transitioning
> for CT1, the promoting secondary:
> 
> CT0 - was primary - unavailable - TPG 0
> CT1 - promoting not yet primary - transitioning - TPG 1
> 
> This is where we hit the issue. The paths to CT0 fail
> since its ALUA state is unavailable as expected. The paths
> to CT1 also quickly fail in the same second after some
> retries. There are no paths which can serve I/O for a
> short time as the secondary promotes to primary. We
> expect ALUA state transitioning to protect this path
> against an I/O error returning to multipath which it
> no longer does.
> 
> If it worked we would expect:
> CT0 - becoming secondary - still unavailable - TPG 0
> CT1 - Primary - AO - TPG 1
> 
> And a short time later:
> CT0 - secondary - AO - TPG 0
> CT1 - primary - AO - TPG 1
> 
> Hopefully that helps with the context and why we
> are proposing what we are.
> Ah-ha.
'Unavailable' state. Right.

Hmm. Seems that we need to distinguish (at the device-mapper multipath 
layer) between temporarily failed paths (like transitioning), which 
could become available at a later time, and permanently failed paths 
(like unavailable or standby), for which a retry would not yield 
different results. I thought we did that, but apparently there's an 
issue somewhere.

Lemme see ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1]: scsi scsi_dh_alua: don't fail I/O until transition time expires
  2021-06-09  7:03     ` Hannes Reinecke
@ 2021-06-10 21:08       ` Brian Bunker
  2021-06-17 19:19         ` Brian Bunker
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Bunker @ 2021-06-10 21:08 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-scsi


> On Jun 9, 2021, at 12:03 AM, Hannes Reinecke <hare@suse.de> wrote:
> 
> On 6/8/21 2:03 AM, Brian Bunker wrote:
>>> Do not return an error to multipath which will result in a failed path until the \
>>> transition time expires.
>>> The current patch which returns BLK_STS_AGAIN for ALUA transitioning breaks the \
>>> assumptions in our target regarding ALUA states. With that change an error is very \
>>> quickly returned to multipath which in turn immediately fails the path. The \
>>> assumption in that patch seems to be that another path will be available for \
>>> multipath to use. That assumption I don't believe is fair to make since while one \
>>> path is in a transitioning state it is reasonable to assume that other paths may \
>>> also be in non active states.
>>> I beg to disagree. Path groups are nominally independent, and might
>>> change states independent on the other path groups.
>>> While for some arrays a 'transitioning' state is indeed system-wide,
>>> other arrays might be able to serve I/O on other paths whilst one is in
>>> transitioning.
>>> So I'd rather not presume anything here.
>> I agree. No problem there. Our array could and does return transitioning on
>> some portal groups while others might be active/online or unavailable.
>>> As outlined above, we cannot assume that all paths will be set to
>>> 'transitioning' once we hit the 'transitioning' condition on one path.
>>> As such, we need to retry the I/O on other paths, to ensure failover
>>> does work in these cases. Hence it's perfectly okay to set this path to
>> ‘> failed' as we cannot currently send I/O to that path.
>>> If, however, we are hitting a 'transitioning' status on _all_ paths (ie
>>> all paths are set to 'failed') we need to ensure that we do _not_ fail
>>> the I/O (as technically the paths are still alive), but retry with
>>> TUR/RTPG until one path reaches a final state.
>>> Then we should reinstate that path and continue with I/O.
>> I am not saying that all paths should be changed to transitioning, but
>> I/Os sent to the path that is in transitioning should not immediately
>> fail if there is not an online path like what does happen without
>> this patch or one like it.
>> The other paths which are in other states should succeed or fail
>> I/O as they would based on their state. I am only concerned about
>> the portal group in the transitioning state and making sure it doesn’t
>> immediately bubble errors back to the multipath layer which fails the
>> path which is what we see and don’t want to see.
>>> So what is the error you are seeing?
>> Right now this is what fails and used to work before the patch
>> This worked in previous Linux versions and continues to work
>> in Windows, ESXi, Solaris, AIX, and HP-UX. I have tested those.
>> It might work on others as well, but that list is good enough for me.
>> We have an array with two controllers and when all is good
>> each controller reports active/optimized for all of it ports. There
>> Is a TPG per controller.
>> CT0 - Primary - AO - TPG 0
>> CT1 - Secondary - AO - TPG 1
>> In any upgrade there is a point where we have to have the
>> secondary promote to primary. In our world we call this a
>> giveback. This is done by returning unavailable for I/O
>> that is sent to the previous primary CT0 and transitioning
>> for CT1, the promoting secondary:
>> CT0 - was primary - unavailable - TPG 0
>> CT1 - promoting not yet primary - transitioning - TPG 1
>> This is where we hit the issue. The paths to CT0 fail
>> since its ALUA state is unavailable as expected. The paths
>> to CT1 also quickly fail in the same second after some
>> retries. There are no paths which can serve I/O for a
>> short time as the secondary promotes to primary. We
>> expect ALUA state transitioning to protect this path
>> against an I/O error returning to multipath which it
>> no longer does.
>> If it worked we would expect:
>> CT0 - becoming secondary - still unavailable - TPG 0
>> CT1 - Primary - AO - TPG 1
>> And a short time later:
>> CT0 - secondary - AO - TPG 0
>> CT1 - primary - AO - TPG 1
>> Hopefully that helps with the context and why we
>> are proposing what we are.
>> Ah-ha.
> 'Unavailable' state. Right.
> 
> Hmm. Seems that we need to distinguish (at the device-mapper multipath layer) between temporarily failed paths (like transitioning), which could become available at a later time, and permanently failed paths (like unavailable or standby), for which a retry would not yield different results. I thought we did that, but apparently there's an issue somewhere.
> 
> Lemme see ...
> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Thanks. For us, the current behavior after the fix I mentioned which broke things leaves us in a bad state. The only thing we can do is use no_path_retry in multipath.conf, but that brings with it just a new set of problems. There are times we do want an I/O error to return to the caller as quickly as possible (We actual don’t have any paths that can serve I/O). The state where one controller is unavailable and the other is transitioning is not a condition where we would want an I/O error returned. We expect that the transitioning path would protect us until it transitions to an online state in our case. These transition times in our case are very short, just a few seconds, but we would prefer to not hold on to I/O on the target if we can’t serve it. The transitioning state had been working great for this up until Centos 8 stream picked up the change.

Thanks,
Brian


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1]: scsi scsi_dh_alua: don't fail I/O until transition time expires
  2021-06-10 21:08       ` Brian Bunker
@ 2021-06-17 19:19         ` Brian Bunker
  2021-07-07 20:16           ` Brian Bunker
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Bunker @ 2021-06-17 19:19 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-scsi

Hannes,

I ran with what you mentioned about the multipath layer and chased down where the issue is happening. What I see is that new I/O coming in and hitting the device handler’s prep_fn here if the state is transitioning:

	case SCSI_ACCESS_STATE_TRANSITIONING:
		return BLK_STS_AGAIN;

The issue is when this hits multipath_end_io, this block status even though it is AGAIN implying it should be retried results in the path getting failed. Adding a check here like this makes our problem go away:

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index bced42f082b0..d5d6be96068d 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1657,7 +1657,7 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone,
                else
                        r = DM_ENDIO_REQUEUE;
 
-               if (pgpath)
+               if (pgpath && (error != BLK_STS_AGAIN))
                        fail_path(pgpath);
 
                if (!atomic_read(&m->nr_valid_paths) &&

I am not sure if this where you were going but it does work without any other changes.

Thanks,
Brian

Brian Bunker
SW Eng
brian@purestorage.com



> On Jun 10, 2021, at 2:08 PM, Brian Bunker <brian@purestorage.com> wrote:
> 
> 
>> On Jun 9, 2021, at 12:03 AM, Hannes Reinecke <hare@suse.de> wrote:
>> 
>> On 6/8/21 2:03 AM, Brian Bunker wrote:
>>>> Do not return an error to multipath which will result in a failed path until the \
>>>> transition time expires.
>>>> The current patch which returns BLK_STS_AGAIN for ALUA transitioning breaks the \
>>>> assumptions in our target regarding ALUA states. With that change an error is very \
>>>> quickly returned to multipath which in turn immediately fails the path. The \
>>>> assumption in that patch seems to be that another path will be available for \
>>>> multipath to use. That assumption I don't believe is fair to make since while one \
>>>> path is in a transitioning state it is reasonable to assume that other paths may \
>>>> also be in non active states.
>>>> I beg to disagree. Path groups are nominally independent, and might
>>>> change states independent on the other path groups.
>>>> While for some arrays a 'transitioning' state is indeed system-wide,
>>>> other arrays might be able to serve I/O on other paths whilst one is in
>>>> transitioning.
>>>> So I'd rather not presume anything here.
>>> I agree. No problem there. Our array could and does return transitioning on
>>> some portal groups while others might be active/online or unavailable.
>>>> As outlined above, we cannot assume that all paths will be set to
>>>> 'transitioning' once we hit the 'transitioning' condition on one path.
>>>> As such, we need to retry the I/O on other paths, to ensure failover
>>>> does work in these cases. Hence it's perfectly okay to set this path to
>>> ‘> failed' as we cannot currently send I/O to that path.
>>>> If, however, we are hitting a 'transitioning' status on _all_ paths (ie
>>>> all paths are set to 'failed') we need to ensure that we do _not_ fail
>>>> the I/O (as technically the paths are still alive), but retry with
>>>> TUR/RTPG until one path reaches a final state.
>>>> Then we should reinstate that path and continue with I/O.
>>> I am not saying that all paths should be changed to transitioning, but
>>> I/Os sent to the path that is in transitioning should not immediately
>>> fail if there is not an online path like what does happen without
>>> this patch or one like it.
>>> The other paths which are in other states should succeed or fail
>>> I/O as they would based on their state. I am only concerned about
>>> the portal group in the transitioning state and making sure it doesn’t
>>> immediately bubble errors back to the multipath layer which fails the
>>> path which is what we see and don’t want to see.
>>>> So what is the error you are seeing?
>>> Right now this is what fails and used to work before the patch
>>> This worked in previous Linux versions and continues to work
>>> in Windows, ESXi, Solaris, AIX, and HP-UX. I have tested those.
>>> It might work on others as well, but that list is good enough for me.
>>> We have an array with two controllers and when all is good
>>> each controller reports active/optimized for all of it ports. There
>>> Is a TPG per controller.
>>> CT0 - Primary - AO - TPG 0
>>> CT1 - Secondary - AO - TPG 1
>>> In any upgrade there is a point where we have to have the
>>> secondary promote to primary. In our world we call this a
>>> giveback. This is done by returning unavailable for I/O
>>> that is sent to the previous primary CT0 and transitioning
>>> for CT1, the promoting secondary:
>>> CT0 - was primary - unavailable - TPG 0
>>> CT1 - promoting not yet primary - transitioning - TPG 1
>>> This is where we hit the issue. The paths to CT0 fail
>>> since its ALUA state is unavailable as expected. The paths
>>> to CT1 also quickly fail in the same second after some
>>> retries. There are no paths which can serve I/O for a
>>> short time as the secondary promotes to primary. We
>>> expect ALUA state transitioning to protect this path
>>> against an I/O error returning to multipath which it
>>> no longer does.
>>> If it worked we would expect:
>>> CT0 - becoming secondary - still unavailable - TPG 0
>>> CT1 - Primary - AO - TPG 1
>>> And a short time later:
>>> CT0 - secondary - AO - TPG 0
>>> CT1 - primary - AO - TPG 1
>>> Hopefully that helps with the context and why we
>>> are proposing what we are.
>>> Ah-ha.
>> 'Unavailable' state. Right.
>> 
>> Hmm. Seems that we need to distinguish (at the device-mapper multipath layer) between temporarily failed paths (like transitioning), which could become available at a later time, and permanently failed paths (like unavailable or standby), for which a retry would not yield different results. I thought we did that, but apparently there's an issue somewhere.
>> 
>> Lemme see ...
>> 
>> Cheers,
>> 
>> Hannes
>> -- 
>> Dr. Hannes Reinecke                Kernel Storage Architect
>> hare@suse.de                              +49 911 74053 688
>> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
>> HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
> 
> Thanks. For us, the current behavior after the fix I mentioned which broke things leaves us in a bad state. The only thing we can do is use no_path_retry in multipath.conf, but that brings with it just a new set of problems. There are times we do want an I/O error to return to the caller as quickly as possible (We actual don’t have any paths that can serve I/O). The state where one controller is unavailable and the other is transitioning is not a condition where we would want an I/O error returned. We expect that the transitioning path would protect us until it transitions to an online state in our case. These transition times in our case are very short, just a few seconds, but we would prefer to not hold on to I/O on the target if we can’t serve it. The transitioning state had been working great for this up until Centos 8 stream picked up the change.
> 
> Thanks,
> Brian
> 


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1]: scsi scsi_dh_alua: don't fail I/O until transition time expires
  2021-06-17 19:19         ` Brian Bunker
@ 2021-07-07 20:16           ` Brian Bunker
  0 siblings, 0 replies; 7+ messages in thread
From: Brian Bunker @ 2021-07-07 20:16 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-scsi

Just checking back in. It seems that there are at least two ways to fix this problem. It seems like it can be done in either the ALUA device handler or in drivers/md/dm-mpath.c. At this point, however, the failing of the path when ALUA state transitioning is hit breaks what used to work and what we believe should work. Where should we go from here?

Thanks,
Brian

Brian Bunker
SW Eng
brian@purestorage.com



> On Jun 17, 2021, at 12:19 PM, Brian Bunker <brian@purestorage.com> wrote:
> 
> drivers/md/dm-mpath.c


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-07-07 20:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-28 18:34 [PATCH 1/1]: scsi scsi_dh_alua: don't fail I/O until transition time expires Brian Bunker
2021-06-07 12:42 ` Hannes Reinecke
2021-06-08  0:03   ` Brian Bunker
2021-06-09  7:03     ` Hannes Reinecke
2021-06-10 21:08       ` Brian Bunker
2021-06-17 19:19         ` Brian Bunker
2021-07-07 20:16           ` Brian Bunker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.