* Re: [PATCH 1/1] scsi_dh_alua: properly handling the ALUA transitioning state
2022-05-20 20:03 ` Martin Wilck
@ 2022-05-21 2:52 ` Brian Bunker
2022-05-23 16:03 ` Martin Wilck
2022-05-21 10:17 ` Hannes Reinecke
2022-05-21 16:58 ` Mike Christie
2 siblings, 1 reply; 17+ messages in thread
From: Brian Bunker @ 2022-05-21 2:52 UTC (permalink / raw)
To: Martin Wilck
Cc: Mike Christie, Hannes Reinecke, Martin K. Petersen, linux-scsi,
Benjamin Marzinski
From my perspective, the ALUA transitioning state is a temporary state
where the target is saying that it does not have a permanent state
yet. Having the initiator try another pg to me doesn't seem like the
right thing for it to do. If the target wanted the initiator to use a
different pg, it should use an ALUA state which would make that clear,
standby, unavailable, etc. The target would only return an error state
if it was aware that some other path is in an active state.When
transitioning is returned, I don't think the initiator should assume
that any other pg would be a better choice. I think it should assume
that the target will make its intention clear for that path with a
permanent state within a transition timeout.
From my perspective the right thing to do is to let the ALUA handler
do what it is trying to do. If the pg state is transitioning and
within the transition timeout it should continue to retry that request
checking each time the transition timeout. In most cases the target
will easily change its state within the transition timeout and the
subsequent requests for that pg will either fail if the state has
transitioned to an error state, standby, etc., or the request will be
served as it has transitioned to an active state. If the target
misbehaves for the pg and stays in transition too long then the
handler will change the state from transition to standby and
subsequent I/O will fail on that path.
On Fri, May 20, 2022 at 1:03 PM Martin Wilck <mwilck@suse.com> wrote:
>
> On Fri, 2022-05-20 at 14:08 -0500, Mike Christie wrote:
> > On 5/20/22 9:03 AM, Martin Wilck wrote:
> > > On Fri, 2022-05-20 at 14:06 +0200, Hannes Reinecke wrote:
> > > > On 5/20/22 12:57, Martin Wilck wrote:
> > > > > Brian, Martin,
> > > > >
> > > > > sorry, I've overlooked this patch previously. I have to say I
> > > > > think
> > > > > it's wrong and shouldn't have been applied. At least I need
> > > > > more
> > > > > in-
> > > > > depth explanation.
> > > > >
> > > > > On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen wrote:
> > > > > > On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote:
> > > > > >
> > > > > > > The handling of the ALUA transitioning state is currently
> > > > > > > broken.
> > > > > > > When
> > > > > > > a target goes into this state, it is expected that the
> > > > > > > target
> > > > > > > is
> > > > > > > allowed to stay in this state for the implicit transition
> > > > > > > timeout
> > > > > > > without a path failure.
> > > > >
> > > > > Can you please show me a quote from the specs on which this
> > > > > expectation
> > > > > ("without a path failure") is based? AFAIK the SCSI specs don't
> > > > > say
> > > > > anything about device-mapper multipath semantics.
> > > > >
> > > > > > > The handler has this logic, but it gets
> > > > > > > skipped currently.
> > > > > > >
> > > > > > > When the target transitions, there is in-flight I/O from
> > > > > > > the
> > > > > > > initiator. The first of these responses from the target
> > > > > > > will be
> > > > > > > a
> > > > > > > unit
> > > > > > > attention letting the initiator know that the ALUA state
> > > > > > > has
> > > > > > > changed.
> > > > > > > The remaining in-flight I/Os, before the initiator finds
> > > > > > > out
> > > > > > > that
> > > > > > > the
> > > > > > > portal state has changed, will return not ready, ALUA state
> > > > > > > is
> > > > > > > transitioning. The portal state will change to
> > > > > > > SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new
> > > > > > > I/O
> > > > > > > immediately failing the path unexpectedly. The path failure
> > > > > > > happens
> > > > > > > in
> > > > > > > less than a second instead of the expected successes until
> > > > > > > the
> > > > > > > transition timer is exceeded.
> > > > >
> > > > > dm multipath has no concept of "transitioning" state. Path
> > > > > state
> > > > > can be
> > > > > either active or inactive. As Brian wrote, commands sent to the
> > > > > transitioning device will return NOT READY, TRANSITIONING, and
> > > > > require
> > > > > retries on the SCSI layer. If we know this in advance, why
> > > > > should
> > > > > we
> > > > > continue sending I/O down this semi-broken path? If other,
> > > > > healthy
> > > > > paths are available, why it would it not be the right thing to
> > > > > switch
> > > > > I/O to them ASAP?
> > > > >
> > > > But we do, don't we?
> > > > Commands are being returned with the appropriate status, and
> > > > dm-multipath should make the corresponding decisions here.
> > > > This patch just modifies the check when _sending_ commands; ie
> > > > multipath
> > > > had decided that the path is still usable.
> > > > Question rather would be why multipath did that;
> > >
> > > If alua_prep_fn() got called, the path was considered usable at the
> > > given point in time by dm-multipath. Most probably the reason was
> > > simply that no error condition had occured on this path before ALUA
> > > state switched to transitioning. I suppose this can happen
> > > if storage
> > > switches a PG consisting of multiple paths to TRANSITIONING. We get
> > > an
> > > error on one path (sda, say), issue an RTPG, and receive the new
> > > ALUA
> > > state for all paths of the PG. For all paths except sda, we'd just
> > > see
> > > a switch to TRANSITIONING without a previous SCSI error.
> > >
> > > With this patch, we'll dispatch I/O (usually an entire bunch) to
> > > these
> > > paths despite seeing them in TRANSITIONING state. Eventually, when
> > > the
> > > SCSI responses are received, this leads to path failures. If I/O
> > > latencies are small, this happens after a few ms. In that case, the
> > > goal of Brian's patch is not reached, because the time until path
> > > failure would still be on the order of milliseconds. OTOH, if
> > > latencies
> > > are high, it takes substantially longer for the kernel to realize
> > > that
> > > the path is non-functional, while other, good paths may be idle. I
> > > fail
> > > to see the benefit.
> > >
> >
> > I'm not sure everyone agrees with you on the meaning of
> > transitioning.
> >
> > If we go from non-optimized to optimized or standby to non-
> > opt/optimized
> > we don't want to try other paths because it can cause thrashing.
>
> But only with explicit ALUA, or am I missing something? I agree that
> the host shouldn't initiate a PG switch if it encounters transitioning
> state. I also agree that for transitioning towards a "better" state,
> e.g. standby to (non)-optimized, failing the path would be
> questionable. Unfortunately we don't know in which "direction" the path
> is transitioning - it could be for 'better' or 'worse'. I suppose that
> in the case of a PG switch, it can happen that we dispatch I/O to a
> device that used to be in Standby and is now transitioning. Would it
> make sense to remember the previous state and "guess" what we're going
> to transition to? I.e. if the previous state was "Standby", it's
> probably going to be (non)optimized after the transition, and vice-
> versa?
>
> > We just
> > need to transition resources before we can fully use the path. It
> > could
> > be a local cache operation or for distributed targets it could be a
> > really
> > expensive operation.
> >
> > For both though, it can take longer than the retries we get from
> > scsi-ml.
>
> So if we want to do "the right thing", we'd continue dispatching to the
> device until either the state changes or the device-reported transition
> timeout has expired?
>
> Martin
>
>
> > For example this patch:
> >
> > commit 2b35865e7a290d313c3d156c0c2074b4c4ffaf52
> > Author: Hannes Reinecke <hare@suse.de>
> > Date: Fri Feb 19 09:17:13 2016 +0100
> >
> > scsi_dh_alua: Recheck state on unit attention
> >
> >
> > caused us issues because the retries were used up quickly. We just
> > changed
> > the target to return BUSY status and we don't use the transitioning
> > state.
> > The spec does mention using either return value in "5.15.2.5
> > Transitions
> > between target port asymmetric access states":
> >
> > ------
> > if during the transition the logical unit is inaccessible, then the
> > transition
> > is performed as a single indivisible event and the device server
> > shall respond
> > by either returning BUSY status, or returning CHECK CONDITION status,
> > with the
> > sense key set to NOT READY, and the sense code set to LOGICAL UNIT
> > NOT ACCESSIBLE,
> > ASYMMETRIC ACCESS STATE TRANSITION;
> >
> > ------
> >
> > So Brian's patch works if you return BUSY instead of 02/04/0a and are
> > setting
> > the state to transitioning during the time it's transitioning.
> >
> > I do partially agree with you and it's kind of a messy mix and match.
> > However,
> > I think we should change alua_check_sense to handle 02/04/0a the same
> > way we
> > handle it in alua_prep_fn. And then we should add a new flag for
> > devices that
> > have a bug and return transitioning forever.
> >
>
--
Brian Bunker
PURE Storage, Inc.
brian@purestorage.com
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] scsi_dh_alua: properly handling the ALUA transitioning state
2022-05-21 2:52 ` Brian Bunker
@ 2022-05-23 16:03 ` Martin Wilck
2022-05-23 16:52 ` Brian Bunker
0 siblings, 1 reply; 17+ messages in thread
From: Martin Wilck @ 2022-05-23 16:03 UTC (permalink / raw)
To: Brian Bunker
Cc: Mike Christie, Hannes Reinecke, Martin K. Petersen, linux-scsi,
Benjamin Marzinski
On Fri, 2022-05-20 at 19:52 -0700, Brian Bunker wrote:
> From my perspective, the ALUA transitioning state is a temporary
> state
> where the target is saying that it does not have a permanent state
> yet. Having the initiator try another pg to me doesn't seem like the
> right thing for it to do.
I agree. Unfortunately, there's no logic in dm-multipath saying "I may
switch paths inside a PG, but I may not do PG failover".
> If the target wanted the initiator to use a
> different pg, it should use an ALUA state which would make that
> clear,
> standby, unavailable, etc. The target would only return an error
> state
> if it was aware that some other path is in an active state.When
> transitioning is returned, I don't think the initiator should assume
> that any other pg would be a better choice. I think it should assume
> that the target will make its intention clear for that path with a
> permanent state within a transition timeout.
For me the question is still whether trying to send I/O to the path
that is known not to be able to process it makes sense. As noted
elsewhere, you patch just delays the BLK_STS_AGAIN by a few
milliseconds. You want to avoid a PG switch, and I second that, but IMO
that needs a different approach.
> From my perspective the right thing to do is to let the ALUA handler
> do what it is trying to do. If the pg state is transitioning and
> within the transition timeout it should continue to retry that
> request
> checking each time the transition timeout.
But this means that we should modify the logic not only in
alua_prep_fn() but also for handling of NOT READY conditions, either in
alua_check_sense() or in scsi_io_completion_action().
I agree that this would make a lot of sense, perhaps more than trying
to implement a cleverer logic in dm-multipath as discussed between
Hannes and myself.
This is what we need to figure out first: Do we want to change the
logic in the multipath layer, making it somehow aware of the special
nature of "transitioning" state, or should we tune the retry logic in
the SCSI layer such that dm-multipath will "do the right thing"
automatically?
Regards
Martin
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] scsi_dh_alua: properly handling the ALUA transitioning state
2022-05-23 16:03 ` Martin Wilck
@ 2022-05-23 16:52 ` Brian Bunker
2022-05-24 8:29 ` Martin Wilck
0 siblings, 1 reply; 17+ messages in thread
From: Brian Bunker @ 2022-05-23 16:52 UTC (permalink / raw)
To: Martin Wilck
Cc: Mike Christie, Hannes Reinecke, Martin K. Petersen, linux-scsi,
Benjamin Marzinski
> On May 23, 2022, at 9:03 AM, Martin Wilck <mwilck@suse.com> wrote:
>
> On Fri, 2022-05-20 at 19:52 -0700, Brian Bunker wrote:
>> From my perspective, the ALUA transitioning state is a temporary
>> state
>> where the target is saying that it does not have a permanent state
>> yet. Having the initiator try another pg to me doesn't seem like the
>> right thing for it to do.
>
> I agree. Unfortunately, there's no logic in dm-multipath saying "I may
> switch paths inside a PG, but I may not do PG failover".
>
>> If the target wanted the initiator to use a
>> different pg, it should use an ALUA state which would make that
>> clear,
>> standby, unavailable, etc. The target would only return an error
>> state
>> if it was aware that some other path is in an active state.When
>> transitioning is returned, I don't think the initiator should assume
>> that any other pg would be a better choice. I think it should assume
>> that the target will make its intention clear for that path with a
>> permanent state within a transition timeout.
>
> For me the question is still whether trying to send I/O to the path
> that is known not to be able to process it makes sense. As noted
> elsewhere, you patch just delays the BLK_STS_AGAIN by a few
> milliseconds. You want to avoid a PG switch, and I second that, but IMO
> that needs a different approach.
>
>> From my perspective the right thing to do is to let the ALUA handler
>> do what it is trying to do. If the pg state is transitioning and
>> within the transition timeout it should continue to retry that
>> request
>> checking each time the transition timeout.
>
> But this means that we should modify the logic not only in
> alua_prep_fn() but also for handling of NOT READY conditions, either in
> alua_check_sense() or in scsi_io_completion_action().
> I agree that this would make a lot of sense, perhaps more than trying
> to implement a cleverer logic in dm-multipath as discussed between
> Hannes and myself.
>
> This is what we need to figure out first: Do we want to change the
> logic in the multipath layer, making it somehow aware of the special
> nature of "transitioning" state, or should we tune the retry logic in
> the SCSI layer such that dm-multipath will "do the right thing"
> automatically?
>
> Regards
> Martin
>
I think that a couple of things are needed to get to where my expectation of how it should work would be. First this code should come out of the not ready sense check. The reason is that it will continually override alua_check’s attempt to change the pg state as it exceeds the allowed time and the path state is changed to standby to handle a misbehaving target which stays in transitioning past the timer.
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -409,20 +409,12 @@ static char print_alua_state(unsigned char state)
static enum scsi_disposition alua_check_sense(struct scsi_device *sdev,
struct scsi_sense_hdr *sense_hdr)
{
- struct alua_dh_data *h = sdev->handler_data;
- struct alua_port_group *pg;
-
switch (sense_hdr->sense_key) {
case NOT_READY:
if (sense_hdr->asc == 0x04 && sense_hdr->ascq == 0x0a) {
/*
* LUN Not Accessible - ALUA state transition
*/
- rcu_read_lock();
- pg = rcu_dereference(h->pg);
- if (pg)
- pg->state = SCSI_ACCESS_STATE_TRANSITIONING;
- rcu_read_unlock();
Second for this to work how it used to work in Linux and how it works for other OS’s where ALUA state transition is not a fail path response:
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 8d18cc7e510e..33828aa44347 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -776,11 +776,11 @@ static void scsi_io_completion_action(struct scsi_cmnd *cmd, int result)
case 0x1b: /* sanitize in progress */
case 0x1d: /* configuration in progress */
case 0x24: /* depopulation in progress */
action = ACTION_DELAYED_RETRY;
break;
case 0x0a: /* ALUA state transition */
- blk_stat = BLK_STS_AGAIN;
- fallthrough;
+ action = ACTION_REPREP;
+ break;
Because, as you said the expiation check for whether to continually allow new I/O on the pg is in the prep function of the ALUA handler. I think that this does introduce a lot error processing for a target that will respond quickly. Maybe there is some way to use the equivalent of ACTION_DELAYED_RETRY so that the target is not as aggressively retried in the transitioning state.
It is possible, maybe likely, in other OS’s that this retry is done at the multipath level. The DSM from Microsoft in GitHub would seem to show that Windows does it that way. The multipath-tools in Linux though don’t seem to use sense data to make decisions so getting this logic in would seem like a decent set of changes and getting the buy-in to go that route.
Thanks,
Brian
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] scsi_dh_alua: properly handling the ALUA transitioning state
2022-05-23 16:52 ` Brian Bunker
@ 2022-05-24 8:29 ` Martin Wilck
0 siblings, 0 replies; 17+ messages in thread
From: Martin Wilck @ 2022-05-24 8:29 UTC (permalink / raw)
To: Brian Bunker
Cc: Mike Christie, Hannes Reinecke, Martin K. Petersen, linux-scsi,
Benjamin Marzinski
On Mon, 2022-05-23 at 09:52 -0700, Brian Bunker wrote:
>
> > On May 23, 2022, at 9:03 AM, Martin Wilck <mwilck@suse.com> wrote:
> >
> > On Fri, 2022-05-20 at 19:52 -0700, Brian Bunker wrote:
> > > From my perspective, the ALUA transitioning state is a temporary
> > > state
> > > where the target is saying that it does not have a permanent
> > > state
> > > yet. Having the initiator try another pg to me doesn't seem like
> > > the
> > > right thing for it to do.
> >
> > I agree. Unfortunately, there's no logic in dm-multipath saying "I
> > may
> > switch paths inside a PG, but I may not do PG failover".
> >
> > > If the target wanted the initiator to use a
> > > different pg, it should use an ALUA state which would make that
> > > clear,
> > > standby, unavailable, etc. The target would only return an error
> > > state
> > > if it was aware that some other path is in an active state.When
> > > transitioning is returned, I don't think the initiator should
> > > assume
> > > that any other pg would be a better choice. I think it should
> > > assume
> > > that the target will make its intention clear for that path with
> > > a
> > > permanent state within a transition timeout.
> >
> > For me the question is still whether trying to send I/O to the path
> > that is known not to be able to process it makes sense. As noted
> > elsewhere, you patch just delays the BLK_STS_AGAIN by a few
> > milliseconds. You want to avoid a PG switch, and I second that, but
> > IMO
> > that needs a different approach.
> >
> > > From my perspective the right thing to do is to let the ALUA
> > > handler
> > > do what it is trying to do. If the pg state is transitioning and
> > > within the transition timeout it should continue to retry that
> > > request
> > > checking each time the transition timeout.
> >
> > But this means that we should modify the logic not only in
> > alua_prep_fn() but also for handling of NOT READY conditions,
> > either in
> > alua_check_sense() or in scsi_io_completion_action().
> > I agree that this would make a lot of sense, perhaps more than
> > trying
> > to implement a cleverer logic in dm-multipath as discussed between
> > Hannes and myself.
> >
> > This is what we need to figure out first: Do we want to change the
> > logic in the multipath layer, making it somehow aware of the
> > special
> > nature of "transitioning" state, or should we tune the retry logic
> > in
> > the SCSI layer such that dm-multipath will "do the right thing"
> > automatically?
> >
> > Regards
> > Martin
> >
> I think that a couple of things are needed to get to where my
> expectation of how it should work would be. First this code should
> come out of the not ready sense check. The reason is that it will
> continually override alua_check’s attempt to change the pg state as
> it exceeds the allowed time and the path state is changed to standby
> to handle a misbehaving target which stays in transitioning past the
> timer.
>
> --- a/drivers/scsi/device_handler/scsi_dh_alua.c
> +++ b/drivers/scsi/device_handler/scsi_dh_alua.c
> @@ -409,20 +409,12 @@ static char print_alua_state(unsigned char
> state)
> static enum scsi_disposition alua_check_sense(struct scsi_device
> *sdev,
> struct scsi_sense_hdr
> *sense_hdr)
> {
> - struct alua_dh_data *h = sdev->handler_data;
> - struct alua_port_group *pg;
> -
> switch (sense_hdr->sense_key) {
> case NOT_READY:
> if (sense_hdr->asc == 0x04 && sense_hdr->ascq ==
> 0x0a) {
> /*
> * LUN Not Accessible - ALUA state transition
> */
> - rcu_read_lock();
> - pg = rcu_dereference(h->pg);
> - if (pg)
> - pg->state =
> SCSI_ACCESS_STATE_TRANSITIONING;
> - rcu_read_unlock();
>
Good point. But I'd say that rather than removing it, this code should
be made aware of the timer.
> Second for this to work how it used to work in Linux and how it works
> for other OS’s where ALUA state transition is not a fail path
> response:
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 8d18cc7e510e..33828aa44347 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -776,11 +776,11 @@ static void scsi_io_completion_action(struct
> scsi_cmnd *cmd, int result)
> case 0x1b: /* sanitize in progress */
> case 0x1d: /* configuration in
> progress */
> case 0x24: /* depopulation in
> progress */
> action = ACTION_DELAYED_RETRY;
> break;
> case 0x0a: /* ALUA state transition
> */
> - blk_stat = BLK_STS_AGAIN;
> - fallthrough;
> + action = ACTION_REPREP;
> + break;
>
> Because, as you said the expiation check for whether to continually
> allow new I/O on the pg is in the prep function of the ALUA handler.
Not any more with your patch, or what am I missing?
> I think that this does introduce a lot error processing for a target
> that will respond quickly. Maybe there is some way to use the
> equivalent of ACTION_DELAYED_RETRY so that the target is not as
> aggressively retried in the transitioning state.
IMHO ACTION_DELAYED_RETRY would be a good option. Note that it blocks
the device; thus it prevents dispatching IO to the device according to
the scsi_device_block() logic. This way it automatically introduces
some delay.
>
> It is possible, maybe likely, in other OS’s that this retry is done
> at the multipath level. The DSM from Microsoft in GitHub would seem
> to show that Windows does it that way. The multipath-tools in Linux
> though don’t seem to use sense data to make decisions so getting this
> logic in would seem like a decent set of changes and getting the buy-
> in to go that route.
IMO we're closer to a good solution on the SCSI side. I've no idea how
Microsoft's DSM works, but it probably has a more detailed concept of
path states than Linux' dm-multipath, which operates on abstract block
devices.
Neither dm-multipath nor multipath-tools have a concept of
transitioning state. While it'd be possible to add this logic in
multipath-tools (user space), I see a couple of hard-to-solve problems:
- multipathd has no concept of ALUA. The ALUA functionality is split
into the device handler part (handled in the kernel, responsible for
explicit PG switching and error handling), and the "prioritizer"
handled in user space. ALUA is just one out of several path
prioritizers. The generic prioritizer concept doesn't define a
"transitioning" property. We'd need to try to generalize the concept,
and possibly figure out how to implement it with other prioritizers.
- the dm-multipath path group concept is abstract and, in general,
independent of ALUA's port groups. The core property of ALUA (all ports
in a group always have the same primary port state) doesn't generlly
apply to multipath PGs. Nothing prevents users from configuring
different path grouping policies, like multibus or grouping by latency,
even if ALUA is supported. The idea to prevent dm-multipath from
switching PG away from a transitioning path group is only valid if dm-
multipath PGs map 1:1 to ALUA port groups.
- ALUA itself is more generic than the scenario you describe. You say
that that if a path is transitioning, trying to switch PGs is wrong.
This is true if there's just one other PG, and that other PG is
unavailable. But there may be setups with more than 2 PGs, where one PG
is perfectly healthy while another is transitioning. In such a case,
it'd be reasonable to switch PGs towards the healthy one.
- multipathd doesn't get notified of ALUA state changes. It polls in
rather long time intervals (5s). If this logic is implemented in
multipathd, we'll see considerable latency in PG switching.
- In general, multipathd doesn't have full control over dm-multipath PG
switching. Some logic is in the kernel and some in user space. It's not
easy for multipathd to prevent the kernel from switching to a certain
PG. Basically the only way is to set all paths in that PG to failed
status.
Here's one dumb question: ALUA defines "transitioning" in an abstract
fashion, as a "movement from one target port asymmetric access state to
another". This doesn't say anything about the state the port is
transitioning towards. IOW, it could be transitioning towards "STANDBY"
or "UNAVAILABLE", perhaps even "OFFLINE". I am unsure if this makes
sense in practice; I _guess_ that most of the time "transitioning"
means something like "booting". Your suggestions seem to confirm that.
at least for PURE hardware, "transitioning" always means "will be
usable soon". Is this correct? And if yes, to which extent could this
be generalized to other arrays, past, present, and future?
Regards
Martin
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] scsi_dh_alua: properly handling the ALUA transitioning state
2022-05-20 20:03 ` Martin Wilck
2022-05-21 2:52 ` Brian Bunker
@ 2022-05-21 10:17 ` Hannes Reinecke
2022-05-23 15:33 ` Martin Wilck
2022-05-21 16:58 ` Mike Christie
2 siblings, 1 reply; 17+ messages in thread
From: Hannes Reinecke @ 2022-05-21 10:17 UTC (permalink / raw)
To: Martin Wilck, Mike Christie, Martin K. Petersen, Brian Bunker,
linux-scsi
Cc: Benjamin Marzinski
On 5/20/22 13:03, Martin Wilck wrote:
> On Fri, 2022-05-20 at 14:08 -0500, Mike Christie wrote:
>> On 5/20/22 9:03 AM, Martin Wilck wrote:
>>> On Fri, 2022-05-20 at 14:06 +0200, Hannes Reinecke wrote:
>>>> On 5/20/22 12:57, Martin Wilck wrote:
>>>>> Brian, Martin,
>>>>>
>>>>> sorry, I've overlooked this patch previously. I have to say I
>>>>> think
>>>>> it's wrong and shouldn't have been applied. At least I need
>>>>> more
>>>>> in-
>>>>> depth explanation.
>>>>>
>>>>> On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen wrote:
>>>>>> On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote:
>>>>>>
>>>>>>> The handling of the ALUA transitioning state is currently
>>>>>>> broken.
>>>>>>> When
>>>>>>> a target goes into this state, it is expected that the
>>>>>>> target
>>>>>>> is
>>>>>>> allowed to stay in this state for the implicit transition
>>>>>>> timeout
>>>>>>> without a path failure.
>>>>>
>>>>> Can you please show me a quote from the specs on which this
>>>>> expectation
>>>>> ("without a path failure") is based? AFAIK the SCSI specs don't
>>>>> say
>>>>> anything about device-mapper multipath semantics.
>>>>>
>>>>>>> The handler has this logic, but it gets
>>>>>>> skipped currently.
>>>>>>>
>>>>>>> When the target transitions, there is in-flight I/O from
>>>>>>> the
>>>>>>> initiator. The first of these responses from the target
>>>>>>> will be
>>>>>>> a
>>>>>>> unit
>>>>>>> attention letting the initiator know that the ALUA state
>>>>>>> has
>>>>>>> changed.
>>>>>>> The remaining in-flight I/Os, before the initiator finds
>>>>>>> out
>>>>>>> that
>>>>>>> the
>>>>>>> portal state has changed, will return not ready, ALUA state
>>>>>>> is
>>>>>>> transitioning. The portal state will change to
>>>>>>> SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new
>>>>>>> I/O
>>>>>>> immediately failing the path unexpectedly. The path failure
>>>>>>> happens
>>>>>>> in
>>>>>>> less than a second instead of the expected successes until
>>>>>>> the
>>>>>>> transition timer is exceeded.
>>>>>
>>>>> dm multipath has no concept of "transitioning" state. Path
>>>>> state
>>>>> can be
>>>>> either active or inactive. As Brian wrote, commands sent to the
>>>>> transitioning device will return NOT READY, TRANSITIONING, and
>>>>> require
>>>>> retries on the SCSI layer. If we know this in advance, why
>>>>> should
>>>>> we
>>>>> continue sending I/O down this semi-broken path? If other,
>>>>> healthy
>>>>> paths are available, why it would it not be the right thing to
>>>>> switch
>>>>> I/O to them ASAP?
>>>>>
>>>> But we do, don't we?
>>>> Commands are being returned with the appropriate status, and
>>>> dm-multipath should make the corresponding decisions here.
>>>> This patch just modifies the check when _sending_ commands; ie
>>>> multipath
>>>> had decided that the path is still usable.
>>>> Question rather would be why multipath did that;
>>>
>>> If alua_prep_fn() got called, the path was considered usable at the
>>> given point in time by dm-multipath. Most probably the reason was
>>> simply that no error condition had occured on this path before ALUA
>>> state switched to transitioning. I suppose this can happen
>>> if storage
>>> switches a PG consisting of multiple paths to TRANSITIONING. We get
>>> an
>>> error on one path (sda, say), issue an RTPG, and receive the new
>>> ALUA
>>> state for all paths of the PG. For all paths except sda, we'd just
>>> see
>>> a switch to TRANSITIONING without a previous SCSI error.
>>>
>>> With this patch, we'll dispatch I/O (usually an entire bunch) to
>>> these
>>> paths despite seeing them in TRANSITIONING state. Eventually, when
>>> the
>>> SCSI responses are received, this leads to path failures. If I/O
>>> latencies are small, this happens after a few ms. In that case, the
>>> goal of Brian's patch is not reached, because the time until path
>>> failure would still be on the order of milliseconds. OTOH, if
>>> latencies
>>> are high, it takes substantially longer for the kernel to realize
>>> that
>>> the path is non-functional, while other, good paths may be idle. I
>>> fail
>>> to see the benefit.
>>>
>>
>> I'm not sure everyone agrees with you on the meaning of
>> transitioning.
>>
>> If we go from non-optimized to optimized or standby to non-
>> opt/optimized
>> we don't want to try other paths because it can cause thrashing.
>
> But only with explicit ALUA, or am I missing something? I agree that
> the host shouldn't initiate a PG switch if it encounters transitioning
> state. I also agree that for transitioning towards a "better" state,
> e.g. standby to (non)-optimized, failing the path would be
> questionable. Unfortunately we don't know in which "direction" the path
> is transitioning - it could be for 'better' or 'worse'. I suppose that
> in the case of a PG switch, it can happen that we dispatch I/O to a
> device that used to be in Standby and is now transitioning. Would it
> make sense to remember the previous state and "guess" what we're going
> to transition to? I.e. if the previous state was "Standby", it's
> probably going to be (non)optimized after the transition, and vice-
> versa?
>
>> We just
>> need to transition resources before we can fully use the path. It
>> could
>> be a local cache operation or for distributed targets it could be a
>> really
>> expensive operation.
>>
>> For both though, it can take longer than the retries we get from
>> scsi-ml.
>
> So if we want to do "the right thing", we'd continue dispatching to the
> device until either the state changes or the device-reported transition
> timeout has expired?
>
Not quite.
I think multipath should make the decision, and we shouldn't try to
out-guess multipathing from the device driver.
We should only make the information available (to wit: the path state)
for multipath to make the decision.
But if multipath decides to send I/O via a path which we _think_ it's in
transitioning we shouldn't be arguing with that (ie we shouldn't preempt
is from prep_fn) but rather let it rip; who knows, maybe there was a
reason for that.
The only thing this patch changes (for any current setup) is that we'll
incur additional round-trip times for I/O being sent in between the
first I/O returning with 'transitioning' and multipath picking up the
new path state. This I/O will be returned more-or-less immediately by
the target, so really we're only having an additional round-trip latency
for each I/O during that time. This is at most in the milliseconds
range, so I guess we can live with that.
If, however, multipath fails to pick up the 'transitioning' state then
this won't work, and we indeed will incur a regression. But that
arguably is an issue with multipath ...
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] scsi_dh_alua: properly handling the ALUA transitioning state
2022-05-21 10:17 ` Hannes Reinecke
@ 2022-05-23 15:33 ` Martin Wilck
0 siblings, 0 replies; 17+ messages in thread
From: Martin Wilck @ 2022-05-23 15:33 UTC (permalink / raw)
To: Hannes Reinecke, Mike Christie, Martin K. Petersen, Brian Bunker,
linux-scsi
Cc: Benjamin Marzinski
On Sat, 2022-05-21 at 12:17 +0200, Hannes Reinecke wrote:
> On 5/20/22 13:03, Martin Wilck wrote:
> > On Fri, 2022-05-20 at 14:08 -0500, Mike Christie wrote:
> > > On 5/20/22 9:03 AM, Martin Wilck wrote:
> > > > On Fri, 2022-05-20 at 14:06 +0200, Hannes Reinecke wrote:
> > > > > On 5/20/22 12:57, Martin Wilck wrote:
> > > > > > Brian, Martin,
> > > > > >
> > > > > > sorry, I've overlooked this patch previously. I have to say
> > > > > > I
> > > > > > think
> > > > > > it's wrong and shouldn't have been applied. At least I need
> > > > > > more
> > > > > > in-
> > > > > > depth explanation.
> > > > > >
> > > > > > On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen
> > > > > > wrote:
> > > > > > > On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote:
> > > > > > >
> > > > > > > > The handling of the ALUA transitioning state is
> > > > > > > > currently
> > > > > > > > broken.
> > > > > > > > When
> > > > > > > > a target goes into this state, it is expected that the
> > > > > > > > target
> > > > > > > > is
> > > > > > > > allowed to stay in this state for the implicit
> > > > > > > > transition
> > > > > > > > timeout
> > > > > > > > without a path failure.
> > > > > >
> > > > > > Can you please show me a quote from the specs on which this
> > > > > > expectation
> > > > > > ("without a path failure") is based? AFAIK the SCSI specs
> > > > > > don't
> > > > > > say
> > > > > > anything about device-mapper multipath semantics.
> > > > > >
> > > > > > > > The handler has this logic, but it gets
> > > > > > > > skipped currently.
> > > > > > > >
> > > > > > > > When the target transitions, there is in-flight I/O
> > > > > > > > from
> > > > > > > > the
> > > > > > > > initiator. The first of these responses from the target
> > > > > > > > will be
> > > > > > > > a
> > > > > > > > unit
> > > > > > > > attention letting the initiator know that the ALUA
> > > > > > > > state
> > > > > > > > has
> > > > > > > > changed.
> > > > > > > > The remaining in-flight I/Os, before the initiator
> > > > > > > > finds
> > > > > > > > out
> > > > > > > > that
> > > > > > > > the
> > > > > > > > portal state has changed, will return not ready, ALUA
> > > > > > > > state
> > > > > > > > is
> > > > > > > > transitioning. The portal state will change to
> > > > > > > > SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all
> > > > > > > > new
> > > > > > > > I/O
> > > > > > > > immediately failing the path unexpectedly. The path
> > > > > > > > failure
> > > > > > > > happens
> > > > > > > > in
> > > > > > > > less than a second instead of the expected successes
> > > > > > > > until
> > > > > > > > the
> > > > > > > > transition timer is exceeded.
> > > > > >
> > > > > > dm multipath has no concept of "transitioning" state. Path
> > > > > > state
> > > > > > can be
> > > > > > either active or inactive. As Brian wrote, commands sent to
> > > > > > the
> > > > > > transitioning device will return NOT READY, TRANSITIONING,
> > > > > > and
> > > > > > require
> > > > > > retries on the SCSI layer. If we know this in advance, why
> > > > > > should
> > > > > > we
> > > > > > continue sending I/O down this semi-broken path? If other,
> > > > > > healthy
> > > > > > paths are available, why it would it not be the right thing
> > > > > > to
> > > > > > switch
> > > > > > I/O to them ASAP?
> > > > > >
> > > > > But we do, don't we?
> > > > > Commands are being returned with the appropriate status, and
> > > > > dm-multipath should make the corresponding decisions here.
> > > > > This patch just modifies the check when _sending_ commands;
> > > > > ie
> > > > > multipath Depending on timing, multipathd may or may not
> > > > > already have done a PG switch because of 3.)
> > > > > had decided that the path is still usable.
> > > > > Question rather would be why multipath did that;
> > > >
> > > > If alua_prep_fn() got called, the path was considered usable at
> > > > the
> > > > given point in time by dm-multipath. Most probably the reason
> > > > was
> > > > simply that no error condition had occured on this path before
> > > > ALUA
> > > > state switched to transitioning. I suppose this can happen
> > > > if storage
> > > > switches a PG consisting of multiple paths to TRANSITIONING. We
> > > > get
> > > > an
> > > > error on one path (sda, say), issue an RTPG, and receive the
> > > > new
> > > > ALUA
> > > > state for all paths of the PG. For all paths except sda, we'd
> > > > just
> > > > see
> > > > a switch to TRANSITIONING without a previous SCSI error.
> > > >
> > > > With this patch, we'll dispatch I/O (usually an entire bunch)
> > > > to
> > > > these
> > > > paths despite seeing them in TRANSITIONING state. Eventually,
> > > > when
> > > > the
> > > > SCSI responses are received, this leads to path failures. If
> > > > I/O
> > > > latencies are small, this happens after a few ms. In that case,
> > > > the
> > > > goal of Brian's patch is not reached, because the time until
> > > > path
> > > > failure would still be on the order of milliseconds. OTOH, if
> > > > latencies
> > > > are high, it takes substantially longer for the kernel to
> > > > realize
> > > > that
> > > > the path is non-functional, while other, good paths may be
> > > > idle. I
> > > > fail
> > > > to see the benefit.
> > > >
> > >
> > > I'm not sure everyone agrees with you on the meaning of
> > > transitioning.
> > >
> > > If we go from non-optimized to optimized or standby to non-
> > > opt/optimized
> > > we don't want to try other paths because it can cause thrashing.
> >
> > But only with explicit ALUA, or am I missing something? I agree
> > that
> > the host shouldn't initiate a PG switch if it encounters
> > transitioning
> > state. I also agree that for transitioning towards a "better"
> > state,
> > e.g. standby to (non)-optimized, failing the path would be
> > questionable. Unfortunately we don't know in which "direction" the
> > path
> > is transitioning - it could be for 'better' or 'worse'. I suppose
> > that
> > in the case of a PG switch, it can happen that we dispatch I/O to a
> > device that used to be in Standby and is now transitioning. Would
> > it
> > make sense to remember the previous state and "guess" what we're
> > going
> > to transition to? I.e. if the previous state was "Standby", it's
> > probably going to be (non)optimized after the transition, and vice-
> > versa?
> >
> > > We just
> > > need to transition resources before we can fully use the path. It
> > > could
> > > be a local cache operation or for distributed targets it could be
> > > a
> > > really
> > > expensive operation.
> > >
> > > For both though, it can take longer than the retries we get from
> > > scsi-ml.
> >
> > So if we want to do "the right thing", we'd continue dispatching to
> > the
> > device until either the state changes or the device-reported
> > transition
> > timeout has expired?
> >
> Not quite.
> I think multipath should make the decision, and we shouldn't try to
> out-guess multipathing from the device driver.
> We should only make the information available (to wit: the path
> state)
> for multipath to make the decision.
The only information that dm-multipath currently has is the block-level
error code. The SCSI layer uses BLK_STS_AGAIN for "transitioning". Can
dm-multipath assume that BLK_STS_AGAIN *always* has these semantics,
regadless of the low-level driver? It isn't clearly documented, AFAICS.
Doing this properly requires additional logic in dm-mpath.c to treat
BLK_STS_AGAIN differently from other path errors. IIUC, what Brian and
Mike say is that dm-multipath may do path failover, but not PG failover
for BLK_STS_AGAIN. We currently don't have this logic, but we could
work on it.
IMO this means that Brian's patch would be a step in the wrong
direction: by returning BLK_STS_OK instead of BLK_STS_AGAIN, dm-
multipath gets *less* information about path state than before.
dm-multipath has 3 ways to get notified of transitioning state
(correct me if I'm missing anything):
1. I/O queuing fails with BLK_STS_AGAIN in scsi_prep_fn(). This won't
happen any more with Brian's patch.
2. I/O returns from target with CHECK CONDITION / NOT READY / 0x4:0xa
(LU not accessible - asymmetric state change transition).
In this case the SCSI layer applies the "maybe_retry" logic,
i.e. it retries the command. When retries are exhausted, it fails
the command with BLK_STS_AGAIN.
3. multipathd in user space sees a TRANSITIONING status during routine
path checking. Currently multipathd uses prio=0 for TRANSITIONING
state, which is even less than the prio used for STANDBY (1). Thus
multipathd would initiate a PG switch sooner or later. I believe
these prio assignments by multipathd are questionable, but that's
a different discussion.
dm-multipath interprets BLK_STS_AGAIN in 2.) as path error, and will
thus do a path failover in the PG. In a case like Brian's, the other
paths in the PG will likely also be in TRANSITIONING state, so
procedure 2.) will repeat until all paths in the PG have failed. In
that situation, dm-multipath will initiate a PG switch, which is what
Brian wanted to avoid in the first place.
> But if multipath decides to send I/O via a path which we _think_ it's
> in
> transitioning we shouldn't be arguing with that (ie we shouldn't
> preempt
> is from prep_fn) but rather let it rip; who knows, maybe there was a
> reason for that.
You seem to assume that multipath knows more about the device then the
SCSI layer. Isn't the opposite true? dm-multipath would never know that
the path is "transitioning", and it has zero knowledge about the
transition timeout. Only the ALUA code knows that certain path devices
will change their status simultaneously, always. The only additional
"knowledge" that multipath has is knowledge about path siblings and
PGs. By returning BLK_STS_OK from scsi_prep_fn(), we hold back
information that the multipath layer could use for an informed
decision.
> The only thing this patch changes (for any current setup) is that
> we'll
> incur additional round-trip times for I/O being sent in between the
> first I/O returning with 'transitioning' and multipath picking up the
> new path state. This I/O will be returned more-or-less immediately by
> the target, so really we're only having an additional round-trip
> latency
> for each I/O during that time. This is at most in the milliseconds
> range, so I guess we can live with that.
Side note: AFAICS it'll be several round-trips (until SCSI retries are
exhausted, as this is not a "fast io fail" case).
Anyway, you're right, a PG switch will happen after a few milliseconds.
That's what Brian wanted to avoid. IOW, the patch won't achieve what
it's supposed to achieve. *But* it will hold back information that dm-
multipath could use for handling transitions smarter than it currently
does.
> If, however, multipath fails to pick up the 'transitioning' state
> then
> this won't work, and we indeed will incur a regression. But that
> arguably is an issue with multipath ...
dm-multipath currently doesn't. Requesting that it should is a valid
feature request, no less and no more. I'm not against working on this
feature. But the generic device model in dm-multipath isn't quite
powerful enough for this. We could try to handle it more cleverly in
user space, but yet I don't see how it could be done.
Cheers,
Martin
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] scsi_dh_alua: properly handling the ALUA transitioning state
2022-05-20 20:03 ` Martin Wilck
2022-05-21 2:52 ` Brian Bunker
2022-05-21 10:17 ` Hannes Reinecke
@ 2022-05-21 16:58 ` Mike Christie
2 siblings, 0 replies; 17+ messages in thread
From: Mike Christie @ 2022-05-21 16:58 UTC (permalink / raw)
To: Martin Wilck, Hannes Reinecke, Martin K. Petersen, Brian Bunker,
linux-scsi
Cc: Benjamin Marzinski
On 5/20/22 3:03 PM, Martin Wilck wrote:
> On Fri, 2022-05-20 at 14:08 -0500, Mike Christie wrote:
>> On 5/20/22 9:03 AM, Martin Wilck wrote:
>>> On Fri, 2022-05-20 at 14:06 +0200, Hannes Reinecke wrote:
>>>> On 5/20/22 12:57, Martin Wilck wrote:
>>>>> Brian, Martin,
>>>>>
>>>>> sorry, I've overlooked this patch previously. I have to say I
>>>>> think
>>>>> it's wrong and shouldn't have been applied. At least I need
>>>>> more
>>>>> in-
>>>>> depth explanation.
>>>>>
>>>>> On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen wrote:
>>>>>> On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote:
>>>>>>
>>>>>>> The handling of the ALUA transitioning state is currently
>>>>>>> broken.
>>>>>>> When
>>>>>>> a target goes into this state, it is expected that the
>>>>>>> target
>>>>>>> is
>>>>>>> allowed to stay in this state for the implicit transition
>>>>>>> timeout
>>>>>>> without a path failure.
>>>>>
>>>>> Can you please show me a quote from the specs on which this
>>>>> expectation
>>>>> ("without a path failure") is based? AFAIK the SCSI specs don't
>>>>> say
>>>>> anything about device-mapper multipath semantics.
>>>>>
>>>>>>> The handler has this logic, but it gets
>>>>>>> skipped currently.
>>>>>>>
>>>>>>> When the target transitions, there is in-flight I/O from
>>>>>>> the
>>>>>>> initiator. The first of these responses from the target
>>>>>>> will be
>>>>>>> a
>>>>>>> unit
>>>>>>> attention letting the initiator know that the ALUA state
>>>>>>> has
>>>>>>> changed.
>>>>>>> The remaining in-flight I/Os, before the initiator finds
>>>>>>> out
>>>>>>> that
>>>>>>> the
>>>>>>> portal state has changed, will return not ready, ALUA state
>>>>>>> is
>>>>>>> transitioning. The portal state will change to
>>>>>>> SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new
>>>>>>> I/O
>>>>>>> immediately failing the path unexpectedly. The path failure
>>>>>>> happens
>>>>>>> in
>>>>>>> less than a second instead of the expected successes until
>>>>>>> the
>>>>>>> transition timer is exceeded.
>>>>>
>>>>> dm multipath has no concept of "transitioning" state. Path
>>>>> state
>>>>> can be
>>>>> either active or inactive. As Brian wrote, commands sent to the
>>>>> transitioning device will return NOT READY, TRANSITIONING, and
>>>>> require
>>>>> retries on the SCSI layer. If we know this in advance, why
>>>>> should
>>>>> we
>>>>> continue sending I/O down this semi-broken path? If other,
>>>>> healthy
>>>>> paths are available, why it would it not be the right thing to
>>>>> switch
>>>>> I/O to them ASAP?
>>>>>
>>>> But we do, don't we?
>>>> Commands are being returned with the appropriate status, and
>>>> dm-multipath should make the corresponding decisions here.
>>>> This patch just modifies the check when _sending_ commands; ie
>>>> multipath
>>>> had decided that the path is still usable.
>>>> Question rather would be why multipath did that;
>>>
>>> If alua_prep_fn() got called, the path was considered usable at the
>>> given point in time by dm-multipath. Most probably the reason was
>>> simply that no error condition had occured on this path before ALUA
>>> state switched to transitioning. I suppose this can happen
>>> if storage
>>> switches a PG consisting of multiple paths to TRANSITIONING. We get
>>> an
>>> error on one path (sda, say), issue an RTPG, and receive the new
>>> ALUA
>>> state for all paths of the PG. For all paths except sda, we'd just
>>> see
>>> a switch to TRANSITIONING without a previous SCSI error.
>>>
>>> With this patch, we'll dispatch I/O (usually an entire bunch) to
>>> these
>>> paths despite seeing them in TRANSITIONING state. Eventually, when
>>> the
>>> SCSI responses are received, this leads to path failures. If I/O
>>> latencies are small, this happens after a few ms. In that case, the
>>> goal of Brian's patch is not reached, because the time until path
>>> failure would still be on the order of milliseconds. OTOH, if
>>> latencies
>>> are high, it takes substantially longer for the kernel to realize
>>> that
>>> the path is non-functional, while other, good paths may be idle. I
>>> fail
>>> to see the benefit.
>>>
>>
>> I'm not sure everyone agrees with you on the meaning of
>> transitioning.
>>
>> If we go from non-optimized to optimized or standby to non-
>> opt/optimized
>> we don't want to try other paths because it can cause thrashing.
>
> But only with explicit ALUA, or am I missing something? I agree that
That section of the spec mentions both implicit and explicit. For
implicit, the target can want to rebalance resources for things like
a resource is down permanently, or you add more ports, or we bring
up/down resources dynamically based on usage or maintenance.
> the host shouldn't initiate a PG switch if it encounters transitioning
> state. I also agree that for transitioning towards a "better" state,
> e.g. standby to (non)-optimized, failing the path would be
> questionable. Unfortunately we don't know in which "direction" the path
> is transitioning - it could be for 'better' or 'worse'. I suppose that
For implicit, the target knows. It's initiating the transition based
on whatever metrics or resources it has. We want the initiator to let us
complete what we are doing.
For explicit, then again the target knows what it wants to do when
it gets the STPG and we only it to set the paths to optimized. So
if it goes that route where it completes the STPG before the transition
completes, then goes into transitioning, then we can just let the
device do it's transitions.
> in the case of a PG switch, it can happen that we dispatch I/O to a
> device that used to be in Standby and is now transitioning. Would it
> make sense to remember the previous state and "guess" what we're going
> to transition to? I.e. if the previous state was "Standby", it's
> probably going to be (non)optimized after the transition, and vice-
> versa?
You are referring to the issue Hannes mentions where multipath can pick
up the transitioning state and it might get confused, right? I'm not
sure what to do.
>
>> We just
>> need to transition resources before we can fully use the path. It
>> could
>> be a local cache operation or for distributed targets it could be a
>> really
>> expensive operation.
>>
>> For both though, it can take longer than the retries we get from
>> scsi-ml.
>
> So if we want to do "the right thing", we'd continue dispatching to the
> device until either the state changes or the device-reported transition
> timeout has expired?
Sort of.
Ideally I think it would be nice if we blocked the device/queue for
normal IO, then just sent a RTPG every N secs or msecs until we changed
state or until the timer expired. We then unblock and either fail upwards
or dispatch. I think this is a lot of work though.
The problem with constant dispatching is that on low latency systems we
retry too quickly. I had to add a little sleep on the target side for this
or we hammer the target/initiator too hard and we got warnings (I can't
remember the exact warn/err).
^ permalink raw reply [flat|nested] 17+ messages in thread