* [PATCH 1/1]: scsi dm-mpath do not fail paths which are in ALUA state transitioning @ 2021-07-08 20:42 Brian Bunker 2021-07-12 9:19 ` Martin Wilck 0 siblings, 1 reply; 9+ messages in thread From: Brian Bunker @ 2021-07-08 20:42 UTC (permalink / raw) To: linux-scsi In a controller failover do not fail paths that are transitioning or an unexpected I/O error will return when accessing a multipath device. Consider this case, a two controller array with paths coming from a primary and a secondary controller. During any upgrade there will be a transition from a secondary to a primary state. 1. In the beginning all paths active / optimized: 3624a9370602220bd9986439b00012c1e dm-12 PURE,FlashArray size=3.0T features='0' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active |- 13:0:0:11 sdch 69:80 active ready running |- 14:0:0:11 sdce 69:32 active ready running |- 15:0:0:11 sdcf 69:48 active ready running |- 1:0:0:11 sdci 69:96 active ready running |- 9:0:0:11 sdck 69:128 active ready running |- 10:0:0:11 sdcg 69:64 active ready running |- 11:0:0:11 sdcd 69:16 active ready running `- 12:0:0:11 sdcj 69:112 active ready running CT0 paths - sdce, sdcf, sdcg, sdcd CT1 paths - sdch, sdci, sdck, sdcj 2. Run I/O to the multipath device: [root@init115-15 ~]# /opt/Purity/bin/bb/pureload -m initthreads=32 /dev/dm-12 Thu Jul 8 13:33:47 2021: /opt/Purity/bin/bb/pureload num_cpus = 64 Thu Jul 8 13:33:47 2021: /opt/Purity/bin/bb/pureload num numa nodes 2 Thu Jul 8 13:33:47 2021: /opt/Purity/bin/bb/pureload Starting test with 32 threads 3. In an upgrade the primary controller is failed and the secondary controller transitions to primary. From an ALUA paths perspective, the paths to the previous primary go to ALUA state unavailable while the paths transitioning to the promoting primary move to ALUA state transitioning. It is expected that 4 paths will fail: Jul 8 13:33:58 init115-15 kernel: sd 14:0:0:11: [sdce] tag#1178 Add. Sense: Logical unit not accessible, target port in unavailable state Jul 8 13:33:58 init115-15 kernel: sd 15:0:0:11: [sdcf] tag#1374 Add. Sense: Logical unit not accessible, target port in unavailable state Jul 8 13:33:58 init115-15 kernel: sd 10:0:0:11: [sdcg] tag#600 Add. Sense: Logical unit not accessible, target port in unavailable state Jul 8 13:33:58 init115-15 kernel: sd 11:0:0:11: [sdcd] tag#1460 Add. Sense: Logical unit not accessible, target port in unavailable state Jul 8 13:33:58 init115-15 kernel: device-mapper: multipath: 253:12: Failing path 69:64. Jul 8 13:33:58 init115-15 kernel: device-mapper: multipath: 253:12: Failing path 69:48. Jul 8 13:33:58 init115-15 kernel: device-mapper: multipath: 253:12: Failing path 69:16. Jul 8 13:33:58 init115-15 kernel: device-mapper: multipath: 253:12: Failing path 69:32. Jul 8 13:33:58 init115-15 multipathd[46030]: 3624a9370602220bd9986439b00012c1e: remaining active paths: 7 Jul 8 13:33:58 init115-15 multipathd[46030]: 3624a9370602220bd9986439b00012c1e: remaining active paths: 6 Jul 8 13:33:59 init115-15 multipathd[46030]: 3624a9370602220bd9986439b00012c1e: remaining active paths: 5 Jul 8 13:33:59 init115-15 multipathd[46030]: 3624a9370602220bd9986439b00012c1e: remaining active paths: 4 4. It is not expected that the remaining 4 paths will also fail. This was not the case until the change which introduced BLK_STS_AGAIN into the SCSI ALUA device handler. With that change new I/O which reaches that handler on paths that are in ALUA state transitioning will result in those paths failing. Previous Linux versions, before that change, will not return an I/O error back to the client application. Similarly, this problem does not happen in other operating systems, e.g. ESXi, Windows, AIX, etc. 5. It is not expected that the paths to the promoting primary fail yet they do: Jul 8 13:33:59 init115-15 kernel: device-mapper: multipath: 253:12: Failing path 69:96. Jul 8 13:33:59 init115-15 kernel: device-mapper: multipath: 253:12: Failing path 69:112. Jul 8 13:33:59 init115-15 kernel: device-mapper: multipath: 253:12: Failing path 69:80. Jul 8 13:33:59 init115-15 kernel: device-mapper: multipath: 253:12: Failing path 69:128. Jul 8 13:33:59 init115-15 multipath[53813]: dm-12: no usable paths found Jul 8 13:33:59 init115-15 multipath[53833]: dm-12: no usable paths found Jul 8 13:33:59 init115-15 multipath[53853]: dm-12: no usable paths found Jul 8 13:33:59 init115-15 multipathd[46030]: 3624a9370602220bd9986439b00012c1e: remaining active paths: 3 Jul 8 13:33:59 init115-15 multipathd[46030]: 3624a9370602220bd9986439b00012c1e: remaining active paths: 2 Jul 8 13:33:59 init115-15 multipathd[46030]: 3624a9370602220bd9986439b00012c1e: remaining active paths: 1 Jul 8 13:33:59 init115-15 multipathd[46030]: 3624a9370602220bd9986439b00012c1e: remaining active paths: 0 6. The error gets back to the user of the muitipath device unexpectedly: Thu Jul 8 13:33:59 2021: /opt/Purity/bin/bb/pureload I/O Error: io 43047 fd 36 op read offset 00000028ef7a7000 size 4096 errno 11 rsize -1 The earlier patch I made for this was not desirable, so I am proposing this much smaller patch which will similarly not allow the transitioning paths to result in immediate failure. Signed-off-by: Brian Bunker <brian@purestorage.com> Acked-by: Krishna Kant <krishna.kant@purestorage.com> Acked-by: Seamus Connor <sconnor@purestorage.com> ____ diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index bced42f082b0..d5d6be96068d 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -1657,7 +1657,7 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone, else r = DM_ENDIO_REQUEUE; - if (pgpath) + if (pgpath && (error != BLK_STS_AGAIN)) fail_path(pgpath); if (!atomic_read(&m->nr_valid_paths) && Brian Bunker SW Eng brian@purestorage.com ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1]: scsi dm-mpath do not fail paths which are in ALUA state transitioning 2021-07-08 20:42 [PATCH 1/1]: scsi dm-mpath do not fail paths which are in ALUA state transitioning Brian Bunker @ 2021-07-12 9:19 ` Martin Wilck 2021-07-12 21:38 ` Brian Bunker 0 siblings, 1 reply; 9+ messages in thread From: Martin Wilck @ 2021-07-12 9:19 UTC (permalink / raw) To: Brian Bunker, linux-scsi, Hannes Reinecke Hello Brian, On Do, 2021-07-08 at 13:42 -0700, Brian Bunker wrote: > In a controller failover do not fail paths that are transitioning or > an unexpected I/O error will return when accessing a multipath device. > > Consider this case, a two controller array with paths coming from a > primary and a secondary controller. During any upgrade there will be a > transition from a secondary to a primary state. > > [...] > 4. It is not expected that the remaining 4 paths will also fail. This > was not the case until the change which introduced BLK_STS_AGAIN into > the SCSI ALUA device handler. With that change new I/O which reaches > that handler on paths that are in ALUA state transitioning will result > in those paths failing. Previous Linux versions, before that change, > will not return an I/O error back to the client application. > Similarly, this problem does not happen in other operating systems, > e.g. ESXi, Windows, AIX, etc. Please confirm that your kernel included ee8868c5c78f ("scsi: scsi_dh_alua: Retry RTPG on a different path after failure"). That commit should cause the RTPG to be retried on other map members which are not in failed state, thus avoiding this phenomenon. > [...] > > 6. The error gets back to the user of the muitipath device > unexpectedly: > Thu Jul 8 13:33:59 2021: /opt/Purity/bin/bb/pureload I/O Error: io > 43047 fd 36 op read offset 00000028ef7a7000 size 4096 errno 11 > rsize -1 > > The earlier patch I made for this was not desirable, so I am proposing > this much smaller patch which will similarly not allow the > transitioning paths to result in immediate failure. > > Signed-off-by: Brian Bunker <brian@purestorage.com> > Acked-by: Krishna Kant <krishna.kant@purestorage.com> > Acked-by: Seamus Connor <sconnor@purestorage.com> > > ____ > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > index bced42f082b0..d5d6be96068d 100644 > --- a/drivers/md/dm-mpath.c > +++ b/drivers/md/dm-mpath.c > @@ -1657,7 +1657,7 @@ static int multipath_end_io(struct dm_target > *ti, struct request *clone, > else > r = DM_ENDIO_REQUEUE; > > - if (pgpath) > + if (pgpath && (error != BLK_STS_AGAIN)) > fail_path(pgpath); > > if (!atomic_read(&m->nr_valid_paths) && > I doubt that this is correct. If you look at the commit msg of 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA transitioning state"): "When the ALUA state indicates transitioning we should not retry the command immediately, but rather complete the command with BLK_STS_AGAIN to signal the completion handler that it might be retried. This allows multipathing to redirect the command to another path if possible, and avoid stalls during lengthy transitioning times." The purpose of that patch was to set the state of the transitioning path to failed in order to make sure IO is retried on a different path. Your patch would undermine this purpose. Regards Martin ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1]: scsi dm-mpath do not fail paths which are in ALUA state transitioning 2021-07-12 9:19 ` Martin Wilck @ 2021-07-12 21:38 ` Brian Bunker 2021-07-13 9:13 ` Martin Wilck 0 siblings, 1 reply; 9+ messages in thread From: Brian Bunker @ 2021-07-12 21:38 UTC (permalink / raw) To: Martin Wilck; +Cc: linux-scsi, Hannes Reinecke Martin, Please confirm that your kernel included ee8868c5c78f ("scsi: scsi_dh_alua: Retry RTPG on a different path after failure"). That commit should cause the RTPG to be retried on other map members which are not in failed state, thus avoiding this phenomenon. In my case, there are no other map members that are not in the failed state. One set of paths goes to the ALUA unavailable state when the primary fails, and the second set of paths moves to ALUA state transitioning as the previous secondary becomes the primary. If the paths are failed which are transitioning, an all paths down state happens which is not expected. There should be a time for which transitioning is a transient state until the next state is entered. Failing a path assuming there would be non-failed paths seems wrong. The purpose of that patch was to set the state of the transitioning path to failed in order to make sure IO is retried on a different path. Your patch would undermine this purpose. I agree this is what happens but those transitioning paths might be the only non-failed paths available. I don't think it is reasonable to fail them. This is the same as treating transitioning as standby or unavailable. As you point out this happened with the commit you mention. Before this commit what I am doing does not result in an all paths down error, and similarly, it does not in earlier Linux versions or other OS's under the same condition. I see this as a regression. Thanks, Brian On Mon, Jul 12, 2021 at 2:19 AM Martin Wilck <mwilck@suse.com> wrote: > > Hello Brian, > > On Do, 2021-07-08 at 13:42 -0700, Brian Bunker wrote: > > In a controller failover do not fail paths that are transitioning or > > an unexpected I/O error will return when accessing a multipath device. > > > > Consider this case, a two controller array with paths coming from a > > primary and a secondary controller. During any upgrade there will be a > > transition from a secondary to a primary state. > > > > [...] > > 4. It is not expected that the remaining 4 paths will also fail. This > > was not the case until the change which introduced BLK_STS_AGAIN into > > the SCSI ALUA device handler. With that change new I/O which reaches > > that handler on paths that are in ALUA state transitioning will result > > in those paths failing. Previous Linux versions, before that change, > > will not return an I/O error back to the client application. > > Similarly, this problem does not happen in other operating systems, > > e.g. ESXi, Windows, AIX, etc. > > Please confirm that your kernel included ee8868c5c78f ("scsi: > scsi_dh_alua: Retry RTPG on a different path after failure"). > That commit should cause the RTPG to be retried on other map members > which are not in failed state, thus avoiding this phenomenon. > > > > [...] > > > > 6. The error gets back to the user of the muitipath device > > unexpectedly: > > Thu Jul 8 13:33:59 2021: /opt/Purity/bin/bb/pureload I/O Error: io > > 43047 fd 36 op read offset 00000028ef7a7000 size 4096 errno 11 > > rsize -1 > > > > The earlier patch I made for this was not desirable, so I am proposing > > this much smaller patch which will similarly not allow the > > transitioning paths to result in immediate failure. > > > > Signed-off-by: Brian Bunker <brian@purestorage.com> > > Acked-by: Krishna Kant <krishna.kant@purestorage.com> > > Acked-by: Seamus Connor <sconnor@purestorage.com> > > > > ____ > > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > > index bced42f082b0..d5d6be96068d 100644 > > --- a/drivers/md/dm-mpath.c > > +++ b/drivers/md/dm-mpath.c > > @@ -1657,7 +1657,7 @@ static int multipath_end_io(struct dm_target > > *ti, struct request *clone, > > else > > r = DM_ENDIO_REQUEUE; > > > > - if (pgpath) > > + if (pgpath && (error != BLK_STS_AGAIN)) > > fail_path(pgpath); > > > > if (!atomic_read(&m->nr_valid_paths) && > > > > I doubt that this is correct. If you look at the commit msg of > 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA > transitioning state"): > > > "When the ALUA state indicates transitioning we should not retry the command > immediately, but rather complete the command with BLK_STS_AGAIN to signal > the completion handler that it might be retried. This allows multipathing > to redirect the command to another path if possible, and avoid stalls > during lengthy transitioning times." > > The purpose of that patch was to set the state of the transitioning > path to failed in order to make sure IO is retried on a different path. > Your patch would undermine this purpose. > > Regards > Martin > > -- Brian Bunker PURE Storage, Inc. brian@purestorage.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1]: scsi dm-mpath do not fail paths which are in ALUA state transitioning 2021-07-12 21:38 ` Brian Bunker @ 2021-07-13 9:13 ` Martin Wilck 2021-07-14 0:32 ` Brian Bunker 0 siblings, 1 reply; 9+ messages in thread From: Martin Wilck @ 2021-07-13 9:13 UTC (permalink / raw) To: Brian Bunker; +Cc: linux-scsi, Hannes Reinecke Hello Brian, On Mo, 2021-07-12 at 14:38 -0700, Brian Bunker wrote: > Martin, > > > Please confirm that your kernel included ee8868c5c78f ("scsi: > > scsi_dh_alua: Retry RTPG on a different path after failure"). > > That commit should cause the RTPG to be retried on other map > > members > > which are not in failed state, thus avoiding this phenomenon. > > In my case, there are no other map members that are not in the failed > state. One set of paths goes to the ALUA unavailable state when the > primary fails, and the second set of paths moves to ALUA state > transitioning as the previous secondary becomes the primary. IMO this is the problem. How does your array respond to SCSI commands while ports are transitioning? SPC5 (§5.16.2.6) says that the server should either fail all commands with BUSY or CHECK CONDITION/NOT READY/LOGICAL UNIT NOT ACCESSIBLE/ASYMMETRIC ACCESS STATE TRANSITION (a), or should support all TMFs and a subset of SCSI commands, while responding with CC/NR/AAST to all other commands (b). SPC6 (§5.18.2.6) is no different. No matter how you read that paragraph, it's pretty clear that "transitioning" is generally not a healthy state to attempt I/O. Are you saying that on your server, the transitioning ports are able to process regular I/O commands like READ and WRITE? If that's the case, why do you pretend to be "transitioning" at all, rather than in an active state? If it's not the case, why would it make sense for the host to retry I/O on the transitioning path? > If the > paths are failed which are transitioning, an all paths down state > happens which is not expected. IMO it _is_ expected if in fact no path is able to process SCSI commands at the given point in time. > There should be a time for which > transitioning is a transient state until the next state is entered. > Failing a path assuming there would be non-failed paths seems wrong. This is a misunderstanding. The path isn't failed because of assumptions about other paths. It is failed because we know that it's non-functional, and thus we must try other paths, if there are any. Before 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA transitioning state"), I/O was indeed retried on transitioning paths, possibly forever. This posed a serious problem when a transitioning path was about to be removed (e.g. dev_loss_tmo expired). And I'm still quite convinced that it was wrong in general, because by all reasonable means a "transitioning" path isn't usable for the host. If we find a transitioning path, it might make sense to retry on other paths first and eventually switch back to the transitioning path, when all others have failed hard (e.g. "unavailable" state). However, this logic doesn't exist in the kernel. In theory, it could be mapped to a "transitioning" priority group in device-mapper multipath. But prio groups are managed in user space (multipathd), which treats transitioning paths as "almost failed" (priority 0) anyway. We can discuss enhancing multipathd such that it re-checks transitioning paths more frequently, in order to be able to reinstate them ASAP. According to what you said above, the "transitioning" ports in the problem situation ("second set") are those that were in "unavailable" state before, which means "failed" as far as device mapper is concerned - IOW, the paths in question would be unused anyway, until they got reinstated, which wouldn't happen before they are fully up. With this in mind, I have to say I don't understand why your proposed patch would help at all. Please explain. > > The purpose of that patch was to set the state of the transitioning > > path to failed in order to make sure IO is retried on a different > > path. > > Your patch would undermine this purpose. (Additional indentation added by me) Can you please use proper quoting? You were mixing my statements and your own. > I agree this is what happens but those transitioning paths might be > the only non-failed paths available. I don't think it is reasonable > to > fail them. This is the same as treating transitioning as standby or > unavailable. Right, and according to the SPC spec (see above), that makes more sense than treating it as "active". Storage vendors seem to interpret "transitioning" very differently, both in terms of commands supported and in terms of time required to reach the target state. That makes it hard to deal with it correctly on the host side. > As you point out this happened with the commit you > mention. Before this commit what I am doing does not result in an all > paths down error, and similarly, it does not in earlier Linux > versions > or other OS's under the same condition. I see this as a regression. If you use a suitable "no_path_retry" setting in multipathd, you should be able to handle the situation you describe just fine by queueing the I/O until the transitioning paths are fully up. IIUC, on your server "transitioning" is a transient state that ends quickly, so queueing shouldn't be an issue. E.g. if you are certain that "transitioning" won't last longer than 10s, you could set "no_path_retry 2". Regards, Martin ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1]: scsi dm-mpath do not fail paths which are in ALUA state transitioning 2021-07-13 9:13 ` Martin Wilck @ 2021-07-14 0:32 ` Brian Bunker 2021-07-14 0:37 ` Brian Bunker 0 siblings, 1 reply; 9+ messages in thread From: Brian Bunker @ 2021-07-14 0:32 UTC (permalink / raw) To: Martin Wilck; +Cc: linux-scsi, Hannes Reinecke On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@suse.com> wrote: > > Hello Brian, > > On Mo, 2021-07-12 at 14:38 -0700, Brian Bunker wrote: > > Martin, > > > > > Please confirm that your kernel included ee8868c5c78f ("scsi: > > > scsi_dh_alua: Retry RTPG on a different path after failure"). > > > That commit should cause the RTPG to be retried on other map > > > members > > > which are not in failed state, thus avoiding this phenomenon. > > > > In my case, there are no other map members that are not in the failed > > state. One set of paths goes to the ALUA unavailable state when the > > primary fails, and the second set of paths moves to ALUA state > > transitioning as the previous secondary becomes the primary. > > IMO this is the problem. How does your array respond to SCSI commands > while ports are transitioning? > > SPC5 (§5.16.2.6) says that the server should either fail all commands > with BUSY or CHECK CONDITION/NOT READY/LOGICAL UNIT NOT > ACCESSIBLE/ASYMMETRIC ACCESS STATE TRANSITION (a), or should support > all TMFs and a subset of SCSI commands, while responding with > CC/NR/AAST to all other commands (b). SPC6 (§5.18.2.6) is no different. > > No matter how you read that paragraph, it's pretty clear that > "transitioning" is generally not a healthy state to attempt I/O. > > Are you saying that on your server, the transitioning ports are able to > process regular I/O commands like READ and WRITE? If that's the case, > why do you pretend to be "transitioning" at all, rather than in an > active state? If it's not the case, why would it make sense for the > host to retry I/O on the transitioning path? In the ALUA transitioning state, we cannot process READ or WRITE and will return with the sense data as you mentioned above. We expect retries down that transitioning path until it transitions to another ALUA state (at least for some reasonable period of time for the transition). The spec defines the state as the transition between target asymmetric states. The current implementation requires coordination on the target not to return a state transition down all paths at the same time or risk all paths being failed. Using the ALUA transitioning state allows us to respond to initiator READ and WRITE requests even if we can't serve them when our internal target state is transitioning (secondary to primary). The alternative is to queue them which presents a different set of problems. > > > If the > > paths are failed which are transitioning, an all paths down state > > happens which is not expected. > > IMO it _is_ expected if in fact no path is able to process SCSI > commands at the given point in time. In this case it would seem having all paths move to transitioning would lead to all paths lost. It is possible to imagine implementations where for a brief period of time all paths are in a transitioning state. What would be the point of returning a transient state if the result is a permanent failure? > > > There should be a time for which > > transitioning is a transient state until the next state is entered. > > Failing a path assuming there would be non-failed paths seems wrong. > > This is a misunderstanding. The path isn't failed because of > assumptions about other paths. It is failed because we know that it's > non-functional, and thus we must try other paths, if there are any. > > Before 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA > transitioning state"), I/O was indeed retried on transitioning paths, > possibly forever. This posed a serious problem when a transitioning > path was about to be removed (e.g. dev_loss_tmo expired). And I'm still > quite convinced that it was wrong in general, because by all reasonable > means a "transitioning" path isn't usable for the host. > > If we find a transitioning path, it might make sense to retry on other > paths first and eventually switch back to the transitioning path, when > all others have failed hard (e.g. "unavailable" state). However, this > logic doesn't exist in the kernel. In theory, it could be mapped to a > "transitioning" priority group in device-mapper multipath. But prio > groups are managed in user space (multipathd), which treats > transitioning paths as "almost failed" (priority 0) anyway. We can > discuss enhancing multipathd such that it re-checks transitioning paths > more frequently, in order to be able to reinstate them ASAP. > > According to what you said above, the "transitioning" ports in the > problem situation ("second set") are those that were in "unavailable" > state before, which means "failed" as far as device mapper is concerned > - IOW, the paths in question would be unused anyway, until they got > reinstated, which wouldn't happen before they are fully up. With this > in mind, I have to say I don't understand why your proposed patch would > help at all. Please explain. > My proposed patch would not fail the paths in the case of BLK_STS_AGAIN. This seems to result in the requests being retried until the path transitions to either a failed state, standby or unavailable, or an online state. > > > The purpose of that patch was to set the state of the transitioning > > > path to failed in order to make sure IO is retried on a different > > > path. > > > Your patch would undermine this purpose. > > (Additional indentation added by me) Can you please use proper quoting? > You were mixing my statements and your own. > > > I agree this is what happens but those transitioning paths might be > > the only non-failed paths available. I don't think it is reasonable > > to > > fail them. This is the same as treating transitioning as standby or > > unavailable. > > Right, and according to the SPC spec (see above), that makes more sense > than treating it as "active". > > Storage vendors seem to interpret "transitioning" very differently, > both in terms of commands supported and in terms of time required to > reach the target state. That makes it hard to deal with it correctly on > the host side. > > > As you point out this happened with the commit you > > mention. Before this commit what I am doing does not result in an all > > paths down error, and similarly, it does not in earlier Linux > > versions > > or other OS's under the same condition. I see this as a regression. > > If you use a suitable "no_path_retry" setting in multipathd, you should > be able to handle the situation you describe just fine by queueing the > I/O until the transitioning paths are fully up. IIUC, on your server > "transitioning" is a transient state that ends quickly, so queueing > shouldn't be an issue. E.g. if you are certain that "transitioning" > won't last longer than 10s, you could set "no_path_retry 2". > > Regards, > Martin > > > I have tested using the no_path_retry and you are correct that it does work around the issue that I am seeing. The problem with that is there are times we want to convey all paths down to the initiator as quickly as possible and doing this will delay that. Thanks, Brian Brian Bunker PURE Storage, Inc. brian@purestorage.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1]: scsi dm-mpath do not fail paths which are in ALUA state transitioning 2021-07-14 0:32 ` Brian Bunker @ 2021-07-14 0:37 ` Brian Bunker 2021-07-14 8:38 ` Martin Wilck 0 siblings, 1 reply; 9+ messages in thread From: Brian Bunker @ 2021-07-14 0:37 UTC (permalink / raw) To: Martin Wilck; +Cc: linux-scsi, Hannes Reinecke On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@suse.com> wrote: > > Hello Brian, > > On Mo, 2021-07-12 at 14:38 -0700, Brian Bunker wrote: > > Martin, > > > > > Please confirm that your kernel included ee8868c5c78f ("scsi: > > > scsi_dh_alua: Retry RTPG on a different path after failure"). > > > That commit should cause the RTPG to be retried on other map > > > members > > > which are not in failed state, thus avoiding this phenomenon. > > > > In my case, there are no other map members that are not in the failed > > state. One set of paths goes to the ALUA unavailable state when the > > primary fails, and the second set of paths moves to ALUA state > > transitioning as the previous secondary becomes the primary. > > IMO this is the problem. How does your array respond to SCSI commands > while ports are transitioning? > > SPC5 (§5.16.2.6) says that the server should either fail all commands > with BUSY or CHECK CONDITION/NOT READY/LOGICAL UNIT NOT > ACCESSIBLE/ASYMMETRIC ACCESS STATE TRANSITION (a), or should support > all TMFs and a subset of SCSI commands, while responding with > CC/NR/AAST to all other commands (b). SPC6 (§5.18.2.6) is no different. > > No matter how you read that paragraph, it's pretty clear that > "transitioning" is generally not a healthy state to attempt I/O. > > Are you saying that on your server, the transitioning ports are able to > process regular I/O commands like READ and WRITE? If that's the case, > why do you pretend to be "transitioning" at all, rather than in an > active state? If it's not the case, why would it make sense for the > host to retry I/O on the transitioning path? In the ALUA transitioning state, we cannot process READ or WRITE and will return with the sense data as you mentioned above. We expect retries down that transitioning path until it transitions to another ALUA state (at least for some reasonable period of time for the transition). The spec defines the state as the transition between target asymmetric states. The current implementation requires coordination on the target not to return a state transition down all paths at the same time or risk all paths being failed. Using the ALUA transition state allows us to respond to initiator READ and WRITE requests even if we can't serve them when our internal target state is transitioning (secondary to primary). The alternative is to queue them which presents a different set of problems. > > > If the > > paths are failed which are transitioning, an all paths down state > > happens which is not expected. > > IMO it _is_ expected if in fact no path is able to process SCSI > commands at the given point in time. In this case it would seem having all paths move to transitioning would lead to all paths lost. It is possible to imagine implementations where for a brief period of time all paths are in a transitioning state. What would be the point of returning a transient state if the result is a permanent failure? > > > There should be a time for which > > transitioning is a transient state until the next state is entered. > > Failing a path assuming there would be non-failed paths seems wrong. > > This is a misunderstanding. The path isn't failed because of > assumptions about other paths. It is failed because we know that it's > non-functional, and thus we must try other paths, if there are any. > > Before 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA > transitioning state"), I/O was indeed retried on transitioning paths, > possibly forever. This posed a serious problem when a transitioning > path was about to be removed (e.g. dev_loss_tmo expired). And I'm still > quite convinced that it was wrong in general, because by all reasonable > means a "transitioning" path isn't usable for the host. > > If we find a transitioning path, it might make sense to retry on other > paths first and eventually switch back to the transitioning path, when > all others have failed hard (e.g. "unavailable" state). However, this > logic doesn't exist in the kernel. In theory, it could be mapped to a > "transitioning" priority group in device-mapper multipath. But prio > groups are managed in user space (multipathd), which treats > transitioning paths as "almost failed" (priority 0) anyway. We can > discuss enhancing multipathd such that it re-checks transitioning paths > more frequently, in order to be able to reinstate them ASAP. > > According to what you said above, the "transitioning" ports in the > problem situation ("second set") are those that were in "unavailable" > state before, which means "failed" as far as device mapper is concerned > - IOW, the paths in question would be unused anyway, until they got > reinstated, which wouldn't happen before they are fully up. With this > in mind, I have to say I don't understand why your proposed patch would > help at all. Please explain. > My proposed patch would not fail the paths in the case of BLK_STS_AGAIN. This seems to result in the requests being retried until the path transitions to either a failed state, standby or unavailable, or an online state. > > > The purpose of that patch was to set the state of the transitioning > > > path to failed in order to make sure IO is retried on a different > > > path. > > > Your patch would undermine this purpose. > > (Additional indentation added by me) Can you please use proper quoting? > You were mixing my statements and your own. > > > I agree this is what happens but those transitioning paths might be > > the only non-failed paths available. I don't think it is reasonable > > to > > fail them. This is the same as treating transitioning as standby or > > unavailable. > > Right, and according to the SPC spec (see above), that makes more sense > than treating it as "active". > > Storage vendors seem to interpret "transitioning" very differently, > both in terms of commands supported and in terms of time required to > reach the target state. That makes it hard to deal with it correctly on > the host side. > > > As you point out this happened with the commit you > > mention. Before this commit what I am doing does not result in an all > > paths down error, and similarly, it does not in earlier Linux > > versions > > or other OS's under the same condition. I see this as a regression. > > If you use a suitable "no_path_retry" setting in multipathd, you should > be able to handle the situation you describe just fine by queueing the > I/O until the transitioning paths are fully up. IIUC, on your server > "transitioning" is a transient state that ends quickly, so queueing > shouldn't be an issue. E.g. if you are certain that "transitioning" > won't last longer than 10s, you could set "no_path_retry 2". > > Regards, > Martin > > > I have tested using the no_path_retry and you are correct that it does work around the issue that I am seeing. The problem with that is are times we want to convey all paths down to the initiator as quickly as possible and doing this will delay that. Thanks, Brian Brian Bunker PURE Storage, Inc. brian@purestorage.com On Tue, Jul 13, 2021 at 5:32 PM Brian Bunker <brian@purestorage.com> wrote: > > On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@suse.com> wrote: > > > > Hello Brian, > > > > On Mo, 2021-07-12 at 14:38 -0700, Brian Bunker wrote: > > > Martin, > > > > > > > Please confirm that your kernel included ee8868c5c78f ("scsi: > > > > scsi_dh_alua: Retry RTPG on a different path after failure"). > > > > That commit should cause the RTPG to be retried on other map > > > > members > > > > which are not in failed state, thus avoiding this phenomenon. > > > > > > In my case, there are no other map members that are not in the failed > > > state. One set of paths goes to the ALUA unavailable state when the > > > primary fails, and the second set of paths moves to ALUA state > > > transitioning as the previous secondary becomes the primary. > > > > IMO this is the problem. How does your array respond to SCSI commands > > while ports are transitioning? > > > > SPC5 (§5.16.2.6) says that the server should either fail all commands > > with BUSY or CHECK CONDITION/NOT READY/LOGICAL UNIT NOT > > ACCESSIBLE/ASYMMETRIC ACCESS STATE TRANSITION (a), or should support > > all TMFs and a subset of SCSI commands, while responding with > > CC/NR/AAST to all other commands (b). SPC6 (§5.18.2.6) is no different. > > > > No matter how you read that paragraph, it's pretty clear that > > "transitioning" is generally not a healthy state to attempt I/O. > > > > Are you saying that on your server, the transitioning ports are able to > > process regular I/O commands like READ and WRITE? If that's the case, > > why do you pretend to be "transitioning" at all, rather than in an > > active state? If it's not the case, why would it make sense for the > > host to retry I/O on the transitioning path? > > In the ALUA transitioning state, we cannot process READ or WRITE and > will return with the sense data as you mentioned above. We expect > retries down that transitioning path until it transitions to another > ALUA state (at least for some reasonable period of time for the > transition). The spec defines the state as the transition between > target asymmetric states. The current implementation requires > coordination on the target not to return a state transition down all > paths at the same time or risk all paths being failed. Using the ALUA > transitioning state allows us to respond to initiator READ and WRITE > requests even if we can't serve them when our internal target state is > transitioning (secondary to primary). The alternative is to queue them > which presents a different set of problems. > > > > > > If the > > > paths are failed which are transitioning, an all paths down state > > > happens which is not expected. > > > > IMO it _is_ expected if in fact no path is able to process SCSI > > commands at the given point in time. > > In this case it would seem having all paths move to transitioning > would lead to all paths lost. It is possible to imagine > implementations where for a brief period of time all paths are in a > transitioning state. What would be the point of returning a transient > state if the result is a permanent failure? > > > > > > There should be a time for which > > > transitioning is a transient state until the next state is entered. > > > Failing a path assuming there would be non-failed paths seems wrong. > > > > This is a misunderstanding. The path isn't failed because of > > assumptions about other paths. It is failed because we know that it's > > non-functional, and thus we must try other paths, if there are any. > > > > Before 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA > > transitioning state"), I/O was indeed retried on transitioning paths, > > possibly forever. This posed a serious problem when a transitioning > > path was about to be removed (e.g. dev_loss_tmo expired). And I'm still > > quite convinced that it was wrong in general, because by all reasonable > > means a "transitioning" path isn't usable for the host. > > > > If we find a transitioning path, it might make sense to retry on other > > paths first and eventually switch back to the transitioning path, when > > all others have failed hard (e.g. "unavailable" state). However, this > > logic doesn't exist in the kernel. In theory, it could be mapped to a > > "transitioning" priority group in device-mapper multipath. But prio > > groups are managed in user space (multipathd), which treats > > transitioning paths as "almost failed" (priority 0) anyway. We can > > discuss enhancing multipathd such that it re-checks transitioning paths > > more frequently, in order to be able to reinstate them ASAP. > > > > According to what you said above, the "transitioning" ports in the > > problem situation ("second set") are those that were in "unavailable" > > state before, which means "failed" as far as device mapper is concerned > > - IOW, the paths in question would be unused anyway, until they got > > reinstated, which wouldn't happen before they are fully up. With this > > in mind, I have to say I don't understand why your proposed patch would > > help at all. Please explain. > > > > My proposed patch would not fail the paths in the case of > BLK_STS_AGAIN. This seems to result in the requests being retried > until the path transitions to either a failed state, standby or > unavailable, or an online state. > > > > > The purpose of that patch was to set the state of the transitioning > > > > path to failed in order to make sure IO is retried on a different > > > > path. > > > > Your patch would undermine this purpose. > > > > (Additional indentation added by me) Can you please use proper quoting? > > You were mixing my statements and your own. > > > > > I agree this is what happens but those transitioning paths might be > > > the only non-failed paths available. I don't think it is reasonable > > > to > > > fail them. This is the same as treating transitioning as standby or > > > unavailable. > > > > Right, and according to the SPC spec (see above), that makes more sense > > than treating it as "active". > > > > Storage vendors seem to interpret "transitioning" very differently, > > both in terms of commands supported and in terms of time required to > > reach the target state. That makes it hard to deal with it correctly on > > the host side. > > > > > As you point out this happened with the commit you > > > mention. Before this commit what I am doing does not result in an all > > > paths down error, and similarly, it does not in earlier Linux > > > versions > > > or other OS's under the same condition. I see this as a regression. > > > > If you use a suitable "no_path_retry" setting in multipathd, you should > > be able to handle the situation you describe just fine by queueing the > > I/O until the transitioning paths are fully up. IIUC, on your server > > "transitioning" is a transient state that ends quickly, so queueing > > shouldn't be an issue. E.g. if you are certain that "transitioning" > > won't last longer than 10s, you could set "no_path_retry 2". > > > > Regards, > > Martin > > > > > > > > I have tested using the no_path_retry and you are correct that it does > work around the issue that I am seeing. The problem with that is there > are times we want to convey all paths down to the initiator as quickly > as possible and doing this will delay that. > > Thanks, > Brian > > Brian Bunker > PURE Storage, Inc. > brian@purestorage.com -- Brian Bunker PURE Storage, Inc. brian@purestorage.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1]: scsi dm-mpath do not fail paths which are in ALUA state transitioning 2021-07-14 0:37 ` Brian Bunker @ 2021-07-14 8:38 ` Martin Wilck 2021-07-14 18:13 ` Brian Bunker 0 siblings, 1 reply; 9+ messages in thread From: Martin Wilck @ 2021-07-14 8:38 UTC (permalink / raw) To: Brian Bunker; +Cc: linux-scsi, Hannes Reinecke On Di, 2021-07-13 at 17:37 -0700, Brian Bunker wrote: > On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@suse.com> wrote: > > Are you saying that on your server, the transitioning ports are able > > to > > process regular I/O commands like READ and WRITE? If that's the case, > > why do you pretend to be "transitioning" at all, rather than in an > > active state? If it's not the case, why would it make sense for the > > host to retry I/O on the transitioning path? > > In the ALUA transitioning state, we cannot process READ or WRITE and > will return with the sense data as you mentioned above. We expect > retries down that transitioning path until it transitions to another > ALUA state (at least for some reasonable period of time for the > transition). The spec defines the state as the transition between > target asymmetric states. The current implementation requires > coordination on the target not to return a state transition down all > paths at the same time or risk all paths being failed. Using the ALUA > transition state allows us to respond to initiator READ and WRITE > requests even if we can't serve them when our internal target state is > transitioning (secondary to primary). The alternative is to queue them > which presents a different set of problems. Indeed, it would be less prone to host-side errors if the "new" pathgroup went to a usable state before the "old" pathgroup got unavailable. Granted, this may be difficult to guarantee on the storage side. > > > If the > > > paths are failed which are transitioning, an all paths down state > > > happens which is not expected. > > > > IMO it _is_ expected if in fact no path is able to process SCSI > > commands at the given point in time. > > In this case it would seem having all paths move to transitioning > would lead to all paths lost. It is possible to imagine > implementations where for a brief period of time all paths are in a > transitioning state. What would be the point of returning a transient > state if the result is a permanent failure? When a command fails with ALUA TRANSITIONING status, we must make sure that: 1) The command itself is not retried on the path at hand, neither on the SCSI layer nor on the blk-mq layer. The former was the case anyway, the latter is guaranteed by 0d88232010d5 ("scsi: core: Return BLK_STS_AGAIN for ALUA transitioning"). 2) No new commands are sent down this path until it reaches a usable final state. This is achieved on the SCSI layer by alua_prep_fn(), with 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA transitioning state"). These two items would still be true with your patch applied. However, the problem is that if the path isn't failed, dm-multipath would continue sending I/O down this path. If the path isn't set to failed state, the path selector algorithm may or may not choose a different path next time. In the worst case, dm-multipath would busy-loop retrying the I/O on the same path. Please consider the following: diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index 86518aec32b4..3f3a89fc2b3b 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -1654,12 +1654,12 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone, if (error && blk_path_error(error)) { struct multipath *m = ti->private; - if (error == BLK_STS_RESOURCE) + if (error == BLK_STS_RESOURCE || error == BLK_STS_AGAIN) r = DM_ENDIO_DELAY_REQUEUE; else r = DM_ENDIO_REQUEUE; - if (pgpath) + if (pgpath && (error != BLK_STS_AGAIN)) fail_path(pgpath); This way we'd avoid busy-looping by delaying the retry. This would cause I/O delay in the case where some healthy paths are still in the same dm-multipath priority group as the transitioning path. I suppose this is a minor problem, because in the default case for ALUA (group_by_prio in multipathd), all paths in the PG would have switched to "transitioning" simultaneously. Note that multipathd assigns priority 0 (the same prio as "unavailable") if it happens to notice a "transitioning" path. This is something we may want to change eventually. In your specific case, it would cause the paths to be temporarily re-grouped, all paths being moved to the same non-functional PG. The way you describe it, for your storage at least, "transitioning" should be assigned a higher priority. > > > If you use a suitable "no_path_retry" setting in multipathd, you > > should > > be able to handle the situation you describe just fine by queueing > > the > > I/O until the transitioning paths are fully up. IIUC, on your > > server > > "transitioning" is a transient state that ends quickly, so queueing > > shouldn't be an issue. E.g. if you are certain that "transitioning" > > won't last longer than 10s, you could set "no_path_retry 2". > > I have tested using the no_path_retry and you are correct that it > does > work around the issue that I am seeing. The problem with that is are > times > we want to convey all paths down to the initiator as quickly > as possible and doing this will delay that. Ok, that makes sense e.g. for cluster configurations. So, the purpose is to distinguish between two cases where no path can process SCSI commands: a) all paths are really down on the storage, and b) some paths are down and some are "coming up". Thanks, Martin ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1]: scsi dm-mpath do not fail paths which are in ALUA state transitioning 2021-07-14 8:38 ` Martin Wilck @ 2021-07-14 18:13 ` Brian Bunker 2021-07-14 21:06 ` Martin Wilck 0 siblings, 1 reply; 9+ messages in thread From: Brian Bunker @ 2021-07-14 18:13 UTC (permalink / raw) To: Martin Wilck; +Cc: linux-scsi, Hannes Reinecke On Wed, Jul 14, 2021 at 1:39 AM Martin Wilck <mwilck@suse.com> wrote: > > On Di, 2021-07-13 at 17:37 -0700, Brian Bunker wrote: > > On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@suse.com> wrote: > > > Are you saying that on your server, the transitioning ports are able > > > to > > > process regular I/O commands like READ and WRITE? If that's the case, > > > why do you pretend to be "transitioning" at all, rather than in an > > > active state? If it's not the case, why would it make sense for the > > > host to retry I/O on the transitioning path? > > > > In the ALUA transitioning state, we cannot process READ or WRITE and > > will return with the sense data as you mentioned above. We expect > > retries down that transitioning path until it transitions to another > > ALUA state (at least for some reasonable period of time for the > > transition). The spec defines the state as the transition between > > target asymmetric states. The current implementation requires > > coordination on the target not to return a state transition down all > > paths at the same time or risk all paths being failed. Using the ALUA > > transition state allows us to respond to initiator READ and WRITE > > requests even if we can't serve them when our internal target state is > > transitioning (secondary to primary). The alternative is to queue them > > which presents a different set of problems. > > Indeed, it would be less prone to host-side errors if the "new" > pathgroup went to a usable state before the "old" pathgroup got > unavailable. Granted, this may be difficult to guarantee on the storage > side. > For us, this is not easily doable. We are transitioning from a secondary to a primary, so for a brief time we have no paths which can serve the I/O. The transition looks like this (a bit oversimplified, but the general idea): 1. primary and secondary are active optimized 2. primary fails and secondary starts promoting 3. both previous primary and promoting secondary return transitioning 4. once the promoting primary gets far enough along the previous primary returns unavailable 5. the promoting primary continues to return transitioning 6. the promoting primary is done and returns active optimized 7. the previous primary becomes secondary and returns active optimized So there are windows when we can return AO for all paths, transitioning for all paths, transitioning for half the paths and unavailable for the other half, even though these windows can be very small, they are possible. > > > > If the > > > > paths are failed which are transitioning, an all paths down state > > > > happens which is not expected. > > > > > > IMO it _is_ expected if in fact no path is able to process SCSI > > > commands at the given point in time. > > > > In this case it would seem having all paths move to transitioning > > would lead to all paths lost. It is possible to imagine > > implementations where for a brief period of time all paths are in a > > transitioning state. What would be the point of returning a transient > > state if the result is a permanent failure? > > When a command fails with ALUA TRANSITIONING status, we must make sure > that: > > 1) The command itself is not retried on the path at hand, neither on > the SCSI layer nor on the blk-mq layer. The former was the case anyway, > the latter is guaranteed by 0d88232010d5 ("scsi: core: Return > BLK_STS_AGAIN for ALUA transitioning"). > > 2) No new commands are sent down this path until it reaches a usable > final state. This is achieved on the SCSI layer by alua_prep_fn(), with > 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA > transitioning state"). > > These two items would still be true with your patch applied. However, > the problem is that if the path isn't failed, dm-multipath would > continue sending I/O down this path. If the path isn't set to failed > state, the path selector algorithm may or may not choose a different > path next time. In the worst case, dm-multipath would busy-loop > retrying the I/O on the same path. Please consider the following: > > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > index 86518aec32b4..3f3a89fc2b3b 100644 > --- a/drivers/md/dm-mpath.c > +++ b/drivers/md/dm-mpath.c > @@ -1654,12 +1654,12 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone, > if (error && blk_path_error(error)) { > struct multipath *m = ti->private; > > - if (error == BLK_STS_RESOURCE) > + if (error == BLK_STS_RESOURCE || error == BLK_STS_AGAIN) > r = DM_ENDIO_DELAY_REQUEUE; > else > r = DM_ENDIO_REQUEUE; > > - if (pgpath) > + if (pgpath && (error != BLK_STS_AGAIN)) > fail_path(pgpath); > > This way we'd avoid busy-looping by delaying the retry. This would > cause I/O delay in the case where some healthy paths are still in the > same dm-multipath priority group as the transitioning path. I suppose > this is a minor problem, because in the default case for ALUA > (group_by_prio in multipathd), all paths in the PG would have switched > to "transitioning" simultaneously. > > Note that multipathd assigns priority 0 (the same prio as > "unavailable") if it happens to notice a "transitioning" path. This is > something we may want to change eventually. In your specific case, it > would cause the paths to be temporarily re-grouped, all paths being > moved to the same non-functional PG. The way you describe it, for your > storage at least, "transitioning" should be assigned a higher priority. > > > Yes. I tried it with this change and it works. Should I re-submit this modified version or do you want to do it? Either way works for me. I was flying a bit in the dark with my initial patch since I just found the fail_path and made that not run without much thought to what happens after that. > > > If you use a suitable "no_path_retry" setting in multipathd, you > > > should > > > be able to handle the situation you describe just fine by queueing > > > the > > > I/O until the transitioning paths are fully up. IIUC, on your > > > server > > > "transitioning" is a transient state that ends quickly, so queueing > > > shouldn't be an issue. E.g. if you are certain that "transitioning" > > > won't last longer than 10s, you could set "no_path_retry 2". > > > > I have tested using the no_path_retry and you are correct that it > > does > > work around the issue that I am seeing. The problem with that is are > > times > > we want to convey all paths down to the initiator as quickly > > as possible and doing this will delay that. > > Ok, that makes sense e.g. for cluster configurations. So, the purpose > is to distinguish between two cases where no path can process SCSI > commands: a) all paths are really down on the storage, and b) some > paths are down and some are "coming up". > > Thanks, > Martin > > Thanks, Brian -- Brian Bunker PURE Storage, Inc. brian@purestorage.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/1]: scsi dm-mpath do not fail paths which are in ALUA state transitioning 2021-07-14 18:13 ` Brian Bunker @ 2021-07-14 21:06 ` Martin Wilck 0 siblings, 0 replies; 9+ messages in thread From: Martin Wilck @ 2021-07-14 21:06 UTC (permalink / raw) To: Brian Bunker; +Cc: linux-scsi, Hannes Reinecke On Mi, 2021-07-14 at 11:13 -0700, Brian Bunker wrote: > On Wed, Jul 14, 2021 at 1:39 AM Martin Wilck <mwilck@suse.com> wrote: > > > > > When a command fails with ALUA TRANSITIONING status, we must make > > sure > > that: > > > > 1) The command itself is not retried on the path at hand, neither > > on > > the SCSI layer nor on the blk-mq layer. The former was the case > > anyway, > > the latter is guaranteed by 0d88232010d5 ("scsi: core: Return > > BLK_STS_AGAIN for ALUA transitioning"). > > > > 2) No new commands are sent down this path until it reaches a > > usable > > final state. This is achieved on the SCSI layer by alua_prep_fn(), > > with > > 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA > > transitioning state"). > > > > These two items would still be true with your patch applied. > > However, > > the problem is that if the path isn't failed, dm-multipath would > > continue sending I/O down this path. If the path isn't set to > > failed > > state, the path selector algorithm may or may not choose a > > different > > path next time. In the worst case, dm-multipath would busy-loop > > retrying the I/O on the same path. Please consider the following: > > > > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > > index 86518aec32b4..3f3a89fc2b3b 100644 > > --- a/drivers/md/dm-mpath.c > > +++ b/drivers/md/dm-mpath.c > > @@ -1654,12 +1654,12 @@ static int multipath_end_io(struct > > dm_target *ti, struct request *clone, > > if (error && blk_path_error(error)) { > > struct multipath *m = ti->private; > > > > - if (error == BLK_STS_RESOURCE) > > + if (error == BLK_STS_RESOURCE || error == > > BLK_STS_AGAIN) > > r = DM_ENDIO_DELAY_REQUEUE; > > else > > r = DM_ENDIO_REQUEUE; > > > > - if (pgpath) > > + if (pgpath && (error != BLK_STS_AGAIN)) > > fail_path(pgpath); > > > > This way we'd avoid busy-looping by delaying the retry. This would > > cause I/O delay in the case where some healthy paths are still in > > the > > same dm-multipath priority group as the transitioning path. I > > suppose > > this is a minor problem, because in the default case for ALUA > > (group_by_prio in multipathd), all paths in the PG would have > > switched > > to "transitioning" simultaneously. > > > > Note that multipathd assigns priority 0 (the same prio as > > "unavailable") if it happens to notice a "transitioning" path. This > > is > > something we may want to change eventually. In your specific case, > > it > > would cause the paths to be temporarily re-grouped, all paths being > > moved to the same non-functional PG. The way you describe it, for > > your > > storage at least, "transitioning" should be assigned a higher > > priority. > > > > > > > Yes. I tried it with this change and it works. Should I re-submit > this > modified version or do you want to do it? Yes, please. Regards, Martin ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2021-07-14 21:06 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-07-08 20:42 [PATCH 1/1]: scsi dm-mpath do not fail paths which are in ALUA state transitioning Brian Bunker 2021-07-12 9:19 ` Martin Wilck 2021-07-12 21:38 ` Brian Bunker 2021-07-13 9:13 ` Martin Wilck 2021-07-14 0:32 ` Brian Bunker 2021-07-14 0:37 ` Brian Bunker 2021-07-14 8:38 ` Martin Wilck 2021-07-14 18:13 ` Brian Bunker 2021-07-14 21:06 ` Martin Wilck
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.