linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] zfcp: fix reaction on bit error theshold notification with adapter close
@ 2019-09-24 16:06 Steffen Maier
       [not found] ` <20190925224305.00183208C3@mail.kernel.org>
  2019-10-01  3:49 ` Martin K. Petersen
  0 siblings, 2 replies; 10+ messages in thread
From: Steffen Maier @ 2019-09-24 16:06 UTC (permalink / raw)
  To: James E . J . Bottomley, Martin K . Petersen
  Cc: linux-scsi, linux-s390, Benjamin Block, Heiko Carstens,
	Vasily Gorbik, Steffen Maier, stable

Kernel message explanation:

 * Description:
 * The FCP channel reported that its bit error threshold has been exceeded.
 * These errors might result from a problem with the physical components
 * of the local fibre link into the FCP channel.
 * The problem might be damage or malfunction of the cable or
 * cable connection between the FCP channel and
 * the adjacent fabric switch port or the point-to-point peer.
 * Find details about the errors in the HBA trace for the FCP device.
 * The zfcp device driver closed down the FCP device
 * to limit the performance impact from possible I/O command timeouts.
 * User action:
 * Check for problems on the local fibre link, ensure that fibre optics are
 * clean and functional, and all cables are properly plugged.
 * After the repair action, you can manually recover the FCP device by
 * writing "0" into its "failed" sysfs attribute.
 * If recovery through sysfs is not possible, set the CHPID of the device
 * offline and back online on the service element.

Signed-off-by: Steffen Maier <maier@linux.ibm.com>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: <stable@vger.kernel.org> #2.6.30+
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
---

Martin, James,

an important zfcp fix for v5.4-rc.
It applies to Martin's 5.4/scsi-fixes or to James' fixes branch.


 drivers/s390/scsi/zfcp_fsf.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/s390/scsi/zfcp_fsf.c b/drivers/s390/scsi/zfcp_fsf.c
index 296bbc3c4606..cf63916814cc 100644
--- a/drivers/s390/scsi/zfcp_fsf.c
+++ b/drivers/s390/scsi/zfcp_fsf.c
@@ -27,6 +27,11 @@
 
 struct kmem_cache *zfcp_fsf_qtcb_cache;
 
+static bool ber_stop = true;
+module_param(ber_stop, bool, 0600);
+MODULE_PARM_DESC(ber_stop,
+		 "Shuts down FCP devices for FCP channels that report a bit-error count in excess of its threshold (default on)");
+
 static void zfcp_fsf_request_timeout_handler(struct timer_list *t)
 {
 	struct zfcp_fsf_req *fsf_req = from_timer(fsf_req, t, timer);
@@ -236,10 +241,15 @@ static void zfcp_fsf_status_read_handler(struct zfcp_fsf_req *req)
 	case FSF_STATUS_READ_SENSE_DATA_AVAIL:
 		break;
 	case FSF_STATUS_READ_BIT_ERROR_THRESHOLD:
-		dev_warn(&adapter->ccw_device->dev,
-			 "The error threshold for checksum statistics "
-			 "has been exceeded\n");
 		zfcp_dbf_hba_bit_err("fssrh_3", req);
+		if (ber_stop) {
+			dev_warn(&adapter->ccw_device->dev,
+				 "All paths over this FCP device are disused because of excessive bit errors\n");
+			zfcp_erp_adapter_shutdown(adapter, 0, "fssrh_b");
+		} else {
+			dev_warn(&adapter->ccw_device->dev,
+				 "The error threshold for checksum statistics has been exceeded\n");
+		}
 		break;
 	case FSF_STATUS_READ_LINK_DOWN:
 		zfcp_fsf_status_read_link_down(req);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] zfcp: fix reaction on bit error theshold notification with adapter close
       [not found] ` <20190925224305.00183208C3@mail.kernel.org>
@ 2019-09-26 11:00   ` Steffen Maier
  0 siblings, 0 replies; 10+ messages in thread
From: Steffen Maier @ 2019-09-26 11:00 UTC (permalink / raw)
  To: Sasha Levin, James E . J . Bottomley
  Cc: linux-scsi, linux-s390, stable, Benjamin Block

On 9/26/19 12:43 AM, Sasha Levin wrote:
> [This is an automated email]
> 
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: 1da177e4c3f4 Linux-2.6.12-rc2.
> 
> The bot has tested the following trees: v5.3.1, v5.2.17, v4.19.75, v4.14.146, v4.9.194, v4.4.194.
> 
> v5.3.1: Build OK!
> v5.2.17: Build OK!
> v4.19.75: Build OK!
> v4.14.146: Failed to apply! Possible dependencies:
>      75492a51568b ("s390/scsi: Convert timers to use timer_setup()")
> 
> v4.9.194: Failed to apply! Possible dependencies:
>      75492a51568b ("s390/scsi: Convert timers to use timer_setup()")
>      bc46427e807e ("scsi: zfcp: use setup_timer instead of init_timer")
> 
> v4.4.194: Failed to apply! Possible dependencies:
>      75492a51568b ("s390/scsi: Convert timers to use timer_setup()")
>      bc46427e807e ("scsi: zfcp: use setup_timer instead of init_timer")
> 
> 
> NOTE: The patch will not be queued to stable trees until it is upstream.
> 
> How should we proceed with this patch?

It's sufficient to have the fix in those more recent stable trees where it 
applies (and builds). My fixes tag formally indicates since when it was at 
least broken but I don't expect all stable or longterm kernels to get the fix. 
If I happen to find out we need the fix in a kernel where it does not apply, 
I'll send a backport to stable when the time is right.


Showing the possible dependencies is awesome!


-- 
Mit freundlichen Gruessen / Kind regards
Steffen Maier

Linux on IBM Z Development

https://www.ibm.com/privacy/us/en/
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] zfcp: fix reaction on bit error theshold notification with adapter close
  2019-09-24 16:06 [PATCH] zfcp: fix reaction on bit error theshold notification with adapter close Steffen Maier
       [not found] ` <20190925224305.00183208C3@mail.kernel.org>
@ 2019-10-01  3:49 ` Martin K. Petersen
  2019-10-01 10:49   ` [PATCH v2] " Steffen Maier
  1 sibling, 1 reply; 10+ messages in thread
From: Martin K. Petersen @ 2019-10-01  3:49 UTC (permalink / raw)
  To: Steffen Maier
  Cc: James E . J . Bottomley, Martin K . Petersen, linux-scsi,
	linux-s390, Benjamin Block, Heiko Carstens, Vasily Gorbik,
	stable


Steffen,

> Kernel message explanation:
>
>  * Description:
>  * The FCP channel reported that its bit error threshold has been exceeded.
>  * These errors might result from a problem with the physical components
>  * of the local fibre link into the FCP channel.
>  * The problem might be damage or malfunction of the cable or
>  * cable connection between the FCP channel and
>  * the adjacent fabric switch port or the point-to-point peer.
>  * Find details about the errors in the HBA trace for the FCP device.
>  * The zfcp device driver closed down the FCP device
>  * to limit the performance impact from possible I/O command timeouts.
>  * User action:
>  * Check for problems on the local fibre link, ensure that fibre optics are
>  * clean and functional, and all cables are properly plugged.
>  * After the repair action, you can manually recover the FCP device by
>  * writing "0" into its "failed" sysfs attribute.
>  * If recovery through sysfs is not possible, set the CHPID of the device
>  * offline and back online on the service element.

This commentary does not read like a patch description. It makes no
mention of the actual kernel changes and the introduced module
parameter.

> +static bool ber_stop = true;
> +module_param(ber_stop, bool, 0600);
> +MODULE_PARM_DESC(ber_stop,
> +		 "Shuts down FCP devices for FCP channels that report a bit-error count in excess of its threshold (default on)");
> +

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2] zfcp: fix reaction on bit error theshold notification with adapter close
  2019-10-01  3:49 ` Martin K. Petersen
@ 2019-10-01 10:49   ` Steffen Maier
  2019-10-01 14:14     ` Greg KH
  0 siblings, 1 reply; 10+ messages in thread
From: Steffen Maier @ 2019-10-01 10:49 UTC (permalink / raw)
  To: James E . J . Bottomley, Martin K . Petersen
  Cc: linux-scsi, linux-s390, Benjamin Block, Heiko Carstens,
	Vasily Gorbik, Steffen Maier, stable

On excessive bit errors for the FCP channel ingress fibre path, the channel
notifies us. Previously, we only emitted a kernel message and a trace record.
Since performance can become suboptimal with I/O timeouts due to
bit errors, we now stop using an FCP device by default on channel
notification so multipath on top can timely failover to other paths.
A new module parameter zfcp.ber_stop can be used to get zfcp old behavior.

User explanation of new kernel message:
 * Description:
 * The FCP channel reported that its bit error threshold has been exceeded.
 * These errors might result from a problem with the physical components
 * of the local fibre link into the FCP channel.
 * The problem might be damage or malfunction of the cable or
 * cable connection between the FCP channel and
 * the adjacent fabric switch port or the point-to-point peer.
 * Find details about the errors in the HBA trace for the FCP device.
 * The zfcp device driver closed down the FCP device
 * to limit the performance impact from possible I/O command timeouts.
 * User action:
 * Check for problems on the local fibre link, ensure that fibre optics are
 * clean and functional, and all cables are properly plugged.
 * After the repair action, you can manually recover the FCP device by
 * writing "0" into its "failed" sysfs attribute.
 * If recovery through sysfs is not possible, set the CHPID of the device
 * offline and back online on the service element.

Signed-off-by: Steffen Maier <maier@linux.ibm.com>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: <stable@vger.kernel.org> #2.6.30+
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
---

Martin, James,

an important zfcp fix for v5.4-rc.
It applies to Martin's 5.4/scsi-fixes or to James' fixes branch.

Changes since v1:
* Martin's review comments: describe code change and new module parameter


 drivers/s390/scsi/zfcp_fsf.c | 16 +++++++++++++---
 2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/drivers/s390/scsi/zfcp_fsf.c b/drivers/s390/scsi/zfcp_fsf.c
index e31c6b47af97..1e279220f073 100644
--- a/drivers/s390/scsi/zfcp_fsf.c
+++ b/drivers/s390/scsi/zfcp_fsf.c
@@ -29,6 +29,11 @@
 
 struct kmem_cache *zfcp_fsf_qtcb_cache;
 
+static bool ber_stop = true;
+module_param(ber_stop, bool, 0600);
+MODULE_PARM_DESC(ber_stop,
+		 "Shuts down FCP devices for FCP channels that report a bit-error count in excess of its threshold (default on)");
+
 static void zfcp_fsf_request_timeout_handler(struct timer_list *t)
 {
 	struct zfcp_fsf_req *fsf_req = from_timer(fsf_req, t, timer);
@@ -238,10 +243,15 @@ static void zfcp_fsf_status_read_handler(struct zfcp_fsf_req *req)
 	case FSF_STATUS_READ_SENSE_DATA_AVAIL:
 		break;
 	case FSF_STATUS_READ_BIT_ERROR_THRESHOLD:
-		dev_warn(&adapter->ccw_device->dev,
-			 "The error threshold for checksum statistics "
-			 "has been exceeded\n");
 		zfcp_dbf_hba_bit_err("fssrh_3", req);
+		if (ber_stop) {
+			dev_warn(&adapter->ccw_device->dev,
+				 "All paths over this FCP device are disused because of excessive bit errors\n");
+			zfcp_erp_adapter_shutdown(adapter, 0, "fssrh_b");
+		} else {
+			dev_warn(&adapter->ccw_device->dev,
+				 "The error threshold for checksum statistics has been exceeded\n");
+		}
 		break;
 	case FSF_STATUS_READ_LINK_DOWN:
 		zfcp_fsf_status_read_link_down(req);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] zfcp: fix reaction on bit error theshold notification with adapter close
  2019-10-01 10:49   ` [PATCH v2] " Steffen Maier
@ 2019-10-01 14:14     ` Greg KH
  2019-10-01 15:07       ` Steffen Maier
  0 siblings, 1 reply; 10+ messages in thread
From: Greg KH @ 2019-10-01 14:14 UTC (permalink / raw)
  To: Steffen Maier
  Cc: James E . J . Bottomley, Martin K . Petersen, linux-scsi,
	linux-s390, Benjamin Block, Heiko Carstens, Vasily Gorbik,
	stable

On Tue, Oct 01, 2019 at 12:49:49PM +0200, Steffen Maier wrote:
> On excessive bit errors for the FCP channel ingress fibre path, the channel
> notifies us. Previously, we only emitted a kernel message and a trace record.
> Since performance can become suboptimal with I/O timeouts due to
> bit errors, we now stop using an FCP device by default on channel
> notification so multipath on top can timely failover to other paths.
> A new module parameter zfcp.ber_stop can be used to get zfcp old behavior.

Ugh, module parameters?  This isn't the 1990's anymore :(

Why not just make this a dynamic sysfs variable, that way you properly
can set this on whatever device you want, not just "all or nothing"?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] zfcp: fix reaction on bit error theshold notification with adapter close
  2019-10-01 14:14     ` Greg KH
@ 2019-10-01 15:07       ` Steffen Maier
  2019-10-01 15:42         ` Greg KH
  0 siblings, 1 reply; 10+ messages in thread
From: Steffen Maier @ 2019-10-01 15:07 UTC (permalink / raw)
  To: Greg KH
  Cc: James E . J . Bottomley, Martin K . Petersen, linux-scsi,
	linux-s390, Benjamin Block, Heiko Carstens, Vasily Gorbik,
	stable

On 10/1/19 4:14 PM, Greg KH wrote:
> On Tue, Oct 01, 2019 at 12:49:49PM +0200, Steffen Maier wrote:
>> On excessive bit errors for the FCP channel ingress fibre path, the channel
>> notifies us. Previously, we only emitted a kernel message and a trace record.
>> Since performance can become suboptimal with I/O timeouts due to
>> bit errors, we now stop using an FCP device by default on channel
>> notification so multipath on top can timely failover to other paths.
>> A new module parameter zfcp.ber_stop can be used to get zfcp old behavior.
> 
> Ugh, module parameters?  This isn't the 1990's anymore :(
> 
> Why not just make this a dynamic sysfs variable, that way you properly
> can set this on whatever device you want, not just "all or nothing"?

Since we can see many more (virtual) FCP devices than we want to actually use, 
we defer probing. It means, we only start allocating structures and sysfs 
entries on setting an FCP "online" for the first time. Setting online works 
through another sysfs attribute owned by our ccw bus code component called 
"cio". IIRC, setting online does not emit a uevent. On setting online, the 
(add) uevent of hot-/coldplug of an FCP device had already happened, so we 
could not easily have end users craft udev rules to automatically/persistently 
configure a new sysfs attribute (which is FCP-device-specific and appears late) 
to disable the new code behavior.

Not sure if that could ever become a problem for end users: Even if we were to 
write into a new sysfs attribute, the attribute only appears during setting 
online so this might race with starting to actually use the FCP device with the 
new default behavior and could potentially disable I/O paths before the sysfs 
attribute write could become effective to disable the new behavor.


-- 
Mit freundlichen Gruessen / Kind regards
Steffen Maier

Linux on IBM Z Development

https://www.ibm.com/privacy/us/en/
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] zfcp: fix reaction on bit error theshold notification with adapter close
  2019-10-01 15:07       ` Steffen Maier
@ 2019-10-01 15:42         ` Greg KH
  2019-10-01 18:26           ` Martin K. Petersen
  0 siblings, 1 reply; 10+ messages in thread
From: Greg KH @ 2019-10-01 15:42 UTC (permalink / raw)
  To: Steffen Maier
  Cc: James E . J . Bottomley, Martin K . Petersen, linux-scsi,
	linux-s390, Benjamin Block, Heiko Carstens, Vasily Gorbik,
	stable

On Tue, Oct 01, 2019 at 05:07:50PM +0200, Steffen Maier wrote:
> On 10/1/19 4:14 PM, Greg KH wrote:
> > On Tue, Oct 01, 2019 at 12:49:49PM +0200, Steffen Maier wrote:
> > > On excessive bit errors for the FCP channel ingress fibre path, the channel
> > > notifies us. Previously, we only emitted a kernel message and a trace record.
> > > Since performance can become suboptimal with I/O timeouts due to
> > > bit errors, we now stop using an FCP device by default on channel
> > > notification so multipath on top can timely failover to other paths.
> > > A new module parameter zfcp.ber_stop can be used to get zfcp old behavior.
> > 
> > Ugh, module parameters?  This isn't the 1990's anymore :(
> > 
> > Why not just make this a dynamic sysfs variable, that way you properly
> > can set this on whatever device you want, not just "all or nothing"?
> 
> Since we can see many more (virtual) FCP devices than we want to actually
> use, we defer probing. It means, we only start allocating structures and
> sysfs entries on setting an FCP "online" for the first time. Setting online
> works through another sysfs attribute owned by our ccw bus code component
> called "cio". IIRC, setting online does not emit a uevent. On setting
> online, the (add) uevent of hot-/coldplug of an FCP device had already
> happened, so we could not easily have end users craft udev rules to
> automatically/persistently configure a new sysfs attribute (which is
> FCP-device-specific and appears late) to disable the new code behavior.
> 
> Not sure if that could ever become a problem for end users: Even if we were
> to write into a new sysfs attribute, the attribute only appears during
> setting online so this might race with starting to actually use the FCP
> device with the new default behavior and could potentially disable I/O paths
> before the sysfs attribute write could become effective to disable the new
> behavor.

Ok, then why make this a module option that you will have to support for
the next 20+ years anyway if you feel this fix is the correct way that
it should be done instead?

module options are tough to manage and support, only add them as a very
last thing, when all other options have been ruled out.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] zfcp: fix reaction on bit error theshold notification with adapter close
  2019-10-01 15:42         ` Greg KH
@ 2019-10-01 18:26           ` Martin K. Petersen
  2019-10-02  8:31             ` Steffen Maier
  0 siblings, 1 reply; 10+ messages in thread
From: Martin K. Petersen @ 2019-10-01 18:26 UTC (permalink / raw)
  To: Greg KH
  Cc: Steffen Maier, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, linux-s390, Benjamin Block, Heiko Carstens,
	Vasily Gorbik, stable


Greg,

> Ok, then why make this a module option that you will have to support
> for the next 20+ years anyway if you feel this fix is the correct way
> that it should be done instead?

I agree.

Why not just shut FCP down unconditionally on excessive bit errors?
What's the benefit of allowing things to continue? Are you hoping things
will eventually recover in a single-path scenario?

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] zfcp: fix reaction on bit error theshold notification with adapter close
  2019-10-01 18:26           ` Martin K. Petersen
@ 2019-10-02  8:31             ` Steffen Maier
  2019-10-04  1:47               ` Martin K. Petersen
  0 siblings, 1 reply; 10+ messages in thread
From: Steffen Maier @ 2019-10-02  8:31 UTC (permalink / raw)
  To: Martin K. Petersen, Greg KH
  Cc: James E . J . Bottomley, linux-scsi, linux-s390, Benjamin Block,
	Heiko Carstens, Vasily Gorbik, stable

On 10/1/19 8:26 PM, Martin K. Petersen wrote:
>> Ok, then why make this a module option that you will have to support
>> for the next 20+ years anyway if you feel this fix is the correct way
>> that it should be done instead?
> 
> I agree.
> 
> Why not just shut FCP down unconditionally on excessive bit errors?
> What's the benefit of allowing things to continue? Are you hoping things
> will eventually recover in a single-path scenario?

Experience told me that there will be an unforeseen end user scenario where I 
need a quick switch to let even shaky paths survive.


-- 
Mit freundlichen Gruessen / Kind regards
Steffen Maier

Linux on IBM Z Development

https://www.ibm.com/privacy/us/en/
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] zfcp: fix reaction on bit error theshold notification with adapter close
  2019-10-02  8:31             ` Steffen Maier
@ 2019-10-04  1:47               ` Martin K. Petersen
  0 siblings, 0 replies; 10+ messages in thread
From: Martin K. Petersen @ 2019-10-04  1:47 UTC (permalink / raw)
  To: Steffen Maier
  Cc: Martin K. Petersen, Greg KH, James E . J . Bottomley, linux-scsi,
	linux-s390, Benjamin Block, Heiko Carstens, Vasily Gorbik,
	stable


Steffen,

>> Why not just shut FCP down unconditionally on excessive bit errors?
>> What's the benefit of allowing things to continue? Are you hoping things
>> will eventually recover in a single-path scenario?
>
> Experience told me that there will be an unforeseen end user scenario
> where I need a quick switch to let even shaky paths survive.

Can't say I like it. But it's your driver.

Applied to 5.4/scsi-fixes. Thanks!

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-10-04  1:48 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-24 16:06 [PATCH] zfcp: fix reaction on bit error theshold notification with adapter close Steffen Maier
     [not found] ` <20190925224305.00183208C3@mail.kernel.org>
2019-09-26 11:00   ` Steffen Maier
2019-10-01  3:49 ` Martin K. Petersen
2019-10-01 10:49   ` [PATCH v2] " Steffen Maier
2019-10-01 14:14     ` Greg KH
2019-10-01 15:07       ` Steffen Maier
2019-10-01 15:42         ` Greg KH
2019-10-01 18:26           ` Martin K. Petersen
2019-10-02  8:31             ` Steffen Maier
2019-10-04  1:47               ` Martin K. Petersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).