[PATCH] scsi: smartpqi_init: Reporting 'logical unit failure'

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] scsi: smartpqi_init: Reporting 'logical unit failure'
@ 2019-02-27 16:31 Erwan Velu
  2019-02-28 13:09 ` Erwan Velu
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Erwan Velu @ 2019-02-27 16:31 UTC (permalink / raw)
  Cc: Erwan Velu, Don Brace, James E.J. Bottomley, Martin K. Petersen,
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list

When this HARDWARE_ERROR/0x3e/0x1 case is triggered, the logical volume is offlined.
When reading the kernel log, the cause why the device got offlined isn't reported to the user.
This situation makes difficult for admins to estimate _why_ the volume got offlined.
Reading this part of the code makes clear this is because driver received a HARDWARE_ERROR/0x3e/0x1 which is a 'logical unit failure'.

This patch is just about reporting that fact to help admins making a relationship between this event and the offlining.

Signed-off-by: Erwan Velu <e.velu@criteo.com>
---
 drivers/scsi/smartpqi/smartpqi_init.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index f564af8949e8..89f37d76735c 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -2764,6 +2764,12 @@ static void pqi_process_raid_io_error(struct pqi_io_request *io_request)
 				sshdr.sense_key == HARDWARE_ERROR &&
 				sshdr.asc == 0x3e &&
 				sshdr.ascq == 0x1) {
+			struct pqi_ctrl_info *ctrl_info = shost_to_hba(scmd->device->host);
+			struct pqi_scsi_dev *device = scmd->device->hostdata;
+
+			dev_err(&ctrl_info->pci_dev->dev, "received 'logical unit failure' from controller for scsi %d:%d:%d:%d\n",
+							ctrl_info->scsi_host->host_no, device->bus,
+							device->target, device->lun);
 			pqi_take_device_offline(scmd->device, "RAID");
 			host_byte = DID_NO_CONNECT;
 		}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] scsi: smartpqi_init: Reporting 'logical unit failure'
  2019-02-27 16:31 [PATCH] scsi: smartpqi_init: Reporting 'logical unit failure' Erwan Velu
@ 2019-02-28 13:09 ` Erwan Velu
  2019-02-28 20:03 ` Elliott, Robert (Persistent Memory)
  2019-03-01 14:58 ` [PATCH v2] " Erwan Velu
  2 siblings, 0 replies; 14+ messages in thread
From: Erwan Velu @ 2019-02-28 13:09 UTC (permalink / raw)
  To: Erwan Velu
  Cc: Don Brace, James E.J. Bottomley, Martin K. Petersen,
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list

Hey,

That makes me wonder why the 0x3e / 0x2 isn't handled here aka

3E/02 DZTPROMAEBKVF TIMEOUT ON LOGICAL UNIT Is it possible the 
controller send to the kernel this kind of message, if so shouldn't we 
handle it here ? Erwan,

Le 27/02/2019 à 17:31, Erwan Velu a écrit :

> When this HARDWARE_ERROR/0x3e/0x1 case is triggered, the logical volume is offlined.
> When reading the kernel log, the cause why the device got offlined isn't reported to the user.
> This situation makes difficult for admins to estimate _why_ the volume got offlined.
> Reading this part of the code makes clear this is because driver received a HARDWARE_ERROR/0x3e/0x1 which is a 'logical unit failure'.
>
> This patch is just about reporting that fact to help admins making a relationship between this event and the offlining.
>
> Signed-off-by: Erwan Velu <e.velu@criteo.com>
> ---
>   drivers/scsi/smartpqi/smartpqi_init.c | 5 +++++
>   1 file changed, 5 insertions(+)
>
> diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
> index f564af8949e8..89f37d76735c 100644
> --- a/drivers/scsi/smartpqi/smartpqi_init.c
> +++ b/drivers/scsi/smartpqi/smartpqi_init.c
> @@ -2764,6 +2764,12 @@ static void pqi_process_raid_io_error(struct pqi_io_request *io_request)
>   				sshdr.sense_key == HARDWARE_ERROR &&
>   				sshdr.asc == 0x3e &&
>   				sshdr.ascq == 0x1) {
> +			struct pqi_ctrl_info *ctrl_info = shost_to_hba(scmd->device->host);
> +			struct pqi_scsi_dev *device = scmd->device->hostdata;
> +
> +			dev_err(&ctrl_info->pci_dev->dev, "received 'logical unit failure' from controller for scsi %d:%d:%d:%d\n",
> +							ctrl_info->scsi_host->host_no, device->bus,
> +							device->target, device->lun);
>   			pqi_take_device_offline(scmd->device, "RAID");
>   			host_byte = DID_NO_CONNECT;
>   		}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH] scsi: smartpqi_init: Reporting 'logical unit failure'
  2019-02-27 16:31 [PATCH] scsi: smartpqi_init: Reporting 'logical unit failure' Erwan Velu
  2019-02-28 13:09 ` Erwan Velu
@ 2019-02-28 20:03 ` Elliott, Robert (Persistent Memory)
  2019-03-01 14:59   ` Erwan Velu
  2019-03-01 14:58 ` [PATCH v2] " Erwan Velu
  2 siblings, 1 reply; 14+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2019-02-28 20:03 UTC (permalink / raw)
  To: Erwan Velu
  Cc: Erwan Velu, Don Brace, James E.J. Bottomley, Martin K. Petersen,
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list



> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of
> Erwan Velu
> Sent: Wednesday, February 27, 2019 10:32 AM
> Subject: [PATCH] scsi: smartpqi_init: Reporting 'logical unit failure'
> 
> When this HARDWARE_ERROR/0x3e/0x1 case is triggered, the logical volume is offlined.
> When reading the kernel log, the cause why the device got offlined isn't reported to the user.
> This situation makes difficult for admins to estimate _why_ the volume got offlined.
> Reading this part of the code makes clear this is because driver received a HARDWARE_ERROR/0x3e/0x1
> which is a 'logical unit failure'.
> 
> This patch is just about reporting that fact to help admins making a relationship between this event
> and the offlining.
> 
> Signed-off-by: Erwan Velu <e.velu@criteo.com>
> ---
>  drivers/scsi/smartpqi/smartpqi_init.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
> index f564af8949e8..89f37d76735c 100644
> --- a/drivers/scsi/smartpqi/smartpqi_init.c
> +++ b/drivers/scsi/smartpqi/smartpqi_init.c
> @@ -2764,6 +2764,12 @@ static void pqi_process_raid_io_error(struct pqi_io_request *io_request)
>  				sshdr.sense_key == HARDWARE_ERROR &&
>  				sshdr.asc == 0x3e &&
>  				sshdr.ascq == 0x1) {
> +			struct pqi_ctrl_info *ctrl_info = shost_to_hba(scmd->device->host);
> +			struct pqi_scsi_dev *device = scmd->device->hostdata;
> +
> +			dev_err(&ctrl_info->pci_dev->dev, "received 'logical unit failure' from controller
> for scsi %d:%d:%d:%d\n",
> +							ctrl_info->scsi_host->host_no, device->bus,
> +							device->target, device->lun);
>  			pqi_take_device_offline(scmd->device, "RAID");
>  			host_byte = DID_NO_CONNECT;
>  		}

Be careful printing errors per-IO; you could get thousands of them if things go bad.
The block layer print_req_error() uses printk_ratelimited(KERN_ERR) for that reason,
and the SCSI layer scsi_io_completion_action() maintains a ratelimit on its own.

The dev_err_ratelimited() macro might be a good fit here.


---
Robert Elliott, HPE Persistent Memory



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2] scsi: smartpqi_init: Reporting 'logical unit failure'
  2019-02-27 16:31 [PATCH] scsi: smartpqi_init: Reporting 'logical unit failure' Erwan Velu
  2019-02-28 13:09 ` Erwan Velu
  2019-02-28 20:03 ` Elliott, Robert (Persistent Memory)
@ 2019-03-01 14:58 ` Erwan Velu
  2019-03-01 15:26   ` James Bottomley
  2 siblings, 1 reply; 14+ messages in thread
From: Erwan Velu @ 2019-03-01 14:58 UTC (permalink / raw)
  To: elliott
  Cc: Erwan Velu, Don Brace, James E.J. Bottomley, Martin K. Petersen,
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list

When the HARDWARE_ERROR/0x3e/0x1 case is triggered, the logical volume is offlined.
When reading the kernel log, the reason why the device got offlined isn't reported to the user.
This situation makes difficult for admins to estimate the root cause of the issue they analize.

Reading this part of the code makes clear this is because driver received a HARDWARE_ERROR/0x3e/0x1 which is a 'logical unit failure'.
This patch is just about reporting the reason behind the offlining to ease the analyse.

Signed-off-by: Erwan Velu <e.velu@criteo.com>
---
 drivers/scsi/smartpqi/smartpqi_init.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index f564af8949e8..dfc4a6813440 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -2764,6 +2764,12 @@ static void pqi_process_raid_io_error(struct pqi_io_request *io_request)
 				sshdr.sense_key == HARDWARE_ERROR &&
 				sshdr.asc == 0x3e &&
 				sshdr.ascq == 0x1) {
+			struct pqi_ctrl_info *ctrl_info = shost_to_hba(scmd->device->host);
+			struct pqi_scsi_dev *device = scmd->device->hostdata;
+
+			dev_err_ratelimited(&ctrl_info->pci_dev->dev, "received 'logical unit failure' from controller for scsi %d:%d:%d:%d\n",
+							ctrl_info->scsi_host->host_no, device->bus,
+							device->target, device->lun);
 			pqi_take_device_offline(scmd->device, "RAID");
 			host_byte = DID_NO_CONNECT;
 		}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] scsi: smartpqi_init: Reporting 'logical unit failure'
  2019-02-28 20:03 ` Elliott, Robert (Persistent Memory)
@ 2019-03-01 14:59   ` Erwan Velu
  0 siblings, 0 replies; 14+ messages in thread
From: Erwan Velu @ 2019-03-01 14:59 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory), Erwan Velu
  Cc: Don Brace, James E.J. Bottomley, Martin K. Petersen,
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list

[...]
> Be careful printing errors per-IO; you could get thousands of them if things go bad.
> The block layer print_req_error() uses printk_ratelimited(KERN_ERR) for that reason,
> and the SCSI layer scsi_io_completion_action() maintains a ratelimit on its own.
>
> The dev_err_ratelimited() macro might be a good fit here.

Thanks for the tip. I updated the patch and send a V2 for that.

I adjusted also to commit message to make it more explicit.

Thanks for the review.

Erwan,


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] scsi: smartpqi_init: Reporting 'logical unit failure'
  2019-03-01 14:58 ` [PATCH v2] " Erwan Velu
@ 2019-03-01 15:26   ` James Bottomley
  2019-03-01 15:43     ` Erwan Velu
  0 siblings, 1 reply; 14+ messages in thread
From: James Bottomley @ 2019-03-01 15:26 UTC (permalink / raw)
  To: Erwan Velu, elliott
  Cc: Erwan Velu, Don Brace, Martin K. Petersen,
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list

On Fri, 2019-03-01 at 15:58 +0100, Erwan Velu wrote:
> +			dev_err_ratelimited(&ctrl_info->pci_dev-
> >dev, "received 'logical unit failure' from controller for scsi
> %d:%d:%d:%d\n",
> +							ctrl_info-
> >scsi_host->host_no, device->bus,
> +							device-
> >target, device->lun);

Shouldn't this be a variant of sdev/scmd_printk?  Otherwise it tells
you what disk in the array terms is the problem but not what device in
your actual system is affected.

James


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] scsi: smartpqi_init: Reporting 'logical unit failure'
  2019-03-01 15:26   ` James Bottomley
@ 2019-03-01 15:43     ` Erwan Velu
  2019-03-01 15:56       ` James Bottomley
  0 siblings, 1 reply; 14+ messages in thread
From: Erwan Velu @ 2019-03-01 15:43 UTC (permalink / raw)
  To: James Bottomley, Erwan Velu, elliott
  Cc: Don Brace, Martin K. Petersen,
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list

Le 01/03/2019 à 16:26, James Bottomley a écrit :
> [...]
> Shouldn't this be a variant of sdev/scmd_printk?  Otherwise it tells
> you what disk in the array terms is the problem but not what device in
> your actual system is affected.

Hey James,

My initial take on that was that pqi_take_device_offline(), which is 
called just after, will print the "re-scanning " message with the same 
format.

As they will be both printed in the same error context and one after the 
other, I though that would make sense to represent the same information 
to ease the reading like cause -> consequence.

As the message is about the LUN itself, which is reported faulty, I 
though it would worth reporting the info that way.

Shall I consider printing also the disk name in addition ?

Erwan,

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] scsi: smartpqi_init: Reporting 'logical unit failure'
  2019-03-01 15:43     ` Erwan Velu
@ 2019-03-01 15:56       ` James Bottomley
  2019-03-01 16:00         ` Erwan Velu
  2019-03-01 16:08         ` [PATCH v3] " Erwan Velu
  0 siblings, 2 replies; 14+ messages in thread
From: James Bottomley @ 2019-03-01 15:56 UTC (permalink / raw)
  To: Erwan Velu, Erwan Velu, elliott
  Cc: Don Brace, Martin K. Petersen,
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list

On Fri, 2019-03-01 at 15:43 +0000, Erwan Velu wrote:
> Le 01/03/2019 à 16:26, James Bottomley a écrit :
> > [...]
> > Shouldn't this be a variant of sdev/scmd_printk?  Otherwise it
> > tells
> > you what disk in the array terms is the problem but not what device
> > in
> > your actual system is affected.
> 
> Hey James,
> 
> My initial take on that was that pqi_take_device_offline(), which is 
> called just after, will print the "re-scanning " message with the
> same 
> format.
> 
> As they will be both printed in the same error context and one after
> the 
> other, I though that would make sense to represent the same
> information 
> to ease the reading like cause -> consequence.
> 
> As the message is about the LUN itself, which is reported faulty, I 
> though it would worth reporting the info that way.
> 
> Shall I consider printing also the disk name in addition ?

I was thinking just

if (printk_ratelimit())
	scmd_printk(KERN_ERR, scmd, "received 'logical unit failure' from controller for scsi %d:%d:%d:%d\n", ...

That will give all the necessary information

James


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] scsi: smartpqi_init: Reporting 'logical unit failure'
  2019-03-01 15:56       ` James Bottomley
@ 2019-03-01 16:00         ` Erwan Velu
  2019-03-01 16:08         ` [PATCH v3] " Erwan Velu
  1 sibling, 0 replies; 14+ messages in thread
From: Erwan Velu @ 2019-03-01 16:00 UTC (permalink / raw)
  To: James Bottomley, Erwan Velu, elliott
  Cc: Don Brace, Martin K. Petersen,
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list


Le 01/03/2019 à 16:56, James Bottomley a écrit :
> [...]
> I was thinking just
>
> if (printk_ratelimit())
> 	scmd_printk(KERN_ERR, scmd, "received 'logical unit failure' from controller for scsi %d:%d:%d:%d\n", ...
>
> That will give all the necessary information

I'm pretty new to this area, learning from you  ;o)

I'll update the v3 this way.

Thanks for the review.

Erwan,


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3] scsi: smartpqi_init: Reporting 'logical unit failure'
  2019-03-01 15:56       ` James Bottomley
  2019-03-01 16:00         ` Erwan Velu
@ 2019-03-01 16:08         ` Erwan Velu
  2019-03-05 22:30           ` Don.Brace
  2019-03-06 17:34           ` Martin K. Petersen
  1 sibling, 2 replies; 14+ messages in thread
From: Erwan Velu @ 2019-03-01 16:08 UTC (permalink / raw)
  Cc: Erwan Velu, Don Brace, James E.J. Bottomley, Martin K. Petersen,
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi),
	open list

When the HARDWARE_ERROR/0x3e/0x1 case is triggered, the logical volume is offlined.
When reading the kernel log, the reason why the device got offlined isn't reported to the user.
This situation makes difficult for admins to estimate the root cause of the issue they analize.

Reading this part of the code makes clear this is because driver received a HARDWARE_ERROR/0x3e/0x1 which is a 'logical unit failure'.
This patch is just about reporting the reason behind the offlining to ease the analyse.

Signed-off-by: Erwan Velu <e.velu@criteo.com>
---
 drivers/scsi/smartpqi/smartpqi_init.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index f564af8949e8..adebafe56b5b 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -2764,6 +2764,12 @@ static void pqi_process_raid_io_error(struct pqi_io_request *io_request)
 				sshdr.sense_key == HARDWARE_ERROR &&
 				sshdr.asc == 0x3e &&
 				sshdr.ascq == 0x1) {
+			struct pqi_ctrl_info *ctrl_info = shost_to_hba(scmd->device->host);
+			struct pqi_scsi_dev *device = scmd->device->hostdata;
+
+			if (printk_ratelimit())
+				scmd_printk(KERN_ERR, scmd, "received 'logical unit failure' from controller for scsi %d:%d:%d:%d\n",
+					ctrl_info->scsi_host->host_no, device->bus, device->target, device->lun);
 			pqi_take_device_offline(scmd->device, "RAID");
 			host_byte = DID_NO_CONNECT;
 		}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* RE: [PATCH v3] scsi: smartpqi_init: Reporting 'logical unit failure'
  2019-03-01 16:08         ` [PATCH v3] " Erwan Velu
@ 2019-03-05 22:30           ` Don.Brace
  2019-03-06 17:34           ` Martin K. Petersen
  1 sibling, 0 replies; 14+ messages in thread
From: Don.Brace @ 2019-03-05 22:30 UTC (permalink / raw)
  To: erwanaliasr1
  Cc: e.velu, don.brace, jejb, martin.petersen, esc.storagedev,
	linux-scsi, linux-kernel

-----Original Message-----
From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Erwan Velu
Sent: Friday, March 1, 2019 10:08 AM
Cc: Erwan Velu <e.velu@criteo.com>; Don Brace <don.brace@microsemi.com>; James E.J. Bottomley <jejb@linux.ibm.com>; Martin K. Petersen <martin.petersen@oracle.com>; open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi) <esc.storagedev@microsemi.com>; open list:MICROSEMI SMART ARRAY SMARTPQI DRIVER (smartpqi) <linux-scsi@vger.kernel.org>; open list <linux-kernel@vger.kernel.org>
Subject: [PATCH v3] scsi: smartpqi_init: Reporting 'logical unit failure'

When the HARDWARE_ERROR/0x3e/0x1 case is triggered, the logical volume is offlined.
When reading the kernel log, the reason why the device got offlined isn't reported to the user.
This situation makes difficult for admins to estimate the root cause of the issue they analize.

Reading this part of the code makes clear this is because driver received a HARDWARE_ERROR/0x3e/0x1 which is a 'logical unit failure'.
This patch is just about reporting the reason behind the offlining to ease the analyse.

Signed-off-by: Erwan Velu <e.velu@criteo.com>

Acked-by: Don Brace <don.brace@microsemi.com>

---
 drivers/scsi/smartpqi/smartpqi_init.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index f564af8949e8..adebafe56b5b 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -2764,6 +2764,12 @@ static void pqi_process_raid_io_error(struct pqi_io_request *io_request)
 				sshdr.sense_key == HARDWARE_ERROR &&
 				sshdr.asc == 0x3e &&
 				sshdr.ascq == 0x1) {
+			struct pqi_ctrl_info *ctrl_info = shost_to_hba(scmd->device->host);
+			struct pqi_scsi_dev *device = scmd->device->hostdata;
+
+			if (printk_ratelimit())
+				scmd_printk(KERN_ERR, scmd, "received 'logical unit failure' from controller for scsi %d:%d:%d:%d\n",
+					ctrl_info->scsi_host->host_no, device->bus, device->target, device->lun);
 			pqi_take_device_offline(scmd->device, "RAID");
 			host_byte = DID_NO_CONNECT;
 		}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] scsi: smartpqi_init: Reporting 'logical unit failure'
  2019-03-01 16:08         ` [PATCH v3] " Erwan Velu
  2019-03-05 22:30           ` Don.Brace
@ 2019-03-06 17:34           ` Martin K. Petersen
  2019-03-11 16:36             ` Erwan Velu
  2019-03-11 16:43             ` Erwan Velu
  1 sibling, 2 replies; 14+ messages in thread
From: Martin K. Petersen @ 2019-03-06 17:34 UTC (permalink / raw)
  To: Erwan Velu
  Cc: Erwan Velu, Don Brace, James E.J. Bottomley, Martin K. Petersen,
	esc.storagedev, linux-scsi, linux-kernel


Erwan,

> When the HARDWARE_ERROR/0x3e/0x1 case is triggered, the logical volume
> is offlined.  When reading the kernel log, the reason why the device
> got offlined isn't reported to the user.  This situation makes
> difficult for admins to estimate the root cause of the issue they
> analize.

Applied to 5.1/scsi-queue, thanks.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] scsi: smartpqi_init: Reporting 'logical unit failure'
  2019-03-06 17:34           ` Martin K. Petersen
@ 2019-03-11 16:36             ` Erwan Velu
  2019-03-11 16:43             ` Erwan Velu
  1 sibling, 0 replies; 14+ messages in thread
From: Erwan Velu @ 2019-03-11 16:36 UTC (permalink / raw)
  To: Martin K. Petersen, Erwan Velu
  Cc: Don Brace, James E.J. Bottomley, esc.storagedev, linux-scsi,
	linux-kernel


Le 06/03/2019 à 18:34, Martin K. Petersen a écrit :
> Erwan,
>
>> When the HARDWARE_ERROR/0x3e/0x1 case is triggered, the logical volume
>> is offlined.  When reading the kernel log, the reason why the device
>> got offlined isn't reported to the user.  This situation makes
>> difficult for admins to estimate the root cause of the issue they
>> analize.
> Applied to 5.1/scsi-queue, thanks.
>
Thanks Martin !

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] scsi: smartpqi_init: Reporting 'logical unit failure'
  2019-03-06 17:34           ` Martin K. Petersen
  2019-03-11 16:36             ` Erwan Velu
@ 2019-03-11 16:43             ` Erwan Velu
  1 sibling, 0 replies; 14+ messages in thread
From: Erwan Velu @ 2019-03-11 16:43 UTC (permalink / raw)
  To: Martin K. Petersen, Erwan Velu
  Cc: Don Brace, James E.J. Bottomley, esc.storagedev, linux-scsi,
	linux-kernel

Le 06/03/2019 à 18:34, Martin K. Petersen a écrit :
> Erwan,
>
>> When the HARDWARE_ERROR/0x3e/0x1 case is triggered, the logical volume
>> is offlined.  When reading the kernel log, the reason why the device
>> got offlined isn't reported to the user.  This situation makes
>> difficult for admins to estimate the root cause of the issue they
>> analize.

While I was debugging this scenario, I was wondering if some other cases 
were possible.

The current code is considering  (sshdr.asc == 0x3e && sshdr.ascq == 
0x1), but what if ascq have a different value here ?

The specification (http://www.t10.org/lists/asc-num.htm#ASC_3E) reports 
other sub-values like ASCQ==02 which means a timeout on the lun.

So, does the raid controllers supported by smartpqi can generates these 
other values ? If so, how/where are they handled ?

I was considering at least, to a switch statement on sshdr.ascq with a 
0x1 case on the current code and a a default one that prints at least a 
message saying that a message got received but not handled.

Thanks !

Erwan,

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-03-11 16:43 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-27 16:31 [PATCH] scsi: smartpqi_init: Reporting 'logical unit failure' Erwan Velu
2019-02-28 13:09 ` Erwan Velu
2019-02-28 20:03 ` Elliott, Robert (Persistent Memory)
2019-03-01 14:59   ` Erwan Velu
2019-03-01 14:58 ` [PATCH v2] " Erwan Velu
2019-03-01 15:26   ` James Bottomley
2019-03-01 15:43     ` Erwan Velu
2019-03-01 15:56       ` James Bottomley
2019-03-01 16:00         ` Erwan Velu
2019-03-01 16:08         ` [PATCH v3] " Erwan Velu
2019-03-05 22:30           ` Don.Brace
2019-03-06 17:34           ` Martin K. Petersen
2019-03-11 16:36             ` Erwan Velu
2019-03-11 16:43             ` Erwan Velu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).