linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] qla2xxx: fix errors in PCI device remove with ongoing I/O
@ 2016-11-07 19:53 Mauricio Faria de Oliveira
  2016-11-07 19:53 ` [PATCH 1/2] qla2xxx: do not queue commands when unloading Mauricio Faria de Oliveira
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Mauricio Faria de Oliveira @ 2016-11-07 19:53 UTC (permalink / raw)
  To: qla2xxx-upstream; +Cc: jejb, martin.petersen, linux-scsi, linux-kernel

This patchset addresses a couple of errors that might happen during
PCI device remove (e.g., PCI hotplug, PowerVM DLPAR), which prevent
the successful removal and re-addition of the adapter to the system,
and cause an oops and/or invalid DMA access (triggers an EEH event).

It allowed several cycles of PCI device add/remove with ongoing I/O,
to complete successfully without triggering oopses or EEH events.

Verified on v4.9-rc3.

Test-case:
---
    # lspci
    <...>
    001d:70:00.0 Fibre Channel: QLogic Corp. ISP2532-based ...
    001d:70:00.1 Fibre Channel: QLogic Corp. ISP2532-based ...
    <...>

    # for sd in $(find /sys/bus/pci/devices/001d:70:00.*/ \
                       -name 'sd*' -printf "%f\n"); do \
        dd if=/dev/$sd of=/dev/null iflag=nocache & done

    # echo 1 | tee /sys/bus/pci/devices/001d:70:00.*/remove
    (this either works or not)

    # echo 1 > /sys/bus/pci/rescan

Before:
---
    <...>
    EEH: Frozen PHB#1d-PE#700000 detected
    qla2xxx [001d:70:00.1]-8042:2: PCI/Register disconnect, exiting.
    <...>
    EEH: Detected PCI bus error on PHB#29-PE#700000
    <...>
    (and/or)
    Unable to handle kernel paging request for data at address 0x00000138
    <...>
    NIP [d000000004700a40] qla2xxx_queuecommand+0x80/0x3f0 [qla2xxx]
    LR [d000000004700a10] qla2xxx_queuecommand+0x50/0x3f0 [qla2xxx]

    (command does not return; adapter cannot be re-added)

After:
---
    <...>
    qla2xxx [001d:70:00.0]-801c:1: Abort command issued nexus=1:0:0 --  1 2003.
    <...>
    qla2xxx [001d:70:00.1]-801c:2: Abort command issued nexus=2:3:0 --  1 2003.
    <...>

    (command does return; adapter can be re-added correctly)


Mauricio Faria de Oliveira (2):
  qla2xxx: do not queue commands when unloading
  qla2xxx: fix invalid DMA access after command aborts in PCI device
    remove

 drivers/scsi/qla2xxx/qla_os.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] qla2xxx: do not queue commands when unloading
  2016-11-07 19:53 [PATCH 0/2] qla2xxx: fix errors in PCI device remove with ongoing I/O Mauricio Faria de Oliveira
@ 2016-11-07 19:53 ` Mauricio Faria de Oliveira
  2016-11-07 19:53 ` [PATCH 2/2] qla2xxx: fix invalid DMA access after command aborts in PCI device remove Mauricio Faria de Oliveira
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Mauricio Faria de Oliveira @ 2016-11-07 19:53 UTC (permalink / raw)
  To: qla2xxx-upstream; +Cc: jejb, martin.petersen, linux-scsi, linux-kernel

When the driver is unloading, in qla2x00_remove_one(), there is a single
call/point in time to abort ongoing commands, qla2x00_abort_all_cmds(),
which is still several steps away from the call to scsi_remove_host().

If more commands continue to arrive and be processed during that interval,
when the driver is tearing down and releasing its structures, it might
potentially hit an oops due to invalid memory access:

    Unable to handle kernel paging request for data at address 0x00000138
    <...>
    NIP [d000000004700a40] qla2xxx_queuecommand+0x80/0x3f0 [qla2xxx]
    LR [d000000004700a10] qla2xxx_queuecommand+0x50/0x3f0 [qla2xxx]

So, fail commands in qla2xxx_queuecommand() if the UNLOADING bit is set.

Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
---
 drivers/scsi/qla2xxx/qla_os.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index ace65db..fdb135b 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -707,6 +707,11 @@ static int qla25xx_setup_mode(struct scsi_qla_host *vha)
 	srb_t *sp;
 	int rval;
 
+	if (unlikely(test_bit(UNLOADING, &base_vha->dpc_flags))) {
+		cmd->result = DID_NO_CONNECT << 16;
+		goto qc24_fail_command;
+	}
+
 	if (ha->flags.eeh_busy) {
 		if (ha->flags.pci_channel_io_perm_failure) {
 			ql_dbg(ql_dbg_aer, vha, 0x9010,
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 2/2] qla2xxx: fix invalid DMA access after command aborts in PCI device remove
  2016-11-07 19:53 [PATCH 0/2] qla2xxx: fix errors in PCI device remove with ongoing I/O Mauricio Faria de Oliveira
  2016-11-07 19:53 ` [PATCH 1/2] qla2xxx: do not queue commands when unloading Mauricio Faria de Oliveira
@ 2016-11-07 19:53 ` Mauricio Faria de Oliveira
  2016-11-08 17:47 ` [PATCH 0/2] qla2xxx: fix errors in PCI device remove with ongoing I/O Madhani, Himanshu
  2016-11-09  0:14 ` Martin K. Petersen
  3 siblings, 0 replies; 5+ messages in thread
From: Mauricio Faria de Oliveira @ 2016-11-07 19:53 UTC (permalink / raw)
  To: qla2xxx-upstream; +Cc: jejb, martin.petersen, linux-scsi, linux-kernel

If a command is aborted in the kernel but not in the adapter, it might be
considered complete and its DMA memory released, but it is still alive in
the adapter, which will trigger an invalid DMA access upon its completion
(in the DMA operations to deliver the command response to the driver).

On powerpc platforms with IOMMU/EEH capabilities, the problem is observed
during PCI device removal with ongoing IO requests -- which might trigger
an EEH event very often, pointing to a 'TCE Request Page Access Error'.

In that path, which is qla2x00_remove_one(), the commands are aborted in
qla2x00_abort_all_cmds(), which does not perform an abort in the adapter
as is done in qla2xxx_eh_abort() for example.

So, this patch changes qla2x00_abort_all_cmds() to abort commands in the
adapter too, with a call to qla2xxx_eh_abort(), which already implements
all the logic to submit abort requests and handle responses.

Reported-by: Naresh Bannoth <nbannoth@in.ibm.com>
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
---
 drivers/scsi/qla2xxx/qla_os.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index fdb135b..c50dd22 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -1456,6 +1456,15 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha)
 		for (cnt = 1; cnt < req->num_outstanding_cmds; cnt++) {
 			sp = req->outstanding_cmds[cnt];
 			if (sp) {
+				/* Get a reference to the sp and drop the lock.
+				 * The reference ensures this sp->done() call
+				 * - and not the call in qla2xxx_eh_abort() -
+				 * ends the SCSI command (with result 'res').
+				 */
+				sp_get(sp);
+				spin_unlock_irqrestore(&ha->hardware_lock, flags);
+				qla2xxx_eh_abort(GET_CMD_SP(sp));
+				spin_lock_irqsave(&ha->hardware_lock, flags);
 				req->outstanding_cmds[cnt] = NULL;
 				sp->done(vha, sp, res);
 			}
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/2] qla2xxx: fix errors in PCI device remove with ongoing I/O
  2016-11-07 19:53 [PATCH 0/2] qla2xxx: fix errors in PCI device remove with ongoing I/O Mauricio Faria de Oliveira
  2016-11-07 19:53 ` [PATCH 1/2] qla2xxx: do not queue commands when unloading Mauricio Faria de Oliveira
  2016-11-07 19:53 ` [PATCH 2/2] qla2xxx: fix invalid DMA access after command aborts in PCI device remove Mauricio Faria de Oliveira
@ 2016-11-08 17:47 ` Madhani, Himanshu
  2016-11-09  0:14 ` Martin K. Petersen
  3 siblings, 0 replies; 5+ messages in thread
From: Madhani, Himanshu @ 2016-11-08 17:47 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira, qla2xxx-upstream
  Cc: jejb, martin.petersen, linux-scsi, linux-kernel



On 11/7/16, 11:53 AM, "Mauricio Faria de Oliveira" <mauricfo@linux.vnet.ibm.com> wrote:

>This patchset addresses a couple of errors that might happen during
>PCI device remove (e.g., PCI hotplug, PowerVM DLPAR), which prevent
>the successful removal and re-addition of the adapter to the system,
>and cause an oops and/or invalid DMA access (triggers an EEH event).
>
>It allowed several cycles of PCI device add/remove with ongoing I/O,
>to complete successfully without triggering oopses or EEH events.
>
>Verified on v4.9-rc3.
>
>Test-case:
>---
>    # lspci
>    <...>
>    001d:70:00.0 Fibre Channel: QLogic Corp. ISP2532-based ...
>    001d:70:00.1 Fibre Channel: QLogic Corp. ISP2532-based ...
>    <...>
>
>    # for sd in $(find /sys/bus/pci/devices/001d:70:00.*/ \
>                       -name 'sd*' -printf "%f\n"); do \
>        dd if=/dev/$sd of=/dev/null iflag=nocache & done
>
>    # echo 1 | tee /sys/bus/pci/devices/001d:70:00.*/remove
>    (this either works or not)
>
>    # echo 1 > /sys/bus/pci/rescan
>
>Before:
>---
>    <...>
>    EEH: Frozen PHB#1d-PE#700000 detected
>    qla2xxx [001d:70:00.1]-8042:2: PCI/Register disconnect, exiting.
>    <...>
>    EEH: Detected PCI bus error on PHB#29-PE#700000
>    <...>
>    (and/or)
>    Unable to handle kernel paging request for data at address 0x00000138
>    <...>
>    NIP [d000000004700a40] qla2xxx_queuecommand+0x80/0x3f0 [qla2xxx]
>    LR [d000000004700a10] qla2xxx_queuecommand+0x50/0x3f0 [qla2xxx]
>
>    (command does not return; adapter cannot be re-added)
>
>After:
>---
>    <...>
>    qla2xxx [001d:70:00.0]-801c:1: Abort command issued nexus=1:0:0 --  1 2003.
>    <...>
>    qla2xxx [001d:70:00.1]-801c:2: Abort command issued nexus=2:3:0 --  1 2003.
>    <...>
>
>    (command does return; adapter can be re-added correctly)
>
>
>Mauricio Faria de Oliveira (2):
>  qla2xxx: do not queue commands when unloading
>  qla2xxx: fix invalid DMA access after command aborts in PCI device
>    remove
>
> drivers/scsi/qla2xxx/qla_os.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
>-- 
>1.8.3.1
>
Thanks for the patches. Series Looks Good. 

Acked-by: Himanshu Madhani <himanshu.madhani@cavium.com>

>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/2] qla2xxx: fix errors in PCI device remove with ongoing I/O
  2016-11-07 19:53 [PATCH 0/2] qla2xxx: fix errors in PCI device remove with ongoing I/O Mauricio Faria de Oliveira
                   ` (2 preceding siblings ...)
  2016-11-08 17:47 ` [PATCH 0/2] qla2xxx: fix errors in PCI device remove with ongoing I/O Madhani, Himanshu
@ 2016-11-09  0:14 ` Martin K. Petersen
  3 siblings, 0 replies; 5+ messages in thread
From: Martin K. Petersen @ 2016-11-09  0:14 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: qla2xxx-upstream, jejb, martin.petersen, linux-scsi, linux-kernel

>>>>> "Mauricio" == Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com> writes:

Mauricio> This patchset addresses a couple of errors that might happen
Mauricio> during PCI device remove (e.g., PCI hotplug, PowerVM DLPAR),
Mauricio> which prevent the successful removal and re-addition of the
Mauricio> adapter to the system, and cause an oops and/or invalid DMA
Mauricio> access (triggers an EEH event).

Applied to 4.9/scsi-fixes.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-11-09  0:15 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-07 19:53 [PATCH 0/2] qla2xxx: fix errors in PCI device remove with ongoing I/O Mauricio Faria de Oliveira
2016-11-07 19:53 ` [PATCH 1/2] qla2xxx: do not queue commands when unloading Mauricio Faria de Oliveira
2016-11-07 19:53 ` [PATCH 2/2] qla2xxx: fix invalid DMA access after command aborts in PCI device remove Mauricio Faria de Oliveira
2016-11-08 17:47 ` [PATCH 0/2] qla2xxx: fix errors in PCI device remove with ongoing I/O Madhani, Himanshu
2016-11-09  0:14 ` Martin K. Petersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).