All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] qla2xxx device removal fixups
@ 2014-06-18 14:02 Joe Lawrence
  2014-06-18 14:02 ` [PATCH 1/6] qla2xxx: Fix shost use-after-free on device removal Joe Lawrence
                   ` (7 more replies)
  0 siblings, 8 replies; 14+ messages in thread
From: Joe Lawrence @ 2014-06-18 14:02 UTC (permalink / raw)
  To: linux-scsi
  Cc: Chad Dupuis, Giridhar Malavali, Saurav Kashyap, Don Zickus,
	Prarit Bhargava, Bill Kuzeja, Dave Bulkow, Joe Lawrence

Hi Chad, Giri, et al.

Stratus has been testing the upstream qla2xxx driver against surprise
device removal and has found a few minor issues along the way.  With
this patchset, results have been good.  Although the following changes
may be most interesting on a Stratus platform, they should be applicable
to general hotplug scenarios on other hardware.

The first two patches are independent from the others and can be
reviewed as such.  One fixes up a use-after-free and the other
consolidates some code.

The final four patches are more invasive and modify the scsi_qla_host
structure to avoid device removal race conditions.  These changes were
written to demonstrate the problem at hand and minimally fix them in
order to continue testing.  (For example, adding completely a new
pci_flags member to scsi_qla_host is probably overkill.) 

We would welcome any feedback or questions about our testing.

Regards,

-- Joe


Joe Lawrence (6):

  qla2xxx: Fix shost use-after-free on device removal

    scsi_qla_host *base_vha is allocated as part of the Scsi_Host
    private hostdata[] by qla2x00_create_host.  When the last reference
    on the host is put by qla2x00_remove_one, it's released along with
    any hostdata[], rendering base_vha unsafe to dereference.  
      
    To safely complete the adapter cleanup in qla2x00_remove_one,
    use the scsi_qla_host_t *qla_hw_data pointer that is already in
    hand.
      
    To reproduce, set kernel boot option slub_debug=FZPU and remove
    an adapter instance.


  qla2xxx: Use qla2x00_clear_drv_active on probe failure

    Clean up some duplicate code along the way.


  qla2xxx: Collect PCI register checks and board_disable scheduling

    PCI register read checks are performed throughout the driver to
    detect disconnected hardware.  There is currently a function that
    verifies 32-bit values and schedules board_removal if needed. Other
    16-bit registers are checked and board_removal scheduling scattered
    around the driver.

    This change pulls all of these together such that a new 16-bit
    register check function wrappers the existing 32-bit version and
    centralizes the scheduling of board_disable work to a single
    invocation.

    The following patches depend upon this change.


  qla2xxx: Schedule board_disable only once

    In automated hot-plug device removal testing, occasionally
    qla2x00_disable_board_on_pci_error would run twice.  I believe this
    occurred by a second qla2x00_check_reg_for_disconnect scheduling
    board_disable while the first instance was still running.

    The easiest solution seemed to be adding a per-adapter flag that
    could be test_and_set before scheduling board_disable. 


  qla2xxx: Prevent removal and board_disable race

    This race was discovered through many iterations of automated PCI
    hotplug.  The automated test inserts faults on the Stratus platform
    to simulate device removal.  After removal, it re-adds the devices,
    simulating PCI hotplug insertion.  Rinse, repeat.

    The race occurs when a Stratus proprietary hotplug driver detects
    that PCI devices have gone missing and calls into the kernel PCI
    subsystem to remove them.  At nearly the same time, the qla2xxx
    driver figures out the same issue and schedules a board_disable.  It
    takes a few hundred device removals to hit, but it seemed that the
    problem was that qla2x00_disable_board_on_pci_error started, then
    qla2xxx_remove_one was invoked, so one of the two started touching
    already freed resources.

    Although the bug manifested on a Stratus platform running modified
    drivers, the window for qla2xxx_remove_one to race
    qla2x00_disable_board_on_pci_error is still present if running stock
    kernel/drivers on commodity hardware.  It may be difficult to hit,
    but one could theoretically invoke PCI device removal via sysfs (or
    simply unload the qla2xxx driver) and for board_disable to run at
    the same time.

    The fix leaves the Scsi_Host intact during board_disable so that
    qla2x00_remove_one can safely stop the main timer and
    cancel_work_sync on any outstanding board_disable work.  A
    PFLG_DRIVER_REMOVING bit is also set to prevent any other
    board_disable instances from scheduling.  Once the asynchronous
    players are out of the way, qla2x00_remove_one can move forward
    cleaning up whatever remaining resources are still attached.

    The PCI device enable count check in qla2x00_remove_one was
    re-purposed to determine if board_disable had run.  Its original
    intention to detect qla2x00_probe_one failure was invalid.  As long
    as the driver .probe routine returns an -ERRNO, then its counterpart
    .remove routine is *not* called.


  qla2xxx: Prevent probe and board_disable race

    An automated Stratus device removal test hotplug removes devices
    shortly during/after their initialization.  The test inserts
    devices, then proceeds to remove them at various 1, 2, 3, 4...
    second intervals.  This test revealed two bugs:

    1 - The struct isp_operations initialize_adapter callback is invoked
    by qla2x00_probe_one before ha->board_disable is initialized. The
    callback will eventually read PCI registers and potentially trigger
    the disconnection check.  If the board_disable work structure hasn't
    been initialized, an ugly WARN trace will be emitted. The call stack
    looked something like:

      qla2x00_probe_one
        qla2x00_initialize_adapter
          qla2xxx_get_flash_info
            qla2xxx_get_flt_info
              qla25xx_read_optrom_data
                 qla2x00_dump_ram
                   qla2x00_mailbox_command
                     qla24xx_intr_handler
                       qla2x00_check_reg_for_disconnect

    2 - Races between the qla2x00_disable_board_on_pci_error and
    qla2x00_probe_one. The latter was written before any asynchronous
    device removals were introduced by commit f3ddac1 "qla2xxx: Disable
    adapter when we encounter a PCI disconnect," and proceeds as if it
    were the only game in town with respect to the underlying device.
    Restore that assumption by preventing any board_disable scheduling
    while qla2x00_probe_one is in flight.  Holding off this scheduling
    prevents the race, as well as avoids bug #1.

 drivers/scsi/qla2xxx/qla_def.h |    5 ++++
 drivers/scsi/qla2xxx/qla_gbl.h |    3 +-
 drivers/scsi/qla2xxx/qla_isr.c |   44 +++++++++++++++--------------
 drivers/scsi/qla2xxx/qla_mr.c  |    2 +-
 drivers/scsi/qla2xxx/qla_nx.c  |    6 ++--
 drivers/scsi/qla2xxx/qla_os.c  |   60 ++++++++++++++++++----------------------
 6 files changed, 61 insertions(+), 59 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/6] qla2xxx: Fix shost use-after-free on device removal
  2014-06-18 14:02 [PATCH 0/6] qla2xxx device removal fixups Joe Lawrence
@ 2014-06-18 14:02 ` Joe Lawrence
  2014-06-18 14:02 ` [PATCH 2/6] qla2xxx: Use qla2x00_clear_drv_active on probe failure Joe Lawrence
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Joe Lawrence @ 2014-06-18 14:02 UTC (permalink / raw)
  To: linux-scsi
  Cc: Chad Dupuis, Giridhar Malavali, Saurav Kashyap, Don Zickus,
	Prarit Bhargava, Bill Kuzeja, Dave Bulkow, Joe Lawrence

Once calling scsi_host_put, be careful to not access qla_hw_data through
the Scsi_Host private data (ie, scsi_qla_host base_vha).

Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
---
 drivers/scsi/qla2xxx/qla_os.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 5a430c7..6a512b7 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -3111,10 +3111,8 @@ qla2x00_unmap_iobases(struct qla_hw_data *ha)
 }
 
 static void
-qla2x00_clear_drv_active(scsi_qla_host_t *vha)
+qla2x00_clear_drv_active(struct qla_hw_data *ha)
 {
-	struct qla_hw_data *ha = vha->hw;
-
 	if (IS_QLA8044(ha)) {
 		qla8044_idc_lock(ha);
 		qla8044_clear_drv_active(ha);
@@ -3185,7 +3183,7 @@ qla2x00_remove_one(struct pci_dev *pdev)
 
 	scsi_host_put(base_vha->host);
 
-	qla2x00_clear_drv_active(base_vha);
+	qla2x00_clear_drv_active(ha);
 
 	qla2x00_unmap_iobases(ha);
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/6] qla2xxx: Use qla2x00_clear_drv_active on probe failure
  2014-06-18 14:02 [PATCH 0/6] qla2xxx device removal fixups Joe Lawrence
  2014-06-18 14:02 ` [PATCH 1/6] qla2xxx: Fix shost use-after-free on device removal Joe Lawrence
@ 2014-06-18 14:02 ` Joe Lawrence
  2014-06-18 14:02 ` [PATCH 3/6] qla2xxx: Collect PCI register checks and board_disable scheduling Joe Lawrence
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Joe Lawrence @ 2014-06-18 14:02 UTC (permalink / raw)
  To: linux-scsi
  Cc: Chad Dupuis, Giridhar Malavali, Saurav Kashyap, Don Zickus,
	Prarit Bhargava, Bill Kuzeja, Dave Bulkow, Joe Lawrence

Take advantage of commit fe1b806f "qla2xxx: Refactor shutdown code so
some functionality can be reused" to remove an inlined copy of
qla2x00_clear_drv_active in the driver's probe hardware error path.

Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
---
 drivers/scsi/qla2xxx/qla_os.c |   13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 6a512b7..31913bb 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -238,6 +238,7 @@ static int qla2xxx_eh_host_reset(struct scsi_cmnd *);
 
 static int qla2x00_change_queue_depth(struct scsi_device *, int, int);
 static int qla2x00_change_queue_type(struct scsi_device *, int);
+static void qla2x00_clear_drv_active(struct qla_hw_data *);
 static void qla2x00_free_device(scsi_qla_host_t *);
 
 struct scsi_host_template qla2xxx_driver_template = {
@@ -2946,16 +2947,8 @@ probe_failed:
 	scsi_host_put(base_vha->host);
 
 probe_hw_failed:
-	if (IS_QLA82XX(ha)) {
-		qla82xx_idc_lock(ha);
-		qla82xx_clear_drv_active(ha);
-		qla82xx_idc_unlock(ha);
-	}
-	if (IS_QLA8044(ha)) {
-		qla8044_idc_lock(ha);
-		qla8044_clear_drv_active(ha);
-		qla8044_idc_unlock(ha);
-	}
+	qla2x00_clear_drv_active(ha);
+
 iospace_config_failed:
 	if (IS_P3P_TYPE(ha)) {
 		if (!ha->nx_pcibase)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/6] qla2xxx: Collect PCI register checks and board_disable scheduling
  2014-06-18 14:02 [PATCH 0/6] qla2xxx device removal fixups Joe Lawrence
  2014-06-18 14:02 ` [PATCH 1/6] qla2xxx: Fix shost use-after-free on device removal Joe Lawrence
  2014-06-18 14:02 ` [PATCH 2/6] qla2xxx: Use qla2x00_clear_drv_active on probe failure Joe Lawrence
@ 2014-06-18 14:02 ` Joe Lawrence
  2014-06-18 14:02 ` [PATCH 4/6] qla2xxx: Schedule board_disable only once Joe Lawrence
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Joe Lawrence @ 2014-06-18 14:02 UTC (permalink / raw)
  To: linux-scsi
  Cc: Chad Dupuis, Giridhar Malavali, Saurav Kashyap, Don Zickus,
	Prarit Bhargava, Bill Kuzeja, Dave Bulkow, Joe Lawrence

Add an uint16_t variant of qla2x00_check_reg_for_disconnect and use
these routines to check and schedule a PCI-disconnected board from a
centralized place.

Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
---
 drivers/scsi/qla2xxx/qla_gbl.h |    3 ++-
 drivers/scsi/qla2xxx/qla_isr.c |   28 +++++++++++++---------------
 drivers/scsi/qla2xxx/qla_mr.c  |    2 +-
 drivers/scsi/qla2xxx/qla_nx.c  |    6 +++---
 drivers/scsi/qla2xxx/qla_os.c  |    8 +-------
 5 files changed, 20 insertions(+), 27 deletions(-)

diff --git a/drivers/scsi/qla2xxx/qla_gbl.h b/drivers/scsi/qla2xxx/qla_gbl.h
index d48dea8..a60c1b8 100644
--- a/drivers/scsi/qla2xxx/qla_gbl.h
+++ b/drivers/scsi/qla2xxx/qla_gbl.h
@@ -475,7 +475,8 @@ extern uint8_t *qla25xx_read_nvram_data(scsi_qla_host_t *, uint8_t *, uint32_t,
 extern int qla25xx_write_nvram_data(scsi_qla_host_t *, uint8_t *, uint32_t,
 				    uint32_t);
 extern int qla2x00_is_a_vp_did(scsi_qla_host_t *, uint32_t);
-bool qla2x00_check_reg_for_disconnect(scsi_qla_host_t *, uint32_t);
+bool qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *, uint32_t);
+bool qla2x00_check_reg16_for_disconnect(scsi_qla_host_t *, uint16_t);
 
 extern int qla2x00_beacon_on(struct scsi_qla_host *);
 extern int qla2x00_beacon_off(struct scsi_qla_host *);
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index a56825c..e1b16ad 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -56,16 +56,8 @@ qla2100_intr_handler(int irq, void *dev_id)
 	vha = pci_get_drvdata(ha->pdev);
 	for (iter = 50; iter--; ) {
 		hccr = RD_REG_WORD(&reg->hccr);
-		/* Check for PCI disconnection */
-		if (hccr == 0xffff) {
-			/*
-			 * Schedule this on the default system workqueue so that
-			 * all the adapter workqueues and the DPC thread can be
-			 * shutdown cleanly.
-			 */
-			schedule_work(&ha->board_disable);
+		if (qla2x00_check_reg16_for_disconnect(vha, hccr))
 			break;
-		}
 		if (hccr & HCCR_RISC_PAUSE) {
 			if (pci_channel_offline(ha->pdev))
 				break;
@@ -121,7 +113,7 @@ qla2100_intr_handler(int irq, void *dev_id)
 }
 
 bool
-qla2x00_check_reg_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
+qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
 {
 	/* Check for PCI disconnection */
 	if (reg == 0xffffffff) {
@@ -136,6 +128,12 @@ qla2x00_check_reg_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
 		return false;
 }
 
+bool
+qla2x00_check_reg16_for_disconnect(scsi_qla_host_t *vha, uint16_t reg)
+{
+	return qla2x00_check_reg32_for_disconnect(vha, 0xffff0000 | reg);
+}
+
 /**
  * qla2300_intr_handler() - Process interrupts for the ISP23xx and ISP63xx.
  * @irq:
@@ -174,7 +172,7 @@ qla2300_intr_handler(int irq, void *dev_id)
 	vha = pci_get_drvdata(ha->pdev);
 	for (iter = 50; iter--; ) {
 		stat = RD_REG_DWORD(&reg->u.isp2300.host_status);
-		if (qla2x00_check_reg_for_disconnect(vha, stat))
+		if (qla2x00_check_reg32_for_disconnect(vha, stat))
 			break;
 		if (stat & HSR_RISC_PAUSED) {
 			if (unlikely(pci_channel_offline(ha->pdev)))
@@ -2633,7 +2631,7 @@ qla24xx_intr_handler(int irq, void *dev_id)
 	vha = pci_get_drvdata(ha->pdev);
 	for (iter = 50; iter--; ) {
 		stat = RD_REG_DWORD(&reg->host_status);
-		if (qla2x00_check_reg_for_disconnect(vha, stat))
+		if (qla2x00_check_reg32_for_disconnect(vha, stat))
 			break;
 		if (stat & HSRX_RISC_PAUSED) {
 			if (unlikely(pci_channel_offline(ha->pdev)))
@@ -2723,7 +2721,7 @@ qla24xx_msix_rsp_q(int irq, void *dev_id)
 	 * we process the response queue.
 	 */
 	stat = RD_REG_DWORD(&reg->host_status);
-	if (qla2x00_check_reg_for_disconnect(vha, stat))
+	if (qla2x00_check_reg32_for_disconnect(vha, stat))
 		goto out;
 	qla24xx_process_response_queue(vha, rsp);
 	if (!ha->flags.disable_msix_handshake) {
@@ -2763,7 +2761,7 @@ qla25xx_msix_rsp_q(int irq, void *dev_id)
 		hccr = RD_REG_DWORD_RELAXED(&reg->hccr);
 		spin_unlock_irqrestore(&ha->hardware_lock, flags);
 	}
-	if (qla2x00_check_reg_for_disconnect(vha, hccr))
+	if (qla2x00_check_reg32_for_disconnect(vha, hccr))
 		goto out;
 	queue_work_on((int) (rsp->id - 1), ha->wq, &rsp->q_work);
 
@@ -2798,7 +2796,7 @@ qla24xx_msix_default(int irq, void *dev_id)
 	vha = pci_get_drvdata(ha->pdev);
 	do {
 		stat = RD_REG_DWORD(&reg->host_status);
-		if (qla2x00_check_reg_for_disconnect(vha, stat))
+		if (qla2x00_check_reg32_for_disconnect(vha, stat))
 			break;
 		if (stat & HSRX_RISC_PAUSED) {
 			if (unlikely(pci_channel_offline(ha->pdev)))
diff --git a/drivers/scsi/qla2xxx/qla_mr.c b/drivers/scsi/qla2xxx/qla_mr.c
index abeb390..1624ff3 100644
--- a/drivers/scsi/qla2xxx/qla_mr.c
+++ b/drivers/scsi/qla2xxx/qla_mr.c
@@ -2924,7 +2924,7 @@ qlafx00_intr_handler(int irq, void *dev_id)
 	vha = pci_get_drvdata(ha->pdev);
 	for (iter = 50; iter--; clr_intr = 0) {
 		stat = QLAFX00_RD_INTR_REG(ha);
-		if (qla2x00_check_reg_for_disconnect(vha, stat))
+		if (qla2x00_check_reg32_for_disconnect(vha, stat))
 			break;
 		intr_stat = stat & QLAFX00_HST_INT_STS_BITS;
 		if (!intr_stat)
diff --git a/drivers/scsi/qla2xxx/qla_nx.c b/drivers/scsi/qla2xxx/qla_nx.c
index 58f3c91..2562600 100644
--- a/drivers/scsi/qla2xxx/qla_nx.c
+++ b/drivers/scsi/qla2xxx/qla_nx.c
@@ -2123,7 +2123,7 @@ qla82xx_msix_default(int irq, void *dev_id)
 	vha = pci_get_drvdata(ha->pdev);
 	do {
 		host_int = RD_REG_DWORD(&reg->host_int);
-		if (qla2x00_check_reg_for_disconnect(vha, host_int))
+		if (qla2x00_check_reg32_for_disconnect(vha, host_int))
 			break;
 		if (host_int) {
 			stat = RD_REG_DWORD(&reg->host_status);
@@ -2184,7 +2184,7 @@ qla82xx_msix_rsp_q(int irq, void *dev_id)
 	spin_lock_irqsave(&ha->hardware_lock, flags);
 	vha = pci_get_drvdata(ha->pdev);
 	host_int = RD_REG_DWORD(&reg->host_int);
-	if (qla2x00_check_reg_for_disconnect(vha, host_int))
+	if (qla2x00_check_reg32_for_disconnect(vha, host_int))
 		goto out;
 	qla24xx_process_response_queue(vha, rsp);
 	WRT_REG_DWORD(&reg->host_int, 0);
@@ -2219,7 +2219,7 @@ qla82xx_poll(int irq, void *dev_id)
 	vha = pci_get_drvdata(ha->pdev);
 
 	host_int = RD_REG_DWORD(&reg->host_int);
-	if (qla2x00_check_reg_for_disconnect(vha, host_int))
+	if (qla2x00_check_reg32_for_disconnect(vha, host_int))
 		goto out;
 	if (host_int) {
 		stat = RD_REG_DWORD(&reg->host_status);
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 31913bb..39c9953 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -5174,13 +5174,7 @@ qla2x00_timer(scsi_qla_host_t *vha)
 	 */
 	if (!pci_channel_offline(ha->pdev)) {
 		pci_read_config_word(ha->pdev, PCI_VENDOR_ID, &w);
-		if (w == 0xffff)
-			/*
-			 * Schedule this on the default system workqueue so that
-			 * all the adapter workqueues and the DPC thread can be
-			 * shutdown cleanly.
-			 */
-			schedule_work(&ha->board_disable);
+		qla2x00_check_reg16_for_disconnect(vha, w);
 	}
 
 	/* Make sure qla82xx_watchdog is run only for physical port */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/6] qla2xxx: Schedule board_disable only once
  2014-06-18 14:02 [PATCH 0/6] qla2xxx device removal fixups Joe Lawrence
                   ` (2 preceding siblings ...)
  2014-06-18 14:02 ` [PATCH 3/6] qla2xxx: Collect PCI register checks and board_disable scheduling Joe Lawrence
@ 2014-06-18 14:02 ` Joe Lawrence
  2014-06-18 14:04 ` [PATCH 5/6] qla2xxx: Prevent removal and board_disable race Joe Lawrence
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: Joe Lawrence @ 2014-06-18 14:02 UTC (permalink / raw)
  To: linux-scsi
  Cc: Chad Dupuis, Giridhar Malavali, Saurav Kashyap, Don Zickus,
	Prarit Bhargava, Bill Kuzeja, Dave Bulkow, Joe Lawrence

There are various callers of qla2x00_check_reg{32,16}_for_disconnect
that may schedule board removal on PCI-disconnect.  Test-and-set a
dedicated flag before scheduling board_disable so it is invoked only
once.

Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
---
 drivers/scsi/qla2xxx/qla_def.h |    3 +++
 drivers/scsi/qla2xxx/qla_isr.c |   14 ++++++++------
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h
index 1fa0104..1267b11 100644
--- a/drivers/scsi/qla2xxx/qla_def.h
+++ b/drivers/scsi/qla2xxx/qla_def.h
@@ -3402,6 +3402,9 @@ typedef struct scsi_qla_host {
 #define FX00_CRITEMP_RECOVERY	25
 #define FX00_HOST_INFO_RESEND	26
 
+	unsigned long	pci_flags;
+#define PFLG_DISCONNECTED	0	/* PCI device removed */
+
 	uint32_t	device_flags;
 #define SWITCH_FOUND		BIT_0
 #define DFLG_NO_CABLE		BIT_1
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index e1b16ad..2741ad8 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -117,12 +117,14 @@ qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
 {
 	/* Check for PCI disconnection */
 	if (reg == 0xffffffff) {
-		/*
-		 * Schedule this on the default system workqueue so that all the
-		 * adapter workqueues and the DPC thread can be shutdown
-		 * cleanly.
-		 */
-		schedule_work(&vha->hw->board_disable);
+		if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags)) {
+			/*
+			 * Schedule this (only once) on the default system
+			 * workqueue so that all the adapter workqueues and the
+			 * DPC thread can be shutdown cleanly.
+			 */
+			schedule_work(&vha->hw->board_disable);
+		}
 		return true;
 	} else
 		return false;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/6] qla2xxx: Prevent removal and board_disable race
  2014-06-18 14:02 [PATCH 0/6] qla2xxx device removal fixups Joe Lawrence
                   ` (3 preceding siblings ...)
  2014-06-18 14:02 ` [PATCH 4/6] qla2xxx: Schedule board_disable only once Joe Lawrence
@ 2014-06-18 14:04 ` Joe Lawrence
  2014-06-18 14:04   ` [PATCH 6/6] qla2xxx: Prevent probe " Joe Lawrence
  2014-07-25 15:23   ` [PATCH 5/6] qla2xxx: Prevent removal " Chad Dupuis
  2014-06-18 15:35 ` [PATCH 0/6] qla2xxx device removal fixups Giridhar Malavali
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 14+ messages in thread
From: Joe Lawrence @ 2014-06-18 14:04 UTC (permalink / raw)
  To: linux-scsi
  Cc: Chad Dupuis, Giridhar Malavali, Saurav Kashyap, Don Zickus,
	Prarit Bhargava, Bill Kuzeja, Dave Bulkow, Joe Lawrence

Introduce mutual exclusion between the qla2xxx_remove_one PCI driver
callback and qla2x00_disable_board_on_pci_error, which is scheduled as
board_disable work by qla2x00_check_reg{32,16}_for_disconnect:

* Leave the driver-specific data attached to the underlying PCI device
intact in qla2x00_disable_board_on_pci_error, so that qla2x00_remove_one
has enough breadcrumbs to determine that any board_disable work has been
completed.

* In qla2xxx_remove_one, set a bit to prevent any subsequent
board_disable work from scheduling, then cancel and wait until pending
work has completed.

* Reuse the PCI device enable count check in qla2x00_remove_one to
determine if board_disable has occured.  The original purpose of this
check was unnecessary since the driver remove function wasn't called
when the probe fails.

Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
---
 drivers/scsi/qla2xxx/qla_def.h |    1 +
 drivers/scsi/qla2xxx/qla_isr.c |    3 ++-
 drivers/scsi/qla2xxx/qla_os.c  |   31 +++++++++++++++++++------------
 3 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h
index 1267b11..7c441c9 100644
--- a/drivers/scsi/qla2xxx/qla_def.h
+++ b/drivers/scsi/qla2xxx/qla_def.h
@@ -3404,6 +3404,7 @@ typedef struct scsi_qla_host {
 
 	unsigned long	pci_flags;
 #define PFLG_DISCONNECTED	0	/* PCI device removed */
+#define PFLG_DRIVER_REMOVING	1	/* PCI driver .remove */
 
 	uint32_t	device_flags;
 #define SWITCH_FOUND		BIT_0
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index 2741ad8..ee5eef4 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -117,7 +117,8 @@ qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
 {
 	/* Check for PCI disconnection */
 	if (reg == 0xffffffff) {
-		if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags)) {
+		if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags) &&
+		    !test_bit(PFLG_DRIVER_REMOVING, &vha->pci_flags)) {
 			/*
 			 * Schedule this (only once) on the default system
 			 * workqueue so that all the adapter workqueues and the
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 39c9953..51cba37 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -3123,15 +3123,25 @@ qla2x00_remove_one(struct pci_dev *pdev)
 	scsi_qla_host_t *base_vha;
 	struct qla_hw_data  *ha;
 
+	base_vha = pci_get_drvdata(pdev);
+	ha = base_vha->hw;
+
+	/* Indicate device removal to prevent future board_disable and wait
+	 * until any pending board_disable has completed. */
+	set_bit(PFLG_DRIVER_REMOVING, &base_vha->pci_flags);
+	cancel_work_sync(&ha->board_disable);
+
 	/*
-	 * If the PCI device is disabled that means that probe failed and any
-	 * resources should be have cleaned up on probe exit.
+	 * If the PCI device is disabled then there was a PCI-disconnect and
+	 * qla2x00_disable_board_on_pci_error has taken care of most of the
+	 * resources.
 	 */
-	if (!atomic_read(&pdev->enable_cnt))
+	if (!atomic_read(&pdev->enable_cnt)) {
+		scsi_host_put(base_vha->host);
+		kfree(ha);
+		pci_set_drvdata(pdev, NULL);
 		return;
-
-	base_vha = pci_get_drvdata(pdev);
-	ha = base_vha->hw;
+	}
 
 	qla2x00_wait_for_hba_ready(base_vha);
 
@@ -4791,18 +4801,15 @@ qla2x00_disable_board_on_pci_error(struct work_struct *work)
 	qla82xx_md_free(base_vha);
 	qla2x00_free_queues(ha);
 
-	scsi_host_put(base_vha->host);
-
 	qla2x00_unmap_iobases(ha);
 
 	pci_release_selected_regions(ha->pdev, ha->bars);
-	kfree(ha);
-	ha = NULL;
-
 	pci_disable_pcie_error_reporting(pdev);
 	pci_disable_device(pdev);
-	pci_set_drvdata(pdev, NULL);
 
+	/*
+	 * Let qla2x00_remove_one cleanup qla_hw_data on device removal.
+	 */
 }
 
 /**************************************************************************
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 6/6] qla2xxx: Prevent probe and board_disable race
  2014-06-18 14:04 ` [PATCH 5/6] qla2xxx: Prevent removal and board_disable race Joe Lawrence
@ 2014-06-18 14:04   ` Joe Lawrence
  2014-07-25 15:23   ` [PATCH 5/6] qla2xxx: Prevent removal " Chad Dupuis
  1 sibling, 0 replies; 14+ messages in thread
From: Joe Lawrence @ 2014-06-18 14:04 UTC (permalink / raw)
  To: linux-scsi
  Cc: Chad Dupuis, Giridhar Malavali, Saurav Kashyap, Don Zickus,
	Prarit Bhargava, Bill Kuzeja, Dave Bulkow, Joe Lawrence

The PCI register read checking introduced in commit f3ddac19 "qla2xxx:
Disable adapter when we encounter a PCI disconnect" is active during
driver probe.  Hold off scheduling any board removal until the driver
probe has completed.  This ensures that the the board_disable work
structure is initialized and more importantly, avoids racing
qla2x00_probe_one.

Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
---
 drivers/scsi/qla2xxx/qla_def.h |    1 +
 drivers/scsi/qla2xxx/qla_isr.c |    3 ++-
 drivers/scsi/qla2xxx/qla_os.c  |    2 ++
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h
index 7c441c9..253a89d 100644
--- a/drivers/scsi/qla2xxx/qla_def.h
+++ b/drivers/scsi/qla2xxx/qla_def.h
@@ -3405,6 +3405,7 @@ typedef struct scsi_qla_host {
 	unsigned long	pci_flags;
 #define PFLG_DISCONNECTED	0	/* PCI device removed */
 #define PFLG_DRIVER_REMOVING	1	/* PCI driver .remove */
+#define PFLG_DRIVER_PROBING	2	/* PCI driver .probe */
 
 	uint32_t	device_flags;
 #define SWITCH_FOUND		BIT_0
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index ee5eef4..4ea8252 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -118,7 +118,8 @@ qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
 	/* Check for PCI disconnection */
 	if (reg == 0xffffffff) {
 		if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags) &&
-		    !test_bit(PFLG_DRIVER_REMOVING, &vha->pci_flags)) {
+		    !test_bit(PFLG_DRIVER_REMOVING, &vha->pci_flags) &&
+		    !test_bit(PFLG_DRIVER_PROBING, &vha->pci_flags)) {
 			/*
 			 * Schedule this (only once) on the default system
 			 * workqueue so that all the adapter workqueues and the
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 51cba37..0b39425 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -2629,6 +2629,7 @@ qla2x00_probe_one(struct pci_dev *pdev, const struct pci_device_id *id)
 	}
 
 	pci_set_drvdata(pdev, base_vha);
+	set_bit(PFLG_DRIVER_PROBING, &base_vha->pci_flags);
 
 	host = base_vha->host;
 	base_vha->req = req;
@@ -2920,6 +2921,7 @@ skip_dpc:
 
 	qlt_add_target(ha, base_vha);
 
+	clear_bit(PFLG_DRIVER_PROBING, &base_vha->pci_flags);
 	return 0;
 
 probe_init_failed:
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/6] qla2xxx device removal fixups
  2014-06-18 14:02 [PATCH 0/6] qla2xxx device removal fixups Joe Lawrence
                   ` (4 preceding siblings ...)
  2014-06-18 14:04 ` [PATCH 5/6] qla2xxx: Prevent removal and board_disable race Joe Lawrence
@ 2014-06-18 15:35 ` Giridhar Malavali
  2014-07-28 18:26 ` Chad Dupuis
  2014-08-26 13:20 ` Joe Lawrence
  7 siblings, 0 replies; 14+ messages in thread
From: Giridhar Malavali @ 2014-06-18 15:35 UTC (permalink / raw)
  To: Joe Lawrence, linux-scsi
  Cc: Chad Dupuis, Saurav Kashyap, Don Zickus, Prarit Bhargava,
	Bill Kuzeja (Partner - Stratus),
	Dave Bulkow

[-- Attachment #1: Type: text/plain, Size: 7189 bytes --]

Hi Joe,

Thanks for the patches. We will review and update.

-- Giri

On 6/18/14 7:02 AM, "Joe Lawrence" <joe.lawrence@stratus.com> wrote:

>Hi Chad, Giri, et al.
>
>Stratus has been testing the upstream qla2xxx driver against surprise
>device removal and has found a few minor issues along the way.  With
>this patchset, results have been good.  Although the following changes
>may be most interesting on a Stratus platform, they should be applicable
>to general hotplug scenarios on other hardware.
>
>The first two patches are independent from the others and can be
>reviewed as such.  One fixes up a use-after-free and the other
>consolidates some code.
>
>The final four patches are more invasive and modify the scsi_qla_host
>structure to avoid device removal race conditions.  These changes were
>written to demonstrate the problem at hand and minimally fix them in
>order to continue testing.  (For example, adding completely a new
>pci_flags member to scsi_qla_host is probably overkill.)
>
>We would welcome any feedback or questions about our testing.
>
>Regards,
>
>-- Joe
>
>
>Joe Lawrence (6):
>
>  qla2xxx: Fix shost use-after-free on device removal
>
>    scsi_qla_host *base_vha is allocated as part of the Scsi_Host
>    private hostdata[] by qla2x00_create_host.  When the last reference
>    on the host is put by qla2x00_remove_one, it's released along with
>    any hostdata[], rendering base_vha unsafe to dereference.
>      
>    To safely complete the adapter cleanup in qla2x00_remove_one,
>    use the scsi_qla_host_t *qla_hw_data pointer that is already in
>    hand.
>      
>    To reproduce, set kernel boot option slub_debug=FZPU and remove
>    an adapter instance.
>
>
>  qla2xxx: Use qla2x00_clear_drv_active on probe failure
>
>    Clean up some duplicate code along the way.
>
>
>  qla2xxx: Collect PCI register checks and board_disable scheduling
>
>    PCI register read checks are performed throughout the driver to
>    detect disconnected hardware.  There is currently a function that
>    verifies 32-bit values and schedules board_removal if needed. Other
>    16-bit registers are checked and board_removal scheduling scattered
>    around the driver.
>
>    This change pulls all of these together such that a new 16-bit
>    register check function wrappers the existing 32-bit version and
>    centralizes the scheduling of board_disable work to a single
>    invocation.
>
>    The following patches depend upon this change.
>
>
>  qla2xxx: Schedule board_disable only once
>
>    In automated hot-plug device removal testing, occasionally
>    qla2x00_disable_board_on_pci_error would run twice.  I believe this
>    occurred by a second qla2x00_check_reg_for_disconnect scheduling
>    board_disable while the first instance was still running.
>
>    The easiest solution seemed to be adding a per-adapter flag that
>    could be test_and_set before scheduling board_disable.
>
>
>  qla2xxx: Prevent removal and board_disable race
>
>    This race was discovered through many iterations of automated PCI
>    hotplug.  The automated test inserts faults on the Stratus platform
>    to simulate device removal.  After removal, it re-adds the devices,
>    simulating PCI hotplug insertion.  Rinse, repeat.
>
>    The race occurs when a Stratus proprietary hotplug driver detects
>    that PCI devices have gone missing and calls into the kernel PCI
>    subsystem to remove them.  At nearly the same time, the qla2xxx
>    driver figures out the same issue and schedules a board_disable.  It
>    takes a few hundred device removals to hit, but it seemed that the
>    problem was that qla2x00_disable_board_on_pci_error started, then
>    qla2xxx_remove_one was invoked, so one of the two started touching
>    already freed resources.
>
>    Although the bug manifested on a Stratus platform running modified
>    drivers, the window for qla2xxx_remove_one to race
>    qla2x00_disable_board_on_pci_error is still present if running stock
>    kernel/drivers on commodity hardware.  It may be difficult to hit,
>    but one could theoretically invoke PCI device removal via sysfs (or
>    simply unload the qla2xxx driver) and for board_disable to run at
>    the same time.
>
>    The fix leaves the Scsi_Host intact during board_disable so that
>    qla2x00_remove_one can safely stop the main timer and
>    cancel_work_sync on any outstanding board_disable work.  A
>    PFLG_DRIVER_REMOVING bit is also set to prevent any other
>    board_disable instances from scheduling.  Once the asynchronous
>    players are out of the way, qla2x00_remove_one can move forward
>    cleaning up whatever remaining resources are still attached.
>
>    The PCI device enable count check in qla2x00_remove_one was
>    re-purposed to determine if board_disable had run.  Its original
>    intention to detect qla2x00_probe_one failure was invalid.  As long
>    as the driver .probe routine returns an -ERRNO, then its counterpart
>    .remove routine is *not* called.
>
>
>  qla2xxx: Prevent probe and board_disable race
>
>    An automated Stratus device removal test hotplug removes devices
>    shortly during/after their initialization.  The test inserts
>    devices, then proceeds to remove them at various 1, 2, 3, 4...
>    second intervals.  This test revealed two bugs:
>
>    1 - The struct isp_operations initialize_adapter callback is invoked
>    by qla2x00_probe_one before ha->board_disable is initialized. The
>    callback will eventually read PCI registers and potentially trigger
>    the disconnection check.  If the board_disable work structure hasn't
>    been initialized, an ugly WARN trace will be emitted. The call stack
>    looked something like:
>
>      qla2x00_probe_one
>        qla2x00_initialize_adapter
>          qla2xxx_get_flash_info
>            qla2xxx_get_flt_info
>              qla25xx_read_optrom_data
>                 qla2x00_dump_ram
>                   qla2x00_mailbox_command
>                     qla24xx_intr_handler
>                       qla2x00_check_reg_for_disconnect
>
>    2 - Races between the qla2x00_disable_board_on_pci_error and
>    qla2x00_probe_one. The latter was written before any asynchronous
>    device removals were introduced by commit f3ddac1 "qla2xxx: Disable
>    adapter when we encounter a PCI disconnect," and proceeds as if it
>    were the only game in town with respect to the underlying device.
>    Restore that assumption by preventing any board_disable scheduling
>    while qla2x00_probe_one is in flight.  Holding off this scheduling
>    prevents the race, as well as avoids bug #1.
>
> drivers/scsi/qla2xxx/qla_def.h |    5 ++++
> drivers/scsi/qla2xxx/qla_gbl.h |    3 +-
> drivers/scsi/qla2xxx/qla_isr.c |   44 +++++++++++++++--------------
> drivers/scsi/qla2xxx/qla_mr.c  |    2 +-
> drivers/scsi/qla2xxx/qla_nx.c  |    6 ++--
> drivers/scsi/qla2xxx/qla_os.c  |   60
>++++++++++++++++++----------------------
> 6 files changed, 61 insertions(+), 59 deletions(-)
>
>-- 
>1.7.10.4
>


[-- Attachment #2: winmail.dat --]
[-- Type: application/ms-tnef, Size: 7342 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/6] qla2xxx: Prevent removal and board_disable race
  2014-06-18 14:04 ` [PATCH 5/6] qla2xxx: Prevent removal and board_disable race Joe Lawrence
  2014-06-18 14:04   ` [PATCH 6/6] qla2xxx: Prevent probe " Joe Lawrence
@ 2014-07-25 15:23   ` Chad Dupuis
  2014-07-25 19:00     ` Joe Lawrence
  1 sibling, 1 reply; 14+ messages in thread
From: Chad Dupuis @ 2014-07-25 15:23 UTC (permalink / raw)
  To: Joe Lawrence
  Cc: linux-scsi, Giridhar Malavali, Saurav Kashyap, Don Zickus,
	Prarit Bhargava, Bill Kuzeja, Dave Bulkow



On Wed, 18 Jun 2014, Joe Lawrence wrote:

> Introduce mutual exclusion between the qla2xxx_remove_one PCI driver
> callback and qla2x00_disable_board_on_pci_error, which is scheduled as
> board_disable work by qla2x00_check_reg{32,16}_for_disconnect:
>
> * Leave the driver-specific data attached to the underlying PCI device
> intact in qla2x00_disable_board_on_pci_error, so that qla2x00_remove_one
> has enough breadcrumbs to determine that any board_disable work has been
> completed.
>
> * In qla2xxx_remove_one, set a bit to prevent any subsequent
> board_disable work from scheduling, then cancel and wait until pending
> work has completed.
>
> * Reuse the PCI device enable count check in qla2x00_remove_one to
> determine if board_disable has occured.  The original purpose of this
> check was unnecessary since the driver remove function wasn't called
> when the probe fails.
>
> Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
> ---
> drivers/scsi/qla2xxx/qla_def.h |    1 +
> drivers/scsi/qla2xxx/qla_isr.c |    3 ++-
> drivers/scsi/qla2xxx/qla_os.c  |   31 +++++++++++++++++++------------
> 3 files changed, 22 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h
> index 1267b11..7c441c9 100644
> --- a/drivers/scsi/qla2xxx/qla_def.h
> +++ b/drivers/scsi/qla2xxx/qla_def.h
> @@ -3404,6 +3404,7 @@ typedef struct scsi_qla_host {
>
> 	unsigned long	pci_flags;
> #define PFLG_DISCONNECTED	0	/* PCI device removed */
> +#define PFLG_DRIVER_REMOVING	1	/* PCI driver .remove */
>
> 	uint32_t	device_flags;
> #define SWITCH_FOUND		BIT_0
> diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
> index 2741ad8..ee5eef4 100644
> --- a/drivers/scsi/qla2xxx/qla_isr.c
> +++ b/drivers/scsi/qla2xxx/qla_isr.c
> @@ -117,7 +117,8 @@ qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
> {
> 	/* Check for PCI disconnection */
> 	if (reg == 0xffffffff) {
> -		if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags)) {
> +		if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags) &&
> +		    !test_bit(PFLG_DRIVER_REMOVING, &vha->pci_flags)) {

In our remove function we set a bit that we are unloading:

set_bit (UNLOADING, &base_vha->dpc_flags);

could this be used instead?

> 			/*
> 			 * Schedule this (only once) on the default system
> 			 * workqueue so that all the adapter workqueues and the
> diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
> index 39c9953..51cba37 100644
> --- a/drivers/scsi/qla2xxx/qla_os.c
> +++ b/drivers/scsi/qla2xxx/qla_os.c
> @@ -3123,15 +3123,25 @@ qla2x00_remove_one(struct pci_dev *pdev)
> 	scsi_qla_host_t *base_vha;
> 	struct qla_hw_data  *ha;
>
> +	base_vha = pci_get_drvdata(pdev);
> +	ha = base_vha->hw;
> +
> +	/* Indicate device removal to prevent future board_disable and wait
> +	 * until any pending board_disable has completed. */
> +	set_bit(PFLG_DRIVER_REMOVING, &base_vha->pci_flags);
> +	cancel_work_sync(&ha->board_disable);
> +
> 	/*
> -	 * If the PCI device is disabled that means that probe failed and any
> -	 * resources should be have cleaned up on probe exit.
> +	 * If the PCI device is disabled then there was a PCI-disconnect and
> +	 * qla2x00_disable_board_on_pci_error has taken care of most of the
> +	 * resources.
> 	 */
> -	if (!atomic_read(&pdev->enable_cnt))
> +	if (!atomic_read(&pdev->enable_cnt)) {

Should we also add a check here that this is a disconnection before 
freeing these structs?  The original intent of the check for 
pdev->enable_cnt is to make sure we don't try to dereference an already 
freed struct if probe failed.

> +		scsi_host_put(base_vha->host);
> +		kfree(ha);
> +		pci_set_drvdata(pdev, NULL);
> 		return;
> -
> -	base_vha = pci_get_drvdata(pdev);
> -	ha = base_vha->hw;
> +	}
>
> 	qla2x00_wait_for_hba_ready(base_vha);
>
> @@ -4791,18 +4801,15 @@ qla2x00_disable_board_on_pci_error(struct work_struct *work)
> 	qla82xx_md_free(base_vha);
> 	qla2x00_free_queues(ha);
>
> -	scsi_host_put(base_vha->host);
> -
> 	qla2x00_unmap_iobases(ha);
>
> 	pci_release_selected_regions(ha->pdev, ha->bars);
> -	kfree(ha);
> -	ha = NULL;
> -
> 	pci_disable_pcie_error_reporting(pdev);
> 	pci_disable_device(pdev);
> -	pci_set_drvdata(pdev, NULL);
>
> +	/*
> +	 * Let qla2x00_remove_one cleanup qla_hw_data on device removal.
> +	 */
> }
>
> /**************************************************************************
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/6] qla2xxx: Prevent removal and board_disable race
  2014-07-25 15:23   ` [PATCH 5/6] qla2xxx: Prevent removal " Chad Dupuis
@ 2014-07-25 19:00     ` Joe Lawrence
  2014-07-25 19:45       ` Chad Dupuis
  0 siblings, 1 reply; 14+ messages in thread
From: Joe Lawrence @ 2014-07-25 19:00 UTC (permalink / raw)
  To: Chad Dupuis
  Cc: linux-scsi, Giridhar Malavali, Saurav Kashyap, Don Zickus,
	Prarit Bhargava, Bill Kuzeja, Dave Bulkow

On Fri, 25 Jul 2014 11:23:02 -0400
Chad Dupuis <chad.dupuis@qlogic.com> wrote:

> 
> 
> On Wed, 18 Jun 2014, Joe Lawrence wrote:
> 
> > Introduce mutual exclusion between the qla2xxx_remove_one PCI driver
> > callback and qla2x00_disable_board_on_pci_error, which is scheduled as
> > board_disable work by qla2x00_check_reg{32,16}_for_disconnect:
> >
> > * Leave the driver-specific data attached to the underlying PCI device
> > intact in qla2x00_disable_board_on_pci_error, so that qla2x00_remove_one
> > has enough breadcrumbs to determine that any board_disable work has been
> > completed.
> >
> > * In qla2xxx_remove_one, set a bit to prevent any subsequent
> > board_disable work from scheduling, then cancel and wait until pending
> > work has completed.
> >
> > * Reuse the PCI device enable count check in qla2x00_remove_one to
> > determine if board_disable has occured.  The original purpose of this
> > check was unnecessary since the driver remove function wasn't called
> > when the probe fails.
> >
> > Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
> > ---
> > drivers/scsi/qla2xxx/qla_def.h |    1 +
> > drivers/scsi/qla2xxx/qla_isr.c |    3 ++-
> > drivers/scsi/qla2xxx/qla_os.c  |   31 +++++++++++++++++++------------
> > 3 files changed, 22 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h
> > index 1267b11..7c441c9 100644
> > --- a/drivers/scsi/qla2xxx/qla_def.h
> > +++ b/drivers/scsi/qla2xxx/qla_def.h
> > @@ -3404,6 +3404,7 @@ typedef struct scsi_qla_host {
> >
> > 	unsigned long	pci_flags;
> > #define PFLG_DISCONNECTED	0	/* PCI device removed */
> > +#define PFLG_DRIVER_REMOVING	1	/* PCI driver .remove */
> >
> > 	uint32_t	device_flags;
> > #define SWITCH_FOUND		BIT_0
> > diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
> > index 2741ad8..ee5eef4 100644
> > --- a/drivers/scsi/qla2xxx/qla_isr.c
> > +++ b/drivers/scsi/qla2xxx/qla_isr.c
> > @@ -117,7 +117,8 @@ qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
> > {
> > 	/* Check for PCI disconnection */
> > 	if (reg == 0xffffffff) {
> > -		if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags)) {
> > +		if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags) &&
> > +		    !test_bit(PFLG_DRIVER_REMOVING, &vha->pci_flags)) {
> 
> In our remove function we set a bit that we are unloading:
> 
> set_bit (UNLOADING, &base_vha->dpc_flags);
> 
> could this be used instead?

Hi Chad,

Thanks for the comments.

I think with a little bit of code shuffling this could be accomplished.
My goal with the patchset was to try and present the problem/fix as
plain as possible.  It was easiest to collect all the atomic bits I
needed inside a single variable.  Should I be tacking on such flags to
'dpc_flags' ?

> 
> > 			/*
> > 			 * Schedule this (only once) on the default system
> > 			 * workqueue so that all the adapter workqueues and the
> > diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
> > index 39c9953..51cba37 100644
> > --- a/drivers/scsi/qla2xxx/qla_os.c
> > +++ b/drivers/scsi/qla2xxx/qla_os.c
> > @@ -3123,15 +3123,25 @@ qla2x00_remove_one(struct pci_dev *pdev)
> > 	scsi_qla_host_t *base_vha;
> > 	struct qla_hw_data  *ha;
> >
> > +	base_vha = pci_get_drvdata(pdev);
> > +	ha = base_vha->hw;
> > +
> > +	/* Indicate device removal to prevent future board_disable and wait
> > +	 * until any pending board_disable has completed. */
> > +	set_bit(PFLG_DRIVER_REMOVING, &base_vha->pci_flags);
> > +	cancel_work_sync(&ha->board_disable);
> > +
> > 	/*
> > -	 * If the PCI device is disabled that means that probe failed and any
> > -	 * resources should be have cleaned up on probe exit.
> > +	 * If the PCI device is disabled then there was a PCI-disconnect and
> > +	 * qla2x00_disable_board_on_pci_error has taken care of most of the
> > +	 * resources.
> > 	 */
> > -	if (!atomic_read(&pdev->enable_cnt))
> > +	if (!atomic_read(&pdev->enable_cnt)) {
> 
> Should we also add a check here that this is a disconnection before 
> freeing these structs?  The original intent of the check for 
> pdev->enable_cnt is to make sure we don't try to dereference an already 
> freed struct if probe failed.

I'm not exactly sure what you're asking here.  In my tests, when .probe
return -ERRNO, .remove was not called.  Is there another call path into
qla2x00_remove_one?

The reason I didn't completely cleanup qla_hw_data in
qla2x00_disable_board_on_pci_error and re-purposed the PCI enable count
check, was that I needed some way of determining that any
board_disable work was out of the way before proceeding with
qla2x00_remove_one.

The patch's set_bit / cancel_work_sync above (along with the test_bit
before board_disable schedule) should be ensuring that the
board_disable won't be running or rescheduling in the future.

If qla2x00_disable_board_on_pci_error got as far as actually disabling
the PCI device, then qla2x00_remove_one's check on its enable count
would be verifying that.  If the device is still enabled, then
qla2x00_remove_one knows that board_disable didn't clean up the device.

Would it be clearer if I used an explicit scsi_qla_host flag to
indicate that state?

> 
> > +		scsi_host_put(base_vha->host);
> > +		kfree(ha);
> > +		pci_set_drvdata(pdev, NULL);
> > 		return;
> > -
> > -	base_vha = pci_get_drvdata(pdev);
> > -	ha = base_vha->hw;
> > +	}
> >
> > 	qla2x00_wait_for_hba_ready(base_vha);
> >
> > @@ -4791,18 +4801,15 @@ qla2x00_disable_board_on_pci_error(struct work_struct *work)
> > 	qla82xx_md_free(base_vha);
> > 	qla2x00_free_queues(ha);
> >
> > -	scsi_host_put(base_vha->host);
> > -
> > 	qla2x00_unmap_iobases(ha);
> >
> > 	pci_release_selected_regions(ha->pdev, ha->bars);
> > -	kfree(ha);
> > -	ha = NULL;
> > -
> > 	pci_disable_pcie_error_reporting(pdev);
> > 	pci_disable_device(pdev);
> > -	pci_set_drvdata(pdev, NULL);
> >
> > +	/*
> > +	 * Let qla2x00_remove_one cleanup qla_hw_data on device removal.
> > +	 */
> > }
> >
> > /**************************************************************************
> >

Regards,

-- Joe

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/6] qla2xxx: Prevent removal and board_disable race
  2014-07-25 19:00     ` Joe Lawrence
@ 2014-07-25 19:45       ` Chad Dupuis
  0 siblings, 0 replies; 14+ messages in thread
From: Chad Dupuis @ 2014-07-25 19:45 UTC (permalink / raw)
  To: Joe Lawrence
  Cc: linux-scsi, Giridhar Malavali, Saurav Kashyap, Don Zickus,
	Prarit Bhargava, Bill Kuzeja, Dave Bulkow


On Fri, 25 Jul 2014, Joe Lawrence wrote:

> On Fri, 25 Jul 2014 11:23:02 -0400
> Chad Dupuis <chad.dupuis@qlogic.com> wrote:
>
>>
>>
>> On Wed, 18 Jun 2014, Joe Lawrence wrote:
>>
>>> Introduce mutual exclusion between the qla2xxx_remove_one PCI driver
>>> callback and qla2x00_disable_board_on_pci_error, which is scheduled as
>>> board_disable work by qla2x00_check_reg{32,16}_for_disconnect:
>>>
>>> * Leave the driver-specific data attached to the underlying PCI device
>>> intact in qla2x00_disable_board_on_pci_error, so that qla2x00_remove_one
>>> has enough breadcrumbs to determine that any board_disable work has been
>>> completed.
>>>
>>> * In qla2xxx_remove_one, set a bit to prevent any subsequent
>>> board_disable work from scheduling, then cancel and wait until pending
>>> work has completed.
>>>
>>> * Reuse the PCI device enable count check in qla2x00_remove_one to
>>> determine if board_disable has occured.  The original purpose of this
>>> check was unnecessary since the driver remove function wasn't called
>>> when the probe fails.
>>>
>>> Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
>>> ---
>>> drivers/scsi/qla2xxx/qla_def.h |    1 +
>>> drivers/scsi/qla2xxx/qla_isr.c |    3 ++-
>>> drivers/scsi/qla2xxx/qla_os.c  |   31 +++++++++++++++++++------------
>>> 3 files changed, 22 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h
>>> index 1267b11..7c441c9 100644
>>> --- a/drivers/scsi/qla2xxx/qla_def.h
>>> +++ b/drivers/scsi/qla2xxx/qla_def.h
>>> @@ -3404,6 +3404,7 @@ typedef struct scsi_qla_host {
>>>
>>> 	unsigned long	pci_flags;
>>> #define PFLG_DISCONNECTED	0	/* PCI device removed */
>>> +#define PFLG_DRIVER_REMOVING	1	/* PCI driver .remove */
>>>
>>> 	uint32_t	device_flags;
>>> #define SWITCH_FOUND		BIT_0
>>> diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
>>> index 2741ad8..ee5eef4 100644
>>> --- a/drivers/scsi/qla2xxx/qla_isr.c
>>> +++ b/drivers/scsi/qla2xxx/qla_isr.c
>>> @@ -117,7 +117,8 @@ qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
>>> {
>>> 	/* Check for PCI disconnection */
>>> 	if (reg == 0xffffffff) {
>>> -		if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags)) {
>>> +		if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags) &&
>>> +		    !test_bit(PFLG_DRIVER_REMOVING, &vha->pci_flags)) {
>>
>> In our remove function we set a bit that we are unloading:
>>
>> set_bit (UNLOADING, &base_vha->dpc_flags);
>>
>> could this be used instead?
>
> Hi Chad,
>
> Thanks for the comments.
>
> I think with a little bit of code shuffling this could be accomplished.
> My goal with the patchset was to try and present the problem/fix as
> plain as possible.  It was easiest to collect all the atomic bits I
> needed inside a single variable.  Should I be tacking on such flags to
> 'dpc_flags' ?

Now that I see your explanation, it makes sense how you did it.  My 
concern had been that we introduce more flags which could theoretcially 
introduce more points for race conditions but dpc_flags in pretty 
overloaded as it is.

>
>>
>>> 			/*
>>> 			 * Schedule this (only once) on the default system
>>> 			 * workqueue so that all the adapter workqueues and the
>>> diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
>>> index 39c9953..51cba37 100644
>>> --- a/drivers/scsi/qla2xxx/qla_os.c
>>> +++ b/drivers/scsi/qla2xxx/qla_os.c
>>> @@ -3123,15 +3123,25 @@ qla2x00_remove_one(struct pci_dev *pdev)
>>> 	scsi_qla_host_t *base_vha;
>>> 	struct qla_hw_data  *ha;
>>>
>>> +	base_vha = pci_get_drvdata(pdev);
>>> +	ha = base_vha->hw;
>>> +
>>> +	/* Indicate device removal to prevent future board_disable and wait
>>> +	 * until any pending board_disable has completed. */
>>> +	set_bit(PFLG_DRIVER_REMOVING, &base_vha->pci_flags);
>>> +	cancel_work_sync(&ha->board_disable);
>>> +
>>> 	/*
>>> -	 * If the PCI device is disabled that means that probe failed and any
>>> -	 * resources should be have cleaned up on probe exit.
>>> +	 * If the PCI device is disabled then there was a PCI-disconnect and
>>> +	 * qla2x00_disable_board_on_pci_error has taken care of most of the
>>> +	 * resources.
>>> 	 */
>>> -	if (!atomic_read(&pdev->enable_cnt))
>>> +	if (!atomic_read(&pdev->enable_cnt)) {
>>
>> Should we also add a check here that this is a disconnection before
>> freeing these structs?  The original intent of the check for
>> pdev->enable_cnt is to make sure we don't try to dereference an already
>> freed struct if probe failed.
>
> I'm not exactly sure what you're asking here.  In my tests, when .probe
> return -ERRNO, .remove was not called.  Is there another call path into
> qla2x00_remove_one?
>
> The reason I didn't completely cleanup qla_hw_data in
> qla2x00_disable_board_on_pci_error and re-purposed the PCI enable count
> check, was that I needed some way of determining that any
> board_disable work was out of the way before proceeding with
> qla2x00_remove_one.
>
> The patch's set_bit / cancel_work_sync above (along with the test_bit
> before board_disable schedule) should be ensuring that the
> board_disable won't be running or rescheduling in the future.
>
> If qla2x00_disable_board_on_pci_error got as far as actually disabling
> the PCI device, then qla2x00_remove_one's check on its enable count
> would be verifying that.  If the device is still enabled, then
> qla2x00_remove_one knows that board_disable didn't clean up the device.
>
> Would it be clearer if I used an explicit scsi_qla_host flag to
> indicate that state?

Thanks for the explanation.

>
>>
>>> +		scsi_host_put(base_vha->host);
>>> +		kfree(ha);
>>> +		pci_set_drvdata(pdev, NULL);
>>> 		return;
>>> -
>>> -	base_vha = pci_get_drvdata(pdev);
>>> -	ha = base_vha->hw;
>>> +	}
>>>
>>> 	qla2x00_wait_for_hba_ready(base_vha);
>>>
>>> @@ -4791,18 +4801,15 @@ qla2x00_disable_board_on_pci_error(struct work_struct *work)
>>> 	qla82xx_md_free(base_vha);
>>> 	qla2x00_free_queues(ha);
>>>
>>> -	scsi_host_put(base_vha->host);
>>> -
>>> 	qla2x00_unmap_iobases(ha);
>>>
>>> 	pci_release_selected_regions(ha->pdev, ha->bars);
>>> -	kfree(ha);
>>> -	ha = NULL;
>>> -
>>> 	pci_disable_pcie_error_reporting(pdev);
>>> 	pci_disable_device(pdev);
>>> -	pci_set_drvdata(pdev, NULL);
>>>
>>> +	/*
>>> +	 * Let qla2x00_remove_one cleanup qla_hw_data on device removal.
>>> +	 */
>>> }
>>>
>>> /**************************************************************************
>>>
>
> Regards,
>
> -- Joe
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/6] qla2xxx device removal fixups
  2014-06-18 14:02 [PATCH 0/6] qla2xxx device removal fixups Joe Lawrence
                   ` (5 preceding siblings ...)
  2014-06-18 15:35 ` [PATCH 0/6] qla2xxx device removal fixups Giridhar Malavali
@ 2014-07-28 18:26 ` Chad Dupuis
  2014-08-26 13:20 ` Joe Lawrence
  7 siblings, 0 replies; 14+ messages in thread
From: Chad Dupuis @ 2014-07-28 18:26 UTC (permalink / raw)
  To: Joe Lawrence
  Cc: linux-scsi, Giridhar Malavali, Saurav Kashyap, Don Zickus,
	Prarit Bhargava, Bill Kuzeja, Dave Bulkow

Ack the series.

Acked-by: Chad Dupuis <chad.dupuis@qlogic.com>

On Wed, 18 Jun 2014, Joe Lawrence wrote:

> Hi Chad, Giri, et al.
>
> Stratus has been testing the upstream qla2xxx driver against surprise
> device removal and has found a few minor issues along the way.  With
> this patchset, results have been good.  Although the following changes
> may be most interesting on a Stratus platform, they should be applicable
> to general hotplug scenarios on other hardware.
>
> The first two patches are independent from the others and can be
> reviewed as such.  One fixes up a use-after-free and the other
> consolidates some code.
>
> The final four patches are more invasive and modify the scsi_qla_host
> structure to avoid device removal race conditions.  These changes were
> written to demonstrate the problem at hand and minimally fix them in
> order to continue testing.  (For example, adding completely a new
> pci_flags member to scsi_qla_host is probably overkill.)
>
> We would welcome any feedback or questions about our testing.
>
> Regards,
>
> -- Joe
>
>
> Joe Lawrence (6):
>
>  qla2xxx: Fix shost use-after-free on device removal
>
>    scsi_qla_host *base_vha is allocated as part of the Scsi_Host
>    private hostdata[] by qla2x00_create_host.  When the last reference
>    on the host is put by qla2x00_remove_one, it's released along with
>    any hostdata[], rendering base_vha unsafe to dereference.
>
>    To safely complete the adapter cleanup in qla2x00_remove_one,
>    use the scsi_qla_host_t *qla_hw_data pointer that is already in
>    hand.
>
>    To reproduce, set kernel boot option slub_debug=FZPU and remove
>    an adapter instance.
>
>
>  qla2xxx: Use qla2x00_clear_drv_active on probe failure
>
>    Clean up some duplicate code along the way.
>
>
>  qla2xxx: Collect PCI register checks and board_disable scheduling
>
>    PCI register read checks are performed throughout the driver to
>    detect disconnected hardware.  There is currently a function that
>    verifies 32-bit values and schedules board_removal if needed. Other
>    16-bit registers are checked and board_removal scheduling scattered
>    around the driver.
>
>    This change pulls all of these together such that a new 16-bit
>    register check function wrappers the existing 32-bit version and
>    centralizes the scheduling of board_disable work to a single
>    invocation.
>
>    The following patches depend upon this change.
>
>
>  qla2xxx: Schedule board_disable only once
>
>    In automated hot-plug device removal testing, occasionally
>    qla2x00_disable_board_on_pci_error would run twice.  I believe this
>    occurred by a second qla2x00_check_reg_for_disconnect scheduling
>    board_disable while the first instance was still running.
>
>    The easiest solution seemed to be adding a per-adapter flag that
>    could be test_and_set before scheduling board_disable.
>
>
>  qla2xxx: Prevent removal and board_disable race
>
>    This race was discovered through many iterations of automated PCI
>    hotplug.  The automated test inserts faults on the Stratus platform
>    to simulate device removal.  After removal, it re-adds the devices,
>    simulating PCI hotplug insertion.  Rinse, repeat.
>
>    The race occurs when a Stratus proprietary hotplug driver detects
>    that PCI devices have gone missing and calls into the kernel PCI
>    subsystem to remove them.  At nearly the same time, the qla2xxx
>    driver figures out the same issue and schedules a board_disable.  It
>    takes a few hundred device removals to hit, but it seemed that the
>    problem was that qla2x00_disable_board_on_pci_error started, then
>    qla2xxx_remove_one was invoked, so one of the two started touching
>    already freed resources.
>
>    Although the bug manifested on a Stratus platform running modified
>    drivers, the window for qla2xxx_remove_one to race
>    qla2x00_disable_board_on_pci_error is still present if running stock
>    kernel/drivers on commodity hardware.  It may be difficult to hit,
>    but one could theoretically invoke PCI device removal via sysfs (or
>    simply unload the qla2xxx driver) and for board_disable to run at
>    the same time.
>
>    The fix leaves the Scsi_Host intact during board_disable so that
>    qla2x00_remove_one can safely stop the main timer and
>    cancel_work_sync on any outstanding board_disable work.  A
>    PFLG_DRIVER_REMOVING bit is also set to prevent any other
>    board_disable instances from scheduling.  Once the asynchronous
>    players are out of the way, qla2x00_remove_one can move forward
>    cleaning up whatever remaining resources are still attached.
>
>    The PCI device enable count check in qla2x00_remove_one was
>    re-purposed to determine if board_disable had run.  Its original
>    intention to detect qla2x00_probe_one failure was invalid.  As long
>    as the driver .probe routine returns an -ERRNO, then its counterpart
>    .remove routine is *not* called.
>
>
>  qla2xxx: Prevent probe and board_disable race
>
>    An automated Stratus device removal test hotplug removes devices
>    shortly during/after their initialization.  The test inserts
>    devices, then proceeds to remove them at various 1, 2, 3, 4...
>    second intervals.  This test revealed two bugs:
>
>    1 - The struct isp_operations initialize_adapter callback is invoked
>    by qla2x00_probe_one before ha->board_disable is initialized. The
>    callback will eventually read PCI registers and potentially trigger
>    the disconnection check.  If the board_disable work structure hasn't
>    been initialized, an ugly WARN trace will be emitted. The call stack
>    looked something like:
>
>      qla2x00_probe_one
>        qla2x00_initialize_adapter
>          qla2xxx_get_flash_info
>            qla2xxx_get_flt_info
>              qla25xx_read_optrom_data
>                 qla2x00_dump_ram
>                   qla2x00_mailbox_command
>                     qla24xx_intr_handler
>                       qla2x00_check_reg_for_disconnect
>
>    2 - Races between the qla2x00_disable_board_on_pci_error and
>    qla2x00_probe_one. The latter was written before any asynchronous
>    device removals were introduced by commit f3ddac1 "qla2xxx: Disable
>    adapter when we encounter a PCI disconnect," and proceeds as if it
>    were the only game in town with respect to the underlying device.
>    Restore that assumption by preventing any board_disable scheduling
>    while qla2x00_probe_one is in flight.  Holding off this scheduling
>    prevents the race, as well as avoids bug #1.
>
> drivers/scsi/qla2xxx/qla_def.h |    5 ++++
> drivers/scsi/qla2xxx/qla_gbl.h |    3 +-
> drivers/scsi/qla2xxx/qla_isr.c |   44 +++++++++++++++--------------
> drivers/scsi/qla2xxx/qla_mr.c  |    2 +-
> drivers/scsi/qla2xxx/qla_nx.c  |    6 ++--
> drivers/scsi/qla2xxx/qla_os.c  |   60 ++++++++++++++++++----------------------
> 6 files changed, 61 insertions(+), 59 deletions(-)
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/6] qla2xxx device removal fixups
  2014-06-18 14:02 [PATCH 0/6] qla2xxx device removal fixups Joe Lawrence
                   ` (6 preceding siblings ...)
  2014-07-28 18:26 ` Chad Dupuis
@ 2014-08-26 13:20 ` Joe Lawrence
  2014-08-26 13:52   ` Christoph Hellwig
  7 siblings, 1 reply; 14+ messages in thread
From: Joe Lawrence @ 2014-08-26 13:20 UTC (permalink / raw)
  To: linux-scsi
  Cc: Joe Lawrence, Chad Dupuis, Giridhar Malavali, Saurav Kashyap,
	Don Zickus, Prarit Bhargava, Bill Kuzeja, Dave Bulkow,
	Christoph Hellwig

On Wed, 18 Jun 2014 10:02:13 -0400
Joe Lawrence <joe.lawrence@stratus.com> wrote:

> Hi Chad, Giri, et al.
> 
> Stratus has been testing the upstream qla2xxx driver against surprise
> device removal and has found a few minor issues along the way.  With
> this patchset, results have been good.  Although the following changes
> may be most interesting on a Stratus platform, they should be applicable
> to general hotplug scenarios on other hardware.
> 
> The first two patches are independent from the others and can be
> reviewed as such.  One fixes up a use-after-free and the other
> consolidates some code.

Patch 1 corrects a bug introduced by commit fe1b806, which has made its
way into three linux-stable releases:

% git tag --contains fe1b806 | grep '^v3.[0-9]*$'
v3.14
v3.15
v3.16

> The final four patches are more invasive and modify the scsi_qla_host
> structure to avoid device removal race conditions.  These changes were
> written to demonstrate the problem at hand and minimally fix them in
> order to continue testing.  (For example, adding completely a new
> pci_flags member to scsi_qla_host is probably overkill.) 
> 
> We would welcome any feedback or questions about our testing.

Chad ack'd the series on 28 Jul 2014.  Any chance this can make into
hch's queue for 3.18?

Regards,

-- Joe

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/6] qla2xxx device removal fixups
  2014-08-26 13:20 ` Joe Lawrence
@ 2014-08-26 13:52   ` Christoph Hellwig
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2014-08-26 13:52 UTC (permalink / raw)
  To: Joe Lawrence
  Cc: linux-scsi, Chad Dupuis, Giridhar Malavali, Saurav Kashyap,
	Don Zickus, Prarit Bhargava, Bill Kuzeja, Dave Bulkow,
	Christoph Hellwig

On Tue, Aug 26, 2014 at 09:20:55AM -0400, Joe Lawrence wrote:
> > The final four patches are more invasive and modify the scsi_qla_host
> > structure to avoid device removal race conditions.  These changes were
> > written to demonstrate the problem at hand and minimally fix them in
> > order to continue testing.  (For example, adding completely a new
> > pci_flags member to scsi_qla_host is probably overkill.) 
> > 
> > We would welcome any feedback or questions about our testing.
> 
> Chad ack'd the series on 28 Jul 2014.  Any chance this can make into
> hch's queue for 3.18?

Can you please resend the patches with the collected ACKs against the
drivers-for-3.18 branch?  Thanks!


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-08-26 13:52 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-18 14:02 [PATCH 0/6] qla2xxx device removal fixups Joe Lawrence
2014-06-18 14:02 ` [PATCH 1/6] qla2xxx: Fix shost use-after-free on device removal Joe Lawrence
2014-06-18 14:02 ` [PATCH 2/6] qla2xxx: Use qla2x00_clear_drv_active on probe failure Joe Lawrence
2014-06-18 14:02 ` [PATCH 3/6] qla2xxx: Collect PCI register checks and board_disable scheduling Joe Lawrence
2014-06-18 14:02 ` [PATCH 4/6] qla2xxx: Schedule board_disable only once Joe Lawrence
2014-06-18 14:04 ` [PATCH 5/6] qla2xxx: Prevent removal and board_disable race Joe Lawrence
2014-06-18 14:04   ` [PATCH 6/6] qla2xxx: Prevent probe " Joe Lawrence
2014-07-25 15:23   ` [PATCH 5/6] qla2xxx: Prevent removal " Chad Dupuis
2014-07-25 19:00     ` Joe Lawrence
2014-07-25 19:45       ` Chad Dupuis
2014-06-18 15:35 ` [PATCH 0/6] qla2xxx device removal fixups Giridhar Malavali
2014-07-28 18:26 ` Chad Dupuis
2014-08-26 13:20 ` Joe Lawrence
2014-08-26 13:52   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.