* [PATCH 1/6] qla2xxx: Fix shost use-after-free on device removal
2014-06-18 14:02 [PATCH 0/6] qla2xxx device removal fixups Joe Lawrence
@ 2014-06-18 14:02 ` Joe Lawrence
2014-06-18 14:02 ` [PATCH 2/6] qla2xxx: Use qla2x00_clear_drv_active on probe failure Joe Lawrence
` (6 subsequent siblings)
7 siblings, 0 replies; 14+ messages in thread
From: Joe Lawrence @ 2014-06-18 14:02 UTC (permalink / raw)
To: linux-scsi
Cc: Chad Dupuis, Giridhar Malavali, Saurav Kashyap, Don Zickus,
Prarit Bhargava, Bill Kuzeja, Dave Bulkow, Joe Lawrence
Once calling scsi_host_put, be careful to not access qla_hw_data through
the Scsi_Host private data (ie, scsi_qla_host base_vha).
Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
---
drivers/scsi/qla2xxx/qla_os.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 5a430c7..6a512b7 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -3111,10 +3111,8 @@ qla2x00_unmap_iobases(struct qla_hw_data *ha)
}
static void
-qla2x00_clear_drv_active(scsi_qla_host_t *vha)
+qla2x00_clear_drv_active(struct qla_hw_data *ha)
{
- struct qla_hw_data *ha = vha->hw;
-
if (IS_QLA8044(ha)) {
qla8044_idc_lock(ha);
qla8044_clear_drv_active(ha);
@@ -3185,7 +3183,7 @@ qla2x00_remove_one(struct pci_dev *pdev)
scsi_host_put(base_vha->host);
- qla2x00_clear_drv_active(base_vha);
+ qla2x00_clear_drv_active(ha);
qla2x00_unmap_iobases(ha);
--
1.7.10.4
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/6] qla2xxx: Use qla2x00_clear_drv_active on probe failure
2014-06-18 14:02 [PATCH 0/6] qla2xxx device removal fixups Joe Lawrence
2014-06-18 14:02 ` [PATCH 1/6] qla2xxx: Fix shost use-after-free on device removal Joe Lawrence
@ 2014-06-18 14:02 ` Joe Lawrence
2014-06-18 14:02 ` [PATCH 3/6] qla2xxx: Collect PCI register checks and board_disable scheduling Joe Lawrence
` (5 subsequent siblings)
7 siblings, 0 replies; 14+ messages in thread
From: Joe Lawrence @ 2014-06-18 14:02 UTC (permalink / raw)
To: linux-scsi
Cc: Chad Dupuis, Giridhar Malavali, Saurav Kashyap, Don Zickus,
Prarit Bhargava, Bill Kuzeja, Dave Bulkow, Joe Lawrence
Take advantage of commit fe1b806f "qla2xxx: Refactor shutdown code so
some functionality can be reused" to remove an inlined copy of
qla2x00_clear_drv_active in the driver's probe hardware error path.
Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
---
drivers/scsi/qla2xxx/qla_os.c | 13 +++----------
1 file changed, 3 insertions(+), 10 deletions(-)
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 6a512b7..31913bb 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -238,6 +238,7 @@ static int qla2xxx_eh_host_reset(struct scsi_cmnd *);
static int qla2x00_change_queue_depth(struct scsi_device *, int, int);
static int qla2x00_change_queue_type(struct scsi_device *, int);
+static void qla2x00_clear_drv_active(struct qla_hw_data *);
static void qla2x00_free_device(scsi_qla_host_t *);
struct scsi_host_template qla2xxx_driver_template = {
@@ -2946,16 +2947,8 @@ probe_failed:
scsi_host_put(base_vha->host);
probe_hw_failed:
- if (IS_QLA82XX(ha)) {
- qla82xx_idc_lock(ha);
- qla82xx_clear_drv_active(ha);
- qla82xx_idc_unlock(ha);
- }
- if (IS_QLA8044(ha)) {
- qla8044_idc_lock(ha);
- qla8044_clear_drv_active(ha);
- qla8044_idc_unlock(ha);
- }
+ qla2x00_clear_drv_active(ha);
+
iospace_config_failed:
if (IS_P3P_TYPE(ha)) {
if (!ha->nx_pcibase)
--
1.7.10.4
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 3/6] qla2xxx: Collect PCI register checks and board_disable scheduling
2014-06-18 14:02 [PATCH 0/6] qla2xxx device removal fixups Joe Lawrence
2014-06-18 14:02 ` [PATCH 1/6] qla2xxx: Fix shost use-after-free on device removal Joe Lawrence
2014-06-18 14:02 ` [PATCH 2/6] qla2xxx: Use qla2x00_clear_drv_active on probe failure Joe Lawrence
@ 2014-06-18 14:02 ` Joe Lawrence
2014-06-18 14:02 ` [PATCH 4/6] qla2xxx: Schedule board_disable only once Joe Lawrence
` (4 subsequent siblings)
7 siblings, 0 replies; 14+ messages in thread
From: Joe Lawrence @ 2014-06-18 14:02 UTC (permalink / raw)
To: linux-scsi
Cc: Chad Dupuis, Giridhar Malavali, Saurav Kashyap, Don Zickus,
Prarit Bhargava, Bill Kuzeja, Dave Bulkow, Joe Lawrence
Add an uint16_t variant of qla2x00_check_reg_for_disconnect and use
these routines to check and schedule a PCI-disconnected board from a
centralized place.
Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
---
drivers/scsi/qla2xxx/qla_gbl.h | 3 ++-
drivers/scsi/qla2xxx/qla_isr.c | 28 +++++++++++++---------------
drivers/scsi/qla2xxx/qla_mr.c | 2 +-
drivers/scsi/qla2xxx/qla_nx.c | 6 +++---
drivers/scsi/qla2xxx/qla_os.c | 8 +-------
5 files changed, 20 insertions(+), 27 deletions(-)
diff --git a/drivers/scsi/qla2xxx/qla_gbl.h b/drivers/scsi/qla2xxx/qla_gbl.h
index d48dea8..a60c1b8 100644
--- a/drivers/scsi/qla2xxx/qla_gbl.h
+++ b/drivers/scsi/qla2xxx/qla_gbl.h
@@ -475,7 +475,8 @@ extern uint8_t *qla25xx_read_nvram_data(scsi_qla_host_t *, uint8_t *, uint32_t,
extern int qla25xx_write_nvram_data(scsi_qla_host_t *, uint8_t *, uint32_t,
uint32_t);
extern int qla2x00_is_a_vp_did(scsi_qla_host_t *, uint32_t);
-bool qla2x00_check_reg_for_disconnect(scsi_qla_host_t *, uint32_t);
+bool qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *, uint32_t);
+bool qla2x00_check_reg16_for_disconnect(scsi_qla_host_t *, uint16_t);
extern int qla2x00_beacon_on(struct scsi_qla_host *);
extern int qla2x00_beacon_off(struct scsi_qla_host *);
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index a56825c..e1b16ad 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -56,16 +56,8 @@ qla2100_intr_handler(int irq, void *dev_id)
vha = pci_get_drvdata(ha->pdev);
for (iter = 50; iter--; ) {
hccr = RD_REG_WORD(®->hccr);
- /* Check for PCI disconnection */
- if (hccr == 0xffff) {
- /*
- * Schedule this on the default system workqueue so that
- * all the adapter workqueues and the DPC thread can be
- * shutdown cleanly.
- */
- schedule_work(&ha->board_disable);
+ if (qla2x00_check_reg16_for_disconnect(vha, hccr))
break;
- }
if (hccr & HCCR_RISC_PAUSE) {
if (pci_channel_offline(ha->pdev))
break;
@@ -121,7 +113,7 @@ qla2100_intr_handler(int irq, void *dev_id)
}
bool
-qla2x00_check_reg_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
+qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
{
/* Check for PCI disconnection */
if (reg == 0xffffffff) {
@@ -136,6 +128,12 @@ qla2x00_check_reg_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
return false;
}
+bool
+qla2x00_check_reg16_for_disconnect(scsi_qla_host_t *vha, uint16_t reg)
+{
+ return qla2x00_check_reg32_for_disconnect(vha, 0xffff0000 | reg);
+}
+
/**
* qla2300_intr_handler() - Process interrupts for the ISP23xx and ISP63xx.
* @irq:
@@ -174,7 +172,7 @@ qla2300_intr_handler(int irq, void *dev_id)
vha = pci_get_drvdata(ha->pdev);
for (iter = 50; iter--; ) {
stat = RD_REG_DWORD(®->u.isp2300.host_status);
- if (qla2x00_check_reg_for_disconnect(vha, stat))
+ if (qla2x00_check_reg32_for_disconnect(vha, stat))
break;
if (stat & HSR_RISC_PAUSED) {
if (unlikely(pci_channel_offline(ha->pdev)))
@@ -2633,7 +2631,7 @@ qla24xx_intr_handler(int irq, void *dev_id)
vha = pci_get_drvdata(ha->pdev);
for (iter = 50; iter--; ) {
stat = RD_REG_DWORD(®->host_status);
- if (qla2x00_check_reg_for_disconnect(vha, stat))
+ if (qla2x00_check_reg32_for_disconnect(vha, stat))
break;
if (stat & HSRX_RISC_PAUSED) {
if (unlikely(pci_channel_offline(ha->pdev)))
@@ -2723,7 +2721,7 @@ qla24xx_msix_rsp_q(int irq, void *dev_id)
* we process the response queue.
*/
stat = RD_REG_DWORD(®->host_status);
- if (qla2x00_check_reg_for_disconnect(vha, stat))
+ if (qla2x00_check_reg32_for_disconnect(vha, stat))
goto out;
qla24xx_process_response_queue(vha, rsp);
if (!ha->flags.disable_msix_handshake) {
@@ -2763,7 +2761,7 @@ qla25xx_msix_rsp_q(int irq, void *dev_id)
hccr = RD_REG_DWORD_RELAXED(®->hccr);
spin_unlock_irqrestore(&ha->hardware_lock, flags);
}
- if (qla2x00_check_reg_for_disconnect(vha, hccr))
+ if (qla2x00_check_reg32_for_disconnect(vha, hccr))
goto out;
queue_work_on((int) (rsp->id - 1), ha->wq, &rsp->q_work);
@@ -2798,7 +2796,7 @@ qla24xx_msix_default(int irq, void *dev_id)
vha = pci_get_drvdata(ha->pdev);
do {
stat = RD_REG_DWORD(®->host_status);
- if (qla2x00_check_reg_for_disconnect(vha, stat))
+ if (qla2x00_check_reg32_for_disconnect(vha, stat))
break;
if (stat & HSRX_RISC_PAUSED) {
if (unlikely(pci_channel_offline(ha->pdev)))
diff --git a/drivers/scsi/qla2xxx/qla_mr.c b/drivers/scsi/qla2xxx/qla_mr.c
index abeb390..1624ff3 100644
--- a/drivers/scsi/qla2xxx/qla_mr.c
+++ b/drivers/scsi/qla2xxx/qla_mr.c
@@ -2924,7 +2924,7 @@ qlafx00_intr_handler(int irq, void *dev_id)
vha = pci_get_drvdata(ha->pdev);
for (iter = 50; iter--; clr_intr = 0) {
stat = QLAFX00_RD_INTR_REG(ha);
- if (qla2x00_check_reg_for_disconnect(vha, stat))
+ if (qla2x00_check_reg32_for_disconnect(vha, stat))
break;
intr_stat = stat & QLAFX00_HST_INT_STS_BITS;
if (!intr_stat)
diff --git a/drivers/scsi/qla2xxx/qla_nx.c b/drivers/scsi/qla2xxx/qla_nx.c
index 58f3c91..2562600 100644
--- a/drivers/scsi/qla2xxx/qla_nx.c
+++ b/drivers/scsi/qla2xxx/qla_nx.c
@@ -2123,7 +2123,7 @@ qla82xx_msix_default(int irq, void *dev_id)
vha = pci_get_drvdata(ha->pdev);
do {
host_int = RD_REG_DWORD(®->host_int);
- if (qla2x00_check_reg_for_disconnect(vha, host_int))
+ if (qla2x00_check_reg32_for_disconnect(vha, host_int))
break;
if (host_int) {
stat = RD_REG_DWORD(®->host_status);
@@ -2184,7 +2184,7 @@ qla82xx_msix_rsp_q(int irq, void *dev_id)
spin_lock_irqsave(&ha->hardware_lock, flags);
vha = pci_get_drvdata(ha->pdev);
host_int = RD_REG_DWORD(®->host_int);
- if (qla2x00_check_reg_for_disconnect(vha, host_int))
+ if (qla2x00_check_reg32_for_disconnect(vha, host_int))
goto out;
qla24xx_process_response_queue(vha, rsp);
WRT_REG_DWORD(®->host_int, 0);
@@ -2219,7 +2219,7 @@ qla82xx_poll(int irq, void *dev_id)
vha = pci_get_drvdata(ha->pdev);
host_int = RD_REG_DWORD(®->host_int);
- if (qla2x00_check_reg_for_disconnect(vha, host_int))
+ if (qla2x00_check_reg32_for_disconnect(vha, host_int))
goto out;
if (host_int) {
stat = RD_REG_DWORD(®->host_status);
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 31913bb..39c9953 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -5174,13 +5174,7 @@ qla2x00_timer(scsi_qla_host_t *vha)
*/
if (!pci_channel_offline(ha->pdev)) {
pci_read_config_word(ha->pdev, PCI_VENDOR_ID, &w);
- if (w == 0xffff)
- /*
- * Schedule this on the default system workqueue so that
- * all the adapter workqueues and the DPC thread can be
- * shutdown cleanly.
- */
- schedule_work(&ha->board_disable);
+ qla2x00_check_reg16_for_disconnect(vha, w);
}
/* Make sure qla82xx_watchdog is run only for physical port */
--
1.7.10.4
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 4/6] qla2xxx: Schedule board_disable only once
2014-06-18 14:02 [PATCH 0/6] qla2xxx device removal fixups Joe Lawrence
` (2 preceding siblings ...)
2014-06-18 14:02 ` [PATCH 3/6] qla2xxx: Collect PCI register checks and board_disable scheduling Joe Lawrence
@ 2014-06-18 14:02 ` Joe Lawrence
2014-06-18 14:04 ` [PATCH 5/6] qla2xxx: Prevent removal and board_disable race Joe Lawrence
` (3 subsequent siblings)
7 siblings, 0 replies; 14+ messages in thread
From: Joe Lawrence @ 2014-06-18 14:02 UTC (permalink / raw)
To: linux-scsi
Cc: Chad Dupuis, Giridhar Malavali, Saurav Kashyap, Don Zickus,
Prarit Bhargava, Bill Kuzeja, Dave Bulkow, Joe Lawrence
There are various callers of qla2x00_check_reg{32,16}_for_disconnect
that may schedule board removal on PCI-disconnect. Test-and-set a
dedicated flag before scheduling board_disable so it is invoked only
once.
Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
---
drivers/scsi/qla2xxx/qla_def.h | 3 +++
drivers/scsi/qla2xxx/qla_isr.c | 14 ++++++++------
2 files changed, 11 insertions(+), 6 deletions(-)
diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h
index 1fa0104..1267b11 100644
--- a/drivers/scsi/qla2xxx/qla_def.h
+++ b/drivers/scsi/qla2xxx/qla_def.h
@@ -3402,6 +3402,9 @@ typedef struct scsi_qla_host {
#define FX00_CRITEMP_RECOVERY 25
#define FX00_HOST_INFO_RESEND 26
+ unsigned long pci_flags;
+#define PFLG_DISCONNECTED 0 /* PCI device removed */
+
uint32_t device_flags;
#define SWITCH_FOUND BIT_0
#define DFLG_NO_CABLE BIT_1
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index e1b16ad..2741ad8 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -117,12 +117,14 @@ qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
{
/* Check for PCI disconnection */
if (reg == 0xffffffff) {
- /*
- * Schedule this on the default system workqueue so that all the
- * adapter workqueues and the DPC thread can be shutdown
- * cleanly.
- */
- schedule_work(&vha->hw->board_disable);
+ if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags)) {
+ /*
+ * Schedule this (only once) on the default system
+ * workqueue so that all the adapter workqueues and the
+ * DPC thread can be shutdown cleanly.
+ */
+ schedule_work(&vha->hw->board_disable);
+ }
return true;
} else
return false;
--
1.7.10.4
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 5/6] qla2xxx: Prevent removal and board_disable race
2014-06-18 14:02 [PATCH 0/6] qla2xxx device removal fixups Joe Lawrence
` (3 preceding siblings ...)
2014-06-18 14:02 ` [PATCH 4/6] qla2xxx: Schedule board_disable only once Joe Lawrence
@ 2014-06-18 14:04 ` Joe Lawrence
2014-06-18 14:04 ` [PATCH 6/6] qla2xxx: Prevent probe " Joe Lawrence
2014-07-25 15:23 ` [PATCH 5/6] qla2xxx: Prevent removal " Chad Dupuis
2014-06-18 15:35 ` [PATCH 0/6] qla2xxx device removal fixups Giridhar Malavali
` (2 subsequent siblings)
7 siblings, 2 replies; 14+ messages in thread
From: Joe Lawrence @ 2014-06-18 14:04 UTC (permalink / raw)
To: linux-scsi
Cc: Chad Dupuis, Giridhar Malavali, Saurav Kashyap, Don Zickus,
Prarit Bhargava, Bill Kuzeja, Dave Bulkow, Joe Lawrence
Introduce mutual exclusion between the qla2xxx_remove_one PCI driver
callback and qla2x00_disable_board_on_pci_error, which is scheduled as
board_disable work by qla2x00_check_reg{32,16}_for_disconnect:
* Leave the driver-specific data attached to the underlying PCI device
intact in qla2x00_disable_board_on_pci_error, so that qla2x00_remove_one
has enough breadcrumbs to determine that any board_disable work has been
completed.
* In qla2xxx_remove_one, set a bit to prevent any subsequent
board_disable work from scheduling, then cancel and wait until pending
work has completed.
* Reuse the PCI device enable count check in qla2x00_remove_one to
determine if board_disable has occured. The original purpose of this
check was unnecessary since the driver remove function wasn't called
when the probe fails.
Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
---
drivers/scsi/qla2xxx/qla_def.h | 1 +
drivers/scsi/qla2xxx/qla_isr.c | 3 ++-
drivers/scsi/qla2xxx/qla_os.c | 31 +++++++++++++++++++------------
3 files changed, 22 insertions(+), 13 deletions(-)
diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h
index 1267b11..7c441c9 100644
--- a/drivers/scsi/qla2xxx/qla_def.h
+++ b/drivers/scsi/qla2xxx/qla_def.h
@@ -3404,6 +3404,7 @@ typedef struct scsi_qla_host {
unsigned long pci_flags;
#define PFLG_DISCONNECTED 0 /* PCI device removed */
+#define PFLG_DRIVER_REMOVING 1 /* PCI driver .remove */
uint32_t device_flags;
#define SWITCH_FOUND BIT_0
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index 2741ad8..ee5eef4 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -117,7 +117,8 @@ qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
{
/* Check for PCI disconnection */
if (reg == 0xffffffff) {
- if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags)) {
+ if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags) &&
+ !test_bit(PFLG_DRIVER_REMOVING, &vha->pci_flags)) {
/*
* Schedule this (only once) on the default system
* workqueue so that all the adapter workqueues and the
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 39c9953..51cba37 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -3123,15 +3123,25 @@ qla2x00_remove_one(struct pci_dev *pdev)
scsi_qla_host_t *base_vha;
struct qla_hw_data *ha;
+ base_vha = pci_get_drvdata(pdev);
+ ha = base_vha->hw;
+
+ /* Indicate device removal to prevent future board_disable and wait
+ * until any pending board_disable has completed. */
+ set_bit(PFLG_DRIVER_REMOVING, &base_vha->pci_flags);
+ cancel_work_sync(&ha->board_disable);
+
/*
- * If the PCI device is disabled that means that probe failed and any
- * resources should be have cleaned up on probe exit.
+ * If the PCI device is disabled then there was a PCI-disconnect and
+ * qla2x00_disable_board_on_pci_error has taken care of most of the
+ * resources.
*/
- if (!atomic_read(&pdev->enable_cnt))
+ if (!atomic_read(&pdev->enable_cnt)) {
+ scsi_host_put(base_vha->host);
+ kfree(ha);
+ pci_set_drvdata(pdev, NULL);
return;
-
- base_vha = pci_get_drvdata(pdev);
- ha = base_vha->hw;
+ }
qla2x00_wait_for_hba_ready(base_vha);
@@ -4791,18 +4801,15 @@ qla2x00_disable_board_on_pci_error(struct work_struct *work)
qla82xx_md_free(base_vha);
qla2x00_free_queues(ha);
- scsi_host_put(base_vha->host);
-
qla2x00_unmap_iobases(ha);
pci_release_selected_regions(ha->pdev, ha->bars);
- kfree(ha);
- ha = NULL;
-
pci_disable_pcie_error_reporting(pdev);
pci_disable_device(pdev);
- pci_set_drvdata(pdev, NULL);
+ /*
+ * Let qla2x00_remove_one cleanup qla_hw_data on device removal.
+ */
}
/**************************************************************************
--
1.7.10.4
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 6/6] qla2xxx: Prevent probe and board_disable race
2014-06-18 14:04 ` [PATCH 5/6] qla2xxx: Prevent removal and board_disable race Joe Lawrence
@ 2014-06-18 14:04 ` Joe Lawrence
2014-07-25 15:23 ` [PATCH 5/6] qla2xxx: Prevent removal " Chad Dupuis
1 sibling, 0 replies; 14+ messages in thread
From: Joe Lawrence @ 2014-06-18 14:04 UTC (permalink / raw)
To: linux-scsi
Cc: Chad Dupuis, Giridhar Malavali, Saurav Kashyap, Don Zickus,
Prarit Bhargava, Bill Kuzeja, Dave Bulkow, Joe Lawrence
The PCI register read checking introduced in commit f3ddac19 "qla2xxx:
Disable adapter when we encounter a PCI disconnect" is active during
driver probe. Hold off scheduling any board removal until the driver
probe has completed. This ensures that the the board_disable work
structure is initialized and more importantly, avoids racing
qla2x00_probe_one.
Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
---
drivers/scsi/qla2xxx/qla_def.h | 1 +
drivers/scsi/qla2xxx/qla_isr.c | 3 ++-
drivers/scsi/qla2xxx/qla_os.c | 2 ++
3 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h
index 7c441c9..253a89d 100644
--- a/drivers/scsi/qla2xxx/qla_def.h
+++ b/drivers/scsi/qla2xxx/qla_def.h
@@ -3405,6 +3405,7 @@ typedef struct scsi_qla_host {
unsigned long pci_flags;
#define PFLG_DISCONNECTED 0 /* PCI device removed */
#define PFLG_DRIVER_REMOVING 1 /* PCI driver .remove */
+#define PFLG_DRIVER_PROBING 2 /* PCI driver .probe */
uint32_t device_flags;
#define SWITCH_FOUND BIT_0
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index ee5eef4..4ea8252 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -118,7 +118,8 @@ qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
/* Check for PCI disconnection */
if (reg == 0xffffffff) {
if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags) &&
- !test_bit(PFLG_DRIVER_REMOVING, &vha->pci_flags)) {
+ !test_bit(PFLG_DRIVER_REMOVING, &vha->pci_flags) &&
+ !test_bit(PFLG_DRIVER_PROBING, &vha->pci_flags)) {
/*
* Schedule this (only once) on the default system
* workqueue so that all the adapter workqueues and the
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 51cba37..0b39425 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -2629,6 +2629,7 @@ qla2x00_probe_one(struct pci_dev *pdev, const struct pci_device_id *id)
}
pci_set_drvdata(pdev, base_vha);
+ set_bit(PFLG_DRIVER_PROBING, &base_vha->pci_flags);
host = base_vha->host;
base_vha->req = req;
@@ -2920,6 +2921,7 @@ skip_dpc:
qlt_add_target(ha, base_vha);
+ clear_bit(PFLG_DRIVER_PROBING, &base_vha->pci_flags);
return 0;
probe_init_failed:
--
1.7.10.4
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 5/6] qla2xxx: Prevent removal and board_disable race
2014-06-18 14:04 ` [PATCH 5/6] qla2xxx: Prevent removal and board_disable race Joe Lawrence
2014-06-18 14:04 ` [PATCH 6/6] qla2xxx: Prevent probe " Joe Lawrence
@ 2014-07-25 15:23 ` Chad Dupuis
2014-07-25 19:00 ` Joe Lawrence
1 sibling, 1 reply; 14+ messages in thread
From: Chad Dupuis @ 2014-07-25 15:23 UTC (permalink / raw)
To: Joe Lawrence
Cc: linux-scsi, Giridhar Malavali, Saurav Kashyap, Don Zickus,
Prarit Bhargava, Bill Kuzeja, Dave Bulkow
On Wed, 18 Jun 2014, Joe Lawrence wrote:
> Introduce mutual exclusion between the qla2xxx_remove_one PCI driver
> callback and qla2x00_disable_board_on_pci_error, which is scheduled as
> board_disable work by qla2x00_check_reg{32,16}_for_disconnect:
>
> * Leave the driver-specific data attached to the underlying PCI device
> intact in qla2x00_disable_board_on_pci_error, so that qla2x00_remove_one
> has enough breadcrumbs to determine that any board_disable work has been
> completed.
>
> * In qla2xxx_remove_one, set a bit to prevent any subsequent
> board_disable work from scheduling, then cancel and wait until pending
> work has completed.
>
> * Reuse the PCI device enable count check in qla2x00_remove_one to
> determine if board_disable has occured. The original purpose of this
> check was unnecessary since the driver remove function wasn't called
> when the probe fails.
>
> Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
> ---
> drivers/scsi/qla2xxx/qla_def.h | 1 +
> drivers/scsi/qla2xxx/qla_isr.c | 3 ++-
> drivers/scsi/qla2xxx/qla_os.c | 31 +++++++++++++++++++------------
> 3 files changed, 22 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h
> index 1267b11..7c441c9 100644
> --- a/drivers/scsi/qla2xxx/qla_def.h
> +++ b/drivers/scsi/qla2xxx/qla_def.h
> @@ -3404,6 +3404,7 @@ typedef struct scsi_qla_host {
>
> unsigned long pci_flags;
> #define PFLG_DISCONNECTED 0 /* PCI device removed */
> +#define PFLG_DRIVER_REMOVING 1 /* PCI driver .remove */
>
> uint32_t device_flags;
> #define SWITCH_FOUND BIT_0
> diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
> index 2741ad8..ee5eef4 100644
> --- a/drivers/scsi/qla2xxx/qla_isr.c
> +++ b/drivers/scsi/qla2xxx/qla_isr.c
> @@ -117,7 +117,8 @@ qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
> {
> /* Check for PCI disconnection */
> if (reg == 0xffffffff) {
> - if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags)) {
> + if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags) &&
> + !test_bit(PFLG_DRIVER_REMOVING, &vha->pci_flags)) {
In our remove function we set a bit that we are unloading:
set_bit (UNLOADING, &base_vha->dpc_flags);
could this be used instead?
> /*
> * Schedule this (only once) on the default system
> * workqueue so that all the adapter workqueues and the
> diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
> index 39c9953..51cba37 100644
> --- a/drivers/scsi/qla2xxx/qla_os.c
> +++ b/drivers/scsi/qla2xxx/qla_os.c
> @@ -3123,15 +3123,25 @@ qla2x00_remove_one(struct pci_dev *pdev)
> scsi_qla_host_t *base_vha;
> struct qla_hw_data *ha;
>
> + base_vha = pci_get_drvdata(pdev);
> + ha = base_vha->hw;
> +
> + /* Indicate device removal to prevent future board_disable and wait
> + * until any pending board_disable has completed. */
> + set_bit(PFLG_DRIVER_REMOVING, &base_vha->pci_flags);
> + cancel_work_sync(&ha->board_disable);
> +
> /*
> - * If the PCI device is disabled that means that probe failed and any
> - * resources should be have cleaned up on probe exit.
> + * If the PCI device is disabled then there was a PCI-disconnect and
> + * qla2x00_disable_board_on_pci_error has taken care of most of the
> + * resources.
> */
> - if (!atomic_read(&pdev->enable_cnt))
> + if (!atomic_read(&pdev->enable_cnt)) {
Should we also add a check here that this is a disconnection before
freeing these structs? The original intent of the check for
pdev->enable_cnt is to make sure we don't try to dereference an already
freed struct if probe failed.
> + scsi_host_put(base_vha->host);
> + kfree(ha);
> + pci_set_drvdata(pdev, NULL);
> return;
> -
> - base_vha = pci_get_drvdata(pdev);
> - ha = base_vha->hw;
> + }
>
> qla2x00_wait_for_hba_ready(base_vha);
>
> @@ -4791,18 +4801,15 @@ qla2x00_disable_board_on_pci_error(struct work_struct *work)
> qla82xx_md_free(base_vha);
> qla2x00_free_queues(ha);
>
> - scsi_host_put(base_vha->host);
> -
> qla2x00_unmap_iobases(ha);
>
> pci_release_selected_regions(ha->pdev, ha->bars);
> - kfree(ha);
> - ha = NULL;
> -
> pci_disable_pcie_error_reporting(pdev);
> pci_disable_device(pdev);
> - pci_set_drvdata(pdev, NULL);
>
> + /*
> + * Let qla2x00_remove_one cleanup qla_hw_data on device removal.
> + */
> }
>
> /**************************************************************************
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 5/6] qla2xxx: Prevent removal and board_disable race
2014-07-25 15:23 ` [PATCH 5/6] qla2xxx: Prevent removal " Chad Dupuis
@ 2014-07-25 19:00 ` Joe Lawrence
2014-07-25 19:45 ` Chad Dupuis
0 siblings, 1 reply; 14+ messages in thread
From: Joe Lawrence @ 2014-07-25 19:00 UTC (permalink / raw)
To: Chad Dupuis
Cc: linux-scsi, Giridhar Malavali, Saurav Kashyap, Don Zickus,
Prarit Bhargava, Bill Kuzeja, Dave Bulkow
On Fri, 25 Jul 2014 11:23:02 -0400
Chad Dupuis <chad.dupuis@qlogic.com> wrote:
>
>
> On Wed, 18 Jun 2014, Joe Lawrence wrote:
>
> > Introduce mutual exclusion between the qla2xxx_remove_one PCI driver
> > callback and qla2x00_disable_board_on_pci_error, which is scheduled as
> > board_disable work by qla2x00_check_reg{32,16}_for_disconnect:
> >
> > * Leave the driver-specific data attached to the underlying PCI device
> > intact in qla2x00_disable_board_on_pci_error, so that qla2x00_remove_one
> > has enough breadcrumbs to determine that any board_disable work has been
> > completed.
> >
> > * In qla2xxx_remove_one, set a bit to prevent any subsequent
> > board_disable work from scheduling, then cancel and wait until pending
> > work has completed.
> >
> > * Reuse the PCI device enable count check in qla2x00_remove_one to
> > determine if board_disable has occured. The original purpose of this
> > check was unnecessary since the driver remove function wasn't called
> > when the probe fails.
> >
> > Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
> > ---
> > drivers/scsi/qla2xxx/qla_def.h | 1 +
> > drivers/scsi/qla2xxx/qla_isr.c | 3 ++-
> > drivers/scsi/qla2xxx/qla_os.c | 31 +++++++++++++++++++------------
> > 3 files changed, 22 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h
> > index 1267b11..7c441c9 100644
> > --- a/drivers/scsi/qla2xxx/qla_def.h
> > +++ b/drivers/scsi/qla2xxx/qla_def.h
> > @@ -3404,6 +3404,7 @@ typedef struct scsi_qla_host {
> >
> > unsigned long pci_flags;
> > #define PFLG_DISCONNECTED 0 /* PCI device removed */
> > +#define PFLG_DRIVER_REMOVING 1 /* PCI driver .remove */
> >
> > uint32_t device_flags;
> > #define SWITCH_FOUND BIT_0
> > diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
> > index 2741ad8..ee5eef4 100644
> > --- a/drivers/scsi/qla2xxx/qla_isr.c
> > +++ b/drivers/scsi/qla2xxx/qla_isr.c
> > @@ -117,7 +117,8 @@ qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
> > {
> > /* Check for PCI disconnection */
> > if (reg == 0xffffffff) {
> > - if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags)) {
> > + if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags) &&
> > + !test_bit(PFLG_DRIVER_REMOVING, &vha->pci_flags)) {
>
> In our remove function we set a bit that we are unloading:
>
> set_bit (UNLOADING, &base_vha->dpc_flags);
>
> could this be used instead?
Hi Chad,
Thanks for the comments.
I think with a little bit of code shuffling this could be accomplished.
My goal with the patchset was to try and present the problem/fix as
plain as possible. It was easiest to collect all the atomic bits I
needed inside a single variable. Should I be tacking on such flags to
'dpc_flags' ?
>
> > /*
> > * Schedule this (only once) on the default system
> > * workqueue so that all the adapter workqueues and the
> > diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
> > index 39c9953..51cba37 100644
> > --- a/drivers/scsi/qla2xxx/qla_os.c
> > +++ b/drivers/scsi/qla2xxx/qla_os.c
> > @@ -3123,15 +3123,25 @@ qla2x00_remove_one(struct pci_dev *pdev)
> > scsi_qla_host_t *base_vha;
> > struct qla_hw_data *ha;
> >
> > + base_vha = pci_get_drvdata(pdev);
> > + ha = base_vha->hw;
> > +
> > + /* Indicate device removal to prevent future board_disable and wait
> > + * until any pending board_disable has completed. */
> > + set_bit(PFLG_DRIVER_REMOVING, &base_vha->pci_flags);
> > + cancel_work_sync(&ha->board_disable);
> > +
> > /*
> > - * If the PCI device is disabled that means that probe failed and any
> > - * resources should be have cleaned up on probe exit.
> > + * If the PCI device is disabled then there was a PCI-disconnect and
> > + * qla2x00_disable_board_on_pci_error has taken care of most of the
> > + * resources.
> > */
> > - if (!atomic_read(&pdev->enable_cnt))
> > + if (!atomic_read(&pdev->enable_cnt)) {
>
> Should we also add a check here that this is a disconnection before
> freeing these structs? The original intent of the check for
> pdev->enable_cnt is to make sure we don't try to dereference an already
> freed struct if probe failed.
I'm not exactly sure what you're asking here. In my tests, when .probe
return -ERRNO, .remove was not called. Is there another call path into
qla2x00_remove_one?
The reason I didn't completely cleanup qla_hw_data in
qla2x00_disable_board_on_pci_error and re-purposed the PCI enable count
check, was that I needed some way of determining that any
board_disable work was out of the way before proceeding with
qla2x00_remove_one.
The patch's set_bit / cancel_work_sync above (along with the test_bit
before board_disable schedule) should be ensuring that the
board_disable won't be running or rescheduling in the future.
If qla2x00_disable_board_on_pci_error got as far as actually disabling
the PCI device, then qla2x00_remove_one's check on its enable count
would be verifying that. If the device is still enabled, then
qla2x00_remove_one knows that board_disable didn't clean up the device.
Would it be clearer if I used an explicit scsi_qla_host flag to
indicate that state?
>
> > + scsi_host_put(base_vha->host);
> > + kfree(ha);
> > + pci_set_drvdata(pdev, NULL);
> > return;
> > -
> > - base_vha = pci_get_drvdata(pdev);
> > - ha = base_vha->hw;
> > + }
> >
> > qla2x00_wait_for_hba_ready(base_vha);
> >
> > @@ -4791,18 +4801,15 @@ qla2x00_disable_board_on_pci_error(struct work_struct *work)
> > qla82xx_md_free(base_vha);
> > qla2x00_free_queues(ha);
> >
> > - scsi_host_put(base_vha->host);
> > -
> > qla2x00_unmap_iobases(ha);
> >
> > pci_release_selected_regions(ha->pdev, ha->bars);
> > - kfree(ha);
> > - ha = NULL;
> > -
> > pci_disable_pcie_error_reporting(pdev);
> > pci_disable_device(pdev);
> > - pci_set_drvdata(pdev, NULL);
> >
> > + /*
> > + * Let qla2x00_remove_one cleanup qla_hw_data on device removal.
> > + */
> > }
> >
> > /**************************************************************************
> >
Regards,
-- Joe
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 5/6] qla2xxx: Prevent removal and board_disable race
2014-07-25 19:00 ` Joe Lawrence
@ 2014-07-25 19:45 ` Chad Dupuis
0 siblings, 0 replies; 14+ messages in thread
From: Chad Dupuis @ 2014-07-25 19:45 UTC (permalink / raw)
To: Joe Lawrence
Cc: linux-scsi, Giridhar Malavali, Saurav Kashyap, Don Zickus,
Prarit Bhargava, Bill Kuzeja, Dave Bulkow
On Fri, 25 Jul 2014, Joe Lawrence wrote:
> On Fri, 25 Jul 2014 11:23:02 -0400
> Chad Dupuis <chad.dupuis@qlogic.com> wrote:
>
>>
>>
>> On Wed, 18 Jun 2014, Joe Lawrence wrote:
>>
>>> Introduce mutual exclusion between the qla2xxx_remove_one PCI driver
>>> callback and qla2x00_disable_board_on_pci_error, which is scheduled as
>>> board_disable work by qla2x00_check_reg{32,16}_for_disconnect:
>>>
>>> * Leave the driver-specific data attached to the underlying PCI device
>>> intact in qla2x00_disable_board_on_pci_error, so that qla2x00_remove_one
>>> has enough breadcrumbs to determine that any board_disable work has been
>>> completed.
>>>
>>> * In qla2xxx_remove_one, set a bit to prevent any subsequent
>>> board_disable work from scheduling, then cancel and wait until pending
>>> work has completed.
>>>
>>> * Reuse the PCI device enable count check in qla2x00_remove_one to
>>> determine if board_disable has occured. The original purpose of this
>>> check was unnecessary since the driver remove function wasn't called
>>> when the probe fails.
>>>
>>> Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
>>> ---
>>> drivers/scsi/qla2xxx/qla_def.h | 1 +
>>> drivers/scsi/qla2xxx/qla_isr.c | 3 ++-
>>> drivers/scsi/qla2xxx/qla_os.c | 31 +++++++++++++++++++------------
>>> 3 files changed, 22 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h
>>> index 1267b11..7c441c9 100644
>>> --- a/drivers/scsi/qla2xxx/qla_def.h
>>> +++ b/drivers/scsi/qla2xxx/qla_def.h
>>> @@ -3404,6 +3404,7 @@ typedef struct scsi_qla_host {
>>>
>>> unsigned long pci_flags;
>>> #define PFLG_DISCONNECTED 0 /* PCI device removed */
>>> +#define PFLG_DRIVER_REMOVING 1 /* PCI driver .remove */
>>>
>>> uint32_t device_flags;
>>> #define SWITCH_FOUND BIT_0
>>> diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
>>> index 2741ad8..ee5eef4 100644
>>> --- a/drivers/scsi/qla2xxx/qla_isr.c
>>> +++ b/drivers/scsi/qla2xxx/qla_isr.c
>>> @@ -117,7 +117,8 @@ qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *vha, uint32_t reg)
>>> {
>>> /* Check for PCI disconnection */
>>> if (reg == 0xffffffff) {
>>> - if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags)) {
>>> + if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags) &&
>>> + !test_bit(PFLG_DRIVER_REMOVING, &vha->pci_flags)) {
>>
>> In our remove function we set a bit that we are unloading:
>>
>> set_bit (UNLOADING, &base_vha->dpc_flags);
>>
>> could this be used instead?
>
> Hi Chad,
>
> Thanks for the comments.
>
> I think with a little bit of code shuffling this could be accomplished.
> My goal with the patchset was to try and present the problem/fix as
> plain as possible. It was easiest to collect all the atomic bits I
> needed inside a single variable. Should I be tacking on such flags to
> 'dpc_flags' ?
Now that I see your explanation, it makes sense how you did it. My
concern had been that we introduce more flags which could theoretcially
introduce more points for race conditions but dpc_flags in pretty
overloaded as it is.
>
>>
>>> /*
>>> * Schedule this (only once) on the default system
>>> * workqueue so that all the adapter workqueues and the
>>> diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
>>> index 39c9953..51cba37 100644
>>> --- a/drivers/scsi/qla2xxx/qla_os.c
>>> +++ b/drivers/scsi/qla2xxx/qla_os.c
>>> @@ -3123,15 +3123,25 @@ qla2x00_remove_one(struct pci_dev *pdev)
>>> scsi_qla_host_t *base_vha;
>>> struct qla_hw_data *ha;
>>>
>>> + base_vha = pci_get_drvdata(pdev);
>>> + ha = base_vha->hw;
>>> +
>>> + /* Indicate device removal to prevent future board_disable and wait
>>> + * until any pending board_disable has completed. */
>>> + set_bit(PFLG_DRIVER_REMOVING, &base_vha->pci_flags);
>>> + cancel_work_sync(&ha->board_disable);
>>> +
>>> /*
>>> - * If the PCI device is disabled that means that probe failed and any
>>> - * resources should be have cleaned up on probe exit.
>>> + * If the PCI device is disabled then there was a PCI-disconnect and
>>> + * qla2x00_disable_board_on_pci_error has taken care of most of the
>>> + * resources.
>>> */
>>> - if (!atomic_read(&pdev->enable_cnt))
>>> + if (!atomic_read(&pdev->enable_cnt)) {
>>
>> Should we also add a check here that this is a disconnection before
>> freeing these structs? The original intent of the check for
>> pdev->enable_cnt is to make sure we don't try to dereference an already
>> freed struct if probe failed.
>
> I'm not exactly sure what you're asking here. In my tests, when .probe
> return -ERRNO, .remove was not called. Is there another call path into
> qla2x00_remove_one?
>
> The reason I didn't completely cleanup qla_hw_data in
> qla2x00_disable_board_on_pci_error and re-purposed the PCI enable count
> check, was that I needed some way of determining that any
> board_disable work was out of the way before proceeding with
> qla2x00_remove_one.
>
> The patch's set_bit / cancel_work_sync above (along with the test_bit
> before board_disable schedule) should be ensuring that the
> board_disable won't be running or rescheduling in the future.
>
> If qla2x00_disable_board_on_pci_error got as far as actually disabling
> the PCI device, then qla2x00_remove_one's check on its enable count
> would be verifying that. If the device is still enabled, then
> qla2x00_remove_one knows that board_disable didn't clean up the device.
>
> Would it be clearer if I used an explicit scsi_qla_host flag to
> indicate that state?
Thanks for the explanation.
>
>>
>>> + scsi_host_put(base_vha->host);
>>> + kfree(ha);
>>> + pci_set_drvdata(pdev, NULL);
>>> return;
>>> -
>>> - base_vha = pci_get_drvdata(pdev);
>>> - ha = base_vha->hw;
>>> + }
>>>
>>> qla2x00_wait_for_hba_ready(base_vha);
>>>
>>> @@ -4791,18 +4801,15 @@ qla2x00_disable_board_on_pci_error(struct work_struct *work)
>>> qla82xx_md_free(base_vha);
>>> qla2x00_free_queues(ha);
>>>
>>> - scsi_host_put(base_vha->host);
>>> -
>>> qla2x00_unmap_iobases(ha);
>>>
>>> pci_release_selected_regions(ha->pdev, ha->bars);
>>> - kfree(ha);
>>> - ha = NULL;
>>> -
>>> pci_disable_pcie_error_reporting(pdev);
>>> pci_disable_device(pdev);
>>> - pci_set_drvdata(pdev, NULL);
>>>
>>> + /*
>>> + * Let qla2x00_remove_one cleanup qla_hw_data on device removal.
>>> + */
>>> }
>>>
>>> /**************************************************************************
>>>
>
> Regards,
>
> -- Joe
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/6] qla2xxx device removal fixups
2014-06-18 14:02 [PATCH 0/6] qla2xxx device removal fixups Joe Lawrence
` (4 preceding siblings ...)
2014-06-18 14:04 ` [PATCH 5/6] qla2xxx: Prevent removal and board_disable race Joe Lawrence
@ 2014-06-18 15:35 ` Giridhar Malavali
2014-07-28 18:26 ` Chad Dupuis
2014-08-26 13:20 ` Joe Lawrence
7 siblings, 0 replies; 14+ messages in thread
From: Giridhar Malavali @ 2014-06-18 15:35 UTC (permalink / raw)
To: Joe Lawrence, linux-scsi
Cc: Chad Dupuis, Saurav Kashyap, Don Zickus, Prarit Bhargava,
Bill Kuzeja (Partner - Stratus),
Dave Bulkow
[-- Attachment #1: Type: text/plain, Size: 7189 bytes --]
Hi Joe,
Thanks for the patches. We will review and update.
-- Giri
On 6/18/14 7:02 AM, "Joe Lawrence" <joe.lawrence@stratus.com> wrote:
>Hi Chad, Giri, et al.
>
>Stratus has been testing the upstream qla2xxx driver against surprise
>device removal and has found a few minor issues along the way. With
>this patchset, results have been good. Although the following changes
>may be most interesting on a Stratus platform, they should be applicable
>to general hotplug scenarios on other hardware.
>
>The first two patches are independent from the others and can be
>reviewed as such. One fixes up a use-after-free and the other
>consolidates some code.
>
>The final four patches are more invasive and modify the scsi_qla_host
>structure to avoid device removal race conditions. These changes were
>written to demonstrate the problem at hand and minimally fix them in
>order to continue testing. (For example, adding completely a new
>pci_flags member to scsi_qla_host is probably overkill.)
>
>We would welcome any feedback or questions about our testing.
>
>Regards,
>
>-- Joe
>
>
>Joe Lawrence (6):
>
> qla2xxx: Fix shost use-after-free on device removal
>
> scsi_qla_host *base_vha is allocated as part of the Scsi_Host
> private hostdata[] by qla2x00_create_host. When the last reference
> on the host is put by qla2x00_remove_one, it's released along with
> any hostdata[], rendering base_vha unsafe to dereference.
>
> To safely complete the adapter cleanup in qla2x00_remove_one,
> use the scsi_qla_host_t *qla_hw_data pointer that is already in
> hand.
>
> To reproduce, set kernel boot option slub_debug=FZPU and remove
> an adapter instance.
>
>
> qla2xxx: Use qla2x00_clear_drv_active on probe failure
>
> Clean up some duplicate code along the way.
>
>
> qla2xxx: Collect PCI register checks and board_disable scheduling
>
> PCI register read checks are performed throughout the driver to
> detect disconnected hardware. There is currently a function that
> verifies 32-bit values and schedules board_removal if needed. Other
> 16-bit registers are checked and board_removal scheduling scattered
> around the driver.
>
> This change pulls all of these together such that a new 16-bit
> register check function wrappers the existing 32-bit version and
> centralizes the scheduling of board_disable work to a single
> invocation.
>
> The following patches depend upon this change.
>
>
> qla2xxx: Schedule board_disable only once
>
> In automated hot-plug device removal testing, occasionally
> qla2x00_disable_board_on_pci_error would run twice. I believe this
> occurred by a second qla2x00_check_reg_for_disconnect scheduling
> board_disable while the first instance was still running.
>
> The easiest solution seemed to be adding a per-adapter flag that
> could be test_and_set before scheduling board_disable.
>
>
> qla2xxx: Prevent removal and board_disable race
>
> This race was discovered through many iterations of automated PCI
> hotplug. The automated test inserts faults on the Stratus platform
> to simulate device removal. After removal, it re-adds the devices,
> simulating PCI hotplug insertion. Rinse, repeat.
>
> The race occurs when a Stratus proprietary hotplug driver detects
> that PCI devices have gone missing and calls into the kernel PCI
> subsystem to remove them. At nearly the same time, the qla2xxx
> driver figures out the same issue and schedules a board_disable. It
> takes a few hundred device removals to hit, but it seemed that the
> problem was that qla2x00_disable_board_on_pci_error started, then
> qla2xxx_remove_one was invoked, so one of the two started touching
> already freed resources.
>
> Although the bug manifested on a Stratus platform running modified
> drivers, the window for qla2xxx_remove_one to race
> qla2x00_disable_board_on_pci_error is still present if running stock
> kernel/drivers on commodity hardware. It may be difficult to hit,
> but one could theoretically invoke PCI device removal via sysfs (or
> simply unload the qla2xxx driver) and for board_disable to run at
> the same time.
>
> The fix leaves the Scsi_Host intact during board_disable so that
> qla2x00_remove_one can safely stop the main timer and
> cancel_work_sync on any outstanding board_disable work. A
> PFLG_DRIVER_REMOVING bit is also set to prevent any other
> board_disable instances from scheduling. Once the asynchronous
> players are out of the way, qla2x00_remove_one can move forward
> cleaning up whatever remaining resources are still attached.
>
> The PCI device enable count check in qla2x00_remove_one was
> re-purposed to determine if board_disable had run. Its original
> intention to detect qla2x00_probe_one failure was invalid. As long
> as the driver .probe routine returns an -ERRNO, then its counterpart
> .remove routine is *not* called.
>
>
> qla2xxx: Prevent probe and board_disable race
>
> An automated Stratus device removal test hotplug removes devices
> shortly during/after their initialization. The test inserts
> devices, then proceeds to remove them at various 1, 2, 3, 4...
> second intervals. This test revealed two bugs:
>
> 1 - The struct isp_operations initialize_adapter callback is invoked
> by qla2x00_probe_one before ha->board_disable is initialized. The
> callback will eventually read PCI registers and potentially trigger
> the disconnection check. If the board_disable work structure hasn't
> been initialized, an ugly WARN trace will be emitted. The call stack
> looked something like:
>
> qla2x00_probe_one
> qla2x00_initialize_adapter
> qla2xxx_get_flash_info
> qla2xxx_get_flt_info
> qla25xx_read_optrom_data
> qla2x00_dump_ram
> qla2x00_mailbox_command
> qla24xx_intr_handler
> qla2x00_check_reg_for_disconnect
>
> 2 - Races between the qla2x00_disable_board_on_pci_error and
> qla2x00_probe_one. The latter was written before any asynchronous
> device removals were introduced by commit f3ddac1 "qla2xxx: Disable
> adapter when we encounter a PCI disconnect," and proceeds as if it
> were the only game in town with respect to the underlying device.
> Restore that assumption by preventing any board_disable scheduling
> while qla2x00_probe_one is in flight. Holding off this scheduling
> prevents the race, as well as avoids bug #1.
>
> drivers/scsi/qla2xxx/qla_def.h | 5 ++++
> drivers/scsi/qla2xxx/qla_gbl.h | 3 +-
> drivers/scsi/qla2xxx/qla_isr.c | 44 +++++++++++++++--------------
> drivers/scsi/qla2xxx/qla_mr.c | 2 +-
> drivers/scsi/qla2xxx/qla_nx.c | 6 ++--
> drivers/scsi/qla2xxx/qla_os.c | 60
>++++++++++++++++++----------------------
> 6 files changed, 61 insertions(+), 59 deletions(-)
>
>--
>1.7.10.4
>
[-- Attachment #2: winmail.dat --]
[-- Type: application/ms-tnef, Size: 7342 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/6] qla2xxx device removal fixups
2014-06-18 14:02 [PATCH 0/6] qla2xxx device removal fixups Joe Lawrence
` (5 preceding siblings ...)
2014-06-18 15:35 ` [PATCH 0/6] qla2xxx device removal fixups Giridhar Malavali
@ 2014-07-28 18:26 ` Chad Dupuis
2014-08-26 13:20 ` Joe Lawrence
7 siblings, 0 replies; 14+ messages in thread
From: Chad Dupuis @ 2014-07-28 18:26 UTC (permalink / raw)
To: Joe Lawrence
Cc: linux-scsi, Giridhar Malavali, Saurav Kashyap, Don Zickus,
Prarit Bhargava, Bill Kuzeja, Dave Bulkow
Ack the series.
Acked-by: Chad Dupuis <chad.dupuis@qlogic.com>
On Wed, 18 Jun 2014, Joe Lawrence wrote:
> Hi Chad, Giri, et al.
>
> Stratus has been testing the upstream qla2xxx driver against surprise
> device removal and has found a few minor issues along the way. With
> this patchset, results have been good. Although the following changes
> may be most interesting on a Stratus platform, they should be applicable
> to general hotplug scenarios on other hardware.
>
> The first two patches are independent from the others and can be
> reviewed as such. One fixes up a use-after-free and the other
> consolidates some code.
>
> The final four patches are more invasive and modify the scsi_qla_host
> structure to avoid device removal race conditions. These changes were
> written to demonstrate the problem at hand and minimally fix them in
> order to continue testing. (For example, adding completely a new
> pci_flags member to scsi_qla_host is probably overkill.)
>
> We would welcome any feedback or questions about our testing.
>
> Regards,
>
> -- Joe
>
>
> Joe Lawrence (6):
>
> qla2xxx: Fix shost use-after-free on device removal
>
> scsi_qla_host *base_vha is allocated as part of the Scsi_Host
> private hostdata[] by qla2x00_create_host. When the last reference
> on the host is put by qla2x00_remove_one, it's released along with
> any hostdata[], rendering base_vha unsafe to dereference.
>
> To safely complete the adapter cleanup in qla2x00_remove_one,
> use the scsi_qla_host_t *qla_hw_data pointer that is already in
> hand.
>
> To reproduce, set kernel boot option slub_debug=FZPU and remove
> an adapter instance.
>
>
> qla2xxx: Use qla2x00_clear_drv_active on probe failure
>
> Clean up some duplicate code along the way.
>
>
> qla2xxx: Collect PCI register checks and board_disable scheduling
>
> PCI register read checks are performed throughout the driver to
> detect disconnected hardware. There is currently a function that
> verifies 32-bit values and schedules board_removal if needed. Other
> 16-bit registers are checked and board_removal scheduling scattered
> around the driver.
>
> This change pulls all of these together such that a new 16-bit
> register check function wrappers the existing 32-bit version and
> centralizes the scheduling of board_disable work to a single
> invocation.
>
> The following patches depend upon this change.
>
>
> qla2xxx: Schedule board_disable only once
>
> In automated hot-plug device removal testing, occasionally
> qla2x00_disable_board_on_pci_error would run twice. I believe this
> occurred by a second qla2x00_check_reg_for_disconnect scheduling
> board_disable while the first instance was still running.
>
> The easiest solution seemed to be adding a per-adapter flag that
> could be test_and_set before scheduling board_disable.
>
>
> qla2xxx: Prevent removal and board_disable race
>
> This race was discovered through many iterations of automated PCI
> hotplug. The automated test inserts faults on the Stratus platform
> to simulate device removal. After removal, it re-adds the devices,
> simulating PCI hotplug insertion. Rinse, repeat.
>
> The race occurs when a Stratus proprietary hotplug driver detects
> that PCI devices have gone missing and calls into the kernel PCI
> subsystem to remove them. At nearly the same time, the qla2xxx
> driver figures out the same issue and schedules a board_disable. It
> takes a few hundred device removals to hit, but it seemed that the
> problem was that qla2x00_disable_board_on_pci_error started, then
> qla2xxx_remove_one was invoked, so one of the two started touching
> already freed resources.
>
> Although the bug manifested on a Stratus platform running modified
> drivers, the window for qla2xxx_remove_one to race
> qla2x00_disable_board_on_pci_error is still present if running stock
> kernel/drivers on commodity hardware. It may be difficult to hit,
> but one could theoretically invoke PCI device removal via sysfs (or
> simply unload the qla2xxx driver) and for board_disable to run at
> the same time.
>
> The fix leaves the Scsi_Host intact during board_disable so that
> qla2x00_remove_one can safely stop the main timer and
> cancel_work_sync on any outstanding board_disable work. A
> PFLG_DRIVER_REMOVING bit is also set to prevent any other
> board_disable instances from scheduling. Once the asynchronous
> players are out of the way, qla2x00_remove_one can move forward
> cleaning up whatever remaining resources are still attached.
>
> The PCI device enable count check in qla2x00_remove_one was
> re-purposed to determine if board_disable had run. Its original
> intention to detect qla2x00_probe_one failure was invalid. As long
> as the driver .probe routine returns an -ERRNO, then its counterpart
> .remove routine is *not* called.
>
>
> qla2xxx: Prevent probe and board_disable race
>
> An automated Stratus device removal test hotplug removes devices
> shortly during/after their initialization. The test inserts
> devices, then proceeds to remove them at various 1, 2, 3, 4...
> second intervals. This test revealed two bugs:
>
> 1 - The struct isp_operations initialize_adapter callback is invoked
> by qla2x00_probe_one before ha->board_disable is initialized. The
> callback will eventually read PCI registers and potentially trigger
> the disconnection check. If the board_disable work structure hasn't
> been initialized, an ugly WARN trace will be emitted. The call stack
> looked something like:
>
> qla2x00_probe_one
> qla2x00_initialize_adapter
> qla2xxx_get_flash_info
> qla2xxx_get_flt_info
> qla25xx_read_optrom_data
> qla2x00_dump_ram
> qla2x00_mailbox_command
> qla24xx_intr_handler
> qla2x00_check_reg_for_disconnect
>
> 2 - Races between the qla2x00_disable_board_on_pci_error and
> qla2x00_probe_one. The latter was written before any asynchronous
> device removals were introduced by commit f3ddac1 "qla2xxx: Disable
> adapter when we encounter a PCI disconnect," and proceeds as if it
> were the only game in town with respect to the underlying device.
> Restore that assumption by preventing any board_disable scheduling
> while qla2x00_probe_one is in flight. Holding off this scheduling
> prevents the race, as well as avoids bug #1.
>
> drivers/scsi/qla2xxx/qla_def.h | 5 ++++
> drivers/scsi/qla2xxx/qla_gbl.h | 3 +-
> drivers/scsi/qla2xxx/qla_isr.c | 44 +++++++++++++++--------------
> drivers/scsi/qla2xxx/qla_mr.c | 2 +-
> drivers/scsi/qla2xxx/qla_nx.c | 6 ++--
> drivers/scsi/qla2xxx/qla_os.c | 60 ++++++++++++++++++----------------------
> 6 files changed, 61 insertions(+), 59 deletions(-)
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/6] qla2xxx device removal fixups
2014-06-18 14:02 [PATCH 0/6] qla2xxx device removal fixups Joe Lawrence
` (6 preceding siblings ...)
2014-07-28 18:26 ` Chad Dupuis
@ 2014-08-26 13:20 ` Joe Lawrence
2014-08-26 13:52 ` Christoph Hellwig
7 siblings, 1 reply; 14+ messages in thread
From: Joe Lawrence @ 2014-08-26 13:20 UTC (permalink / raw)
To: linux-scsi
Cc: Joe Lawrence, Chad Dupuis, Giridhar Malavali, Saurav Kashyap,
Don Zickus, Prarit Bhargava, Bill Kuzeja, Dave Bulkow,
Christoph Hellwig
On Wed, 18 Jun 2014 10:02:13 -0400
Joe Lawrence <joe.lawrence@stratus.com> wrote:
> Hi Chad, Giri, et al.
>
> Stratus has been testing the upstream qla2xxx driver against surprise
> device removal and has found a few minor issues along the way. With
> this patchset, results have been good. Although the following changes
> may be most interesting on a Stratus platform, they should be applicable
> to general hotplug scenarios on other hardware.
>
> The first two patches are independent from the others and can be
> reviewed as such. One fixes up a use-after-free and the other
> consolidates some code.
Patch 1 corrects a bug introduced by commit fe1b806, which has made its
way into three linux-stable releases:
% git tag --contains fe1b806 | grep '^v3.[0-9]*$'
v3.14
v3.15
v3.16
> The final four patches are more invasive and modify the scsi_qla_host
> structure to avoid device removal race conditions. These changes were
> written to demonstrate the problem at hand and minimally fix them in
> order to continue testing. (For example, adding completely a new
> pci_flags member to scsi_qla_host is probably overkill.)
>
> We would welcome any feedback or questions about our testing.
Chad ack'd the series on 28 Jul 2014. Any chance this can make into
hch's queue for 3.18?
Regards,
-- Joe
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/6] qla2xxx device removal fixups
2014-08-26 13:20 ` Joe Lawrence
@ 2014-08-26 13:52 ` Christoph Hellwig
0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2014-08-26 13:52 UTC (permalink / raw)
To: Joe Lawrence
Cc: linux-scsi, Chad Dupuis, Giridhar Malavali, Saurav Kashyap,
Don Zickus, Prarit Bhargava, Bill Kuzeja, Dave Bulkow,
Christoph Hellwig
On Tue, Aug 26, 2014 at 09:20:55AM -0400, Joe Lawrence wrote:
> > The final four patches are more invasive and modify the scsi_qla_host
> > structure to avoid device removal race conditions. These changes were
> > written to demonstrate the problem at hand and minimally fix them in
> > order to continue testing. (For example, adding completely a new
> > pci_flags member to scsi_qla_host is probably overkill.)
> >
> > We would welcome any feedback or questions about our testing.
>
> Chad ack'd the series on 28 Jul 2014. Any chance this can make into
> hch's queue for 3.18?
Can you please resend the patches with the collected ACKs against the
drivers-for-3.18 branch? Thanks!
^ permalink raw reply [flat|nested] 14+ messages in thread