From mboxrd@z Thu Jan 1 00:00:00 1970 From: Joe Lawrence Subject: Re: [PATCH 5/6] qla2xxx: Prevent removal and board_disable race Date: Fri, 25 Jul 2014 15:00:17 -0400 Message-ID: <20140725150017.61a0cdc1@jlaw-desktop.mno.stratus.com> References: <1403100139-1798-1-git-send-email-joe.lawrence@stratus.com> <1403100284-1872-1-git-send-email-joe.lawrence@stratus.com> Mime-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Return-path: Received: from p02c12o147.mxlogic.net ([208.65.145.80]:59404 "EHLO p02c12o147.mxlogic.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754843AbaGYTAc (ORCPT ); Fri, 25 Jul 2014 15:00:32 -0400 In-Reply-To: Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Chad Dupuis Cc: linux-scsi@vger.kernel.org, Giridhar Malavali , Saurav Kashyap , Don Zickus , Prarit Bhargava , Bill Kuzeja , Dave Bulkow On Fri, 25 Jul 2014 11:23:02 -0400 Chad Dupuis wrote: > > > On Wed, 18 Jun 2014, Joe Lawrence wrote: > > > Introduce mutual exclusion between the qla2xxx_remove_one PCI driver > > callback and qla2x00_disable_board_on_pci_error, which is scheduled as > > board_disable work by qla2x00_check_reg{32,16}_for_disconnect: > > > > * Leave the driver-specific data attached to the underlying PCI device > > intact in qla2x00_disable_board_on_pci_error, so that qla2x00_remove_one > > has enough breadcrumbs to determine that any board_disable work has been > > completed. > > > > * In qla2xxx_remove_one, set a bit to prevent any subsequent > > board_disable work from scheduling, then cancel and wait until pending > > work has completed. > > > > * Reuse the PCI device enable count check in qla2x00_remove_one to > > determine if board_disable has occured. The original purpose of this > > check was unnecessary since the driver remove function wasn't called > > when the probe fails. > > > > Signed-off-by: Joe Lawrence > > --- > > drivers/scsi/qla2xxx/qla_def.h | 1 + > > drivers/scsi/qla2xxx/qla_isr.c | 3 ++- > > drivers/scsi/qla2xxx/qla_os.c | 31 +++++++++++++++++++------------ > > 3 files changed, 22 insertions(+), 13 deletions(-) > > > > diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h > > index 1267b11..7c441c9 100644 > > --- a/drivers/scsi/qla2xxx/qla_def.h > > +++ b/drivers/scsi/qla2xxx/qla_def.h > > @@ -3404,6 +3404,7 @@ typedef struct scsi_qla_host { > > > > unsigned long pci_flags; > > #define PFLG_DISCONNECTED 0 /* PCI device removed */ > > +#define PFLG_DRIVER_REMOVING 1 /* PCI driver .remove */ > > > > uint32_t device_flags; > > #define SWITCH_FOUND BIT_0 > > diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c > > index 2741ad8..ee5eef4 100644 > > --- a/drivers/scsi/qla2xxx/qla_isr.c > > +++ b/drivers/scsi/qla2xxx/qla_isr.c > > @@ -117,7 +117,8 @@ qla2x00_check_reg32_for_disconnect(scsi_qla_host_t *vha, uint32_t reg) > > { > > /* Check for PCI disconnection */ > > if (reg == 0xffffffff) { > > - if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags)) { > > + if (!test_and_set_bit(PFLG_DISCONNECTED, &vha->pci_flags) && > > + !test_bit(PFLG_DRIVER_REMOVING, &vha->pci_flags)) { > > In our remove function we set a bit that we are unloading: > > set_bit (UNLOADING, &base_vha->dpc_flags); > > could this be used instead? Hi Chad, Thanks for the comments. I think with a little bit of code shuffling this could be accomplished. My goal with the patchset was to try and present the problem/fix as plain as possible. It was easiest to collect all the atomic bits I needed inside a single variable. Should I be tacking on such flags to 'dpc_flags' ? > > > /* > > * Schedule this (only once) on the default system > > * workqueue so that all the adapter workqueues and the > > diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c > > index 39c9953..51cba37 100644 > > --- a/drivers/scsi/qla2xxx/qla_os.c > > +++ b/drivers/scsi/qla2xxx/qla_os.c > > @@ -3123,15 +3123,25 @@ qla2x00_remove_one(struct pci_dev *pdev) > > scsi_qla_host_t *base_vha; > > struct qla_hw_data *ha; > > > > + base_vha = pci_get_drvdata(pdev); > > + ha = base_vha->hw; > > + > > + /* Indicate device removal to prevent future board_disable and wait > > + * until any pending board_disable has completed. */ > > + set_bit(PFLG_DRIVER_REMOVING, &base_vha->pci_flags); > > + cancel_work_sync(&ha->board_disable); > > + > > /* > > - * If the PCI device is disabled that means that probe failed and any > > - * resources should be have cleaned up on probe exit. > > + * If the PCI device is disabled then there was a PCI-disconnect and > > + * qla2x00_disable_board_on_pci_error has taken care of most of the > > + * resources. > > */ > > - if (!atomic_read(&pdev->enable_cnt)) > > + if (!atomic_read(&pdev->enable_cnt)) { > > Should we also add a check here that this is a disconnection before > freeing these structs? The original intent of the check for > pdev->enable_cnt is to make sure we don't try to dereference an already > freed struct if probe failed. I'm not exactly sure what you're asking here. In my tests, when .probe return -ERRNO, .remove was not called. Is there another call path into qla2x00_remove_one? The reason I didn't completely cleanup qla_hw_data in qla2x00_disable_board_on_pci_error and re-purposed the PCI enable count check, was that I needed some way of determining that any board_disable work was out of the way before proceeding with qla2x00_remove_one. The patch's set_bit / cancel_work_sync above (along with the test_bit before board_disable schedule) should be ensuring that the board_disable won't be running or rescheduling in the future. If qla2x00_disable_board_on_pci_error got as far as actually disabling the PCI device, then qla2x00_remove_one's check on its enable count would be verifying that. If the device is still enabled, then qla2x00_remove_one knows that board_disable didn't clean up the device. Would it be clearer if I used an explicit scsi_qla_host flag to indicate that state? > > > + scsi_host_put(base_vha->host); > > + kfree(ha); > > + pci_set_drvdata(pdev, NULL); > > return; > > - > > - base_vha = pci_get_drvdata(pdev); > > - ha = base_vha->hw; > > + } > > > > qla2x00_wait_for_hba_ready(base_vha); > > > > @@ -4791,18 +4801,15 @@ qla2x00_disable_board_on_pci_error(struct work_struct *work) > > qla82xx_md_free(base_vha); > > qla2x00_free_queues(ha); > > > > - scsi_host_put(base_vha->host); > > - > > qla2x00_unmap_iobases(ha); > > > > pci_release_selected_regions(ha->pdev, ha->bars); > > - kfree(ha); > > - ha = NULL; > > - > > pci_disable_pcie_error_reporting(pdev); > > pci_disable_device(pdev); > > - pci_set_drvdata(pdev, NULL); > > > > + /* > > + * Let qla2x00_remove_one cleanup qla_hw_data on device removal. > > + */ > > } > > > > /************************************************************************** > > Regards, -- Joe