Linux-PCI Archive on lore.kernel.org
 help / color / Atom feed
From: Yicong Yang <yangyicong@hisilicon.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: <linux-pci@vger.kernel.org>, <f.fangjian@huawei.com>,
	<huangdaode@huawei.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Sinan Kaya <Okaya@kernel.org>,
	Jakub Kicinski <jakub.kicinski@netronome.com>,
	Christoph Hellwig <hch@lst.de>,
	Kuppuswamy Sathyanarayanan 
	<sathyanarayanan.kuppuswamy@linux.intel.com>
Subject: Re: [PATCH 2/2] PCI/AER: Fix deadlock triggered by AER and sriov enable routine
Date: Sat, 4 Jul 2020 18:25:46 +0800
Message-ID: <fade5ccd-c80f-9bee-5710-87c855f4a649@hisilicon.com> (raw)
In-Reply-To: <20200630235853.GA3500801@bjorn-Precision-5520>

Hi Bjorn,

Thanks for reviewing this.


On 2020/7/1 7:58, Bjorn Helgaas wrote:
> [+cc Ben, Sinan, Jakub, Christoph, Sathy]
>
> On Wed, Feb 05, 2020 at 05:18:09PM +0800, Yicong Yang wrote:
>> When enabling VF of pci device by sysfs, we hold device_lock
>> first in sriov_enable() and then pci_bus_sem in pci_device_add().
>> But when AER invoked, we hold pci_bus_sem first in
>> pcie_do_recovery() and then device_lock subsequently.
>> The inconsistent order will lead to deadlock as reported [1].
>>
>> when adding VF by sysfs:
>> sriov_numvfs_store()
>> 	device_lock()
>> 		...
>> 		sriov_enable()
>> 			...
>> 			pci_device_add()
>> 				down_write(&pci_bus_sem)
>> when invoking aer routine:
>> pcie_do_recovery()
>> 	...
>> 	pci_walk_bus()
>> 		down_read(&pci_bus_sem)
>> 		report_*()
>> 			...
>> 			device_lock()
>>
>> Add pci_lock_and_walk_bus(), which locks the devices on the bus first
>> using pci_bus_lock() and then walk the bus with specific callback.
>> Use pci_lock_and_walk_bus() in pcie_do_recovery() and remove
>> device_lock() in report_*() callbacks. Then the lock order will be
>> consistent with that in the sriov enable routine.
>> User space access to the configuration space of the devices on the bus
>> will also be blocked in the error recovery routine. The device should
>> be unreachable during error recovery.
> Did you consider Ben's response [1]?  He suggested that the SR-IOV
> side maybe shouldn't take the device_lock before doing something that
> can take the pci_bus_sem.

We take pci_bus_sem in pdev->driver->sriov_configure() to add device to
the pci_bus, seems we cannot avoid taking it or we'll met what mentioned
in commit 17530e71e016 [2]. Or am I missed something?

sriov_numvfs_store()
	device_lock()
	pdev->driver->sriov_configure()
		...
		pci_device_add()
			down_write(&pci_bus_sem)

> The device_lock() in sriov_numvfs_store() was added by 17530e71e016
> ("PCI: Protect pci_driver->sriov_configure() usage with
> device_lock()") [2].  The commit log says we must provide exclusion vs
> the driver's ->remove() method, usually by using device_lock().
>
> Maybe this patch is doing the right thing by acquiring all the
> device_locks in the subtree on the AER side.
>
> But I do feel a little queasy about pci_device_add() and
> pci_bus_add_device() and pci_bus_sem being held while holding
> device_lock.  Maybe the device addition should be done by a separate
> thread or something.

I haven't found other routines(please tell me if there is) besides SR-IOV
enable holding device_lock while calling pci_device_add(), pci_bus_add_device()
or holding pci_bus_sem. So I think we only need to put VF's device addition
to another thread, and then release the device_lock. And I have something
uncertain: 

1. If we put the device addition into another thread and avoid holding
device_lock and pci_bus_sem at the same time, I think it's still necessary
to make sure PF won't be removed(through sysfs) before we call
pci_device_add() for VF.
2. Furthermore, if we start a new thread for VF device addition, users may be
not able to get synchronous results of whether sriov enable succeed or not,
as they may check the return value of sriov_numvfs written.

Please let me know if I understand it correctly. Seems all prefer to solve
this issue at SR-IOV side, and no one suggests to adjust the lock order.

Thanks,
Yicong

>
> [1] https://lore.kernel.org/linux-pci/a1c90cfb9ce4062b4823c6647d7709baf1c5534f.camel@kernel.crashing.org/
> [2] https://git.kernel.org/linus/17530e71e016
>
>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=203981
>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
>> ---
>>  drivers/pci/bus.c      |  8 ++++++++
>>  drivers/pci/pci.c      |  4 ++--
>>  drivers/pci/pci.h      |  4 ++++
>>  drivers/pci/pcie/err.c | 18 +++++-------------
>>  4 files changed, 19 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
>> index 8e40b3e..eb29273 100644
>> --- a/drivers/pci/bus.c
>> +++ b/drivers/pci/bus.c
>> @@ -411,6 +411,14 @@ void pci_walk_bus(struct pci_bus *top, int (*cb)(struct pci_dev *, void *),
>>  }
>>  EXPORT_SYMBOL_GPL(pci_walk_bus);
>>  
>> +void pci_lock_and_walk_bus(struct pci_bus *top,
>> +		int (*cb)(struct pci_dev *, void *), void *userdata)
>> +{
>> +	pci_bus_lock(top);
>> +	pci_walk_bus(top, cb, userdata);
>> +	pci_bus_unlock(top);
>> +}
>> +
>>  struct pci_bus *pci_bus_get(struct pci_bus *bus)
>>  {
>>  	if (bus)
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 78c99ef..94a7f91 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -5150,7 +5150,7 @@ static bool pci_bus_resetable(struct pci_bus *bus)
>>  }
>>  
>>  /* Lock devices from the top of the tree down */
>> -static void pci_bus_lock(struct pci_bus *bus)
>> +void pci_bus_lock(struct pci_bus *bus)
>>  {
>>  	struct pci_dev *dev;
>>  
>> @@ -5162,7 +5162,7 @@ static void pci_bus_lock(struct pci_bus *bus)
>>  }
>>  
>>  /* Unlock devices from the bottom of the tree up */
>> -static void pci_bus_unlock(struct pci_bus *bus)
>> +void pci_bus_unlock(struct pci_bus *bus)
>>  {
>>  	struct pci_dev *dev;
>>  
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index a0a53bd..8f8cd53 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -39,6 +39,8 @@ int pci_mmap_fits(struct pci_dev *pdev, int resno, struct vm_area_struct *vmai,
>>  int pci_probe_reset_function(struct pci_dev *dev);
>>  int pci_bridge_secondary_bus_reset(struct pci_dev *dev);
>>  int pci_bus_error_reset(struct pci_dev *dev);
>> +void pci_bus_lock(struct pci_bus *bus);
>> +void pci_bus_unlock(struct pci_bus *bus);
>>  
>>  #define PCI_PM_D2_DELAY         200
>>  #define PCI_PM_D3_WAIT          10
>> @@ -286,6 +288,8 @@ bool pci_bus_clip_resource(struct pci_dev *dev, int idx);
>>  
>>  void pci_reassigndev_resource_alignment(struct pci_dev *dev);
>>  void pci_disable_bridge_window(struct pci_dev *dev);
>> +void pci_lock_and_walk_bus(struct pci_bus *top,
>> +		int (*cb)(struct pci_dev *, void *), void *userdata);
>>  struct pci_bus *pci_bus_get(struct pci_bus *bus);
>>  void pci_bus_put(struct pci_bus *bus);
>>  
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index b0e6048..ce0ef5c 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -50,7 +50,6 @@ static int report_error_detected(struct pci_dev *dev,
>>  	pci_ers_result_t vote;
>>  	const struct pci_error_handlers *err_handler;
>>  
>> -	device_lock(&dev->dev);
>>  	if (!pci_dev_set_io_state(dev, state) ||
>>  		!dev->driver ||
>>  		!dev->driver->err_handler ||
>> @@ -71,7 +70,6 @@ static int report_error_detected(struct pci_dev *dev,
>>  	}
>>  	pci_uevent_ers(dev, vote);
>>  	*result = merge_result(*result, vote);
>> -	device_unlock(&dev->dev);
>>  	return 0;
>>  }
>>  
>> @@ -90,7 +88,6 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data)
>>  	pci_ers_result_t vote, *result = data;
>>  	const struct pci_error_handlers *err_handler;
>>  
>> -	device_lock(&dev->dev);
>>  	if (!dev->driver ||
>>  		!dev->driver->err_handler ||
>>  		!dev->driver->err_handler->mmio_enabled)
>> @@ -100,7 +97,6 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data)
>>  	vote = err_handler->mmio_enabled(dev);
>>  	*result = merge_result(*result, vote);
>>  out:
>> -	device_unlock(&dev->dev);
>>  	return 0;
>>  }
>>  
>> @@ -109,7 +105,6 @@ static int report_slot_reset(struct pci_dev *dev, void *data)
>>  	pci_ers_result_t vote, *result = data;
>>  	const struct pci_error_handlers *err_handler;
>>  
>> -	device_lock(&dev->dev);
>>  	if (!dev->driver ||
>>  		!dev->driver->err_handler ||
>>  		!dev->driver->err_handler->slot_reset)
>> @@ -119,7 +114,6 @@ static int report_slot_reset(struct pci_dev *dev, void *data)
>>  	vote = err_handler->slot_reset(dev);
>>  	*result = merge_result(*result, vote);
>>  out:
>> -	device_unlock(&dev->dev);
>>  	return 0;
>>  }
>>  
>> @@ -127,7 +121,6 @@ static int report_resume(struct pci_dev *dev, void *data)
>>  {
>>  	const struct pci_error_handlers *err_handler;
>>  
>> -	device_lock(&dev->dev);
>>  	if (!pci_dev_set_io_state(dev, pci_channel_io_normal) ||
>>  		!dev->driver ||
>>  		!dev->driver->err_handler ||
>> @@ -138,7 +131,6 @@ static int report_resume(struct pci_dev *dev, void *data)
>>  	err_handler->resume(dev);
>>  out:
>>  	pci_uevent_ers(dev, PCI_ERS_RESULT_RECOVERED);
>> -	device_unlock(&dev->dev);
>>  	return 0;
>>  }
>>  
>> @@ -200,9 +192,9 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>>  
>>  	pci_dbg(dev, "broadcast error_detected message\n");
>>  	if (state == pci_channel_io_frozen)
>> -		pci_walk_bus(bus, report_frozen_detected, &status);
>> +		pci_lock_and_walk_bus(bus, report_frozen_detected, &status);
>>  	else
>> -		pci_walk_bus(bus, report_normal_detected, &status);
>> +		pci_lock_and_walk_bus(bus, report_normal_detected, &status);
>>  
>>  	if (state == pci_channel_io_frozen &&
>>  	    reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
>> @@ -211,7 +203,7 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>>  	if (status == PCI_ERS_RESULT_CAN_RECOVER) {
>>  		status = PCI_ERS_RESULT_RECOVERED;
>>  		pci_dbg(dev, "broadcast mmio_enabled message\n");
>> -		pci_walk_bus(bus, report_mmio_enabled, &status);
>> +		pci_lock_and_walk_bus(bus, report_mmio_enabled, &status);
>>  	}
>>  
>>  	if (status == PCI_ERS_RESULT_NEED_RESET) {
>> @@ -222,14 +214,14 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>>  		 */
>>  		status = PCI_ERS_RESULT_RECOVERED;
>>  		pci_dbg(dev, "broadcast slot_reset message\n");
>> -		pci_walk_bus(bus, report_slot_reset, &status);
>> +		pci_lock_and_walk_bus(bus, report_slot_reset, &status);
>>  	}
>>  
>>  	if (status != PCI_ERS_RESULT_RECOVERED)
>>  		goto failed;
>>  
>>  	pci_dbg(dev, "broadcast resume message\n");
>> -	pci_walk_bus(bus, report_resume, &status);
>> +	pci_lock_and_walk_bus(bus, report_resume, &status);
>>  
>>  	pci_aer_clear_device_status(dev);
>>  	pci_cleanup_aer_uncorrect_error_status(dev);
>> -- 
>> 2.8.1
>>
> .
>


      parent reply index

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-05  9:18 Yicong Yang
2020-06-30 23:58 ` Bjorn Helgaas
2020-07-01 15:24   ` Sinan Kaya
2020-07-04 10:25   ` Yicong Yang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fade5ccd-c80f-9bee-5710-87c855f4a649@hisilicon.com \
    --to=yangyicong@hisilicon.com \
    --cc=Okaya@kernel.org \
    --cc=benh@kernel.crashing.org \
    --cc=f.fangjian@huawei.com \
    --cc=hch@lst.de \
    --cc=helgaas@kernel.org \
    --cc=huangdaode@huawei.com \
    --cc=jakub.kicinski@netronome.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-PCI Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-pci/0 linux-pci/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-pci linux-pci/ https://lore.kernel.org/linux-pci \
		linux-pci@vger.kernel.org
	public-inbox-index linux-pci

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-pci


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git