All of lore.kernel.org
 help / color / mirror / Atom feed
* Locking between vfio hot-remove and pci sysfs sriov_numvfs
       [not found] <CGME20231207223824uscas1p27dd91f0af56cda282cd28046cc981fe9@uscas1p2.samsung.com>
@ 2023-12-07 22:38 ` Jim Harris
  2023-12-07 23:21   ` Alex Williamson
  0 siblings, 1 reply; 16+ messages in thread
From: Jim Harris @ 2023-12-07 22:38 UTC (permalink / raw)
  To: bhelgaas, alex.williamson, linux-pci, kvm, ben, jgg

I am seeing a deadlock using SPDK with hotplug detection using vfio-pci
and an SR-IOV enabled NVMe SSD. It is not clear if this deadlock is intended
or if it's a kernel bug.

Note: SPDK uses DPDK's PCI device enumeration framework, so I'll reference
both SPDK and DPDK in this description.

DPDK registers an eventfd with vfio for hotplug notifications. If the associated
device is removed (i.e. write 1 to its pci sysfs remove entry), vfio
writes to the eventfd, requesting DPDK to release the device. It does this
while holding the device_lock(), and then waits for completion.

DPDK gets the notification, and passes it up to SPDK. SPDK does not release
the device immediately. It has some asynchronous operations that need to be
performed first, so it will release the device a bit later.

But before the device is released, SPDK also triggers DPDK to do a sysfs scan
looking for newly inserted devices. Note that the removed device is not
completely removed yet from kernel PCI perspective - all of its sysfs entries
are still available, including sriov_numvfs.

DPDK explicitly reads sriov_numvfs to see if the device is SR-IOV capable.
SPDK itself doesn't actually use this value, but it is part of the scan
triggered by SPDK and directly leads to the deadlock. sriov_numvfs_show()
deadlocks because it tries to hold device_lock() while reading the pci
device's pdev->sriov->num_VFs.

We're able to workaround this in SPDK by deferring the sysfs scan if
a device removal is in process. And maybe that is what we are supposed to
be doing, to avoid this deadlock?

Reference to SPDK issue, for some more details (plus simple repro stpes for
anyone already familiar with SPDK): https://github.com/spdk/spdk/issues/3205

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs
  2023-12-07 22:38 ` Locking between vfio hot-remove and pci sysfs sriov_numvfs Jim Harris
@ 2023-12-07 23:21   ` Alex Williamson
  2023-12-07 23:48     ` Jason Gunthorpe
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Williamson @ 2023-12-07 23:21 UTC (permalink / raw)
  To: Jim Harris; +Cc: bhelgaas, linux-pci, kvm, ben, jgg

On Thu, 7 Dec 2023 22:38:23 +0000
Jim Harris <jim.harris@samsung.com> wrote:

> I am seeing a deadlock using SPDK with hotplug detection using vfio-pci
> and an SR-IOV enabled NVMe SSD. It is not clear if this deadlock is intended
> or if it's a kernel bug.
> 
> Note: SPDK uses DPDK's PCI device enumeration framework, so I'll reference
> both SPDK and DPDK in this description.
> 
> DPDK registers an eventfd with vfio for hotplug notifications. If the associated
> device is removed (i.e. write 1 to its pci sysfs remove entry), vfio
> writes to the eventfd, requesting DPDK to release the device. It does this
> while holding the device_lock(), and then waits for completion.
> 
> DPDK gets the notification, and passes it up to SPDK. SPDK does not release
> the device immediately. It has some asynchronous operations that need to be
> performed first, so it will release the device a bit later.
> 
> But before the device is released, SPDK also triggers DPDK to do a sysfs scan
> looking for newly inserted devices. Note that the removed device is not
> completely removed yet from kernel PCI perspective - all of its sysfs entries
> are still available, including sriov_numvfs.
> 
> DPDK explicitly reads sriov_numvfs to see if the device is SR-IOV capable.
> SPDK itself doesn't actually use this value, but it is part of the scan
> triggered by SPDK and directly leads to the deadlock. sriov_numvfs_show()
> deadlocks because it tries to hold device_lock() while reading the pci
> device's pdev->sriov->num_VFs.
> 
> We're able to workaround this in SPDK by deferring the sysfs scan if
> a device removal is in process. And maybe that is what we are supposed to
> be doing, to avoid this deadlock?
> 
> Reference to SPDK issue, for some more details (plus simple repro stpes for
> anyone already familiar with SPDK): https://github.com/spdk/spdk/issues/3205

device_lock() has been a recurring problem.  We don't have a lot of
leeway in how we support the driver remove callback, the device needs
to be released.  We can't return -EBUSY and I don't think we can drop
the mutex while we're waiting on userspace.

I've done some fix-ups in the past to use device_trylock() to avoid
deadlocks, which might be an option here, ex. reading sriov_numvfs
could return -EBUSY in this scenario.  We keep running into these
scenarios though and we might just need to pick a point at which we
kill the user process holding the device.

I'm open to suggestions.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs
  2023-12-07 23:21   ` Alex Williamson
@ 2023-12-07 23:48     ` Jason Gunthorpe
  2023-12-08 17:07       ` Jim Harris
  2023-12-08 17:38       ` Jim Harris
  0 siblings, 2 replies; 16+ messages in thread
From: Jason Gunthorpe @ 2023-12-07 23:48 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jim Harris, bhelgaas, linux-pci, kvm, ben

On Thu, Dec 07, 2023 at 04:21:48PM -0700, Alex Williamson wrote:
> On Thu, 7 Dec 2023 22:38:23 +0000
> Jim Harris <jim.harris@samsung.com> wrote:
> 
> > I am seeing a deadlock using SPDK with hotplug detection using vfio-pci
> > and an SR-IOV enabled NVMe SSD. It is not clear if this deadlock is intended
> > or if it's a kernel bug.
> > 
> > Note: SPDK uses DPDK's PCI device enumeration framework, so I'll reference
> > both SPDK and DPDK in this description.
> > 
> > DPDK registers an eventfd with vfio for hotplug notifications. If the associated
> > device is removed (i.e. write 1 to its pci sysfs remove entry), vfio
> > writes to the eventfd, requesting DPDK to release the device. It does this
> > while holding the device_lock(), and then waits for completion.
> > 
> > DPDK gets the notification, and passes it up to SPDK. SPDK does not release
> > the device immediately. It has some asynchronous operations that need to be
> > performed first, so it will release the device a bit later.
> > 
> > But before the device is released, SPDK also triggers DPDK to do a sysfs scan
> > looking for newly inserted devices. Note that the removed device is not
> > completely removed yet from kernel PCI perspective - all of its sysfs entries
> > are still available, including sriov_numvfs.
> > 
> > DPDK explicitly reads sriov_numvfs to see if the device is SR-IOV capable.
> > SPDK itself doesn't actually use this value, but it is part of the scan
> > triggered by SPDK and directly leads to the deadlock. sriov_numvfs_show()
> > deadlocks because it tries to hold device_lock() while reading the pci
> > device's pdev->sriov->num_VFs.
> > 
> > We're able to workaround this in SPDK by deferring the sysfs scan if
> > a device removal is in process. And maybe that is what we are supposed to
> > be doing, to avoid this deadlock?
> > 
> > Reference to SPDK issue, for some more details (plus simple repro stpes for
> > anyone already familiar with SPDK): https://github.com/spdk/spdk/issues/3205
> 
> device_lock() has been a recurring problem.  We don't have a lot of
> leeway in how we support the driver remove callback, the device needs
> to be released.  We can't return -EBUSY and I don't think we can drop
> the mutex while we're waiting on userspace.

The mechanism of waiting in remove for userspace is inherently flawed,
it can never work fully correctly. :( I've hit this many times.

Upon remove VFIO should immediately remove itself and leave behind a
non-functional file descriptor. Userspace should catch up eventually
and see it is toast.

The kernel locking model just cannot support userspace delaying this
process.

Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs
  2023-12-07 23:48     ` Jason Gunthorpe
@ 2023-12-08 17:07       ` Jim Harris
  2023-12-08 19:41         ` Jason Gunthorpe
  2023-12-08 17:38       ` Jim Harris
  1 sibling, 1 reply; 16+ messages in thread
From: Jim Harris @ 2023-12-08 17:07 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Alex Williamson, bhelgaas, linux-pci, kvm, ben

On Thu, Dec 07, 2023 at 07:48:10PM -0400, Jason Gunthorpe wrote:
> On Thu, Dec 07, 2023 at 04:21:48PM -0700, Alex Williamson wrote:
> > On Thu, 7 Dec 2023 22:38:23 +0000
> > Jim Harris <jim.harris@samsung.com> wrote:
> > 
> > device_lock() has been a recurring problem.  We don't have a lot of
> > leeway in how we support the driver remove callback, the device needs
> > to be released.  We can't return -EBUSY and I don't think we can drop
> > the mutex while we're waiting on userspace.
> 
> The mechanism of waiting in remove for userspace is inherently flawed,
> it can never work fully correctly. :( I've hit this many times.
> 
> Upon remove VFIO should immediately remove itself and leave behind a
> non-functional file descriptor. Userspace should catch up eventually
> and see it is toast.
> 
> The kernel locking model just cannot support userspace delaying this
> process.
> 
> Jason

Maybe for now we just whack this specific mole with a separate mutex
for synchronizing access to sriov->num_VFs in the sysfs paths?
Something like this (tested on my system):

---

Author: Jim Harris <jim.harris@samsung.com>

pci: sync sriov->num_VFs sysfs access with its own mutex

If SR-IOV enabled device is held by vfio, and device is removed, vfio will hold
device lock and notify userspace of the removal. If userspace reads sriov_numvfs
sysfs entry, that thread will be blocked since sriov_numvfs_show() also tries
to acquire the device lock. If that same thread is responsible for releasing the
device to vfio, it results in a deadlock.

So add a separate mutex, specifically for struct pci_sriov. Use this to
synchronize accesses to sriov_numvfs in the sysfs paths. sriov_numvfs_store()
will also still hold the device lock while configuring sriov.

Fixes: 35ff867b765 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes")

Signed-off-by: Jim Harris <jim.harris@samsung.com>

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 25dbe85c4217..8910cf6c97be 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -398,9 +398,9 @@ static ssize_t sriov_numvfs_show(struct device *dev,
 	u16 num_vfs;
 
 	/* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */
-	device_lock(&pdev->dev);
+	mutex_lock(&pdev->sriov->lock);
 	num_vfs = pdev->sriov->num_VFs;
-	device_unlock(&pdev->dev);
+	mutex_unlock(&pdev->sriov->lock);
 
 	return sysfs_emit(buf, "%u\n", num_vfs);
 }
@@ -427,6 +427,7 @@ static ssize_t sriov_numvfs_store(struct device *dev,
 		return -ERANGE;
 
 	device_lock(&pdev->dev);
+	mutex_lock(&pdev->sriov->lock);
 
 	if (num_vfs == pdev->sriov->num_VFs)
 		goto exit;
@@ -468,6 +469,7 @@ static ssize_t sriov_numvfs_store(struct device *dev,
 			 num_vfs, ret);
 
 exit:
+	mutex_unlock(&pdev->sriov->lock);
 	device_unlock(&pdev->dev);
 
 	if (ret < 0)
@@ -808,6 +810,7 @@ static int sriov_init(struct pci_dev *dev, int pos)
 		nres++;
 	}
 
+	mutex_init(&iov->lock);
 	iov->pos = pos;
 	iov->nres = nres;
 	iov->ctrl = ctrl;
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 5ecbcf041179..04e636ab50e5 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -313,6 +313,7 @@ struct pci_sriov {
 	u16		subsystem_device; /* VF subsystem device */
 	resource_size_t	barsz[PCI_SRIOV_NUM_BARS];	/* VF BAR size */
 	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
+	struct mutex	lock; /* for synchronizing num_VFs sysfs accesses */
 };
 
 #ifdef CONFIG_PCI_DOE

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs
  2023-12-07 23:48     ` Jason Gunthorpe
  2023-12-08 17:07       ` Jim Harris
@ 2023-12-08 17:38       ` Jim Harris
  2023-12-08 17:41         ` Jason Gunthorpe
  1 sibling, 1 reply; 16+ messages in thread
From: Jim Harris @ 2023-12-08 17:38 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Alex Williamson, bhelgaas, linux-pci, kvm, ben

On Thu, Dec 07, 2023 at 07:48:10PM -0400, Jason Gunthorpe wrote:
> 
> The mechanism of waiting in remove for userspace is inherently flawed,
> it can never work fully correctly. :( I've hit this many times.
> 
> Upon remove VFIO should immediately remove itself and leave behind a
> non-functional file descriptor. Userspace should catch up eventually
> and see it is toast.

One nice aspect of the current design is that vfio will leave the BARs
mapped until userspace releases the vfio handle. It avoids some rather
nasty hacks for handling SIGBUS errors in the fast path (i.e. writing
NVMe doorbells) where we cannot try to check for device removal on
every MMIO write. Would your proposal immediately yank the BARs, without
waiting for userspace to respond? This is mostly for my curiosity - SPDK
already has these hacks implemented, so I don't think it would be
affected by this kind of change in behavior.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs
  2023-12-08 17:38       ` Jim Harris
@ 2023-12-08 17:41         ` Jason Gunthorpe
  2023-12-08 17:59           ` Jim Harris
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2023-12-08 17:41 UTC (permalink / raw)
  To: Jim Harris; +Cc: Alex Williamson, bhelgaas, linux-pci, kvm, ben

On Fri, Dec 08, 2023 at 05:38:51PM +0000, Jim Harris wrote:
> On Thu, Dec 07, 2023 at 07:48:10PM -0400, Jason Gunthorpe wrote:
> > 
> > The mechanism of waiting in remove for userspace is inherently flawed,
> > it can never work fully correctly. :( I've hit this many times.
> > 
> > Upon remove VFIO should immediately remove itself and leave behind a
> > non-functional file descriptor. Userspace should catch up eventually
> > and see it is toast.
> 
> One nice aspect of the current design is that vfio will leave the BARs
> mapped until userspace releases the vfio handle. It avoids some rather
> nasty hacks for handling SIGBUS errors in the fast path (i.e. writing
> NVMe doorbells) where we cannot try to check for device removal on
> every MMIO write. Would your proposal immediately yank the BARs, without
> waiting for userspace to respond? This is mostly for my curiosity - SPDK
> already has these hacks implemented, so I don't think it would be
> affected by this kind of change in behavior.

What we did in RDMA was map a dummy page to the BARs so the sigbus was
avoided. But in that case RDMA knows the BAR memory is used only for
doorbell write so this is a reasonable thing to do.

Jason


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs
  2023-12-08 17:41         ` Jason Gunthorpe
@ 2023-12-08 17:59           ` Jim Harris
  2023-12-08 18:01             ` Jason Gunthorpe
  0 siblings, 1 reply; 16+ messages in thread
From: Jim Harris @ 2023-12-08 17:59 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Alex Williamson, bhelgaas, linux-pci, kvm, ben

On Fri, Dec 08, 2023 at 01:41:09PM -0400, Jason Gunthorpe wrote:
> On Fri, Dec 08, 2023 at 05:38:51PM +0000, Jim Harris wrote:
> > On Thu, Dec 07, 2023 at 07:48:10PM -0400, Jason Gunthorpe wrote:
> > > 
> > > The mechanism of waiting in remove for userspace is inherently flawed,
> > > it can never work fully correctly. :( I've hit this many times.
> > > 
> > > Upon remove VFIO should immediately remove itself and leave behind a
> > > non-functional file descriptor. Userspace should catch up eventually
> > > and see it is toast.
> > 
> > One nice aspect of the current design is that vfio will leave the BARs
> > mapped until userspace releases the vfio handle. It avoids some rather
> > nasty hacks for handling SIGBUS errors in the fast path (i.e. writing
> > NVMe doorbells) where we cannot try to check for device removal on
> > every MMIO write. Would your proposal immediately yank the BARs, without
> > waiting for userspace to respond? This is mostly for my curiosity - SPDK
> > already has these hacks implemented, so I don't think it would be
> > affected by this kind of change in behavior.
> 
> What we did in RDMA was map a dummy page to the BARs so the sigbus was
> avoided. But in that case RDMA knows the BAR memory is used only for
> doorbell write so this is a reasonable thing to do.

Yeah, this is exactly what SPDK (and DPDK) does today.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs
  2023-12-08 17:59           ` Jim Harris
@ 2023-12-08 18:01             ` Jason Gunthorpe
  2023-12-08 18:12               ` Alex Williamson
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2023-12-08 18:01 UTC (permalink / raw)
  To: Jim Harris; +Cc: Alex Williamson, bhelgaas, linux-pci, kvm, ben

On Fri, Dec 08, 2023 at 05:59:17PM +0000, Jim Harris wrote:
> On Fri, Dec 08, 2023 at 01:41:09PM -0400, Jason Gunthorpe wrote:
> > On Fri, Dec 08, 2023 at 05:38:51PM +0000, Jim Harris wrote:
> > > On Thu, Dec 07, 2023 at 07:48:10PM -0400, Jason Gunthorpe wrote:
> > > > 
> > > > The mechanism of waiting in remove for userspace is inherently flawed,
> > > > it can never work fully correctly. :( I've hit this many times.
> > > > 
> > > > Upon remove VFIO should immediately remove itself and leave behind a
> > > > non-functional file descriptor. Userspace should catch up eventually
> > > > and see it is toast.
> > > 
> > > One nice aspect of the current design is that vfio will leave the BARs
> > > mapped until userspace releases the vfio handle. It avoids some rather
> > > nasty hacks for handling SIGBUS errors in the fast path (i.e. writing
> > > NVMe doorbells) where we cannot try to check for device removal on
> > > every MMIO write. Would your proposal immediately yank the BARs, without
> > > waiting for userspace to respond? This is mostly for my curiosity - SPDK
> > > already has these hacks implemented, so I don't think it would be
> > > affected by this kind of change in behavior.
> > 
> > What we did in RDMA was map a dummy page to the BARs so the sigbus was
> > avoided. But in that case RDMA knows the BAR memory is used only for
> > doorbell write so this is a reasonable thing to do.
> 
> Yeah, this is exactly what SPDK (and DPDK) does today.

To be clear, I mean we did it in the kernel.

When the device driver is removed we zap all the VMAs and install a
fault handler that installs the dummy page instead of SIGBUS

The application doesn't do anything, and this is how SPDK already will
be supporting device hot unplug of the RDMA drivers.

Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs
  2023-12-08 18:01             ` Jason Gunthorpe
@ 2023-12-08 18:12               ` Alex Williamson
  2023-12-08 19:43                 ` Jason Gunthorpe
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Williamson @ 2023-12-08 18:12 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Jim Harris, bhelgaas, linux-pci, kvm, ben

On Fri, 8 Dec 2023 14:01:57 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Dec 08, 2023 at 05:59:17PM +0000, Jim Harris wrote:
> > On Fri, Dec 08, 2023 at 01:41:09PM -0400, Jason Gunthorpe wrote:  
> > > On Fri, Dec 08, 2023 at 05:38:51PM +0000, Jim Harris wrote:  
> > > > On Thu, Dec 07, 2023 at 07:48:10PM -0400, Jason Gunthorpe wrote:  
> > > > > 
> > > > > The mechanism of waiting in remove for userspace is inherently flawed,
> > > > > it can never work fully correctly. :( I've hit this many times.
> > > > > 
> > > > > Upon remove VFIO should immediately remove itself and leave behind a
> > > > > non-functional file descriptor. Userspace should catch up eventually
> > > > > and see it is toast.  
> > > > 
> > > > One nice aspect of the current design is that vfio will leave the BARs
> > > > mapped until userspace releases the vfio handle. It avoids some rather
> > > > nasty hacks for handling SIGBUS errors in the fast path (i.e. writing
> > > > NVMe doorbells) where we cannot try to check for device removal on
> > > > every MMIO write. Would your proposal immediately yank the BARs, without
> > > > waiting for userspace to respond? This is mostly for my curiosity - SPDK
> > > > already has these hacks implemented, so I don't think it would be
> > > > affected by this kind of change in behavior.  
> > > 
> > > What we did in RDMA was map a dummy page to the BARs so the sigbus was
> > > avoided. But in that case RDMA knows the BAR memory is used only for
> > > doorbell write so this is a reasonable thing to do.  
> > 
> > Yeah, this is exactly what SPDK (and DPDK) does today.  
> 
> To be clear, I mean we did it in the kernel.
> 
> When the device driver is removed we zap all the VMAs and install a
> fault handler that installs the dummy page instead of SIGBUS
> 
> The application doesn't do anything, and this is how SPDK already will
> be supporting device hot unplug of the RDMA drivers.

But I think you can only do that in the kernel because you understand
the device uses those pages for doorbells and it's not a general
purpose solution, right?

Perhaps a variant driver could do something similar for NVMe devices
doorbell pages, but a device agnostic driver like vfio-pci would need
to SIGBUS on access or else we risk significant data integrity issues.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs
  2023-12-08 17:07       ` Jim Harris
@ 2023-12-08 19:41         ` Jason Gunthorpe
  2023-12-08 20:09           ` Jim Harris
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2023-12-08 19:41 UTC (permalink / raw)
  To: Jim Harris, Leon Romanovsky
  Cc: Alex Williamson, bhelgaas, linux-pci, kvm, ben

On Fri, Dec 08, 2023 at 05:07:22PM +0000, Jim Harris wrote:
> On Thu, Dec 07, 2023 at 07:48:10PM -0400, Jason Gunthorpe wrote:
> > On Thu, Dec 07, 2023 at 04:21:48PM -0700, Alex Williamson wrote:
> > > On Thu, 7 Dec 2023 22:38:23 +0000
> > > Jim Harris <jim.harris@samsung.com> wrote:
> > > 
> > > device_lock() has been a recurring problem.  We don't have a lot of
> > > leeway in how we support the driver remove callback, the device needs
> > > to be released.  We can't return -EBUSY and I don't think we can drop
> > > the mutex while we're waiting on userspace.
> > 
> > The mechanism of waiting in remove for userspace is inherently flawed,
> > it can never work fully correctly. :( I've hit this many times.
> > 
> > Upon remove VFIO should immediately remove itself and leave behind a
> > non-functional file descriptor. Userspace should catch up eventually
> > and see it is toast.
> > 
> > The kernel locking model just cannot support userspace delaying this
> > process.
> > 
> > Jason
> 
> Maybe for now we just whack this specific mole with a separate mutex
> for synchronizing access to sriov->num_VFs in the sysfs paths?
> Something like this (tested on my system):

TBH, I don't have the time right now to unpack this locking
mystery. Maybe Leon remembers?

device_lock() gets everywhere and does a lot of different stuff, so I
would be surprised if it was so easy..

Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs
  2023-12-08 18:12               ` Alex Williamson
@ 2023-12-08 19:43                 ` Jason Gunthorpe
  0 siblings, 0 replies; 16+ messages in thread
From: Jason Gunthorpe @ 2023-12-08 19:43 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jim Harris, bhelgaas, linux-pci, kvm, ben

On Fri, Dec 08, 2023 at 11:12:15AM -0700, Alex Williamson wrote:

> > > > avoided. But in that case RDMA knows the BAR memory is used only for
> > > > doorbell write so this is a reasonable thing to do.  
> > > 
> > > Yeah, this is exactly what SPDK (and DPDK) does today.  
> > 
> > To be clear, I mean we did it in the kernel.
> > 
> > When the device driver is removed we zap all the VMAs and install a
> > fault handler that installs the dummy page instead of SIGBUS
> > 
> > The application doesn't do anything, and this is how SPDK already will
> > be supporting device hot unplug of the RDMA drivers.
> 
> But I think you can only do that in the kernel because you understand
> the device uses those pages for doorbells and it's not a general
> purpose solution, right?
> 
> Perhaps a variant driver could do something similar for NVMe devices
> doorbell pages, but a device agnostic driver like vfio-pci would need
> to SIGBUS on access or else we risk significant data integrity issues.
> Thanks,

Yes, basically.

Might be interesting to consider having a VFIO FEATURE flag to opt
into SIGBUS or dummy page, perhaps even on a VMA by VMA basis.

Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs
  2023-12-08 19:41         ` Jason Gunthorpe
@ 2023-12-08 20:09           ` Jim Harris
  2023-12-10 19:05             ` Jason Gunthorpe
  0 siblings, 1 reply; 16+ messages in thread
From: Jim Harris @ 2023-12-08 20:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Alex Williamson, bhelgaas, linux-pci, kvm, ben,
	pierre.cregut

On Fri, Dec 08, 2023 at 03:41:59PM -0400, Jason Gunthorpe wrote:
> On Fri, Dec 08, 2023 at 05:07:22PM +0000, Jim Harris wrote:
> > 
> > Maybe for now we just whack this specific mole with a separate mutex
> > for synchronizing access to sriov->num_VFs in the sysfs paths?
> > Something like this (tested on my system):
> 
> TBH, I don't have the time right now to unpack this locking
> mystery. Maybe Leon remembers?
> 
> device_lock() gets everywhere and does a lot of different stuff, so I
> would be surprised if it was so easy..

The store() side still keeps the device_lock(), it just also acquires this
new sriov lock. So store() side should observe zero differences. The only
difference is now the show() side can acquire just the more-granular lock,
since it is only trying to synchronize on sriov->num_VFs with the store()
side. But maybe I'm missing something subtle here...

Adding Pierre who authored the 35ff867b7 commit.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs
  2023-12-08 20:09           ` Jim Harris
@ 2023-12-10 19:05             ` Jason Gunthorpe
  2023-12-11  7:20               ` Leon Romanovsky
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2023-12-10 19:05 UTC (permalink / raw)
  To: Jim Harris
  Cc: Leon Romanovsky, Alex Williamson, bhelgaas, linux-pci, kvm, ben,
	pierre.cregut

On Fri, Dec 08, 2023 at 08:09:34PM +0000, Jim Harris wrote:
> On Fri, Dec 08, 2023 at 03:41:59PM -0400, Jason Gunthorpe wrote:
> > On Fri, Dec 08, 2023 at 05:07:22PM +0000, Jim Harris wrote:
> > > 
> > > Maybe for now we just whack this specific mole with a separate mutex
> > > for synchronizing access to sriov->num_VFs in the sysfs paths?
> > > Something like this (tested on my system):
> > 
> > TBH, I don't have the time right now to unpack this locking
> > mystery. Maybe Leon remembers?
> > 
> > device_lock() gets everywhere and does a lot of different stuff, so I
> > would be surprised if it was so easy..
> 
> The store() side still keeps the device_lock(), it just also acquires this
> new sriov lock. So store() side should observe zero differences. The only
> difference is now the show() side can acquire just the more-granular lock,
> since it is only trying to synchronize on sriov->num_VFs with the store()
> side. But maybe I'm missing something subtle here...

Oh if that is the only goal then probably a READ_ONCE is fine

Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs
  2023-12-10 19:05             ` Jason Gunthorpe
@ 2023-12-11  7:20               ` Leon Romanovsky
  2023-12-12 21:34                 ` Jim Harris
  0 siblings, 1 reply; 16+ messages in thread
From: Leon Romanovsky @ 2023-12-11  7:20 UTC (permalink / raw)
  To: Jason Gunthorpe, Jim Harris
  Cc: Alex Williamson, bhelgaas, linux-pci, kvm, ben, pierre.cregut

On Sun, Dec 10, 2023 at 03:05:49PM -0400, Jason Gunthorpe wrote:
> On Fri, Dec 08, 2023 at 08:09:34PM +0000, Jim Harris wrote:
> > On Fri, Dec 08, 2023 at 03:41:59PM -0400, Jason Gunthorpe wrote:
> > > On Fri, Dec 08, 2023 at 05:07:22PM +0000, Jim Harris wrote:
> > > > 
> > > > Maybe for now we just whack this specific mole with a separate mutex
> > > > for synchronizing access to sriov->num_VFs in the sysfs paths?
> > > > Something like this (tested on my system):
> > > 
> > > TBH, I don't have the time right now to unpack this locking
> > > mystery. Maybe Leon remembers?
> > > 
> > > device_lock() gets everywhere and does a lot of different stuff, so I
> > > would be surprised if it was so easy..
> > 
> > The store() side still keeps the device_lock(), it just also acquires this
> > new sriov lock. So store() side should observe zero differences. The only
> > difference is now the show() side can acquire just the more-granular lock,
> > since it is only trying to synchronize on sriov->num_VFs with the store()
> > side. But maybe I'm missing something subtle here...
> 
> Oh if that is the only goal then probably a READ_ONCE is fine

I would say that worth to revert the patch
35ff867b7657 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes")
as there is no such promise that netdev devices (as presented in script
https://bugzilla.kernel.org/show_bug.cgi?id=202991), which have different
lifetime model will be only after sysfs changes in PF.

netlink event means netdev FOO is ready and if someone needs to follow
after sriov_numvfs, he/she should listen to sysfs events.

In addition, I would do this change:

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 25dbe85c4217..3b768e20c7ab 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -683,8 +683,8 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
        if (rc)
                goto err_pcibios;

-       kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
        iov->num_VFs = nr_virtfn;
+       kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);

        return 0;



Thanks

> 
> Jason
> 

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs
  2023-12-11  7:20               ` Leon Romanovsky
@ 2023-12-12 21:34                 ` Jim Harris
  2023-12-13  6:55                   ` Leon Romanovsky
  0 siblings, 1 reply; 16+ messages in thread
From: Jim Harris @ 2023-12-12 21:34 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Alex Williamson, bhelgaas, linux-pci, kvm, ben,
	pierre.cregut

On Mon, Dec 11, 2023 at 09:20:06AM +0200, Leon Romanovsky wrote:
> On Sun, Dec 10, 2023 at 03:05:49PM -0400, Jason Gunthorpe wrote:
> > On Fri, Dec 08, 2023 at 08:09:34PM +0000, Jim Harris wrote:
> > > 
> > > The store() side still keeps the device_lock(), it just also acquires this
> > > new sriov lock. So store() side should observe zero differences. The only
> > > difference is now the show() side can acquire just the more-granular lock,
> > > since it is only trying to synchronize on sriov->num_VFs with the store()
> > > side. But maybe I'm missing something subtle here...
> > 
> > Oh if that is the only goal then probably a READ_ONCE is fine

IIUC, the synchronization was to block readers of sriov_numvfs if a writer was
in process of the driver->sriov_configure(). Presumably sriov_configure()
can take a long time, and it was better to block the sysfs read rather than
return a stale value.

> I would say that worth to revert the patch
> 35ff867b7657 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes")
> as there is no such promise that netdev devices (as presented in script
> https://bugzilla.kernel.org/show_bug.cgi?id=202991), which have different
> lifetime model will be only after sysfs changes in PF.

But I guess you're saying using the sysfs change as any kind of indicator
is wrong to begin with.

> netlink event means netdev FOO is ready and if someone needs to follow
> after sriov_numvfs, he/she should listen to sysfs events.
> 
> In addition, I would do this change:
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 25dbe85c4217..3b768e20c7ab 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -683,8 +683,8 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>         if (rc)
>                 goto err_pcibios;
> 
> -       kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
>         iov->num_VFs = nr_virtfn;
> +       kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);

Ack. I'll post patches for both of these suggestions.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs
  2023-12-12 21:34                 ` Jim Harris
@ 2023-12-13  6:55                   ` Leon Romanovsky
  0 siblings, 0 replies; 16+ messages in thread
From: Leon Romanovsky @ 2023-12-13  6:55 UTC (permalink / raw)
  To: Jim Harris
  Cc: Jason Gunthorpe, Alex Williamson, bhelgaas, linux-pci, kvm, ben,
	pierre.cregut

On Tue, Dec 12, 2023 at 09:34:43PM +0000, Jim Harris wrote:
> On Mon, Dec 11, 2023 at 09:20:06AM +0200, Leon Romanovsky wrote:
> > On Sun, Dec 10, 2023 at 03:05:49PM -0400, Jason Gunthorpe wrote:
> > > On Fri, Dec 08, 2023 at 08:09:34PM +0000, Jim Harris wrote:
> > > > 
> > > > The store() side still keeps the device_lock(), it just also acquires this
> > > > new sriov lock. So store() side should observe zero differences. The only
> > > > difference is now the show() side can acquire just the more-granular lock,
> > > > since it is only trying to synchronize on sriov->num_VFs with the store()
> > > > side. But maybe I'm missing something subtle here...
> > > 
> > > Oh if that is the only goal then probably a READ_ONCE is fine
> 
> IIUC, the synchronization was to block readers of sriov_numvfs if a writer was
> in process of the driver->sriov_configure(). Presumably sriov_configure()
> can take a long time, and it was better to block the sysfs read rather than
> return a stale value.

Nothing prevents from user to write to sriov_numvfs in one thread and
read from another thread. User will get stale value anyway.

> 
> > I would say that worth to revert the patch
> > 35ff867b7657 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes")
> > as there is no such promise that netdev devices (as presented in script
> > https://bugzilla.kernel.org/show_bug.cgi?id=202991), which have different
> > lifetime model will be only after sysfs changes in PF.
> 
> But I guess you're saying using the sysfs change as any kind of indicator
> is wrong to begin with.

Yes

> 
> > netlink event means netdev FOO is ready and if someone needs to follow
> > after sriov_numvfs, he/she should listen to sysfs events.
> > 
> > In addition, I would do this change:
> > 
> > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > index 25dbe85c4217..3b768e20c7ab 100644
> > --- a/drivers/pci/iov.c
> > +++ b/drivers/pci/iov.c
> > @@ -683,8 +683,8 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
> >         if (rc)
> >                 goto err_pcibios;
> > 
> > -       kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
> >         iov->num_VFs = nr_virtfn;
> > +       kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
> 
> Ack. I'll post patches for both of these suggestions.

Thanks

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-12-13  6:56 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20231207223824uscas1p27dd91f0af56cda282cd28046cc981fe9@uscas1p2.samsung.com>
2023-12-07 22:38 ` Locking between vfio hot-remove and pci sysfs sriov_numvfs Jim Harris
2023-12-07 23:21   ` Alex Williamson
2023-12-07 23:48     ` Jason Gunthorpe
2023-12-08 17:07       ` Jim Harris
2023-12-08 19:41         ` Jason Gunthorpe
2023-12-08 20:09           ` Jim Harris
2023-12-10 19:05             ` Jason Gunthorpe
2023-12-11  7:20               ` Leon Romanovsky
2023-12-12 21:34                 ` Jim Harris
2023-12-13  6:55                   ` Leon Romanovsky
2023-12-08 17:38       ` Jim Harris
2023-12-08 17:41         ` Jason Gunthorpe
2023-12-08 17:59           ` Jim Harris
2023-12-08 18:01             ` Jason Gunthorpe
2023-12-08 18:12               ` Alex Williamson
2023-12-08 19:43                 ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.