* [PATCH v3 0/6] vpci: first series in preparation for vpci on ARM @ 2023-03-14 20:56 Volodymyr Babchuk 2023-03-14 20:56 ` [PATCH v3 1/6] xen: add reference counter support Volodymyr Babchuk ` (5 more replies) 0 siblings, 6 replies; 50+ messages in thread From: Volodymyr Babchuk @ 2023-03-14 20:56 UTC (permalink / raw) To: xen-devel Cc: Volodymyr Babchuk, Andrew Cooper, George Dunlap, Jan Beulich, Julien Grall, Stefano Stabellini, Wei Liu, Roger Pau Monné, Paul Durrant, Kevin Tian This patch set is spiritual successor of "[PATCH v2 0/4] vpci: first series in preparation for vpci on ARM". But most of the contents was reworked. Main aim of those patches is to allow vPCI MMIO handlers to work with DomUs, not only with Dom0. To do this, we need protect pdev from being removed while still in use. Jan suggested to use reference counting for this. So this series include patches from other series ("[RFC] Rework PCI locking") that implement reference counting for pdevs. With reference counting implemented, it would be possible to make further rework of PCI locking. Oleksandr Andrushchenko (1): vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk (5): xen: add reference counter support xen: pci: introduce reference counting for pdev vpci: crash domain if we wasn't able to (un) map vPCI regions vpci: use reference counter to protect vpci state xen: pci: print reference counter when dumping pci_devs xen/arch/x86/hvm/vmsi.c | 2 +- xen/arch/x86/irq.c | 4 + xen/arch/x86/msi.c | 44 ++++++- xen/arch/x86/pci.c | 3 + xen/arch/x86/physdev.c | 17 ++- xen/common/sysctl.c | 7 +- xen/drivers/passthrough/amd/iommu_init.c | 12 +- xen/drivers/passthrough/amd/iommu_map.c | 6 +- xen/drivers/passthrough/pci.c | 141 +++++++++++++++-------- xen/drivers/passthrough/vtd/quirks.c | 2 + xen/drivers/video/vga.c | 7 +- xen/drivers/vpci/header.c | 11 +- xen/drivers/vpci/vpci.c | 31 ++++- xen/include/xen/pci.h | 18 +++ xen/include/xen/refcnt.h | 59 ++++++++++ 15 files changed, 293 insertions(+), 71 deletions(-) create mode 100644 xen/include/xen/refcnt.h -- 2.39.2 ^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v3 1/6] xen: add reference counter support 2023-03-14 20:56 [PATCH v3 0/6] vpci: first series in preparation for vpci on ARM Volodymyr Babchuk @ 2023-03-14 20:56 ` Volodymyr Babchuk 2023-03-16 13:54 ` Roger Pau Monné ` (2 more replies) 2023-03-14 20:56 ` [PATCH v3 2/6] xen: pci: introduce reference counting for pdev Volodymyr Babchuk ` (4 subsequent siblings) 5 siblings, 3 replies; 50+ messages in thread From: Volodymyr Babchuk @ 2023-03-14 20:56 UTC (permalink / raw) To: xen-devel Cc: Volodymyr Babchuk, Andrew Cooper, George Dunlap, Jan Beulich, Julien Grall, Stefano Stabellini, Wei Liu We can use reference counter to ease up object lifetime management. This patch adds very basic support for reference counters. refcnt should be used in the following way: 1. Protected structure should have refcnt_t field 2. This field should be initialized with refcnt_init() during object construction. 3. If code holds a valid pointer to a structure/object it can increase refcount with refcnt_get(). No additional locking is required. 4. Code should call refcnt_put() before dropping pointer to a protected structure. `destructor` is a call back function that should destruct object and free all resources, including structure protected itself. Destructor will be called if reference counter reaches zero. 5. If code does not hold a valid pointer to a protected structure it should use other locking mechanism to obtain a pointer. For example, it should lock a list that hold protected objects. Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> --- v3: - moved in from another patch series - used structure to encapsulate refcnt_t - removed #ifndef NDEBUG in favor of just calling ASSERT - added assertion to refcnt_put - added saturation support: code catches overflow and underflow - added EMACS magic at end of the file - fixed formatting --- xen/include/xen/refcnt.h | 59 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 59 insertions(+) create mode 100644 xen/include/xen/refcnt.h diff --git a/xen/include/xen/refcnt.h b/xen/include/xen/refcnt.h new file mode 100644 index 0000000000..1ac05d927c --- /dev/null +++ b/xen/include/xen/refcnt.h @@ -0,0 +1,59 @@ +#ifndef __XEN_REFCNT_H__ +#define __XEN_REFCNT_H__ + +#include <asm/atomic.h> +#include <asm/bug.h> + +#define REFCNT_SATURATED (INT_MIN / 2) + +typedef struct { + atomic_t refcnt; +} refcnt_t; + +static inline void refcnt_init(refcnt_t *refcnt) +{ + atomic_set(&refcnt->refcnt, 1); +} + +static inline int refcnt_read(refcnt_t *refcnt) +{ + return atomic_read(&refcnt->refcnt); +} + +static inline void refcnt_get(refcnt_t *refcnt) +{ + int old = atomic_add_unless(&refcnt->refcnt, 1, 0); + + if ( unlikely(old < 0) || unlikely (old + 1 < 0) ) + { + atomic_set(&refcnt->refcnt, REFCNT_SATURATED); + printk(XENLOG_ERR"refcnt saturation: old = %d\n", old); + WARN(); + } +} + +static inline void refcnt_put(refcnt_t *refcnt, void (*destructor)(refcnt_t *refcnt)) +{ + int ret = atomic_dec_return(&refcnt->refcnt); + + if ( ret == 0 ) + destructor(refcnt); + + if ( unlikely(ret < 0)) + { + atomic_set(&refcnt->refcnt, REFCNT_SATURATED); + printk(XENLOG_ERR"refcnt already hit 0: val = %d\n", ret); + WARN(); + } +} + +#endif + +/* + * Local variables: + * mode: C + * c-file-style: "BSD" + * c-basic-offset: 4 + * indent-tabs-mode: nil + * End: + */ -- 2.39.2 ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH v3 1/6] xen: add reference counter support 2023-03-14 20:56 ` [PATCH v3 1/6] xen: add reference counter support Volodymyr Babchuk @ 2023-03-16 13:54 ` Roger Pau Monné 2023-03-16 14:03 ` Jan Beulich 2023-04-11 22:27 ` Volodymyr Babchuk 2023-03-16 16:19 ` Roger Pau Monné 2023-03-16 17:01 ` Jan Beulich 2 siblings, 2 replies; 50+ messages in thread From: Roger Pau Monné @ 2023-03-16 13:54 UTC (permalink / raw) To: Volodymyr Babchuk Cc: xen-devel, Andrew Cooper, George Dunlap, Jan Beulich, Julien Grall, Stefano Stabellini, Wei Liu On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: > We can use reference counter to ease up object lifetime management. > This patch adds very basic support for reference counters. refcnt > should be used in the following way: > > 1. Protected structure should have refcnt_t field > > 2. This field should be initialized with refcnt_init() during object > construction. > > 3. If code holds a valid pointer to a structure/object it can increase > refcount with refcnt_get(). No additional locking is required. > > 4. Code should call refcnt_put() before dropping pointer to a > protected structure. `destructor` is a call back function that should > destruct object and free all resources, including structure protected > itself. Destructor will be called if reference counter reaches zero. > > 5. If code does not hold a valid pointer to a protected structure it > should use other locking mechanism to obtain a pointer. For example, > it should lock a list that hold protected objects. Sorry, I didn't look at the previous versions, but did we consider importing refcount_t and related logic from Linux? > > Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> > > --- > v3: > - moved in from another patch series > - used structure to encapsulate refcnt_t > - removed #ifndef NDEBUG in favor of just calling ASSERT > - added assertion to refcnt_put > - added saturation support: code catches overflow and underflow > - added EMACS magic at end of the file > - fixed formatting > --- > xen/include/xen/refcnt.h | 59 ++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 59 insertions(+) > create mode 100644 xen/include/xen/refcnt.h > > diff --git a/xen/include/xen/refcnt.h b/xen/include/xen/refcnt.h > new file mode 100644 > index 0000000000..1ac05d927c > --- /dev/null > +++ b/xen/include/xen/refcnt.h > @@ -0,0 +1,59 @@ This seems to be missing some kind of license, can we have an SPDX tag at least? > +#ifndef __XEN_REFCNT_H__ > +#define __XEN_REFCNT_H__ > + > +#include <asm/atomic.h> > +#include <asm/bug.h> > + > +#define REFCNT_SATURATED (INT_MIN / 2) > + > +typedef struct { > + atomic_t refcnt; > +} refcnt_t; > + > +static inline void refcnt_init(refcnt_t *refcnt) > +{ > + atomic_set(&refcnt->refcnt, 1); > +} > + > +static inline int refcnt_read(refcnt_t *refcnt) const. > +{ > + return atomic_read(&refcnt->refcnt); > +} > + > +static inline void refcnt_get(refcnt_t *refcnt) > +{ > + int old = atomic_add_unless(&refcnt->refcnt, 1, 0); > + > + if ( unlikely(old < 0) || unlikely (old + 1 < 0) ) ^ extra space You want a single unlikely for both conditions. > + { > + atomic_set(&refcnt->refcnt, REFCNT_SATURATED); > + printk(XENLOG_ERR"refcnt saturation: old = %d\n", old); Should this be printed only once for refcount? I fear it might spam the console once a refcnt hits it. > + WARN(); > + } > +} > + > +static inline void refcnt_put(refcnt_t *refcnt, void (*destructor)(refcnt_t *refcnt)) > +{ > + int ret = atomic_dec_return(&refcnt->refcnt); > + > + if ( ret == 0 ) > + destructor(refcnt); > + > + if ( unlikely(ret < 0)) ^ missing space > + { > + atomic_set(&refcnt->refcnt, REFCNT_SATURATED); > + printk(XENLOG_ERR"refcnt already hit 0: val = %d\n", ret); Same here regarding the spamming. > + WARN(); > + } > +} > + Extra newline. I will look at further patches to see how this gets used. Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 1/6] xen: add reference counter support 2023-03-16 13:54 ` Roger Pau Monné @ 2023-03-16 14:03 ` Jan Beulich 2023-03-16 16:21 ` Roger Pau Monné 2023-04-11 22:27 ` Volodymyr Babchuk 1 sibling, 1 reply; 50+ messages in thread From: Jan Beulich @ 2023-03-16 14:03 UTC (permalink / raw) To: Roger Pau Monné Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu, Volodymyr Babchuk On 16.03.2023 14:54, Roger Pau Monné wrote: > On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: >> --- /dev/null >> +++ b/xen/include/xen/refcnt.h >> @@ -0,0 +1,59 @@ > > This seems to be missing some kind of license, can we have an SPDX tag > at least? Not "at least", but strictly that way for any new code. Patches to convert various existing code to SPDX are already pending iirc. >> +#ifndef __XEN_REFCNT_H__ >> +#define __XEN_REFCNT_H__ >> + >> +#include <asm/atomic.h> >> +#include <asm/bug.h> >> + >> +#define REFCNT_SATURATED (INT_MIN / 2) >> + >> +typedef struct { >> + atomic_t refcnt; >> +} refcnt_t; >> + >> +static inline void refcnt_init(refcnt_t *refcnt) >> +{ >> + atomic_set(&refcnt->refcnt, 1); >> +} >> + >> +static inline int refcnt_read(refcnt_t *refcnt) > > const. > >> +{ >> + return atomic_read(&refcnt->refcnt); >> +} >> + >> +static inline void refcnt_get(refcnt_t *refcnt) >> +{ >> + int old = atomic_add_unless(&refcnt->refcnt, 1, 0); >> + >> + if ( unlikely(old < 0) || unlikely (old + 1 < 0) ) > ^ extra space > > You want a single unlikely for both conditions. Are you sure? My experience was generally the other way around: likely() and unlikely() become ineffectual as soon as the compiler needs more than one branch for the inner construct (ie generally for and && or ||). Jan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 1/6] xen: add reference counter support 2023-03-16 14:03 ` Jan Beulich @ 2023-03-16 16:21 ` Roger Pau Monné 0 siblings, 0 replies; 50+ messages in thread From: Roger Pau Monné @ 2023-03-16 16:21 UTC (permalink / raw) To: Jan Beulich Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu, Volodymyr Babchuk On Thu, Mar 16, 2023 at 03:03:42PM +0100, Jan Beulich wrote: > On 16.03.2023 14:54, Roger Pau Monné wrote: > > On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: > >> +{ > >> + return atomic_read(&refcnt->refcnt); > >> +} > >> + > >> +static inline void refcnt_get(refcnt_t *refcnt) > >> +{ > >> + int old = atomic_add_unless(&refcnt->refcnt, 1, 0); > >> + > >> + if ( unlikely(old < 0) || unlikely (old + 1 < 0) ) > > ^ extra space > > > > You want a single unlikely for both conditions. > > Are you sure? My experience was generally the other way around: likely() > and unlikely() become ineffectual as soon as the compiler needs more > than one branch for the inner construct (ie generally for and && or ||). Oh, OK, never mind then. We have examples of both in the code base. Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 1/6] xen: add reference counter support 2023-03-16 13:54 ` Roger Pau Monné 2023-03-16 14:03 ` Jan Beulich @ 2023-04-11 22:27 ` Volodymyr Babchuk 2023-04-12 10:12 ` Roger Pau Monné 1 sibling, 1 reply; 50+ messages in thread From: Volodymyr Babchuk @ 2023-04-11 22:27 UTC (permalink / raw) To: Roger Pau Monné Cc: xen-devel, Andrew Cooper, George Dunlap, Jan Beulich, Julien Grall, Stefano Stabellini, Wei Liu Hello Roger, Sorry for the late answer. Roger Pau Monné <roger.pau@citrix.com> writes: > On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: >> We can use reference counter to ease up object lifetime management. >> This patch adds very basic support for reference counters. refcnt >> should be used in the following way: >> >> 1. Protected structure should have refcnt_t field >> >> 2. This field should be initialized with refcnt_init() during object >> construction. >> >> 3. If code holds a valid pointer to a structure/object it can increase >> refcount with refcnt_get(). No additional locking is required. >> >> 4. Code should call refcnt_put() before dropping pointer to a >> protected structure. `destructor` is a call back function that should >> destruct object and free all resources, including structure protected >> itself. Destructor will be called if reference counter reaches zero. >> >> 5. If code does not hold a valid pointer to a protected structure it >> should use other locking mechanism to obtain a pointer. For example, >> it should lock a list that hold protected objects. > > Sorry, I didn't look at the previous versions, but did we consider > importing refcount_t and related logic from Linux? Well, I considered this. But it is more complex. It has separate refcount module, which just counts references + kref code, that is capable of calling destructors. I am not sure if Xen need this division. In any case, I tried to replicate Linux behavior as close as possible. On other hand, Jan suggests to rework API, so it will be differ from Linux one... -- WBR, Volodymyr ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 1/6] xen: add reference counter support 2023-04-11 22:27 ` Volodymyr Babchuk @ 2023-04-12 10:12 ` Roger Pau Monné 0 siblings, 0 replies; 50+ messages in thread From: Roger Pau Monné @ 2023-04-12 10:12 UTC (permalink / raw) To: Volodymyr Babchuk Cc: xen-devel, Andrew Cooper, George Dunlap, Jan Beulich, Julien Grall, Stefano Stabellini, Wei Liu On Tue, Apr 11, 2023 at 10:27:45PM +0000, Volodymyr Babchuk wrote: > > Hello Roger, > > Sorry for the late answer. > > Roger Pau Monné <roger.pau@citrix.com> writes: > > > On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: > >> We can use reference counter to ease up object lifetime management. > >> This patch adds very basic support for reference counters. refcnt > >> should be used in the following way: > >> > >> 1. Protected structure should have refcnt_t field > >> > >> 2. This field should be initialized with refcnt_init() during object > >> construction. > >> > >> 3. If code holds a valid pointer to a structure/object it can increase > >> refcount with refcnt_get(). No additional locking is required. > >> > >> 4. Code should call refcnt_put() before dropping pointer to a > >> protected structure. `destructor` is a call back function that should > >> destruct object and free all resources, including structure protected > >> itself. Destructor will be called if reference counter reaches zero. > >> > >> 5. If code does not hold a valid pointer to a protected structure it > >> should use other locking mechanism to obtain a pointer. For example, > >> it should lock a list that hold protected objects. > > > > Sorry, I didn't look at the previous versions, but did we consider > > importing refcount_t and related logic from Linux? > > Well, I considered this. But it is more complex. It has separate > refcount module, which just counts references + kref code, that is > capable of calling destructors. I am not sure if Xen need this > division. In any case, I tried to replicate Linux behavior as close as > possible. On other hand, Jan suggests to rework API, so it will be > differ from Linux one... OK, just asking because it's likely for the interface to grow if there are more users of refcounting, and at some point we might need a set of features similar to Linux. Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 1/6] xen: add reference counter support 2023-03-14 20:56 ` [PATCH v3 1/6] xen: add reference counter support Volodymyr Babchuk 2023-03-16 13:54 ` Roger Pau Monné @ 2023-03-16 16:19 ` Roger Pau Monné 2023-03-16 16:32 ` Jan Beulich 2023-03-16 17:01 ` Jan Beulich 2 siblings, 1 reply; 50+ messages in thread From: Roger Pau Monné @ 2023-03-16 16:19 UTC (permalink / raw) To: Volodymyr Babchuk Cc: xen-devel, Andrew Cooper, George Dunlap, Jan Beulich, Julien Grall, Stefano Stabellini, Wei Liu On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: > We can use reference counter to ease up object lifetime management. > This patch adds very basic support for reference counters. refcnt > should be used in the following way: > > 1. Protected structure should have refcnt_t field > > 2. This field should be initialized with refcnt_init() during object > construction. > > 3. If code holds a valid pointer to a structure/object it can increase > refcount with refcnt_get(). No additional locking is required. > > 4. Code should call refcnt_put() before dropping pointer to a > protected structure. `destructor` is a call back function that should > destruct object and free all resources, including structure protected > itself. Destructor will be called if reference counter reaches zero. > > 5. If code does not hold a valid pointer to a protected structure it > should use other locking mechanism to obtain a pointer. For example, > it should lock a list that hold protected objects. > > Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> > > --- > v3: > - moved in from another patch series > - used structure to encapsulate refcnt_t > - removed #ifndef NDEBUG in favor of just calling ASSERT > - added assertion to refcnt_put > - added saturation support: code catches overflow and underflow > - added EMACS magic at end of the file > - fixed formatting > --- > xen/include/xen/refcnt.h | 59 ++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 59 insertions(+) > create mode 100644 xen/include/xen/refcnt.h > > diff --git a/xen/include/xen/refcnt.h b/xen/include/xen/refcnt.h > new file mode 100644 > index 0000000000..1ac05d927c > --- /dev/null > +++ b/xen/include/xen/refcnt.h > @@ -0,0 +1,59 @@ > +#ifndef __XEN_REFCNT_H__ > +#define __XEN_REFCNT_H__ > + > +#include <asm/atomic.h> > +#include <asm/bug.h> > + > +#define REFCNT_SATURATED (INT_MIN / 2) > + > +typedef struct { > + atomic_t refcnt; > +} refcnt_t; > + > +static inline void refcnt_init(refcnt_t *refcnt) > +{ > + atomic_set(&refcnt->refcnt, 1); > +} > + > +static inline int refcnt_read(refcnt_t *refcnt) > +{ > + return atomic_read(&refcnt->refcnt); > +} > + > +static inline void refcnt_get(refcnt_t *refcnt) > +{ > + int old = atomic_add_unless(&refcnt->refcnt, 1, 0); Occurred to me while looking at the next patch: Don't you also need to print a warning (and saturate the counter maybe?) if old == 0, as that would imply the caller is attempting to take a reference of an object that should be destroyed? IOW: it would point to some kind of memory leak. Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 1/6] xen: add reference counter support 2023-03-16 16:19 ` Roger Pau Monné @ 2023-03-16 16:32 ` Jan Beulich 2023-03-16 16:39 ` Roger Pau Monné 0 siblings, 1 reply; 50+ messages in thread From: Jan Beulich @ 2023-03-16 16:32 UTC (permalink / raw) To: Roger Pau Monné Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu, Volodymyr Babchuk On 16.03.2023 17:19, Roger Pau Monné wrote: > On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: >> +static inline void refcnt_get(refcnt_t *refcnt) >> +{ >> + int old = atomic_add_unless(&refcnt->refcnt, 1, 0); > > Occurred to me while looking at the next patch: > > Don't you also need to print a warning (and saturate the counter > maybe?) if old == 0, as that would imply the caller is attempting > to take a reference of an object that should be destroyed? IOW: it > would point to some kind of memory leak. Hmm, I notice the function presently returns void. I think what to do when the counter is zero needs leaving to the caller. See e.g. get_page() which will simply indicate failure to the caller in case the refcnt is zero. (There overflow handling also is left to the caller ... All that matters is whether a ref can be acquired.) Jan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 1/6] xen: add reference counter support 2023-03-16 16:32 ` Jan Beulich @ 2023-03-16 16:39 ` Roger Pau Monné 2023-03-16 16:43 ` Jan Beulich 0 siblings, 1 reply; 50+ messages in thread From: Roger Pau Monné @ 2023-03-16 16:39 UTC (permalink / raw) To: Jan Beulich Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu, Volodymyr Babchuk On Thu, Mar 16, 2023 at 05:32:38PM +0100, Jan Beulich wrote: > On 16.03.2023 17:19, Roger Pau Monné wrote: > > On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: > >> +static inline void refcnt_get(refcnt_t *refcnt) > >> +{ > >> + int old = atomic_add_unless(&refcnt->refcnt, 1, 0); > > > > Occurred to me while looking at the next patch: > > > > Don't you also need to print a warning (and saturate the counter > > maybe?) if old == 0, as that would imply the caller is attempting > > to take a reference of an object that should be destroyed? IOW: it > > would point to some kind of memory leak. > > Hmm, I notice the function presently returns void. I think what to do > when the counter is zero needs leaving to the caller. See e.g. > get_page() which will simply indicate failure to the caller in case > the refcnt is zero. (There overflow handling also is left to the > caller ... All that matters is whether a ref can be acquired.) Hm, likely. I guess pages never go away even when it's refcount reaches 0. For the pdev case attempting to take a refcount on an object that has 0 refcounts implies that the caller is using leaked memory, as the point an object reaches 0 it supposed to be destroyed. Returning success would be fine, as then for the pdev use-case we could print a warning and likely take some action to prevent further damage if possible. Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 1/6] xen: add reference counter support 2023-03-16 16:39 ` Roger Pau Monné @ 2023-03-16 16:43 ` Jan Beulich 2023-03-16 16:48 ` Roger Pau Monné 0 siblings, 1 reply; 50+ messages in thread From: Jan Beulich @ 2023-03-16 16:43 UTC (permalink / raw) To: Roger Pau Monné Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu, Volodymyr Babchuk On 16.03.2023 17:39, Roger Pau Monné wrote: > On Thu, Mar 16, 2023 at 05:32:38PM +0100, Jan Beulich wrote: >> On 16.03.2023 17:19, Roger Pau Monné wrote: >>> On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: >>>> +static inline void refcnt_get(refcnt_t *refcnt) >>>> +{ >>>> + int old = atomic_add_unless(&refcnt->refcnt, 1, 0); >>> >>> Occurred to me while looking at the next patch: >>> >>> Don't you also need to print a warning (and saturate the counter >>> maybe?) if old == 0, as that would imply the caller is attempting >>> to take a reference of an object that should be destroyed? IOW: it >>> would point to some kind of memory leak. >> >> Hmm, I notice the function presently returns void. I think what to do >> when the counter is zero needs leaving to the caller. See e.g. >> get_page() which will simply indicate failure to the caller in case >> the refcnt is zero. (There overflow handling also is left to the >> caller ... All that matters is whether a ref can be acquired.) > > Hm, likely. I guess pages never go away even when it's refcount > reaches 0. > > For the pdev case attempting to take a refcount on an object that has > 0 refcounts implies that the caller is using leaked memory, as the > point an object reaches 0 it supposed to be destroyed. Hmm, my thinking was that a device would remain at refcnt 0 until it is actually removed, i.e. refcnt == 0 being a prereq for pci_remove_device() to be willing to do anything at all. But maybe that's not a viable model. Jan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 1/6] xen: add reference counter support 2023-03-16 16:43 ` Jan Beulich @ 2023-03-16 16:48 ` Roger Pau Monné 2023-03-16 16:56 ` Jan Beulich 0 siblings, 1 reply; 50+ messages in thread From: Roger Pau Monné @ 2023-03-16 16:48 UTC (permalink / raw) To: Jan Beulich Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu, Volodymyr Babchuk On Thu, Mar 16, 2023 at 05:43:18PM +0100, Jan Beulich wrote: > On 16.03.2023 17:39, Roger Pau Monné wrote: > > On Thu, Mar 16, 2023 at 05:32:38PM +0100, Jan Beulich wrote: > >> On 16.03.2023 17:19, Roger Pau Monné wrote: > >>> On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: > >>>> +static inline void refcnt_get(refcnt_t *refcnt) > >>>> +{ > >>>> + int old = atomic_add_unless(&refcnt->refcnt, 1, 0); > >>> > >>> Occurred to me while looking at the next patch: > >>> > >>> Don't you also need to print a warning (and saturate the counter > >>> maybe?) if old == 0, as that would imply the caller is attempting > >>> to take a reference of an object that should be destroyed? IOW: it > >>> would point to some kind of memory leak. > >> > >> Hmm, I notice the function presently returns void. I think what to do > >> when the counter is zero needs leaving to the caller. See e.g. > >> get_page() which will simply indicate failure to the caller in case > >> the refcnt is zero. (There overflow handling also is left to the > >> caller ... All that matters is whether a ref can be acquired.) > > > > Hm, likely. I guess pages never go away even when it's refcount > > reaches 0. > > > > For the pdev case attempting to take a refcount on an object that has > > 0 refcounts implies that the caller is using leaked memory, as the > > point an object reaches 0 it supposed to be destroyed. > > Hmm, my thinking was that a device would remain at refcnt 0 until it is > actually removed, i.e. refcnt == 0 being a prereq for pci_remove_device() > to be willing to do anything at all. But maybe that's not a viable model. Right, I think the intention was for pci_remove_device() to drop the refcount to 0 and do the removal, so the refcount should be 1 when calling pci_remove_device(). But none of this is written down, so it's mostly my assumptions from looking at the code. I have some comments about the model in patch 2, I think we need to clarify the intended usage on the commit log about pdev and refcounts. Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 1/6] xen: add reference counter support 2023-03-16 16:48 ` Roger Pau Monné @ 2023-03-16 16:56 ` Jan Beulich 2023-03-17 10:05 ` Roger Pau Monné 0 siblings, 1 reply; 50+ messages in thread From: Jan Beulich @ 2023-03-16 16:56 UTC (permalink / raw) To: Roger Pau Monné Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu, Volodymyr Babchuk On 16.03.2023 17:48, Roger Pau Monné wrote: > On Thu, Mar 16, 2023 at 05:43:18PM +0100, Jan Beulich wrote: >> On 16.03.2023 17:39, Roger Pau Monné wrote: >>> On Thu, Mar 16, 2023 at 05:32:38PM +0100, Jan Beulich wrote: >>>> On 16.03.2023 17:19, Roger Pau Monné wrote: >>>>> On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: >>>>>> +static inline void refcnt_get(refcnt_t *refcnt) >>>>>> +{ >>>>>> + int old = atomic_add_unless(&refcnt->refcnt, 1, 0); >>>>> >>>>> Occurred to me while looking at the next patch: >>>>> >>>>> Don't you also need to print a warning (and saturate the counter >>>>> maybe?) if old == 0, as that would imply the caller is attempting >>>>> to take a reference of an object that should be destroyed? IOW: it >>>>> would point to some kind of memory leak. >>>> >>>> Hmm, I notice the function presently returns void. I think what to do >>>> when the counter is zero needs leaving to the caller. See e.g. >>>> get_page() which will simply indicate failure to the caller in case >>>> the refcnt is zero. (There overflow handling also is left to the >>>> caller ... All that matters is whether a ref can be acquired.) >>> >>> Hm, likely. I guess pages never go away even when it's refcount >>> reaches 0. >>> >>> For the pdev case attempting to take a refcount on an object that has >>> 0 refcounts implies that the caller is using leaked memory, as the >>> point an object reaches 0 it supposed to be destroyed. >> >> Hmm, my thinking was that a device would remain at refcnt 0 until it is >> actually removed, i.e. refcnt == 0 being a prereq for pci_remove_device() >> to be willing to do anything at all. But maybe that's not a viable model. > > Right, I think the intention was for pci_remove_device() to drop the > refcount to 0 and do the removal, so the refcount should be 1 when > calling pci_remove_device(). But none of this is written down, so > it's mostly my assumptions from looking at the code. Could such work at all? The function can't safely drop a reference and _then_ check whether it was the last one. The function either has to take refcnt == 0 as prereq, or it needs to be the destructor function that refcnt_put() calls. Jan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 1/6] xen: add reference counter support 2023-03-16 16:56 ` Jan Beulich @ 2023-03-17 10:05 ` Roger Pau Monné 2023-03-17 14:46 ` Jan Beulich 0 siblings, 1 reply; 50+ messages in thread From: Roger Pau Monné @ 2023-03-17 10:05 UTC (permalink / raw) To: Jan Beulich Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu, Volodymyr Babchuk On Thu, Mar 16, 2023 at 05:56:00PM +0100, Jan Beulich wrote: > On 16.03.2023 17:48, Roger Pau Monné wrote: > > On Thu, Mar 16, 2023 at 05:43:18PM +0100, Jan Beulich wrote: > >> On 16.03.2023 17:39, Roger Pau Monné wrote: > >>> On Thu, Mar 16, 2023 at 05:32:38PM +0100, Jan Beulich wrote: > >>>> On 16.03.2023 17:19, Roger Pau Monné wrote: > >>>>> On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: > >>>>>> +static inline void refcnt_get(refcnt_t *refcnt) > >>>>>> +{ > >>>>>> + int old = atomic_add_unless(&refcnt->refcnt, 1, 0); > >>>>> > >>>>> Occurred to me while looking at the next patch: > >>>>> > >>>>> Don't you also need to print a warning (and saturate the counter > >>>>> maybe?) if old == 0, as that would imply the caller is attempting > >>>>> to take a reference of an object that should be destroyed? IOW: it > >>>>> would point to some kind of memory leak. > >>>> > >>>> Hmm, I notice the function presently returns void. I think what to do > >>>> when the counter is zero needs leaving to the caller. See e.g. > >>>> get_page() which will simply indicate failure to the caller in case > >>>> the refcnt is zero. (There overflow handling also is left to the > >>>> caller ... All that matters is whether a ref can be acquired.) > >>> > >>> Hm, likely. I guess pages never go away even when it's refcount > >>> reaches 0. > >>> > >>> For the pdev case attempting to take a refcount on an object that has > >>> 0 refcounts implies that the caller is using leaked memory, as the > >>> point an object reaches 0 it supposed to be destroyed. > >> > >> Hmm, my thinking was that a device would remain at refcnt 0 until it is > >> actually removed, i.e. refcnt == 0 being a prereq for pci_remove_device() > >> to be willing to do anything at all. But maybe that's not a viable model. > > > > Right, I think the intention was for pci_remove_device() to drop the > > refcount to 0 and do the removal, so the refcount should be 1 when > > calling pci_remove_device(). But none of this is written down, so > > it's mostly my assumptions from looking at the code. > > Could such work at all? The function can't safely drop a reference > and _then_ check whether it was the last one. The function either has > to take refcnt == 0 as prereq, or it needs to be the destructor > function that refcnt_put() calls. But then you also get in the trouble of asserting that refcnt == 0 doesn't change between evaluation and actual removal of the structure. Should all refcounts to pdev be taken and dropped while holding the pcidevs lock? I there an email (outside of this series) that contains a description of how the refcounting is to be used with pdevs? Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 1/6] xen: add reference counter support 2023-03-17 10:05 ` Roger Pau Monné @ 2023-03-17 14:46 ` Jan Beulich 0 siblings, 0 replies; 50+ messages in thread From: Jan Beulich @ 2023-03-17 14:46 UTC (permalink / raw) To: Roger Pau Monné Cc: xen-devel, Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu, Volodymyr Babchuk On 17.03.2023 11:05, Roger Pau Monné wrote: > On Thu, Mar 16, 2023 at 05:56:00PM +0100, Jan Beulich wrote: >> On 16.03.2023 17:48, Roger Pau Monné wrote: >>> On Thu, Mar 16, 2023 at 05:43:18PM +0100, Jan Beulich wrote: >>>> On 16.03.2023 17:39, Roger Pau Monné wrote: >>>>> On Thu, Mar 16, 2023 at 05:32:38PM +0100, Jan Beulich wrote: >>>>>> On 16.03.2023 17:19, Roger Pau Monné wrote: >>>>>>> On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: >>>>>>>> +static inline void refcnt_get(refcnt_t *refcnt) >>>>>>>> +{ >>>>>>>> + int old = atomic_add_unless(&refcnt->refcnt, 1, 0); >>>>>>> >>>>>>> Occurred to me while looking at the next patch: >>>>>>> >>>>>>> Don't you also need to print a warning (and saturate the counter >>>>>>> maybe?) if old == 0, as that would imply the caller is attempting >>>>>>> to take a reference of an object that should be destroyed? IOW: it >>>>>>> would point to some kind of memory leak. >>>>>> >>>>>> Hmm, I notice the function presently returns void. I think what to do >>>>>> when the counter is zero needs leaving to the caller. See e.g. >>>>>> get_page() which will simply indicate failure to the caller in case >>>>>> the refcnt is zero. (There overflow handling also is left to the >>>>>> caller ... All that matters is whether a ref can be acquired.) >>>>> >>>>> Hm, likely. I guess pages never go away even when it's refcount >>>>> reaches 0. >>>>> >>>>> For the pdev case attempting to take a refcount on an object that has >>>>> 0 refcounts implies that the caller is using leaked memory, as the >>>>> point an object reaches 0 it supposed to be destroyed. >>>> >>>> Hmm, my thinking was that a device would remain at refcnt 0 until it is >>>> actually removed, i.e. refcnt == 0 being a prereq for pci_remove_device() >>>> to be willing to do anything at all. But maybe that's not a viable model. >>> >>> Right, I think the intention was for pci_remove_device() to drop the >>> refcount to 0 and do the removal, so the refcount should be 1 when >>> calling pci_remove_device(). But none of this is written down, so >>> it's mostly my assumptions from looking at the code. >> >> Could such work at all? The function can't safely drop a reference >> and _then_ check whether it was the last one. The function either has >> to take refcnt == 0 as prereq, or it needs to be the destructor >> function that refcnt_put() calls. > > But then you also get in the trouble of asserting that refcnt == 0 > doesn't change between evaluation and actual removal of the structure. > > Should all refcounts to pdev be taken and dropped while holding the > pcidevs lock? > > I there an email (outside of this series) that contains a description > of how the refcounting is to be used with pdevs? I'm not aware of one. The intentions indeed need outlining somewhere. Jan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 1/6] xen: add reference counter support 2023-03-14 20:56 ` [PATCH v3 1/6] xen: add reference counter support Volodymyr Babchuk 2023-03-16 13:54 ` Roger Pau Monné 2023-03-16 16:19 ` Roger Pau Monné @ 2023-03-16 17:01 ` Jan Beulich 2023-04-11 22:38 ` Volodymyr Babchuk 2 siblings, 1 reply; 50+ messages in thread From: Jan Beulich @ 2023-03-16 17:01 UTC (permalink / raw) To: Volodymyr Babchuk Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu, xen-devel On 14.03.2023 21:56, Volodymyr Babchuk wrote: > +static inline void refcnt_put(refcnt_t *refcnt, void (*destructor)(refcnt_t *refcnt)) Hmm, this means all callers need to pass (and agree on) the supposedly single destructor function that needs calling. Wouldn't the destructor function better be stored elsewhere (and supplied to e.g. refcnt_init())? > +{ > + int ret = atomic_dec_return(&refcnt->refcnt); > + > + if ( ret == 0 ) > + destructor(refcnt); > + > + if ( unlikely(ret < 0)) > + { > + atomic_set(&refcnt->refcnt, REFCNT_SATURATED); It's undefined whether *refcnt still exists once the destructor was called (which would have happened before we can make it here). While even the atomic_dec_return() above would already have acted in an unknown way in this case I don't think it's a good idea to access the object yet another time. (Same for the "negative" case in refcnt_get() then.) Jan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 1/6] xen: add reference counter support 2023-03-16 17:01 ` Jan Beulich @ 2023-04-11 22:38 ` Volodymyr Babchuk 2023-04-17 6:47 ` Jan Beulich 0 siblings, 1 reply; 50+ messages in thread From: Volodymyr Babchuk @ 2023-04-11 22:38 UTC (permalink / raw) To: Jan Beulich Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu, xen-devel Jan Beulich <jbeulich@suse.com> writes: > On 14.03.2023 21:56, Volodymyr Babchuk wrote: >> +static inline void refcnt_put(refcnt_t *refcnt, void (*destructor)(refcnt_t *refcnt)) > > Hmm, this means all callers need to pass (and agree on) the supposedly > single destructor function that needs calling. Wouldn't the destructor > function better be stored elsewhere (and supplied to e.g. refcnt_init())? > I tried to replicate Linux approach. They provide destructor function every time. On other hand, kref_put() is often called from a wrapper function (like pdev_put() in our case), so destructor in fact, is provided only once. >> +{ >> + int ret = atomic_dec_return(&refcnt->refcnt); >> + >> + if ( ret == 0 ) >> + destructor(refcnt); >> + >> + if ( unlikely(ret < 0)) >> + { >> + atomic_set(&refcnt->refcnt, REFCNT_SATURATED); > > It's undefined whether *refcnt still exists once the destructor was > called (which would have happened before we can make it here). While > even the atomic_dec_return() above would already have acted in an > unknown way in this case I don't think it's a good idea to access the > object yet another time. (Same for the "negative" case in > refcnt_get() then.) Okay, then I'll remove saturation logic. -- WBR, Volodymyr ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 1/6] xen: add reference counter support 2023-04-11 22:38 ` Volodymyr Babchuk @ 2023-04-17 6:47 ` Jan Beulich 0 siblings, 0 replies; 50+ messages in thread From: Jan Beulich @ 2023-04-17 6:47 UTC (permalink / raw) To: Volodymyr Babchuk Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu, xen-devel On 12.04.2023 00:38, Volodymyr Babchuk wrote: > Jan Beulich <jbeulich@suse.com> writes: >> On 14.03.2023 21:56, Volodymyr Babchuk wrote: >>> +static inline void refcnt_put(refcnt_t *refcnt, void (*destructor)(refcnt_t *refcnt)) >> >> Hmm, this means all callers need to pass (and agree on) the supposedly >> single destructor function that needs calling. Wouldn't the destructor >> function better be stored elsewhere (and supplied to e.g. refcnt_init())? >> > > I tried to replicate Linux approach. They provide destructor function > every time. On other hand, kref_put() is often called from a wrapper > function (like pdev_put() in our case), so destructor in fact, is > provided only once. If provided via wrappers, that'll be fine of course. >>> +{ >>> + int ret = atomic_dec_return(&refcnt->refcnt); >>> + >>> + if ( ret == 0 ) >>> + destructor(refcnt); >>> + >>> + if ( unlikely(ret < 0)) >>> + { >>> + atomic_set(&refcnt->refcnt, REFCNT_SATURATED); >> >> It's undefined whether *refcnt still exists once the destructor was >> called (which would have happened before we can make it here). While >> even the atomic_dec_return() above would already have acted in an >> unknown way in this case I don't think it's a good idea to access the >> object yet another time. (Same for the "negative" case in >> refcnt_get() then.) > > Okay, then I'll remove saturation logic. Wait. Saturating on overflow might still be a reasonable concept. But here you convert an underflow to the "saturated" value. Jan ^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-03-14 20:56 [PATCH v3 0/6] vpci: first series in preparation for vpci on ARM Volodymyr Babchuk 2023-03-14 20:56 ` [PATCH v3 1/6] xen: add reference counter support Volodymyr Babchuk @ 2023-03-14 20:56 ` Volodymyr Babchuk 2023-03-16 16:16 ` Roger Pau Monné 2023-03-29 10:04 ` Jan Beulich 2023-03-14 20:56 ` [PATCH v3 3/6] vpci: crash domain if we wasn't able to (un) map vPCI regions Volodymyr Babchuk ` (3 subsequent siblings) 5 siblings, 2 replies; 50+ messages in thread From: Volodymyr Babchuk @ 2023-03-14 20:56 UTC (permalink / raw) To: xen-devel Cc: Volodymyr Babchuk, Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian Prior to this change, lifetime of pci_dev objects was protected by global pcidevs_lock(). Long-term plan is to remove this log, so we need some other mechanism to ensure that those objects will not disappear under feet of code that access them. Reference counting is a good choice as it provides easy to comprehend way to control object lifetime. This patch adds two new helper functions: pcidev_get() and pcidev_put(). pcidev_get() will increase reference counter, while pcidev_put() will decrease it, destroying object when counter reaches zero. pcidev_get() should be used only when you already have a valid pointer to the object or you are holding lock that protects one of the lists (domain, pseg or ats) that store pci_dev structs. pcidev_get() is rarely used directly, because there already are functions that will provide valid pointer to pci_dev struct: pci_get_pdev(), pci_get_real_pdev(). They will lock appropriate list, find needed object and increase its reference counter before returning to the caller. Naturally, pci_put() should be called after finishing working with a received object. This is the reason why this patch have so many pcidev_put()s and so little pcidev_get()s: existing calls to pci_get_*() functions now will increase reference counter automatically, we just need to decrease it back when we finished. This patch removes "const" qualifier from some pdev pointers because pcidev_put() technically alters the contents of pci_dev structure. Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> Suggested-by: Jan Beulich <jbeulich@suse.com> --- v3: - Moved in from another patch series - Fixed code formatting (tabs -> spaces) - Removed erroneous pcidev_put in vga.c - Added missing pcidev_put in couple of places - removed mention of pci_get_pdev_by_domain() --- xen/arch/x86/hvm/vmsi.c | 2 +- xen/arch/x86/irq.c | 4 + xen/arch/x86/msi.c | 44 +++++++- xen/arch/x86/pci.c | 3 + xen/arch/x86/physdev.c | 17 ++- xen/common/sysctl.c | 7 +- xen/drivers/passthrough/amd/iommu_init.c | 12 +- xen/drivers/passthrough/amd/iommu_map.c | 6 +- xen/drivers/passthrough/pci.c | 138 +++++++++++++++-------- xen/drivers/passthrough/vtd/quirks.c | 2 + xen/drivers/video/vga.c | 7 +- xen/drivers/vpci/vpci.c | 16 ++- xen/include/xen/pci.h | 18 +++ 13 files changed, 215 insertions(+), 61 deletions(-) diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c index 3cd4923060..8c3d673872 100644 --- a/xen/arch/x86/hvm/vmsi.c +++ b/xen/arch/x86/hvm/vmsi.c @@ -914,7 +914,7 @@ int vpci_msix_arch_print(const struct vpci_msix *msix) spin_unlock(&msix->pdev->vpci->lock); process_pending_softirqs(); - /* NB: we assume that pdev cannot go away for an alive domain. */ + if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) ) return -EBUSY; if ( pdev->vpci->msix != msix ) diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c index 20150b1c7f..87464d82c8 100644 --- a/xen/arch/x86/irq.c +++ b/xen/arch/x86/irq.c @@ -2175,6 +2175,7 @@ int map_domain_pirq( msi->entry_nr = ret; ret = -ENFILE; } + pcidev_put(pdev); goto done; } @@ -2189,6 +2190,7 @@ int map_domain_pirq( msi_desc->irq = -1; msi_free_irq(msi_desc); ret = -EBUSY; + pcidev_put(pdev); goto done; } @@ -2273,10 +2275,12 @@ int map_domain_pirq( } msi_desc->irq = -1; msi_free_irq(msi_desc); + pcidev_put(pdev); goto done; } set_domain_irq_pirq(d, irq, info); + pcidev_put(pdev); spin_unlock_irqrestore(&desc->lock, flags); } else diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c index d0bf63df1d..91926fce50 100644 --- a/xen/arch/x86/msi.c +++ b/xen/arch/x86/msi.c @@ -572,6 +572,10 @@ int msi_free_irq(struct msi_desc *entry) virt_to_fix((unsigned long)entry->mask_base)); list_del(&entry->list); + + /* Corresponds to pcidev_get() in msi[x]_capability_init() */ + pcidev_put(entry->dev); + xfree(entry); return 0; } @@ -644,6 +648,7 @@ static int msi_capability_init(struct pci_dev *dev, entry[i].msi.mpos = mpos; entry[i].msi.nvec = 0; entry[i].dev = dev; + pcidev_get(dev); } entry->msi.nvec = nvec; entry->irq = irq; @@ -703,22 +708,36 @@ static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf) !num_vf || !offset || (num_vf > 1 && !stride) || bir >= PCI_SRIOV_NUM_BARS || !pdev->vf_rlen[bir] ) + { + if ( pdev ) + pcidev_put(pdev); return 0; + } base = pos + PCI_SRIOV_BAR; vf -= PCI_BDF(bus, slot, func) + offset; if ( vf < 0 ) + { + pcidev_put(pdev); return 0; + } if ( stride ) { if ( vf % stride ) + { + pcidev_put(pdev); return 0; + } vf /= stride; } if ( vf >= num_vf ) + { + pcidev_put(pdev); return 0; + } BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS); disp = vf * pdev->vf_rlen[bir]; limit = PCI_SRIOV_NUM_BARS; + pcidev_put(pdev); } else switch ( pci_conf_read8(PCI_SBDF(seg, bus, slot, func), PCI_HEADER_TYPE) & 0x7f ) @@ -925,6 +944,8 @@ static int msix_capability_init(struct pci_dev *dev, entry->dev = dev; entry->mask_base = base; + pcidev_get(dev); + list_add_tail(&entry->list, &dev->msi_list); *desc = entry; } @@ -999,6 +1020,7 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc) { struct pci_dev *pdev; struct msi_desc *old_desc; + int ret; ASSERT(pcidevs_locked()); pdev = pci_get_pdev(NULL, msi->sbdf); @@ -1010,6 +1032,7 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc) { printk(XENLOG_ERR "irq %d already mapped to MSI on %pp\n", msi->irq, &pdev->sbdf); + pcidev_put(pdev); return -EEXIST; } @@ -1020,7 +1043,10 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc) __pci_disable_msix(old_desc); } - return msi_capability_init(pdev, msi->irq, desc, msi->entry_nr); + ret = msi_capability_init(pdev, msi->irq, desc, msi->entry_nr); + pcidev_put(pdev); + + return ret; } static void __pci_disable_msi(struct msi_desc *entry) @@ -1054,20 +1080,29 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc) { struct pci_dev *pdev; struct msi_desc *old_desc; + int ret; ASSERT(pcidevs_locked()); pdev = pci_get_pdev(NULL, msi->sbdf); if ( !pdev || !pdev->msix ) + { + if ( pdev ) + pcidev_put(pdev); return -ENODEV; + } if ( msi->entry_nr >= pdev->msix->nr_entries ) + { + pcidev_put(pdev); return -EINVAL; + } old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSIX); if ( old_desc ) { printk(XENLOG_ERR "irq %d already mapped to MSI-X on %pp\n", msi->irq, &pdev->sbdf); + pcidev_put(pdev); return -EEXIST; } @@ -1078,7 +1113,11 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc) __pci_disable_msi(old_desc); } - return msix_capability_init(pdev, msi, desc); + ret = msix_capability_init(pdev, msi, desc); + + pcidev_put(pdev); + + return ret; } static void _pci_cleanup_msix(struct arch_msix *msix) @@ -1159,6 +1198,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off) } else rc = msix_capability_init(pdev, NULL, NULL); + pcidev_put(pdev); pcidevs_unlock(); return rc; diff --git a/xen/arch/x86/pci.c b/xen/arch/x86/pci.c index 97b792e578..c1fcdf08d6 100644 --- a/xen/arch/x86/pci.c +++ b/xen/arch/x86/pci.c @@ -92,7 +92,10 @@ int pci_conf_write_intercept(unsigned int seg, unsigned int bdf, pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bdf)); if ( pdev ) + { rc = pci_msi_conf_write_intercept(pdev, reg, size, data); + pcidev_put(pdev); + } pcidevs_unlock(); diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c index 2f1d955a96..96214a3d40 100644 --- a/xen/arch/x86/physdev.c +++ b/xen/arch/x86/physdev.c @@ -533,7 +533,14 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) pcidevs_lock(); pdev = pci_get_pdev(NULL, PCI_SBDF(0, restore_msi.bus, restore_msi.devfn)); - ret = pdev ? pci_restore_msi_state(pdev) : -ENODEV; + if ( pdev ) + { + ret = pci_restore_msi_state(pdev); + pcidev_put(pdev); + } + else + ret = -ENODEV; + pcidevs_unlock(); break; } @@ -548,7 +555,13 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) pcidevs_lock(); pdev = pci_get_pdev(NULL, PCI_SBDF(dev.seg, dev.bus, dev.devfn)); - ret = pdev ? pci_restore_msi_state(pdev) : -ENODEV; + if ( pdev ) + { + ret = pci_restore_msi_state(pdev); + pcidev_put(pdev); + } + else + ret = -ENODEV; pcidevs_unlock(); break; } diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c index 02505ab044..9af07fa92a 100644 --- a/xen/common/sysctl.c +++ b/xen/common/sysctl.c @@ -438,7 +438,7 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl) { physdev_pci_device_t dev; uint32_t node; - const struct pci_dev *pdev; + struct pci_dev *pdev; if ( copy_from_guest_offset(&dev, ti->devs, i, 1) ) { @@ -454,8 +454,11 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl) node = XEN_INVALID_NODE_ID; else node = pdev->node; - pcidevs_unlock(); + if ( pdev ) + pcidev_put(pdev); + + pcidevs_unlock(); if ( copy_to_guest_offset(ti->nodes, i, &node, 1) ) { ret = -EFAULT; diff --git a/xen/drivers/passthrough/amd/iommu_init.c b/xen/drivers/passthrough/amd/iommu_init.c index 9773ccfcb4..f90b1c1e58 100644 --- a/xen/drivers/passthrough/amd/iommu_init.c +++ b/xen/drivers/passthrough/amd/iommu_init.c @@ -646,6 +646,7 @@ static void cf_check parse_ppr_log_entry(struct amd_iommu *iommu, u32 entry[]) if ( pdev ) guest_iommu_add_ppr_log(pdev->domain, entry); + pcidev_put(pdev); } static void iommu_check_ppr_log(struct amd_iommu *iommu) @@ -749,6 +750,11 @@ static bool_t __init set_iommu_interrupt_handler(struct amd_iommu *iommu) } pcidevs_lock(); + /* + * XXX: it is unclear if this device can be removed. Right now + * there is no code that clears msi.dev, so no one will decrease + * refcount on it. + */ iommu->msi.dev = pci_get_pdev(NULL, PCI_SBDF(iommu->seg, iommu->bdf)); pcidevs_unlock(); if ( !iommu->msi.dev ) @@ -1274,7 +1280,7 @@ static int __init cf_check amd_iommu_setup_device_table( { if ( ivrs_mappings[bdf].valid ) { - const struct pci_dev *pdev = NULL; + struct pci_dev *pdev = NULL; /* add device table entry */ iommu_dte_add_device_entry(&dt[bdf], &ivrs_mappings[bdf]); @@ -1299,7 +1305,10 @@ static int __init cf_check amd_iommu_setup_device_table( pdev->msix ? pdev->msix->nr_entries : pdev->msi_maxvec); if ( !ivrs_mappings[bdf].intremap_table ) + { + pcidev_put(pdev); return -ENOMEM; + } if ( pdev->phantom_stride ) { @@ -1317,6 +1326,7 @@ static int __init cf_check amd_iommu_setup_device_table( ivrs_mappings[bdf].intremap_inuse; } } + pcidev_put(pdev); } amd_iommu_set_intremap_table( diff --git a/xen/drivers/passthrough/amd/iommu_map.c b/xen/drivers/passthrough/amd/iommu_map.c index 993bac6f88..9d621e3d36 100644 --- a/xen/drivers/passthrough/amd/iommu_map.c +++ b/xen/drivers/passthrough/amd/iommu_map.c @@ -724,14 +724,18 @@ int cf_check amd_iommu_get_reserved_device_memory( if ( !iommu ) { /* May need to trigger the workaround in find_iommu_for_device(). */ - const struct pci_dev *pdev; + struct pci_dev *pdev; pcidevs_lock(); pdev = pci_get_pdev(NULL, sbdf); pcidevs_unlock(); if ( pdev ) + { iommu = find_iommu_for_device(seg, bdf); + /* XXX: Should we hold pdev reference till end of the loop? */ + pcidev_put(pdev); + } if ( !iommu ) continue; } diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c index b42acb8d7c..b32382aca0 100644 --- a/xen/drivers/passthrough/pci.c +++ b/xen/drivers/passthrough/pci.c @@ -328,6 +328,7 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn) *((u8*) &pdev->bus) = bus; *((u8*) &pdev->devfn) = devfn; pdev->domain = NULL; + refcnt_init(&pdev->refcnt); arch_pci_init_pdev(pdev); @@ -422,33 +423,6 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn) return pdev; } -static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev) -{ - /* update bus2bridge */ - switch ( pdev->type ) - { - unsigned int sec_bus, sub_bus; - - case DEV_TYPE_PCIe2PCI_BRIDGE: - case DEV_TYPE_LEGACY_PCI_BRIDGE: - sec_bus = pci_conf_read8(pdev->sbdf, PCI_SECONDARY_BUS); - sub_bus = pci_conf_read8(pdev->sbdf, PCI_SUBORDINATE_BUS); - - spin_lock(&pseg->bus2bridge_lock); - for ( ; sec_bus <= sub_bus; sec_bus++ ) - pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus]; - spin_unlock(&pseg->bus2bridge_lock); - break; - - default: - break; - } - - list_del(&pdev->alldevs_list); - pdev_msi_deinit(pdev); - xfree(pdev); -} - static void __init _pci_hide_device(struct pci_dev *pdev) { if ( pdev->domain ) @@ -517,10 +491,14 @@ struct pci_dev *pci_get_real_pdev(pci_sbdf_t sbdf) { if ( !(sbdf.devfn & stride) ) continue; + sbdf.devfn &= ~stride; pdev = pci_get_pdev(NULL, sbdf); if ( pdev && stride != pdev->phantom_stride ) + { + pcidev_put(pdev); pdev = NULL; + } } return pdev; @@ -548,13 +526,18 @@ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf) list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list ) if ( pdev->sbdf.bdf == sbdf.bdf && (!d || pdev->domain == d) ) + { + pcidev_get(pdev); return pdev; + } } else list_for_each_entry ( pdev, &d->pdev_list, domain_list ) if ( pdev->sbdf.bdf == sbdf.bdf ) + { + pcidev_get(pdev); return pdev; - + } return NULL; } @@ -663,7 +646,10 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn, PCI_SBDF(seg, info->physfn.bus, info->physfn.devfn)); if ( pdev ) + { pf_is_extfn = pdev->info.is_extfn; + pcidev_put(pdev); + } pcidevs_unlock(); if ( !pdev ) pci_add_device(seg, info->physfn.bus, info->physfn.devfn, @@ -818,7 +804,9 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn) if ( pdev->domain ) list_del(&pdev->domain_list); printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf); - free_pdev(pseg, pdev); + list_del(&pdev->alldevs_list); + pdev_msi_deinit(pdev); + pcidev_put(pdev); break; } @@ -848,7 +836,7 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus, { ret = iommu_quarantine_dev_init(pci_to_dev(pdev)); if ( ret ) - return ret; + goto out; target = dom_io; } @@ -878,6 +866,7 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus, pdev->fault.count = 0; out: + pcidev_put(pdev); if ( ret ) printk(XENLOG_G_ERR "%pd: deassign (%pp) failed (%d)\n", d, &PCI_SBDF(seg, bus, devfn), ret); @@ -1011,7 +1000,10 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn) pdev->fault.count >>= 1; pdev->fault.time = now; if ( ++pdev->fault.count < PT_FAULT_THRESHOLD ) + { + pcidev_put(pdev); pdev = NULL; + } } pcidevs_unlock(); @@ -1022,6 +1014,8 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn) * control it for us. */ cword = pci_conf_read16(pdev->sbdf, PCI_COMMAND); pci_conf_write16(pdev->sbdf, PCI_COMMAND, cword & ~PCI_COMMAND_MASTER); + + pcidev_put(pdev); } /* @@ -1138,6 +1132,7 @@ static int __hwdom_init cf_check _setup_hwdom_pci_devices( printk(XENLOG_WARNING "Dom%d owning %pp?\n", pdev->domain->domain_id, &pdev->sbdf); + pcidev_put(pdev); if ( iommu_verbose ) { pcidevs_unlock(); @@ -1385,33 +1380,28 @@ static int iommu_remove_device(struct pci_dev *pdev) return iommu_call(hd->platform_ops, remove_device, devfn, pci_to_dev(pdev)); } -static int device_assigned(u16 seg, u8 bus, u8 devfn) +static int device_assigned(struct pci_dev *pdev) { - struct pci_dev *pdev; int rc = 0; ASSERT(pcidevs_locked()); - pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); - - if ( !pdev ) - rc = -ENODEV; /* * If the device exists and it is not owned by either the hardware * domain or dom_io then it must be assigned to a guest, or be * hidden (owned by dom_xen). */ - else if ( pdev->domain != hardware_domain && - pdev->domain != dom_io ) + if ( pdev->domain != hardware_domain && + pdev->domain != dom_io ) rc = -EBUSY; return rc; } /* Caller should hold the pcidevs_lock */ -static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) +static int assign_device(struct domain *d, struct pci_dev *pdev, u32 flag) { const struct domain_iommu *hd = dom_iommu(d); - struct pci_dev *pdev; + uint8_t devfn; int rc = 0; if ( !is_iommu_enabled(d) ) @@ -1422,10 +1412,11 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) /* device_assigned() should already have cleared the device for assignment */ ASSERT(pcidevs_locked()); - pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); ASSERT(pdev && (pdev->domain == hardware_domain || pdev->domain == dom_io)); + devfn = pdev->devfn; + /* Do not allow broken devices to be assigned to guests. */ rc = -EBADF; if ( pdev->broken && d != hardware_domain && d != dom_io ) @@ -1460,7 +1451,7 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) done: if ( rc ) printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n", - d, &PCI_SBDF(seg, bus, devfn), rc); + d, &PCI_SBDF(pdev->seg, pdev->bus, devfn), rc); /* The device is assigned to dom_io so mark it as quarantined */ else if ( d == dom_io ) pdev->quarantine = true; @@ -1595,6 +1586,9 @@ int iommu_do_pci_domctl( ASSERT(d); /* fall through */ case XEN_DOMCTL_test_assign_device: + { + struct pci_dev *pdev; + /* Don't support self-assignment of devices. */ if ( d == current->domain ) { @@ -1622,26 +1616,36 @@ int iommu_do_pci_domctl( seg = machine_sbdf >> 16; bus = PCI_BUS(machine_sbdf); devfn = PCI_DEVFN(machine_sbdf); - pcidevs_lock(); - ret = device_assigned(seg, bus, devfn); + pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); + if ( !pdev ) + { + printk(XENLOG_G_INFO "%pp non-existent\n", + &PCI_SBDF(seg, bus, devfn)); + ret = -EINVAL; + break; + } + + ret = device_assigned(pdev); if ( domctl->cmd == XEN_DOMCTL_test_assign_device ) { if ( ret ) { - printk(XENLOG_G_INFO "%pp already assigned, or non-existent\n", + printk(XENLOG_G_INFO "%pp already assigned\n", &PCI_SBDF(seg, bus, devfn)); ret = -EINVAL; } } else if ( !ret ) - ret = assign_device(d, seg, bus, devfn, flags); + ret = assign_device(d, pdev, flags); + + pcidev_put(pdev); pcidevs_unlock(); if ( ret == -ERESTART ) ret = hypercall_create_continuation(__HYPERVISOR_domctl, "h", u_domctl); break; - + } case XEN_DOMCTL_deassign_device: /* Don't support self-deassignment of devices. */ if ( d == current->domain ) @@ -1681,6 +1685,46 @@ int iommu_do_pci_domctl( return ret; } +static void release_pdev(refcnt_t *refcnt) +{ + struct pci_dev *pdev = container_of(refcnt, struct pci_dev, refcnt); + struct pci_seg *pseg = get_pseg(pdev->seg); + + printk(XENLOG_DEBUG "PCI release device %pp\n", &pdev->sbdf); + + /* update bus2bridge */ + switch ( pdev->type ) + { + unsigned int sec_bus, sub_bus; + + case DEV_TYPE_PCIe2PCI_BRIDGE: + case DEV_TYPE_LEGACY_PCI_BRIDGE: + sec_bus = pci_conf_read8(pdev->sbdf, PCI_SECONDARY_BUS); + sub_bus = pci_conf_read8(pdev->sbdf, PCI_SUBORDINATE_BUS); + + spin_lock(&pseg->bus2bridge_lock); + for ( ; sec_bus <= sub_bus; sec_bus++ ) + pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus]; + spin_unlock(&pseg->bus2bridge_lock); + break; + + default: + break; + } + + xfree(pdev); +} + +void pcidev_get(struct pci_dev *pdev) +{ + refcnt_get(&pdev->refcnt); +} + +void pcidev_put(struct pci_dev *pdev) +{ + refcnt_put(&pdev->refcnt, release_pdev); +} + /* * Local variables: * mode: C diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough/vtd/quirks.c index fcc8f73e8b..d240da0416 100644 --- a/xen/drivers/passthrough/vtd/quirks.c +++ b/xen/drivers/passthrough/vtd/quirks.c @@ -429,6 +429,8 @@ static int __must_check map_me_phantom_function(struct domain *domain, rc = domain_context_unmap_one(domain, drhd->iommu, 0, PCI_DEVFN(dev, 7)); + pcidev_put(pdev); + return rc; } diff --git a/xen/drivers/video/vga.c b/xen/drivers/video/vga.c index 0a03508bee..1049d4da6d 100644 --- a/xen/drivers/video/vga.c +++ b/xen/drivers/video/vga.c @@ -114,7 +114,7 @@ void __init video_endboot(void) for ( bus = 0; bus < 256; ++bus ) for ( devfn = 0; devfn < 256; ++devfn ) { - const struct pci_dev *pdev; + struct pci_dev *pdev; u8 b = bus, df = devfn, sb; pcidevs_lock(); @@ -126,7 +126,11 @@ void __init video_endboot(void) PCI_CLASS_DEVICE) != 0x0300 || !(pci_conf_read16(PCI_SBDF(0, bus, devfn), PCI_COMMAND) & (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) ) + { + if ( pdev ) + pcidev_put(pdev); continue; + } while ( b ) { @@ -157,6 +161,7 @@ void __init video_endboot(void) bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); pci_hide_device(0, bus, devfn); } + pcidev_put(pdev); } } diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c index 6d48d496bb..5232f9605b 100644 --- a/xen/drivers/vpci/vpci.c +++ b/xen/drivers/vpci/vpci.c @@ -317,8 +317,8 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size, uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) { - const struct domain *d = current->domain; - const struct pci_dev *pdev; + struct domain *d = current->domain; + struct pci_dev *pdev; const struct vpci_register *r; unsigned int data_offset = 0; uint32_t data = ~(uint32_t)0; @@ -332,7 +332,11 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) /* Find the PCI dev matching the address. */ pdev = pci_get_pdev(d, sbdf); if ( !pdev || !pdev->vpci ) + { + if ( pdev ) + pcidev_put(pdev); return vpci_read_hw(sbdf, reg, size); + } spin_lock(&pdev->vpci->lock); @@ -378,6 +382,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) ASSERT(data_offset < size); } spin_unlock(&pdev->vpci->lock); + pcidev_put(pdev); if ( data_offset < size ) { @@ -420,8 +425,8 @@ static void vpci_write_helper(const struct pci_dev *pdev, void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, uint32_t data) { - const struct domain *d = current->domain; - const struct pci_dev *pdev; + struct domain *d = current->domain; + struct pci_dev *pdev; const struct vpci_register *r; unsigned int data_offset = 0; const unsigned long *ro_map = pci_get_ro_map(sbdf.seg); @@ -443,6 +448,8 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, pdev = pci_get_pdev(d, sbdf); if ( !pdev || !pdev->vpci ) { + if ( pdev ) + pcidev_put(pdev); vpci_write_hw(sbdf, reg, size, data); return; } @@ -483,6 +490,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, ASSERT(data_offset < size); } spin_unlock(&pdev->vpci->lock); + pcidev_put(pdev); if ( data_offset < size ) /* Tailing gap, write the remaining. */ diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h index 5975ca2f30..6631643fb1 100644 --- a/xen/include/xen/pci.h +++ b/xen/include/xen/pci.h @@ -13,6 +13,7 @@ #include <xen/irq.h> #include <xen/pci_regs.h> #include <xen/pfn.h> +#include <xen/refcnt.h> #include <asm/device.h> #include <asm/numa.h> @@ -116,6 +117,9 @@ struct pci_dev { /* Device misbehaving, prevent assigning it to guests. */ bool broken; + /* Reference counter */ + refcnt_t refcnt; + enum pdev_type { DEV_TYPE_PCI_UNKNOWN, DEV_TYPE_PCIe_ENDPOINT, @@ -160,6 +164,14 @@ void pcidevs_lock(void); void pcidevs_unlock(void); bool_t __must_check pcidevs_locked(void); +/* + * Acquire and release reference to the given device. Holding + * reference ensures that device will not disappear under feet, but + * does not guarantee that code has exclusive access to the device. + */ +void pcidev_get(struct pci_dev *pdev); +void pcidev_put(struct pci_dev *pdev); + bool_t pci_known_segment(u16 seg); bool_t pci_device_detect(u16 seg, u8 bus, u8 dev, u8 func); int scan_pci_devices(void); @@ -177,8 +189,14 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn, int pci_remove_device(u16 seg, u8 bus, u8 devfn); int pci_ro_device(int seg, int bus, int devfn); int pci_hide_device(unsigned int seg, unsigned int bus, unsigned int devfn); + +/* + * Next two functions will find a requested device and acquire + * reference to it. Use pcidev_put() to release the reference. + */ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf); struct pci_dev *pci_get_real_pdev(pci_sbdf_t sbdf); + void pci_check_disable_device(u16 seg, u8 bus, u8 devfn); uint8_t pci_conf_read8(pci_sbdf_t sbdf, unsigned int reg); -- 2.39.2 ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-03-14 20:56 ` [PATCH v3 2/6] xen: pci: introduce reference counting for pdev Volodymyr Babchuk @ 2023-03-16 16:16 ` Roger Pau Monné 2023-03-29 9:55 ` Jan Beulich 2023-04-11 23:41 ` Volodymyr Babchuk 2023-03-29 10:04 ` Jan Beulich 1 sibling, 2 replies; 50+ messages in thread From: Roger Pau Monné @ 2023-03-16 16:16 UTC (permalink / raw) To: Volodymyr Babchuk Cc: xen-devel, Jan Beulich, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: > Prior to this change, lifetime of pci_dev objects was protected by global > pcidevs_lock(). Long-term plan is to remove this log, so we need some ^ lock I wouldn't say remove, as one way or another we need a lock to protect concurrent accesses. > other mechanism to ensure that those objects will not disappear under > feet of code that access them. Reference counting is a good choice as > it provides easy to comprehend way to control object lifetime. > > This patch adds two new helper functions: pcidev_get() and > pcidev_put(). pcidev_get() will increase reference counter, while > pcidev_put() will decrease it, destroying object when counter reaches > zero. > > pcidev_get() should be used only when you already have a valid pointer > to the object or you are holding lock that protects one of the > lists (domain, pseg or ats) that store pci_dev structs. > > pcidev_get() is rarely used directly, because there already are > functions that will provide valid pointer to pci_dev struct: > pci_get_pdev(), pci_get_real_pdev(). They will lock appropriate list, > find needed object and increase its reference counter before returning > to the caller. > > Naturally, pci_put() should be called after finishing working with a > received object. This is the reason why this patch have so many > pcidev_put()s and so little pcidev_get()s: existing calls to > pci_get_*() functions now will increase reference counter > automatically, we just need to decrease it back when we finished. After looking a bit into this, I would like to ask whether it's been considered the need to increase the refcount for each use of a pdev. For example I would consider the initial alloc_pdev() to take a refcount, and then pci_remove_device() _must_ be the function that removes the last refcount, so that it can return -EBUSY otherwise (see my comment below). I would also think that having the device assigned to a guest will take another refcount, and then any usage from further callers (ie: like vpci) will need some kind of protection from preventing the device from being deassigned from a domain while vPCI handlers are running, and the current refcount won't help with that. That makes me wonder if for example callers of pci_get_pdev(d, sbdf) do need to take an extra refcount, because such access is already protected from the pdev going away by the fact that the device is assigned to a guest. But maybe it's too much work to separate users of pci_get_pdev(d, ...); vs pci_get_pdev(NULL, ...);. There's also a window when the refcount is dropped to 0, and the destruction function is called, but at the same time a concurrent thread could attempt to take a reference to the pdev still? > > This patch removes "const" qualifier from some pdev pointers because > pcidev_put() technically alters the contents of pci_dev structure. > > Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> > Suggested-by: Jan Beulich <jbeulich@suse.com> > > --- > > v3: > - Moved in from another patch series > - Fixed code formatting (tabs -> spaces) > - Removed erroneous pcidev_put in vga.c > - Added missing pcidev_put in couple of places > - removed mention of pci_get_pdev_by_domain() > --- > xen/arch/x86/hvm/vmsi.c | 2 +- > xen/arch/x86/irq.c | 4 + > xen/arch/x86/msi.c | 44 +++++++- > xen/arch/x86/pci.c | 3 + > xen/arch/x86/physdev.c | 17 ++- > xen/common/sysctl.c | 7 +- > xen/drivers/passthrough/amd/iommu_init.c | 12 +- > xen/drivers/passthrough/amd/iommu_map.c | 6 +- > xen/drivers/passthrough/pci.c | 138 +++++++++++++++-------- > xen/drivers/passthrough/vtd/quirks.c | 2 + > xen/drivers/video/vga.c | 7 +- > xen/drivers/vpci/vpci.c | 16 ++- > xen/include/xen/pci.h | 18 +++ > 13 files changed, 215 insertions(+), 61 deletions(-) > > diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c > index 3cd4923060..8c3d673872 100644 > --- a/xen/arch/x86/hvm/vmsi.c > +++ b/xen/arch/x86/hvm/vmsi.c > @@ -914,7 +914,7 @@ int vpci_msix_arch_print(const struct vpci_msix *msix) > > spin_unlock(&msix->pdev->vpci->lock); > process_pending_softirqs(); > - /* NB: we assume that pdev cannot go away for an alive domain. */ > + > if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) ) > return -EBUSY; > if ( pdev->vpci->msix != msix ) > diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c > index 20150b1c7f..87464d82c8 100644 > --- a/xen/arch/x86/irq.c > +++ b/xen/arch/x86/irq.c > @@ -2175,6 +2175,7 @@ int map_domain_pirq( > msi->entry_nr = ret; > ret = -ENFILE; > } > + pcidev_put(pdev); > goto done; > } > > @@ -2189,6 +2190,7 @@ int map_domain_pirq( > msi_desc->irq = -1; > msi_free_irq(msi_desc); > ret = -EBUSY; > + pcidev_put(pdev); > goto done; > } > > @@ -2273,10 +2275,12 @@ int map_domain_pirq( > } > msi_desc->irq = -1; > msi_free_irq(msi_desc); > + pcidev_put(pdev); > goto done; > } > > set_domain_irq_pirq(d, irq, info); > + pcidev_put(pdev); > spin_unlock_irqrestore(&desc->lock, flags); > } > else > diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c > index d0bf63df1d..91926fce50 100644 > --- a/xen/arch/x86/msi.c > +++ b/xen/arch/x86/msi.c > @@ -572,6 +572,10 @@ int msi_free_irq(struct msi_desc *entry) > virt_to_fix((unsigned long)entry->mask_base)); > > list_del(&entry->list); > + > + /* Corresponds to pcidev_get() in msi[x]_capability_init() */ > + pcidev_put(entry->dev); > + > xfree(entry); > return 0; > } > @@ -644,6 +648,7 @@ static int msi_capability_init(struct pci_dev *dev, > entry[i].msi.mpos = mpos; > entry[i].msi.nvec = 0; > entry[i].dev = dev; > + pcidev_get(dev); > } > entry->msi.nvec = nvec; > entry->irq = irq; > @@ -703,22 +708,36 @@ static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf) > !num_vf || !offset || (num_vf > 1 && !stride) || > bir >= PCI_SRIOV_NUM_BARS || > !pdev->vf_rlen[bir] ) > + { > + if ( pdev ) > + pcidev_put(pdev); > return 0; > + } > base = pos + PCI_SRIOV_BAR; > vf -= PCI_BDF(bus, slot, func) + offset; > if ( vf < 0 ) > + { > + pcidev_put(pdev); > return 0; > + } > if ( stride ) > { > if ( vf % stride ) > + { > + pcidev_put(pdev); > return 0; > + } > vf /= stride; > } > if ( vf >= num_vf ) > + { > + pcidev_put(pdev); > return 0; > + } > BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS); > disp = vf * pdev->vf_rlen[bir]; > limit = PCI_SRIOV_NUM_BARS; > + pcidev_put(pdev); > } > else switch ( pci_conf_read8(PCI_SBDF(seg, bus, slot, func), > PCI_HEADER_TYPE) & 0x7f ) > @@ -925,6 +944,8 @@ static int msix_capability_init(struct pci_dev *dev, > entry->dev = dev; > entry->mask_base = base; > > + pcidev_get(dev); > + > list_add_tail(&entry->list, &dev->msi_list); > *desc = entry; > } > @@ -999,6 +1020,7 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc) > { > struct pci_dev *pdev; > struct msi_desc *old_desc; > + int ret; > > ASSERT(pcidevs_locked()); > pdev = pci_get_pdev(NULL, msi->sbdf); > @@ -1010,6 +1032,7 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc) > { > printk(XENLOG_ERR "irq %d already mapped to MSI on %pp\n", > msi->irq, &pdev->sbdf); > + pcidev_put(pdev); > return -EEXIST; > } > > @@ -1020,7 +1043,10 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc) > __pci_disable_msix(old_desc); > } > > - return msi_capability_init(pdev, msi->irq, desc, msi->entry_nr); > + ret = msi_capability_init(pdev, msi->irq, desc, msi->entry_nr); > + pcidev_put(pdev); > + > + return ret; > } > > static void __pci_disable_msi(struct msi_desc *entry) > @@ -1054,20 +1080,29 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc) > { > struct pci_dev *pdev; > struct msi_desc *old_desc; > + int ret; > > ASSERT(pcidevs_locked()); > pdev = pci_get_pdev(NULL, msi->sbdf); > if ( !pdev || !pdev->msix ) > + { > + if ( pdev ) > + pcidev_put(pdev); > return -ENODEV; > + } > > if ( msi->entry_nr >= pdev->msix->nr_entries ) > + { > + pcidev_put(pdev); > return -EINVAL; > + } > > old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSIX); > if ( old_desc ) > { > printk(XENLOG_ERR "irq %d already mapped to MSI-X on %pp\n", > msi->irq, &pdev->sbdf); > + pcidev_put(pdev); > return -EEXIST; > } > > @@ -1078,7 +1113,11 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc) > __pci_disable_msi(old_desc); > } > > - return msix_capability_init(pdev, msi, desc); > + ret = msix_capability_init(pdev, msi, desc); > + > + pcidev_put(pdev); > + > + return ret; > } > > static void _pci_cleanup_msix(struct arch_msix *msix) > @@ -1159,6 +1198,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off) > } > else > rc = msix_capability_init(pdev, NULL, NULL); > + pcidev_put(pdev); > pcidevs_unlock(); > > return rc; > diff --git a/xen/arch/x86/pci.c b/xen/arch/x86/pci.c > index 97b792e578..c1fcdf08d6 100644 > --- a/xen/arch/x86/pci.c > +++ b/xen/arch/x86/pci.c > @@ -92,7 +92,10 @@ int pci_conf_write_intercept(unsigned int seg, unsigned int bdf, > > pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bdf)); > if ( pdev ) > + { > rc = pci_msi_conf_write_intercept(pdev, reg, size, data); > + pcidev_put(pdev); > + } > > pcidevs_unlock(); > > diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c > index 2f1d955a96..96214a3d40 100644 > --- a/xen/arch/x86/physdev.c > +++ b/xen/arch/x86/physdev.c > @@ -533,7 +533,14 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) > pcidevs_lock(); > pdev = pci_get_pdev(NULL, > PCI_SBDF(0, restore_msi.bus, restore_msi.devfn)); > - ret = pdev ? pci_restore_msi_state(pdev) : -ENODEV; > + if ( pdev ) > + { > + ret = pci_restore_msi_state(pdev); > + pcidev_put(pdev); > + } > + else > + ret = -ENODEV; > + > pcidevs_unlock(); > break; > } > @@ -548,7 +555,13 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) > > pcidevs_lock(); > pdev = pci_get_pdev(NULL, PCI_SBDF(dev.seg, dev.bus, dev.devfn)); > - ret = pdev ? pci_restore_msi_state(pdev) : -ENODEV; > + if ( pdev ) > + { > + ret = pci_restore_msi_state(pdev); > + pcidev_put(pdev); > + } > + else > + ret = -ENODEV; > pcidevs_unlock(); > break; > } > diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c > index 02505ab044..9af07fa92a 100644 > --- a/xen/common/sysctl.c > +++ b/xen/common/sysctl.c > @@ -438,7 +438,7 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl) > { > physdev_pci_device_t dev; > uint32_t node; > - const struct pci_dev *pdev; > + struct pci_dev *pdev; > > if ( copy_from_guest_offset(&dev, ti->devs, i, 1) ) > { > @@ -454,8 +454,11 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl) > node = XEN_INVALID_NODE_ID; > else > node = pdev->node; > - pcidevs_unlock(); > > + if ( pdev ) > + pcidev_put(pdev); > + > + pcidevs_unlock(); > if ( copy_to_guest_offset(ti->nodes, i, &node, 1) ) > { > ret = -EFAULT; > diff --git a/xen/drivers/passthrough/amd/iommu_init.c b/xen/drivers/passthrough/amd/iommu_init.c > index 9773ccfcb4..f90b1c1e58 100644 > --- a/xen/drivers/passthrough/amd/iommu_init.c > +++ b/xen/drivers/passthrough/amd/iommu_init.c > @@ -646,6 +646,7 @@ static void cf_check parse_ppr_log_entry(struct amd_iommu *iommu, u32 entry[]) > > if ( pdev ) > guest_iommu_add_ppr_log(pdev->domain, entry); > + pcidev_put(pdev); > } > > static void iommu_check_ppr_log(struct amd_iommu *iommu) > @@ -749,6 +750,11 @@ static bool_t __init set_iommu_interrupt_handler(struct amd_iommu *iommu) > } > > pcidevs_lock(); > + /* > + * XXX: it is unclear if this device can be removed. Right now > + * there is no code that clears msi.dev, so no one will decrease > + * refcount on it. > + */ > iommu->msi.dev = pci_get_pdev(NULL, PCI_SBDF(iommu->seg, iommu->bdf)); I don't think we can remove an IOMMU from the system, so this is fine as-is AFAICT. > pcidevs_unlock(); > if ( !iommu->msi.dev ) > @@ -1274,7 +1280,7 @@ static int __init cf_check amd_iommu_setup_device_table( > { > if ( ivrs_mappings[bdf].valid ) > { > - const struct pci_dev *pdev = NULL; > + struct pci_dev *pdev = NULL; > > /* add device table entry */ > iommu_dte_add_device_entry(&dt[bdf], &ivrs_mappings[bdf]); > @@ -1299,7 +1305,10 @@ static int __init cf_check amd_iommu_setup_device_table( > pdev->msix ? pdev->msix->nr_entries > : pdev->msi_maxvec); > if ( !ivrs_mappings[bdf].intremap_table ) > + { > + pcidev_put(pdev); > return -ENOMEM; > + } > > if ( pdev->phantom_stride ) > { > @@ -1317,6 +1326,7 @@ static int __init cf_check amd_iommu_setup_device_table( > ivrs_mappings[bdf].intremap_inuse; > } > } > + pcidev_put(pdev); > } > > amd_iommu_set_intremap_table( > diff --git a/xen/drivers/passthrough/amd/iommu_map.c b/xen/drivers/passthrough/amd/iommu_map.c > index 993bac6f88..9d621e3d36 100644 > --- a/xen/drivers/passthrough/amd/iommu_map.c > +++ b/xen/drivers/passthrough/amd/iommu_map.c > @@ -724,14 +724,18 @@ int cf_check amd_iommu_get_reserved_device_memory( > if ( !iommu ) > { > /* May need to trigger the workaround in find_iommu_for_device(). */ > - const struct pci_dev *pdev; > + struct pci_dev *pdev; > > pcidevs_lock(); > pdev = pci_get_pdev(NULL, sbdf); > pcidevs_unlock(); > > if ( pdev ) > + { > iommu = find_iommu_for_device(seg, bdf); > + /* XXX: Should we hold pdev reference till end of the loop? */ > + pcidev_put(pdev); I don't think you need to hold a reference to the device until the end of the loop, the data fetched there is from the ACPI tables. If the func() helper also needs a pdev instance is it's task to get one. > + } > if ( !iommu ) > continue; > } > diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c > index b42acb8d7c..b32382aca0 100644 > --- a/xen/drivers/passthrough/pci.c > +++ b/xen/drivers/passthrough/pci.c > @@ -328,6 +328,7 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn) > *((u8*) &pdev->bus) = bus; > *((u8*) &pdev->devfn) = devfn; > pdev->domain = NULL; > + refcnt_init(&pdev->refcnt); > > arch_pci_init_pdev(pdev); > > @@ -422,33 +423,6 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn) > return pdev; > } > > -static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev) > -{ > - /* update bus2bridge */ > - switch ( pdev->type ) > - { > - unsigned int sec_bus, sub_bus; > - > - case DEV_TYPE_PCIe2PCI_BRIDGE: > - case DEV_TYPE_LEGACY_PCI_BRIDGE: > - sec_bus = pci_conf_read8(pdev->sbdf, PCI_SECONDARY_BUS); > - sub_bus = pci_conf_read8(pdev->sbdf, PCI_SUBORDINATE_BUS); > - > - spin_lock(&pseg->bus2bridge_lock); > - for ( ; sec_bus <= sub_bus; sec_bus++ ) > - pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus]; > - spin_unlock(&pseg->bus2bridge_lock); > - break; > - > - default: > - break; > - } > - > - list_del(&pdev->alldevs_list); > - pdev_msi_deinit(pdev); > - xfree(pdev); > -} > - > static void __init _pci_hide_device(struct pci_dev *pdev) > { > if ( pdev->domain ) > @@ -517,10 +491,14 @@ struct pci_dev *pci_get_real_pdev(pci_sbdf_t sbdf) > { > if ( !(sbdf.devfn & stride) ) > continue; > + Unrelated change? There are some of those in the patch, should be removed. > sbdf.devfn &= ~stride; > pdev = pci_get_pdev(NULL, sbdf); > if ( pdev && stride != pdev->phantom_stride ) > + { > + pcidev_put(pdev); > pdev = NULL; > + } > } > > return pdev; > @@ -548,13 +526,18 @@ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf) > list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list ) > if ( pdev->sbdf.bdf == sbdf.bdf && > (!d || pdev->domain == d) ) > + { > + pcidev_get(pdev); > return pdev; > + } > } > else > list_for_each_entry ( pdev, &d->pdev_list, domain_list ) > if ( pdev->sbdf.bdf == sbdf.bdf ) > + { > + pcidev_get(pdev); > return pdev; > - > + } > return NULL; > } > > @@ -663,7 +646,10 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn, > PCI_SBDF(seg, info->physfn.bus, > info->physfn.devfn)); > if ( pdev ) > + { > pf_is_extfn = pdev->info.is_extfn; > + pcidev_put(pdev); > + } > pcidevs_unlock(); > if ( !pdev ) > pci_add_device(seg, info->physfn.bus, info->physfn.devfn, > @@ -818,7 +804,9 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn) > if ( pdev->domain ) > list_del(&pdev->domain_list); > printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf); > - free_pdev(pseg, pdev); > + list_del(&pdev->alldevs_list); > + pdev_msi_deinit(pdev); > + pcidev_put(pdev); Hm, I think here we want to make sure that the device has been freed, or else you would have to return -EBUSY to the calls to notify that the device is still in use. I think we need an extra pcidev_put_final() or similar that can be used in pci_remove_device() to assert that the device has been actually removed. > break; > } > > @@ -848,7 +836,7 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus, > { > ret = iommu_quarantine_dev_init(pci_to_dev(pdev)); > if ( ret ) > - return ret; > + goto out; > > target = dom_io; > } > @@ -878,6 +866,7 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus, > pdev->fault.count = 0; > > out: > + pcidev_put(pdev); > if ( ret ) > printk(XENLOG_G_ERR "%pd: deassign (%pp) failed (%d)\n", > d, &PCI_SBDF(seg, bus, devfn), ret); > @@ -1011,7 +1000,10 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn) > pdev->fault.count >>= 1; > pdev->fault.time = now; > if ( ++pdev->fault.count < PT_FAULT_THRESHOLD ) > + { > + pcidev_put(pdev); > pdev = NULL; > + } > } > pcidevs_unlock(); > > @@ -1022,6 +1014,8 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn) > * control it for us. */ > cword = pci_conf_read16(pdev->sbdf, PCI_COMMAND); > pci_conf_write16(pdev->sbdf, PCI_COMMAND, cword & ~PCI_COMMAND_MASTER); > + > + pcidev_put(pdev); > } > > /* > @@ -1138,6 +1132,7 @@ static int __hwdom_init cf_check _setup_hwdom_pci_devices( > printk(XENLOG_WARNING "Dom%d owning %pp?\n", > pdev->domain->domain_id, &pdev->sbdf); > > + pcidev_put(pdev); > if ( iommu_verbose ) > { > pcidevs_unlock(); > @@ -1385,33 +1380,28 @@ static int iommu_remove_device(struct pci_dev *pdev) > return iommu_call(hd->platform_ops, remove_device, devfn, pci_to_dev(pdev)); > } > > -static int device_assigned(u16 seg, u8 bus, u8 devfn) > +static int device_assigned(struct pci_dev *pdev) > { > - struct pci_dev *pdev; > int rc = 0; > > ASSERT(pcidevs_locked()); > - pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); > - > - if ( !pdev ) > - rc = -ENODEV; > /* > * If the device exists and it is not owned by either the hardware > * domain or dom_io then it must be assigned to a guest, or be > * hidden (owned by dom_xen). > */ > - else if ( pdev->domain != hardware_domain && > - pdev->domain != dom_io ) > + if ( pdev->domain != hardware_domain && > + pdev->domain != dom_io ) > rc = -EBUSY; > > return rc; > } > > /* Caller should hold the pcidevs_lock */ I would assume the caller has taken an extra reference to the pdev, so holding the pcidevs_lock is no longer needed? > -static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) > +static int assign_device(struct domain *d, struct pci_dev *pdev, u32 flag) > { > const struct domain_iommu *hd = dom_iommu(d); > - struct pci_dev *pdev; > + uint8_t devfn; > int rc = 0; > > if ( !is_iommu_enabled(d) ) > @@ -1422,10 +1412,11 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) > > /* device_assigned() should already have cleared the device for assignment */ > ASSERT(pcidevs_locked()); > - pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); > ASSERT(pdev && (pdev->domain == hardware_domain || > pdev->domain == dom_io)); > > + devfn = pdev->devfn; > + > /* Do not allow broken devices to be assigned to guests. */ > rc = -EBADF; > if ( pdev->broken && d != hardware_domain && d != dom_io ) > @@ -1460,7 +1451,7 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) > done: > if ( rc ) > printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n", > - d, &PCI_SBDF(seg, bus, devfn), rc); > + d, &PCI_SBDF(pdev->seg, pdev->bus, devfn), rc); > /* The device is assigned to dom_io so mark it as quarantined */ > else if ( d == dom_io ) > pdev->quarantine = true; > @@ -1595,6 +1586,9 @@ int iommu_do_pci_domctl( > ASSERT(d); > /* fall through */ > case XEN_DOMCTL_test_assign_device: > + { > + struct pci_dev *pdev; > + > /* Don't support self-assignment of devices. */ > if ( d == current->domain ) > { > @@ -1622,26 +1616,36 @@ int iommu_do_pci_domctl( > seg = machine_sbdf >> 16; > bus = PCI_BUS(machine_sbdf); > devfn = PCI_DEVFN(machine_sbdf); > - > pcidevs_lock(); > - ret = device_assigned(seg, bus, devfn); > + pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); > + if ( !pdev ) > + { > + printk(XENLOG_G_INFO "%pp non-existent\n", > + &PCI_SBDF(seg, bus, devfn)); > + ret = -EINVAL; > + break; > + } > + > + ret = device_assigned(pdev); > if ( domctl->cmd == XEN_DOMCTL_test_assign_device ) > { > if ( ret ) > { > - printk(XENLOG_G_INFO "%pp already assigned, or non-existent\n", > + printk(XENLOG_G_INFO "%pp already assigned\n", > &PCI_SBDF(seg, bus, devfn)); > ret = -EINVAL; > } > } > else if ( !ret ) > - ret = assign_device(d, seg, bus, devfn, flags); > + ret = assign_device(d, pdev, flags); > + > + pcidev_put(pdev); I would think you need to keep the refcount here if ret == 0, so that the device cannot be removed while assigned to a domain? > pcidevs_unlock(); > if ( ret == -ERESTART ) > ret = hypercall_create_continuation(__HYPERVISOR_domctl, > "h", u_domctl); > break; > - > + } > case XEN_DOMCTL_deassign_device: > /* Don't support self-deassignment of devices. */ > if ( d == current->domain ) > @@ -1681,6 +1685,46 @@ int iommu_do_pci_domctl( > return ret; > } > > +static void release_pdev(refcnt_t *refcnt) > +{ > + struct pci_dev *pdev = container_of(refcnt, struct pci_dev, refcnt); > + struct pci_seg *pseg = get_pseg(pdev->seg); > + > + printk(XENLOG_DEBUG "PCI release device %pp\n", &pdev->sbdf); > + > + /* update bus2bridge */ > + switch ( pdev->type ) > + { > + unsigned int sec_bus, sub_bus; > + > + case DEV_TYPE_PCIe2PCI_BRIDGE: > + case DEV_TYPE_LEGACY_PCI_BRIDGE: > + sec_bus = pci_conf_read8(pdev->sbdf, PCI_SECONDARY_BUS); > + sub_bus = pci_conf_read8(pdev->sbdf, PCI_SUBORDINATE_BUS); > + > + spin_lock(&pseg->bus2bridge_lock); > + for ( ; sec_bus <= sub_bus; sec_bus++ ) > + pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus]; > + spin_unlock(&pseg->bus2bridge_lock); > + break; > + > + default: > + break; > + } > + > + xfree(pdev); > +} > + > +void pcidev_get(struct pci_dev *pdev) > +{ > + refcnt_get(&pdev->refcnt); > +} > + > +void pcidev_put(struct pci_dev *pdev) > +{ > + refcnt_put(&pdev->refcnt, release_pdev); > +} > + > /* > * Local variables: > * mode: C > diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough/vtd/quirks.c > index fcc8f73e8b..d240da0416 100644 > --- a/xen/drivers/passthrough/vtd/quirks.c > +++ b/xen/drivers/passthrough/vtd/quirks.c > @@ -429,6 +429,8 @@ static int __must_check map_me_phantom_function(struct domain *domain, > rc = domain_context_unmap_one(domain, drhd->iommu, 0, > PCI_DEVFN(dev, 7)); > > + pcidev_put(pdev); > + > return rc; > } > > diff --git a/xen/drivers/video/vga.c b/xen/drivers/video/vga.c > index 0a03508bee..1049d4da6d 100644 > --- a/xen/drivers/video/vga.c > +++ b/xen/drivers/video/vga.c > @@ -114,7 +114,7 @@ void __init video_endboot(void) > for ( bus = 0; bus < 256; ++bus ) > for ( devfn = 0; devfn < 256; ++devfn ) > { > - const struct pci_dev *pdev; > + struct pci_dev *pdev; > u8 b = bus, df = devfn, sb; > > pcidevs_lock(); > @@ -126,7 +126,11 @@ void __init video_endboot(void) > PCI_CLASS_DEVICE) != 0x0300 || > !(pci_conf_read16(PCI_SBDF(0, bus, devfn), PCI_COMMAND) & > (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) ) > + { > + if ( pdev ) > + pcidev_put(pdev); > continue; > + } > > while ( b ) > { > @@ -157,6 +161,7 @@ void __init video_endboot(void) > bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); > pci_hide_device(0, bus, devfn); > } > + pcidev_put(pdev); > } > } > > diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c > index 6d48d496bb..5232f9605b 100644 > --- a/xen/drivers/vpci/vpci.c > +++ b/xen/drivers/vpci/vpci.c > @@ -317,8 +317,8 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size, > > uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) > { > - const struct domain *d = current->domain; > - const struct pci_dev *pdev; > + struct domain *d = current->domain; Why do you need to drop the const on domain here? > + struct pci_dev *pdev; > const struct vpci_register *r; > unsigned int data_offset = 0; > uint32_t data = ~(uint32_t)0; > @@ -332,7 +332,11 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) > /* Find the PCI dev matching the address. */ > pdev = pci_get_pdev(d, sbdf); > if ( !pdev || !pdev->vpci ) > + { > + if ( pdev ) > + pcidev_put(pdev); > return vpci_read_hw(sbdf, reg, size); > + } > > spin_lock(&pdev->vpci->lock); > > @@ -378,6 +382,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) > ASSERT(data_offset < size); > } > spin_unlock(&pdev->vpci->lock); > + pcidev_put(pdev); > > if ( data_offset < size ) > { > @@ -420,8 +425,8 @@ static void vpci_write_helper(const struct pci_dev *pdev, > void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, > uint32_t data) > { > - const struct domain *d = current->domain; > - const struct pci_dev *pdev; > + struct domain *d = current->domain; > + struct pci_dev *pdev; Same here regarding dropping the const of d. Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-03-16 16:16 ` Roger Pau Monné @ 2023-03-29 9:55 ` Jan Beulich 2023-03-29 10:48 ` Roger Pau Monné 2023-04-11 23:41 ` Volodymyr Babchuk 1 sibling, 1 reply; 50+ messages in thread From: Jan Beulich @ 2023-03-29 9:55 UTC (permalink / raw) To: Roger Pau Monné, Volodymyr Babchuk Cc: xen-devel, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian On 16.03.2023 17:16, Roger Pau Monné wrote: > On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: >> Prior to this change, lifetime of pci_dev objects was protected by global >> pcidevs_lock(). Long-term plan is to remove this log, so we need some > ^ lock > > I wouldn't say remove, as one way or another we need a lock to protect > concurrent accesses. > >> other mechanism to ensure that those objects will not disappear under >> feet of code that access them. Reference counting is a good choice as >> it provides easy to comprehend way to control object lifetime. >> >> This patch adds two new helper functions: pcidev_get() and >> pcidev_put(). pcidev_get() will increase reference counter, while >> pcidev_put() will decrease it, destroying object when counter reaches >> zero. >> >> pcidev_get() should be used only when you already have a valid pointer >> to the object or you are holding lock that protects one of the >> lists (domain, pseg or ats) that store pci_dev structs. >> >> pcidev_get() is rarely used directly, because there already are >> functions that will provide valid pointer to pci_dev struct: >> pci_get_pdev(), pci_get_real_pdev(). They will lock appropriate list, >> find needed object and increase its reference counter before returning >> to the caller. >> >> Naturally, pci_put() should be called after finishing working with a >> received object. This is the reason why this patch have so many >> pcidev_put()s and so little pcidev_get()s: existing calls to >> pci_get_*() functions now will increase reference counter >> automatically, we just need to decrease it back when we finished. > > After looking a bit into this, I would like to ask whether it's been > considered the need to increase the refcount for each use of a pdev. > > For example I would consider the initial alloc_pdev() to take a > refcount, and then pci_remove_device() _must_ be the function that > removes the last refcount, so that it can return -EBUSY otherwise (see > my comment below). I thought I had replied to this, but couldn't find any record thereof; apologies for a possible duplicate. In a get-/put-ref model, much like we have it for domheap pages, the last put should trigger whatever is needed for "freeing" (here: removing) the item. Therefore I think in this new model all PHYSDEVOP_{pci_device_remove,manage_pci_remove} should cause is the dropping of the ref that alloc_pdev() has put in place (plus some marking of the device, so that another PHYSDEVOP_{pci_device_remove, manage_pci_remove} can be properly ignored rather than dropping one ref too many; this marking may then also prevent the obtaining of new references, if such can be arranged for without breaking [cleanup] functionality elsewhere). Whenever the last reference is put, that would trigger the operations that pci_remove_device() presently carries out. Of course this would mean that if PHYSDEVOP_{pci_device_remove, manage_pci_remove} didn't drop the last reference, it would need to signal this to its caller, for it to be aware that the device is not yet ready for (e.g.) hot-unplug. There'll then also need to be a way for the caller to figure out when that situation has changed (which might be via repeated invocations of the same hypercall sub-op, or some new sub-op). Jan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-03-29 9:55 ` Jan Beulich @ 2023-03-29 10:48 ` Roger Pau Monné 2023-03-29 11:58 ` Jan Beulich 0 siblings, 1 reply; 50+ messages in thread From: Roger Pau Monné @ 2023-03-29 10:48 UTC (permalink / raw) To: Jan Beulich Cc: Volodymyr Babchuk, xen-devel, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian On Wed, Mar 29, 2023 at 11:55:26AM +0200, Jan Beulich wrote: > On 16.03.2023 17:16, Roger Pau Monné wrote: > > On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: > >> Prior to this change, lifetime of pci_dev objects was protected by global > >> pcidevs_lock(). Long-term plan is to remove this log, so we need some > > ^ lock > > > > I wouldn't say remove, as one way or another we need a lock to protect > > concurrent accesses. > > > >> other mechanism to ensure that those objects will not disappear under > >> feet of code that access them. Reference counting is a good choice as > >> it provides easy to comprehend way to control object lifetime. > >> > >> This patch adds two new helper functions: pcidev_get() and > >> pcidev_put(). pcidev_get() will increase reference counter, while > >> pcidev_put() will decrease it, destroying object when counter reaches > >> zero. > >> > >> pcidev_get() should be used only when you already have a valid pointer > >> to the object or you are holding lock that protects one of the > >> lists (domain, pseg or ats) that store pci_dev structs. > >> > >> pcidev_get() is rarely used directly, because there already are > >> functions that will provide valid pointer to pci_dev struct: > >> pci_get_pdev(), pci_get_real_pdev(). They will lock appropriate list, > >> find needed object and increase its reference counter before returning > >> to the caller. > >> > >> Naturally, pci_put() should be called after finishing working with a > >> received object. This is the reason why this patch have so many > >> pcidev_put()s and so little pcidev_get()s: existing calls to > >> pci_get_*() functions now will increase reference counter > >> automatically, we just need to decrease it back when we finished. > > > > After looking a bit into this, I would like to ask whether it's been > > considered the need to increase the refcount for each use of a pdev. > > > > For example I would consider the initial alloc_pdev() to take a > > refcount, and then pci_remove_device() _must_ be the function that > > removes the last refcount, so that it can return -EBUSY otherwise (see > > my comment below). > > I thought I had replied to this, but couldn't find any record thereof; > apologies for a possible duplicate. > > In a get-/put-ref model, much like we have it for domheap pages, the > last put should trigger whatever is needed for "freeing" (here: > removing) the item. Therefore I think in this new model all > PHYSDEVOP_{pci_device_remove,manage_pci_remove} should cause is the > dropping of the ref that alloc_pdev() has put in place (plus some > marking of the device, so that another PHYSDEVOP_{pci_device_remove, > manage_pci_remove} can be properly ignored rather than dropping one > ref too many; this marking may then also prevent the obtaining of new > references, if such can be arranged for without breaking [cleanup] > functionality elsewhere). Whenever the last reference is put, that > would trigger the operations that pci_remove_device() presently > carries out. Right, this all seems sensible. > > Of course this would mean that if PHYSDEVOP_{pci_device_remove, > manage_pci_remove} didn't drop the last reference, it would need to > signal this to its caller, for it to be aware that the device is not > yet ready for (e.g.) hot-unplug. There'll then also need to be a way > for the caller to figure out when that situation has changed (which > might be via repeated invocations of the same hypercall sub-op, or > some new sub-op). Returning -EBUSY and expecting the caller to repeat the call would likely be the easier one to implement and likely fine for our purposes. There's a risk that the toolstack/kernel enters an infinite loop if there's a dangling extra ref somewhere, but that would be a bug anyway. So device creation would take a reference, and device assignation would take another one. Devices assigned are safe against removal, so there should be no need to take an extra reference in that case. There are however a number of cases that use pci_get_pdev(NULL, ...) for example, at which point we would need to take an extra reference on those cases if the device is not assigned to a domain? Or would we just keep those under pcidevs_locked regions as-is? (as PHYSDEVOP_{pci_device_remove, manage_pci_remove} will still take the pci_lock). Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-03-29 10:48 ` Roger Pau Monné @ 2023-03-29 11:58 ` Jan Beulich 0 siblings, 0 replies; 50+ messages in thread From: Jan Beulich @ 2023-03-29 11:58 UTC (permalink / raw) To: Roger Pau Monné Cc: Volodymyr Babchuk, xen-devel, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian On 29.03.2023 12:48, Roger Pau Monné wrote: > On Wed, Mar 29, 2023 at 11:55:26AM +0200, Jan Beulich wrote: >> On 16.03.2023 17:16, Roger Pau Monné wrote: >>> On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: >>>> Prior to this change, lifetime of pci_dev objects was protected by global >>>> pcidevs_lock(). Long-term plan is to remove this log, so we need some >>> ^ lock >>> >>> I wouldn't say remove, as one way or another we need a lock to protect >>> concurrent accesses. >>> >>>> other mechanism to ensure that those objects will not disappear under >>>> feet of code that access them. Reference counting is a good choice as >>>> it provides easy to comprehend way to control object lifetime. >>>> >>>> This patch adds two new helper functions: pcidev_get() and >>>> pcidev_put(). pcidev_get() will increase reference counter, while >>>> pcidev_put() will decrease it, destroying object when counter reaches >>>> zero. >>>> >>>> pcidev_get() should be used only when you already have a valid pointer >>>> to the object or you are holding lock that protects one of the >>>> lists (domain, pseg or ats) that store pci_dev structs. >>>> >>>> pcidev_get() is rarely used directly, because there already are >>>> functions that will provide valid pointer to pci_dev struct: >>>> pci_get_pdev(), pci_get_real_pdev(). They will lock appropriate list, >>>> find needed object and increase its reference counter before returning >>>> to the caller. >>>> >>>> Naturally, pci_put() should be called after finishing working with a >>>> received object. This is the reason why this patch have so many >>>> pcidev_put()s and so little pcidev_get()s: existing calls to >>>> pci_get_*() functions now will increase reference counter >>>> automatically, we just need to decrease it back when we finished. >>> >>> After looking a bit into this, I would like to ask whether it's been >>> considered the need to increase the refcount for each use of a pdev. >>> >>> For example I would consider the initial alloc_pdev() to take a >>> refcount, and then pci_remove_device() _must_ be the function that >>> removes the last refcount, so that it can return -EBUSY otherwise (see >>> my comment below). >> >> I thought I had replied to this, but couldn't find any record thereof; >> apologies for a possible duplicate. >> >> In a get-/put-ref model, much like we have it for domheap pages, the >> last put should trigger whatever is needed for "freeing" (here: >> removing) the item. Therefore I think in this new model all >> PHYSDEVOP_{pci_device_remove,manage_pci_remove} should cause is the >> dropping of the ref that alloc_pdev() has put in place (plus some >> marking of the device, so that another PHYSDEVOP_{pci_device_remove, >> manage_pci_remove} can be properly ignored rather than dropping one >> ref too many; this marking may then also prevent the obtaining of new >> references, if such can be arranged for without breaking [cleanup] >> functionality elsewhere). Whenever the last reference is put, that >> would trigger the operations that pci_remove_device() presently >> carries out. > > Right, this all seems sensible. > >> >> Of course this would mean that if PHYSDEVOP_{pci_device_remove, >> manage_pci_remove} didn't drop the last reference, it would need to >> signal this to its caller, for it to be aware that the device is not >> yet ready for (e.g.) hot-unplug. There'll then also need to be a way >> for the caller to figure out when that situation has changed (which >> might be via repeated invocations of the same hypercall sub-op, or >> some new sub-op). > > Returning -EBUSY and expecting the caller to repeat the call would > likely be the easier one to implement and likely fine for our > purposes. There's a risk that the toolstack/kernel enters an infinite > loop if there's a dangling extra ref somewhere, but that would be a > bug anyway. > > So device creation would take a reference, and device assignation would > take another one. Devices assigned are safe against removal, so there > should be no need to take an extra reference in that case. > > There are however a number of cases that use pci_get_pdev(NULL, ...) > for example, at which point we would need to take an extra reference > on those cases if the device is not assigned to a domain? I think in this case a ref should be acquired, and independent of whether the device is assigned anywhere (or else I expect this would end up cumbersome for callers, when they need to figure whether to drop a ref). > Or would we just keep those under pcidevs_locked regions as-is? This may be a short-term option, but longer term I think we want to fully move over (and get rid of the global lock altogether, if at all possible). Jan > (as PHYSDEVOP_{pci_device_remove, manage_pci_remove} will still take > the pci_lock). > > Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-03-16 16:16 ` Roger Pau Monné 2023-03-29 9:55 ` Jan Beulich @ 2023-04-11 23:41 ` Volodymyr Babchuk 2023-04-12 9:13 ` Roger Pau Monné 1 sibling, 1 reply; 50+ messages in thread From: Volodymyr Babchuk @ 2023-04-11 23:41 UTC (permalink / raw) To: Roger Pau Monné Cc: xen-devel, Jan Beulich, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian Hi Roger, Roger Pau Monné <roger.pau@citrix.com> writes: > On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: >> Prior to this change, lifetime of pci_dev objects was protected by global >> pcidevs_lock(). Long-term plan is to remove this log, so we need some > ^ lock > > I wouldn't say remove, as one way or another we need a lock to protect > concurrent accesses. > I'll write "replace this global lock with couple of more granular locking devices" if this is okay for you. >> other mechanism to ensure that those objects will not disappear under >> feet of code that access them. Reference counting is a good choice as >> it provides easy to comprehend way to control object lifetime. >> >> This patch adds two new helper functions: pcidev_get() and >> pcidev_put(). pcidev_get() will increase reference counter, while >> pcidev_put() will decrease it, destroying object when counter reaches >> zero. >> >> pcidev_get() should be used only when you already have a valid pointer >> to the object or you are holding lock that protects one of the >> lists (domain, pseg or ats) that store pci_dev structs. >> >> pcidev_get() is rarely used directly, because there already are >> functions that will provide valid pointer to pci_dev struct: >> pci_get_pdev(), pci_get_real_pdev(). They will lock appropriate list, >> find needed object and increase its reference counter before returning >> to the caller. >> >> Naturally, pci_put() should be called after finishing working with a >> received object. This is the reason why this patch have so many >> pcidev_put()s and so little pcidev_get()s: existing calls to >> pci_get_*() functions now will increase reference counter >> automatically, we just need to decrease it back when we finished. > > After looking a bit into this, I would like to ask whether it's been > considered the need to increase the refcount for each use of a pdev. > This is how Linux uses reference locking. It decreases cognitive load and chance for an error, as there is a simple set of rules, which you follow. > For example I would consider the initial alloc_pdev() to take a > refcount, and then pci_remove_device() _must_ be the function that > removes the last refcount, so that it can return -EBUSY otherwise (see > my comment below). I tend to disagree there, as this ruins the very idea of reference counting. We can't know who else holds reference right now. Okay, we might know, but this requires additional lock to serialize accesses. Which, in turn, makes refcount un-needed. > > I would also think that having the device assigned to a guest will take > another refcount, and then any usage from further callers (ie: like > vpci) will need some kind of protection from preventing the device > from being deassigned from a domain while vPCI handlers are running, > and the current refcount won't help with that. Yes, idea of this refcounting is to ensure that a pdev object exists as an valid object in memory if we are holding a long-term pointer to it. Indeed, vPCI handlers should use some other mechanism to ensure that pdev is not being re-assigned while handlers are running. I believe, this is the task of vpci->lock. Should we call vpci_remove_device/vpci_add_handlers each time we re-assign a PCI device? > > That makes me wonder if for example callers of pci_get_pdev(d, sbdf) > do need to take an extra refcount, because such access is already > protected from the pdev going away by the fact that the device is > assigned to a guest. But maybe it's too much work to separate users > of pci_get_pdev(d, ...); vs pci_get_pdev(NULL, ...);. > > There's also a window when the refcount is dropped to 0, and the > destruction function is called, but at the same time a concurrent > thread could attempt to take a reference to the pdev still? Last pcidev_put() would be called by pci_remove_device(), after removing it from all lists. This should prevent other threads from obtaining a valid reference to the pdev. > >> >> This patch removes "const" qualifier from some pdev pointers because >> pcidev_put() technically alters the contents of pci_dev structure. >> >> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> >> Suggested-by: Jan Beulich <jbeulich@suse.com> >> >> --- >> >> v3: >> - Moved in from another patch series >> - Fixed code formatting (tabs -> spaces) >> - Removed erroneous pcidev_put in vga.c >> - Added missing pcidev_put in couple of places >> - removed mention of pci_get_pdev_by_domain() >> --- >> xen/arch/x86/hvm/vmsi.c | 2 +- >> xen/arch/x86/irq.c | 4 + >> xen/arch/x86/msi.c | 44 +++++++- >> xen/arch/x86/pci.c | 3 + >> xen/arch/x86/physdev.c | 17 ++- >> xen/common/sysctl.c | 7 +- >> xen/drivers/passthrough/amd/iommu_init.c | 12 +- >> xen/drivers/passthrough/amd/iommu_map.c | 6 +- >> xen/drivers/passthrough/pci.c | 138 +++++++++++++++-------- >> xen/drivers/passthrough/vtd/quirks.c | 2 + >> xen/drivers/video/vga.c | 7 +- >> xen/drivers/vpci/vpci.c | 16 ++- >> xen/include/xen/pci.h | 18 +++ >> 13 files changed, 215 insertions(+), 61 deletions(-) >> >> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c >> index 3cd4923060..8c3d673872 100644 >> --- a/xen/arch/x86/hvm/vmsi.c >> +++ b/xen/arch/x86/hvm/vmsi.c >> @@ -914,7 +914,7 @@ int vpci_msix_arch_print(const struct vpci_msix *msix) >> >> spin_unlock(&msix->pdev->vpci->lock); >> process_pending_softirqs(); >> - /* NB: we assume that pdev cannot go away for an alive domain. */ >> + >> if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) ) >> return -EBUSY; >> if ( pdev->vpci->msix != msix ) >> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c >> index 20150b1c7f..87464d82c8 100644 >> --- a/xen/arch/x86/irq.c >> +++ b/xen/arch/x86/irq.c >> @@ -2175,6 +2175,7 @@ int map_domain_pirq( >> msi->entry_nr = ret; >> ret = -ENFILE; >> } >> + pcidev_put(pdev); >> goto done; >> } >> >> @@ -2189,6 +2190,7 @@ int map_domain_pirq( >> msi_desc->irq = -1; >> msi_free_irq(msi_desc); >> ret = -EBUSY; >> + pcidev_put(pdev); >> goto done; >> } >> >> @@ -2273,10 +2275,12 @@ int map_domain_pirq( >> } >> msi_desc->irq = -1; >> msi_free_irq(msi_desc); >> + pcidev_put(pdev); >> goto done; >> } >> >> set_domain_irq_pirq(d, irq, info); >> + pcidev_put(pdev); >> spin_unlock_irqrestore(&desc->lock, flags); >> } >> else >> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c >> index d0bf63df1d..91926fce50 100644 >> --- a/xen/arch/x86/msi.c >> +++ b/xen/arch/x86/msi.c >> @@ -572,6 +572,10 @@ int msi_free_irq(struct msi_desc *entry) >> virt_to_fix((unsigned long)entry->mask_base)); >> >> list_del(&entry->list); >> + >> + /* Corresponds to pcidev_get() in msi[x]_capability_init() */ >> + pcidev_put(entry->dev); >> + >> xfree(entry); >> return 0; >> } >> @@ -644,6 +648,7 @@ static int msi_capability_init(struct pci_dev *dev, >> entry[i].msi.mpos = mpos; >> entry[i].msi.nvec = 0; >> entry[i].dev = dev; >> + pcidev_get(dev); >> } >> entry->msi.nvec = nvec; >> entry->irq = irq; >> @@ -703,22 +708,36 @@ static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf) >> !num_vf || !offset || (num_vf > 1 && !stride) || >> bir >= PCI_SRIOV_NUM_BARS || >> !pdev->vf_rlen[bir] ) >> + { >> + if ( pdev ) >> + pcidev_put(pdev); >> return 0; >> + } >> base = pos + PCI_SRIOV_BAR; >> vf -= PCI_BDF(bus, slot, func) + offset; >> if ( vf < 0 ) >> + { >> + pcidev_put(pdev); >> return 0; >> + } >> if ( stride ) >> { >> if ( vf % stride ) >> + { >> + pcidev_put(pdev); >> return 0; >> + } >> vf /= stride; >> } >> if ( vf >= num_vf ) >> + { >> + pcidev_put(pdev); >> return 0; >> + } >> BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS); >> disp = vf * pdev->vf_rlen[bir]; >> limit = PCI_SRIOV_NUM_BARS; >> + pcidev_put(pdev); >> } >> else switch ( pci_conf_read8(PCI_SBDF(seg, bus, slot, func), >> PCI_HEADER_TYPE) & 0x7f ) >> @@ -925,6 +944,8 @@ static int msix_capability_init(struct pci_dev *dev, >> entry->dev = dev; >> entry->mask_base = base; >> >> + pcidev_get(dev); >> + >> list_add_tail(&entry->list, &dev->msi_list); >> *desc = entry; >> } >> @@ -999,6 +1020,7 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc) >> { >> struct pci_dev *pdev; >> struct msi_desc *old_desc; >> + int ret; >> >> ASSERT(pcidevs_locked()); >> pdev = pci_get_pdev(NULL, msi->sbdf); >> @@ -1010,6 +1032,7 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc) >> { >> printk(XENLOG_ERR "irq %d already mapped to MSI on %pp\n", >> msi->irq, &pdev->sbdf); >> + pcidev_put(pdev); >> return -EEXIST; >> } >> >> @@ -1020,7 +1043,10 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc) >> __pci_disable_msix(old_desc); >> } >> >> - return msi_capability_init(pdev, msi->irq, desc, msi->entry_nr); >> + ret = msi_capability_init(pdev, msi->irq, desc, msi->entry_nr); >> + pcidev_put(pdev); >> + >> + return ret; >> } >> >> static void __pci_disable_msi(struct msi_desc *entry) >> @@ -1054,20 +1080,29 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc) >> { >> struct pci_dev *pdev; >> struct msi_desc *old_desc; >> + int ret; >> >> ASSERT(pcidevs_locked()); >> pdev = pci_get_pdev(NULL, msi->sbdf); >> if ( !pdev || !pdev->msix ) >> + { >> + if ( pdev ) >> + pcidev_put(pdev); >> return -ENODEV; >> + } >> >> if ( msi->entry_nr >= pdev->msix->nr_entries ) >> + { >> + pcidev_put(pdev); >> return -EINVAL; >> + } >> >> old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSIX); >> if ( old_desc ) >> { >> printk(XENLOG_ERR "irq %d already mapped to MSI-X on %pp\n", >> msi->irq, &pdev->sbdf); >> + pcidev_put(pdev); >> return -EEXIST; >> } >> >> @@ -1078,7 +1113,11 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc) >> __pci_disable_msi(old_desc); >> } >> >> - return msix_capability_init(pdev, msi, desc); >> + ret = msix_capability_init(pdev, msi, desc); >> + >> + pcidev_put(pdev); >> + >> + return ret; >> } >> >> static void _pci_cleanup_msix(struct arch_msix *msix) >> @@ -1159,6 +1198,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off) >> } >> else >> rc = msix_capability_init(pdev, NULL, NULL); >> + pcidev_put(pdev); >> pcidevs_unlock(); >> >> return rc; >> diff --git a/xen/arch/x86/pci.c b/xen/arch/x86/pci.c >> index 97b792e578..c1fcdf08d6 100644 >> --- a/xen/arch/x86/pci.c >> +++ b/xen/arch/x86/pci.c >> @@ -92,7 +92,10 @@ int pci_conf_write_intercept(unsigned int seg, unsigned int bdf, >> >> pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bdf)); >> if ( pdev ) >> + { >> rc = pci_msi_conf_write_intercept(pdev, reg, size, data); >> + pcidev_put(pdev); >> + } >> >> pcidevs_unlock(); >> >> diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c >> index 2f1d955a96..96214a3d40 100644 >> --- a/xen/arch/x86/physdev.c >> +++ b/xen/arch/x86/physdev.c >> @@ -533,7 +533,14 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) >> pcidevs_lock(); >> pdev = pci_get_pdev(NULL, >> PCI_SBDF(0, restore_msi.bus, restore_msi.devfn)); >> - ret = pdev ? pci_restore_msi_state(pdev) : -ENODEV; >> + if ( pdev ) >> + { >> + ret = pci_restore_msi_state(pdev); >> + pcidev_put(pdev); >> + } >> + else >> + ret = -ENODEV; >> + >> pcidevs_unlock(); >> break; >> } >> @@ -548,7 +555,13 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) >> >> pcidevs_lock(); >> pdev = pci_get_pdev(NULL, PCI_SBDF(dev.seg, dev.bus, dev.devfn)); >> - ret = pdev ? pci_restore_msi_state(pdev) : -ENODEV; >> + if ( pdev ) >> + { >> + ret = pci_restore_msi_state(pdev); >> + pcidev_put(pdev); >> + } >> + else >> + ret = -ENODEV; >> pcidevs_unlock(); >> break; >> } >> diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c >> index 02505ab044..9af07fa92a 100644 >> --- a/xen/common/sysctl.c >> +++ b/xen/common/sysctl.c >> @@ -438,7 +438,7 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl) >> { >> physdev_pci_device_t dev; >> uint32_t node; >> - const struct pci_dev *pdev; >> + struct pci_dev *pdev; >> >> if ( copy_from_guest_offset(&dev, ti->devs, i, 1) ) >> { >> @@ -454,8 +454,11 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl) >> node = XEN_INVALID_NODE_ID; >> else >> node = pdev->node; >> - pcidevs_unlock(); >> >> + if ( pdev ) >> + pcidev_put(pdev); >> + >> + pcidevs_unlock(); >> if ( copy_to_guest_offset(ti->nodes, i, &node, 1) ) >> { >> ret = -EFAULT; >> diff --git a/xen/drivers/passthrough/amd/iommu_init.c b/xen/drivers/passthrough/amd/iommu_init.c >> index 9773ccfcb4..f90b1c1e58 100644 >> --- a/xen/drivers/passthrough/amd/iommu_init.c >> +++ b/xen/drivers/passthrough/amd/iommu_init.c >> @@ -646,6 +646,7 @@ static void cf_check parse_ppr_log_entry(struct amd_iommu *iommu, u32 entry[]) >> >> if ( pdev ) >> guest_iommu_add_ppr_log(pdev->domain, entry); >> + pcidev_put(pdev); >> } >> >> static void iommu_check_ppr_log(struct amd_iommu *iommu) >> @@ -749,6 +750,11 @@ static bool_t __init set_iommu_interrupt_handler(struct amd_iommu *iommu) >> } >> >> pcidevs_lock(); >> + /* >> + * XXX: it is unclear if this device can be removed. Right now >> + * there is no code that clears msi.dev, so no one will decrease >> + * refcount on it. >> + */ >> iommu->msi.dev = pci_get_pdev(NULL, PCI_SBDF(iommu->seg, iommu->bdf)); > > I don't think we can remove an IOMMU from the system, so this is > fine as-is AFAICT. > Oh, thank you for the clarification. I'll remove the comment then. >> pcidevs_unlock(); >> if ( !iommu->msi.dev ) >> @@ -1274,7 +1280,7 @@ static int __init cf_check amd_iommu_setup_device_table( >> { >> if ( ivrs_mappings[bdf].valid ) >> { >> - const struct pci_dev *pdev = NULL; >> + struct pci_dev *pdev = NULL; >> >> /* add device table entry */ >> iommu_dte_add_device_entry(&dt[bdf], &ivrs_mappings[bdf]); >> @@ -1299,7 +1305,10 @@ static int __init cf_check amd_iommu_setup_device_table( >> pdev->msix ? pdev->msix->nr_entries >> : pdev->msi_maxvec); >> if ( !ivrs_mappings[bdf].intremap_table ) >> + { >> + pcidev_put(pdev); >> return -ENOMEM; >> + } >> >> if ( pdev->phantom_stride ) >> { >> @@ -1317,6 +1326,7 @@ static int __init cf_check amd_iommu_setup_device_table( >> ivrs_mappings[bdf].intremap_inuse; >> } >> } >> + pcidev_put(pdev); >> } >> >> amd_iommu_set_intremap_table( >> diff --git a/xen/drivers/passthrough/amd/iommu_map.c b/xen/drivers/passthrough/amd/iommu_map.c >> index 993bac6f88..9d621e3d36 100644 >> --- a/xen/drivers/passthrough/amd/iommu_map.c >> +++ b/xen/drivers/passthrough/amd/iommu_map.c >> @@ -724,14 +724,18 @@ int cf_check amd_iommu_get_reserved_device_memory( >> if ( !iommu ) >> { >> /* May need to trigger the workaround in find_iommu_for_device(). */ >> - const struct pci_dev *pdev; >> + struct pci_dev *pdev; >> >> pcidevs_lock(); >> pdev = pci_get_pdev(NULL, sbdf); >> pcidevs_unlock(); >> >> if ( pdev ) >> + { >> iommu = find_iommu_for_device(seg, bdf); >> + /* XXX: Should we hold pdev reference till end of the loop? */ >> + pcidev_put(pdev); > > I don't think you need to hold a reference to the device until the end > of the loop, the data fetched there is from the ACPI tables. If the > func() helper also needs a pdev instance is it's task to get one. > Thank you for the clarification. I'll remove the comment. >> + } >> if ( !iommu ) >> continue; >> } >> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c >> index b42acb8d7c..b32382aca0 100644 >> --- a/xen/drivers/passthrough/pci.c >> +++ b/xen/drivers/passthrough/pci.c >> @@ -328,6 +328,7 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn) >> *((u8*) &pdev->bus) = bus; >> *((u8*) &pdev->devfn) = devfn; >> pdev->domain = NULL; >> + refcnt_init(&pdev->refcnt); >> >> arch_pci_init_pdev(pdev); >> >> @@ -422,33 +423,6 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn) >> return pdev; >> } >> >> -static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev) >> -{ >> - /* update bus2bridge */ >> - switch ( pdev->type ) >> - { >> - unsigned int sec_bus, sub_bus; >> - >> - case DEV_TYPE_PCIe2PCI_BRIDGE: >> - case DEV_TYPE_LEGACY_PCI_BRIDGE: >> - sec_bus = pci_conf_read8(pdev->sbdf, PCI_SECONDARY_BUS); >> - sub_bus = pci_conf_read8(pdev->sbdf, PCI_SUBORDINATE_BUS); >> - >> - spin_lock(&pseg->bus2bridge_lock); >> - for ( ; sec_bus <= sub_bus; sec_bus++ ) >> - pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus]; >> - spin_unlock(&pseg->bus2bridge_lock); >> - break; >> - >> - default: >> - break; >> - } >> - >> - list_del(&pdev->alldevs_list); >> - pdev_msi_deinit(pdev); >> - xfree(pdev); >> -} >> - >> static void __init _pci_hide_device(struct pci_dev *pdev) >> { >> if ( pdev->domain ) >> @@ -517,10 +491,14 @@ struct pci_dev *pci_get_real_pdev(pci_sbdf_t sbdf) >> { >> if ( !(sbdf.devfn & stride) ) >> continue; >> + > > Unrelated change? There are some of those in the patch, should be > removed. Yes, sorry for this. > >> sbdf.devfn &= ~stride; >> pdev = pci_get_pdev(NULL, sbdf); >> if ( pdev && stride != pdev->phantom_stride ) >> + { >> + pcidev_put(pdev); >> pdev = NULL; >> + } >> } >> >> return pdev; >> @@ -548,13 +526,18 @@ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf) >> list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list ) >> if ( pdev->sbdf.bdf == sbdf.bdf && >> (!d || pdev->domain == d) ) >> + { >> + pcidev_get(pdev); >> return pdev; >> + } >> } >> else >> list_for_each_entry ( pdev, &d->pdev_list, domain_list ) >> if ( pdev->sbdf.bdf == sbdf.bdf ) >> + { >> + pcidev_get(pdev); >> return pdev; >> - >> + } >> return NULL; >> } >> >> @@ -663,7 +646,10 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn, >> PCI_SBDF(seg, info->physfn.bus, >> info->physfn.devfn)); >> if ( pdev ) >> + { >> pf_is_extfn = pdev->info.is_extfn; >> + pcidev_put(pdev); >> + } >> pcidevs_unlock(); >> if ( !pdev ) >> pci_add_device(seg, info->physfn.bus, info->physfn.devfn, >> @@ -818,7 +804,9 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn) >> if ( pdev->domain ) >> list_del(&pdev->domain_list); >> printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf); >> - free_pdev(pseg, pdev); >> + list_del(&pdev->alldevs_list); >> + pdev_msi_deinit(pdev); >> + pcidev_put(pdev); > > Hm, I think here we want to make sure that the device has been freed, > or else you would have to return -EBUSY to the calls to notify that > the device is still in use. Why? As I can see, pdev object is still may potentially be accessed by some other CPU right now. So pdev object will be freed after last reference is dropped. As it is already removed from all the lists, pci_dev_get() will not find it anymore. Actually, I can't see how this can happen in reality, as VPCI, MSI and IOMMU are already deactivated for this device. So, no one would touch it. > > I think we need an extra pcidev_put_final() or similar that can be > used in pci_remove_device() to assert that the device has been > actually removed. Will something break if we don't do this? I can't see how this can happen. > >> break; >> } >> >> @@ -848,7 +836,7 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus, >> { >> ret = iommu_quarantine_dev_init(pci_to_dev(pdev)); >> if ( ret ) >> - return ret; >> + goto out; >> >> target = dom_io; >> } >> @@ -878,6 +866,7 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus, >> pdev->fault.count = 0; >> >> out: >> + pcidev_put(pdev); >> if ( ret ) >> printk(XENLOG_G_ERR "%pd: deassign (%pp) failed (%d)\n", >> d, &PCI_SBDF(seg, bus, devfn), ret); >> @@ -1011,7 +1000,10 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn) >> pdev->fault.count >>= 1; >> pdev->fault.time = now; >> if ( ++pdev->fault.count < PT_FAULT_THRESHOLD ) >> + { >> + pcidev_put(pdev); >> pdev = NULL; >> + } >> } >> pcidevs_unlock(); >> >> @@ -1022,6 +1014,8 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn) >> * control it for us. */ >> cword = pci_conf_read16(pdev->sbdf, PCI_COMMAND); >> pci_conf_write16(pdev->sbdf, PCI_COMMAND, cword & ~PCI_COMMAND_MASTER); >> + >> + pcidev_put(pdev); >> } >> >> /* >> @@ -1138,6 +1132,7 @@ static int __hwdom_init cf_check _setup_hwdom_pci_devices( >> printk(XENLOG_WARNING "Dom%d owning %pp?\n", >> pdev->domain->domain_id, &pdev->sbdf); >> >> + pcidev_put(pdev); >> if ( iommu_verbose ) >> { >> pcidevs_unlock(); >> @@ -1385,33 +1380,28 @@ static int iommu_remove_device(struct pci_dev *pdev) >> return iommu_call(hd->platform_ops, remove_device, devfn, pci_to_dev(pdev)); >> } >> >> -static int device_assigned(u16 seg, u8 bus, u8 devfn) >> +static int device_assigned(struct pci_dev *pdev) >> { >> - struct pci_dev *pdev; >> int rc = 0; >> >> ASSERT(pcidevs_locked()); >> - pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); >> - >> - if ( !pdev ) >> - rc = -ENODEV; >> /* >> * If the device exists and it is not owned by either the hardware >> * domain or dom_io then it must be assigned to a guest, or be >> * hidden (owned by dom_xen). >> */ >> - else if ( pdev->domain != hardware_domain && >> - pdev->domain != dom_io ) >> + if ( pdev->domain != hardware_domain && >> + pdev->domain != dom_io ) >> rc = -EBUSY; >> >> return rc; >> } >> >> /* Caller should hold the pcidevs_lock */ > > I would assume the caller has taken an extra reference to the pdev, so > holding the pcidevs_lock is no longer needed? I am assumed that lock may be required by MSIX or IOMMU functions, that are being called here. For example, I can see that reassign_device() in pci_amd_iommu.c manipulates with some lists. I believe, it should be protected with the lock. > >> -static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) >> +static int assign_device(struct domain *d, struct pci_dev *pdev, u32 flag) >> { >> const struct domain_iommu *hd = dom_iommu(d); >> - struct pci_dev *pdev; >> + uint8_t devfn; >> int rc = 0; >> >> if ( !is_iommu_enabled(d) ) >> @@ -1422,10 +1412,11 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) >> >> /* device_assigned() should already have cleared the device for assignment */ >> ASSERT(pcidevs_locked()); >> - pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); >> ASSERT(pdev && (pdev->domain == hardware_domain || >> pdev->domain == dom_io)); >> >> + devfn = pdev->devfn; >> + >> /* Do not allow broken devices to be assigned to guests. */ >> rc = -EBADF; >> if ( pdev->broken && d != hardware_domain && d != dom_io ) >> @@ -1460,7 +1451,7 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) >> done: >> if ( rc ) >> printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n", >> - d, &PCI_SBDF(seg, bus, devfn), rc); >> + d, &PCI_SBDF(pdev->seg, pdev->bus, devfn), rc); >> /* The device is assigned to dom_io so mark it as quarantined */ >> else if ( d == dom_io ) >> pdev->quarantine = true; >> @@ -1595,6 +1586,9 @@ int iommu_do_pci_domctl( >> ASSERT(d); >> /* fall through */ >> case XEN_DOMCTL_test_assign_device: >> + { >> + struct pci_dev *pdev; >> + >> /* Don't support self-assignment of devices. */ >> if ( d == current->domain ) >> { >> @@ -1622,26 +1616,36 @@ int iommu_do_pci_domctl( >> seg = machine_sbdf >> 16; >> bus = PCI_BUS(machine_sbdf); >> devfn = PCI_DEVFN(machine_sbdf); >> - >> pcidevs_lock(); >> - ret = device_assigned(seg, bus, devfn); >> + pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); >> + if ( !pdev ) >> + { >> + printk(XENLOG_G_INFO "%pp non-existent\n", >> + &PCI_SBDF(seg, bus, devfn)); >> + ret = -EINVAL; >> + break; >> + } >> + >> + ret = device_assigned(pdev); >> if ( domctl->cmd == XEN_DOMCTL_test_assign_device ) >> { >> if ( ret ) >> { >> - printk(XENLOG_G_INFO "%pp already assigned, or non-existent\n", >> + printk(XENLOG_G_INFO "%pp already assigned\n", >> &PCI_SBDF(seg, bus, devfn)); >> ret = -EINVAL; >> } >> } >> else if ( !ret ) >> - ret = assign_device(d, seg, bus, devfn, flags); >> + ret = assign_device(d, pdev, flags); >> + >> + pcidev_put(pdev); > > I would think you need to keep the refcount here if ret == 0, so that > the device cannot be removed while assigned to a domain? Looks like we are perceiving function of refcnt in a different ways. For me, this is the mechanism to guarantee that if we have a valid pointer to an object, this object will not disappear under our feet. This is the main function of krefs in the linux kernel: if your code holds a reference to an object, you can be sure that this object is exists in memory. On other hand, it seems that you are considering this refcnt as an usage counter for an actual PCI device, not "struct pdev" that represent it. Those are two related things, but not the same. So, I can see why you are suggesting to get additional reference there. But for me, this looks unnecessary: the very first refcount is obtained in pci_add_device() and there is the corresponding function pci_remove_device() that will drop this refcount. So, for me, if admin wants to remove a PCI device which is assigned to a domain, they can do this as they were able to do this prior this patches. The main value of introducing refcnt is to be able to access pdev objects without holding the global pcidevs_lock(). This does not mean that you don't need locking at all. But this allows you to use pdev->lock (which does not exists in this series, but was introduced in a RFC earlier), or vpci->lock, or any other subsystem->lock. > >> pcidevs_unlock(); >> if ( ret == -ERESTART ) >> ret = hypercall_create_continuation(__HYPERVISOR_domctl, >> "h", u_domctl); >> break; >> - >> + } >> case XEN_DOMCTL_deassign_device: >> /* Don't support self-deassignment of devices. */ >> if ( d == current->domain ) >> @@ -1681,6 +1685,46 @@ int iommu_do_pci_domctl( >> return ret; >> } >> >> +static void release_pdev(refcnt_t *refcnt) >> +{ >> + struct pci_dev *pdev = container_of(refcnt, struct pci_dev, refcnt); >> + struct pci_seg *pseg = get_pseg(pdev->seg); >> + >> + printk(XENLOG_DEBUG "PCI release device %pp\n", &pdev->sbdf); >> + >> + /* update bus2bridge */ >> + switch ( pdev->type ) >> + { >> + unsigned int sec_bus, sub_bus; >> + >> + case DEV_TYPE_PCIe2PCI_BRIDGE: >> + case DEV_TYPE_LEGACY_PCI_BRIDGE: >> + sec_bus = pci_conf_read8(pdev->sbdf, PCI_SECONDARY_BUS); >> + sub_bus = pci_conf_read8(pdev->sbdf, PCI_SUBORDINATE_BUS); >> + >> + spin_lock(&pseg->bus2bridge_lock); >> + for ( ; sec_bus <= sub_bus; sec_bus++ ) >> + pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus]; >> + spin_unlock(&pseg->bus2bridge_lock); >> + break; >> + >> + default: >> + break; >> + } >> + >> + xfree(pdev); >> +} >> + >> +void pcidev_get(struct pci_dev *pdev) >> +{ >> + refcnt_get(&pdev->refcnt); >> +} >> + >> +void pcidev_put(struct pci_dev *pdev) >> +{ >> + refcnt_put(&pdev->refcnt, release_pdev); >> +} >> + >> /* >> * Local variables: >> * mode: C >> diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough/vtd/quirks.c >> index fcc8f73e8b..d240da0416 100644 >> --- a/xen/drivers/passthrough/vtd/quirks.c >> +++ b/xen/drivers/passthrough/vtd/quirks.c >> @@ -429,6 +429,8 @@ static int __must_check map_me_phantom_function(struct domain *domain, >> rc = domain_context_unmap_one(domain, drhd->iommu, 0, >> PCI_DEVFN(dev, 7)); >> >> + pcidev_put(pdev); >> + >> return rc; >> } >> >> diff --git a/xen/drivers/video/vga.c b/xen/drivers/video/vga.c >> index 0a03508bee..1049d4da6d 100644 >> --- a/xen/drivers/video/vga.c >> +++ b/xen/drivers/video/vga.c >> @@ -114,7 +114,7 @@ void __init video_endboot(void) >> for ( bus = 0; bus < 256; ++bus ) >> for ( devfn = 0; devfn < 256; ++devfn ) >> { >> - const struct pci_dev *pdev; >> + struct pci_dev *pdev; >> u8 b = bus, df = devfn, sb; >> >> pcidevs_lock(); >> @@ -126,7 +126,11 @@ void __init video_endboot(void) >> PCI_CLASS_DEVICE) != 0x0300 || >> !(pci_conf_read16(PCI_SBDF(0, bus, devfn), PCI_COMMAND) & >> (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) ) >> + { >> + if ( pdev ) >> + pcidev_put(pdev); >> continue; >> + } >> >> while ( b ) >> { >> @@ -157,6 +161,7 @@ void __init video_endboot(void) >> bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); >> pci_hide_device(0, bus, devfn); >> } >> + pcidev_put(pdev); >> } >> } >> >> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c >> index 6d48d496bb..5232f9605b 100644 >> --- a/xen/drivers/vpci/vpci.c >> +++ b/xen/drivers/vpci/vpci.c >> @@ -317,8 +317,8 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size, >> >> uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) >> { >> - const struct domain *d = current->domain; >> - const struct pci_dev *pdev; >> + struct domain *d = current->domain; > > Why do you need to drop the const on domain here? > Looks like leftover from a previous version. Will remove. >> + struct pci_dev *pdev; >> const struct vpci_register *r; >> unsigned int data_offset = 0; >> uint32_t data = ~(uint32_t)0; >> @@ -332,7 +332,11 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) >> /* Find the PCI dev matching the address. */ >> pdev = pci_get_pdev(d, sbdf); >> if ( !pdev || !pdev->vpci ) >> + { >> + if ( pdev ) >> + pcidev_put(pdev); >> return vpci_read_hw(sbdf, reg, size); >> + } >> >> spin_lock(&pdev->vpci->lock); >> >> @@ -378,6 +382,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) >> ASSERT(data_offset < size); >> } >> spin_unlock(&pdev->vpci->lock); >> + pcidev_put(pdev); >> >> if ( data_offset < size ) >> { >> @@ -420,8 +425,8 @@ static void vpci_write_helper(const struct pci_dev *pdev, >> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, >> uint32_t data) >> { >> - const struct domain *d = current->domain; >> - const struct pci_dev *pdev; >> + struct domain *d = current->domain; >> + struct pci_dev *pdev; -- WBR, Volodymyr ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-04-11 23:41 ` Volodymyr Babchuk @ 2023-04-12 9:13 ` Roger Pau Monné 2023-04-12 21:54 ` Volodymyr Babchuk 0 siblings, 1 reply; 50+ messages in thread From: Roger Pau Monné @ 2023-04-12 9:13 UTC (permalink / raw) To: Volodymyr Babchuk Cc: xen-devel, Jan Beulich, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian On Tue, Apr 11, 2023 at 11:41:04PM +0000, Volodymyr Babchuk wrote: > > Hi Roger, > > Roger Pau Monné <roger.pau@citrix.com> writes: > > > On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: > >> Prior to this change, lifetime of pci_dev objects was protected by global > >> pcidevs_lock(). Long-term plan is to remove this log, so we need some > > ^ lock > > > > I wouldn't say remove, as one way or another we need a lock to protect > > concurrent accesses. > > > > I'll write "replace this global lock with couple of more granular > locking devices" > if this is okay for you. > > >> other mechanism to ensure that those objects will not disappear under > >> feet of code that access them. Reference counting is a good choice as > >> it provides easy to comprehend way to control object lifetime. > >> > >> This patch adds two new helper functions: pcidev_get() and > >> pcidev_put(). pcidev_get() will increase reference counter, while > >> pcidev_put() will decrease it, destroying object when counter reaches > >> zero. > >> > >> pcidev_get() should be used only when you already have a valid pointer > >> to the object or you are holding lock that protects one of the > >> lists (domain, pseg or ats) that store pci_dev structs. > >> > >> pcidev_get() is rarely used directly, because there already are > >> functions that will provide valid pointer to pci_dev struct: > >> pci_get_pdev(), pci_get_real_pdev(). They will lock appropriate list, > >> find needed object and increase its reference counter before returning > >> to the caller. > >> > >> Naturally, pci_put() should be called after finishing working with a > >> received object. This is the reason why this patch have so many > >> pcidev_put()s and so little pcidev_get()s: existing calls to > >> pci_get_*() functions now will increase reference counter > >> automatically, we just need to decrease it back when we finished. > > > > After looking a bit into this, I would like to ask whether it's been > > considered the need to increase the refcount for each use of a pdev. > > > > This is how Linux uses reference locking. It decreases cognitive load > and chance for an error, as there is a simple set of rules, which you > follow. > > > For example I would consider the initial alloc_pdev() to take a > > refcount, and then pci_remove_device() _must_ be the function that > > removes the last refcount, so that it can return -EBUSY otherwise (see > > my comment below). > > I tend to disagree there, as this ruins the very idea of reference > counting. We can't know who else holds reference right now. Okay, we > might know, but this requires additional lock to serialize > accesses. Which, in turn, makes refcount un-needed. In principle pci_remove_device() must report whether the device is ready to be physically removed from the system, so it must return -EBUSY if there are still users accessing the device. A user would use PHYSDEVOP_manage_pci_remove to signal Xen it's trying to physically remove a PCI device from a system, so we must ensure that when the hypervisor returns success the device is ready to be physically removed. Or at least that's my understanding of how this should work. > > > > I would also think that having the device assigned to a guest will take > > another refcount, and then any usage from further callers (ie: like > > vpci) will need some kind of protection from preventing the device > > from being deassigned from a domain while vPCI handlers are running, > > and the current refcount won't help with that. > > Yes, idea of this refcounting is to ensure that a pdev object exists as an > valid object in memory if we are holding a long-term pointer to > it. Indeed, vPCI handlers should use some other mechanism to ensure that > pdev is not being re-assigned while handlers are running. I believe, > this is the task of vpci->lock. Should we call > vpci_remove_device/vpci_add_handlers each time we re-assign a PCI device? Yes, I think this was also part of a comment I've made on a different patch. The device state needs to be cleared when assigned to a different guest (as the hardware domain will also perform a device reset). I think there are some points that needs to be part of the commit message so the code can be properly evaluated: - The reference counting is only used to ensure the object cannot be removed while in use. Users of the pci device object should implement whatever protections required in order to get mutual exclusion between them and device state changes. > > > > That makes me wonder if for example callers of pci_get_pdev(d, sbdf) > > do need to take an extra refcount, because such access is already > > protected from the pdev going away by the fact that the device is > > assigned to a guest. But maybe it's too much work to separate users > > of pci_get_pdev(d, ...); vs pci_get_pdev(NULL, ...);. > > > > There's also a window when the refcount is dropped to 0, and the > > destruction function is called, but at the same time a concurrent > > thread could attempt to take a reference to the pdev still? > > Last pcidev_put() would be called by pci_remove_device(), after removing > it from all lists. This should prevent other threads from obtaining a valid > reference to the pdev. What if a concurrent user has taken a reference to the object before pci_remove_device() has removed the device from the lists, and still holds it when pci_remove_device() performs the supposedly last pcidev_put() call? > > > >> sbdf.devfn &= ~stride; > >> pdev = pci_get_pdev(NULL, sbdf); > >> if ( pdev && stride != pdev->phantom_stride ) > >> + { > >> + pcidev_put(pdev); > >> pdev = NULL; > >> + } > >> } > >> > >> return pdev; > >> @@ -548,13 +526,18 @@ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf) > >> list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list ) > >> if ( pdev->sbdf.bdf == sbdf.bdf && > >> (!d || pdev->domain == d) ) > >> + { > >> + pcidev_get(pdev); > >> return pdev; > >> + } > >> } > >> else > >> list_for_each_entry ( pdev, &d->pdev_list, domain_list ) > >> if ( pdev->sbdf.bdf == sbdf.bdf ) > >> + { > >> + pcidev_get(pdev); > >> return pdev; > >> - > >> + } > >> return NULL; > >> } > >> > >> @@ -663,7 +646,10 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn, > >> PCI_SBDF(seg, info->physfn.bus, > >> info->physfn.devfn)); > >> if ( pdev ) > >> + { > >> pf_is_extfn = pdev->info.is_extfn; > >> + pcidev_put(pdev); > >> + } > >> pcidevs_unlock(); > >> if ( !pdev ) > >> pci_add_device(seg, info->physfn.bus, info->physfn.devfn, > >> @@ -818,7 +804,9 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn) > >> if ( pdev->domain ) > >> list_del(&pdev->domain_list); > >> printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf); > >> - free_pdev(pseg, pdev); > >> + list_del(&pdev->alldevs_list); > >> + pdev_msi_deinit(pdev); > >> + pcidev_put(pdev); > > > > Hm, I think here we want to make sure that the device has been freed, > > or else you would have to return -EBUSY to the calls to notify that > > the device is still in use. > > Why? As I can see, pdev object is still may potentially be accessed by > some other CPU right now. So pdev object will be freed after last > reference is dropped. As it is already removed from all the lists, > pci_dev_get() will not find it anymore. > > Actually, I can't see how this can happen in reality, as VPCI, MSI and > IOMMU are already deactivated for this device. So, no one would touch it. Wouldn't it be possible for a concurrent user to hold a reference from befoe the device has been 'deactivated'? > > > > I think we need an extra pcidev_put_final() or similar that can be > > used in pci_remove_device() to assert that the device has been > > actually removed. > > Will something break if we don't do this? I can't see how this can > happen. As mentioned above, once pci_remove_device() returns 0 the admin should be capable of physically removing the device from the system. > > > >> break; > >> } > >> > >> @@ -848,7 +836,7 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus, > >> { > >> ret = iommu_quarantine_dev_init(pci_to_dev(pdev)); > >> if ( ret ) > >> - return ret; > >> + goto out; > >> > >> target = dom_io; > >> } > >> @@ -878,6 +866,7 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus, > >> pdev->fault.count = 0; > >> > >> out: > >> + pcidev_put(pdev); > >> if ( ret ) > >> printk(XENLOG_G_ERR "%pd: deassign (%pp) failed (%d)\n", > >> d, &PCI_SBDF(seg, bus, devfn), ret); > >> @@ -1011,7 +1000,10 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn) > >> pdev->fault.count >>= 1; > >> pdev->fault.time = now; > >> if ( ++pdev->fault.count < PT_FAULT_THRESHOLD ) > >> + { > >> + pcidev_put(pdev); > >> pdev = NULL; > >> + } > >> } > >> pcidevs_unlock(); > >> > >> @@ -1022,6 +1014,8 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn) > >> * control it for us. */ > >> cword = pci_conf_read16(pdev->sbdf, PCI_COMMAND); > >> pci_conf_write16(pdev->sbdf, PCI_COMMAND, cword & ~PCI_COMMAND_MASTER); > >> + > >> + pcidev_put(pdev); > >> } > >> > >> /* > >> @@ -1138,6 +1132,7 @@ static int __hwdom_init cf_check _setup_hwdom_pci_devices( > >> printk(XENLOG_WARNING "Dom%d owning %pp?\n", > >> pdev->domain->domain_id, &pdev->sbdf); > >> > >> + pcidev_put(pdev); > >> if ( iommu_verbose ) > >> { > >> pcidevs_unlock(); > >> @@ -1385,33 +1380,28 @@ static int iommu_remove_device(struct pci_dev *pdev) > >> return iommu_call(hd->platform_ops, remove_device, devfn, pci_to_dev(pdev)); > >> } > >> > >> -static int device_assigned(u16 seg, u8 bus, u8 devfn) > >> +static int device_assigned(struct pci_dev *pdev) > >> { > >> - struct pci_dev *pdev; > >> int rc = 0; > >> > >> ASSERT(pcidevs_locked()); > >> - pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); > >> - > >> - if ( !pdev ) > >> - rc = -ENODEV; > >> /* > >> * If the device exists and it is not owned by either the hardware > >> * domain or dom_io then it must be assigned to a guest, or be > >> * hidden (owned by dom_xen). > >> */ > >> - else if ( pdev->domain != hardware_domain && > >> - pdev->domain != dom_io ) > >> + if ( pdev->domain != hardware_domain && > >> + pdev->domain != dom_io ) > >> rc = -EBUSY; > >> > >> return rc; > >> } > >> > >> /* Caller should hold the pcidevs_lock */ > > > > I would assume the caller has taken an extra reference to the pdev, so > > holding the pcidevs_lock is no longer needed? > > I am assumed that lock may be required by MSIX or IOMMU functions, that > are being called here. For example, I can see that reassign_device() in > pci_amd_iommu.c manipulates with some lists. I believe, it should be > protected with the lock. OK, so that's pcidevs_lock being used to protect something else that's not strictly a pci device, but a related structure. > > > >> -static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) > >> +static int assign_device(struct domain *d, struct pci_dev *pdev, u32 flag) > >> { > >> const struct domain_iommu *hd = dom_iommu(d); > >> - struct pci_dev *pdev; > >> + uint8_t devfn; > >> int rc = 0; > >> > >> if ( !is_iommu_enabled(d) ) > >> @@ -1422,10 +1412,11 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) > >> > >> /* device_assigned() should already have cleared the device for assignment */ > >> ASSERT(pcidevs_locked()); > >> - pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); > >> ASSERT(pdev && (pdev->domain == hardware_domain || > >> pdev->domain == dom_io)); > >> > >> + devfn = pdev->devfn; > >> + > >> /* Do not allow broken devices to be assigned to guests. */ > >> rc = -EBADF; > >> if ( pdev->broken && d != hardware_domain && d != dom_io ) > >> @@ -1460,7 +1451,7 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) > >> done: > >> if ( rc ) > >> printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n", > >> - d, &PCI_SBDF(seg, bus, devfn), rc); > >> + d, &PCI_SBDF(pdev->seg, pdev->bus, devfn), rc); > >> /* The device is assigned to dom_io so mark it as quarantined */ > >> else if ( d == dom_io ) > >> pdev->quarantine = true; > >> @@ -1595,6 +1586,9 @@ int iommu_do_pci_domctl( > >> ASSERT(d); > >> /* fall through */ > >> case XEN_DOMCTL_test_assign_device: > >> + { > >> + struct pci_dev *pdev; > >> + > >> /* Don't support self-assignment of devices. */ > >> if ( d == current->domain ) > >> { > >> @@ -1622,26 +1616,36 @@ int iommu_do_pci_domctl( > >> seg = machine_sbdf >> 16; > >> bus = PCI_BUS(machine_sbdf); > >> devfn = PCI_DEVFN(machine_sbdf); > >> - > >> pcidevs_lock(); > >> - ret = device_assigned(seg, bus, devfn); > >> + pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); > >> + if ( !pdev ) > >> + { > >> + printk(XENLOG_G_INFO "%pp non-existent\n", > >> + &PCI_SBDF(seg, bus, devfn)); > >> + ret = -EINVAL; > >> + break; > >> + } > >> + > >> + ret = device_assigned(pdev); > >> if ( domctl->cmd == XEN_DOMCTL_test_assign_device ) > >> { > >> if ( ret ) > >> { > >> - printk(XENLOG_G_INFO "%pp already assigned, or non-existent\n", > >> + printk(XENLOG_G_INFO "%pp already assigned\n", > >> &PCI_SBDF(seg, bus, devfn)); > >> ret = -EINVAL; > >> } > >> } > >> else if ( !ret ) > >> - ret = assign_device(d, seg, bus, devfn, flags); > >> + ret = assign_device(d, pdev, flags); > >> + > >> + pcidev_put(pdev); > > > > I would think you need to keep the refcount here if ret == 0, so that > > the device cannot be removed while assigned to a domain? > > Looks like we are perceiving function of refcnt in a different > ways. For me, this is the mechanism to guarantee that if we have a valid > pointer to an object, this object will not disappear under our > feet. This is the main function of krefs in the linux kernel: if your > code holds a reference to an object, you can be sure that this object is > exists in memory. > > On other hand, it seems that you are considering this refcnt as an usage > counter for an actual PCI device, not "struct pdev" that represent > it. Those are two related things, but not the same. So, I can see why > you are suggesting to get additional reference there. But for me, this > looks unnecessary: the very first refcount is obtained in > pci_add_device() and there is the corresponding function > pci_remove_device() that will drop this refcount. So, for me, if admin > wants to remove a PCI device which is assigned to a domain, they can do > this as they were able to do this prior this patches. This is all fine, but needs to be stated in the commit message. > The main value of introducing refcnt is to be able to access pdev objects > without holding the global pcidevs_lock(). This does not mean that you > don't need locking at all. But this allows you to use pdev->lock (which > does not exists in this series, but was introduced in a RFC earlier), or > vpci->lock, or any other subsystem->lock. I guess I was missing this other bit about introducing a per-device lock, would it be possible to bundle all this together into a single patch series? It would be good to place this change together with any other locking related change that you have pending. Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-04-12 9:13 ` Roger Pau Monné @ 2023-04-12 21:54 ` Volodymyr Babchuk 2023-04-13 15:00 ` Roger Pau Monné 0 siblings, 1 reply; 50+ messages in thread From: Volodymyr Babchuk @ 2023-04-12 21:54 UTC (permalink / raw) To: Roger Pau Monné Cc: xen-devel, Jan Beulich, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian Hi Roger, First of all, I want to provide link [1] to the RFC series where I tried total PCI locking rework. After discussing with Jan, it became clear for me, that task is much harder, than I anticipated. So, it was decided to move with a smaller steps. First step is to make vPCI code independed from the global PCI lock. Actually, this is not the first try. Oleksandr Andrushchenko tried to use r/w lock for this: [2]. But, Jan suggested to use refcounting instead of r/w locks, and I liked the idea. So, this is why you are seeing this patch series. Roger Pau Monné <roger.pau@citrix.com> writes: > On Tue, Apr 11, 2023 at 11:41:04PM +0000, Volodymyr Babchuk wrote: >> >> Hi Roger, >> >> Roger Pau Monné <roger.pau@citrix.com> writes: >> >> > On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: >> >> Prior to this change, lifetime of pci_dev objects was protected by global >> >> pcidevs_lock(). Long-term plan is to remove this log, so we need some >> > ^ lock >> > >> > I wouldn't say remove, as one way or another we need a lock to protect >> > concurrent accesses. >> > >> >> I'll write "replace this global lock with couple of more granular >> locking devices" >> if this is okay for you. >> >> >> other mechanism to ensure that those objects will not disappear under >> >> feet of code that access them. Reference counting is a good choice as >> >> it provides easy to comprehend way to control object lifetime. >> >> >> >> This patch adds two new helper functions: pcidev_get() and >> >> pcidev_put(). pcidev_get() will increase reference counter, while >> >> pcidev_put() will decrease it, destroying object when counter reaches >> >> zero. >> >> >> >> pcidev_get() should be used only when you already have a valid pointer >> >> to the object or you are holding lock that protects one of the >> >> lists (domain, pseg or ats) that store pci_dev structs. >> >> >> >> pcidev_get() is rarely used directly, because there already are >> >> functions that will provide valid pointer to pci_dev struct: >> >> pci_get_pdev(), pci_get_real_pdev(). They will lock appropriate list, >> >> find needed object and increase its reference counter before returning >> >> to the caller. >> >> >> >> Naturally, pci_put() should be called after finishing working with a >> >> received object. This is the reason why this patch have so many >> >> pcidev_put()s and so little pcidev_get()s: existing calls to >> >> pci_get_*() functions now will increase reference counter >> >> automatically, we just need to decrease it back when we finished. >> > >> > After looking a bit into this, I would like to ask whether it's been >> > considered the need to increase the refcount for each use of a pdev. >> > >> >> This is how Linux uses reference locking. It decreases cognitive load >> and chance for an error, as there is a simple set of rules, which you >> follow. >> >> > For example I would consider the initial alloc_pdev() to take a >> > refcount, and then pci_remove_device() _must_ be the function that >> > removes the last refcount, so that it can return -EBUSY otherwise (see >> > my comment below). >> >> I tend to disagree there, as this ruins the very idea of reference >> counting. We can't know who else holds reference right now. Okay, we >> might know, but this requires additional lock to serialize >> accesses. Which, in turn, makes refcount un-needed. > > In principle pci_remove_device() must report whether the device is > ready to be physically removed from the system, so it must return > -EBUSY if there are still users accessing the device. > > A user would use PHYSDEVOP_manage_pci_remove to signal Xen it's trying > to physically remove a PCI device from a system, so we must ensure > that when the hypervisor returns success the device is ready to be > physically removed. > > Or at least that's my understanding of how this should work. > As I can see, this is not how it is implemented right now. pci_remove_device() is not checking if device is not assigned to a domain. Id does not check if there are still users accessing the device. It just relies on a the global PCI lock to ensure that device is removed in an orderly manner. My patch series has no intention to change this behavior. All what I want to achieve - is to allow vpci code access struct pdev objects without holding the global PCI lock. >> > >> > I would also think that having the device assigned to a guest will take >> > another refcount, and then any usage from further callers (ie: like >> > vpci) will need some kind of protection from preventing the device >> > from being deassigned from a domain while vPCI handlers are running, >> > and the current refcount won't help with that. >> >> Yes, idea of this refcounting is to ensure that a pdev object exists as an >> valid object in memory if we are holding a long-term pointer to >> it. Indeed, vPCI handlers should use some other mechanism to ensure that >> pdev is not being re-assigned while handlers are running. I believe, >> this is the task of vpci->lock. Should we call >> vpci_remove_device/vpci_add_handlers each time we re-assign a PCI device? > > Yes, I think this was also part of a comment I've made on a different > patch. The device state needs to be cleared when assigned to a > different guest (as the hardware domain will also perform a device > reset). > > I think there are some points that needs to be part of the commit > message so the code can be properly evaluated: > > - The reference counting is only used to ensure the object cannot be > removed while in use. Users of the pci device object should > implement whatever protections required in order to get mutual > exclusion between them and device state changes. > Sure, I will add this. >> > >> > That makes me wonder if for example callers of pci_get_pdev(d, sbdf) >> > do need to take an extra refcount, because such access is already >> > protected from the pdev going away by the fact that the device is >> > assigned to a guest. But maybe it's too much work to separate users >> > of pci_get_pdev(d, ...); vs pci_get_pdev(NULL, ...);. >> > >> > There's also a window when the refcount is dropped to 0, and the >> > destruction function is called, but at the same time a concurrent >> > thread could attempt to take a reference to the pdev still? >> >> Last pcidev_put() would be called by pci_remove_device(), after removing >> it from all lists. This should prevent other threads from obtaining a valid >> reference to the pdev. > > What if a concurrent user has taken a reference to the object before > pci_remove_device() has removed the device from the lists, and still > holds it when pci_remove_device() performs the supposedly last > pcidev_put() call? Well, let's consider VPCI code as this concurrent user, for example. First, it will try to take vpci->lock. Depending on where in pci_remov_device() there will be three cases: 1. Lock is taken before vpci_remove_device() takes the lock. In this case vpci code works as always 2. It tries to take the lock when vpci_remove_device() is already locked this. In this case we are falling to the next case: 3. Lock is taken after vpci_remove_device() had finished it's work. In this case vPCI code sees that it was called for a device in an invalid state and exits. As you can see, there is no case where vPCI code is running on an device which was removed. After vPCI code drops refcounter, pdev object will be freed once and for all. Please node, that I am talking about pdev object there, not about PCI device, because PCI device (as a high-level entity) was destroyed by pci_remove_device(). refcount is needed just for the last clean-up operations. > >> > >> >> sbdf.devfn &= ~stride; >> >> pdev = pci_get_pdev(NULL, sbdf); >> >> if ( pdev && stride != pdev->phantom_stride ) >> >> + { >> >> + pcidev_put(pdev); >> >> pdev = NULL; >> >> + } >> >> } >> >> >> >> return pdev; >> >> @@ -548,13 +526,18 @@ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf) >> >> list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list ) >> >> if ( pdev->sbdf.bdf == sbdf.bdf && >> >> (!d || pdev->domain == d) ) >> >> + { >> >> + pcidev_get(pdev); >> >> return pdev; >> >> + } >> >> } >> >> else >> >> list_for_each_entry ( pdev, &d->pdev_list, domain_list ) >> >> if ( pdev->sbdf.bdf == sbdf.bdf ) >> >> + { >> >> + pcidev_get(pdev); >> >> return pdev; >> >> - >> >> + } >> >> return NULL; >> >> } >> >> >> >> @@ -663,7 +646,10 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn, >> >> PCI_SBDF(seg, info->physfn.bus, >> >> info->physfn.devfn)); >> >> if ( pdev ) >> >> + { >> >> pf_is_extfn = pdev->info.is_extfn; >> >> + pcidev_put(pdev); >> >> + } >> >> pcidevs_unlock(); >> >> if ( !pdev ) >> >> pci_add_device(seg, info->physfn.bus, info->physfn.devfn, >> >> @@ -818,7 +804,9 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn) >> >> if ( pdev->domain ) >> >> list_del(&pdev->domain_list); >> >> printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf); >> >> - free_pdev(pseg, pdev); >> >> + list_del(&pdev->alldevs_list); >> >> + pdev_msi_deinit(pdev); >> >> + pcidev_put(pdev); >> > >> > Hm, I think here we want to make sure that the device has been freed, >> > or else you would have to return -EBUSY to the calls to notify that >> > the device is still in use. >> >> Why? As I can see, pdev object is still may potentially be accessed by >> some other CPU right now. So pdev object will be freed after last >> reference is dropped. As it is already removed from all the lists, >> pci_dev_get() will not find it anymore. >> >> Actually, I can't see how this can happen in reality, as VPCI, MSI and >> IOMMU are already deactivated for this device. So, no one would touch it. > > Wouldn't it be possible for a concurrent user to hold a reference from > befoe the device has been 'deactivated'? > Yes, it can hold a reference. This is why we need additional locking to ensure that, say, pci_cleanup_msi() does not races with rest of the MSI code. Right now this is ensured by then global PCI lock. >> > >> > I think we need an extra pcidev_put_final() or similar that can be >> > used in pci_remove_device() to assert that the device has been >> > actually removed. >> >> Will something break if we don't do this? I can't see how this can >> happen. > > As mentioned above, once pci_remove_device() returns 0 the admin > should be capable of physically removing the device from the system. > This patch series does not alter this requirement. Admin is still capable of physically removing the device from the system. After successful call to the pci_remove_device() >> > >> >> break; >> >> } >> >> >> >> @@ -848,7 +836,7 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus, >> >> { >> >> ret = iommu_quarantine_dev_init(pci_to_dev(pdev)); >> >> if ( ret ) >> >> - return ret; >> >> + goto out; >> >> >> >> target = dom_io; >> >> } >> >> @@ -878,6 +866,7 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus, >> >> pdev->fault.count = 0; >> >> >> >> out: >> >> + pcidev_put(pdev); >> >> if ( ret ) >> >> printk(XENLOG_G_ERR "%pd: deassign (%pp) failed (%d)\n", >> >> d, &PCI_SBDF(seg, bus, devfn), ret); >> >> @@ -1011,7 +1000,10 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn) >> >> pdev->fault.count >>= 1; >> >> pdev->fault.time = now; >> >> if ( ++pdev->fault.count < PT_FAULT_THRESHOLD ) >> >> + { >> >> + pcidev_put(pdev); >> >> pdev = NULL; >> >> + } >> >> } >> >> pcidevs_unlock(); >> >> >> >> @@ -1022,6 +1014,8 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn) >> >> * control it for us. */ >> >> cword = pci_conf_read16(pdev->sbdf, PCI_COMMAND); >> >> pci_conf_write16(pdev->sbdf, PCI_COMMAND, cword & ~PCI_COMMAND_MASTER); >> >> + >> >> + pcidev_put(pdev); >> >> } >> >> >> >> /* >> >> @@ -1138,6 +1132,7 @@ static int __hwdom_init cf_check _setup_hwdom_pci_devices( >> >> printk(XENLOG_WARNING "Dom%d owning %pp?\n", >> >> pdev->domain->domain_id, &pdev->sbdf); >> >> >> >> + pcidev_put(pdev); >> >> if ( iommu_verbose ) >> >> { >> >> pcidevs_unlock(); >> >> @@ -1385,33 +1380,28 @@ static int iommu_remove_device(struct pci_dev *pdev) >> >> return iommu_call(hd->platform_ops, remove_device, devfn, pci_to_dev(pdev)); >> >> } >> >> >> >> -static int device_assigned(u16 seg, u8 bus, u8 devfn) >> >> +static int device_assigned(struct pci_dev *pdev) >> >> { >> >> - struct pci_dev *pdev; >> >> int rc = 0; >> >> >> >> ASSERT(pcidevs_locked()); >> >> - pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); >> >> - >> >> - if ( !pdev ) >> >> - rc = -ENODEV; >> >> /* >> >> * If the device exists and it is not owned by either the hardware >> >> * domain or dom_io then it must be assigned to a guest, or be >> >> * hidden (owned by dom_xen). >> >> */ >> >> - else if ( pdev->domain != hardware_domain && >> >> - pdev->domain != dom_io ) >> >> + if ( pdev->domain != hardware_domain && >> >> + pdev->domain != dom_io ) >> >> rc = -EBUSY; >> >> >> >> return rc; >> >> } >> >> >> >> /* Caller should hold the pcidevs_lock */ >> > >> > I would assume the caller has taken an extra reference to the pdev, so >> > holding the pcidevs_lock is no longer needed? >> >> I am assumed that lock may be required by MSIX or IOMMU functions, that >> are being called here. For example, I can see that reassign_device() in >> pci_amd_iommu.c manipulates with some lists. I believe, it should be >> protected with the lock. > > OK, so that's pcidevs_lock being used to protect something else that's > not strictly a pci device, but a related structure. > Yes. I have found multiple such places, when I tried total PCI locking reworking. >> > >> >> -static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) >> >> +static int assign_device(struct domain *d, struct pci_dev *pdev, u32 flag) >> >> { >> >> const struct domain_iommu *hd = dom_iommu(d); >> >> - struct pci_dev *pdev; >> >> + uint8_t devfn; >> >> int rc = 0; >> >> >> >> if ( !is_iommu_enabled(d) ) >> >> @@ -1422,10 +1412,11 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) >> >> >> >> /* device_assigned() should already have cleared the device for assignment */ >> >> ASSERT(pcidevs_locked()); >> >> - pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); >> >> ASSERT(pdev && (pdev->domain == hardware_domain || >> >> pdev->domain == dom_io)); >> >> >> >> + devfn = pdev->devfn; >> >> + >> >> /* Do not allow broken devices to be assigned to guests. */ >> >> rc = -EBADF; >> >> if ( pdev->broken && d != hardware_domain && d != dom_io ) >> >> @@ -1460,7 +1451,7 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) >> >> done: >> >> if ( rc ) >> >> printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n", >> >> - d, &PCI_SBDF(seg, bus, devfn), rc); >> >> + d, &PCI_SBDF(pdev->seg, pdev->bus, devfn), rc); >> >> /* The device is assigned to dom_io so mark it as quarantined */ >> >> else if ( d == dom_io ) >> >> pdev->quarantine = true; >> >> @@ -1595,6 +1586,9 @@ int iommu_do_pci_domctl( >> >> ASSERT(d); >> >> /* fall through */ >> >> case XEN_DOMCTL_test_assign_device: >> >> + { >> >> + struct pci_dev *pdev; >> >> + >> >> /* Don't support self-assignment of devices. */ >> >> if ( d == current->domain ) >> >> { >> >> @@ -1622,26 +1616,36 @@ int iommu_do_pci_domctl( >> >> seg = machine_sbdf >> 16; >> >> bus = PCI_BUS(machine_sbdf); >> >> devfn = PCI_DEVFN(machine_sbdf); >> >> - >> >> pcidevs_lock(); >> >> - ret = device_assigned(seg, bus, devfn); >> >> + pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); >> >> + if ( !pdev ) >> >> + { >> >> + printk(XENLOG_G_INFO "%pp non-existent\n", >> >> + &PCI_SBDF(seg, bus, devfn)); >> >> + ret = -EINVAL; >> >> + break; >> >> + } >> >> + >> >> + ret = device_assigned(pdev); >> >> if ( domctl->cmd == XEN_DOMCTL_test_assign_device ) >> >> { >> >> if ( ret ) >> >> { >> >> - printk(XENLOG_G_INFO "%pp already assigned, or non-existent\n", >> >> + printk(XENLOG_G_INFO "%pp already assigned\n", >> >> &PCI_SBDF(seg, bus, devfn)); >> >> ret = -EINVAL; >> >> } >> >> } >> >> else if ( !ret ) >> >> - ret = assign_device(d, seg, bus, devfn, flags); >> >> + ret = assign_device(d, pdev, flags); >> >> + >> >> + pcidev_put(pdev); >> > >> > I would think you need to keep the refcount here if ret == 0, so that >> > the device cannot be removed while assigned to a domain? >> >> Looks like we are perceiving function of refcnt in a different >> ways. For me, this is the mechanism to guarantee that if we have a valid >> pointer to an object, this object will not disappear under our >> feet. This is the main function of krefs in the linux kernel: if your >> code holds a reference to an object, you can be sure that this object is >> exists in memory. >> >> On other hand, it seems that you are considering this refcnt as an usage >> counter for an actual PCI device, not "struct pdev" that represent >> it. Those are two related things, but not the same. So, I can see why >> you are suggesting to get additional reference there. But for me, this >> looks unnecessary: the very first refcount is obtained in >> pci_add_device() and there is the corresponding function >> pci_remove_device() that will drop this refcount. So, for me, if admin >> wants to remove a PCI device which is assigned to a domain, they can do >> this as they were able to do this prior this patches. > > This is all fine, but needs to be stated in the commit message. > Sure, I will add this. >> The main value of introducing refcnt is to be able to access pdev objects >> without holding the global pcidevs_lock(). This does not mean that you >> don't need locking at all. But this allows you to use pdev->lock (which >> does not exists in this series, but was introduced in a RFC earlier), or >> vpci->lock, or any other subsystem->lock. > > I guess I was missing this other bit about introducing a > per-device lock, would it be possible to bundle all this together into > a single patch series? As I said at the top of this email, it was tried. You can check RFC at [1]. > > It would be good to place this change together with any other locking > related change that you have pending. Honestly, my main goal is to fix the current issues with vPCI, so ARM can move forward on adding PCI support for the platform. So, I am focusing on this right now. [1] https://patchwork.kernel.org/project/xen-devel/cover/20220831141040.13231-1-volodymyr_babchuk@epam.com/ [2] https://patchwork.kernel.org/project/xen-devel/cover/20220216151628.1610777-1-andr2000@gmail.com/ -- WBR, Volodymyr ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-04-12 21:54 ` Volodymyr Babchuk @ 2023-04-13 15:00 ` Roger Pau Monné 2023-04-14 1:30 ` Volodymyr Babchuk 0 siblings, 1 reply; 50+ messages in thread From: Roger Pau Monné @ 2023-04-13 15:00 UTC (permalink / raw) To: Volodymyr Babchuk Cc: xen-devel, Jan Beulich, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian On Wed, Apr 12, 2023 at 09:54:12PM +0000, Volodymyr Babchuk wrote: > > Hi Roger, > > First of all, I want to provide link [1] to the RFC series where I tried > total PCI locking rework. After discussing with Jan, it became clear for > me, that task is much harder, than I anticipated. So, it was decided to > move with a smaller steps. First step is to make vPCI code independed > from the global PCI lock. Actually, this is not the first try. > Oleksandr Andrushchenko tried to use r/w lock for this: [2]. But, > Jan suggested to use refcounting instead of r/w locks, and I liked the > idea. So, this is why you are seeing this patch series. Thanks, I've been on leave for long periods recently and I've missed some of the series. > > > Roger Pau Monné <roger.pau@citrix.com> writes: > > > On Tue, Apr 11, 2023 at 11:41:04PM +0000, Volodymyr Babchuk wrote: > >> > >> Hi Roger, > >> > >> Roger Pau Monné <roger.pau@citrix.com> writes: > >> > >> > On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: > >> >> Prior to this change, lifetime of pci_dev objects was protected by global > >> >> pcidevs_lock(). Long-term plan is to remove this log, so we need some > >> > ^ lock > >> > > >> > I wouldn't say remove, as one way or another we need a lock to protect > >> > concurrent accesses. > >> > > >> > >> I'll write "replace this global lock with couple of more granular > >> locking devices" > >> if this is okay for you. > >> > >> >> other mechanism to ensure that those objects will not disappear under > >> >> feet of code that access them. Reference counting is a good choice as > >> >> it provides easy to comprehend way to control object lifetime. > >> >> > >> >> This patch adds two new helper functions: pcidev_get() and > >> >> pcidev_put(). pcidev_get() will increase reference counter, while > >> >> pcidev_put() will decrease it, destroying object when counter reaches > >> >> zero. > >> >> > >> >> pcidev_get() should be used only when you already have a valid pointer > >> >> to the object or you are holding lock that protects one of the > >> >> lists (domain, pseg or ats) that store pci_dev structs. > >> >> > >> >> pcidev_get() is rarely used directly, because there already are > >> >> functions that will provide valid pointer to pci_dev struct: > >> >> pci_get_pdev(), pci_get_real_pdev(). They will lock appropriate list, > >> >> find needed object and increase its reference counter before returning > >> >> to the caller. > >> >> > >> >> Naturally, pci_put() should be called after finishing working with a > >> >> received object. This is the reason why this patch have so many > >> >> pcidev_put()s and so little pcidev_get()s: existing calls to > >> >> pci_get_*() functions now will increase reference counter > >> >> automatically, we just need to decrease it back when we finished. > >> > > >> > After looking a bit into this, I would like to ask whether it's been > >> > considered the need to increase the refcount for each use of a pdev. > >> > > >> > >> This is how Linux uses reference locking. It decreases cognitive load > >> and chance for an error, as there is a simple set of rules, which you > >> follow. > >> > >> > For example I would consider the initial alloc_pdev() to take a > >> > refcount, and then pci_remove_device() _must_ be the function that > >> > removes the last refcount, so that it can return -EBUSY otherwise (see > >> > my comment below). > >> > >> I tend to disagree there, as this ruins the very idea of reference > >> counting. We can't know who else holds reference right now. Okay, we > >> might know, but this requires additional lock to serialize > >> accesses. Which, in turn, makes refcount un-needed. > > > > In principle pci_remove_device() must report whether the device is > > ready to be physically removed from the system, so it must return > > -EBUSY if there are still users accessing the device. > > > > A user would use PHYSDEVOP_manage_pci_remove to signal Xen it's trying > > to physically remove a PCI device from a system, so we must ensure > > that when the hypervisor returns success the device is ready to be > > physically removed. > > > > Or at least that's my understanding of how this should work. > > > > As I can see, this is not how it is implemented right > now. pci_remove_device() is not checking if device is not assigned to a > domain. Id does not check if there are still users accessing the > device. It just relies on a the global PCI lock to ensure that device is > removed in an orderly manner. Right, the expectation is that any path inside of the hypervisor using the device will hold the pcidevs lock, and thus bny holding it while removing we assert that no users (inside the hypervisor) are left. I don't think we have been very consistent about the usage of the pcidevs lock, and hence most of this is likely broken. Hopefully removing a PCI device from a system is a very uncommon operation. > My patch series has no intention to change this behavior. All what I > want to achieve - is to allow vpci code access struct pdev objects > without holding the global PCI lock. That's all fine, but we need to make sure it doesn't make things worse and what they currently are, and ideally it should make things easier. That's why I would like to understand exactly what's the purpose of the refcount, and how it should be used. The usage of the refcount should be compatible with the intended behaviour of pci_remove_device(), regardless of whether the current implementation is not correct. We don't want to be piling up more broken stuff on top of an already broken implementation. > >> > > >> > That makes me wonder if for example callers of pci_get_pdev(d, sbdf) > >> > do need to take an extra refcount, because such access is already > >> > protected from the pdev going away by the fact that the device is > >> > assigned to a guest. But maybe it's too much work to separate users > >> > of pci_get_pdev(d, ...); vs pci_get_pdev(NULL, ...);. > >> > > >> > There's also a window when the refcount is dropped to 0, and the > >> > destruction function is called, but at the same time a concurrent > >> > thread could attempt to take a reference to the pdev still? > >> > >> Last pcidev_put() would be called by pci_remove_device(), after removing > >> it from all lists. This should prevent other threads from obtaining a valid > >> reference to the pdev. > > > > What if a concurrent user has taken a reference to the object before > > pci_remove_device() has removed the device from the lists, and still > > holds it when pci_remove_device() performs the supposedly last > > pcidev_put() call? > > Well, let's consider VPCI code as this concurrent user, for > example. First, it will try to take vpci->lock. Depending on where in > pci_remov_device() there will be three cases: > > 1. Lock is taken before vpci_remove_device() takes the lock. In this > case vpci code works as always > > 2. It tries to take the lock when vpci_remove_device() is already locked > this. In this case we are falling to the next case: > > 3. Lock is taken after vpci_remove_device() had finished it's work. In this > case vPCI code sees that it was called for a device in an invalid state > and exits. For 2) and 3) you will hit a dereference, as the lock (vpci->lock) would have been freed by vpci_remove_device() while a concurrent user is waiting on pci_remov_device() to release the lock. I'm not sure how the user sees the device is in an invalid state, because it was waiting on a lock (vpci->lock) that has been removed under it's feet. This is an existing issue not made worse by the refcounting, but it's not a great example. > > As you can see, there is no case where vPCI code is running on an device > which was removed. > > After vPCI code drops refcounter, pdev object will be freed once and for > all. Please node, that I am talking about pdev object there, not about > PCI device, because PCI device (as a high-level entity) was destroyed by > pci_remove_device(). refcount is needed just for the last clean-up > operations. Right, but pci_remove_device() will return success even when there are some users holding a refcount to the device, which is IMO undesirable. As I understand it the purpose of pci_remove_device() is that once it returns success the device can be physically removed from the system. > > > >> > > >> >> sbdf.devfn &= ~stride; > >> >> pdev = pci_get_pdev(NULL, sbdf); > >> >> if ( pdev && stride != pdev->phantom_stride ) > >> >> + { > >> >> + pcidev_put(pdev); > >> >> pdev = NULL; > >> >> + } > >> >> } > >> >> > >> >> return pdev; > >> >> @@ -548,13 +526,18 @@ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf) > >> >> list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list ) > >> >> if ( pdev->sbdf.bdf == sbdf.bdf && > >> >> (!d || pdev->domain == d) ) > >> >> + { > >> >> + pcidev_get(pdev); > >> >> return pdev; > >> >> + } > >> >> } > >> >> else > >> >> list_for_each_entry ( pdev, &d->pdev_list, domain_list ) > >> >> if ( pdev->sbdf.bdf == sbdf.bdf ) > >> >> + { > >> >> + pcidev_get(pdev); > >> >> return pdev; > >> >> - > >> >> + } > >> >> return NULL; > >> >> } > >> >> > >> >> @@ -663,7 +646,10 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn, > >> >> PCI_SBDF(seg, info->physfn.bus, > >> >> info->physfn.devfn)); > >> >> if ( pdev ) > >> >> + { > >> >> pf_is_extfn = pdev->info.is_extfn; > >> >> + pcidev_put(pdev); > >> >> + } > >> >> pcidevs_unlock(); > >> >> if ( !pdev ) > >> >> pci_add_device(seg, info->physfn.bus, info->physfn.devfn, > >> >> @@ -818,7 +804,9 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn) > >> >> if ( pdev->domain ) > >> >> list_del(&pdev->domain_list); > >> >> printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf); > >> >> - free_pdev(pseg, pdev); > >> >> + list_del(&pdev->alldevs_list); > >> >> + pdev_msi_deinit(pdev); > >> >> + pcidev_put(pdev); > >> > > >> > Hm, I think here we want to make sure that the device has been freed, > >> > or else you would have to return -EBUSY to the calls to notify that > >> > the device is still in use. > >> > >> Why? As I can see, pdev object is still may potentially be accessed by > >> some other CPU right now. So pdev object will be freed after last > >> reference is dropped. As it is already removed from all the lists, > >> pci_dev_get() will not find it anymore. > >> > >> Actually, I can't see how this can happen in reality, as VPCI, MSI and > >> IOMMU are already deactivated for this device. So, no one would touch it. > > > > Wouldn't it be possible for a concurrent user to hold a reference from > > befoe the device has been 'deactivated'? > > > > Yes, it can hold a reference. This is why we need additional locking to > ensure that, say, pci_cleanup_msi() does not races with rest of the MSI > code. Right now this is ensured by then global PCI lock. > > >> > > >> > I think we need an extra pcidev_put_final() or similar that can be > >> > used in pci_remove_device() to assert that the device has been > >> > actually removed. > >> > >> Will something break if we don't do this? I can't see how this can > >> happen. > > > > As mentioned above, once pci_remove_device() returns 0 the admin > > should be capable of physically removing the device from the system. > > > > This patch series does not alter this requirement. Admin is still > capable of physically removing the device from the system. After > successful call to the pci_remove_device() Indeed, but there might be users in the hypervisor still holding a reference to the pdev. > >> >> -static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) > >> >> +static int assign_device(struct domain *d, struct pci_dev *pdev, u32 flag) > >> >> { > >> >> const struct domain_iommu *hd = dom_iommu(d); > >> >> - struct pci_dev *pdev; > >> >> + uint8_t devfn; > >> >> int rc = 0; > >> >> > >> >> if ( !is_iommu_enabled(d) ) > >> >> @@ -1422,10 +1412,11 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) > >> >> > >> >> /* device_assigned() should already have cleared the device for assignment */ > >> >> ASSERT(pcidevs_locked()); > >> >> - pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); > >> >> ASSERT(pdev && (pdev->domain == hardware_domain || > >> >> pdev->domain == dom_io)); > >> >> > >> >> + devfn = pdev->devfn; > >> >> + > >> >> /* Do not allow broken devices to be assigned to guests. */ > >> >> rc = -EBADF; > >> >> if ( pdev->broken && d != hardware_domain && d != dom_io ) > >> >> @@ -1460,7 +1451,7 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) > >> >> done: > >> >> if ( rc ) > >> >> printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n", > >> >> - d, &PCI_SBDF(seg, bus, devfn), rc); > >> >> + d, &PCI_SBDF(pdev->seg, pdev->bus, devfn), rc); > >> >> /* The device is assigned to dom_io so mark it as quarantined */ > >> >> else if ( d == dom_io ) > >> >> pdev->quarantine = true; > >> >> @@ -1595,6 +1586,9 @@ int iommu_do_pci_domctl( > >> >> ASSERT(d); > >> >> /* fall through */ > >> >> case XEN_DOMCTL_test_assign_device: > >> >> + { > >> >> + struct pci_dev *pdev; > >> >> + > >> >> /* Don't support self-assignment of devices. */ > >> >> if ( d == current->domain ) > >> >> { > >> >> @@ -1622,26 +1616,36 @@ int iommu_do_pci_domctl( > >> >> seg = machine_sbdf >> 16; > >> >> bus = PCI_BUS(machine_sbdf); > >> >> devfn = PCI_DEVFN(machine_sbdf); > >> >> - > >> >> pcidevs_lock(); > >> >> - ret = device_assigned(seg, bus, devfn); > >> >> + pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); > >> >> + if ( !pdev ) > >> >> + { > >> >> + printk(XENLOG_G_INFO "%pp non-existent\n", > >> >> + &PCI_SBDF(seg, bus, devfn)); > >> >> + ret = -EINVAL; > >> >> + break; > >> >> + } > >> >> + > >> >> + ret = device_assigned(pdev); > >> >> if ( domctl->cmd == XEN_DOMCTL_test_assign_device ) > >> >> { > >> >> if ( ret ) > >> >> { > >> >> - printk(XENLOG_G_INFO "%pp already assigned, or non-existent\n", > >> >> + printk(XENLOG_G_INFO "%pp already assigned\n", > >> >> &PCI_SBDF(seg, bus, devfn)); > >> >> ret = -EINVAL; > >> >> } > >> >> } > >> >> else if ( !ret ) > >> >> - ret = assign_device(d, seg, bus, devfn, flags); > >> >> + ret = assign_device(d, pdev, flags); > >> >> + > >> >> + pcidev_put(pdev); > >> > > >> > I would think you need to keep the refcount here if ret == 0, so that > >> > the device cannot be removed while assigned to a domain? > >> > >> Looks like we are perceiving function of refcnt in a different > >> ways. For me, this is the mechanism to guarantee that if we have a valid > >> pointer to an object, this object will not disappear under our > >> feet. This is the main function of krefs in the linux kernel: if your > >> code holds a reference to an object, you can be sure that this object is > >> exists in memory. > >> > >> On other hand, it seems that you are considering this refcnt as an usage > >> counter for an actual PCI device, not "struct pdev" that represent > >> it. Those are two related things, but not the same. So, I can see why > >> you are suggesting to get additional reference there. But for me, this > >> looks unnecessary: the very first refcount is obtained in > >> pci_add_device() and there is the corresponding function > >> pci_remove_device() that will drop this refcount. So, for me, if admin > >> wants to remove a PCI device which is assigned to a domain, they can do > >> this as they were able to do this prior this patches. > > > > This is all fine, but needs to be stated in the commit message. > > > > Sure, I will add this. > > >> The main value of introducing refcnt is to be able to access pdev objects > >> without holding the global pcidevs_lock(). This does not mean that you > >> don't need locking at all. But this allows you to use pdev->lock (which > >> does not exists in this series, but was introduced in a RFC earlier), or > >> vpci->lock, or any other subsystem->lock. > > > > I guess I was missing this other bit about introducing a > > per-device lock, would it be possible to bundle all this together into > > a single patch series? > > As I said at the top of this email, it was tried. You can check RFC at [1]. > > > > > It would be good to place this change together with any other locking > > related change that you have pending. > > Honestly, my main goal is to fix the current issues with vPCI, so ARM > can move forward on adding PCI support for the platform. So, I am > focusing on this right now. Thanks, we need to be careful however as to not accumulate more bandaids on top just to workaround the fact that the locking we have regarding the pci devices is not suitable. I think it's important to keep all the usages of the pci_dev struct in mind when designing a solution. Overall it seems like might help vPCI on Arm, I think the only major request I have is the one related to pci_remove_device() only returning success when there are not refcounts left. Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-04-13 15:00 ` Roger Pau Monné @ 2023-04-14 1:30 ` Volodymyr Babchuk 2023-04-17 10:17 ` Roger Pau Monné 0 siblings, 1 reply; 50+ messages in thread From: Volodymyr Babchuk @ 2023-04-14 1:30 UTC (permalink / raw) To: Roger Pau Monné Cc: xen-devel, Jan Beulich, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian Hi Roger, Roger Pau Monné <roger.pau@citrix.com> writes: > On Wed, Apr 12, 2023 at 09:54:12PM +0000, Volodymyr Babchuk wrote: >> >> Hi Roger, >> >> First of all, I want to provide link [1] to the RFC series where I tried >> total PCI locking rework. After discussing with Jan, it became clear for >> me, that task is much harder, than I anticipated. So, it was decided to >> move with a smaller steps. First step is to make vPCI code independed >> from the global PCI lock. Actually, this is not the first try. >> Oleksandr Andrushchenko tried to use r/w lock for this: [2]. But, >> Jan suggested to use refcounting instead of r/w locks, and I liked the >> idea. So, this is why you are seeing this patch series. > > Thanks, I've been on leave for long periods recently and I've missed > some of the series. > Did you checked this RFC series? I am not asking you to review it, I am just curious about your opinion on the selected approach >> >> >> Roger Pau Monné <roger.pau@citrix.com> writes: >> >> > On Tue, Apr 11, 2023 at 11:41:04PM +0000, Volodymyr Babchuk wrote: >> >> >> >> Hi Roger, >> >> >> >> Roger Pau Monné <roger.pau@citrix.com> writes: >> >> >> >> > On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: >> >> >> Prior to this change, lifetime of pci_dev objects was protected by global >> >> >> pcidevs_lock(). Long-term plan is to remove this log, so we need some >> >> > ^ lock >> >> > >> >> > I wouldn't say remove, as one way or another we need a lock to protect >> >> > concurrent accesses. >> >> > >> >> >> >> I'll write "replace this global lock with couple of more granular >> >> locking devices" >> >> if this is okay for you. >> >> >> >> >> other mechanism to ensure that those objects will not disappear under >> >> >> feet of code that access them. Reference counting is a good choice as >> >> >> it provides easy to comprehend way to control object lifetime. >> >> >> >> >> >> This patch adds two new helper functions: pcidev_get() and >> >> >> pcidev_put(). pcidev_get() will increase reference counter, while >> >> >> pcidev_put() will decrease it, destroying object when counter reaches >> >> >> zero. >> >> >> >> >> >> pcidev_get() should be used only when you already have a valid pointer >> >> >> to the object or you are holding lock that protects one of the >> >> >> lists (domain, pseg or ats) that store pci_dev structs. >> >> >> >> >> >> pcidev_get() is rarely used directly, because there already are >> >> >> functions that will provide valid pointer to pci_dev struct: >> >> >> pci_get_pdev(), pci_get_real_pdev(). They will lock appropriate list, >> >> >> find needed object and increase its reference counter before returning >> >> >> to the caller. >> >> >> >> >> >> Naturally, pci_put() should be called after finishing working with a >> >> >> received object. This is the reason why this patch have so many >> >> >> pcidev_put()s and so little pcidev_get()s: existing calls to >> >> >> pci_get_*() functions now will increase reference counter >> >> >> automatically, we just need to decrease it back when we finished. >> >> > >> >> > After looking a bit into this, I would like to ask whether it's been >> >> > considered the need to increase the refcount for each use of a pdev. >> >> > >> >> >> >> This is how Linux uses reference locking. It decreases cognitive load >> >> and chance for an error, as there is a simple set of rules, which you >> >> follow. >> >> >> >> > For example I would consider the initial alloc_pdev() to take a >> >> > refcount, and then pci_remove_device() _must_ be the function that >> >> > removes the last refcount, so that it can return -EBUSY otherwise (see >> >> > my comment below). >> >> >> >> I tend to disagree there, as this ruins the very idea of reference >> >> counting. We can't know who else holds reference right now. Okay, we >> >> might know, but this requires additional lock to serialize >> >> accesses. Which, in turn, makes refcount un-needed. >> > >> > In principle pci_remove_device() must report whether the device is >> > ready to be physically removed from the system, so it must return >> > -EBUSY if there are still users accessing the device. >> > >> > A user would use PHYSDEVOP_manage_pci_remove to signal Xen it's trying >> > to physically remove a PCI device from a system, so we must ensure >> > that when the hypervisor returns success the device is ready to be >> > physically removed. >> > >> > Or at least that's my understanding of how this should work. >> > >> >> As I can see, this is not how it is implemented right >> now. pci_remove_device() is not checking if device is not assigned to a >> domain. Id does not check if there are still users accessing the >> device. It just relies on a the global PCI lock to ensure that device is >> removed in an orderly manner. > > Right, the expectation is that any path inside of the hypervisor using > the device will hold the pcidevs lock, and thus bny holding it while > removing we assert that no users (inside the hypervisor) are left. > May I proposed a bit relaxed assertion? "We assert that no users that access the device are left". What I am trying is say there, that no one will try to access, say, device's config space. Because the device already may be physically removed and any access to the device itself will cause a fault. But there may be users that can access struct pdev that corresponds to this device. > I don't think we have been very consistent about the usage of the > pcidevs lock, and hence most of this is likely broken. Hopefully > removing a PCI device from a system is a very uncommon operation. > >> My patch series has no intention to change this behavior. All what I >> want to achieve - is to allow vpci code access struct pdev objects >> without holding the global PCI lock. > > That's all fine, but we need to make sure it doesn't make things worse > and what they currently are, and ideally it should make things easier. > > That's why I would like to understand exactly what's the purpose of > the refcount, and how it should be used. The usage of the refcount > should be compatible with the intended behaviour of > pci_remove_device(), regardless of whether the current implementation > is not correct. We don't want to be piling up more broken stuff on > top of an already broken implementation. > I agree with you. I'll fix the issue with vPCI, that you mentioned below and prepare more comprehensive commit description in the next version. >> >> > >> >> > That makes me wonder if for example callers of pci_get_pdev(d, sbdf) >> >> > do need to take an extra refcount, because such access is already >> >> > protected from the pdev going away by the fact that the device is >> >> > assigned to a guest. But maybe it's too much work to separate users >> >> > of pci_get_pdev(d, ...); vs pci_get_pdev(NULL, ...);. >> >> > >> >> > There's also a window when the refcount is dropped to 0, and the >> >> > destruction function is called, but at the same time a concurrent >> >> > thread could attempt to take a reference to the pdev still? >> >> >> >> Last pcidev_put() would be called by pci_remove_device(), after removing >> >> it from all lists. This should prevent other threads from obtaining a valid >> >> reference to the pdev. >> > >> > What if a concurrent user has taken a reference to the object before >> > pci_remove_device() has removed the device from the lists, and still >> > holds it when pci_remove_device() performs the supposedly last >> > pcidev_put() call? >> >> Well, let's consider VPCI code as this concurrent user, for >> example. First, it will try to take vpci->lock. Depending on where in >> pci_remov_device() there will be three cases: >> >> 1. Lock is taken before vpci_remove_device() takes the lock. In this >> case vpci code works as always >> >> 2. It tries to take the lock when vpci_remove_device() is already locked >> this. In this case we are falling to the next case: >> >> 3. Lock is taken after vpci_remove_device() had finished it's work. In this >> case vPCI code sees that it was called for a device in an invalid state >> and exits. > > For 2) and 3) you will hit a dereference, as the lock (vpci->lock) > would have been freed by vpci_remove_device() while a concurrent user > is waiting on pci_remov_device() to release the lock. > > I'm not sure how the user sees the device is in an invalid state, > because it was waiting on a lock (vpci->lock) that has been removed > under it's feet. > > This is an existing issue not made worse by the refcounting, but it's > not a great example. > Yes, agree. I am going to move vpci->lock to the upper level (pdev->vpci_lock) and rework vPCI code so it will gracefully handle pdev->vpci == NULL. >> >> As you can see, there is no case where vPCI code is running on an device >> which was removed. >> >> After vPCI code drops refcounter, pdev object will be freed once and for >> all. Please node, that I am talking about pdev object there, not about >> PCI device, because PCI device (as a high-level entity) was destroyed by >> pci_remove_device(). refcount is needed just for the last clean-up >> operations. > > Right, but pci_remove_device() will return success even when there are > some users holding a refcount to the device, which is IMO undesirable. > > As I understand it the purpose of pci_remove_device() is that once it > returns success the device can be physically removed from the system. > Yes, I totally agree with you. By saying "the device can physically removed from the system" we are asserting that no one will try to access this device via PCI bus. But this is not the same as "no one shall access struct pdev fields as it should be freed immediately". >> > >> >> > >> >> >> sbdf.devfn &= ~stride; >> >> >> pdev = pci_get_pdev(NULL, sbdf); >> >> >> if ( pdev && stride != pdev->phantom_stride ) >> >> >> + { >> >> >> + pcidev_put(pdev); >> >> >> pdev = NULL; >> >> >> + } >> >> >> } >> >> >> >> >> >> return pdev; >> >> >> @@ -548,13 +526,18 @@ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf) >> >> >> list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list ) >> >> >> if ( pdev->sbdf.bdf == sbdf.bdf && >> >> >> (!d || pdev->domain == d) ) >> >> >> + { >> >> >> + pcidev_get(pdev); >> >> >> return pdev; >> >> >> + } >> >> >> } >> >> >> else >> >> >> list_for_each_entry ( pdev, &d->pdev_list, domain_list ) >> >> >> if ( pdev->sbdf.bdf == sbdf.bdf ) >> >> >> + { >> >> >> + pcidev_get(pdev); >> >> >> return pdev; >> >> >> - >> >> >> + } >> >> >> return NULL; >> >> >> } >> >> >> >> >> >> @@ -663,7 +646,10 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn, >> >> >> PCI_SBDF(seg, info->physfn.bus, >> >> >> info->physfn.devfn)); >> >> >> if ( pdev ) >> >> >> + { >> >> >> pf_is_extfn = pdev->info.is_extfn; >> >> >> + pcidev_put(pdev); >> >> >> + } >> >> >> pcidevs_unlock(); >> >> >> if ( !pdev ) >> >> >> pci_add_device(seg, info->physfn.bus, info->physfn.devfn, >> >> >> @@ -818,7 +804,9 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn) >> >> >> if ( pdev->domain ) >> >> >> list_del(&pdev->domain_list); >> >> >> printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf); >> >> >> - free_pdev(pseg, pdev); >> >> >> + list_del(&pdev->alldevs_list); >> >> >> + pdev_msi_deinit(pdev); >> >> >> + pcidev_put(pdev); >> >> > >> >> > Hm, I think here we want to make sure that the device has been freed, >> >> > or else you would have to return -EBUSY to the calls to notify that >> >> > the device is still in use. >> >> >> >> Why? As I can see, pdev object is still may potentially be accessed by >> >> some other CPU right now. So pdev object will be freed after last >> >> reference is dropped. As it is already removed from all the lists, >> >> pci_dev_get() will not find it anymore. >> >> >> >> Actually, I can't see how this can happen in reality, as VPCI, MSI and >> >> IOMMU are already deactivated for this device. So, no one would touch it. >> > >> > Wouldn't it be possible for a concurrent user to hold a reference from >> > befoe the device has been 'deactivated'? >> > >> >> Yes, it can hold a reference. This is why we need additional locking to >> ensure that, say, pci_cleanup_msi() does not races with rest of the MSI >> code. Right now this is ensured by then global PCI lock. >> >> >> > >> >> > I think we need an extra pcidev_put_final() or similar that can be >> >> > used in pci_remove_device() to assert that the device has been >> >> > actually removed. >> >> >> >> Will something break if we don't do this? I can't see how this can >> >> happen. >> > >> > As mentioned above, once pci_remove_device() returns 0 the admin >> > should be capable of physically removing the device from the system. >> > >> >> This patch series does not alter this requirement. Admin is still >> capable of physically removing the device from the system. After >> successful call to the pci_remove_device() > > Indeed, but there might be users in the hypervisor still holding a > reference to the pdev. > reference counting alone can't protect you from this situation. Additional locking is required in this case. And right now we have the global PCI lock that protects us. Actually, almost all the code takes and drops references while holding the global PCI lock. Only one exception, as far as I know, is the vPCI code. Which I am going to fix in the next version. Also, I'll double check that only vPCI code obtains references while not holding the global lock. My reasoning is the following: 1. Right now (i.e. on staging branch) all accesses to pdevs are in consistent state. This basically means that all code that access pdevs is doing this while holding an appropriate lock. Global PCI lock, in most cases. This means the following: pdev can't disappear under our feet, no one racing with us while accessing the pdev, no new pdev can be created while we are holding the global PCI lock. 2. Adding reference counting alone changes nothing in this regard. Actually, PCI code will needlessly increase/decrease an atomic while holding the global lock. 3. As all work with PCI devices is done while holding the lock, we can assert that reference count at the beginning of a critical section will be equal to reference count at the end of a critical section, because my patch add _put to the every _get all across the hypervisor, with a few notable exceptions: 3.1. pci_add_device() will initialize a device and set reference count to 1 3.2. pci_remove_device() will de-initialize a device and decrease reference count by 1. I can assert, that if p.1 is true and I didn't messed up with balancing _gets/_puts in other parts of the code, then pci_remove_device() will always remove the last reference. This may (and will) change in the future. 3.3. MSI code holds long-term pointers to pdev, so msi[x]_capability_init() does additional _get() and then `msi_free_irq()` does corresponding _put(). Luckily for us, pci_remove_device() calls pci_cleanup_msi() so we can be sure that does not break assertion in p.3.2 4. Now, we want vPCI code to be able to access PCI devices without holding the global PCI lock the whole time. This is where we can leverage reference counting. Here are the assertions: 4.1. vPCI code gets pdev pointer only via pci_get_pdev() function, which reads from a list while holding the global PCI lock. That means that pci_get_pdev() will return NULL after pci_remove_device() deletes the device from all lists. Also, that means that vPCI code can't get pdev while pci_remove_device() is running, because pci_remove_device() is holding the global PCI lock. 4.2. vPCI code will always acquire pdev->vpci_lock before accessing pdev->vpci 4.3. pci_remove_device() will de-init vpci state while holding pdev->vpci_lock 4.4. vPCI code will not try to access PCI device if pdev->vpci == NULL 4.5. vPCI code will access only vpci-related fields in struct pdev 4.6. vPCI does not depends and does not alter non-vPCI-related state of a PCI device. This is the most tricky part, because most of the remaining state is protected by the global PCI lock, which we are not holding. That means, that we need to disable vPCI while re-assigning the PCI device to another domain. As I can see, this is the only place where vPCI depends on more broader PCI device state. This approach will not interfere with pci_remove_device() obligations, because we can be sure that right now vPCI is the only user that can hold reference counter past pci_remove_device() call and that vPCI code will not attempt to access to PCI device after end of , thus, allowing admin to physically remote the device. In the future, we can gradually remove other parts of the PCI code from under the global PCI lock, providing we can give the same guarantees as p 4.1-4.6 >> >> >> -static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) >> >> >> +static int assign_device(struct domain *d, struct pci_dev *pdev, u32 flag) >> >> >> { >> >> >> const struct domain_iommu *hd = dom_iommu(d); >> >> >> - struct pci_dev *pdev; >> >> >> + uint8_t devfn; >> >> >> int rc = 0; >> >> >> >> >> >> if ( !is_iommu_enabled(d) ) >> >> >> @@ -1422,10 +1412,11 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) >> >> >> >> >> >> /* device_assigned() should already have cleared the device for assignment */ >> >> >> ASSERT(pcidevs_locked()); >> >> >> - pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); >> >> >> ASSERT(pdev && (pdev->domain == hardware_domain || >> >> >> pdev->domain == dom_io)); >> >> >> >> >> >> + devfn = pdev->devfn; >> >> >> + >> >> >> /* Do not allow broken devices to be assigned to guests. */ >> >> >> rc = -EBADF; >> >> >> if ( pdev->broken && d != hardware_domain && d != dom_io ) >> >> >> @@ -1460,7 +1451,7 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) >> >> >> done: >> >> >> if ( rc ) >> >> >> printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n", >> >> >> - d, &PCI_SBDF(seg, bus, devfn), rc); >> >> >> + d, &PCI_SBDF(pdev->seg, pdev->bus, devfn), rc); >> >> >> /* The device is assigned to dom_io so mark it as quarantined */ >> >> >> else if ( d == dom_io ) >> >> >> pdev->quarantine = true; >> >> >> @@ -1595,6 +1586,9 @@ int iommu_do_pci_domctl( >> >> >> ASSERT(d); >> >> >> /* fall through */ >> >> >> case XEN_DOMCTL_test_assign_device: >> >> >> + { >> >> >> + struct pci_dev *pdev; >> >> >> + >> >> >> /* Don't support self-assignment of devices. */ >> >> >> if ( d == current->domain ) >> >> >> { >> >> >> @@ -1622,26 +1616,36 @@ int iommu_do_pci_domctl( >> >> >> seg = machine_sbdf >> 16; >> >> >> bus = PCI_BUS(machine_sbdf); >> >> >> devfn = PCI_DEVFN(machine_sbdf); >> >> >> - >> >> >> pcidevs_lock(); >> >> >> - ret = device_assigned(seg, bus, devfn); >> >> >> + pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); >> >> >> + if ( !pdev ) >> >> >> + { >> >> >> + printk(XENLOG_G_INFO "%pp non-existent\n", >> >> >> + &PCI_SBDF(seg, bus, devfn)); >> >> >> + ret = -EINVAL; >> >> >> + break; >> >> >> + } >> >> >> + >> >> >> + ret = device_assigned(pdev); >> >> >> if ( domctl->cmd == XEN_DOMCTL_test_assign_device ) >> >> >> { >> >> >> if ( ret ) >> >> >> { >> >> >> - printk(XENLOG_G_INFO "%pp already assigned, or non-existent\n", >> >> >> + printk(XENLOG_G_INFO "%pp already assigned\n", >> >> >> &PCI_SBDF(seg, bus, devfn)); >> >> >> ret = -EINVAL; >> >> >> } >> >> >> } >> >> >> else if ( !ret ) >> >> >> - ret = assign_device(d, seg, bus, devfn, flags); >> >> >> + ret = assign_device(d, pdev, flags); >> >> >> + >> >> >> + pcidev_put(pdev); >> >> > >> >> > I would think you need to keep the refcount here if ret == 0, so that >> >> > the device cannot be removed while assigned to a domain? >> >> >> >> Looks like we are perceiving function of refcnt in a different >> >> ways. For me, this is the mechanism to guarantee that if we have a valid >> >> pointer to an object, this object will not disappear under our >> >> feet. This is the main function of krefs in the linux kernel: if your >> >> code holds a reference to an object, you can be sure that this object is >> >> exists in memory. >> >> >> >> On other hand, it seems that you are considering this refcnt as an usage >> >> counter for an actual PCI device, not "struct pdev" that represent >> >> it. Those are two related things, but not the same. So, I can see why >> >> you are suggesting to get additional reference there. But for me, this >> >> looks unnecessary: the very first refcount is obtained in >> >> pci_add_device() and there is the corresponding function >> >> pci_remove_device() that will drop this refcount. So, for me, if admin >> >> wants to remove a PCI device which is assigned to a domain, they can do >> >> this as they were able to do this prior this patches. >> > >> > This is all fine, but needs to be stated in the commit message. >> > >> >> Sure, I will add this. >> >> >> The main value of introducing refcnt is to be able to access pdev objects >> >> without holding the global pcidevs_lock(). This does not mean that you >> >> don't need locking at all. But this allows you to use pdev->lock (which >> >> does not exists in this series, but was introduced in a RFC earlier), or >> >> vpci->lock, or any other subsystem->lock. >> > >> > I guess I was missing this other bit about introducing a >> > per-device lock, would it be possible to bundle all this together into >> > a single patch series? >> >> As I said at the top of this email, it was tried. You can check RFC at [1]. >> >> > >> > It would be good to place this change together with any other locking >> > related change that you have pending. >> >> Honestly, my main goal is to fix the current issues with vPCI, so ARM >> can move forward on adding PCI support for the platform. So, I am >> focusing on this right now. > > Thanks, we need to be careful however as to not accumulate more > bandaids on top just to workaround the fact that the locking we have > regarding the pci devices is not suitable. > > I think it's important to keep all the usages of the pci_dev struct in > mind when designing a solution. > > Overall it seems like might help vPCI on Arm, I think the only major > request I have is the one related to pci_remove_device() only > returning success when there are not refcounts left. Above I have proposed another view on this. I hope, it will work for you. Just to reiterate, idea is to allow "harmless" refcounts to be left after returning from pci_remove_device(). By "harmless" I mean that owners of those refcounts will not try to access the physical PCI device if pci_remove_device() is already finished. -- WBR, Volodymyr ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-04-14 1:30 ` Volodymyr Babchuk @ 2023-04-17 10:17 ` Roger Pau Monné 2023-04-17 10:34 ` Jan Beulich 0 siblings, 1 reply; 50+ messages in thread From: Roger Pau Monné @ 2023-04-17 10:17 UTC (permalink / raw) To: Volodymyr Babchuk Cc: xen-devel, Jan Beulich, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian On Fri, Apr 14, 2023 at 01:30:39AM +0000, Volodymyr Babchuk wrote: > > Hi Roger, > > Roger Pau Monné <roger.pau@citrix.com> writes: > > > On Wed, Apr 12, 2023 at 09:54:12PM +0000, Volodymyr Babchuk wrote: > >> > >> Hi Roger, > >> > >> First of all, I want to provide link [1] to the RFC series where I tried > >> total PCI locking rework. After discussing with Jan, it became clear for > >> me, that task is much harder, than I anticipated. So, it was decided to > >> move with a smaller steps. First step is to make vPCI code independed > >> from the global PCI lock. Actually, this is not the first try. > >> Oleksandr Andrushchenko tried to use r/w lock for this: [2]. But, > >> Jan suggested to use refcounting instead of r/w locks, and I liked the > >> idea. So, this is why you are seeing this patch series. > > > > Thanks, I've been on leave for long periods recently and I've missed > > some of the series. > > > > Did you checked this RFC series? I am not asking you to review it, I am > just curious about your opinion on the selected approach I've just taken a look, it seems sensible (locking is complicated). Splitting a big lock like the pci devs one can lead to all kind of unexpected races because it was applying serialization to a lot of operations which would no longer be serialized on the global lock. Overall it's time we kill pci_devs lock, inevitably this will likely result in some fallout. It's important however that we give some thought to what mode we switch too, as to try to avoid finding ourselves in a similar situation to where we are now. > >> > >> > >> Roger Pau Monné <roger.pau@citrix.com> writes: > >> > >> > On Tue, Apr 11, 2023 at 11:41:04PM +0000, Volodymyr Babchuk wrote: > >> >> > >> >> Hi Roger, > >> >> > >> >> Roger Pau Monné <roger.pau@citrix.com> writes: > >> >> > >> >> > On Tue, Mar 14, 2023 at 08:56:29PM +0000, Volodymyr Babchuk wrote: > >> >> >> Prior to this change, lifetime of pci_dev objects was protected by global > >> >> >> pcidevs_lock(). Long-term plan is to remove this log, so we need some > >> >> > ^ lock > >> >> > > >> >> > I wouldn't say remove, as one way or another we need a lock to protect > >> >> > concurrent accesses. > >> >> > > >> >> > >> >> I'll write "replace this global lock with couple of more granular > >> >> locking devices" > >> >> if this is okay for you. > >> >> > >> >> >> other mechanism to ensure that those objects will not disappear under > >> >> >> feet of code that access them. Reference counting is a good choice as > >> >> >> it provides easy to comprehend way to control object lifetime. > >> >> >> > >> >> >> This patch adds two new helper functions: pcidev_get() and > >> >> >> pcidev_put(). pcidev_get() will increase reference counter, while > >> >> >> pcidev_put() will decrease it, destroying object when counter reaches > >> >> >> zero. > >> >> >> > >> >> >> pcidev_get() should be used only when you already have a valid pointer > >> >> >> to the object or you are holding lock that protects one of the > >> >> >> lists (domain, pseg or ats) that store pci_dev structs. > >> >> >> > >> >> >> pcidev_get() is rarely used directly, because there already are > >> >> >> functions that will provide valid pointer to pci_dev struct: > >> >> >> pci_get_pdev(), pci_get_real_pdev(). They will lock appropriate list, > >> >> >> find needed object and increase its reference counter before returning > >> >> >> to the caller. > >> >> >> > >> >> >> Naturally, pci_put() should be called after finishing working with a > >> >> >> received object. This is the reason why this patch have so many > >> >> >> pcidev_put()s and so little pcidev_get()s: existing calls to > >> >> >> pci_get_*() functions now will increase reference counter > >> >> >> automatically, we just need to decrease it back when we finished. > >> >> > > >> >> > After looking a bit into this, I would like to ask whether it's been > >> >> > considered the need to increase the refcount for each use of a pdev. > >> >> > > >> >> > >> >> This is how Linux uses reference locking. It decreases cognitive load > >> >> and chance for an error, as there is a simple set of rules, which you > >> >> follow. > >> >> > >> >> > For example I would consider the initial alloc_pdev() to take a > >> >> > refcount, and then pci_remove_device() _must_ be the function that > >> >> > removes the last refcount, so that it can return -EBUSY otherwise (see > >> >> > my comment below). > >> >> > >> >> I tend to disagree there, as this ruins the very idea of reference > >> >> counting. We can't know who else holds reference right now. Okay, we > >> >> might know, but this requires additional lock to serialize > >> >> accesses. Which, in turn, makes refcount un-needed. > >> > > >> > In principle pci_remove_device() must report whether the device is > >> > ready to be physically removed from the system, so it must return > >> > -EBUSY if there are still users accessing the device. > >> > > >> > A user would use PHYSDEVOP_manage_pci_remove to signal Xen it's trying > >> > to physically remove a PCI device from a system, so we must ensure > >> > that when the hypervisor returns success the device is ready to be > >> > physically removed. > >> > > >> > Or at least that's my understanding of how this should work. > >> > > >> > >> As I can see, this is not how it is implemented right > >> now. pci_remove_device() is not checking if device is not assigned to a > >> domain. Id does not check if there are still users accessing the > >> device. It just relies on a the global PCI lock to ensure that device is > >> removed in an orderly manner. > > > > Right, the expectation is that any path inside of the hypervisor using > > the device will hold the pcidevs lock, and thus bny holding it while > > removing we assert that no users (inside the hypervisor) are left. > > > > May I proposed a bit relaxed assertion? "We assert that no users that > access the device are left". What I am trying is say there, that no one > will try to access, say, device's config space. Because the device > already may be physically removed and any access to the device itself > will cause a fault. But there may be users that can access struct pdev > that corresponds to this device. Isn't holding a reference to the pdev a sign that it's PCI config space might be accessed? > > I don't think we have been very consistent about the usage of the > > pcidevs lock, and hence most of this is likely broken. Hopefully > > removing a PCI device from a system is a very uncommon operation. > > > >> My patch series has no intention to change this behavior. All what I > >> want to achieve - is to allow vpci code access struct pdev objects > >> without holding the global PCI lock. > > > > That's all fine, but we need to make sure it doesn't make things worse > > and what they currently are, and ideally it should make things easier. > > > > That's why I would like to understand exactly what's the purpose of > > the refcount, and how it should be used. The usage of the refcount > > should be compatible with the intended behaviour of > > pci_remove_device(), regardless of whether the current implementation > > is not correct. We don't want to be piling up more broken stuff on > > top of an already broken implementation. > > > > I agree with you. I'll fix the issue with vPCI, that you mentioned below > and prepare more comprehensive commit description in the next version. > > >> >> > > >> >> > That makes me wonder if for example callers of pci_get_pdev(d, sbdf) > >> >> > do need to take an extra refcount, because such access is already > >> >> > protected from the pdev going away by the fact that the device is > >> >> > assigned to a guest. But maybe it's too much work to separate users > >> >> > of pci_get_pdev(d, ...); vs pci_get_pdev(NULL, ...);. > >> >> > > >> >> > There's also a window when the refcount is dropped to 0, and the > >> >> > destruction function is called, but at the same time a concurrent > >> >> > thread could attempt to take a reference to the pdev still? > >> >> > >> >> Last pcidev_put() would be called by pci_remove_device(), after removing > >> >> it from all lists. This should prevent other threads from obtaining a valid > >> >> reference to the pdev. > >> > > >> > What if a concurrent user has taken a reference to the object before > >> > pci_remove_device() has removed the device from the lists, and still > >> > holds it when pci_remove_device() performs the supposedly last > >> > pcidev_put() call? > >> > >> Well, let's consider VPCI code as this concurrent user, for > >> example. First, it will try to take vpci->lock. Depending on where in > >> pci_remov_device() there will be three cases: > >> > >> 1. Lock is taken before vpci_remove_device() takes the lock. In this > >> case vpci code works as always > >> > >> 2. It tries to take the lock when vpci_remove_device() is already locked > >> this. In this case we are falling to the next case: > >> > >> 3. Lock is taken after vpci_remove_device() had finished it's work. In this > >> case vPCI code sees that it was called for a device in an invalid state > >> and exits. > > > > For 2) and 3) you will hit a dereference, as the lock (vpci->lock) > > would have been freed by vpci_remove_device() while a concurrent user > > is waiting on pci_remov_device() to release the lock. > > > > I'm not sure how the user sees the device is in an invalid state, > > because it was waiting on a lock (vpci->lock) that has been removed > > under it's feet. > > > > This is an existing issue not made worse by the refcounting, but it's > > not a great example. > > > > Yes, agree. I am going to move vpci->lock to the upper level > (pdev->vpci_lock) and rework vPCI code so it will gracefully handle > pdev->vpci == NULL. We likely need to so something along this lines. > >> > >> As you can see, there is no case where vPCI code is running on an device > >> which was removed. > >> > >> After vPCI code drops refcounter, pdev object will be freed once and for > >> all. Please node, that I am talking about pdev object there, not about > >> PCI device, because PCI device (as a high-level entity) was destroyed by > >> pci_remove_device(). refcount is needed just for the last clean-up > >> operations. > > > > Right, but pci_remove_device() will return success even when there are > > some users holding a refcount to the device, which is IMO undesirable. > > > > As I understand it the purpose of pci_remove_device() is that once it > > returns success the device can be physically removed from the system. > > > > Yes, I totally agree with you. By saying "the device can physically removed > from the system" we are asserting that no one will try to access this > device via PCI bus. But this is not the same as "no one shall access > struct pdev fields as it should be freed immediately". I kind of view those two linked together, a user holding a ref to a pdev might access it's pci config space. It's still possible for some callers to access the PCI condig space of a device as long as the SBDF is known, but still it feels wrong to return from pci_remove_device() while the pdev hasn't been fully purged from the hypervisor. The more complex we make this handling the more likely to introduce errors in the long term. IMO I think it's easier to reason about device state if we make pci_remove_device() authoritative wrt any uses of the related pdev inside the hypervisor. > >> > > >> >> > > >> >> >> sbdf.devfn &= ~stride; > >> >> >> pdev = pci_get_pdev(NULL, sbdf); > >> >> >> if ( pdev && stride != pdev->phantom_stride ) > >> >> >> + { > >> >> >> + pcidev_put(pdev); > >> >> >> pdev = NULL; > >> >> >> + } > >> >> >> } > >> >> >> > >> >> >> return pdev; > >> >> >> @@ -548,13 +526,18 @@ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf) > >> >> >> list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list ) > >> >> >> if ( pdev->sbdf.bdf == sbdf.bdf && > >> >> >> (!d || pdev->domain == d) ) > >> >> >> + { > >> >> >> + pcidev_get(pdev); > >> >> >> return pdev; > >> >> >> + } > >> >> >> } > >> >> >> else > >> >> >> list_for_each_entry ( pdev, &d->pdev_list, domain_list ) > >> >> >> if ( pdev->sbdf.bdf == sbdf.bdf ) > >> >> >> + { > >> >> >> + pcidev_get(pdev); > >> >> >> return pdev; > >> >> >> - > >> >> >> + } > >> >> >> return NULL; > >> >> >> } > >> >> >> > >> >> >> @@ -663,7 +646,10 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn, > >> >> >> PCI_SBDF(seg, info->physfn.bus, > >> >> >> info->physfn.devfn)); > >> >> >> if ( pdev ) > >> >> >> + { > >> >> >> pf_is_extfn = pdev->info.is_extfn; > >> >> >> + pcidev_put(pdev); > >> >> >> + } > >> >> >> pcidevs_unlock(); > >> >> >> if ( !pdev ) > >> >> >> pci_add_device(seg, info->physfn.bus, info->physfn.devfn, > >> >> >> @@ -818,7 +804,9 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn) > >> >> >> if ( pdev->domain ) > >> >> >> list_del(&pdev->domain_list); > >> >> >> printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf); > >> >> >> - free_pdev(pseg, pdev); > >> >> >> + list_del(&pdev->alldevs_list); > >> >> >> + pdev_msi_deinit(pdev); > >> >> >> + pcidev_put(pdev); > >> >> > > >> >> > Hm, I think here we want to make sure that the device has been freed, > >> >> > or else you would have to return -EBUSY to the calls to notify that > >> >> > the device is still in use. > >> >> > >> >> Why? As I can see, pdev object is still may potentially be accessed by > >> >> some other CPU right now. So pdev object will be freed after last > >> >> reference is dropped. As it is already removed from all the lists, > >> >> pci_dev_get() will not find it anymore. > >> >> > >> >> Actually, I can't see how this can happen in reality, as VPCI, MSI and > >> >> IOMMU are already deactivated for this device. So, no one would touch it. > >> > > >> > Wouldn't it be possible for a concurrent user to hold a reference from > >> > befoe the device has been 'deactivated'? > >> > > >> > >> Yes, it can hold a reference. This is why we need additional locking to > >> ensure that, say, pci_cleanup_msi() does not races with rest of the MSI > >> code. Right now this is ensured by then global PCI lock. > >> > >> >> > > >> >> > I think we need an extra pcidev_put_final() or similar that can be > >> >> > used in pci_remove_device() to assert that the device has been > >> >> > actually removed. > >> >> > >> >> Will something break if we don't do this? I can't see how this can > >> >> happen. > >> > > >> > As mentioned above, once pci_remove_device() returns 0 the admin > >> > should be capable of physically removing the device from the system. > >> > > >> > >> This patch series does not alter this requirement. Admin is still > >> capable of physically removing the device from the system. After > >> successful call to the pci_remove_device() > > > > Indeed, but there might be users in the hypervisor still holding a > > reference to the pdev. > > > > reference counting alone can't protect you from this > situation. Additional locking is required in this case. And right now we > have the global PCI lock that protects us. Actually, almost all the code > takes and drops references while holding the global PCI lock. Only one > exception, as far as I know, is the vPCI code. Which I am going to fix > in the next version. But it would IMO be fine to just return -EBUSY if pci_remove_device() doesn't drop the last reference (and thus the pdev is not yet removed). I'm not saying that pci_remove_device() must unconditionally remove the pdev, but that whe nnot doing so it should return -EBUSY and the caller will have to try. I'm not the maintainer, so maybe Jan has other opinions about this, I will let him comment, as I don't want to enforce something without having agreement. > Also, I'll double check that only vPCI code obtains references while not > holding the global lock. My reasoning is the following: > > 1. Right now (i.e. on staging branch) all accesses to pdevs are > in consistent state. This basically means that all code that access > pdevs is doing this while holding an appropriate lock. Global PCI lock, > in most cases. The only appropriate lock should be the pci_devs lock if we want to prevent device removal. > This means the following: pdev can't disappear under our > feet, no one racing with us while accessing the pdev, no new pdev can be > created while we are holding the global PCI lock. > > 2. Adding reference counting alone changes nothing in this > regard. Actually, PCI code will needlessly increase/decrease an atomic > while holding the global lock. > > 3. As all work with PCI devices is done while holding the lock, we can > assert that reference count at the beginning of a critical section will > be equal to reference count at the end of a critical section, because > my patch add _put to the every _get all across the hypervisor, with a > few notable exceptions: > > 3.1. pci_add_device() will initialize a device and set reference count > to 1 > > 3.2. pci_remove_device() will de-initialize a device and decrease > reference count by 1. I can assert, that if p.1 is true and I didn't > messed up with balancing _gets/_puts in other parts of the code, then > pci_remove_device() will always remove the last reference. This may (and > will) change in the future. > > 3.3. MSI code holds long-term pointers to pdev, so > msi[x]_capability_init() does additional _get() and then > `msi_free_irq()` does corresponding _put(). Luckily for us, > pci_remove_device() calls pci_cleanup_msi() so we can be sure that does > not break assertion in p.3.2 Doesn't by the same logic assign device to a domain also take an extra reference because it's adding the pdev to a domain private list? (ie: much like MSI storing the pointer to the pdev). And then for MSI-X it feels like we should be taken a reference for each msi_desc entry in use, since each one contains a pointer to the pdev. > 4. Now, we want vPCI code to be able to access PCI devices without > holding the global PCI lock the whole time. This is where we can > leverage reference counting. Here are the assertions: > > 4.1. vPCI code gets pdev pointer only via pci_get_pdev() function, which > reads from a list while holding the global PCI lock. That means that > pci_get_pdev() will return NULL after pci_remove_device() deletes the > device from all lists. Also, that means that vPCI code can't get pdev > while pci_remove_device() is running, because pci_remove_device() is > holding the global PCI lock. What if it gets the pointer just before pci_remove_device() runs? It can't get the pointer while pci_remove_device() is running, but could get it just before. > 4.2. vPCI code will always acquire pdev->vpci_lock before accessing > pdev->vpci > > 4.3. pci_remove_device() will de-init vpci state while holding > pdev->vpci_lock > > 4.4. vPCI code will not try to access PCI device if pdev->vpci == NULL > > 4.5. vPCI code will access only vpci-related fields in struct pdev That's not currently true, vPCI does make extensive use of pdev->sbdf for example. It does also cause changes in the MSI(-X) state, albeit not directly but through helpers. > > 4.6. vPCI does not depends and does not alter non-vPCI-related state of a PCI > device. This is the most tricky part, because most of the remaining state is > protected by the global PCI lock, which we are not holding. That means, > that we need to disable vPCI while re-assigning the PCI device to > another domain. As I can see, this is the only place where vPCI depends > on more broader PCI device state. Device re-assigning should cause all previous vPCI state to be torn down and re-created when assigned to a different domain. For once we need to do this in order to clear any internal state, but we also must do so because the initial setup (as done by init_bars for example) can be different depending on whether the owner domain is a domU or dom0. > This approach will not interfere with pci_remove_device() obligations, > because we can be sure that right now vPCI is the only user that can > hold reference counter past pci_remove_device() call and that vPCI code > will not attempt to access to PCI device after end of , thus, allowing admin to > physically remote the device. > > In the future, we can gradually remove other parts of the PCI code from under > the global PCI lock, providing we can give the same guarantees as p 4.1-4.6 I appreciate you doing all this analysis and reasoning. IMO having to write a page long justification should really get us worried about the locking scheme we are using being far too complex and difficult to follow. Again not your fault, it's just how things currently are. > >> >> >> -static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) > >> >> >> +static int assign_device(struct domain *d, struct pci_dev *pdev, u32 flag) > >> >> >> { > >> >> >> const struct domain_iommu *hd = dom_iommu(d); > >> >> >> - struct pci_dev *pdev; > >> >> >> + uint8_t devfn; > >> >> >> int rc = 0; > >> >> >> > >> >> >> if ( !is_iommu_enabled(d) ) > >> >> >> @@ -1422,10 +1412,11 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) > >> >> >> > >> >> >> /* device_assigned() should already have cleared the device for assignment */ > >> >> >> ASSERT(pcidevs_locked()); > >> >> >> - pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); > >> >> >> ASSERT(pdev && (pdev->domain == hardware_domain || > >> >> >> pdev->domain == dom_io)); > >> >> >> > >> >> >> + devfn = pdev->devfn; > >> >> >> + > >> >> >> /* Do not allow broken devices to be assigned to guests. */ > >> >> >> rc = -EBADF; > >> >> >> if ( pdev->broken && d != hardware_domain && d != dom_io ) > >> >> >> @@ -1460,7 +1451,7 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) > >> >> >> done: > >> >> >> if ( rc ) > >> >> >> printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n", > >> >> >> - d, &PCI_SBDF(seg, bus, devfn), rc); > >> >> >> + d, &PCI_SBDF(pdev->seg, pdev->bus, devfn), rc); > >> >> >> /* The device is assigned to dom_io so mark it as quarantined */ > >> >> >> else if ( d == dom_io ) > >> >> >> pdev->quarantine = true; > >> >> >> @@ -1595,6 +1586,9 @@ int iommu_do_pci_domctl( > >> >> >> ASSERT(d); > >> >> >> /* fall through */ > >> >> >> case XEN_DOMCTL_test_assign_device: > >> >> >> + { > >> >> >> + struct pci_dev *pdev; > >> >> >> + > >> >> >> /* Don't support self-assignment of devices. */ > >> >> >> if ( d == current->domain ) > >> >> >> { > >> >> >> @@ -1622,26 +1616,36 @@ int iommu_do_pci_domctl( > >> >> >> seg = machine_sbdf >> 16; > >> >> >> bus = PCI_BUS(machine_sbdf); > >> >> >> devfn = PCI_DEVFN(machine_sbdf); > >> >> >> - > >> >> >> pcidevs_lock(); > >> >> >> - ret = device_assigned(seg, bus, devfn); > >> >> >> + pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn)); > >> >> >> + if ( !pdev ) > >> >> >> + { > >> >> >> + printk(XENLOG_G_INFO "%pp non-existent\n", > >> >> >> + &PCI_SBDF(seg, bus, devfn)); > >> >> >> + ret = -EINVAL; > >> >> >> + break; > >> >> >> + } > >> >> >> + > >> >> >> + ret = device_assigned(pdev); > >> >> >> if ( domctl->cmd == XEN_DOMCTL_test_assign_device ) > >> >> >> { > >> >> >> if ( ret ) > >> >> >> { > >> >> >> - printk(XENLOG_G_INFO "%pp already assigned, or non-existent\n", > >> >> >> + printk(XENLOG_G_INFO "%pp already assigned\n", > >> >> >> &PCI_SBDF(seg, bus, devfn)); > >> >> >> ret = -EINVAL; > >> >> >> } > >> >> >> } > >> >> >> else if ( !ret ) > >> >> >> - ret = assign_device(d, seg, bus, devfn, flags); > >> >> >> + ret = assign_device(d, pdev, flags); > >> >> >> + > >> >> >> + pcidev_put(pdev); > >> >> > > >> >> > I would think you need to keep the refcount here if ret == 0, so that > >> >> > the device cannot be removed while assigned to a domain? > >> >> > >> >> Looks like we are perceiving function of refcnt in a different > >> >> ways. For me, this is the mechanism to guarantee that if we have a valid > >> >> pointer to an object, this object will not disappear under our > >> >> feet. This is the main function of krefs in the linux kernel: if your > >> >> code holds a reference to an object, you can be sure that this object is > >> >> exists in memory. > >> >> > >> >> On other hand, it seems that you are considering this refcnt as an usage > >> >> counter for an actual PCI device, not "struct pdev" that represent > >> >> it. Those are two related things, but not the same. So, I can see why > >> >> you are suggesting to get additional reference there. But for me, this > >> >> looks unnecessary: the very first refcount is obtained in > >> >> pci_add_device() and there is the corresponding function > >> >> pci_remove_device() that will drop this refcount. So, for me, if admin > >> >> wants to remove a PCI device which is assigned to a domain, they can do > >> >> this as they were able to do this prior this patches. > >> > > >> > This is all fine, but needs to be stated in the commit message. > >> > > >> > >> Sure, I will add this. > >> > >> >> The main value of introducing refcnt is to be able to access pdev objects > >> >> without holding the global pcidevs_lock(). This does not mean that you > >> >> don't need locking at all. But this allows you to use pdev->lock (which > >> >> does not exists in this series, but was introduced in a RFC earlier), or > >> >> vpci->lock, or any other subsystem->lock. > >> > > >> > I guess I was missing this other bit about introducing a > >> > per-device lock, would it be possible to bundle all this together into > >> > a single patch series? > >> > >> As I said at the top of this email, it was tried. You can check RFC at [1]. > >> > >> > > >> > It would be good to place this change together with any other locking > >> > related change that you have pending. > >> > >> Honestly, my main goal is to fix the current issues with vPCI, so ARM > >> can move forward on adding PCI support for the platform. So, I am > >> focusing on this right now. > > > > Thanks, we need to be careful however as to not accumulate more > > bandaids on top just to workaround the fact that the locking we have > > regarding the pci devices is not suitable. > > > > I think it's important to keep all the usages of the pci_dev struct in > > mind when designing a solution. > > > > Overall it seems like might help vPCI on Arm, I think the only major > > request I have is the one related to pci_remove_device() only > > returning success when there are not refcounts left. > > Above I have proposed another view on this. I hope, it will work for > you. Just to reiterate, idea is to allow "harmless" refcounts to be left > after returning from pci_remove_device(). By "harmless" I mean that > owners of those refcounts will not try to access the physical PCI > device if pci_remove_device() is already finished. I'm not strictly a maintainer of this piece code, albeit I have an opinion. I will like to also hear Jans opinion, since he is the maintainer. Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-04-17 10:17 ` Roger Pau Monné @ 2023-04-17 10:34 ` Jan Beulich 2023-04-17 10:51 ` Roger Pau Monné 0 siblings, 1 reply; 50+ messages in thread From: Jan Beulich @ 2023-04-17 10:34 UTC (permalink / raw) To: Roger Pau Monné, Volodymyr Babchuk Cc: xen-devel, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian On 17.04.2023 12:17, Roger Pau Monné wrote: > On Fri, Apr 14, 2023 at 01:30:39AM +0000, Volodymyr Babchuk wrote: >> Above I have proposed another view on this. I hope, it will work for >> you. Just to reiterate, idea is to allow "harmless" refcounts to be left >> after returning from pci_remove_device(). By "harmless" I mean that >> owners of those refcounts will not try to access the physical PCI >> device if pci_remove_device() is already finished. > > I'm not strictly a maintainer of this piece code, albeit I have an > opinion. I will like to also hear Jans opinion, since he is the > maintainer. I'm afraid I can't really appreciate the term "harmless refcounts". Whoever holds a ref is entitled to access the device. As stated before, I see only two ways of getting things consistent: Either pci_remove_device() is invoked upon dropping of the last ref, or it checks that it is dropping the last one. The former looks architecturally cleaner to me, but I can accept that moving there might be more of a change, so wouldn't object to going the latter route. Jan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-04-17 10:34 ` Jan Beulich @ 2023-04-17 10:51 ` Roger Pau Monné 2023-04-17 11:02 ` Jan Beulich 2023-04-21 11:00 ` Volodymyr Babchuk 0 siblings, 2 replies; 50+ messages in thread From: Roger Pau Monné @ 2023-04-17 10:51 UTC (permalink / raw) To: Jan Beulich Cc: Volodymyr Babchuk, xen-devel, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian On Mon, Apr 17, 2023 at 12:34:31PM +0200, Jan Beulich wrote: > On 17.04.2023 12:17, Roger Pau Monné wrote: > > On Fri, Apr 14, 2023 at 01:30:39AM +0000, Volodymyr Babchuk wrote: > >> Above I have proposed another view on this. I hope, it will work for > >> you. Just to reiterate, idea is to allow "harmless" refcounts to be left > >> after returning from pci_remove_device(). By "harmless" I mean that > >> owners of those refcounts will not try to access the physical PCI > >> device if pci_remove_device() is already finished. > > > > I'm not strictly a maintainer of this piece code, albeit I have an > > opinion. I will like to also hear Jans opinion, since he is the > > maintainer. > > I'm afraid I can't really appreciate the term "harmless refcounts". Whoever > holds a ref is entitled to access the device. As stated before, I see only > two ways of getting things consistent: Either pci_remove_device() is > invoked upon dropping of the last ref, With this approach, what would be the implementation of PHYSDEVOP_manage_pci_remove? Would it just check whether the pdev exist and either return 0 or -EBUSY? > or it checks that it is dropping the > last one. The former looks architecturally cleaner to me, but I can accept > that moving there might be more of a change, so wouldn't object to going > the latter route. One of my concerns is what is expected of PHYSDEVOP_manage_pci_remove, I don't think it's expected for PHYSDEVOP_manage_pci_remove to return 0 while there are users inside the hypervisor still holding a reference to the pdev. Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-04-17 10:51 ` Roger Pau Monné @ 2023-04-17 11:02 ` Jan Beulich 2023-04-21 11:00 ` Volodymyr Babchuk 1 sibling, 0 replies; 50+ messages in thread From: Jan Beulich @ 2023-04-17 11:02 UTC (permalink / raw) To: Roger Pau Monné Cc: Volodymyr Babchuk, xen-devel, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian On 17.04.2023 12:51, Roger Pau Monné wrote: > On Mon, Apr 17, 2023 at 12:34:31PM +0200, Jan Beulich wrote: >> On 17.04.2023 12:17, Roger Pau Monné wrote: >>> On Fri, Apr 14, 2023 at 01:30:39AM +0000, Volodymyr Babchuk wrote: >>>> Above I have proposed another view on this. I hope, it will work for >>>> you. Just to reiterate, idea is to allow "harmless" refcounts to be left >>>> after returning from pci_remove_device(). By "harmless" I mean that >>>> owners of those refcounts will not try to access the physical PCI >>>> device if pci_remove_device() is already finished. >>> >>> I'm not strictly a maintainer of this piece code, albeit I have an >>> opinion. I will like to also hear Jans opinion, since he is the >>> maintainer. >> >> I'm afraid I can't really appreciate the term "harmless refcounts". Whoever >> holds a ref is entitled to access the device. As stated before, I see only >> two ways of getting things consistent: Either pci_remove_device() is >> invoked upon dropping of the last ref, > > With this approach, what would be the implementation of > PHYSDEVOP_manage_pci_remove? Would it just check whether the pdev > exist and either return 0 or -EBUSY? If the device doesn't (physically) exist, it would return e.g. -ENODEV. If it still exists and the pdev also does, it would return e.g. -EBUSY, yes. Jan >> or it checks that it is dropping the >> last one. The former looks architecturally cleaner to me, but I can accept >> that moving there might be more of a change, so wouldn't object to going >> the latter route. > > One of my concerns is what is expected of PHYSDEVOP_manage_pci_remove, > I don't think it's expected for PHYSDEVOP_manage_pci_remove to return > 0 while there are users inside the hypervisor still holding a > reference to the pdev. > > Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-04-17 10:51 ` Roger Pau Monné 2023-04-17 11:02 ` Jan Beulich @ 2023-04-21 11:00 ` Volodymyr Babchuk 2023-04-21 12:24 ` Jan Beulich 2023-04-21 13:10 ` Roger Pau Monné 1 sibling, 2 replies; 50+ messages in thread From: Volodymyr Babchuk @ 2023-04-21 11:00 UTC (permalink / raw) To: Roger Pau Monné Cc: Jan Beulich, xen-devel, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian Hello Roger, Roger Pau Monné <roger.pau@citrix.com> writes: > On Mon, Apr 17, 2023 at 12:34:31PM +0200, Jan Beulich wrote: >> On 17.04.2023 12:17, Roger Pau Monné wrote: >> > On Fri, Apr 14, 2023 at 01:30:39AM +0000, Volodymyr Babchuk wrote: >> >> Above I have proposed another view on this. I hope, it will work for >> >> you. Just to reiterate, idea is to allow "harmless" refcounts to be left >> >> after returning from pci_remove_device(). By "harmless" I mean that >> >> owners of those refcounts will not try to access the physical PCI >> >> device if pci_remove_device() is already finished. >> > >> > I'm not strictly a maintainer of this piece code, albeit I have an >> > opinion. I will like to also hear Jans opinion, since he is the >> > maintainer. >> >> I'm afraid I can't really appreciate the term "harmless refcounts". Whoever >> holds a ref is entitled to access the device. As stated before, I see only >> two ways of getting things consistent: Either pci_remove_device() is >> invoked upon dropping of the last ref, > > With this approach, what would be the implementation of > PHYSDEVOP_manage_pci_remove? Would it just check whether the pdev > exist and either return 0 or -EBUSY? > Okay, I am preparing patches with the behavior you proposed. To test it, I artificially set refcount to 2 and indeed PHYSDEVOP_manage_pci_remove returned -EBUSY, which propagated to the linux driver. Problem is that Linux driver can't do anything with this. It just displayed an error message and removed device anyways. This is because Linux sends PHYSDEVOP_manage_pci_remove in device_remove() call path and there is no way to prevent the device removal. So, admin is not capable to try this again. As I workaround, I can create hypercall continuation in case if pci_remove_device() returns -EBUSY. What is your opinion? -- WBR, Volodymyr ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-04-21 11:00 ` Volodymyr Babchuk @ 2023-04-21 12:24 ` Jan Beulich 2023-04-21 13:02 ` Volodymyr Babchuk 2023-04-21 13:10 ` Roger Pau Monné 1 sibling, 1 reply; 50+ messages in thread From: Jan Beulich @ 2023-04-21 12:24 UTC (permalink / raw) To: Volodymyr Babchuk Cc: xen-devel, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian, Roger Pau Monné On 21.04.2023 13:00, Volodymyr Babchuk wrote: > > Hello Roger, > > Roger Pau Monné <roger.pau@citrix.com> writes: > >> On Mon, Apr 17, 2023 at 12:34:31PM +0200, Jan Beulich wrote: >>> On 17.04.2023 12:17, Roger Pau Monné wrote: >>>> On Fri, Apr 14, 2023 at 01:30:39AM +0000, Volodymyr Babchuk wrote: >>>>> Above I have proposed another view on this. I hope, it will work for >>>>> you. Just to reiterate, idea is to allow "harmless" refcounts to be left >>>>> after returning from pci_remove_device(). By "harmless" I mean that >>>>> owners of those refcounts will not try to access the physical PCI >>>>> device if pci_remove_device() is already finished. >>>> >>>> I'm not strictly a maintainer of this piece code, albeit I have an >>>> opinion. I will like to also hear Jans opinion, since he is the >>>> maintainer. >>> >>> I'm afraid I can't really appreciate the term "harmless refcounts". Whoever >>> holds a ref is entitled to access the device. As stated before, I see only >>> two ways of getting things consistent: Either pci_remove_device() is >>> invoked upon dropping of the last ref, >> >> With this approach, what would be the implementation of >> PHYSDEVOP_manage_pci_remove? Would it just check whether the pdev >> exist and either return 0 or -EBUSY? >> > > Okay, I am preparing patches with the behavior you proposed. To test it, > I artificially set refcount to 2 and indeed PHYSDEVOP_manage_pci_remove > returned -EBUSY, which propagated to the linux driver. Problem is that > Linux driver can't do anything with this. It just displayed an error > message and removed device anyways. This is because Linux sends > PHYSDEVOP_manage_pci_remove in device_remove() call path and there is no > way to prevent the device removal. So, admin is not capable to try this > again. So maybe Linux'es issuing of the call needs moving elsewhere? Or we need a new sub-op, such that PHYSDEVOP_manage_pci_remove can remain purely a last-moment notification? > As I workaround, I can create hypercall continuation in case if > pci_remove_device() returns -EBUSY. What is your opinion? How would that help? You'd then spin perhaps for hours or days ... Jan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-04-21 12:24 ` Jan Beulich @ 2023-04-21 13:02 ` Volodymyr Babchuk 0 siblings, 0 replies; 50+ messages in thread From: Volodymyr Babchuk @ 2023-04-21 13:02 UTC (permalink / raw) To: Jan Beulich Cc: Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian, Roger Pau Monné, xen-devel Hi Jan, Jan Beulich <jbeulich@suse.com> writes: > On 21.04.2023 13:00, Volodymyr Babchuk wrote: >> >> Hello Roger, >> >> Roger Pau Monné <roger.pau@citrix.com> writes: >> >>> On Mon, Apr 17, 2023 at 12:34:31PM +0200, Jan Beulich wrote: >>>> On 17.04.2023 12:17, Roger Pau Monné wrote: >>>>> On Fri, Apr 14, 2023 at 01:30:39AM +0000, Volodymyr Babchuk wrote: >>>>>> Above I have proposed another view on this. I hope, it will work for >>>>>> you. Just to reiterate, idea is to allow "harmless" refcounts to be left >>>>>> after returning from pci_remove_device(). By "harmless" I mean that >>>>>> owners of those refcounts will not try to access the physical PCI >>>>>> device if pci_remove_device() is already finished. >>>>> >>>>> I'm not strictly a maintainer of this piece code, albeit I have an >>>>> opinion. I will like to also hear Jans opinion, since he is the >>>>> maintainer. >>>> >>>> I'm afraid I can't really appreciate the term "harmless refcounts". Whoever >>>> holds a ref is entitled to access the device. As stated before, I see only >>>> two ways of getting things consistent: Either pci_remove_device() is >>>> invoked upon dropping of the last ref, >>> >>> With this approach, what would be the implementation of >>> PHYSDEVOP_manage_pci_remove? Would it just check whether the pdev >>> exist and either return 0 or -EBUSY? >>> >> >> Okay, I am preparing patches with the behavior you proposed. To test it, >> I artificially set refcount to 2 and indeed PHYSDEVOP_manage_pci_remove >> returned -EBUSY, which propagated to the linux driver. Problem is that >> Linux driver can't do anything with this. It just displayed an error >> message and removed device anyways. This is because Linux sends >> PHYSDEVOP_manage_pci_remove in device_remove() call path and there is no >> way to prevent the device removal. So, admin is not capable to try this >> again. > > So maybe Linux'es issuing of the call needs moving elsewhere? Or we need > a new sub-op, such that PHYSDEVOP_manage_pci_remove can remain purely a > last-moment notification? From Linux point of view, it already cleaned up all the device resources and it is ready to hot-unplug the device. Xen PCI driver in Linux just gets a notification that device is being removed. BTW, xen_pciback (AKA pci_stub) driver in Linux tracks that device is assigned to another domain, but all it can do is to loudly complain in kernel log if device is being removed without being deassigned from another domain. > >> As I workaround, I can create hypercall continuation in case if >> pci_remove_device() returns -EBUSY. What is your opinion? > > How would that help? You'd then spin perhaps for hours or days ... Are you implying the case when we increase refcounter when we assign a PCI device to a domain? In this case yes, it is quite possible that we will spin there for any arbitrary amount of time... -- WBR, Volodymyr ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-04-21 11:00 ` Volodymyr Babchuk 2023-04-21 12:24 ` Jan Beulich @ 2023-04-21 13:10 ` Roger Pau Monné 2023-04-21 14:13 ` Volodymyr Babchuk 1 sibling, 1 reply; 50+ messages in thread From: Roger Pau Monné @ 2023-04-21 13:10 UTC (permalink / raw) To: Volodymyr Babchuk Cc: Jan Beulich, xen-devel, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian On Fri, Apr 21, 2023 at 11:00:23AM +0000, Volodymyr Babchuk wrote: > > Hello Roger, > > Roger Pau Monné <roger.pau@citrix.com> writes: > > > On Mon, Apr 17, 2023 at 12:34:31PM +0200, Jan Beulich wrote: > >> On 17.04.2023 12:17, Roger Pau Monné wrote: > >> > On Fri, Apr 14, 2023 at 01:30:39AM +0000, Volodymyr Babchuk wrote: > >> >> Above I have proposed another view on this. I hope, it will work for > >> >> you. Just to reiterate, idea is to allow "harmless" refcounts to be left > >> >> after returning from pci_remove_device(). By "harmless" I mean that > >> >> owners of those refcounts will not try to access the physical PCI > >> >> device if pci_remove_device() is already finished. > >> > > >> > I'm not strictly a maintainer of this piece code, albeit I have an > >> > opinion. I will like to also hear Jans opinion, since he is the > >> > maintainer. > >> > >> I'm afraid I can't really appreciate the term "harmless refcounts". Whoever > >> holds a ref is entitled to access the device. As stated before, I see only > >> two ways of getting things consistent: Either pci_remove_device() is > >> invoked upon dropping of the last ref, > > > > With this approach, what would be the implementation of > > PHYSDEVOP_manage_pci_remove? Would it just check whether the pdev > > exist and either return 0 or -EBUSY? > > > > Okay, I am preparing patches with the behavior you proposed. To test it, > I artificially set refcount to 2 and indeed PHYSDEVOP_manage_pci_remove > returned -EBUSY, which propagated to the linux driver. Problem is that > Linux driver can't do anything with this. It just displayed an error > message and removed device anyways. This is because Linux sends > PHYSDEVOP_manage_pci_remove in device_remove() call path and there is no > way to prevent the device removal. So, admin is not capable to try this > again. Ideally Linux won't remove the device, and then the admin would get to retry. Maybe the way the Linux hook is placed is not the best one? The hypervisor should be authoritative on whether a device can be removed or not, and hence PHYSDEVOP_manage_pci_remove returning an error (EBUSY or otherwise) shouldn't allow the device unplug in Linux to continue. We could add a PHYSDEVOP_manage_pci_test or similar that could be programmatically used to check whether a device has a matching pdev in the hypervisor, but I have no idea how that could be used by Linux so it's exposed to the user, and it seems to just make the interface more complicated for noo real benefit, when the same could be accomplished by PHYSDEVOP_manage_pci_remove. Maybe the only feasible solution is for pci_remove_device() to drop a reference expecting it would be the last one, and print a warning message if it's not and return -EBUSY. Expecting any remaining references to be dropped and the backing pdev to be freed. Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-04-21 13:10 ` Roger Pau Monné @ 2023-04-21 14:13 ` Volodymyr Babchuk 2023-04-24 7:46 ` Jan Beulich 0 siblings, 1 reply; 50+ messages in thread From: Volodymyr Babchuk @ 2023-04-21 14:13 UTC (permalink / raw) To: Roger Pau Monné Cc: Jan Beulich, xen-devel, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian Hi Roger, Roger Pau Monné <roger.pau@citrix.com> writes: > On Fri, Apr 21, 2023 at 11:00:23AM +0000, Volodymyr Babchuk wrote: >> >> Hello Roger, >> >> Roger Pau Monné <roger.pau@citrix.com> writes: >> >> > On Mon, Apr 17, 2023 at 12:34:31PM +0200, Jan Beulich wrote: >> >> On 17.04.2023 12:17, Roger Pau Monné wrote: >> >> > On Fri, Apr 14, 2023 at 01:30:39AM +0000, Volodymyr Babchuk wrote: >> >> >> Above I have proposed another view on this. I hope, it will work for >> >> >> you. Just to reiterate, idea is to allow "harmless" refcounts to be left >> >> >> after returning from pci_remove_device(). By "harmless" I mean that >> >> >> owners of those refcounts will not try to access the physical PCI >> >> >> device if pci_remove_device() is already finished. >> >> > >> >> > I'm not strictly a maintainer of this piece code, albeit I have an >> >> > opinion. I will like to also hear Jans opinion, since he is the >> >> > maintainer. >> >> >> >> I'm afraid I can't really appreciate the term "harmless refcounts". Whoever >> >> holds a ref is entitled to access the device. As stated before, I see only >> >> two ways of getting things consistent: Either pci_remove_device() is >> >> invoked upon dropping of the last ref, >> > >> > With this approach, what would be the implementation of >> > PHYSDEVOP_manage_pci_remove? Would it just check whether the pdev >> > exist and either return 0 or -EBUSY? >> > >> >> Okay, I am preparing patches with the behavior you proposed. To test it, >> I artificially set refcount to 2 and indeed PHYSDEVOP_manage_pci_remove >> returned -EBUSY, which propagated to the linux driver. Problem is that >> Linux driver can't do anything with this. It just displayed an error >> message and removed device anyways. This is because Linux sends >> PHYSDEVOP_manage_pci_remove in device_remove() call path and there is no >> way to prevent the device removal. So, admin is not capable to try this >> again. > > Ideally Linux won't remove the device, and then the admin would get to > retry. Maybe the way the Linux hook is placed is not the best one? > The hypervisor should be authoritative on whether a device can be > removed or not, and hence PHYSDEVOP_manage_pci_remove returning an > error (EBUSY or otherwise) shouldn't allow the device unplug in Linux > to continue. Yes, it would be ideally, but Linux driver/device model is written in a such way, that PCI subsystem tracks all the PCI device usage, so it can be certain that it can remove the device. Thus, functions in the device removal path either return void or 0. Of course, kernel does not know that hypervisor has additional uses for the device, so there is no mechanisms to prevent removal. > We could add a PHYSDEVOP_manage_pci_test or similar that could be > programmatically used to check whether a device has a matching pdev in > the hypervisor, but I have no idea how that could be used by Linux so > it's exposed to the user, and it seems to just make the interface more > complicated for noo real benefit, when the same could be accomplished > by PHYSDEVOP_manage_pci_remove. We can ignore the kernel behavior and just call PHYSDEVOP_manage_pci_remove from toolstack. Something like "xl pci-hotunplug SBFD". But yes, this will make interface more complicated. > Maybe the only feasible solution is for pci_remove_device() to drop a > reference expecting it would be the last one, and print a warning > message if it's not and return -EBUSY. Expecting any remaining > references to be dropped and the backing pdev to be freed. So, basically in the same way as I proposed initially? -- WBR, Volodymyr ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-04-21 14:13 ` Volodymyr Babchuk @ 2023-04-24 7:46 ` Jan Beulich 2023-04-24 14:15 ` Volodymyr Babchuk 0 siblings, 1 reply; 50+ messages in thread From: Jan Beulich @ 2023-04-24 7:46 UTC (permalink / raw) To: Volodymyr Babchuk Cc: xen-devel, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian, Roger Pau Monné On 21.04.2023 16:13, Volodymyr Babchuk wrote: > > Hi Roger, > > Roger Pau Monné <roger.pau@citrix.com> writes: > >> On Fri, Apr 21, 2023 at 11:00:23AM +0000, Volodymyr Babchuk wrote: >>> >>> Hello Roger, >>> >>> Roger Pau Monné <roger.pau@citrix.com> writes: >>> >>>> On Mon, Apr 17, 2023 at 12:34:31PM +0200, Jan Beulich wrote: >>>>> On 17.04.2023 12:17, Roger Pau Monné wrote: >>>>>> On Fri, Apr 14, 2023 at 01:30:39AM +0000, Volodymyr Babchuk wrote: >>>>>>> Above I have proposed another view on this. I hope, it will work for >>>>>>> you. Just to reiterate, idea is to allow "harmless" refcounts to be left >>>>>>> after returning from pci_remove_device(). By "harmless" I mean that >>>>>>> owners of those refcounts will not try to access the physical PCI >>>>>>> device if pci_remove_device() is already finished. >>>>>> >>>>>> I'm not strictly a maintainer of this piece code, albeit I have an >>>>>> opinion. I will like to also hear Jans opinion, since he is the >>>>>> maintainer. >>>>> >>>>> I'm afraid I can't really appreciate the term "harmless refcounts". Whoever >>>>> holds a ref is entitled to access the device. As stated before, I see only >>>>> two ways of getting things consistent: Either pci_remove_device() is >>>>> invoked upon dropping of the last ref, >>>> >>>> With this approach, what would be the implementation of >>>> PHYSDEVOP_manage_pci_remove? Would it just check whether the pdev >>>> exist and either return 0 or -EBUSY? >>>> >>> >>> Okay, I am preparing patches with the behavior you proposed. To test it, >>> I artificially set refcount to 2 and indeed PHYSDEVOP_manage_pci_remove >>> returned -EBUSY, which propagated to the linux driver. Problem is that >>> Linux driver can't do anything with this. It just displayed an error >>> message and removed device anyways. This is because Linux sends >>> PHYSDEVOP_manage_pci_remove in device_remove() call path and there is no >>> way to prevent the device removal. So, admin is not capable to try this >>> again. >> >> Ideally Linux won't remove the device, and then the admin would get to >> retry. Maybe the way the Linux hook is placed is not the best one? >> The hypervisor should be authoritative on whether a device can be >> removed or not, and hence PHYSDEVOP_manage_pci_remove returning an >> error (EBUSY or otherwise) shouldn't allow the device unplug in Linux >> to continue. > > Yes, it would be ideally, but Linux driver/device model is written in a > such way, that PCI subsystem tracks all the PCI device usage, so it can > be certain that it can remove the device. Thus, functions in the device > removal path either return void or 0. Of course, kernel does not know that > hypervisor has additional uses for the device, so there is no mechanisms > to prevent removal. Could pciback obtain a reference on behalf of the hypervisor, dropping it when device removal is requested (i.e. much closer to the start of that operation), and only if it finds that no guests use the device anymore? Jan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-04-24 7:46 ` Jan Beulich @ 2023-04-24 14:15 ` Volodymyr Babchuk 2023-04-24 14:27 ` Jan Beulich 0 siblings, 1 reply; 50+ messages in thread From: Volodymyr Babchuk @ 2023-04-24 14:15 UTC (permalink / raw) To: Jan Beulich Cc: xen-devel, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian, Roger Pau Monné Hi Jan, Jan Beulich <jbeulich@suse.com> writes: > On 21.04.2023 16:13, Volodymyr Babchuk wrote: >> >> Hi Roger, >> >> Roger Pau Monné <roger.pau@citrix.com> writes: >> >>> On Fri, Apr 21, 2023 at 11:00:23AM +0000, Volodymyr Babchuk wrote: >>>> >>>> Hello Roger, >>>> >>>> Roger Pau Monné <roger.pau@citrix.com> writes: >>>> >>>>> On Mon, Apr 17, 2023 at 12:34:31PM +0200, Jan Beulich wrote: >>>>>> On 17.04.2023 12:17, Roger Pau Monné wrote: >>>>>>> On Fri, Apr 14, 2023 at 01:30:39AM +0000, Volodymyr Babchuk wrote: >>>>>>>> Above I have proposed another view on this. I hope, it will work for >>>>>>>> you. Just to reiterate, idea is to allow "harmless" refcounts to be left >>>>>>>> after returning from pci_remove_device(). By "harmless" I mean that >>>>>>>> owners of those refcounts will not try to access the physical PCI >>>>>>>> device if pci_remove_device() is already finished. >>>>>>> >>>>>>> I'm not strictly a maintainer of this piece code, albeit I have an >>>>>>> opinion. I will like to also hear Jans opinion, since he is the >>>>>>> maintainer. >>>>>> >>>>>> I'm afraid I can't really appreciate the term "harmless refcounts". Whoever >>>>>> holds a ref is entitled to access the device. As stated before, I see only >>>>>> two ways of getting things consistent: Either pci_remove_device() is >>>>>> invoked upon dropping of the last ref, >>>>> >>>>> With this approach, what would be the implementation of >>>>> PHYSDEVOP_manage_pci_remove? Would it just check whether the pdev >>>>> exist and either return 0 or -EBUSY? >>>>> >>>> >>>> Okay, I am preparing patches with the behavior you proposed. To test it, >>>> I artificially set refcount to 2 and indeed PHYSDEVOP_manage_pci_remove >>>> returned -EBUSY, which propagated to the linux driver. Problem is that >>>> Linux driver can't do anything with this. It just displayed an error >>>> message and removed device anyways. This is because Linux sends >>>> PHYSDEVOP_manage_pci_remove in device_remove() call path and there is no >>>> way to prevent the device removal. So, admin is not capable to try this >>>> again. >>> >>> Ideally Linux won't remove the device, and then the admin would get to >>> retry. Maybe the way the Linux hook is placed is not the best one? >>> The hypervisor should be authoritative on whether a device can be >>> removed or not, and hence PHYSDEVOP_manage_pci_remove returning an >>> error (EBUSY or otherwise) shouldn't allow the device unplug in Linux >>> to continue. >> >> Yes, it would be ideally, but Linux driver/device model is written in a >> such way, that PCI subsystem tracks all the PCI device usage, so it can >> be certain that it can remove the device. Thus, functions in the device >> removal path either return void or 0. Of course, kernel does not know that >> hypervisor has additional uses for the device, so there is no mechanisms >> to prevent removal. > > Could pciback obtain a reference on behalf of the hypervisor, dropping it > when device removal is requested (i.e. much closer to the start of that > operation), and only if it finds that no guests use the device anymore? Yes, it can, it this indeed will hold a reference to a pci device for a time, but there are some consideration that made this approach not feasible. Basically, when an user writes to /sys/bus/pci/SBDF/remove, the following happens: 1. /sys/bus/pci/SBFD/remove entry is removed - we can't retry the operation anymore [unimportant things] N. pci_stop_dev() function is called. This function unloads a device driver. Any good behaving driver should drop all additional references to a device at this point. [more unimportant things] M. PCI subsystem drops own reference to a generic device object So, as you can see, admin can't restart a "failed" attempt to remove a PCI device. On other hand, remove() function can sleep. This allows us to pause removal process a bit and check if hypervisor had finished removing a PCI device on its side. But, as you pointed out, this can take weeks... -- WBR, Volodymyr ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-04-24 14:15 ` Volodymyr Babchuk @ 2023-04-24 14:27 ` Jan Beulich 0 siblings, 0 replies; 50+ messages in thread From: Jan Beulich @ 2023-04-24 14:27 UTC (permalink / raw) To: Volodymyr Babchuk Cc: xen-devel, Andrew Cooper, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian, Roger Pau Monné On 24.04.2023 16:15, Volodymyr Babchuk wrote: > > Hi Jan, > > Jan Beulich <jbeulich@suse.com> writes: > >> On 21.04.2023 16:13, Volodymyr Babchuk wrote: >>> >>> Hi Roger, >>> >>> Roger Pau Monné <roger.pau@citrix.com> writes: >>> >>>> On Fri, Apr 21, 2023 at 11:00:23AM +0000, Volodymyr Babchuk wrote: >>>>> >>>>> Hello Roger, >>>>> >>>>> Roger Pau Monné <roger.pau@citrix.com> writes: >>>>> >>>>>> On Mon, Apr 17, 2023 at 12:34:31PM +0200, Jan Beulich wrote: >>>>>>> On 17.04.2023 12:17, Roger Pau Monné wrote: >>>>>>>> On Fri, Apr 14, 2023 at 01:30:39AM +0000, Volodymyr Babchuk wrote: >>>>>>>>> Above I have proposed another view on this. I hope, it will work for >>>>>>>>> you. Just to reiterate, idea is to allow "harmless" refcounts to be left >>>>>>>>> after returning from pci_remove_device(). By "harmless" I mean that >>>>>>>>> owners of those refcounts will not try to access the physical PCI >>>>>>>>> device if pci_remove_device() is already finished. >>>>>>>> >>>>>>>> I'm not strictly a maintainer of this piece code, albeit I have an >>>>>>>> opinion. I will like to also hear Jans opinion, since he is the >>>>>>>> maintainer. >>>>>>> >>>>>>> I'm afraid I can't really appreciate the term "harmless refcounts". Whoever >>>>>>> holds a ref is entitled to access the device. As stated before, I see only >>>>>>> two ways of getting things consistent: Either pci_remove_device() is >>>>>>> invoked upon dropping of the last ref, >>>>>> >>>>>> With this approach, what would be the implementation of >>>>>> PHYSDEVOP_manage_pci_remove? Would it just check whether the pdev >>>>>> exist and either return 0 or -EBUSY? >>>>>> >>>>> >>>>> Okay, I am preparing patches with the behavior you proposed. To test it, >>>>> I artificially set refcount to 2 and indeed PHYSDEVOP_manage_pci_remove >>>>> returned -EBUSY, which propagated to the linux driver. Problem is that >>>>> Linux driver can't do anything with this. It just displayed an error >>>>> message and removed device anyways. This is because Linux sends >>>>> PHYSDEVOP_manage_pci_remove in device_remove() call path and there is no >>>>> way to prevent the device removal. So, admin is not capable to try this >>>>> again. >>>> >>>> Ideally Linux won't remove the device, and then the admin would get to >>>> retry. Maybe the way the Linux hook is placed is not the best one? >>>> The hypervisor should be authoritative on whether a device can be >>>> removed or not, and hence PHYSDEVOP_manage_pci_remove returning an >>>> error (EBUSY or otherwise) shouldn't allow the device unplug in Linux >>>> to continue. >>> >>> Yes, it would be ideally, but Linux driver/device model is written in a >>> such way, that PCI subsystem tracks all the PCI device usage, so it can >>> be certain that it can remove the device. Thus, functions in the device >>> removal path either return void or 0. Of course, kernel does not know that >>> hypervisor has additional uses for the device, so there is no mechanisms >>> to prevent removal. >> >> Could pciback obtain a reference on behalf of the hypervisor, dropping it >> when device removal is requested (i.e. much closer to the start of that >> operation), and only if it finds that no guests use the device anymore? > > Yes, it can, it this indeed will hold a reference to a pci device for a > time, but there are some consideration that made this approach not > feasible. > > Basically, when an user writes to /sys/bus/pci/SBDF/remove, the > following happens: > > 1. /sys/bus/pci/SBFD/remove entry is removed - we can't retry the > operation anymore Looking at the comment ahead of pci_stop_and_remove_bus_device(), isn't this too late already. The text there says "has been removed", not e.g. "is about to be removed". (Of course chances are that it is the comment which is wrong; I know too little about Linux'es hot- unplug machinery.) Jan > [unimportant things] > > N. pci_stop_dev() function is called. This function unloads a device > driver. Any good behaving driver should drop all additional references > to a device at this point. > > [more unimportant things] > > M. PCI subsystem drops own reference to a generic device object > > So, as you can see, admin can't restart a "failed" attempt to remove a > PCI device. > > On other hand, remove() function can sleep. This allows us to pause > removal process a bit and check if hypervisor had finished removing a > PCI device on its side. But, as you pointed out, this can take weeks... > ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 2/6] xen: pci: introduce reference counting for pdev 2023-03-14 20:56 ` [PATCH v3 2/6] xen: pci: introduce reference counting for pdev Volodymyr Babchuk 2023-03-16 16:16 ` Roger Pau Monné @ 2023-03-29 10:04 ` Jan Beulich 1 sibling, 0 replies; 50+ messages in thread From: Jan Beulich @ 2023-03-29 10:04 UTC (permalink / raw) To: Volodymyr Babchuk Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini, Paul Durrant, Kevin Tian, xen-devel On 14.03.2023 21:56, Volodymyr Babchuk wrote: > @@ -422,33 +423,6 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn) > return pdev; > } > > -static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev) > -{ > - /* update bus2bridge */ > - switch ( pdev->type ) > - { > - unsigned int sec_bus, sub_bus; > - > - case DEV_TYPE_PCIe2PCI_BRIDGE: > - case DEV_TYPE_LEGACY_PCI_BRIDGE: > - sec_bus = pci_conf_read8(pdev->sbdf, PCI_SECONDARY_BUS); > - sub_bus = pci_conf_read8(pdev->sbdf, PCI_SUBORDINATE_BUS); > - > - spin_lock(&pseg->bus2bridge_lock); > - for ( ; sec_bus <= sub_bus; sec_bus++ ) > - pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus]; > - spin_unlock(&pseg->bus2bridge_lock); > - break; > - > - default: > - break; > - } > - > - list_del(&pdev->alldevs_list); > - pdev_msi_deinit(pdev); > - xfree(pdev); > -} No matter what cleanup model we choose in the end, I think it would be helpful if this function wasn't effectively moved to the end of the file, but adjusted in place. Then it'll be much easier to see what actually is moved out of here. pcidev_{get,put}() could be added right after this function. They're important enough anyway to warrant them not living at the bottom of the file. Jan ^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v3 3/6] vpci: crash domain if we wasn't able to (un) map vPCI regions 2023-03-14 20:56 [PATCH v3 0/6] vpci: first series in preparation for vpci on ARM Volodymyr Babchuk 2023-03-14 20:56 ` [PATCH v3 1/6] xen: add reference counter support Volodymyr Babchuk 2023-03-14 20:56 ` [PATCH v3 2/6] xen: pci: introduce reference counting for pdev Volodymyr Babchuk @ 2023-03-14 20:56 ` Volodymyr Babchuk 2023-03-16 16:32 ` Roger Pau Monné 2023-03-14 20:56 ` [PATCH v3 6/6] xen: pci: print reference counter when dumping pci_devs Volodymyr Babchuk ` (2 subsequent siblings) 5 siblings, 1 reply; 50+ messages in thread From: Volodymyr Babchuk @ 2023-03-14 20:56 UTC (permalink / raw) To: xen-devel; +Cc: Volodymyr Babchuk, Roger Pau Monné In that unlikely case, when map_range() fails to do it's job, domain memory mapping will be left in inconsistent state. As there is no easy way to remove stale p2m mapping we need to crash domain, as FIXME suggests. Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> --- v3: - new patch --- xen/drivers/vpci/header.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c index ec2e978a4e..8319fe4c1d 100644 --- a/xen/drivers/vpci/header.c +++ b/xen/drivers/vpci/header.c @@ -162,14 +162,11 @@ bool vpci_process_pending(struct vcpu *v) rangeset_destroy(v->vpci.mem); v->vpci.mem = NULL; if ( rc ) - /* - * FIXME: in case of failure remove the device from the domain. - * Note that there might still be leftover mappings. While this is - * safe for Dom0, for DomUs the domain will likely need to be - * killed in order to avoid leaking stale p2m mappings on - * failure. - */ + { vpci_remove_device(v->vpci.pdev); + if ( !is_hardware_domain(v->domain) ) + domain_crash(v->domain); + } } return false; -- 2.39.2 ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH v3 3/6] vpci: crash domain if we wasn't able to (un) map vPCI regions 2023-03-14 20:56 ` [PATCH v3 3/6] vpci: crash domain if we wasn't able to (un) map vPCI regions Volodymyr Babchuk @ 2023-03-16 16:32 ` Roger Pau Monné 0 siblings, 0 replies; 50+ messages in thread From: Roger Pau Monné @ 2023-03-16 16:32 UTC (permalink / raw) To: Volodymyr Babchuk; +Cc: xen-devel On Tue, Mar 14, 2023 at 08:56:30PM +0000, Volodymyr Babchuk wrote: > In that unlikely case, when map_range() fails to do it's job, > domain memory mapping will be left in inconsistent state. As there is > no easy way to remove stale p2m mapping we need to crash domain, as > FIXME suggests. > > Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> > > --- > > v3: > - new patch > --- > xen/drivers/vpci/header.c | 11 ++++------- > 1 file changed, 4 insertions(+), 7 deletions(-) > > diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c > index ec2e978a4e..8319fe4c1d 100644 > --- a/xen/drivers/vpci/header.c > +++ b/xen/drivers/vpci/header.c > @@ -162,14 +162,11 @@ bool vpci_process_pending(struct vcpu *v) > rangeset_destroy(v->vpci.mem); > v->vpci.mem = NULL; > if ( rc ) > - /* > - * FIXME: in case of failure remove the device from the domain. > - * Note that there might still be leftover mappings. While this is > - * safe for Dom0, for DomUs the domain will likely need to be > - * killed in order to avoid leaking stale p2m mappings on > - * failure. > - */ > + { > vpci_remove_device(v->vpci.pdev); > + if ( !is_hardware_domain(v->domain) ) > + domain_crash(v->domain); No need to remove the device if you are crashing the domain, so the vpci_remove_device() call can be placed in the else branch of the conditional. Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v3 6/6] xen: pci: print reference counter when dumping pci_devs 2023-03-14 20:56 [PATCH v3 0/6] vpci: first series in preparation for vpci on ARM Volodymyr Babchuk ` (2 preceding siblings ...) 2023-03-14 20:56 ` [PATCH v3 3/6] vpci: crash domain if we wasn't able to (un) map vPCI regions Volodymyr Babchuk @ 2023-03-14 20:56 ` Volodymyr Babchuk 2023-03-17 8:46 ` Roger Pau Monné 2023-03-14 20:56 ` [PATCH v3 5/6] vpci: use reference counter to protect vpci state Volodymyr Babchuk 2023-03-14 20:56 ` [PATCH v3 4/6] vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk 5 siblings, 1 reply; 50+ messages in thread From: Volodymyr Babchuk @ 2023-03-14 20:56 UTC (permalink / raw) To: xen-devel Cc: Volodymyr Babchuk, Jan Beulich, Paul Durrant, Roger Pau Monné This can be handy during new reference counter approach evaluation. Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> --- v3: - Moved from another patch series --- xen/drivers/passthrough/pci.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c index b32382aca0..1eb79e7d01 100644 --- a/xen/drivers/passthrough/pci.c +++ b/xen/drivers/passthrough/pci.c @@ -1275,7 +1275,8 @@ static int cf_check _dump_pci_devices(struct pci_seg *pseg, void *arg) else #endif printk("%pd", pdev->domain); - printk(" - node %-3d", (pdev->node != NUMA_NO_NODE) ? pdev->node : -1); + printk(" - node %-3d refcnt %d", (pdev->node != NUMA_NO_NODE) ? pdev->node : -1, + refcnt_read(&pdev->refcnt)); pdev_dump_msi(pdev); printk("\n"); } -- 2.39.2 ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH v3 6/6] xen: pci: print reference counter when dumping pci_devs 2023-03-14 20:56 ` [PATCH v3 6/6] xen: pci: print reference counter when dumping pci_devs Volodymyr Babchuk @ 2023-03-17 8:46 ` Roger Pau Monné 0 siblings, 0 replies; 50+ messages in thread From: Roger Pau Monné @ 2023-03-17 8:46 UTC (permalink / raw) To: Volodymyr Babchuk; +Cc: xen-devel, Jan Beulich, Paul Durrant On Tue, Mar 14, 2023 at 08:56:30PM +0000, Volodymyr Babchuk wrote: > This can be handy during new reference counter approach evaluation. > > Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> > > --- > > v3: > - Moved from another patch series > --- > xen/drivers/passthrough/pci.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c > index b32382aca0..1eb79e7d01 100644 > --- a/xen/drivers/passthrough/pci.c > +++ b/xen/drivers/passthrough/pci.c > @@ -1275,7 +1275,8 @@ static int cf_check _dump_pci_devices(struct pci_seg *pseg, void *arg) > else > #endif > printk("%pd", pdev->domain); > - printk(" - node %-3d", (pdev->node != NUMA_NO_NODE) ? pdev->node : -1); > + printk(" - node %-3d refcnt %d", (pdev->node != NUMA_NO_NODE) ? pdev->node : -1, This line is now too long (> 80 chars), you need to add a newline between the format and the argument list. The rest LGTM. Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v3 5/6] vpci: use reference counter to protect vpci state 2023-03-14 20:56 [PATCH v3 0/6] vpci: first series in preparation for vpci on ARM Volodymyr Babchuk ` (3 preceding siblings ...) 2023-03-14 20:56 ` [PATCH v3 6/6] xen: pci: print reference counter when dumping pci_devs Volodymyr Babchuk @ 2023-03-14 20:56 ` Volodymyr Babchuk 2023-03-17 8:43 ` Roger Pau Monné 2023-03-14 20:56 ` [PATCH v3 4/6] vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk 5 siblings, 1 reply; 50+ messages in thread From: Volodymyr Babchuk @ 2023-03-14 20:56 UTC (permalink / raw) To: xen-devel; +Cc: Volodymyr Babchuk, Roger Pau Monné, Jan Beulich vPCI MMIO handlers are accessing pdevs without protecting this access with pcidevs_{lock|unlock}. This is not a problem as of now as these are only used by Dom0. But, towards vPCI is used also for guests, we need to properly protect pdev and pdev->vpci from being removed while still in use. For that use pdev reference counting. Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> Suggested-by: Jan Beulich <jbeulich@suse.com> --- v3: - Moved from another patch series --- xen/drivers/vpci/vpci.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c index 199ff55672..005f38dc77 100644 --- a/xen/drivers/vpci/vpci.c +++ b/xen/drivers/vpci/vpci.c @@ -62,6 +62,7 @@ void vpci_remove_device(struct pci_dev *pdev) xfree(pdev->vpci->msi); xfree(pdev->vpci); pdev->vpci = NULL; + pcidev_put(pdev); } int vpci_add_handlers(struct pci_dev *pdev) @@ -72,6 +73,8 @@ int vpci_add_handlers(struct pci_dev *pdev) if ( !has_vpci(pdev->domain) ) return 0; + pcidev_get(pdev); + /* We should not get here twice for the same device. */ ASSERT(!pdev->vpci); -- 2.39.2 ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH v3 5/6] vpci: use reference counter to protect vpci state 2023-03-14 20:56 ` [PATCH v3 5/6] vpci: use reference counter to protect vpci state Volodymyr Babchuk @ 2023-03-17 8:43 ` Roger Pau Monné 2023-03-29 9:31 ` Jan Beulich 0 siblings, 1 reply; 50+ messages in thread From: Roger Pau Monné @ 2023-03-17 8:43 UTC (permalink / raw) To: Volodymyr Babchuk; +Cc: xen-devel, Jan Beulich On Tue, Mar 14, 2023 at 08:56:30PM +0000, Volodymyr Babchuk wrote: > vPCI MMIO handlers are accessing pdevs without protecting this > access with pcidevs_{lock|unlock}. This is not a problem as of now > as these are only used by Dom0. But, towards vPCI is used also for > guests, we need to properly protect pdev and pdev->vpci from being > removed while still in use. > > For that use pdev reference counting. I wonder whether vPCI does need to take another reference to the device. This all stems from me not having it fully clear how the reference counting is supposed to be used for pdevs. As mentioned in a previous patch, I would expect device assignation to take a reference, and hence vPCI won't need to take an extra refcount since vPCI can only be used once the device has been assigned to a domain, and hence already has at least an extra reference taken from the fact it's assigned to a domain. If anything I would add an ASSERT(pdev->refcount > 1) or equivalent. > > Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> > Suggested-by: Jan Beulich <jbeulich@suse.com> > > --- > > v3: > - Moved from another patch series > --- > xen/drivers/vpci/vpci.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c > index 199ff55672..005f38dc77 100644 > --- a/xen/drivers/vpci/vpci.c > +++ b/xen/drivers/vpci/vpci.c > @@ -62,6 +62,7 @@ void vpci_remove_device(struct pci_dev *pdev) > xfree(pdev->vpci->msi); > xfree(pdev->vpci); > pdev->vpci = NULL; > + pcidev_put(pdev); > } > > int vpci_add_handlers(struct pci_dev *pdev) > @@ -72,6 +73,8 @@ int vpci_add_handlers(struct pci_dev *pdev) > if ( !has_vpci(pdev->domain) ) > return 0; > > + pcidev_get(pdev); > + > /* We should not get here twice for the same device. */ > ASSERT(!pdev->vpci); You are missing a pcidev_put() in case allocation of pdev->vpci fails. Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH v3 5/6] vpci: use reference counter to protect vpci state 2023-03-17 8:43 ` Roger Pau Monné @ 2023-03-29 9:31 ` Jan Beulich 0 siblings, 0 replies; 50+ messages in thread From: Jan Beulich @ 2023-03-29 9:31 UTC (permalink / raw) To: Roger Pau Monné, Volodymyr Babchuk; +Cc: xen-devel On 17.03.2023 09:43, Roger Pau Monné wrote: > On Tue, Mar 14, 2023 at 08:56:30PM +0000, Volodymyr Babchuk wrote: >> vPCI MMIO handlers are accessing pdevs without protecting this >> access with pcidevs_{lock|unlock}. This is not a problem as of now >> as these are only used by Dom0. But, towards vPCI is used also for >> guests, we need to properly protect pdev and pdev->vpci from being >> removed while still in use. >> >> For that use pdev reference counting. > > I wonder whether vPCI does need to take another reference to the > device. This all stems from me not having it fully clear how the > reference counting is supposed to be used for pdevs. > > As mentioned in a previous patch, I would expect device assignation to > take a reference, and hence vPCI won't need to take an extra refcount > since vPCI can only be used once the device has been assigned to a > domain, and hence already has at least an extra reference taken from > the fact it's assigned to a domain. > > If anything I would add an ASSERT(pdev->refcount > 1) or equivalent. FWIW: +1 Jan ^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v3 4/6] vpci: restrict unhandled read/write operations for guests 2023-03-14 20:56 [PATCH v3 0/6] vpci: first series in preparation for vpci on ARM Volodymyr Babchuk ` (4 preceding siblings ...) 2023-03-14 20:56 ` [PATCH v3 5/6] vpci: use reference counter to protect vpci state Volodymyr Babchuk @ 2023-03-14 20:56 ` Volodymyr Babchuk 2023-03-17 8:37 ` Roger Pau Monné 5 siblings, 1 reply; 50+ messages in thread From: Volodymyr Babchuk @ 2023-03-14 20:56 UTC (permalink / raw) To: xen-devel; +Cc: Oleksandr Andrushchenko, Roger Pau Monné From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> A guest would be able to read and write those registers which are not emulated and have no respective vPCI handlers, so it will be possible for it to access the hardware directly. In order to prevent a guest from reads and writes from/to the unhandled registers make sure only hardware domain can access the hardware directly and restrict guests from doing so. Suggested-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> --- v3: - No changes Older comments from another series: Since v6: - do not use is_hwdom parameter for vpci_{read|write}_hw and use current->domain internally - update commit message New in v6 Moved into another series --- xen/drivers/vpci/vpci.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c index 5232f9605b..199ff55672 100644 --- a/xen/drivers/vpci/vpci.c +++ b/xen/drivers/vpci/vpci.c @@ -220,6 +220,10 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg, { uint32_t data; + /* Guest domains are not allowed to read real hardware. */ + if ( !is_hardware_domain(current->domain) ) + return ~(uint32_t)0; + switch ( size ) { case 4: @@ -260,9 +264,13 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg, return data; } -static void vpci_write_hw(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, - uint32_t data) +static void vpci_write_hw(pci_sbdf_t sbdf, unsigned int reg, + unsigned int size, uint32_t data) { + /* Guest domains are not allowed to write real hardware. */ + if ( !is_hardware_domain(current->domain) ) + return; + switch ( size ) { case 4: -- 2.39.2 ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH v3 4/6] vpci: restrict unhandled read/write operations for guests 2023-03-14 20:56 ` [PATCH v3 4/6] vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk @ 2023-03-17 8:37 ` Roger Pau Monné 0 siblings, 0 replies; 50+ messages in thread From: Roger Pau Monné @ 2023-03-17 8:37 UTC (permalink / raw) To: Volodymyr Babchuk; +Cc: xen-devel, Oleksandr Andrushchenko On Tue, Mar 14, 2023 at 08:56:30PM +0000, Volodymyr Babchuk wrote: > From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > > A guest would be able to read and write those registers which are not > emulated and have no respective vPCI handlers, so it will be possible > for it to access the hardware directly. > In order to prevent a guest from reads and writes from/to the unhandled > registers make sure only hardware domain can access the hardware directly > and restrict guests from doing so. > > Suggested-by: Roger Pau Monné <roger.pau@citrix.com> > Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > > --- > > v3: > - No changes > > Older comments from another series: > > Since v6: > - do not use is_hwdom parameter for vpci_{read|write}_hw and use > current->domain internally > - update commit message > New in v6 > Moved into another series > --- > xen/drivers/vpci/vpci.c | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c > index 5232f9605b..199ff55672 100644 > --- a/xen/drivers/vpci/vpci.c > +++ b/xen/drivers/vpci/vpci.c > @@ -220,6 +220,10 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg, > { > uint32_t data; > > + /* Guest domains are not allowed to read real hardware. */ > + if ( !is_hardware_domain(current->domain) ) > + return ~(uint32_t)0; > + > switch ( size ) > { > case 4: > @@ -260,9 +264,13 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg, > return data; > } > > -static void vpci_write_hw(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, > - uint32_t data) > +static void vpci_write_hw(pci_sbdf_t sbdf, unsigned int reg, > + unsigned int size, uint32_t data) Unrelated change? The parameter list doesn't go over 80 characters so this rearranging is not necessary, and in any case should be done in a separate commit or at least mentioned in the commit log. > { > + /* Guest domains are not allowed to write real hardware. */ I would maybe write this as: "Unprivileged domain are not allowed unhandled accesses to the config space." But that's mostly a nit, and would also apply to the comment in vpci_read_hw(). Thanks, Roger. ^ permalink raw reply [flat|nested] 50+ messages in thread
end of thread, other threads:[~2023-04-24 14:28 UTC | newest] Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-03-14 20:56 [PATCH v3 0/6] vpci: first series in preparation for vpci on ARM Volodymyr Babchuk 2023-03-14 20:56 ` [PATCH v3 1/6] xen: add reference counter support Volodymyr Babchuk 2023-03-16 13:54 ` Roger Pau Monné 2023-03-16 14:03 ` Jan Beulich 2023-03-16 16:21 ` Roger Pau Monné 2023-04-11 22:27 ` Volodymyr Babchuk 2023-04-12 10:12 ` Roger Pau Monné 2023-03-16 16:19 ` Roger Pau Monné 2023-03-16 16:32 ` Jan Beulich 2023-03-16 16:39 ` Roger Pau Monné 2023-03-16 16:43 ` Jan Beulich 2023-03-16 16:48 ` Roger Pau Monné 2023-03-16 16:56 ` Jan Beulich 2023-03-17 10:05 ` Roger Pau Monné 2023-03-17 14:46 ` Jan Beulich 2023-03-16 17:01 ` Jan Beulich 2023-04-11 22:38 ` Volodymyr Babchuk 2023-04-17 6:47 ` Jan Beulich 2023-03-14 20:56 ` [PATCH v3 2/6] xen: pci: introduce reference counting for pdev Volodymyr Babchuk 2023-03-16 16:16 ` Roger Pau Monné 2023-03-29 9:55 ` Jan Beulich 2023-03-29 10:48 ` Roger Pau Monné 2023-03-29 11:58 ` Jan Beulich 2023-04-11 23:41 ` Volodymyr Babchuk 2023-04-12 9:13 ` Roger Pau Monné 2023-04-12 21:54 ` Volodymyr Babchuk 2023-04-13 15:00 ` Roger Pau Monné 2023-04-14 1:30 ` Volodymyr Babchuk 2023-04-17 10:17 ` Roger Pau Monné 2023-04-17 10:34 ` Jan Beulich 2023-04-17 10:51 ` Roger Pau Monné 2023-04-17 11:02 ` Jan Beulich 2023-04-21 11:00 ` Volodymyr Babchuk 2023-04-21 12:24 ` Jan Beulich 2023-04-21 13:02 ` Volodymyr Babchuk 2023-04-21 13:10 ` Roger Pau Monné 2023-04-21 14:13 ` Volodymyr Babchuk 2023-04-24 7:46 ` Jan Beulich 2023-04-24 14:15 ` Volodymyr Babchuk 2023-04-24 14:27 ` Jan Beulich 2023-03-29 10:04 ` Jan Beulich 2023-03-14 20:56 ` [PATCH v3 3/6] vpci: crash domain if we wasn't able to (un) map vPCI regions Volodymyr Babchuk 2023-03-16 16:32 ` Roger Pau Monné 2023-03-14 20:56 ` [PATCH v3 6/6] xen: pci: print reference counter when dumping pci_devs Volodymyr Babchuk 2023-03-17 8:46 ` Roger Pau Monné 2023-03-14 20:56 ` [PATCH v3 5/6] vpci: use reference counter to protect vpci state Volodymyr Babchuk 2023-03-17 8:43 ` Roger Pau Monné 2023-03-29 9:31 ` Jan Beulich 2023-03-14 20:56 ` [PATCH v3 4/6] vpci: restrict unhandled read/write operations for guests Volodymyr Babchuk 2023-03-17 8:37 ` Roger Pau Monné
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).