All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/11] PCI devices passthrough on Arm, part 3
@ 2021-11-05  6:56 Oleksandr Andrushchenko
  2021-11-05  6:56 ` [PATCH v4 01/11] vpci: fix function attributes for vpci_process_pending Oleksandr Andrushchenko
                   ` (11 more replies)
  0 siblings, 12 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-05  6:56 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Hi, all!

This patch series is focusing on vPCI and adds support for non-identity
PCI BAR mappings which is required while passing through a PCI device to
a guest. The highlights are:

- Add relevant vpci register handlers when assigning PCI device to a domain
  and remove those when de-assigning. This allows having different
  handlers for different domains, e.g. hwdom and other guests.

- Emulate guest BAR register values based on physical BAR values.
  This allows creating a guest view of the registers and emulates
  size and properties probe as it is done during PCI device enumeration by
  the guest.

- Instead of handling a single range set, that contains all the memory
  regions of all the BARs and ROM, have them per BAR.

- Take into account guest's BAR view and program its p2m accordingly:
  gfn is guest's view of the BAR and mfn is the physical BAR value as set
  up by the host bridge in the hardware domain.
  This way hardware doamin sees physical BAR values and guest sees
  emulated ones.

The series also adds support for virtual PCI bus topology for guests:
 - We emulate a single host bridge for the guest, so segment is always 0.
 - The implementation is limited to 32 devices which are allowed on
   a single PCI bus.
 - The virtual bus number is set to 0, so virtual devices are seen
   as embedded endpoints behind the root complex.

The series was also tested on:
 - x86 PVH Dom0 and doesn't break it.
 - x86 HVM with PCI passthrough to DomU and doesn't break it.

Thank you,
Oleksandr

Oleksandr Andrushchenko (11):
  vpci: fix function attributes for vpci_process_pending
  vpci: cancel pending map/unmap on vpci removal
  vpci: make vpci registers removal a dedicated function
  vpci: add hooks for PCI device assign/de-assign
  vpci/header: implement guest BAR register handlers
  vpci/header: handle p2m range sets per BAR
  vpci/header: program p2m with guest BAR view
  vpci/header: emulate PCI_COMMAND register for guests
  vpci/header: reset the command register when adding devices
  vpci: add initial support for virtual PCI bus topology
  xen/arm: translate virtual PCI bus topology for guests

 xen/arch/arm/vpci.c           |  18 ++
 xen/drivers/Kconfig           |   4 +
 xen/drivers/passthrough/pci.c |   6 +
 xen/drivers/vpci/header.c     | 320 +++++++++++++++++++++++++++-------
 xen/drivers/vpci/vpci.c       | 163 ++++++++++++++++-
 xen/include/xen/sched.h       |   8 +
 xen/include/xen/vpci.h        |  37 +++-
 7 files changed, 484 insertions(+), 72 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v4 01/11] vpci: fix function attributes for vpci_process_pending
  2021-11-05  6:56 [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
@ 2021-11-05  6:56 ` Oleksandr Andrushchenko
  2021-11-05  6:56 ` [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal Oleksandr Andrushchenko
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-05  6:56 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

vpci_process_pending is defined with different attributes, e.g.
with __must_check if CONFIG_HAS_VPCI enabled and not otherwise.
Fix this by defining both of the definitions with __must_check.

Fixes: 14583a590783 ("7fbb096bf345 kconfig: don't select VPCI if building a shim-only binary")

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

---
Cc: Roger Pau Monné <roger.pau@citrix.com>

New in v4
---
 xen/include/xen/vpci.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 9ea66e033f11..3f32de9d7eb3 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -247,7 +247,7 @@ static inline void vpci_write(pci_sbdf_t sbdf, unsigned int reg,
     ASSERT_UNREACHABLE();
 }
 
-static inline bool vpci_process_pending(struct vcpu *v)
+static inline bool __must_check vpci_process_pending(struct vcpu *v)
 {
     ASSERT_UNREACHABLE();
     return false;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-05  6:56 [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
  2021-11-05  6:56 ` [PATCH v4 01/11] vpci: fix function attributes for vpci_process_pending Oleksandr Andrushchenko
@ 2021-11-05  6:56 ` Oleksandr Andrushchenko
  2021-11-15 16:56   ` Jan Beulich
  2021-11-17  8:28   ` Jan Beulich
  2021-11-05  6:56 ` [PATCH v4 03/11] vpci: make vpci registers removal a dedicated function Oleksandr Andrushchenko
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-05  6:56 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

When a vPCI is removed for a PCI device it is possible that we have
scheduled a delayed work for map/unmap operations for that device.
For example, the following scenario can illustrate the problem:

pci_physdev_op
   pci_add_device
       init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
   iommu_add_device <- FAILS
   vpci_remove_device -> xfree(pdev->vpci)

leave_hypervisor_to_guest
   vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL

For the hardware domain we continue execution as the worse that
could happen is that MMIO mappings are left in place when the
device has been deassigned

For unprivileged domains that get a failure in the middle of a vPCI
{un}map operation we need to destroy them, as we don't know in which
state the p2m is. This can only happen in vpci_process_pending for
DomUs as they won't be allowed to call pci_add_device.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

---
Cc: Roger Pau Monné <roger.pau@citrix.com>

New in v4
---
 xen/drivers/vpci/header.c | 15 +++++++++++++--
 xen/drivers/vpci/vpci.c   |  2 ++
 xen/include/xen/vpci.h    |  6 ++++++
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 40ff79c33f8f..ef538386e95d 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -149,8 +149,7 @@ bool vpci_process_pending(struct vcpu *v)
                         !rc && v->vpci.rom_only);
         spin_unlock(&v->vpci.pdev->vpci->lock);
 
-        rangeset_destroy(v->vpci.mem);
-        v->vpci.mem = NULL;
+        vpci_cancel_pending(v->vpci.pdev);
         if ( rc )
             /*
              * FIXME: in case of failure remove the device from the domain.
@@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v)
     return false;
 }
 
+void vpci_cancel_pending(const struct pci_dev *pdev)
+{
+    struct vcpu *v = current;
+
+    /* Cancel any pending work now. */
+    if ( v->vpci.mem && v->vpci.pdev == pdev)
+    {
+        rangeset_destroy(v->vpci.mem);
+        v->vpci.mem = NULL;
+    }
+}
+
 static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
                             struct rangeset *mem, uint16_t cmd)
 {
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 657697fe3406..4e24956419aa 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev)
         xfree(r);
     }
     spin_unlock(&pdev->vpci->lock);
+
+    vpci_cancel_pending(pdev);
     xfree(pdev->vpci->msix);
     xfree(pdev->vpci->msi);
     xfree(pdev->vpci);
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 3f32de9d7eb3..609d6383b252 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -56,6 +56,7 @@ uint32_t vpci_hw_read32(const struct pci_dev *pdev, unsigned int reg,
  * should not run.
  */
 bool __must_check vpci_process_pending(struct vcpu *v);
+void vpci_cancel_pending(const struct pci_dev *pdev);
 
 struct vpci {
     /* List of vPCI handlers for a device. */
@@ -252,6 +253,11 @@ static inline bool __must_check vpci_process_pending(struct vcpu *v)
     ASSERT_UNREACHABLE();
     return false;
 }
+
+static inline void vpci_cancel_pending(const struct pci_dev *pdev)
+{
+    ASSERT_UNREACHABLE();
+}
 #endif
 
 #endif
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v4 03/11] vpci: make vpci registers removal a dedicated function
  2021-11-05  6:56 [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
  2021-11-05  6:56 ` [PATCH v4 01/11] vpci: fix function attributes for vpci_process_pending Oleksandr Andrushchenko
  2021-11-05  6:56 ` [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal Oleksandr Andrushchenko
@ 2021-11-05  6:56 ` Oleksandr Andrushchenko
  2021-11-15 16:57   ` Jan Beulich
  2021-11-05  6:56 ` [PATCH v4 04/11] vpci: add hooks for PCI device assign/de-assign Oleksandr Andrushchenko
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-05  6:56 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

This is in preparation for dynamic assignment of the vpci register
handlers depending on the domain: hwdom or guest.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v3:
- remove all R-b's due to changes
- s/vpci_remove_device_registers/vpci_remove_device_handlers
- minor comment cleanup
Since v1:
 - constify struct pci_dev where possible
---
 xen/drivers/vpci/vpci.c | 6 +++++-
 xen/include/xen/vpci.h  | 2 ++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 4e24956419aa..d7f033a0811f 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -35,7 +35,7 @@ extern vpci_register_init_t *const __start_vpci_array[];
 extern vpci_register_init_t *const __end_vpci_array[];
 #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
 
-void vpci_remove_device(struct pci_dev *pdev)
+void vpci_remove_device_handlers(const struct pci_dev *pdev)
 {
     if ( !has_vpci(pdev->domain) )
         return;
@@ -51,8 +51,12 @@ void vpci_remove_device(struct pci_dev *pdev)
         xfree(r);
     }
     spin_unlock(&pdev->vpci->lock);
+}
 
+void vpci_remove_device(struct pci_dev *pdev)
+{
     vpci_cancel_pending(pdev);
+    vpci_remove_device_handlers(pdev);
     xfree(pdev->vpci->msix);
     xfree(pdev->vpci->msi);
     xfree(pdev->vpci);
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 609d6383b252..1883b9d08a70 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -30,6 +30,8 @@ int __must_check vpci_add_handlers(struct pci_dev *dev);
 
 /* Remove all handlers and free vpci related structures. */
 void vpci_remove_device(struct pci_dev *pdev);
+/* Remove all handlers for the device. */
+void vpci_remove_device_handlers(const struct pci_dev *pdev);
 
 /* Add/remove a register handler. */
 int __must_check vpci_add_register(struct vpci *vpci,
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v4 04/11] vpci: add hooks for PCI device assign/de-assign
  2021-11-05  6:56 [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (2 preceding siblings ...)
  2021-11-05  6:56 ` [PATCH v4 03/11] vpci: make vpci registers removal a dedicated function Oleksandr Andrushchenko
@ 2021-11-05  6:56 ` Oleksandr Andrushchenko
  2021-11-15 17:06   ` Jan Beulich
  2021-11-05  6:56 ` [PATCH v4 05/11] vpci/header: implement guest BAR register handlers Oleksandr Andrushchenko
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-05  6:56 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

When a PCI device gets assigned/de-assigned some work on vPCI side needs
to be done for that device. Introduce a pair of hooks so vPCI can handle
that.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v3:
 - remove toolstack roll-back description from the commit message
   as error are to be handled with proper cleanup in Xen itself
 - remove __must_check
 - remove redundant rc check while assigning devices
 - fix redundant CONFIG_HAS_VPCI check for CONFIG_HAS_VPCI_GUEST_SUPPORT
 - use REGISTER_VPCI_INIT machinery to run required steps on device
   init/assign: add run_vpci_init helper
Since v2:
- define CONFIG_HAS_VPCI_GUEST_SUPPORT so dead code is not compiled
  for x86
Since v1:
 - constify struct pci_dev where possible
 - do not open code is_system_domain()
 - extended the commit message
---
 xen/drivers/Kconfig           |  4 +++
 xen/drivers/passthrough/pci.c |  6 ++++
 xen/drivers/vpci/vpci.c       | 57 ++++++++++++++++++++++++++++++-----
 xen/include/xen/vpci.h        | 16 ++++++++++
 4 files changed, 75 insertions(+), 8 deletions(-)

diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig
index db94393f47a6..780490cf8e39 100644
--- a/xen/drivers/Kconfig
+++ b/xen/drivers/Kconfig
@@ -15,4 +15,8 @@ source "drivers/video/Kconfig"
 config HAS_VPCI
 	bool
 
+config HAS_VPCI_GUEST_SUPPORT
+	bool
+	depends on HAS_VPCI
+
 endmenu
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index a9d31293ac09..529a4f50aa80 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -873,6 +873,10 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
     if ( ret )
         goto out;
 
+    ret = vpci_deassign_device(d, pdev);
+    if ( ret )
+        goto out;
+
     if ( pdev->domain == hardware_domain  )
         pdev->quarantine = false;
 
@@ -1445,6 +1449,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
         rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev), flag);
     }
 
+    rc = vpci_assign_device(d, pdev);
+
  done:
     if ( rc )
         printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index d7f033a0811f..5f086398a98c 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -63,11 +63,25 @@ void vpci_remove_device(struct pci_dev *pdev)
     pdev->vpci = NULL;
 }
 
-int vpci_add_handlers(struct pci_dev *pdev)
+static int run_vpci_init(struct pci_dev *pdev)
 {
     unsigned int i;
     int rc = 0;
 
+    for ( i = 0; i < NUM_VPCI_INIT; i++ )
+    {
+        rc = __start_vpci_array[i](pdev);
+        if ( rc )
+            break;
+    }
+
+    return rc;
+}
+
+int vpci_add_handlers(struct pci_dev *pdev)
+{
+    int rc;
+
     if ( !has_vpci(pdev->domain) )
         return 0;
 
@@ -81,18 +95,45 @@ int vpci_add_handlers(struct pci_dev *pdev)
     INIT_LIST_HEAD(&pdev->vpci->handlers);
     spin_lock_init(&pdev->vpci->lock);
 
-    for ( i = 0; i < NUM_VPCI_INIT; i++ )
-    {
-        rc = __start_vpci_array[i](pdev);
-        if ( rc )
-            break;
-    }
-
+    rc = run_vpci_init(pdev);
     if ( rc )
         vpci_remove_device(pdev);
 
     return rc;
 }
+
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+/* Notify vPCI that device is assigned to guest. */
+int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
+{
+    int rc;
+
+    /* It only makes sense to assign for hwdom or guest domain. */
+    if ( is_system_domain(d) || !has_vpci(d) )
+        return 0;
+
+    vpci_remove_device_handlers(pdev);
+
+    rc = run_vpci_init(pdev);
+    if ( rc )
+        vpci_deassign_device(d, pdev);
+
+    return rc;
+}
+
+/* Notify vPCI that device is de-assigned from guest. */
+int vpci_deassign_device(struct domain *d, struct pci_dev *pdev)
+{
+    /* It only makes sense to de-assign from hwdom or guest domain. */
+    if ( is_system_domain(d) || !has_vpci(d) )
+        return 0;
+
+    vpci_remove_device_handlers(pdev);
+
+    return 0;
+}
+#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
+
 #endif /* __XEN__ */
 
 static int vpci_register_cmp(const struct vpci_register *r1,
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 1883b9d08a70..a016b4197801 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -262,6 +262,22 @@ static inline void vpci_cancel_pending(const struct pci_dev *pdev)
 }
 #endif
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+/* Notify vPCI that device is assigned/de-assigned to/from guest. */
+int vpci_assign_device(struct domain *d, struct pci_dev *pdev);
+int vpci_deassign_device(struct domain *d, struct pci_dev *pdev);
+#else
+static inline int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
+{
+    return 0;
+};
+
+static inline int vpci_deassign_device(struct domain *d, struct pci_dev *pdev)
+{
+    return 0;
+};
+#endif
+
 #endif
 
 /*
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v4 05/11] vpci/header: implement guest BAR register handlers
  2021-11-05  6:56 [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (3 preceding siblings ...)
  2021-11-05  6:56 ` [PATCH v4 04/11] vpci: add hooks for PCI device assign/de-assign Oleksandr Andrushchenko
@ 2021-11-05  6:56 ` Oleksandr Andrushchenko
  2021-11-19 11:58   ` Jan Beulich
  2021-11-05  6:56 ` [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR Oleksandr Andrushchenko
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-05  6:56 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Add relevant vpci register handlers when assigning PCI device to a domain
and remove those when de-assigning. This allows having different
handlers for different domains, e.g. hwdom and other guests.

Emulate guest BAR register values: this allows creating a guest view
of the registers and emulates size and properties probe as it is done
during PCI device enumeration by the guest.

ROM BAR is only handled for the hardware domain and for guest domains
there is a stub: at the moment PCI expansion ROM is x86 only, so it
might not be used by other architectures without emulating x86. Other
use-cases may include using that expansion ROM before Xen boots, hence
no emulation is needed in Xen itself. Or when a guest wants to use the
ROM code which seems to be rare.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v3:
- squashed two patches: dynamic add/remove handlers and guest BAR
  handler implementation
- fix guest BAR read of the high part of a 64bit BAR (Roger)
- add error handling to vpci_assign_device
- s/dom%pd/%pd
- blank line before return
Since v2:
- remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
  has been eliminated from being built on x86
Since v1:
 - constify struct pci_dev where possible
 - do not open code is_system_domain()
 - simplify some code3. simplify
 - use gdprintk + error code instead of gprintk
 - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
   so these do not get compiled for x86
 - removed unneeded is_system_domain check
 - re-work guest read/write to be much simpler and do more work on write
   than read which is expected to be called more frequently
 - removed one too obvious comment
---
 xen/drivers/vpci/header.c | 72 +++++++++++++++++++++++++++++++++++----
 xen/include/xen/vpci.h    |  3 ++
 2 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index ef538386e95d..1239051ee8ff 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -408,6 +408,48 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
     pci_conf_write32(pdev->sbdf, reg, val);
 }
 
+static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
+                            uint32_t val, void *data)
+{
+    struct vpci_bar *bar = data;
+    bool hi = false;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+    else
+    {
+        val &= PCI_BASE_ADDRESS_MEM_MASK;
+        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
+                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
+        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
+    }
+
+    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
+    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
+
+    bar->guest_addr &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
+}
+
+static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
+                               void *data)
+{
+    const struct vpci_bar *bar = data;
+    bool hi = false;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+
+    return bar->guest_addr >> (hi ? 32 : 0);
+}
+
 static void rom_write(const struct pci_dev *pdev, unsigned int reg,
                       uint32_t val, void *data)
 {
@@ -456,6 +498,17 @@ static void rom_write(const struct pci_dev *pdev, unsigned int reg,
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
 }
 
+static void guest_rom_write(const struct pci_dev *pdev, unsigned int reg,
+                            uint32_t val, void *data)
+{
+}
+
+static uint32_t guest_rom_read(const struct pci_dev *pdev, unsigned int reg,
+                               void *data)
+{
+    return 0xffffffff;
+}
+
 static int init_bars(struct pci_dev *pdev)
 {
     uint16_t cmd;
@@ -464,6 +517,7 @@ static int init_bars(struct pci_dev *pdev)
     struct vpci_header *header = &pdev->vpci->header;
     struct vpci_bar *bars = header->bars;
     int rc;
+    bool is_hwdom = is_hardware_domain(pdev->domain);
 
     switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
     {
@@ -503,8 +557,10 @@ static int init_bars(struct pci_dev *pdev)
         if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
         {
             bars[i].type = VPCI_BAR_MEM64_HI;
-            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
-                                   4, &bars[i]);
+            rc = vpci_add_register(pdev->vpci,
+                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
+                                   is_hwdom ? bar_write : guest_bar_write,
+                                   reg, 4, &bars[i]);
             if ( rc )
             {
                 pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
@@ -544,8 +600,10 @@ static int init_bars(struct pci_dev *pdev)
         bars[i].size = size;
         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
 
-        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
-                               &bars[i]);
+        rc = vpci_add_register(pdev->vpci,
+                               is_hwdom ? vpci_hw_read32 : guest_bar_read,
+                               is_hwdom ? bar_write : guest_bar_write,
+                               reg, 4, &bars[i]);
         if ( rc )
         {
             pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
@@ -565,8 +623,10 @@ static int init_bars(struct pci_dev *pdev)
         header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
                               PCI_ROM_ADDRESS_ENABLE;
 
-        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
-                               4, rom);
+        rc = vpci_add_register(pdev->vpci,
+                               is_hwdom ? vpci_hw_read32 : guest_rom_read,
+                               is_hwdom ? rom_write : guest_rom_write,
+                               rom_reg, 4, rom);
         if ( rc )
             rom->type = VPCI_BAR_EMPTY;
     }
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index a016b4197801..3e7428da822c 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -70,7 +70,10 @@ struct vpci {
     struct vpci_header {
         /* Information about the PCI BARs of this device. */
         struct vpci_bar {
+            /* Physical view of the BAR. */
             uint64_t addr;
+            /* Guest view of the BAR. */
+            uint64_t guest_addr;
             uint64_t size;
             enum {
                 VPCI_BAR_EMPTY,
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR
  2021-11-05  6:56 [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (4 preceding siblings ...)
  2021-11-05  6:56 ` [PATCH v4 05/11] vpci/header: implement guest BAR register handlers Oleksandr Andrushchenko
@ 2021-11-05  6:56 ` Oleksandr Andrushchenko
  2021-11-19 12:05   ` Jan Beulich
  2021-11-19 13:16   ` Jan Beulich
  2021-11-05  6:56 ` [PATCH v4 07/11] vpci/header: program p2m with guest BAR view Oleksandr Andrushchenko
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-05  6:56 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Instead of handling a single range set, that contains all the memory
regions of all the BARs and ROM, have them per BAR.

This is in preparation of making non-identity mappings in p2m for the
MMIOs/ROM.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

---
Since v3:
- re-work vpci_cancel_pending accordingly to the per-BAR handling
- s/num_mem_ranges/map_pending and s/uint8_t/bool
- ASSERT(bar->mem) in modify_bars
- create and destroy the rangesets on add/remove
---
 xen/drivers/vpci/header.c | 178 ++++++++++++++++++++++++++------------
 xen/drivers/vpci/vpci.c   |  26 +++++-
 xen/include/xen/vpci.h    |   3 +-
 3 files changed, 150 insertions(+), 57 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 1239051ee8ff..5fc2dfbbc864 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -131,34 +131,50 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
 
 bool vpci_process_pending(struct vcpu *v)
 {
-    if ( v->vpci.mem )
+    if ( v->vpci.map_pending )
     {
         struct map_data data = {
             .d = v->domain,
             .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
         };
-        int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
+        struct pci_dev *pdev = v->vpci.pdev;
+        struct vpci_header *header = &pdev->vpci->header;
+        unsigned int i;
 
-        if ( rc == -ERESTART )
-            return true;
+        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+        {
+            struct vpci_bar *bar = &header->bars[i];
+            int rc;
 
-        spin_lock(&v->vpci.pdev->vpci->lock);
-        /* Disable memory decoding unconditionally on failure. */
-        modify_decoding(v->vpci.pdev,
-                        rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
-                        !rc && v->vpci.rom_only);
-        spin_unlock(&v->vpci.pdev->vpci->lock);
+            if ( rangeset_is_empty(bar->mem) )
+                continue;
 
-        vpci_cancel_pending(v->vpci.pdev);
-        if ( rc )
-            /*
-             * FIXME: in case of failure remove the device from the domain.
-             * Note that there might still be leftover mappings. While this is
-             * safe for Dom0, for DomUs the domain will likely need to be
-             * killed in order to avoid leaking stale p2m mappings on
-             * failure.
-             */
-            vpci_remove_device(v->vpci.pdev);
+            rc = rangeset_consume_ranges(bar->mem, map_range, &data);
+
+            if ( rc == -ERESTART )
+                return true;
+
+            spin_lock(&pdev->vpci->lock);
+            /* Disable memory decoding unconditionally on failure. */
+            modify_decoding(pdev,
+                            rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
+                            !rc && v->vpci.rom_only);
+            spin_unlock(&pdev->vpci->lock);
+
+            if ( rc )
+            {
+                /*
+                 * FIXME: in case of failure remove the device from the domain.
+                 * Note that there might still be leftover mappings. While this is
+                 * safe for Dom0, for DomUs the domain will likely need to be
+                 * killed in order to avoid leaking stale p2m mappings on
+                 * failure.
+                 */
+                vpci_remove_device(pdev);
+                break;
+            }
+        }
+        v->vpci.map_pending = false;
     }
 
     return false;
@@ -169,22 +185,48 @@ void vpci_cancel_pending(const struct pci_dev *pdev)
     struct vcpu *v = current;
 
     /* Cancel any pending work now. */
-    if ( v->vpci.mem && v->vpci.pdev == pdev)
+    if ( v->vpci.map_pending && v->vpci.pdev == pdev)
     {
-        rangeset_destroy(v->vpci.mem);
-        v->vpci.mem = NULL;
+        struct vpci_header *header = &pdev->vpci->header;
+        unsigned int i;
+        int rc;
+
+        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+        {
+            struct vpci_bar *bar = &header->bars[i];
+
+            if ( rangeset_is_empty(bar->mem) )
+                continue;
+
+            rc = rangeset_remove_range(bar->mem, 0, ~0ULL);
+            if ( !rc )
+                printk(XENLOG_ERR
+                       "%pd %pp failed to remove range set for BAR: %d\n",
+                       v->domain, &pdev->sbdf, rc);
+        }
+        v->vpci.map_pending = false;
     }
 }
 
 static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
-                            struct rangeset *mem, uint16_t cmd)
+                            uint16_t cmd)
 {
     struct map_data data = { .d = d, .map = true };
-    int rc;
+    struct vpci_header *header = &pdev->vpci->header;
+    int rc = 0;
+    unsigned int i;
+
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+    {
+        struct vpci_bar *bar = &header->bars[i];
+
+        if ( rangeset_is_empty(bar->mem) )
+            continue;
 
-    while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
-        process_pending_softirqs();
-    rangeset_destroy(mem);
+        while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
+                                              &data)) == -ERESTART )
+            process_pending_softirqs();
+    }
     if ( !rc )
         modify_decoding(pdev, cmd, false);
 
@@ -192,7 +234,7 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
 }
 
 static void defer_map(struct domain *d, struct pci_dev *pdev,
-                      struct rangeset *mem, uint16_t cmd, bool rom_only)
+                      uint16_t cmd, bool rom_only)
 {
     struct vcpu *curr = current;
 
@@ -203,9 +245,9 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
      * started for the same device if the domain is not well-behaved.
      */
     curr->vpci.pdev = pdev;
-    curr->vpci.mem = mem;
     curr->vpci.cmd = cmd;
     curr->vpci.rom_only = rom_only;
+    curr->vpci.map_pending = true;
     /*
      * Raise a scheduler softirq in order to prevent the guest from resuming
      * execution with pending mapping operations, to trigger the invocation
@@ -217,42 +259,40 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
 static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
 {
     struct vpci_header *header = &pdev->vpci->header;
-    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
     struct pci_dev *tmp, *dev = NULL;
     const struct vpci_msix *msix = pdev->vpci->msix;
-    unsigned int i;
+    unsigned int i, j;
     int rc;
-
-    if ( !mem )
-        return -ENOMEM;
+    bool map_pending;
 
     /*
-     * Create a rangeset that represents the current device BARs memory region
+     * Create a rangeset per BAR that represents the current device memory region
      * and compare it against all the currently active BAR memory regions. If
      * an overlap is found, subtract it from the region to be mapped/unmapped.
      *
-     * First fill the rangeset with all the BARs of this device or with the ROM
+     * First fill the rangesets with all the BARs of this device or with the ROM
      * BAR only, depending on whether the guest is toggling the memory decode
      * bit of the command register, or the enable bit of the ROM BAR register.
      */
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
-        const struct vpci_bar *bar = &header->bars[i];
+        struct vpci_bar *bar = &header->bars[i];
         unsigned long start = PFN_DOWN(bar->addr);
         unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
 
+        ASSERT(bar->mem);
+
         if ( !MAPPABLE_BAR(bar) ||
              (rom_only ? bar->type != VPCI_BAR_ROM
                        : (bar->type == VPCI_BAR_ROM && !header->rom_enabled)) )
             continue;
 
-        rc = rangeset_add_range(mem, start, end);
+        rc = rangeset_add_range(bar->mem, start, end);
         if ( rc )
         {
             printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
                    start, end, rc);
-            rangeset_destroy(mem);
-            return rc;
+            goto fail;
         }
     }
 
@@ -263,14 +303,21 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
         unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
                                      vmsix_table_size(pdev->vpci, i) - 1);
 
-        rc = rangeset_remove_range(mem, start, end);
-        if ( rc )
+        for ( j = 0; j < ARRAY_SIZE(header->bars); j++ )
         {
-            printk(XENLOG_G_WARNING
-                   "Failed to remove MSIX table [%lx, %lx]: %d\n",
-                   start, end, rc);
-            rangeset_destroy(mem);
-            return rc;
+            const struct vpci_bar *bar = &header->bars[j];
+
+            if ( rangeset_is_empty(bar->mem) )
+                continue;
+
+            rc = rangeset_remove_range(bar->mem, start, end);
+            if ( rc )
+            {
+                printk(XENLOG_G_WARNING
+                       "Failed to remove MSIX table [%lx, %lx]: %d\n",
+                       start, end, rc);
+                goto fail;
+            }
         }
     }
 
@@ -302,7 +349,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             unsigned long start = PFN_DOWN(bar->addr);
             unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
 
-            if ( !bar->enabled || !rangeset_overlaps_range(mem, start, end) ||
+            if ( !bar->enabled ||
+                 !rangeset_overlaps_range(bar->mem, start, end) ||
                  /*
                   * If only the ROM enable bit is toggled check against other
                   * BARs in the same device for overlaps, but not against the
@@ -311,13 +359,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
                  (rom_only && tmp == pdev && bar->type == VPCI_BAR_ROM) )
                 continue;
 
-            rc = rangeset_remove_range(mem, start, end);
+            rc = rangeset_remove_range(bar->mem, start, end);
             if ( rc )
             {
                 printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
                        start, end, rc);
-                rangeset_destroy(mem);
-                return rc;
+                goto fail;
             }
         }
     }
@@ -335,12 +382,35 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
          * will always be to establish mappings and process all the BARs.
          */
         ASSERT((cmd & PCI_COMMAND_MEMORY) && !rom_only);
-        return apply_map(pdev->domain, pdev, mem, cmd);
+        return apply_map(pdev->domain, pdev, cmd);
     }
 
-    defer_map(dev->domain, dev, mem, cmd, rom_only);
+    /* Find out how many memory ranges has left after MSI and overlaps. */
+    map_pending = false;
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+        if ( !rangeset_is_empty(header->bars[i].mem) )
+        {
+            map_pending = true;
+            break;
+        }
+
+    /*
+     * There are cases when PCI device, root port for example, has neither
+     * memory space nor IO. In this case PCI command register write is
+     * missed resulting in the underlying PCI device not functional, so:
+     *   - if there are no regions write the command register now
+     *   - if there are regions then defer work and write later on
+     */
+    if ( !map_pending )
+        pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
+    else
+        defer_map(dev->domain, dev, cmd, rom_only);
 
     return 0;
+
+fail:
+    vpci_cancel_pending(pdev);
+    return rc;
 }
 
 static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 5f086398a98c..45733300f00b 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -55,7 +55,12 @@ void vpci_remove_device_handlers(const struct pci_dev *pdev)
 
 void vpci_remove_device(struct pci_dev *pdev)
 {
+    struct vpci_header *header = &pdev->vpci->header;
+    unsigned int i;
+
     vpci_cancel_pending(pdev);
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+        rangeset_destroy(header->bars[i].mem);
     vpci_remove_device_handlers(pdev);
     xfree(pdev->vpci->msix);
     xfree(pdev->vpci->msi);
@@ -80,6 +85,8 @@ static int run_vpci_init(struct pci_dev *pdev)
 
 int vpci_add_handlers(struct pci_dev *pdev)
 {
+    struct vpci_header *header;
+    unsigned int i;
     int rc;
 
     if ( !has_vpci(pdev->domain) )
@@ -95,10 +102,25 @@ int vpci_add_handlers(struct pci_dev *pdev)
     INIT_LIST_HEAD(&pdev->vpci->handlers);
     spin_lock_init(&pdev->vpci->lock);
 
+    header = &pdev->vpci->header;
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+    {
+        struct vpci_bar *bar = &header->bars[i];
+
+        bar->mem = rangeset_new(NULL, NULL, 0);
+        if ( !bar->mem )
+        {
+            rc = -ENOMEM;
+            goto fail;
+        }
+    }
+
     rc = run_vpci_init(pdev);
-    if ( rc )
-        vpci_remove_device(pdev);
+    if ( !rc )
+        return 0;
 
+ fail:
+    vpci_remove_device(pdev);
     return rc;
 }
 
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 3e7428da822c..143f3166a730 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -75,6 +75,7 @@ struct vpci {
             /* Guest view of the BAR. */
             uint64_t guest_addr;
             uint64_t size;
+            struct rangeset *mem;
             enum {
                 VPCI_BAR_EMPTY,
                 VPCI_BAR_IO,
@@ -149,9 +150,9 @@ struct vpci {
 
 struct vpci_vcpu {
     /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
-    struct rangeset *mem;
     struct pci_dev *pdev;
     uint16_t cmd;
+    bool map_pending : 1;
     bool rom_only : 1;
 };
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v4 07/11] vpci/header: program p2m with guest BAR view
  2021-11-05  6:56 [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (5 preceding siblings ...)
  2021-11-05  6:56 ` [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR Oleksandr Andrushchenko
@ 2021-11-05  6:56 ` Oleksandr Andrushchenko
  2021-11-19 12:33   ` Jan Beulich
  2021-11-05  6:56 ` [PATCH v4 08/11] vpci/header: emulate PCI_COMMAND register for guests Oleksandr Andrushchenko
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-05  6:56 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Take into account guest's BAR view and program its p2m accordingly:
gfn is guest's view of the BAR and mfn is the physical BAR value as set
up by the host bridge in the hardware domain.
This way hardware doamin sees physical BAR values and guest sees
emulated ones.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v3:
- updated comment (Roger)
- removed gfn_add(map->start_gfn, rc); which is wrong
- use v->domain instead of v->vpci.pdev->domain
- removed odd e.g. in comment
- s/d%d/%pd in altered code
- use gdprintk for map/unmap logs
Since v2:
- improve readability for data.start_gfn and restructure ?: construct
Since v1:
 - s/MSI/MSI-X in comments
---
 xen/drivers/vpci/header.c | 33 +++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 5fc2dfbbc864..34158da2d5f6 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -30,6 +30,10 @@
 
 struct map_data {
     struct domain *d;
+    /* Start address of the BAR as seen by the guest. */
+    gfn_t start_gfn;
+    /* Physical start address of the BAR. */
+    mfn_t start_mfn;
     bool map;
 };
 
@@ -37,12 +41,24 @@ static int map_range(unsigned long s, unsigned long e, void *data,
                      unsigned long *c)
 {
     const struct map_data *map = data;
+    gfn_t start_gfn;
     int rc;
 
     for ( ; ; )
     {
         unsigned long size = e - s + 1;
 
+        /*
+         * Ranges to be mapped don't always start at the BAR start address, as
+         * there can be holes or partially consumed ranges. Account for the
+         * offset of the current address from the BAR start.
+         */
+        start_gfn = gfn_add(map->start_gfn, s - mfn_x(map->start_mfn));
+
+        gdprintk(XENLOG_G_DEBUG,
+                 "%smap [%lx, %lx] -> %#"PRI_gfn" for %pd\n",
+                 map->map ? "" : "un", s, e, gfn_x(start_gfn),
+                 map->d);
         /*
          * ARM TODOs:
          * - On ARM whether the memory is prefetchable or not should be passed
@@ -52,8 +68,10 @@ static int map_range(unsigned long s, unsigned long e, void *data,
          * - {un}map_mmio_regions doesn't support preemption.
          */
 
-        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
-                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
+        rc = map->map ? map_mmio_regions(map->d, start_gfn,
+                                         size, _mfn(s))
+                      : unmap_mmio_regions(map->d, start_gfn,
+                                           size, _mfn(s));
         if ( rc == 0 )
         {
             *c += size;
@@ -62,8 +80,8 @@ static int map_range(unsigned long s, unsigned long e, void *data,
         if ( rc < 0 )
         {
             printk(XENLOG_G_WARNING
-                   "Failed to identity %smap [%lx, %lx] for d%d: %d\n",
-                   map->map ? "" : "un", s, e, map->d->domain_id, rc);
+                   "Failed to identity %smap [%lx, %lx] for %pd: %d\n",
+                   map->map ? "" : "un", s, e, map->d, rc);
             break;
         }
         ASSERT(rc < size);
@@ -149,6 +167,10 @@ bool vpci_process_pending(struct vcpu *v)
             if ( rangeset_is_empty(bar->mem) )
                 continue;
 
+            data.start_gfn =
+                 _gfn(PFN_DOWN(is_hardware_domain(v->domain)
+                               ? bar->addr : bar->guest_addr));
+            data.start_mfn = _mfn(PFN_DOWN(bar->addr));
             rc = rangeset_consume_ranges(bar->mem, map_range, &data);
 
             if ( rc == -ERESTART )
@@ -223,6 +245,9 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
         if ( rangeset_is_empty(bar->mem) )
             continue;
 
+        data.start_gfn = _gfn(PFN_DOWN(is_hardware_domain(d)
+                                       ? bar->addr : bar->guest_addr));
+        data.start_mfn = _mfn(PFN_DOWN(bar->addr));
         while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
                                               &data)) == -ERESTART )
             process_pending_softirqs();
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v4 08/11] vpci/header: emulate PCI_COMMAND register for guests
  2021-11-05  6:56 [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (6 preceding siblings ...)
  2021-11-05  6:56 ` [PATCH v4 07/11] vpci/header: program p2m with guest BAR view Oleksandr Andrushchenko
@ 2021-11-05  6:56 ` Oleksandr Andrushchenko
  2021-11-05  6:56 ` [PATCH v4 09/11] vpci/header: reset the command register when adding devices Oleksandr Andrushchenko
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-05  6:56 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Add basic emulation support for guests. At the moment only emulate
PCI_COMMAND_INTX_DISABLE bit, the rest is not emulated yet and left
as TODO.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v3:
- gate more code on CONFIG_HAS_MSI
- removed logic for the case when MSI/MSI-X not enabled
---
 xen/drivers/vpci/header.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 34158da2d5f6..64cfc268c341 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -459,6 +459,22 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
         pci_conf_write16(pdev->sbdf, reg, cmd);
 }
 
+static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
+                            uint32_t cmd, void *data)
+{
+    /* TODO: Add proper emulation for all bits of the command register. */
+
+#ifdef CONFIG_HAS_PCI_MSI
+    if ( pdev->vpci->msi->enabled )
+    {
+        /* Guest wants to enable INTx. It can't be enabled if MSI/MSI-X enabled. */
+        cmd |= PCI_COMMAND_INTX_DISABLE;
+    }
+#endif
+
+    cmd_write(pdev, reg, cmd, data);
+}
+
 static void bar_write(const struct pci_dev *pdev, unsigned int reg,
                       uint32_t val, void *data)
 {
@@ -631,8 +647,9 @@ static int init_bars(struct pci_dev *pdev)
     }
 
     /* Setup a handler for the command register. */
-    rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
-                           2, header);
+    rc = vpci_add_register(pdev->vpci, vpci_hw_read16,
+                           is_hwdom ? cmd_write : guest_cmd_write,
+                           PCI_COMMAND, 2, header);
     if ( rc )
         return rc;
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v4 09/11] vpci/header: reset the command register when adding devices
  2021-11-05  6:56 [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (7 preceding siblings ...)
  2021-11-05  6:56 ` [PATCH v4 08/11] vpci/header: emulate PCI_COMMAND register for guests Oleksandr Andrushchenko
@ 2021-11-05  6:56 ` Oleksandr Andrushchenko
  2021-11-05  6:56 ` [PATCH v4 10/11] vpci: add initial support for virtual PCI bus topology Oleksandr Andrushchenko
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-05  6:56 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Reset the command register when passing through a PCI device:
it is possible that when passing through a PCI device its memory
decoding bits in the command register are already set. Thus, a
guest OS may not write to the command register to update memory
decoding, so guest mappings (guest's view of the BARs) are
left not updated.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v1:
 - do not write 0 to the command register, but respect host settings.
---
 xen/drivers/vpci/header.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 64cfc268c341..680319b3a63f 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -459,8 +459,7 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
         pci_conf_write16(pdev->sbdf, reg, cmd);
 }
 
-static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
-                            uint32_t cmd, void *data)
+static uint32_t emulate_cmd_reg(const struct pci_dev *pdev, uint32_t cmd)
 {
     /* TODO: Add proper emulation for all bits of the command register. */
 
@@ -472,7 +471,13 @@ static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
     }
 #endif
 
-    cmd_write(pdev, reg, cmd, data);
+    return cmd;
+}
+
+static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
+                            uint32_t cmd, void *data)
+{
+    cmd_write(pdev, reg, emulate_cmd_reg(pdev, cmd), data);
 }
 
 static void bar_write(const struct pci_dev *pdev, unsigned int reg,
@@ -646,6 +651,10 @@ static int init_bars(struct pci_dev *pdev)
         return -EOPNOTSUPP;
     }
 
+    /* Reset the command register for the guest. */
+    if ( !is_hwdom )
+        pci_conf_write16(pdev->sbdf, PCI_COMMAND, emulate_cmd_reg(pdev, 0));
+
     /* Setup a handler for the command register. */
     rc = vpci_add_register(pdev->vpci, vpci_hw_read16,
                            is_hwdom ? cmd_write : guest_cmd_write,
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v4 10/11] vpci: add initial support for virtual PCI bus topology
  2021-11-05  6:56 [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (8 preceding siblings ...)
  2021-11-05  6:56 ` [PATCH v4 09/11] vpci/header: reset the command register when adding devices Oleksandr Andrushchenko
@ 2021-11-05  6:56 ` Oleksandr Andrushchenko
  2021-11-18 16:45   ` Jan Beulich
  2021-11-05  6:56 ` [PATCH v4 11/11] xen/arm: translate virtual PCI bus topology for guests Oleksandr Andrushchenko
  2021-11-19 13:56 ` [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Jan Beulich
  11 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-05  6:56 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Assign SBDF to the PCI devices being passed through with bus 0.
The resulting topology is where PCIe devices reside on the bus 0 of the
root complex itself (embedded endpoints).
This implementation is limited to 32 devices which are allowed on
a single PCI bus.

Please note, that at the moment only function 0 of a multifunction
device can be passed through.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v3:
 - make use of VPCI_INIT
 - moved all new code to vpci.c which belongs to it
 - changed open-coded 31 to PCI_SLOT(~0)
 - revisited locking: add dedicated vdev list's lock
 - added comments and code to reject multifunction devices with
   functions other than 0
 - updated comment about vpci_dev_next and made it unsigned int
 - implement roll back in case of error while assigning/deassigning devices
 - s/dom%pd/%pd
Since v2:
 - remove casts that are (a) malformed and (b) unnecessary
 - add new line for better readability
 - remove CONFIG_HAS_VPCI_GUEST_SUPPORT ifdef's as the relevant vPCI
    functions are now completely gated with this config
 - gate common code with CONFIG_HAS_VPCI_GUEST_SUPPORT
New in v2
---
 xen/drivers/vpci/vpci.c | 52 +++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/sched.h |  8 +++++++
 xen/include/xen/vpci.h  |  4 ++++
 3 files changed, 64 insertions(+)

diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 45733300f00b..6657d236dc1a 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -101,6 +101,9 @@ int vpci_add_handlers(struct pci_dev *pdev)
 
     INIT_LIST_HEAD(&pdev->vpci->handlers);
     spin_lock_init(&pdev->vpci->lock);
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    pdev->vpci->guest_sbdf.sbdf = ~0;
+#endif
 
     header = &pdev->vpci->header;
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
@@ -125,6 +128,54 @@ int vpci_add_handlers(struct pci_dev *pdev)
 }
 
 #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+int vpci_add_virtual_device(struct pci_dev *pdev)
+{
+    struct domain *d = pdev->domain;
+    pci_sbdf_t sbdf;
+    unsigned long new_dev_number;
+
+    /*
+     * Each PCI bus supports 32 devices/slots at max or up to 256 when
+     * there are multi-function ones which are not yet supported.
+     */
+    if ( pdev->info.is_extfn )
+    {
+        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
+                 &pdev->sbdf);
+        return -EOPNOTSUPP;
+    }
+
+    new_dev_number = find_first_zero_bit(&d->vpci_dev_assigned_map,
+                                         PCI_SLOT(~0) + 1);
+    if ( new_dev_number > PCI_SLOT(~0) )
+        return -ENOSPC;
+
+    set_bit(new_dev_number, &d->vpci_dev_assigned_map);
+
+    /*
+     * Both segment and bus number are 0:
+     *  - we emulate a single host bridge for the guest, e.g. segment 0
+     *  - with bus 0 the virtual devices are seen as embedded
+     *    endpoints behind the root complex
+     *
+     * TODO: add support for multi-function devices.
+     */
+    sbdf.sbdf = 0;
+    sbdf.devfn = PCI_DEVFN(new_dev_number, 0);
+    pdev->vpci->guest_sbdf = sbdf;
+
+    return 0;
+
+}
+REGISTER_VPCI_INIT(vpci_add_virtual_device, VPCI_PRIORITY_MIDDLE);
+
+static void vpci_remove_virtual_device(struct domain *d,
+                                       const struct pci_dev *pdev)
+{
+    clear_bit(pdev->vpci->guest_sbdf.dev, &d->vpci_dev_assigned_map);
+    pdev->vpci->guest_sbdf.sbdf = ~0;
+}
+
 /* Notify vPCI that device is assigned to guest. */
 int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
 {
@@ -150,6 +201,7 @@ int vpci_deassign_device(struct domain *d, struct pci_dev *pdev)
     if ( is_system_domain(d) || !has_vpci(d) )
         return 0;
 
+    vpci_remove_virtual_device(d, pdev);
     vpci_remove_device_handlers(pdev);
 
     return 0;
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 28146ee404e6..10bff103317c 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -444,6 +444,14 @@ struct domain
 
 #ifdef CONFIG_HAS_PCI
     struct list_head pdev_list;
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /*
+     * The bitmap which shows which device numbers are already used by the
+     * virtual PCI bus topology and is used to assign a unique SBDF to the
+     * next passed through virtual PCI device.
+     */
+    unsigned long vpci_dev_assigned_map;
+#endif
 #endif
 
 #ifdef CONFIG_HAS_PASSTHROUGH
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 143f3166a730..9cc7071bc0af 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -145,6 +145,10 @@ struct vpci {
             struct vpci_arch_msix_entry arch;
         } entries[];
     } *msix;
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /* Virtual SBDF of the device. */
+    pci_sbdf_t guest_sbdf;
+#endif
 #endif
 };
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v4 11/11] xen/arm: translate virtual PCI bus topology for guests
  2021-11-05  6:56 [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (9 preceding siblings ...)
  2021-11-05  6:56 ` [PATCH v4 10/11] vpci: add initial support for virtual PCI bus topology Oleksandr Andrushchenko
@ 2021-11-05  6:56 ` Oleksandr Andrushchenko
  2021-11-08 11:10   ` Jan Beulich
  2021-11-19 13:56 ` [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Jan Beulich
  11 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-05  6:56 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

There are three  originators for the PCI configuration space access:
1. The domain that owns physical host bridge: MMIO handlers are
there so we can update vPCI register handlers with the values
written by the hardware domain, e.g. physical view of the registers
vs guest's view on the configuration space.
2. Guest access to the passed through PCI devices: we need to properly
map virtual bus topology to the physical one, e.g. pass the configuration
space access to the corresponding physical devices.
3. Emulated host PCI bridge access. It doesn't exist in the physical
topology, e.g. it can't be mapped to some physical host bridge.
So, all access to the host bridge itself needs to be trapped and
emulated.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v3:
- revisit locking
- move code to vpci.c
Since v2:
 - pass struct domain instead of struct vcpu
 - constify arguments where possible
 - gate relevant code with CONFIG_HAS_VPCI_GUEST_SUPPORT
New in v2
---
 xen/arch/arm/vpci.c     | 18 ++++++++++++++++++
 xen/drivers/vpci/vpci.c | 30 ++++++++++++++++++++++++++++++
 xen/include/xen/vpci.h  |  1 +
 3 files changed, 49 insertions(+)

diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index 5a6ebd8b9868..6a37f770f8f0 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -41,6 +41,15 @@ static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
     /* data is needed to prevent a pointer cast on 32bit */
     unsigned long data;
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /*
+     * For the passed through devices we need to map their virtual SBDF
+     * to the physical PCI device being passed through.
+     */
+    if ( !bridge && !vpci_translate_virtual_device(v->domain, &sbdf) )
+            return 1;
+#endif
+
     if ( vpci_ecam_read(sbdf, ECAM_REG_OFFSET(info->gpa),
                         1U << info->dabt.size, &data) )
     {
@@ -59,6 +68,15 @@ static int vpci_mmio_write(struct vcpu *v, mmio_info_t *info,
     struct pci_host_bridge *bridge = p;
     pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /*
+     * For the passed through devices we need to map their virtual SBDF
+     * to the physical PCI device being passed through.
+     */
+    if ( !bridge && !vpci_translate_virtual_device(v->domain, &sbdf) )
+            return 1;
+#endif
+
     return vpci_ecam_write(sbdf, ECAM_REG_OFFSET(info->gpa),
                            1U << info->dabt.size, r);
 }
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 6657d236dc1a..cb0bde35b6a6 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -94,6 +94,7 @@ int vpci_add_handlers(struct pci_dev *pdev)
 
     /* We should not get here twice for the same device. */
     ASSERT(!pdev->vpci);
+    ASSERT(pcidevs_locked());
 
     pdev->vpci = xzalloc(struct vpci);
     if ( !pdev->vpci )
@@ -134,6 +135,8 @@ int vpci_add_virtual_device(struct pci_dev *pdev)
     pci_sbdf_t sbdf;
     unsigned long new_dev_number;
 
+    ASSERT(pcidevs_locked());
+
     /*
      * Each PCI bus supports 32 devices/slots at max or up to 256 when
      * there are multi-function ones which are not yet supported.
@@ -172,10 +175,37 @@ REGISTER_VPCI_INIT(vpci_add_virtual_device, VPCI_PRIORITY_MIDDLE);
 static void vpci_remove_virtual_device(struct domain *d,
                                        const struct pci_dev *pdev)
 {
+    ASSERT(pcidevs_locked());
+
     clear_bit(pdev->vpci->guest_sbdf.dev, &d->vpci_dev_assigned_map);
     pdev->vpci->guest_sbdf.sbdf = ~0;
 }
 
+/*
+ * Find the physical device which is mapped to the virtual device
+ * and translate virtual SBDF to the physical one.
+ */
+bool vpci_translate_virtual_device(struct domain *d, pci_sbdf_t *sbdf)
+{
+    const struct pci_dev *pdev;
+    bool found = false;
+
+    pcidevs_lock();
+    for_each_pdev( d, pdev )
+    {
+        if ( pdev->vpci->guest_sbdf.sbdf == sbdf->sbdf )
+        {
+            /* Replace virtual SBDF with the physical one. */
+            *sbdf = pdev->sbdf;
+            found = true;
+            break;
+        }
+    }
+    pcidevs_unlock();
+
+    return found;
+}
+
 /* Notify vPCI that device is assigned to guest. */
 int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
 {
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 9cc7071bc0af..d5765301e442 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -274,6 +274,7 @@ static inline void vpci_cancel_pending(const struct pci_dev *pdev)
 /* Notify vPCI that device is assigned/de-assigned to/from guest. */
 int vpci_assign_device(struct domain *d, struct pci_dev *pdev);
 int vpci_deassign_device(struct domain *d, struct pci_dev *pdev);
+bool vpci_translate_virtual_device(struct domain *d, pci_sbdf_t *sbdf);
 #else
 static inline int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
 {
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 11/11] xen/arm: translate virtual PCI bus topology for guests
  2021-11-05  6:56 ` [PATCH v4 11/11] xen/arm: translate virtual PCI bus topology for guests Oleksandr Andrushchenko
@ 2021-11-08 11:10   ` Jan Beulich
  2021-11-08 11:16     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-08 11:10 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko,
	xen-devel

On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> --- a/xen/arch/arm/vpci.c
> +++ b/xen/arch/arm/vpci.c
> @@ -41,6 +41,15 @@ static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
>      /* data is needed to prevent a pointer cast on 32bit */
>      unsigned long data;
>  
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    /*
> +     * For the passed through devices we need to map their virtual SBDF
> +     * to the physical PCI device being passed through.
> +     */
> +    if ( !bridge && !vpci_translate_virtual_device(v->domain, &sbdf) )
> +            return 1;

Nit: Indentation.

> @@ -59,6 +68,15 @@ static int vpci_mmio_write(struct vcpu *v, mmio_info_t *info,
>      struct pci_host_bridge *bridge = p;
>      pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
>  
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    /*
> +     * For the passed through devices we need to map their virtual SBDF
> +     * to the physical PCI device being passed through.
> +     */
> +    if ( !bridge && !vpci_translate_virtual_device(v->domain, &sbdf) )
> +            return 1;

Again.

> @@ -172,10 +175,37 @@ REGISTER_VPCI_INIT(vpci_add_virtual_device, VPCI_PRIORITY_MIDDLE);
>  static void vpci_remove_virtual_device(struct domain *d,
>                                         const struct pci_dev *pdev)
>  {
> +    ASSERT(pcidevs_locked());
> +
>      clear_bit(pdev->vpci->guest_sbdf.dev, &d->vpci_dev_assigned_map);
>      pdev->vpci->guest_sbdf.sbdf = ~0;
>  }
>  
> +/*
> + * Find the physical device which is mapped to the virtual device
> + * and translate virtual SBDF to the physical one.
> + */
> +bool vpci_translate_virtual_device(struct domain *d, pci_sbdf_t *sbdf)

const struct domain *d ?

> +{
> +    const struct pci_dev *pdev;
> +    bool found = false;
> +
> +    pcidevs_lock();
> +    for_each_pdev( d, pdev )
> +    {
> +        if ( pdev->vpci->guest_sbdf.sbdf == sbdf->sbdf )
> +        {
> +            /* Replace virtual SBDF with the physical one. */
> +            *sbdf = pdev->sbdf;
> +            found = true;
> +            break;
> +        }
> +    }
> +    pcidevs_unlock();

I think the description wants to at least mention that in principle
this is too coarse grained a lock, providing justification for why
it is deemed good enough nevertheless. (Personally, as expressed
before, I don't think the lock should be used here, but as long as
Roger agrees with you, you're fine.)

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 11/11] xen/arm: translate virtual PCI bus topology for guests
  2021-11-08 11:10   ` Jan Beulich
@ 2021-11-08 11:16     ` Oleksandr Andrushchenko
  2021-11-08 14:23       ` Roger Pau Monné
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-08 11:16 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, Oleksandr Andrushchenko,
	xen-devel



On 08.11.21 13:10, Jan Beulich wrote:
> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>> --- a/xen/arch/arm/vpci.c
>> +++ b/xen/arch/arm/vpci.c
>> @@ -41,6 +41,15 @@ static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
>>       /* data is needed to prevent a pointer cast on 32bit */
>>       unsigned long data;
>>   
>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>> +    /*
>> +     * For the passed through devices we need to map their virtual SBDF
>> +     * to the physical PCI device being passed through.
>> +     */
>> +    if ( !bridge && !vpci_translate_virtual_device(v->domain, &sbdf) )
>> +            return 1;
> Nit: Indentation.
Ouch, sure
>
>> @@ -59,6 +68,15 @@ static int vpci_mmio_write(struct vcpu *v, mmio_info_t *info,
>>       struct pci_host_bridge *bridge = p;
>>       pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
>>   
>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>> +    /*
>> +     * For the passed through devices we need to map their virtual SBDF
>> +     * to the physical PCI device being passed through.
>> +     */
>> +    if ( !bridge && !vpci_translate_virtual_device(v->domain, &sbdf) )
>> +            return 1;
> Again.
Will fix
>
>> @@ -172,10 +175,37 @@ REGISTER_VPCI_INIT(vpci_add_virtual_device, VPCI_PRIORITY_MIDDLE);
>>   static void vpci_remove_virtual_device(struct domain *d,
>>                                          const struct pci_dev *pdev)
>>   {
>> +    ASSERT(pcidevs_locked());
>> +
>>       clear_bit(pdev->vpci->guest_sbdf.dev, &d->vpci_dev_assigned_map);
>>       pdev->vpci->guest_sbdf.sbdf = ~0;
>>   }
>>   
>> +/*
>> + * Find the physical device which is mapped to the virtual device
>> + * and translate virtual SBDF to the physical one.
>> + */
>> +bool vpci_translate_virtual_device(struct domain *d, pci_sbdf_t *sbdf)
> const struct domain *d ?
Will change
>
>> +{
>> +    const struct pci_dev *pdev;
>> +    bool found = false;
>> +
>> +    pcidevs_lock();
>> +    for_each_pdev( d, pdev )
>> +    {
>> +        if ( pdev->vpci->guest_sbdf.sbdf == sbdf->sbdf )
>> +        {
>> +            /* Replace virtual SBDF with the physical one. */
>> +            *sbdf = pdev->sbdf;
>> +            found = true;
>> +            break;
>> +        }
>> +    }
>> +    pcidevs_unlock();
> I think the description wants to at least mention that in principle
> this is too coarse grained a lock, providing justification for why
> it is deemed good enough nevertheless. (Personally, as expressed
> before, I don't think the lock should be used here, but as long as
> Roger agrees with you, you're fine.)
Yes, makes sense
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 11/11] xen/arm: translate virtual PCI bus topology for guests
  2021-11-08 11:16     ` Oleksandr Andrushchenko
@ 2021-11-08 14:23       ` Roger Pau Monné
  2021-11-08 15:28         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Roger Pau Monné @ 2021-11-08 14:23 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel

On Mon, Nov 08, 2021 at 11:16:42AM +0000, Oleksandr Andrushchenko wrote:
> 
> 
> On 08.11.21 13:10, Jan Beulich wrote:
> > On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> >> --- a/xen/arch/arm/vpci.c
> >> +++ b/xen/arch/arm/vpci.c
> >> @@ -41,6 +41,15 @@ static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
> >>       /* data is needed to prevent a pointer cast on 32bit */
> >>       unsigned long data;
> >>   
> >> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> >> +    /*
> >> +     * For the passed through devices we need to map their virtual SBDF
> >> +     * to the physical PCI device being passed through.
> >> +     */
> >> +    if ( !bridge && !vpci_translate_virtual_device(v->domain, &sbdf) )
> >> +            return 1;
> > Nit: Indentation.
> Ouch, sure
> >
> >> @@ -59,6 +68,15 @@ static int vpci_mmio_write(struct vcpu *v, mmio_info_t *info,
> >>       struct pci_host_bridge *bridge = p;
> >>       pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
> >>   
> >> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> >> +    /*
> >> +     * For the passed through devices we need to map their virtual SBDF
> >> +     * to the physical PCI device being passed through.
> >> +     */
> >> +    if ( !bridge && !vpci_translate_virtual_device(v->domain, &sbdf) )
> >> +            return 1;
> > Again.
> Will fix
> >
> >> @@ -172,10 +175,37 @@ REGISTER_VPCI_INIT(vpci_add_virtual_device, VPCI_PRIORITY_MIDDLE);
> >>   static void vpci_remove_virtual_device(struct domain *d,
> >>                                          const struct pci_dev *pdev)
> >>   {
> >> +    ASSERT(pcidevs_locked());
> >> +
> >>       clear_bit(pdev->vpci->guest_sbdf.dev, &d->vpci_dev_assigned_map);
> >>       pdev->vpci->guest_sbdf.sbdf = ~0;
> >>   }
> >>   
> >> +/*
> >> + * Find the physical device which is mapped to the virtual device
> >> + * and translate virtual SBDF to the physical one.
> >> + */
> >> +bool vpci_translate_virtual_device(struct domain *d, pci_sbdf_t *sbdf)
> > const struct domain *d ?
> Will change
> >
> >> +{
> >> +    const struct pci_dev *pdev;
> >> +    bool found = false;
> >> +
> >> +    pcidevs_lock();
> >> +    for_each_pdev( d, pdev )
> >> +    {
> >> +        if ( pdev->vpci->guest_sbdf.sbdf == sbdf->sbdf )
> >> +        {
> >> +            /* Replace virtual SBDF with the physical one. */
> >> +            *sbdf = pdev->sbdf;
> >> +            found = true;
> >> +            break;
> >> +        }
> >> +    }
> >> +    pcidevs_unlock();
> > I think the description wants to at least mention that in principle
> > this is too coarse grained a lock, providing justification for why
> > it is deemed good enough nevertheless. (Personally, as expressed
> > before, I don't think the lock should be used here, but as long as
> > Roger agrees with you, you're fine.)
> Yes, makes sense

Seeing as we don't take the lock in vpci_{read,write} I'm not sure we
need it here either then.

Since on Arm you will add devices to the guest at runtime (ie: while
there could already be PCI accesses), have you seen issues with not
taking the lock here?

I think the whole pcidevs locking needs to be clarified, as it's
currently a mess. If you want to take it here that's fine, but overall
there are issues in other places that would make removing a device at
runtime not reliable.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 11/11] xen/arm: translate virtual PCI bus topology for guests
  2021-11-08 14:23       ` Roger Pau Monné
@ 2021-11-08 15:28         ` Oleksandr Andrushchenko
  2021-11-24 11:31           ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-08 15:28 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 08.11.21 16:23, Roger Pau Monné wrote:
> On Mon, Nov 08, 2021 at 11:16:42AM +0000, Oleksandr Andrushchenko wrote:
>>
>> On 08.11.21 13:10, Jan Beulich wrote:
>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>> --- a/xen/arch/arm/vpci.c
>>>> +++ b/xen/arch/arm/vpci.c
>>>> @@ -41,6 +41,15 @@ static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
>>>>        /* data is needed to prevent a pointer cast on 32bit */
>>>>        unsigned long data;
>>>>    
>>>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>>>> +    /*
>>>> +     * For the passed through devices we need to map their virtual SBDF
>>>> +     * to the physical PCI device being passed through.
>>>> +     */
>>>> +    if ( !bridge && !vpci_translate_virtual_device(v->domain, &sbdf) )
>>>> +            return 1;
>>> Nit: Indentation.
>> Ouch, sure
>>>> @@ -59,6 +68,15 @@ static int vpci_mmio_write(struct vcpu *v, mmio_info_t *info,
>>>>        struct pci_host_bridge *bridge = p;
>>>>        pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
>>>>    
>>>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>>>> +    /*
>>>> +     * For the passed through devices we need to map their virtual SBDF
>>>> +     * to the physical PCI device being passed through.
>>>> +     */
>>>> +    if ( !bridge && !vpci_translate_virtual_device(v->domain, &sbdf) )
>>>> +            return 1;
>>> Again.
>> Will fix
>>>> @@ -172,10 +175,37 @@ REGISTER_VPCI_INIT(vpci_add_virtual_device, VPCI_PRIORITY_MIDDLE);
>>>>    static void vpci_remove_virtual_device(struct domain *d,
>>>>                                           const struct pci_dev *pdev)
>>>>    {
>>>> +    ASSERT(pcidevs_locked());
>>>> +
>>>>        clear_bit(pdev->vpci->guest_sbdf.dev, &d->vpci_dev_assigned_map);
>>>>        pdev->vpci->guest_sbdf.sbdf = ~0;
>>>>    }
>>>>    
>>>> +/*
>>>> + * Find the physical device which is mapped to the virtual device
>>>> + * and translate virtual SBDF to the physical one.
>>>> + */
>>>> +bool vpci_translate_virtual_device(struct domain *d, pci_sbdf_t *sbdf)
>>> const struct domain *d ?
>> Will change
>>>> +{
>>>> +    const struct pci_dev *pdev;
>>>> +    bool found = false;
>>>> +
>>>> +    pcidevs_lock();
>>>> +    for_each_pdev( d, pdev )
>>>> +    {
>>>> +        if ( pdev->vpci->guest_sbdf.sbdf == sbdf->sbdf )
>>>> +        {
>>>> +            /* Replace virtual SBDF with the physical one. */
>>>> +            *sbdf = pdev->sbdf;
>>>> +            found = true;
>>>> +            break;
>>>> +        }
>>>> +    }
>>>> +    pcidevs_unlock();
>>> I think the description wants to at least mention that in principle
>>> this is too coarse grained a lock, providing justification for why
>>> it is deemed good enough nevertheless. (Personally, as expressed
>>> before, I don't think the lock should be used here, but as long as
>>> Roger agrees with you, you're fine.)
>> Yes, makes sense
> Seeing as we don't take the lock in vpci_{read,write} I'm not sure we
> need it here either then.
Yes, I was not feeling confident while adding locking
> Since on Arm you will add devices to the guest at runtime (ie: while
> there could already be PCI accesses), have you seen issues with not
> taking the lock here?
No, I didn't. Neither I am aware of Arm had problems
But this could just mean that we were lucky not to step on it
>
> I think the whole pcidevs locking needs to be clarified, as it's
> currently a mess.
Agree
>   If you want to take it here that's fine, but overall
> there are issues in other places that would make removing a device at
> runtime not reliable.
So, what's the decision? I would leave the locks where I put them,
so at least this part won't need fixes.
>
> Thanks, Roger.
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-05  6:56 ` [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal Oleksandr Andrushchenko
@ 2021-11-15 16:56   ` Jan Beulich
  2021-11-16  7:32     ` Oleksandr Andrushchenko
  2021-11-17  8:28   ` Jan Beulich
  1 sibling, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-15 16:56 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko,
	xen-devel

On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> When a vPCI is removed for a PCI device it is possible that we have
> scheduled a delayed work for map/unmap operations for that device.
> For example, the following scenario can illustrate the problem:
> 
> pci_physdev_op
>    pci_add_device
>        init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>    iommu_add_device <- FAILS
>    vpci_remove_device -> xfree(pdev->vpci)
> 
> leave_hypervisor_to_guest
>    vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
> 
> For the hardware domain we continue execution as the worse that
> could happen is that MMIO mappings are left in place when the
> device has been deassigned

Is continuing safe in this case? I.e. isn't there the risk of a NULL
deref?

> For unprivileged domains that get a failure in the middle of a vPCI
> {un}map operation we need to destroy them, as we don't know in which
> state the p2m is. This can only happen in vpci_process_pending for
> DomUs as they won't be allowed to call pci_add_device.

You saying "we need to destroy them" made me look for a new domain_crash()
that you add, but there is none. What is this about?

> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v)
>      return false;
>  }
>  
> +void vpci_cancel_pending(const struct pci_dev *pdev)
> +{
> +    struct vcpu *v = current;
> +
> +    /* Cancel any pending work now. */

Doesn't "any" include pending work on all vCPU-s of the guest, not
just current? Is current even relevant (as in: part of the correct
domain), considering ...

> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev)
>          xfree(r);
>      }
>      spin_unlock(&pdev->vpci->lock);
> +
> +    vpci_cancel_pending(pdev);

... this code path, when coming here from pci_{add,remove}_device()?

I can agree that there's a problem here, but I think you need to
properly (i.e. in a race free manner) drain pending work.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 03/11] vpci: make vpci registers removal a dedicated function
  2021-11-05  6:56 ` [PATCH v4 03/11] vpci: make vpci registers removal a dedicated function Oleksandr Andrushchenko
@ 2021-11-15 16:57   ` Jan Beulich
  2021-11-16  8:02     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-15 16:57 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko,
	xen-devel

On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> This is in preparation for dynamic assignment of the vpci register
> handlers depending on the domain: hwdom or guest.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> Since v3:
> - remove all R-b's due to changes
> - s/vpci_remove_device_registers/vpci_remove_device_handlers

Should this maybe extend to the title then?

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 04/11] vpci: add hooks for PCI device assign/de-assign
  2021-11-05  6:56 ` [PATCH v4 04/11] vpci: add hooks for PCI device assign/de-assign Oleksandr Andrushchenko
@ 2021-11-15 17:06   ` Jan Beulich
  2021-11-16  9:38     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-15 17:06 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko,
	xen-devel

On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> When a PCI device gets assigned/de-assigned some work on vPCI side needs
> to be done for that device. Introduce a pair of hooks so vPCI can handle
> that.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> Since v3:
>  - remove toolstack roll-back description from the commit message
>    as error are to be handled with proper cleanup in Xen itself
>  - remove __must_check
>  - remove redundant rc check while assigning devices
>  - fix redundant CONFIG_HAS_VPCI check for CONFIG_HAS_VPCI_GUEST_SUPPORT
>  - use REGISTER_VPCI_INIT machinery to run required steps on device
>    init/assign: add run_vpci_init helper
> Since v2:
> - define CONFIG_HAS_VPCI_GUEST_SUPPORT so dead code is not compiled
>   for x86
> Since v1:
>  - constify struct pci_dev where possible
>  - do not open code is_system_domain()
>  - extended the commit message
> ---
>  xen/drivers/Kconfig           |  4 +++
>  xen/drivers/passthrough/pci.c |  6 ++++
>  xen/drivers/vpci/vpci.c       | 57 ++++++++++++++++++++++++++++++-----
>  xen/include/xen/vpci.h        | 16 ++++++++++
>  4 files changed, 75 insertions(+), 8 deletions(-)
> 
> diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig
> index db94393f47a6..780490cf8e39 100644
> --- a/xen/drivers/Kconfig
> +++ b/xen/drivers/Kconfig
> @@ -15,4 +15,8 @@ source "drivers/video/Kconfig"
>  config HAS_VPCI
>  	bool
>  
> +config HAS_VPCI_GUEST_SUPPORT
> +	bool
> +	depends on HAS_VPCI
> +
>  endmenu
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index a9d31293ac09..529a4f50aa80 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -873,6 +873,10 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>      if ( ret )
>          goto out;
>  
> +    ret = vpci_deassign_device(d, pdev);
> +    if ( ret )
> +        goto out;
> +
>      if ( pdev->domain == hardware_domain  )
>          pdev->quarantine = false;
>  
> @@ -1445,6 +1449,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>          rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev), flag);
>      }
>  
> +    rc = vpci_assign_device(d, pdev);
> +
>   done:
>      if ( rc )
>          printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",

Don't you need to call vpci_deassign_device() higher up in this
function for the prior owner of the device?

> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +/* Notify vPCI that device is assigned to guest. */
> +int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
> +{
> +    int rc;
> +
> +    /* It only makes sense to assign for hwdom or guest domain. */
> +    if ( is_system_domain(d) || !has_vpci(d) )
> +        return 0;
> +
> +    vpci_remove_device_handlers(pdev);

This removes handlers in d, not in the prior owner domain. Is this
really intended? And if it really is meant to remove the new domain's
handlers (of which there ought to be none) - why is this necessary?

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-15 16:56   ` Jan Beulich
@ 2021-11-16  7:32     ` Oleksandr Andrushchenko
  2021-11-16  8:01       ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-16  7:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 15.11.21 18:56, Jan Beulich wrote:
> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> When a vPCI is removed for a PCI device it is possible that we have
>> scheduled a delayed work for map/unmap operations for that device.
>> For example, the following scenario can illustrate the problem:
>>
>> pci_physdev_op
>>     pci_add_device
>>         init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>     iommu_add_device <- FAILS
>>     vpci_remove_device -> xfree(pdev->vpci)
>>
>> leave_hypervisor_to_guest
>>     vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>
>> For the hardware domain we continue execution as the worse that
>> could happen is that MMIO mappings are left in place when the
>> device has been deassigned
> Is continuing safe in this case? I.e. isn't there the risk of a NULL
> deref?
I think it is safe to continue
>
>> For unprivileged domains that get a failure in the middle of a vPCI
>> {un}map operation we need to destroy them, as we don't know in which
>> state the p2m is. This can only happen in vpci_process_pending for
>> DomUs as they won't be allowed to call pci_add_device.
> You saying "we need to destroy them" made me look for a new domain_crash()
> that you add, but there is none. What is this about?
Yes, I guess we need to implicitly destroy the domain,
@Roger are you ok with that?
>
>> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v)
>>       return false;
>>   }
>>   
>> +void vpci_cancel_pending(const struct pci_dev *pdev)
>> +{
>> +    struct vcpu *v = current;
>> +
>> +    /* Cancel any pending work now. */
> Doesn't "any" include pending work on all vCPU-s of the guest, not
> just current? Is current even relevant (as in: part of the correct
> domain), considering ...
>
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev)
>>           xfree(r);
>>       }
>>       spin_unlock(&pdev->vpci->lock);
>> +
>> +    vpci_cancel_pending(pdev);
> ... this code path, when coming here from pci_{add,remove}_device()?
>
> I can agree that there's a problem here, but I think you need to
> properly (i.e. in a race free manner) drain pending work.
Yes, the code is inconsistent with this respect. I am thinking about:

void vpci_cancel_pending(const struct pci_dev *pdev)
{
     struct domain *d = pdev->domain;
     struct vcpu *v;

     /* Cancel any pending work now. */
     domain_lock(d);
     for_each_vcpu ( d, v )
     {
         vcpu_pause(v);
         if ( v->vpci.mem && v->vpci.pdev == pdev)
         {
             rangeset_destroy(v->vpci.mem);
             v->vpci.mem = NULL;
         }
         vcpu_unpause(v);
     }
     domain_unlock(d);
}

which seems to solve all the concerns. Is this what you mean?
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-16  7:32     ` Oleksandr Andrushchenko
@ 2021-11-16  8:01       ` Jan Beulich
  2021-11-16  8:23         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-16  8:01 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
> On 15.11.21 18:56, Jan Beulich wrote:
>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>
>>> When a vPCI is removed for a PCI device it is possible that we have
>>> scheduled a delayed work for map/unmap operations for that device.
>>> For example, the following scenario can illustrate the problem:
>>>
>>> pci_physdev_op
>>>     pci_add_device
>>>         init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>     iommu_add_device <- FAILS
>>>     vpci_remove_device -> xfree(pdev->vpci)
>>>
>>> leave_hypervisor_to_guest
>>>     vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>
>>> For the hardware domain we continue execution as the worse that
>>> could happen is that MMIO mappings are left in place when the
>>> device has been deassigned
>> Is continuing safe in this case? I.e. isn't there the risk of a NULL
>> deref?
> I think it is safe to continue

And why do you think so? I.e. why is there no race for Dom0 when there
is one for DomU?

>>> For unprivileged domains that get a failure in the middle of a vPCI
>>> {un}map operation we need to destroy them, as we don't know in which
>>> state the p2m is. This can only happen in vpci_process_pending for
>>> DomUs as they won't be allowed to call pci_add_device.
>> You saying "we need to destroy them" made me look for a new domain_crash()
>> that you add, but there is none. What is this about?
> Yes, I guess we need to implicitly destroy the domain,

What do you mean by "implicitly"?

>>> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v)
>>>       return false;
>>>   }
>>>   
>>> +void vpci_cancel_pending(const struct pci_dev *pdev)
>>> +{
>>> +    struct vcpu *v = current;
>>> +
>>> +    /* Cancel any pending work now. */
>> Doesn't "any" include pending work on all vCPU-s of the guest, not
>> just current? Is current even relevant (as in: part of the correct
>> domain), considering ...
>>
>>> --- a/xen/drivers/vpci/vpci.c
>>> +++ b/xen/drivers/vpci/vpci.c
>>> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev)
>>>           xfree(r);
>>>       }
>>>       spin_unlock(&pdev->vpci->lock);
>>> +
>>> +    vpci_cancel_pending(pdev);
>> ... this code path, when coming here from pci_{add,remove}_device()?
>>
>> I can agree that there's a problem here, but I think you need to
>> properly (i.e. in a race free manner) drain pending work.
> Yes, the code is inconsistent with this respect. I am thinking about:
> 
> void vpci_cancel_pending(const struct pci_dev *pdev)
> {
>      struct domain *d = pdev->domain;
>      struct vcpu *v;
> 
>      /* Cancel any pending work now. */
>      domain_lock(d);
>      for_each_vcpu ( d, v )
>      {
>          vcpu_pause(v);
>          if ( v->vpci.mem && v->vpci.pdev == pdev)

Nit: Same style issue as in the original patch.

>          {
>              rangeset_destroy(v->vpci.mem);
>              v->vpci.mem = NULL;
>          }
>          vcpu_unpause(v);
>      }
>      domain_unlock(d);
> }
> 
> which seems to solve all the concerns. Is this what you mean?

Something along these lines. I expect you will want to make use of
domain_pause_except_self(), and I don't understand the purpose of
acquiring the domain lock.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 03/11] vpci: make vpci registers removal a dedicated function
  2021-11-15 16:57   ` Jan Beulich
@ 2021-11-16  8:02     ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-16  8:02 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel



On 15.11.21 18:57, Jan Beulich wrote:
> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> This is in preparation for dynamic assignment of the vpci register
>> handlers depending on the domain: hwdom or guest.
>>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> ---
>> Since v3:
>> - remove all R-b's due to changes
>> - s/vpci_remove_device_registers/vpci_remove_device_handlers
> Should this maybe extend to the title then?
Sure, I will re-word
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-16  8:01       ` Jan Beulich
@ 2021-11-16  8:23         ` Oleksandr Andrushchenko
  2021-11-16 11:38           ` Jan Beulich
  2021-11-16 13:41           ` Oleksandr Andrushchenko
  0 siblings, 2 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-16  8:23 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 16.11.21 10:01, Jan Beulich wrote:
> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>> On 15.11.21 18:56, Jan Beulich wrote:
>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>
>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>> scheduled a delayed work for map/unmap operations for that device.
>>>> For example, the following scenario can illustrate the problem:
>>>>
>>>> pci_physdev_op
>>>>      pci_add_device
>>>>          init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>      iommu_add_device <- FAILS
>>>>      vpci_remove_device -> xfree(pdev->vpci)
>>>>
>>>> leave_hypervisor_to_guest
>>>>      vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>
>>>> For the hardware domain we continue execution as the worse that
>>>> could happen is that MMIO mappings are left in place when the
>>>> device has been deassigned
>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL
>>> deref?
>> I think it is safe to continue
> And why do you think so? I.e. why is there no race for Dom0 when there
> is one for DomU?
Well, then we need to use a lock to synchronize the two.
I guess this needs to be pci devs lock unfortunately
>
>>>> For unprivileged domains that get a failure in the middle of a vPCI
>>>> {un}map operation we need to destroy them, as we don't know in which
>>>> state the p2m is. This can only happen in vpci_process_pending for
>>>> DomUs as they won't be allowed to call pci_add_device.
>>> You saying "we need to destroy them" made me look for a new domain_crash()
>>> that you add, but there is none. What is this about?
>> Yes, I guess we need to implicitly destroy the domain,
> What do you mean by "implicitly"?
@@ -151,14 +151,18 @@ bool vpci_process_pending(struct vcpu *v)

          vpci_cancel_pending(v->vpci.pdev);
          if ( rc )
+        {
              /*
               * FIXME: in case of failure remove the device from the domain.
               * Note that there might still be leftover mappings. While this is
+             * safe for Dom0, for DomUs the domain needs to be killed in order
+             * to avoid leaking stale p2m mappings on failure.
               */
              vpci_remove_device(v->vpci.pdev);
+
+            if ( !is_hardware_domain(v->domain) )
+                domain_crash(v->domain);

>
>>>> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v)
>>>>        return false;
>>>>    }
>>>>    
>>>> +void vpci_cancel_pending(const struct pci_dev *pdev)
>>>> +{
>>>> +    struct vcpu *v = current;
>>>> +
>>>> +    /* Cancel any pending work now. */
>>> Doesn't "any" include pending work on all vCPU-s of the guest, not
>>> just current? Is current even relevant (as in: part of the correct
>>> domain), considering ...
>>>
>>>> --- a/xen/drivers/vpci/vpci.c
>>>> +++ b/xen/drivers/vpci/vpci.c
>>>> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev)
>>>>            xfree(r);
>>>>        }
>>>>        spin_unlock(&pdev->vpci->lock);
>>>> +
>>>> +    vpci_cancel_pending(pdev);
>>> ... this code path, when coming here from pci_{add,remove}_device()?
>>>
>>> I can agree that there's a problem here, but I think you need to
>>> properly (i.e. in a race free manner) drain pending work.
>> Yes, the code is inconsistent with this respect. I am thinking about:
>>
>> void vpci_cancel_pending(const struct pci_dev *pdev)
>> {
>>       struct domain *d = pdev->domain;
>>       struct vcpu *v;
>>
>>       /* Cancel any pending work now. */
>>       domain_lock(d);
>>       for_each_vcpu ( d, v )
>>       {
>>           vcpu_pause(v);
>>           if ( v->vpci.mem && v->vpci.pdev == pdev)
> Nit: Same style issue as in the original patch.
Will fix
>
>>           {
>>               rangeset_destroy(v->vpci.mem);
>>               v->vpci.mem = NULL;
>>           }
>>           vcpu_unpause(v);
>>       }
>>       domain_unlock(d);
>> }
>>
>> which seems to solve all the concerns. Is this what you mean?
> Something along these lines. I expect you will want to make use of
> domain_pause_except_self(),
Yes, this is what is needed here, thanks. The only question is that

int domain_pause_except_self(struct domain *d)
{
[snip]
         /* Avoid racing with other vcpus which may want to be pausing us */
         if ( !spin_trylock(&d->hypercall_deadlock_mutex) )
             return -ERESTART;

so it is not clear what do we do in case of -ERESTART: do we want to spin?
Otherwise we will leave the job not done effectively not canceling the
pending work. Any idea other then spinning?
>   and I don't understand the purpose of
> acquiring the domain lock.
You are right, no need
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 04/11] vpci: add hooks for PCI device assign/de-assign
  2021-11-15 17:06   ` Jan Beulich
@ 2021-11-16  9:38     ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-16  9:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 15.11.21 19:06, Jan Beulich wrote:
> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> When a PCI device gets assigned/de-assigned some work on vPCI side needs
>> to be done for that device. Introduce a pair of hooks so vPCI can handle
>> that.
>>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> ---
>> Since v3:
>>   - remove toolstack roll-back description from the commit message
>>     as error are to be handled with proper cleanup in Xen itself
>>   - remove __must_check
>>   - remove redundant rc check while assigning devices
>>   - fix redundant CONFIG_HAS_VPCI check for CONFIG_HAS_VPCI_GUEST_SUPPORT
>>   - use REGISTER_VPCI_INIT machinery to run required steps on device
>>     init/assign: add run_vpci_init helper
>> Since v2:
>> - define CONFIG_HAS_VPCI_GUEST_SUPPORT so dead code is not compiled
>>    for x86
>> Since v1:
>>   - constify struct pci_dev where possible
>>   - do not open code is_system_domain()
>>   - extended the commit message
>> ---
>>   xen/drivers/Kconfig           |  4 +++
>>   xen/drivers/passthrough/pci.c |  6 ++++
>>   xen/drivers/vpci/vpci.c       | 57 ++++++++++++++++++++++++++++++-----
>>   xen/include/xen/vpci.h        | 16 ++++++++++
>>   4 files changed, 75 insertions(+), 8 deletions(-)
>>
>> diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig
>> index db94393f47a6..780490cf8e39 100644
>> --- a/xen/drivers/Kconfig
>> +++ b/xen/drivers/Kconfig
>> @@ -15,4 +15,8 @@ source "drivers/video/Kconfig"
>>   config HAS_VPCI
>>   	bool
>>   
>> +config HAS_VPCI_GUEST_SUPPORT
>> +	bool
>> +	depends on HAS_VPCI
>> +
>>   endmenu
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index a9d31293ac09..529a4f50aa80 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -873,6 +873,10 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>>       if ( ret )
>>           goto out;
>>   
>> +    ret = vpci_deassign_device(d, pdev);
>> +    if ( ret )
>> +        goto out;
>> +
>>       if ( pdev->domain == hardware_domain  )
>>           pdev->quarantine = false;
>>   
>> @@ -1445,6 +1449,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>           rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev), flag);
>>       }
>>   
>> +    rc = vpci_assign_device(d, pdev);
>> +
>>    done:
>>       if ( rc )
>>           printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
> Don't you need to call vpci_deassign_device() higher up in this
> function for the prior owner of the device?
Yes, this does make more sense, e.g. vpci_deassign_device(pdev->domain, pdev)
before assigning pdev to an IOMMU which updates pdev->domain and then...
>
>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>> +/* Notify vPCI that device is assigned to guest. */
>> +int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
>> +{
>> +    int rc;
>> +
>> +    /* It only makes sense to assign for hwdom or guest domain. */
>> +    if ( is_system_domain(d) || !has_vpci(d) )
>> +        return 0;
>> +
>> +    vpci_remove_device_handlers(pdev);
> This removes handlers in d, not in the prior owner domain. Is this
> really intended? And if it really is meant to remove the new domain's
> handlers (of which there ought to be none) - why is this necessary?
the above won't be needed
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-16  8:23         ` Oleksandr Andrushchenko
@ 2021-11-16 11:38           ` Jan Beulich
  2021-11-16 13:27             ` Oleksandr Andrushchenko
  2021-11-16 13:41           ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-16 11:38 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 16.11.2021 09:23, Oleksandr Andrushchenko wrote:
> 
> 
> On 16.11.21 10:01, Jan Beulich wrote:
>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v)
>>>>>        return false;
>>>>>    }
>>>>>    
>>>>> +void vpci_cancel_pending(const struct pci_dev *pdev)
>>>>> +{
>>>>> +    struct vcpu *v = current;
>>>>> +
>>>>> +    /* Cancel any pending work now. */
>>>> Doesn't "any" include pending work on all vCPU-s of the guest, not
>>>> just current? Is current even relevant (as in: part of the correct
>>>> domain), considering ...
>>>>
>>>>> --- a/xen/drivers/vpci/vpci.c
>>>>> +++ b/xen/drivers/vpci/vpci.c
>>>>> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev)
>>>>>            xfree(r);
>>>>>        }
>>>>>        spin_unlock(&pdev->vpci->lock);
>>>>> +
>>>>> +    vpci_cancel_pending(pdev);
>>>> ... this code path, when coming here from pci_{add,remove}_device()?
>>>>
>>>> I can agree that there's a problem here, but I think you need to
>>>> properly (i.e. in a race free manner) drain pending work.
>>> Yes, the code is inconsistent with this respect. I am thinking about:
>>>
>>> void vpci_cancel_pending(const struct pci_dev *pdev)
>>> {
>>>       struct domain *d = pdev->domain;
>>>       struct vcpu *v;
>>>
>>>       /* Cancel any pending work now. */
>>>       domain_lock(d);
>>>       for_each_vcpu ( d, v )
>>>       {
>>>           vcpu_pause(v);
>>>           if ( v->vpci.mem && v->vpci.pdev == pdev)
>> Nit: Same style issue as in the original patch.
> Will fix
>>
>>>           {
>>>               rangeset_destroy(v->vpci.mem);
>>>               v->vpci.mem = NULL;
>>>           }
>>>           vcpu_unpause(v);
>>>       }
>>>       domain_unlock(d);
>>> }
>>>
>>> which seems to solve all the concerns. Is this what you mean?
>> Something along these lines. I expect you will want to make use of
>> domain_pause_except_self(),
> Yes, this is what is needed here, thanks. The only question is that
> 
> int domain_pause_except_self(struct domain *d)
> {
> [snip]
>          /* Avoid racing with other vcpus which may want to be pausing us */
>          if ( !spin_trylock(&d->hypercall_deadlock_mutex) )
>              return -ERESTART;
> 
> so it is not clear what do we do in case of -ERESTART: do we want to spin?
> Otherwise we will leave the job not done effectively not canceling the
> pending work. Any idea other then spinning?

Depends on the call chain you come through. There may need to be some
rearrangements such that you may be able to preempt the enclosing
hypercall.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-16 11:38           ` Jan Beulich
@ 2021-11-16 13:27             ` Oleksandr Andrushchenko
  2021-11-16 14:11               ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-16 13:27 UTC (permalink / raw)
  To: Jan Beulich, roger.pau
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 16.11.21 13:38, Jan Beulich wrote:
> On 16.11.2021 09:23, Oleksandr Andrushchenko wrote:
>>
>> On 16.11.21 10:01, Jan Beulich wrote:
>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v)
>>>>>>         return false;
>>>>>>     }
>>>>>>     
>>>>>> +void vpci_cancel_pending(const struct pci_dev *pdev)
>>>>>> +{
>>>>>> +    struct vcpu *v = current;
>>>>>> +
>>>>>> +    /* Cancel any pending work now. */
>>>>> Doesn't "any" include pending work on all vCPU-s of the guest, not
>>>>> just current? Is current even relevant (as in: part of the correct
>>>>> domain), considering ...
>>>>>
>>>>>> --- a/xen/drivers/vpci/vpci.c
>>>>>> +++ b/xen/drivers/vpci/vpci.c
>>>>>> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev)
>>>>>>             xfree(r);
>>>>>>         }
>>>>>>         spin_unlock(&pdev->vpci->lock);
>>>>>> +
>>>>>> +    vpci_cancel_pending(pdev);
>>>>> ... this code path, when coming here from pci_{add,remove}_device()?
>>>>>
>>>>> I can agree that there's a problem here, but I think you need to
>>>>> properly (i.e. in a race free manner) drain pending work.
>>>> Yes, the code is inconsistent with this respect. I am thinking about:
>>>>
>>>> void vpci_cancel_pending(const struct pci_dev *pdev)
>>>> {
>>>>        struct domain *d = pdev->domain;
>>>>        struct vcpu *v;
>>>>
>>>>        /* Cancel any pending work now. */
>>>>        domain_lock(d);
>>>>        for_each_vcpu ( d, v )
>>>>        {
>>>>            vcpu_pause(v);
>>>>            if ( v->vpci.mem && v->vpci.pdev == pdev)
>>> Nit: Same style issue as in the original patch.
>> Will fix
>>>>            {
>>>>                rangeset_destroy(v->vpci.mem);
>>>>                v->vpci.mem = NULL;
>>>>            }
>>>>            vcpu_unpause(v);
>>>>        }
>>>>        domain_unlock(d);
>>>> }
>>>>
>>>> which seems to solve all the concerns. Is this what you mean?
>>> Something along these lines. I expect you will want to make use of
>>> domain_pause_except_self(),
>> Yes, this is what is needed here, thanks. The only question is that
>>
>> int domain_pause_except_self(struct domain *d)
>> {
>> [snip]
>>           /* Avoid racing with other vcpus which may want to be pausing us */
>>           if ( !spin_trylock(&d->hypercall_deadlock_mutex) )
>>               return -ERESTART;
>>
>> so it is not clear what do we do in case of -ERESTART: do we want to spin?
>> Otherwise we will leave the job not done effectively not canceling the
>> pending work. Any idea other then spinning?
> Depends on the call chain you come through. There may need to be some
> rearrangements such that you may be able to preempt the enclosing
> hypercall.
Well, there are three places which may lead to the pending work
needs to be canceled:

MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> vpci_cancel_pending (on modify_bars fail path)

PHYSDEVOP_pci_device_add -> pci_add_device (error path) -> vpci_remove_device -> vpci_cancel_pending

PHYSDEVOP_pci_device_remove -> pci_remove_device -> vpci_remove_device -> vpci_cancel_pending

So, taking into account the MMIO path, I think about the below code

     /*
      * Cancel any pending work now.
      *
      * FIXME: this can be called from an MMIO trap handler's error
      * path, so we cannot just return an error code here, so upper
      * layers can handle it. The best we can do is to still try
      * removing the range sets.
      */
     while ( (rc = domain_pause_except_self(d)) == -ERESTART )
         cpu_relax();

     if ( rc )
         printk(XENLOG_G_ERR
                "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n",
                &pdev->sbdf, pdev->domain, rc);

I am not sure how to handle this otherwise
@Roger, do you see any other good way?
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-16  8:23         ` Oleksandr Andrushchenko
  2021-11-16 11:38           ` Jan Beulich
@ 2021-11-16 13:41           ` Oleksandr Andrushchenko
  2021-11-16 14:12             ` Jan Beulich
  1 sibling, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-16 13:41 UTC (permalink / raw)
  To: Jan Beulich, roger.pau
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 16.11.21 10:23, Oleksandr Andrushchenko wrote:
>
> On 16.11.21 10:01, Jan Beulich wrote:
>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>
>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>> For example, the following scenario can illustrate the problem:
>>>>>
>>>>> pci_physdev_op
>>>>>       pci_add_device
>>>>>           init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>       iommu_add_device <- FAILS
>>>>>       vpci_remove_device -> xfree(pdev->vpci)
>>>>>
>>>>> leave_hypervisor_to_guest
>>>>>       vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>
>>>>> For the hardware domain we continue execution as the worse that
>>>>> could happen is that MMIO mappings are left in place when the
>>>>> device has been deassigned
>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL
>>>> deref?
>>> I think it is safe to continue
>> And why do you think so? I.e. why is there no race for Dom0 when there
>> is one for DomU?
> Well, then we need to use a lock to synchronize the two.
> I guess this needs to be pci devs lock unfortunately
The parties involved in deferred work and its cancellation:

MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map

Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending

x86: two places -> hvm_do_resume -> vpci_process_pending

So, both defer_map and vpci_process_pending need to be synchronized with
pcidevs_{lock|unlock).
As both functions existed before the code I introduce I would prefer this to be
a dedicated patch for v5 of the series.

Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-16 13:27             ` Oleksandr Andrushchenko
@ 2021-11-16 14:11               ` Jan Beulich
  0 siblings, 0 replies; 101+ messages in thread
From: Jan Beulich @ 2021-11-16 14:11 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau

On 16.11.2021 14:27, Oleksandr Andrushchenko wrote:
> 
> 
> On 16.11.21 13:38, Jan Beulich wrote:
>> On 16.11.2021 09:23, Oleksandr Andrushchenko wrote:
>>>
>>> On 16.11.21 10:01, Jan Beulich wrote:
>>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v)
>>>>>>>         return false;
>>>>>>>     }
>>>>>>>     
>>>>>>> +void vpci_cancel_pending(const struct pci_dev *pdev)
>>>>>>> +{
>>>>>>> +    struct vcpu *v = current;
>>>>>>> +
>>>>>>> +    /* Cancel any pending work now. */
>>>>>> Doesn't "any" include pending work on all vCPU-s of the guest, not
>>>>>> just current? Is current even relevant (as in: part of the correct
>>>>>> domain), considering ...
>>>>>>
>>>>>>> --- a/xen/drivers/vpci/vpci.c
>>>>>>> +++ b/xen/drivers/vpci/vpci.c
>>>>>>> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev)
>>>>>>>             xfree(r);
>>>>>>>         }
>>>>>>>         spin_unlock(&pdev->vpci->lock);
>>>>>>> +
>>>>>>> +    vpci_cancel_pending(pdev);
>>>>>> ... this code path, when coming here from pci_{add,remove}_device()?
>>>>>>
>>>>>> I can agree that there's a problem here, but I think you need to
>>>>>> properly (i.e. in a race free manner) drain pending work.
>>>>> Yes, the code is inconsistent with this respect. I am thinking about:
>>>>>
>>>>> void vpci_cancel_pending(const struct pci_dev *pdev)
>>>>> {
>>>>>        struct domain *d = pdev->domain;
>>>>>        struct vcpu *v;
>>>>>
>>>>>        /* Cancel any pending work now. */
>>>>>        domain_lock(d);
>>>>>        for_each_vcpu ( d, v )
>>>>>        {
>>>>>            vcpu_pause(v);
>>>>>            if ( v->vpci.mem && v->vpci.pdev == pdev)
>>>> Nit: Same style issue as in the original patch.
>>> Will fix
>>>>>            {
>>>>>                rangeset_destroy(v->vpci.mem);
>>>>>                v->vpci.mem = NULL;
>>>>>            }
>>>>>            vcpu_unpause(v);
>>>>>        }
>>>>>        domain_unlock(d);
>>>>> }
>>>>>
>>>>> which seems to solve all the concerns. Is this what you mean?
>>>> Something along these lines. I expect you will want to make use of
>>>> domain_pause_except_self(),
>>> Yes, this is what is needed here, thanks. The only question is that
>>>
>>> int domain_pause_except_self(struct domain *d)
>>> {
>>> [snip]
>>>           /* Avoid racing with other vcpus which may want to be pausing us */
>>>           if ( !spin_trylock(&d->hypercall_deadlock_mutex) )
>>>               return -ERESTART;
>>>
>>> so it is not clear what do we do in case of -ERESTART: do we want to spin?
>>> Otherwise we will leave the job not done effectively not canceling the
>>> pending work. Any idea other then spinning?
>> Depends on the call chain you come through. There may need to be some
>> rearrangements such that you may be able to preempt the enclosing
>> hypercall.
> Well, there are three places which may lead to the pending work
> needs to be canceled:
> 
> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> vpci_cancel_pending (on modify_bars fail path)
> 
> PHYSDEVOP_pci_device_add -> pci_add_device (error path) -> vpci_remove_device -> vpci_cancel_pending
> 
> PHYSDEVOP_pci_device_remove -> pci_remove_device -> vpci_remove_device -> vpci_cancel_pending
> 
> So, taking into account the MMIO path, I think about the below code
> 
>      /*
>       * Cancel any pending work now.
>       *
>       * FIXME: this can be called from an MMIO trap handler's error
>       * path, so we cannot just return an error code here, so upper
>       * layers can handle it. The best we can do is to still try
>       * removing the range sets.
>       */

The MMIO trap path should simply exit to the guest to have the insn
retried. With the vPCI removal a subsequent emulation attempt will
no longer be able to resolve the address to BDF, and hence do
whatever it would do for an attempted access to config space not
belonging to any device.

For the two physdevops it may not be possible to preempt the
hypercalls without further code adjustments.

Jan

>      while ( (rc = domain_pause_except_self(d)) == -ERESTART )
>          cpu_relax();
> 
>      if ( rc )
>          printk(XENLOG_G_ERR
>                 "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n",
>                 &pdev->sbdf, pdev->domain, rc);
> 
> I am not sure how to handle this otherwise
> @Roger, do you see any other good way?
>>
>> Jan
>>
> Thank you,
> Oleksandr
> 



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-16 13:41           ` Oleksandr Andrushchenko
@ 2021-11-16 14:12             ` Jan Beulich
  2021-11-16 14:24               ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-16 14:12 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau

On 16.11.2021 14:41, Oleksandr Andrushchenko wrote:
> 
> 
> On 16.11.21 10:23, Oleksandr Andrushchenko wrote:
>>
>> On 16.11.21 10:01, Jan Beulich wrote:
>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>
>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>
>>>>>> pci_physdev_op
>>>>>>       pci_add_device
>>>>>>           init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>       iommu_add_device <- FAILS
>>>>>>       vpci_remove_device -> xfree(pdev->vpci)
>>>>>>
>>>>>> leave_hypervisor_to_guest
>>>>>>       vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>
>>>>>> For the hardware domain we continue execution as the worse that
>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>> device has been deassigned
>>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL
>>>>> deref?
>>>> I think it is safe to continue
>>> And why do you think so? I.e. why is there no race for Dom0 when there
>>> is one for DomU?
>> Well, then we need to use a lock to synchronize the two.
>> I guess this needs to be pci devs lock unfortunately
> The parties involved in deferred work and its cancellation:
> 
> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map
> 
> Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending
> 
> x86: two places -> hvm_do_resume -> vpci_process_pending
> 
> So, both defer_map and vpci_process_pending need to be synchronized with
> pcidevs_{lock|unlock).

If I was an Arm maintainer, I'm afraid I would object to the pcidevs lock
getting used in leave_hypervisor_to_guest.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-16 14:12             ` Jan Beulich
@ 2021-11-16 14:24               ` Oleksandr Andrushchenko
  2021-11-16 14:37                 ` Oleksandr Andrushchenko
                                   ` (2 more replies)
  0 siblings, 3 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-16 14:24 UTC (permalink / raw)
  To: Jan Beulich, sstabellini, julien
  Cc: Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel, roger.pau



On 16.11.21 16:12, Jan Beulich wrote:
> On 16.11.2021 14:41, Oleksandr Andrushchenko wrote:
>>
>> On 16.11.21 10:23, Oleksandr Andrushchenko wrote:
>>> On 16.11.21 10:01, Jan Beulich wrote:
>>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>
>>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>>
>>>>>>> pci_physdev_op
>>>>>>>        pci_add_device
>>>>>>>            init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>>        iommu_add_device <- FAILS
>>>>>>>        vpci_remove_device -> xfree(pdev->vpci)
>>>>>>>
>>>>>>> leave_hypervisor_to_guest
>>>>>>>        vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>>
>>>>>>> For the hardware domain we continue execution as the worse that
>>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>>> device has been deassigned
>>>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL
>>>>>> deref?
>>>>> I think it is safe to continue
>>>> And why do you think so? I.e. why is there no race for Dom0 when there
>>>> is one for DomU?
>>> Well, then we need to use a lock to synchronize the two.
>>> I guess this needs to be pci devs lock unfortunately
>> The parties involved in deferred work and its cancellation:
>>
>> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map
>>
>> Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending
>>
>> x86: two places -> hvm_do_resume -> vpci_process_pending
>>
>> So, both defer_map and vpci_process_pending need to be synchronized with
>> pcidevs_{lock|unlock).
> If I was an Arm maintainer, I'm afraid I would object to the pcidevs lock
> getting used in leave_hypervisor_to_guest.
I do agree this is really not good, but it seems I am limited in choices.
@Stefano, @Julien, do you see any better way of doing that?

We were thinking about introducing a dedicated lock for vpci [1],
but finally decided to use pcidevs_lock for now
> Jan
>

[1] https://lore.kernel.org/all/afe47397-a792-6b0c-0a89-b47c523e50d9@epam.com/

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-16 14:24               ` Oleksandr Andrushchenko
@ 2021-11-16 14:37                 ` Oleksandr Andrushchenko
  2021-11-16 16:09                 ` Jan Beulich
  2021-11-16 18:02                 ` Julien Grall
  2 siblings, 0 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-16 14:37 UTC (permalink / raw)
  To: Jan Beulich, sstabellini, julien
  Cc: Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel, roger.pau, Oleksandr Andrushchenko



On 16.11.21 16:24, Oleksandr Andrushchenko wrote:
>
> On 16.11.21 16:12, Jan Beulich wrote:
>> On 16.11.2021 14:41, Oleksandr Andrushchenko wrote:
>>> On 16.11.21 10:23, Oleksandr Andrushchenko wrote:
>>>> On 16.11.21 10:01, Jan Beulich wrote:
>>>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>>>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>
>>>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>>>
>>>>>>>> pci_physdev_op
>>>>>>>>         pci_add_device
>>>>>>>>             init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>>>         iommu_add_device <- FAILS
>>>>>>>>         vpci_remove_device -> xfree(pdev->vpci)
>>>>>>>>
>>>>>>>> leave_hypervisor_to_guest
>>>>>>>>         vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>>>
>>>>>>>> For the hardware domain we continue execution as the worse that
>>>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>>>> device has been deassigned
>>>>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL
>>>>>>> deref?
>>>>>> I think it is safe to continue
>>>>> And why do you think so? I.e. why is there no race for Dom0 when there
>>>>> is one for DomU?
>>>> Well, then we need to use a lock to synchronize the two.
>>>> I guess this needs to be pci devs lock unfortunately
>>> The parties involved in deferred work and its cancellation:
>>>
>>> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map
>>>
>>> Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending
>>>
>>> x86: two places -> hvm_do_resume -> vpci_process_pending
>>>
>>> So, both defer_map and vpci_process_pending need to be synchronized with
>>> pcidevs_{lock|unlock).
>> If I was an Arm maintainer, I'm afraid I would object to the pcidevs lock
>> getting used in leave_hypervisor_to_guest.
> I do agree this is really not good, but it seems I am limited in choices.
> @Stefano, @Julien, do you see any better way of doing that?
>
> We were thinking about introducing a dedicated lock for vpci [1],
> but finally decided to use pcidevs_lock for now
Or even better and simpler: we just use pdev->vpci->lock to
protect vpci_process_pending vs MMIO trap handlers which  already
use it.
>> Jan
>>
> [1] https://lore.kernel.org/all/afe47397-a792-6b0c-0a89-b47c523e50d9@epam.com/

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-16 14:24               ` Oleksandr Andrushchenko
  2021-11-16 14:37                 ` Oleksandr Andrushchenko
@ 2021-11-16 16:09                 ` Jan Beulich
  2021-11-16 18:02                 ` Julien Grall
  2 siblings, 0 replies; 101+ messages in thread
From: Jan Beulich @ 2021-11-16 16:09 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel, roger.pau, sstabellini, julien

On 16.11.2021 15:24, Oleksandr Andrushchenko wrote:
> 
> 
> On 16.11.21 16:12, Jan Beulich wrote:
>> On 16.11.2021 14:41, Oleksandr Andrushchenko wrote:
>>>
>>> On 16.11.21 10:23, Oleksandr Andrushchenko wrote:
>>>> On 16.11.21 10:01, Jan Beulich wrote:
>>>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>>>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>
>>>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>>>
>>>>>>>> pci_physdev_op
>>>>>>>>        pci_add_device
>>>>>>>>            init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>>>        iommu_add_device <- FAILS
>>>>>>>>        vpci_remove_device -> xfree(pdev->vpci)
>>>>>>>>
>>>>>>>> leave_hypervisor_to_guest
>>>>>>>>        vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>>>
>>>>>>>> For the hardware domain we continue execution as the worse that
>>>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>>>> device has been deassigned
>>>>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL
>>>>>>> deref?
>>>>>> I think it is safe to continue
>>>>> And why do you think so? I.e. why is there no race for Dom0 when there
>>>>> is one for DomU?
>>>> Well, then we need to use a lock to synchronize the two.
>>>> I guess this needs to be pci devs lock unfortunately
>>> The parties involved in deferred work and its cancellation:
>>>
>>> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map
>>>
>>> Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending
>>>
>>> x86: two places -> hvm_do_resume -> vpci_process_pending
>>>
>>> So, both defer_map and vpci_process_pending need to be synchronized with
>>> pcidevs_{lock|unlock).
>> If I was an Arm maintainer, I'm afraid I would object to the pcidevs lock
>> getting used in leave_hypervisor_to_guest.
> I do agree this is really not good, but it seems I am limited in choices.
> @Stefano, @Julien, do you see any better way of doing that?
> 
> We were thinking about introducing a dedicated lock for vpci [1],
> but finally decided to use pcidevs_lock for now

Even that locking model might be too heavyweight for this purpose,
unless an r/w lock was intended. The problem would still be that
all guest exiting would be serialized within a domain. (That's still
better than serializing all guest exiting on the host, of course.)

Jan

> [1] https://lore.kernel.org/all/afe47397-a792-6b0c-0a89-b47c523e50d9@epam.com/
> 



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-16 14:24               ` Oleksandr Andrushchenko
  2021-11-16 14:37                 ` Oleksandr Andrushchenko
  2021-11-16 16:09                 ` Jan Beulich
@ 2021-11-16 18:02                 ` Julien Grall
  2021-11-18 12:57                   ` Oleksandr Andrushchenko
  2 siblings, 1 reply; 101+ messages in thread
From: Julien Grall @ 2021-11-16 18:02 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Jan Beulich, sstabellini
  Cc: Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel, roger.pau

Hi Oleksandr,

On 16/11/2021 14:24, Oleksandr Andrushchenko wrote:
> 
> 
> On 16.11.21 16:12, Jan Beulich wrote:
>> On 16.11.2021 14:41, Oleksandr Andrushchenko wrote:
>>>
>>> On 16.11.21 10:23, Oleksandr Andrushchenko wrote:
>>>> On 16.11.21 10:01, Jan Beulich wrote:
>>>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>>>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>
>>>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>>>
>>>>>>>> pci_physdev_op
>>>>>>>>         pci_add_device
>>>>>>>>             init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>>>         iommu_add_device <- FAILS
>>>>>>>>         vpci_remove_device -> xfree(pdev->vpci)
>>>>>>>>
>>>>>>>> leave_hypervisor_to_guest
>>>>>>>>         vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>>>
>>>>>>>> For the hardware domain we continue execution as the worse that
>>>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>>>> device has been deassigned
>>>>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL
>>>>>>> deref?
>>>>>> I think it is safe to continue
>>>>> And why do you think so? I.e. why is there no race for Dom0 when there
>>>>> is one for DomU?
>>>> Well, then we need to use a lock to synchronize the two.
>>>> I guess this needs to be pci devs lock unfortunately
>>> The parties involved in deferred work and its cancellation:
>>>
>>> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map
>>>
>>> Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending
>>>
>>> x86: two places -> hvm_do_resume -> vpci_process_pending
>>>
>>> So, both defer_map and vpci_process_pending need to be synchronized with
>>> pcidevs_{lock|unlock).
>> If I was an Arm maintainer, I'm afraid I would object to the pcidevs lock
>> getting used in leave_hypervisor_to_guest.
> I do agree this is really not good, but it seems I am limited in choices.
> @Stefano, @Julien, do you see any better way of doing that?

I agree with Jan about using the pcidevs_{lock|unlock}. The lock is not 
fine-grained enough for been call in vpci_process_pending().

I haven't yet looked at the rest of the series to be able to suggest the 
exact lock. But we at least want a per-domain spinlock.

> 
> We were thinking about introducing a dedicated lock for vpci [1],
> but finally decided to use pcidevs_lock for now

Skimming through the thread, you decided to use pcidevs_lock because it 
was simpler and sufficient for the use case discussed back then. Now, we 
have a use case where it would be a problem to use pcidevs_lock. So I 
think the extra complexity is justified.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-05  6:56 ` [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal Oleksandr Andrushchenko
  2021-11-15 16:56   ` Jan Beulich
@ 2021-11-17  8:28   ` Jan Beulich
  2021-11-18  7:49     ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-17  8:28 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, roger.pau
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, andrew.cooper3, george.dunlap, paul,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko,
	xen-devel

On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> When a vPCI is removed for a PCI device it is possible that we have
> scheduled a delayed work for map/unmap operations for that device.
> For example, the following scenario can illustrate the problem:
> 
> pci_physdev_op
>    pci_add_device
>        init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>    iommu_add_device <- FAILS
>    vpci_remove_device -> xfree(pdev->vpci)
> 
> leave_hypervisor_to_guest
>    vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
> 
> For the hardware domain we continue execution as the worse that
> could happen is that MMIO mappings are left in place when the
> device has been deassigned
> 
> For unprivileged domains that get a failure in the middle of a vPCI
> {un}map operation we need to destroy them, as we don't know in which
> state the p2m is. This can only happen in vpci_process_pending for
> DomUs as they won't be allowed to call pci_add_device.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Thinking about it some more, I'm not convinced any of this is really
needed in the presented form. Removal of a vPCI device is the analogue
of hot-unplug on baremetal. That's not a "behind the backs of
everything" operation. Instead the host admin has to prepare the
device for removal, which will result in it being quiescent (which in
particular means no BAR adjustments anymore). The act of removing the
device from the system has as its virtual counterpart "xl pci-detach".
I think it ought to be in this context when pending requests get
drained, and an indicator be set that no further changes to that
device are permitted. This would mean invoking from
vpci_deassign_device() as added by patch 4, not from
vpci_remove_device(). This would yield removal of a device from the
host being independent of removal of a device from a guest.

The need for vpci_remove_device() seems questionable in the first
place: Even for hot-unplug on the host it may be better to require a
pci-detach from (PVH) Dom0 before the actual device removal. This
would involve an adjustment to the de-assignment logic for the case
of no quarantining: We'd need to make sure explicit de-assignment
from Dom0 actually removes the device from there; right now
de-assignment assumes "from DomU" and "to Dom0 or DomIO" (depending
on quarantining mode).

Thoughts?

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-17  8:28   ` Jan Beulich
@ 2021-11-18  7:49     ` Oleksandr Andrushchenko
  2021-11-18  8:36       ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-18  7:49 UTC (permalink / raw)
  To: Jan Beulich, roger.pau
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 17.11.21 10:28, Jan Beulich wrote:
> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> When a vPCI is removed for a PCI device it is possible that we have
>> scheduled a delayed work for map/unmap operations for that device.
>> For example, the following scenario can illustrate the problem:
>>
>> pci_physdev_op
>>     pci_add_device
>>         init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>     iommu_add_device <- FAILS
>>     vpci_remove_device -> xfree(pdev->vpci)
>>
>> leave_hypervisor_to_guest
>>     vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>
>> For the hardware domain we continue execution as the worse that
>> could happen is that MMIO mappings are left in place when the
>> device has been deassigned
>>
>> For unprivileged domains that get a failure in the middle of a vPCI
>> {un}map operation we need to destroy them, as we don't know in which
>> state the p2m is. This can only happen in vpci_process_pending for
>> DomUs as they won't be allowed to call pci_add_device.
>>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Thinking about it some more, I'm not convinced any of this is really
> needed in the presented form.
The intention of this patch was to handle error conditions which are
abnormal, e.g. when iommu_add_device fails and we are in the middle
of initialization. So, I am trying to cancel all pending work which might
already be there and not to crash.
>   Removal of a vPCI device is the analogue
> of hot-unplug on baremetal. That's not a "behind the backs of
> everything" operation. Instead the host admin has to prepare the
> device for removal, which will result in it being quiescent (which in
> particular means no BAR adjustments anymore). The act of removing the
> device from the system has as its virtual counterpart "xl pci-detach".
> I think it ought to be in this context when pending requests get
> drained, and an indicator be set that no further changes to that
> device are permitted. This would mean invoking from
> vpci_deassign_device() as added by patch 4, not from
> vpci_remove_device(). This would yield removal of a device from the
> host being independent of removal of a device from a guest.
>
> The need for vpci_remove_device() seems questionable in the first
> place: Even for hot-unplug on the host it may be better to require a
> pci-detach from (PVH) Dom0 before the actual device removal. This
> would involve an adjustment to the de-assignment logic for the case
> of no quarantining: We'd need to make sure explicit de-assignment
> from Dom0 actually removes the device from there; right now
> de-assignment assumes "from DomU" and "to Dom0 or DomIO" (depending
> on quarantining mode).
Please see above. What you wrote might be perfectly fine for
the "expected" removals, but what about the errors which are
out of administrator's control?
>
> Thoughts?
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-18  7:49     ` Oleksandr Andrushchenko
@ 2021-11-18  8:36       ` Jan Beulich
  2021-11-18  8:54         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-18  8:36 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau

On 18.11.2021 08:49, Oleksandr Andrushchenko wrote:
> 
> 
> On 17.11.21 10:28, Jan Beulich wrote:
>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>
>>> When a vPCI is removed for a PCI device it is possible that we have
>>> scheduled a delayed work for map/unmap operations for that device.
>>> For example, the following scenario can illustrate the problem:
>>>
>>> pci_physdev_op
>>>     pci_add_device
>>>         init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>     iommu_add_device <- FAILS
>>>     vpci_remove_device -> xfree(pdev->vpci)
>>>
>>> leave_hypervisor_to_guest
>>>     vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>
>>> For the hardware domain we continue execution as the worse that
>>> could happen is that MMIO mappings are left in place when the
>>> device has been deassigned
>>>
>>> For unprivileged domains that get a failure in the middle of a vPCI
>>> {un}map operation we need to destroy them, as we don't know in which
>>> state the p2m is. This can only happen in vpci_process_pending for
>>> DomUs as they won't be allowed to call pci_add_device.
>>>
>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> Thinking about it some more, I'm not convinced any of this is really
>> needed in the presented form.
> The intention of this patch was to handle error conditions which are
> abnormal, e.g. when iommu_add_device fails and we are in the middle
> of initialization. So, I am trying to cancel all pending work which might
> already be there and not to crash.

Only Dom0 may be able to prematurely access the device during "add".
Yet unlike for DomU-s we generally expect Dom0 to be well-behaved.
Hence I'm not sure I see the need for dealing with these.

>>   Removal of a vPCI device is the analogue
>> of hot-unplug on baremetal. That's not a "behind the backs of
>> everything" operation. Instead the host admin has to prepare the
>> device for removal, which will result in it being quiescent (which in
>> particular means no BAR adjustments anymore). The act of removing the
>> device from the system has as its virtual counterpart "xl pci-detach".
>> I think it ought to be in this context when pending requests get
>> drained, and an indicator be set that no further changes to that
>> device are permitted. This would mean invoking from
>> vpci_deassign_device() as added by patch 4, not from
>> vpci_remove_device(). This would yield removal of a device from the
>> host being independent of removal of a device from a guest.
>>
>> The need for vpci_remove_device() seems questionable in the first
>> place: Even for hot-unplug on the host it may be better to require a
>> pci-detach from (PVH) Dom0 before the actual device removal. This
>> would involve an adjustment to the de-assignment logic for the case
>> of no quarantining: We'd need to make sure explicit de-assignment
>> from Dom0 actually removes the device from there; right now
>> de-assignment assumes "from DomU" and "to Dom0 or DomIO" (depending
>> on quarantining mode).

As to this, I meanwhile think that add/remove can very well have Dom0
related vPCI init/teardown. But for DomU all of that should happen
during assign/de-assign. A device still assigned to a DomU simply
should never be subject to physical hot-unplug in the first place.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-18  8:36       ` Jan Beulich
@ 2021-11-18  8:54         ` Oleksandr Andrushchenko
  2021-11-18  9:15           ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-18  8:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau,
	Oleksandr Andrushchenko



On 18.11.21 10:36, Jan Beulich wrote:
> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote:
>>
>> On 17.11.21 10:28, Jan Beulich wrote:
>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>
>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>> scheduled a delayed work for map/unmap operations for that device.
>>>> For example, the following scenario can illustrate the problem:
>>>>
>>>> pci_physdev_op
>>>>      pci_add_device
>>>>          init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>      iommu_add_device <- FAILS
>>>>      vpci_remove_device -> xfree(pdev->vpci)
>>>>
>>>> leave_hypervisor_to_guest
>>>>      vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>
>>>> For the hardware domain we continue execution as the worse that
>>>> could happen is that MMIO mappings are left in place when the
>>>> device has been deassigned
>>>>
>>>> For unprivileged domains that get a failure in the middle of a vPCI
>>>> {un}map operation we need to destroy them, as we don't know in which
>>>> state the p2m is. This can only happen in vpci_process_pending for
>>>> DomUs as they won't be allowed to call pci_add_device.
>>>>
>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>> Thinking about it some more, I'm not convinced any of this is really
>>> needed in the presented form.
>> The intention of this patch was to handle error conditions which are
>> abnormal, e.g. when iommu_add_device fails and we are in the middle
>> of initialization. So, I am trying to cancel all pending work which might
>> already be there and not to crash.
> Only Dom0 may be able to prematurely access the device during "add".
> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved.
> Hence I'm not sure I see the need for dealing with these.
Probably I don't follow you here. The issue I am facing is Dom0
related, e.g. Xen was not able to initialize during "add" and thus
wanted to clean up the leftovers. As the result the already
scheduled work crashes as it was not neither canceled nor interrupted
in some safe manner. So, this sounds like something we need to take
care of, thus this patch.

Another story, which I am getting more convinced now, is that
the proper locking should be a dedicated patch as it should not only
add locks as required by this patch, but also probably revisit the
existing locking schemes to be acceptable for new use-cases.
>
>>>    Removal of a vPCI device is the analogue
>>> of hot-unplug on baremetal. That's not a "behind the backs of
>>> everything" operation. Instead the host admin has to prepare the
>>> device for removal, which will result in it being quiescent (which in
>>> particular means no BAR adjustments anymore). The act of removing the
>>> device from the system has as its virtual counterpart "xl pci-detach".
>>> I think it ought to be in this context when pending requests get
>>> drained, and an indicator be set that no further changes to that
>>> device are permitted. This would mean invoking from
>>> vpci_deassign_device() as added by patch 4, not from
>>> vpci_remove_device(). This would yield removal of a device from the
>>> host being independent of removal of a device from a guest.
>>>
>>> The need for vpci_remove_device() seems questionable in the first
>>> place: Even for hot-unplug on the host it may be better to require a
>>> pci-detach from (PVH) Dom0 before the actual device removal. This
>>> would involve an adjustment to the de-assignment logic for the case
>>> of no quarantining: We'd need to make sure explicit de-assignment
>>> from Dom0 actually removes the device from there; right now
>>> de-assignment assumes "from DomU" and "to Dom0 or DomIO" (depending
>>> on quarantining mode).
> As to this, I meanwhile think that add/remove can very well have Dom0
> related vPCI init/teardown. But for DomU all of that should happen
> during assign/de-assign.
Yes, I agree. The model I also see is:
- for Dom0 we use add/remove
- for DomUs we use assign/de-assign
>   A device still assigned to a DomU simply
> should never be subject to physical hot-unplug in the first place.
Double that
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-18  8:54         ` Oleksandr Andrushchenko
@ 2021-11-18  9:15           ` Jan Beulich
  2021-11-18  9:32             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-18  9:15 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau

On 18.11.2021 09:54, Oleksandr Andrushchenko wrote:
> On 18.11.21 10:36, Jan Beulich wrote:
>> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote:
>>> On 17.11.21 10:28, Jan Beulich wrote:
>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>
>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>> For example, the following scenario can illustrate the problem:
>>>>>
>>>>> pci_physdev_op
>>>>>      pci_add_device
>>>>>          init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>      iommu_add_device <- FAILS
>>>>>      vpci_remove_device -> xfree(pdev->vpci)
>>>>>
>>>>> leave_hypervisor_to_guest
>>>>>      vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>
>>>>> For the hardware domain we continue execution as the worse that
>>>>> could happen is that MMIO mappings are left in place when the
>>>>> device has been deassigned
>>>>>
>>>>> For unprivileged domains that get a failure in the middle of a vPCI
>>>>> {un}map operation we need to destroy them, as we don't know in which
>>>>> state the p2m is. This can only happen in vpci_process_pending for
>>>>> DomUs as they won't be allowed to call pci_add_device.
>>>>>
>>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>> Thinking about it some more, I'm not convinced any of this is really
>>>> needed in the presented form.
>>> The intention of this patch was to handle error conditions which are
>>> abnormal, e.g. when iommu_add_device fails and we are in the middle
>>> of initialization. So, I am trying to cancel all pending work which might
>>> already be there and not to crash.
>> Only Dom0 may be able to prematurely access the device during "add".
>> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved.
>> Hence I'm not sure I see the need for dealing with these.
> Probably I don't follow you here. The issue I am facing is Dom0
> related, e.g. Xen was not able to initialize during "add" and thus
> wanted to clean up the leftovers. As the result the already
> scheduled work crashes as it was not neither canceled nor interrupted
> in some safe manner. So, this sounds like something we need to take
> care of, thus this patch.

But my point was the question of why there would be any pending work
in the first place in this case, when we expect Dom0 to be well-behaved.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-18  9:15           ` Jan Beulich
@ 2021-11-18  9:32             ` Oleksandr Andrushchenko
  2021-11-18 13:25               ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-18  9:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau,
	Oleksandr Andrushchenko



On 18.11.21 11:15, Jan Beulich wrote:
> On 18.11.2021 09:54, Oleksandr Andrushchenko wrote:
>> On 18.11.21 10:36, Jan Beulich wrote:
>>> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote:
>>>> On 17.11.21 10:28, Jan Beulich wrote:
>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>
>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>
>>>>>> pci_physdev_op
>>>>>>       pci_add_device
>>>>>>           init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>       iommu_add_device <- FAILS
>>>>>>       vpci_remove_device -> xfree(pdev->vpci)
>>>>>>
>>>>>> leave_hypervisor_to_guest
>>>>>>       vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>
>>>>>> For the hardware domain we continue execution as the worse that
>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>> device has been deassigned
>>>>>>
>>>>>> For unprivileged domains that get a failure in the middle of a vPCI
>>>>>> {un}map operation we need to destroy them, as we don't know in which
>>>>>> state the p2m is. This can only happen in vpci_process_pending for
>>>>>> DomUs as they won't be allowed to call pci_add_device.
>>>>>>
>>>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>> Thinking about it some more, I'm not convinced any of this is really
>>>>> needed in the presented form.
>>>> The intention of this patch was to handle error conditions which are
>>>> abnormal, e.g. when iommu_add_device fails and we are in the middle
>>>> of initialization. So, I am trying to cancel all pending work which might
>>>> already be there and not to crash.
>>> Only Dom0 may be able to prematurely access the device during "add".
>>> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved.
>>> Hence I'm not sure I see the need for dealing with these.
>> Probably I don't follow you here. The issue I am facing is Dom0
>> related, e.g. Xen was not able to initialize during "add" and thus
>> wanted to clean up the leftovers. As the result the already
>> scheduled work crashes as it was not neither canceled nor interrupted
>> in some safe manner. So, this sounds like something we need to take
>> care of, thus this patch.
> But my point was the question of why there would be any pending work
> in the first place in this case, when we expect Dom0 to be well-behaved.
I am not saying Dom0 misbehaves here. This is my real use-case
(as in the commit message):

pci_physdev_op
      pci_add_device
          init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
      iommu_add_device <- FAILS
      vpci_remove_device -> xfree(pdev->vpci)

leave_hypervisor_to_guest
      vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL

So, this made me implement the patch. Then we decided that it is also
possible that other vCPUs may also have some pending work and I
agreed that this is a good point and we want to remove all pending work
for all vCPUs.

So, if you doubt the patch and we still have the scenario above, what
would you suggest in order to make sure we do not crash?
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-16 18:02                 ` Julien Grall
@ 2021-11-18 12:57                   ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-18 12:57 UTC (permalink / raw)
  To: Julien Grall, Jan Beulich, sstabellini, roger.pau
  Cc: Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel, Oleksandr Andrushchenko

Hi, Julien!

On 16.11.21 20:02, Julien Grall wrote:
> Hi Oleksandr,
>
> On 16/11/2021 14:24, Oleksandr Andrushchenko wrote:
>>
>>
>> On 16.11.21 16:12, Jan Beulich wrote:
>>> On 16.11.2021 14:41, Oleksandr Andrushchenko wrote:
>>>>
>>>> On 16.11.21 10:23, Oleksandr Andrushchenko wrote:
>>>>> On 16.11.21 10:01, Jan Beulich wrote:
>>>>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>>>>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>>
>>>>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>>>>
>>>>>>>>> pci_physdev_op
>>>>>>>>>         pci_add_device
>>>>>>>>>             init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>>>>         iommu_add_device <- FAILS
>>>>>>>>>         vpci_remove_device -> xfree(pdev->vpci)
>>>>>>>>>
>>>>>>>>> leave_hypervisor_to_guest
>>>>>>>>>         vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>>>>
>>>>>>>>> For the hardware domain we continue execution as the worse that
>>>>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>>>>> device has been deassigned
>>>>>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL
>>>>>>>> deref?
>>>>>>> I think it is safe to continue
>>>>>> And why do you think so? I.e. why is there no race for Dom0 when there
>>>>>> is one for DomU?
>>>>> Well, then we need to use a lock to synchronize the two.
>>>>> I guess this needs to be pci devs lock unfortunately
>>>> The parties involved in deferred work and its cancellation:
>>>>
>>>> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map
>>>>
>>>> Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending
>>>>
>>>> x86: two places -> hvm_do_resume -> vpci_process_pending
>>>>
>>>> So, both defer_map and vpci_process_pending need to be synchronized with
>>>> pcidevs_{lock|unlock).
>>> If I was an Arm maintainer, I'm afraid I would object to the pcidevs lock
>>> getting used in leave_hypervisor_to_guest.
>> I do agree this is really not good, but it seems I am limited in choices.
>> @Stefano, @Julien, do you see any better way of doing that?
>
> I agree with Jan about using the pcidevs_{lock|unlock}. The lock is not fine-grained enough for been call in vpci_process_pending().
>
> I haven't yet looked at the rest of the series to be able to suggest the exact lock. But we at least want a per-domain spinlock.
>
>>
>> We were thinking about introducing a dedicated lock for vpci [1],
>> but finally decided to use pcidevs_lock for now
>
> Skimming through the thread, you decided to use pcidevs_lock because it was simpler and sufficient for the use case discussed back then. Now, we have a use case where it would be a problem to use pcidevs_lock. So I think the extra complexity is justified.
I would like to understand what is this lock so I can implement that properly.
We have the following options as I can see:

1. pcidevs_{lock|unlock} - considered too heavy, per host
2. pdev->vpci->lock - better, but still heavy, per PCI device
3. We may convert pdev->vpci->lock into r/w lock
4. We may introduce a specific lock

To better understand the scope of the lock:
1. MMIO trap handlers (vpci_{read|write} - already protected with pdev->vpci->lock
2. vpci_process_pending (SOFTIRQ context)
3. Hypercalls which call pci_{add|remove|assign|deassign}_device
4. @Roger, did I miss something?

And I feel that this needs a dedicated patch for that: I am not sure it is a
good idea to add this locking change into this patch which seems not so relevant
>
> Cheers,
>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-18  9:32             ` Oleksandr Andrushchenko
@ 2021-11-18 13:25               ` Jan Beulich
  2021-11-18 13:48                 ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-18 13:25 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Roger Pau Monné
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 18.11.2021 10:32, Oleksandr Andrushchenko wrote:
> 
> 
> On 18.11.21 11:15, Jan Beulich wrote:
>> On 18.11.2021 09:54, Oleksandr Andrushchenko wrote:
>>> On 18.11.21 10:36, Jan Beulich wrote:
>>>> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote:
>>>>> On 17.11.21 10:28, Jan Beulich wrote:
>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>
>>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>>
>>>>>>> pci_physdev_op
>>>>>>>       pci_add_device
>>>>>>>           init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>>       iommu_add_device <- FAILS
>>>>>>>       vpci_remove_device -> xfree(pdev->vpci)
>>>>>>>
>>>>>>> leave_hypervisor_to_guest
>>>>>>>       vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>>
>>>>>>> For the hardware domain we continue execution as the worse that
>>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>>> device has been deassigned
>>>>>>>
>>>>>>> For unprivileged domains that get a failure in the middle of a vPCI
>>>>>>> {un}map operation we need to destroy them, as we don't know in which
>>>>>>> state the p2m is. This can only happen in vpci_process_pending for
>>>>>>> DomUs as they won't be allowed to call pci_add_device.
>>>>>>>
>>>>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>> Thinking about it some more, I'm not convinced any of this is really
>>>>>> needed in the presented form.
>>>>> The intention of this patch was to handle error conditions which are
>>>>> abnormal, e.g. when iommu_add_device fails and we are in the middle
>>>>> of initialization. So, I am trying to cancel all pending work which might
>>>>> already be there and not to crash.
>>>> Only Dom0 may be able to prematurely access the device during "add".
>>>> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved.
>>>> Hence I'm not sure I see the need for dealing with these.
>>> Probably I don't follow you here. The issue I am facing is Dom0
>>> related, e.g. Xen was not able to initialize during "add" and thus
>>> wanted to clean up the leftovers. As the result the already
>>> scheduled work crashes as it was not neither canceled nor interrupted
>>> in some safe manner. So, this sounds like something we need to take
>>> care of, thus this patch.
>> But my point was the question of why there would be any pending work
>> in the first place in this case, when we expect Dom0 to be well-behaved.
> I am not saying Dom0 misbehaves here. This is my real use-case
> (as in the commit message):
> 
> pci_physdev_op
>       pci_add_device
>           init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>       iommu_add_device <- FAILS
>       vpci_remove_device -> xfree(pdev->vpci)
> 
> leave_hypervisor_to_guest
>       vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL

First of all I'm sorry for having lost track of that particular case in
the course of the discussion.

I wonder though whether that's something we really need to take care of.
At boot (on x86) modify_bars() wouldn't call defer_map() anyway, but
use apply_map() instead. I wonder whether this wouldn't be appropriate
generally in the context of init_bars() when used for Dom0 (not sure
whether init_bars() would find some form of use for DomU-s as well).
This is even more so as it would better be the exception that devices
discovered post-boot start out with memory decoding enabled (which is a
prereq for modify_bars() to be called in the first place).

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-18 13:25               ` Jan Beulich
@ 2021-11-18 13:48                 ` Oleksandr Andrushchenko
  2021-11-18 14:04                   ` Roger Pau Monné
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-18 13:48 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monné
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 18.11.21 15:25, Jan Beulich wrote:
> On 18.11.2021 10:32, Oleksandr Andrushchenko wrote:
>>
>> On 18.11.21 11:15, Jan Beulich wrote:
>>> On 18.11.2021 09:54, Oleksandr Andrushchenko wrote:
>>>> On 18.11.21 10:36, Jan Beulich wrote:
>>>>> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote:
>>>>>> On 17.11.21 10:28, Jan Beulich wrote:
>>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>
>>>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>>>
>>>>>>>> pci_physdev_op
>>>>>>>>        pci_add_device
>>>>>>>>            init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>>>        iommu_add_device <- FAILS
>>>>>>>>        vpci_remove_device -> xfree(pdev->vpci)
>>>>>>>>
>>>>>>>> leave_hypervisor_to_guest
>>>>>>>>        vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>>>
>>>>>>>> For the hardware domain we continue execution as the worse that
>>>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>>>> device has been deassigned
>>>>>>>>
>>>>>>>> For unprivileged domains that get a failure in the middle of a vPCI
>>>>>>>> {un}map operation we need to destroy them, as we don't know in which
>>>>>>>> state the p2m is. This can only happen in vpci_process_pending for
>>>>>>>> DomUs as they won't be allowed to call pci_add_device.
>>>>>>>>
>>>>>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>> Thinking about it some more, I'm not convinced any of this is really
>>>>>>> needed in the presented form.
>>>>>> The intention of this patch was to handle error conditions which are
>>>>>> abnormal, e.g. when iommu_add_device fails and we are in the middle
>>>>>> of initialization. So, I am trying to cancel all pending work which might
>>>>>> already be there and not to crash.
>>>>> Only Dom0 may be able to prematurely access the device during "add".
>>>>> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved.
>>>>> Hence I'm not sure I see the need for dealing with these.
>>>> Probably I don't follow you here. The issue I am facing is Dom0
>>>> related, e.g. Xen was not able to initialize during "add" and thus
>>>> wanted to clean up the leftovers. As the result the already
>>>> scheduled work crashes as it was not neither canceled nor interrupted
>>>> in some safe manner. So, this sounds like something we need to take
>>>> care of, thus this patch.
>>> But my point was the question of why there would be any pending work
>>> in the first place in this case, when we expect Dom0 to be well-behaved.
>> I am not saying Dom0 misbehaves here. This is my real use-case
>> (as in the commit message):
>>
>> pci_physdev_op
>>        pci_add_device
>>            init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>        iommu_add_device <- FAILS
>>        vpci_remove_device -> xfree(pdev->vpci)
>>
>> leave_hypervisor_to_guest
>>        vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
> First of all I'm sorry for having lost track of that particular case in
> the course of the discussion.
No problem, I see on the ML how much you review every day...
>
> I wonder though whether that's something we really need to take care of.
> At boot (on x86) modify_bars() wouldn't call defer_map() anyway, but
> use apply_map() instead. I wonder whether this wouldn't be appropriate
> generally in the context of init_bars() when used for Dom0 (not sure
> whether init_bars() would find some form of use for DomU-s as well).
> This is even more so as it would better be the exception that devices
> discovered post-boot start out with memory decoding enabled (which is a
> prereq for modify_bars() to be called in the first place).
Well, first of all init_bars is going to be used for DomUs as well:
that was discussed previously and is reflected in this series.

But the real use-case for the deferred mapping would be the one
from PCI_COMMAND register write emulation:

void vpci_cmd_write(const struct pci_dev *pdev, unsigned int reg,
                     uint32_t cmd, void *data)
{
[snip]
         modify_bars(pdev, cmd, false);

which in turn calls defer_map.

I believe Roger did that for a good reason not to stall Xen while doing
map/unmap (which may take quite some time) and moved that to
vpci_process_pending and SOFTIRQ context.
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-18 13:48                 ` Oleksandr Andrushchenko
@ 2021-11-18 14:04                   ` Roger Pau Monné
  2021-11-18 14:14                     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Roger Pau Monné @ 2021-11-18 14:04 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel

Sorry, I've been quite busy with other staff.

On Thu, Nov 18, 2021 at 01:48:06PM +0000, Oleksandr Andrushchenko wrote:
> 
> 
> On 18.11.21 15:25, Jan Beulich wrote:
> > On 18.11.2021 10:32, Oleksandr Andrushchenko wrote:
> >>
> >> On 18.11.21 11:15, Jan Beulich wrote:
> >>> On 18.11.2021 09:54, Oleksandr Andrushchenko wrote:
> >>>> On 18.11.21 10:36, Jan Beulich wrote:
> >>>>> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote:
> >>>>>> On 17.11.21 10:28, Jan Beulich wrote:
> >>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> >>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >>>>>>>>
> >>>>>>>> When a vPCI is removed for a PCI device it is possible that we have
> >>>>>>>> scheduled a delayed work for map/unmap operations for that device.
> >>>>>>>> For example, the following scenario can illustrate the problem:
> >>>>>>>>
> >>>>>>>> pci_physdev_op
> >>>>>>>>        pci_add_device
> >>>>>>>>            init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
> >>>>>>>>        iommu_add_device <- FAILS
> >>>>>>>>        vpci_remove_device -> xfree(pdev->vpci)
> >>>>>>>>
> >>>>>>>> leave_hypervisor_to_guest
> >>>>>>>>        vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
> >>>>>>>>
> >>>>>>>> For the hardware domain we continue execution as the worse that
> >>>>>>>> could happen is that MMIO mappings are left in place when the
> >>>>>>>> device has been deassigned
> >>>>>>>>
> >>>>>>>> For unprivileged domains that get a failure in the middle of a vPCI
> >>>>>>>> {un}map operation we need to destroy them, as we don't know in which
> >>>>>>>> state the p2m is. This can only happen in vpci_process_pending for
> >>>>>>>> DomUs as they won't be allowed to call pci_add_device.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >>>>>>> Thinking about it some more, I'm not convinced any of this is really
> >>>>>>> needed in the presented form.
> >>>>>> The intention of this patch was to handle error conditions which are
> >>>>>> abnormal, e.g. when iommu_add_device fails and we are in the middle
> >>>>>> of initialization. So, I am trying to cancel all pending work which might
> >>>>>> already be there and not to crash.
> >>>>> Only Dom0 may be able to prematurely access the device during "add".
> >>>>> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved.
> >>>>> Hence I'm not sure I see the need for dealing with these.
> >>>> Probably I don't follow you here. The issue I am facing is Dom0
> >>>> related, e.g. Xen was not able to initialize during "add" and thus
> >>>> wanted to clean up the leftovers. As the result the already
> >>>> scheduled work crashes as it was not neither canceled nor interrupted
> >>>> in some safe manner. So, this sounds like something we need to take
> >>>> care of, thus this patch.
> >>> But my point was the question of why there would be any pending work
> >>> in the first place in this case, when we expect Dom0 to be well-behaved.
> >> I am not saying Dom0 misbehaves here. This is my real use-case
> >> (as in the commit message):
> >>
> >> pci_physdev_op
> >>        pci_add_device
> >>            init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
> >>        iommu_add_device <- FAILS
> >>        vpci_remove_device -> xfree(pdev->vpci)
> >>
> >> leave_hypervisor_to_guest
> >>        vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
> > First of all I'm sorry for having lost track of that particular case in
> > the course of the discussion.
> No problem, I see on the ML how much you review every day...
> >
> > I wonder though whether that's something we really need to take care of.
> > At boot (on x86) modify_bars() wouldn't call defer_map() anyway, but
> > use apply_map() instead. I wonder whether this wouldn't be appropriate
> > generally in the context of init_bars() when used for Dom0 (not sure
> > whether init_bars() would find some form of use for DomU-s as well).
> > This is even more so as it would better be the exception that devices
> > discovered post-boot start out with memory decoding enabled (which is a
> > prereq for modify_bars() to be called in the first place).
> Well, first of all init_bars is going to be used for DomUs as well:
> that was discussed previously and is reflected in this series.
> 
> But the real use-case for the deferred mapping would be the one
> from PCI_COMMAND register write emulation:
> 
> void vpci_cmd_write(const struct pci_dev *pdev, unsigned int reg,
>                      uint32_t cmd, void *data)
> {
> [snip]
>          modify_bars(pdev, cmd, false);
> 
> which in turn calls defer_map.
> 
> I believe Roger did that for a good reason not to stall Xen while doing
> map/unmap (which may take quite some time) and moved that to
> vpci_process_pending and SOFTIRQ context.

Indeed. In the physdevop failure case this comes from an hypercall
context, so maybe you could do the mapping in place using hypercall
continuations if required. Not sure how complex that would be,
compared to just deferring to guest entry point and then dealing with
the possible cleanup on failure.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-18 14:04                   ` Roger Pau Monné
@ 2021-11-18 14:14                     ` Oleksandr Andrushchenko
  2021-11-18 14:35                       ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-18 14:14 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel



On 18.11.21 16:04, Roger Pau Monné wrote:
> Sorry, I've been quite busy with other staff.
>
> On Thu, Nov 18, 2021 at 01:48:06PM +0000, Oleksandr Andrushchenko wrote:
>>
>> On 18.11.21 15:25, Jan Beulich wrote:
>>> On 18.11.2021 10:32, Oleksandr Andrushchenko wrote:
>>>> On 18.11.21 11:15, Jan Beulich wrote:
>>>>> On 18.11.2021 09:54, Oleksandr Andrushchenko wrote:
>>>>>> On 18.11.21 10:36, Jan Beulich wrote:
>>>>>>> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote:
>>>>>>>> On 17.11.21 10:28, Jan Beulich wrote:
>>>>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>>>
>>>>>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>>>>>
>>>>>>>>>> pci_physdev_op
>>>>>>>>>>         pci_add_device
>>>>>>>>>>             init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>>>>>         iommu_add_device <- FAILS
>>>>>>>>>>         vpci_remove_device -> xfree(pdev->vpci)
>>>>>>>>>>
>>>>>>>>>> leave_hypervisor_to_guest
>>>>>>>>>>         vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>>>>>
>>>>>>>>>> For the hardware domain we continue execution as the worse that
>>>>>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>>>>>> device has been deassigned
>>>>>>>>>>
>>>>>>>>>> For unprivileged domains that get a failure in the middle of a vPCI
>>>>>>>>>> {un}map operation we need to destroy them, as we don't know in which
>>>>>>>>>> state the p2m is. This can only happen in vpci_process_pending for
>>>>>>>>>> DomUs as they won't be allowed to call pci_add_device.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>> Thinking about it some more, I'm not convinced any of this is really
>>>>>>>>> needed in the presented form.
>>>>>>>> The intention of this patch was to handle error conditions which are
>>>>>>>> abnormal, e.g. when iommu_add_device fails and we are in the middle
>>>>>>>> of initialization. So, I am trying to cancel all pending work which might
>>>>>>>> already be there and not to crash.
>>>>>>> Only Dom0 may be able to prematurely access the device during "add".
>>>>>>> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved.
>>>>>>> Hence I'm not sure I see the need for dealing with these.
>>>>>> Probably I don't follow you here. The issue I am facing is Dom0
>>>>>> related, e.g. Xen was not able to initialize during "add" and thus
>>>>>> wanted to clean up the leftovers. As the result the already
>>>>>> scheduled work crashes as it was not neither canceled nor interrupted
>>>>>> in some safe manner. So, this sounds like something we need to take
>>>>>> care of, thus this patch.
>>>>> But my point was the question of why there would be any pending work
>>>>> in the first place in this case, when we expect Dom0 to be well-behaved.
>>>> I am not saying Dom0 misbehaves here. This is my real use-case
>>>> (as in the commit message):
>>>>
>>>> pci_physdev_op
>>>>         pci_add_device
>>>>             init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>         iommu_add_device <- FAILS
>>>>         vpci_remove_device -> xfree(pdev->vpci)
>>>>
>>>> leave_hypervisor_to_guest
>>>>         vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>> First of all I'm sorry for having lost track of that particular case in
>>> the course of the discussion.
>> No problem, I see on the ML how much you review every day...
>>> I wonder though whether that's something we really need to take care of.
>>> At boot (on x86) modify_bars() wouldn't call defer_map() anyway, but
>>> use apply_map() instead. I wonder whether this wouldn't be appropriate
>>> generally in the context of init_bars() when used for Dom0 (not sure
>>> whether init_bars() would find some form of use for DomU-s as well).
>>> This is even more so as it would better be the exception that devices
>>> discovered post-boot start out with memory decoding enabled (which is a
>>> prereq for modify_bars() to be called in the first place).
>> Well, first of all init_bars is going to be used for DomUs as well:
>> that was discussed previously and is reflected in this series.
>>
>> But the real use-case for the deferred mapping would be the one
>> from PCI_COMMAND register write emulation:
>>
>> void vpci_cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>                       uint32_t cmd, void *data)
>> {
>> [snip]
>>           modify_bars(pdev, cmd, false);
>>
>> which in turn calls defer_map.
>>
>> I believe Roger did that for a good reason not to stall Xen while doing
>> map/unmap (which may take quite some time) and moved that to
>> vpci_process_pending and SOFTIRQ context.
> Indeed. In the physdevop failure case this comes from an hypercall
> context, so maybe you could do the mapping in place using hypercall
> continuations if required. Not sure how complex that would be,
> compared to just deferring to guest entry point and then dealing with
> the possible cleanup on failure.
This will solve one part of the equation:

pci_physdev_op
        pci_add_device
            init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
        iommu_add_device <- FAILS
        vpci_remove_device -> xfree(pdev->vpci)

But what about the other one, e.g. vpci_process_pending is triggered in
parallel with PCI device de-assign for example?

>
> Thanks, Roger.
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-18 14:14                     ` Oleksandr Andrushchenko
@ 2021-11-18 14:35                       ` Jan Beulich
  2021-11-18 15:11                         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-18 14:35 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné

On 18.11.2021 15:14, Oleksandr Andrushchenko wrote:
> On 18.11.21 16:04, Roger Pau Monné wrote:
>> Indeed. In the physdevop failure case this comes from an hypercall
>> context, so maybe you could do the mapping in place using hypercall
>> continuations if required. Not sure how complex that would be,
>> compared to just deferring to guest entry point and then dealing with
>> the possible cleanup on failure.
> This will solve one part of the equation:
> 
> pci_physdev_op
>         pci_add_device
>             init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>         iommu_add_device <- FAILS
>         vpci_remove_device -> xfree(pdev->vpci)
> 
> But what about the other one, e.g. vpci_process_pending is triggered in
> parallel with PCI device de-assign for example?

Well, that's again in hypercall context, so using hypercall continuations
may again be an option. Of course at the point a de-assign is initiated,
you "only" need to drain requests (for that device, but that's unlikely
to be worthwhile optimizing for), while ensuring no new requests can be
issued. Again, for the device in question, but here this is relevant -
a flag may want setting to refuse all further requests. Or maybe the
register handling hooks may want tearing down before draining pending
BAR mapping requests; without the hooks in place no new such requests
can possibly appear.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-18 14:35                       ` Jan Beulich
@ 2021-11-18 15:11                         ` Oleksandr Andrushchenko
  2021-11-18 15:16                           ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-18 15:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné,
	Oleksandr Andrushchenko



On 18.11.21 16:35, Jan Beulich wrote:
> On 18.11.2021 15:14, Oleksandr Andrushchenko wrote:
>> On 18.11.21 16:04, Roger Pau Monné wrote:
>>> Indeed. In the physdevop failure case this comes from an hypercall
>>> context, so maybe you could do the mapping in place using hypercall
>>> continuations if required. Not sure how complex that would be,
>>> compared to just deferring to guest entry point and then dealing with
>>> the possible cleanup on failure.
>> This will solve one part of the equation:
>>
>> pci_physdev_op
>>          pci_add_device
>>              init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>          iommu_add_device <- FAILS
>>          vpci_remove_device -> xfree(pdev->vpci)
>>
>> But what about the other one, e.g. vpci_process_pending is triggered in
>> parallel with PCI device de-assign for example?
> Well, that's again in hypercall context, so using hypercall continuations
> may again be an option. Of course at the point a de-assign is initiated,
> you "only" need to drain requests (for that device, but that's unlikely
> to be worthwhile optimizing for), while ensuring no new requests can be
> issued. Again, for the device in question, but here this is relevant -
> a flag may want setting to refuse all further requests. Or maybe the
> register handling hooks may want tearing down before draining pending
> BAR mapping requests; without the hooks in place no new such requests
> can possibly appear.
This can be probably even easier to solve as we were talking about
pausing all vCPUs:

void vpci_cancel_pending(const struct pci_dev *pdev)
{
     struct domain *d = pdev->domain;
     struct vcpu *v;
     int rc;

     while ( (rc = domain_pause_except_self(d)) == -ERESTART )
         cpu_relax();

     if ( rc )
         printk(XENLOG_G_ERR
                "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n",
                &pdev->sbdf, pdev->domain, rc);

     for_each_vcpu ( d, v )
     {
         if ( v->vpci.map_pending && (v->vpci.pdev == pdev) )

This will prevent all vCPUs to run, but current, thus making it impossible
to run vpci_process_pending in parallel with any hypercall.
So, even without locking in vpci_process_pending the above should
be enough.
The only concern here is that domain_pause_except_self may return
the error code we somehow need to handle...
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-18 15:11                         ` Oleksandr Andrushchenko
@ 2021-11-18 15:16                           ` Jan Beulich
  2021-11-18 15:21                             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-18 15:16 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné

On 18.11.2021 16:11, Oleksandr Andrushchenko wrote:
> 
> 
> On 18.11.21 16:35, Jan Beulich wrote:
>> On 18.11.2021 15:14, Oleksandr Andrushchenko wrote:
>>> On 18.11.21 16:04, Roger Pau Monné wrote:
>>>> Indeed. In the physdevop failure case this comes from an hypercall
>>>> context, so maybe you could do the mapping in place using hypercall
>>>> continuations if required. Not sure how complex that would be,
>>>> compared to just deferring to guest entry point and then dealing with
>>>> the possible cleanup on failure.
>>> This will solve one part of the equation:
>>>
>>> pci_physdev_op
>>>          pci_add_device
>>>              init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>          iommu_add_device <- FAILS
>>>          vpci_remove_device -> xfree(pdev->vpci)
>>>
>>> But what about the other one, e.g. vpci_process_pending is triggered in
>>> parallel with PCI device de-assign for example?
>> Well, that's again in hypercall context, so using hypercall continuations
>> may again be an option. Of course at the point a de-assign is initiated,
>> you "only" need to drain requests (for that device, but that's unlikely
>> to be worthwhile optimizing for), while ensuring no new requests can be
>> issued. Again, for the device in question, but here this is relevant -
>> a flag may want setting to refuse all further requests. Or maybe the
>> register handling hooks may want tearing down before draining pending
>> BAR mapping requests; without the hooks in place no new such requests
>> can possibly appear.
> This can be probably even easier to solve as we were talking about
> pausing all vCPUs:

I have to admit I'm not sure. It might be easier, but it may also be
less desirable.

> void vpci_cancel_pending(const struct pci_dev *pdev)
> {
>      struct domain *d = pdev->domain;
>      struct vcpu *v;
>      int rc;
> 
>      while ( (rc = domain_pause_except_self(d)) == -ERESTART )
>          cpu_relax();
> 
>      if ( rc )
>          printk(XENLOG_G_ERR
>                 "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n",
>                 &pdev->sbdf, pdev->domain, rc);
> 
>      for_each_vcpu ( d, v )
>      {
>          if ( v->vpci.map_pending && (v->vpci.pdev == pdev) )
> 
> This will prevent all vCPUs to run, but current, thus making it impossible
> to run vpci_process_pending in parallel with any hypercall.
> So, even without locking in vpci_process_pending the above should
> be enough.
> The only concern here is that domain_pause_except_self may return
> the error code we somehow need to handle...

Not just this. The -ERESTART handling isn't appropriate this way
either. For the moment I can't help thinking that draining would
be preferable over canceling.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-18 15:16                           ` Jan Beulich
@ 2021-11-18 15:21                             ` Oleksandr Andrushchenko
  2021-11-18 15:41                               ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-18 15:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné,
	Oleksandr Andrushchenko



On 18.11.21 17:16, Jan Beulich wrote:
> On 18.11.2021 16:11, Oleksandr Andrushchenko wrote:
>>
>> On 18.11.21 16:35, Jan Beulich wrote:
>>> On 18.11.2021 15:14, Oleksandr Andrushchenko wrote:
>>>> On 18.11.21 16:04, Roger Pau Monné wrote:
>>>>> Indeed. In the physdevop failure case this comes from an hypercall
>>>>> context, so maybe you could do the mapping in place using hypercall
>>>>> continuations if required. Not sure how complex that would be,
>>>>> compared to just deferring to guest entry point and then dealing with
>>>>> the possible cleanup on failure.
>>>> This will solve one part of the equation:
>>>>
>>>> pci_physdev_op
>>>>           pci_add_device
>>>>               init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>           iommu_add_device <- FAILS
>>>>           vpci_remove_device -> xfree(pdev->vpci)
>>>>
>>>> But what about the other one, e.g. vpci_process_pending is triggered in
>>>> parallel with PCI device de-assign for example?
>>> Well, that's again in hypercall context, so using hypercall continuations
>>> may again be an option. Of course at the point a de-assign is initiated,
>>> you "only" need to drain requests (for that device, but that's unlikely
>>> to be worthwhile optimizing for), while ensuring no new requests can be
>>> issued. Again, for the device in question, but here this is relevant -
>>> a flag may want setting to refuse all further requests. Or maybe the
>>> register handling hooks may want tearing down before draining pending
>>> BAR mapping requests; without the hooks in place no new such requests
>>> can possibly appear.
>> This can be probably even easier to solve as we were talking about
>> pausing all vCPUs:
> I have to admit I'm not sure. It might be easier, but it may also be
> less desirable.
>
>> void vpci_cancel_pending(const struct pci_dev *pdev)
>> {
>>       struct domain *d = pdev->domain;
>>       struct vcpu *v;
>>       int rc;
>>
>>       while ( (rc = domain_pause_except_self(d)) == -ERESTART )
>>           cpu_relax();
>>
>>       if ( rc )
>>           printk(XENLOG_G_ERR
>>                  "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n",
>>                  &pdev->sbdf, pdev->domain, rc);
>>
>>       for_each_vcpu ( d, v )
>>       {
>>           if ( v->vpci.map_pending && (v->vpci.pdev == pdev) )
>>
>> This will prevent all vCPUs to run, but current, thus making it impossible
>> to run vpci_process_pending in parallel with any hypercall.
>> So, even without locking in vpci_process_pending the above should
>> be enough.
>> The only concern here is that domain_pause_except_self may return
>> the error code we somehow need to handle...
> Not just this. The -ERESTART handling isn't appropriate this way
> either.
Are you talking about cpu_relax()?
>   For the moment I can't help thinking that draining would
> be preferable over canceling.
Given that cancellation is going to happen on error path or
on device de-assign/remove I think this can be acceptable.
Any reason why not?
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-18 15:21                             ` Oleksandr Andrushchenko
@ 2021-11-18 15:41                               ` Jan Beulich
  2021-11-18 15:46                                 ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-18 15:41 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné

On 18.11.2021 16:21, Oleksandr Andrushchenko wrote:
> On 18.11.21 17:16, Jan Beulich wrote:
>> On 18.11.2021 16:11, Oleksandr Andrushchenko wrote:
>>> On 18.11.21 16:35, Jan Beulich wrote:
>>>> On 18.11.2021 15:14, Oleksandr Andrushchenko wrote:
>>>>> On 18.11.21 16:04, Roger Pau Monné wrote:
>>>>>> Indeed. In the physdevop failure case this comes from an hypercall
>>>>>> context, so maybe you could do the mapping in place using hypercall
>>>>>> continuations if required. Not sure how complex that would be,
>>>>>> compared to just deferring to guest entry point and then dealing with
>>>>>> the possible cleanup on failure.
>>>>> This will solve one part of the equation:
>>>>>
>>>>> pci_physdev_op
>>>>>           pci_add_device
>>>>>               init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>           iommu_add_device <- FAILS
>>>>>           vpci_remove_device -> xfree(pdev->vpci)
>>>>>
>>>>> But what about the other one, e.g. vpci_process_pending is triggered in
>>>>> parallel with PCI device de-assign for example?
>>>> Well, that's again in hypercall context, so using hypercall continuations
>>>> may again be an option. Of course at the point a de-assign is initiated,
>>>> you "only" need to drain requests (for that device, but that's unlikely
>>>> to be worthwhile optimizing for), while ensuring no new requests can be
>>>> issued. Again, for the device in question, but here this is relevant -
>>>> a flag may want setting to refuse all further requests. Or maybe the
>>>> register handling hooks may want tearing down before draining pending
>>>> BAR mapping requests; without the hooks in place no new such requests
>>>> can possibly appear.
>>> This can be probably even easier to solve as we were talking about
>>> pausing all vCPUs:
>> I have to admit I'm not sure. It might be easier, but it may also be
>> less desirable.
>>
>>> void vpci_cancel_pending(const struct pci_dev *pdev)
>>> {
>>>       struct domain *d = pdev->domain;
>>>       struct vcpu *v;
>>>       int rc;
>>>
>>>       while ( (rc = domain_pause_except_self(d)) == -ERESTART )
>>>           cpu_relax();
>>>
>>>       if ( rc )
>>>           printk(XENLOG_G_ERR
>>>                  "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n",
>>>                  &pdev->sbdf, pdev->domain, rc);
>>>
>>>       for_each_vcpu ( d, v )
>>>       {
>>>           if ( v->vpci.map_pending && (v->vpci.pdev == pdev) )
>>>
>>> This will prevent all vCPUs to run, but current, thus making it impossible
>>> to run vpci_process_pending in parallel with any hypercall.
>>> So, even without locking in vpci_process_pending the above should
>>> be enough.
>>> The only concern here is that domain_pause_except_self may return
>>> the error code we somehow need to handle...
>> Not just this. The -ERESTART handling isn't appropriate this way
>> either.
> Are you talking about cpu_relax()?

I'm talking about that spin-waiting loop as a whole.

>>   For the moment I can't help thinking that draining would
>> be preferable over canceling.
> Given that cancellation is going to happen on error path or
> on device de-assign/remove I think this can be acceptable.
> Any reason why not?

It would seem to me that the correctness of a draining approach is
going to be easier to prove than that of a canceling one, where I
expect races to be a bigger risk. Especially something that gets
executed infrequently, if ever (error paths in particular), knowing
things are well from testing isn't typically possible.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-18 15:41                               ` Jan Beulich
@ 2021-11-18 15:46                                 ` Oleksandr Andrushchenko
  2021-11-18 15:53                                   ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-18 15:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné,
	Oleksandr Andrushchenko



On 18.11.21 17:41, Jan Beulich wrote:
> On 18.11.2021 16:21, Oleksandr Andrushchenko wrote:
>> On 18.11.21 17:16, Jan Beulich wrote:
>>> On 18.11.2021 16:11, Oleksandr Andrushchenko wrote:
>>>> On 18.11.21 16:35, Jan Beulich wrote:
>>>>> On 18.11.2021 15:14, Oleksandr Andrushchenko wrote:
>>>>>> On 18.11.21 16:04, Roger Pau Monné wrote:
>>>>>>> Indeed. In the physdevop failure case this comes from an hypercall
>>>>>>> context, so maybe you could do the mapping in place using hypercall
>>>>>>> continuations if required. Not sure how complex that would be,
>>>>>>> compared to just deferring to guest entry point and then dealing with
>>>>>>> the possible cleanup on failure.
>>>>>> This will solve one part of the equation:
>>>>>>
>>>>>> pci_physdev_op
>>>>>>            pci_add_device
>>>>>>                init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>            iommu_add_device <- FAILS
>>>>>>            vpci_remove_device -> xfree(pdev->vpci)
>>>>>>
>>>>>> But what about the other one, e.g. vpci_process_pending is triggered in
>>>>>> parallel with PCI device de-assign for example?
>>>>> Well, that's again in hypercall context, so using hypercall continuations
>>>>> may again be an option. Of course at the point a de-assign is initiated,
>>>>> you "only" need to drain requests (for that device, but that's unlikely
>>>>> to be worthwhile optimizing for), while ensuring no new requests can be
>>>>> issued. Again, for the device in question, but here this is relevant -
>>>>> a flag may want setting to refuse all further requests. Or maybe the
>>>>> register handling hooks may want tearing down before draining pending
>>>>> BAR mapping requests; without the hooks in place no new such requests
>>>>> can possibly appear.
>>>> This can be probably even easier to solve as we were talking about
>>>> pausing all vCPUs:
>>> I have to admit I'm not sure. It might be easier, but it may also be
>>> less desirable.
>>>
>>>> void vpci_cancel_pending(const struct pci_dev *pdev)
>>>> {
>>>>        struct domain *d = pdev->domain;
>>>>        struct vcpu *v;
>>>>        int rc;
>>>>
>>>>        while ( (rc = domain_pause_except_self(d)) == -ERESTART )
>>>>            cpu_relax();
>>>>
>>>>        if ( rc )
>>>>            printk(XENLOG_G_ERR
>>>>                   "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n",
>>>>                   &pdev->sbdf, pdev->domain, rc);
>>>>
>>>>        for_each_vcpu ( d, v )
>>>>        {
>>>>            if ( v->vpci.map_pending && (v->vpci.pdev == pdev) )
>>>>
>>>> This will prevent all vCPUs to run, but current, thus making it impossible
>>>> to run vpci_process_pending in parallel with any hypercall.
>>>> So, even without locking in vpci_process_pending the above should
>>>> be enough.
>>>> The only concern here is that domain_pause_except_self may return
>>>> the error code we somehow need to handle...
>>> Not just this. The -ERESTART handling isn't appropriate this way
>>> either.
>> Are you talking about cpu_relax()?
> I'm talking about that spin-waiting loop as a whole.
>
>>>    For the moment I can't help thinking that draining would
>>> be preferable over canceling.
>> Given that cancellation is going to happen on error path or
>> on device de-assign/remove I think this can be acceptable.
>> Any reason why not?
> It would seem to me that the correctness of a draining approach is
> going to be easier to prove than that of a canceling one, where I
> expect races to be a bigger risk. Especially something that gets
> executed infrequently, if ever (error paths in particular), knowing
> things are well from testing isn't typically possible.
Could you please then give me a hint how to do that:
1. We have scheduled SOFTIRQ on vCPU0 and it is about to touch pdev->vpci
2. We have de-assign/remove on vCPU1

How do we drain that? Do you mean some atomic variable to be
used in vpci_process_pending to flag it is running and de-assign/remove
needs to wait and spinning checking that?
>
> Jan
>
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-18 15:46                                 ` Oleksandr Andrushchenko
@ 2021-11-18 15:53                                   ` Jan Beulich
  2021-11-19 12:34                                     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-18 15:53 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné

On 18.11.2021 16:46, Oleksandr Andrushchenko wrote:
> On 18.11.21 17:41, Jan Beulich wrote:
>> On 18.11.2021 16:21, Oleksandr Andrushchenko wrote:
>>> On 18.11.21 17:16, Jan Beulich wrote:
>>>>    For the moment I can't help thinking that draining would
>>>> be preferable over canceling.
>>> Given that cancellation is going to happen on error path or
>>> on device de-assign/remove I think this can be acceptable.
>>> Any reason why not?
>> It would seem to me that the correctness of a draining approach is
>> going to be easier to prove than that of a canceling one, where I
>> expect races to be a bigger risk. Especially something that gets
>> executed infrequently, if ever (error paths in particular), knowing
>> things are well from testing isn't typically possible.
> Could you please then give me a hint how to do that:
> 1. We have scheduled SOFTIRQ on vCPU0 and it is about to touch pdev->vpci
> 2. We have de-assign/remove on vCPU1
> 
> How do we drain that? Do you mean some atomic variable to be
> used in vpci_process_pending to flag it is running and de-assign/remove
> needs to wait and spinning checking that?

First of all let's please keep remove and de-assign separate. I think we
have largely reached agreement that remove may need handling differently,
for being a Dom0-only operation.

As to draining during de-assign: I did suggest before that removing the
register handling hooks first would guarantee no new requests to appear.
Then it should be merely a matter of using hypercall continuations until
the respective domain has no pending requests anymore for the device in
question. Some locking (or lock barrier) may of course be needed to
make sure another CPU isn't just about to pend a new request.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 10/11] vpci: add initial support for virtual PCI bus topology
  2021-11-05  6:56 ` [PATCH v4 10/11] vpci: add initial support for virtual PCI bus topology Oleksandr Andrushchenko
@ 2021-11-18 16:45   ` Jan Beulich
  2021-11-24 11:28     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-18 16:45 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko,
	xen-devel

On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> Since v3:
>  - make use of VPCI_INIT
>  - moved all new code to vpci.c which belongs to it
>  - changed open-coded 31 to PCI_SLOT(~0)
>  - revisited locking: add dedicated vdev list's lock

What is this about? I can't spot any locking in the patch. In particular ...

> @@ -125,6 +128,54 @@ int vpci_add_handlers(struct pci_dev *pdev)
>  }
>  
>  #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +int vpci_add_virtual_device(struct pci_dev *pdev)
> +{
> +    struct domain *d = pdev->domain;
> +    pci_sbdf_t sbdf;
> +    unsigned long new_dev_number;
> +
> +    /*
> +     * Each PCI bus supports 32 devices/slots at max or up to 256 when
> +     * there are multi-function ones which are not yet supported.
> +     */
> +    if ( pdev->info.is_extfn )
> +    {
> +        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
> +                 &pdev->sbdf);
> +        return -EOPNOTSUPP;
> +    }
> +
> +    new_dev_number = find_first_zero_bit(&d->vpci_dev_assigned_map,
> +                                         PCI_SLOT(~0) + 1);
> +    if ( new_dev_number > PCI_SLOT(~0) )
> +        return -ENOSPC;
> +
> +    set_bit(new_dev_number, &d->vpci_dev_assigned_map);

... I wonder whether this isn't racy without any locking around it,
and without looping over test_and_set_bit(). Whereas with locking I
think you could just use __set_bit().

> +    /*
> +     * Both segment and bus number are 0:
> +     *  - we emulate a single host bridge for the guest, e.g. segment 0
> +     *  - with bus 0 the virtual devices are seen as embedded
> +     *    endpoints behind the root complex
> +     *
> +     * TODO: add support for multi-function devices.
> +     */
> +    sbdf.sbdf = 0;

I think this would be better expressed as an initializer, making it
clear to the reader that the whole object gets initialized with out
them needing to go check the type (and find that .sbdf covers the
entire object).

> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -145,6 +145,10 @@ struct vpci {
>              struct vpci_arch_msix_entry arch;
>          } entries[];
>      } *msix;
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    /* Virtual SBDF of the device. */
> +    pci_sbdf_t guest_sbdf;

Would vsbdf perhaps be better in line with things like vpci or vcpu
(as well as with the comment here)?

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 05/11] vpci/header: implement guest BAR register handlers
  2021-11-05  6:56 ` [PATCH v4 05/11] vpci/header: implement guest BAR register handlers Oleksandr Andrushchenko
@ 2021-11-19 11:58   ` Jan Beulich
  2021-11-19 12:10     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-19 11:58 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko,
	xen-devel

On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Add relevant vpci register handlers when assigning PCI device to a domain
> and remove those when de-assigning. This allows having different
> handlers for different domains, e.g. hwdom and other guests.
> 
> Emulate guest BAR register values: this allows creating a guest view
> of the registers and emulates size and properties probe as it is done
> during PCI device enumeration by the guest.
> 
> ROM BAR is only handled for the hardware domain and for guest domains
> there is a stub: at the moment PCI expansion ROM is x86 only, so it
> might not be used by other architectures without emulating x86. Other
> use-cases may include using that expansion ROM before Xen boots, hence
> no emulation is needed in Xen itself. Or when a guest wants to use the
> ROM code which seems to be rare.

At least in the initial days of EFI there was the concept of EFI byte
code, for ROM code to be compiled to such that it would be arch-
independent. While I don't mean this to be an argument against leaving
out ROM BAR handling for now, this may want mentioning here to avoid
giving the impression that it's only x86 which might be affected by
this deliberate omission.

> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -408,6 +408,48 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>      pci_conf_write32(pdev->sbdf, reg, val);
>  }
>  
> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
> +                            uint32_t val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +    bool hi = false;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +    else
> +    {
> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
> +    }
> +
> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
> +
> +    bar->guest_addr &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
> +}
> +
> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
> +                               void *data)
> +{
> +    const struct vpci_bar *bar = data;
> +    bool hi = false;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +
> +    return bar->guest_addr >> (hi ? 32 : 0);

I'm afraid "guest_addr" then isn't the best name; maybe "guest_val"?
This would make more obvious that there is a meaningful difference
from "addr" besides the guest vs host aspect.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR
  2021-11-05  6:56 ` [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR Oleksandr Andrushchenko
@ 2021-11-19 12:05   ` Jan Beulich
  2021-11-19 12:13     ` Oleksandr Andrushchenko
  2021-11-19 13:16   ` Jan Beulich
  1 sibling, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-19 12:05 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko,
	xen-devel

On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Instead of handling a single range set, that contains all the memory
> regions of all the BARs and ROM, have them per BAR.

Iirc Roger did indicate agreement with the spitting. May I nevertheless
ask that for posterity you say a word here about the overhead, to make
clear this was a conscious decision?

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 05/11] vpci/header: implement guest BAR register handlers
  2021-11-19 11:58   ` Jan Beulich
@ 2021-11-19 12:10     ` Oleksandr Andrushchenko
  2021-11-19 12:37       ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-19 12:10 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, Oleksandr Andrushchenko,
	xen-devel



On 19.11.21 13:58, Jan Beulich wrote:
> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Add relevant vpci register handlers when assigning PCI device to a domain
>> and remove those when de-assigning. This allows having different
>> handlers for different domains, e.g. hwdom and other guests.
>>
>> Emulate guest BAR register values: this allows creating a guest view
>> of the registers and emulates size and properties probe as it is done
>> during PCI device enumeration by the guest.
>>
>> ROM BAR is only handled for the hardware domain and for guest domains
>> there is a stub: at the moment PCI expansion ROM is x86 only, so it
>> might not be used by other architectures without emulating x86. Other
>> use-cases may include using that expansion ROM before Xen boots, hence
>> no emulation is needed in Xen itself. Or when a guest wants to use the
>> ROM code which seems to be rare.
> At least in the initial days of EFI there was the concept of EFI byte
> code, for ROM code to be compiled to such that it would be arch-
> independent. While I don't mean this to be an argument against leaving
> out ROM BAR handling for now, this may want mentioning here to avoid
> giving the impression that it's only x86 which might be affected by
> this deliberate omission.
I can put:
at the moment PCI expansion ROM handling is supported for x86 only
and it might not be used by other architectures without emulating x86.
>
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -408,6 +408,48 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>       pci_conf_write32(pdev->sbdf, reg, val);
>>   }
>>   
>> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>> +                            uint32_t val, void *data)
>> +{
>> +    struct vpci_bar *bar = data;
>> +    bool hi = false;
>> +
>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>> +    {
>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>> +        bar--;
>> +        hi = true;
>> +    }
>> +    else
>> +    {
>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
>> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>> +    }
>> +
>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>> +
>> +    bar->guest_addr &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
>> +}
>> +
>> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
>> +                               void *data)
>> +{
>> +    const struct vpci_bar *bar = data;
>> +    bool hi = false;
>> +
>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>> +    {
>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>> +        bar--;
>> +        hi = true;
>> +    }
>> +
>> +    return bar->guest_addr >> (hi ? 32 : 0);
> I'm afraid "guest_addr" then isn't the best name; maybe "guest_val"?
> This would make more obvious that there is a meaningful difference
> from "addr" besides the guest vs host aspect.
I am not sure I can agree here:
bar->addr and bar->guest_addr make it clear what are these while
bar->addr and bar->guest_val would make someone go look for
additional information about what that val is for.
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR
  2021-11-19 12:05   ` Jan Beulich
@ 2021-11-19 12:13     ` Oleksandr Andrushchenko
  2021-11-19 12:45       ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-19 12:13 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, Oleksandr Andrushchenko,
	xen-devel



On 19.11.21 14:05, Jan Beulich wrote:
> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Instead of handling a single range set, that contains all the memory
>> regions of all the BARs and ROM, have them per BAR.
> Iirc Roger did indicate agreement with the spitting. May I nevertheless
> ask that for posterity you say a word here about the overhead, to make
> clear this was a conscious decision?
Sure, but could you please help me with that sentence to please your
eye? I mean that it was you seeing the overhead while I was not as
to implement the similar functionality as range sets do I still think we'll
duplicate range sets at the end of the day.
> Jan
>
Thank you in advance,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 07/11] vpci/header: program p2m with guest BAR view
  2021-11-05  6:56 ` [PATCH v4 07/11] vpci/header: program p2m with guest BAR view Oleksandr Andrushchenko
@ 2021-11-19 12:33   ` Jan Beulich
  2021-11-19 12:44     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-19 12:33 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko,
	xen-devel

On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Take into account guest's BAR view and program its p2m accordingly:
> gfn is guest's view of the BAR and mfn is the physical BAR value as set
> up by the host bridge in the hardware domain.

I'm sorry to be picky, but I don't think host bridges set up BARs. What
I think you mean is "as set up by the PCI bus driver in the hardware
domain". Of course this then still leaves out the case of firmware
doing so, and hence the dom0less case.

> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -30,6 +30,10 @@
>  
>  struct map_data {
>      struct domain *d;
> +    /* Start address of the BAR as seen by the guest. */
> +    gfn_t start_gfn;
> +    /* Physical start address of the BAR. */
> +    mfn_t start_mfn;

As of the previous patch you process this on a per-BAR basis. Why don't
you simply put const struct vpci_bar * here? This would e.g. avoid the
need to keep in sync the identical expressions in vpci_process_pending()
and apply_map().

> @@ -37,12 +41,24 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>                       unsigned long *c)
>  {
>      const struct map_data *map = data;
> +    gfn_t start_gfn;

Imo this wants to move into the more narrow scope, ...

>      int rc;
>  
>      for ( ; ; )
>      {
>          unsigned long size = e - s + 1;
>  
> +        /*
> +         * Ranges to be mapped don't always start at the BAR start address, as
> +         * there can be holes or partially consumed ranges. Account for the
> +         * offset of the current address from the BAR start.
> +         */
> +        start_gfn = gfn_add(map->start_gfn, s - mfn_x(map->start_mfn));

... allowing (in principle) for this to become its initializer.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-18 15:53                                   ` Jan Beulich
@ 2021-11-19 12:34                                     ` Oleksandr Andrushchenko
  2021-11-19 13:00                                       ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-19 12:34 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monné
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko

Hi, Roger, Jan!

On 18.11.21 17:53, Jan Beulich wrote:
> On 18.11.2021 16:46, Oleksandr Andrushchenko wrote:
>> On 18.11.21 17:41, Jan Beulich wrote:
>>> On 18.11.2021 16:21, Oleksandr Andrushchenko wrote:
>>>> On 18.11.21 17:16, Jan Beulich wrote:
>>>>>     For the moment I can't help thinking that draining would
>>>>> be preferable over canceling.
>>>> Given that cancellation is going to happen on error path or
>>>> on device de-assign/remove I think this can be acceptable.
>>>> Any reason why not?
>>> It would seem to me that the correctness of a draining approach is
>>> going to be easier to prove than that of a canceling one, where I
>>> expect races to be a bigger risk. Especially something that gets
>>> executed infrequently, if ever (error paths in particular), knowing
>>> things are well from testing isn't typically possible.
>> Could you please then give me a hint how to do that:
>> 1. We have scheduled SOFTIRQ on vCPU0 and it is about to touch pdev->vpci
>> 2. We have de-assign/remove on vCPU1
>>
>> How do we drain that? Do you mean some atomic variable to be
>> used in vpci_process_pending to flag it is running and de-assign/remove
>> needs to wait and spinning checking that?
> First of all let's please keep remove and de-assign separate. I think we
> have largely reached agreement that remove may need handling differently,
> for being a Dom0-only operation.
>
> As to draining during de-assign: I did suggest before that removing the
> register handling hooks first would guarantee no new requests to appear.
> Then it should be merely a matter of using hypercall continuations until
> the respective domain has no pending requests anymore for the device in
> question. Some locking (or lock barrier) may of course be needed to
> make sure another CPU isn't just about to pend a new request.
>
> Jan
>
>
Too long, but please read.

The below is some simplified analysis of what is happening with
respect to deferred mapping. First we start from looking at what
hypercals are used which may run in parallel with vpci_process_pending,
which lock they hold:

1. do_physdev_op(PHYSDEVOP_pci_device_add): failure during PHYSDEVOP_pci_device_add
===============================================================================
   pci_physdev_op: <- no hypercall_create_continuation
     pci_add_device  <- pcidevs_lock()
       vpci_add_handlers
         init_bars
           cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
             modify_bars <- if cmd & PCI_COMMAND_MEMORY
               struct rangeset *mem = rangeset_new(NULL, NULL, 0);

               if ( system_state < SYS_STATE_active ) <- Dom0 is being created
                  return apply_map(pdev->domain, pdev, mem, cmd);

               defer_map(dev->domain, dev, mem, cmd, rom_only); < Dom0 is running
                 curr->vpci.pdev = pdev;
                 curr->vpci.mem = mem;

       ret = iommu_add_device(pdev); <- FAIL
       if ( ret )
           vpci_remove_device
             remove vPCI register handlers
             xfree(pdev->vpci);
             pdev->vpci = NULL; <- this will crash vpci_process_pending if it was
                                   scheduled and yet to run

2. do_physdev_op(PHYSDEVOP_pci_device_remove)
===============================================================================
   pci_physdev_op: <- no hypercall_create_continuation
     pci_remove_device <- pcidevs_lock()
       vpci_remove_device
         pdev->vpci = NULL; <- this will crash vpci_process_pending if it was
                               scheduled and yet to run

3. iommu_do_pci_domctl(XEN_DOMCTL_assign_device)
===============================================================================
case XEN_DOMCTL_assign_device <- pcidevs_lock();
   ret = assign_device(d, seg, bus, devfn, flags);
   if ( ret == -ERESTART )
     ret = hypercall_create_continuation(__HYPERVISOR_domctl, "h", u_domctl);


4. iommu_do_pci_domctl(XEN_DOMCTL_deassign_device) <- pcidevs_lock();
===============================================================================
case XEN_DOMCTL_deassign_device: <- no hypercall_create_continuation
   ret = deassign_device(d, seg, bus, devfn);


5. vPCI MMIO trap for PCI_COMMAND
===============================================================================
vpci_mmio_{read|write}
   vpci_ecam_{read|write}
     vpci_{read|write} <- NO locking yet
       pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn);
       spin_lock(&pdev->vpci->lock);
         cmd_write
           modify_bars
             defer_map

6. SoftIRQ processing
===============================================================================
hvm_do_resume
   vcpu_ioreq_handle_completion
     vpci_process_pending
       if ( v->vpci.mem )
         rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
         if ( rc == -ERESTART )
             return true; <- re-scheduling

=========================================================================
         spin_lock(&v->vpci.pdev->vpci->lock); <- v->vpci.pdev->vpci can be NULL
=========================================================================
         spin_unlock(&v->vpci.pdev->vpci->lock);
         v->vpci.mem = NULL;
       if ( rc ) <- rc is from rangeset_consume_ranges
         vpci_remove_device <- this is a BUG as it is potentially possible that
                               vpci_process_pending is running on other vCPU

So, from the above it is clearly seen that it might be that there is a
PCI_COMMAND's write triggered mapping is happening on other vCPU in parallel
with a hypercall.

Some analysis on the hypercalls with respect to domains which are eligible targets.
1. Dom0 (hardware domain) only: PHYSDEVOP_pci_device_add, PHYSDEVOP_pci_device_remove
2. Any domain: XEN_DOMCTL_assign_device, XEN_DOMCTL_deassign_device

Possible crash paths
===============================================================================
1. Failure in PHYSDEVOP_pci_device_add after defer_map may make
vpci_process_pending crash because of pdev->vpci == NULL
2. vpci_process_pending on other vCPU may crash if runs in parallel with itself
because of vpci_remove_device may be called
3. Crash in vpci_mmio_{read|write} after PHYSDEVOP_pci_device_remove
vpci_remove_device makes pdev->vpci == NULL
4. Both XEN_DOMCTL_assign_device and XEN_DOMCTL_deassign_device seem to be
unaffected.

Synchronization is needed between:
  - vpci_remove_device
  - vpci_process_pending
  - vpci_mmio_{read|write}

Possible locking and other work needed:
=======================================

1. pcidevs_{lock|unlock} is too heavy and is per-host
2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
3. We may want a dedicated per-domain rw lock to be implemented:

diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 28146ee404e6..ebf071893b21 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -444,6 +444,7 @@ struct domain

  #ifdef CONFIG_HAS_PCI
      struct list_head pdev_list;
+    rwlock_t vpci_rwlock;
+    bool vpci_terminating; <- atomic?
  #endif
then vpci_remove_device is a writer (cold path) and vpci_process_pending and
vpci_mmio_{read|write} are readers (hot path).

do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
to be implemented, so when re-start removal if need be:

vpci_remove_device()
{
   d->vpci_terminating = true;
   remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
   if ( !write_trylock(d->vpci_rwlock) )
     return -ERESTART;
   xfree(pdev->vpci);
   pdev->vpci = NULL;
}

Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
other operations which may require it, e.g. virtual bus topology can
use it when assigning vSBDF etc.

4. vpci_remove_device needs to be removed from vpci_process_pending
and do nothing for Dom0 and crash DomU otherwise:

if ( rc )
{
   /*
    * FIXME: in case of failure remove the device from the domain.
    * Note that there might still be leftover mappings. While this is
    * safe for Dom0, for DomUs the domain needs to be killed in order
    * to avoid leaking stale p2m mappings on failure.
    */
   if ( !is_hardware_domain(v->domain) )
     domain_crash(v->domain);

I do hope we can finally come up with some decision which I can implement...

Thank you,
Oleksandr

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 05/11] vpci/header: implement guest BAR register handlers
  2021-11-19 12:10     ` Oleksandr Andrushchenko
@ 2021-11-19 12:37       ` Jan Beulich
  2021-11-19 12:46         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-19 12:37 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 19.11.2021 13:10, Oleksandr Andrushchenko wrote:
> On 19.11.21 13:58, Jan Beulich wrote:
>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>
>>> Add relevant vpci register handlers when assigning PCI device to a domain
>>> and remove those when de-assigning. This allows having different
>>> handlers for different domains, e.g. hwdom and other guests.
>>>
>>> Emulate guest BAR register values: this allows creating a guest view
>>> of the registers and emulates size and properties probe as it is done
>>> during PCI device enumeration by the guest.
>>>
>>> ROM BAR is only handled for the hardware domain and for guest domains
>>> there is a stub: at the moment PCI expansion ROM is x86 only, so it
>>> might not be used by other architectures without emulating x86. Other
>>> use-cases may include using that expansion ROM before Xen boots, hence
>>> no emulation is needed in Xen itself. Or when a guest wants to use the
>>> ROM code which seems to be rare.
>> At least in the initial days of EFI there was the concept of EFI byte
>> code, for ROM code to be compiled to such that it would be arch-
>> independent. While I don't mean this to be an argument against leaving
>> out ROM BAR handling for now, this may want mentioning here to avoid
>> giving the impression that it's only x86 which might be affected by
>> this deliberate omission.
> I can put:
> at the moment PCI expansion ROM handling is supported for x86 only
> and it might not be used by other architectures without emulating x86.

Sounds at least somewhat better to me.

>>> --- a/xen/drivers/vpci/header.c
>>> +++ b/xen/drivers/vpci/header.c
>>> @@ -408,6 +408,48 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>       pci_conf_write32(pdev->sbdf, reg, val);
>>>   }
>>>   
>>> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>> +                            uint32_t val, void *data)
>>> +{
>>> +    struct vpci_bar *bar = data;
>>> +    bool hi = false;
>>> +
>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>> +    {
>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>> +        bar--;
>>> +        hi = true;
>>> +    }
>>> +    else
>>> +    {
>>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>>> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>>> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
>>> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>>> +    }
>>> +
>>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>>> +
>>> +    bar->guest_addr &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
>>> +}
>>> +
>>> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
>>> +                               void *data)
>>> +{
>>> +    const struct vpci_bar *bar = data;
>>> +    bool hi = false;
>>> +
>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>> +    {
>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>> +        bar--;
>>> +        hi = true;
>>> +    }
>>> +
>>> +    return bar->guest_addr >> (hi ? 32 : 0);
>> I'm afraid "guest_addr" then isn't the best name; maybe "guest_val"?
>> This would make more obvious that there is a meaningful difference
>> from "addr" besides the guest vs host aspect.
> I am not sure I can agree here:
> bar->addr and bar->guest_addr make it clear what are these while
> bar->addr and bar->guest_val would make someone go look for
> additional information about what that val is for.

Feel free to replace "val" with something more suitable. "guest_bar"
maybe? The value definitely is not an address, so "addr" seems
inappropriate / misleading to me.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 07/11] vpci/header: program p2m with guest BAR view
  2021-11-19 12:33   ` Jan Beulich
@ 2021-11-19 12:44     ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-19 12:44 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 19.11.21 14:33, Jan Beulich wrote:
> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Take into account guest's BAR view and program its p2m accordingly:
>> gfn is guest's view of the BAR and mfn is the physical BAR value as set
>> up by the host bridge in the hardware domain.
> I'm sorry to be picky, but I don't think host bridges set up BARs. What
> I think you mean is "as set up by the PCI bus driver in the hardware
> domain". Of course this then still leaves out the case of firmware
> doing so, and hence the dom0less case.
Sounds, good I will use your wording, thanks
>
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -30,6 +30,10 @@
>>   
>>   struct map_data {
>>       struct domain *d;
>> +    /* Start address of the BAR as seen by the guest. */
>> +    gfn_t start_gfn;
>> +    /* Physical start address of the BAR. */
>> +    mfn_t start_mfn;
> As of the previous patch you process this on a per-BAR basis. Why don't
> you simply put const struct vpci_bar * here? This would e.g. avoid the
> need to keep in sync the identical expressions in vpci_process_pending()
> and apply_map().
Aha, you mean to move

+            data.start_gfn =
+                 _gfn(PFN_DOWN(is_hardware_domain(v->domain)
+                               ? bar->addr : bar->guest_addr));
+            data.start_mfn = _mfn(PFN_DOWN(bar->addr));

into map_range. Makes sense, it seems I can do that

>> @@ -37,12 +41,24 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>>                        unsigned long *c)
>>   {
>>       const struct map_data *map = data;
>> +    gfn_t start_gfn;
> Imo this wants to move into the more narrow scope, ...
>
>>       int rc;
>>   
>>       for ( ; ; )
>>       {
>>           unsigned long size = e - s + 1;
>>   
>> +        /*
>> +         * Ranges to be mapped don't always start at the BAR start address, as
>> +         * there can be holes or partially consumed ranges. Account for the
>> +         * offset of the current address from the BAR start.
>> +         */
>> +        start_gfn = gfn_add(map->start_gfn, s - mfn_x(map->start_mfn));
> ... allowing (in principle) for this to become its initializer.
Yes, good idea
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR
  2021-11-19 12:13     ` Oleksandr Andrushchenko
@ 2021-11-19 12:45       ` Jan Beulich
  2021-11-19 12:50         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-19 12:45 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 19.11.2021 13:13, Oleksandr Andrushchenko wrote:
> On 19.11.21 14:05, Jan Beulich wrote:
>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>
>>> Instead of handling a single range set, that contains all the memory
>>> regions of all the BARs and ROM, have them per BAR.
>> Iirc Roger did indicate agreement with the spitting. May I nevertheless
>> ask that for posterity you say a word here about the overhead, to make
>> clear this was a conscious decision?
> Sure, but could you please help me with that sentence to please your
> eye? I mean that it was you seeing the overhead while I was not as
> to implement the similar functionality as range sets do I still think we'll
> duplicate range sets at the end of the day.

"Note that rangesets were chosen here despite there being only up to
<N> separate ranges in each set (typically just 1)." Albeit that's
then still lacking a justification for the choice. Ease of
implementation?

As to overhead - did you compare sizeof(struct rangeset) + N *
sizeof(struct range) with just N * sizeof(unsigned long [2])?

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 05/11] vpci/header: implement guest BAR register handlers
  2021-11-19 12:37       ` Jan Beulich
@ 2021-11-19 12:46         ` Oleksandr Andrushchenko
  2021-11-19 12:49           ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-19 12:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel



On 19.11.21 14:37, Jan Beulich wrote:
> On 19.11.2021 13:10, Oleksandr Andrushchenko wrote:
>> On 19.11.21 13:58, Jan Beulich wrote:
>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>
>>>> Add relevant vpci register handlers when assigning PCI device to a domain
>>>> and remove those when de-assigning. This allows having different
>>>> handlers for different domains, e.g. hwdom and other guests.
>>>>
>>>> Emulate guest BAR register values: this allows creating a guest view
>>>> of the registers and emulates size and properties probe as it is done
>>>> during PCI device enumeration by the guest.
>>>>
>>>> ROM BAR is only handled for the hardware domain and for guest domains
>>>> there is a stub: at the moment PCI expansion ROM is x86 only, so it
>>>> might not be used by other architectures without emulating x86. Other
>>>> use-cases may include using that expansion ROM before Xen boots, hence
>>>> no emulation is needed in Xen itself. Or when a guest wants to use the
>>>> ROM code which seems to be rare.
>>> At least in the initial days of EFI there was the concept of EFI byte
>>> code, for ROM code to be compiled to such that it would be arch-
>>> independent. While I don't mean this to be an argument against leaving
>>> out ROM BAR handling for now, this may want mentioning here to avoid
>>> giving the impression that it's only x86 which might be affected by
>>> this deliberate omission.
>> I can put:
>> at the moment PCI expansion ROM handling is supported for x86 only
>> and it might not be used by other architectures without emulating x86.
> Sounds at least somewhat better to me.
>
>>>> --- a/xen/drivers/vpci/header.c
>>>> +++ b/xen/drivers/vpci/header.c
>>>> @@ -408,6 +408,48 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>        pci_conf_write32(pdev->sbdf, reg, val);
>>>>    }
>>>>    
>>>> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>> +                            uint32_t val, void *data)
>>>> +{
>>>> +    struct vpci_bar *bar = data;
>>>> +    bool hi = false;
>>>> +
>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>> +    {
>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>> +        bar--;
>>>> +        hi = true;
>>>> +    }
>>>> +    else
>>>> +    {
>>>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>>>> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>>>> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
>>>> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>>>> +    }
>>>> +
>>>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>>>> +
>>>> +    bar->guest_addr &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
>>>> +}
>>>> +
>>>> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
>>>> +                               void *data)
>>>> +{
>>>> +    const struct vpci_bar *bar = data;
>>>> +    bool hi = false;
>>>> +
>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>> +    {
>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>> +        bar--;
>>>> +        hi = true;
>>>> +    }
>>>> +
>>>> +    return bar->guest_addr >> (hi ? 32 : 0);
>>> I'm afraid "guest_addr" then isn't the best name; maybe "guest_val"?
>>> This would make more obvious that there is a meaningful difference
>>> from "addr" besides the guest vs host aspect.
>> I am not sure I can agree here:
>> bar->addr and bar->guest_addr make it clear what are these while
>> bar->addr and bar->guest_val would make someone go look for
>> additional information about what that val is for.
> Feel free to replace "val" with something more suitable. "guest_bar"
> maybe? The value definitely is not an address, so "addr" seems
> inappropriate / misleading to me.
This is a guest's view on the BAR's address. So to me it is still guest_addr
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 05/11] vpci/header: implement guest BAR register handlers
  2021-11-19 12:46         ` Oleksandr Andrushchenko
@ 2021-11-19 12:49           ` Jan Beulich
  2021-11-19 12:54             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-19 12:49 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 19.11.2021 13:46, Oleksandr Andrushchenko wrote:
> On 19.11.21 14:37, Jan Beulich wrote:
>> On 19.11.2021 13:10, Oleksandr Andrushchenko wrote:
>>> On 19.11.21 13:58, Jan Beulich wrote:
>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>> --- a/xen/drivers/vpci/header.c
>>>>> +++ b/xen/drivers/vpci/header.c
>>>>> @@ -408,6 +408,48 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>        pci_conf_write32(pdev->sbdf, reg, val);
>>>>>    }
>>>>>    
>>>>> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>> +                            uint32_t val, void *data)
>>>>> +{
>>>>> +    struct vpci_bar *bar = data;
>>>>> +    bool hi = false;
>>>>> +
>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>> +    {
>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>> +        bar--;
>>>>> +        hi = true;
>>>>> +    }
>>>>> +    else
>>>>> +    {
>>>>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>>>>> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>>>>> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
>>>>> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>>>>> +    }
>>>>> +
>>>>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>>>>> +
>>>>> +    bar->guest_addr &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
>>>>> +}
>>>>> +
>>>>> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
>>>>> +                               void *data)
>>>>> +{
>>>>> +    const struct vpci_bar *bar = data;
>>>>> +    bool hi = false;
>>>>> +
>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>> +    {
>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>> +        bar--;
>>>>> +        hi = true;
>>>>> +    }
>>>>> +
>>>>> +    return bar->guest_addr >> (hi ? 32 : 0);
>>>> I'm afraid "guest_addr" then isn't the best name; maybe "guest_val"?
>>>> This would make more obvious that there is a meaningful difference
>>>> from "addr" besides the guest vs host aspect.
>>> I am not sure I can agree here:
>>> bar->addr and bar->guest_addr make it clear what are these while
>>> bar->addr and bar->guest_val would make someone go look for
>>> additional information about what that val is for.
>> Feel free to replace "val" with something more suitable. "guest_bar"
>> maybe? The value definitely is not an address, so "addr" seems
>> inappropriate / misleading to me.
> This is a guest's view on the BAR's address. So to me it is still guest_addr

It's a guest's view on the BAR, not just the address. Or else you couldn't
simply return the value here without folding in the correct low bits.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR
  2021-11-19 12:45       ` Jan Beulich
@ 2021-11-19 12:50         ` Oleksandr Andrushchenko
  2021-11-19 13:06           ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-19 12:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel



On 19.11.21 14:45, Jan Beulich wrote:
> On 19.11.2021 13:13, Oleksandr Andrushchenko wrote:
>> On 19.11.21 14:05, Jan Beulich wrote:
>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>
>>>> Instead of handling a single range set, that contains all the memory
>>>> regions of all the BARs and ROM, have them per BAR.
>>> Iirc Roger did indicate agreement with the spitting. May I nevertheless
>>> ask that for posterity you say a word here about the overhead, to make
>>> clear this was a conscious decision?
>> Sure, but could you please help me with that sentence to please your
>> eye? I mean that it was you seeing the overhead while I was not as
>> to implement the similar functionality as range sets do I still think we'll
>> duplicate range sets at the end of the day.
> "Note that rangesets were chosen here despite there being only up to
> <N> separate ranges in each set (typically just 1)." Albeit that's
> then still lacking a justification for the choice. Ease of
> implementation?
I guess yes. I'll put:

"Note that rangesets were chosen here despite there being only up to
<N> separate ranges in each set (typically just 1). But rangeset per BAR
was chosen for the ease of implementation and existing code re-usability."

>
> As to overhead - did you compare sizeof(struct rangeset) + N *
> sizeof(struct range) with just N * sizeof(unsigned long [2])?
I was not thinking about data memory sizes in the first place, but new code
introduced to handle that. And to be maintained.
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 05/11] vpci/header: implement guest BAR register handlers
  2021-11-19 12:49           ` Jan Beulich
@ 2021-11-19 12:54             ` Oleksandr Andrushchenko
  2021-11-19 13:02               ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-19 12:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel



On 19.11.21 14:49, Jan Beulich wrote:
> On 19.11.2021 13:46, Oleksandr Andrushchenko wrote:
>> On 19.11.21 14:37, Jan Beulich wrote:
>>> On 19.11.2021 13:10, Oleksandr Andrushchenko wrote:
>>>> On 19.11.21 13:58, Jan Beulich wrote:
>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>> @@ -408,6 +408,48 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>         pci_conf_write32(pdev->sbdf, reg, val);
>>>>>>     }
>>>>>>     
>>>>>> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>> +                            uint32_t val, void *data)
>>>>>> +{
>>>>>> +    struct vpci_bar *bar = data;
>>>>>> +    bool hi = false;
>>>>>> +
>>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>>> +    {
>>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>>> +        bar--;
>>>>>> +        hi = true;
>>>>>> +    }
>>>>>> +    else
>>>>>> +    {
>>>>>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>>>>>> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>>>>>> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
>>>>>> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>>>>>> +    }
>>>>>> +
>>>>>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>>>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>>>>>> +
>>>>>> +    bar->guest_addr &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
>>>>>> +}
>>>>>> +
>>>>>> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
>>>>>> +                               void *data)
>>>>>> +{
>>>>>> +    const struct vpci_bar *bar = data;
>>>>>> +    bool hi = false;
>>>>>> +
>>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>>> +    {
>>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>>> +        bar--;
>>>>>> +        hi = true;
>>>>>> +    }
>>>>>> +
>>>>>> +    return bar->guest_addr >> (hi ? 32 : 0);
>>>>> I'm afraid "guest_addr" then isn't the best name; maybe "guest_val"?
>>>>> This would make more obvious that there is a meaningful difference
>>>>> from "addr" besides the guest vs host aspect.
>>>> I am not sure I can agree here:
>>>> bar->addr and bar->guest_addr make it clear what are these while
>>>> bar->addr and bar->guest_val would make someone go look for
>>>> additional information about what that val is for.
>>> Feel free to replace "val" with something more suitable. "guest_bar"
>>> maybe? The value definitely is not an address, so "addr" seems
>>> inappropriate / misleading to me.
>> This is a guest's view on the BAR's address. So to me it is still guest_addr
> It's a guest's view on the BAR, not just the address. Or else you couldn't
> simply return the value here without folding in the correct low bits.
I agree with this this respect as it is indeed address + lower bits.
How about guest_bar_val then? So it reflects its nature, e.g. the value
of the BAR as seen by the guest.
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-19 12:34                                     ` Oleksandr Andrushchenko
@ 2021-11-19 13:00                                       ` Jan Beulich
  2021-11-19 13:16                                         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-19 13:00 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné

On 19.11.2021 13:34, Oleksandr Andrushchenko wrote:
> Possible locking and other work needed:
> =======================================
> 
> 1. pcidevs_{lock|unlock} is too heavy and is per-host
> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
> 3. We may want a dedicated per-domain rw lock to be implemented:
> 
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index 28146ee404e6..ebf071893b21 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -444,6 +444,7 @@ struct domain
> 
>   #ifdef CONFIG_HAS_PCI
>       struct list_head pdev_list;
> +    rwlock_t vpci_rwlock;
> +    bool vpci_terminating; <- atomic?
>   #endif
> then vpci_remove_device is a writer (cold path) and vpci_process_pending and
> vpci_mmio_{read|write} are readers (hot path).

Right - you need such a lock for other purposes anyway, as per the
discussion with Julien.

> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
> to be implemented, so when re-start removal if need be:
> 
> vpci_remove_device()
> {
>    d->vpci_terminating = true;
>    remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
>    if ( !write_trylock(d->vpci_rwlock) )
>      return -ERESTART;
>    xfree(pdev->vpci);
>    pdev->vpci = NULL;
> }
> 
> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
> other operations which may require it, e.g. virtual bus topology can
> use it when assigning vSBDF etc.
> 
> 4. vpci_remove_device needs to be removed from vpci_process_pending
> and do nothing for Dom0 and crash DomU otherwise:

Why is this? I'm not outright opposed, but I don't immediately see why
trying to remove the problematic device wouldn't be a reasonable course
of action anymore. vpci_remove_device() may need to become more careful
as to not crashing, though.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 05/11] vpci/header: implement guest BAR register handlers
  2021-11-19 12:54             ` Oleksandr Andrushchenko
@ 2021-11-19 13:02               ` Jan Beulich
  2021-11-19 13:17                 ` Oleksandr Andrushchenko
  2021-11-23 15:14                 ` Oleksandr Andrushchenko
  0 siblings, 2 replies; 101+ messages in thread
From: Jan Beulich @ 2021-11-19 13:02 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 19.11.2021 13:54, Oleksandr Andrushchenko wrote:
> On 19.11.21 14:49, Jan Beulich wrote:
>> On 19.11.2021 13:46, Oleksandr Andrushchenko wrote:
>>> On 19.11.21 14:37, Jan Beulich wrote:
>>>> On 19.11.2021 13:10, Oleksandr Andrushchenko wrote:
>>>>> On 19.11.21 13:58, Jan Beulich wrote:
>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>>> @@ -408,6 +408,48 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>         pci_conf_write32(pdev->sbdf, reg, val);
>>>>>>>     }
>>>>>>>     
>>>>>>> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>> +                            uint32_t val, void *data)
>>>>>>> +{
>>>>>>> +    struct vpci_bar *bar = data;
>>>>>>> +    bool hi = false;
>>>>>>> +
>>>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>>>> +    {
>>>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>>>> +        bar--;
>>>>>>> +        hi = true;
>>>>>>> +    }
>>>>>>> +    else
>>>>>>> +    {
>>>>>>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>>>>>>> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>>>>>>> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
>>>>>>> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>>>>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>>>>>>> +
>>>>>>> +    bar->guest_addr &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
>>>>>>> +                               void *data)
>>>>>>> +{
>>>>>>> +    const struct vpci_bar *bar = data;
>>>>>>> +    bool hi = false;
>>>>>>> +
>>>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>>>> +    {
>>>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>>>> +        bar--;
>>>>>>> +        hi = true;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return bar->guest_addr >> (hi ? 32 : 0);
>>>>>> I'm afraid "guest_addr" then isn't the best name; maybe "guest_val"?
>>>>>> This would make more obvious that there is a meaningful difference
>>>>>> from "addr" besides the guest vs host aspect.
>>>>> I am not sure I can agree here:
>>>>> bar->addr and bar->guest_addr make it clear what are these while
>>>>> bar->addr and bar->guest_val would make someone go look for
>>>>> additional information about what that val is for.
>>>> Feel free to replace "val" with something more suitable. "guest_bar"
>>>> maybe? The value definitely is not an address, so "addr" seems
>>>> inappropriate / misleading to me.
>>> This is a guest's view on the BAR's address. So to me it is still guest_addr
>> It's a guest's view on the BAR, not just the address. Or else you couldn't
>> simply return the value here without folding in the correct low bits.
> I agree with this this respect as it is indeed address + lower bits.
> How about guest_bar_val then? So it reflects its nature, e.g. the value
> of the BAR as seen by the guest.

Gets a little longish for my taste. I for one wouldn't mind it be just
"guest". In the end Roger has the final say here anyway.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR
  2021-11-19 12:50         ` Oleksandr Andrushchenko
@ 2021-11-19 13:06           ` Jan Beulich
  2021-11-19 13:19             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-19 13:06 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 19.11.2021 13:50, Oleksandr Andrushchenko wrote:
> On 19.11.21 14:45, Jan Beulich wrote:
>> On 19.11.2021 13:13, Oleksandr Andrushchenko wrote:
>>> On 19.11.21 14:05, Jan Beulich wrote:
>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>
>>>>> Instead of handling a single range set, that contains all the memory
>>>>> regions of all the BARs and ROM, have them per BAR.
>>>> Iirc Roger did indicate agreement with the spitting. May I nevertheless
>>>> ask that for posterity you say a word here about the overhead, to make
>>>> clear this was a conscious decision?
>>> Sure, but could you please help me with that sentence to please your
>>> eye? I mean that it was you seeing the overhead while I was not as
>>> to implement the similar functionality as range sets do I still think we'll
>>> duplicate range sets at the end of the day.
>> "Note that rangesets were chosen here despite there being only up to
>> <N> separate ranges in each set (typically just 1)." Albeit that's
>> then still lacking a justification for the choice. Ease of
>> implementation?
> I guess yes. I'll put:
> 
> "Note that rangesets were chosen here despite there being only up to
> <N> separate ranges in each set (typically just 1). But rangeset per BAR
> was chosen for the ease of implementation and existing code re-usability."

FTAOD please don't forget to replace the <N> - I wasn't sure if it would
be 2 or 3. Also (nit) I don't think starting the 2nd sentence with "But
..." fits with the 1st sentence.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-19 13:00                                       ` Jan Beulich
@ 2021-11-19 13:16                                         ` Oleksandr Andrushchenko
  2021-11-19 13:25                                           ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-19 13:16 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Roger Pau Monné,
	julien, Rahul Singh, xen-devel, Stefano Stabellini



On 19.11.21 15:00, Jan Beulich wrote:
> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote:
>> Possible locking and other work needed:
>> =======================================
>>
>> 1. pcidevs_{lock|unlock} is too heavy and is per-host
>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
>> 3. We may want a dedicated per-domain rw lock to be implemented:
>>
>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>> index 28146ee404e6..ebf071893b21 100644
>> --- a/xen/include/xen/sched.h
>> +++ b/xen/include/xen/sched.h
>> @@ -444,6 +444,7 @@ struct domain
>>
>>    #ifdef CONFIG_HAS_PCI
>>        struct list_head pdev_list;
>> +    rwlock_t vpci_rwlock;
>> +    bool vpci_terminating; <- atomic?
>>    #endif
>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and
>> vpci_mmio_{read|write} are readers (hot path).
> Right - you need such a lock for other purposes anyway, as per the
> discussion with Julien.
What about bool vpci_terminating? Do you see it as an atomic type or just bool?
>
>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
>> to be implemented, so when re-start removal if need be:
>>
>> vpci_remove_device()
>> {
>>     d->vpci_terminating = true;
>>     remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
>>     if ( !write_trylock(d->vpci_rwlock) )
>>       return -ERESTART;
>>     xfree(pdev->vpci);
>>     pdev->vpci = NULL;
>> }
>>
>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
>> other operations which may require it, e.g. virtual bus topology can
>> use it when assigning vSBDF etc.
>>
>> 4. vpci_remove_device needs to be removed from vpci_process_pending
>> and do nothing for Dom0 and crash DomU otherwise:
> Why is this? I'm not outright opposed, but I don't immediately see why
> trying to remove the problematic device wouldn't be a reasonable course
> of action anymore. vpci_remove_device() may need to become more careful
> as to not crashing,
vpci_remove_device does not crash, vpci_process_pending does
>   though.
Assume we are in an error state in vpci_process_pending *on one of the vCPUs*
and we call vpci_remove_device. vpci_remove_device tries to acquire the
lock and it can't just because there are some other vpci code is running on other vCPU.
Then what do we do here? We are in SoftIRQ context now and we can't spin
trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci
structure because it is seen by all vCPUs and may crash them.

If vpci_remove_device is in hypercall context it just returns -ERESTART and
hypercall continuation helps here. But not in SoftIRQ context.

Thus, I think we need to remove vpci_remove_device call from vpci_process_pending
and crash the domain if it is a guest domain. Leave with partially done map/unmap if
it is the hardware domain as per Roger's comment in the code.
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR
  2021-11-05  6:56 ` [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR Oleksandr Andrushchenko
  2021-11-19 12:05   ` Jan Beulich
@ 2021-11-19 13:16   ` Jan Beulich
  2021-11-19 13:41     ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-19 13:16 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko,
	xen-devel

On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> @@ -95,10 +102,25 @@ int vpci_add_handlers(struct pci_dev *pdev)
>      INIT_LIST_HEAD(&pdev->vpci->handlers);
>      spin_lock_init(&pdev->vpci->lock);
>  
> +    header = &pdev->vpci->header;
> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> +    {
> +        struct vpci_bar *bar = &header->bars[i];
> +
> +        bar->mem = rangeset_new(NULL, NULL, 0);

I don't recall why an anonymous range set was chosen back at the time
when vPCI was first implemented, but I think this needs to be changed
now that DomU-s get supported. Whether you do so right here or in a
prereq patch is secondary to me. It may be desirable to exclude them
from rangeset_domain_printk() (which would likely require a new
RANGESETF_* flag), but I think such resources should be associated
with their domains.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 05/11] vpci/header: implement guest BAR register handlers
  2021-11-19 13:02               ` Jan Beulich
@ 2021-11-19 13:17                 ` Oleksandr Andrushchenko
  2021-11-23 15:14                 ` Oleksandr Andrushchenko
  1 sibling, 0 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-19 13:17 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel



On 19.11.21 15:02, Jan Beulich wrote:
> On 19.11.2021 13:54, Oleksandr Andrushchenko wrote:
>> On 19.11.21 14:49, Jan Beulich wrote:
>>> On 19.11.2021 13:46, Oleksandr Andrushchenko wrote:
>>>> On 19.11.21 14:37, Jan Beulich wrote:
>>>>> On 19.11.2021 13:10, Oleksandr Andrushchenko wrote:
>>>>>> On 19.11.21 13:58, Jan Beulich wrote:
>>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>>>> @@ -408,6 +408,48 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>          pci_conf_write32(pdev->sbdf, reg, val);
>>>>>>>>      }
>>>>>>>>      
>>>>>>>> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>> +                            uint32_t val, void *data)
>>>>>>>> +{
>>>>>>>> +    struct vpci_bar *bar = data;
>>>>>>>> +    bool hi = false;
>>>>>>>> +
>>>>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>>>>> +    {
>>>>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>>>>> +        bar--;
>>>>>>>> +        hi = true;
>>>>>>>> +    }
>>>>>>>> +    else
>>>>>>>> +    {
>>>>>>>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>>>>>>>> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>>>>>>>> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
>>>>>>>> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>>>>>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>>>>>>>> +
>>>>>>>> +    bar->guest_addr &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>> +                               void *data)
>>>>>>>> +{
>>>>>>>> +    const struct vpci_bar *bar = data;
>>>>>>>> +    bool hi = false;
>>>>>>>> +
>>>>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>>>>> +    {
>>>>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>>>>> +        bar--;
>>>>>>>> +        hi = true;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    return bar->guest_addr >> (hi ? 32 : 0);
>>>>>>> I'm afraid "guest_addr" then isn't the best name; maybe "guest_val"?
>>>>>>> This would make more obvious that there is a meaningful difference
>>>>>>> from "addr" besides the guest vs host aspect.
>>>>>> I am not sure I can agree here:
>>>>>> bar->addr and bar->guest_addr make it clear what are these while
>>>>>> bar->addr and bar->guest_val would make someone go look for
>>>>>> additional information about what that val is for.
>>>>> Feel free to replace "val" with something more suitable. "guest_bar"
>>>>> maybe? The value definitely is not an address, so "addr" seems
>>>>> inappropriate / misleading to me.
>>>> This is a guest's view on the BAR's address. So to me it is still guest_addr
>>> It's a guest's view on the BAR, not just the address. Or else you couldn't
>>> simply return the value here without folding in the correct low bits.
>> I agree with this this respect as it is indeed address + lower bits.
>> How about guest_bar_val then? So it reflects its nature, e.g. the value
>> of the BAR as seen by the guest.
> Gets a little longish for my taste. I for one wouldn't mind it be just
> "guest". In the end Roger has the final say here anyway.
Ok, so let Roger chose the name ;)
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR
  2021-11-19 13:06           ` Jan Beulich
@ 2021-11-19 13:19             ` Oleksandr Andrushchenko
  2021-11-19 13:29               ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-19 13:19 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel



On 19.11.21 15:06, Jan Beulich wrote:
> On 19.11.2021 13:50, Oleksandr Andrushchenko wrote:
>> On 19.11.21 14:45, Jan Beulich wrote:
>>> On 19.11.2021 13:13, Oleksandr Andrushchenko wrote:
>>>> On 19.11.21 14:05, Jan Beulich wrote:
>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>
>>>>>> Instead of handling a single range set, that contains all the memory
>>>>>> regions of all the BARs and ROM, have them per BAR.
>>>>> Iirc Roger did indicate agreement with the spitting. May I nevertheless
>>>>> ask that for posterity you say a word here about the overhead, to make
>>>>> clear this was a conscious decision?
>>>> Sure, but could you please help me with that sentence to please your
>>>> eye? I mean that it was you seeing the overhead while I was not as
>>>> to implement the similar functionality as range sets do I still think we'll
>>>> duplicate range sets at the end of the day.
>>> "Note that rangesets were chosen here despite there being only up to
>>> <N> separate ranges in each set (typically just 1)." Albeit that's
>>> then still lacking a justification for the choice. Ease of
>>> implementation?
>> I guess yes. I'll put:
>>
>> "Note that rangesets were chosen here despite there being only up to
>> <N> separate ranges in each set (typically just 1). But rangeset per BAR
>> was chosen for the ease of implementation and existing code re-usability."
> FTAOD please don't forget to replace the <N> - I wasn't sure if it would
> be 2 or 3.
It seems we can't put the exact number as it depends on how many MSI/MSI-X
holes are there and that depends on an arbitrary device properties.
>   Also (nit) I don't think starting the 2nd sentence with "But
> ..." fits with the 1st sentence.
Sure, I will clean it up
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-19 13:16                                         ` Oleksandr Andrushchenko
@ 2021-11-19 13:25                                           ` Jan Beulich
  2021-11-19 13:34                                             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-19 13:25 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Roger Pau Monné,
	julien, Rahul Singh, xen-devel, Stefano Stabellini

On 19.11.2021 14:16, Oleksandr Andrushchenko wrote:
> On 19.11.21 15:00, Jan Beulich wrote:
>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote:
>>> Possible locking and other work needed:
>>> =======================================
>>>
>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host
>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
>>> 3. We may want a dedicated per-domain rw lock to be implemented:
>>>
>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>>> index 28146ee404e6..ebf071893b21 100644
>>> --- a/xen/include/xen/sched.h
>>> +++ b/xen/include/xen/sched.h
>>> @@ -444,6 +444,7 @@ struct domain
>>>
>>>    #ifdef CONFIG_HAS_PCI
>>>        struct list_head pdev_list;
>>> +    rwlock_t vpci_rwlock;
>>> +    bool vpci_terminating; <- atomic?
>>>    #endif
>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and
>>> vpci_mmio_{read|write} are readers (hot path).
>> Right - you need such a lock for other purposes anyway, as per the
>> discussion with Julien.
> What about bool vpci_terminating? Do you see it as an atomic type or just bool?

Having seen only ...

>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
>>> to be implemented, so when re-start removal if need be:
>>>
>>> vpci_remove_device()
>>> {
>>>     d->vpci_terminating = true;

... this use so far, I can't tell yet. But at a first glance a boolean
looks to be what you need.

>>>     remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
>>>     if ( !write_trylock(d->vpci_rwlock) )
>>>       return -ERESTART;
>>>     xfree(pdev->vpci);
>>>     pdev->vpci = NULL;
>>> }
>>>
>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
>>> other operations which may require it, e.g. virtual bus topology can
>>> use it when assigning vSBDF etc.
>>>
>>> 4. vpci_remove_device needs to be removed from vpci_process_pending
>>> and do nothing for Dom0 and crash DomU otherwise:
>> Why is this? I'm not outright opposed, but I don't immediately see why
>> trying to remove the problematic device wouldn't be a reasonable course
>> of action anymore. vpci_remove_device() may need to become more careful
>> as to not crashing,
> vpci_remove_device does not crash, vpci_process_pending does
>>   though.
> Assume we are in an error state in vpci_process_pending *on one of the vCPUs*
> and we call vpci_remove_device. vpci_remove_device tries to acquire the
> lock and it can't just because there are some other vpci code is running on other vCPU.
> Then what do we do here? We are in SoftIRQ context now and we can't spin
> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci
> structure because it is seen by all vCPUs and may crash them.
> 
> If vpci_remove_device is in hypercall context it just returns -ERESTART and
> hypercall continuation helps here. But not in SoftIRQ context.

Maybe then you want to invoke this cleanup from RCU context (whether
vpci_remove_device() itself or a suitable clone there of is TBD)? (I
will admit though that I didn't check whether that would satisfy all
constraints.)

Then again it also hasn't become clear to me why you use write_trylock()
there. The lock contention you describe doesn't, on the surface, look
any different from situations elsewhere.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR
  2021-11-19 13:19             ` Oleksandr Andrushchenko
@ 2021-11-19 13:29               ` Jan Beulich
  2021-11-19 13:38                 ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-19 13:29 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 19.11.2021 14:19, Oleksandr Andrushchenko wrote:
> 
> 
> On 19.11.21 15:06, Jan Beulich wrote:
>> On 19.11.2021 13:50, Oleksandr Andrushchenko wrote:
>>> On 19.11.21 14:45, Jan Beulich wrote:
>>>> On 19.11.2021 13:13, Oleksandr Andrushchenko wrote:
>>>>> On 19.11.21 14:05, Jan Beulich wrote:
>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>
>>>>>>> Instead of handling a single range set, that contains all the memory
>>>>>>> regions of all the BARs and ROM, have them per BAR.
>>>>>> Iirc Roger did indicate agreement with the spitting. May I nevertheless
>>>>>> ask that for posterity you say a word here about the overhead, to make
>>>>>> clear this was a conscious decision?
>>>>> Sure, but could you please help me with that sentence to please your
>>>>> eye? I mean that it was you seeing the overhead while I was not as
>>>>> to implement the similar functionality as range sets do I still think we'll
>>>>> duplicate range sets at the end of the day.
>>>> "Note that rangesets were chosen here despite there being only up to
>>>> <N> separate ranges in each set (typically just 1)." Albeit that's
>>>> then still lacking a justification for the choice. Ease of
>>>> implementation?
>>> I guess yes. I'll put:
>>>
>>> "Note that rangesets were chosen here despite there being only up to
>>> <N> separate ranges in each set (typically just 1). But rangeset per BAR
>>> was chosen for the ease of implementation and existing code re-usability."
>> FTAOD please don't forget to replace the <N> - I wasn't sure if it would
>> be 2 or 3.
> It seems we can't put the exact number as it depends on how many MSI/MSI-X
> holes are there and that depends on an arbitrary device properties.

There aren't any MSI holes, and there can be at most 2 MSI-X holes iirc
(MSI-X table and PBA). What I don't recall is whether there are
constraints on these two, but istr them being fully independent. This
would make the upper bound 3 (both in one BAR, other BARs then all using
just a single range).

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-19 13:25                                           ` Jan Beulich
@ 2021-11-19 13:34                                             ` Oleksandr Andrushchenko
  2021-11-22 14:21                                               ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-19 13:34 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Roger Pau Monné,
	julien, Rahul Singh, xen-devel, Stefano Stabellini



On 19.11.21 15:25, Jan Beulich wrote:
> On 19.11.2021 14:16, Oleksandr Andrushchenko wrote:
>> On 19.11.21 15:00, Jan Beulich wrote:
>>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote:
>>>> Possible locking and other work needed:
>>>> =======================================
>>>>
>>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host
>>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
>>>> 3. We may want a dedicated per-domain rw lock to be implemented:
>>>>
>>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>>>> index 28146ee404e6..ebf071893b21 100644
>>>> --- a/xen/include/xen/sched.h
>>>> +++ b/xen/include/xen/sched.h
>>>> @@ -444,6 +444,7 @@ struct domain
>>>>
>>>>     #ifdef CONFIG_HAS_PCI
>>>>         struct list_head pdev_list;
>>>> +    rwlock_t vpci_rwlock;
>>>> +    bool vpci_terminating; <- atomic?
>>>>     #endif
>>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and
>>>> vpci_mmio_{read|write} are readers (hot path).
>>> Right - you need such a lock for other purposes anyway, as per the
>>> discussion with Julien.
>> What about bool vpci_terminating? Do you see it as an atomic type or just bool?
> Having seen only ...
>
>>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
>>>> to be implemented, so when re-start removal if need be:
>>>>
>>>> vpci_remove_device()
>>>> {
>>>>      d->vpci_terminating = true;
> ... this use so far, I can't tell yet. But at a first glance a boolean
> looks to be what you need.
>
>>>>      remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
>>>>      if ( !write_trylock(d->vpci_rwlock) )
>>>>        return -ERESTART;
>>>>      xfree(pdev->vpci);
>>>>      pdev->vpci = NULL;
>>>> }
>>>>
>>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
>>>> other operations which may require it, e.g. virtual bus topology can
>>>> use it when assigning vSBDF etc.
>>>>
>>>> 4. vpci_remove_device needs to be removed from vpci_process_pending
>>>> and do nothing for Dom0 and crash DomU otherwise:
>>> Why is this? I'm not outright opposed, but I don't immediately see why
>>> trying to remove the problematic device wouldn't be a reasonable course
>>> of action anymore. vpci_remove_device() may need to become more careful
>>> as to not crashing,
>> vpci_remove_device does not crash, vpci_process_pending does
>>>    though.
>> Assume we are in an error state in vpci_process_pending *on one of the vCPUs*
>> and we call vpci_remove_device. vpci_remove_device tries to acquire the
>> lock and it can't just because there are some other vpci code is running on other vCPU.
>> Then what do we do here? We are in SoftIRQ context now and we can't spin
>> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci
>> structure because it is seen by all vCPUs and may crash them.
>>
>> If vpci_remove_device is in hypercall context it just returns -ERESTART and
>> hypercall continuation helps here. But not in SoftIRQ context.
> Maybe then you want to invoke this cleanup from RCU context (whether
> vpci_remove_device() itself or a suitable clone there of is TBD)? (I
> will admit though that I didn't check whether that would satisfy all
> constraints.)
>
> Then again it also hasn't become clear to me why you use write_trylock()
> there. The lock contention you describe doesn't, on the surface, look
> any different from situations elsewhere.
I use write_trylock in vpci_remove_device because if we can't
acquire the lock then we defer device removal. This would work
well if called from a hypercall which will employ hypercall continuation.
But SoftIRQ getting -ERESTART is something that we can't probably
handle by restarting as hypercall can, thus I only see that vpci_process_pending
will need to spin and wait until vpci_remove_device succeeds.
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR
  2021-11-19 13:29               ` Jan Beulich
@ 2021-11-19 13:38                 ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-19 13:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel



On 19.11.21 15:29, Jan Beulich wrote:
> On 19.11.2021 14:19, Oleksandr Andrushchenko wrote:
>>
>> On 19.11.21 15:06, Jan Beulich wrote:
>>> On 19.11.2021 13:50, Oleksandr Andrushchenko wrote:
>>>> On 19.11.21 14:45, Jan Beulich wrote:
>>>>> On 19.11.2021 13:13, Oleksandr Andrushchenko wrote:
>>>>>> On 19.11.21 14:05, Jan Beulich wrote:
>>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>
>>>>>>>> Instead of handling a single range set, that contains all the memory
>>>>>>>> regions of all the BARs and ROM, have them per BAR.
>>>>>>> Iirc Roger did indicate agreement with the spitting. May I nevertheless
>>>>>>> ask that for posterity you say a word here about the overhead, to make
>>>>>>> clear this was a conscious decision?
>>>>>> Sure, but could you please help me with that sentence to please your
>>>>>> eye? I mean that it was you seeing the overhead while I was not as
>>>>>> to implement the similar functionality as range sets do I still think we'll
>>>>>> duplicate range sets at the end of the day.
>>>>> "Note that rangesets were chosen here despite there being only up to
>>>>> <N> separate ranges in each set (typically just 1)." Albeit that's
>>>>> then still lacking a justification for the choice. Ease of
>>>>> implementation?
>>>> I guess yes. I'll put:
>>>>
>>>> "Note that rangesets were chosen here despite there being only up to
>>>> <N> separate ranges in each set (typically just 1). But rangeset per BAR
>>>> was chosen for the ease of implementation and existing code re-usability."
>>> FTAOD please don't forget to replace the <N> - I wasn't sure if it would
>>> be 2 or 3.
>> It seems we can't put the exact number as it depends on how many MSI/MSI-X
>> holes are there and that depends on an arbitrary device properties.
> There aren't any MSI holes, and there can be at most 2 MSI-X holes iirc
> (MSI-X table and PBA). What I don't recall is whether there are
> constraints on these two, but istr them being fully independent. This
> would make the upper bound 3 (both in one BAR, other BARs then all using
> just a single range).
So if they are both in a single BAR (this is what I probably saw while
running QEMU for PVH Dom0 tests), then we may have up to 3 range
sets per BAR at max, so I will use 3 instead of N in description and will
probably put some description how we came up with N == 3.
>
> Jan
>
Thank you!!
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR
  2021-11-19 13:16   ` Jan Beulich
@ 2021-11-19 13:41     ` Oleksandr Andrushchenko
  2021-11-19 13:57       ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-19 13:41 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel



On 19.11.21 15:16, Jan Beulich wrote:
> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>> @@ -95,10 +102,25 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>       INIT_LIST_HEAD(&pdev->vpci->handlers);
>>       spin_lock_init(&pdev->vpci->lock);
>>   
>> +    header = &pdev->vpci->header;
>> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>> +    {
>> +        struct vpci_bar *bar = &header->bars[i];
>> +
>> +        bar->mem = rangeset_new(NULL, NULL, 0);
> I don't recall why an anonymous range set was chosen back at the time
> when vPCI was first implemented, but I think this needs to be changed
> now that DomU-s get supported. Whether you do so right here or in a
> prereq patch is secondary to me. It may be desirable to exclude them
> from rangeset_domain_printk() (which would likely require a new
> RANGESETF_* flag), but I think such resources should be associated
> with their domains.
What would be the proper name for such a range set then?
"vpci_bar"?
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 00/11] PCI devices passthrough on Arm, part 3
  2021-11-05  6:56 [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (10 preceding siblings ...)
  2021-11-05  6:56 ` [PATCH v4 11/11] xen/arm: translate virtual PCI bus topology for guests Oleksandr Andrushchenko
@ 2021-11-19 13:56 ` Jan Beulich
  2021-11-19 14:06   ` Oleksandr Andrushchenko
  2021-11-19 14:23   ` Roger Pau Monné
  11 siblings, 2 replies; 101+ messages in thread
From: Jan Beulich @ 2021-11-19 13:56 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko,
	xen-devel

On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Hi, all!
> 
> This patch series is focusing on vPCI and adds support for non-identity
> PCI BAR mappings which is required while passing through a PCI device to
> a guest. The highlights are:
> 
> - Add relevant vpci register handlers when assigning PCI device to a domain
>   and remove those when de-assigning. This allows having different
>   handlers for different domains, e.g. hwdom and other guests.
> 
> - Emulate guest BAR register values based on physical BAR values.
>   This allows creating a guest view of the registers and emulates
>   size and properties probe as it is done during PCI device enumeration by
>   the guest.
> 
> - Instead of handling a single range set, that contains all the memory
>   regions of all the BARs and ROM, have them per BAR.
> 
> - Take into account guest's BAR view and program its p2m accordingly:
>   gfn is guest's view of the BAR and mfn is the physical BAR value as set
>   up by the host bridge in the hardware domain.
>   This way hardware doamin sees physical BAR values and guest sees
>   emulated ones.
> 
> The series also adds support for virtual PCI bus topology for guests:
>  - We emulate a single host bridge for the guest, so segment is always 0.
>  - The implementation is limited to 32 devices which are allowed on
>    a single PCI bus.
>  - The virtual bus number is set to 0, so virtual devices are seen
>    as embedded endpoints behind the root complex.
> 
> The series was also tested on:
>  - x86 PVH Dom0 and doesn't break it.
>  - x86 HVM with PCI passthrough to DomU and doesn't break it.
> 
> Thank you,
> Oleksandr
> 
> Oleksandr Andrushchenko (11):
>   vpci: fix function attributes for vpci_process_pending
>   vpci: cancel pending map/unmap on vpci removal
>   vpci: make vpci registers removal a dedicated function
>   vpci: add hooks for PCI device assign/de-assign
>   vpci/header: implement guest BAR register handlers
>   vpci/header: handle p2m range sets per BAR
>   vpci/header: program p2m with guest BAR view
>   vpci/header: emulate PCI_COMMAND register for guests
>   vpci/header: reset the command register when adding devices
>   vpci: add initial support for virtual PCI bus topology
>   xen/arm: translate virtual PCI bus topology for guests

If I'm not mistaken by the end of this series a guest can access a
device handed to it. I couldn't find anything dealing with the
uses of vpci_{read,write}_hw() and vpci_hw_{read,write}*() to cover
config registers not covered by registered handlers. IMO this should
happen before patch 5: Before any handlers get registered the view a
guest would have would be all ones no matter which register it
accesses. Handler registration would then "punch holes" into this
"curtain", as opposed to Dom0, where handler registration hides
previously visible raw hardware registers.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR
  2021-11-19 13:41     ` Oleksandr Andrushchenko
@ 2021-11-19 13:57       ` Jan Beulich
  2021-11-19 14:09         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-19 13:57 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 19.11.2021 14:41, Oleksandr Andrushchenko wrote:
> 
> 
> On 19.11.21 15:16, Jan Beulich wrote:
>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>> @@ -95,10 +102,25 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>>       INIT_LIST_HEAD(&pdev->vpci->handlers);
>>>       spin_lock_init(&pdev->vpci->lock);
>>>   
>>> +    header = &pdev->vpci->header;
>>> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>> +    {
>>> +        struct vpci_bar *bar = &header->bars[i];
>>> +
>>> +        bar->mem = rangeset_new(NULL, NULL, 0);
>> I don't recall why an anonymous range set was chosen back at the time
>> when vPCI was first implemented, but I think this needs to be changed
>> now that DomU-s get supported. Whether you do so right here or in a
>> prereq patch is secondary to me. It may be desirable to exclude them
>> from rangeset_domain_printk() (which would likely require a new
>> RANGESETF_* flag), but I think such resources should be associated
>> with their domains.
> What would be the proper name for such a range set then?
> "vpci_bar"?

E.g. bb:dd.f:BARn

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 00/11] PCI devices passthrough on Arm, part 3
  2021-11-19 13:56 ` [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Jan Beulich
@ 2021-11-19 14:06   ` Oleksandr Andrushchenko
  2021-11-19 14:23   ` Roger Pau Monné
  1 sibling, 0 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-19 14:06 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 19.11.21 15:56, Jan Beulich wrote:
> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Hi, all!
>>
>> This patch series is focusing on vPCI and adds support for non-identity
>> PCI BAR mappings which is required while passing through a PCI device to
>> a guest. The highlights are:
>>
>> - Add relevant vpci register handlers when assigning PCI device to a domain
>>    and remove those when de-assigning. This allows having different
>>    handlers for different domains, e.g. hwdom and other guests.
>>
>> - Emulate guest BAR register values based on physical BAR values.
>>    This allows creating a guest view of the registers and emulates
>>    size and properties probe as it is done during PCI device enumeration by
>>    the guest.
>>
>> - Instead of handling a single range set, that contains all the memory
>>    regions of all the BARs and ROM, have them per BAR.
>>
>> - Take into account guest's BAR view and program its p2m accordingly:
>>    gfn is guest's view of the BAR and mfn is the physical BAR value as set
>>    up by the host bridge in the hardware domain.
>>    This way hardware doamin sees physical BAR values and guest sees
>>    emulated ones.
>>
>> The series also adds support for virtual PCI bus topology for guests:
>>   - We emulate a single host bridge for the guest, so segment is always 0.
>>   - The implementation is limited to 32 devices which are allowed on
>>     a single PCI bus.
>>   - The virtual bus number is set to 0, so virtual devices are seen
>>     as embedded endpoints behind the root complex.
>>
>> The series was also tested on:
>>   - x86 PVH Dom0 and doesn't break it.
>>   - x86 HVM with PCI passthrough to DomU and doesn't break it.
>>
>> Thank you,
>> Oleksandr
>>
>> Oleksandr Andrushchenko (11):
>>    vpci: fix function attributes for vpci_process_pending
>>    vpci: cancel pending map/unmap on vpci removal
>>    vpci: make vpci registers removal a dedicated function
>>    vpci: add hooks for PCI device assign/de-assign
>>    vpci/header: implement guest BAR register handlers
>>    vpci/header: handle p2m range sets per BAR
>>    vpci/header: program p2m with guest BAR view
>>    vpci/header: emulate PCI_COMMAND register for guests
>>    vpci/header: reset the command register when adding devices
>>    vpci: add initial support for virtual PCI bus topology
>>    xen/arm: translate virtual PCI bus topology for guests
> If I'm not mistaken by the end of this series a guest can access a
> device handed to it. I couldn't find anything dealing with the
> uses of vpci_{read,write}_hw() and vpci_hw_{read,write}*() to cover
> config registers not covered by registered handlers. IMO this should
> happen before patch 5: Before any handlers get registered the view a
> guest would have would be all ones no matter which register it
> accesses. Handler registration would then "punch holes" into this
> "curtain", as opposed to Dom0, where handler registration hides
> previously visible raw hardware registers.
This is "by design" now which is not good, I know. We only have some
register handlers set, but the rest of the configuration space is
still visible raw to the guest without restrictions. Not letting the
guest access those and returning all ones will render the device
unusable for the guest as it does need access to all its configuration
space. This means that we would need to emulate every possible
register for the guest which seems to be out of the scope of this series.

But definitely you are right that this needs to be solved somehow.
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR
  2021-11-19 13:57       ` Jan Beulich
@ 2021-11-19 14:09         ` Oleksandr Andrushchenko
  2021-11-22  8:24           ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-19 14:09 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 19.11.21 15:57, Jan Beulich wrote:
> On 19.11.2021 14:41, Oleksandr Andrushchenko wrote:
>>
>> On 19.11.21 15:16, Jan Beulich wrote:
>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>> @@ -95,10 +102,25 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>>>        INIT_LIST_HEAD(&pdev->vpci->handlers);
>>>>        spin_lock_init(&pdev->vpci->lock);
>>>>    
>>>> +    header = &pdev->vpci->header;
>>>> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>> +    {
>>>> +        struct vpci_bar *bar = &header->bars[i];
>>>> +
>>>> +        bar->mem = rangeset_new(NULL, NULL, 0);
>>> I don't recall why an anonymous range set was chosen back at the time
>>> when vPCI was first implemented, but I think this needs to be changed
>>> now that DomU-s get supported. Whether you do so right here or in a
>>> prereq patch is secondary to me. It may be desirable to exclude them
>>> from rangeset_domain_printk() (which would likely require a new
>>> RANGESETF_* flag), but I think such resources should be associated
>>> with their domains.
>> What would be the proper name for such a range set then?
>> "vpci_bar"?
> E.g. bb:dd.f:BARn
Hm, indeed
I can only see a single flag RANGESETF_prettyprint_hex which tells
*how* to print, but I can't see any limitation in *what* to print.
So, do you mean I want some logic to be implemented in
rangeset_domain_printk so it knows that this entry needs to be skipped
while printing? RANGESETF_skip_print?
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 00/11] PCI devices passthrough on Arm, part 3
  2021-11-19 13:56 ` [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Jan Beulich
  2021-11-19 14:06   ` Oleksandr Andrushchenko
@ 2021-11-19 14:23   ` Roger Pau Monné
  2021-11-19 14:26     ` Oleksandr Andrushchenko
  2021-11-22  8:22     ` Jan Beulich
  1 sibling, 2 replies; 101+ messages in thread
From: Roger Pau Monné @ 2021-11-19 14:23 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, julien, sstabellini,
	oleksandr_tyshchenko, volodymyr_babchuk, Artem_Mygaiev,
	andrew.cooper3, george.dunlap, paul, bertrand.marquis,
	rahul.singh, Oleksandr Andrushchenko, xen-devel

On Fri, Nov 19, 2021 at 02:56:12PM +0100, Jan Beulich wrote:
> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> > From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> > 
> > Hi, all!
> > 
> > This patch series is focusing on vPCI and adds support for non-identity
> > PCI BAR mappings which is required while passing through a PCI device to
> > a guest. The highlights are:
> > 
> > - Add relevant vpci register handlers when assigning PCI device to a domain
> >   and remove those when de-assigning. This allows having different
> >   handlers for different domains, e.g. hwdom and other guests.
> > 
> > - Emulate guest BAR register values based on physical BAR values.
> >   This allows creating a guest view of the registers and emulates
> >   size and properties probe as it is done during PCI device enumeration by
> >   the guest.
> > 
> > - Instead of handling a single range set, that contains all the memory
> >   regions of all the BARs and ROM, have them per BAR.
> > 
> > - Take into account guest's BAR view and program its p2m accordingly:
> >   gfn is guest's view of the BAR and mfn is the physical BAR value as set
> >   up by the host bridge in the hardware domain.
> >   This way hardware doamin sees physical BAR values and guest sees
> >   emulated ones.
> > 
> > The series also adds support for virtual PCI bus topology for guests:
> >  - We emulate a single host bridge for the guest, so segment is always 0.
> >  - The implementation is limited to 32 devices which are allowed on
> >    a single PCI bus.
> >  - The virtual bus number is set to 0, so virtual devices are seen
> >    as embedded endpoints behind the root complex.
> > 
> > The series was also tested on:
> >  - x86 PVH Dom0 and doesn't break it.
> >  - x86 HVM with PCI passthrough to DomU and doesn't break it.
> > 
> > Thank you,
> > Oleksandr
> > 
> > Oleksandr Andrushchenko (11):
> >   vpci: fix function attributes for vpci_process_pending
> >   vpci: cancel pending map/unmap on vpci removal
> >   vpci: make vpci registers removal a dedicated function
> >   vpci: add hooks for PCI device assign/de-assign
> >   vpci/header: implement guest BAR register handlers
> >   vpci/header: handle p2m range sets per BAR
> >   vpci/header: program p2m with guest BAR view
> >   vpci/header: emulate PCI_COMMAND register for guests
> >   vpci/header: reset the command register when adding devices
> >   vpci: add initial support for virtual PCI bus topology
> >   xen/arm: translate virtual PCI bus topology for guests
> 
> If I'm not mistaken by the end of this series a guest can access a
> device handed to it. I couldn't find anything dealing with the
> uses of vpci_{read,write}_hw() and vpci_hw_{read,write}*() to cover
> config registers not covered by registered handlers. IMO this should
> happen before patch 5: Before any handlers get registered the view a
> guest would have would be all ones no matter which register it
> accesses. Handler registration would then "punch holes" into this
> "curtain", as opposed to Dom0, where handler registration hides
> previously visible raw hardware registers.

FWIW, I've also raised the same concern in a different thread:

https://lore.kernel.org/xen-devel/YYD7VmDGKJRkid4a@Air-de-Roger/

It seems like this is future work, but unless such a model is
implemented vPCI cannot be used for guest passthrough.

I'm fine with doing it in a separate series, but needs to be kept in
mind.

Regards, Roger.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 00/11] PCI devices passthrough on Arm, part 3
  2021-11-19 14:23   ` Roger Pau Monné
@ 2021-11-19 14:26     ` Oleksandr Andrushchenko
  2021-11-20  9:47       ` Roger Pau Monné
  2021-11-22  8:22     ` Jan Beulich
  1 sibling, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-19 14:26 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich
  Cc: Oleksandr Andrushchenko, julien, sstabellini,
	Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel



On 19.11.21 16:23, Roger Pau Monné wrote:
> On Fri, Nov 19, 2021 at 02:56:12PM +0100, Jan Beulich wrote:
>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>
>>> Hi, all!
>>>
>>> This patch series is focusing on vPCI and adds support for non-identity
>>> PCI BAR mappings which is required while passing through a PCI device to
>>> a guest. The highlights are:
>>>
>>> - Add relevant vpci register handlers when assigning PCI device to a domain
>>>    and remove those when de-assigning. This allows having different
>>>    handlers for different domains, e.g. hwdom and other guests.
>>>
>>> - Emulate guest BAR register values based on physical BAR values.
>>>    This allows creating a guest view of the registers and emulates
>>>    size and properties probe as it is done during PCI device enumeration by
>>>    the guest.
>>>
>>> - Instead of handling a single range set, that contains all the memory
>>>    regions of all the BARs and ROM, have them per BAR.
>>>
>>> - Take into account guest's BAR view and program its p2m accordingly:
>>>    gfn is guest's view of the BAR and mfn is the physical BAR value as set
>>>    up by the host bridge in the hardware domain.
>>>    This way hardware doamin sees physical BAR values and guest sees
>>>    emulated ones.
>>>
>>> The series also adds support for virtual PCI bus topology for guests:
>>>   - We emulate a single host bridge for the guest, so segment is always 0.
>>>   - The implementation is limited to 32 devices which are allowed on
>>>     a single PCI bus.
>>>   - The virtual bus number is set to 0, so virtual devices are seen
>>>     as embedded endpoints behind the root complex.
>>>
>>> The series was also tested on:
>>>   - x86 PVH Dom0 and doesn't break it.
>>>   - x86 HVM with PCI passthrough to DomU and doesn't break it.
>>>
>>> Thank you,
>>> Oleksandr
>>>
>>> Oleksandr Andrushchenko (11):
>>>    vpci: fix function attributes for vpci_process_pending
>>>    vpci: cancel pending map/unmap on vpci removal
>>>    vpci: make vpci registers removal a dedicated function
>>>    vpci: add hooks for PCI device assign/de-assign
>>>    vpci/header: implement guest BAR register handlers
>>>    vpci/header: handle p2m range sets per BAR
>>>    vpci/header: program p2m with guest BAR view
>>>    vpci/header: emulate PCI_COMMAND register for guests
>>>    vpci/header: reset the command register when adding devices
>>>    vpci: add initial support for virtual PCI bus topology
>>>    xen/arm: translate virtual PCI bus topology for guests
>> If I'm not mistaken by the end of this series a guest can access a
>> device handed to it. I couldn't find anything dealing with the
>> uses of vpci_{read,write}_hw() and vpci_hw_{read,write}*() to cover
>> config registers not covered by registered handlers. IMO this should
>> happen before patch 5: Before any handlers get registered the view a
>> guest would have would be all ones no matter which register it
>> accesses. Handler registration would then "punch holes" into this
>> "curtain", as opposed to Dom0, where handler registration hides
>> previously visible raw hardware registers.
> FWIW, I've also raised the same concern in a different thread:
>
> https://urldefense.com/v3/__https://lore.kernel.org/xen-devel/YYD7VmDGKJRkid4a@Air-de-Roger/__;!!GF_29dbcQIUBPA!gihX6c2Mg87AKSDMmh1xrRnPjTXZkgR3kqPxg-WPghAdbY59gmJK5Ngkf4OJFK6NU5IwCStYAQ$ [lore[.]kernel[.]org]
>
> It seems like this is future work,
Yes, it takes quite some time to get even what we have now...
>   but unless such a model is
> implemented vPCI cannot be used for guest passthrough.
But it can be a tech-preview
>
> I'm fine with doing it in a separate series, but needs to be kept in
> mind.
Sure
>
> Regards, Roger.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 00/11] PCI devices passthrough on Arm, part 3
  2021-11-19 14:26     ` Oleksandr Andrushchenko
@ 2021-11-20  9:47       ` Roger Pau Monné
  0 siblings, 0 replies; 101+ messages in thread
From: Roger Pau Monné @ 2021-11-20  9:47 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Jan Beulich, Oleksandr Andrushchenko, julien, sstabellini,
	Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel

On Fri, Nov 19, 2021 at 02:26:21PM +0000, Oleksandr Andrushchenko wrote:
> 
> 
> On 19.11.21 16:23, Roger Pau Monné wrote:
> > On Fri, Nov 19, 2021 at 02:56:12PM +0100, Jan Beulich wrote:
> >> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> >>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >>>
> >>> Hi, all!
> >>>
> >>> This patch series is focusing on vPCI and adds support for non-identity
> >>> PCI BAR mappings which is required while passing through a PCI device to
> >>> a guest. The highlights are:
> >>>
> >>> - Add relevant vpci register handlers when assigning PCI device to a domain
> >>>    and remove those when de-assigning. This allows having different
> >>>    handlers for different domains, e.g. hwdom and other guests.
> >>>
> >>> - Emulate guest BAR register values based on physical BAR values.
> >>>    This allows creating a guest view of the registers and emulates
> >>>    size and properties probe as it is done during PCI device enumeration by
> >>>    the guest.
> >>>
> >>> - Instead of handling a single range set, that contains all the memory
> >>>    regions of all the BARs and ROM, have them per BAR.
> >>>
> >>> - Take into account guest's BAR view and program its p2m accordingly:
> >>>    gfn is guest's view of the BAR and mfn is the physical BAR value as set
> >>>    up by the host bridge in the hardware domain.
> >>>    This way hardware doamin sees physical BAR values and guest sees
> >>>    emulated ones.
> >>>
> >>> The series also adds support for virtual PCI bus topology for guests:
> >>>   - We emulate a single host bridge for the guest, so segment is always 0.
> >>>   - The implementation is limited to 32 devices which are allowed on
> >>>     a single PCI bus.
> >>>   - The virtual bus number is set to 0, so virtual devices are seen
> >>>     as embedded endpoints behind the root complex.
> >>>
> >>> The series was also tested on:
> >>>   - x86 PVH Dom0 and doesn't break it.
> >>>   - x86 HVM with PCI passthrough to DomU and doesn't break it.
> >>>
> >>> Thank you,
> >>> Oleksandr
> >>>
> >>> Oleksandr Andrushchenko (11):
> >>>    vpci: fix function attributes for vpci_process_pending
> >>>    vpci: cancel pending map/unmap on vpci removal
> >>>    vpci: make vpci registers removal a dedicated function
> >>>    vpci: add hooks for PCI device assign/de-assign
> >>>    vpci/header: implement guest BAR register handlers
> >>>    vpci/header: handle p2m range sets per BAR
> >>>    vpci/header: program p2m with guest BAR view
> >>>    vpci/header: emulate PCI_COMMAND register for guests
> >>>    vpci/header: reset the command register when adding devices
> >>>    vpci: add initial support for virtual PCI bus topology
> >>>    xen/arm: translate virtual PCI bus topology for guests
> >> If I'm not mistaken by the end of this series a guest can access a
> >> device handed to it. I couldn't find anything dealing with the
> >> uses of vpci_{read,write}_hw() and vpci_hw_{read,write}*() to cover
> >> config registers not covered by registered handlers. IMO this should
> >> happen before patch 5: Before any handlers get registered the view a
> >> guest would have would be all ones no matter which register it
> >> accesses. Handler registration would then "punch holes" into this
> >> "curtain", as opposed to Dom0, where handler registration hides
> >> previously visible raw hardware registers.
> > FWIW, I've also raised the same concern in a different thread:
> >
> > https://urldefense.com/v3/__https://lore.kernel.org/xen-devel/YYD7VmDGKJRkid4a@Air-de-Roger/__;!!GF_29dbcQIUBPA!gihX6c2Mg87AKSDMmh1xrRnPjTXZkgR3kqPxg-WPghAdbY59gmJK5Ngkf4OJFK6NU5IwCStYAQ$ [lore[.]kernel[.]org]
> >
> > It seems like this is future work,
> Yes, it takes quite some time to get even what we have now...
> >   but unless such a model is
> > implemented vPCI cannot be used for guest passthrough.
> But it can be a tech-preview

I'm afraid 'Tech Preview' requires the feature to be functionally
complete, which I won't consider the case for vPCI unless the above is
solved. I think we could only label this as 'Experimental' until the
remaining work is done, but the limitations would need to be clearly
noted, as it would be completely insecure.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 00/11] PCI devices passthrough on Arm, part 3
  2021-11-19 14:23   ` Roger Pau Monné
  2021-11-19 14:26     ` Oleksandr Andrushchenko
@ 2021-11-22  8:22     ` Jan Beulich
  2021-11-22  8:34       ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-22  8:22 UTC (permalink / raw)
  To: Roger Pau Monné, Oleksandr Andrushchenko
  Cc: Oleksandr Andrushchenko, julien, sstabellini,
	oleksandr_tyshchenko, volodymyr_babchuk, Artem_Mygaiev,
	andrew.cooper3, george.dunlap, paul, bertrand.marquis,
	rahul.singh, xen-devel

On 19.11.2021 15:23, Roger Pau Monné wrote:
> On Fri, Nov 19, 2021 at 02:56:12PM +0100, Jan Beulich wrote:
>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>
>>> Hi, all!
>>>
>>> This patch series is focusing on vPCI and adds support for non-identity
>>> PCI BAR mappings which is required while passing through a PCI device to
>>> a guest. The highlights are:
>>>
>>> - Add relevant vpci register handlers when assigning PCI device to a domain
>>>   and remove those when de-assigning. This allows having different
>>>   handlers for different domains, e.g. hwdom and other guests.
>>>
>>> - Emulate guest BAR register values based on physical BAR values.
>>>   This allows creating a guest view of the registers and emulates
>>>   size and properties probe as it is done during PCI device enumeration by
>>>   the guest.
>>>
>>> - Instead of handling a single range set, that contains all the memory
>>>   regions of all the BARs and ROM, have them per BAR.
>>>
>>> - Take into account guest's BAR view and program its p2m accordingly:
>>>   gfn is guest's view of the BAR and mfn is the physical BAR value as set
>>>   up by the host bridge in the hardware domain.
>>>   This way hardware doamin sees physical BAR values and guest sees
>>>   emulated ones.
>>>
>>> The series also adds support for virtual PCI bus topology for guests:
>>>  - We emulate a single host bridge for the guest, so segment is always 0.
>>>  - The implementation is limited to 32 devices which are allowed on
>>>    a single PCI bus.
>>>  - The virtual bus number is set to 0, so virtual devices are seen
>>>    as embedded endpoints behind the root complex.
>>>
>>> The series was also tested on:
>>>  - x86 PVH Dom0 and doesn't break it.
>>>  - x86 HVM with PCI passthrough to DomU and doesn't break it.
>>>
>>> Thank you,
>>> Oleksandr
>>>
>>> Oleksandr Andrushchenko (11):
>>>   vpci: fix function attributes for vpci_process_pending
>>>   vpci: cancel pending map/unmap on vpci removal
>>>   vpci: make vpci registers removal a dedicated function
>>>   vpci: add hooks for PCI device assign/de-assign
>>>   vpci/header: implement guest BAR register handlers
>>>   vpci/header: handle p2m range sets per BAR
>>>   vpci/header: program p2m with guest BAR view
>>>   vpci/header: emulate PCI_COMMAND register for guests
>>>   vpci/header: reset the command register when adding devices
>>>   vpci: add initial support for virtual PCI bus topology
>>>   xen/arm: translate virtual PCI bus topology for guests
>>
>> If I'm not mistaken by the end of this series a guest can access a
>> device handed to it. I couldn't find anything dealing with the
>> uses of vpci_{read,write}_hw() and vpci_hw_{read,write}*() to cover
>> config registers not covered by registered handlers. IMO this should
>> happen before patch 5: Before any handlers get registered the view a
>> guest would have would be all ones no matter which register it
>> accesses. Handler registration would then "punch holes" into this
>> "curtain", as opposed to Dom0, where handler registration hides
>> previously visible raw hardware registers.
> 
> FWIW, I've also raised the same concern in a different thread:
> 
> https://lore.kernel.org/xen-devel/YYD7VmDGKJRkid4a@Air-de-Roger/
> 
> It seems like this is future work, but unless such a model is
> implemented vPCI cannot be used for guest passthrough.
> 
> I'm fine with doing it in a separate series, but needs to be kept in
> mind.

Not just this - it also needs to be recorded in this cover letter and
imo also in a comment in the sources somewhere. Or else the question
will (validly) be raised again and again.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR
  2021-11-19 14:09         ` Oleksandr Andrushchenko
@ 2021-11-22  8:24           ` Jan Beulich
  2021-11-22  8:31             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-22  8:24 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 19.11.2021 15:09, Oleksandr Andrushchenko wrote:
> On 19.11.21 15:57, Jan Beulich wrote:
>> On 19.11.2021 14:41, Oleksandr Andrushchenko wrote:
>>> On 19.11.21 15:16, Jan Beulich wrote:
>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>> @@ -95,10 +102,25 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>>>>        INIT_LIST_HEAD(&pdev->vpci->handlers);
>>>>>        spin_lock_init(&pdev->vpci->lock);
>>>>>    
>>>>> +    header = &pdev->vpci->header;
>>>>> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>> +    {
>>>>> +        struct vpci_bar *bar = &header->bars[i];
>>>>> +
>>>>> +        bar->mem = rangeset_new(NULL, NULL, 0);
>>>> I don't recall why an anonymous range set was chosen back at the time
>>>> when vPCI was first implemented, but I think this needs to be changed
>>>> now that DomU-s get supported. Whether you do so right here or in a
>>>> prereq patch is secondary to me. It may be desirable to exclude them
>>>> from rangeset_domain_printk() (which would likely require a new
>>>> RANGESETF_* flag), but I think such resources should be associated
>>>> with their domains.
>>> What would be the proper name for such a range set then?
>>> "vpci_bar"?
>> E.g. bb:dd.f:BARn
> Hm, indeed
> I can only see a single flag RANGESETF_prettyprint_hex which tells
> *how* to print, but I can't see any limitation in *what* to print.
> So, do you mean I want some logic to be implemented in
> rangeset_domain_printk so it knows that this entry needs to be skipped
> while printing? RANGESETF_skip_print?

Yes, albeit I'd call the flag e.g. RANGESETF_no_print.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR
  2021-11-22  8:24           ` Jan Beulich
@ 2021-11-22  8:31             ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-22  8:31 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 22.11.21 10:24, Jan Beulich wrote:
> On 19.11.2021 15:09, Oleksandr Andrushchenko wrote:
>> On 19.11.21 15:57, Jan Beulich wrote:
>>> On 19.11.2021 14:41, Oleksandr Andrushchenko wrote:
>>>> On 19.11.21 15:16, Jan Beulich wrote:
>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>> @@ -95,10 +102,25 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>>>>>         INIT_LIST_HEAD(&pdev->vpci->handlers);
>>>>>>         spin_lock_init(&pdev->vpci->lock);
>>>>>>     
>>>>>> +    header = &pdev->vpci->header;
>>>>>> +    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>> +    {
>>>>>> +        struct vpci_bar *bar = &header->bars[i];
>>>>>> +
>>>>>> +        bar->mem = rangeset_new(NULL, NULL, 0);
>>>>> I don't recall why an anonymous range set was chosen back at the time
>>>>> when vPCI was first implemented, but I think this needs to be changed
>>>>> now that DomU-s get supported. Whether you do so right here or in a
>>>>> prereq patch is secondary to me. It may be desirable to exclude them
>>>>> from rangeset_domain_printk() (which would likely require a new
>>>>> RANGESETF_* flag), but I think such resources should be associated
>>>>> with their domains.
>>>> What would be the proper name for such a range set then?
>>>> "vpci_bar"?
>>> E.g. bb:dd.f:BARn
>> Hm, indeed
>> I can only see a single flag RANGESETF_prettyprint_hex which tells
>> *how* to print, but I can't see any limitation in *what* to print.
>> So, do you mean I want some logic to be implemented in
>> rangeset_domain_printk so it knows that this entry needs to be skipped
>> while printing? RANGESETF_skip_print?
> Yes, albeit I'd call the flag e.g. RANGESETF_no_print.
Then I see two patches here: one which introduces a generic RANGESETF_no_print
flag and the second one converting anonymous range set used by vPCI
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 00/11] PCI devices passthrough on Arm, part 3
  2021-11-22  8:22     ` Jan Beulich
@ 2021-11-22  8:34       ` Oleksandr Andrushchenko
  2021-11-22  8:44         ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-22  8:34 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monné
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 22.11.21 10:22, Jan Beulich wrote:
> On 19.11.2021 15:23, Roger Pau Monné wrote:
>> On Fri, Nov 19, 2021 at 02:56:12PM +0100, Jan Beulich wrote:
>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>
>>>> Hi, all!
>>>>
>>>> This patch series is focusing on vPCI and adds support for non-identity
>>>> PCI BAR mappings which is required while passing through a PCI device to
>>>> a guest. The highlights are:
>>>>
>>>> - Add relevant vpci register handlers when assigning PCI device to a domain
>>>>    and remove those when de-assigning. This allows having different
>>>>    handlers for different domains, e.g. hwdom and other guests.
>>>>
>>>> - Emulate guest BAR register values based on physical BAR values.
>>>>    This allows creating a guest view of the registers and emulates
>>>>    size and properties probe as it is done during PCI device enumeration by
>>>>    the guest.
>>>>
>>>> - Instead of handling a single range set, that contains all the memory
>>>>    regions of all the BARs and ROM, have them per BAR.
>>>>
>>>> - Take into account guest's BAR view and program its p2m accordingly:
>>>>    gfn is guest's view of the BAR and mfn is the physical BAR value as set
>>>>    up by the host bridge in the hardware domain.
>>>>    This way hardware doamin sees physical BAR values and guest sees
>>>>    emulated ones.
>>>>
>>>> The series also adds support for virtual PCI bus topology for guests:
>>>>   - We emulate a single host bridge for the guest, so segment is always 0.
>>>>   - The implementation is limited to 32 devices which are allowed on
>>>>     a single PCI bus.
>>>>   - The virtual bus number is set to 0, so virtual devices are seen
>>>>     as embedded endpoints behind the root complex.
>>>>
>>>> The series was also tested on:
>>>>   - x86 PVH Dom0 and doesn't break it.
>>>>   - x86 HVM with PCI passthrough to DomU and doesn't break it.
>>>>
>>>> Thank you,
>>>> Oleksandr
>>>>
>>>> Oleksandr Andrushchenko (11):
>>>>    vpci: fix function attributes for vpci_process_pending
>>>>    vpci: cancel pending map/unmap on vpci removal
>>>>    vpci: make vpci registers removal a dedicated function
>>>>    vpci: add hooks for PCI device assign/de-assign
>>>>    vpci/header: implement guest BAR register handlers
>>>>    vpci/header: handle p2m range sets per BAR
>>>>    vpci/header: program p2m with guest BAR view
>>>>    vpci/header: emulate PCI_COMMAND register for guests
>>>>    vpci/header: reset the command register when adding devices
>>>>    vpci: add initial support for virtual PCI bus topology
>>>>    xen/arm: translate virtual PCI bus topology for guests
>>> If I'm not mistaken by the end of this series a guest can access a
>>> device handed to it. I couldn't find anything dealing with the
>>> uses of vpci_{read,write}_hw() and vpci_hw_{read,write}*() to cover
>>> config registers not covered by registered handlers. IMO this should
>>> happen before patch 5: Before any handlers get registered the view a
>>> guest would have would be all ones no matter which register it
>>> accesses. Handler registration would then "punch holes" into this
>>> "curtain", as opposed to Dom0, where handler registration hides
>>> previously visible raw hardware registers.
>> FWIW, I've also raised the same concern in a different thread:
>>
>> https://urldefense.com/v3/__https://lore.kernel.org/xen-devel/YYD7VmDGKJRkid4a@Air-de-Roger/__;!!GF_29dbcQIUBPA!n37Yig9pAAyho7fB9kC-q0T-gjC_utpzjtQxNv8udMX0dXK54PWcHwBqtmzHXSc5lTkzKu4XfQ$ [lore[.]kernel[.]org]
>>
>> It seems like this is future work, but unless such a model is
>> implemented vPCI cannot be used for guest passthrough.
>>
>> I'm fine with doing it in a separate series, but needs to be kept in
>> mind.
> Not just this - it also needs to be recorded in this cover letter and
> imo also in a comment in the sources somewhere. Or else the question
> will (validly) be raised again and again.
I am fine adding such a comment, but am not sure where to put it.
What would be your best bet if you were to look for this information?
I think we can put that in xen/drivers/vpci/vpci.c at the top, right
after the license in the same comment block.
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 00/11] PCI devices passthrough on Arm, part 3
  2021-11-22  8:34       ` Oleksandr Andrushchenko
@ 2021-11-22  8:44         ` Jan Beulich
  0 siblings, 0 replies; 101+ messages in thread
From: Jan Beulich @ 2021-11-22  8:44 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné

On 22.11.2021 09:34, Oleksandr Andrushchenko wrote:
> On 22.11.21 10:22, Jan Beulich wrote:
>> On 19.11.2021 15:23, Roger Pau Monné wrote:
>>> On Fri, Nov 19, 2021 at 02:56:12PM +0100, Jan Beulich wrote:
>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>
>>>>> Hi, all!
>>>>>
>>>>> This patch series is focusing on vPCI and adds support for non-identity
>>>>> PCI BAR mappings which is required while passing through a PCI device to
>>>>> a guest. The highlights are:
>>>>>
>>>>> - Add relevant vpci register handlers when assigning PCI device to a domain
>>>>>    and remove those when de-assigning. This allows having different
>>>>>    handlers for different domains, e.g. hwdom and other guests.
>>>>>
>>>>> - Emulate guest BAR register values based on physical BAR values.
>>>>>    This allows creating a guest view of the registers and emulates
>>>>>    size and properties probe as it is done during PCI device enumeration by
>>>>>    the guest.
>>>>>
>>>>> - Instead of handling a single range set, that contains all the memory
>>>>>    regions of all the BARs and ROM, have them per BAR.
>>>>>
>>>>> - Take into account guest's BAR view and program its p2m accordingly:
>>>>>    gfn is guest's view of the BAR and mfn is the physical BAR value as set
>>>>>    up by the host bridge in the hardware domain.
>>>>>    This way hardware doamin sees physical BAR values and guest sees
>>>>>    emulated ones.
>>>>>
>>>>> The series also adds support for virtual PCI bus topology for guests:
>>>>>   - We emulate a single host bridge for the guest, so segment is always 0.
>>>>>   - The implementation is limited to 32 devices which are allowed on
>>>>>     a single PCI bus.
>>>>>   - The virtual bus number is set to 0, so virtual devices are seen
>>>>>     as embedded endpoints behind the root complex.
>>>>>
>>>>> The series was also tested on:
>>>>>   - x86 PVH Dom0 and doesn't break it.
>>>>>   - x86 HVM with PCI passthrough to DomU and doesn't break it.
>>>>>
>>>>> Thank you,
>>>>> Oleksandr
>>>>>
>>>>> Oleksandr Andrushchenko (11):
>>>>>    vpci: fix function attributes for vpci_process_pending
>>>>>    vpci: cancel pending map/unmap on vpci removal
>>>>>    vpci: make vpci registers removal a dedicated function
>>>>>    vpci: add hooks for PCI device assign/de-assign
>>>>>    vpci/header: implement guest BAR register handlers
>>>>>    vpci/header: handle p2m range sets per BAR
>>>>>    vpci/header: program p2m with guest BAR view
>>>>>    vpci/header: emulate PCI_COMMAND register for guests
>>>>>    vpci/header: reset the command register when adding devices
>>>>>    vpci: add initial support for virtual PCI bus topology
>>>>>    xen/arm: translate virtual PCI bus topology for guests
>>>> If I'm not mistaken by the end of this series a guest can access a
>>>> device handed to it. I couldn't find anything dealing with the
>>>> uses of vpci_{read,write}_hw() and vpci_hw_{read,write}*() to cover
>>>> config registers not covered by registered handlers. IMO this should
>>>> happen before patch 5: Before any handlers get registered the view a
>>>> guest would have would be all ones no matter which register it
>>>> accesses. Handler registration would then "punch holes" into this
>>>> "curtain", as opposed to Dom0, where handler registration hides
>>>> previously visible raw hardware registers.
>>> FWIW, I've also raised the same concern in a different thread:
>>>
>>> https://urldefense.com/v3/__https://lore.kernel.org/xen-devel/YYD7VmDGKJRkid4a@Air-de-Roger/__;!!GF_29dbcQIUBPA!n37Yig9pAAyho7fB9kC-q0T-gjC_utpzjtQxNv8udMX0dXK54PWcHwBqtmzHXSc5lTkzKu4XfQ$ [lore[.]kernel[.]org]
>>>
>>> It seems like this is future work, but unless such a model is
>>> implemented vPCI cannot be used for guest passthrough.
>>>
>>> I'm fine with doing it in a separate series, but needs to be kept in
>>> mind.
>> Not just this - it also needs to be recorded in this cover letter and
>> imo also in a comment in the sources somewhere. Or else the question
>> will (validly) be raised again and again.
> I am fine adding such a comment, but am not sure where to put it.
> What would be your best bet if you were to look for this information?
> I think we can put that in xen/drivers/vpci/vpci.c at the top, right
> after the license in the same comment block.

I would put this e.g. next to the first call to vpci_read_hw() from
vpci_read(), making the wording general enough to express that this
applies to all such calls, including the write counterpart ones.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-19 13:34                                             ` Oleksandr Andrushchenko
@ 2021-11-22 14:21                                               ` Oleksandr Andrushchenko
  2021-11-22 14:37                                                 ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-22 14:21 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monné, julien
  Cc: Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel, Stefano Stabellini,
	Oleksandr Andrushchenko



On 19.11.21 15:34, Oleksandr Andrushchenko wrote:
>
> On 19.11.21 15:25, Jan Beulich wrote:
>> On 19.11.2021 14:16, Oleksandr Andrushchenko wrote:
>>> On 19.11.21 15:00, Jan Beulich wrote:
>>>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote:
>>>>> Possible locking and other work needed:
>>>>> =======================================
>>>>>
>>>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host
>>>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
>>>>> 3. We may want a dedicated per-domain rw lock to be implemented:
>>>>>
>>>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>>>>> index 28146ee404e6..ebf071893b21 100644
>>>>> --- a/xen/include/xen/sched.h
>>>>> +++ b/xen/include/xen/sched.h
>>>>> @@ -444,6 +444,7 @@ struct domain
>>>>>
>>>>>      #ifdef CONFIG_HAS_PCI
>>>>>          struct list_head pdev_list;
>>>>> +    rwlock_t vpci_rwlock;
>>>>> +    bool vpci_terminating; <- atomic?
>>>>>      #endif
>>>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and
>>>>> vpci_mmio_{read|write} are readers (hot path).
>>>> Right - you need such a lock for other purposes anyway, as per the
>>>> discussion with Julien.
>>> What about bool vpci_terminating? Do you see it as an atomic type or just bool?
>> Having seen only ...
>>
>>>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
>>>>> to be implemented, so when re-start removal if need be:
>>>>>
>>>>> vpci_remove_device()
>>>>> {
>>>>>       d->vpci_terminating = true;
>> ... this use so far, I can't tell yet. But at a first glance a boolean
>> looks to be what you need.
>>
>>>>>       remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
>>>>>       if ( !write_trylock(d->vpci_rwlock) )
>>>>>         return -ERESTART;
>>>>>       xfree(pdev->vpci);
>>>>>       pdev->vpci = NULL;
>>>>> }
>>>>>
>>>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
>>>>> other operations which may require it, e.g. virtual bus topology can
>>>>> use it when assigning vSBDF etc.
>>>>>
>>>>> 4. vpci_remove_device needs to be removed from vpci_process_pending
>>>>> and do nothing for Dom0 and crash DomU otherwise:
>>>> Why is this? I'm not outright opposed, but I don't immediately see why
>>>> trying to remove the problematic device wouldn't be a reasonable course
>>>> of action anymore. vpci_remove_device() may need to become more careful
>>>> as to not crashing,
>>> vpci_remove_device does not crash, vpci_process_pending does
>>>>     though.
>>> Assume we are in an error state in vpci_process_pending *on one of the vCPUs*
>>> and we call vpci_remove_device. vpci_remove_device tries to acquire the
>>> lock and it can't just because there are some other vpci code is running on other vCPU.
>>> Then what do we do here? We are in SoftIRQ context now and we can't spin
>>> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci
>>> structure because it is seen by all vCPUs and may crash them.
>>>
>>> If vpci_remove_device is in hypercall context it just returns -ERESTART and
>>> hypercall continuation helps here. But not in SoftIRQ context.
>> Maybe then you want to invoke this cleanup from RCU context (whether
>> vpci_remove_device() itself or a suitable clone there of is TBD)? (I
>> will admit though that I didn't check whether that would satisfy all
>> constraints.)
>>
>> Then again it also hasn't become clear to me why you use write_trylock()
>> there. The lock contention you describe doesn't, on the surface, look
>> any different from situations elsewhere.
> I use write_trylock in vpci_remove_device because if we can't
> acquire the lock then we defer device removal. This would work
> well if called from a hypercall which will employ hypercall continuation.
> But SoftIRQ getting -ERESTART is something that we can't probably
> handle by restarting as hypercall can, thus I only see that vpci_process_pending
> will need to spin and wait until vpci_remove_device succeeds.
Does anybody have any better solution for preventing SoftIRQ from
spinning on vpci_remove_device and -ERESTART?
>> Jan
>>
> Thank you,
> Oleksandr
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-22 14:21                                               ` Oleksandr Andrushchenko
@ 2021-11-22 14:37                                                 ` Jan Beulich
  2021-11-22 14:45                                                   ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-22 14:37 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel, Stefano Stabellini, Roger Pau Monné,
	julien

On 22.11.2021 15:21, Oleksandr Andrushchenko wrote:
> On 19.11.21 15:34, Oleksandr Andrushchenko wrote:
>> On 19.11.21 15:25, Jan Beulich wrote:
>>> On 19.11.2021 14:16, Oleksandr Andrushchenko wrote:
>>>> On 19.11.21 15:00, Jan Beulich wrote:
>>>>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote:
>>>>>> Possible locking and other work needed:
>>>>>> =======================================
>>>>>>
>>>>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host
>>>>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
>>>>>> 3. We may want a dedicated per-domain rw lock to be implemented:
>>>>>>
>>>>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>>>>>> index 28146ee404e6..ebf071893b21 100644
>>>>>> --- a/xen/include/xen/sched.h
>>>>>> +++ b/xen/include/xen/sched.h
>>>>>> @@ -444,6 +444,7 @@ struct domain
>>>>>>
>>>>>>      #ifdef CONFIG_HAS_PCI
>>>>>>          struct list_head pdev_list;
>>>>>> +    rwlock_t vpci_rwlock;
>>>>>> +    bool vpci_terminating; <- atomic?
>>>>>>      #endif
>>>>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and
>>>>>> vpci_mmio_{read|write} are readers (hot path).
>>>>> Right - you need such a lock for other purposes anyway, as per the
>>>>> discussion with Julien.
>>>> What about bool vpci_terminating? Do you see it as an atomic type or just bool?
>>> Having seen only ...
>>>
>>>>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
>>>>>> to be implemented, so when re-start removal if need be:
>>>>>>
>>>>>> vpci_remove_device()
>>>>>> {
>>>>>>       d->vpci_terminating = true;
>>> ... this use so far, I can't tell yet. But at a first glance a boolean
>>> looks to be what you need.
>>>
>>>>>>       remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
>>>>>>       if ( !write_trylock(d->vpci_rwlock) )
>>>>>>         return -ERESTART;
>>>>>>       xfree(pdev->vpci);
>>>>>>       pdev->vpci = NULL;
>>>>>> }
>>>>>>
>>>>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
>>>>>> other operations which may require it, e.g. virtual bus topology can
>>>>>> use it when assigning vSBDF etc.
>>>>>>
>>>>>> 4. vpci_remove_device needs to be removed from vpci_process_pending
>>>>>> and do nothing for Dom0 and crash DomU otherwise:
>>>>> Why is this? I'm not outright opposed, but I don't immediately see why
>>>>> trying to remove the problematic device wouldn't be a reasonable course
>>>>> of action anymore. vpci_remove_device() may need to become more careful
>>>>> as to not crashing,
>>>> vpci_remove_device does not crash, vpci_process_pending does
>>>>>     though.
>>>> Assume we are in an error state in vpci_process_pending *on one of the vCPUs*
>>>> and we call vpci_remove_device. vpci_remove_device tries to acquire the
>>>> lock and it can't just because there are some other vpci code is running on other vCPU.
>>>> Then what do we do here? We are in SoftIRQ context now and we can't spin
>>>> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci
>>>> structure because it is seen by all vCPUs and may crash them.
>>>>
>>>> If vpci_remove_device is in hypercall context it just returns -ERESTART and
>>>> hypercall continuation helps here. But not in SoftIRQ context.
>>> Maybe then you want to invoke this cleanup from RCU context (whether
>>> vpci_remove_device() itself or a suitable clone there of is TBD)? (I
>>> will admit though that I didn't check whether that would satisfy all
>>> constraints.)
>>>
>>> Then again it also hasn't become clear to me why you use write_trylock()
>>> there. The lock contention you describe doesn't, on the surface, look
>>> any different from situations elsewhere.
>> I use write_trylock in vpci_remove_device because if we can't
>> acquire the lock then we defer device removal. This would work
>> well if called from a hypercall which will employ hypercall continuation.
>> But SoftIRQ getting -ERESTART is something that we can't probably
>> handle by restarting as hypercall can, thus I only see that vpci_process_pending
>> will need to spin and wait until vpci_remove_device succeeds.
> Does anybody have any better solution for preventing SoftIRQ from
> spinning on vpci_remove_device and -ERESTART?

Well, at this point I can suggest only a marginal improvement: Instead of
spinning inside the softirq handler, you want to re-raise the softirq and
exit the handler. That way at least higher "priority" softirqs won't be
starved.

Beyond that - maybe the guest (or just a vcpu of it) needs pausing in such
an event, with the work deferred to a tasklet?

Yet I don't think my earlier question regarding the use of write_trylock()
was really answered. What you said in reply doesn't explain (to me at
least) why write_lock() is not an option.

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-22 14:37                                                 ` Jan Beulich
@ 2021-11-22 14:45                                                   ` Oleksandr Andrushchenko
  2021-11-22 14:57                                                     ` Jan Beulich
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-22 14:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel, Stefano Stabellini, Roger Pau Monné,
	julien, Oleksandr Andrushchenko



On 22.11.21 16:37, Jan Beulich wrote:
> On 22.11.2021 15:21, Oleksandr Andrushchenko wrote:
>> On 19.11.21 15:34, Oleksandr Andrushchenko wrote:
>>> On 19.11.21 15:25, Jan Beulich wrote:
>>>> On 19.11.2021 14:16, Oleksandr Andrushchenko wrote:
>>>>> On 19.11.21 15:00, Jan Beulich wrote:
>>>>>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote:
>>>>>>> Possible locking and other work needed:
>>>>>>> =======================================
>>>>>>>
>>>>>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host
>>>>>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
>>>>>>> 3. We may want a dedicated per-domain rw lock to be implemented:
>>>>>>>
>>>>>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>>>>>>> index 28146ee404e6..ebf071893b21 100644
>>>>>>> --- a/xen/include/xen/sched.h
>>>>>>> +++ b/xen/include/xen/sched.h
>>>>>>> @@ -444,6 +444,7 @@ struct domain
>>>>>>>
>>>>>>>       #ifdef CONFIG_HAS_PCI
>>>>>>>           struct list_head pdev_list;
>>>>>>> +    rwlock_t vpci_rwlock;
>>>>>>> +    bool vpci_terminating; <- atomic?
>>>>>>>       #endif
>>>>>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and
>>>>>>> vpci_mmio_{read|write} are readers (hot path).
>>>>>> Right - you need such a lock for other purposes anyway, as per the
>>>>>> discussion with Julien.
>>>>> What about bool vpci_terminating? Do you see it as an atomic type or just bool?
>>>> Having seen only ...
>>>>
>>>>>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
>>>>>>> to be implemented, so when re-start removal if need be:
>>>>>>>
>>>>>>> vpci_remove_device()
>>>>>>> {
>>>>>>>        d->vpci_terminating = true;
>>>> ... this use so far, I can't tell yet. But at a first glance a boolean
>>>> looks to be what you need.
>>>>
>>>>>>>        remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
>>>>>>>        if ( !write_trylock(d->vpci_rwlock) )
>>>>>>>          return -ERESTART;
>>>>>>>        xfree(pdev->vpci);
>>>>>>>        pdev->vpci = NULL;
>>>>>>> }
>>>>>>>
>>>>>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
>>>>>>> other operations which may require it, e.g. virtual bus topology can
>>>>>>> use it when assigning vSBDF etc.
>>>>>>>
>>>>>>> 4. vpci_remove_device needs to be removed from vpci_process_pending
>>>>>>> and do nothing for Dom0 and crash DomU otherwise:
>>>>>> Why is this? I'm not outright opposed, but I don't immediately see why
>>>>>> trying to remove the problematic device wouldn't be a reasonable course
>>>>>> of action anymore. vpci_remove_device() may need to become more careful
>>>>>> as to not crashing,
>>>>> vpci_remove_device does not crash, vpci_process_pending does
>>>>>>      though.
>>>>> Assume we are in an error state in vpci_process_pending *on one of the vCPUs*
>>>>> and we call vpci_remove_device. vpci_remove_device tries to acquire the
>>>>> lock and it can't just because there are some other vpci code is running on other vCPU.
>>>>> Then what do we do here? We are in SoftIRQ context now and we can't spin
>>>>> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci
>>>>> structure because it is seen by all vCPUs and may crash them.
>>>>>
>>>>> If vpci_remove_device is in hypercall context it just returns -ERESTART and
>>>>> hypercall continuation helps here. But not in SoftIRQ context.
>>>> Maybe then you want to invoke this cleanup from RCU context (whether
>>>> vpci_remove_device() itself or a suitable clone there of is TBD)? (I
>>>> will admit though that I didn't check whether that would satisfy all
>>>> constraints.)
>>>>
>>>> Then again it also hasn't become clear to me why you use write_trylock()
>>>> there. The lock contention you describe doesn't, on the surface, look
>>>> any different from situations elsewhere.
>>> I use write_trylock in vpci_remove_device because if we can't
>>> acquire the lock then we defer device removal. This would work
>>> well if called from a hypercall which will employ hypercall continuation.
>>> But SoftIRQ getting -ERESTART is something that we can't probably
>>> handle by restarting as hypercall can, thus I only see that vpci_process_pending
>>> will need to spin and wait until vpci_remove_device succeeds.
>> Does anybody have any better solution for preventing SoftIRQ from
>> spinning on vpci_remove_device and -ERESTART?
> Well, at this point I can suggest only a marginal improvement: Instead of
> spinning inside the softirq handler, you want to re-raise the softirq and
> exit the handler. That way at least higher "priority" softirqs won't be
> starved.
>
> Beyond that - maybe the guest (or just a vcpu of it) needs pausing in such
> an event, with the work deferred to a tasklet?
>
> Yet I don't think my earlier question regarding the use of write_trylock()
> was really answered. What you said in reply doesn't explain (to me at
> least) why write_lock() is not an option.
I was thinking that we do not want to freeze in case we are calling vpci_remove_device
from SoftIRQ context, thus we try to lock and if we can't we return -ERESTART
indicating that the removal needs to be deferred. If we use write_lock, then
SoftIRQ -> write_lock will spin there waiting for readers to release the lock.

write_lock actually makes things a lot easier, but I just don't know if it
is ok to use it. If so, then vpci_remove_device becomes synchronous and
there is no need in hypercall continuation and other heavy machinery for
re-scheduling SoftIRQ...
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-22 14:45                                                   ` Oleksandr Andrushchenko
@ 2021-11-22 14:57                                                     ` Jan Beulich
  2021-11-22 15:02                                                       ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Beulich @ 2021-11-22 14:57 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel, Stefano Stabellini, Roger Pau Monné,
	julien

On 22.11.2021 15:45, Oleksandr Andrushchenko wrote:
> 
> 
> On 22.11.21 16:37, Jan Beulich wrote:
>> On 22.11.2021 15:21, Oleksandr Andrushchenko wrote:
>>> On 19.11.21 15:34, Oleksandr Andrushchenko wrote:
>>>> On 19.11.21 15:25, Jan Beulich wrote:
>>>>> On 19.11.2021 14:16, Oleksandr Andrushchenko wrote:
>>>>>> On 19.11.21 15:00, Jan Beulich wrote:
>>>>>>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote:
>>>>>>>> Possible locking and other work needed:
>>>>>>>> =======================================
>>>>>>>>
>>>>>>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host
>>>>>>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
>>>>>>>> 3. We may want a dedicated per-domain rw lock to be implemented:
>>>>>>>>
>>>>>>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>>>>>>>> index 28146ee404e6..ebf071893b21 100644
>>>>>>>> --- a/xen/include/xen/sched.h
>>>>>>>> +++ b/xen/include/xen/sched.h
>>>>>>>> @@ -444,6 +444,7 @@ struct domain
>>>>>>>>
>>>>>>>>       #ifdef CONFIG_HAS_PCI
>>>>>>>>           struct list_head pdev_list;
>>>>>>>> +    rwlock_t vpci_rwlock;
>>>>>>>> +    bool vpci_terminating; <- atomic?
>>>>>>>>       #endif
>>>>>>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and
>>>>>>>> vpci_mmio_{read|write} are readers (hot path).
>>>>>>> Right - you need such a lock for other purposes anyway, as per the
>>>>>>> discussion with Julien.
>>>>>> What about bool vpci_terminating? Do you see it as an atomic type or just bool?
>>>>> Having seen only ...
>>>>>
>>>>>>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
>>>>>>>> to be implemented, so when re-start removal if need be:
>>>>>>>>
>>>>>>>> vpci_remove_device()
>>>>>>>> {
>>>>>>>>        d->vpci_terminating = true;
>>>>> ... this use so far, I can't tell yet. But at a first glance a boolean
>>>>> looks to be what you need.
>>>>>
>>>>>>>>        remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
>>>>>>>>        if ( !write_trylock(d->vpci_rwlock) )
>>>>>>>>          return -ERESTART;
>>>>>>>>        xfree(pdev->vpci);
>>>>>>>>        pdev->vpci = NULL;
>>>>>>>> }
>>>>>>>>
>>>>>>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
>>>>>>>> other operations which may require it, e.g. virtual bus topology can
>>>>>>>> use it when assigning vSBDF etc.
>>>>>>>>
>>>>>>>> 4. vpci_remove_device needs to be removed from vpci_process_pending
>>>>>>>> and do nothing for Dom0 and crash DomU otherwise:
>>>>>>> Why is this? I'm not outright opposed, but I don't immediately see why
>>>>>>> trying to remove the problematic device wouldn't be a reasonable course
>>>>>>> of action anymore. vpci_remove_device() may need to become more careful
>>>>>>> as to not crashing,
>>>>>> vpci_remove_device does not crash, vpci_process_pending does
>>>>>>>      though.
>>>>>> Assume we are in an error state in vpci_process_pending *on one of the vCPUs*
>>>>>> and we call vpci_remove_device. vpci_remove_device tries to acquire the
>>>>>> lock and it can't just because there are some other vpci code is running on other vCPU.
>>>>>> Then what do we do here? We are in SoftIRQ context now and we can't spin
>>>>>> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci
>>>>>> structure because it is seen by all vCPUs and may crash them.
>>>>>>
>>>>>> If vpci_remove_device is in hypercall context it just returns -ERESTART and
>>>>>> hypercall continuation helps here. But not in SoftIRQ context.
>>>>> Maybe then you want to invoke this cleanup from RCU context (whether
>>>>> vpci_remove_device() itself or a suitable clone there of is TBD)? (I
>>>>> will admit though that I didn't check whether that would satisfy all
>>>>> constraints.)
>>>>>
>>>>> Then again it also hasn't become clear to me why you use write_trylock()
>>>>> there. The lock contention you describe doesn't, on the surface, look
>>>>> any different from situations elsewhere.
>>>> I use write_trylock in vpci_remove_device because if we can't
>>>> acquire the lock then we defer device removal. This would work
>>>> well if called from a hypercall which will employ hypercall continuation.
>>>> But SoftIRQ getting -ERESTART is something that we can't probably
>>>> handle by restarting as hypercall can, thus I only see that vpci_process_pending
>>>> will need to spin and wait until vpci_remove_device succeeds.
>>> Does anybody have any better solution for preventing SoftIRQ from
>>> spinning on vpci_remove_device and -ERESTART?
>> Well, at this point I can suggest only a marginal improvement: Instead of
>> spinning inside the softirq handler, you want to re-raise the softirq and
>> exit the handler. That way at least higher "priority" softirqs won't be
>> starved.
>>
>> Beyond that - maybe the guest (or just a vcpu of it) needs pausing in such
>> an event, with the work deferred to a tasklet?
>>
>> Yet I don't think my earlier question regarding the use of write_trylock()
>> was really answered. What you said in reply doesn't explain (to me at
>> least) why write_lock() is not an option.
> I was thinking that we do not want to freeze in case we are calling vpci_remove_device
> from SoftIRQ context, thus we try to lock and if we can't we return -ERESTART
> indicating that the removal needs to be deferred. If we use write_lock, then
> SoftIRQ -> write_lock will spin there waiting for readers to release the lock.
> 
> write_lock actually makes things a lot easier, but I just don't know if it
> is ok to use it. If so, then vpci_remove_device becomes synchronous and
> there is no need in hypercall continuation and other heavy machinery for
> re-scheduling SoftIRQ...

I'm inclined to ask: If it wasn't okay to use here, then where would it be
okay to use? Of course I realize there are cases when long spinning times
can be a problem. But I expect there aren't going to be excessively long
lock holding regions for this lock, and I also would expect average
contention to not be overly bad. But in the end you know better the code
that you're writing (and which may lead to issues with the lock usage) than
I do ...

Jan



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal
  2021-11-22 14:57                                                     ` Jan Beulich
@ 2021-11-22 15:02                                                       ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-22 15:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel, Stefano Stabellini, Roger Pau Monné,
	julien, Oleksandr Andrushchenko



On 22.11.21 16:57, Jan Beulich wrote:
> On 22.11.2021 15:45, Oleksandr Andrushchenko wrote:
>>
>> On 22.11.21 16:37, Jan Beulich wrote:
>>> On 22.11.2021 15:21, Oleksandr Andrushchenko wrote:
>>>> On 19.11.21 15:34, Oleksandr Andrushchenko wrote:
>>>>> On 19.11.21 15:25, Jan Beulich wrote:
>>>>>> On 19.11.2021 14:16, Oleksandr Andrushchenko wrote:
>>>>>>> On 19.11.21 15:00, Jan Beulich wrote:
>>>>>>>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote:
>>>>>>>>> Possible locking and other work needed:
>>>>>>>>> =======================================
>>>>>>>>>
>>>>>>>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host
>>>>>>>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
>>>>>>>>> 3. We may want a dedicated per-domain rw lock to be implemented:
>>>>>>>>>
>>>>>>>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>>>>>>>>> index 28146ee404e6..ebf071893b21 100644
>>>>>>>>> --- a/xen/include/xen/sched.h
>>>>>>>>> +++ b/xen/include/xen/sched.h
>>>>>>>>> @@ -444,6 +444,7 @@ struct domain
>>>>>>>>>
>>>>>>>>>        #ifdef CONFIG_HAS_PCI
>>>>>>>>>            struct list_head pdev_list;
>>>>>>>>> +    rwlock_t vpci_rwlock;
>>>>>>>>> +    bool vpci_terminating; <- atomic?
>>>>>>>>>        #endif
>>>>>>>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and
>>>>>>>>> vpci_mmio_{read|write} are readers (hot path).
>>>>>>>> Right - you need such a lock for other purposes anyway, as per the
>>>>>>>> discussion with Julien.
>>>>>>> What about bool vpci_terminating? Do you see it as an atomic type or just bool?
>>>>>> Having seen only ...
>>>>>>
>>>>>>>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
>>>>>>>>> to be implemented, so when re-start removal if need be:
>>>>>>>>>
>>>>>>>>> vpci_remove_device()
>>>>>>>>> {
>>>>>>>>>         d->vpci_terminating = true;
>>>>>> ... this use so far, I can't tell yet. But at a first glance a boolean
>>>>>> looks to be what you need.
>>>>>>
>>>>>>>>>         remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
>>>>>>>>>         if ( !write_trylock(d->vpci_rwlock) )
>>>>>>>>>           return -ERESTART;
>>>>>>>>>         xfree(pdev->vpci);
>>>>>>>>>         pdev->vpci = NULL;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
>>>>>>>>> other operations which may require it, e.g. virtual bus topology can
>>>>>>>>> use it when assigning vSBDF etc.
>>>>>>>>>
>>>>>>>>> 4. vpci_remove_device needs to be removed from vpci_process_pending
>>>>>>>>> and do nothing for Dom0 and crash DomU otherwise:
>>>>>>>> Why is this? I'm not outright opposed, but I don't immediately see why
>>>>>>>> trying to remove the problematic device wouldn't be a reasonable course
>>>>>>>> of action anymore. vpci_remove_device() may need to become more careful
>>>>>>>> as to not crashing,
>>>>>>> vpci_remove_device does not crash, vpci_process_pending does
>>>>>>>>       though.
>>>>>>> Assume we are in an error state in vpci_process_pending *on one of the vCPUs*
>>>>>>> and we call vpci_remove_device. vpci_remove_device tries to acquire the
>>>>>>> lock and it can't just because there are some other vpci code is running on other vCPU.
>>>>>>> Then what do we do here? We are in SoftIRQ context now and we can't spin
>>>>>>> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci
>>>>>>> structure because it is seen by all vCPUs and may crash them.
>>>>>>>
>>>>>>> If vpci_remove_device is in hypercall context it just returns -ERESTART and
>>>>>>> hypercall continuation helps here. But not in SoftIRQ context.
>>>>>> Maybe then you want to invoke this cleanup from RCU context (whether
>>>>>> vpci_remove_device() itself or a suitable clone there of is TBD)? (I
>>>>>> will admit though that I didn't check whether that would satisfy all
>>>>>> constraints.)
>>>>>>
>>>>>> Then again it also hasn't become clear to me why you use write_trylock()
>>>>>> there. The lock contention you describe doesn't, on the surface, look
>>>>>> any different from situations elsewhere.
>>>>> I use write_trylock in vpci_remove_device because if we can't
>>>>> acquire the lock then we defer device removal. This would work
>>>>> well if called from a hypercall which will employ hypercall continuation.
>>>>> But SoftIRQ getting -ERESTART is something that we can't probably
>>>>> handle by restarting as hypercall can, thus I only see that vpci_process_pending
>>>>> will need to spin and wait until vpci_remove_device succeeds.
>>>> Does anybody have any better solution for preventing SoftIRQ from
>>>> spinning on vpci_remove_device and -ERESTART?
>>> Well, at this point I can suggest only a marginal improvement: Instead of
>>> spinning inside the softirq handler, you want to re-raise the softirq and
>>> exit the handler. That way at least higher "priority" softirqs won't be
>>> starved.
>>>
>>> Beyond that - maybe the guest (or just a vcpu of it) needs pausing in such
>>> an event, with the work deferred to a tasklet?
>>>
>>> Yet I don't think my earlier question regarding the use of write_trylock()
>>> was really answered. What you said in reply doesn't explain (to me at
>>> least) why write_lock() is not an option.
>> I was thinking that we do not want to freeze in case we are calling vpci_remove_device
>> from SoftIRQ context, thus we try to lock and if we can't we return -ERESTART
>> indicating that the removal needs to be deferred. If we use write_lock, then
>> SoftIRQ -> write_lock will spin there waiting for readers to release the lock.
>>
>> write_lock actually makes things a lot easier, but I just don't know if it
>> is ok to use it. If so, then vpci_remove_device becomes synchronous and
>> there is no need in hypercall continuation and other heavy machinery for
>> re-scheduling SoftIRQ...
> I'm inclined to ask: If it wasn't okay to use here, then where would it be
> okay to use? Of course I realize there are cases when long spinning times
> can be a problem.
I can't prove, but I have a feeling that write_lock could be less
"harmful" in case of a hypercall rather than SoftIRQ
>   But I expect there aren't going to be excessively long
> lock holding regions for this lock, and I also would expect average
> contention to not be overly bad.
Yes, this is my impression as well
>   But in the end you know better the code
> that you're writing (and which may lead to issues with the lock usage) than
> I do ...
I am pretty much ok with write_lock as it does make things way easier.
So I'll got with write_lock then and if we spot great contention then we
can improve that
>
> Jan
>
Thank you!!
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 05/11] vpci/header: implement guest BAR register handlers
  2021-11-19 13:02               ` Jan Beulich
  2021-11-19 13:17                 ` Oleksandr Andrushchenko
@ 2021-11-23 15:14                 ` Oleksandr Andrushchenko
  2021-11-24 12:32                   ` Roger Pau Monné
  1 sibling, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-23 15:14 UTC (permalink / raw)
  To: roger.pau
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Jan Beulich

Hi, Roger!

On 19.11.21 15:02, Jan Beulich wrote:
> On 19.11.2021 13:54, Oleksandr Andrushchenko wrote:
>> On 19.11.21 14:49, Jan Beulich wrote:
>>> On 19.11.2021 13:46, Oleksandr Andrushchenko wrote:
>>>> On 19.11.21 14:37, Jan Beulich wrote:
>>>>> On 19.11.2021 13:10, Oleksandr Andrushchenko wrote:
>>>>>> On 19.11.21 13:58, Jan Beulich wrote:
>>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>>>> @@ -408,6 +408,48 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>          pci_conf_write32(pdev->sbdf, reg, val);
>>>>>>>>      }
>>>>>>>>      
>>>>>>>> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>> +                            uint32_t val, void *data)
>>>>>>>> +{
>>>>>>>> +    struct vpci_bar *bar = data;
>>>>>>>> +    bool hi = false;
>>>>>>>> +
>>>>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>>>>> +    {
>>>>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>>>>> +        bar--;
>>>>>>>> +        hi = true;
>>>>>>>> +    }
>>>>>>>> +    else
>>>>>>>> +    {
>>>>>>>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>>>>>>>> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>>>>>>>> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
>>>>>>>> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>>>>>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>>>>>>>> +
>>>>>>>> +    bar->guest_addr &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>> +                               void *data)
>>>>>>>> +{
>>>>>>>> +    const struct vpci_bar *bar = data;
>>>>>>>> +    bool hi = false;
>>>>>>>> +
>>>>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>>>>> +    {
>>>>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>>>>> +        bar--;
>>>>>>>> +        hi = true;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    return bar->guest_addr >> (hi ? 32 : 0);
>>>>>>> I'm afraid "guest_addr" then isn't the best name; maybe "guest_val"?
>>>>>>> This would make more obvious that there is a meaningful difference
>>>>>>> from "addr" besides the guest vs host aspect.
>>>>>> I am not sure I can agree here:
>>>>>> bar->addr and bar->guest_addr make it clear what are these while
>>>>>> bar->addr and bar->guest_val would make someone go look for
>>>>>> additional information about what that val is for.
>>>>> Feel free to replace "val" with something more suitable. "guest_bar"
>>>>> maybe? The value definitely is not an address, so "addr" seems
>>>>> inappropriate / misleading to me.
>>>> This is a guest's view on the BAR's address. So to me it is still guest_addr
>>> It's a guest's view on the BAR, not just the address. Or else you couldn't
>>> simply return the value here without folding in the correct low bits.
>> I agree with this this respect as it is indeed address + lower bits.
>> How about guest_bar_val then? So it reflects its nature, e.g. the value
>> of the BAR as seen by the guest.
> Gets a little longish for my taste. I for one wouldn't mind it be just
> "guest". In the end Roger has the final say here anyway.
What is your preference on naming here?
1. guest_addr
2. guest_val
3. guest_bar_val
4. guest
>
> Jan
>
Thank you in advance,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 10/11] vpci: add initial support for virtual PCI bus topology
  2021-11-18 16:45   ` Jan Beulich
@ 2021-11-24 11:28     ` Oleksandr Andrushchenko
  2021-11-24 12:36       ` Roger Pau Monné
  0 siblings, 1 reply; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-24 11:28 UTC (permalink / raw)
  To: Jan Beulich, roger.pau
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko

Hi, Jan!

On 18.11.21 18:45, Jan Beulich wrote:
> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>> Since v3:
>>   - make use of VPCI_INIT
>>   - moved all new code to vpci.c which belongs to it
>>   - changed open-coded 31 to PCI_SLOT(~0)
>>   - revisited locking: add dedicated vdev list's lock
> What is this about? I can't spot any locking in the patch. In particular ...
I will update
>
>> @@ -125,6 +128,54 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>   }
>>   
>>   #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>> +int vpci_add_virtual_device(struct pci_dev *pdev)
>> +{
>> +    struct domain *d = pdev->domain;
>> +    pci_sbdf_t sbdf;
>> +    unsigned long new_dev_number;
>> +
>> +    /*
>> +     * Each PCI bus supports 32 devices/slots at max or up to 256 when
>> +     * there are multi-function ones which are not yet supported.
>> +     */
>> +    if ( pdev->info.is_extfn )
>> +    {
>> +        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
>> +                 &pdev->sbdf);
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>> +    new_dev_number = find_first_zero_bit(&d->vpci_dev_assigned_map,
>> +                                         PCI_SLOT(~0) + 1);
>> +    if ( new_dev_number > PCI_SLOT(~0) )
>> +        return -ENOSPC;
>> +
>> +    set_bit(new_dev_number, &d->vpci_dev_assigned_map);
> ... I wonder whether this isn't racy without any locking around it,
Locking is going to be implemented by moving vpci->lock to the
outside, so this code will be protected
> and without looping over test_and_set_bit(). Whereas with locking I
> think you could just use __set_bit().
Although __set_bit == set_bit on Arm I see there is  a difference on x86
I wil use __set_bit
>
>> +    /*
>> +     * Both segment and bus number are 0:
>> +     *  - we emulate a single host bridge for the guest, e.g. segment 0
>> +     *  - with bus 0 the virtual devices are seen as embedded
>> +     *    endpoints behind the root complex
>> +     *
>> +     * TODO: add support for multi-function devices.
>> +     */
>> +    sbdf.sbdf = 0;
> I think this would be better expressed as an initializer,
Ok,
-    pci_sbdf_t sbdf;
+    pci_sbdf_t sbdf = { 0 };

>   making it
> clear to the reader that the whole object gets initialized with out
> them needing to go check the type (and find that .sbdf covers the
> entire object).
>
>> --- a/xen/include/xen/vpci.h
>> +++ b/xen/include/xen/vpci.h
>> @@ -145,6 +145,10 @@ struct vpci {
>>               struct vpci_arch_msix_entry arch;
>>           } entries[];
>>       } *msix;
>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>> +    /* Virtual SBDF of the device. */
>> +    pci_sbdf_t guest_sbdf;
> Would vsbdf perhaps be better in line with things like vpci or vcpu
> (as well as with the comment here)?
This is the same as guest_addr...
@Roger what is your preference here?
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 11/11] xen/arm: translate virtual PCI bus topology for guests
  2021-11-08 15:28         ` Oleksandr Andrushchenko
@ 2021-11-24 11:31           ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-24 11:31 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel



On 08.11.21 17:28, Oleksandr Andrushchenko wrote:
>
> On 08.11.21 16:23, Roger Pau Monné wrote:
>> On Mon, Nov 08, 2021 at 11:16:42AM +0000, Oleksandr Andrushchenko wrote:
>>> On 08.11.21 13:10, Jan Beulich wrote:
>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>> --- a/xen/arch/arm/vpci.c
>>>>> +++ b/xen/arch/arm/vpci.c
>>>>> @@ -41,6 +41,15 @@ static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
>>>>>         /* data is needed to prevent a pointer cast on 32bit */
>>>>>         unsigned long data;
>>>>>     
>>>>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>>>>> +    /*
>>>>> +     * For the passed through devices we need to map their virtual SBDF
>>>>> +     * to the physical PCI device being passed through.
>>>>> +     */
>>>>> +    if ( !bridge && !vpci_translate_virtual_device(v->domain, &sbdf) )
>>>>> +            return 1;
>>>> Nit: Indentation.
>>> Ouch, sure
>>>>> @@ -59,6 +68,15 @@ static int vpci_mmio_write(struct vcpu *v, mmio_info_t *info,
>>>>>         struct pci_host_bridge *bridge = p;
>>>>>         pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
>>>>>     
>>>>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>>>>> +    /*
>>>>> +     * For the passed through devices we need to map their virtual SBDF
>>>>> +     * to the physical PCI device being passed through.
>>>>> +     */
>>>>> +    if ( !bridge && !vpci_translate_virtual_device(v->domain, &sbdf) )
>>>>> +            return 1;
>>>> Again.
>>> Will fix
>>>>> @@ -172,10 +175,37 @@ REGISTER_VPCI_INIT(vpci_add_virtual_device, VPCI_PRIORITY_MIDDLE);
>>>>>     static void vpci_remove_virtual_device(struct domain *d,
>>>>>                                            const struct pci_dev *pdev)
>>>>>     {
>>>>> +    ASSERT(pcidevs_locked());
>>>>> +
>>>>>         clear_bit(pdev->vpci->guest_sbdf.dev, &d->vpci_dev_assigned_map);
>>>>>         pdev->vpci->guest_sbdf.sbdf = ~0;
>>>>>     }
>>>>>     
>>>>> +/*
>>>>> + * Find the physical device which is mapped to the virtual device
>>>>> + * and translate virtual SBDF to the physical one.
>>>>> + */
>>>>> +bool vpci_translate_virtual_device(struct domain *d, pci_sbdf_t *sbdf)
>>>> const struct domain *d ?
>>> Will change
>>>>> +{
>>>>> +    const struct pci_dev *pdev;
>>>>> +    bool found = false;
>>>>> +
>>>>> +    pcidevs_lock();
>>>>> +    for_each_pdev( d, pdev )
>>>>> +    {
>>>>> +        if ( pdev->vpci->guest_sbdf.sbdf == sbdf->sbdf )
>>>>> +        {
>>>>> +            /* Replace virtual SBDF with the physical one. */
>>>>> +            *sbdf = pdev->sbdf;
>>>>> +            found = true;
>>>>> +            break;
>>>>> +        }
>>>>> +    }
>>>>> +    pcidevs_unlock();
>>>> I think the description wants to at least mention that in principle
>>>> this is too coarse grained a lock, providing justification for why
>>>> it is deemed good enough nevertheless. (Personally, as expressed
>>>> before, I don't think the lock should be used here, but as long as
>>>> Roger agrees with you, you're fine.)
>>> Yes, makes sense
>> Seeing as we don't take the lock in vpci_{read,write} I'm not sure we
>> need it here either then.
> Yes, I was not feeling confident while adding locking
>> Since on Arm you will add devices to the guest at runtime (ie: while
>> there could already be PCI accesses), have you seen issues with not
>> taking the lock here?
> No, I didn't. Neither I am aware of Arm had problems
> But this could just mean that we were lucky not to step on it
>> I think the whole pcidevs locking needs to be clarified, as it's
>> currently a mess.
> Agree
>>    If you want to take it here that's fine, but overall
>> there are issues in other places that would make removing a device at
>> runtime not reliable.
> So, what's the decision? I would leave the locks where I put them,
> so at least this part won't need fixes.
As I am about to use the lock outside vpci struct in v5 all these go away
>> Thanks, Roger.
>>
> Thank you,
> Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 05/11] vpci/header: implement guest BAR register handlers
  2021-11-23 15:14                 ` Oleksandr Andrushchenko
@ 2021-11-24 12:32                   ` Roger Pau Monné
  2021-11-24 12:36                     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Roger Pau Monné @ 2021-11-24 12:32 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Jan Beulich

On Tue, Nov 23, 2021 at 03:14:27PM +0000, Oleksandr Andrushchenko wrote:
> Hi, Roger!
> 
> On 19.11.21 15:02, Jan Beulich wrote:
> > On 19.11.2021 13:54, Oleksandr Andrushchenko wrote:
> >> On 19.11.21 14:49, Jan Beulich wrote:
> >>> On 19.11.2021 13:46, Oleksandr Andrushchenko wrote:
> >>>> On 19.11.21 14:37, Jan Beulich wrote:
> >>>>> On 19.11.2021 13:10, Oleksandr Andrushchenko wrote:
> >>>>>> On 19.11.21 13:58, Jan Beulich wrote:
> >>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> >>>>>>>> --- a/xen/drivers/vpci/header.c
> >>>>>>>> +++ b/xen/drivers/vpci/header.c
> >>>>>>>> @@ -408,6 +408,48 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
> >>>>>>>>          pci_conf_write32(pdev->sbdf, reg, val);
> >>>>>>>>      }
> >>>>>>>>      
> >>>>>>>> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
> >>>>>>>> +                            uint32_t val, void *data)
> >>>>>>>> +{
> >>>>>>>> +    struct vpci_bar *bar = data;
> >>>>>>>> +    bool hi = false;
> >>>>>>>> +
> >>>>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> >>>>>>>> +    {
> >>>>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> >>>>>>>> +        bar--;
> >>>>>>>> +        hi = true;
> >>>>>>>> +    }
> >>>>>>>> +    else
> >>>>>>>> +    {
> >>>>>>>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
> >>>>>>>> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
> >>>>>>>> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
> >>>>>>>> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
> >>>>>>>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
> >>>>>>>> +
> >>>>>>>> +    bar->guest_addr &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
> >>>>>>>> +                               void *data)
> >>>>>>>> +{
> >>>>>>>> +    const struct vpci_bar *bar = data;
> >>>>>>>> +    bool hi = false;
> >>>>>>>> +
> >>>>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> >>>>>>>> +    {
> >>>>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> >>>>>>>> +        bar--;
> >>>>>>>> +        hi = true;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    return bar->guest_addr >> (hi ? 32 : 0);
> >>>>>>> I'm afraid "guest_addr" then isn't the best name; maybe "guest_val"?
> >>>>>>> This would make more obvious that there is a meaningful difference
> >>>>>>> from "addr" besides the guest vs host aspect.
> >>>>>> I am not sure I can agree here:
> >>>>>> bar->addr and bar->guest_addr make it clear what are these while
> >>>>>> bar->addr and bar->guest_val would make someone go look for
> >>>>>> additional information about what that val is for.
> >>>>> Feel free to replace "val" with something more suitable. "guest_bar"
> >>>>> maybe? The value definitely is not an address, so "addr" seems
> >>>>> inappropriate / misleading to me.
> >>>> This is a guest's view on the BAR's address. So to me it is still guest_addr
> >>> It's a guest's view on the BAR, not just the address. Or else you couldn't
> >>> simply return the value here without folding in the correct low bits.
> >> I agree with this this respect as it is indeed address + lower bits.
> >> How about guest_bar_val then? So it reflects its nature, e.g. the value
> >> of the BAR as seen by the guest.
> > Gets a little longish for my taste. I for one wouldn't mind it be just
> > "guest". In the end Roger has the final say here anyway.
> What is your preference on naming here?
> 1. guest_addr
> 2. guest_val
> 3. guest_bar_val
> 4. guest

I think guest_reg would be fine?

Or alternatively you could make it a guest address by dropping the low
bits and adding them in the read handler instead of doing it in the
write handler.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 10/11] vpci: add initial support for virtual PCI bus topology
  2021-11-24 11:28     ` Oleksandr Andrushchenko
@ 2021-11-24 12:36       ` Roger Pau Monné
  2021-11-24 12:43         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 101+ messages in thread
From: Roger Pau Monné @ 2021-11-24 12:36 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel

On Wed, Nov 24, 2021 at 11:28:18AM +0000, Oleksandr Andrushchenko wrote:
> Hi, Jan!
> 
> On 18.11.21 18:45, Jan Beulich wrote:
> > On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> >> --- a/xen/include/xen/vpci.h
> >> +++ b/xen/include/xen/vpci.h
> >> @@ -145,6 +145,10 @@ struct vpci {
> >>               struct vpci_arch_msix_entry arch;
> >>           } entries[];
> >>       } *msix;
> >> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> >> +    /* Virtual SBDF of the device. */
> >> +    pci_sbdf_t guest_sbdf;
> > Would vsbdf perhaps be better in line with things like vpci or vcpu
> > (as well as with the comment here)?
> This is the same as guest_addr...
> @Roger what is your preference here?

I'm fine with using guest_ here, but the comment should be slightly
adjusted to s/Virtual/Guest/ IMO. It's already inline with other
guest_ fields added in the series anyway.

Just to confirm, such guest_sbdf is strictly to be used by
unprivileged domains, dom0 will never get such a virtual PCI bus?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 05/11] vpci/header: implement guest BAR register handlers
  2021-11-24 12:32                   ` Roger Pau Monné
@ 2021-11-24 12:36                     ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-24 12:36 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Jan Beulich,
	Oleksandr Andrushchenko



On 24.11.21 14:32, Roger Pau Monné wrote:
> On Tue, Nov 23, 2021 at 03:14:27PM +0000, Oleksandr Andrushchenko wrote:
>> Hi, Roger!
>>
>> On 19.11.21 15:02, Jan Beulich wrote:
>>> On 19.11.2021 13:54, Oleksandr Andrushchenko wrote:
>>>> On 19.11.21 14:49, Jan Beulich wrote:
>>>>> On 19.11.2021 13:46, Oleksandr Andrushchenko wrote:
>>>>>> On 19.11.21 14:37, Jan Beulich wrote:
>>>>>>> On 19.11.2021 13:10, Oleksandr Andrushchenko wrote:
>>>>>>>> On 19.11.21 13:58, Jan Beulich wrote:
>>>>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>>>>>> @@ -408,6 +408,48 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>>>           pci_conf_write32(pdev->sbdf, reg, val);
>>>>>>>>>>       }
>>>>>>>>>>       
>>>>>>>>>> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>>> +                            uint32_t val, void *data)
>>>>>>>>>> +{
>>>>>>>>>> +    struct vpci_bar *bar = data;
>>>>>>>>>> +    bool hi = false;
>>>>>>>>>> +
>>>>>>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>>>>>>> +    {
>>>>>>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>>>>>>> +        bar--;
>>>>>>>>>> +        hi = true;
>>>>>>>>>> +    }
>>>>>>>>>> +    else
>>>>>>>>>> +    {
>>>>>>>>>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>>>>>>>>>> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>>>>>>>>>> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
>>>>>>>>>> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>>>>>>>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>>>>>>>>>> +
>>>>>>>>>> +    bar->guest_addr &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>>> +                               void *data)
>>>>>>>>>> +{
>>>>>>>>>> +    const struct vpci_bar *bar = data;
>>>>>>>>>> +    bool hi = false;
>>>>>>>>>> +
>>>>>>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>>>>>>> +    {
>>>>>>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>>>>>>> +        bar--;
>>>>>>>>>> +        hi = true;
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    return bar->guest_addr >> (hi ? 32 : 0);
>>>>>>>>> I'm afraid "guest_addr" then isn't the best name; maybe "guest_val"?
>>>>>>>>> This would make more obvious that there is a meaningful difference
>>>>>>>>> from "addr" besides the guest vs host aspect.
>>>>>>>> I am not sure I can agree here:
>>>>>>>> bar->addr and bar->guest_addr make it clear what are these while
>>>>>>>> bar->addr and bar->guest_val would make someone go look for
>>>>>>>> additional information about what that val is for.
>>>>>>> Feel free to replace "val" with something more suitable. "guest_bar"
>>>>>>> maybe? The value definitely is not an address, so "addr" seems
>>>>>>> inappropriate / misleading to me.
>>>>>> This is a guest's view on the BAR's address. So to me it is still guest_addr
>>>>> It's a guest's view on the BAR, not just the address. Or else you couldn't
>>>>> simply return the value here without folding in the correct low bits.
>>>> I agree with this this respect as it is indeed address + lower bits.
>>>> How about guest_bar_val then? So it reflects its nature, e.g. the value
>>>> of the BAR as seen by the guest.
>>> Gets a little longish for my taste. I for one wouldn't mind it be just
>>> "guest". In the end Roger has the final say here anyway.
>> What is your preference on naming here?
>> 1. guest_addr
>> 2. guest_val
>> 3. guest_bar_val
>> 4. guest
> I think guest_reg would be fine?
>
> Or alternatively you could make it a guest address by dropping the low
> bits and adding them in the read handler instead of doing it in the
> write handler.
So, let it be "guest_reg" then
>
> Thanks, Roger.
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 10/11] vpci: add initial support for virtual PCI bus topology
  2021-11-24 12:36       ` Roger Pau Monné
@ 2021-11-24 12:43         ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 101+ messages in thread
From: Oleksandr Andrushchenko @ 2021-11-24 12:43 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel



On 24.11.21 14:36, Roger Pau Monné wrote:
> On Wed, Nov 24, 2021 at 11:28:18AM +0000, Oleksandr Andrushchenko wrote:
>> Hi, Jan!
>>
>> On 18.11.21 18:45, Jan Beulich wrote:
>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>> --- a/xen/include/xen/vpci.h
>>>> +++ b/xen/include/xen/vpci.h
>>>> @@ -145,6 +145,10 @@ struct vpci {
>>>>                struct vpci_arch_msix_entry arch;
>>>>            } entries[];
>>>>        } *msix;
>>>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>>>> +    /* Virtual SBDF of the device. */
>>>> +    pci_sbdf_t guest_sbdf;
>>> Would vsbdf perhaps be better in line with things like vpci or vcpu
>>> (as well as with the comment here)?
>> This is the same as guest_addr...
>> @Roger what is your preference here?
> I'm fine with using guest_ here, but the comment should be slightly
> adjusted to s/Virtual/Guest/ IMO. It's already inline with other
> guest_ fields added in the series anyway.
Ok, I will update the comment
>
> Just to confirm, such guest_sbdf is strictly to be used by
> unprivileged domains, dom0 will never get such a virtual PCI bus?
Right, for unprivileged domains domains only
>
> Thanks, Roger.
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 101+ messages in thread

end of thread, other threads:[~2021-11-24 12:43 UTC | newest]

Thread overview: 101+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-05  6:56 [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
2021-11-05  6:56 ` [PATCH v4 01/11] vpci: fix function attributes for vpci_process_pending Oleksandr Andrushchenko
2021-11-05  6:56 ` [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal Oleksandr Andrushchenko
2021-11-15 16:56   ` Jan Beulich
2021-11-16  7:32     ` Oleksandr Andrushchenko
2021-11-16  8:01       ` Jan Beulich
2021-11-16  8:23         ` Oleksandr Andrushchenko
2021-11-16 11:38           ` Jan Beulich
2021-11-16 13:27             ` Oleksandr Andrushchenko
2021-11-16 14:11               ` Jan Beulich
2021-11-16 13:41           ` Oleksandr Andrushchenko
2021-11-16 14:12             ` Jan Beulich
2021-11-16 14:24               ` Oleksandr Andrushchenko
2021-11-16 14:37                 ` Oleksandr Andrushchenko
2021-11-16 16:09                 ` Jan Beulich
2021-11-16 18:02                 ` Julien Grall
2021-11-18 12:57                   ` Oleksandr Andrushchenko
2021-11-17  8:28   ` Jan Beulich
2021-11-18  7:49     ` Oleksandr Andrushchenko
2021-11-18  8:36       ` Jan Beulich
2021-11-18  8:54         ` Oleksandr Andrushchenko
2021-11-18  9:15           ` Jan Beulich
2021-11-18  9:32             ` Oleksandr Andrushchenko
2021-11-18 13:25               ` Jan Beulich
2021-11-18 13:48                 ` Oleksandr Andrushchenko
2021-11-18 14:04                   ` Roger Pau Monné
2021-11-18 14:14                     ` Oleksandr Andrushchenko
2021-11-18 14:35                       ` Jan Beulich
2021-11-18 15:11                         ` Oleksandr Andrushchenko
2021-11-18 15:16                           ` Jan Beulich
2021-11-18 15:21                             ` Oleksandr Andrushchenko
2021-11-18 15:41                               ` Jan Beulich
2021-11-18 15:46                                 ` Oleksandr Andrushchenko
2021-11-18 15:53                                   ` Jan Beulich
2021-11-19 12:34                                     ` Oleksandr Andrushchenko
2021-11-19 13:00                                       ` Jan Beulich
2021-11-19 13:16                                         ` Oleksandr Andrushchenko
2021-11-19 13:25                                           ` Jan Beulich
2021-11-19 13:34                                             ` Oleksandr Andrushchenko
2021-11-22 14:21                                               ` Oleksandr Andrushchenko
2021-11-22 14:37                                                 ` Jan Beulich
2021-11-22 14:45                                                   ` Oleksandr Andrushchenko
2021-11-22 14:57                                                     ` Jan Beulich
2021-11-22 15:02                                                       ` Oleksandr Andrushchenko
2021-11-05  6:56 ` [PATCH v4 03/11] vpci: make vpci registers removal a dedicated function Oleksandr Andrushchenko
2021-11-15 16:57   ` Jan Beulich
2021-11-16  8:02     ` Oleksandr Andrushchenko
2021-11-05  6:56 ` [PATCH v4 04/11] vpci: add hooks for PCI device assign/de-assign Oleksandr Andrushchenko
2021-11-15 17:06   ` Jan Beulich
2021-11-16  9:38     ` Oleksandr Andrushchenko
2021-11-05  6:56 ` [PATCH v4 05/11] vpci/header: implement guest BAR register handlers Oleksandr Andrushchenko
2021-11-19 11:58   ` Jan Beulich
2021-11-19 12:10     ` Oleksandr Andrushchenko
2021-11-19 12:37       ` Jan Beulich
2021-11-19 12:46         ` Oleksandr Andrushchenko
2021-11-19 12:49           ` Jan Beulich
2021-11-19 12:54             ` Oleksandr Andrushchenko
2021-11-19 13:02               ` Jan Beulich
2021-11-19 13:17                 ` Oleksandr Andrushchenko
2021-11-23 15:14                 ` Oleksandr Andrushchenko
2021-11-24 12:32                   ` Roger Pau Monné
2021-11-24 12:36                     ` Oleksandr Andrushchenko
2021-11-05  6:56 ` [PATCH v4 06/11] vpci/header: handle p2m range sets per BAR Oleksandr Andrushchenko
2021-11-19 12:05   ` Jan Beulich
2021-11-19 12:13     ` Oleksandr Andrushchenko
2021-11-19 12:45       ` Jan Beulich
2021-11-19 12:50         ` Oleksandr Andrushchenko
2021-11-19 13:06           ` Jan Beulich
2021-11-19 13:19             ` Oleksandr Andrushchenko
2021-11-19 13:29               ` Jan Beulich
2021-11-19 13:38                 ` Oleksandr Andrushchenko
2021-11-19 13:16   ` Jan Beulich
2021-11-19 13:41     ` Oleksandr Andrushchenko
2021-11-19 13:57       ` Jan Beulich
2021-11-19 14:09         ` Oleksandr Andrushchenko
2021-11-22  8:24           ` Jan Beulich
2021-11-22  8:31             ` Oleksandr Andrushchenko
2021-11-05  6:56 ` [PATCH v4 07/11] vpci/header: program p2m with guest BAR view Oleksandr Andrushchenko
2021-11-19 12:33   ` Jan Beulich
2021-11-19 12:44     ` Oleksandr Andrushchenko
2021-11-05  6:56 ` [PATCH v4 08/11] vpci/header: emulate PCI_COMMAND register for guests Oleksandr Andrushchenko
2021-11-05  6:56 ` [PATCH v4 09/11] vpci/header: reset the command register when adding devices Oleksandr Andrushchenko
2021-11-05  6:56 ` [PATCH v4 10/11] vpci: add initial support for virtual PCI bus topology Oleksandr Andrushchenko
2021-11-18 16:45   ` Jan Beulich
2021-11-24 11:28     ` Oleksandr Andrushchenko
2021-11-24 12:36       ` Roger Pau Monné
2021-11-24 12:43         ` Oleksandr Andrushchenko
2021-11-05  6:56 ` [PATCH v4 11/11] xen/arm: translate virtual PCI bus topology for guests Oleksandr Andrushchenko
2021-11-08 11:10   ` Jan Beulich
2021-11-08 11:16     ` Oleksandr Andrushchenko
2021-11-08 14:23       ` Roger Pau Monné
2021-11-08 15:28         ` Oleksandr Andrushchenko
2021-11-24 11:31           ` Oleksandr Andrushchenko
2021-11-19 13:56 ` [PATCH v4 00/11] PCI devices passthrough on Arm, part 3 Jan Beulich
2021-11-19 14:06   ` Oleksandr Andrushchenko
2021-11-19 14:23   ` Roger Pau Monné
2021-11-19 14:26     ` Oleksandr Andrushchenko
2021-11-20  9:47       ` Roger Pau Monné
2021-11-22  8:22     ` Jan Beulich
2021-11-22  8:34       ` Oleksandr Andrushchenko
2021-11-22  8:44         ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.