All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/13] PCI devices passthrough on Arm, part 3
@ 2022-02-04  6:34 Oleksandr Andrushchenko
  2022-02-04  6:34 ` [PATCH v6 01/13] xen/pci: arm: add stub for is_memory_hole Oleksandr Andrushchenko
                   ` (12 more replies)
  0 siblings, 13 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  6:34 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Hi, all!

1. This patch series is focusing on vPCI and adds support for non-identity
PCI BAR mappings which is required while passing through a PCI device to
a guest. The highlights are:

- Add relevant vpci register handlers when assigning PCI device to a domain
  and remove those when de-assigning. This allows having different
  handlers for different domains, e.g. hwdom and other guests.

- Emulate guest BAR register values based on physical BAR values.
  This allows creating a guest view of the registers and emulates
  size and properties probe as it is done during PCI device enumeration by
  the guest.

- Instead of handling a single range set, that contains all the memory
  regions of all the BARs and ROM, have them per BAR.

- Take into account guest's BAR view and program its p2m accordingly:
  gfn is guest's view of the BAR and mfn is the physical BAR value as set
  up by the host bridge in the hardware domain.
  This way hardware doamin sees physical BAR values and guest sees
  emulated ones.

2. The series also adds support for virtual PCI bus topology for guests:
 - We emulate a single host bridge for the guest, so segment is always 0.
 - The implementation is limited to 32 devices which are allowed on
   a single PCI bus.
 - The virtual bus number is set to 0, so virtual devices are seen
   as embedded endpoints behind the root complex.

3. The series has complete re-work of the locking scheme used/absent before with
the help of the work started by Roger [1]:
[PATCH v6 03/13] vpci: move lock outside of struct vpci

This way the lock can be used to check whether vpci is present, and
removal can be performed while holding the lock, in order to make
sure there are no accesses to the contents of the vpci struct.
Previously removal could race with vpci_read for example, since the
lock was dropped prior to freeing pdev->vpci.
This also solves synchronization issues between all vPCI code entities
which could run in parallel.

4. For unprivileged guests vpci_{read|write} has been re-worked
to not passthrough accesses to the registers not explicitly handled
by the corresponding vPCI handlers: without that passthrough
to guests is completely unsafe as Xen allows them full access to
the registers.
During development this can be reverted for debugging purposes.

5. The series was also tested on:
 - x86 PVH Dom0 and doesn't break it.
 - x86 HVM with PCI passthrough to DomU and doesn't break it.
 - Arm

Thank you,
Oleksandr

[1] https://lore.kernel.org/xen-devel/20180717094830.54806-2-roger.pau@citrix.com/

Oleksandr Andrushchenko (12):
  xen/pci: arm: add stub for is_memory_hole
  rangeset: add RANGESETF_no_print flag
  vpci: restrict unhandled read/write operations for guests
  vpci: add hooks for PCI device assign/de-assign
  vpci/header: implement guest BAR register handlers
  vpci/header: handle p2m range sets per BAR
  vpci/header: program p2m with guest BAR view
  vpci/header: emulate PCI_COMMAND register for guests
  vpci/header: reset the command register when adding devices
  vpci: add initial support for virtual PCI bus topology
  xen/arm: translate virtual PCI bus topology for guests
  xen/arm: account IO handlers for emulated PCI MSI-X

Roger Pau Monné (1):
  vpci: move lock outside of struct vpci

 tools/tests/vpci/emul.h       |   5 +-
 tools/tests/vpci/main.c       |   3 +-
 xen/arch/arm/mm.c             |   6 +
 xen/arch/arm/vpci.c           |  31 ++-
 xen/arch/x86/hvm/vmsi.c       |   8 +-
 xen/common/rangeset.c         |   5 +-
 xen/drivers/Kconfig           |   4 +
 xen/drivers/passthrough/pci.c |   7 +
 xen/drivers/vpci/header.c     | 407 +++++++++++++++++++++++++++-------
 xen/drivers/vpci/msi.c        |  15 +-
 xen/drivers/vpci/msix.c       |  43 +++-
 xen/drivers/vpci/vpci.c       | 232 ++++++++++++++++---
 xen/include/xen/pci.h         |   1 +
 xen/include/xen/rangeset.h    |   5 +-
 xen/include/xen/sched.h       |   8 +
 xen/include/xen/vpci.h        |  43 +++-
 16 files changed, 688 insertions(+), 135 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 138+ messages in thread

* [PATCH v6 01/13] xen/pci: arm: add stub for is_memory_hole
  2022-02-04  6:34 [PATCH v6 00/13] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
@ 2022-02-04  6:34 ` Oleksandr Andrushchenko
  2022-02-04  8:51   ` Julien Grall
  2022-02-04  6:34 ` [PATCH v6 02/13] rangeset: add RANGESETF_no_print flag Oleksandr Andrushchenko
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  6:34 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Add a stub for is_memory_hole which is required for PCI passthrough
on Arm.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

---
Cc: Julien Grall <julien@xen.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
---
New in v6
---
 xen/arch/arm/mm.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
index b1eae767c27c..c32e34a182a2 100644
--- a/xen/arch/arm/mm.c
+++ b/xen/arch/arm/mm.c
@@ -1640,6 +1640,12 @@ unsigned long get_upper_mfn_bound(void)
     return max_page - 1;
 }
 
+bool is_memory_hole(mfn_t start, mfn_t end)
+{
+    /* TODO: this needs to be properly implemented. */
+    return true;
+}
+
 /*
  * Local variables:
  * mode: C
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 138+ messages in thread

* [PATCH v6 02/13] rangeset: add RANGESETF_no_print flag
  2022-02-04  6:34 [PATCH v6 00/13] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
  2022-02-04  6:34 ` [PATCH v6 01/13] xen/pci: arm: add stub for is_memory_hole Oleksandr Andrushchenko
@ 2022-02-04  6:34 ` Oleksandr Andrushchenko
  2022-02-04  6:34 ` [PATCH v6 03/13] vpci: move lock outside of struct vpci Oleksandr Andrushchenko
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  6:34 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

There are range sets which should not be printed, so introduce a flag
which allows marking those as such. Implement relevant logic to skip
such entries while printing.

While at it also simplify the definition of the flags by directly
defining those without helpers.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
Since v5:
- comment indentation (Jan)
Since v1:
- update BUG_ON with new flag
- simplify the definition of the flags
---
 xen/common/rangeset.c      | 5 ++++-
 xen/include/xen/rangeset.h | 5 +++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/xen/common/rangeset.c b/xen/common/rangeset.c
index 885b6b15c229..ea27d651723b 100644
--- a/xen/common/rangeset.c
+++ b/xen/common/rangeset.c
@@ -433,7 +433,7 @@ struct rangeset *rangeset_new(
     INIT_LIST_HEAD(&r->range_list);
     r->nr_ranges = -1;
 
-    BUG_ON(flags & ~RANGESETF_prettyprint_hex);
+    BUG_ON(flags & ~(RANGESETF_prettyprint_hex | RANGESETF_no_print));
     r->flags = flags;
 
     safe_strcpy(r->name, name ?: "(no name)");
@@ -575,6 +575,9 @@ void rangeset_domain_printk(
 
     list_for_each_entry ( r, &d->rangesets, rangeset_list )
     {
+        if ( r->flags & RANGESETF_no_print )
+            continue;
+
         printk("    ");
         rangeset_printk(r);
         printk("\n");
diff --git a/xen/include/xen/rangeset.h b/xen/include/xen/rangeset.h
index 135f33f6066f..f7c69394d66a 100644
--- a/xen/include/xen/rangeset.h
+++ b/xen/include/xen/rangeset.h
@@ -49,8 +49,9 @@ void rangeset_limit(
 
 /* Flags for passing to rangeset_new(). */
  /* Pretty-print range limits in hexadecimal. */
-#define _RANGESETF_prettyprint_hex 0
-#define RANGESETF_prettyprint_hex  (1U << _RANGESETF_prettyprint_hex)
+#define RANGESETF_prettyprint_hex   (1U << 0)
+ /* Do not print entries marked with this flag. */
+#define RANGESETF_no_print          (1U << 1)
 
 bool_t __must_check rangeset_is_empty(
     const struct rangeset *r);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 138+ messages in thread

* [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04  6:34 [PATCH v6 00/13] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
  2022-02-04  6:34 ` [PATCH v6 01/13] xen/pci: arm: add stub for is_memory_hole Oleksandr Andrushchenko
  2022-02-04  6:34 ` [PATCH v6 02/13] rangeset: add RANGESETF_no_print flag Oleksandr Andrushchenko
@ 2022-02-04  6:34 ` Oleksandr Andrushchenko
  2022-02-04  7:52   ` Jan Beulich
  2022-02-04  6:34 ` [PATCH v6 04/13] vpci: restrict unhandled read/write operations for guests Oleksandr Andrushchenko
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  6:34 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Roger Pau Monné <roger.pau@citrix.com>

This way the lock can be used (and in a few cases is used right away)
to check whether vpci is present, and removal can be performed while
holding the lock, in order to make sure there are no accesses to the
contents of the vpci struct.
Previously removal could race with vpci_read for example, since the
lock was dropped prior to freeing pdev->vpci.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Julien Grall <julien@xen.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
---
New in v5 of this series: this is an updated version of the patch published at
https://lore.kernel.org/xen-devel/20180717094830.54806-2-roger.pau@citrix.com/

Changes since v5:
 - vpci_lock in test_pdev is already initialized to false by default
 - introduce msix_{get|put} to protect former msix_find's result
 - add comments to vpci_{add|remove}_registers about pdev->vpci_lock must
   be held.
 - do not split code into vpci_remove_device_handlers_locked yet
 - move INIT_LIST_HEAD outside the locked region (Jan)
 - stripped out locking optimizations for vpci_{read|write} into a
   dedicated patch
Changes since v2:
 - fixed pdev->vpci = xzalloc(struct vpci); under spin_lock (Jan)
Changes since v1:
 - Assert that vpci_lock is locked in vpci_remove_device_locked.
 - Remove double newline.
 - Shrink critical section in vpci_{read/write}.
---
 tools/tests/vpci/emul.h       |  5 ++-
 tools/tests/vpci/main.c       |  3 +-
 xen/arch/x86/hvm/vmsi.c       |  8 ++---
 xen/drivers/passthrough/pci.c |  1 +
 xen/drivers/vpci/header.c     | 21 +++++++----
 xen/drivers/vpci/msi.c        | 11 ++++--
 xen/drivers/vpci/msix.c       | 39 ++++++++++++++++-----
 xen/drivers/vpci/vpci.c       | 65 ++++++++++++++++++++++-------------
 xen/include/xen/pci.h         |  1 +
 xen/include/xen/vpci.h        |  3 +-
 10 files changed, 106 insertions(+), 51 deletions(-)

diff --git a/tools/tests/vpci/emul.h b/tools/tests/vpci/emul.h
index 2e1d3057c9d8..d018fb5eef21 100644
--- a/tools/tests/vpci/emul.h
+++ b/tools/tests/vpci/emul.h
@@ -44,6 +44,7 @@ struct domain {
 };
 
 struct pci_dev {
+    bool vpci_lock;
     struct vpci *vpci;
 };
 
@@ -53,10 +54,8 @@ struct vcpu
 };
 
 extern const struct vcpu *current;
-extern const struct pci_dev test_pdev;
+extern struct pci_dev test_pdev;
 
-typedef bool spinlock_t;
-#define spin_lock_init(l) (*(l) = false)
 #define spin_lock(l) (*(l) = true)
 #define spin_unlock(l) (*(l) = false)
 
diff --git a/tools/tests/vpci/main.c b/tools/tests/vpci/main.c
index b9a0a6006bb9..3b86ed232eb1 100644
--- a/tools/tests/vpci/main.c
+++ b/tools/tests/vpci/main.c
@@ -23,7 +23,7 @@ static struct vpci vpci;
 
 const static struct domain d;
 
-const struct pci_dev test_pdev = {
+struct pci_dev test_pdev = {
     .vpci = &vpci,
 };
 
@@ -158,7 +158,6 @@ main(int argc, char **argv)
     int rc;
 
     INIT_LIST_HEAD(&vpci.handlers);
-    spin_lock_init(&vpci.lock);
 
     VPCI_ADD_REG(vpci_read32, vpci_write32, 0, 4, r0);
     VPCI_READ_CHECK(0, 4, r0);
diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 13e2a190b439..1f7a37f78264 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -910,14 +910,14 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
         {
             struct pci_dev *pdev = msix->pdev;
 
-            spin_unlock(&msix->pdev->vpci->lock);
+            spin_unlock(&msix->pdev->vpci_lock);
             process_pending_softirqs();
             /* NB: we assume that pdev cannot go away for an alive domain. */
-            if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
+            if ( !spin_trylock(&pdev->vpci_lock) )
                 return -EBUSY;
-            if ( pdev->vpci->msix != msix )
+            if ( !pdev->vpci || pdev->vpci->msix != msix )
             {
-                spin_unlock(&pdev->vpci->lock);
+                spin_unlock(&pdev->vpci_lock);
                 return -EAGAIN;
             }
         }
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index e8b09d77d880..50dec3bb73d0 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -397,6 +397,7 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
     *((u8*) &pdev->bus) = bus;
     *((u8*) &pdev->devfn) = devfn;
     pdev->domain = NULL;
+    spin_lock_init(&pdev->vpci_lock);
 
     arch_pci_init_pdev(pdev);
 
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 40ff79c33f8f..bd23c0274d48 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -142,12 +142,13 @@ bool vpci_process_pending(struct vcpu *v)
         if ( rc == -ERESTART )
             return true;
 
-        spin_lock(&v->vpci.pdev->vpci->lock);
-        /* Disable memory decoding unconditionally on failure. */
-        modify_decoding(v->vpci.pdev,
-                        rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
-                        !rc && v->vpci.rom_only);
-        spin_unlock(&v->vpci.pdev->vpci->lock);
+        spin_lock(&v->vpci.pdev->vpci_lock);
+        if ( v->vpci.pdev->vpci )
+            /* Disable memory decoding unconditionally on failure. */
+            modify_decoding(v->vpci.pdev,
+                            rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
+                            !rc && v->vpci.rom_only);
+        spin_unlock(&v->vpci.pdev->vpci_lock);
 
         rangeset_destroy(v->vpci.mem);
         v->vpci.mem = NULL;
@@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
                 continue;
         }
 
+        spin_lock(&tmp->vpci_lock);
+        if ( !tmp->vpci )
+        {
+            spin_unlock(&tmp->vpci_lock);
+            continue;
+        }
         for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
         {
             const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
@@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             rc = rangeset_remove_range(mem, start, end);
             if ( rc )
             {
+                spin_unlock(&tmp->vpci_lock);
                 printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
                        start, end, rc);
                 rangeset_destroy(mem);
                 return rc;
             }
         }
+        spin_unlock(&tmp->vpci_lock);
     }
 
     ASSERT(dev);
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index 5757a7aed20f..e3ce46869dad 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -270,7 +270,7 @@ void vpci_dump_msi(void)
     rcu_read_lock(&domlist_read_lock);
     for_each_domain ( d )
     {
-        const struct pci_dev *pdev;
+        struct pci_dev *pdev;
 
         if ( !has_vpci(d) )
             continue;
@@ -282,8 +282,13 @@ void vpci_dump_msi(void)
             const struct vpci_msi *msi;
             const struct vpci_msix *msix;
 
-            if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
+            if ( !spin_trylock(&pdev->vpci_lock) )
                 continue;
+            if ( !pdev->vpci )
+            {
+                spin_unlock(&pdev->vpci_lock);
+                continue;
+            }
 
             msi = pdev->vpci->msi;
             if ( msi && msi->enabled )
@@ -323,7 +328,7 @@ void vpci_dump_msi(void)
                 }
             }
 
-            spin_unlock(&pdev->vpci->lock);
+            spin_unlock(&pdev->vpci_lock);
             process_pending_softirqs();
         }
     }
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index 846f1b8d7038..d1dbfc6e0ffd 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -138,7 +138,7 @@ static void control_write(const struct pci_dev *pdev, unsigned int reg,
         pci_conf_write16(pdev->sbdf, reg, val);
 }
 
-static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
+static struct vpci_msix *msix_get(const struct domain *d, unsigned long addr)
 {
     struct vpci_msix *msix;
 
@@ -150,15 +150,29 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
         for ( i = 0; i < ARRAY_SIZE(msix->tables); i++ )
             if ( bars[msix->tables[i] & PCI_MSIX_BIRMASK].enabled &&
                  VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, i) )
+            {
+                spin_lock(&msix->pdev->vpci_lock);
                 return msix;
+            }
     }
 
     return NULL;
 }
 
+static void msix_put(struct vpci_msix *msix)
+{
+    if ( !msix )
+        return;
+
+    spin_unlock(&msix->pdev->vpci_lock);
+}
+
 static int msix_accept(struct vcpu *v, unsigned long addr)
 {
-    return !!msix_find(v->domain, addr);
+    struct vpci_msix *msix = msix_get(v->domain, addr);
+
+    msix_put(msix);
+    return !!msix;
 }
 
 static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
@@ -186,7 +200,7 @@ static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len,
                      unsigned long *data)
 {
     const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct vpci_msix *msix = msix_get(d, addr);
     const struct vpci_msix_entry *entry;
     unsigned int offset;
 
@@ -196,7 +210,10 @@ static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len,
         return X86EMUL_RETRY;
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        msix_put(msix);
         return X86EMUL_OKAY;
+    }
 
     if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) )
     {
@@ -222,10 +239,10 @@ static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len,
             break;
         }
 
+        msix_put(msix);
         return X86EMUL_OKAY;
     }
 
-    spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
     offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
 
@@ -254,7 +271,8 @@ static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len,
         ASSERT_UNREACHABLE();
         break;
     }
-    spin_unlock(&msix->pdev->vpci->lock);
+
+    msix_put(msix);
 
     return X86EMUL_OKAY;
 }
@@ -263,7 +281,7 @@ static int msix_write(struct vcpu *v, unsigned long addr, unsigned int len,
                       unsigned long data)
 {
     const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct vpci_msix *msix = msix_get(d, addr);
     struct vpci_msix_entry *entry;
     unsigned int offset;
 
@@ -271,7 +289,10 @@ static int msix_write(struct vcpu *v, unsigned long addr, unsigned int len,
         return X86EMUL_RETRY;
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        msix_put(msix);
         return X86EMUL_OKAY;
+    }
 
     if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) )
     {
@@ -294,10 +315,11 @@ static int msix_write(struct vcpu *v, unsigned long addr, unsigned int len,
             }
         }
 
+        msix_put(msix);
+
         return X86EMUL_OKAY;
     }
 
-    spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
     offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
 
@@ -370,7 +392,8 @@ static int msix_write(struct vcpu *v, unsigned long addr, unsigned int len,
         ASSERT_UNREACHABLE();
         break;
     }
-    spin_unlock(&msix->pdev->vpci->lock);
+
+    msix_put(msix);
 
     return X86EMUL_OKAY;
 }
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index fb0947179b79..cb2ababa28e3 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -35,12 +35,10 @@ extern vpci_register_init_t *const __start_vpci_array[];
 extern vpci_register_init_t *const __end_vpci_array[];
 #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
 
-void vpci_remove_device(struct pci_dev *pdev)
+static void vpci_remove_device_locked(struct pci_dev *pdev)
 {
-    if ( !has_vpci(pdev->domain) )
-        return;
+    ASSERT(spin_is_locked(&pdev->vpci_lock));
 
-    spin_lock(&pdev->vpci->lock);
     while ( !list_empty(&pdev->vpci->handlers) )
     {
         struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
@@ -50,15 +48,26 @@ void vpci_remove_device(struct pci_dev *pdev)
         list_del(&r->node);
         xfree(r);
     }
-    spin_unlock(&pdev->vpci->lock);
     xfree(pdev->vpci->msix);
     xfree(pdev->vpci->msi);
     xfree(pdev->vpci);
     pdev->vpci = NULL;
 }
 
+void vpci_remove_device(struct pci_dev *pdev)
+{
+    if ( !has_vpci(pdev->domain) )
+        return;
+
+    spin_lock(&pdev->vpci_lock);
+    if ( pdev->vpci )
+        vpci_remove_device_locked(pdev);
+    spin_unlock(&pdev->vpci_lock);
+}
+
 int vpci_add_handlers(struct pci_dev *pdev)
 {
+    struct vpci *vpci;
     unsigned int i;
     int rc = 0;
 
@@ -68,12 +77,14 @@ int vpci_add_handlers(struct pci_dev *pdev)
     /* We should not get here twice for the same device. */
     ASSERT(!pdev->vpci);
 
-    pdev->vpci = xzalloc(struct vpci);
-    if ( !pdev->vpci )
+    vpci = xzalloc(struct vpci);
+    if ( !vpci )
         return -ENOMEM;
 
-    INIT_LIST_HEAD(&pdev->vpci->handlers);
-    spin_lock_init(&pdev->vpci->lock);
+    INIT_LIST_HEAD(&vpci->handlers);
+
+    spin_lock(&pdev->vpci_lock);
+    pdev->vpci = vpci;
 
     for ( i = 0; i < NUM_VPCI_INIT; i++ )
     {
@@ -83,7 +94,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
     }
 
     if ( rc )
-        vpci_remove_device(pdev);
+        vpci_remove_device_locked(pdev);
+    spin_unlock(&pdev->vpci_lock);
 
     return rc;
 }
@@ -129,6 +141,7 @@ uint32_t vpci_hw_read32(const struct pci_dev *pdev, unsigned int reg,
     return pci_conf_read32(pdev->sbdf, reg);
 }
 
+/* Must be called with pdev->vpci_lock held. */
 int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
                       vpci_write_t *write_handler, unsigned int offset,
                       unsigned int size, void *data)
@@ -152,8 +165,6 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
     r->offset = offset;
     r->private = data;
 
-    spin_lock(&vpci->lock);
-
     /* The list of handlers must be kept sorted at all times. */
     list_for_each ( prev, &vpci->handlers )
     {
@@ -165,25 +176,23 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
             break;
         if ( cmp == 0 )
         {
-            spin_unlock(&vpci->lock);
             xfree(r);
             return -EEXIST;
         }
     }
 
     list_add_tail(&r->node, prev);
-    spin_unlock(&vpci->lock);
 
     return 0;
 }
 
+/* Must be called with pdev->vpci_lock held. */
 int vpci_remove_register(struct vpci *vpci, unsigned int offset,
                          unsigned int size)
 {
     const struct vpci_register r = { .offset = offset, .size = size };
     struct vpci_register *rm;
 
-    spin_lock(&vpci->lock);
     list_for_each_entry ( rm, &vpci->handlers, node )
     {
         int cmp = vpci_register_cmp(&r, rm);
@@ -195,14 +204,12 @@ int vpci_remove_register(struct vpci *vpci, unsigned int offset,
         if ( !cmp && rm->offset == offset && rm->size == size )
         {
             list_del(&rm->node);
-            spin_unlock(&vpci->lock);
             xfree(rm);
             return 0;
         }
         if ( cmp <= 0 )
             break;
     }
-    spin_unlock(&vpci->lock);
 
     return -ENOENT;
 }
@@ -311,7 +318,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
 uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
 {
     const struct domain *d = current->domain;
-    const struct pci_dev *pdev;
+    struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
     uint32_t data = ~(uint32_t)0;
@@ -327,7 +334,12 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
     if ( !pdev )
         return vpci_read_hw(sbdf, reg, size);
 
-    spin_lock(&pdev->vpci->lock);
+    spin_lock(&pdev->vpci_lock);
+    if ( !pdev->vpci )
+    {
+        spin_unlock(&pdev->vpci_lock);
+        return vpci_read_hw(sbdf, reg, size);
+    }
 
     /* Read from the hardware or the emulated register handlers. */
     list_for_each_entry ( r, &pdev->vpci->handlers, node )
@@ -370,7 +382,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
             break;
         ASSERT(data_offset < size);
     }
-    spin_unlock(&pdev->vpci->lock);
+    spin_unlock(&pdev->vpci_lock);
 
     if ( data_offset < size )
     {
@@ -414,7 +426,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                 uint32_t data)
 {
     const struct domain *d = current->domain;
-    const struct pci_dev *pdev;
+    struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
     const unsigned long *ro_map = pci_get_ro_map(sbdf.seg);
@@ -440,7 +452,14 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
         return;
     }
 
-    spin_lock(&pdev->vpci->lock);
+    spin_lock(&pdev->vpci_lock);
+    if ( !pdev->vpci )
+    {
+        spin_unlock(&pdev->vpci_lock);
+        vpci_write_hw(sbdf, reg, size, data);
+        return;
+    }
+
 
     /* Write the value to the hardware or emulated registers. */
     list_for_each_entry ( r, &pdev->vpci->handlers, node )
@@ -475,7 +494,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
             break;
         ASSERT(data_offset < size);
     }
-    spin_unlock(&pdev->vpci->lock);
+    spin_unlock(&pdev->vpci_lock);
 
     if ( data_offset < size )
         /* Tailing gap, write the remaining. */
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index b6d7e454f814..3f60d6c6c6dd 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -134,6 +134,7 @@ struct pci_dev {
     u64 vf_rlen[6];
 
     /* Data for vPCI. */
+    spinlock_t vpci_lock;
     struct vpci *vpci;
 };
 
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index e8ac1eb39513..f2a7d82ce77b 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -31,7 +31,7 @@ int __must_check vpci_add_handlers(struct pci_dev *dev);
 /* Remove all handlers and free vpci related structures. */
 void vpci_remove_device(struct pci_dev *pdev);
 
-/* Add/remove a register handler. */
+/* Add/remove a register handler. Must be called holding the vpci_lock. */
 int __must_check vpci_add_register(struct vpci *vpci,
                                    vpci_read_t *read_handler,
                                    vpci_write_t *write_handler,
@@ -60,7 +60,6 @@ bool __must_check vpci_process_pending(struct vcpu *v);
 struct vpci {
     /* List of vPCI handlers for a device. */
     struct list_head handlers;
-    spinlock_t lock;
 
 #ifdef __XEN__
     /* Hide the rest of the vpci struct from the user-space test harness. */
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 138+ messages in thread

* [PATCH v6 04/13] vpci: restrict unhandled read/write operations for guests
  2022-02-04  6:34 [PATCH v6 00/13] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (2 preceding siblings ...)
  2022-02-04  6:34 ` [PATCH v6 03/13] vpci: move lock outside of struct vpci Oleksandr Andrushchenko
@ 2022-02-04  6:34 ` Oleksandr Andrushchenko
  2022-02-04 14:11   ` Jan Beulich
  2022-02-04  6:34 ` [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign Oleksandr Andrushchenko
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  6:34 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

A guest can read and write those registers which are not emulated and
have no respective vPCI handlers, so it can access the HW directly.
In order to prevent a guest from reads and writes from/to the unhandled
registers make sure only hardware domain can access HW directly and restrict
guests from doing so.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

---
New in v6
---
 xen/drivers/vpci/vpci.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index cb2ababa28e3..f8a93e61c08f 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -215,11 +215,15 @@ int vpci_remove_register(struct vpci *vpci, unsigned int offset,
 }
 
 /* Wrappers for performing reads/writes to the underlying hardware. */
-static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
+static uint32_t vpci_read_hw(bool is_hwdom, pci_sbdf_t sbdf, unsigned int reg,
                              unsigned int size)
 {
     uint32_t data;
 
+    /* Guest domains are not allowed to read real hardware. */
+    if ( !is_hwdom )
+        return ~(uint32_t)0;
+
     switch ( size )
     {
     case 4:
@@ -260,9 +264,13 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
     return data;
 }
 
-static void vpci_write_hw(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
-                          uint32_t data)
+static void vpci_write_hw(bool is_hwdom, pci_sbdf_t sbdf, unsigned int reg,
+                          unsigned int size, uint32_t data)
 {
+    /* Guest domains are not allowed to write real hardware. */
+    if ( !is_hwdom )
+        return;
+
     switch ( size )
     {
     case 4:
@@ -322,6 +330,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
     const struct vpci_register *r;
     unsigned int data_offset = 0;
     uint32_t data = ~(uint32_t)0;
+    bool is_hwdom = is_hardware_domain(d);
 
     if ( !size )
     {
@@ -332,13 +341,13 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
     /* Find the PCI dev matching the address. */
     pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn);
     if ( !pdev )
-        return vpci_read_hw(sbdf, reg, size);
+        return vpci_read_hw(is_hwdom, sbdf, reg, size);
 
     spin_lock(&pdev->vpci_lock);
     if ( !pdev->vpci )
     {
         spin_unlock(&pdev->vpci_lock);
-        return vpci_read_hw(sbdf, reg, size);
+        return vpci_read_hw(is_hwdom, sbdf, reg, size);
     }
 
     /* Read from the hardware or the emulated register handlers. */
@@ -361,7 +370,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
         {
             /* Heading gap, read partial content from hardware. */
             read_size = r->offset - emu.offset;
-            val = vpci_read_hw(sbdf, emu.offset, read_size);
+            val = vpci_read_hw(is_hwdom, sbdf, emu.offset, read_size);
             data = merge_result(data, val, read_size, data_offset);
             data_offset += read_size;
         }
@@ -387,7 +396,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
     if ( data_offset < size )
     {
         /* Tailing gap, read the remaining. */
-        uint32_t tmp_data = vpci_read_hw(sbdf, reg + data_offset,
+        uint32_t tmp_data = vpci_read_hw(is_hwdom, sbdf, reg + data_offset,
                                          size - data_offset);
 
         data = merge_result(data, tmp_data, size - data_offset, data_offset);
@@ -430,6 +439,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
     const struct vpci_register *r;
     unsigned int data_offset = 0;
     const unsigned long *ro_map = pci_get_ro_map(sbdf.seg);
+    bool is_hwdom = is_hardware_domain(d);
 
     if ( !size )
     {
@@ -448,7 +458,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
     pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn);
     if ( !pdev )
     {
-        vpci_write_hw(sbdf, reg, size, data);
+        vpci_write_hw(is_hwdom, sbdf, reg, size, data);
         return;
     }
 
@@ -456,7 +466,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
     if ( !pdev->vpci )
     {
         spin_unlock(&pdev->vpci_lock);
-        vpci_write_hw(sbdf, reg, size, data);
+        vpci_write_hw(is_hwdom, sbdf, reg, size, data);
         return;
     }
 
@@ -479,7 +489,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
         if ( emu.offset < r->offset )
         {
             /* Heading gap, write partial content to hardware. */
-            vpci_write_hw(sbdf, emu.offset, r->offset - emu.offset,
+            vpci_write_hw(is_hwdom, sbdf, emu.offset, r->offset - emu.offset,
                           data >> (data_offset * 8));
             data_offset += r->offset - emu.offset;
         }
@@ -498,7 +508,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
 
     if ( data_offset < size )
         /* Tailing gap, write the remaining. */
-        vpci_write_hw(sbdf, reg + data_offset, size - data_offset,
+        vpci_write_hw(is_hwdom, sbdf, reg + data_offset, size - data_offset,
                       data >> (data_offset * 8));
 }
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 138+ messages in thread

* [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign
  2022-02-04  6:34 [PATCH v6 00/13] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (3 preceding siblings ...)
  2022-02-04  6:34 ` [PATCH v6 04/13] vpci: restrict unhandled read/write operations for guests Oleksandr Andrushchenko
@ 2022-02-04  6:34 ` Oleksandr Andrushchenko
  2022-02-07 16:28   ` Jan Beulich
  2022-02-04  6:34 ` [PATCH v6 06/13] vpci/header: implement guest BAR register handlers Oleksandr Andrushchenko
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  6:34 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

When a PCI device gets assigned/de-assigned some work on vPCI side needs
to be done for that device. Introduce a pair of hooks so vPCI can handle
that.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v5:
- do not split code into run_vpci_init
- do not check for is_system_domain in vpci_{de}assign_device
- do not use vpci_remove_device_handlers_locked and re-allocate
  pdev->vpci completely
- make vpci_deassign_device void
Since v4:
 - de-assign vPCI from the previous domain on device assignment
 - do not remove handlers in vpci_assign_device as those must not
   exist at that point
Since v3:
 - remove toolstack roll-back description from the commit message
   as error are to be handled with proper cleanup in Xen itself
 - remove __must_check
 - remove redundant rc check while assigning devices
 - fix redundant CONFIG_HAS_VPCI check for CONFIG_HAS_VPCI_GUEST_SUPPORT
 - use REGISTER_VPCI_INIT machinery to run required steps on device
   init/assign: add run_vpci_init helper
Since v2:
- define CONFIG_HAS_VPCI_GUEST_SUPPORT so dead code is not compiled
  for x86
Since v1:
 - constify struct pci_dev where possible
 - do not open code is_system_domain()
 - extended the commit message
---
 xen/drivers/Kconfig           |  4 ++++
 xen/drivers/passthrough/pci.c |  6 ++++++
 xen/drivers/vpci/vpci.c       | 27 +++++++++++++++++++++++++++
 xen/include/xen/vpci.h        | 15 +++++++++++++++
 4 files changed, 52 insertions(+)

diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig
index db94393f47a6..780490cf8e39 100644
--- a/xen/drivers/Kconfig
+++ b/xen/drivers/Kconfig
@@ -15,4 +15,8 @@ source "drivers/video/Kconfig"
 config HAS_VPCI
 	bool
 
+config HAS_VPCI_GUEST_SUPPORT
+	bool
+	depends on HAS_VPCI
+
 endmenu
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 50dec3bb73d0..88836aab6baf 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -943,6 +943,8 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
     if ( ret )
         goto out;
 
+    vpci_deassign_device(d, pdev);
+
     if ( pdev->domain == hardware_domain  )
         pdev->quarantine = false;
 
@@ -1488,6 +1490,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
     ASSERT(pdev && (pdev->domain == hardware_domain ||
                     pdev->domain == dom_io));
 
+    vpci_deassign_device(pdev->domain, pdev);
+
     rc = pdev_msix_assign(d, pdev);
     if ( rc )
         goto done;
@@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
                         pci_to_dev(pdev), flag);
     }
 
+    rc = vpci_assign_device(d, pdev);
+
  done:
     if ( rc )
         printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index f8a93e61c08f..4e774875fa04 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -99,6 +99,33 @@ int vpci_add_handlers(struct pci_dev *pdev)
 
     return rc;
 }
+
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+/* Notify vPCI that device is assigned to guest. */
+int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
+{
+    int rc;
+
+    if ( !has_vpci(d) )
+        return 0;
+
+    rc = vpci_add_handlers(pdev);
+    if ( rc )
+        vpci_deassign_device(d, pdev);
+
+    return rc;
+}
+
+/* Notify vPCI that device is de-assigned from guest. */
+void vpci_deassign_device(struct domain *d, struct pci_dev *pdev)
+{
+    if ( !has_vpci(d) )
+        return;
+
+    vpci_remove_device(pdev);
+}
+#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
+
 #endif /* __XEN__ */
 
 static int vpci_register_cmp(const struct vpci_register *r1,
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index f2a7d82ce77b..246307e6f5d5 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -251,6 +251,21 @@ static inline bool __must_check vpci_process_pending(struct vcpu *v)
 }
 #endif
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+/* Notify vPCI that device is assigned/de-assigned to/from guest. */
+int vpci_assign_device(struct domain *d, struct pci_dev *pdev);
+void vpci_deassign_device(struct domain *d, struct pci_dev *pdev);
+#else
+static inline int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
+{
+    return 0;
+};
+
+static inline void vpci_deassign_device(struct domain *d, struct pci_dev *pdev)
+{
+};
+#endif
+
 #endif
 
 /*
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 138+ messages in thread

* [PATCH v6 06/13] vpci/header: implement guest BAR register handlers
  2022-02-04  6:34 [PATCH v6 00/13] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (4 preceding siblings ...)
  2022-02-04  6:34 ` [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign Oleksandr Andrushchenko
@ 2022-02-04  6:34 ` Oleksandr Andrushchenko
  2022-02-07 17:06   ` Jan Beulich
  2022-02-08  9:25   ` Roger Pau Monné
  2022-02-04  6:34 ` [PATCH v6 07/13] vpci/header: handle p2m range sets per BAR Oleksandr Andrushchenko
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  6:34 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Add relevant vpci register handlers when assigning PCI device to a domain
and remove those when de-assigning. This allows having different
handlers for different domains, e.g. hwdom and other guests.

Emulate guest BAR register values: this allows creating a guest view
of the registers and emulates size and properties probe as it is done
during PCI device enumeration by the guest.

All empty, IO and ROM BARs for guests are emulated by returning 0 on
reads and ignoring writes: this BARs are special with this respect as
their lower bits have special meaning, so returning default ~0 on read
may confuse guest OS.

Memory decoding is initially disabled when used by guests in order to
prevent the BAR being placed on top of a RAM region.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v5:
- make sure that the guest set address has the same page offset
  as the physical address on the host
- remove guest_rom_{read|write} as those just implement the default
  behaviour of the registers not being handled
- adjusted comment for struct vpci.addr field
- add guest handlers for BARs which are not handled and will otherwise
  return ~0 on read and ignore writes. The BARs are special with this
  respect as their lower bits have special meaning, so returning ~0
  doesn't seem to be right
Since v4:
- updated commit message
- s/guest_addr/guest_reg
Since v3:
- squashed two patches: dynamic add/remove handlers and guest BAR
  handler implementation
- fix guest BAR read of the high part of a 64bit BAR (Roger)
- add error handling to vpci_assign_device
- s/dom%pd/%pd
- blank line before return
Since v2:
- remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
  has been eliminated from being built on x86
Since v1:
 - constify struct pci_dev where possible
 - do not open code is_system_domain()
 - simplify some code3. simplify
 - use gdprintk + error code instead of gprintk
 - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
   so these do not get compiled for x86
 - removed unneeded is_system_domain check
 - re-work guest read/write to be much simpler and do more work on write
   than read which is expected to be called more frequently
 - removed one too obvious comment
---
 xen/drivers/vpci/header.c | 131 +++++++++++++++++++++++++++++++++-----
 xen/include/xen/vpci.h    |   3 +
 2 files changed, 118 insertions(+), 16 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index bd23c0274d48..2620a95ff35b 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -406,6 +406,81 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
     pci_conf_write32(pdev->sbdf, reg, val);
 }
 
+static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
+                            uint32_t val, void *data)
+{
+    struct vpci_bar *bar = data;
+    bool hi = false;
+    uint64_t guest_reg = bar->guest_reg;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+    else
+    {
+        val &= PCI_BASE_ADDRESS_MEM_MASK;
+        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
+                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
+        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
+    }
+
+    guest_reg &= ~(0xffffffffull << (hi ? 32 : 0));
+    guest_reg |= (uint64_t)val << (hi ? 32 : 0);
+
+    guest_reg &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
+
+    /*
+     * Make sure that the guest set address has the same page offset
+     * as the physical address on the host or otherwise things won't work as
+     * expected.
+     */
+    if ( (guest_reg & (~PAGE_MASK & PCI_BASE_ADDRESS_MEM_MASK)) !=
+         (bar->addr & ~PAGE_MASK) )
+    {
+        gprintk(XENLOG_WARNING,
+                "%pp: ignored BAR %zu write with wrong page offset\n",
+                &pdev->sbdf, bar - pdev->vpci->header.bars + hi);
+        return;
+    }
+
+    bar->guest_reg = guest_reg;
+}
+
+static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
+                               void *data)
+{
+    const struct vpci_bar *bar = data;
+    bool hi = false;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+
+    return bar->guest_reg >> (hi ? 32 : 0);
+}
+
+static uint32_t guest_bar_ignore_read(const struct pci_dev *pdev,
+                                      unsigned int reg, void *data)
+{
+    return 0;
+}
+
+static int bar_ignore_access(const struct pci_dev *pdev, unsigned int reg,
+                             struct vpci_bar *bar)
+{
+    if ( is_hardware_domain(pdev->domain) )
+        return 0;
+
+    return vpci_add_register(pdev->vpci, guest_bar_ignore_read, NULL,
+                             reg, 4, bar);
+}
+
 static void rom_write(const struct pci_dev *pdev, unsigned int reg,
                       uint32_t val, void *data)
 {
@@ -462,6 +537,7 @@ static int init_bars(struct pci_dev *pdev)
     struct vpci_header *header = &pdev->vpci->header;
     struct vpci_bar *bars = header->bars;
     int rc;
+    bool is_hwdom = is_hardware_domain(pdev->domain);
 
     switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
     {
@@ -501,8 +577,10 @@ static int init_bars(struct pci_dev *pdev)
         if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
         {
             bars[i].type = VPCI_BAR_MEM64_HI;
-            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
-                                   4, &bars[i]);
+            rc = vpci_add_register(pdev->vpci,
+                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
+                                   is_hwdom ? bar_write : guest_bar_write,
+                                   reg, 4, &bars[i]);
             if ( rc )
             {
                 pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
@@ -516,6 +594,11 @@ static int init_bars(struct pci_dev *pdev)
         if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
         {
             bars[i].type = VPCI_BAR_IO;
+
+            rc = bar_ignore_access(pdev, reg, &bars[i]);
+            if ( rc )
+                return rc;
+
             continue;
         }
         if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
@@ -535,6 +618,11 @@ static int init_bars(struct pci_dev *pdev)
         if ( size == 0 )
         {
             bars[i].type = VPCI_BAR_EMPTY;
+
+            rc = bar_ignore_access(pdev, reg, &bars[i]);
+            if ( rc )
+                return rc;
+
             continue;
         }
 
@@ -542,8 +630,10 @@ static int init_bars(struct pci_dev *pdev)
         bars[i].size = size;
         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
 
-        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
-                               &bars[i]);
+        rc = vpci_add_register(pdev->vpci,
+                               is_hwdom ? vpci_hw_read32 : guest_bar_read,
+                               is_hwdom ? bar_write : guest_bar_write,
+                               reg, 4, &bars[i]);
         if ( rc )
         {
             pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
@@ -551,22 +641,31 @@ static int init_bars(struct pci_dev *pdev)
         }
     }
 
-    /* Check expansion ROM. */
-    rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
-    if ( rc > 0 && size )
+    /* Check expansion ROM: we do not handle ROM for guests. */
+    if ( is_hwdom )
     {
-        struct vpci_bar *rom = &header->bars[num_bars];
+        rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
+        if ( rc > 0 && size )
+        {
+            struct vpci_bar *rom = &header->bars[num_bars];
 
-        rom->type = VPCI_BAR_ROM;
-        rom->size = size;
-        rom->addr = addr;
-        header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
-                              PCI_ROM_ADDRESS_ENABLE;
+            rom->type = VPCI_BAR_ROM;
+            rom->size = size;
+            rom->addr = addr;
+            header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
+                                  PCI_ROM_ADDRESS_ENABLE;
 
-        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
-                               4, rom);
+            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write,
+                                   rom_reg, 4, rom);
+            if ( rc )
+                rom->type = VPCI_BAR_EMPTY;
+        }
+    }
+    else
+    {
+        rc = bar_ignore_access(pdev, rom_reg, &header->bars[num_bars]);
         if ( rc )
-            rom->type = VPCI_BAR_EMPTY;
+            return rc;
     }
 
     return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 246307e6f5d5..270d22b85653 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -66,7 +66,10 @@ struct vpci {
     struct vpci_header {
         /* Information about the PCI BARs of this device. */
         struct vpci_bar {
+            /* Physical (host) address. */
             uint64_t addr;
+            /* Guest view of the BAR: address and lower bits. */
+            uint64_t guest_reg;
             uint64_t size;
             enum {
                 VPCI_BAR_EMPTY,
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 138+ messages in thread

* [PATCH v6 07/13] vpci/header: handle p2m range sets per BAR
  2022-02-04  6:34 [PATCH v6 00/13] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (5 preceding siblings ...)
  2022-02-04  6:34 ` [PATCH v6 06/13] vpci/header: implement guest BAR register handlers Oleksandr Andrushchenko
@ 2022-02-04  6:34 ` Oleksandr Andrushchenko
  2022-02-04  6:34 ` [PATCH v6 08/13] vpci/header: program p2m with guest BAR view Oleksandr Andrushchenko
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  6:34 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Instead of handling a single range set, that contains all the memory
regions of all the BARs and ROM, have them per BAR.
As the range sets are now created when a PCI device is added and destroyed
when it is removed so make them named and accounted.

Note that rangesets were chosen here despite there being only up to
3 separate ranges in each set (typically just 1). But rangeset per BAR
was chosen for the ease of implementation and existing code re-usability.

This is in preparation of making non-identity mappings in p2m for the MMIOs.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

---
Since v5:
- fix comments
- move rangeset allocation to init_bars and only allocate
  for MAPPABLE BARs
- check for overlap with the already setup BAR ranges
Since v4:
- use named range sets for BARs (Jan)
- changes required by the new locking scheme
- updated commit message (Jan)
Since v3:
- re-work vpci_cancel_pending accordingly to the per-BAR handling
- s/num_mem_ranges/map_pending and s/uint8_t/bool
- ASSERT(bar->mem) in modify_bars
- create and destroy the rangesets on add/remove
---
 xen/drivers/vpci/header.c | 213 ++++++++++++++++++++++++++++----------
 xen/drivers/vpci/vpci.c   |  17 ++-
 xen/include/xen/vpci.h    |   4 +-
 3 files changed, 177 insertions(+), 57 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 2620a95ff35b..0c94504b87d8 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -131,50 +131,85 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
 
 bool vpci_process_pending(struct vcpu *v)
 {
-    if ( v->vpci.mem )
+    struct pci_dev *pdev = v->vpci.pdev;
+
+    if ( !pdev )
+        return false;
+
+    spin_lock(&pdev->vpci_lock);
+    if ( v->vpci.map_pending )
     {
         struct map_data data = {
             .d = v->domain,
             .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
         };
-        int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
+        struct vpci_header *header = &pdev->vpci->header;
+        unsigned int i;
+
+        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+        {
+            struct vpci_bar *bar = &header->bars[i];
+            int rc;
+
+            if ( rangeset_is_empty(bar->mem) )
+                continue;
 
-        if ( rc == -ERESTART )
-            return true;
+            rc = rangeset_consume_ranges(bar->mem, map_range, &data);
+
+            if ( rc == -ERESTART )
+            {
+                spin_unlock(&pdev->vpci_lock);
+                return true;
+            }
 
-        spin_lock(&v->vpci.pdev->vpci_lock);
-        if ( v->vpci.pdev->vpci )
             /* Disable memory decoding unconditionally on failure. */
-            modify_decoding(v->vpci.pdev,
-                            rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
-                            !rc && v->vpci.rom_only);
-        spin_unlock(&v->vpci.pdev->vpci_lock);
+            modify_decoding(pdev, rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY :
+                                       v->vpci.cmd, !rc && v->vpci.rom_only);
+
+            if ( rc )
+            {
+                /*
+                 * FIXME: in case of failure remove the device from the domain.
+                 * Note that there might still be leftover mappings. While this
+                 * is safe for Dom0, for DomUs the domain needs to be killed in
+                 * order to avoid leaking stale p2m mappings on failure.
+                 */
+                if ( is_hardware_domain(v->domain) )
+                    vpci_remove_device_locked(pdev);
+                else
+                    domain_crash(v->domain);
+
+                break;
+            }
+        }
+
+        v->vpci.map_pending = false;
 
-        rangeset_destroy(v->vpci.mem);
-        v->vpci.mem = NULL;
-        if ( rc )
-            /*
-             * FIXME: in case of failure remove the device from the domain.
-             * Note that there might still be leftover mappings. While this is
-             * safe for Dom0, for DomUs the domain will likely need to be
-             * killed in order to avoid leaking stale p2m mappings on
-             * failure.
-             */
-            vpci_remove_device(v->vpci.pdev);
     }
+    spin_unlock(&pdev->vpci_lock);
 
     return false;
 }
 
 static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
-                            struct rangeset *mem, uint16_t cmd)
+                            uint16_t cmd)
 {
     struct map_data data = { .d = d, .map = true };
-    int rc;
+    struct vpci_header *header = &pdev->vpci->header;
+    int rc = 0;
+    unsigned int i;
+
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+    {
+        struct vpci_bar *bar = &header->bars[i];
 
-    while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
-        process_pending_softirqs();
-    rangeset_destroy(mem);
+        if ( rangeset_is_empty(bar->mem) )
+            continue;
+
+        while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
+                                              &data)) == -ERESTART )
+            process_pending_softirqs();
+    }
     if ( !rc )
         modify_decoding(pdev, cmd, false);
 
@@ -182,7 +217,7 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
 }
 
 static void defer_map(struct domain *d, struct pci_dev *pdev,
-                      struct rangeset *mem, uint16_t cmd, bool rom_only)
+                      uint16_t cmd, bool rom_only)
 {
     struct vcpu *curr = current;
 
@@ -193,7 +228,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
      * started for the same device if the domain is not well-behaved.
      */
     curr->vpci.pdev = pdev;
-    curr->vpci.mem = mem;
+    curr->vpci.map_pending = true;
     curr->vpci.cmd = cmd;
     curr->vpci.rom_only = rom_only;
     /*
@@ -207,42 +242,60 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
 static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
 {
     struct vpci_header *header = &pdev->vpci->header;
-    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
     struct pci_dev *tmp, *dev = NULL;
     const struct vpci_msix *msix = pdev->vpci->msix;
-    unsigned int i;
+    unsigned int i, j;
     int rc;
-
-    if ( !mem )
-        return -ENOMEM;
+    bool map_pending;
 
     /*
-     * Create a rangeset that represents the current device BARs memory region
-     * and compare it against all the currently active BAR memory regions. If
-     * an overlap is found, subtract it from the region to be mapped/unmapped.
+     * Create a rangeset per BAR that represents the current device memory
+     * region and compare it against all the currently active BAR memory
+     * regions. If an overlap is found, subtract it from the region to be
+     * mapped/unmapped.
      *
-     * First fill the rangeset with all the BARs of this device or with the ROM
+     * First fill the rangesets with the BARs of this device or with the ROM
      * BAR only, depending on whether the guest is toggling the memory decode
      * bit of the command register, or the enable bit of the ROM BAR register.
      */
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
-        const struct vpci_bar *bar = &header->bars[i];
+        struct vpci_bar *bar = &header->bars[i];
         unsigned long start = PFN_DOWN(bar->addr);
         unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
 
+        if ( !bar->mem )
+            continue;
+
         if ( !MAPPABLE_BAR(bar) ||
              (rom_only ? bar->type != VPCI_BAR_ROM
                        : (bar->type == VPCI_BAR_ROM && !header->rom_enabled)) )
             continue;
 
-        rc = rangeset_add_range(mem, start, end);
+        rc = rangeset_add_range(bar->mem, start, end);
         if ( rc )
         {
             printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
                    start, end, rc);
-            rangeset_destroy(mem);
-            return rc;
+            goto fail;
+        }
+
+        /* Check for overlap with the already setup BAR ranges. */
+        for ( j = 0; j < i; j++ )
+        {
+            struct vpci_bar *bar = &header->bars[j];
+
+            if ( rangeset_is_empty(bar->mem) )
+                continue;
+
+            rc = rangeset_remove_range(bar->mem, start, end);
+            if ( rc )
+            {
+                printk(XENLOG_G_WARNING
+                       "Failed to remove overlapping range [%lx, %lx]: %d\n",
+                       start, end, rc);
+                goto fail;
+            }
         }
     }
 
@@ -253,14 +306,21 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
         unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
                                      vmsix_table_size(pdev->vpci, i) - 1);
 
-        rc = rangeset_remove_range(mem, start, end);
-        if ( rc )
+        for ( j = 0; j < ARRAY_SIZE(header->bars); j++ )
         {
-            printk(XENLOG_G_WARNING
-                   "Failed to remove MSIX table [%lx, %lx]: %d\n",
-                   start, end, rc);
-            rangeset_destroy(mem);
-            return rc;
+            const struct vpci_bar *bar = &header->bars[j];
+
+            if ( rangeset_is_empty(bar->mem) )
+                continue;
+
+            rc = rangeset_remove_range(bar->mem, start, end);
+            if ( rc )
+            {
+                printk(XENLOG_G_WARNING
+                       "Failed to remove MSIX table [%lx, %lx]: %d\n",
+                       start, end, rc);
+                goto fail;
+            }
         }
     }
 
@@ -298,7 +358,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             unsigned long start = PFN_DOWN(bar->addr);
             unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
 
-            if ( !bar->enabled || !rangeset_overlaps_range(mem, start, end) ||
+            if ( !bar->enabled ||
+                 !rangeset_overlaps_range(bar->mem, start, end) ||
                  /*
                   * If only the ROM enable bit is toggled check against other
                   * BARs in the same device for overlaps, but not against the
@@ -307,14 +368,13 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
                  (rom_only && tmp == pdev && bar->type == VPCI_BAR_ROM) )
                 continue;
 
-            rc = rangeset_remove_range(mem, start, end);
+            rc = rangeset_remove_range(bar->mem, start, end);
             if ( rc )
             {
                 spin_unlock(&tmp->vpci_lock);
                 printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
                        start, end, rc);
-                rangeset_destroy(mem);
-                return rc;
+                goto fail;
             }
         }
         spin_unlock(&tmp->vpci_lock);
@@ -333,12 +393,28 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
          * will always be to establish mappings and process all the BARs.
          */
         ASSERT((cmd & PCI_COMMAND_MEMORY) && !rom_only);
-        return apply_map(pdev->domain, pdev, mem, cmd);
+        return apply_map(pdev->domain, pdev, cmd);
     }
 
-    defer_map(dev->domain, dev, mem, cmd, rom_only);
+    /* Find out how many memory ranges has left after MSI and overlaps. */
+    map_pending = false;
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+        if ( !rangeset_is_empty(header->bars[i].mem) )
+        {
+            map_pending = true;
+            break;
+        }
+
+    /* If there's no mapping work write the command register now. */
+    if ( !map_pending )
+        pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
+    else
+        defer_map(dev->domain, dev, cmd, rom_only);
 
     return 0;
+
+fail:
+    return rc;
 }
 
 static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
@@ -529,6 +605,19 @@ static void rom_write(const struct pci_dev *pdev, unsigned int reg,
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
 }
 
+static int bar_add_rangeset(struct pci_dev *pdev, struct vpci_bar *bar, int i)
+{
+    char str[32];
+
+    snprintf(str, sizeof(str), "%pp:BAR%d", &pdev->sbdf, i);
+
+    bar->mem = rangeset_new(pdev->domain, str, RANGESETF_no_print);
+    if ( !bar->mem )
+        return -ENOMEM;
+
+    return 0;
+}
+
 static int init_bars(struct pci_dev *pdev)
 {
     uint16_t cmd;
@@ -607,6 +696,13 @@ static int init_bars(struct pci_dev *pdev)
         else
             bars[i].type = VPCI_BAR_MEM32;
 
+        rc = bar_add_rangeset(pdev, &bars[i], i);
+        if ( rc )
+        {
+            bars[i].type = VPCI_BAR_EMPTY;
+            return rc;
+        }
+
         rc = pci_size_mem_bar(pdev->sbdf, reg, &addr, &size,
                               (i == num_bars - 1) ? PCI_BAR_LAST : 0);
         if ( rc < 0 )
@@ -659,6 +755,15 @@ static int init_bars(struct pci_dev *pdev)
                                    rom_reg, 4, rom);
             if ( rc )
                 rom->type = VPCI_BAR_EMPTY;
+            else
+            {
+                rc = bar_add_rangeset(pdev, rom, i);
+                if ( rc )
+                {
+                    rom->type = VPCI_BAR_EMPTY;
+                    return rc;
+                }
+            }
         }
     }
     else
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 4e774875fa04..3177f13c1c22 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -35,8 +35,11 @@ extern vpci_register_init_t *const __start_vpci_array[];
 extern vpci_register_init_t *const __end_vpci_array[];
 #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
 
-static void vpci_remove_device_locked(struct pci_dev *pdev)
+void vpci_remove_device_locked(struct pci_dev *pdev)
 {
+    struct vpci_header *header = &pdev->vpci->header;
+    unsigned int i;
+
     ASSERT(spin_is_locked(&pdev->vpci_lock));
 
     while ( !list_empty(&pdev->vpci->handlers) )
@@ -48,6 +51,10 @@ static void vpci_remove_device_locked(struct pci_dev *pdev)
         list_del(&r->node);
         xfree(r);
     }
+
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+        rangeset_destroy(header->bars[i].mem);
+
     xfree(pdev->vpci->msix);
     xfree(pdev->vpci->msi);
     xfree(pdev->vpci);
@@ -94,9 +101,15 @@ int vpci_add_handlers(struct pci_dev *pdev)
     }
 
     if ( rc )
-        vpci_remove_device_locked(pdev);
+        goto fail;
+
     spin_unlock(&pdev->vpci_lock);
 
+    return 0;
+
+ fail:
+    vpci_remove_device_locked(pdev);
+    spin_unlock(&pdev->vpci_lock);
     return rc;
 }
 
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 270d22b85653..f1f49db959c7 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -30,6 +30,7 @@ int __must_check vpci_add_handlers(struct pci_dev *dev);
 
 /* Remove all handlers and free vpci related structures. */
 void vpci_remove_device(struct pci_dev *pdev);
+void vpci_remove_device_locked(struct pci_dev *pdev);
 
 /* Add/remove a register handler. Must be called holding the vpci_lock. */
 int __must_check vpci_add_register(struct vpci *vpci,
@@ -71,6 +72,7 @@ struct vpci {
             /* Guest view of the BAR: address and lower bits. */
             uint64_t guest_reg;
             uint64_t size;
+            struct rangeset *mem;
             enum {
                 VPCI_BAR_EMPTY,
                 VPCI_BAR_IO,
@@ -143,9 +145,9 @@ struct vpci {
 
 struct vpci_vcpu {
     /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
-    struct rangeset *mem;
     struct pci_dev *pdev;
     uint16_t cmd;
+    bool map_pending : 1;
     bool rom_only : 1;
 };
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 138+ messages in thread

* [PATCH v6 08/13] vpci/header: program p2m with guest BAR view
  2022-02-04  6:34 [PATCH v6 00/13] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (6 preceding siblings ...)
  2022-02-04  6:34 ` [PATCH v6 07/13] vpci/header: handle p2m range sets per BAR Oleksandr Andrushchenko
@ 2022-02-04  6:34 ` Oleksandr Andrushchenko
  2022-02-04  6:34 ` [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests Oleksandr Andrushchenko
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  6:34 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Take into account guest's BAR view and program its p2m accordingly:
gfn is guest's view of the BAR and mfn is the physical BAR value as set
up by the PCI bus driver in the hardware domain.
This way hardware domain sees physical BAR values and guest sees
emulated ones.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v5:
- remove debug print in map_range callback
- remove "identity" from the debug print
Since v4:
- moved start_{gfn|mfn} calculation into map_range
- pass vpci_bar in the map_data instead of start_{gfn|mfn}
- s/guest_addr/guest_reg
Since v3:
- updated comment (Roger)
- removed gfn_add(map->start_gfn, rc); which is wrong
- use v->domain instead of v->vpci.pdev->domain
- removed odd e.g. in comment
- s/d%d/%pd in altered code
- use gdprintk for map/unmap logs
Since v2:
- improve readability for data.start_gfn and restructure ?: construct
Since v1:
 - s/MSI/MSI-X in comments
---
 xen/drivers/vpci/header.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 0c94504b87d8..88ca1ad8211d 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -30,6 +30,7 @@
 
 struct map_data {
     struct domain *d;
+    const struct vpci_bar *bar;
     bool map;
 };
 
@@ -41,8 +42,21 @@ static int map_range(unsigned long s, unsigned long e, void *data,
 
     for ( ; ; )
     {
+        /* Start address of the BAR as seen by the guest. */
+        gfn_t start_gfn = _gfn(PFN_DOWN(is_hardware_domain(map->d)
+                                        ? map->bar->addr
+                                        : map->bar->guest_reg));
+        /* Physical start address of the BAR. */
+        mfn_t start_mfn = _mfn(PFN_DOWN(map->bar->addr));
         unsigned long size = e - s + 1;
 
+        /*
+         * Ranges to be mapped don't always start at the BAR start address, as
+         * there can be holes or partially consumed ranges. Account for the
+         * offset of the current address from the BAR start.
+         */
+        start_gfn = gfn_add(start_gfn, s - mfn_x(start_mfn));
+
         /*
          * ARM TODOs:
          * - On ARM whether the memory is prefetchable or not should be passed
@@ -52,8 +66,8 @@ static int map_range(unsigned long s, unsigned long e, void *data,
          * - {un}map_mmio_regions doesn't support preemption.
          */
 
-        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
-                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
+        rc = map->map ? map_mmio_regions(map->d, start_gfn, size, _mfn(s))
+                      : unmap_mmio_regions(map->d, start_gfn, size, _mfn(s));
         if ( rc == 0 )
         {
             *c += size;
@@ -62,8 +76,8 @@ static int map_range(unsigned long s, unsigned long e, void *data,
         if ( rc < 0 )
         {
             printk(XENLOG_G_WARNING
-                   "Failed to identity %smap [%lx, %lx] for d%d: %d\n",
-                   map->map ? "" : "un", s, e, map->d->domain_id, rc);
+                   "Failed to %smap [%lx, %lx] for %pd: %d\n",
+                   map->map ? "" : "un", s, e, map->d, rc);
             break;
         }
         ASSERT(rc < size);
@@ -154,6 +168,7 @@ bool vpci_process_pending(struct vcpu *v)
             if ( rangeset_is_empty(bar->mem) )
                 continue;
 
+            data.bar = bar;
             rc = rangeset_consume_ranges(bar->mem, map_range, &data);
 
             if ( rc == -ERESTART )
@@ -206,6 +221,7 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
         if ( rangeset_is_empty(bar->mem) )
             continue;
 
+        data.bar = bar;
         while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
                                               &data)) == -ERESTART )
             process_pending_softirqs();
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 138+ messages in thread

* [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests
  2022-02-04  6:34 [PATCH v6 00/13] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (7 preceding siblings ...)
  2022-02-04  6:34 ` [PATCH v6 08/13] vpci/header: program p2m with guest BAR view Oleksandr Andrushchenko
@ 2022-02-04  6:34 ` Oleksandr Andrushchenko
  2022-02-04 14:25   ` Jan Beulich
  2022-02-04  6:34 ` [PATCH v6 10/13] vpci/header: reset the command register when adding devices Oleksandr Andrushchenko
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  6:34 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Add basic emulation support for guests. At the moment only emulate
PCI_COMMAND_INTX_DISABLE bit, the rest is not emulated yet and left
as TODO.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v5:
- add additional check for MSI-X enabled while altering INTX bit
- make sure INTx disabled while guests enable MSI/MSI-X
Since v3:
- gate more code on CONFIG_HAS_MSI
- removed logic for the case when MSI/MSI-X not enabled
---
 xen/drivers/vpci/header.c | 21 +++++++++++++++++++--
 xen/drivers/vpci/msi.c    |  4 ++++
 xen/drivers/vpci/msix.c   |  4 ++++
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 88ca1ad8211d..33d8c15ae6e8 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -454,6 +454,22 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
         pci_conf_write16(pdev->sbdf, reg, cmd);
 }
 
+static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
+                            uint32_t cmd, void *data)
+{
+    /* TODO: Add proper emulation for all bits of the command register. */
+
+#ifdef CONFIG_HAS_PCI_MSI
+    if ( pdev->vpci->msi->enabled || pdev->vpci->msix->enabled )
+    {
+        /* Guest wants to enable INTx. It can't be enabled if MSI/MSI-X enabled. */
+        cmd |= PCI_COMMAND_INTX_DISABLE;
+    }
+#endif
+
+    cmd_write(pdev, reg, cmd, data);
+}
+
 static void bar_write(const struct pci_dev *pdev, unsigned int reg,
                       uint32_t val, void *data)
 {
@@ -661,8 +677,9 @@ static int init_bars(struct pci_dev *pdev)
     }
 
     /* Setup a handler for the command register. */
-    rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
-                           2, header);
+    rc = vpci_add_register(pdev->vpci, vpci_hw_read16,
+                           is_hwdom ? cmd_write : guest_cmd_write,
+                           PCI_COMMAND, 2, header);
     if ( rc )
         return rc;
 
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index e3ce46869dad..90465dcb4831 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -70,6 +70,10 @@ static void control_write(const struct pci_dev *pdev, unsigned int reg,
 
         if ( vpci_msi_arch_enable(msi, pdev, vectors) )
             return;
+
+        /* Make sure guest doesn't enable INTx while enabling MSI. */
+        if ( !is_hardware_domain(pdev->domain) )
+            pci_intx(pdev, false);
     }
     else
         vpci_msi_arch_disable(msi, pdev);
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index d1dbfc6e0ffd..4c0e1836b589 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -92,6 +92,10 @@ static void control_write(const struct pci_dev *pdev, unsigned int reg,
         for ( i = 0; i < msix->max_entries; i++ )
             if ( !msix->entries[i].masked && msix->entries[i].updated )
                 update_entry(&msix->entries[i], pdev, i);
+
+        /* Make sure guest doesn't enable INTx while enabling MSI-X. */
+        if ( !is_hardware_domain(pdev->domain) )
+            pci_intx(pdev, false);
     }
     else if ( !new_enabled && msix->enabled )
     {
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 138+ messages in thread

* [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-04  6:34 [PATCH v6 00/13] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (8 preceding siblings ...)
  2022-02-04  6:34 ` [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests Oleksandr Andrushchenko
@ 2022-02-04  6:34 ` Oleksandr Andrushchenko
  2022-02-04 14:30   ` Jan Beulich
  2022-02-04  6:34 ` [PATCH v6 11/13] vpci: add initial support for virtual PCI bus topology Oleksandr Andrushchenko
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  6:34 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Reset the command register when assigning a PCI device to a guest:
according to the PCI spec the PCI_COMMAND register is typically all 0's
after reset.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v5:
- updated commit message
Since v1:
 - do not write 0 to the command register, but respect host settings.
---
 xen/drivers/vpci/header.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 33d8c15ae6e8..407fa2fc4749 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -454,8 +454,7 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
         pci_conf_write16(pdev->sbdf, reg, cmd);
 }
 
-static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
-                            uint32_t cmd, void *data)
+static uint32_t emulate_cmd_reg(const struct pci_dev *pdev, uint32_t cmd)
 {
     /* TODO: Add proper emulation for all bits of the command register. */
 
@@ -467,7 +466,13 @@ static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
     }
 #endif
 
-    cmd_write(pdev, reg, cmd, data);
+    return cmd;
+}
+
+static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
+                            uint32_t cmd, void *data)
+{
+    cmd_write(pdev, reg, emulate_cmd_reg(pdev, cmd), data);
 }
 
 static void bar_write(const struct pci_dev *pdev, unsigned int reg,
@@ -676,6 +681,10 @@ static int init_bars(struct pci_dev *pdev)
         return -EOPNOTSUPP;
     }
 
+    /* Reset the command register for the guest. */
+    if ( !is_hwdom )
+        pci_conf_write16(pdev->sbdf, PCI_COMMAND, emulate_cmd_reg(pdev, 0));
+
     /* Setup a handler for the command register. */
     rc = vpci_add_register(pdev->vpci, vpci_hw_read16,
                            is_hwdom ? cmd_write : guest_cmd_write,
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 138+ messages in thread

* [PATCH v6 11/13] vpci: add initial support for virtual PCI bus topology
  2022-02-04  6:34 [PATCH v6 00/13] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (9 preceding siblings ...)
  2022-02-04  6:34 ` [PATCH v6 10/13] vpci/header: reset the command register when adding devices Oleksandr Andrushchenko
@ 2022-02-04  6:34 ` Oleksandr Andrushchenko
  2022-02-04  6:34 ` [PATCH v6 12/13] xen/arm: translate virtual PCI bus topology for guests Oleksandr Andrushchenko
  2022-02-04  6:34 ` [PATCH v6 13/13] xen/arm: account IO handlers for emulated PCI MSI-X Oleksandr Andrushchenko
  12 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  6:34 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Assign SBDF to the PCI devices being passed through with bus 0.
The resulting topology is where PCIe devices reside on the bus 0 of the
root complex itself (embedded endpoints).
This implementation is limited to 32 devices which are allowed on
a single PCI bus.

Please note, that at the moment only function 0 of a multifunction
device can be passed through.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v5:
- s/vpci_add_virtual_device/add_virtual_device and make it static
- call add_virtual_device from vpci_assign_device and do not use
  REGISTER_VPCI_INIT machinery
- add pcidevs_locked ASSERT
- use DECLARE_BITMAP for vpci_dev_assigned_map
Since v4:
- moved and re-worked guest sbdf initializers
- s/set_bit/__set_bit
- s/clear_bit/__clear_bit
- minor comment fix s/Virtual/Guest/
- added VPCI_MAX_VIRT_DEV constant (PCI_SLOT(~0) + 1) which will be used
  later for counting the number of MMIO handlers required for a guest
  (Julien)
Since v3:
 - make use of VPCI_INIT
 - moved all new code to vpci.c which belongs to it
 - changed open-coded 31 to PCI_SLOT(~0)
 - added comments and code to reject multifunction devices with
   functions other than 0
 - updated comment about vpci_dev_next and made it unsigned int
 - implement roll back in case of error while assigning/deassigning devices
 - s/dom%pd/%pd
Since v2:
 - remove casts that are (a) malformed and (b) unnecessary
 - add new line for better readability
 - remove CONFIG_HAS_VPCI_GUEST_SUPPORT ifdef's as the relevant vPCI
    functions are now completely gated with this config
 - gate common code with CONFIG_HAS_VPCI_GUEST_SUPPORT
New in v2
---
 xen/drivers/vpci/vpci.c | 74 +++++++++++++++++++++++++++++++++++++++--
 xen/include/xen/sched.h |  8 +++++
 xen/include/xen/vpci.h  | 11 ++++++
 3 files changed, 91 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 3177f13c1c22..7d422d11f83d 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -89,6 +89,9 @@ int vpci_add_handlers(struct pci_dev *pdev)
         return -ENOMEM;
 
     INIT_LIST_HEAD(&vpci->handlers);
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    vpci->guest_sbdf.sbdf = ~0;
+#endif
 
     spin_lock(&pdev->vpci_lock);
     pdev->vpci = vpci;
@@ -114,6 +117,57 @@ int vpci_add_handlers(struct pci_dev *pdev)
 }
 
 #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+static int add_virtual_device(struct pci_dev *pdev)
+{
+    struct domain *d = pdev->domain;
+    pci_sbdf_t sbdf = { 0 };
+    unsigned long new_dev_number;
+
+    if ( is_hardware_domain(d) )
+        return 0;
+
+    ASSERT(pcidevs_locked());
+
+    /*
+     * Each PCI bus supports 32 devices/slots at max or up to 256 when
+     * there are multi-function ones which are not yet supported.
+     */
+    if ( pdev->info.is_extfn )
+    {
+        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
+                 &pdev->sbdf);
+        return -EOPNOTSUPP;
+    }
+
+    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
+                                         VPCI_MAX_VIRT_DEV);
+    if ( new_dev_number >= VPCI_MAX_VIRT_DEV )
+        return -ENOSPC;
+
+    __set_bit(new_dev_number, &d->vpci_dev_assigned_map);
+
+    /*
+     * Both segment and bus number are 0:
+     *  - we emulate a single host bridge for the guest, e.g. segment 0
+     *  - with bus 0 the virtual devices are seen as embedded
+     *    endpoints behind the root complex
+     *
+     * TODO: add support for multi-function devices.
+     */
+    sbdf.devfn = PCI_DEVFN(new_dev_number, 0);
+    pdev->vpci->guest_sbdf = sbdf;
+
+    return 0;
+
+}
+
+static void vpci_remove_virtual_device(struct domain *d,
+                                       const struct pci_dev *pdev)
+{
+    __clear_bit(pdev->vpci->guest_sbdf.dev, &d->vpci_dev_assigned_map);
+    pdev->vpci->guest_sbdf.sbdf = ~0;
+}
+
 /* Notify vPCI that device is assigned to guest. */
 int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
 {
@@ -124,8 +178,16 @@ int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
 
     rc = vpci_add_handlers(pdev);
     if ( rc )
-        vpci_deassign_device(d, pdev);
+        goto fail;
+
+    rc = add_virtual_device(pdev);
+    if ( rc )
+        goto fail;
 
+    return 0;
+
+ fail:
+    vpci_deassign_device(d, pdev);
     return rc;
 }
 
@@ -135,7 +197,15 @@ void vpci_deassign_device(struct domain *d, struct pci_dev *pdev)
     if ( !has_vpci(d) )
         return;
 
-    vpci_remove_device(pdev);
+    spin_lock(&pdev->vpci_lock);
+    if ( !pdev->vpci )
+        goto done;
+
+    vpci_remove_virtual_device(d, pdev);
+    vpci_remove_device_locked(pdev);
+
+ done:
+    spin_unlock(&pdev->vpci_lock);
 }
 #endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
 
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 37f78cc4c4c9..3c25e265eaa8 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -444,6 +444,14 @@ struct domain
 
 #ifdef CONFIG_HAS_PCI
     struct list_head pdev_list;
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /*
+     * The bitmap which shows which device numbers are already used by the
+     * virtual PCI bus topology and is used to assign a unique SBDF to the
+     * next passed through virtual PCI device.
+     */
+    DECLARE_BITMAP(vpci_dev_assigned_map, VPCI_MAX_VIRT_DEV);
+#endif
 #endif
 
 #ifdef CONFIG_HAS_PASSTHROUGH
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index f1f49db959c7..1f04d34a2369 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -21,6 +21,13 @@ typedef int vpci_register_init_t(struct pci_dev *dev);
 
 #define VPCI_ECAM_BDF(addr)     (((addr) & 0x0ffff000) >> 12)
 
+/*
+ * Maximum number of devices supported by the virtual bus topology:
+ * each PCI bus supports 32 devices/slots at max or up to 256 when
+ * there are multi-function ones which are not yet supported.
+ */
+#define VPCI_MAX_VIRT_DEV       (PCI_SLOT(~0) + 1)
+
 #define REGISTER_VPCI_INIT(x, p)                \
   static vpci_register_init_t *const x##_entry  \
                __used_section(".data.vpci." p) = x
@@ -140,6 +147,10 @@ struct vpci {
             struct vpci_arch_msix_entry arch;
         } entries[];
     } *msix;
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /* Guest SBDF of the device. */
+    pci_sbdf_t guest_sbdf;
+#endif
 #endif
 };
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 138+ messages in thread

* [PATCH v6 12/13] xen/arm: translate virtual PCI bus topology for guests
  2022-02-04  6:34 [PATCH v6 00/13] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (10 preceding siblings ...)
  2022-02-04  6:34 ` [PATCH v6 11/13] vpci: add initial support for virtual PCI bus topology Oleksandr Andrushchenko
@ 2022-02-04  6:34 ` Oleksandr Andrushchenko
  2022-02-04  7:56   ` Jan Beulich
  2022-02-04  6:34 ` [PATCH v6 13/13] xen/arm: account IO handlers for emulated PCI MSI-X Oleksandr Andrushchenko
  12 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  6:34 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

There are three  originators for the PCI configuration space access:
1. The domain that owns physical host bridge: MMIO handlers are
there so we can update vPCI register handlers with the values
written by the hardware domain, e.g. physical view of the registers
vs guest's view on the configuration space.
2. Guest access to the passed through PCI devices: we need to properly
map virtual bus topology to the physical one, e.g. pass the configuration
space access to the corresponding physical devices.
3. Emulated host PCI bridge access. It doesn't exist in the physical
topology, e.g. it can't be mapped to some physical host bridge.
So, all access to the host bridge itself needs to be trapped and
emulated.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v5:
- add vpci_translate_virtual_device for #ifndef CONFIG_HAS_VPCI_GUEST_SUPPORT
  case to simplify ifdefery
- add ASSERT(!is_hardware_domain(d)); to vpci_translate_virtual_device
- reset output register on failed virtual SBDF translation
Since v4:
- indentation fixes
- constify struct domain
- updated commit message
- updates to the new locking scheme (pdev->vpci_lock)
Since v3:
- revisit locking
- move code to vpci.c
Since v2:
 - pass struct domain instead of struct vcpu
 - constify arguments where possible
 - gate relevant code with CONFIG_HAS_VPCI_GUEST_SUPPORT
New in v2
---
 xen/arch/arm/vpci.c     | 17 +++++++++++++++++
 xen/drivers/vpci/vpci.c | 29 +++++++++++++++++++++++++++++
 xen/include/xen/vpci.h  |  7 +++++++
 3 files changed, 53 insertions(+)

diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index a9fc5817f94e..84b2b068a0fe 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -41,6 +41,16 @@ static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
     /* data is needed to prevent a pointer cast on 32bit */
     unsigned long data;
 
+    /*
+     * For the passed through devices we need to map their virtual SBDF
+     * to the physical PCI device being passed through.
+     */
+    if ( !bridge && !vpci_translate_virtual_device(v->domain, &sbdf) )
+    {
+        *r = ~0ul;
+        return 1;
+    }
+
     if ( vpci_ecam_read(sbdf, ECAM_REG_OFFSET(info->gpa),
                         1U << info->dabt.size, &data) )
     {
@@ -59,6 +69,13 @@ static int vpci_mmio_write(struct vcpu *v, mmio_info_t *info,
     struct pci_host_bridge *bridge = p;
     pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
 
+    /*
+     * For the passed through devices we need to map their virtual SBDF
+     * to the physical PCI device being passed through.
+     */
+    if ( !bridge && !vpci_translate_virtual_device(v->domain, &sbdf) )
+        return 1;
+
     return vpci_ecam_write(sbdf, ECAM_REG_OFFSET(info->gpa),
                            1U << info->dabt.size, r);
 }
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 7d422d11f83d..070db7391391 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -168,6 +168,35 @@ static void vpci_remove_virtual_device(struct domain *d,
     pdev->vpci->guest_sbdf.sbdf = ~0;
 }
 
+/*
+ * Find the physical device which is mapped to the virtual device
+ * and translate virtual SBDF to the physical one.
+ */
+bool vpci_translate_virtual_device(const struct domain *d, pci_sbdf_t *sbdf)
+{
+    struct pci_dev *pdev;
+
+    ASSERT(!is_hardware_domain(d));
+
+    for_each_pdev( d, pdev )
+    {
+        bool found;
+
+        spin_lock(&pdev->vpci_lock);
+        found = pdev->vpci && (pdev->vpci->guest_sbdf.sbdf == sbdf->sbdf);
+        spin_unlock(&pdev->vpci_lock);
+
+        if ( found )
+        {
+            /* Replace guest SBDF with the physical one. */
+            *sbdf = pdev->sbdf;
+            return true;
+        }
+    }
+
+    return false;
+}
+
 /* Notify vPCI that device is assigned to guest. */
 int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
 {
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 1f04d34a2369..f6eb9f2051af 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -271,6 +271,7 @@ static inline bool __must_check vpci_process_pending(struct vcpu *v)
 /* Notify vPCI that device is assigned/de-assigned to/from guest. */
 int vpci_assign_device(struct domain *d, struct pci_dev *pdev);
 void vpci_deassign_device(struct domain *d, struct pci_dev *pdev);
+bool vpci_translate_virtual_device(const struct domain *d, pci_sbdf_t *sbdf);
 #else
 static inline int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
 {
@@ -280,6 +281,12 @@ static inline int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
 static inline void vpci_deassign_device(struct domain *d, struct pci_dev *pdev)
 {
 };
+
+static inline bool vpci_translate_virtual_device(const struct domain *d,
+                                                 pci_sbdf_t *sbdf)
+{
+    return false;
+}
 #endif
 
 #endif
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 138+ messages in thread

* [PATCH v6 13/13] xen/arm: account IO handlers for emulated PCI MSI-X
  2022-02-04  6:34 [PATCH v6 00/13] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (11 preceding siblings ...)
  2022-02-04  6:34 ` [PATCH v6 12/13] xen/arm: translate virtual PCI bus topology for guests Oleksandr Andrushchenko
@ 2022-02-04  6:34 ` Oleksandr Andrushchenko
  2022-02-11 15:28   ` Julien Grall
  12 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  6:34 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

At the moment, we always allocate an extra 16 slots for IO handlers
(see MAX_IO_HANDLER). So while adding IO trap handlers for the emulated
MSI-X registers we need to explicitly tell that we have additional IO
handlers, so those are accounted.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

---
Cc: Julien Grall <julien@xen.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
---
This actually moved here from the part 2 of the prep work for PCI
passthrough on Arm as it seems to be the proper place for it.

Since v5:
- optimize with IS_ENABLED(CONFIG_HAS_PCI_MSI) since VPCI_MAX_VIRT_DEV is
  defined unconditionally
New in v5
---
 xen/arch/arm/vpci.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index 84b2b068a0fe..c5902cb9d34d 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -131,6 +131,8 @@ static int vpci_get_num_handlers_cb(struct domain *d,
 
 unsigned int domain_vpci_get_num_mmio_handlers(struct domain *d)
 {
+    unsigned int count;
+
     if ( !has_vpci(d) )
         return 0;
 
@@ -151,7 +153,17 @@ unsigned int domain_vpci_get_num_mmio_handlers(struct domain *d)
      * For guests each host bridge requires one region to cover the
      * configuration space. At the moment, we only expose a single host bridge.
      */
-    return 1;
+    count = 1;
+
+    /*
+     * There's a single MSI-X MMIO handler that deals with both PBA
+     * and MSI-X tables per each PCI device being passed through.
+     * Maximum number of emulated virtual devices is VPCI_MAX_VIRT_DEV.
+     */
+    if ( IS_ENABLED(CONFIG_HAS_PCI_MSI) )
+        count += VPCI_MAX_VIRT_DEV;
+
+    return count;
 }
 
 /*
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04  6:34 ` [PATCH v6 03/13] vpci: move lock outside of struct vpci Oleksandr Andrushchenko
@ 2022-02-04  7:52   ` Jan Beulich
  2022-02-04  8:13     ` Oleksandr Andrushchenko
  2022-02-04  8:58     ` Oleksandr Andrushchenko
  0 siblings, 2 replies; 138+ messages in thread
From: Jan Beulich @ 2022-02-04  7:52 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, roger.pau
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, andrew.cooper3, george.dunlap, paul,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko,
	xen-devel

On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>                  continue;
>          }
>  
> +        spin_lock(&tmp->vpci_lock);
> +        if ( !tmp->vpci )
> +        {
> +            spin_unlock(&tmp->vpci_lock);
> +            continue;
> +        }
>          for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>          {
>              const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>              rc = rangeset_remove_range(mem, start, end);
>              if ( rc )
>              {
> +                spin_unlock(&tmp->vpci_lock);
>                  printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>                         start, end, rc);
>                  rangeset_destroy(mem);
>                  return rc;
>              }
>          }
> +        spin_unlock(&tmp->vpci_lock);
>      }

At the first glance this simply looks like another unjustified (in the
description) change, as you're not converting anything here but you
actually add locking (and I realize this was there before, so I'm sorry
for not pointing this out earlier). But then I wonder whether you
actually tested this, since I can't help getting the impression that
you're introducing a live-lock: The function is called from cmd_write()
and rom_write(), which in turn are called out of vpci_write(). Yet that
function already holds the lock, and the lock is not (currently)
recursive. (For the 3rd caller of the function - init_bars() - otoh
the locking looks to be entirely unnecessary.)

Then again this was present already even in Roger's original patch, so
I guess I must be missing something ...

> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -138,7 +138,7 @@ static void control_write(const struct pci_dev *pdev, unsigned int reg,
>          pci_conf_write16(pdev->sbdf, reg, val);
>  }
>  
> -static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
> +static struct vpci_msix *msix_get(const struct domain *d, unsigned long addr)
>  {
>      struct vpci_msix *msix;
>  
> @@ -150,15 +150,29 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>          for ( i = 0; i < ARRAY_SIZE(msix->tables); i++ )
>              if ( bars[msix->tables[i] & PCI_MSIX_BIRMASK].enabled &&
>                   VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, i) )
> +            {
> +                spin_lock(&msix->pdev->vpci_lock);
>                  return msix;
> +            }

I think deliberately returning with a lock held requires a respective
comment ahead of the function.

>      }
>  
>      return NULL;
>  }
>  
> +static void msix_put(struct vpci_msix *msix)
> +{
> +    if ( !msix )
> +        return;
> +
> +    spin_unlock(&msix->pdev->vpci_lock);
> +}

Maybe shorter

    if ( msix )
        spin_unlock(&msix->pdev->vpci_lock);

? Yet there's only one case where you may pass NULL in here, so
maybe it's better anyway to move the conditional ...

>  static int msix_accept(struct vcpu *v, unsigned long addr)
>  {
> -    return !!msix_find(v->domain, addr);
> +    struct vpci_msix *msix = msix_get(v->domain, addr);
> +
> +    msix_put(msix);
> +    return !!msix;
>  }

... here?

> @@ -186,7 +200,7 @@ static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len,
>                       unsigned long *data)
>  {
>      const struct domain *d = v->domain;
> -    struct vpci_msix *msix = msix_find(d, addr);
> +    struct vpci_msix *msix = msix_get(d, addr);
>      const struct vpci_msix_entry *entry;
>      unsigned int offset;
>  
> @@ -196,7 +210,10 @@ static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len,
>          return X86EMUL_RETRY;
>  
>      if ( !access_allowed(msix->pdev, addr, len) )
> +    {
> +        msix_put(msix);
>          return X86EMUL_OKAY;
> +    }
>  
>      if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) )
>      {
> @@ -222,10 +239,10 @@ static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len,
>              break;
>          }
>  
> +        msix_put(msix);
>          return X86EMUL_OKAY;
>      }
>  
> -    spin_lock(&msix->pdev->vpci->lock);
>      entry = get_entry(msix, addr);
>      offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);

You're increasing the locked region quite a bit here. If this is really
needed, it wants explaining. And if this is deemed acceptable as a
"side effect", it wants justifying or at least stating imo. Same for
msix_write() then, obviously. (I'm not sure Roger actually implied this
when suggesting to switch to the get/put pair.)

> @@ -327,7 +334,12 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>      if ( !pdev )
>          return vpci_read_hw(sbdf, reg, size);
>  
> -    spin_lock(&pdev->vpci->lock);
> +    spin_lock(&pdev->vpci_lock);
> +    if ( !pdev->vpci )
> +    {
> +        spin_unlock(&pdev->vpci_lock);
> +        return vpci_read_hw(sbdf, reg, size);
> +    }

Didn't you say you would add justification of this part of the change
(and its vpci_write() counterpart) to the description?

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 12/13] xen/arm: translate virtual PCI bus topology for guests
  2022-02-04  6:34 ` [PATCH v6 12/13] xen/arm: translate virtual PCI bus topology for guests Oleksandr Andrushchenko
@ 2022-02-04  7:56   ` Jan Beulich
  2022-02-04  8:18     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-04  7:56 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko,
	xen-devel

On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -168,6 +168,35 @@ static void vpci_remove_virtual_device(struct domain *d,
>      pdev->vpci->guest_sbdf.sbdf = ~0;
>  }
>  
> +/*
> + * Find the physical device which is mapped to the virtual device
> + * and translate virtual SBDF to the physical one.
> + */
> +bool vpci_translate_virtual_device(const struct domain *d, pci_sbdf_t *sbdf)
> +{
> +    struct pci_dev *pdev;
> +
> +    ASSERT(!is_hardware_domain(d));

In addition to this, don't you also need to assert that pcidevs_lock is
held (or if it isn't, you'd need to acquire it) for ...

> +    for_each_pdev( d, pdev )

... this to be race-free?

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04  7:52   ` Jan Beulich
@ 2022-02-04  8:13     ` Oleksandr Andrushchenko
  2022-02-04  8:36       ` Jan Beulich
  2022-02-04  8:58     ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  8:13 UTC (permalink / raw)
  To: Jan Beulich, roger.pau
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko

Hi, Jan!

On 04.02.22 09:52, Jan Beulich wrote:
> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>
> At the first glance this simply looks like another unjustified (in the
> description) change, as you're not converting anything here but you
> actually add locking (and I realize this was there before, so I'm sorry
> for not pointing this out earlier). But then I wonder whether you
> actually tested this
This is already stated in the cover letter that I have tested two x86
configurations and tested that on Arm.......
Would you like to see the relevant logs?

Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 12/13] xen/arm: translate virtual PCI bus topology for guests
  2022-02-04  7:56   ` Jan Beulich
@ 2022-02-04  8:18     ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  8:18 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko

Hi, Jan!

On 04.02.22 09:56, Jan Beulich wrote:
> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -168,6 +168,35 @@ static void vpci_remove_virtual_device(struct domain *d,
>>       pdev->vpci->guest_sbdf.sbdf = ~0;
>>   }
>>   
>> +/*
>> + * Find the physical device which is mapped to the virtual device
>> + * and translate virtual SBDF to the physical one.
>> + */
>> +bool vpci_translate_virtual_device(const struct domain *d, pci_sbdf_t *sbdf)
>> +{
>> +    struct pci_dev *pdev;
>> +
>> +    ASSERT(!is_hardware_domain(d));
> In addition to this, don't you also need to assert that pcidevs_lock is
> held (or if it isn't, you'd need to acquire it) for ...
>
>> +    for_each_pdev( d, pdev )
> ... this to be race-free?
Yes, you are right and this needs pcidevs_lock();
Will add
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04  8:13     ` Oleksandr Andrushchenko
@ 2022-02-04  8:36       ` Jan Beulich
  0 siblings, 0 replies; 138+ messages in thread
From: Jan Beulich @ 2022-02-04  8:36 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau

On 04.02.2022 09:13, Oleksandr Andrushchenko wrote:
> On 04.02.22 09:52, Jan Beulich wrote:
>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>
>> At the first glance this simply looks like another unjustified (in the
>> description) change, as you're not converting anything here but you
>> actually add locking (and I realize this was there before, so I'm sorry
>> for not pointing this out earlier). But then I wonder whether you
>> actually tested this
> This is already stated in the cover letter that I have tested two x86
> configurations and tested that on Arm.......

Okay, I'm sorry then. But could you then please point out where I'm
wrong with my analysis?

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 01/13] xen/pci: arm: add stub for is_memory_hole
  2022-02-04  6:34 ` [PATCH v6 01/13] xen/pci: arm: add stub for is_memory_hole Oleksandr Andrushchenko
@ 2022-02-04  8:51   ` Julien Grall
  2022-02-04  9:01     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Julien Grall @ 2022-02-04  8:51 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, xen-devel
  Cc: sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

Hi,

On 04/02/2022 06:34, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Add a stub for is_memory_hole which is required for PCI passthrough
> on Arm.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> ---
> Cc: Julien Grall <julien@xen.org>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> ---
> New in v6
> ---
>   xen/arch/arm/mm.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
> index b1eae767c27c..c32e34a182a2 100644
> --- a/xen/arch/arm/mm.c
> +++ b/xen/arch/arm/mm.c
> @@ -1640,6 +1640,12 @@ unsigned long get_upper_mfn_bound(void)
>       return max_page - 1;
>   }
>   
> +bool is_memory_hole(mfn_t start, mfn_t end)
> +{
> +    /* TODO: this needs to be properly implemented. */

I was hoping to see a summary of the discussion from IRC somewhere in 
the patch (maybe after ---). This would help to bring up to speed the 
others that were not on IRC.

> +    return true;
> +}
> +
>   /*
>    * Local variables:
>    * mode: C

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04  7:52   ` Jan Beulich
  2022-02-04  8:13     ` Oleksandr Andrushchenko
@ 2022-02-04  8:58     ` Oleksandr Andrushchenko
  2022-02-04  9:15       ` Jan Beulich
  1 sibling, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  8:58 UTC (permalink / raw)
  To: Jan Beulich, roger.pau
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko

Hi, Jan!

On 04.02.22 09:52, Jan Beulich wrote:
> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>                   continue;
>>           }
>>   
>> +        spin_lock(&tmp->vpci_lock);
>> +        if ( !tmp->vpci )
>> +        {
>> +            spin_unlock(&tmp->vpci_lock);
>> +            continue;
>> +        }
>>           for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>           {
>>               const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>               rc = rangeset_remove_range(mem, start, end);
>>               if ( rc )
>>               {
>> +                spin_unlock(&tmp->vpci_lock);
>>                   printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>>                          start, end, rc);
>>                   rangeset_destroy(mem);
>>                   return rc;
>>               }
>>           }
>> +        spin_unlock(&tmp->vpci_lock);
>>       }
> At the first glance this simply looks like another unjustified (in the
> description) change, as you're not converting anything here but you
> actually add locking (and I realize this was there before, so I'm sorry
> for not pointing this out earlier).
Well, I thought that the description already has "...the lock can be
used (and in a few cases is used right away) to check whether vpci
is present" and this is enough for such uses as here.
>   But then I wonder whether you
> actually tested this, since I can't help getting the impression that
> you're introducing a live-lock: The function is called from cmd_write()
> and rom_write(), which in turn are called out of vpci_write(). Yet that
> function already holds the lock, and the lock is not (currently)
> recursive. (For the 3rd caller of the function - init_bars() - otoh
> the locking looks to be entirely unnecessary.)
Well, you are correct: if tmp != pdev then it is correct to acquire
the lock. But if tmp == pdev and rom_only == true
then we'll deadlock.

It seems we need to have the locking conditional, e.g. only lock
if tmp != pdev
>
> Then again this was present already even in Roger's original patch, so
> I guess I must be missing something ...
>
>> --- a/xen/drivers/vpci/msix.c
>> +++ b/xen/drivers/vpci/msix.c
>> @@ -138,7 +138,7 @@ static void control_write(const struct pci_dev *pdev, unsigned int reg,
>>           pci_conf_write16(pdev->sbdf, reg, val);
>>   }
>>   
>> -static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>> +static struct vpci_msix *msix_get(const struct domain *d, unsigned long addr)
>>   {
>>       struct vpci_msix *msix;
>>   
>> @@ -150,15 +150,29 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>>           for ( i = 0; i < ARRAY_SIZE(msix->tables); i++ )
>>               if ( bars[msix->tables[i] & PCI_MSIX_BIRMASK].enabled &&
>>                    VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, i) )
>> +            {
>> +                spin_lock(&msix->pdev->vpci_lock);
>>                   return msix;
>> +            }
> I think deliberately returning with a lock held requires a respective
> comment ahead of the function.
Ok, will add a comment
>
>>       }
>>   
>>       return NULL;
>>   }
>>   
>> +static void msix_put(struct vpci_msix *msix)
>> +{
>> +    if ( !msix )
>> +        return;
>> +
>> +    spin_unlock(&msix->pdev->vpci_lock);
>> +}
> Maybe shorter
>
>      if ( msix )
>          spin_unlock(&msix->pdev->vpci_lock);
Looks good
>
> ? Yet there's only one case where you may pass NULL in here, so
> maybe it's better anyway to move the conditional ...
>
>>   static int msix_accept(struct vcpu *v, unsigned long addr)
>>   {
>> -    return !!msix_find(v->domain, addr);
>> +    struct vpci_msix *msix = msix_get(v->domain, addr);
>> +
>> +    msix_put(msix);
>> +    return !!msix;
>>   }
> ... here?
Yes, I can have that check here, but what if there is yet
another caller of the same? I am not sure whether it is better
to have the check in msix_get or at the caller site.
At the moment (with a single place with NULL possible) I can
move the check. @Roger?
>
>> @@ -186,7 +200,7 @@ static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len,
>>                        unsigned long *data)
>>   {
>>       const struct domain *d = v->domain;
>> -    struct vpci_msix *msix = msix_find(d, addr);
>> +    struct vpci_msix *msix = msix_get(d, addr);
>>       const struct vpci_msix_entry *entry;
>>       unsigned int offset;
>>   
>> @@ -196,7 +210,10 @@ static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len,
>>           return X86EMUL_RETRY;
>>   
>>       if ( !access_allowed(msix->pdev, addr, len) )
>> +    {
>> +        msix_put(msix);
>>           return X86EMUL_OKAY;
>> +    }
>>   
>>       if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) )
>>       {
>> @@ -222,10 +239,10 @@ static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len,
>>               break;
>>           }
>>   
>> +        msix_put(msix);
>>           return X86EMUL_OKAY;
>>       }
>>   
>> -    spin_lock(&msix->pdev->vpci->lock);
>>       entry = get_entry(msix, addr);
>>       offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
> You're increasing the locked region quite a bit here. If this is really
> needed, it wants explaining. And if this is deemed acceptable as a
> "side effect", it wants justifying or at least stating imo. Same for
> msix_write() then, obviously.
Yes, I do increase the locking region here, but the msix variable needs
to be protected all the time, so it seems to be obvious that it remains
under the lock
>   (I'm not sure Roger actually implied this
> when suggesting to switch to the get/put pair.)
>
>> @@ -327,7 +334,12 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>       if ( !pdev )
>>           return vpci_read_hw(sbdf, reg, size);
>>   
>> -    spin_lock(&pdev->vpci->lock);
>> +    spin_lock(&pdev->vpci_lock);
>> +    if ( !pdev->vpci )
>> +    {
>> +        spin_unlock(&pdev->vpci_lock);
>> +        return vpci_read_hw(sbdf, reg, size);
>> +    }
> Didn't you say you would add justification of this part of the change
> (and its vpci_write() counterpart) to the description?
Again, I am referring to the commit message as described above
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 01/13] xen/pci: arm: add stub for is_memory_hole
  2022-02-04  8:51   ` Julien Grall
@ 2022-02-04  9:01     ` Oleksandr Andrushchenko
  2022-02-04  9:41       ` Julien Grall
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  9:01 UTC (permalink / raw)
  To: Julien Grall, xen-devel
  Cc: sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, Bertrand Marquis, Rahul Singh,
	Oleksandr Andrushchenko

Hi, Julien!

On 04.02.22 10:51, Julien Grall wrote:
> Hi,
>
> On 04/02/2022 06:34, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Add a stub for is_memory_hole which is required for PCI passthrough
>> on Arm.
>>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> ---
>> Cc: Julien Grall <julien@xen.org>
>> Cc: Stefano Stabellini <sstabellini@kernel.org>
>> ---
>> New in v6
>> ---
>>   xen/arch/arm/mm.c | 6 ++++++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
>> index b1eae767c27c..c32e34a182a2 100644
>> --- a/xen/arch/arm/mm.c
>> +++ b/xen/arch/arm/mm.c
>> @@ -1640,6 +1640,12 @@ unsigned long get_upper_mfn_bound(void)
>>       return max_page - 1;
>>   }
>>   +bool is_memory_hole(mfn_t start, mfn_t end)
>> +{
>> +    /* TODO: this needs to be properly implemented. */
>
> I was hoping to see a summary of the discussion from IRC somewhere in the patch (maybe after ---). This would help to bring up to speed the others that were not on IRC.
I am not quite sure what needs to be put here as the summary
Could you please help me with the exact message you would like to see?
>
>> +    return true;
>> +}
>> +
>>   /*
>>    * Local variables:
>>    * mode: C
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04  8:58     ` Oleksandr Andrushchenko
@ 2022-02-04  9:15       ` Jan Beulich
  2022-02-04 10:12         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-04  9:15 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau

On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
> On 04.02.22 09:52, Jan Beulich wrote:
>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>                   continue;
>>>           }
>>>   
>>> +        spin_lock(&tmp->vpci_lock);
>>> +        if ( !tmp->vpci )
>>> +        {
>>> +            spin_unlock(&tmp->vpci_lock);
>>> +            continue;
>>> +        }
>>>           for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>>           {
>>>               const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>               rc = rangeset_remove_range(mem, start, end);
>>>               if ( rc )
>>>               {
>>> +                spin_unlock(&tmp->vpci_lock);
>>>                   printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>>>                          start, end, rc);
>>>                   rangeset_destroy(mem);
>>>                   return rc;
>>>               }
>>>           }
>>> +        spin_unlock(&tmp->vpci_lock);
>>>       }
>> At the first glance this simply looks like another unjustified (in the
>> description) change, as you're not converting anything here but you
>> actually add locking (and I realize this was there before, so I'm sorry
>> for not pointing this out earlier).
> Well, I thought that the description already has "...the lock can be
> used (and in a few cases is used right away) to check whether vpci
> is present" and this is enough for such uses as here.
>>   But then I wonder whether you
>> actually tested this, since I can't help getting the impression that
>> you're introducing a live-lock: The function is called from cmd_write()
>> and rom_write(), which in turn are called out of vpci_write(). Yet that
>> function already holds the lock, and the lock is not (currently)
>> recursive. (For the 3rd caller of the function - init_bars() - otoh
>> the locking looks to be entirely unnecessary.)
> Well, you are correct: if tmp != pdev then it is correct to acquire
> the lock. But if tmp == pdev and rom_only == true
> then we'll deadlock.
> 
> It seems we need to have the locking conditional, e.g. only lock
> if tmp != pdev

Which will address the live-lock, but introduce ABBA deadlock potential
between the two locks.

>>> @@ -222,10 +239,10 @@ static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len,
>>>               break;
>>>           }
>>>   
>>> +        msix_put(msix);
>>>           return X86EMUL_OKAY;
>>>       }
>>>   
>>> -    spin_lock(&msix->pdev->vpci->lock);
>>>       entry = get_entry(msix, addr);
>>>       offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
>> You're increasing the locked region quite a bit here. If this is really
>> needed, it wants explaining. And if this is deemed acceptable as a
>> "side effect", it wants justifying or at least stating imo. Same for
>> msix_write() then, obviously.
> Yes, I do increase the locking region here, but the msix variable needs
> to be protected all the time, so it seems to be obvious that it remains
> under the lock

What does the msix variable have to do with the vPCI lock? If you see
a need to grow the locked region here, then surely this is independent
of your conversion of the lock, and hence wants to be a prereq fix
(which may in fact want/need backporting).

>>> @@ -327,7 +334,12 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>>       if ( !pdev )
>>>           return vpci_read_hw(sbdf, reg, size);
>>>   
>>> -    spin_lock(&pdev->vpci->lock);
>>> +    spin_lock(&pdev->vpci_lock);
>>> +    if ( !pdev->vpci )
>>> +    {
>>> +        spin_unlock(&pdev->vpci_lock);
>>> +        return vpci_read_hw(sbdf, reg, size);
>>> +    }
>> Didn't you say you would add justification of this part of the change
>> (and its vpci_write() counterpart) to the description?
> Again, I am referring to the commit message as described above

No, sorry - that part applies only to what inside the parentheses of
if(). But on the intermediate version (post-v5 in a 4-patch series) I
did say:

"In this case as well as in its write counterpart it becomes even more
 important to justify (in the description) the new behavior. It is not
 obvious at all that the absence of a struct vpci should be taken as
 an indication that the underlying device needs accessing instead.
 This also cannot be inferred from the "!pdev" case visible in context.
 In that case we have no record of a device at this SBDF, and hence the
 fallback pretty clearly is a "just in case" one. Yet if we know of a
 device, the absence of a struct vpci may mean various possible things."

If it wasn't obvious: The comment was on the use of vpci_read_hw() on
this path, not redundant with the earlier one regarding the added
"is vpci non-NULL" in a few places.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 01/13] xen/pci: arm: add stub for is_memory_hole
  2022-02-04  9:01     ` Oleksandr Andrushchenko
@ 2022-02-04  9:41       ` Julien Grall
  2022-02-04  9:47         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Julien Grall @ 2022-02-04  9:41 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, xen-devel
  Cc: sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, Bertrand Marquis, Rahul Singh

On 04/02/2022 09:01, Oleksandr Andrushchenko wrote:
> On 04.02.22 10:51, Julien Grall wrote:
>> Hi,
>>
>> On 04/02/2022 06:34, Oleksandr Andrushchenko wrote:
>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>
>>> Add a stub for is_memory_hole which is required for PCI passthrough
>>> on Arm.
>>>
>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>
>>> ---
>>> Cc: Julien Grall <julien@xen.org>
>>> Cc: Stefano Stabellini <sstabellini@kernel.org>
>>> ---
>>> New in v6
>>> ---
>>>    xen/arch/arm/mm.c | 6 ++++++
>>>    1 file changed, 6 insertions(+)
>>>
>>> diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
>>> index b1eae767c27c..c32e34a182a2 100644
>>> --- a/xen/arch/arm/mm.c
>>> +++ b/xen/arch/arm/mm.c
>>> @@ -1640,6 +1640,12 @@ unsigned long get_upper_mfn_bound(void)
>>>        return max_page - 1;
>>>    }
>>>    +bool is_memory_hole(mfn_t start, mfn_t end)
>>> +{
>>> +    /* TODO: this needs to be properly implemented. */
>>
>> I was hoping to see a summary of the discussion from IRC somewhere in the patch (maybe after ---). This would help to bring up to speed the others that were not on IRC.
> I am not quite sure what needs to be put here as the summary

At least some details on why this is a TODO. Is it because you are 
unsure of the implementation? Is it because you wanted to send early?...

IOW, what are you expecting from the reviewers?

> Could you please help me with the exact message you would like to see?

Here a summary of the discussion (+ some my follow-up thoughts):

is_memory_hole() was recently introduced on x86 (see commit 75cc460a1b8c 
"xen/pci: detect when BARs are not suitably positioned") to check 
whether the BAR are positioned outside of a valid memory range. This was 
introduced to work-around quirky firmware.

In theory, this could also happen on Arm. In practice, this may not 
happen but it sounds better to sanity check that the BAR contains 
"valid" I/O range.

On x86, this is implemented by checking the region is not described is 
in the e820. IIUC, on Arm, the BARs have to be positioned in pre-defined 
ranges. So I think it would be possible to implement is_memory_hole() by 
going through the list of hostbridges and check the ranges.

But first, I'd like to confirm my understanding with Rahul, and others.

If we were going to go this route, I would also rename the function to 
be better match what it is doing (i.e. it checks the BAR is correctly 
placed). As a potentially optimization/hardening for Arm, we could pass 
the hostbridge so we don't have to walk all of them.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 01/13] xen/pci: arm: add stub for is_memory_hole
  2022-02-04  9:41       ` Julien Grall
@ 2022-02-04  9:47         ` Oleksandr Andrushchenko
  2022-02-04  9:57           ` Julien Grall
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04  9:47 UTC (permalink / raw)
  To: Julien Grall, xen-devel
  Cc: sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, Bertrand Marquis, Rahul Singh,
	Oleksandr Andrushchenko



On 04.02.22 11:41, Julien Grall wrote:
> On 04/02/2022 09:01, Oleksandr Andrushchenko wrote:
>> On 04.02.22 10:51, Julien Grall wrote:
>>> Hi,
>>>
>>> On 04/02/2022 06:34, Oleksandr Andrushchenko wrote:
>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>
>>>> Add a stub for is_memory_hole which is required for PCI passthrough
>>>> on Arm.
>>>>
>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>
>>>> ---
>>>> Cc: Julien Grall <julien@xen.org>
>>>> Cc: Stefano Stabellini <sstabellini@kernel.org>
>>>> ---
>>>> New in v6
>>>> ---
>>>>    xen/arch/arm/mm.c | 6 ++++++
>>>>    1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
>>>> index b1eae767c27c..c32e34a182a2 100644
>>>> --- a/xen/arch/arm/mm.c
>>>> +++ b/xen/arch/arm/mm.c
>>>> @@ -1640,6 +1640,12 @@ unsigned long get_upper_mfn_bound(void)
>>>>        return max_page - 1;
>>>>    }
>>>>    +bool is_memory_hole(mfn_t start, mfn_t end)
>>>> +{
>>>> +    /* TODO: this needs to be properly implemented. */
>>>
>>> I was hoping to see a summary of the discussion from IRC somewhere in the patch (maybe after ---). This would help to bring up to speed the others that were not on IRC.
>> I am not quite sure what needs to be put here as the summary
>
> At least some details on why this is a TODO. Is it because you are unsure of the implementation? Is it because you wanted to send early?...
>
> IOW, what are you expecting from the reviewers?
Well, I just need to allow PCI passthrough to be built on Arm at the moment.
Clearly, without this stub I can't do so. This is the only intention now.
Of course, while PCI passthrough on Arm is still not really enabled those
who want trying it will need reverting the offending patch otherwise.
I am fine both ways
>
>> Could you please help me with the exact message you would like to see?
>
> Here a summary of the discussion (+ some my follow-up thoughts):
>
> is_memory_hole() was recently introduced on x86 (see commit 75cc460a1b8c "xen/pci: detect when BARs are not suitably positioned") to check whether the BAR are positioned outside of a valid memory range. This was introduced to work-around quirky firmware.
>
> In theory, this could also happen on Arm. In practice, this may not happen but it sounds better to sanity check that the BAR contains "valid" I/O range.
>
> On x86, this is implemented by checking the region is not described is in the e820. IIUC, on Arm, the BARs have to be positioned in pre-defined ranges. So I think it would be possible to implement is_memory_hole() by going through the list of hostbridges and check the ranges.
>
> But first, I'd like to confirm my understanding with Rahul, and others.
>
> If we were going to go this route, I would also rename the function to be better match what it is doing (i.e. it checks the BAR is correctly placed). As a potentially optimization/hardening for Arm, we could pass the hostbridge so we don't have to walk all of them.
It seems this needs to live in the commit message then? So, it is easy to find
as everything after "---" is going to be dropped on commit
>
> Cheers,
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 01/13] xen/pci: arm: add stub for is_memory_hole
  2022-02-04  9:47         ` Oleksandr Andrushchenko
@ 2022-02-04  9:57           ` Julien Grall
  2022-02-04 10:35             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Julien Grall @ 2022-02-04  9:57 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, xen-devel
  Cc: sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, Bertrand Marquis, Rahul Singh

Hi,

On 04/02/2022 09:47, Oleksandr Andrushchenko wrote:
>>> Could you please help me with the exact message you would like to see?
>>
>> Here a summary of the discussion (+ some my follow-up thoughts):
>>
>> is_memory_hole() was recently introduced on x86 (see commit 75cc460a1b8c "xen/pci: detect when BARs are not suitably positioned") to check whether the BAR are positioned outside of a valid memory range. This was introduced to work-around quirky firmware.
>>
>> In theory, this could also happen on Arm. In practice, this may not happen but it sounds better to sanity check that the BAR contains "valid" I/O range.
>>
>> On x86, this is implemented by checking the region is not described is in the e820. IIUC, on Arm, the BARs have to be positioned in pre-defined ranges. So I think it would be possible to implement is_memory_hole() by going through the list of hostbridges and check the ranges.
>>
>> But first, I'd like to confirm my understanding with Rahul, and others.
>>
>> If we were going to go this route, I would also rename the function to be better match what it is doing (i.e. it checks the BAR is correctly placed). As a potentially optimization/hardening for Arm, we could pass the hostbridge so we don't have to walk all of them.
> It seems this needs to live in the commit message then? So, it is easy to find
> as everything after "---" is going to be dropped on commit
I expect the function to be fully implemented before this is will be merged.

So if it is fully implemented, then a fair chunk of what I wrote would 
not be necessary to carry in the commit message.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04  9:15       ` Jan Beulich
@ 2022-02-04 10:12         ` Oleksandr Andrushchenko
  2022-02-04 10:49           ` Jan Beulich
  2022-02-04 10:57           ` Roger Pau Monné
  0 siblings, 2 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04 10:12 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau,
	Oleksandr Andrushchenko

Hi, Jan!

On 04.02.22 11:15, Jan Beulich wrote:
> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
>> On 04.02.22 09:52, Jan Beulich wrote:
>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>                    continue;
>>>>            }
>>>>    
>>>> +        spin_lock(&tmp->vpci_lock);
>>>> +        if ( !tmp->vpci )
>>>> +        {
>>>> +            spin_unlock(&tmp->vpci_lock);
>>>> +            continue;
>>>> +        }
>>>>            for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>>>            {
>>>>                const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>                rc = rangeset_remove_range(mem, start, end);
>>>>                if ( rc )
>>>>                {
>>>> +                spin_unlock(&tmp->vpci_lock);
>>>>                    printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>>>>                           start, end, rc);
>>>>                    rangeset_destroy(mem);
>>>>                    return rc;
>>>>                }
>>>>            }
>>>> +        spin_unlock(&tmp->vpci_lock);
>>>>        }
>>> At the first glance this simply looks like another unjustified (in the
>>> description) change, as you're not converting anything here but you
>>> actually add locking (and I realize this was there before, so I'm sorry
>>> for not pointing this out earlier).
>> Well, I thought that the description already has "...the lock can be
>> used (and in a few cases is used right away) to check whether vpci
>> is present" and this is enough for such uses as here.
>>>    But then I wonder whether you
>>> actually tested this, since I can't help getting the impression that
>>> you're introducing a live-lock: The function is called from cmd_write()
>>> and rom_write(), which in turn are called out of vpci_write(). Yet that
>>> function already holds the lock, and the lock is not (currently)
>>> recursive. (For the 3rd caller of the function - init_bars() - otoh
>>> the locking looks to be entirely unnecessary.)
>> Well, you are correct: if tmp != pdev then it is correct to acquire
>> the lock. But if tmp == pdev and rom_only == true
>> then we'll deadlock.
>>
>> It seems we need to have the locking conditional, e.g. only lock
>> if tmp != pdev
> Which will address the live-lock, but introduce ABBA deadlock potential
> between the two locks.
I am not sure I can suggest a better solution here
@Roger, @Jan, could you please help here?
>
>>>> @@ -222,10 +239,10 @@ static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len,
>>>>                break;
>>>>            }
>>>>    
>>>> +        msix_put(msix);
>>>>            return X86EMUL_OKAY;
>>>>        }
>>>>    
>>>> -    spin_lock(&msix->pdev->vpci->lock);
>>>>        entry = get_entry(msix, addr);
>>>>        offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
>>> You're increasing the locked region quite a bit here. If this is really
>>> needed, it wants explaining. And if this is deemed acceptable as a
>>> "side effect", it wants justifying or at least stating imo. Same for
>>> msix_write() then, obviously.
>> Yes, I do increase the locking region here, but the msix variable needs
>> to be protected all the time, so it seems to be obvious that it remains
>> under the lock
> What does the msix variable have to do with the vPCI lock? If you see
> a need to grow the locked region here, then surely this is independent
> of your conversion of the lock, and hence wants to be a prereq fix
> (which may in fact want/need backporting).
First of all, the implementation of msix_get is wrong and needs to be:

/*
  * Note: if vpci_msix found, then this function returns with
  * pdev->vpci_lock held. Use msix_put to unlock.
  */
static struct vpci_msix *msix_get(const struct domain *d, unsigned long addr)
{
     struct vpci_msix *msix;

     list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
     {
         const struct vpci_bar *bars;
         unsigned int i;

         spin_lock(&msix->pdev->vpci_lock);
         if ( !msix->pdev->vpci )
         {
             spin_unlock(&msix->pdev->vpci_lock);
             continue;
         }

         bars = msix->pdev->vpci->header.bars;
         for ( i = 0; i < ARRAY_SIZE(msix->tables); i++ )
             if ( bars[msix->tables[i] & PCI_MSIX_BIRMASK].enabled &&
                  VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, i) )
                 return msix;

         spin_unlock(&msix->pdev->vpci_lock);
     }

     return NULL;
}

Then, both msix_{read|write} can dereference msix->pdev->vpci early,
this is why Roger suggested we move to msix_{get|put} here.
And yes, we grow the locked region here and yes this might want a
prereq fix. Or just be fixed while at it.

>
>>>> @@ -327,7 +334,12 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>>>        if ( !pdev )
>>>>            return vpci_read_hw(sbdf, reg, size);
>>>>    
>>>> -    spin_lock(&pdev->vpci->lock);
>>>> +    spin_lock(&pdev->vpci_lock);
>>>> +    if ( !pdev->vpci )
>>>> +    {
>>>> +        spin_unlock(&pdev->vpci_lock);
>>>> +        return vpci_read_hw(sbdf, reg, size);
>>>> +    }
>>> Didn't you say you would add justification of this part of the change
>>> (and its vpci_write() counterpart) to the description?
>> Again, I am referring to the commit message as described above
> No, sorry - that part applies only to what inside the parentheses of
> if(). But on the intermediate version (post-v5 in a 4-patch series) I
> did say:
>
> "In this case as well as in its write counterpart it becomes even more
>   important to justify (in the description) the new behavior. It is not
>   obvious at all that the absence of a struct vpci should be taken as
>   an indication that the underlying device needs accessing instead.
>   This also cannot be inferred from the "!pdev" case visible in context.
>   In that case we have no record of a device at this SBDF, and hence the
>   fallback pretty clearly is a "just in case" one. Yet if we know of a
>   device, the absence of a struct vpci may mean various possible things."
>
> If it wasn't obvious: The comment was on the use of vpci_read_hw() on
> this path, not redundant with the earlier one regarding the added
> "is vpci non-NULL" in a few places.
Ok
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 01/13] xen/pci: arm: add stub for is_memory_hole
  2022-02-04  9:57           ` Julien Grall
@ 2022-02-04 10:35             ` Oleksandr Andrushchenko
  2022-02-04 11:00               ` Julien Grall
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04 10:35 UTC (permalink / raw)
  To: Julien Grall, xen-devel
  Cc: sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, Bertrand Marquis, Rahul Singh,
	Oleksandr Andrushchenko



On 04.02.22 11:57, Julien Grall wrote:
> Hi,
>
> On 04/02/2022 09:47, Oleksandr Andrushchenko wrote:
>>>> Could you please help me with the exact message you would like to see?
>>>
>>> Here a summary of the discussion (+ some my follow-up thoughts):
>>>
>>> is_memory_hole() was recently introduced on x86 (see commit 75cc460a1b8c "xen/pci: detect when BARs are not suitably positioned") to check whether the BAR are positioned outside of a valid memory range. This was introduced to work-around quirky firmware.
>>>
>>> In theory, this could also happen on Arm. In practice, this may not happen but it sounds better to sanity check that the BAR contains "valid" I/O range.
>>>
>>> On x86, this is implemented by checking the region is not described is in the e820. IIUC, on Arm, the BARs have to be positioned in pre-defined ranges. So I think it would be possible to implement is_memory_hole() by going through the list of hostbridges and check the ranges.
>>>
>>> But first, I'd like to confirm my understanding with Rahul, and others.
>>>
>>> If we were going to go this route, I would also rename the function to be better match what it is doing (i.e. it checks the BAR is correctly placed). As a potentially optimization/hardening for Arm, we could pass the hostbridge so we don't have to walk all of them.
>> It seems this needs to live in the commit message then? So, it is easy to find
>> as everything after "---" is going to be dropped on commit
> I expect the function to be fully implemented before this is will be merged.
>
> So if it is fully implemented, then a fair chunk of what I wrote would not be necessary to carry in the commit message.
Well, we started from that we want *something* with TODO and now
you request it to be fully implemented before it is merged.
What do I miss here?
>
> Cheers,
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04 10:12         ` Oleksandr Andrushchenko
@ 2022-02-04 10:49           ` Jan Beulich
  2022-02-04 11:13             ` Roger Pau Monné
  2022-02-04 10:57           ` Roger Pau Monné
  1 sibling, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-04 10:49 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau

On 04.02.2022 11:12, Oleksandr Andrushchenko wrote:
> On 04.02.22 11:15, Jan Beulich wrote:
>> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
>>> On 04.02.22 09:52, Jan Beulich wrote:
>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>                    continue;
>>>>>            }
>>>>>    
>>>>> +        spin_lock(&tmp->vpci_lock);
>>>>> +        if ( !tmp->vpci )
>>>>> +        {
>>>>> +            spin_unlock(&tmp->vpci_lock);
>>>>> +            continue;
>>>>> +        }
>>>>>            for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>>>>            {
>>>>>                const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
>>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>                rc = rangeset_remove_range(mem, start, end);
>>>>>                if ( rc )
>>>>>                {
>>>>> +                spin_unlock(&tmp->vpci_lock);
>>>>>                    printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>>>>>                           start, end, rc);
>>>>>                    rangeset_destroy(mem);
>>>>>                    return rc;
>>>>>                }
>>>>>            }
>>>>> +        spin_unlock(&tmp->vpci_lock);
>>>>>        }
>>>> At the first glance this simply looks like another unjustified (in the
>>>> description) change, as you're not converting anything here but you
>>>> actually add locking (and I realize this was there before, so I'm sorry
>>>> for not pointing this out earlier).
>>> Well, I thought that the description already has "...the lock can be
>>> used (and in a few cases is used right away) to check whether vpci
>>> is present" and this is enough for such uses as here.
>>>>    But then I wonder whether you
>>>> actually tested this, since I can't help getting the impression that
>>>> you're introducing a live-lock: The function is called from cmd_write()
>>>> and rom_write(), which in turn are called out of vpci_write(). Yet that
>>>> function already holds the lock, and the lock is not (currently)
>>>> recursive. (For the 3rd caller of the function - init_bars() - otoh
>>>> the locking looks to be entirely unnecessary.)
>>> Well, you are correct: if tmp != pdev then it is correct to acquire
>>> the lock. But if tmp == pdev and rom_only == true
>>> then we'll deadlock.
>>>
>>> It seems we need to have the locking conditional, e.g. only lock
>>> if tmp != pdev
>> Which will address the live-lock, but introduce ABBA deadlock potential
>> between the two locks.
> I am not sure I can suggest a better solution here
> @Roger, @Jan, could you please help here?

Well, first of all I'd like to mention that while it may have been okay to
not hold pcidevs_lock here for Dom0, it surely needs acquiring when dealing
with DomU-s' lists of PCI devices. The requirement really applies to the
other use of for_each_pdev() as well (in vpci_dump_msi()), except that
there it probably wants to be a try-lock.

Next I'd like to point out that here we have the still pending issue of
how to deal with hidden devices, which Dom0 can access. See my RFC patch
"vPCI: account for hidden devices in modify_bars()". Whatever the solution
here, I think it wants to at least account for the extra need there.

Now it is quite clear that pcidevs_lock isn't going to help with avoiding
the deadlock, as it's imo not an option at all to acquire that lock
everywhere else you access ->vpci (or else the vpci lock itself would be
pointless). But a per-domain auxiliary r/w lock may help: Other paths
would acquire it in read mode, and here you'd acquire it in write mode (in
the former case around the vpci lock, while in the latter case there may
then not be any need to acquire the individual vpci locks at all). FTAOD:
I haven't fully thought through all implications (and hence whether this is
viable in the first place); I expect you will, documenting what you've
found in the resulting patch description. Of course the double lock
acquire/release would then likely want hiding in helper functions.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04 10:12         ` Oleksandr Andrushchenko
  2022-02-04 10:49           ` Jan Beulich
@ 2022-02-04 10:57           ` Roger Pau Monné
  1 sibling, 0 replies; 138+ messages in thread
From: Roger Pau Monné @ 2022-02-04 10:57 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel

On Fri, Feb 04, 2022 at 10:12:46AM +0000, Oleksandr Andrushchenko wrote:
> Hi, Jan!
> 
> On 04.02.22 11:15, Jan Beulich wrote:
> > On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
> >> On 04.02.22 09:52, Jan Beulich wrote:
> >>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
> >>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>>>                    continue;
> >>>>            }
> >>>>    
> >>>> +        spin_lock(&tmp->vpci_lock);
> >>>> +        if ( !tmp->vpci )
> >>>> +        {
> >>>> +            spin_unlock(&tmp->vpci_lock);
> >>>> +            continue;
> >>>> +        }
> >>>>            for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
> >>>>            {
> >>>>                const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
> >>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>>>                rc = rangeset_remove_range(mem, start, end);
> >>>>                if ( rc )
> >>>>                {
> >>>> +                spin_unlock(&tmp->vpci_lock);
> >>>>                    printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
> >>>>                           start, end, rc);
> >>>>                    rangeset_destroy(mem);
> >>>>                    return rc;
> >>>>                }
> >>>>            }
> >>>> +        spin_unlock(&tmp->vpci_lock);
> >>>>        }
> >>> At the first glance this simply looks like another unjustified (in the
> >>> description) change, as you're not converting anything here but you
> >>> actually add locking (and I realize this was there before, so I'm sorry
> >>> for not pointing this out earlier).
> >> Well, I thought that the description already has "...the lock can be
> >> used (and in a few cases is used right away) to check whether vpci
> >> is present" and this is enough for such uses as here.
> >>>    But then I wonder whether you
> >>> actually tested this, since I can't help getting the impression that
> >>> you're introducing a live-lock: The function is called from cmd_write()
> >>> and rom_write(), which in turn are called out of vpci_write(). Yet that
> >>> function already holds the lock, and the lock is not (currently)
> >>> recursive. (For the 3rd caller of the function - init_bars() - otoh
> >>> the locking looks to be entirely unnecessary.)
> >> Well, you are correct: if tmp != pdev then it is correct to acquire
> >> the lock. But if tmp == pdev and rom_only == true
> >> then we'll deadlock.
> >>
> >> It seems we need to have the locking conditional, e.g. only lock
> >> if tmp != pdev
> > Which will address the live-lock, but introduce ABBA deadlock potential
> > between the two locks.
> I am not sure I can suggest a better solution here
> @Roger, @Jan, could you please help here?

I think we could set the locking order based on the memory address of
the locks, ie:

if ( &tmp->vpci_lock < &pdev->vpci_lock )
{
    spin_unlock(&pdev->vpci_lock);
    spin_lock(&tmp->vpci_lock);
    spin_lock(&pdev->vpci_lock);
    if ( !pdev->vpci || &pdev->vpci->header != header )
        /* ERROR: vpci removed or recreated. */
}
else
    spin_lock(&tmp->vpci_lock);

That however creates a window where the address of the BARs on the
current device (pdev) could be changed, so the result of the mapping
might be skewed. I think the guest would only have itself to blame for
that, as changing the position of the BARs while toggling memory
decoding is not something sensible to do.

> >
> >>>> @@ -222,10 +239,10 @@ static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len,
> >>>>                break;
> >>>>            }
> >>>>    
> >>>> +        msix_put(msix);
> >>>>            return X86EMUL_OKAY;
> >>>>        }
> >>>>    
> >>>> -    spin_lock(&msix->pdev->vpci->lock);
> >>>>        entry = get_entry(msix, addr);
> >>>>        offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
> >>> You're increasing the locked region quite a bit here. If this is really
> >>> needed, it wants explaining. And if this is deemed acceptable as a
> >>> "side effect", it wants justifying or at least stating imo. Same for
> >>> msix_write() then, obviously.
> >> Yes, I do increase the locking region here, but the msix variable needs
> >> to be protected all the time, so it seems to be obvious that it remains
> >> under the lock
> > What does the msix variable have to do with the vPCI lock? If you see
> > a need to grow the locked region here, then surely this is independent
> > of your conversion of the lock, and hence wants to be a prereq fix
> > (which may in fact want/need backporting).
> First of all, the implementation of msix_get is wrong and needs to be:
> 
> /*
>   * Note: if vpci_msix found, then this function returns with
>   * pdev->vpci_lock held. Use msix_put to unlock.
>   */
> static struct vpci_msix *msix_get(const struct domain *d, unsigned long addr)
> {
>      struct vpci_msix *msix;
> 
>      list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )

Strictly speaking you would also need to introduce a lock here to
protect msix_tables.

This was all designed when hot-adding (or removing) PCI devices to the
domain wasn't supported.

>      {
>          const struct vpci_bar *bars;
>          unsigned int i;
> 
>          spin_lock(&msix->pdev->vpci_lock);
>          if ( !msix->pdev->vpci )
>          {
>              spin_unlock(&msix->pdev->vpci_lock);
>              continue;
>          }
> 
>          bars = msix->pdev->vpci->header.bars;
>          for ( i = 0; i < ARRAY_SIZE(msix->tables); i++ )
>              if ( bars[msix->tables[i] & PCI_MSIX_BIRMASK].enabled &&
>                   VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, i) )
>                  return msix;
> 
>          spin_unlock(&msix->pdev->vpci_lock);
>      }
> 
>      return NULL;
> }
> 
> Then, both msix_{read|write} can dereference msix->pdev->vpci early,
> this is why Roger suggested we move to msix_{get|put} here.
> And yes, we grow the locked region here and yes this might want a
> prereq fix. Or just be fixed while at it.

Ideally yes, we would need a separate fix that introduced
msix_{get,put}, because the currently unlocked regions of
msix_{read,write} do access the BAR address fields, and doing so
without holding the vpci lock would be racy. I would expect that the
writing/reading of the addr field is done in a single instruction, so
it's unlikely to be a problem in practice. That's kind of similar to
the fact that modify_bars also accesses the addr and size fields of
remote BARs without taking the respective lock.

Once the lock is moved outside of the vpci struct and it's used to
assert that pdev->vpci is present then we do need to hold it while
accessing vpci, or else the struct could be removed under our feet.

Roger.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 01/13] xen/pci: arm: add stub for is_memory_hole
  2022-02-04 10:35             ` Oleksandr Andrushchenko
@ 2022-02-04 11:00               ` Julien Grall
  2022-02-04 11:25                 ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Julien Grall @ 2022-02-04 11:00 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, xen-devel
  Cc: sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, Bertrand Marquis, Rahul Singh



On 04/02/2022 10:35, Oleksandr Andrushchenko wrote:
> 
> 
> On 04.02.22 11:57, Julien Grall wrote:
>> Hi,
>>
>> On 04/02/2022 09:47, Oleksandr Andrushchenko wrote:
>>>>> Could you please help me with the exact message you would like to see?
>>>>
>>>> Here a summary of the discussion (+ some my follow-up thoughts):
>>>>
>>>> is_memory_hole() was recently introduced on x86 (see commit 75cc460a1b8c "xen/pci: detect when BARs are not suitably positioned") to check whether the BAR are positioned outside of a valid memory range. This was introduced to work-around quirky firmware.
>>>>
>>>> In theory, this could also happen on Arm. In practice, this may not happen but it sounds better to sanity check that the BAR contains "valid" I/O range.
>>>>
>>>> On x86, this is implemented by checking the region is not described is in the e820. IIUC, on Arm, the BARs have to be positioned in pre-defined ranges. So I think it would be possible to implement is_memory_hole() by going through the list of hostbridges and check the ranges.
>>>>
>>>> But first, I'd like to confirm my understanding with Rahul, and others.
>>>>
>>>> If we were going to go this route, I would also rename the function to be better match what it is doing (i.e. it checks the BAR is correctly placed). As a potentially optimization/hardening for Arm, we could pass the hostbridge so we don't have to walk all of them.
>>> It seems this needs to live in the commit message then? So, it is easy to find
>>> as everything after "---" is going to be dropped on commit
>> I expect the function to be fully implemented before this is will be merged.
>>
>> So if it is fully implemented, then a fair chunk of what I wrote would not be necessary to carry in the commit message.
> Well, we started from that we want *something* with TODO and now
> you request it to be fully implemented before it is merged.

I don't think I ever suggested this patch would be merged as-is. Sorry 
if this may have crossed like this.

Instead, my intent by asking you to send a TODO patch is to start a 
discussion how this function could be implemented for Arm.

You sent a TODO but you didn't provide any summary on what is the issue, 
what we want to achieve... Hence my request to add a bit more details so 
the other reviewers can provide their opinion more easily.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04 10:49           ` Jan Beulich
@ 2022-02-04 11:13             ` Roger Pau Monné
  2022-02-04 11:37               ` Jan Beulich
  2022-02-04 11:37               ` Oleksandr Andrushchenko
  0 siblings, 2 replies; 138+ messages in thread
From: Roger Pau Monné @ 2022-02-04 11:13 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, julien, sstabellini,
	Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel

On Fri, Feb 04, 2022 at 11:49:18AM +0100, Jan Beulich wrote:
> On 04.02.2022 11:12, Oleksandr Andrushchenko wrote:
> > On 04.02.22 11:15, Jan Beulich wrote:
> >> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
> >>> On 04.02.22 09:52, Jan Beulich wrote:
> >>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
> >>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>>>>                    continue;
> >>>>>            }
> >>>>>    
> >>>>> +        spin_lock(&tmp->vpci_lock);
> >>>>> +        if ( !tmp->vpci )
> >>>>> +        {
> >>>>> +            spin_unlock(&tmp->vpci_lock);
> >>>>> +            continue;
> >>>>> +        }
> >>>>>            for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
> >>>>>            {
> >>>>>                const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
> >>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>>>>                rc = rangeset_remove_range(mem, start, end);
> >>>>>                if ( rc )
> >>>>>                {
> >>>>> +                spin_unlock(&tmp->vpci_lock);
> >>>>>                    printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
> >>>>>                           start, end, rc);
> >>>>>                    rangeset_destroy(mem);
> >>>>>                    return rc;
> >>>>>                }
> >>>>>            }
> >>>>> +        spin_unlock(&tmp->vpci_lock);
> >>>>>        }
> >>>> At the first glance this simply looks like another unjustified (in the
> >>>> description) change, as you're not converting anything here but you
> >>>> actually add locking (and I realize this was there before, so I'm sorry
> >>>> for not pointing this out earlier).
> >>> Well, I thought that the description already has "...the lock can be
> >>> used (and in a few cases is used right away) to check whether vpci
> >>> is present" and this is enough for such uses as here.
> >>>>    But then I wonder whether you
> >>>> actually tested this, since I can't help getting the impression that
> >>>> you're introducing a live-lock: The function is called from cmd_write()
> >>>> and rom_write(), which in turn are called out of vpci_write(). Yet that
> >>>> function already holds the lock, and the lock is not (currently)
> >>>> recursive. (For the 3rd caller of the function - init_bars() - otoh
> >>>> the locking looks to be entirely unnecessary.)
> >>> Well, you are correct: if tmp != pdev then it is correct to acquire
> >>> the lock. But if tmp == pdev and rom_only == true
> >>> then we'll deadlock.
> >>>
> >>> It seems we need to have the locking conditional, e.g. only lock
> >>> if tmp != pdev
> >> Which will address the live-lock, but introduce ABBA deadlock potential
> >> between the two locks.
> > I am not sure I can suggest a better solution here
> > @Roger, @Jan, could you please help here?
> 
> Well, first of all I'd like to mention that while it may have been okay to
> not hold pcidevs_lock here for Dom0, it surely needs acquiring when dealing
> with DomU-s' lists of PCI devices. The requirement really applies to the
> other use of for_each_pdev() as well (in vpci_dump_msi()), except that
> there it probably wants to be a try-lock.
> 
> Next I'd like to point out that here we have the still pending issue of
> how to deal with hidden devices, which Dom0 can access. See my RFC patch
> "vPCI: account for hidden devices in modify_bars()". Whatever the solution
> here, I think it wants to at least account for the extra need there.

Yes, sorry, I should take care of that.

> Now it is quite clear that pcidevs_lock isn't going to help with avoiding
> the deadlock, as it's imo not an option at all to acquire that lock
> everywhere else you access ->vpci (or else the vpci lock itself would be
> pointless). But a per-domain auxiliary r/w lock may help: Other paths
> would acquire it in read mode, and here you'd acquire it in write mode (in
> the former case around the vpci lock, while in the latter case there may
> then not be any need to acquire the individual vpci locks at all). FTAOD:
> I haven't fully thought through all implications (and hence whether this is
> viable in the first place); I expect you will, documenting what you've
> found in the resulting patch description. Of course the double lock
> acquire/release would then likely want hiding in helper functions.

I've been also thinking about this, and whether it's really worth to
have a per-device lock rather than a per-domain one that protects all
vpci regions of the devices assigned to the domain.

The OS is likely to serialize accesses to the PCI config space anyway,
and the only place I could see a benefit of having per-device locks is
in the handling of MSI-X tables, as the handling of the mask bit is
likely very performance sensitive, so adding a per-domain lock there
could be a bottleneck.

We could alternatively do a per-domain rwlock for vpci and special case
the MSI-X area to also have a per-device specific lock. At which point
it becomes fairly similar to what you propose.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 01/13] xen/pci: arm: add stub for is_memory_hole
  2022-02-04 11:00               ` Julien Grall
@ 2022-02-04 11:25                 ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04 11:25 UTC (permalink / raw)
  To: Julien Grall, xen-devel
  Cc: sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, Bertrand Marquis, Rahul Singh,
	Oleksandr Andrushchenko



On 04.02.22 13:00, Julien Grall wrote:
>
>
> On 04/02/2022 10:35, Oleksandr Andrushchenko wrote:
>>
>>
>> On 04.02.22 11:57, Julien Grall wrote:
>>> Hi,
>>>
>>> On 04/02/2022 09:47, Oleksandr Andrushchenko wrote:
>>>>>> Could you please help me with the exact message you would like to see?
>>>>>
>>>>> Here a summary of the discussion (+ some my follow-up thoughts):
>>>>>
>>>>> is_memory_hole() was recently introduced on x86 (see commit 75cc460a1b8c "xen/pci: detect when BARs are not suitably positioned") to check whether the BAR are positioned outside of a valid memory range. This was introduced to work-around quirky firmware.
>>>>>
>>>>> In theory, this could also happen on Arm. In practice, this may not happen but it sounds better to sanity check that the BAR contains "valid" I/O range.
>>>>>
>>>>> On x86, this is implemented by checking the region is not described is in the e820. IIUC, on Arm, the BARs have to be positioned in pre-defined ranges. So I think it would be possible to implement is_memory_hole() by going through the list of hostbridges and check the ranges.
>>>>>
>>>>> But first, I'd like to confirm my understanding with Rahul, and others.
>>>>>
>>>>> If we were going to go this route, I would also rename the function to be better match what it is doing (i.e. it checks the BAR is correctly placed). As a potentially optimization/hardening for Arm, we could pass the hostbridge so we don't have to walk all of them.
>>>> It seems this needs to live in the commit message then? So, it is easy to find
>>>> as everything after "---" is going to be dropped on commit
>>> I expect the function to be fully implemented before this is will be merged.
>>>
>>> So if it is fully implemented, then a fair chunk of what I wrote would not be necessary to carry in the commit message.
>> Well, we started from that we want *something* with TODO and now
>> you request it to be fully implemented before it is merged.
>
> I don't think I ever suggested this patch would be merged as-is. Sorry if this may have crossed like this.
Np
>
> Instead, my intent by asking you to send a TODO patch is to start a discussion how this function could be implemented for Arm.
>
> You sent a TODO but you didn't provide any summary on what is the issue, what we want to achieve... Hence my request to add a bit more details so the other reviewers can provide their opinion more easily.
Ok, so we can discuss it here, but I won't have this patch in v7
>
> Cheers,
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04 11:13             ` Roger Pau Monné
@ 2022-02-04 11:37               ` Jan Beulich
  2022-02-04 12:37                 ` Oleksandr Andrushchenko
  2022-02-04 11:37               ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-04 11:37 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Oleksandr Andrushchenko, julien, sstabellini,
	Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel

On 04.02.2022 12:13, Roger Pau Monné wrote:
> On Fri, Feb 04, 2022 at 11:49:18AM +0100, Jan Beulich wrote:
>> On 04.02.2022 11:12, Oleksandr Andrushchenko wrote:
>>> On 04.02.22 11:15, Jan Beulich wrote:
>>>> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
>>>>> On 04.02.22 09:52, Jan Beulich wrote:
>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>                    continue;
>>>>>>>            }
>>>>>>>    
>>>>>>> +        spin_lock(&tmp->vpci_lock);
>>>>>>> +        if ( !tmp->vpci )
>>>>>>> +        {
>>>>>>> +            spin_unlock(&tmp->vpci_lock);
>>>>>>> +            continue;
>>>>>>> +        }
>>>>>>>            for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>>>>>>            {
>>>>>>>                const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
>>>>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>                rc = rangeset_remove_range(mem, start, end);
>>>>>>>                if ( rc )
>>>>>>>                {
>>>>>>> +                spin_unlock(&tmp->vpci_lock);
>>>>>>>                    printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>>>>>>>                           start, end, rc);
>>>>>>>                    rangeset_destroy(mem);
>>>>>>>                    return rc;
>>>>>>>                }
>>>>>>>            }
>>>>>>> +        spin_unlock(&tmp->vpci_lock);
>>>>>>>        }
>>>>>> At the first glance this simply looks like another unjustified (in the
>>>>>> description) change, as you're not converting anything here but you
>>>>>> actually add locking (and I realize this was there before, so I'm sorry
>>>>>> for not pointing this out earlier).
>>>>> Well, I thought that the description already has "...the lock can be
>>>>> used (and in a few cases is used right away) to check whether vpci
>>>>> is present" and this is enough for such uses as here.
>>>>>>    But then I wonder whether you
>>>>>> actually tested this, since I can't help getting the impression that
>>>>>> you're introducing a live-lock: The function is called from cmd_write()
>>>>>> and rom_write(), which in turn are called out of vpci_write(). Yet that
>>>>>> function already holds the lock, and the lock is not (currently)
>>>>>> recursive. (For the 3rd caller of the function - init_bars() - otoh
>>>>>> the locking looks to be entirely unnecessary.)
>>>>> Well, you are correct: if tmp != pdev then it is correct to acquire
>>>>> the lock. But if tmp == pdev and rom_only == true
>>>>> then we'll deadlock.
>>>>>
>>>>> It seems we need to have the locking conditional, e.g. only lock
>>>>> if tmp != pdev
>>>> Which will address the live-lock, but introduce ABBA deadlock potential
>>>> between the two locks.
>>> I am not sure I can suggest a better solution here
>>> @Roger, @Jan, could you please help here?
>>
>> Well, first of all I'd like to mention that while it may have been okay to
>> not hold pcidevs_lock here for Dom0, it surely needs acquiring when dealing
>> with DomU-s' lists of PCI devices. The requirement really applies to the
>> other use of for_each_pdev() as well (in vpci_dump_msi()), except that
>> there it probably wants to be a try-lock.
>>
>> Next I'd like to point out that here we have the still pending issue of
>> how to deal with hidden devices, which Dom0 can access. See my RFC patch
>> "vPCI: account for hidden devices in modify_bars()". Whatever the solution
>> here, I think it wants to at least account for the extra need there.
> 
> Yes, sorry, I should take care of that.
> 
>> Now it is quite clear that pcidevs_lock isn't going to help with avoiding
>> the deadlock, as it's imo not an option at all to acquire that lock
>> everywhere else you access ->vpci (or else the vpci lock itself would be
>> pointless). But a per-domain auxiliary r/w lock may help: Other paths
>> would acquire it in read mode, and here you'd acquire it in write mode (in
>> the former case around the vpci lock, while in the latter case there may
>> then not be any need to acquire the individual vpci locks at all). FTAOD:
>> I haven't fully thought through all implications (and hence whether this is
>> viable in the first place); I expect you will, documenting what you've
>> found in the resulting patch description. Of course the double lock
>> acquire/release would then likely want hiding in helper functions.
> 
> I've been also thinking about this, and whether it's really worth to
> have a per-device lock rather than a per-domain one that protects all
> vpci regions of the devices assigned to the domain.
> 
> The OS is likely to serialize accesses to the PCI config space anyway,
> and the only place I could see a benefit of having per-device locks is
> in the handling of MSI-X tables, as the handling of the mask bit is
> likely very performance sensitive, so adding a per-domain lock there
> could be a bottleneck.

Hmm, with method 1 accesses serializing globally is basically
unavoidable, but with MMCFG I see no reason why OSes may not (move
to) permit(ting) parallel accesses, with serialization perhaps done
only at device level. See our own pci_config_lock, which applies to
only method 1 accesses; we don't look to be serializing MMCFG
accesses at all.

> We could alternatively do a per-domain rwlock for vpci and special case
> the MSI-X area to also have a per-device specific lock. At which point
> it becomes fairly similar to what you propose.

Indeed.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04 11:13             ` Roger Pau Monné
  2022-02-04 11:37               ` Jan Beulich
@ 2022-02-04 11:37               ` Oleksandr Andrushchenko
  2022-02-04 12:15                 ` Roger Pau Monné
  1 sibling, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04 11:37 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel



On 04.02.22 13:13, Roger Pau Monné wrote:
> On Fri, Feb 04, 2022 at 11:49:18AM +0100, Jan Beulich wrote:
>> On 04.02.2022 11:12, Oleksandr Andrushchenko wrote:
>>> On 04.02.22 11:15, Jan Beulich wrote:
>>>> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
>>>>> On 04.02.22 09:52, Jan Beulich wrote:
>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>                     continue;
>>>>>>>             }
>>>>>>>     
>>>>>>> +        spin_lock(&tmp->vpci_lock);
>>>>>>> +        if ( !tmp->vpci )
>>>>>>> +        {
>>>>>>> +            spin_unlock(&tmp->vpci_lock);
>>>>>>> +            continue;
>>>>>>> +        }
>>>>>>>             for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>>>>>>             {
>>>>>>>                 const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
>>>>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>                 rc = rangeset_remove_range(mem, start, end);
>>>>>>>                 if ( rc )
>>>>>>>                 {
>>>>>>> +                spin_unlock(&tmp->vpci_lock);
>>>>>>>                     printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>>>>>>>                            start, end, rc);
>>>>>>>                     rangeset_destroy(mem);
>>>>>>>                     return rc;
>>>>>>>                 }
>>>>>>>             }
>>>>>>> +        spin_unlock(&tmp->vpci_lock);
>>>>>>>         }
>>>>>> At the first glance this simply looks like another unjustified (in the
>>>>>> description) change, as you're not converting anything here but you
>>>>>> actually add locking (and I realize this was there before, so I'm sorry
>>>>>> for not pointing this out earlier).
>>>>> Well, I thought that the description already has "...the lock can be
>>>>> used (and in a few cases is used right away) to check whether vpci
>>>>> is present" and this is enough for such uses as here.
>>>>>>     But then I wonder whether you
>>>>>> actually tested this, since I can't help getting the impression that
>>>>>> you're introducing a live-lock: The function is called from cmd_write()
>>>>>> and rom_write(), which in turn are called out of vpci_write(). Yet that
>>>>>> function already holds the lock, and the lock is not (currently)
>>>>>> recursive. (For the 3rd caller of the function - init_bars() - otoh
>>>>>> the locking looks to be entirely unnecessary.)
>>>>> Well, you are correct: if tmp != pdev then it is correct to acquire
>>>>> the lock. But if tmp == pdev and rom_only == true
>>>>> then we'll deadlock.
>>>>>
>>>>> It seems we need to have the locking conditional, e.g. only lock
>>>>> if tmp != pdev
>>>> Which will address the live-lock, but introduce ABBA deadlock potential
>>>> between the two locks.
>>> I am not sure I can suggest a better solution here
>>> @Roger, @Jan, could you please help here?
>> Well, first of all I'd like to mention that while it may have been okay to
>> not hold pcidevs_lock here for Dom0, it surely needs acquiring when dealing
>> with DomU-s' lists of PCI devices. The requirement really applies to the
>> other use of for_each_pdev() as well (in vpci_dump_msi()), except that
>> there it probably wants to be a try-lock.
>>
>> Next I'd like to point out that here we have the still pending issue of
>> how to deal with hidden devices, which Dom0 can access. See my RFC patch
>> "vPCI: account for hidden devices in modify_bars()". Whatever the solution
>> here, I think it wants to at least account for the extra need there.
> Yes, sorry, I should take care of that.
>
>> Now it is quite clear that pcidevs_lock isn't going to help with avoiding
>> the deadlock, as it's imo not an option at all to acquire that lock
>> everywhere else you access ->vpci (or else the vpci lock itself would be
>> pointless). But a per-domain auxiliary r/w lock may help: Other paths
>> would acquire it in read mode, and here you'd acquire it in write mode (in
>> the former case around the vpci lock, while in the latter case there may
>> then not be any need to acquire the individual vpci locks at all). FTAOD:
>> I haven't fully thought through all implications (and hence whether this is
>> viable in the first place); I expect you will, documenting what you've
>> found in the resulting patch description. Of course the double lock
>> acquire/release would then likely want hiding in helper functions.
> I've been also thinking about this, and whether it's really worth to
> have a per-device lock rather than a per-domain one that protects all
> vpci regions of the devices assigned to the domain.
>
> The OS is likely to serialize accesses to the PCI config space anyway,
> and the only place I could see a benefit of having per-device locks is
> in the handling of MSI-X tables, as the handling of the mask bit is
> likely very performance sensitive, so adding a per-domain lock there
> could be a bottleneck.
>
> We could alternatively do a per-domain rwlock for vpci and special case
> the MSI-X area to also have a per-device specific lock. At which point
> it becomes fairly similar to what you propose.
I need a decision.
Please.
>
> Thanks, Roger.
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04 11:37               ` Oleksandr Andrushchenko
@ 2022-02-04 12:15                 ` Roger Pau Monné
  0 siblings, 0 replies; 138+ messages in thread
From: Roger Pau Monné @ 2022-02-04 12:15 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel

On Fri, Feb 04, 2022 at 11:37:50AM +0000, Oleksandr Andrushchenko wrote:
> 
> 
> On 04.02.22 13:13, Roger Pau Monné wrote:
> > On Fri, Feb 04, 2022 at 11:49:18AM +0100, Jan Beulich wrote:
> >> On 04.02.2022 11:12, Oleksandr Andrushchenko wrote:
> >>> On 04.02.22 11:15, Jan Beulich wrote:
> >>>> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
> >>>>> On 04.02.22 09:52, Jan Beulich wrote:
> >>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
> >>>>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>>>>>>                     continue;
> >>>>>>>             }
> >>>>>>>     
> >>>>>>> +        spin_lock(&tmp->vpci_lock);
> >>>>>>> +        if ( !tmp->vpci )
> >>>>>>> +        {
> >>>>>>> +            spin_unlock(&tmp->vpci_lock);
> >>>>>>> +            continue;
> >>>>>>> +        }
> >>>>>>>             for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
> >>>>>>>             {
> >>>>>>>                 const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
> >>>>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>>>>>>                 rc = rangeset_remove_range(mem, start, end);
> >>>>>>>                 if ( rc )
> >>>>>>>                 {
> >>>>>>> +                spin_unlock(&tmp->vpci_lock);
> >>>>>>>                     printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
> >>>>>>>                            start, end, rc);
> >>>>>>>                     rangeset_destroy(mem);
> >>>>>>>                     return rc;
> >>>>>>>                 }
> >>>>>>>             }
> >>>>>>> +        spin_unlock(&tmp->vpci_lock);
> >>>>>>>         }
> >>>>>> At the first glance this simply looks like another unjustified (in the
> >>>>>> description) change, as you're not converting anything here but you
> >>>>>> actually add locking (and I realize this was there before, so I'm sorry
> >>>>>> for not pointing this out earlier).
> >>>>> Well, I thought that the description already has "...the lock can be
> >>>>> used (and in a few cases is used right away) to check whether vpci
> >>>>> is present" and this is enough for such uses as here.
> >>>>>>     But then I wonder whether you
> >>>>>> actually tested this, since I can't help getting the impression that
> >>>>>> you're introducing a live-lock: The function is called from cmd_write()
> >>>>>> and rom_write(), which in turn are called out of vpci_write(). Yet that
> >>>>>> function already holds the lock, and the lock is not (currently)
> >>>>>> recursive. (For the 3rd caller of the function - init_bars() - otoh
> >>>>>> the locking looks to be entirely unnecessary.)
> >>>>> Well, you are correct: if tmp != pdev then it is correct to acquire
> >>>>> the lock. But if tmp == pdev and rom_only == true
> >>>>> then we'll deadlock.
> >>>>>
> >>>>> It seems we need to have the locking conditional, e.g. only lock
> >>>>> if tmp != pdev
> >>>> Which will address the live-lock, but introduce ABBA deadlock potential
> >>>> between the two locks.
> >>> I am not sure I can suggest a better solution here
> >>> @Roger, @Jan, could you please help here?
> >> Well, first of all I'd like to mention that while it may have been okay to
> >> not hold pcidevs_lock here for Dom0, it surely needs acquiring when dealing
> >> with DomU-s' lists of PCI devices. The requirement really applies to the
> >> other use of for_each_pdev() as well (in vpci_dump_msi()), except that
> >> there it probably wants to be a try-lock.
> >>
> >> Next I'd like to point out that here we have the still pending issue of
> >> how to deal with hidden devices, which Dom0 can access. See my RFC patch
> >> "vPCI: account for hidden devices in modify_bars()". Whatever the solution
> >> here, I think it wants to at least account for the extra need there.
> > Yes, sorry, I should take care of that.
> >
> >> Now it is quite clear that pcidevs_lock isn't going to help with avoiding
> >> the deadlock, as it's imo not an option at all to acquire that lock
> >> everywhere else you access ->vpci (or else the vpci lock itself would be
> >> pointless). But a per-domain auxiliary r/w lock may help: Other paths
> >> would acquire it in read mode, and here you'd acquire it in write mode (in
> >> the former case around the vpci lock, while in the latter case there may
> >> then not be any need to acquire the individual vpci locks at all). FTAOD:
> >> I haven't fully thought through all implications (and hence whether this is
> >> viable in the first place); I expect you will, documenting what you've
> >> found in the resulting patch description. Of course the double lock
> >> acquire/release would then likely want hiding in helper functions.
> > I've been also thinking about this, and whether it's really worth to
> > have a per-device lock rather than a per-domain one that protects all
> > vpci regions of the devices assigned to the domain.
> >
> > The OS is likely to serialize accesses to the PCI config space anyway,
> > and the only place I could see a benefit of having per-device locks is
> > in the handling of MSI-X tables, as the handling of the mask bit is
> > likely very performance sensitive, so adding a per-domain lock there
> > could be a bottleneck.
> >
> > We could alternatively do a per-domain rwlock for vpci and special case
> > the MSI-X area to also have a per-device specific lock. At which point
> > it becomes fairly similar to what you propose.
> I need a decision.
> Please.

I'm afraid that's up to you. I cannot assure that any of the proposed
options will actually be viable until someone attempts to implement
them. I wouldn't want to impose a solution to you because I cannot
guarantee it will work or result in better code than other options.

I think there are two options:

1. Set a lock ordering for double locking (based on the memory address
   of the lock for example).

2. Introduce a per-domain rwlock that protects all of the devices
   assigned to a domain.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04 11:37               ` Jan Beulich
@ 2022-02-04 12:37                 ` Oleksandr Andrushchenko
  2022-02-04 12:47                   ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04 12:37 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monné
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel



On 04.02.22 13:37, Jan Beulich wrote:
> On 04.02.2022 12:13, Roger Pau Monné wrote:
>> On Fri, Feb 04, 2022 at 11:49:18AM +0100, Jan Beulich wrote:
>>> On 04.02.2022 11:12, Oleksandr Andrushchenko wrote:
>>>> On 04.02.22 11:15, Jan Beulich wrote:
>>>>> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
>>>>>> On 04.02.22 09:52, Jan Beulich wrote:
>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>                     continue;
>>>>>>>>             }
>>>>>>>>     
>>>>>>>> +        spin_lock(&tmp->vpci_lock);
>>>>>>>> +        if ( !tmp->vpci )
>>>>>>>> +        {
>>>>>>>> +            spin_unlock(&tmp->vpci_lock);
>>>>>>>> +            continue;
>>>>>>>> +        }
>>>>>>>>             for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>>>>>>>             {
>>>>>>>>                 const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
>>>>>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>                 rc = rangeset_remove_range(mem, start, end);
>>>>>>>>                 if ( rc )
>>>>>>>>                 {
>>>>>>>> +                spin_unlock(&tmp->vpci_lock);
>>>>>>>>                     printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>>>>>>>>                            start, end, rc);
>>>>>>>>                     rangeset_destroy(mem);
>>>>>>>>                     return rc;
>>>>>>>>                 }
>>>>>>>>             }
>>>>>>>> +        spin_unlock(&tmp->vpci_lock);
>>>>>>>>         }
>>>>>>> At the first glance this simply looks like another unjustified (in the
>>>>>>> description) change, as you're not converting anything here but you
>>>>>>> actually add locking (and I realize this was there before, so I'm sorry
>>>>>>> for not pointing this out earlier).
>>>>>> Well, I thought that the description already has "...the lock can be
>>>>>> used (and in a few cases is used right away) to check whether vpci
>>>>>> is present" and this is enough for such uses as here.
>>>>>>>     But then I wonder whether you
>>>>>>> actually tested this, since I can't help getting the impression that
>>>>>>> you're introducing a live-lock: The function is called from cmd_write()
>>>>>>> and rom_write(), which in turn are called out of vpci_write(). Yet that
>>>>>>> function already holds the lock, and the lock is not (currently)
>>>>>>> recursive. (For the 3rd caller of the function - init_bars() - otoh
>>>>>>> the locking looks to be entirely unnecessary.)
>>>>>> Well, you are correct: if tmp != pdev then it is correct to acquire
>>>>>> the lock. But if tmp == pdev and rom_only == true
>>>>>> then we'll deadlock.
>>>>>>
>>>>>> It seems we need to have the locking conditional, e.g. only lock
>>>>>> if tmp != pdev
>>>>> Which will address the live-lock, but introduce ABBA deadlock potential
>>>>> between the two locks.
>>>> I am not sure I can suggest a better solution here
>>>> @Roger, @Jan, could you please help here?
>>> Well, first of all I'd like to mention that while it may have been okay to
>>> not hold pcidevs_lock here for Dom0, it surely needs acquiring when dealing
>>> with DomU-s' lists of PCI devices. The requirement really applies to the
>>> other use of for_each_pdev() as well (in vpci_dump_msi()), except that
>>> there it probably wants to be a try-lock.
>>>
>>> Next I'd like to point out that here we have the still pending issue of
>>> how to deal with hidden devices, which Dom0 can access. See my RFC patch
>>> "vPCI: account for hidden devices in modify_bars()". Whatever the solution
>>> here, I think it wants to at least account for the extra need there.
>> Yes, sorry, I should take care of that.
>>
>>> Now it is quite clear that pcidevs_lock isn't going to help with avoiding
>>> the deadlock, as it's imo not an option at all to acquire that lock
>>> everywhere else you access ->vpci (or else the vpci lock itself would be
>>> pointless). But a per-domain auxiliary r/w lock may help: Other paths
>>> would acquire it in read mode, and here you'd acquire it in write mode (in
>>> the former case around the vpci lock, while in the latter case there may
>>> then not be any need to acquire the individual vpci locks at all). FTAOD:
>>> I haven't fully thought through all implications (and hence whether this is
>>> viable in the first place); I expect you will, documenting what you've
>>> found in the resulting patch description. Of course the double lock
>>> acquire/release would then likely want hiding in helper functions.
>> I've been also thinking about this, and whether it's really worth to
>> have a per-device lock rather than a per-domain one that protects all
>> vpci regions of the devices assigned to the domain.
>>
>> The OS is likely to serialize accesses to the PCI config space anyway,
>> and the only place I could see a benefit of having per-device locks is
>> in the handling of MSI-X tables, as the handling of the mask bit is
>> likely very performance sensitive, so adding a per-domain lock there
>> could be a bottleneck.
> Hmm, with method 1 accesses serializing globally is basically
> unavoidable, but with MMCFG I see no reason why OSes may not (move
> to) permit(ting) parallel accesses, with serialization perhaps done
> only at device level. See our own pci_config_lock, which applies to
> only method 1 accesses; we don't look to be serializing MMCFG
> accesses at all.
>
>> We could alternatively do a per-domain rwlock for vpci and special case
>> the MSI-X area to also have a per-device specific lock. At which point
>> it becomes fairly similar to what you propose.
@Jan, @Roger

1. d->vpci_lock - rwlock <- this protects vpci
2. pdev->vpci->msix_tbl_lock - rwlock <- this protects MSI-X tables
or should it better be pdev->msix_tbl_lock as MSI-X tables don't
really depend on vPCI?

Does this sound like something that could fly?
It takes quite a while to implement and test, so I would like to understand
that on the ground yet before putting efforts in it.
> Indeed.
>
> Jan
>
Thank you in advance,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04 12:37                 ` Oleksandr Andrushchenko
@ 2022-02-04 12:47                   ` Jan Beulich
  2022-02-04 12:53                     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-04 12:47 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné

On 04.02.2022 13:37, Oleksandr Andrushchenko wrote:
> 
> 
> On 04.02.22 13:37, Jan Beulich wrote:
>> On 04.02.2022 12:13, Roger Pau Monné wrote:
>>> On Fri, Feb 04, 2022 at 11:49:18AM +0100, Jan Beulich wrote:
>>>> On 04.02.2022 11:12, Oleksandr Andrushchenko wrote:
>>>>> On 04.02.22 11:15, Jan Beulich wrote:
>>>>>> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
>>>>>>> On 04.02.22 09:52, Jan Beulich wrote:
>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>>                     continue;
>>>>>>>>>             }
>>>>>>>>>     
>>>>>>>>> +        spin_lock(&tmp->vpci_lock);
>>>>>>>>> +        if ( !tmp->vpci )
>>>>>>>>> +        {
>>>>>>>>> +            spin_unlock(&tmp->vpci_lock);
>>>>>>>>> +            continue;
>>>>>>>>> +        }
>>>>>>>>>             for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>>>>>>>>             {
>>>>>>>>>                 const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
>>>>>>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>>                 rc = rangeset_remove_range(mem, start, end);
>>>>>>>>>                 if ( rc )
>>>>>>>>>                 {
>>>>>>>>> +                spin_unlock(&tmp->vpci_lock);
>>>>>>>>>                     printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>>>>>>>>>                            start, end, rc);
>>>>>>>>>                     rangeset_destroy(mem);
>>>>>>>>>                     return rc;
>>>>>>>>>                 }
>>>>>>>>>             }
>>>>>>>>> +        spin_unlock(&tmp->vpci_lock);
>>>>>>>>>         }
>>>>>>>> At the first glance this simply looks like another unjustified (in the
>>>>>>>> description) change, as you're not converting anything here but you
>>>>>>>> actually add locking (and I realize this was there before, so I'm sorry
>>>>>>>> for not pointing this out earlier).
>>>>>>> Well, I thought that the description already has "...the lock can be
>>>>>>> used (and in a few cases is used right away) to check whether vpci
>>>>>>> is present" and this is enough for such uses as here.
>>>>>>>>     But then I wonder whether you
>>>>>>>> actually tested this, since I can't help getting the impression that
>>>>>>>> you're introducing a live-lock: The function is called from cmd_write()
>>>>>>>> and rom_write(), which in turn are called out of vpci_write(). Yet that
>>>>>>>> function already holds the lock, and the lock is not (currently)
>>>>>>>> recursive. (For the 3rd caller of the function - init_bars() - otoh
>>>>>>>> the locking looks to be entirely unnecessary.)
>>>>>>> Well, you are correct: if tmp != pdev then it is correct to acquire
>>>>>>> the lock. But if tmp == pdev and rom_only == true
>>>>>>> then we'll deadlock.
>>>>>>>
>>>>>>> It seems we need to have the locking conditional, e.g. only lock
>>>>>>> if tmp != pdev
>>>>>> Which will address the live-lock, but introduce ABBA deadlock potential
>>>>>> between the two locks.
>>>>> I am not sure I can suggest a better solution here
>>>>> @Roger, @Jan, could you please help here?
>>>> Well, first of all I'd like to mention that while it may have been okay to
>>>> not hold pcidevs_lock here for Dom0, it surely needs acquiring when dealing
>>>> with DomU-s' lists of PCI devices. The requirement really applies to the
>>>> other use of for_each_pdev() as well (in vpci_dump_msi()), except that
>>>> there it probably wants to be a try-lock.
>>>>
>>>> Next I'd like to point out that here we have the still pending issue of
>>>> how to deal with hidden devices, which Dom0 can access. See my RFC patch
>>>> "vPCI: account for hidden devices in modify_bars()". Whatever the solution
>>>> here, I think it wants to at least account for the extra need there.
>>> Yes, sorry, I should take care of that.
>>>
>>>> Now it is quite clear that pcidevs_lock isn't going to help with avoiding
>>>> the deadlock, as it's imo not an option at all to acquire that lock
>>>> everywhere else you access ->vpci (or else the vpci lock itself would be
>>>> pointless). But a per-domain auxiliary r/w lock may help: Other paths
>>>> would acquire it in read mode, and here you'd acquire it in write mode (in
>>>> the former case around the vpci lock, while in the latter case there may
>>>> then not be any need to acquire the individual vpci locks at all). FTAOD:
>>>> I haven't fully thought through all implications (and hence whether this is
>>>> viable in the first place); I expect you will, documenting what you've
>>>> found in the resulting patch description. Of course the double lock
>>>> acquire/release would then likely want hiding in helper functions.
>>> I've been also thinking about this, and whether it's really worth to
>>> have a per-device lock rather than a per-domain one that protects all
>>> vpci regions of the devices assigned to the domain.
>>>
>>> The OS is likely to serialize accesses to the PCI config space anyway,
>>> and the only place I could see a benefit of having per-device locks is
>>> in the handling of MSI-X tables, as the handling of the mask bit is
>>> likely very performance sensitive, so adding a per-domain lock there
>>> could be a bottleneck.
>> Hmm, with method 1 accesses serializing globally is basically
>> unavoidable, but with MMCFG I see no reason why OSes may not (move
>> to) permit(ting) parallel accesses, with serialization perhaps done
>> only at device level. See our own pci_config_lock, which applies to
>> only method 1 accesses; we don't look to be serializing MMCFG
>> accesses at all.
>>
>>> We could alternatively do a per-domain rwlock for vpci and special case
>>> the MSI-X area to also have a per-device specific lock. At which point
>>> it becomes fairly similar to what you propose.
> @Jan, @Roger
> 
> 1. d->vpci_lock - rwlock <- this protects vpci
> 2. pdev->vpci->msix_tbl_lock - rwlock <- this protects MSI-X tables
> or should it better be pdev->msix_tbl_lock as MSI-X tables don't
> really depend on vPCI?

If so, perhaps indeed better the latter. But as said in reply to Roger,
I'm not convinced (yet) that doing away with the per-device lock is a
good move. As said there - we're ourselves doing fully parallel MMCFG
accesses, so OSes ought to be fine to do so, too.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04 12:47                   ` Jan Beulich
@ 2022-02-04 12:53                     ` Oleksandr Andrushchenko
  2022-02-04 13:03                       ` Jan Beulich
  2022-02-04 13:06                       ` Roger Pau Monné
  0 siblings, 2 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04 12:53 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné,
	Oleksandr Andrushchenko



On 04.02.22 14:47, Jan Beulich wrote:
> On 04.02.2022 13:37, Oleksandr Andrushchenko wrote:
>>
>> On 04.02.22 13:37, Jan Beulich wrote:
>>> On 04.02.2022 12:13, Roger Pau Monné wrote:
>>>> On Fri, Feb 04, 2022 at 11:49:18AM +0100, Jan Beulich wrote:
>>>>> On 04.02.2022 11:12, Oleksandr Andrushchenko wrote:
>>>>>> On 04.02.22 11:15, Jan Beulich wrote:
>>>>>>> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
>>>>>>>> On 04.02.22 09:52, Jan Beulich wrote:
>>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>>>                      continue;
>>>>>>>>>>              }
>>>>>>>>>>      
>>>>>>>>>> +        spin_lock(&tmp->vpci_lock);
>>>>>>>>>> +        if ( !tmp->vpci )
>>>>>>>>>> +        {
>>>>>>>>>> +            spin_unlock(&tmp->vpci_lock);
>>>>>>>>>> +            continue;
>>>>>>>>>> +        }
>>>>>>>>>>              for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>>>>>>>>>              {
>>>>>>>>>>                  const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
>>>>>>>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>>>                  rc = rangeset_remove_range(mem, start, end);
>>>>>>>>>>                  if ( rc )
>>>>>>>>>>                  {
>>>>>>>>>> +                spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>                      printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>>>>>>>>>>                             start, end, rc);
>>>>>>>>>>                      rangeset_destroy(mem);
>>>>>>>>>>                      return rc;
>>>>>>>>>>                  }
>>>>>>>>>>              }
>>>>>>>>>> +        spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>          }
>>>>>>>>> At the first glance this simply looks like another unjustified (in the
>>>>>>>>> description) change, as you're not converting anything here but you
>>>>>>>>> actually add locking (and I realize this was there before, so I'm sorry
>>>>>>>>> for not pointing this out earlier).
>>>>>>>> Well, I thought that the description already has "...the lock can be
>>>>>>>> used (and in a few cases is used right away) to check whether vpci
>>>>>>>> is present" and this is enough for such uses as here.
>>>>>>>>>      But then I wonder whether you
>>>>>>>>> actually tested this, since I can't help getting the impression that
>>>>>>>>> you're introducing a live-lock: The function is called from cmd_write()
>>>>>>>>> and rom_write(), which in turn are called out of vpci_write(). Yet that
>>>>>>>>> function already holds the lock, and the lock is not (currently)
>>>>>>>>> recursive. (For the 3rd caller of the function - init_bars() - otoh
>>>>>>>>> the locking looks to be entirely unnecessary.)
>>>>>>>> Well, you are correct: if tmp != pdev then it is correct to acquire
>>>>>>>> the lock. But if tmp == pdev and rom_only == true
>>>>>>>> then we'll deadlock.
>>>>>>>>
>>>>>>>> It seems we need to have the locking conditional, e.g. only lock
>>>>>>>> if tmp != pdev
>>>>>>> Which will address the live-lock, but introduce ABBA deadlock potential
>>>>>>> between the two locks.
>>>>>> I am not sure I can suggest a better solution here
>>>>>> @Roger, @Jan, could you please help here?
>>>>> Well, first of all I'd like to mention that while it may have been okay to
>>>>> not hold pcidevs_lock here for Dom0, it surely needs acquiring when dealing
>>>>> with DomU-s' lists of PCI devices. The requirement really applies to the
>>>>> other use of for_each_pdev() as well (in vpci_dump_msi()), except that
>>>>> there it probably wants to be a try-lock.
>>>>>
>>>>> Next I'd like to point out that here we have the still pending issue of
>>>>> how to deal with hidden devices, which Dom0 can access. See my RFC patch
>>>>> "vPCI: account for hidden devices in modify_bars()". Whatever the solution
>>>>> here, I think it wants to at least account for the extra need there.
>>>> Yes, sorry, I should take care of that.
>>>>
>>>>> Now it is quite clear that pcidevs_lock isn't going to help with avoiding
>>>>> the deadlock, as it's imo not an option at all to acquire that lock
>>>>> everywhere else you access ->vpci (or else the vpci lock itself would be
>>>>> pointless). But a per-domain auxiliary r/w lock may help: Other paths
>>>>> would acquire it in read mode, and here you'd acquire it in write mode (in
>>>>> the former case around the vpci lock, while in the latter case there may
>>>>> then not be any need to acquire the individual vpci locks at all). FTAOD:
>>>>> I haven't fully thought through all implications (and hence whether this is
>>>>> viable in the first place); I expect you will, documenting what you've
>>>>> found in the resulting patch description. Of course the double lock
>>>>> acquire/release would then likely want hiding in helper functions.
>>>> I've been also thinking about this, and whether it's really worth to
>>>> have a per-device lock rather than a per-domain one that protects all
>>>> vpci regions of the devices assigned to the domain.
>>>>
>>>> The OS is likely to serialize accesses to the PCI config space anyway,
>>>> and the only place I could see a benefit of having per-device locks is
>>>> in the handling of MSI-X tables, as the handling of the mask bit is
>>>> likely very performance sensitive, so adding a per-domain lock there
>>>> could be a bottleneck.
>>> Hmm, with method 1 accesses serializing globally is basically
>>> unavoidable, but with MMCFG I see no reason why OSes may not (move
>>> to) permit(ting) parallel accesses, with serialization perhaps done
>>> only at device level. See our own pci_config_lock, which applies to
>>> only method 1 accesses; we don't look to be serializing MMCFG
>>> accesses at all.
>>>
>>>> We could alternatively do a per-domain rwlock for vpci and special case
>>>> the MSI-X area to also have a per-device specific lock. At which point
>>>> it becomes fairly similar to what you propose.
>> @Jan, @Roger
>>
>> 1. d->vpci_lock - rwlock <- this protects vpci
>> 2. pdev->vpci->msix_tbl_lock - rwlock <- this protects MSI-X tables
>> or should it better be pdev->msix_tbl_lock as MSI-X tables don't
>> really depend on vPCI?
> If so, perhaps indeed better the latter. But as said in reply to Roger,
> I'm not convinced (yet) that doing away with the per-device lock is a
> good move. As said there - we're ourselves doing fully parallel MMCFG
> accesses, so OSes ought to be fine to do so, too.
But with pdev->vpci_lock we face ABBA...
>
> Jan
>
>

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04 12:53                     ` Oleksandr Andrushchenko
@ 2022-02-04 13:03                       ` Jan Beulich
  2022-02-04 13:06                       ` Roger Pau Monné
  1 sibling, 0 replies; 138+ messages in thread
From: Jan Beulich @ 2022-02-04 13:03 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné

On 04.02.2022 13:53, Oleksandr Andrushchenko wrote:
> 
> 
> On 04.02.22 14:47, Jan Beulich wrote:
>> On 04.02.2022 13:37, Oleksandr Andrushchenko wrote:
>>>
>>> On 04.02.22 13:37, Jan Beulich wrote:
>>>> On 04.02.2022 12:13, Roger Pau Monné wrote:
>>>>> On Fri, Feb 04, 2022 at 11:49:18AM +0100, Jan Beulich wrote:
>>>>>> On 04.02.2022 11:12, Oleksandr Andrushchenko wrote:
>>>>>>> On 04.02.22 11:15, Jan Beulich wrote:
>>>>>>>> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
>>>>>>>>> On 04.02.22 09:52, Jan Beulich wrote:
>>>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>>>>                      continue;
>>>>>>>>>>>              }
>>>>>>>>>>>      
>>>>>>>>>>> +        spin_lock(&tmp->vpci_lock);
>>>>>>>>>>> +        if ( !tmp->vpci )
>>>>>>>>>>> +        {
>>>>>>>>>>> +            spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>> +            continue;
>>>>>>>>>>> +        }
>>>>>>>>>>>              for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>>>>>>>>>>              {
>>>>>>>>>>>                  const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
>>>>>>>>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>>>>                  rc = rangeset_remove_range(mem, start, end);
>>>>>>>>>>>                  if ( rc )
>>>>>>>>>>>                  {
>>>>>>>>>>> +                spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>>                      printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>>>>>>>>>>>                             start, end, rc);
>>>>>>>>>>>                      rangeset_destroy(mem);
>>>>>>>>>>>                      return rc;
>>>>>>>>>>>                  }
>>>>>>>>>>>              }
>>>>>>>>>>> +        spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>>          }
>>>>>>>>>> At the first glance this simply looks like another unjustified (in the
>>>>>>>>>> description) change, as you're not converting anything here but you
>>>>>>>>>> actually add locking (and I realize this was there before, so I'm sorry
>>>>>>>>>> for not pointing this out earlier).
>>>>>>>>> Well, I thought that the description already has "...the lock can be
>>>>>>>>> used (and in a few cases is used right away) to check whether vpci
>>>>>>>>> is present" and this is enough for such uses as here.
>>>>>>>>>>      But then I wonder whether you
>>>>>>>>>> actually tested this, since I can't help getting the impression that
>>>>>>>>>> you're introducing a live-lock: The function is called from cmd_write()
>>>>>>>>>> and rom_write(), which in turn are called out of vpci_write(). Yet that
>>>>>>>>>> function already holds the lock, and the lock is not (currently)
>>>>>>>>>> recursive. (For the 3rd caller of the function - init_bars() - otoh
>>>>>>>>>> the locking looks to be entirely unnecessary.)
>>>>>>>>> Well, you are correct: if tmp != pdev then it is correct to acquire
>>>>>>>>> the lock. But if tmp == pdev and rom_only == true
>>>>>>>>> then we'll deadlock.
>>>>>>>>>
>>>>>>>>> It seems we need to have the locking conditional, e.g. only lock
>>>>>>>>> if tmp != pdev
>>>>>>>> Which will address the live-lock, but introduce ABBA deadlock potential
>>>>>>>> between the two locks.
>>>>>>> I am not sure I can suggest a better solution here
>>>>>>> @Roger, @Jan, could you please help here?
>>>>>> Well, first of all I'd like to mention that while it may have been okay to
>>>>>> not hold pcidevs_lock here for Dom0, it surely needs acquiring when dealing
>>>>>> with DomU-s' lists of PCI devices. The requirement really applies to the
>>>>>> other use of for_each_pdev() as well (in vpci_dump_msi()), except that
>>>>>> there it probably wants to be a try-lock.
>>>>>>
>>>>>> Next I'd like to point out that here we have the still pending issue of
>>>>>> how to deal with hidden devices, which Dom0 can access. See my RFC patch
>>>>>> "vPCI: account for hidden devices in modify_bars()". Whatever the solution
>>>>>> here, I think it wants to at least account for the extra need there.
>>>>> Yes, sorry, I should take care of that.
>>>>>
>>>>>> Now it is quite clear that pcidevs_lock isn't going to help with avoiding
>>>>>> the deadlock, as it's imo not an option at all to acquire that lock
>>>>>> everywhere else you access ->vpci (or else the vpci lock itself would be
>>>>>> pointless). But a per-domain auxiliary r/w lock may help: Other paths
>>>>>> would acquire it in read mode, and here you'd acquire it in write mode (in
>>>>>> the former case around the vpci lock, while in the latter case there may
>>>>>> then not be any need to acquire the individual vpci locks at all). FTAOD:
>>>>>> I haven't fully thought through all implications (and hence whether this is
>>>>>> viable in the first place); I expect you will, documenting what you've
>>>>>> found in the resulting patch description. Of course the double lock
>>>>>> acquire/release would then likely want hiding in helper functions.
>>>>> I've been also thinking about this, and whether it's really worth to
>>>>> have a per-device lock rather than a per-domain one that protects all
>>>>> vpci regions of the devices assigned to the domain.
>>>>>
>>>>> The OS is likely to serialize accesses to the PCI config space anyway,
>>>>> and the only place I could see a benefit of having per-device locks is
>>>>> in the handling of MSI-X tables, as the handling of the mask bit is
>>>>> likely very performance sensitive, so adding a per-domain lock there
>>>>> could be a bottleneck.
>>>> Hmm, with method 1 accesses serializing globally is basically
>>>> unavoidable, but with MMCFG I see no reason why OSes may not (move
>>>> to) permit(ting) parallel accesses, with serialization perhaps done
>>>> only at device level. See our own pci_config_lock, which applies to
>>>> only method 1 accesses; we don't look to be serializing MMCFG
>>>> accesses at all.
>>>>
>>>>> We could alternatively do a per-domain rwlock for vpci and special case
>>>>> the MSI-X area to also have a per-device specific lock. At which point
>>>>> it becomes fairly similar to what you propose.
>>> @Jan, @Roger
>>>
>>> 1. d->vpci_lock - rwlock <- this protects vpci
>>> 2. pdev->vpci->msix_tbl_lock - rwlock <- this protects MSI-X tables
>>> or should it better be pdev->msix_tbl_lock as MSI-X tables don't
>>> really depend on vPCI?
>> If so, perhaps indeed better the latter. But as said in reply to Roger,
>> I'm not convinced (yet) that doing away with the per-device lock is a
>> good move. As said there - we're ourselves doing fully parallel MMCFG
>> accesses, so OSes ought to be fine to do so, too.
> But with pdev->vpci_lock we face ABBA...

I didn't say without per-domain r/w lock, did I? I stand by my earlier
outline.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04 12:53                     ` Oleksandr Andrushchenko
  2022-02-04 13:03                       ` Jan Beulich
@ 2022-02-04 13:06                       ` Roger Pau Monné
  2022-02-04 14:43                         ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 138+ messages in thread
From: Roger Pau Monné @ 2022-02-04 13:06 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel

On Fri, Feb 04, 2022 at 12:53:20PM +0000, Oleksandr Andrushchenko wrote:
> 
> 
> On 04.02.22 14:47, Jan Beulich wrote:
> > On 04.02.2022 13:37, Oleksandr Andrushchenko wrote:
> >>
> >> On 04.02.22 13:37, Jan Beulich wrote:
> >>> On 04.02.2022 12:13, Roger Pau Monné wrote:
> >>>> On Fri, Feb 04, 2022 at 11:49:18AM +0100, Jan Beulich wrote:
> >>>>> On 04.02.2022 11:12, Oleksandr Andrushchenko wrote:
> >>>>>> On 04.02.22 11:15, Jan Beulich wrote:
> >>>>>>> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
> >>>>>>>> On 04.02.22 09:52, Jan Beulich wrote:
> >>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
> >>>>>>>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>>>>>>>>>                      continue;
> >>>>>>>>>>              }
> >>>>>>>>>>      
> >>>>>>>>>> +        spin_lock(&tmp->vpci_lock);
> >>>>>>>>>> +        if ( !tmp->vpci )
> >>>>>>>>>> +        {
> >>>>>>>>>> +            spin_unlock(&tmp->vpci_lock);
> >>>>>>>>>> +            continue;
> >>>>>>>>>> +        }
> >>>>>>>>>>              for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
> >>>>>>>>>>              {
> >>>>>>>>>>                  const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
> >>>>>>>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>>>>>>>>>                  rc = rangeset_remove_range(mem, start, end);
> >>>>>>>>>>                  if ( rc )
> >>>>>>>>>>                  {
> >>>>>>>>>> +                spin_unlock(&tmp->vpci_lock);
> >>>>>>>>>>                      printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
> >>>>>>>>>>                             start, end, rc);
> >>>>>>>>>>                      rangeset_destroy(mem);
> >>>>>>>>>>                      return rc;
> >>>>>>>>>>                  }
> >>>>>>>>>>              }
> >>>>>>>>>> +        spin_unlock(&tmp->vpci_lock);
> >>>>>>>>>>          }
> >>>>>>>>> At the first glance this simply looks like another unjustified (in the
> >>>>>>>>> description) change, as you're not converting anything here but you
> >>>>>>>>> actually add locking (and I realize this was there before, so I'm sorry
> >>>>>>>>> for not pointing this out earlier).
> >>>>>>>> Well, I thought that the description already has "...the lock can be
> >>>>>>>> used (and in a few cases is used right away) to check whether vpci
> >>>>>>>> is present" and this is enough for such uses as here.
> >>>>>>>>>      But then I wonder whether you
> >>>>>>>>> actually tested this, since I can't help getting the impression that
> >>>>>>>>> you're introducing a live-lock: The function is called from cmd_write()
> >>>>>>>>> and rom_write(), which in turn are called out of vpci_write(). Yet that
> >>>>>>>>> function already holds the lock, and the lock is not (currently)
> >>>>>>>>> recursive. (For the 3rd caller of the function - init_bars() - otoh
> >>>>>>>>> the locking looks to be entirely unnecessary.)
> >>>>>>>> Well, you are correct: if tmp != pdev then it is correct to acquire
> >>>>>>>> the lock. But if tmp == pdev and rom_only == true
> >>>>>>>> then we'll deadlock.
> >>>>>>>>
> >>>>>>>> It seems we need to have the locking conditional, e.g. only lock
> >>>>>>>> if tmp != pdev
> >>>>>>> Which will address the live-lock, but introduce ABBA deadlock potential
> >>>>>>> between the two locks.
> >>>>>> I am not sure I can suggest a better solution here
> >>>>>> @Roger, @Jan, could you please help here?
> >>>>> Well, first of all I'd like to mention that while it may have been okay to
> >>>>> not hold pcidevs_lock here for Dom0, it surely needs acquiring when dealing
> >>>>> with DomU-s' lists of PCI devices. The requirement really applies to the
> >>>>> other use of for_each_pdev() as well (in vpci_dump_msi()), except that
> >>>>> there it probably wants to be a try-lock.
> >>>>>
> >>>>> Next I'd like to point out that here we have the still pending issue of
> >>>>> how to deal with hidden devices, which Dom0 can access. See my RFC patch
> >>>>> "vPCI: account for hidden devices in modify_bars()". Whatever the solution
> >>>>> here, I think it wants to at least account for the extra need there.
> >>>> Yes, sorry, I should take care of that.
> >>>>
> >>>>> Now it is quite clear that pcidevs_lock isn't going to help with avoiding
> >>>>> the deadlock, as it's imo not an option at all to acquire that lock
> >>>>> everywhere else you access ->vpci (or else the vpci lock itself would be
> >>>>> pointless). But a per-domain auxiliary r/w lock may help: Other paths
> >>>>> would acquire it in read mode, and here you'd acquire it in write mode (in
> >>>>> the former case around the vpci lock, while in the latter case there may
> >>>>> then not be any need to acquire the individual vpci locks at all). FTAOD:
> >>>>> I haven't fully thought through all implications (and hence whether this is
> >>>>> viable in the first place); I expect you will, documenting what you've
> >>>>> found in the resulting patch description. Of course the double lock
> >>>>> acquire/release would then likely want hiding in helper functions.
> >>>> I've been also thinking about this, and whether it's really worth to
> >>>> have a per-device lock rather than a per-domain one that protects all
> >>>> vpci regions of the devices assigned to the domain.
> >>>>
> >>>> The OS is likely to serialize accesses to the PCI config space anyway,
> >>>> and the only place I could see a benefit of having per-device locks is
> >>>> in the handling of MSI-X tables, as the handling of the mask bit is
> >>>> likely very performance sensitive, so adding a per-domain lock there
> >>>> could be a bottleneck.
> >>> Hmm, with method 1 accesses serializing globally is basically
> >>> unavoidable, but with MMCFG I see no reason why OSes may not (move
> >>> to) permit(ting) parallel accesses, with serialization perhaps done
> >>> only at device level. See our own pci_config_lock, which applies to
> >>> only method 1 accesses; we don't look to be serializing MMCFG
> >>> accesses at all.
> >>>
> >>>> We could alternatively do a per-domain rwlock for vpci and special case
> >>>> the MSI-X area to also have a per-device specific lock. At which point
> >>>> it becomes fairly similar to what you propose.
> >> @Jan, @Roger
> >>
> >> 1. d->vpci_lock - rwlock <- this protects vpci
> >> 2. pdev->vpci->msix_tbl_lock - rwlock <- this protects MSI-X tables
> >> or should it better be pdev->msix_tbl_lock as MSI-X tables don't
> >> really depend on vPCI?
> > If so, perhaps indeed better the latter. But as said in reply to Roger,
> > I'm not convinced (yet) that doing away with the per-device lock is a
> > good move. As said there - we're ourselves doing fully parallel MMCFG
> > accesses, so OSes ought to be fine to do so, too.
> But with pdev->vpci_lock we face ABBA...

I think it would be easier to start with a per-domain rwlock that
guarantees pdev->vpci cannot be removed under our feet. This would be
taken in read mode in vpci_{read,write} and in write mode when
removing a device from a domain.

Then there are also other issues regarding vPCI locking that need to
be fixed, but that lock would likely be a start.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 04/13] vpci: restrict unhandled read/write operations for guests
  2022-02-04  6:34 ` [PATCH v6 04/13] vpci: restrict unhandled read/write operations for guests Oleksandr Andrushchenko
@ 2022-02-04 14:11   ` Jan Beulich
  2022-02-04 14:24     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-04 14:11 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko,
	xen-devel

On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
> A guest can read and write those registers which are not emulated and
> have no respective vPCI handlers, so it can access the HW directly.

I don't think this describes the present situation. Or did I miss where
devices can actually be exposed to guests already, despite much of the
support logic still missing?

> In order to prevent a guest from reads and writes from/to the unhandled
> registers make sure only hardware domain can access HW directly and restrict
> guests from doing so.

Tangential question: Going over the titles of the remaining patches I
notice patch 6 is going to deal with BAR accesses. But (going just
from the titles) I can't spot anywhere that vendor and device IDs
would be exposed to guests. Yet that's the first thing guests will need
in order to actually recognize devices. As said before, allowing guests
access to such r/o fields is quite likely going to be fine.

> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -215,11 +215,15 @@ int vpci_remove_register(struct vpci *vpci, unsigned int offset,
>  }
>  
>  /* Wrappers for performing reads/writes to the underlying hardware. */
> -static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
> +static uint32_t vpci_read_hw(bool is_hwdom, pci_sbdf_t sbdf, unsigned int reg,
>                               unsigned int size)

Was the passing around of a boolean the consensus which was reached?
Personally I'd fine it more natural if the two functions checked
current->domain themselves.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 04/13] vpci: restrict unhandled read/write operations for guests
  2022-02-04 14:11   ` Jan Beulich
@ 2022-02-04 14:24     ` Oleksandr Andrushchenko
  2022-02-08  8:00       ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04 14:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, Oleksandr Andrushchenko,
	xen-devel



On 04.02.22 16:11, Jan Beulich wrote:
> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>> A guest can read and write those registers which are not emulated and
>> have no respective vPCI handlers, so it can access the HW directly.
> I don't think this describes the present situation. Or did I miss where
> devices can actually be exposed to guests already, despite much of the
> support logic still missing?
No, they are not exposed yet and you know that.
I will update the commit message
>
>> In order to prevent a guest from reads and writes from/to the unhandled
>> registers make sure only hardware domain can access HW directly and restrict
>> guests from doing so.
> Tangential question: Going over the titles of the remaining patches I
> notice patch 6 is going to deal with BAR accesses. But (going just
> from the titles) I can't spot anywhere that vendor and device IDs
> would be exposed to guests. Yet that's the first thing guests will need
> in order to actually recognize devices. As said before, allowing guests
> access to such r/o fields is quite likely going to be fine.
Agree, I was thinking about adding such a patch to allow IDs,
but finally decided not to add more to this series.
Again, the whole thing is not working yet and for the development
this patch can/needs to be reverted. So, either we implement IDs
or not this doesn't change anything with this respect
>
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -215,11 +215,15 @@ int vpci_remove_register(struct vpci *vpci, unsigned int offset,
>>   }
>>   
>>   /* Wrappers for performing reads/writes to the underlying hardware. */
>> -static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
>> +static uint32_t vpci_read_hw(bool is_hwdom, pci_sbdf_t sbdf, unsigned int reg,
>>                                unsigned int size)
> Was the passing around of a boolean the consensus which was reached?
Was this patch committed yet?
> Personally I'd fine it more natural if the two functions checked
> current->domain themselves.
This is also possible, but I would like to hear Roger's view on this as well
I am fine either way
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests
  2022-02-04  6:34 ` [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests Oleksandr Andrushchenko
@ 2022-02-04 14:25   ` Jan Beulich
  2022-02-08  8:13     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-04 14:25 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko,
	xen-devel

On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -454,6 +454,22 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
>          pci_conf_write16(pdev->sbdf, reg, cmd);
>  }
>  
> +static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
> +                            uint32_t cmd, void *data)
> +{
> +    /* TODO: Add proper emulation for all bits of the command register. */
> +
> +#ifdef CONFIG_HAS_PCI_MSI
> +    if ( pdev->vpci->msi->enabled || pdev->vpci->msix->enabled )
> +    {
> +        /* Guest wants to enable INTx. It can't be enabled if MSI/MSI-X enabled. */
> +        cmd |= PCI_COMMAND_INTX_DISABLE;
> +    }
> +#endif
> +
> +    cmd_write(pdev, reg, cmd, data);
> +}

It's not really clear to me whether the TODO warrants this being a
separate function. Personally I'd find it preferable if the logic
was folded into cmd_write().

With this and ...

> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -70,6 +70,10 @@ static void control_write(const struct pci_dev *pdev, unsigned int reg,
>  
>          if ( vpci_msi_arch_enable(msi, pdev, vectors) )
>              return;
> +
> +        /* Make sure guest doesn't enable INTx while enabling MSI. */
> +        if ( !is_hardware_domain(pdev->domain) )
> +            pci_intx(pdev, false);
>      }
>      else
>          vpci_msi_arch_disable(msi, pdev);
> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -92,6 +92,10 @@ static void control_write(const struct pci_dev *pdev, unsigned int reg,
>          for ( i = 0; i < msix->max_entries; i++ )
>              if ( !msix->entries[i].masked && msix->entries[i].updated )
>                  update_entry(&msix->entries[i], pdev, i);
> +
> +        /* Make sure guest doesn't enable INTx while enabling MSI-X. */
> +        if ( !is_hardware_domain(pdev->domain) )
> +            pci_intx(pdev, false);
>      }
>      else if ( !new_enabled && msix->enabled )
>      {

... this done (as requested) behind the back of the guest, what's the
idea wrt the guest reading the command register? That continues to be
wired to vpci_hw_read16() (and hence accesses the underlying hardware
value irrespective of what patch 4 did).

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-04  6:34 ` [PATCH v6 10/13] vpci/header: reset the command register when adding devices Oleksandr Andrushchenko
@ 2022-02-04 14:30   ` Jan Beulich
  2022-02-04 14:37     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-04 14:30 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko,
	xen-devel

On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
> Reset the command register when assigning a PCI device to a guest:
> according to the PCI spec the PCI_COMMAND register is typically all 0's
> after reset.

It's not entirely clear to me whether setting the hardware register to
zero is okay. What wants to be zero is the value the guest observes
initially.

> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -454,8 +454,7 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
>          pci_conf_write16(pdev->sbdf, reg, cmd);
>  }
>  
> -static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
> -                            uint32_t cmd, void *data)
> +static uint32_t emulate_cmd_reg(const struct pci_dev *pdev, uint32_t cmd)

The command register is a 16-bit one, so parameter and return type should
either be plain unsigned int (preferred, see ./CODING_STYLE) or uint16_t
imo.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-04 14:30   ` Jan Beulich
@ 2022-02-04 14:37     ` Oleksandr Andrushchenko
  2022-02-07  7:29       ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04 14:37 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 04.02.22 16:30, Jan Beulich wrote:
> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>> Reset the command register when assigning a PCI device to a guest:
>> according to the PCI spec the PCI_COMMAND register is typically all 0's
>> after reset.
> It's not entirely clear to me whether setting the hardware register to
> zero is okay. What wants to be zero is the value the guest observes
> initially.
"the PCI spec says the PCI_COMMAND register is typically all 0's after reset."
Why wouldn't it be ok? What is the exact concern here?
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -454,8 +454,7 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>           pci_conf_write16(pdev->sbdf, reg, cmd);
>>   }
>>   
>> -static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
>> -                            uint32_t cmd, void *data)
>> +static uint32_t emulate_cmd_reg(const struct pci_dev *pdev, uint32_t cmd)
> The command register is a 16-bit one, so parameter and return type should
> either be plain unsigned int (preferred, see ./CODING_STYLE) or uint16_t
> imo.
God catch, thank you
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04 13:06                       ` Roger Pau Monné
@ 2022-02-04 14:43                         ` Oleksandr Andrushchenko
  2022-02-04 14:57                           ` Roger Pau Monné
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-04 14:43 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 04.02.22 15:06, Roger Pau Monné wrote:
> On Fri, Feb 04, 2022 at 12:53:20PM +0000, Oleksandr Andrushchenko wrote:
>>
>> On 04.02.22 14:47, Jan Beulich wrote:
>>> On 04.02.2022 13:37, Oleksandr Andrushchenko wrote:
>>>> On 04.02.22 13:37, Jan Beulich wrote:
>>>>> On 04.02.2022 12:13, Roger Pau Monné wrote:
>>>>>> On Fri, Feb 04, 2022 at 11:49:18AM +0100, Jan Beulich wrote:
>>>>>>> On 04.02.2022 11:12, Oleksandr Andrushchenko wrote:
>>>>>>>> On 04.02.22 11:15, Jan Beulich wrote:
>>>>>>>>> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
>>>>>>>>>> On 04.02.22 09:52, Jan Beulich wrote:
>>>>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>>>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>>>>>                       continue;
>>>>>>>>>>>>               }
>>>>>>>>>>>>       
>>>>>>>>>>>> +        spin_lock(&tmp->vpci_lock);
>>>>>>>>>>>> +        if ( !tmp->vpci )
>>>>>>>>>>>> +        {
>>>>>>>>>>>> +            spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>>> +            continue;
>>>>>>>>>>>> +        }
>>>>>>>>>>>>               for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>>>>>>>>>>>               {
>>>>>>>>>>>>                   const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
>>>>>>>>>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>>>>>                   rc = rangeset_remove_range(mem, start, end);
>>>>>>>>>>>>                   if ( rc )
>>>>>>>>>>>>                   {
>>>>>>>>>>>> +                spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>>>                       printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>>>>>>>>>>>>                              start, end, rc);
>>>>>>>>>>>>                       rangeset_destroy(mem);
>>>>>>>>>>>>                       return rc;
>>>>>>>>>>>>                   }
>>>>>>>>>>>>               }
>>>>>>>>>>>> +        spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>>>           }
>>>>>>>>>>> At the first glance this simply looks like another unjustified (in the
>>>>>>>>>>> description) change, as you're not converting anything here but you
>>>>>>>>>>> actually add locking (and I realize this was there before, so I'm sorry
>>>>>>>>>>> for not pointing this out earlier).
>>>>>>>>>> Well, I thought that the description already has "...the lock can be
>>>>>>>>>> used (and in a few cases is used right away) to check whether vpci
>>>>>>>>>> is present" and this is enough for such uses as here.
>>>>>>>>>>>       But then I wonder whether you
>>>>>>>>>>> actually tested this, since I can't help getting the impression that
>>>>>>>>>>> you're introducing a live-lock: The function is called from cmd_write()
>>>>>>>>>>> and rom_write(), which in turn are called out of vpci_write(). Yet that
>>>>>>>>>>> function already holds the lock, and the lock is not (currently)
>>>>>>>>>>> recursive. (For the 3rd caller of the function - init_bars() - otoh
>>>>>>>>>>> the locking looks to be entirely unnecessary.)
>>>>>>>>>> Well, you are correct: if tmp != pdev then it is correct to acquire
>>>>>>>>>> the lock. But if tmp == pdev and rom_only == true
>>>>>>>>>> then we'll deadlock.
>>>>>>>>>>
>>>>>>>>>> It seems we need to have the locking conditional, e.g. only lock
>>>>>>>>>> if tmp != pdev
>>>>>>>>> Which will address the live-lock, but introduce ABBA deadlock potential
>>>>>>>>> between the two locks.
>>>>>>>> I am not sure I can suggest a better solution here
>>>>>>>> @Roger, @Jan, could you please help here?
>>>>>>> Well, first of all I'd like to mention that while it may have been okay to
>>>>>>> not hold pcidevs_lock here for Dom0, it surely needs acquiring when dealing
>>>>>>> with DomU-s' lists of PCI devices. The requirement really applies to the
>>>>>>> other use of for_each_pdev() as well (in vpci_dump_msi()), except that
>>>>>>> there it probably wants to be a try-lock.
>>>>>>>
>>>>>>> Next I'd like to point out that here we have the still pending issue of
>>>>>>> how to deal with hidden devices, which Dom0 can access. See my RFC patch
>>>>>>> "vPCI: account for hidden devices in modify_bars()". Whatever the solution
>>>>>>> here, I think it wants to at least account for the extra need there.
>>>>>> Yes, sorry, I should take care of that.
>>>>>>
>>>>>>> Now it is quite clear that pcidevs_lock isn't going to help with avoiding
>>>>>>> the deadlock, as it's imo not an option at all to acquire that lock
>>>>>>> everywhere else you access ->vpci (or else the vpci lock itself would be
>>>>>>> pointless). But a per-domain auxiliary r/w lock may help: Other paths
>>>>>>> would acquire it in read mode, and here you'd acquire it in write mode (in
>>>>>>> the former case around the vpci lock, while in the latter case there may
>>>>>>> then not be any need to acquire the individual vpci locks at all). FTAOD:
>>>>>>> I haven't fully thought through all implications (and hence whether this is
>>>>>>> viable in the first place); I expect you will, documenting what you've
>>>>>>> found in the resulting patch description. Of course the double lock
>>>>>>> acquire/release would then likely want hiding in helper functions.
>>>>>> I've been also thinking about this, and whether it's really worth to
>>>>>> have a per-device lock rather than a per-domain one that protects all
>>>>>> vpci regions of the devices assigned to the domain.
>>>>>>
>>>>>> The OS is likely to serialize accesses to the PCI config space anyway,
>>>>>> and the only place I could see a benefit of having per-device locks is
>>>>>> in the handling of MSI-X tables, as the handling of the mask bit is
>>>>>> likely very performance sensitive, so adding a per-domain lock there
>>>>>> could be a bottleneck.
>>>>> Hmm, with method 1 accesses serializing globally is basically
>>>>> unavoidable, but with MMCFG I see no reason why OSes may not (move
>>>>> to) permit(ting) parallel accesses, with serialization perhaps done
>>>>> only at device level. See our own pci_config_lock, which applies to
>>>>> only method 1 accesses; we don't look to be serializing MMCFG
>>>>> accesses at all.
>>>>>
>>>>>> We could alternatively do a per-domain rwlock for vpci and special case
>>>>>> the MSI-X area to also have a per-device specific lock. At which point
>>>>>> it becomes fairly similar to what you propose.
>>>> @Jan, @Roger
>>>>
>>>> 1. d->vpci_lock - rwlock <- this protects vpci
>>>> 2. pdev->vpci->msix_tbl_lock - rwlock <- this protects MSI-X tables
>>>> or should it better be pdev->msix_tbl_lock as MSI-X tables don't
>>>> really depend on vPCI?
>>> If so, perhaps indeed better the latter. But as said in reply to Roger,
>>> I'm not convinced (yet) that doing away with the per-device lock is a
>>> good move. As said there - we're ourselves doing fully parallel MMCFG
>>> accesses, so OSes ought to be fine to do so, too.
>> But with pdev->vpci_lock we face ABBA...
> I think it would be easier to start with a per-domain rwlock that
> guarantees pdev->vpci cannot be removed under our feet. This would be
> taken in read mode in vpci_{read,write} and in write mode when
> removing a device from a domain.
>
> Then there are also other issues regarding vPCI locking that need to
> be fixed, but that lock would likely be a start.
Or let's see the problem at a different angle: this is the only place
which breaks the use of pdev->vpci_lock. Because all other places
do not try to acquire the lock of any two devices at a time.
So, what if we re-work the offending piece of code instead?
That way we do not break parallel access and have the lock per-device
which might also be a plus.

By re-work I mean, that instead of reading already mapped regions
from tmp we can employ a d->pci_mapped_regions range set which
will hold all the already mapped ranges. And when it is needed to access
that range set we use pcidevs_lock which seems to be rare.
So, modify_bars will rely on pdev->vpci_lock + pcidevs_lock and
ABBA won't be possible at all.

>
> Thanks, Roger.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04 14:43                         ` Oleksandr Andrushchenko
@ 2022-02-04 14:57                           ` Roger Pau Monné
  2022-02-07 11:08                             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Roger Pau Monné @ 2022-02-04 14:57 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel

On Fri, Feb 04, 2022 at 02:43:07PM +0000, Oleksandr Andrushchenko wrote:
> 
> 
> On 04.02.22 15:06, Roger Pau Monné wrote:
> > On Fri, Feb 04, 2022 at 12:53:20PM +0000, Oleksandr Andrushchenko wrote:
> >>
> >> On 04.02.22 14:47, Jan Beulich wrote:
> >>> On 04.02.2022 13:37, Oleksandr Andrushchenko wrote:
> >>>> On 04.02.22 13:37, Jan Beulich wrote:
> >>>>> On 04.02.2022 12:13, Roger Pau Monné wrote:
> >>>>>> On Fri, Feb 04, 2022 at 11:49:18AM +0100, Jan Beulich wrote:
> >>>>>>> On 04.02.2022 11:12, Oleksandr Andrushchenko wrote:
> >>>>>>>> On 04.02.22 11:15, Jan Beulich wrote:
> >>>>>>>>> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
> >>>>>>>>>> On 04.02.22 09:52, Jan Beulich wrote:
> >>>>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
> >>>>>>>>>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>>>>>>>>>>>                       continue;
> >>>>>>>>>>>>               }
> >>>>>>>>>>>>       
> >>>>>>>>>>>> +        spin_lock(&tmp->vpci_lock);
> >>>>>>>>>>>> +        if ( !tmp->vpci )
> >>>>>>>>>>>> +        {
> >>>>>>>>>>>> +            spin_unlock(&tmp->vpci_lock);
> >>>>>>>>>>>> +            continue;
> >>>>>>>>>>>> +        }
> >>>>>>>>>>>>               for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
> >>>>>>>>>>>>               {
> >>>>>>>>>>>>                   const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
> >>>>>>>>>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>>>>>>>>>>>                   rc = rangeset_remove_range(mem, start, end);
> >>>>>>>>>>>>                   if ( rc )
> >>>>>>>>>>>>                   {
> >>>>>>>>>>>> +                spin_unlock(&tmp->vpci_lock);
> >>>>>>>>>>>>                       printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
> >>>>>>>>>>>>                              start, end, rc);
> >>>>>>>>>>>>                       rangeset_destroy(mem);
> >>>>>>>>>>>>                       return rc;
> >>>>>>>>>>>>                   }
> >>>>>>>>>>>>               }
> >>>>>>>>>>>> +        spin_unlock(&tmp->vpci_lock);
> >>>>>>>>>>>>           }
> >>>>>>>>>>> At the first glance this simply looks like another unjustified (in the
> >>>>>>>>>>> description) change, as you're not converting anything here but you
> >>>>>>>>>>> actually add locking (and I realize this was there before, so I'm sorry
> >>>>>>>>>>> for not pointing this out earlier).
> >>>>>>>>>> Well, I thought that the description already has "...the lock can be
> >>>>>>>>>> used (and in a few cases is used right away) to check whether vpci
> >>>>>>>>>> is present" and this is enough for such uses as here.
> >>>>>>>>>>>       But then I wonder whether you
> >>>>>>>>>>> actually tested this, since I can't help getting the impression that
> >>>>>>>>>>> you're introducing a live-lock: The function is called from cmd_write()
> >>>>>>>>>>> and rom_write(), which in turn are called out of vpci_write(). Yet that
> >>>>>>>>>>> function already holds the lock, and the lock is not (currently)
> >>>>>>>>>>> recursive. (For the 3rd caller of the function - init_bars() - otoh
> >>>>>>>>>>> the locking looks to be entirely unnecessary.)
> >>>>>>>>>> Well, you are correct: if tmp != pdev then it is correct to acquire
> >>>>>>>>>> the lock. But if tmp == pdev and rom_only == true
> >>>>>>>>>> then we'll deadlock.
> >>>>>>>>>>
> >>>>>>>>>> It seems we need to have the locking conditional, e.g. only lock
> >>>>>>>>>> if tmp != pdev
> >>>>>>>>> Which will address the live-lock, but introduce ABBA deadlock potential
> >>>>>>>>> between the two locks.
> >>>>>>>> I am not sure I can suggest a better solution here
> >>>>>>>> @Roger, @Jan, could you please help here?
> >>>>>>> Well, first of all I'd like to mention that while it may have been okay to
> >>>>>>> not hold pcidevs_lock here for Dom0, it surely needs acquiring when dealing
> >>>>>>> with DomU-s' lists of PCI devices. The requirement really applies to the
> >>>>>>> other use of for_each_pdev() as well (in vpci_dump_msi()), except that
> >>>>>>> there it probably wants to be a try-lock.
> >>>>>>>
> >>>>>>> Next I'd like to point out that here we have the still pending issue of
> >>>>>>> how to deal with hidden devices, which Dom0 can access. See my RFC patch
> >>>>>>> "vPCI: account for hidden devices in modify_bars()". Whatever the solution
> >>>>>>> here, I think it wants to at least account for the extra need there.
> >>>>>> Yes, sorry, I should take care of that.
> >>>>>>
> >>>>>>> Now it is quite clear that pcidevs_lock isn't going to help with avoiding
> >>>>>>> the deadlock, as it's imo not an option at all to acquire that lock
> >>>>>>> everywhere else you access ->vpci (or else the vpci lock itself would be
> >>>>>>> pointless). But a per-domain auxiliary r/w lock may help: Other paths
> >>>>>>> would acquire it in read mode, and here you'd acquire it in write mode (in
> >>>>>>> the former case around the vpci lock, while in the latter case there may
> >>>>>>> then not be any need to acquire the individual vpci locks at all). FTAOD:
> >>>>>>> I haven't fully thought through all implications (and hence whether this is
> >>>>>>> viable in the first place); I expect you will, documenting what you've
> >>>>>>> found in the resulting patch description. Of course the double lock
> >>>>>>> acquire/release would then likely want hiding in helper functions.
> >>>>>> I've been also thinking about this, and whether it's really worth to
> >>>>>> have a per-device lock rather than a per-domain one that protects all
> >>>>>> vpci regions of the devices assigned to the domain.
> >>>>>>
> >>>>>> The OS is likely to serialize accesses to the PCI config space anyway,
> >>>>>> and the only place I could see a benefit of having per-device locks is
> >>>>>> in the handling of MSI-X tables, as the handling of the mask bit is
> >>>>>> likely very performance sensitive, so adding a per-domain lock there
> >>>>>> could be a bottleneck.
> >>>>> Hmm, with method 1 accesses serializing globally is basically
> >>>>> unavoidable, but with MMCFG I see no reason why OSes may not (move
> >>>>> to) permit(ting) parallel accesses, with serialization perhaps done
> >>>>> only at device level. See our own pci_config_lock, which applies to
> >>>>> only method 1 accesses; we don't look to be serializing MMCFG
> >>>>> accesses at all.
> >>>>>
> >>>>>> We could alternatively do a per-domain rwlock for vpci and special case
> >>>>>> the MSI-X area to also have a per-device specific lock. At which point
> >>>>>> it becomes fairly similar to what you propose.
> >>>> @Jan, @Roger
> >>>>
> >>>> 1. d->vpci_lock - rwlock <- this protects vpci
> >>>> 2. pdev->vpci->msix_tbl_lock - rwlock <- this protects MSI-X tables
> >>>> or should it better be pdev->msix_tbl_lock as MSI-X tables don't
> >>>> really depend on vPCI?
> >>> If so, perhaps indeed better the latter. But as said in reply to Roger,
> >>> I'm not convinced (yet) that doing away with the per-device lock is a
> >>> good move. As said there - we're ourselves doing fully parallel MMCFG
> >>> accesses, so OSes ought to be fine to do so, too.
> >> But with pdev->vpci_lock we face ABBA...
> > I think it would be easier to start with a per-domain rwlock that
> > guarantees pdev->vpci cannot be removed under our feet. This would be
> > taken in read mode in vpci_{read,write} and in write mode when
> > removing a device from a domain.
> >
> > Then there are also other issues regarding vPCI locking that need to
> > be fixed, but that lock would likely be a start.
> Or let's see the problem at a different angle: this is the only place
> which breaks the use of pdev->vpci_lock. Because all other places
> do not try to acquire the lock of any two devices at a time.
> So, what if we re-work the offending piece of code instead?
> That way we do not break parallel access and have the lock per-device
> which might also be a plus.
> 
> By re-work I mean, that instead of reading already mapped regions
> from tmp we can employ a d->pci_mapped_regions range set which
> will hold all the already mapped ranges. And when it is needed to access
> that range set we use pcidevs_lock which seems to be rare.
> So, modify_bars will rely on pdev->vpci_lock + pcidevs_lock and
> ABBA won't be possible at all.

Sadly that won't replace the usage of the loop in modify_bars. This is
not (exclusively) done in order to prevent mapping the same region
multiple times, but rather to prevent unmapping of regions as long as
there's an enabled BAR that's using it.

If you wanted to use something like d->pci_mapped_regions it would
have to keep reference counts to regions, in order to know when a
mapping is no longer required by any BAR on the system with memory
decoding enabled.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-04 14:37     ` Oleksandr Andrushchenko
@ 2022-02-07  7:29       ` Jan Beulich
  2022-02-07 11:27         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-07  7:29 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 04.02.2022 15:37, Oleksandr Andrushchenko wrote:
> On 04.02.22 16:30, Jan Beulich wrote:
>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>> Reset the command register when assigning a PCI device to a guest:
>>> according to the PCI spec the PCI_COMMAND register is typically all 0's
>>> after reset.
>> It's not entirely clear to me whether setting the hardware register to
>> zero is okay. What wants to be zero is the value the guest observes
>> initially.
> "the PCI spec says the PCI_COMMAND register is typically all 0's after reset."
> Why wouldn't it be ok? What is the exact concern here?

The concern is - as voiced is similar ways before, perhaps in other
contexts - that you need to consider bit-by-bit whether overwriting
with 0 what is currently there is okay. Xen and/or Dom0 may have put
values there which they expect to remain unaltered. I guess
PCI_COMMAND_SERR is a good example: While the guest's view of this
will want to be zero initially, the host having set it to 1 may not
easily be overwritten with 0, or else you'd effectively imply giving
the guest control of the bit.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-04 14:57                           ` Roger Pau Monné
@ 2022-02-07 11:08                             ` Oleksandr Andrushchenko
  2022-02-07 12:34                               ` Jan Beulich
  2022-02-07 12:46                               ` Roger Pau Monné
  0 siblings, 2 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-07 11:08 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko

Hello,

On 04.02.22 16:57, Roger Pau Monné wrote:
> On Fri, Feb 04, 2022 at 02:43:07PM +0000, Oleksandr Andrushchenko wrote:
>>
>> On 04.02.22 15:06, Roger Pau Monné wrote:
>>> On Fri, Feb 04, 2022 at 12:53:20PM +0000, Oleksandr Andrushchenko wrote:
>>>> On 04.02.22 14:47, Jan Beulich wrote:
>>>>> On 04.02.2022 13:37, Oleksandr Andrushchenko wrote:
>>>>>> On 04.02.22 13:37, Jan Beulich wrote:
>>>>>>> On 04.02.2022 12:13, Roger Pau Monné wrote:
>>>>>>>> On Fri, Feb 04, 2022 at 11:49:18AM +0100, Jan Beulich wrote:
>>>>>>>>> On 04.02.2022 11:12, Oleksandr Andrushchenko wrote:
>>>>>>>>>> On 04.02.22 11:15, Jan Beulich wrote:
>>>>>>>>>>> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
>>>>>>>>>>>> On 04.02.22 09:52, Jan Beulich wrote:
>>>>>>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>>>>>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>>>>>>>                        continue;
>>>>>>>>>>>>>>                }
>>>>>>>>>>>>>>        
>>>>>>>>>>>>>> +        spin_lock(&tmp->vpci_lock);
>>>>>>>>>>>>>> +        if ( !tmp->vpci )
>>>>>>>>>>>>>> +        {
>>>>>>>>>>>>>> +            spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>>>>> +            continue;
>>>>>>>>>>>>>> +        }
>>>>>>>>>>>>>>                for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>>>>>>>>>>>>>                {
>>>>>>>>>>>>>>                    const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
>>>>>>>>>>>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>>>>>>>                    rc = rangeset_remove_range(mem, start, end);
>>>>>>>>>>>>>>                    if ( rc )
>>>>>>>>>>>>>>                    {
>>>>>>>>>>>>>> +                spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>>>>>                        printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>>>>>>>>>>>>>>                               start, end, rc);
>>>>>>>>>>>>>>                        rangeset_destroy(mem);
>>>>>>>>>>>>>>                        return rc;
>>>>>>>>>>>>>>                    }
>>>>>>>>>>>>>>                }
>>>>>>>>>>>>>> +        spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>>>>>            }
>>>>>>>>>>>>> At the first glance this simply looks like another unjustified (in the
>>>>>>>>>>>>> description) change, as you're not converting anything here but you
>>>>>>>>>>>>> actually add locking (and I realize this was there before, so I'm sorry
>>>>>>>>>>>>> for not pointing this out earlier).
>>>>>>>>>>>> Well, I thought that the description already has "...the lock can be
>>>>>>>>>>>> used (and in a few cases is used right away) to check whether vpci
>>>>>>>>>>>> is present" and this is enough for such uses as here.
>>>>>>>>>>>>>        But then I wonder whether you
>>>>>>>>>>>>> actually tested this, since I can't help getting the impression that
>>>>>>>>>>>>> you're introducing a live-lock: The function is called from cmd_write()
>>>>>>>>>>>>> and rom_write(), which in turn are called out of vpci_write(). Yet that
>>>>>>>>>>>>> function already holds the lock, and the lock is not (currently)
>>>>>>>>>>>>> recursive. (For the 3rd caller of the function - init_bars() - otoh
>>>>>>>>>>>>> the locking looks to be entirely unnecessary.)
>>>>>>>>>>>> Well, you are correct: if tmp != pdev then it is correct to acquire
>>>>>>>>>>>> the lock. But if tmp == pdev and rom_only == true
>>>>>>>>>>>> then we'll deadlock.
>>>>>>>>>>>>
>>>>>>>>>>>> It seems we need to have the locking conditional, e.g. only lock
>>>>>>>>>>>> if tmp != pdev
>>>>>>>>>>> Which will address the live-lock, but introduce ABBA deadlock potential
>>>>>>>>>>> between the two locks.
>>>>>>>>>> I am not sure I can suggest a better solution here
>>>>>>>>>> @Roger, @Jan, could you please help here?
>>>>>>>>> Well, first of all I'd like to mention that while it may have been okay to
>>>>>>>>> not hold pcidevs_lock here for Dom0, it surely needs acquiring when dealing
>>>>>>>>> with DomU-s' lists of PCI devices. The requirement really applies to the
>>>>>>>>> other use of for_each_pdev() as well (in vpci_dump_msi()), except that
>>>>>>>>> there it probably wants to be a try-lock.
>>>>>>>>>
>>>>>>>>> Next I'd like to point out that here we have the still pending issue of
>>>>>>>>> how to deal with hidden devices, which Dom0 can access. See my RFC patch
>>>>>>>>> "vPCI: account for hidden devices in modify_bars()". Whatever the solution
>>>>>>>>> here, I think it wants to at least account for the extra need there.
>>>>>>>> Yes, sorry, I should take care of that.
>>>>>>>>
>>>>>>>>> Now it is quite clear that pcidevs_lock isn't going to help with avoiding
>>>>>>>>> the deadlock, as it's imo not an option at all to acquire that lock
>>>>>>>>> everywhere else you access ->vpci (or else the vpci lock itself would be
>>>>>>>>> pointless). But a per-domain auxiliary r/w lock may help: Other paths
>>>>>>>>> would acquire it in read mode, and here you'd acquire it in write mode (in
>>>>>>>>> the former case around the vpci lock, while in the latter case there may
>>>>>>>>> then not be any need to acquire the individual vpci locks at all). FTAOD:
>>>>>>>>> I haven't fully thought through all implications (and hence whether this is
>>>>>>>>> viable in the first place); I expect you will, documenting what you've
>>>>>>>>> found in the resulting patch description. Of course the double lock
>>>>>>>>> acquire/release would then likely want hiding in helper functions.
>>>>>>>> I've been also thinking about this, and whether it's really worth to
>>>>>>>> have a per-device lock rather than a per-domain one that protects all
>>>>>>>> vpci regions of the devices assigned to the domain.
>>>>>>>>
>>>>>>>> The OS is likely to serialize accesses to the PCI config space anyway,
>>>>>>>> and the only place I could see a benefit of having per-device locks is
>>>>>>>> in the handling of MSI-X tables, as the handling of the mask bit is
>>>>>>>> likely very performance sensitive, so adding a per-domain lock there
>>>>>>>> could be a bottleneck.
>>>>>>> Hmm, with method 1 accesses serializing globally is basically
>>>>>>> unavoidable, but with MMCFG I see no reason why OSes may not (move
>>>>>>> to) permit(ting) parallel accesses, with serialization perhaps done
>>>>>>> only at device level. See our own pci_config_lock, which applies to
>>>>>>> only method 1 accesses; we don't look to be serializing MMCFG
>>>>>>> accesses at all.
>>>>>>>
>>>>>>>> We could alternatively do a per-domain rwlock for vpci and special case
>>>>>>>> the MSI-X area to also have a per-device specific lock. At which point
>>>>>>>> it becomes fairly similar to what you propose.
>>>>>> @Jan, @Roger
>>>>>>
>>>>>> 1. d->vpci_lock - rwlock <- this protects vpci
>>>>>> 2. pdev->vpci->msix_tbl_lock - rwlock <- this protects MSI-X tables
>>>>>> or should it better be pdev->msix_tbl_lock as MSI-X tables don't
>>>>>> really depend on vPCI?
>>>>> If so, perhaps indeed better the latter. But as said in reply to Roger,
>>>>> I'm not convinced (yet) that doing away with the per-device lock is a
>>>>> good move. As said there - we're ourselves doing fully parallel MMCFG
>>>>> accesses, so OSes ought to be fine to do so, too.
>>>> But with pdev->vpci_lock we face ABBA...
>>> I think it would be easier to start with a per-domain rwlock that
>>> guarantees pdev->vpci cannot be removed under our feet. This would be
>>> taken in read mode in vpci_{read,write} and in write mode when
>>> removing a device from a domain.
>>>
>>> Then there are also other issues regarding vPCI locking that need to
>>> be fixed, but that lock would likely be a start.
>> Or let's see the problem at a different angle: this is the only place
>> which breaks the use of pdev->vpci_lock. Because all other places
>> do not try to acquire the lock of any two devices at a time.
>> So, what if we re-work the offending piece of code instead?
>> That way we do not break parallel access and have the lock per-device
>> which might also be a plus.
>>
>> By re-work I mean, that instead of reading already mapped regions
>> from tmp we can employ a d->pci_mapped_regions range set which
>> will hold all the already mapped ranges. And when it is needed to access
>> that range set we use pcidevs_lock which seems to be rare.
>> So, modify_bars will rely on pdev->vpci_lock + pcidevs_lock and
>> ABBA won't be possible at all.
> Sadly that won't replace the usage of the loop in modify_bars. This is
> not (exclusively) done in order to prevent mapping the same region
> multiple times, but rather to prevent unmapping of regions as long as
> there's an enabled BAR that's using it.
>
> If you wanted to use something like d->pci_mapped_regions it would
> have to keep reference counts to regions, in order to know when a
> mapping is no longer required by any BAR on the system with memory
> decoding enabled.
I missed this path, thank you

I tried to analyze the locking in pci/vpci.

First of all some context to refresh the target we want:
the rationale behind moving pdev->vpci->lock outside
is to be able dynamically create and destroy pdev->vpci.
So, for that reason lock needs to be moved outside of the pdev->vpci.

Some of the callers of the vPCI code and locking used:

======================================
vpci_mmio_read/vpci_mmcfg_read
======================================
   - vpci_ecam_read
   - vpci_read
    !!!!!!!! pdev is acquired, then pdev->vpci_lock is used !!!!!!!!
    - msix:
     - control_read
    - header:
     - guest_bar_read
    - msi:
     - control_read
     - address_read/address_hi_read
     - data_read
     - mask_read

======================================
vpci_mmio_write/vpci_mmcfg_write
======================================
   - vpci_ecam_write
   - vpci_write
    !!!!!!!! pdev is acquired, then pdev->vpci_lock is used !!!!!!!!
    - msix:
     - control_write
    - header:
     - bar_write/guest_bar_write
     - cmd_write/guest_cmd_write
     - rom_write
      - all write handlers may call modify_bars
       modify_bars
    - msi:
     - control_write
     - address_write/address_hi_write
     - data_write
     - mask_write

======================================
pci_add_device: locked with pcidevs_lock
======================================
   - vpci_add_handlers
    ++++++++ pdev->vpci_lock is used ++++++++

======================================
pci_remove_device: locked with pcidevs_lock
======================================
- vpci_remove_device
   ++++++++ pdev->vpci_lock is used ++++++++
- pci_cleanup_msi
- free_pdev

======================================
XEN_DOMCTL_assign_device: locked with pcidevs_lock
======================================
- assign_device
  - vpci_deassign_device
  - pdev_msix_assign
  - vpci_assign_device
   - vpci_add_handlers
     ++++++++ pdev->vpci_lock is used ++++++++

======================================
XEN_DOMCTL_deassign_device: locked with pcidevs_lock
======================================
- deassign_device
  - vpci_deassign_device
    ++++++++ pdev->vpci_lock is used ++++++++
   - vpci_remove_device


======================================
modify_bars is a special case: this is the only function which tries to lock
two pci_dev devices: it is done to check for overlaps with other BARs which may have been
already mapped or unmapped.

So, this is the only case which may deadlock because of pci_dev->vpci_lock.
======================================

Bottom line:
======================================

1. vpci_{read|write} are not protected with pcidevs_lock and can run in
parallel with pci_remove_device which can remove pdev after vpci_{read|write}
acquired the pdev pointer. This may lead to a fail due to pdev dereference.

So, to protect pdev dereference vpci_{read|write} must also use pdevs_lock.

2. The only offending place which is in the way of pci_dev->vpci_lock is
modify_bars. If it can be re-worked to track already mapped and unmapped
regions then we can avoid having a possible deadlock and can use
pci_dev->vpci_lock (rangesets won't help here as we also need refcounting be
implemented).

If pcidevs_lock is used for vpci_{read|write} then no deadlock is possible,
but modify_bars code must be re-worked not to lock itself (pdev->vpci_lock and
tmp->vpci_lock when pdev == tmp, this is minor).

3. We may think about a per-domain rwlock and pdev->vpci_lock, so this solves
modify_bars's two pdevs access. But this doesn't solve possible pdev
de-reference in vpci_{read|write} vs pci_remove_device.

@Roger, @Jan, I would like to hear what do you think about the above analysis
and how can we proceed with locking re-work?

Thank you in advance,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-07  7:29       ` Jan Beulich
@ 2022-02-07 11:27         ` Oleksandr Andrushchenko
  2022-02-07 12:38           ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-07 11:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel



On 07.02.22 09:29, Jan Beulich wrote:
> On 04.02.2022 15:37, Oleksandr Andrushchenko wrote:
>> On 04.02.22 16:30, Jan Beulich wrote:
>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>> Reset the command register when assigning a PCI device to a guest:
>>>> according to the PCI spec the PCI_COMMAND register is typically all 0's
>>>> after reset.
>>> It's not entirely clear to me whether setting the hardware register to
>>> zero is okay. What wants to be zero is the value the guest observes
>>> initially.
>> "the PCI spec says the PCI_COMMAND register is typically all 0's after reset."
>> Why wouldn't it be ok? What is the exact concern here?
> The concern is - as voiced is similar ways before, perhaps in other
> contexts - that you need to consider bit-by-bit whether overwriting
> with 0 what is currently there is okay. Xen and/or Dom0 may have put
> values there which they expect to remain unaltered. I guess
> PCI_COMMAND_SERR is a good example: While the guest's view of this
> will want to be zero initially, the host having set it to 1 may not
> easily be overwritten with 0, or else you'd effectively imply giving
> the guest control of the bit.
We have already discussed in great detail PCI_COMMAND emulation [1].
At the end you wrote [1]:
"Well, in order for the whole thing to be security supported it needs to
be explained for every bit why it is safe to allow the guest to drive it.
Until you mean vPCI to reach that state, leaving TODO notes in the code
for anything not investigated may indeed be good enough.

Jan"

So, this is why I left a TODO in the PCI_COMMAND emulation for now and only
care about INTx which is honored with the code in this patch.
>
> Jan
>

Thank you,
Oleksandr

[1] https://patchwork.kernel.org/project/xen-devel/patch/20210903100831.177748-9-andr2000@gmail.com/
[2] https://lists.xenproject.org/archives/html/xen-devel/2021-09/msg00737.html

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 11:08                             ` Oleksandr Andrushchenko
@ 2022-02-07 12:34                               ` Jan Beulich
  2022-02-07 12:57                                 ` Oleksandr Andrushchenko
  2022-02-07 12:46                               ` Roger Pau Monné
  1 sibling, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-07 12:34 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné

On 07.02.2022 12:08, Oleksandr Andrushchenko wrote:
> 1. vpci_{read|write} are not protected with pcidevs_lock and can run in
> parallel with pci_remove_device which can remove pdev after vpci_{read|write}
> acquired the pdev pointer. This may lead to a fail due to pdev dereference.
> 
> So, to protect pdev dereference vpci_{read|write} must also use pdevs_lock.

I think this is not the only place where there is a theoretical race
against pci_remove_device(). I would recommend to separate the
overall situation with pcidevs_lock from the issue here. I don't view
it as an option to acquire pcidevs_lock in vpci_{read,write}(). If
anything, we need proper refcounting of PCI devices (at which point
likely a number of lock uses can go away).

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-07 11:27         ` Oleksandr Andrushchenko
@ 2022-02-07 12:38           ` Jan Beulich
  2022-02-07 12:51             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-07 12:38 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 07.02.2022 12:27, Oleksandr Andrushchenko wrote:
> 
> 
> On 07.02.22 09:29, Jan Beulich wrote:
>> On 04.02.2022 15:37, Oleksandr Andrushchenko wrote:
>>> On 04.02.22 16:30, Jan Beulich wrote:
>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>> Reset the command register when assigning a PCI device to a guest:
>>>>> according to the PCI spec the PCI_COMMAND register is typically all 0's
>>>>> after reset.
>>>> It's not entirely clear to me whether setting the hardware register to
>>>> zero is okay. What wants to be zero is the value the guest observes
>>>> initially.
>>> "the PCI spec says the PCI_COMMAND register is typically all 0's after reset."
>>> Why wouldn't it be ok? What is the exact concern here?
>> The concern is - as voiced is similar ways before, perhaps in other
>> contexts - that you need to consider bit-by-bit whether overwriting
>> with 0 what is currently there is okay. Xen and/or Dom0 may have put
>> values there which they expect to remain unaltered. I guess
>> PCI_COMMAND_SERR is a good example: While the guest's view of this
>> will want to be zero initially, the host having set it to 1 may not
>> easily be overwritten with 0, or else you'd effectively imply giving
>> the guest control of the bit.
> We have already discussed in great detail PCI_COMMAND emulation [1].
> At the end you wrote [1]:
> "Well, in order for the whole thing to be security supported it needs to
> be explained for every bit why it is safe to allow the guest to drive it.
> Until you mean vPCI to reach that state, leaving TODO notes in the code
> for anything not investigated may indeed be good enough.
> 
> Jan"
> 
> So, this is why I left a TODO in the PCI_COMMAND emulation for now and only
> care about INTx which is honored with the code in this patch.

Right. The issue I see is that the description does not have any
mention of this, but instead talks about simply writing zero.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 11:08                             ` Oleksandr Andrushchenko
  2022-02-07 12:34                               ` Jan Beulich
@ 2022-02-07 12:46                               ` Roger Pau Monné
  2022-02-07 13:53                                 ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 138+ messages in thread
From: Roger Pau Monné @ 2022-02-07 12:46 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel

On Mon, Feb 07, 2022 at 11:08:39AM +0000, Oleksandr Andrushchenko wrote:
> Hello,
> 
> On 04.02.22 16:57, Roger Pau Monné wrote:
> > On Fri, Feb 04, 2022 at 02:43:07PM +0000, Oleksandr Andrushchenko wrote:
> >>
> >> On 04.02.22 15:06, Roger Pau Monné wrote:
> >>> On Fri, Feb 04, 2022 at 12:53:20PM +0000, Oleksandr Andrushchenko wrote:
> >>>> On 04.02.22 14:47, Jan Beulich wrote:
> >>>>> On 04.02.2022 13:37, Oleksandr Andrushchenko wrote:
> >>>>>> On 04.02.22 13:37, Jan Beulich wrote:
> >>>>>>> On 04.02.2022 12:13, Roger Pau Monné wrote:
> >>>>>>>> On Fri, Feb 04, 2022 at 11:49:18AM +0100, Jan Beulich wrote:
> >>>>>>>>> On 04.02.2022 11:12, Oleksandr Andrushchenko wrote:
> >>>>>>>>>> On 04.02.22 11:15, Jan Beulich wrote:
> >>>>>>>>>>> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
> >>>>>>>>>>>> On 04.02.22 09:52, Jan Beulich wrote:
> >>>>>>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
> >>>>>>>>>>>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>>>>>>>>>>>>>                        continue;
> >>>>>>>>>>>>>>                }
> >>>>>>>>>>>>>>        
> >>>>>>>>>>>>>> +        spin_lock(&tmp->vpci_lock);
> >>>>>>>>>>>>>> +        if ( !tmp->vpci )
> >>>>>>>>>>>>>> +        {
> >>>>>>>>>>>>>> +            spin_unlock(&tmp->vpci_lock);
> >>>>>>>>>>>>>> +            continue;
> >>>>>>>>>>>>>> +        }
> >>>>>>>>>>>>>>                for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
> >>>>>>>>>>>>>>                {
> >>>>>>>>>>>>>>                    const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
> >>>>>>>>>>>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>>>>>>>>>>>>>                    rc = rangeset_remove_range(mem, start, end);
> >>>>>>>>>>>>>>                    if ( rc )
> >>>>>>>>>>>>>>                    {
> >>>>>>>>>>>>>> +                spin_unlock(&tmp->vpci_lock);
> >>>>>>>>>>>>>>                        printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
> >>>>>>>>>>>>>>                               start, end, rc);
> >>>>>>>>>>>>>>                        rangeset_destroy(mem);
> >>>>>>>>>>>>>>                        return rc;
> >>>>>>>>>>>>>>                    }
> >>>>>>>>>>>>>>                }
> >>>>>>>>>>>>>> +        spin_unlock(&tmp->vpci_lock);
> >>>>>>>>>>>>>>            }
> >>>>>>>>>>>>> At the first glance this simply looks like another unjustified (in the
> >>>>>>>>>>>>> description) change, as you're not converting anything here but you
> >>>>>>>>>>>>> actually add locking (and I realize this was there before, so I'm sorry
> >>>>>>>>>>>>> for not pointing this out earlier).
> >>>>>>>>>>>> Well, I thought that the description already has "...the lock can be
> >>>>>>>>>>>> used (and in a few cases is used right away) to check whether vpci
> >>>>>>>>>>>> is present" and this is enough for such uses as here.
> >>>>>>>>>>>>>        But then I wonder whether you
> >>>>>>>>>>>>> actually tested this, since I can't help getting the impression that
> >>>>>>>>>>>>> you're introducing a live-lock: The function is called from cmd_write()
> >>>>>>>>>>>>> and rom_write(), which in turn are called out of vpci_write(). Yet that
> >>>>>>>>>>>>> function already holds the lock, and the lock is not (currently)
> >>>>>>>>>>>>> recursive. (For the 3rd caller of the function - init_bars() - otoh
> >>>>>>>>>>>>> the locking looks to be entirely unnecessary.)
> >>>>>>>>>>>> Well, you are correct: if tmp != pdev then it is correct to acquire
> >>>>>>>>>>>> the lock. But if tmp == pdev and rom_only == true
> >>>>>>>>>>>> then we'll deadlock.
> >>>>>>>>>>>>
> >>>>>>>>>>>> It seems we need to have the locking conditional, e.g. only lock
> >>>>>>>>>>>> if tmp != pdev
> >>>>>>>>>>> Which will address the live-lock, but introduce ABBA deadlock potential
> >>>>>>>>>>> between the two locks.
> >>>>>>>>>> I am not sure I can suggest a better solution here
> >>>>>>>>>> @Roger, @Jan, could you please help here?
> >>>>>>>>> Well, first of all I'd like to mention that while it may have been okay to
> >>>>>>>>> not hold pcidevs_lock here for Dom0, it surely needs acquiring when dealing
> >>>>>>>>> with DomU-s' lists of PCI devices. The requirement really applies to the
> >>>>>>>>> other use of for_each_pdev() as well (in vpci_dump_msi()), except that
> >>>>>>>>> there it probably wants to be a try-lock.
> >>>>>>>>>
> >>>>>>>>> Next I'd like to point out that here we have the still pending issue of
> >>>>>>>>> how to deal with hidden devices, which Dom0 can access. See my RFC patch
> >>>>>>>>> "vPCI: account for hidden devices in modify_bars()". Whatever the solution
> >>>>>>>>> here, I think it wants to at least account for the extra need there.
> >>>>>>>> Yes, sorry, I should take care of that.
> >>>>>>>>
> >>>>>>>>> Now it is quite clear that pcidevs_lock isn't going to help with avoiding
> >>>>>>>>> the deadlock, as it's imo not an option at all to acquire that lock
> >>>>>>>>> everywhere else you access ->vpci (or else the vpci lock itself would be
> >>>>>>>>> pointless). But a per-domain auxiliary r/w lock may help: Other paths
> >>>>>>>>> would acquire it in read mode, and here you'd acquire it in write mode (in
> >>>>>>>>> the former case around the vpci lock, while in the latter case there may
> >>>>>>>>> then not be any need to acquire the individual vpci locks at all). FTAOD:
> >>>>>>>>> I haven't fully thought through all implications (and hence whether this is
> >>>>>>>>> viable in the first place); I expect you will, documenting what you've
> >>>>>>>>> found in the resulting patch description. Of course the double lock
> >>>>>>>>> acquire/release would then likely want hiding in helper functions.
> >>>>>>>> I've been also thinking about this, and whether it's really worth to
> >>>>>>>> have a per-device lock rather than a per-domain one that protects all
> >>>>>>>> vpci regions of the devices assigned to the domain.
> >>>>>>>>
> >>>>>>>> The OS is likely to serialize accesses to the PCI config space anyway,
> >>>>>>>> and the only place I could see a benefit of having per-device locks is
> >>>>>>>> in the handling of MSI-X tables, as the handling of the mask bit is
> >>>>>>>> likely very performance sensitive, so adding a per-domain lock there
> >>>>>>>> could be a bottleneck.
> >>>>>>> Hmm, with method 1 accesses serializing globally is basically
> >>>>>>> unavoidable, but with MMCFG I see no reason why OSes may not (move
> >>>>>>> to) permit(ting) parallel accesses, with serialization perhaps done
> >>>>>>> only at device level. See our own pci_config_lock, which applies to
> >>>>>>> only method 1 accesses; we don't look to be serializing MMCFG
> >>>>>>> accesses at all.
> >>>>>>>
> >>>>>>>> We could alternatively do a per-domain rwlock for vpci and special case
> >>>>>>>> the MSI-X area to also have a per-device specific lock. At which point
> >>>>>>>> it becomes fairly similar to what you propose.
> >>>>>> @Jan, @Roger
> >>>>>>
> >>>>>> 1. d->vpci_lock - rwlock <- this protects vpci
> >>>>>> 2. pdev->vpci->msix_tbl_lock - rwlock <- this protects MSI-X tables
> >>>>>> or should it better be pdev->msix_tbl_lock as MSI-X tables don't
> >>>>>> really depend on vPCI?
> >>>>> If so, perhaps indeed better the latter. But as said in reply to Roger,
> >>>>> I'm not convinced (yet) that doing away with the per-device lock is a
> >>>>> good move. As said there - we're ourselves doing fully parallel MMCFG
> >>>>> accesses, so OSes ought to be fine to do so, too.
> >>>> But with pdev->vpci_lock we face ABBA...
> >>> I think it would be easier to start with a per-domain rwlock that
> >>> guarantees pdev->vpci cannot be removed under our feet. This would be
> >>> taken in read mode in vpci_{read,write} and in write mode when
> >>> removing a device from a domain.
> >>>
> >>> Then there are also other issues regarding vPCI locking that need to
> >>> be fixed, but that lock would likely be a start.
> >> Or let's see the problem at a different angle: this is the only place
> >> which breaks the use of pdev->vpci_lock. Because all other places
> >> do not try to acquire the lock of any two devices at a time.
> >> So, what if we re-work the offending piece of code instead?
> >> That way we do not break parallel access and have the lock per-device
> >> which might also be a plus.
> >>
> >> By re-work I mean, that instead of reading already mapped regions
> >> from tmp we can employ a d->pci_mapped_regions range set which
> >> will hold all the already mapped ranges. And when it is needed to access
> >> that range set we use pcidevs_lock which seems to be rare.
> >> So, modify_bars will rely on pdev->vpci_lock + pcidevs_lock and
> >> ABBA won't be possible at all.
> > Sadly that won't replace the usage of the loop in modify_bars. This is
> > not (exclusively) done in order to prevent mapping the same region
> > multiple times, but rather to prevent unmapping of regions as long as
> > there's an enabled BAR that's using it.
> >
> > If you wanted to use something like d->pci_mapped_regions it would
> > have to keep reference counts to regions, in order to know when a
> > mapping is no longer required by any BAR on the system with memory
> > decoding enabled.
> I missed this path, thank you
> 
> I tried to analyze the locking in pci/vpci.
> 
> First of all some context to refresh the target we want:
> the rationale behind moving pdev->vpci->lock outside
> is to be able dynamically create and destroy pdev->vpci.
> So, for that reason lock needs to be moved outside of the pdev->vpci.
> 
> Some of the callers of the vPCI code and locking used:
> 
> ======================================
> vpci_mmio_read/vpci_mmcfg_read
> ======================================
>    - vpci_ecam_read
>    - vpci_read
>     !!!!!!!! pdev is acquired, then pdev->vpci_lock is used !!!!!!!!
>     - msix:
>      - control_read
>     - header:
>      - guest_bar_read
>     - msi:
>      - control_read
>      - address_read/address_hi_read
>      - data_read
>      - mask_read
> 
> ======================================
> vpci_mmio_write/vpci_mmcfg_write
> ======================================
>    - vpci_ecam_write
>    - vpci_write
>     !!!!!!!! pdev is acquired, then pdev->vpci_lock is used !!!!!!!!
>     - msix:
>      - control_write
>     - header:
>      - bar_write/guest_bar_write
>      - cmd_write/guest_cmd_write
>      - rom_write
>       - all write handlers may call modify_bars
>        modify_bars
>     - msi:
>      - control_write
>      - address_write/address_hi_write
>      - data_write
>      - mask_write
> 
> ======================================
> pci_add_device: locked with pcidevs_lock
> ======================================
>    - vpci_add_handlers
>     ++++++++ pdev->vpci_lock is used ++++++++
> 
> ======================================
> pci_remove_device: locked with pcidevs_lock
> ======================================
> - vpci_remove_device
>    ++++++++ pdev->vpci_lock is used ++++++++
> - pci_cleanup_msi
> - free_pdev
> 
> ======================================
> XEN_DOMCTL_assign_device: locked with pcidevs_lock
> ======================================
> - assign_device
>   - vpci_deassign_device
>   - pdev_msix_assign
>   - vpci_assign_device
>    - vpci_add_handlers
>      ++++++++ pdev->vpci_lock is used ++++++++
> 
> ======================================
> XEN_DOMCTL_deassign_device: locked with pcidevs_lock
> ======================================
> - deassign_device
>   - vpci_deassign_device
>     ++++++++ pdev->vpci_lock is used ++++++++
>    - vpci_remove_device
> 
> 
> ======================================
> modify_bars is a special case: this is the only function which tries to lock
> two pci_dev devices: it is done to check for overlaps with other BARs which may have been
> already mapped or unmapped.
> 
> So, this is the only case which may deadlock because of pci_dev->vpci_lock.
> ======================================
> 
> Bottom line:
> ======================================
> 
> 1. vpci_{read|write} are not protected with pcidevs_lock and can run in
> parallel with pci_remove_device which can remove pdev after vpci_{read|write}
> acquired the pdev pointer. This may lead to a fail due to pdev dereference.
> 
> So, to protect pdev dereference vpci_{read|write} must also use pdevs_lock.

We would like to take the pcidevs_lock only while fetching the device
(ie: pci_get_pdev_by_domain), afterwards it should be fine to lock the
device using a vpci specific lock so calls to vpci_{read,write} can be
partially concurrent across multiple domains.

In fact I think Jan had already pointed out that the pci lock would
need taking while searching for the device in vpci_{read,write}.

It seems to me that if you implement option 3 below taking the
per-domain rwlock in read mode in vpci_{read|write} will already
protect you from the device being removed if the same per-domain lock
is taken in write mode in vpci_remove_device.

> 2. The only offending place which is in the way of pci_dev->vpci_lock is
> modify_bars. If it can be re-worked to track already mapped and unmapped
> regions then we can avoid having a possible deadlock and can use
> pci_dev->vpci_lock (rangesets won't help here as we also need refcounting be
> implemented).

I think a refcounting based solution will be very complex to
implement. I'm however happy to be proven wrong.

> If pcidevs_lock is used for vpci_{read|write} then no deadlock is possible,
> but modify_bars code must be re-worked not to lock itself (pdev->vpci_lock and
> tmp->vpci_lock when pdev == tmp, this is minor).

Taking the pcidevs lock (a global lock) is out of the picture IMO, as
it's going to serialize all calls of vpci_{read|write}, and would
create too much contention on the pcidevs lock.

> 3. We may think about a per-domain rwlock and pdev->vpci_lock, so this solves
> modify_bars's two pdevs access. But this doesn't solve possible pdev
> de-reference in vpci_{read|write} vs pci_remove_device.

pci_remove device will call vpci_remove_device, so as long as
vpci_remove_device taken the per-domain lock in write (exclusive) mode
it should be fine.

> @Roger, @Jan, I would like to hear what do you think about the above analysis
> and how can we proceed with locking re-work?

I think the per-domain rwlock seems like a good option. I would do
that as a pre-patch.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-07 12:38           ` Jan Beulich
@ 2022-02-07 12:51             ` Oleksandr Andrushchenko
  2022-02-07 12:54               ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-07 12:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 07.02.22 14:38, Jan Beulich wrote:
> On 07.02.2022 12:27, Oleksandr Andrushchenko wrote:
>>
>> On 07.02.22 09:29, Jan Beulich wrote:
>>> On 04.02.2022 15:37, Oleksandr Andrushchenko wrote:
>>>> On 04.02.22 16:30, Jan Beulich wrote:
>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>> Reset the command register when assigning a PCI device to a guest:
>>>>>> according to the PCI spec the PCI_COMMAND register is typically all 0's
>>>>>> after reset.
>>>>> It's not entirely clear to me whether setting the hardware register to
>>>>> zero is okay. What wants to be zero is the value the guest observes
>>>>> initially.
>>>> "the PCI spec says the PCI_COMMAND register is typically all 0's after reset."
>>>> Why wouldn't it be ok? What is the exact concern here?
>>> The concern is - as voiced is similar ways before, perhaps in other
>>> contexts - that you need to consider bit-by-bit whether overwriting
>>> with 0 what is currently there is okay. Xen and/or Dom0 may have put
>>> values there which they expect to remain unaltered. I guess
>>> PCI_COMMAND_SERR is a good example: While the guest's view of this
>>> will want to be zero initially, the host having set it to 1 may not
>>> easily be overwritten with 0, or else you'd effectively imply giving
>>> the guest control of the bit.
>> We have already discussed in great detail PCI_COMMAND emulation [1].
>> At the end you wrote [1]:
>> "Well, in order for the whole thing to be security supported it needs to
>> be explained for every bit why it is safe to allow the guest to drive it.
>> Until you mean vPCI to reach that state, leaving TODO notes in the code
>> for anything not investigated may indeed be good enough.
>>
>> Jan"
>>
>> So, this is why I left a TODO in the PCI_COMMAND emulation for now and only
>> care about INTx which is honored with the code in this patch.
> Right. The issue I see is that the description does not have any
> mention of this, but instead talks about simply writing zero.
How do you want that mentioned? Extended commit message or
just a link to the thread [1]?
With the above done, do you think that writing 0's is an acceptable
approach as of now?
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-07 12:51             ` Oleksandr Andrushchenko
@ 2022-02-07 12:54               ` Jan Beulich
  2022-02-07 14:17                 ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-07 12:54 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 07.02.2022 13:51, Oleksandr Andrushchenko wrote:
> 
> 
> On 07.02.22 14:38, Jan Beulich wrote:
>> On 07.02.2022 12:27, Oleksandr Andrushchenko wrote:
>>>
>>> On 07.02.22 09:29, Jan Beulich wrote:
>>>> On 04.02.2022 15:37, Oleksandr Andrushchenko wrote:
>>>>> On 04.02.22 16:30, Jan Beulich wrote:
>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>> Reset the command register when assigning a PCI device to a guest:
>>>>>>> according to the PCI spec the PCI_COMMAND register is typically all 0's
>>>>>>> after reset.
>>>>>> It's not entirely clear to me whether setting the hardware register to
>>>>>> zero is okay. What wants to be zero is the value the guest observes
>>>>>> initially.
>>>>> "the PCI spec says the PCI_COMMAND register is typically all 0's after reset."
>>>>> Why wouldn't it be ok? What is the exact concern here?
>>>> The concern is - as voiced is similar ways before, perhaps in other
>>>> contexts - that you need to consider bit-by-bit whether overwriting
>>>> with 0 what is currently there is okay. Xen and/or Dom0 may have put
>>>> values there which they expect to remain unaltered. I guess
>>>> PCI_COMMAND_SERR is a good example: While the guest's view of this
>>>> will want to be zero initially, the host having set it to 1 may not
>>>> easily be overwritten with 0, or else you'd effectively imply giving
>>>> the guest control of the bit.
>>> We have already discussed in great detail PCI_COMMAND emulation [1].
>>> At the end you wrote [1]:
>>> "Well, in order for the whole thing to be security supported it needs to
>>> be explained for every bit why it is safe to allow the guest to drive it.
>>> Until you mean vPCI to reach that state, leaving TODO notes in the code
>>> for anything not investigated may indeed be good enough.
>>>
>>> Jan"
>>>
>>> So, this is why I left a TODO in the PCI_COMMAND emulation for now and only
>>> care about INTx which is honored with the code in this patch.
>> Right. The issue I see is that the description does not have any
>> mention of this, but instead talks about simply writing zero.
> How do you want that mentioned? Extended commit message or
> just a link to the thread [1]?

What I'd like you to describe is what the change does without
fundamentally implying it'll end up being zero which gets written
to the register. Stating as a conclusion that for the time being
this means writing zero is certainly fine (and likely helpful if
made explicit).

> With the above done, do you think that writing 0's is an acceptable
> approach as of now?

Well, yes, provided we have a sufficiently similar understanding
of what "acceptable" here means.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 12:34                               ` Jan Beulich
@ 2022-02-07 12:57                                 ` Oleksandr Andrushchenko
  2022-02-07 13:02                                   ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-07 12:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné,
	Oleksandr Andrushchenko



On 07.02.22 14:34, Jan Beulich wrote:
> On 07.02.2022 12:08, Oleksandr Andrushchenko wrote:
>> 1. vpci_{read|write} are not protected with pcidevs_lock and can run in
>> parallel with pci_remove_device which can remove pdev after vpci_{read|write}
>> acquired the pdev pointer. This may lead to a fail due to pdev dereference.
>>
>> So, to protect pdev dereference vpci_{read|write} must also use pdevs_lock.
> I think this is not the only place where there is a theoretical race
> against pci_remove_device().
Not at all, that was just to demonstrate one of the possible sources of races.
>   I would recommend to separate the
> overall situation with pcidevs_lock from the issue here.
Do you agree that there is already an issue with that? In the currently existing code?
>   I don't view
> it as an option to acquire pcidevs_lock in vpci_{read,write}().
Yes, that would hurt too much, I agree. But this needs to be solved
>   If
> anything, we need proper refcounting of PCI devices (at which point
> likely a number of lock uses can go away).
It seems so. Then not only pdev's need refcounting, but pdev->vpci as well

What's your view on how can we achieve both goals?
pdev and pdev->vpci and locking/refcounting
This is really crucial for all the code for PCI passthrough on Arm because
without this ground work done we can't accept all the patches which rely
on this: vPCI changes, MSI/MSI-X etc.
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 12:57                                 ` Oleksandr Andrushchenko
@ 2022-02-07 13:02                                   ` Jan Beulich
  0 siblings, 0 replies; 138+ messages in thread
From: Jan Beulich @ 2022-02-07 13:02 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné

On 07.02.2022 13:57, Oleksandr Andrushchenko wrote:
> 
> 
> On 07.02.22 14:34, Jan Beulich wrote:
>> On 07.02.2022 12:08, Oleksandr Andrushchenko wrote:
>>> 1. vpci_{read|write} are not protected with pcidevs_lock and can run in
>>> parallel with pci_remove_device which can remove pdev after vpci_{read|write}
>>> acquired the pdev pointer. This may lead to a fail due to pdev dereference.
>>>
>>> So, to protect pdev dereference vpci_{read|write} must also use pdevs_lock.
>> I think this is not the only place where there is a theoretical race
>> against pci_remove_device().
> Not at all, that was just to demonstrate one of the possible sources of races.
>>   I would recommend to separate the
>> overall situation with pcidevs_lock from the issue here.
> Do you agree that there is already an issue with that? In the currently existing code?
>>   I don't view
>> it as an option to acquire pcidevs_lock in vpci_{read,write}().
> Yes, that would hurt too much, I agree. But this needs to be solved
>>   If
>> anything, we need proper refcounting of PCI devices (at which point
>> likely a number of lock uses can go away).
> It seems so. Then not only pdev's need refcounting, but pdev->vpci as well
> 
> What's your view on how can we achieve both goals?
> pdev and pdev->vpci and locking/refcounting

I don't see why pdev->vpci might need refcounting. And just to state it
in different words: I'd like to suggest to leave aside the pdev locking
as long as it's _just_ to protect against hot remove of a device. That's
orthogonal to what you need for vPCI, where you need to protect
against the device disappearing from a guest (without at the same time
disappearing from the host).

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 12:46                               ` Roger Pau Monné
@ 2022-02-07 13:53                                 ` Oleksandr Andrushchenko
  2022-02-07 14:11                                   ` Jan Beulich
  2022-02-07 14:19                                   ` Roger Pau Monné
  0 siblings, 2 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-07 13:53 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 07.02.22 14:46, Roger Pau Monné wrote:
> On Mon, Feb 07, 2022 at 11:08:39AM +0000, Oleksandr Andrushchenko wrote:
>> Hello,
>>
>> On 04.02.22 16:57, Roger Pau Monné wrote:
>>> On Fri, Feb 04, 2022 at 02:43:07PM +0000, Oleksandr Andrushchenko wrote:
>>>> On 04.02.22 15:06, Roger Pau Monné wrote:
>>>>> On Fri, Feb 04, 2022 at 12:53:20PM +0000, Oleksandr Andrushchenko wrote:
>>>>>> On 04.02.22 14:47, Jan Beulich wrote:
>>>>>>> On 04.02.2022 13:37, Oleksandr Andrushchenko wrote:
>>>>>>>> On 04.02.22 13:37, Jan Beulich wrote:
>>>>>>>>> On 04.02.2022 12:13, Roger Pau Monné wrote:
>>>>>>>>>> On Fri, Feb 04, 2022 at 11:49:18AM +0100, Jan Beulich wrote:
>>>>>>>>>>> On 04.02.2022 11:12, Oleksandr Andrushchenko wrote:
>>>>>>>>>>>> On 04.02.22 11:15, Jan Beulich wrote:
>>>>>>>>>>>>> On 04.02.2022 09:58, Oleksandr Andrushchenko wrote:
>>>>>>>>>>>>>> On 04.02.22 09:52, Jan Beulich wrote:
>>>>>>>>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>>>>>>>>>> @@ -285,6 +286,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>>>>>>>>>                         continue;
>>>>>>>>>>>>>>>>                 }
>>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>> +        spin_lock(&tmp->vpci_lock);
>>>>>>>>>>>>>>>> +        if ( !tmp->vpci )
>>>>>>>>>>>>>>>> +        {
>>>>>>>>>>>>>>>> +            spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>>>>>>> +            continue;
>>>>>>>>>>>>>>>> +        }
>>>>>>>>>>>>>>>>                 for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>>>>>>>>>>>>>>>                 {
>>>>>>>>>>>>>>>>                     const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
>>>>>>>>>>>>>>>> @@ -303,12 +310,14 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>>>>>>>>>>>>>>>                     rc = rangeset_remove_range(mem, start, end);
>>>>>>>>>>>>>>>>                     if ( rc )
>>>>>>>>>>>>>>>>                     {
>>>>>>>>>>>>>>>> +                spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>>>>>>>                         printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>>>>>>>>>>>>>>>>                                start, end, rc);
>>>>>>>>>>>>>>>>                         rangeset_destroy(mem);
>>>>>>>>>>>>>>>>                         return rc;
>>>>>>>>>>>>>>>>                     }
>>>>>>>>>>>>>>>>                 }
>>>>>>>>>>>>>>>> +        spin_unlock(&tmp->vpci_lock);
>>>>>>>>>>>>>>>>             }
>>>>>>>>>>>>>>> At the first glance this simply looks like another unjustified (in the
>>>>>>>>>>>>>>> description) change, as you're not converting anything here but you
>>>>>>>>>>>>>>> actually add locking (and I realize this was there before, so I'm sorry
>>>>>>>>>>>>>>> for not pointing this out earlier).
>>>>>>>>>>>>>> Well, I thought that the description already has "...the lock can be
>>>>>>>>>>>>>> used (and in a few cases is used right away) to check whether vpci
>>>>>>>>>>>>>> is present" and this is enough for such uses as here.
>>>>>>>>>>>>>>>         But then I wonder whether you
>>>>>>>>>>>>>>> actually tested this, since I can't help getting the impression that
>>>>>>>>>>>>>>> you're introducing a live-lock: The function is called from cmd_write()
>>>>>>>>>>>>>>> and rom_write(), which in turn are called out of vpci_write(). Yet that
>>>>>>>>>>>>>>> function already holds the lock, and the lock is not (currently)
>>>>>>>>>>>>>>> recursive. (For the 3rd caller of the function - init_bars() - otoh
>>>>>>>>>>>>>>> the locking looks to be entirely unnecessary.)
>>>>>>>>>>>>>> Well, you are correct: if tmp != pdev then it is correct to acquire
>>>>>>>>>>>>>> the lock. But if tmp == pdev and rom_only == true
>>>>>>>>>>>>>> then we'll deadlock.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It seems we need to have the locking conditional, e.g. only lock
>>>>>>>>>>>>>> if tmp != pdev
>>>>>>>>>>>>> Which will address the live-lock, but introduce ABBA deadlock potential
>>>>>>>>>>>>> between the two locks.
>>>>>>>>>>>> I am not sure I can suggest a better solution here
>>>>>>>>>>>> @Roger, @Jan, could you please help here?
>>>>>>>>>>> Well, first of all I'd like to mention that while it may have been okay to
>>>>>>>>>>> not hold pcidevs_lock here for Dom0, it surely needs acquiring when dealing
>>>>>>>>>>> with DomU-s' lists of PCI devices. The requirement really applies to the
>>>>>>>>>>> other use of for_each_pdev() as well (in vpci_dump_msi()), except that
>>>>>>>>>>> there it probably wants to be a try-lock.
>>>>>>>>>>>
>>>>>>>>>>> Next I'd like to point out that here we have the still pending issue of
>>>>>>>>>>> how to deal with hidden devices, which Dom0 can access. See my RFC patch
>>>>>>>>>>> "vPCI: account for hidden devices in modify_bars()". Whatever the solution
>>>>>>>>>>> here, I think it wants to at least account for the extra need there.
>>>>>>>>>> Yes, sorry, I should take care of that.
>>>>>>>>>>
>>>>>>>>>>> Now it is quite clear that pcidevs_lock isn't going to help with avoiding
>>>>>>>>>>> the deadlock, as it's imo not an option at all to acquire that lock
>>>>>>>>>>> everywhere else you access ->vpci (or else the vpci lock itself would be
>>>>>>>>>>> pointless). But a per-domain auxiliary r/w lock may help: Other paths
>>>>>>>>>>> would acquire it in read mode, and here you'd acquire it in write mode (in
>>>>>>>>>>> the former case around the vpci lock, while in the latter case there may
>>>>>>>>>>> then not be any need to acquire the individual vpci locks at all). FTAOD:
>>>>>>>>>>> I haven't fully thought through all implications (and hence whether this is
>>>>>>>>>>> viable in the first place); I expect you will, documenting what you've
>>>>>>>>>>> found in the resulting patch description. Of course the double lock
>>>>>>>>>>> acquire/release would then likely want hiding in helper functions.
>>>>>>>>>> I've been also thinking about this, and whether it's really worth to
>>>>>>>>>> have a per-device lock rather than a per-domain one that protects all
>>>>>>>>>> vpci regions of the devices assigned to the domain.
>>>>>>>>>>
>>>>>>>>>> The OS is likely to serialize accesses to the PCI config space anyway,
>>>>>>>>>> and the only place I could see a benefit of having per-device locks is
>>>>>>>>>> in the handling of MSI-X tables, as the handling of the mask bit is
>>>>>>>>>> likely very performance sensitive, so adding a per-domain lock there
>>>>>>>>>> could be a bottleneck.
>>>>>>>>> Hmm, with method 1 accesses serializing globally is basically
>>>>>>>>> unavoidable, but with MMCFG I see no reason why OSes may not (move
>>>>>>>>> to) permit(ting) parallel accesses, with serialization perhaps done
>>>>>>>>> only at device level. See our own pci_config_lock, which applies to
>>>>>>>>> only method 1 accesses; we don't look to be serializing MMCFG
>>>>>>>>> accesses at all.
>>>>>>>>>
>>>>>>>>>> We could alternatively do a per-domain rwlock for vpci and special case
>>>>>>>>>> the MSI-X area to also have a per-device specific lock. At which point
>>>>>>>>>> it becomes fairly similar to what you propose.
>>>>>>>> @Jan, @Roger
>>>>>>>>
>>>>>>>> 1. d->vpci_lock - rwlock <- this protects vpci
>>>>>>>> 2. pdev->vpci->msix_tbl_lock - rwlock <- this protects MSI-X tables
>>>>>>>> or should it better be pdev->msix_tbl_lock as MSI-X tables don't
>>>>>>>> really depend on vPCI?
>>>>>>> If so, perhaps indeed better the latter. But as said in reply to Roger,
>>>>>>> I'm not convinced (yet) that doing away with the per-device lock is a
>>>>>>> good move. As said there - we're ourselves doing fully parallel MMCFG
>>>>>>> accesses, so OSes ought to be fine to do so, too.
>>>>>> But with pdev->vpci_lock we face ABBA...
>>>>> I think it would be easier to start with a per-domain rwlock that
>>>>> guarantees pdev->vpci cannot be removed under our feet. This would be
>>>>> taken in read mode in vpci_{read,write} and in write mode when
>>>>> removing a device from a domain.
>>>>>
>>>>> Then there are also other issues regarding vPCI locking that need to
>>>>> be fixed, but that lock would likely be a start.
>>>> Or let's see the problem at a different angle: this is the only place
>>>> which breaks the use of pdev->vpci_lock. Because all other places
>>>> do not try to acquire the lock of any two devices at a time.
>>>> So, what if we re-work the offending piece of code instead?
>>>> That way we do not break parallel access and have the lock per-device
>>>> which might also be a plus.
>>>>
>>>> By re-work I mean, that instead of reading already mapped regions
>>>> from tmp we can employ a d->pci_mapped_regions range set which
>>>> will hold all the already mapped ranges. And when it is needed to access
>>>> that range set we use pcidevs_lock which seems to be rare.
>>>> So, modify_bars will rely on pdev->vpci_lock + pcidevs_lock and
>>>> ABBA won't be possible at all.
>>> Sadly that won't replace the usage of the loop in modify_bars. This is
>>> not (exclusively) done in order to prevent mapping the same region
>>> multiple times, but rather to prevent unmapping of regions as long as
>>> there's an enabled BAR that's using it.
>>>
>>> If you wanted to use something like d->pci_mapped_regions it would
>>> have to keep reference counts to regions, in order to know when a
>>> mapping is no longer required by any BAR on the system with memory
>>> decoding enabled.
>> I missed this path, thank you
>>
>> I tried to analyze the locking in pci/vpci.
>>
>> First of all some context to refresh the target we want:
>> the rationale behind moving pdev->vpci->lock outside
>> is to be able dynamically create and destroy pdev->vpci.
>> So, for that reason lock needs to be moved outside of the pdev->vpci.
>>
>> Some of the callers of the vPCI code and locking used:
>>
>> ======================================
>> vpci_mmio_read/vpci_mmcfg_read
>> ======================================
>>     - vpci_ecam_read
>>     - vpci_read
>>      !!!!!!!! pdev is acquired, then pdev->vpci_lock is used !!!!!!!!
>>      - msix:
>>       - control_read
>>      - header:
>>       - guest_bar_read
>>      - msi:
>>       - control_read
>>       - address_read/address_hi_read
>>       - data_read
>>       - mask_read
>>
>> ======================================
>> vpci_mmio_write/vpci_mmcfg_write
>> ======================================
>>     - vpci_ecam_write
>>     - vpci_write
>>      !!!!!!!! pdev is acquired, then pdev->vpci_lock is used !!!!!!!!
>>      - msix:
>>       - control_write
>>      - header:
>>       - bar_write/guest_bar_write
>>       - cmd_write/guest_cmd_write
>>       - rom_write
>>        - all write handlers may call modify_bars
>>         modify_bars
>>      - msi:
>>       - control_write
>>       - address_write/address_hi_write
>>       - data_write
>>       - mask_write
>>
>> ======================================
>> pci_add_device: locked with pcidevs_lock
>> ======================================
>>     - vpci_add_handlers
>>      ++++++++ pdev->vpci_lock is used ++++++++
>>
>> ======================================
>> pci_remove_device: locked with pcidevs_lock
>> ======================================
>> - vpci_remove_device
>>     ++++++++ pdev->vpci_lock is used ++++++++
>> - pci_cleanup_msi
>> - free_pdev
>>
>> ======================================
>> XEN_DOMCTL_assign_device: locked with pcidevs_lock
>> ======================================
>> - assign_device
>>    - vpci_deassign_device
>>    - pdev_msix_assign
>>    - vpci_assign_device
>>     - vpci_add_handlers
>>       ++++++++ pdev->vpci_lock is used ++++++++
>>
>> ======================================
>> XEN_DOMCTL_deassign_device: locked with pcidevs_lock
>> ======================================
>> - deassign_device
>>    - vpci_deassign_device
>>      ++++++++ pdev->vpci_lock is used ++++++++
>>     - vpci_remove_device
>>
>>
>> ======================================
>> modify_bars is a special case: this is the only function which tries to lock
>> two pci_dev devices: it is done to check for overlaps with other BARs which may have been
>> already mapped or unmapped.
>>
>> So, this is the only case which may deadlock because of pci_dev->vpci_lock.
>> ======================================
>>
>> Bottom line:
>> ======================================
>>
>> 1. vpci_{read|write} are not protected with pcidevs_lock and can run in
>> parallel with pci_remove_device which can remove pdev after vpci_{read|write}
>> acquired the pdev pointer. This may lead to a fail due to pdev dereference.
>>
>> So, to protect pdev dereference vpci_{read|write} must also use pdevs_lock.
> We would like to take the pcidevs_lock only while fetching the device
> (ie: pci_get_pdev_by_domain), afterwards it should be fine to lock the
> device using a vpci specific lock so calls to vpci_{read,write} can be
> partially concurrent across multiple domains.
This means this can't be done a pre-req patch, but as a part of the
patch which changes locking.
>
> In fact I think Jan had already pointed out that the pci lock would
> need taking while searching for the device in vpci_{read,write}.
I was referring to the time after we found pdev and it is currently
possible to free pdev while using it after the search
>
> It seems to me that if you implement option 3 below taking the
> per-domain rwlock in read mode in vpci_{read|write} will already
> protect you from the device being removed if the same per-domain lock
> is taken in write mode in vpci_remove_device.
Yes, it should. Again this can't be done as a pre-req patch because
this relies on pdev->vpci_lock
>
>> 2. The only offending place which is in the way of pci_dev->vpci_lock is
>> modify_bars. If it can be re-worked to track already mapped and unmapped
>> regions then we can avoid having a possible deadlock and can use
>> pci_dev->vpci_lock (rangesets won't help here as we also need refcounting be
>> implemented).
> I think a refcounting based solution will be very complex to
> implement. I'm however happy to be proven wrong.
I can't estimate, but I have a feeling that all these plays around locking
is just because of this single piece of code. No other place suffer from
pdev->vpci_lock and no d->lock
>
>> If pcidevs_lock is used for vpci_{read|write} then no deadlock is possible,
>> but modify_bars code must be re-worked not to lock itself (pdev->vpci_lock and
>> tmp->vpci_lock when pdev == tmp, this is minor).
> Taking the pcidevs lock (a global lock) is out of the picture IMO, as
> it's going to serialize all calls of vpci_{read|write}, and would
> create too much contention on the pcidevs lock.
I understand that. But if we would like to fix the existing code I see
no other alternative.
>
>> 3. We may think about a per-domain rwlock and pdev->vpci_lock, so this solves
>> modify_bars's two pdevs access. But this doesn't solve possible pdev
>> de-reference in vpci_{read|write} vs pci_remove_device.
> pci_remove device will call vpci_remove_device, so as long as
> vpci_remove_device taken the per-domain lock in write (exclusive) mode
> it should be fine.
I think I need to see if there are any other places which similarly
require the write lock
>
>> @Roger, @Jan, I would like to hear what do you think about the above analysis
>> and how can we proceed with locking re-work?
> I think the per-domain rwlock seems like a good option. I would do
> that as a pre-patch.
It is. But it seems it won't solve the thing we started this adventure for:

With per-domain read lock and still ABBA in modify_bars (hope the below
is correctly seen with a monospace font):

cpu0: vpci_write-> d->RLock -> pdev1->lock ->                                                  rom_write -> modify_bars: tmp (pdev2) ->lock
cpu1:        vpci_write-> d->RLock pdev2->lock -> cmd_write -> modify_bars: tmp (pdev1) ->lock

There is no API to upgrade read lock to write lock in modify_bars which could help,
so in both cases vpci_write should take write lock.

Am I missing something here?
>
> Thanks, Roger.
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 13:53                                 ` Oleksandr Andrushchenko
@ 2022-02-07 14:11                                   ` Jan Beulich
  2022-02-07 14:27                                     ` Roger Pau Monné
  2022-02-07 14:28                                     ` Oleksandr Andrushchenko
  2022-02-07 14:19                                   ` Roger Pau Monné
  1 sibling, 2 replies; 138+ messages in thread
From: Jan Beulich @ 2022-02-07 14:11 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné

On 07.02.2022 14:53, Oleksandr Andrushchenko wrote:
> On 07.02.22 14:46, Roger Pau Monné wrote:
>> I think the per-domain rwlock seems like a good option. I would do
>> that as a pre-patch.
> It is. But it seems it won't solve the thing we started this adventure for:
> 
> With per-domain read lock and still ABBA in modify_bars (hope the below
> is correctly seen with a monospace font):
> 
> cpu0: vpci_write-> d->RLock -> pdev1->lock ->                                                  rom_write -> modify_bars: tmp (pdev2) ->lock
> cpu1:        vpci_write-> d->RLock pdev2->lock -> cmd_write -> modify_bars: tmp (pdev1) ->lock
> 
> There is no API to upgrade read lock to write lock in modify_bars which could help,
> so in both cases vpci_write should take write lock.

Hmm, yes, I think you're right: It's not modify_bars() itself which needs
to acquire the write lock, but its (perhaps indirect) caller. Effectively
vpci_write() would need to take the write lock if the range written
overlaps the BARs or the command register.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-07 12:54               ` Jan Beulich
@ 2022-02-07 14:17                 ` Oleksandr Andrushchenko
  2022-02-07 14:31                   ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-07 14:17 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 07.02.22 14:54, Jan Beulich wrote:
> On 07.02.2022 13:51, Oleksandr Andrushchenko wrote:
>>
>> On 07.02.22 14:38, Jan Beulich wrote:
>>> On 07.02.2022 12:27, Oleksandr Andrushchenko wrote:
>>>> On 07.02.22 09:29, Jan Beulich wrote:
>>>>> On 04.02.2022 15:37, Oleksandr Andrushchenko wrote:
>>>>>> On 04.02.22 16:30, Jan Beulich wrote:
>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>> Reset the command register when assigning a PCI device to a guest:
>>>>>>>> according to the PCI spec the PCI_COMMAND register is typically all 0's
>>>>>>>> after reset.
>>>>>>> It's not entirely clear to me whether setting the hardware register to
>>>>>>> zero is okay. What wants to be zero is the value the guest observes
>>>>>>> initially.
>>>>>> "the PCI spec says the PCI_COMMAND register is typically all 0's after reset."
>>>>>> Why wouldn't it be ok? What is the exact concern here?
>>>>> The concern is - as voiced is similar ways before, perhaps in other
>>>>> contexts - that you need to consider bit-by-bit whether overwriting
>>>>> with 0 what is currently there is okay. Xen and/or Dom0 may have put
>>>>> values there which they expect to remain unaltered. I guess
>>>>> PCI_COMMAND_SERR is a good example: While the guest's view of this
>>>>> will want to be zero initially, the host having set it to 1 may not
>>>>> easily be overwritten with 0, or else you'd effectively imply giving
>>>>> the guest control of the bit.
>>>> We have already discussed in great detail PCI_COMMAND emulation [1].
>>>> At the end you wrote [1]:
>>>> "Well, in order for the whole thing to be security supported it needs to
>>>> be explained for every bit why it is safe to allow the guest to drive it.
>>>> Until you mean vPCI to reach that state, leaving TODO notes in the code
>>>> for anything not investigated may indeed be good enough.
>>>>
>>>> Jan"
>>>>
>>>> So, this is why I left a TODO in the PCI_COMMAND emulation for now and only
>>>> care about INTx which is honored with the code in this patch.
>>> Right. The issue I see is that the description does not have any
>>> mention of this, but instead talks about simply writing zero.
>> How do you want that mentioned? Extended commit message or
>> just a link to the thread [1]?
> What I'd like you to describe is what the change does without
> fundamentally implying it'll end up being zero which gets written
> to the register. Stating as a conclusion that for the time being
> this means writing zero is certainly fine (and likely helpful if
> made explicit).
Xen and/or Dom0 may have put values in PCI_COMMAND which they expect
to remain unaltered. PCI_COMMAND_SERR bit is a good example: while the
guest's view of this will want to be zero initially, the host having set
it to 1 may not easily be overwritten with 0, or else we'd effectively
imply giving the guest control of the bit. Thus, PCI_COMMAND register needs
proper emulation in order to honor host's settings.

There are examples of emulators [1], [2] which already deal with PCI_COMMAND
register emulation and it seems that at most they care about the only INTX
bit (besides IO/memory enable and bus muster which are write through).
It could be because in order to properly emulate the PCI_COMMAND register
we need to know about the whole PCI topology, e.g. if any setting in device's
command register is aligned with the upstream port etc.
This makes me think that because of this complexity others just ignore that.
Neither I think this can be easily done in Xen case.

According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
Device Control" says that the reset state of the command register is
typically 0, so reset the command register when assigning a PCI device
to a guest t all 0's and for now only make sure INTx bit is set according
to if MSI/MSI-X enabled.

[1] https://github.com/qemu/qemu/blob/master/hw/xen/xen_pt_config_init.c#L310
[2] https://github.com/projectacrn/acrn-hypervisor/blob/master/hypervisor/hw/pci.c#L336

Will the above description be enough?

It also seems to be a good move to squash the following patches:
[PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests
[PATCH v6 10/13] vpci/header: reset the command register when adding devices

as they implement a single piece of functionality now.
>> With the above done, do you think that writing 0's is an acceptable
>> approach as of now?
> Well, yes, provided we have a sufficiently similar understanding
> of what "acceptable" here means.
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 13:53                                 ` Oleksandr Andrushchenko
  2022-02-07 14:11                                   ` Jan Beulich
@ 2022-02-07 14:19                                   ` Roger Pau Monné
  2022-02-07 14:27                                     ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 138+ messages in thread
From: Roger Pau Monné @ 2022-02-07 14:19 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel

On Mon, Feb 07, 2022 at 01:53:34PM +0000, Oleksandr Andrushchenko wrote:
> 
> 
> On 07.02.22 14:46, Roger Pau Monné wrote:
> > On Mon, Feb 07, 2022 at 11:08:39AM +0000, Oleksandr Andrushchenko wrote:
> >> ======================================
> >>
> >> Bottom line:
> >> ======================================
> >>
> >> 1. vpci_{read|write} are not protected with pcidevs_lock and can run in
> >> parallel with pci_remove_device which can remove pdev after vpci_{read|write}
> >> acquired the pdev pointer. This may lead to a fail due to pdev dereference.
> >>
> >> So, to protect pdev dereference vpci_{read|write} must also use pdevs_lock.
> > We would like to take the pcidevs_lock only while fetching the device
> > (ie: pci_get_pdev_by_domain), afterwards it should be fine to lock the
> > device using a vpci specific lock so calls to vpci_{read,write} can be
> > partially concurrent across multiple domains.
> This means this can't be done a pre-req patch, but as a part of the
> patch which changes locking.
> >
> > In fact I think Jan had already pointed out that the pci lock would
> > need taking while searching for the device in vpci_{read,write}.
> I was referring to the time after we found pdev and it is currently
> possible to free pdev while using it after the search
> >
> > It seems to me that if you implement option 3 below taking the
> > per-domain rwlock in read mode in vpci_{read|write} will already
> > protect you from the device being removed if the same per-domain lock
> > is taken in write mode in vpci_remove_device.
> Yes, it should. Again this can't be done as a pre-req patch because
> this relies on pdev->vpci_lock

Hm, no, I don't think so. You could introduce this per-domain rwlock
in a prepatch, and then move the vpci lock outside of the vpci struct.
I see no problem with that.

> >
> >> 2. The only offending place which is in the way of pci_dev->vpci_lock is
> >> modify_bars. If it can be re-worked to track already mapped and unmapped
> >> regions then we can avoid having a possible deadlock and can use
> >> pci_dev->vpci_lock (rangesets won't help here as we also need refcounting be
> >> implemented).
> > I think a refcounting based solution will be very complex to
> > implement. I'm however happy to be proven wrong.
> I can't estimate, but I have a feeling that all these plays around locking
> is just because of this single piece of code. No other place suffer from
> pdev->vpci_lock and no d->lock
> >
> >> If pcidevs_lock is used for vpci_{read|write} then no deadlock is possible,
> >> but modify_bars code must be re-worked not to lock itself (pdev->vpci_lock and
> >> tmp->vpci_lock when pdev == tmp, this is minor).
> > Taking the pcidevs lock (a global lock) is out of the picture IMO, as
> > it's going to serialize all calls of vpci_{read|write}, and would
> > create too much contention on the pcidevs lock.
> I understand that. But if we would like to fix the existing code I see
> no other alternative.
> >
> >> 3. We may think about a per-domain rwlock and pdev->vpci_lock, so this solves
> >> modify_bars's two pdevs access. But this doesn't solve possible pdev
> >> de-reference in vpci_{read|write} vs pci_remove_device.
> > pci_remove device will call vpci_remove_device, so as long as
> > vpci_remove_device taken the per-domain lock in write (exclusive) mode
> > it should be fine.
> I think I need to see if there are any other places which similarly
> require the write lock
> >
> >> @Roger, @Jan, I would like to hear what do you think about the above analysis
> >> and how can we proceed with locking re-work?
> > I think the per-domain rwlock seems like a good option. I would do
> > that as a pre-patch.
> It is. But it seems it won't solve the thing we started this adventure for:
> 
> With per-domain read lock and still ABBA in modify_bars (hope the below
> is correctly seen with a monospace font):
> 
> cpu0: vpci_write-> d->RLock -> pdev1->lock ->                                                  rom_write -> modify_bars: tmp (pdev2) ->lock
> cpu1:        vpci_write-> d->RLock pdev2->lock -> cmd_write -> modify_bars: tmp (pdev1) ->lock
> 
> There is no API to upgrade read lock to write lock in modify_bars which could help,
> so in both cases vpci_write should take write lock.

I've thought more than once that it would be nice to have a
write_{upgrade,downgrade} (read_downgrade maybe?) or similar helper.

I think you could also drop the read lock, take the write lock and
check that &pdev->vpci->header == header in order to be sure
pdev->vpci hasn't been recreated. You would have to do similar in
order to get back again from a write lock into a read one.

We should avoid taking the rwlock in write mode in vpci_write
unconditionally.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 14:11                                   ` Jan Beulich
@ 2022-02-07 14:27                                     ` Roger Pau Monné
  2022-02-07 14:33                                       ` Jan Beulich
  2022-02-07 14:35                                       ` Oleksandr Andrushchenko
  2022-02-07 14:28                                     ` Oleksandr Andrushchenko
  1 sibling, 2 replies; 138+ messages in thread
From: Roger Pau Monné @ 2022-02-07 14:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, julien, sstabellini,
	Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel

On Mon, Feb 07, 2022 at 03:11:03PM +0100, Jan Beulich wrote:
> On 07.02.2022 14:53, Oleksandr Andrushchenko wrote:
> > On 07.02.22 14:46, Roger Pau Monné wrote:
> >> I think the per-domain rwlock seems like a good option. I would do
> >> that as a pre-patch.
> > It is. But it seems it won't solve the thing we started this adventure for:
> > 
> > With per-domain read lock and still ABBA in modify_bars (hope the below
> > is correctly seen with a monospace font):
> > 
> > cpu0: vpci_write-> d->RLock -> pdev1->lock ->                                                  rom_write -> modify_bars: tmp (pdev2) ->lock
> > cpu1:        vpci_write-> d->RLock pdev2->lock -> cmd_write -> modify_bars: tmp (pdev1) ->lock
> > 
> > There is no API to upgrade read lock to write lock in modify_bars which could help,
> > so in both cases vpci_write should take write lock.
> 
> Hmm, yes, I think you're right: It's not modify_bars() itself which needs
> to acquire the write lock, but its (perhaps indirect) caller. Effectively
> vpci_write() would need to take the write lock if the range written
> overlaps the BARs or the command register.

I'm confused. If we use a per-domain rwlock approach there would be no
need to lock tmp again in modify_bars, because we should hold the
rwlock in write mode, so there's no ABBA?

We will have however to drop the per domain read and vpci locks and
pick the per-domain lock in write mode.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 14:19                                   ` Roger Pau Monné
@ 2022-02-07 14:27                                     ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-07 14:27 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel



On 07.02.22 16:19, Roger Pau Monné wrote:
> On Mon, Feb 07, 2022 at 01:53:34PM +0000, Oleksandr Andrushchenko wrote:
>>
>> On 07.02.22 14:46, Roger Pau Monné wrote:
>>> On Mon, Feb 07, 2022 at 11:08:39AM +0000, Oleksandr Andrushchenko wrote:
>>>> ======================================
>>>>
>>>> Bottom line:
>>>> ======================================
>>>>
>>>> 1. vpci_{read|write} are not protected with pcidevs_lock and can run in
>>>> parallel with pci_remove_device which can remove pdev after vpci_{read|write}
>>>> acquired the pdev pointer. This may lead to a fail due to pdev dereference.
>>>>
>>>> So, to protect pdev dereference vpci_{read|write} must also use pdevs_lock.
>>> We would like to take the pcidevs_lock only while fetching the device
>>> (ie: pci_get_pdev_by_domain), afterwards it should be fine to lock the
>>> device using a vpci specific lock so calls to vpci_{read,write} can be
>>> partially concurrent across multiple domains.
>> This means this can't be done a pre-req patch, but as a part of the
>> patch which changes locking.
>>> In fact I think Jan had already pointed out that the pci lock would
>>> need taking while searching for the device in vpci_{read,write}.
>> I was referring to the time after we found pdev and it is currently
>> possible to free pdev while using it after the search
>>> It seems to me that if you implement option 3 below taking the
>>> per-domain rwlock in read mode in vpci_{read|write} will already
>>> protect you from the device being removed if the same per-domain lock
>>> is taken in write mode in vpci_remove_device.
>> Yes, it should. Again this can't be done as a pre-req patch because
>> this relies on pdev->vpci_lock
> Hm, no, I don't think so. You could introduce this per-domain rwlock
> in a prepatch, and then move the vpci lock outside of the vpci struct.
> I see no problem with that.
>
>>>> 2. The only offending place which is in the way of pci_dev->vpci_lock is
>>>> modify_bars. If it can be re-worked to track already mapped and unmapped
>>>> regions then we can avoid having a possible deadlock and can use
>>>> pci_dev->vpci_lock (rangesets won't help here as we also need refcounting be
>>>> implemented).
>>> I think a refcounting based solution will be very complex to
>>> implement. I'm however happy to be proven wrong.
>> I can't estimate, but I have a feeling that all these plays around locking
>> is just because of this single piece of code. No other place suffer from
>> pdev->vpci_lock and no d->lock
>>>> If pcidevs_lock is used for vpci_{read|write} then no deadlock is possible,
>>>> but modify_bars code must be re-worked not to lock itself (pdev->vpci_lock and
>>>> tmp->vpci_lock when pdev == tmp, this is minor).
>>> Taking the pcidevs lock (a global lock) is out of the picture IMO, as
>>> it's going to serialize all calls of vpci_{read|write}, and would
>>> create too much contention on the pcidevs lock.
>> I understand that. But if we would like to fix the existing code I see
>> no other alternative.
>>>> 3. We may think about a per-domain rwlock and pdev->vpci_lock, so this solves
>>>> modify_bars's two pdevs access. But this doesn't solve possible pdev
>>>> de-reference in vpci_{read|write} vs pci_remove_device.
>>> pci_remove device will call vpci_remove_device, so as long as
>>> vpci_remove_device taken the per-domain lock in write (exclusive) mode
>>> it should be fine.
>> I think I need to see if there are any other places which similarly
>> require the write lock
>>>> @Roger, @Jan, I would like to hear what do you think about the above analysis
>>>> and how can we proceed with locking re-work?
>>> I think the per-domain rwlock seems like a good option. I would do
>>> that as a pre-patch.
>> It is. But it seems it won't solve the thing we started this adventure for:
>>
>> With per-domain read lock and still ABBA in modify_bars (hope the below
>> is correctly seen with a monospace font):
>>
>> cpu0: vpci_write-> d->RLock -> pdev1->lock ->                                                  rom_write -> modify_bars: tmp (pdev2) ->lock
>> cpu1:        vpci_write-> d->RLock pdev2->lock -> cmd_write -> modify_bars: tmp (pdev1) ->lock
>>
>> There is no API to upgrade read lock to write lock in modify_bars which could help,
>> so in both cases vpci_write should take write lock.
> I've thought more than once that it would be nice to have a
> write_{upgrade,downgrade} (read_downgrade maybe?) or similar helper.
Yes, this is the real use-case for that
>
> I think you could also drop the read lock, take the write lock and
> check that &pdev->vpci->header == header in order to be sure
> pdev->vpci hasn't been recreated.
And have pdev freed in between....
>   You would have to do similar in
> order to get back again from a write lock into a read one.
Not sure this is reliable.
>
> We should avoid taking the rwlock in write mode in vpci_write
> unconditionally.
Yes, but without upgrading the read lock I see no way it can be done
>
> Thanks, Roger.
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 14:11                                   ` Jan Beulich
  2022-02-07 14:27                                     ` Roger Pau Monné
@ 2022-02-07 14:28                                     ` Oleksandr Andrushchenko
  1 sibling, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-07 14:28 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné



On 07.02.22 16:11, Jan Beulich wrote:
> On 07.02.2022 14:53, Oleksandr Andrushchenko wrote:
>> On 07.02.22 14:46, Roger Pau Monné wrote:
>>> I think the per-domain rwlock seems like a good option. I would do
>>> that as a pre-patch.
>> It is. But it seems it won't solve the thing we started this adventure for:
>>
>> With per-domain read lock and still ABBA in modify_bars (hope the below
>> is correctly seen with a monospace font):
>>
>> cpu0: vpci_write-> d->RLock -> pdev1->lock ->                                                  rom_write -> modify_bars: tmp (pdev2) ->lock
>> cpu1:        vpci_write-> d->RLock pdev2->lock -> cmd_write -> modify_bars: tmp (pdev1) ->lock
>>
>> There is no API to upgrade read lock to write lock in modify_bars which could help,
>> so in both cases vpci_write should take write lock.
> Hmm, yes, I think you're right: It's not modify_bars() itself which needs
> to acquire the write lock, but its (perhaps indirect) caller. Effectively
> vpci_write() would need to take the write lock if the range written
> overlaps the BARs or the command register.
Exactly, vpci_write needs a write lock, but it is not desirable.
And again, there is a single offending piece of code which wants that...
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-07 14:17                 ` Oleksandr Andrushchenko
@ 2022-02-07 14:31                   ` Jan Beulich
  2022-02-07 14:46                     ` Oleksandr Andrushchenko
                                       ` (2 more replies)
  0 siblings, 3 replies; 138+ messages in thread
From: Jan Beulich @ 2022-02-07 14:31 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 07.02.2022 15:17, Oleksandr Andrushchenko wrote:
> 
> 
> On 07.02.22 14:54, Jan Beulich wrote:
>> On 07.02.2022 13:51, Oleksandr Andrushchenko wrote:
>>>
>>> On 07.02.22 14:38, Jan Beulich wrote:
>>>> On 07.02.2022 12:27, Oleksandr Andrushchenko wrote:
>>>>> On 07.02.22 09:29, Jan Beulich wrote:
>>>>>> On 04.02.2022 15:37, Oleksandr Andrushchenko wrote:
>>>>>>> On 04.02.22 16:30, Jan Beulich wrote:
>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>>> Reset the command register when assigning a PCI device to a guest:
>>>>>>>>> according to the PCI spec the PCI_COMMAND register is typically all 0's
>>>>>>>>> after reset.
>>>>>>>> It's not entirely clear to me whether setting the hardware register to
>>>>>>>> zero is okay. What wants to be zero is the value the guest observes
>>>>>>>> initially.
>>>>>>> "the PCI spec says the PCI_COMMAND register is typically all 0's after reset."
>>>>>>> Why wouldn't it be ok? What is the exact concern here?
>>>>>> The concern is - as voiced is similar ways before, perhaps in other
>>>>>> contexts - that you need to consider bit-by-bit whether overwriting
>>>>>> with 0 what is currently there is okay. Xen and/or Dom0 may have put
>>>>>> values there which they expect to remain unaltered. I guess
>>>>>> PCI_COMMAND_SERR is a good example: While the guest's view of this
>>>>>> will want to be zero initially, the host having set it to 1 may not
>>>>>> easily be overwritten with 0, or else you'd effectively imply giving
>>>>>> the guest control of the bit.
>>>>> We have already discussed in great detail PCI_COMMAND emulation [1].
>>>>> At the end you wrote [1]:
>>>>> "Well, in order for the whole thing to be security supported it needs to
>>>>> be explained for every bit why it is safe to allow the guest to drive it.
>>>>> Until you mean vPCI to reach that state, leaving TODO notes in the code
>>>>> for anything not investigated may indeed be good enough.
>>>>>
>>>>> Jan"
>>>>>
>>>>> So, this is why I left a TODO in the PCI_COMMAND emulation for now and only
>>>>> care about INTx which is honored with the code in this patch.
>>>> Right. The issue I see is that the description does not have any
>>>> mention of this, but instead talks about simply writing zero.
>>> How do you want that mentioned? Extended commit message or
>>> just a link to the thread [1]?
>> What I'd like you to describe is what the change does without
>> fundamentally implying it'll end up being zero which gets written
>> to the register. Stating as a conclusion that for the time being
>> this means writing zero is certainly fine (and likely helpful if
>> made explicit).
> Xen and/or Dom0 may have put values in PCI_COMMAND which they expect
> to remain unaltered. PCI_COMMAND_SERR bit is a good example: while the
> guest's view of this will want to be zero initially, the host having set
> it to 1 may not easily be overwritten with 0, or else we'd effectively
> imply giving the guest control of the bit. Thus, PCI_COMMAND register needs
> proper emulation in order to honor host's settings.
> 
> There are examples of emulators [1], [2] which already deal with PCI_COMMAND
> register emulation and it seems that at most they care about the only INTX
> bit (besides IO/memory enable and bus muster which are write through).
> It could be because in order to properly emulate the PCI_COMMAND register
> we need to know about the whole PCI topology, e.g. if any setting in device's
> command register is aligned with the upstream port etc.
> This makes me think that because of this complexity others just ignore that.
> Neither I think this can be easily done in Xen case.
> 
> According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
> Device Control" says that the reset state of the command register is
> typically 0, so reset the command register when assigning a PCI device
> to a guest t all 0's and for now only make sure INTx bit is set according
> to if MSI/MSI-X enabled.

"... is typically 0, so when assigning a PCI device reset the guest view of
 the command register to all 0's. For now our emulation only makes sure INTx
 is set according to host requirements, i.e. depending on MSI/MSI-X enabled
 state."

Maybe? (Obviously a fresh device given to a guest will have MSI/MSI-X 
disabled, so I'm not sure that aspect really needs mentioning.)

But: What's still missing here then is the separation of guest and host
views. When we set INTx behind the guest's back, it shouldn't observe the
bit set. Or is this meant to be another (big) TODO?

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 14:27                                     ` Roger Pau Monné
@ 2022-02-07 14:33                                       ` Jan Beulich
  2022-02-07 14:35                                       ` Oleksandr Andrushchenko
  1 sibling, 0 replies; 138+ messages in thread
From: Jan Beulich @ 2022-02-07 14:33 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Oleksandr Andrushchenko, julien, sstabellini,
	Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel

On 07.02.2022 15:27, Roger Pau Monné wrote:
> On Mon, Feb 07, 2022 at 03:11:03PM +0100, Jan Beulich wrote:
>> On 07.02.2022 14:53, Oleksandr Andrushchenko wrote:
>>> On 07.02.22 14:46, Roger Pau Monné wrote:
>>>> I think the per-domain rwlock seems like a good option. I would do
>>>> that as a pre-patch.
>>> It is. But it seems it won't solve the thing we started this adventure for:
>>>
>>> With per-domain read lock and still ABBA in modify_bars (hope the below
>>> is correctly seen with a monospace font):
>>>
>>> cpu0: vpci_write-> d->RLock -> pdev1->lock ->                                                  rom_write -> modify_bars: tmp (pdev2) ->lock
>>> cpu1:        vpci_write-> d->RLock pdev2->lock -> cmd_write -> modify_bars: tmp (pdev1) ->lock
>>>
>>> There is no API to upgrade read lock to write lock in modify_bars which could help,
>>> so in both cases vpci_write should take write lock.
>>
>> Hmm, yes, I think you're right: It's not modify_bars() itself which needs
>> to acquire the write lock, but its (perhaps indirect) caller. Effectively
>> vpci_write() would need to take the write lock if the range written
>> overlaps the BARs or the command register.
> 
> I'm confused. If we use a per-domain rwlock approach there would be no
> need to lock tmp again in modify_bars, because we should hold the
> rwlock in write mode, so there's no ABBA?
> 
> We will have however to drop the per domain read and vpci locks and
> pick the per-domain lock in write mode.

Well, yes, with intermediate dropping of the lock acquiring in write mode
can be done in modify_bars(). I'm not convinced (yet) that such intermediate
dropping is actually going to be okay.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 14:27                                     ` Roger Pau Monné
  2022-02-07 14:33                                       ` Jan Beulich
@ 2022-02-07 14:35                                       ` Oleksandr Andrushchenko
  2022-02-07 15:11                                         ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-07 14:35 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 07.02.22 16:27, Roger Pau Monné wrote:
> On Mon, Feb 07, 2022 at 03:11:03PM +0100, Jan Beulich wrote:
>> On 07.02.2022 14:53, Oleksandr Andrushchenko wrote:
>>> On 07.02.22 14:46, Roger Pau Monné wrote:
>>>> I think the per-domain rwlock seems like a good option. I would do
>>>> that as a pre-patch.
>>> It is. But it seems it won't solve the thing we started this adventure for:
>>>
>>> With per-domain read lock and still ABBA in modify_bars (hope the below
>>> is correctly seen with a monospace font):
>>>
>>> cpu0: vpci_write-> d->RLock -> pdev1->lock ->                                                  rom_write -> modify_bars: tmp (pdev2) ->lock
>>> cpu1:        vpci_write-> d->RLock pdev2->lock -> cmd_write -> modify_bars: tmp (pdev1) ->lock
>>>
>>> There is no API to upgrade read lock to write lock in modify_bars which could help,
>>> so in both cases vpci_write should take write lock.
>> Hmm, yes, I think you're right: It's not modify_bars() itself which needs
>> to acquire the write lock, but its (perhaps indirect) caller. Effectively
>> vpci_write() would need to take the write lock if the range written
>> overlaps the BARs or the command register.
> I'm confused. If we use a per-domain rwlock approach there would be no
> need to lock tmp again in modify_bars, because we should hold the
> rwlock in write mode, so there's no ABBA?
this is only possible with what you wrote below:
>
> We will have however to drop the per domain read and vpci locks and
> pick the per-domain lock in write mode.
I think this is going to be unreliable. We need a reliable way to
upgrade read lock to write lock.
Then, we can drop pdev->vpci_lock at all, because we are always
protected with d->rwlock and those who want to free pdev->vpci
will use write lock.

So, per-domain rwlock with write upgrade implemented minus pdev->vpci
should do the trick
> Thanks, Roger.
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-07 14:31                   ` Jan Beulich
@ 2022-02-07 14:46                     ` Oleksandr Andrushchenko
  2022-02-07 15:05                       ` Jan Beulich
  2022-02-10 12:54                     ` Oleksandr Andrushchenko
  2022-02-10 12:59                     ` Oleksandr Andrushchenko
  2 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-07 14:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel



On 07.02.22 16:31, Jan Beulich wrote:
> On 07.02.2022 15:17, Oleksandr Andrushchenko wrote:
>>
>> On 07.02.22 14:54, Jan Beulich wrote:
>>> On 07.02.2022 13:51, Oleksandr Andrushchenko wrote:
>>>> On 07.02.22 14:38, Jan Beulich wrote:
>>>>> On 07.02.2022 12:27, Oleksandr Andrushchenko wrote:
>>>>>> On 07.02.22 09:29, Jan Beulich wrote:
>>>>>>> On 04.02.2022 15:37, Oleksandr Andrushchenko wrote:
>>>>>>>> On 04.02.22 16:30, Jan Beulich wrote:
>>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>>>> Reset the command register when assigning a PCI device to a guest:
>>>>>>>>>> according to the PCI spec the PCI_COMMAND register is typically all 0's
>>>>>>>>>> after reset.
>>>>>>>>> It's not entirely clear to me whether setting the hardware register to
>>>>>>>>> zero is okay. What wants to be zero is the value the guest observes
>>>>>>>>> initially.
>>>>>>>> "the PCI spec says the PCI_COMMAND register is typically all 0's after reset."
>>>>>>>> Why wouldn't it be ok? What is the exact concern here?
>>>>>>> The concern is - as voiced is similar ways before, perhaps in other
>>>>>>> contexts - that you need to consider bit-by-bit whether overwriting
>>>>>>> with 0 what is currently there is okay. Xen and/or Dom0 may have put
>>>>>>> values there which they expect to remain unaltered. I guess
>>>>>>> PCI_COMMAND_SERR is a good example: While the guest's view of this
>>>>>>> will want to be zero initially, the host having set it to 1 may not
>>>>>>> easily be overwritten with 0, or else you'd effectively imply giving
>>>>>>> the guest control of the bit.
>>>>>> We have already discussed in great detail PCI_COMMAND emulation [1].
>>>>>> At the end you wrote [1]:
>>>>>> "Well, in order for the whole thing to be security supported it needs to
>>>>>> be explained for every bit why it is safe to allow the guest to drive it.
>>>>>> Until you mean vPCI to reach that state, leaving TODO notes in the code
>>>>>> for anything not investigated may indeed be good enough.
>>>>>>
>>>>>> Jan"
>>>>>>
>>>>>> So, this is why I left a TODO in the PCI_COMMAND emulation for now and only
>>>>>> care about INTx which is honored with the code in this patch.
>>>>> Right. The issue I see is that the description does not have any
>>>>> mention of this, but instead talks about simply writing zero.
>>>> How do you want that mentioned? Extended commit message or
>>>> just a link to the thread [1]?
>>> What I'd like you to describe is what the change does without
>>> fundamentally implying it'll end up being zero which gets written
>>> to the register. Stating as a conclusion that for the time being
>>> this means writing zero is certainly fine (and likely helpful if
>>> made explicit).
>> Xen and/or Dom0 may have put values in PCI_COMMAND which they expect
>> to remain unaltered. PCI_COMMAND_SERR bit is a good example: while the
>> guest's view of this will want to be zero initially, the host having set
>> it to 1 may not easily be overwritten with 0, or else we'd effectively
>> imply giving the guest control of the bit. Thus, PCI_COMMAND register needs
>> proper emulation in order to honor host's settings.
>>
>> There are examples of emulators [1], [2] which already deal with PCI_COMMAND
>> register emulation and it seems that at most they care about the only INTX
>> bit (besides IO/memory enable and bus muster which are write through).
>> It could be because in order to properly emulate the PCI_COMMAND register
>> we need to know about the whole PCI topology, e.g. if any setting in device's
>> command register is aligned with the upstream port etc.
>> This makes me think that because of this complexity others just ignore that.
>> Neither I think this can be easily done in Xen case.
>>
>> According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
>> Device Control" says that the reset state of the command register is
>> typically 0, so reset the command register when assigning a PCI device
>> to a guest t all 0's and for now only make sure INTx bit is set according
>> to if MSI/MSI-X enabled.
> "... is typically 0, so when assigning a PCI device reset the guest view of
>   the command register to all 0's. For now our emulation only makes sure INTx
>   is set according to host requirements, i.e. depending on MSI/MSI-X enabled
>   state."
This sounds good, I will use it. Thank you
>
> Maybe? (Obviously a fresh device given to a guest will have MSI/MSI-X
> disabled, so I'm not sure that aspect really needs mentioning.)
>
> But: What's still missing here then is the separation of guest and host
> views. When we set INTx behind the guest's back, it shouldn't observe the
> bit set. Or is this meant to be another (big) TODO?
But, patch [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests
already takes care of it, I mean that it will set/reset INTx for the guest
according to MSI/MSI-X. So, if we squash these two patches the whole
picture will be seen at once.
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-07 14:46                     ` Oleksandr Andrushchenko
@ 2022-02-07 15:05                       ` Jan Beulich
  2022-02-07 15:14                         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-07 15:05 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 07.02.2022 15:46, Oleksandr Andrushchenko wrote:
> On 07.02.22 16:31, Jan Beulich wrote:
>> But: What's still missing here then is the separation of guest and host
>> views. When we set INTx behind the guest's back, it shouldn't observe the
>> bit set. Or is this meant to be another (big) TODO?
> But, patch [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests
> already takes care of it, I mean that it will set/reset INTx for the guest
> according to MSI/MSI-X. So, if we squash these two patches the whole
> picture will be seen at once.

Does it? I did get the impression that the guest would be able to observe
the bit set even after writing zero to it (while a reason exists that Xen
wants the bit set).

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 14:35                                       ` Oleksandr Andrushchenko
@ 2022-02-07 15:11                                         ` Oleksandr Andrushchenko
  2022-02-07 15:26                                           ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-07 15:11 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 07.02.22 16:35, Oleksandr Andrushchenko wrote:
>
> On 07.02.22 16:27, Roger Pau Monné wrote:
>> On Mon, Feb 07, 2022 at 03:11:03PM +0100, Jan Beulich wrote:
>>> On 07.02.2022 14:53, Oleksandr Andrushchenko wrote:
>>>> On 07.02.22 14:46, Roger Pau Monné wrote:
>>>>> I think the per-domain rwlock seems like a good option. I would do
>>>>> that as a pre-patch.
>>>> It is. But it seems it won't solve the thing we started this adventure for:
>>>>
>>>> With per-domain read lock and still ABBA in modify_bars (hope the below
>>>> is correctly seen with a monospace font):
>>>>
>>>> cpu0: vpci_write-> d->RLock -> pdev1->lock ->                                                  rom_write -> modify_bars: tmp (pdev2) ->lock
>>>> cpu1:        vpci_write-> d->RLock pdev2->lock -> cmd_write -> modify_bars: tmp (pdev1) ->lock
>>>>
>>>> There is no API to upgrade read lock to write lock in modify_bars which could help,
>>>> so in both cases vpci_write should take write lock.
>>> Hmm, yes, I think you're right: It's not modify_bars() itself which needs
>>> to acquire the write lock, but its (perhaps indirect) caller. Effectively
>>> vpci_write() would need to take the write lock if the range written
>>> overlaps the BARs or the command register.
>> I'm confused. If we use a per-domain rwlock approach there would be no
>> need to lock tmp again in modify_bars, because we should hold the
>> rwlock in write mode, so there's no ABBA?
> this is only possible with what you wrote below:
>> We will have however to drop the per domain read and vpci locks and
>> pick the per-domain lock in write mode.
> I think this is going to be unreliable. We need a reliable way to
> upgrade read lock to write lock.
> Then, we can drop pdev->vpci_lock at all, because we are always
> protected with d->rwlock and those who want to free pdev->vpci
> will use write lock.
>
> So, per-domain rwlock with write upgrade implemented minus pdev->vpci
> should do the trick
Linux doesn't implement write upgrade and it seems for a reason [1]:
"Also, you cannot “upgrade” a read-lock to a write-lock, so if you at _any_ time
need to do any changes (even if you don’t do it every time), you have to get
the write-lock at the very beginning."

So, I am not sure we can have the same for Xen...

At the moment I see at least two possible ways to solve the issue:
1. Make vpci_write use write lock, thus make all write accesses synchronized
for the given domain, read are fully parallel

2. Re-implement pdev/tmp overlapping detection with something which won't
require pdev->vpci_lock/tmp->vpci_lock

3. Drop read and acquire write lock in modify_bars... but this is not reliable
and will hide a free(pdev->vpci) bug

@Roger, @Jan: Any other suggestions?

Thank you,
Oleksandr

[1] https://www.kernel.org/doc/html/latest/locking/spinlocks.html#lesson-2-reader-writer-spinlocks

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-07 15:05                       ` Jan Beulich
@ 2022-02-07 15:14                         ` Oleksandr Andrushchenko
  2022-02-07 15:28                           ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-07 15:14 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 07.02.22 17:05, Jan Beulich wrote:
> On 07.02.2022 15:46, Oleksandr Andrushchenko wrote:
>> On 07.02.22 16:31, Jan Beulich wrote:
>>> But: What's still missing here then is the separation of guest and host
>>> views. When we set INTx behind the guest's back, it shouldn't observe the
>>> bit set. Or is this meant to be another (big) TODO?
>> But, patch [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests
>> already takes care of it, I mean that it will set/reset INTx for the guest
>> according to MSI/MSI-X. So, if we squash these two patches the whole
>> picture will be seen at once.
> Does it? I did get the impression that the guest would be able to observe
> the bit set even after writing zero to it (while a reason exists that Xen
> wants the bit set).
Yes, you are correct: guest might not see what it wanted to set.
I meant that Xen won't allow resetting INTx if it is not possible
due to MSI/MSI-X

Anyways, I think squashing will be a good idea to have the relevant
functionality in a single change set. Will this work for you?
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 15:11                                         ` Oleksandr Andrushchenko
@ 2022-02-07 15:26                                           ` Jan Beulich
  2022-02-07 16:07                                             ` Oleksandr Andrushchenko
  2022-02-07 16:08                                             ` Roger Pau Monné
  0 siblings, 2 replies; 138+ messages in thread
From: Jan Beulich @ 2022-02-07 15:26 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné

On 07.02.2022 16:11, Oleksandr Andrushchenko wrote:
> 
> 
> On 07.02.22 16:35, Oleksandr Andrushchenko wrote:
>>
>> On 07.02.22 16:27, Roger Pau Monné wrote:
>>> On Mon, Feb 07, 2022 at 03:11:03PM +0100, Jan Beulich wrote:
>>>> On 07.02.2022 14:53, Oleksandr Andrushchenko wrote:
>>>>> On 07.02.22 14:46, Roger Pau Monné wrote:
>>>>>> I think the per-domain rwlock seems like a good option. I would do
>>>>>> that as a pre-patch.
>>>>> It is. But it seems it won't solve the thing we started this adventure for:
>>>>>
>>>>> With per-domain read lock and still ABBA in modify_bars (hope the below
>>>>> is correctly seen with a monospace font):
>>>>>
>>>>> cpu0: vpci_write-> d->RLock -> pdev1->lock ->                                                  rom_write -> modify_bars: tmp (pdev2) ->lock
>>>>> cpu1:        vpci_write-> d->RLock pdev2->lock -> cmd_write -> modify_bars: tmp (pdev1) ->lock
>>>>>
>>>>> There is no API to upgrade read lock to write lock in modify_bars which could help,
>>>>> so in both cases vpci_write should take write lock.
>>>> Hmm, yes, I think you're right: It's not modify_bars() itself which needs
>>>> to acquire the write lock, but its (perhaps indirect) caller. Effectively
>>>> vpci_write() would need to take the write lock if the range written
>>>> overlaps the BARs or the command register.
>>> I'm confused. If we use a per-domain rwlock approach there would be no
>>> need to lock tmp again in modify_bars, because we should hold the
>>> rwlock in write mode, so there's no ABBA?
>> this is only possible with what you wrote below:
>>> We will have however to drop the per domain read and vpci locks and
>>> pick the per-domain lock in write mode.
>> I think this is going to be unreliable. We need a reliable way to
>> upgrade read lock to write lock.
>> Then, we can drop pdev->vpci_lock at all, because we are always
>> protected with d->rwlock and those who want to free pdev->vpci
>> will use write lock.
>>
>> So, per-domain rwlock with write upgrade implemented minus pdev->vpci
>> should do the trick
> Linux doesn't implement write upgrade and it seems for a reason [1]:
> "Also, you cannot “upgrade” a read-lock to a write-lock, so if you at _any_ time
> need to do any changes (even if you don’t do it every time), you have to get
> the write-lock at the very beginning."
> 
> So, I am not sure we can have the same for Xen...
> 
> At the moment I see at least two possible ways to solve the issue:
> 1. Make vpci_write use write lock, thus make all write accesses synchronized
> for the given domain, read are fully parallel

1b. Make vpci_write use write lock for writes to command register and BARs
only; keep using the read lock for all other writes.

Jan

> 2. Re-implement pdev/tmp overlapping detection with something which won't
> require pdev->vpci_lock/tmp->vpci_lock
> 
> 3. Drop read and acquire write lock in modify_bars... but this is not reliable
> and will hide a free(pdev->vpci) bug
> 
> @Roger, @Jan: Any other suggestions?
> 
> Thank you,
> Oleksandr
> 
> [1] https://www.kernel.org/doc/html/latest/locking/spinlocks.html#lesson-2-reader-writer-spinlocks



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-07 15:14                         ` Oleksandr Andrushchenko
@ 2022-02-07 15:28                           ` Jan Beulich
  2022-02-07 15:59                             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-07 15:28 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 07.02.2022 16:14, Oleksandr Andrushchenko wrote:
> On 07.02.22 17:05, Jan Beulich wrote:
>> On 07.02.2022 15:46, Oleksandr Andrushchenko wrote:
>>> On 07.02.22 16:31, Jan Beulich wrote:
>>>> But: What's still missing here then is the separation of guest and host
>>>> views. When we set INTx behind the guest's back, it shouldn't observe the
>>>> bit set. Or is this meant to be another (big) TODO?
>>> But, patch [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests
>>> already takes care of it, I mean that it will set/reset INTx for the guest
>>> according to MSI/MSI-X. So, if we squash these two patches the whole
>>> picture will be seen at once.
>> Does it? I did get the impression that the guest would be able to observe
>> the bit set even after writing zero to it (while a reason exists that Xen
>> wants the bit set).
> Yes, you are correct: guest might not see what it wanted to set.
> I meant that Xen won't allow resetting INTx if it is not possible
> due to MSI/MSI-X
> 
> Anyways, I think squashing will be a good idea to have the relevant
> functionality in a single change set. Will this work for you?

It might work, but I'd prefer things which can sensibly be separate to
remain separate.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-07 15:28                           ` Jan Beulich
@ 2022-02-07 15:59                             ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-07 15:59 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel



On 07.02.22 17:28, Jan Beulich wrote:
> On 07.02.2022 16:14, Oleksandr Andrushchenko wrote:
>> On 07.02.22 17:05, Jan Beulich wrote:
>>> On 07.02.2022 15:46, Oleksandr Andrushchenko wrote:
>>>> On 07.02.22 16:31, Jan Beulich wrote:
>>>>> But: What's still missing here then is the separation of guest and host
>>>>> views. When we set INTx behind the guest's back, it shouldn't observe the
>>>>> bit set. Or is this meant to be another (big) TODO?
>>>> But, patch [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests
>>>> already takes care of it, I mean that it will set/reset INTx for the guest
>>>> according to MSI/MSI-X. So, if we squash these two patches the whole
>>>> picture will be seen at once.
>>> Does it? I did get the impression that the guest would be able to observe
>>> the bit set even after writing zero to it (while a reason exists that Xen
>>> wants the bit set).
>> Yes, you are correct: guest might not see what it wanted to set.
>> I meant that Xen won't allow resetting INTx if it is not possible
>> due to MSI/MSI-X
>>
>> Anyways, I think squashing will be a good idea to have the relevant
>> functionality in a single change set. Will this work for you?
> It might work, but I'd prefer things which can sensibly be separate to
> remain separate.
Ok, two patches
> Jan
>

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 15:26                                           ` Jan Beulich
@ 2022-02-07 16:07                                             ` Oleksandr Andrushchenko
  2022-02-07 16:15                                               ` Jan Beulich
  2022-02-07 16:08                                             ` Roger Pau Monné
  1 sibling, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-07 16:07 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné



On 07.02.22 17:26, Jan Beulich wrote:
> On 07.02.2022 16:11, Oleksandr Andrushchenko wrote:
>>
>> On 07.02.22 16:35, Oleksandr Andrushchenko wrote:
>>> On 07.02.22 16:27, Roger Pau Monné wrote:
>>>> On Mon, Feb 07, 2022 at 03:11:03PM +0100, Jan Beulich wrote:
>>>>> On 07.02.2022 14:53, Oleksandr Andrushchenko wrote:
>>>>>> On 07.02.22 14:46, Roger Pau Monné wrote:
>>>>>>> I think the per-domain rwlock seems like a good option. I would do
>>>>>>> that as a pre-patch.
>>>>>> It is. But it seems it won't solve the thing we started this adventure for:
>>>>>>
>>>>>> With per-domain read lock and still ABBA in modify_bars (hope the below
>>>>>> is correctly seen with a monospace font):
>>>>>>
>>>>>> cpu0: vpci_write-> d->RLock -> pdev1->lock ->                                                  rom_write -> modify_bars: tmp (pdev2) ->lock
>>>>>> cpu1:        vpci_write-> d->RLock pdev2->lock -> cmd_write -> modify_bars: tmp (pdev1) ->lock
>>>>>>
>>>>>> There is no API to upgrade read lock to write lock in modify_bars which could help,
>>>>>> so in both cases vpci_write should take write lock.
>>>>> Hmm, yes, I think you're right: It's not modify_bars() itself which needs
>>>>> to acquire the write lock, but its (perhaps indirect) caller. Effectively
>>>>> vpci_write() would need to take the write lock if the range written
>>>>> overlaps the BARs or the command register.
>>>> I'm confused. If we use a per-domain rwlock approach there would be no
>>>> need to lock tmp again in modify_bars, because we should hold the
>>>> rwlock in write mode, so there's no ABBA?
>>> this is only possible with what you wrote below:
>>>> We will have however to drop the per domain read and vpci locks and
>>>> pick the per-domain lock in write mode.
>>> I think this is going to be unreliable. We need a reliable way to
>>> upgrade read lock to write lock.
>>> Then, we can drop pdev->vpci_lock at all, because we are always
>>> protected with d->rwlock and those who want to free pdev->vpci
>>> will use write lock.
>>>
>>> So, per-domain rwlock with write upgrade implemented minus pdev->vpci
>>> should do the trick
>> Linux doesn't implement write upgrade and it seems for a reason [1]:
>> "Also, you cannot “upgrade” a read-lock to a write-lock, so if you at _any_ time
>> need to do any changes (even if you don’t do it every time), you have to get
>> the write-lock at the very beginning."
>>
>> So, I am not sure we can have the same for Xen...
>>
>> At the moment I see at least two possible ways to solve the issue:
>> 1. Make vpci_write use write lock, thus make all write accesses synchronized
>> for the given domain, read are fully parallel
> 1b. Make vpci_write use write lock for writes to command register and BARs
> only; keep using the read lock for all other writes.
I am not quite sure how to do that. Do you mean something like:
void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                 uint32_t data)
[snip]
     list_for_each_entry ( r, &pdev->vpci->handlers, node )
{
[snip]
     if ( r->needs_write_lock)
         write_lock(d->vpci_lock)
     else
         read_lock(d->vpci_lock)
....

And provide rw as an argument to:

int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
                       vpci_write_t *write_handler, unsigned int offset,
                       unsigned int size, void *data, --->>> bool write_path <<<-----)

Is this what you mean?

With the above, if we have d->vpci_lock, I think we can drop
pdev->vpci_lock at all

Thank you,
Oleksandr

P.S. I don't think you mean we just drop the read lock and acquire write lock
as it leads to the mentioned before unreliability.

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 15:26                                           ` Jan Beulich
  2022-02-07 16:07                                             ` Oleksandr Andrushchenko
@ 2022-02-07 16:08                                             ` Roger Pau Monné
  2022-02-07 16:12                                               ` Jan Beulich
  1 sibling, 1 reply; 138+ messages in thread
From: Roger Pau Monné @ 2022-02-07 16:08 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, julien, sstabellini,
	Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel

On Mon, Feb 07, 2022 at 04:26:56PM +0100, Jan Beulich wrote:
> On 07.02.2022 16:11, Oleksandr Andrushchenko wrote:
> > 
> > 
> > On 07.02.22 16:35, Oleksandr Andrushchenko wrote:
> >>
> >> On 07.02.22 16:27, Roger Pau Monné wrote:
> >>> On Mon, Feb 07, 2022 at 03:11:03PM +0100, Jan Beulich wrote:
> >>>> On 07.02.2022 14:53, Oleksandr Andrushchenko wrote:
> >>>>> On 07.02.22 14:46, Roger Pau Monné wrote:
> >>>>>> I think the per-domain rwlock seems like a good option. I would do
> >>>>>> that as a pre-patch.
> >>>>> It is. But it seems it won't solve the thing we started this adventure for:
> >>>>>
> >>>>> With per-domain read lock and still ABBA in modify_bars (hope the below
> >>>>> is correctly seen with a monospace font):
> >>>>>
> >>>>> cpu0: vpci_write-> d->RLock -> pdev1->lock ->                                                  rom_write -> modify_bars: tmp (pdev2) ->lock
> >>>>> cpu1:        vpci_write-> d->RLock pdev2->lock -> cmd_write -> modify_bars: tmp (pdev1) ->lock
> >>>>>
> >>>>> There is no API to upgrade read lock to write lock in modify_bars which could help,
> >>>>> so in both cases vpci_write should take write lock.
> >>>> Hmm, yes, I think you're right: It's not modify_bars() itself which needs
> >>>> to acquire the write lock, but its (perhaps indirect) caller. Effectively
> >>>> vpci_write() would need to take the write lock if the range written
> >>>> overlaps the BARs or the command register.
> >>> I'm confused. If we use a per-domain rwlock approach there would be no
> >>> need to lock tmp again in modify_bars, because we should hold the
> >>> rwlock in write mode, so there's no ABBA?
> >> this is only possible with what you wrote below:
> >>> We will have however to drop the per domain read and vpci locks and
> >>> pick the per-domain lock in write mode.
> >> I think this is going to be unreliable. We need a reliable way to
> >> upgrade read lock to write lock.
> >> Then, we can drop pdev->vpci_lock at all, because we are always
> >> protected with d->rwlock and those who want to free pdev->vpci
> >> will use write lock.
> >>
> >> So, per-domain rwlock with write upgrade implemented minus pdev->vpci
> >> should do the trick
> > Linux doesn't implement write upgrade and it seems for a reason [1]:
> > "Also, you cannot “upgrade” a read-lock to a write-lock, so if you at _any_ time
> > need to do any changes (even if you don’t do it every time), you have to get
> > the write-lock at the very beginning."
> > 
> > So, I am not sure we can have the same for Xen...
> > 
> > At the moment I see at least two possible ways to solve the issue:
> > 1. Make vpci_write use write lock, thus make all write accesses synchronized
> > for the given domain, read are fully parallel
> 
> 1b. Make vpci_write use write lock for writes to command register and BARs
> only; keep using the read lock for all other writes.

We do not support writing to the BARs with memory decoding enabled
currently for dom0, so we would only need to pick the lock in write
mode for the command register and ROM BAR write handler AFAICT.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 16:08                                             ` Roger Pau Monné
@ 2022-02-07 16:12                                               ` Jan Beulich
  0 siblings, 0 replies; 138+ messages in thread
From: Jan Beulich @ 2022-02-07 16:12 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Oleksandr Andrushchenko, julien, sstabellini,
	Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel

On 07.02.2022 17:08, Roger Pau Monné wrote:
> On Mon, Feb 07, 2022 at 04:26:56PM +0100, Jan Beulich wrote:
>> On 07.02.2022 16:11, Oleksandr Andrushchenko wrote:
>>>
>>>
>>> On 07.02.22 16:35, Oleksandr Andrushchenko wrote:
>>>>
>>>> On 07.02.22 16:27, Roger Pau Monné wrote:
>>>>> On Mon, Feb 07, 2022 at 03:11:03PM +0100, Jan Beulich wrote:
>>>>>> On 07.02.2022 14:53, Oleksandr Andrushchenko wrote:
>>>>>>> On 07.02.22 14:46, Roger Pau Monné wrote:
>>>>>>>> I think the per-domain rwlock seems like a good option. I would do
>>>>>>>> that as a pre-patch.
>>>>>>> It is. But it seems it won't solve the thing we started this adventure for:
>>>>>>>
>>>>>>> With per-domain read lock and still ABBA in modify_bars (hope the below
>>>>>>> is correctly seen with a monospace font):
>>>>>>>
>>>>>>> cpu0: vpci_write-> d->RLock -> pdev1->lock ->                                                  rom_write -> modify_bars: tmp (pdev2) ->lock
>>>>>>> cpu1:        vpci_write-> d->RLock pdev2->lock -> cmd_write -> modify_bars: tmp (pdev1) ->lock
>>>>>>>
>>>>>>> There is no API to upgrade read lock to write lock in modify_bars which could help,
>>>>>>> so in both cases vpci_write should take write lock.
>>>>>> Hmm, yes, I think you're right: It's not modify_bars() itself which needs
>>>>>> to acquire the write lock, but its (perhaps indirect) caller. Effectively
>>>>>> vpci_write() would need to take the write lock if the range written
>>>>>> overlaps the BARs or the command register.
>>>>> I'm confused. If we use a per-domain rwlock approach there would be no
>>>>> need to lock tmp again in modify_bars, because we should hold the
>>>>> rwlock in write mode, so there's no ABBA?
>>>> this is only possible with what you wrote below:
>>>>> We will have however to drop the per domain read and vpci locks and
>>>>> pick the per-domain lock in write mode.
>>>> I think this is going to be unreliable. We need a reliable way to
>>>> upgrade read lock to write lock.
>>>> Then, we can drop pdev->vpci_lock at all, because we are always
>>>> protected with d->rwlock and those who want to free pdev->vpci
>>>> will use write lock.
>>>>
>>>> So, per-domain rwlock with write upgrade implemented minus pdev->vpci
>>>> should do the trick
>>> Linux doesn't implement write upgrade and it seems for a reason [1]:
>>> "Also, you cannot “upgrade” a read-lock to a write-lock, so if you at _any_ time
>>> need to do any changes (even if you don’t do it every time), you have to get
>>> the write-lock at the very beginning."
>>>
>>> So, I am not sure we can have the same for Xen...
>>>
>>> At the moment I see at least two possible ways to solve the issue:
>>> 1. Make vpci_write use write lock, thus make all write accesses synchronized
>>> for the given domain, read are fully parallel
>>
>> 1b. Make vpci_write use write lock for writes to command register and BARs
>> only; keep using the read lock for all other writes.
> 
> We do not support writing to the BARs with memory decoding enabled
> currently for dom0, so we would only need to pick the lock in write
> mode for the command register and ROM BAR write handler AFAICT.

Oh, right - this then makes for even less contention due to needing to
acquire the lock in write mode.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 16:07                                             ` Oleksandr Andrushchenko
@ 2022-02-07 16:15                                               ` Jan Beulich
  2022-02-07 16:21                                                 ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-07 16:15 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné

On 07.02.2022 17:07, Oleksandr Andrushchenko wrote:
> On 07.02.22 17:26, Jan Beulich wrote:
>> 1b. Make vpci_write use write lock for writes to command register and BARs
>> only; keep using the read lock for all other writes.
> I am not quite sure how to do that. Do you mean something like:
> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>                  uint32_t data)
> [snip]
>      list_for_each_entry ( r, &pdev->vpci->handlers, node )
> {
> [snip]
>      if ( r->needs_write_lock)
>          write_lock(d->vpci_lock)
>      else
>          read_lock(d->vpci_lock)
> ....
> 
> And provide rw as an argument to:
> 
> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>                        vpci_write_t *write_handler, unsigned int offset,
>                        unsigned int size, void *data, --->>> bool write_path <<<-----)
> 
> Is this what you mean?

This sounds overly complicated. You can derive locally in vpci_write(),
from just its "reg" and "size" parameters, whether the lock needs taking
in write mode.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 16:15                                               ` Jan Beulich
@ 2022-02-07 16:21                                                 ` Oleksandr Andrushchenko
  2022-02-07 16:37                                                   ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-07 16:21 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monné
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 07.02.22 18:15, Jan Beulich wrote:
> On 07.02.2022 17:07, Oleksandr Andrushchenko wrote:
>> On 07.02.22 17:26, Jan Beulich wrote:
>>> 1b. Make vpci_write use write lock for writes to command register and BARs
>>> only; keep using the read lock for all other writes.
>> I am not quite sure how to do that. Do you mean something like:
>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>                   uint32_t data)
>> [snip]
>>       list_for_each_entry ( r, &pdev->vpci->handlers, node )
>> {
>> [snip]
>>       if ( r->needs_write_lock)
>>           write_lock(d->vpci_lock)
>>       else
>>           read_lock(d->vpci_lock)
>> ....
>>
>> And provide rw as an argument to:
>>
>> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>>                         vpci_write_t *write_handler, unsigned int offset,
>>                         unsigned int size, void *data, --->>> bool write_path <<<-----)
>>
>> Is this what you mean?
> This sounds overly complicated. You can derive locally in vpci_write(),
> from just its "reg" and "size" parameters, whether the lock needs taking
> in write mode.
Yes, I started writing a reply with that. So, the summary (ROM
position depends on header type):
if ( (reg == PCI_COMMAND) || (reg == ROM) )
{
     read PCI_COMMAND and see if memory or IO decoding are enabled.
     if ( enabled )
         write_lock(d->vpci_lock)
     else
         read_lock(d->vpci_lock)
}

Do you also think we can drop pdev->vpci (or currently pdev->vpci->lock)
at all then?
> Jan
>
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign
  2022-02-04  6:34 ` [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign Oleksandr Andrushchenko
@ 2022-02-07 16:28   ` Jan Beulich
  2022-02-08  8:32     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-07 16:28 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko,
	xen-devel

On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>                          pci_to_dev(pdev), flag);
>      }
>  
> +    rc = vpci_assign_device(d, pdev);
> +
>   done:
>      if ( rc )
>          printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",

There's no attempt to undo anything in the case of getting back an
error. ISTR this being deemed okay on the basis that the tool stack
would then take whatever action, but whatever it is that is supposed
to deal with errors here wants spelling out in the description.
What's important is that no caller up the call tree may be left with
the impression that the device is still owned by the original
domain. With how you have it, the device is going to be owned by the
new domain, but not really usable.

> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -99,6 +99,33 @@ int vpci_add_handlers(struct pci_dev *pdev)
>  
>      return rc;
>  }
> +
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +/* Notify vPCI that device is assigned to guest. */
> +int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
> +{
> +    int rc;
> +
> +    if ( !has_vpci(d) )
> +        return 0;
> +
> +    rc = vpci_add_handlers(pdev);
> +    if ( rc )
> +        vpci_deassign_device(d, pdev);
> +
> +    return rc;
> +}
> +
> +/* Notify vPCI that device is de-assigned from guest. */
> +void vpci_deassign_device(struct domain *d, struct pci_dev *pdev)
> +{
> +    if ( !has_vpci(d) )
> +        return;
> +
> +    vpci_remove_device(pdev);
> +}
> +#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */

While for the latter function you look to need two parameters, do you
really need them also in the former one?

Symmetry considerations make me wonder though whether the de-assign
hook shouldn't be called earlier, when pdev->domain still has the
original owner. At which point the 2nd parameter could disappear there
as well.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 16:21                                                 ` Oleksandr Andrushchenko
@ 2022-02-07 16:37                                                   ` Jan Beulich
  2022-02-07 16:44                                                     ` Oleksandr Andrushchenko
  2022-02-08 10:11                                                     ` Roger Pau Monné
  0 siblings, 2 replies; 138+ messages in thread
From: Jan Beulich @ 2022-02-07 16:37 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné

On 07.02.2022 17:21, Oleksandr Andrushchenko wrote:
> 
> 
> On 07.02.22 18:15, Jan Beulich wrote:
>> On 07.02.2022 17:07, Oleksandr Andrushchenko wrote:
>>> On 07.02.22 17:26, Jan Beulich wrote:
>>>> 1b. Make vpci_write use write lock for writes to command register and BARs
>>>> only; keep using the read lock for all other writes.
>>> I am not quite sure how to do that. Do you mean something like:
>>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>                   uint32_t data)
>>> [snip]
>>>       list_for_each_entry ( r, &pdev->vpci->handlers, node )
>>> {
>>> [snip]
>>>       if ( r->needs_write_lock)
>>>           write_lock(d->vpci_lock)
>>>       else
>>>           read_lock(d->vpci_lock)
>>> ....
>>>
>>> And provide rw as an argument to:
>>>
>>> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>>>                         vpci_write_t *write_handler, unsigned int offset,
>>>                         unsigned int size, void *data, --->>> bool write_path <<<-----)
>>>
>>> Is this what you mean?
>> This sounds overly complicated. You can derive locally in vpci_write(),
>> from just its "reg" and "size" parameters, whether the lock needs taking
>> in write mode.
> Yes, I started writing a reply with that. So, the summary (ROM
> position depends on header type):
> if ( (reg == PCI_COMMAND) || (reg == ROM) )
> {
>      read PCI_COMMAND and see if memory or IO decoding are enabled.
>      if ( enabled )
>          write_lock(d->vpci_lock)
>      else
>          read_lock(d->vpci_lock)
> }

Hmm, yes, you can actually get away without using "size", since both
command register and ROM BAR are 32-bit aligned registers, and 64-bit
accesses get split in vpci_ecam_write().

For the command register the memory- / IO-decoding-enabled check may
end up a little more complicated, as the value to be written also
matters. Maybe read the command register only for the ROM BAR write,
using the write lock uniformly for all command register writes?

> Do you also think we can drop pdev->vpci (or currently pdev->vpci->lock)
> at all then?

I haven't looked at this in any detail, sorry. It sounds possible,
yes.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 16:37                                                   ` Jan Beulich
@ 2022-02-07 16:44                                                     ` Oleksandr Andrushchenko
  2022-02-08  7:35                                                       ` Oleksandr Andrushchenko
  2022-02-08  8:53                                                       ` Jan Beulich
  2022-02-08 10:11                                                     ` Roger Pau Monné
  1 sibling, 2 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-07 16:44 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné,
	Oleksandr Andrushchenko



On 07.02.22 18:37, Jan Beulich wrote:
> On 07.02.2022 17:21, Oleksandr Andrushchenko wrote:
>>
>> On 07.02.22 18:15, Jan Beulich wrote:
>>> On 07.02.2022 17:07, Oleksandr Andrushchenko wrote:
>>>> On 07.02.22 17:26, Jan Beulich wrote:
>>>>> 1b. Make vpci_write use write lock for writes to command register and BARs
>>>>> only; keep using the read lock for all other writes.
>>>> I am not quite sure how to do that. Do you mean something like:
>>>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>>                    uint32_t data)
>>>> [snip]
>>>>        list_for_each_entry ( r, &pdev->vpci->handlers, node )
>>>> {
>>>> [snip]
>>>>        if ( r->needs_write_lock)
>>>>            write_lock(d->vpci_lock)
>>>>        else
>>>>            read_lock(d->vpci_lock)
>>>> ....
>>>>
>>>> And provide rw as an argument to:
>>>>
>>>> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>>>>                          vpci_write_t *write_handler, unsigned int offset,
>>>>                          unsigned int size, void *data, --->>> bool write_path <<<-----)
>>>>
>>>> Is this what you mean?
>>> This sounds overly complicated. You can derive locally in vpci_write(),
>>> from just its "reg" and "size" parameters, whether the lock needs taking
>>> in write mode.
>> Yes, I started writing a reply with that. So, the summary (ROM
>> position depends on header type):
>> if ( (reg == PCI_COMMAND) || (reg == ROM) )
>> {
>>       read PCI_COMMAND and see if memory or IO decoding are enabled.
>>       if ( enabled )
>>           write_lock(d->vpci_lock)
>>       else
>>           read_lock(d->vpci_lock)
>> }
> Hmm, yes, you can actually get away without using "size", since both
> command register and ROM BAR are 32-bit aligned registers, and 64-bit
> accesses get split in vpci_ecam_write().
But, OS may want reading a single byte of ROM BAR, so I think
I'll need to check if reg+size fall into PCI_COMAND and ROM BAR
ranges
>
> For the command register the memory- / IO-decoding-enabled check may
> end up a little more complicated, as the value to be written also
> matters. Maybe read the command register only for the ROM BAR write,
> using the write lock uniformly for all command register writes?
Sounds good for the start.
Another concern is that if we go with a read_lock and then in the
underlying code we disable memory decoding and try doing
something and calling cmd_write handler for any reason then....

I mean that the check in the vpci_write is somewhat we can tolerate,
but then it is must be considered that no code in the read path
is allowed to perform write path functions. Which brings a pretty
valid use-case: say in read mode we detect an unrecoverable error
and need to remove the device:
vpci_process_pending -> ERROR -> vpci_remove_device or similar.

What do we do then? It is all going to be fragile...
>
>> Do you also think we can drop pdev->vpci (or currently pdev->vpci->lock)
>> at all then?
> I haven't looked at this in any detail, sorry. It sounds possible,
> yes.
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 06/13] vpci/header: implement guest BAR register handlers
  2022-02-04  6:34 ` [PATCH v6 06/13] vpci/header: implement guest BAR register handlers Oleksandr Andrushchenko
@ 2022-02-07 17:06   ` Jan Beulich
  2022-02-08  8:06     ` Oleksandr Andrushchenko
  2022-02-08  9:25   ` Roger Pau Monné
  1 sibling, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-07 17:06 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko,
	xen-devel

On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
> +static uint32_t guest_bar_ignore_read(const struct pci_dev *pdev,
> +                                      unsigned int reg, void *data)
> +{
> +    return 0;
> +}
> +
> +static int bar_ignore_access(const struct pci_dev *pdev, unsigned int reg,
> +                             struct vpci_bar *bar)
> +{
> +    if ( is_hardware_domain(pdev->domain) )
> +        return 0;
> +
> +    return vpci_add_register(pdev->vpci, guest_bar_ignore_read, NULL,
> +                             reg, 4, bar);
> +}

For these two functions: I'm not sure "ignore" is an appropriate
term here. unused_bar_read() and unused_bar() maybe? Or,
considering we already have VPCI_BAR_EMPTY, s/unused/empty/ ? I'm
also not sure we really need the is_hardware_domain() check here:
Returning 0 for Dom0 is going to be fine as well; there's no need
to fetch the value from actual hardware. The one exception might
be for devices with buggy BAR behavior ...

> @@ -516,6 +594,11 @@ static int init_bars(struct pci_dev *pdev)
>          if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
>          {
>              bars[i].type = VPCI_BAR_IO;
> +
> +            rc = bar_ignore_access(pdev, reg, &bars[i]);
> +            if ( rc )
> +                return rc;

Elsewhere the command register is restored on error paths.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 16:44                                                     ` Oleksandr Andrushchenko
@ 2022-02-08  7:35                                                       ` Oleksandr Andrushchenko
  2022-02-08  8:57                                                         ` Jan Beulich
  2022-02-08 10:50                                                         ` Roger Pau Monné
  2022-02-08  8:53                                                       ` Jan Beulich
  1 sibling, 2 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08  7:35 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monné
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel



On 07.02.22 18:44, Oleksandr Andrushchenko wrote:
>
> On 07.02.22 18:37, Jan Beulich wrote:
>> On 07.02.2022 17:21, Oleksandr Andrushchenko wrote:
>>> On 07.02.22 18:15, Jan Beulich wrote:
>>>> On 07.02.2022 17:07, Oleksandr Andrushchenko wrote:
>>>>> On 07.02.22 17:26, Jan Beulich wrote:
>>>>>> 1b. Make vpci_write use write lock for writes to command register and BARs
>>>>>> only; keep using the read lock for all other writes.
>>>>> I am not quite sure how to do that. Do you mean something like:
>>>>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>>>                     uint32_t data)
>>>>> [snip]
>>>>>         list_for_each_entry ( r, &pdev->vpci->handlers, node )
>>>>> {
>>>>> [snip]
>>>>>         if ( r->needs_write_lock)
>>>>>             write_lock(d->vpci_lock)
>>>>>         else
>>>>>             read_lock(d->vpci_lock)
>>>>> ....
>>>>>
>>>>> And provide rw as an argument to:
>>>>>
>>>>> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>>>>>                           vpci_write_t *write_handler, unsigned int offset,
>>>>>                           unsigned int size, void *data, --->>> bool write_path <<<-----)
>>>>>
>>>>> Is this what you mean?
>>>> This sounds overly complicated. You can derive locally in vpci_write(),
>>>> from just its "reg" and "size" parameters, whether the lock needs taking
>>>> in write mode.
>>> Yes, I started writing a reply with that. So, the summary (ROM
>>> position depends on header type):
>>> if ( (reg == PCI_COMMAND) || (reg == ROM) )
>>> {
>>>        read PCI_COMMAND and see if memory or IO decoding are enabled.
>>>        if ( enabled )
>>>            write_lock(d->vpci_lock)
>>>        else
>>>            read_lock(d->vpci_lock)
>>> }
>> Hmm, yes, you can actually get away without using "size", since both
>> command register and ROM BAR are 32-bit aligned registers, and 64-bit
>> accesses get split in vpci_ecam_write().
> But, OS may want reading a single byte of ROM BAR, so I think
> I'll need to check if reg+size fall into PCI_COMAND and ROM BAR
> ranges
>> For the command register the memory- / IO-decoding-enabled check may
>> end up a little more complicated, as the value to be written also
>> matters. Maybe read the command register only for the ROM BAR write,
>> using the write lock uniformly for all command register writes?
> Sounds good for the start.
> Another concern is that if we go with a read_lock and then in the
> underlying code we disable memory decoding and try doing
> something and calling cmd_write handler for any reason then....
>
> I mean that the check in the vpci_write is somewhat we can tolerate,
> but then it is must be considered that no code in the read path
> is allowed to perform write path functions. Which brings a pretty
> valid use-case: say in read mode we detect an unrecoverable error
> and need to remove the device:
> vpci_process_pending -> ERROR -> vpci_remove_device or similar.
>
> What do we do then? It is all going to be fragile...
I have tried to summarize the options we have wrt locking
and would love to hear from @Roger and @Jan.

In every variant there is a task of dealing with the overlap
detection in modify_bars, so this is the only place as of now
which needs special treatment.

Existing limitations: there is no way to upgrade a read lock to a write
lock, so paths which may require write lock protection need to use
write lock from the very beginning. Workarounds can be applied.

1. Per-domain rw lock, aka d->vpci_lock
==============================================================
Note: with per-domain rw lock it is possible to do without introducing
per-device locks, so pdev->vpci->lock can be removed and no pdev->vpci_lock
should be required.

This is only going to work in case if vpci_write always takes the write lock
and vpci_read takes a read lock and no path in vpci_read is allowed to
perform write path operations.
vpci_process_pending uses write lock as it have vpci_remove_device in its
error path.

Pros:
- no per-device vpci lock is needed?
- solves overlap code ABBA in modify_bars

Cons:
- all writes are serialized
- need to carefully select read paths, so they are guaranteed not to lead
   to lock upgrade use-cases

1.1. Semi read lock upgrade in modify bars
--------------------------------------------------------------
In this case both vpci_read and vpci_write take a read lock and when it comes
to modify_bars:

1. read_unlock(d->vpci_lock)
2. write_lock(d->vpci_lock)
3. Check that pdev->vpci is still available and is the same object:
if (pdev->vpci && (pdev->vpci == old_vpci) )
{
     /* vpci structure is valid and can be used. */
}
else
{
     /* vpci has gone, return an error. */
}

Pros:
- no per-device vpci lock is needed?
- solves overlap code ABBA in modify_bars
- readers and writers are NOT serialized
- NO need to carefully select read paths, so they are guaranteed not to lead
   to lock upgrade use-cases

Cons:
- ???

2. per-device lock (pdev->vpci_lock) + d->overlap_chk_lock
==============================================================
In order to solve overlap ABBA, we introduce a per-domain helper
lock to protect the overlapping code in modify_bars:

     old_vpci = pdev->vpci;
     spin_unlock(pdev->vpci_lock);
     spin_lock(pdev->domain->overlap_chk_lock);
     spin_lock(pdev->vpci_lock);
     if ( pdev->vpci && (pdev->vpci == old_vpci) )
         for_each_pdev ( pdev->domain, tmp )
         {
             if ( tmp != pdev )
             {
                 spin_lock(tmp->vpci_lock);
                 if ( tmp->vpci )
                     ...
             }
         }

Pros:
- all accesses are independent, only the same device access is serialized
- no need to care about readers and writers wrt read lock upgrade issues

Cons:
- helper spin lock

3. Move overlap detection into process pending
==============================================================
There is a Roger's patch [1] which adds a possibility for vpci_process_pending
to perform different tasks rather than just map/unmap. With this patch extended
in a way that it can hold a request queue it is possible to delay execution
of the overlap code until no pdev->vpci_lock is held, but before returning to
a guest after vpci_{read|write} or similar.

Pros:
- no need to emulate read lock upgrade
- fully parallel read/write
- queue in the vpci_process_pending will later on be used by SR-IOV,
   so this is going to help the future code
Cons:
- ???

4. Re-write overlap detection code
==============================================================
It is possible to re-write overlap detection code, so the information about the
mapped/unmapped regions is not read from vpci->header->bars[i] of each device,
but instead there is a per-domain structure which holds the regions and
implements reference counting.

Pros:
- solves ABBA

Cons:
- very complex code is expected

5. You name it
==============================================================

 From all the above I would recommend we go with option 2 which seems to reliably
solve ABBA and does not bring cons of the other approaches.

Thank you in advance,
Oleksandr

[1] https://lore.kernel.org/all/5BABA6EF02000078001EC452@prv1-mh.provo.novell.com/T/#m231fb0586007725bfd8538bb97ff1777a36842cf

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 04/13] vpci: restrict unhandled read/write operations for guests
  2022-02-04 14:24     ` Oleksandr Andrushchenko
@ 2022-02-08  8:00       ` Oleksandr Andrushchenko
  2022-02-08  9:04         ` Jan Beulich
  2022-02-08  9:05         ` Roger Pau Monné
  0 siblings, 2 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08  8:00 UTC (permalink / raw)
  To: roger.pau, Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko


On 04.02.22 16:24, Oleksandr Andrushchenko wrote:
>
> On 04.02.22 16:11, Jan Beulich wrote:
>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>> A guest can read and write those registers which are not emulated and
>>> have no respective vPCI handlers, so it can access the HW directly.
>> I don't think this describes the present situation. Or did I miss where
>> devices can actually be exposed to guests already, despite much of the
>> support logic still missing?
> No, they are not exposed yet and you know that.
> I will update the commit message
BTW, all this work is about adding vpci for guests and of course this
is not going to be enabled right away.
I would like to hear the common acceptable way of documenting such
things: either we just say something like "A guest can read and write"
elsewhere or we need to invent something neutral not directly mentioning
what the change does. With the later it all seems a bit confusing IMO
as we do know what we are doing and for what reason: enable vpci for guests
>>> In order to prevent a guest from reads and writes from/to the unhandled
>>> registers make sure only hardware domain can access HW directly and restrict
>>> guests from doing so.
>> Tangential question: Going over the titles of the remaining patches I
>> notice patch 6 is going to deal with BAR accesses. But (going just
>> from the titles) I can't spot anywhere that vendor and device IDs
>> would be exposed to guests. Yet that's the first thing guests will need
>> in order to actually recognize devices. As said before, allowing guests
>> access to such r/o fields is quite likely going to be fine.
> Agree, I was thinking about adding such a patch to allow IDs,
> but finally decided not to add more to this series.
> Again, the whole thing is not working yet and for the development
> this patch can/needs to be reverted. So, either we implement IDs
> or not this doesn't change anything with this respect
Roger, do you want an additional patch with IDs in v7?
>>> --- a/xen/drivers/vpci/vpci.c
>>> +++ b/xen/drivers/vpci/vpci.c
>>> @@ -215,11 +215,15 @@ int vpci_remove_register(struct vpci *vpci, unsigned int offset,
>>>    }
>>>    
>>>    /* Wrappers for performing reads/writes to the underlying hardware. */
>>> -static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
>>> +static uint32_t vpci_read_hw(bool is_hwdom, pci_sbdf_t sbdf, unsigned int reg,
>>>                                 unsigned int size)
>> Was the passing around of a boolean the consensus which was reached?
> Was this patch committed yet?
>> Personally I'd fine it more natural if the two functions checked
>> current->domain themselves.
> This is also possible, but I would like to hear Roger's view on this as well
> I am fine either way
Roger, what's your maintainer's preference here? Additional argument
to vpci_read_hw of make it use current->domain internally?

Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 06/13] vpci/header: implement guest BAR register handlers
  2022-02-07 17:06   ` Jan Beulich
@ 2022-02-08  8:06     ` Oleksandr Andrushchenko
  2022-02-08  9:16       ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08  8:06 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 07.02.22 19:06, Jan Beulich wrote:
> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>> +static uint32_t guest_bar_ignore_read(const struct pci_dev *pdev,
>> +                                      unsigned int reg, void *data)
>> +{
>> +    return 0;
>> +}
>> +
>> +static int bar_ignore_access(const struct pci_dev *pdev, unsigned int reg,
>> +                             struct vpci_bar *bar)
>> +{
>> +    if ( is_hardware_domain(pdev->domain) )
>> +        return 0;
>> +
>> +    return vpci_add_register(pdev->vpci, guest_bar_ignore_read, NULL,
>> +                             reg, 4, bar);
>> +}
> For these two functions: I'm not sure "ignore" is an appropriate
> term here. unused_bar_read() and unused_bar() maybe? Or,
> considering we already have VPCI_BAR_EMPTY, s/unused/empty/ ? I'm
> also not sure we really need the is_hardware_domain() check here:
> Returning 0 for Dom0 is going to be fine as well; there's no need
> to fetch the value from actual hardware. The one exception might
> be for devices with buggy BAR behavior ...
Well, I think this should be ok, so then
- s/guest_bar_ignore_read/empty_bar_read
- s/bar_ignore_access/empty_bar
- no is_hardware_domain check
>
>> @@ -516,6 +594,11 @@ static int init_bars(struct pci_dev *pdev)
>>           if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
>>           {
>>               bars[i].type = VPCI_BAR_IO;
>> +
>> +            rc = bar_ignore_access(pdev, reg, &bars[i]);
>> +            if ( rc )
>> +                return rc;
> Elsewhere the command register is restored on error paths.
Ok, I will restore
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests
  2022-02-04 14:25   ` Jan Beulich
@ 2022-02-08  8:13     ` Oleksandr Andrushchenko
  2022-02-08  9:33       ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08  8:13 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, Oleksandr Andrushchenko,
	xen-devel



On 04.02.22 16:25, Jan Beulich wrote:
> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -454,6 +454,22 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>           pci_conf_write16(pdev->sbdf, reg, cmd);
>>   }
>>   
>> +static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
>> +                            uint32_t cmd, void *data)
>> +{
>> +    /* TODO: Add proper emulation for all bits of the command register. */
>> +
>> +#ifdef CONFIG_HAS_PCI_MSI
>> +    if ( pdev->vpci->msi->enabled || pdev->vpci->msix->enabled )
>> +    {
>> +        /* Guest wants to enable INTx. It can't be enabled if MSI/MSI-X enabled. */
>> +        cmd |= PCI_COMMAND_INTX_DISABLE;
>> +    }
>> +#endif
>> +
>> +    cmd_write(pdev, reg, cmd, data);
>> +}
> It's not really clear to me whether the TODO warrants this being a
> separate function. Personally I'd find it preferable if the logic
> was folded into cmd_write().
Not sure cmd_write needs to have guest's logic. And what's the
profit? Later on, when we decide how PCI_COMMAND can be emulated
this code will live in guest_cmd_write anyways
>
> With this and ...
>
>> --- a/xen/drivers/vpci/msi.c
>> +++ b/xen/drivers/vpci/msi.c
>> @@ -70,6 +70,10 @@ static void control_write(const struct pci_dev *pdev, unsigned int reg,
>>   
>>           if ( vpci_msi_arch_enable(msi, pdev, vectors) )
>>               return;
>> +
>> +        /* Make sure guest doesn't enable INTx while enabling MSI. */
>> +        if ( !is_hardware_domain(pdev->domain) )
>> +            pci_intx(pdev, false);
>>       }
>>       else
>>           vpci_msi_arch_disable(msi, pdev);
>> --- a/xen/drivers/vpci/msix.c
>> +++ b/xen/drivers/vpci/msix.c
>> @@ -92,6 +92,10 @@ static void control_write(const struct pci_dev *pdev, unsigned int reg,
>>           for ( i = 0; i < msix->max_entries; i++ )
>>               if ( !msix->entries[i].masked && msix->entries[i].updated )
>>                   update_entry(&msix->entries[i], pdev, i);
>> +
>> +        /* Make sure guest doesn't enable INTx while enabling MSI-X. */
>> +        if ( !is_hardware_domain(pdev->domain) )
>> +            pci_intx(pdev, false);
>>       }
>>       else if ( !new_enabled && msix->enabled )
>>       {
> ... this done (as requested) behind the back of the guest, what's the
> idea wrt the guest reading the command register? That continues to be
> wired to vpci_hw_read16() (and hence accesses the underlying hardware
> value irrespective of what patch 4 did).
Yes, good point. We need to add guest_cmd_read counterpart,
so we can also implement the same logic as in guest_cmd_write
wrt to INTx bit.
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign
  2022-02-07 16:28   ` Jan Beulich
@ 2022-02-08  8:32     ` Oleksandr Andrushchenko
  2022-02-08  9:13       ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08  8:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 07.02.22 18:28, Jan Beulich wrote:
> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>                           pci_to_dev(pdev), flag);
>>       }
>>   
>> +    rc = vpci_assign_device(d, pdev);
>> +
>>    done:
>>       if ( rc )
>>           printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
> There's no attempt to undo anything in the case of getting back an
> error. ISTR this being deemed okay on the basis that the tool stack
> would then take whatever action, but whatever it is that is supposed
> to deal with errors here wants spelling out in the description.
Why? I don't change the previously expected decision and implementation
of the assign_device function: I use error paths as they were used before
for the existing code. So, I see no clear reason to stress that the existing
and new code relies on the toolstack
> What's important is that no caller up the call tree may be left with
> the impression that the device is still owned by the original
> domain. With how you have it, the device is going to be owned by the
> new domain, but not really usable.
This is not true: vpci_assign_device will call vpci_deassign_device
internally if it fails. So, the device won't be assigned in this case
>
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -99,6 +99,33 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>   
>>       return rc;
>>   }
>> +
>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>> +/* Notify vPCI that device is assigned to guest. */
>> +int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
>> +{
>> +    int rc;
>> +
>> +    if ( !has_vpci(d) )
>> +        return 0;
>> +
>> +    rc = vpci_add_handlers(pdev);
>> +    if ( rc )
>> +        vpci_deassign_device(d, pdev);
>> +
>> +    return rc;
>> +}
>> +
>> +/* Notify vPCI that device is de-assigned from guest. */
>> +void vpci_deassign_device(struct domain *d, struct pci_dev *pdev)
>> +{
>> +    if ( !has_vpci(d) )
>> +        return;
>> +
>> +    vpci_remove_device(pdev);
>> +}
>> +#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
> While for the latter function you look to need two parameters, do you
> really need them also in the former one?
Do you mean instead of passing d we could just use pdev->domain?
int vpci_assign_device(struct pci_dev *pdev)
+{
+    int rc;
+
+    if ( !has_vpci(pdev->domain) )
+        return 0;
Yes, we probably can, but the rest of functions called from assign_device
are accepting both d and pdev, so not sure why would we want these
two be any different. Any good reason not to change others as well then?
> Symmetry considerations make me wonder though whether the de-assign
> hook shouldn't be called earlier, when pdev->domain still has the
> original owner. At which point the 2nd parameter could disappear there
> as well.
static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
{
[snip]
     vpci_deassign_device(pdev->domain, pdev);
[snip]
     rc = vpci_assign_device(d, pdev);

It looks ok to me
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 16:44                                                     ` Oleksandr Andrushchenko
  2022-02-08  7:35                                                       ` Oleksandr Andrushchenko
@ 2022-02-08  8:53                                                       ` Jan Beulich
  2022-02-08  9:00                                                         ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-08  8:53 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné

On 07.02.2022 17:44, Oleksandr Andrushchenko wrote:
> 
> 
> On 07.02.22 18:37, Jan Beulich wrote:
>> On 07.02.2022 17:21, Oleksandr Andrushchenko wrote:
>>>
>>> On 07.02.22 18:15, Jan Beulich wrote:
>>>> On 07.02.2022 17:07, Oleksandr Andrushchenko wrote:
>>>>> On 07.02.22 17:26, Jan Beulich wrote:
>>>>>> 1b. Make vpci_write use write lock for writes to command register and BARs
>>>>>> only; keep using the read lock for all other writes.
>>>>> I am not quite sure how to do that. Do you mean something like:
>>>>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>>>                    uint32_t data)
>>>>> [snip]
>>>>>        list_for_each_entry ( r, &pdev->vpci->handlers, node )
>>>>> {
>>>>> [snip]
>>>>>        if ( r->needs_write_lock)
>>>>>            write_lock(d->vpci_lock)
>>>>>        else
>>>>>            read_lock(d->vpci_lock)
>>>>> ....
>>>>>
>>>>> And provide rw as an argument to:
>>>>>
>>>>> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>>>>>                          vpci_write_t *write_handler, unsigned int offset,
>>>>>                          unsigned int size, void *data, --->>> bool write_path <<<-----)
>>>>>
>>>>> Is this what you mean?
>>>> This sounds overly complicated. You can derive locally in vpci_write(),
>>>> from just its "reg" and "size" parameters, whether the lock needs taking
>>>> in write mode.
>>> Yes, I started writing a reply with that. So, the summary (ROM
>>> position depends on header type):
>>> if ( (reg == PCI_COMMAND) || (reg == ROM) )
>>> {
>>>       read PCI_COMMAND and see if memory or IO decoding are enabled.
>>>       if ( enabled )
>>>           write_lock(d->vpci_lock)
>>>       else
>>>           read_lock(d->vpci_lock)
>>> }
>> Hmm, yes, you can actually get away without using "size", since both
>> command register and ROM BAR are 32-bit aligned registers, and 64-bit
>> accesses get split in vpci_ecam_write().
> But, OS may want reading a single byte of ROM BAR, so I think
> I'll need to check if reg+size fall into PCI_COMAND and ROM BAR
> ranges
>>
>> For the command register the memory- / IO-decoding-enabled check may
>> end up a little more complicated, as the value to be written also
>> matters. Maybe read the command register only for the ROM BAR write,
>> using the write lock uniformly for all command register writes?
> Sounds good for the start.
> Another concern is that if we go with a read_lock and then in the
> underlying code we disable memory decoding and try doing
> something and calling cmd_write handler for any reason then....
> 
> I mean that the check in the vpci_write is somewhat we can tolerate,
> but then it is must be considered that no code in the read path
> is allowed to perform write path functions. Which brings a pretty
> valid use-case: say in read mode we detect an unrecoverable error
> and need to remove the device:
> vpci_process_pending -> ERROR -> vpci_remove_device or similar.
> 
> What do we do then? It is all going to be fragile...

Real hardware won't cause a device to disappear upon a problem with
a read access. There shouldn't be any need to remove a passed-through
device either; such problems (if any) need handling differently imo.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-08  7:35                                                       ` Oleksandr Andrushchenko
@ 2022-02-08  8:57                                                         ` Jan Beulich
  2022-02-08  9:03                                                           ` Oleksandr Andrushchenko
  2022-02-08 10:50                                                         ` Roger Pau Monné
  1 sibling, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-08  8:57 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné

On 08.02.2022 08:35, Oleksandr Andrushchenko wrote:
> 1.1. Semi read lock upgrade in modify bars
> --------------------------------------------------------------
> In this case both vpci_read and vpci_write take a read lock and when it comes
> to modify_bars:
> 
> 1. read_unlock(d->vpci_lock)
> 2. write_lock(d->vpci_lock)
> 3. Check that pdev->vpci is still available and is the same object:
> if (pdev->vpci && (pdev->vpci == old_vpci) )
> {
>      /* vpci structure is valid and can be used. */
> }
> else
> {
>      /* vpci has gone, return an error. */
> }
> 
> Pros:
> - no per-device vpci lock is needed?
> - solves overlap code ABBA in modify_bars
> - readers and writers are NOT serialized
> - NO need to carefully select read paths, so they are guaranteed not to lead
>    to lock upgrade use-cases
> 
> Cons:
> - ???

The "pdev->vpci == old_vpci" is fragile: The struct may have got re-
allocated, and it just so happened that the two pointers are identical.

Same then for the subsequent variant 2.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-08  8:53                                                       ` Jan Beulich
@ 2022-02-08  9:00                                                         ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08  9:00 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné,
	Oleksandr Andrushchenko



On 08.02.22 10:53, Jan Beulich wrote:
> On 07.02.2022 17:44, Oleksandr Andrushchenko wrote:
>>
>> On 07.02.22 18:37, Jan Beulich wrote:
>>> On 07.02.2022 17:21, Oleksandr Andrushchenko wrote:
>>>> On 07.02.22 18:15, Jan Beulich wrote:
>>>>> On 07.02.2022 17:07, Oleksandr Andrushchenko wrote:
>>>>>> On 07.02.22 17:26, Jan Beulich wrote:
>>>>>>> 1b. Make vpci_write use write lock for writes to command register and BARs
>>>>>>> only; keep using the read lock for all other writes.
>>>>>> I am not quite sure how to do that. Do you mean something like:
>>>>>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>>>>                     uint32_t data)
>>>>>> [snip]
>>>>>>         list_for_each_entry ( r, &pdev->vpci->handlers, node )
>>>>>> {
>>>>>> [snip]
>>>>>>         if ( r->needs_write_lock)
>>>>>>             write_lock(d->vpci_lock)
>>>>>>         else
>>>>>>             read_lock(d->vpci_lock)
>>>>>> ....
>>>>>>
>>>>>> And provide rw as an argument to:
>>>>>>
>>>>>> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>>>>>>                           vpci_write_t *write_handler, unsigned int offset,
>>>>>>                           unsigned int size, void *data, --->>> bool write_path <<<-----)
>>>>>>
>>>>>> Is this what you mean?
>>>>> This sounds overly complicated. You can derive locally in vpci_write(),
>>>>> from just its "reg" and "size" parameters, whether the lock needs taking
>>>>> in write mode.
>>>> Yes, I started writing a reply with that. So, the summary (ROM
>>>> position depends on header type):
>>>> if ( (reg == PCI_COMMAND) || (reg == ROM) )
>>>> {
>>>>        read PCI_COMMAND and see if memory or IO decoding are enabled.
>>>>        if ( enabled )
>>>>            write_lock(d->vpci_lock)
>>>>        else
>>>>            read_lock(d->vpci_lock)
>>>> }
>>> Hmm, yes, you can actually get away without using "size", since both
>>> command register and ROM BAR are 32-bit aligned registers, and 64-bit
>>> accesses get split in vpci_ecam_write().
>> But, OS may want reading a single byte of ROM BAR, so I think
>> I'll need to check if reg+size fall into PCI_COMAND and ROM BAR
>> ranges
>>> For the command register the memory- / IO-decoding-enabled check may
>>> end up a little more complicated, as the value to be written also
>>> matters. Maybe read the command register only for the ROM BAR write,
>>> using the write lock uniformly for all command register writes?
>> Sounds good for the start.
>> Another concern is that if we go with a read_lock and then in the
>> underlying code we disable memory decoding and try doing
>> something and calling cmd_write handler for any reason then....
>>
>> I mean that the check in the vpci_write is somewhat we can tolerate,
>> but then it is must be considered that no code in the read path
>> is allowed to perform write path functions. Which brings a pretty
>> valid use-case: say in read mode we detect an unrecoverable error
>> and need to remove the device:
>> vpci_process_pending -> ERROR -> vpci_remove_device or similar.
>>
>> What do we do then? It is all going to be fragile...
> Real hardware won't cause a device to disappear upon a problem with
> a read access. There shouldn't be any need to remove a passed-through
> device either; such problems (if any) need handling differently imo.
Yes, at the moment there is a single place in the code which
removes the device (besides normal use-cases such as
pci_add_device on fail path and PHYSDEVOP_manage_pci_remove):

bool vpci_process_pending(struct vcpu *v)
{
[snip]
         if ( rc )
             /*
              * FIXME: in case of failure remove the device from the domain.
              * Note that there might still be leftover mappings. While this is
              * safe for Dom0, for DomUs the domain will likely need to be
              * killed in order to avoid leaking stale p2m mappings on
              * failure.
              */
             vpci_remove_device(v->vpci.pdev);

>
> Jan
>
>

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-08  8:57                                                         ` Jan Beulich
@ 2022-02-08  9:03                                                           ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08  9:03 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, Roger Pau Monné,
	Oleksandr Andrushchenko



On 08.02.22 10:57, Jan Beulich wrote:
> On 08.02.2022 08:35, Oleksandr Andrushchenko wrote:
>> 1.1. Semi read lock upgrade in modify bars
>> --------------------------------------------------------------
>> In this case both vpci_read and vpci_write take a read lock and when it comes
>> to modify_bars:
>>
>> 1. read_unlock(d->vpci_lock)
>> 2. write_lock(d->vpci_lock)
>> 3. Check that pdev->vpci is still available and is the same object:
>> if (pdev->vpci && (pdev->vpci == old_vpci) )
>> {
>>       /* vpci structure is valid and can be used. */
>> }
>> else
>> {
>>       /* vpci has gone, return an error. */
>> }
>>
>> Pros:
>> - no per-device vpci lock is needed?
>> - solves overlap code ABBA in modify_bars
>> - readers and writers are NOT serialized
>> - NO need to carefully select read paths, so they are guaranteed not to lead
>>     to lock upgrade use-cases
>>
>> Cons:
>> - ???
> The "pdev->vpci == old_vpci" is fragile: The struct may have got re-
> allocated, and it just so happened that the two pointers are identical.
>
> Same then for the subsequent variant 2.
Yes, it is possible. We can add an ID number to pdev->vpci,
so each new allocated vpci structure has a unique ID which can be used
to compare vpci structures. It can be something like pdev->vpci->id = d->vpci_id++;
with id being uint32_t for example
>
> Jan
>
>

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 04/13] vpci: restrict unhandled read/write operations for guests
  2022-02-08  8:00       ` Oleksandr Andrushchenko
@ 2022-02-08  9:04         ` Jan Beulich
  2022-02-08  9:09           ` Oleksandr Andrushchenko
  2022-02-08  9:05         ` Roger Pau Monné
  1 sibling, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-08  9:04 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau

On 08.02.2022 09:00, Oleksandr Andrushchenko wrote:
> On 04.02.22 16:24, Oleksandr Andrushchenko wrote:
>> On 04.02.22 16:11, Jan Beulich wrote:
>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>> A guest can read and write those registers which are not emulated and
>>>> have no respective vPCI handlers, so it can access the HW directly.
>>> I don't think this describes the present situation. Or did I miss where
>>> devices can actually be exposed to guests already, despite much of the
>>> support logic still missing?
>> No, they are not exposed yet and you know that.
>> I will update the commit message
> BTW, all this work is about adding vpci for guests and of course this
> is not going to be enabled right away.
> I would like to hear the common acceptable way of documenting such
> things: either we just say something like "A guest can read and write"
> elsewhere or we need to invent something neutral not directly mentioning
> what the change does. With the later it all seems a bit confusing IMO
> as we do know what we are doing and for what reason: enable vpci for guests

What's the problem with describing things as they are? Code is hwdom-
only right now, and you're trying to enable DomU support. Hence it's
all about "would be able to", not "can".

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 04/13] vpci: restrict unhandled read/write operations for guests
  2022-02-08  8:00       ` Oleksandr Andrushchenko
  2022-02-08  9:04         ` Jan Beulich
@ 2022-02-08  9:05         ` Roger Pau Monné
  2022-02-08  9:10           ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 138+ messages in thread
From: Roger Pau Monné @ 2022-02-08  9:05 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel

On Tue, Feb 08, 2022 at 08:00:28AM +0000, Oleksandr Andrushchenko wrote:
> 
> On 04.02.22 16:24, Oleksandr Andrushchenko wrote:
> >
> > On 04.02.22 16:11, Jan Beulich wrote:
> >> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
> >>> A guest can read and write those registers which are not emulated and
> >>> have no respective vPCI handlers, so it can access the HW directly.
> >> I don't think this describes the present situation. Or did I miss where
> >> devices can actually be exposed to guests already, despite much of the
> >> support logic still missing?
> > No, they are not exposed yet and you know that.
> > I will update the commit message
> BTW, all this work is about adding vpci for guests and of course this
> is not going to be enabled right away.
> I would like to hear the common acceptable way of documenting such
> things: either we just say something like "A guest can read and write"
> elsewhere or we need to invent something neutral not directly mentioning
> what the change does. With the later it all seems a bit confusing IMO
> as we do know what we are doing and for what reason: enable vpci for guests
> >>> In order to prevent a guest from reads and writes from/to the unhandled
> >>> registers make sure only hardware domain can access HW directly and restrict
> >>> guests from doing so.
> >> Tangential question: Going over the titles of the remaining patches I
> >> notice patch 6 is going to deal with BAR accesses. But (going just
> >> from the titles) I can't spot anywhere that vendor and device IDs
> >> would be exposed to guests. Yet that's the first thing guests will need
> >> in order to actually recognize devices. As said before, allowing guests
> >> access to such r/o fields is quite likely going to be fine.
> > Agree, I was thinking about adding such a patch to allow IDs,
> > but finally decided not to add more to this series.
> > Again, the whole thing is not working yet and for the development
> > this patch can/needs to be reverted. So, either we implement IDs
> > or not this doesn't change anything with this respect
> Roger, do you want an additional patch with IDs in v7?

I would expect a lot more work to be required, you need IDs and the
Header type as a minimum I would say. And then in order to have
something functional you will also need to handle the capabilities
pointer.

I'm fine for this to be added in a followup series. I think it's clear
the status after this series is not going to be functional.

> >>> --- a/xen/drivers/vpci/vpci.c
> >>> +++ b/xen/drivers/vpci/vpci.c
> >>> @@ -215,11 +215,15 @@ int vpci_remove_register(struct vpci *vpci, unsigned int offset,
> >>>    }
> >>>    
> >>>    /* Wrappers for performing reads/writes to the underlying hardware. */
> >>> -static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
> >>> +static uint32_t vpci_read_hw(bool is_hwdom, pci_sbdf_t sbdf, unsigned int reg,
> >>>                                 unsigned int size)
> >> Was the passing around of a boolean the consensus which was reached?
> > Was this patch committed yet?
> >> Personally I'd fine it more natural if the two functions checked
> >> current->domain themselves.
> > This is also possible, but I would like to hear Roger's view on this as well
> > I am fine either way
> Roger, what's your maintainer's preference here? Additional argument
> to vpci_read_hw of make it use current->domain internally?

My recommendation would be to use current->domain. Handlers will
always be executed in guest context, so there's no need to pass a
parameter around.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 04/13] vpci: restrict unhandled read/write operations for guests
  2022-02-08  9:04         ` Jan Beulich
@ 2022-02-08  9:09           ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08  9:09 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko, roger.pau



On 08.02.22 11:04, Jan Beulich wrote:
> On 08.02.2022 09:00, Oleksandr Andrushchenko wrote:
>> On 04.02.22 16:24, Oleksandr Andrushchenko wrote:
>>> On 04.02.22 16:11, Jan Beulich wrote:
>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>> A guest can read and write those registers which are not emulated and
>>>>> have no respective vPCI handlers, so it can access the HW directly.
>>>> I don't think this describes the present situation. Or did I miss where
>>>> devices can actually be exposed to guests already, despite much of the
>>>> support logic still missing?
>>> No, they are not exposed yet and you know that.
>>> I will update the commit message
>> BTW, all this work is about adding vpci for guests and of course this
>> is not going to be enabled right away.
>> I would like to hear the common acceptable way of documenting such
>> things: either we just say something like "A guest can read and write"
>> elsewhere or we need to invent something neutral not directly mentioning
>> what the change does. With the later it all seems a bit confusing IMO
>> as we do know what we are doing and for what reason: enable vpci for guests
> What's the problem with describing things as they are? Code is hwdom-
> only right now, and you're trying to enable DomU support. Hence it's
> all about "would be able to", not "can".
Sounds good, will use that wording then
>
> Jan
>
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 04/13] vpci: restrict unhandled read/write operations for guests
  2022-02-08  9:05         ` Roger Pau Monné
@ 2022-02-08  9:10           ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08  9:10 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 08.02.22 11:05, Roger Pau Monné wrote:
> On Tue, Feb 08, 2022 at 08:00:28AM +0000, Oleksandr Andrushchenko wrote:
>> On 04.02.22 16:24, Oleksandr Andrushchenko wrote:
>>> On 04.02.22 16:11, Jan Beulich wrote:
>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>> A guest can read and write those registers which are not emulated and
>>>>> have no respective vPCI handlers, so it can access the HW directly.
>>>> I don't think this describes the present situation. Or did I miss where
>>>> devices can actually be exposed to guests already, despite much of the
>>>> support logic still missing?
>>> No, they are not exposed yet and you know that.
>>> I will update the commit message
>> BTW, all this work is about adding vpci for guests and of course this
>> is not going to be enabled right away.
>> I would like to hear the common acceptable way of documenting such
>> things: either we just say something like "A guest can read and write"
>> elsewhere or we need to invent something neutral not directly mentioning
>> what the change does. With the later it all seems a bit confusing IMO
>> as we do know what we are doing and for what reason: enable vpci for guests
>>>>> In order to prevent a guest from reads and writes from/to the unhandled
>>>>> registers make sure only hardware domain can access HW directly and restrict
>>>>> guests from doing so.
>>>> Tangential question: Going over the titles of the remaining patches I
>>>> notice patch 6 is going to deal with BAR accesses. But (going just
>>>> from the titles) I can't spot anywhere that vendor and device IDs
>>>> would be exposed to guests. Yet that's the first thing guests will need
>>>> in order to actually recognize devices. As said before, allowing guests
>>>> access to such r/o fields is quite likely going to be fine.
>>> Agree, I was thinking about adding such a patch to allow IDs,
>>> but finally decided not to add more to this series.
>>> Again, the whole thing is not working yet and for the development
>>> this patch can/needs to be reverted. So, either we implement IDs
>>> or not this doesn't change anything with this respect
>> Roger, do you want an additional patch with IDs in v7?
> I would expect a lot more work to be required, you need IDs and the
> Header type as a minimum I would say. And then in order to have
> something functional you will also need to handle the capabilities
> pointer.
>
> I'm fine for this to be added in a followup series. I think it's clear
> the status after this series is not going to be functional.
Ok, so let's first have something and then we can extend guest's support
This can go in parallel with other work on Arm which still waits
for this series to be accepted
>
>>>>> --- a/xen/drivers/vpci/vpci.c
>>>>> +++ b/xen/drivers/vpci/vpci.c
>>>>> @@ -215,11 +215,15 @@ int vpci_remove_register(struct vpci *vpci, unsigned int offset,
>>>>>     }
>>>>>     
>>>>>     /* Wrappers for performing reads/writes to the underlying hardware. */
>>>>> -static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
>>>>> +static uint32_t vpci_read_hw(bool is_hwdom, pci_sbdf_t sbdf, unsigned int reg,
>>>>>                                  unsigned int size)
>>>> Was the passing around of a boolean the consensus which was reached?
>>> Was this patch committed yet?
>>>> Personally I'd fine it more natural if the two functions checked
>>>> current->domain themselves.
>>> This is also possible, but I would like to hear Roger's view on this as well
>>> I am fine either way
>> Roger, what's your maintainer's preference here? Additional argument
>> to vpci_read_hw of make it use current->domain internally?
> My recommendation would be to use current->domain. Handlers will
> always be executed in guest context, so there's no need to pass a
> parameter around.
ok, I'll use current->domain
>
> Thanks, Roger.
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign
  2022-02-08  8:32     ` Oleksandr Andrushchenko
@ 2022-02-08  9:13       ` Jan Beulich
  2022-02-08  9:27         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-08  9:13 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 08.02.2022 09:32, Oleksandr Andrushchenko wrote:
> On 07.02.22 18:28, Jan Beulich wrote:
>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>                           pci_to_dev(pdev), flag);
>>>       }
>>>   
>>> +    rc = vpci_assign_device(d, pdev);
>>> +
>>>    done:
>>>       if ( rc )
>>>           printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>> There's no attempt to undo anything in the case of getting back an
>> error. ISTR this being deemed okay on the basis that the tool stack
>> would then take whatever action, but whatever it is that is supposed
>> to deal with errors here wants spelling out in the description.
> Why? I don't change the previously expected decision and implementation
> of the assign_device function: I use error paths as they were used before
> for the existing code. So, I see no clear reason to stress that the existing
> and new code relies on the toolstack

Saying half a sentence on this is helping review.

>> What's important is that no caller up the call tree may be left with
>> the impression that the device is still owned by the original
>> domain. With how you have it, the device is going to be owned by the
>> new domain, but not really usable.
> This is not true: vpci_assign_device will call vpci_deassign_device
> internally if it fails. So, the device won't be assigned in this case

No. The device is assigned to whatever pdev->domain holds. Calling
vpci_deassign_device() there merely makes sure that the device will
have _no_ vPCI data and hooks in place, rather than something
partial.

>>> --- a/xen/drivers/vpci/vpci.c
>>> +++ b/xen/drivers/vpci/vpci.c
>>> @@ -99,6 +99,33 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>>   
>>>       return rc;
>>>   }
>>> +
>>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>>> +/* Notify vPCI that device is assigned to guest. */
>>> +int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
>>> +{
>>> +    int rc;
>>> +
>>> +    if ( !has_vpci(d) )
>>> +        return 0;
>>> +
>>> +    rc = vpci_add_handlers(pdev);
>>> +    if ( rc )
>>> +        vpci_deassign_device(d, pdev);
>>> +
>>> +    return rc;
>>> +}
>>> +
>>> +/* Notify vPCI that device is de-assigned from guest. */
>>> +void vpci_deassign_device(struct domain *d, struct pci_dev *pdev)
>>> +{
>>> +    if ( !has_vpci(d) )
>>> +        return;
>>> +
>>> +    vpci_remove_device(pdev);
>>> +}
>>> +#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
>> While for the latter function you look to need two parameters, do you
>> really need them also in the former one?
> Do you mean instead of passing d we could just use pdev->domain?
> int vpci_assign_device(struct pci_dev *pdev)
> +{
> +    int rc;
> +
> +    if ( !has_vpci(pdev->domain) )
> +        return 0;

Yes.

> Yes, we probably can, but the rest of functions called from assign_device
> are accepting both d and pdev, so not sure why would we want these
> two be any different. Any good reason not to change others as well then?

Yes: Prior to the call of the ->assign_device() hook, d != pdev->domain.
It is the _purpose_ of this function to change ownership of the device.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 06/13] vpci/header: implement guest BAR register handlers
  2022-02-08  8:06     ` Oleksandr Andrushchenko
@ 2022-02-08  9:16       ` Jan Beulich
  2022-02-08  9:29         ` Roger Pau Monné
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-08  9:16 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 08.02.2022 09:06, Oleksandr Andrushchenko wrote:
> 
> 
> On 07.02.22 19:06, Jan Beulich wrote:
>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>> +static uint32_t guest_bar_ignore_read(const struct pci_dev *pdev,
>>> +                                      unsigned int reg, void *data)
>>> +{
>>> +    return 0;
>>> +}
>>> +
>>> +static int bar_ignore_access(const struct pci_dev *pdev, unsigned int reg,
>>> +                             struct vpci_bar *bar)
>>> +{
>>> +    if ( is_hardware_domain(pdev->domain) )
>>> +        return 0;
>>> +
>>> +    return vpci_add_register(pdev->vpci, guest_bar_ignore_read, NULL,
>>> +                             reg, 4, bar);
>>> +}
>> For these two functions: I'm not sure "ignore" is an appropriate
>> term here. unused_bar_read() and unused_bar() maybe? Or,
>> considering we already have VPCI_BAR_EMPTY, s/unused/empty/ ? I'm
>> also not sure we really need the is_hardware_domain() check here:
>> Returning 0 for Dom0 is going to be fine as well; there's no need
>> to fetch the value from actual hardware. The one exception might
>> be for devices with buggy BAR behavior ...
> Well, I think this should be ok, so then
> - s/guest_bar_ignore_read/empty_bar_read
> - s/bar_ignore_access/empty_bar

Hmm, seeing it, I don't think empty_bar() is a good function name.
setup_empty_bar() or empty_bar_setup() would make more clear what
the function's purpose is.

> - no is_hardware_domain check

Please wait a little to see whether Roger has any input on this aspect.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 06/13] vpci/header: implement guest BAR register handlers
  2022-02-04  6:34 ` [PATCH v6 06/13] vpci/header: implement guest BAR register handlers Oleksandr Andrushchenko
  2022-02-07 17:06   ` Jan Beulich
@ 2022-02-08  9:25   ` Roger Pau Monné
  2022-02-08  9:31     ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 138+ messages in thread
From: Roger Pau Monné @ 2022-02-08  9:25 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: xen-devel, julien, sstabellini, oleksandr_tyshchenko,
	volodymyr_babchuk, artem_mygaiev, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

On Fri, Feb 04, 2022 at 08:34:52AM +0200, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Add relevant vpci register handlers when assigning PCI device to a domain
> and remove those when de-assigning. This allows having different
> handlers for different domains, e.g. hwdom and other guests.
> 
> Emulate guest BAR register values: this allows creating a guest view
> of the registers and emulates size and properties probe as it is done
> during PCI device enumeration by the guest.
> 
> All empty, IO and ROM BARs for guests are emulated by returning 0 on
> reads and ignoring writes: this BARs are special with this respect as
> their lower bits have special meaning, so returning default ~0 on read
> may confuse guest OS.
> 
> Memory decoding is initially disabled when used by guests in order to
> prevent the BAR being placed on top of a RAM region.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> Since v5:
> - make sure that the guest set address has the same page offset
>   as the physical address on the host
> - remove guest_rom_{read|write} as those just implement the default
>   behaviour of the registers not being handled
> - adjusted comment for struct vpci.addr field
> - add guest handlers for BARs which are not handled and will otherwise
>   return ~0 on read and ignore writes. The BARs are special with this
>   respect as their lower bits have special meaning, so returning ~0
>   doesn't seem to be right
> Since v4:
> - updated commit message
> - s/guest_addr/guest_reg
> Since v3:
> - squashed two patches: dynamic add/remove handlers and guest BAR
>   handler implementation
> - fix guest BAR read of the high part of a 64bit BAR (Roger)
> - add error handling to vpci_assign_device
> - s/dom%pd/%pd
> - blank line before return
> Since v2:
> - remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
>   has been eliminated from being built on x86
> Since v1:
>  - constify struct pci_dev where possible
>  - do not open code is_system_domain()
>  - simplify some code3. simplify
>  - use gdprintk + error code instead of gprintk
>  - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
>    so these do not get compiled for x86
>  - removed unneeded is_system_domain check
>  - re-work guest read/write to be much simpler and do more work on write
>    than read which is expected to be called more frequently
>  - removed one too obvious comment
> ---
>  xen/drivers/vpci/header.c | 131 +++++++++++++++++++++++++++++++++-----
>  xen/include/xen/vpci.h    |   3 +
>  2 files changed, 118 insertions(+), 16 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index bd23c0274d48..2620a95ff35b 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -406,6 +406,81 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>      pci_conf_write32(pdev->sbdf, reg, val);
>  }
>  
> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
> +                            uint32_t val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +    bool hi = false;
> +    uint64_t guest_reg = bar->guest_reg;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +    else
> +    {
> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
> +    }
> +
> +    guest_reg &= ~(0xffffffffull << (hi ? 32 : 0));
> +    guest_reg |= (uint64_t)val << (hi ? 32 : 0);
> +
> +    guest_reg &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
> +
> +    /*
> +     * Make sure that the guest set address has the same page offset
> +     * as the physical address on the host or otherwise things won't work as
> +     * expected.
> +     */
> +    if ( (guest_reg & (~PAGE_MASK & PCI_BASE_ADDRESS_MEM_MASK)) !=
> +         (bar->addr & ~PAGE_MASK) )

This is only required when !hi, but I'm fine with doing it
unconditionally as it's clearer.

> +    {
> +        gprintk(XENLOG_WARNING,
> +                "%pp: ignored BAR %zu write with wrong page offset\n",

"%pp: ignored BAR %zu write attempting to change page offset\n"

> +                &pdev->sbdf, bar - pdev->vpci->header.bars + hi);
> +        return;
> +    }
> +
> +    bar->guest_reg = guest_reg;
> +}
> +
> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
> +                               void *data)
> +{
> +    const struct vpci_bar *bar = data;
> +    bool hi = false;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +
> +    return bar->guest_reg >> (hi ? 32 : 0);
> +}
> +
> +static uint32_t guest_bar_ignore_read(const struct pci_dev *pdev,
> +                                      unsigned int reg, void *data)
> +{
> +    return 0;
> +}
> +
> +static int bar_ignore_access(const struct pci_dev *pdev, unsigned int reg,
> +                             struct vpci_bar *bar)
> +{
> +    if ( is_hardware_domain(pdev->domain) )
> +        return 0;
> +
> +    return vpci_add_register(pdev->vpci, guest_bar_ignore_read, NULL,
> +                             reg, 4, bar);
> +}
> +
>  static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>                        uint32_t val, void *data)
>  {
> @@ -462,6 +537,7 @@ static int init_bars(struct pci_dev *pdev)
>      struct vpci_header *header = &pdev->vpci->header;
>      struct vpci_bar *bars = header->bars;
>      int rc;
> +    bool is_hwdom = is_hardware_domain(pdev->domain);
>  
>      switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>      {
> @@ -501,8 +577,10 @@ static int init_bars(struct pci_dev *pdev)
>          if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
>          {
>              bars[i].type = VPCI_BAR_MEM64_HI;
> -            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
> -                                   4, &bars[i]);
> +            rc = vpci_add_register(pdev->vpci,
> +                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
> +                                   is_hwdom ? bar_write : guest_bar_write,
> +                                   reg, 4, &bars[i]);
>              if ( rc )
>              {
>                  pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> @@ -516,6 +594,11 @@ static int init_bars(struct pci_dev *pdev)
>          if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
>          {
>              bars[i].type = VPCI_BAR_IO;
> +
> +            rc = bar_ignore_access(pdev, reg, &bars[i]);

This is wrong: you only want to ignore access to IO BARs for Arm, for
x86 we should keep the previous behavior. Even more if you go with
Jan's suggestions to make bar_ignore_access also applicable to dom0.

> +            if ( rc )
> +                return rc;
> +
>              continue;
>          }
>          if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> @@ -535,6 +618,11 @@ static int init_bars(struct pci_dev *pdev)
>          if ( size == 0 )
>          {
>              bars[i].type = VPCI_BAR_EMPTY;
> +
> +            rc = bar_ignore_access(pdev, reg, &bars[i]);
> +            if ( rc )
> +                return rc;

I would be fine to just call vpci_add_register here, ie;

if ( !is_hwdom )
{
    rc = vpci_add_register(pdev->vpci, guest_bar_ignore_read, NULL,
                           reg, 4, &bars[i]);
     if ( rc )
     {
         ...
     }
}

Feel free to unify the writing of the PCI_COMMAND register on the
error path into a label, as then the error case would simply be a
`goto error;`

Thanks, Roger.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign
  2022-02-08  9:13       ` Jan Beulich
@ 2022-02-08  9:27         ` Oleksandr Andrushchenko
  2022-02-08  9:44           ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08  9:27 UTC (permalink / raw)
  To: Jan Beulich, roger.pau
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 08.02.22 11:13, Jan Beulich wrote:
> On 08.02.2022 09:32, Oleksandr Andrushchenko wrote:
>> On 07.02.22 18:28, Jan Beulich wrote:
>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>                            pci_to_dev(pdev), flag);
>>>>        }
>>>>    
>>>> +    rc = vpci_assign_device(d, pdev);
>>>> +
>>>>     done:
>>>>        if ( rc )
>>>>            printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>> There's no attempt to undo anything in the case of getting back an
>>> error. ISTR this being deemed okay on the basis that the tool stack
>>> would then take whatever action, but whatever it is that is supposed
>>> to deal with errors here wants spelling out in the description.
>> Why? I don't change the previously expected decision and implementation
>> of the assign_device function: I use error paths as they were used before
>> for the existing code. So, I see no clear reason to stress that the existing
>> and new code relies on the toolstack
> Saying half a sentence on this is helping review.
Ok
>
>>> What's important is that no caller up the call tree may be left with
>>> the impression that the device is still owned by the original
>>> domain. With how you have it, the device is going to be owned by the
>>> new domain, but not really usable.
>> This is not true: vpci_assign_device will call vpci_deassign_device
>> internally if it fails. So, the device won't be assigned in this case
> No. The device is assigned to whatever pdev->domain holds. Calling
> vpci_deassign_device() there merely makes sure that the device will
> have _no_ vPCI data and hooks in place, rather than something
> partial.
So, this patch is only dealing with vpci assign/de-assign
And it rolls back what it did in case of a failure
It also returns rc in assign_device to signal it has failed
What else is expected from this patch??
>
>>>> --- a/xen/drivers/vpci/vpci.c
>>>> +++ b/xen/drivers/vpci/vpci.c
>>>> @@ -99,6 +99,33 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>>>    
>>>>        return rc;
>>>>    }
>>>> +
>>>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>>>> +/* Notify vPCI that device is assigned to guest. */
>>>> +int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
>>>> +{
>>>> +    int rc;
>>>> +
>>>> +    if ( !has_vpci(d) )
>>>> +        return 0;
>>>> +
>>>> +    rc = vpci_add_handlers(pdev);
>>>> +    if ( rc )
>>>> +        vpci_deassign_device(d, pdev);
>>>> +
>>>> +    return rc;
>>>> +}
>>>> +
>>>> +/* Notify vPCI that device is de-assigned from guest. */
>>>> +void vpci_deassign_device(struct domain *d, struct pci_dev *pdev)
>>>> +{
>>>> +    if ( !has_vpci(d) )
>>>> +        return;
>>>> +
>>>> +    vpci_remove_device(pdev);
>>>> +}
>>>> +#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
>>> While for the latter function you look to need two parameters, do you
>>> really need them also in the former one?
>> Do you mean instead of passing d we could just use pdev->domain?
>> int vpci_assign_device(struct pci_dev *pdev)
>> +{
>> +    int rc;
>> +
>> +    if ( !has_vpci(pdev->domain) )
>> +        return 0;
> Yes.
>
>> Yes, we probably can, but the rest of functions called from assign_device
>> are accepting both d and pdev, so not sure why would we want these
>> two be any different. Any good reason not to change others as well then?
> Yes: Prior to the call of the ->assign_device() hook, d != pdev->domain.
> It is the _purpose_ of this function to change ownership of the device.
This can be done and makes sense.
@Roger which way do you want this?
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 06/13] vpci/header: implement guest BAR register handlers
  2022-02-08  9:16       ` Jan Beulich
@ 2022-02-08  9:29         ` Roger Pau Monné
  0 siblings, 0 replies; 138+ messages in thread
From: Roger Pau Monné @ 2022-02-08  9:29 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, julien, sstabellini,
	Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel

On Tue, Feb 08, 2022 at 10:16:59AM +0100, Jan Beulich wrote:
> On 08.02.2022 09:06, Oleksandr Andrushchenko wrote:
> > 
> > 
> > On 07.02.22 19:06, Jan Beulich wrote:
> >> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
> >>> +static uint32_t guest_bar_ignore_read(const struct pci_dev *pdev,
> >>> +                                      unsigned int reg, void *data)
> >>> +{
> >>> +    return 0;
> >>> +}
> >>> +
> >>> +static int bar_ignore_access(const struct pci_dev *pdev, unsigned int reg,
> >>> +                             struct vpci_bar *bar)
> >>> +{
> >>> +    if ( is_hardware_domain(pdev->domain) )
> >>> +        return 0;
> >>> +
> >>> +    return vpci_add_register(pdev->vpci, guest_bar_ignore_read, NULL,
> >>> +                             reg, 4, bar);
> >>> +}
> >> For these two functions: I'm not sure "ignore" is an appropriate
> >> term here. unused_bar_read() and unused_bar() maybe? Or,
> >> considering we already have VPCI_BAR_EMPTY, s/unused/empty/ ? I'm
> >> also not sure we really need the is_hardware_domain() check here:
> >> Returning 0 for Dom0 is going to be fine as well; there's no need
> >> to fetch the value from actual hardware. The one exception might
> >> be for devices with buggy BAR behavior ...
> > Well, I think this should be ok, so then
> > - s/guest_bar_ignore_read/empty_bar_read
> > - s/bar_ignore_access/empty_bar
> 
> Hmm, seeing it, I don't think empty_bar() is a good function name.
> setup_empty_bar() or empty_bar_setup() would make more clear what
> the function's purpose is.

I don't think you require an empty_bar_setup helper, the code there is
trivial can be open coded in init_bars directly IMO.

> 
> > - no is_hardware_domain check
> 
> Please wait a little to see whether Roger has any input on this aspect.

I think for the hw domain we should allow access to the BAR even if Xen
has found it empty. Adding the ignore handlers for dom0 shouldn't make
any difference, but we never know whether some quirky hardware could
make use of that.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 06/13] vpci/header: implement guest BAR register handlers
  2022-02-08  9:25   ` Roger Pau Monné
@ 2022-02-08  9:31     ` Oleksandr Andrushchenko
  2022-02-08  9:48       ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08  9:31 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, jbeulich, andrew.cooper3,
	george.dunlap, paul, Bertrand Marquis, Rahul Singh,
	Oleksandr Andrushchenko



On 08.02.22 11:25, Roger Pau Monné wrote:
> On Fri, Feb 04, 2022 at 08:34:52AM +0200, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Add relevant vpci register handlers when assigning PCI device to a domain
>> and remove those when de-assigning. This allows having different
>> handlers for different domains, e.g. hwdom and other guests.
>>
>> Emulate guest BAR register values: this allows creating a guest view
>> of the registers and emulates size and properties probe as it is done
>> during PCI device enumeration by the guest.
>>
>> All empty, IO and ROM BARs for guests are emulated by returning 0 on
>> reads and ignoring writes: this BARs are special with this respect as
>> their lower bits have special meaning, so returning default ~0 on read
>> may confuse guest OS.
>>
>> Memory decoding is initially disabled when used by guests in order to
>> prevent the BAR being placed on top of a RAM region.
>>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> ---
>> Since v5:
>> - make sure that the guest set address has the same page offset
>>    as the physical address on the host
>> - remove guest_rom_{read|write} as those just implement the default
>>    behaviour of the registers not being handled
>> - adjusted comment for struct vpci.addr field
>> - add guest handlers for BARs which are not handled and will otherwise
>>    return ~0 on read and ignore writes. The BARs are special with this
>>    respect as their lower bits have special meaning, so returning ~0
>>    doesn't seem to be right
>> Since v4:
>> - updated commit message
>> - s/guest_addr/guest_reg
>> Since v3:
>> - squashed two patches: dynamic add/remove handlers and guest BAR
>>    handler implementation
>> - fix guest BAR read of the high part of a 64bit BAR (Roger)
>> - add error handling to vpci_assign_device
>> - s/dom%pd/%pd
>> - blank line before return
>> Since v2:
>> - remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
>>    has been eliminated from being built on x86
>> Since v1:
>>   - constify struct pci_dev where possible
>>   - do not open code is_system_domain()
>>   - simplify some code3. simplify
>>   - use gdprintk + error code instead of gprintk
>>   - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
>>     so these do not get compiled for x86
>>   - removed unneeded is_system_domain check
>>   - re-work guest read/write to be much simpler and do more work on write
>>     than read which is expected to be called more frequently
>>   - removed one too obvious comment
>> ---
>>   xen/drivers/vpci/header.c | 131 +++++++++++++++++++++++++++++++++-----
>>   xen/include/xen/vpci.h    |   3 +
>>   2 files changed, 118 insertions(+), 16 deletions(-)
>>
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index bd23c0274d48..2620a95ff35b 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -406,6 +406,81 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>       pci_conf_write32(pdev->sbdf, reg, val);
>>   }
>>   
>> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>> +                            uint32_t val, void *data)
>> +{
>> +    struct vpci_bar *bar = data;
>> +    bool hi = false;
>> +    uint64_t guest_reg = bar->guest_reg;
>> +
>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>> +    {
>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>> +        bar--;
>> +        hi = true;
>> +    }
>> +    else
>> +    {
>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
>> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>> +    }
>> +
>> +    guest_reg &= ~(0xffffffffull << (hi ? 32 : 0));
>> +    guest_reg |= (uint64_t)val << (hi ? 32 : 0);
>> +
>> +    guest_reg &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
>> +
>> +    /*
>> +     * Make sure that the guest set address has the same page offset
>> +     * as the physical address on the host or otherwise things won't work as
>> +     * expected.
>> +     */
>> +    if ( (guest_reg & (~PAGE_MASK & PCI_BASE_ADDRESS_MEM_MASK)) !=
>> +         (bar->addr & ~PAGE_MASK) )
> This is only required when !hi, but I'm fine with doing it
> unconditionally as it's clearer.
This is correct wrt hi
>
>> +    {
>> +        gprintk(XENLOG_WARNING,
>> +                "%pp: ignored BAR %zu write with wrong page offset\n",
> "%pp: ignored BAR %zu write attempting to change page offset\n"
Ok
>
>> +                &pdev->sbdf, bar - pdev->vpci->header.bars + hi);
>> +        return;
>> +    }
>> +
>> +    bar->guest_reg = guest_reg;
>> +}
>> +
>> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
>> +                               void *data)
>> +{
>> +    const struct vpci_bar *bar = data;
>> +    bool hi = false;
>> +
>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>> +    {
>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>> +        bar--;
>> +        hi = true;
>> +    }
>> +
>> +    return bar->guest_reg >> (hi ? 32 : 0);
>> +}
>> +
>> +static uint32_t guest_bar_ignore_read(const struct pci_dev *pdev,
>> +                                      unsigned int reg, void *data)
>> +{
>> +    return 0;
>> +}
>> +
>> +static int bar_ignore_access(const struct pci_dev *pdev, unsigned int reg,
>> +                             struct vpci_bar *bar)
>> +{
>> +    if ( is_hardware_domain(pdev->domain) )
>> +        return 0;
>> +
>> +    return vpci_add_register(pdev->vpci, guest_bar_ignore_read, NULL,
>> +                             reg, 4, bar);
>> +}
>> +
>>   static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>>                         uint32_t val, void *data)
>>   {
>> @@ -462,6 +537,7 @@ static int init_bars(struct pci_dev *pdev)
>>       struct vpci_header *header = &pdev->vpci->header;
>>       struct vpci_bar *bars = header->bars;
>>       int rc;
>> +    bool is_hwdom = is_hardware_domain(pdev->domain);
>>   
>>       switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>>       {
>> @@ -501,8 +577,10 @@ static int init_bars(struct pci_dev *pdev)
>>           if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
>>           {
>>               bars[i].type = VPCI_BAR_MEM64_HI;
>> -            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
>> -                                   4, &bars[i]);
>> +            rc = vpci_add_register(pdev->vpci,
>> +                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
>> +                                   is_hwdom ? bar_write : guest_bar_write,
>> +                                   reg, 4, &bars[i]);
>>               if ( rc )
>>               {
>>                   pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
>> @@ -516,6 +594,11 @@ static int init_bars(struct pci_dev *pdev)
>>           if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
>>           {
>>               bars[i].type = VPCI_BAR_IO;
>> +
>> +            rc = bar_ignore_access(pdev, reg, &bars[i]);
> This is wrong: you only want to ignore access to IO BARs for Arm, for
> x86 we should keep the previous behavior. Even more if you go with
> Jan's suggestions to make bar_ignore_access also applicable to dom0.
How do we want this?
#ifdef CONFIG_ARM?
>
>> +            if ( rc )
>> +                return rc;
>> +
>>               continue;
>>           }
>>           if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
>> @@ -535,6 +618,11 @@ static int init_bars(struct pci_dev *pdev)
>>           if ( size == 0 )
>>           {
>>               bars[i].type = VPCI_BAR_EMPTY;
>> +
>> +            rc = bar_ignore_access(pdev, reg, &bars[i]);
>> +            if ( rc )
>> +                return rc;
> I would be fine to just call vpci_add_register here, ie;
>
> if ( !is_hwdom )
> {
>      rc = vpci_add_register(pdev->vpci, guest_bar_ignore_read, NULL,
>                             reg, 4, &bars[i]);
>       if ( rc )
>       {
>           ...
>       }
> }
But we have 3 places where we do the same and also handle errors
the same way. I was thinking having a helper will make the code
clearer. Do you want to open code all the uses?
> Feel free to unify the writing of the PCI_COMMAND register on the
> error path into a label, as then the error case would simply be a
> `goto error;`
I was thinking about it. Will it be ok to make this change in this patch
or you want a dedicated one for that?
> Thanks, Roger.
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests
  2022-02-08  8:13     ` Oleksandr Andrushchenko
@ 2022-02-08  9:33       ` Jan Beulich
  2022-02-08  9:38         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-08  9:33 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 08.02.2022 09:13, Oleksandr Andrushchenko wrote:
> On 04.02.22 16:25, Jan Beulich wrote:
>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>> --- a/xen/drivers/vpci/header.c
>>> +++ b/xen/drivers/vpci/header.c
>>> @@ -454,6 +454,22 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>>           pci_conf_write16(pdev->sbdf, reg, cmd);
>>>   }
>>>   
>>> +static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>> +                            uint32_t cmd, void *data)
>>> +{
>>> +    /* TODO: Add proper emulation for all bits of the command register. */
>>> +
>>> +#ifdef CONFIG_HAS_PCI_MSI
>>> +    if ( pdev->vpci->msi->enabled || pdev->vpci->msix->enabled )
>>> +    {
>>> +        /* Guest wants to enable INTx. It can't be enabled if MSI/MSI-X enabled. */
>>> +        cmd |= PCI_COMMAND_INTX_DISABLE;
>>> +    }
>>> +#endif
>>> +
>>> +    cmd_write(pdev, reg, cmd, data);
>>> +}
>> It's not really clear to me whether the TODO warrants this being a
>> separate function. Personally I'd find it preferable if the logic
>> was folded into cmd_write().
> Not sure cmd_write needs to have guest's logic. And what's the
> profit? Later on, when we decide how PCI_COMMAND can be emulated
> this code will live in guest_cmd_write anyways

Why "will"? There's nothing conceptually wrong with putting all the
emulation logic into cmd_write(), inside an if(!hwdom) conditional.
If and when we gain CET-IBT support on the x86 side (and I'm told
there's an Arm equivalent of this), then to make this as useful as
possible it is going to be desirable to limit the number of functions
called through function pointers. You may have seen Andrew's huge
"x86: Support for CET Indirect Branch Tracking" series. We want to
keep down the number of such annotations; the vast part of the series
is about adding of such.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests
  2022-02-08  9:33       ` Jan Beulich
@ 2022-02-08  9:38         ` Oleksandr Andrushchenko
  2022-02-08  9:52           ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08  9:38 UTC (permalink / raw)
  To: Jan Beulich, roger.pau
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 08.02.22 11:33, Jan Beulich wrote:
> On 08.02.2022 09:13, Oleksandr Andrushchenko wrote:
>> On 04.02.22 16:25, Jan Beulich wrote:
>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>> --- a/xen/drivers/vpci/header.c
>>>> +++ b/xen/drivers/vpci/header.c
>>>> @@ -454,6 +454,22 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>>>            pci_conf_write16(pdev->sbdf, reg, cmd);
>>>>    }
>>>>    
>>>> +static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>>> +                            uint32_t cmd, void *data)
>>>> +{
>>>> +    /* TODO: Add proper emulation for all bits of the command register. */
>>>> +
>>>> +#ifdef CONFIG_HAS_PCI_MSI
>>>> +    if ( pdev->vpci->msi->enabled || pdev->vpci->msix->enabled )
>>>> +    {
>>>> +        /* Guest wants to enable INTx. It can't be enabled if MSI/MSI-X enabled. */
>>>> +        cmd |= PCI_COMMAND_INTX_DISABLE;
>>>> +    }
>>>> +#endif
>>>> +
>>>> +    cmd_write(pdev, reg, cmd, data);
>>>> +}
>>> It's not really clear to me whether the TODO warrants this being a
>>> separate function. Personally I'd find it preferable if the logic
>>> was folded into cmd_write().
>> Not sure cmd_write needs to have guest's logic. And what's the
>> profit? Later on, when we decide how PCI_COMMAND can be emulated
>> this code will live in guest_cmd_write anyways
> Why "will"? There's nothing conceptually wrong with putting all the
> emulation logic into cmd_write(), inside an if(!hwdom) conditional.
> If and when we gain CET-IBT support on the x86 side (and I'm told
> there's an Arm equivalent of this), then to make this as useful as
> possible it is going to be desirable to limit the number of functions
> called through function pointers. You may have seen Andrew's huge
> "x86: Support for CET Indirect Branch Tracking" series. We want to
> keep down the number of such annotations; the vast part of the series
> is about adding of such.
Well, while I see nothing bad with that, from the code organization
it would look a bit strange: we don't differentiate hwdom in vpci
handlers, but instead provide one for hwdom and one for guests.
While I understand your concern I still think that at the moment
it will be more in line with the existing code if we provide a dedicated
handler.

Once we are all set with the handlers we may want performing a refactoring
with limiting the number of register handlers.

@Roger, what's your view on this?
> Jan
>
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign
  2022-02-08  9:27         ` Oleksandr Andrushchenko
@ 2022-02-08  9:44           ` Jan Beulich
  2022-02-08  9:55             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-08  9:44 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau

On 08.02.2022 10:27, Oleksandr Andrushchenko wrote:
> On 08.02.22 11:13, Jan Beulich wrote:
>> On 08.02.2022 09:32, Oleksandr Andrushchenko wrote:
>>> On 07.02.22 18:28, Jan Beulich wrote:
>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>>                            pci_to_dev(pdev), flag);
>>>>>        }
>>>>>    
>>>>> +    rc = vpci_assign_device(d, pdev);
>>>>> +
>>>>>     done:
>>>>>        if ( rc )
>>>>>            printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>>> There's no attempt to undo anything in the case of getting back an
>>>> error. ISTR this being deemed okay on the basis that the tool stack
>>>> would then take whatever action, but whatever it is that is supposed
>>>> to deal with errors here wants spelling out in the description.
>>> Why? I don't change the previously expected decision and implementation
>>> of the assign_device function: I use error paths as they were used before
>>> for the existing code. So, I see no clear reason to stress that the existing
>>> and new code relies on the toolstack
>> Saying half a sentence on this is helping review.
> Ok
>>
>>>> What's important is that no caller up the call tree may be left with
>>>> the impression that the device is still owned by the original
>>>> domain. With how you have it, the device is going to be owned by the
>>>> new domain, but not really usable.
>>> This is not true: vpci_assign_device will call vpci_deassign_device
>>> internally if it fails. So, the device won't be assigned in this case
>> No. The device is assigned to whatever pdev->domain holds. Calling
>> vpci_deassign_device() there merely makes sure that the device will
>> have _no_ vPCI data and hooks in place, rather than something
>> partial.
> So, this patch is only dealing with vpci assign/de-assign
> And it rolls back what it did in case of a failure
> It also returns rc in assign_device to signal it has failed
> What else is expected from this patch??

Until now if assign_device() returns an error, this tells the caller
that the device did not change ownership; in the worst case it either
only moved to the quarantine domain, or the new owner may have been
crashed. In no case is the device owned by an alive DomU. You're
changing this property, and hence you need to make clear/sure that
this isn't colliding with assumptions made elsewhere.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 06/13] vpci/header: implement guest BAR register handlers
  2022-02-08  9:31     ` Oleksandr Andrushchenko
@ 2022-02-08  9:48       ` Jan Beulich
  2022-02-08  9:57         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-08  9:48 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: xen-devel, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, Roger Pau Monné

On 08.02.2022 10:31, Oleksandr Andrushchenko wrote:
> On 08.02.22 11:25, Roger Pau Monné wrote:
>> On Fri, Feb 04, 2022 at 08:34:52AM +0200, Oleksandr Andrushchenko wrote:
>>> @@ -516,6 +594,11 @@ static int init_bars(struct pci_dev *pdev)
>>>           if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
>>>           {
>>>               bars[i].type = VPCI_BAR_IO;
>>> +
>>> +            rc = bar_ignore_access(pdev, reg, &bars[i]);
>> This is wrong: you only want to ignore access to IO BARs for Arm, for
>> x86 we should keep the previous behavior. Even more if you go with
>> Jan's suggestions to make bar_ignore_access also applicable to dom0.
> How do we want this?
> #ifdef CONFIG_ARM?

Afaic better via a new, dedicated CONFIG_HAVE_* setting, which x86 selects
but Arm doesn't. Unless we have one already, of course ...

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests
  2022-02-08  9:38         ` Oleksandr Andrushchenko
@ 2022-02-08  9:52           ` Jan Beulich
  2022-02-08  9:58             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-08  9:52 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau

On 08.02.2022 10:38, Oleksandr Andrushchenko wrote:
> 
> 
> On 08.02.22 11:33, Jan Beulich wrote:
>> On 08.02.2022 09:13, Oleksandr Andrushchenko wrote:
>>> On 04.02.22 16:25, Jan Beulich wrote:
>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>> --- a/xen/drivers/vpci/header.c
>>>>> +++ b/xen/drivers/vpci/header.c
>>>>> @@ -454,6 +454,22 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>            pci_conf_write16(pdev->sbdf, reg, cmd);
>>>>>    }
>>>>>    
>>>>> +static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>>>> +                            uint32_t cmd, void *data)
>>>>> +{
>>>>> +    /* TODO: Add proper emulation for all bits of the command register. */
>>>>> +
>>>>> +#ifdef CONFIG_HAS_PCI_MSI
>>>>> +    if ( pdev->vpci->msi->enabled || pdev->vpci->msix->enabled )
>>>>> +    {
>>>>> +        /* Guest wants to enable INTx. It can't be enabled if MSI/MSI-X enabled. */
>>>>> +        cmd |= PCI_COMMAND_INTX_DISABLE;
>>>>> +    }
>>>>> +#endif
>>>>> +
>>>>> +    cmd_write(pdev, reg, cmd, data);
>>>>> +}
>>>> It's not really clear to me whether the TODO warrants this being a
>>>> separate function. Personally I'd find it preferable if the logic
>>>> was folded into cmd_write().
>>> Not sure cmd_write needs to have guest's logic. And what's the
>>> profit? Later on, when we decide how PCI_COMMAND can be emulated
>>> this code will live in guest_cmd_write anyways
>> Why "will"? There's nothing conceptually wrong with putting all the
>> emulation logic into cmd_write(), inside an if(!hwdom) conditional.
>> If and when we gain CET-IBT support on the x86 side (and I'm told
>> there's an Arm equivalent of this), then to make this as useful as
>> possible it is going to be desirable to limit the number of functions
>> called through function pointers. You may have seen Andrew's huge
>> "x86: Support for CET Indirect Branch Tracking" series. We want to
>> keep down the number of such annotations; the vast part of the series
>> is about adding of such.
> Well, while I see nothing bad with that, from the code organization
> it would look a bit strange: we don't differentiate hwdom in vpci
> handlers, but instead provide one for hwdom and one for guests.
> While I understand your concern I still think that at the moment
> it will be more in line with the existing code if we provide a dedicated
> handler.

The existing code only deals with Dom0, and hence doesn't have any
pairs of handlers. FTAOD what I said above applies equally to other
separate guest read/write handlers you may be introducing. The
exception being when e.g. a hardware access handler is put in place
for Dom0 (for obvious reasons, I think).

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign
  2022-02-08  9:44           ` Jan Beulich
@ 2022-02-08  9:55             ` Oleksandr Andrushchenko
  2022-02-08 10:09               ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08  9:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau,
	Oleksandr Andrushchenko



On 08.02.22 11:44, Jan Beulich wrote:
> On 08.02.2022 10:27, Oleksandr Andrushchenko wrote:
>> On 08.02.22 11:13, Jan Beulich wrote:
>>> On 08.02.2022 09:32, Oleksandr Andrushchenko wrote:
>>>> On 07.02.22 18:28, Jan Beulich wrote:
>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>>>                             pci_to_dev(pdev), flag);
>>>>>>         }
>>>>>>     
>>>>>> +    rc = vpci_assign_device(d, pdev);
>>>>>> +
>>>>>>      done:
>>>>>>         if ( rc )
>>>>>>             printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>>>> There's no attempt to undo anything in the case of getting back an
>>>>> error. ISTR this being deemed okay on the basis that the tool stack
>>>>> would then take whatever action, but whatever it is that is supposed
>>>>> to deal with errors here wants spelling out in the description.
>>>> Why? I don't change the previously expected decision and implementation
>>>> of the assign_device function: I use error paths as they were used before
>>>> for the existing code. So, I see no clear reason to stress that the existing
>>>> and new code relies on the toolstack
>>> Saying half a sentence on this is helping review.
>> Ok
>>>>> What's important is that no caller up the call tree may be left with
>>>>> the impression that the device is still owned by the original
>>>>> domain. With how you have it, the device is going to be owned by the
>>>>> new domain, but not really usable.
>>>> This is not true: vpci_assign_device will call vpci_deassign_device
>>>> internally if it fails. So, the device won't be assigned in this case
>>> No. The device is assigned to whatever pdev->domain holds. Calling
>>> vpci_deassign_device() there merely makes sure that the device will
>>> have _no_ vPCI data and hooks in place, rather than something
>>> partial.
>> So, this patch is only dealing with vpci assign/de-assign
>> And it rolls back what it did in case of a failure
>> It also returns rc in assign_device to signal it has failed
>> What else is expected from this patch??
> Until now if assign_device() returns an error, this tells the caller
> that the device did not change ownership;
Not sure this is the case:
     if ( (rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
                           pci_to_dev(pdev), flag)) )
iommu_call can leave the new ownership even now without
vpci_assign_device. My understanding is that the roll-back is
expected to be performed by the toolstack and vpci_assign_device
doesn't prevent that by returning rc. Even more, before we discussed
that it would be good for vpci_assign_device to try recovering from
a possible error early which is done by calling vpci_deassign_device
internally.

So, if you want the things to be clearly handled without relying on the
toolstack then it is not vpci_assign_device introduced issue, but the
existing one, which needs (if there is a good reason) to be fixed
separately.
I think that new code doesn't make things worse. At least

>   in the worst case it either
> only moved to the quarantine domain, or the new owner may have been
> crashed. In no case is the device owned by an alive DomU. You're
> changing this property, and hence you need to make clear/sure that
> this isn't colliding with assumptions made elsewhere.
>
> Jan
>
>

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 06/13] vpci/header: implement guest BAR register handlers
  2022-02-08  9:48       ` Jan Beulich
@ 2022-02-08  9:57         ` Oleksandr Andrushchenko
  2022-02-08 10:15           ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08  9:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, Roger Pau Monné,
	Oleksandr Andrushchenko



On 08.02.22 11:48, Jan Beulich wrote:
> On 08.02.2022 10:31, Oleksandr Andrushchenko wrote:
>> On 08.02.22 11:25, Roger Pau Monné wrote:
>>> On Fri, Feb 04, 2022 at 08:34:52AM +0200, Oleksandr Andrushchenko wrote:
>>>> @@ -516,6 +594,11 @@ static int init_bars(struct pci_dev *pdev)
>>>>            if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
>>>>            {
>>>>                bars[i].type = VPCI_BAR_IO;
>>>> +
>>>> +            rc = bar_ignore_access(pdev, reg, &bars[i]);
>>> This is wrong: you only want to ignore access to IO BARs for Arm, for
>>> x86 we should keep the previous behavior. Even more if you go with
>>> Jan's suggestions to make bar_ignore_access also applicable to dom0.
>> How do we want this?
>> #ifdef CONFIG_ARM?
> Afaic better via a new, dedicated CONFIG_HAVE_* setting, which x86 selects
> but Arm doesn't. Unless we have one already, of course ...
Could you please be more specific on the name you see appropriate?
And do you realize that this is going to be a single user of such a
setting?
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests
  2022-02-08  9:52           ` Jan Beulich
@ 2022-02-08  9:58             ` Oleksandr Andrushchenko
  2022-02-08 11:11               ` Roger Pau Monné
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08  9:58 UTC (permalink / raw)
  To: Jan Beulich, roger.pau
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 08.02.22 11:52, Jan Beulich wrote:
> On 08.02.2022 10:38, Oleksandr Andrushchenko wrote:
>>
>> On 08.02.22 11:33, Jan Beulich wrote:
>>> On 08.02.2022 09:13, Oleksandr Andrushchenko wrote:
>>>> On 04.02.22 16:25, Jan Beulich wrote:
>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>> @@ -454,6 +454,22 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>             pci_conf_write16(pdev->sbdf, reg, cmd);
>>>>>>     }
>>>>>>     
>>>>>> +static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>> +                            uint32_t cmd, void *data)
>>>>>> +{
>>>>>> +    /* TODO: Add proper emulation for all bits of the command register. */
>>>>>> +
>>>>>> +#ifdef CONFIG_HAS_PCI_MSI
>>>>>> +    if ( pdev->vpci->msi->enabled || pdev->vpci->msix->enabled )
>>>>>> +    {
>>>>>> +        /* Guest wants to enable INTx. It can't be enabled if MSI/MSI-X enabled. */
>>>>>> +        cmd |= PCI_COMMAND_INTX_DISABLE;
>>>>>> +    }
>>>>>> +#endif
>>>>>> +
>>>>>> +    cmd_write(pdev, reg, cmd, data);
>>>>>> +}
>>>>> It's not really clear to me whether the TODO warrants this being a
>>>>> separate function. Personally I'd find it preferable if the logic
>>>>> was folded into cmd_write().
>>>> Not sure cmd_write needs to have guest's logic. And what's the
>>>> profit? Later on, when we decide how PCI_COMMAND can be emulated
>>>> this code will live in guest_cmd_write anyways
>>> Why "will"? There's nothing conceptually wrong with putting all the
>>> emulation logic into cmd_write(), inside an if(!hwdom) conditional.
>>> If and when we gain CET-IBT support on the x86 side (and I'm told
>>> there's an Arm equivalent of this), then to make this as useful as
>>> possible it is going to be desirable to limit the number of functions
>>> called through function pointers. You may have seen Andrew's huge
>>> "x86: Support for CET Indirect Branch Tracking" series. We want to
>>> keep down the number of such annotations; the vast part of the series
>>> is about adding of such.
>> Well, while I see nothing bad with that, from the code organization
>> it would look a bit strange: we don't differentiate hwdom in vpci
>> handlers, but instead provide one for hwdom and one for guests.
>> While I understand your concern I still think that at the moment
>> it will be more in line with the existing code if we provide a dedicated
>> handler.
> The existing code only deals with Dom0, and hence doesn't have any
> pairs of handlers.
This is fair
>   FTAOD what I said above applies equally to other
> separate guest read/write handlers you may be introducing. The
> exception being when e.g. a hardware access handler is put in place
> for Dom0 (for obvious reasons, I think).
@Roger, what's your preference here?
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign
  2022-02-08  9:55             ` Oleksandr Andrushchenko
@ 2022-02-08 10:09               ` Jan Beulich
  2022-02-08 10:22                 ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-08 10:09 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau

On 08.02.2022 10:55, Oleksandr Andrushchenko wrote:
> 
> 
> On 08.02.22 11:44, Jan Beulich wrote:
>> On 08.02.2022 10:27, Oleksandr Andrushchenko wrote:
>>> On 08.02.22 11:13, Jan Beulich wrote:
>>>> On 08.02.2022 09:32, Oleksandr Andrushchenko wrote:
>>>>> On 07.02.22 18:28, Jan Beulich wrote:
>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>>>>                             pci_to_dev(pdev), flag);
>>>>>>>         }
>>>>>>>     
>>>>>>> +    rc = vpci_assign_device(d, pdev);
>>>>>>> +
>>>>>>>      done:
>>>>>>>         if ( rc )
>>>>>>>             printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>>>>> There's no attempt to undo anything in the case of getting back an
>>>>>> error. ISTR this being deemed okay on the basis that the tool stack
>>>>>> would then take whatever action, but whatever it is that is supposed
>>>>>> to deal with errors here wants spelling out in the description.
>>>>> Why? I don't change the previously expected decision and implementation
>>>>> of the assign_device function: I use error paths as they were used before
>>>>> for the existing code. So, I see no clear reason to stress that the existing
>>>>> and new code relies on the toolstack
>>>> Saying half a sentence on this is helping review.
>>> Ok
>>>>>> What's important is that no caller up the call tree may be left with
>>>>>> the impression that the device is still owned by the original
>>>>>> domain. With how you have it, the device is going to be owned by the
>>>>>> new domain, but not really usable.
>>>>> This is not true: vpci_assign_device will call vpci_deassign_device
>>>>> internally if it fails. So, the device won't be assigned in this case
>>>> No. The device is assigned to whatever pdev->domain holds. Calling
>>>> vpci_deassign_device() there merely makes sure that the device will
>>>> have _no_ vPCI data and hooks in place, rather than something
>>>> partial.
>>> So, this patch is only dealing with vpci assign/de-assign
>>> And it rolls back what it did in case of a failure
>>> It also returns rc in assign_device to signal it has failed
>>> What else is expected from this patch??
>> Until now if assign_device() returns an error, this tells the caller
>> that the device did not change ownership;
> Not sure this is the case:
>      if ( (rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
>                            pci_to_dev(pdev), flag)) )
> iommu_call can leave the new ownership even now without
> vpci_assign_device.

Did you check the actual hook functions for when exactly the ownership
change happens. For both VT-d and AMD it is the last thing they do,
when no error can occur anymore.

 My understanding is that the roll-back is
> expected to be performed by the toolstack and vpci_assign_device
> doesn't prevent that by returning rc. Even more, before we discussed
> that it would be good for vpci_assign_device to try recovering from
> a possible error early which is done by calling vpci_deassign_device
> internally.

Yes, but that's only part of it. It at least needs considering what
effects have resulted from operations prior to vpci_assign_device().

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-07 16:37                                                   ` Jan Beulich
  2022-02-07 16:44                                                     ` Oleksandr Andrushchenko
@ 2022-02-08 10:11                                                     ` Roger Pau Monné
  2022-02-08 10:32                                                       ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 138+ messages in thread
From: Roger Pau Monné @ 2022-02-08 10:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, julien, sstabellini,
	Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh, xen-devel

On Mon, Feb 07, 2022 at 05:37:49PM +0100, Jan Beulich wrote:
> On 07.02.2022 17:21, Oleksandr Andrushchenko wrote:
> > 
> > 
> > On 07.02.22 18:15, Jan Beulich wrote:
> >> On 07.02.2022 17:07, Oleksandr Andrushchenko wrote:
> >>> On 07.02.22 17:26, Jan Beulich wrote:
> >>>> 1b. Make vpci_write use write lock for writes to command register and BARs
> >>>> only; keep using the read lock for all other writes.
> >>> I am not quite sure how to do that. Do you mean something like:
> >>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >>>                   uint32_t data)
> >>> [snip]
> >>>       list_for_each_entry ( r, &pdev->vpci->handlers, node )
> >>> {
> >>> [snip]
> >>>       if ( r->needs_write_lock)
> >>>           write_lock(d->vpci_lock)
> >>>       else
> >>>           read_lock(d->vpci_lock)
> >>> ....
> >>>
> >>> And provide rw as an argument to:
> >>>
> >>> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
> >>>                         vpci_write_t *write_handler, unsigned int offset,
> >>>                         unsigned int size, void *data, --->>> bool write_path <<<-----)
> >>>
> >>> Is this what you mean?
> >> This sounds overly complicated. You can derive locally in vpci_write(),
> >> from just its "reg" and "size" parameters, whether the lock needs taking
> >> in write mode.
> > Yes, I started writing a reply with that. So, the summary (ROM
> > position depends on header type):
> > if ( (reg == PCI_COMMAND) || (reg == ROM) )
> > {
> >      read PCI_COMMAND and see if memory or IO decoding are enabled.
> >      if ( enabled )
> >          write_lock(d->vpci_lock)
> >      else
> >          read_lock(d->vpci_lock)
> > }
> 
> Hmm, yes, you can actually get away without using "size", since both
> command register and ROM BAR are 32-bit aligned registers, and 64-bit
> accesses get split in vpci_ecam_write().
> 
> For the command register the memory- / IO-decoding-enabled check may
> end up a little more complicated, as the value to be written also
> matters. Maybe read the command register only for the ROM BAR write,
> using the write lock uniformly for all command register writes?
> 
> > Do you also think we can drop pdev->vpci (or currently pdev->vpci->lock)
> > at all then?
> 
> I haven't looked at this in any detail, sorry. It sounds possible,
> yes.

AFAICT you should avoid taking the per-device vpci lock when you take
the per-domain lock in write mode. Otherwise you still need the
per-device vpci lock in order to keep consistency between concurrent
accesses to the device registers.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 06/13] vpci/header: implement guest BAR register handlers
  2022-02-08  9:57         ` Oleksandr Andrushchenko
@ 2022-02-08 10:15           ` Jan Beulich
  2022-02-08 10:29             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-08 10:15 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: xen-devel, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, Roger Pau Monné

On 08.02.2022 10:57, Oleksandr Andrushchenko wrote:
> On 08.02.22 11:48, Jan Beulich wrote:
>> On 08.02.2022 10:31, Oleksandr Andrushchenko wrote:
>>> On 08.02.22 11:25, Roger Pau Monné wrote:
>>>> On Fri, Feb 04, 2022 at 08:34:52AM +0200, Oleksandr Andrushchenko wrote:
>>>>> @@ -516,6 +594,11 @@ static int init_bars(struct pci_dev *pdev)
>>>>>            if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
>>>>>            {
>>>>>                bars[i].type = VPCI_BAR_IO;
>>>>> +
>>>>> +            rc = bar_ignore_access(pdev, reg, &bars[i]);
>>>> This is wrong: you only want to ignore access to IO BARs for Arm, for
>>>> x86 we should keep the previous behavior. Even more if you go with
>>>> Jan's suggestions to make bar_ignore_access also applicable to dom0.
>>> How do we want this?
>>> #ifdef CONFIG_ARM?
>> Afaic better via a new, dedicated CONFIG_HAVE_* setting, which x86 selects
>> but Arm doesn't. Unless we have one already, of course ...
> Could you please be more specific on the name you see appropriate?

I'm pretty sure Linux has something similar, so I'd like to ask that
you go look there. I'm sorry to say this a little bluntly, but I'm
really in need of doing something beyond answering your mails (and
in part re-stating the same thing again and again).

> And do you realize that this is going to be a single user of such a
> setting?

Yes, but I'm not sure this is going to remain just a single use.
Furthermore every CONFIG_<arch> is problematic as soon as a new port
is being worked on. If we wanted to go with a CONFIG_<arch> here, imo
it ought to be CONFIG_X86, not CONFIG_ARM, as I/O ports are really an
x86-specific thing (which has propagated into other architectures in
more or less strange ways, but never as truly I/O ports).

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign
  2022-02-08 10:09               ` Jan Beulich
@ 2022-02-08 10:22                 ` Oleksandr Andrushchenko
  2022-02-08 10:29                   ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08 10:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau,
	Oleksandr Andrushchenko



On 08.02.22 12:09, Jan Beulich wrote:
> On 08.02.2022 10:55, Oleksandr Andrushchenko wrote:
>>
>> On 08.02.22 11:44, Jan Beulich wrote:
>>> On 08.02.2022 10:27, Oleksandr Andrushchenko wrote:
>>>> On 08.02.22 11:13, Jan Beulich wrote:
>>>>> On 08.02.2022 09:32, Oleksandr Andrushchenko wrote:
>>>>>> On 07.02.22 18:28, Jan Beulich wrote:
>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>>>>>                              pci_to_dev(pdev), flag);
>>>>>>>>          }
>>>>>>>>      
>>>>>>>> +    rc = vpci_assign_device(d, pdev);
>>>>>>>> +
>>>>>>>>       done:
>>>>>>>>          if ( rc )
>>>>>>>>              printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>>>>>> There's no attempt to undo anything in the case of getting back an
>>>>>>> error. ISTR this being deemed okay on the basis that the tool stack
>>>>>>> would then take whatever action, but whatever it is that is supposed
>>>>>>> to deal with errors here wants spelling out in the description.
>>>>>> Why? I don't change the previously expected decision and implementation
>>>>>> of the assign_device function: I use error paths as they were used before
>>>>>> for the existing code. So, I see no clear reason to stress that the existing
>>>>>> and new code relies on the toolstack
>>>>> Saying half a sentence on this is helping review.
>>>> Ok
>>>>>>> What's important is that no caller up the call tree may be left with
>>>>>>> the impression that the device is still owned by the original
>>>>>>> domain. With how you have it, the device is going to be owned by the
>>>>>>> new domain, but not really usable.
>>>>>> This is not true: vpci_assign_device will call vpci_deassign_device
>>>>>> internally if it fails. So, the device won't be assigned in this case
>>>>> No. The device is assigned to whatever pdev->domain holds. Calling
>>>>> vpci_deassign_device() there merely makes sure that the device will
>>>>> have _no_ vPCI data and hooks in place, rather than something
>>>>> partial.
>>>> So, this patch is only dealing with vpci assign/de-assign
>>>> And it rolls back what it did in case of a failure
>>>> It also returns rc in assign_device to signal it has failed
>>>> What else is expected from this patch??
>>> Until now if assign_device() returns an error, this tells the caller
>>> that the device did not change ownership;
>> Not sure this is the case:
>>       if ( (rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
>>                             pci_to_dev(pdev), flag)) )
>> iommu_call can leave the new ownership even now without
>> vpci_assign_device.
> Did you check the actual hook functions for when exactly the ownership
> change happens. For both VT-d and AMD it is the last thing they do,
> when no error can occur anymore.
This functionality does not exist for Arm yet, so this is up to the
future series to add that.

WRT to the existing code:

static int amd_iommu_assign_device(struct domain *d, u8 devfn,
                                    struct pci_dev *pdev,
                                    u32 flag)
{
     if ( !rc )
         rc = reassign_device(pdev->domain, d, devfn, pdev); <<<<< this will set pdev->domain

     if ( rc && !is_hardware_domain(d) )
     {
         int ret = amd_iommu_reserve_domain_unity_unmap(
                       d, ivrs_mappings[req_id].unity_map);

         if ( ret )
         {
             printk(XENLOG_ERR "AMD-Vi: "
                    "unity-unmap for %pd/%04x:%02x:%02x.%u failed (%d)\n",
                    d, pdev->seg, pdev->bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn), ret);
             domain_crash(d);
         }
So....

This is IMO wrong in the first place to let IOMMU code assign pdev->domain.
This is something that needs to be done by the PCI code itself and
not relying on each IOMMU callback implementation
>
>   My understanding is that the roll-back is
>> expected to be performed by the toolstack and vpci_assign_device
>> doesn't prevent that by returning rc. Even more, before we discussed
>> that it would be good for vpci_assign_device to try recovering from
>> a possible error early which is done by calling vpci_deassign_device
>> internally.
> Yes, but that's only part of it. It at least needs considering what
> effects have resulted from operations prior to vpci_assign_device().
Taking into account the code snippet above: what is your expectation
from this patch with this respect?

>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 06/13] vpci/header: implement guest BAR register handlers
  2022-02-08 10:15           ` Jan Beulich
@ 2022-02-08 10:29             ` Oleksandr Andrushchenko
  2022-02-08 13:58               ` Roger Pau Monné
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08 10:29 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monné
  Cc: xen-devel, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, Oleksandr Andrushchenko



On 08.02.22 12:15, Jan Beulich wrote:
> On 08.02.2022 10:57, Oleksandr Andrushchenko wrote:
>> On 08.02.22 11:48, Jan Beulich wrote:
>>> On 08.02.2022 10:31, Oleksandr Andrushchenko wrote:
>>>> On 08.02.22 11:25, Roger Pau Monné wrote:
>>>>> On Fri, Feb 04, 2022 at 08:34:52AM +0200, Oleksandr Andrushchenko wrote:
>>>>>> @@ -516,6 +594,11 @@ static int init_bars(struct pci_dev *pdev)
>>>>>>             if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
>>>>>>             {
>>>>>>                 bars[i].type = VPCI_BAR_IO;
>>>>>> +
>>>>>> +            rc = bar_ignore_access(pdev, reg, &bars[i]);
>>>>> This is wrong: you only want to ignore access to IO BARs for Arm, for
>>>>> x86 we should keep the previous behavior. Even more if you go with
>>>>> Jan's suggestions to make bar_ignore_access also applicable to dom0.
>>>> How do we want this?
>>>> #ifdef CONFIG_ARM?
>>> Afaic better via a new, dedicated CONFIG_HAVE_* setting, which x86 selects
>>> but Arm doesn't. Unless we have one already, of course ...
>> Could you please be more specific on the name you see appropriate?
> I'm pretty sure Linux has something similar, so I'd like to ask that
> you go look there.
Not sure, but I can have a look
>   I'm sorry to say this a little bluntly, but I'm
> really in need of doing something beyond answering your mails
Well, if answers were to be a bit more specific and not so general
some time, this could definitely be helpful and save a lot of time trying
to guess what other party has in their mind.
>   (and
> in part re-stating the same thing again and again).
I have no comments on this.
>
>> And do you realize that this is going to be a single user of such a
>> setting?
> Yes, but I'm not sure this is going to remain just a single use.
> Furthermore every CONFIG_<arch> is problematic as soon as a new port
> is being worked on. If we wanted to go with a CONFIG_<arch> here, imo
> it ought to be CONFIG_X86, not CONFIG_ARM, as I/O ports are really an
> x86-specific thing (which has propagated into other architectures in
> more or less strange ways, but never as truly I/O ports).
I am fine using CONFIG_X86
@Roger, are you ok with that?
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign
  2022-02-08 10:22                 ` Oleksandr Andrushchenko
@ 2022-02-08 10:29                   ` Jan Beulich
  2022-02-08 10:52                     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-08 10:29 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau

On 08.02.2022 11:22, Oleksandr Andrushchenko wrote:
> 
> 
> On 08.02.22 12:09, Jan Beulich wrote:
>> On 08.02.2022 10:55, Oleksandr Andrushchenko wrote:
>>>
>>> On 08.02.22 11:44, Jan Beulich wrote:
>>>> On 08.02.2022 10:27, Oleksandr Andrushchenko wrote:
>>>>> On 08.02.22 11:13, Jan Beulich wrote:
>>>>>> On 08.02.2022 09:32, Oleksandr Andrushchenko wrote:
>>>>>>> On 07.02.22 18:28, Jan Beulich wrote:
>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>>> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>>>>>>                              pci_to_dev(pdev), flag);
>>>>>>>>>          }
>>>>>>>>>      
>>>>>>>>> +    rc = vpci_assign_device(d, pdev);
>>>>>>>>> +
>>>>>>>>>       done:
>>>>>>>>>          if ( rc )
>>>>>>>>>              printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>>>>>>> There's no attempt to undo anything in the case of getting back an
>>>>>>>> error. ISTR this being deemed okay on the basis that the tool stack
>>>>>>>> would then take whatever action, but whatever it is that is supposed
>>>>>>>> to deal with errors here wants spelling out in the description.
>>>>>>> Why? I don't change the previously expected decision and implementation
>>>>>>> of the assign_device function: I use error paths as they were used before
>>>>>>> for the existing code. So, I see no clear reason to stress that the existing
>>>>>>> and new code relies on the toolstack
>>>>>> Saying half a sentence on this is helping review.
>>>>> Ok
>>>>>>>> What's important is that no caller up the call tree may be left with
>>>>>>>> the impression that the device is still owned by the original
>>>>>>>> domain. With how you have it, the device is going to be owned by the
>>>>>>>> new domain, but not really usable.
>>>>>>> This is not true: vpci_assign_device will call vpci_deassign_device
>>>>>>> internally if it fails. So, the device won't be assigned in this case
>>>>>> No. The device is assigned to whatever pdev->domain holds. Calling
>>>>>> vpci_deassign_device() there merely makes sure that the device will
>>>>>> have _no_ vPCI data and hooks in place, rather than something
>>>>>> partial.
>>>>> So, this patch is only dealing with vpci assign/de-assign
>>>>> And it rolls back what it did in case of a failure
>>>>> It also returns rc in assign_device to signal it has failed
>>>>> What else is expected from this patch??
>>>> Until now if assign_device() returns an error, this tells the caller
>>>> that the device did not change ownership;
>>> Not sure this is the case:
>>>       if ( (rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
>>>                             pci_to_dev(pdev), flag)) )
>>> iommu_call can leave the new ownership even now without
>>> vpci_assign_device.
>> Did you check the actual hook functions for when exactly the ownership
>> change happens. For both VT-d and AMD it is the last thing they do,
>> when no error can occur anymore.
> This functionality does not exist for Arm yet, so this is up to the
> future series to add that.
> 
> WRT to the existing code:
> 
> static int amd_iommu_assign_device(struct domain *d, u8 devfn,
>                                     struct pci_dev *pdev,
>                                     u32 flag)
> {
>      if ( !rc )
>          rc = reassign_device(pdev->domain, d, devfn, pdev); <<<<< this will set pdev->domain
> 
>      if ( rc && !is_hardware_domain(d) )
>      {
>          int ret = amd_iommu_reserve_domain_unity_unmap(
>                        d, ivrs_mappings[req_id].unity_map);
> 
>          if ( ret )
>          {
>              printk(XENLOG_ERR "AMD-Vi: "
>                     "unity-unmap for %pd/%04x:%02x:%02x.%u failed (%d)\n",
>                     d, pdev->seg, pdev->bus,
>                     PCI_SLOT(devfn), PCI_FUNC(devfn), ret);
>              domain_crash(d);
>          }
> So....
> 
> This is IMO wrong in the first place to let IOMMU code assign pdev->domain.
> This is something that needs to be done by the PCI code itself and
> not relying on each IOMMU callback implementation
>>
>>   My understanding is that the roll-back is
>>> expected to be performed by the toolstack and vpci_assign_device
>>> doesn't prevent that by returning rc. Even more, before we discussed
>>> that it would be good for vpci_assign_device to try recovering from
>>> a possible error early which is done by calling vpci_deassign_device
>>> internally.
>> Yes, but that's only part of it. It at least needs considering what
>> effects have resulted from operations prior to vpci_assign_device().
> Taking into account the code snippet above: what is your expectation
> from this patch with this respect?

You did note the domain_crash() in there, didn't you? The snippet above
still matches the "device not assigned to an alive DomU" criteria (which
can be translated to "no exposure of a device to an untrusted entity in
case of error"). Such domain_crash() uses aren't nice, and I'd prefer to
see them go away, but said property needs to be retained with any
alternative solutions.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-08 10:11                                                     ` Roger Pau Monné
@ 2022-02-08 10:32                                                       ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08 10:32 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 08.02.22 12:11, Roger Pau Monné wrote:
> On Mon, Feb 07, 2022 at 05:37:49PM +0100, Jan Beulich wrote:
>> On 07.02.2022 17:21, Oleksandr Andrushchenko wrote:
>>>
>>> On 07.02.22 18:15, Jan Beulich wrote:
>>>> On 07.02.2022 17:07, Oleksandr Andrushchenko wrote:
>>>>> On 07.02.22 17:26, Jan Beulich wrote:
>>>>>> 1b. Make vpci_write use write lock for writes to command register and BARs
>>>>>> only; keep using the read lock for all other writes.
>>>>> I am not quite sure how to do that. Do you mean something like:
>>>>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>>>                    uint32_t data)
>>>>> [snip]
>>>>>        list_for_each_entry ( r, &pdev->vpci->handlers, node )
>>>>> {
>>>>> [snip]
>>>>>        if ( r->needs_write_lock)
>>>>>            write_lock(d->vpci_lock)
>>>>>        else
>>>>>            read_lock(d->vpci_lock)
>>>>> ....
>>>>>
>>>>> And provide rw as an argument to:
>>>>>
>>>>> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>>>>>                          vpci_write_t *write_handler, unsigned int offset,
>>>>>                          unsigned int size, void *data, --->>> bool write_path <<<-----)
>>>>>
>>>>> Is this what you mean?
>>>> This sounds overly complicated. You can derive locally in vpci_write(),
>>>> from just its "reg" and "size" parameters, whether the lock needs taking
>>>> in write mode.
>>> Yes, I started writing a reply with that. So, the summary (ROM
>>> position depends on header type):
>>> if ( (reg == PCI_COMMAND) || (reg == ROM) )
>>> {
>>>       read PCI_COMMAND and see if memory or IO decoding are enabled.
>>>       if ( enabled )
>>>           write_lock(d->vpci_lock)
>>>       else
>>>           read_lock(d->vpci_lock)
>>> }
>> Hmm, yes, you can actually get away without using "size", since both
>> command register and ROM BAR are 32-bit aligned registers, and 64-bit
>> accesses get split in vpci_ecam_write().
>>
>> For the command register the memory- / IO-decoding-enabled check may
>> end up a little more complicated, as the value to be written also
>> matters. Maybe read the command register only for the ROM BAR write,
>> using the write lock uniformly for all command register writes?
>>
>>> Do you also think we can drop pdev->vpci (or currently pdev->vpci->lock)
>>> at all then?
>> I haven't looked at this in any detail, sorry. It sounds possible,
>> yes.
> AFAICT you should avoid taking the per-device vpci lock when you take
> the per-domain lock in write mode. Otherwise you still need the
> per-device vpci lock in order to keep consistency between concurrent
> accesses to the device registers.
I have sent an e-mail this morning describing possible locking schemes.
Could we please move there and continue if you don't mind?
>
> Thanks, Roger.
Thank you in advance,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-08  7:35                                                       ` Oleksandr Andrushchenko
  2022-02-08  8:57                                                         ` Jan Beulich
@ 2022-02-08 10:50                                                         ` Roger Pau Monné
  2022-02-08 11:13                                                           ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 138+ messages in thread
From: Roger Pau Monné @ 2022-02-08 10:50 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel

On Tue, Feb 08, 2022 at 07:35:34AM +0000, Oleksandr Andrushchenko wrote:
> 
> 
> On 07.02.22 18:44, Oleksandr Andrushchenko wrote:
> >
> > On 07.02.22 18:37, Jan Beulich wrote:
> >> On 07.02.2022 17:21, Oleksandr Andrushchenko wrote:
> >>> On 07.02.22 18:15, Jan Beulich wrote:
> >>>> On 07.02.2022 17:07, Oleksandr Andrushchenko wrote:
> >>>>> On 07.02.22 17:26, Jan Beulich wrote:
> >>>>>> 1b. Make vpci_write use write lock for writes to command register and BARs
> >>>>>> only; keep using the read lock for all other writes.
> >>>>> I am not quite sure how to do that. Do you mean something like:
> >>>>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >>>>>                     uint32_t data)
> >>>>> [snip]
> >>>>>         list_for_each_entry ( r, &pdev->vpci->handlers, node )
> >>>>> {
> >>>>> [snip]
> >>>>>         if ( r->needs_write_lock)
> >>>>>             write_lock(d->vpci_lock)
> >>>>>         else
> >>>>>             read_lock(d->vpci_lock)
> >>>>> ....
> >>>>>
> >>>>> And provide rw as an argument to:
> >>>>>
> >>>>> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
> >>>>>                           vpci_write_t *write_handler, unsigned int offset,
> >>>>>                           unsigned int size, void *data, --->>> bool write_path <<<-----)
> >>>>>
> >>>>> Is this what you mean?
> >>>> This sounds overly complicated. You can derive locally in vpci_write(),
> >>>> from just its "reg" and "size" parameters, whether the lock needs taking
> >>>> in write mode.
> >>> Yes, I started writing a reply with that. So, the summary (ROM
> >>> position depends on header type):
> >>> if ( (reg == PCI_COMMAND) || (reg == ROM) )
> >>> {
> >>>        read PCI_COMMAND and see if memory or IO decoding are enabled.
> >>>        if ( enabled )
> >>>            write_lock(d->vpci_lock)
> >>>        else
> >>>            read_lock(d->vpci_lock)
> >>> }
> >> Hmm, yes, you can actually get away without using "size", since both
> >> command register and ROM BAR are 32-bit aligned registers, and 64-bit
> >> accesses get split in vpci_ecam_write().
> > But, OS may want reading a single byte of ROM BAR, so I think
> > I'll need to check if reg+size fall into PCI_COMAND and ROM BAR
> > ranges
> >> For the command register the memory- / IO-decoding-enabled check may
> >> end up a little more complicated, as the value to be written also
> >> matters. Maybe read the command register only for the ROM BAR write,
> >> using the write lock uniformly for all command register writes?
> > Sounds good for the start.
> > Another concern is that if we go with a read_lock and then in the
> > underlying code we disable memory decoding and try doing
> > something and calling cmd_write handler for any reason then....
> >
> > I mean that the check in the vpci_write is somewhat we can tolerate,
> > but then it is must be considered that no code in the read path
> > is allowed to perform write path functions. Which brings a pretty
> > valid use-case: say in read mode we detect an unrecoverable error
> > and need to remove the device:
> > vpci_process_pending -> ERROR -> vpci_remove_device or similar.
> >
> > What do we do then? It is all going to be fragile...
> I have tried to summarize the options we have wrt locking
> and would love to hear from @Roger and @Jan.
> 
> In every variant there is a task of dealing with the overlap
> detection in modify_bars, so this is the only place as of now
> which needs special treatment.
> 
> Existing limitations: there is no way to upgrade a read lock to a write
> lock, so paths which may require write lock protection need to use
> write lock from the very beginning. Workarounds can be applied.
> 
> 1. Per-domain rw lock, aka d->vpci_lock
> ==============================================================
> Note: with per-domain rw lock it is possible to do without introducing
> per-device locks, so pdev->vpci->lock can be removed and no pdev->vpci_lock
> should be required.

Er, no, I think you still need a per-device lock unless you intent to
take the per-domain rwlock in write mode every time you modify data
in vpci. I still think you need pdev->vpci->lock. It's possible this
approach doesn't require moving the lock outside of the vpci struct.

> This is only going to work in case if vpci_write always takes the write lock
> and vpci_read takes a read lock and no path in vpci_read is allowed to
> perform write path operations.

I think that's likely too strong?

You could get away with both vpci_{read,write} only taking the read
lock and use a per-device vpci lock?

Otherwise you are likely to introduce contention in msix_write if a
guest makes heavy use of the MSI-X entry mask bit.

> vpci_process_pending uses write lock as it have vpci_remove_device in its
> error path.
> 
> Pros:
> - no per-device vpci lock is needed?
> - solves overlap code ABBA in modify_bars
> 
> Cons:
> - all writes are serialized
> - need to carefully select read paths, so they are guaranteed not to lead
>    to lock upgrade use-cases
> 
> 1.1. Semi read lock upgrade in modify bars
> --------------------------------------------------------------
> In this case both vpci_read and vpci_write take a read lock and when it comes
> to modify_bars:
> 
> 1. read_unlock(d->vpci_lock)
> 2. write_lock(d->vpci_lock)
> 3. Check that pdev->vpci is still available and is the same object:
> if (pdev->vpci && (pdev->vpci == old_vpci) )
> {
>      /* vpci structure is valid and can be used. */
> }
> else
> {
>      /* vpci has gone, return an error. */
> }
> 
> Pros:
> - no per-device vpci lock is needed?
> - solves overlap code ABBA in modify_bars
> - readers and writers are NOT serialized
> - NO need to carefully select read paths, so they are guaranteed not to lead
>    to lock upgrade use-cases
> 
> Cons:
> - ???
> 
> 2. per-device lock (pdev->vpci_lock) + d->overlap_chk_lock
> ==============================================================
> In order to solve overlap ABBA, we introduce a per-domain helper
> lock to protect the overlapping code in modify_bars:
> 
>      old_vpci = pdev->vpci;
>      spin_unlock(pdev->vpci_lock);
>      spin_lock(pdev->domain->overlap_chk_lock);

Since you drop the pdev lock you get a window here where either vpci
or even pdev itself could be removed under your feet, so using
pdev->vpci_lock like you do below could dereference a stale pdev.

>      spin_lock(pdev->vpci_lock);
>      if ( pdev->vpci && (pdev->vpci == old_vpci) )
>          for_each_pdev ( pdev->domain, tmp )
>          {
>              if ( tmp != pdev )
>              {
>                  spin_lock(tmp->vpci_lock);
>                  if ( tmp->vpci )
>                      ...
>              }
>          }
> 
> Pros:
> - all accesses are independent, only the same device access is serialized
> - no need to care about readers and writers wrt read lock upgrade issues
> 
> Cons:
> - helper spin lock
> 
> 3. Move overlap detection into process pending
> ==============================================================
> There is a Roger's patch [1] which adds a possibility for vpci_process_pending
> to perform different tasks rather than just map/unmap. With this patch extended
> in a way that it can hold a request queue it is possible to delay execution
> of the overlap code until no pdev->vpci_lock is held, but before returning to
> a guest after vpci_{read|write} or similar.
> 
> Pros:
> - no need to emulate read lock upgrade
> - fully parallel read/write
> - queue in the vpci_process_pending will later on be used by SR-IOV,
>    so this is going to help the future code
> Cons:
> - ???

Maybe? It's hard to devise how that would end up looking like, and
whether it won't still require such kind of double locking. We would
still need to prevent doing a rangeset_remove_range for the device we
are trying to setup the mapping for, at which point we still need to
lock the current device plus the device we are iterating against?

Since the code in vpci_process_pending is always executed in guest
vCPU context requiring all guest vCPUs to be paused when doing a
device addition or removal would prevent devices from going away, but
we could still have issues with concurrent accesses from other vCPUs.

> 
> 4. Re-write overlap detection code
> ==============================================================
> It is possible to re-write overlap detection code, so the information about the
> mapped/unmapped regions is not read from vpci->header->bars[i] of each device,
> but instead there is a per-domain structure which holds the regions and
> implements reference counting.
> 
> Pros:
> - solves ABBA
> 
> Cons:
> - very complex code is expected
> 
> 5. You name it
> ==============================================================
> 
>  From all the above I would recommend we go with option 2 which seems to reliably
> solve ABBA and does not bring cons of the other approaches.

6. per-domain rwlock + per-device vpci lock

Introduce vpci_header_write_lock(start, {end, size}) helper: return
whether a range requires the per-domain lock in write mode. This will
only return true if the range overlaps with the BAR ROM or the command
register.

In vpci_{read,write}:

if ( vpci_header_write_lock(...) )
    /* Gain exclusive access to all of the domain pdevs vpci. */
    write_lock(d->vpci);
else
{
    read_lock(d->vpci);
    spin_lock(vpci->lock);
}
...

The vpci assign/deassign functions would need to be modified to write
lock the per-domain rwlock. The MSI-X table MMIO handler will also
need to read lock the per domain vpci lock.

I think it's either something along the lines of my suggestion above,
or maybe option 3, albeit you would have to investigate how to
implement option 3.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign
  2022-02-08 10:29                   ` Jan Beulich
@ 2022-02-08 10:52                     ` Oleksandr Andrushchenko
  2022-02-08 11:00                       ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08 10:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau,
	Oleksandr Andrushchenko



On 08.02.22 12:29, Jan Beulich wrote:
> On 08.02.2022 11:22, Oleksandr Andrushchenko wrote:
>>
>> On 08.02.22 12:09, Jan Beulich wrote:
>>> On 08.02.2022 10:55, Oleksandr Andrushchenko wrote:
>>>> On 08.02.22 11:44, Jan Beulich wrote:
>>>>> On 08.02.2022 10:27, Oleksandr Andrushchenko wrote:
>>>>>> On 08.02.22 11:13, Jan Beulich wrote:
>>>>>>> On 08.02.2022 09:32, Oleksandr Andrushchenko wrote:
>>>>>>>> On 07.02.22 18:28, Jan Beulich wrote:
>>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>>>> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>>>>>>>                               pci_to_dev(pdev), flag);
>>>>>>>>>>           }
>>>>>>>>>>       
>>>>>>>>>> +    rc = vpci_assign_device(d, pdev);
>>>>>>>>>> +
>>>>>>>>>>        done:
>>>>>>>>>>           if ( rc )
>>>>>>>>>>               printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>>>>>>>> There's no attempt to undo anything in the case of getting back an
>>>>>>>>> error. ISTR this being deemed okay on the basis that the tool stack
>>>>>>>>> would then take whatever action, but whatever it is that is supposed
>>>>>>>>> to deal with errors here wants spelling out in the description.
>>>>>>>> Why? I don't change the previously expected decision and implementation
>>>>>>>> of the assign_device function: I use error paths as they were used before
>>>>>>>> for the existing code. So, I see no clear reason to stress that the existing
>>>>>>>> and new code relies on the toolstack
>>>>>>> Saying half a sentence on this is helping review.
>>>>>> Ok
>>>>>>>>> What's important is that no caller up the call tree may be left with
>>>>>>>>> the impression that the device is still owned by the original
>>>>>>>>> domain. With how you have it, the device is going to be owned by the
>>>>>>>>> new domain, but not really usable.
>>>>>>>> This is not true: vpci_assign_device will call vpci_deassign_device
>>>>>>>> internally if it fails. So, the device won't be assigned in this case
>>>>>>> No. The device is assigned to whatever pdev->domain holds. Calling
>>>>>>> vpci_deassign_device() there merely makes sure that the device will
>>>>>>> have _no_ vPCI data and hooks in place, rather than something
>>>>>>> partial.
>>>>>> So, this patch is only dealing with vpci assign/de-assign
>>>>>> And it rolls back what it did in case of a failure
>>>>>> It also returns rc in assign_device to signal it has failed
>>>>>> What else is expected from this patch??
>>>>> Until now if assign_device() returns an error, this tells the caller
>>>>> that the device did not change ownership;
>>>> Not sure this is the case:
>>>>        if ( (rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
>>>>                              pci_to_dev(pdev), flag)) )
>>>> iommu_call can leave the new ownership even now without
>>>> vpci_assign_device.
>>> Did you check the actual hook functions for when exactly the ownership
>>> change happens. For both VT-d and AMD it is the last thing they do,
>>> when no error can occur anymore.
>> This functionality does not exist for Arm yet, so this is up to the
>> future series to add that.
>>
>> WRT to the existing code:
>>
>> static int amd_iommu_assign_device(struct domain *d, u8 devfn,
>>                                      struct pci_dev *pdev,
>>                                      u32 flag)
>> {
>>       if ( !rc )
>>           rc = reassign_device(pdev->domain, d, devfn, pdev); <<<<< this will set pdev->domain
>>
>>       if ( rc && !is_hardware_domain(d) )
>>       {
>>           int ret = amd_iommu_reserve_domain_unity_unmap(
>>                         d, ivrs_mappings[req_id].unity_map);
>>
>>           if ( ret )
>>           {
>>               printk(XENLOG_ERR "AMD-Vi: "
>>                      "unity-unmap for %pd/%04x:%02x:%02x.%u failed (%d)\n",
>>                      d, pdev->seg, pdev->bus,
>>                      PCI_SLOT(devfn), PCI_FUNC(devfn), ret);
>>               domain_crash(d);
>>           }
>> So....
>>
>> This is IMO wrong in the first place to let IOMMU code assign pdev->domain.
>> This is something that needs to be done by the PCI code itself and
>> not relying on each IOMMU callback implementation
>>>    My understanding is that the roll-back is
>>>> expected to be performed by the toolstack and vpci_assign_device
>>>> doesn't prevent that by returning rc. Even more, before we discussed
>>>> that it would be good for vpci_assign_device to try recovering from
>>>> a possible error early which is done by calling vpci_deassign_device
>>>> internally.
>>> Yes, but that's only part of it. It at least needs considering what
>>> effects have resulted from operations prior to vpci_assign_device().
>> Taking into account the code snippet above: what is your expectation
>> from this patch with this respect?
> You did note the domain_crash() in there, didn't you?
Which is AMD specific implementation which can be different for
other IOMMUs. Yes, I did.
> The snippet above
> still matches the "device not assigned to an alive DomU" criteria (which
> can be translated to "no exposure of a device to an untrusted entity in
> case of error"). Such domain_crash() uses aren't nice, and I'd prefer to
> see them go away, but said property needs to be retained with any
> alternative solutions.
This smells like we first need to fix the existing code, so
pdev->domain is not assigned by specific IOMMU implementations,
but instead controlled by the code which relies on that, assign_device.

I can have something like:

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 88836aab6baf..cc7790709a50 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1475,6 +1475,7 @@ static int device_assigned(u16 seg, u8 bus, u8 devfn)
  static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
  {
      const struct domain_iommu *hd = dom_iommu(d);
+    struct domain *old_owner;
      struct pci_dev *pdev;
      int rc = 0;

@@ -1490,6 +1491,9 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
      ASSERT(pdev && (pdev->domain == hardware_domain ||
                      pdev->domain == dom_io));

+    /* We need to restore the old owner in case of an error. */
+    old_owner = pdev->domain;
+
      vpci_deassign_device(pdev->domain, pdev);

      rc = pdev_msix_assign(d, pdev);
@@ -1515,8 +1519,12 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)

   done:
      if ( rc )
+    {
          printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
                 d, &PCI_SBDF3(seg, bus, devfn), rc);
+        /* We failed to assign, so restore the previous owner. */
+        pdev->domain = old_owner;
+    }
      /* The device is assigned to dom_io so mark it as quarantined */
      else if ( d == dom_io )
          pdev->quarantine = true;

But I do not think this belongs to this patch
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply related	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign
  2022-02-08 10:52                     ` Oleksandr Andrushchenko
@ 2022-02-08 11:00                       ` Jan Beulich
  2022-02-08 11:25                         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-08 11:00 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau

On 08.02.2022 11:52, Oleksandr Andrushchenko wrote:
> This smells like we first need to fix the existing code, so
> pdev->domain is not assigned by specific IOMMU implementations,
> but instead controlled by the code which relies on that, assign_device.

Feel free to come up with proposals how to cleanly do so. Moving the
assignment to pdev->domain may even be possible now, but if you go
back you may find that the code was quite different earlier on.

> I can have something like:
> 
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 88836aab6baf..cc7790709a50 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -1475,6 +1475,7 @@ static int device_assigned(u16 seg, u8 bus, u8 devfn)
>   static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>   {
>       const struct domain_iommu *hd = dom_iommu(d);
> +    struct domain *old_owner;
>       struct pci_dev *pdev;
>       int rc = 0;
> 
> @@ -1490,6 +1491,9 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>       ASSERT(pdev && (pdev->domain == hardware_domain ||
>                       pdev->domain == dom_io));
> 
> +    /* We need to restore the old owner in case of an error. */
> +    old_owner = pdev->domain;
> +
>       vpci_deassign_device(pdev->domain, pdev);
> 
>       rc = pdev_msix_assign(d, pdev);
> @@ -1515,8 +1519,12 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
> 
>    done:
>       if ( rc )
> +    {
>           printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>                  d, &PCI_SBDF3(seg, bus, devfn), rc);
> +        /* We failed to assign, so restore the previous owner. */
> +        pdev->domain = old_owner;
> +    }
>       /* The device is assigned to dom_io so mark it as quarantined */
>       else if ( d == dom_io )
>           pdev->quarantine = true;
> 
> But I do not think this belongs to this patch

Indeed. Plus I'm sure you understand that it's not that simple. Assigning
to pdev->domain is only the last step of assignment. Restoring the original
owner would entail putting in place the original IOMMU table entries as
well, which in turn can fail. Hence why you'll find a number of uses of
domain_crash() in places where rolling back is far from easy.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests
  2022-02-08  9:58             ` Oleksandr Andrushchenko
@ 2022-02-08 11:11               ` Roger Pau Monné
  2022-02-08 11:29                 ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Roger Pau Monné @ 2022-02-08 11:11 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel

On Tue, Feb 08, 2022 at 09:58:40AM +0000, Oleksandr Andrushchenko wrote:
> 
> 
> On 08.02.22 11:52, Jan Beulich wrote:
> > On 08.02.2022 10:38, Oleksandr Andrushchenko wrote:
> >>
> >> On 08.02.22 11:33, Jan Beulich wrote:
> >>> On 08.02.2022 09:13, Oleksandr Andrushchenko wrote:
> >>>> On 04.02.22 16:25, Jan Beulich wrote:
> >>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
> >>>>>> --- a/xen/drivers/vpci/header.c
> >>>>>> +++ b/xen/drivers/vpci/header.c
> >>>>>> @@ -454,6 +454,22 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
> >>>>>>             pci_conf_write16(pdev->sbdf, reg, cmd);
> >>>>>>     }
> >>>>>>     
> >>>>>> +static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
> >>>>>> +                            uint32_t cmd, void *data)
> >>>>>> +{
> >>>>>> +    /* TODO: Add proper emulation for all bits of the command register. */
> >>>>>> +
> >>>>>> +#ifdef CONFIG_HAS_PCI_MSI
> >>>>>> +    if ( pdev->vpci->msi->enabled || pdev->vpci->msix->enabled )
> >>>>>> +    {
> >>>>>> +        /* Guest wants to enable INTx. It can't be enabled if MSI/MSI-X enabled. */
> >>>>>> +        cmd |= PCI_COMMAND_INTX_DISABLE;
> >>>>>> +    }
> >>>>>> +#endif
> >>>>>> +
> >>>>>> +    cmd_write(pdev, reg, cmd, data);
> >>>>>> +}
> >>>>> It's not really clear to me whether the TODO warrants this being a
> >>>>> separate function. Personally I'd find it preferable if the logic
> >>>>> was folded into cmd_write().
> >>>> Not sure cmd_write needs to have guest's logic. And what's the
> >>>> profit? Later on, when we decide how PCI_COMMAND can be emulated
> >>>> this code will live in guest_cmd_write anyways
> >>> Why "will"? There's nothing conceptually wrong with putting all the
> >>> emulation logic into cmd_write(), inside an if(!hwdom) conditional.
> >>> If and when we gain CET-IBT support on the x86 side (and I'm told
> >>> there's an Arm equivalent of this), then to make this as useful as
> >>> possible it is going to be desirable to limit the number of functions
> >>> called through function pointers. You may have seen Andrew's huge
> >>> "x86: Support for CET Indirect Branch Tracking" series. We want to
> >>> keep down the number of such annotations; the vast part of the series
> >>> is about adding of such.
> >> Well, while I see nothing bad with that, from the code organization
> >> it would look a bit strange: we don't differentiate hwdom in vpci
> >> handlers, but instead provide one for hwdom and one for guests.
> >> While I understand your concern I still think that at the moment
> >> it will be more in line with the existing code if we provide a dedicated
> >> handler.
> > The existing code only deals with Dom0, and hence doesn't have any
> > pairs of handlers.
> This is fair
> >   FTAOD what I said above applies equally to other
> > separate guest read/write handlers you may be introducing. The
> > exception being when e.g. a hardware access handler is put in place
> > for Dom0 (for obvious reasons, I think).
> @Roger, what's your preference here?
> >

The newly introduced handler ends up calling the existing one, so in
this case it might make sense to expand cmd_write to also cater for
the domU case?

I think we need to be sensible here in that we don't want to end up
with handlers like:

register_read(...)
{
   if ( is_hardware_domain() )
       ....
   else
       ...
}

If there's shared code it's IMO better to not create as guest specific
handler.

It's also more risky to use the same handlers for dom0 and domU, as a
change intended to dom0 only might end up leaking in the domU path and
that could easily become a security issue.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-08 10:50                                                         ` Roger Pau Monné
@ 2022-02-08 11:13                                                           ` Oleksandr Andrushchenko
  2022-02-08 13:38                                                             ` Roger Pau Monné
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08 11:13 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 08.02.22 12:50, Roger Pau Monné wrote:
> On Tue, Feb 08, 2022 at 07:35:34AM +0000, Oleksandr Andrushchenko wrote:
>>
>> On 07.02.22 18:44, Oleksandr Andrushchenko wrote:
>>> On 07.02.22 18:37, Jan Beulich wrote:
>>>> On 07.02.2022 17:21, Oleksandr Andrushchenko wrote:
>>>>> On 07.02.22 18:15, Jan Beulich wrote:
>>>>>> On 07.02.2022 17:07, Oleksandr Andrushchenko wrote:
>>>>>>> On 07.02.22 17:26, Jan Beulich wrote:
>>>>>>>> 1b. Make vpci_write use write lock for writes to command register and BARs
>>>>>>>> only; keep using the read lock for all other writes.
>>>>>>> I am not quite sure how to do that. Do you mean something like:
>>>>>>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>>>>>                      uint32_t data)
>>>>>>> [snip]
>>>>>>>          list_for_each_entry ( r, &pdev->vpci->handlers, node )
>>>>>>> {
>>>>>>> [snip]
>>>>>>>          if ( r->needs_write_lock)
>>>>>>>              write_lock(d->vpci_lock)
>>>>>>>          else
>>>>>>>              read_lock(d->vpci_lock)
>>>>>>> ....
>>>>>>>
>>>>>>> And provide rw as an argument to:
>>>>>>>
>>>>>>> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>>>>>>>                            vpci_write_t *write_handler, unsigned int offset,
>>>>>>>                            unsigned int size, void *data, --->>> bool write_path <<<-----)
>>>>>>>
>>>>>>> Is this what you mean?
>>>>>> This sounds overly complicated. You can derive locally in vpci_write(),
>>>>>> from just its "reg" and "size" parameters, whether the lock needs taking
>>>>>> in write mode.
>>>>> Yes, I started writing a reply with that. So, the summary (ROM
>>>>> position depends on header type):
>>>>> if ( (reg == PCI_COMMAND) || (reg == ROM) )
>>>>> {
>>>>>         read PCI_COMMAND and see if memory or IO decoding are enabled.
>>>>>         if ( enabled )
>>>>>             write_lock(d->vpci_lock)
>>>>>         else
>>>>>             read_lock(d->vpci_lock)
>>>>> }
>>>> Hmm, yes, you can actually get away without using "size", since both
>>>> command register and ROM BAR are 32-bit aligned registers, and 64-bit
>>>> accesses get split in vpci_ecam_write().
>>> But, OS may want reading a single byte of ROM BAR, so I think
>>> I'll need to check if reg+size fall into PCI_COMAND and ROM BAR
>>> ranges
>>>> For the command register the memory- / IO-decoding-enabled check may
>>>> end up a little more complicated, as the value to be written also
>>>> matters. Maybe read the command register only for the ROM BAR write,
>>>> using the write lock uniformly for all command register writes?
>>> Sounds good for the start.
>>> Another concern is that if we go with a read_lock and then in the
>>> underlying code we disable memory decoding and try doing
>>> something and calling cmd_write handler for any reason then....
>>>
>>> I mean that the check in the vpci_write is somewhat we can tolerate,
>>> but then it is must be considered that no code in the read path
>>> is allowed to perform write path functions. Which brings a pretty
>>> valid use-case: say in read mode we detect an unrecoverable error
>>> and need to remove the device:
>>> vpci_process_pending -> ERROR -> vpci_remove_device or similar.
>>>
>>> What do we do then? It is all going to be fragile...
>> I have tried to summarize the options we have wrt locking
>> and would love to hear from @Roger and @Jan.
>>
>> In every variant there is a task of dealing with the overlap
>> detection in modify_bars, so this is the only place as of now
>> which needs special treatment.
>>
>> Existing limitations: there is no way to upgrade a read lock to a write
>> lock, so paths which may require write lock protection need to use
>> write lock from the very beginning. Workarounds can be applied.
>>
>> 1. Per-domain rw lock, aka d->vpci_lock
>> ==============================================================
>> Note: with per-domain rw lock it is possible to do without introducing
>> per-device locks, so pdev->vpci->lock can be removed and no pdev->vpci_lock
>> should be required.
> Er, no, I think you still need a per-device lock unless you intent to
> take the per-domain rwlock in write mode every time you modify data
> in vpci.
This is exactly the assumption stated below. I am trying to discuss
all the possible options, so this one is also listed
>   I still think you need pdev->vpci->lock. It's possible this
> approach doesn't require moving the lock outside of the vpci struct.
>
>> This is only going to work in case if vpci_write always takes the write lock
>> and vpci_read takes a read lock and no path in vpci_read is allowed to
>> perform write path operations.
> I think that's likely too strong?
>
> You could get away with both vpci_{read,write} only taking the read
> lock and use a per-device vpci lock?
But as discussed before:
- if pdev->vpci_lock is used this still leads to ABBA
- we should know about if to take the write lock beforehand
>
> Otherwise you are likely to introduce contention in msix_write if a
> guest makes heavy use of the MSI-X entry mask bit.
>
>> vpci_process_pending uses write lock as it have vpci_remove_device in its
>> error path.
>>
>> Pros:
>> - no per-device vpci lock is needed?
>> - solves overlap code ABBA in modify_bars
>>
>> Cons:
>> - all writes are serialized
>> - need to carefully select read paths, so they are guaranteed not to lead
>>     to lock upgrade use-cases
>>
>> 1.1. Semi read lock upgrade in modify bars
>> --------------------------------------------------------------
>> In this case both vpci_read and vpci_write take a read lock and when it comes
>> to modify_bars:
>>
>> 1. read_unlock(d->vpci_lock)
>> 2. write_lock(d->vpci_lock)
>> 3. Check that pdev->vpci is still available and is the same object:
>> if (pdev->vpci && (pdev->vpci == old_vpci) )
>> {
>>       /* vpci structure is valid and can be used. */
>> }
>> else
>> {
>>       /* vpci has gone, return an error. */
>> }
>>
>> Pros:
>> - no per-device vpci lock is needed?
>> - solves overlap code ABBA in modify_bars
>> - readers and writers are NOT serialized
>> - NO need to carefully select read paths, so they are guaranteed not to lead
>>     to lock upgrade use-cases
>>
>> Cons:
>> - ???
>>
>> 2. per-device lock (pdev->vpci_lock) + d->overlap_chk_lock
>> ==============================================================
>> In order to solve overlap ABBA, we introduce a per-domain helper
>> lock to protect the overlapping code in modify_bars:
>>
>>       old_vpci = pdev->vpci;
>>       spin_unlock(pdev->vpci_lock);
>>       spin_lock(pdev->domain->overlap_chk_lock);
> Since you drop the pdev lock you get a window here where either vpci
> or even pdev itself could be removed under your feet, so using
> pdev->vpci_lock like you do below could dereference a stale pdev.
pdev is anyways not protected with pcidevs lock here, so even
now it is possible to have pdev disapear in between.
We do not use pcidevs_lock in MMIO handlers...
>
>>       spin_lock(pdev->vpci_lock);
>>       if ( pdev->vpci && (pdev->vpci == old_vpci) )
>>           for_each_pdev ( pdev->domain, tmp )
>>           {
>>               if ( tmp != pdev )
>>               {
>>                   spin_lock(tmp->vpci_lock);
>>                   if ( tmp->vpci )
>>                       ...
>>               }
>>           }
>>
>> Pros:
>> - all accesses are independent, only the same device access is serialized
>> - no need to care about readers and writers wrt read lock upgrade issues
>>
>> Cons:
>> - helper spin lock
>>
>> 3. Move overlap detection into process pending
>> ==============================================================
>> There is a Roger's patch [1] which adds a possibility for vpci_process_pending
>> to perform different tasks rather than just map/unmap. With this patch extended
>> in a way that it can hold a request queue it is possible to delay execution
>> of the overlap code until no pdev->vpci_lock is held, but before returning to
>> a guest after vpci_{read|write} or similar.
>>
>> Pros:
>> - no need to emulate read lock upgrade
>> - fully parallel read/write
>> - queue in the vpci_process_pending will later on be used by SR-IOV,
>>     so this is going to help the future code
>> Cons:
>> - ???
> Maybe? It's hard to devise how that would end up looking like, and
> whether it won't still require such kind of double locking. We would
> still need to prevent doing a rangeset_remove_range for the device we
> are trying to setup the mapping for, at which point we still need to
> lock the current device plus the device we are iterating against?
>
> Since the code in vpci_process_pending is always executed in guest
> vCPU context requiring all guest vCPUs to be paused when doing a
> device addition or removal would prevent devices from going away, but
> we could still have issues with concurrent accesses from other vCPUs.
Yes, I understand that this may not be easily done, but this is still
an option,
>
>> 4. Re-write overlap detection code
>> ==============================================================
>> It is possible to re-write overlap detection code, so the information about the
>> mapped/unmapped regions is not read from vpci->header->bars[i] of each device,
>> but instead there is a per-domain structure which holds the regions and
>> implements reference counting.
>>
>> Pros:
>> - solves ABBA
>>
>> Cons:
>> - very complex code is expected
>>
>> 5. You name it
>> ==============================================================
>>
>>   From all the above I would recommend we go with option 2 which seems to reliably
>> solve ABBA and does not bring cons of the other approaches.
> 6. per-domain rwlock + per-device vpci lock
>
> Introduce vpci_header_write_lock(start, {end, size}) helper: return
> whether a range requires the per-domain lock in write mode. This will
> only return true if the range overlaps with the BAR ROM or the command
> register.
>
> In vpci_{read,write}:
>
> if ( vpci_header_write_lock(...) )
>      /* Gain exclusive access to all of the domain pdevs vpci. */
>      write_lock(d->vpci);
> else
> {
>      read_lock(d->vpci);
>      spin_lock(vpci->lock);
> }
> ...
>
> The vpci assign/deassign functions would need to be modified to write
> lock the per-domain rwlock. The MSI-X table MMIO handler will also
> need to read lock the per domain vpci lock.
Ok, so it seems you are in favor of this implementation and I have
no objection as well. The only limitation we should be aware of is
that once a path has acquired the read lock it is not possible to do
any write path operations in there.
vpci_process_pending will acquire write lock though as it can
lead to vpci_remove_device on its error path.

So, I am going to implement pdev->vpci->lock + d->vpci_lock
>
> I think it's either something along the lines of my suggestion above,
> or maybe option 3, albeit you would have to investigate how to
> implement option 3.
>
> Thanks, Roger.

@Roger, @Jan!
Thank you!!

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign
  2022-02-08 11:00                       ` Jan Beulich
@ 2022-02-08 11:25                         ` Oleksandr Andrushchenko
  2022-02-10  8:21                           ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08 11:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau,
	Oleksandr Andrushchenko



On 08.02.22 13:00, Jan Beulich wrote:
> On 08.02.2022 11:52, Oleksandr Andrushchenko wrote:
>> This smells like we first need to fix the existing code, so
>> pdev->domain is not assigned by specific IOMMU implementations,
>> but instead controlled by the code which relies on that, assign_device.
> Feel free to come up with proposals how to cleanly do so. Moving the
> assignment to pdev->domain may even be possible now, but if you go
> back you may find that the code was quite different earlier on.
I do understand that as the code evolves new use cases bring
new issues.
>
>> I can have something like:
>>
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index 88836aab6baf..cc7790709a50 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -1475,6 +1475,7 @@ static int device_assigned(u16 seg, u8 bus, u8 devfn)
>>    static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>    {
>>        const struct domain_iommu *hd = dom_iommu(d);
>> +    struct domain *old_owner;
>>        struct pci_dev *pdev;
>>        int rc = 0;
>>
>> @@ -1490,6 +1491,9 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>        ASSERT(pdev && (pdev->domain == hardware_domain ||
>>                        pdev->domain == dom_io));
>>
>> +    /* We need to restore the old owner in case of an error. */
>> +    old_owner = pdev->domain;
>> +
>>        vpci_deassign_device(pdev->domain, pdev);
>>
>>        rc = pdev_msix_assign(d, pdev);
>> @@ -1515,8 +1519,12 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>
>>     done:
>>        if ( rc )
>> +    {
>>            printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>                   d, &PCI_SBDF3(seg, bus, devfn), rc);
>> +        /* We failed to assign, so restore the previous owner. */
>> +        pdev->domain = old_owner;
>> +    }
>>        /* The device is assigned to dom_io so mark it as quarantined */
>>        else if ( d == dom_io )
>>            pdev->quarantine = true;
>>
>> But I do not think this belongs to this patch
> Indeed. Plus I'm sure you understand that it's not that simple. Assigning
> to pdev->domain is only the last step of assignment. Restoring the original
> owner would entail putting in place the original IOMMU table entries as
> well, which in turn can fail. Hence why you'll find a number of uses of
> domain_crash() in places where rolling back is far from easy.
So, why don't we just rely on the toolstack to do the roll back then?
This way we won't add new domain_crash() calls.
I do understand though that we may live Xen in a wrong state though.
So, do you think it is possible if we just call deassign_device from
assign_device on the error path? This is just like I do in vpci_assign_device:
I call vpci_deassign_device if the former fails.
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests
  2022-02-08 11:11               ` Roger Pau Monné
@ 2022-02-08 11:29                 ` Oleksandr Andrushchenko
  2022-02-08 14:09                   ` Roger Pau Monné
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08 11:29 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 08.02.22 13:11, Roger Pau Monné wrote:
> On Tue, Feb 08, 2022 at 09:58:40AM +0000, Oleksandr Andrushchenko wrote:
>>
>> On 08.02.22 11:52, Jan Beulich wrote:
>>> On 08.02.2022 10:38, Oleksandr Andrushchenko wrote:
>>>> On 08.02.22 11:33, Jan Beulich wrote:
>>>>> On 08.02.2022 09:13, Oleksandr Andrushchenko wrote:
>>>>>> On 04.02.22 16:25, Jan Beulich wrote:
>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>>>> @@ -454,6 +454,22 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>              pci_conf_write16(pdev->sbdf, reg, cmd);
>>>>>>>>      }
>>>>>>>>      
>>>>>>>> +static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>> +                            uint32_t cmd, void *data)
>>>>>>>> +{
>>>>>>>> +    /* TODO: Add proper emulation for all bits of the command register. */
>>>>>>>> +
>>>>>>>> +#ifdef CONFIG_HAS_PCI_MSI
>>>>>>>> +    if ( pdev->vpci->msi->enabled || pdev->vpci->msix->enabled )
>>>>>>>> +    {
>>>>>>>> +        /* Guest wants to enable INTx. It can't be enabled if MSI/MSI-X enabled. */
>>>>>>>> +        cmd |= PCI_COMMAND_INTX_DISABLE;
>>>>>>>> +    }
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>> +    cmd_write(pdev, reg, cmd, data);
>>>>>>>> +}
>>>>>>> It's not really clear to me whether the TODO warrants this being a
>>>>>>> separate function. Personally I'd find it preferable if the logic
>>>>>>> was folded into cmd_write().
>>>>>> Not sure cmd_write needs to have guest's logic. And what's the
>>>>>> profit? Later on, when we decide how PCI_COMMAND can be emulated
>>>>>> this code will live in guest_cmd_write anyways
>>>>> Why "will"? There's nothing conceptually wrong with putting all the
>>>>> emulation logic into cmd_write(), inside an if(!hwdom) conditional.
>>>>> If and when we gain CET-IBT support on the x86 side (and I'm told
>>>>> there's an Arm equivalent of this), then to make this as useful as
>>>>> possible it is going to be desirable to limit the number of functions
>>>>> called through function pointers. You may have seen Andrew's huge
>>>>> "x86: Support for CET Indirect Branch Tracking" series. We want to
>>>>> keep down the number of such annotations; the vast part of the series
>>>>> is about adding of such.
>>>> Well, while I see nothing bad with that, from the code organization
>>>> it would look a bit strange: we don't differentiate hwdom in vpci
>>>> handlers, but instead provide one for hwdom and one for guests.
>>>> While I understand your concern I still think that at the moment
>>>> it will be more in line with the existing code if we provide a dedicated
>>>> handler.
>>> The existing code only deals with Dom0, and hence doesn't have any
>>> pairs of handlers.
>> This is fair
>>>    FTAOD what I said above applies equally to other
>>> separate guest read/write handlers you may be introducing. The
>>> exception being when e.g. a hardware access handler is put in place
>>> for Dom0 (for obvious reasons, I think).
>> @Roger, what's your preference here?
> The newly introduced handler ends up calling the existing one,
But before doing so it implements guest specific logic which will be
extended as we add more bits of emulation
>   so in
> this case it might make sense to expand cmd_write to also cater for
> the domU case?
So, from the above I thought is was ok to have a dedicated handler
>
> I think we need to be sensible here in that we don't want to end up
> with handlers like:
>
> register_read(...)
> {
>     if ( is_hardware_domain() )
>         ....
>     else
>         ...
> }
>
> If there's shared code it's IMO better to not create as guest specific
> handler.
>
> It's also more risky to use the same handlers for dom0 and domU, as a
> change intended to dom0 only might end up leaking in the domU path and
> that could easily become a security issue.
So, just for your justification: BARs. Is this something we also want
to be kept separate or we want if (is_hwdom)?
I guess the former.
>
> Thanks, Roger.
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-08 11:13                                                           ` Oleksandr Andrushchenko
@ 2022-02-08 13:38                                                             ` Roger Pau Monné
  2022-02-08 13:52                                                               ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Roger Pau Monné @ 2022-02-08 13:38 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel

On Tue, Feb 08, 2022 at 11:13:41AM +0000, Oleksandr Andrushchenko wrote:
> 
> 
> On 08.02.22 12:50, Roger Pau Monné wrote:
> > On Tue, Feb 08, 2022 at 07:35:34AM +0000, Oleksandr Andrushchenko wrote:
> >> 5. You name it
> >> ==============================================================
> >>
> >>   From all the above I would recommend we go with option 2 which seems to reliably
> >> solve ABBA and does not bring cons of the other approaches.
> > 6. per-domain rwlock + per-device vpci lock
> >
> > Introduce vpci_header_write_lock(start, {end, size}) helper: return
> > whether a range requires the per-domain lock in write mode. This will
> > only return true if the range overlaps with the BAR ROM or the command
> > register.
> >
> > In vpci_{read,write}:
> >
> > if ( vpci_header_write_lock(...) )
> >      /* Gain exclusive access to all of the domain pdevs vpci. */
> >      write_lock(d->vpci);
> > else
> > {
> >      read_lock(d->vpci);
> >      spin_lock(vpci->lock);
> > }
> > ...
> >
> > The vpci assign/deassign functions would need to be modified to write
> > lock the per-domain rwlock. The MSI-X table MMIO handler will also
> > need to read lock the per domain vpci lock.
> Ok, so it seems you are in favor of this implementation and I have
> no objection as well. The only limitation we should be aware of is
> that once a path has acquired the read lock it is not possible to do
> any write path operations in there.
> vpci_process_pending will acquire write lock though as it can
> lead to vpci_remove_device on its error path.
> 
> So, I am going to implement pdev->vpci->lock + d->vpci_lock

I think it's the less uncertain option.

As said, if you want to investigate whether you can successfully move
the checking into vpci_process_pending that would also be fine with
me, but I cannot assert it's going to be successful. OTOH I think the
per-domain rwlock + per-device spinlock seems quite likely to solve
our issues.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 03/13] vpci: move lock outside of struct vpci
  2022-02-08 13:38                                                             ` Roger Pau Monné
@ 2022-02-08 13:52                                                               ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08 13:52 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 08.02.22 15:38, Roger Pau Monné wrote:
> On Tue, Feb 08, 2022 at 11:13:41AM +0000, Oleksandr Andrushchenko wrote:
>>
>> On 08.02.22 12:50, Roger Pau Monné wrote:
>>> On Tue, Feb 08, 2022 at 07:35:34AM +0000, Oleksandr Andrushchenko wrote:
>>>> 5. You name it
>>>> ==============================================================
>>>>
>>>>    From all the above I would recommend we go with option 2 which seems to reliably
>>>> solve ABBA and does not bring cons of the other approaches.
>>> 6. per-domain rwlock + per-device vpci lock
>>>
>>> Introduce vpci_header_write_lock(start, {end, size}) helper: return
>>> whether a range requires the per-domain lock in write mode. This will
>>> only return true if the range overlaps with the BAR ROM or the command
>>> register.
>>>
>>> In vpci_{read,write}:
>>>
>>> if ( vpci_header_write_lock(...) )
>>>       /* Gain exclusive access to all of the domain pdevs vpci. */
>>>       write_lock(d->vpci);
>>> else
>>> {
>>>       read_lock(d->vpci);
>>>       spin_lock(vpci->lock);
>>> }
>>> ...
>>>
>>> The vpci assign/deassign functions would need to be modified to write
>>> lock the per-domain rwlock. The MSI-X table MMIO handler will also
>>> need to read lock the per domain vpci lock.
>> Ok, so it seems you are in favor of this implementation and I have
>> no objection as well. The only limitation we should be aware of is
>> that once a path has acquired the read lock it is not possible to do
>> any write path operations in there.
>> vpci_process_pending will acquire write lock though as it can
>> lead to vpci_remove_device on its error path.
>>
>> So, I am going to implement pdev->vpci->lock + d->vpci_lock
> I think it's the less uncertain option.
>
> As said, if you want to investigate whether you can successfully move
> the checking into vpci_process_pending that would also be fine with
> me, but I cannot assert it's going to be successful. OTOH I think the
> per-domain rwlock + per-device spinlock seems quite likely to solve
> our issues.
Ok, then I'll go with per-domain rwlock + per-device spinlock
and write lock in vpci_write for cmd + ROM. Of course other
places such as vpci_remove_device and vpci_process_pending
will use write lock
>
> Thanks, Roger.
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 06/13] vpci/header: implement guest BAR register handlers
  2022-02-08 10:29             ` Oleksandr Andrushchenko
@ 2022-02-08 13:58               ` Roger Pau Monné
  0 siblings, 0 replies; 138+ messages in thread
From: Roger Pau Monné @ 2022-02-08 13:58 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Jan Beulich, xen-devel, julien, sstabellini,
	Oleksandr Tyshchenko, Volodymyr Babchuk, Artem Mygaiev,
	andrew.cooper3, george.dunlap, paul, Bertrand Marquis,
	Rahul Singh

On Tue, Feb 08, 2022 at 10:29:22AM +0000, Oleksandr Andrushchenko wrote:
> On 08.02.22 12:15, Jan Beulich wrote:
> > Yes, but I'm not sure this is going to remain just a single use.
> > Furthermore every CONFIG_<arch> is problematic as soon as a new port
> > is being worked on. If we wanted to go with a CONFIG_<arch> here, imo
> > it ought to be CONFIG_X86, not CONFIG_ARM, as I/O ports are really an
> > x86-specific thing (which has propagated into other architectures in
> > more or less strange ways, but never as truly I/O ports).
> I am fine using CONFIG_X86
> @Roger, are you ok with that?

I guess if that's the only instance of having diverging behavior
because of the lack of IO ports I'm fine with using CONFIG_X86.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests
  2022-02-08 11:29                 ` Oleksandr Andrushchenko
@ 2022-02-08 14:09                   ` Roger Pau Monné
  2022-02-08 14:13                     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Roger Pau Monné @ 2022-02-08 14:09 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel

On Tue, Feb 08, 2022 at 11:29:07AM +0000, Oleksandr Andrushchenko wrote:
> 
> 
> On 08.02.22 13:11, Roger Pau Monné wrote:
> > On Tue, Feb 08, 2022 at 09:58:40AM +0000, Oleksandr Andrushchenko wrote:
> >>
> >> On 08.02.22 11:52, Jan Beulich wrote:
> >>> On 08.02.2022 10:38, Oleksandr Andrushchenko wrote:
> >>>> On 08.02.22 11:33, Jan Beulich wrote:
> >>>>> On 08.02.2022 09:13, Oleksandr Andrushchenko wrote:
> >>>>>> On 04.02.22 16:25, Jan Beulich wrote:
> >>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
> >>>>>>>> --- a/xen/drivers/vpci/header.c
> >>>>>>>> +++ b/xen/drivers/vpci/header.c
> >>>>>>>> @@ -454,6 +454,22 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
> >>>>>>>>              pci_conf_write16(pdev->sbdf, reg, cmd);
> >>>>>>>>      }
> >>>>>>>>      
> >>>>>>>> +static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
> >>>>>>>> +                            uint32_t cmd, void *data)
> >>>>>>>> +{
> >>>>>>>> +    /* TODO: Add proper emulation for all bits of the command register. */
> >>>>>>>> +
> >>>>>>>> +#ifdef CONFIG_HAS_PCI_MSI
> >>>>>>>> +    if ( pdev->vpci->msi->enabled || pdev->vpci->msix->enabled )
> >>>>>>>> +    {
> >>>>>>>> +        /* Guest wants to enable INTx. It can't be enabled if MSI/MSI-X enabled. */
> >>>>>>>> +        cmd |= PCI_COMMAND_INTX_DISABLE;
> >>>>>>>> +    }
> >>>>>>>> +#endif
> >>>>>>>> +
> >>>>>>>> +    cmd_write(pdev, reg, cmd, data);
> >>>>>>>> +}
> >>>>>>> It's not really clear to me whether the TODO warrants this being a
> >>>>>>> separate function. Personally I'd find it preferable if the logic
> >>>>>>> was folded into cmd_write().
> >>>>>> Not sure cmd_write needs to have guest's logic. And what's the
> >>>>>> profit? Later on, when we decide how PCI_COMMAND can be emulated
> >>>>>> this code will live in guest_cmd_write anyways
> >>>>> Why "will"? There's nothing conceptually wrong with putting all the
> >>>>> emulation logic into cmd_write(), inside an if(!hwdom) conditional.
> >>>>> If and when we gain CET-IBT support on the x86 side (and I'm told
> >>>>> there's an Arm equivalent of this), then to make this as useful as
> >>>>> possible it is going to be desirable to limit the number of functions
> >>>>> called through function pointers. You may have seen Andrew's huge
> >>>>> "x86: Support for CET Indirect Branch Tracking" series. We want to
> >>>>> keep down the number of such annotations; the vast part of the series
> >>>>> is about adding of such.
> >>>> Well, while I see nothing bad with that, from the code organization
> >>>> it would look a bit strange: we don't differentiate hwdom in vpci
> >>>> handlers, but instead provide one for hwdom and one for guests.
> >>>> While I understand your concern I still think that at the moment
> >>>> it will be more in line with the existing code if we provide a dedicated
> >>>> handler.
> >>> The existing code only deals with Dom0, and hence doesn't have any
> >>> pairs of handlers.
> >> This is fair
> >>>    FTAOD what I said above applies equally to other
> >>> separate guest read/write handlers you may be introducing. The
> >>> exception being when e.g. a hardware access handler is put in place
> >>> for Dom0 (for obvious reasons, I think).
> >> @Roger, what's your preference here?
> > The newly introduced handler ends up calling the existing one,
> But before doing so it implements guest specific logic which will be
> extended as we add more bits of emulation
> >   so in
> > this case it might make sense to expand cmd_write to also cater for
> > the domU case?
> So, from the above I thought is was ok to have a dedicated handler

Given the current proposal where you are only dealing with INTx I don't
think it makes much sense to have a separate handler because you end
up calling cmd_write anyway, so what's added there could very well be
added at the top of cmd_write.

> >
> > I think we need to be sensible here in that we don't want to end up
> > with handlers like:
> >
> > register_read(...)
> > {
> >     if ( is_hardware_domain() )
> >         ....
> >     else
> >         ...
> > }
> >
> > If there's shared code it's IMO better to not create as guest specific
> > handler.
> >
> > It's also more risky to use the same handlers for dom0 and domU, as a
> > change intended to dom0 only might end up leaking in the domU path and
> > that could easily become a security issue.
> So, just for your justification: BARs. Is this something we also want
> to be kept separate or we want if (is_hwdom)?
> I guess the former.

I think BAR access handling is sufficiently different between dom0 and
domU that we want separate handlers.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests
  2022-02-08 14:09                   ` Roger Pau Monné
@ 2022-02-08 14:13                     ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-08 14:13 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jan Beulich, julien, sstabellini, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Artem Mygaiev, andrew.cooper3, george.dunlap,
	paul, Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 08.02.22 16:09, Roger Pau Monné wrote:
> On Tue, Feb 08, 2022 at 11:29:07AM +0000, Oleksandr Andrushchenko wrote:
>>
>> On 08.02.22 13:11, Roger Pau Monné wrote:
>>> On Tue, Feb 08, 2022 at 09:58:40AM +0000, Oleksandr Andrushchenko wrote:
>>>> On 08.02.22 11:52, Jan Beulich wrote:
>>>>> On 08.02.2022 10:38, Oleksandr Andrushchenko wrote:
>>>>>> On 08.02.22 11:33, Jan Beulich wrote:
>>>>>>> On 08.02.2022 09:13, Oleksandr Andrushchenko wrote:
>>>>>>>> On 04.02.22 16:25, Jan Beulich wrote:
>>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>>>>>> @@ -454,6 +454,22 @@ static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>>>               pci_conf_write16(pdev->sbdf, reg, cmd);
>>>>>>>>>>       }
>>>>>>>>>>       
>>>>>>>>>> +static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>>> +                            uint32_t cmd, void *data)
>>>>>>>>>> +{
>>>>>>>>>> +    /* TODO: Add proper emulation for all bits of the command register. */
>>>>>>>>>> +
>>>>>>>>>> +#ifdef CONFIG_HAS_PCI_MSI
>>>>>>>>>> +    if ( pdev->vpci->msi->enabled || pdev->vpci->msix->enabled )
>>>>>>>>>> +    {
>>>>>>>>>> +        /* Guest wants to enable INTx. It can't be enabled if MSI/MSI-X enabled. */
>>>>>>>>>> +        cmd |= PCI_COMMAND_INTX_DISABLE;
>>>>>>>>>> +    }
>>>>>>>>>> +#endif
>>>>>>>>>> +
>>>>>>>>>> +    cmd_write(pdev, reg, cmd, data);
>>>>>>>>>> +}
>>>>>>>>> It's not really clear to me whether the TODO warrants this being a
>>>>>>>>> separate function. Personally I'd find it preferable if the logic
>>>>>>>>> was folded into cmd_write().
>>>>>>>> Not sure cmd_write needs to have guest's logic. And what's the
>>>>>>>> profit? Later on, when we decide how PCI_COMMAND can be emulated
>>>>>>>> this code will live in guest_cmd_write anyways
>>>>>>> Why "will"? There's nothing conceptually wrong with putting all the
>>>>>>> emulation logic into cmd_write(), inside an if(!hwdom) conditional.
>>>>>>> If and when we gain CET-IBT support on the x86 side (and I'm told
>>>>>>> there's an Arm equivalent of this), then to make this as useful as
>>>>>>> possible it is going to be desirable to limit the number of functions
>>>>>>> called through function pointers. You may have seen Andrew's huge
>>>>>>> "x86: Support for CET Indirect Branch Tracking" series. We want to
>>>>>>> keep down the number of such annotations; the vast part of the series
>>>>>>> is about adding of such.
>>>>>> Well, while I see nothing bad with that, from the code organization
>>>>>> it would look a bit strange: we don't differentiate hwdom in vpci
>>>>>> handlers, but instead provide one for hwdom and one for guests.
>>>>>> While I understand your concern I still think that at the moment
>>>>>> it will be more in line with the existing code if we provide a dedicated
>>>>>> handler.
>>>>> The existing code only deals with Dom0, and hence doesn't have any
>>>>> pairs of handlers.
>>>> This is fair
>>>>>     FTAOD what I said above applies equally to other
>>>>> separate guest read/write handlers you may be introducing. The
>>>>> exception being when e.g. a hardware access handler is put in place
>>>>> for Dom0 (for obvious reasons, I think).
>>>> @Roger, what's your preference here?
>>> The newly introduced handler ends up calling the existing one,
>> But before doing so it implements guest specific logic which will be
>> extended as we add more bits of emulation
>>>    so in
>>> this case it might make sense to expand cmd_write to also cater for
>>> the domU case?
>> So, from the above I thought is was ok to have a dedicated handler
> Given the current proposal where you are only dealing with INTx I don't
> think it makes much sense to have a separate handler because you end
> up calling cmd_write anyway, so what's added there could very well be
> added at the top of cmd_write.
Good
>
>>> I think we need to be sensible here in that we don't want to end up
>>> with handlers like:
>>>
>>> register_read(...)
>>> {
>>>      if ( is_hardware_domain() )
>>>          ....
>>>      else
>>>          ...
>>> }
>>>
>>> If there's shared code it's IMO better to not create as guest specific
>>> handler.
>>>
>>> It's also more risky to use the same handlers for dom0 and domU, as a
>>> change intended to dom0 only might end up leaking in the domU path and
>>> that could easily become a security issue.
>> So, just for your justification: BARs. Is this something we also want
>> to be kept separate or we want if (is_hwdom)?
>> I guess the former.
> I think BAR access handling is sufficiently different between dom0 and
> domU that we want separate handlers.
Makes sense
> Thanks, Roger.
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign
  2022-02-08 11:25                         ` Oleksandr Andrushchenko
@ 2022-02-10  8:21                           ` Oleksandr Andrushchenko
  2022-02-10  9:22                             ` Jan Beulich
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-10  8:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau,
	Oleksandr Andrushchenko



On 08.02.22 13:25, Oleksandr Andrushchenko wrote:
>
> On 08.02.22 13:00, Jan Beulich wrote:
>> On 08.02.2022 11:52, Oleksandr Andrushchenko wrote:
>>> This smells like we first need to fix the existing code, so
>>> pdev->domain is not assigned by specific IOMMU implementations,
>>> but instead controlled by the code which relies on that, assign_device.
>> Feel free to come up with proposals how to cleanly do so. Moving the
>> assignment to pdev->domain may even be possible now, but if you go
>> back you may find that the code was quite different earlier on.
> I do understand that as the code evolves new use cases bring
> new issues.
>>> I can have something like:
>>>
>>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>>> index 88836aab6baf..cc7790709a50 100644
>>> --- a/xen/drivers/passthrough/pci.c
>>> +++ b/xen/drivers/passthrough/pci.c
>>> @@ -1475,6 +1475,7 @@ static int device_assigned(u16 seg, u8 bus, u8 devfn)
>>>     static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>     {
>>>         const struct domain_iommu *hd = dom_iommu(d);
>>> +    struct domain *old_owner;
>>>         struct pci_dev *pdev;
>>>         int rc = 0;
>>>
>>> @@ -1490,6 +1491,9 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>         ASSERT(pdev && (pdev->domain == hardware_domain ||
>>>                         pdev->domain == dom_io));
>>>
>>> +    /* We need to restore the old owner in case of an error. */
>>> +    old_owner = pdev->domain;
>>> +
>>>         vpci_deassign_device(pdev->domain, pdev);
>>>
>>>         rc = pdev_msix_assign(d, pdev);
>>> @@ -1515,8 +1519,12 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>
>>>      done:
>>>         if ( rc )
>>> +    {
>>>             printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>>                    d, &PCI_SBDF3(seg, bus, devfn), rc);
>>> +        /* We failed to assign, so restore the previous owner. */
>>> +        pdev->domain = old_owner;
>>> +    }
>>>         /* The device is assigned to dom_io so mark it as quarantined */
>>>         else if ( d == dom_io )
>>>             pdev->quarantine = true;
>>>
>>> But I do not think this belongs to this patch
>> Indeed. Plus I'm sure you understand that it's not that simple. Assigning
>> to pdev->domain is only the last step of assignment. Restoring the original
>> owner would entail putting in place the original IOMMU table entries as
>> well, which in turn can fail. Hence why you'll find a number of uses of
>> domain_crash() in places where rolling back is far from easy.
> So, why don't we just rely on the toolstack to do the roll back then?
> This way we won't add new domain_crash() calls.
> I do understand though that we may live Xen in a wrong state though.
> So, do you think it is possible if we just call deassign_device from
> assign_device on the error path? This is just like I do in vpci_assign_device:
> I call vpci_deassign_device if the former fails.
With the following addition:

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index c4ae22aeefcd..d6c00449193c 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1511,6 +1511,12 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
      }

      rc = vpci_assign_device(pdev);
+    if ( rc )
+        /*
+         * Ignore the return code as we want to preserve the one from the
+         * failed assign operation.
+         */
+        deassign_device(d, seg, bus, devfn);

   done:
      if ( rc )

I see the following logs (PV Dom0):

(XEN) assign_device seg 0 bus 3 devfn 0
(XEN) [VT-D]d[IO]:PCIe: unmap 0000:03:00.0
(XEN) [VT-D]d4:PCIe: map 0000:03:00.0
(XEN) assign_device vpci_assign rc -22 from d[IO] to d4
(XEN) deassign_device current d4 to d[IO]
(XEN) [VT-D]d4:PCIe: unmap 0000:03:00.0
(XEN) [VT-D]d[IO]:PCIe: map 0000:03:00.0
(XEN) deassign_device ret 0
(XEN) d4: assign (0000:03:00.0) failed (-22)
libxl: error: libxl_pci.c:1498:pci_add_dm_done: Domain 4:xc_assign_device failed: Invalid argument
libxl: error: libxl_pci.c:1781:device_pci_add_done: Domain 4:libxl__device_pci_add failed for PCI device 0:3:0.0 (rc -3)
libxl: error: libxl_create.c:1895:domcreate_attach_devices: Domain 4:unable to add pci devices
libxl: error: libxl_domain.c:1183:libxl__destroy_domid: Domain 4:Non-existant domain
libxl: error: libxl_domain.c:1137:domain_destroy_callback: Domain 4:Unable to destroy guest
libxl: error: libxl_domain.c:1064:domain_destroy_cb: Domain 4:Destruction of domain failed

So, it seems to properly solve the issue with pdev->domain left
set to the domain we couldn't create.

@Jan, will this address your concern?

Thank you,
Oleksandr

^ permalink raw reply related	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign
  2022-02-10  8:21                           ` Oleksandr Andrushchenko
@ 2022-02-10  9:22                             ` Jan Beulich
  2022-02-10  9:33                               ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-10  9:22 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau

On 10.02.2022 09:21, Oleksandr Andrushchenko wrote:
> 
> 
> On 08.02.22 13:25, Oleksandr Andrushchenko wrote:
>>
>> On 08.02.22 13:00, Jan Beulich wrote:
>>> On 08.02.2022 11:52, Oleksandr Andrushchenko wrote:
>>>> This smells like we first need to fix the existing code, so
>>>> pdev->domain is not assigned by specific IOMMU implementations,
>>>> but instead controlled by the code which relies on that, assign_device.
>>> Feel free to come up with proposals how to cleanly do so. Moving the
>>> assignment to pdev->domain may even be possible now, but if you go
>>> back you may find that the code was quite different earlier on.
>> I do understand that as the code evolves new use cases bring
>> new issues.
>>>> I can have something like:
>>>>
>>>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>>>> index 88836aab6baf..cc7790709a50 100644
>>>> --- a/xen/drivers/passthrough/pci.c
>>>> +++ b/xen/drivers/passthrough/pci.c
>>>> @@ -1475,6 +1475,7 @@ static int device_assigned(u16 seg, u8 bus, u8 devfn)
>>>>     static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>     {
>>>>         const struct domain_iommu *hd = dom_iommu(d);
>>>> +    struct domain *old_owner;
>>>>         struct pci_dev *pdev;
>>>>         int rc = 0;
>>>>
>>>> @@ -1490,6 +1491,9 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>         ASSERT(pdev && (pdev->domain == hardware_domain ||
>>>>                         pdev->domain == dom_io));
>>>>
>>>> +    /* We need to restore the old owner in case of an error. */
>>>> +    old_owner = pdev->domain;
>>>> +
>>>>         vpci_deassign_device(pdev->domain, pdev);
>>>>
>>>>         rc = pdev_msix_assign(d, pdev);
>>>> @@ -1515,8 +1519,12 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>
>>>>      done:
>>>>         if ( rc )
>>>> +    {
>>>>             printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>>>                    d, &PCI_SBDF3(seg, bus, devfn), rc);
>>>> +        /* We failed to assign, so restore the previous owner. */
>>>> +        pdev->domain = old_owner;
>>>> +    }
>>>>         /* The device is assigned to dom_io so mark it as quarantined */
>>>>         else if ( d == dom_io )
>>>>             pdev->quarantine = true;
>>>>
>>>> But I do not think this belongs to this patch
>>> Indeed. Plus I'm sure you understand that it's not that simple. Assigning
>>> to pdev->domain is only the last step of assignment. Restoring the original
>>> owner would entail putting in place the original IOMMU table entries as
>>> well, which in turn can fail. Hence why you'll find a number of uses of
>>> domain_crash() in places where rolling back is far from easy.
>> So, why don't we just rely on the toolstack to do the roll back then?
>> This way we won't add new domain_crash() calls.
>> I do understand though that we may live Xen in a wrong state though.
>> So, do you think it is possible if we just call deassign_device from
>> assign_device on the error path? This is just like I do in vpci_assign_device:
>> I call vpci_deassign_device if the former fails.
> With the following addition:
> 
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index c4ae22aeefcd..d6c00449193c 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -1511,6 +1511,12 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>       }
> 
>       rc = vpci_assign_device(pdev);
> +    if ( rc )
> +        /*
> +         * Ignore the return code as we want to preserve the one from the
> +         * failed assign operation.
> +         */
> +        deassign_device(d, seg, bus, devfn);
> 
>    done:
>       if ( rc )
> 
> I see the following logs (PV Dom0):
> 
> (XEN) assign_device seg 0 bus 3 devfn 0
> (XEN) [VT-D]d[IO]:PCIe: unmap 0000:03:00.0
> (XEN) [VT-D]d4:PCIe: map 0000:03:00.0
> (XEN) assign_device vpci_assign rc -22 from d[IO] to d4
> (XEN) deassign_device current d4 to d[IO]
> (XEN) [VT-D]d4:PCIe: unmap 0000:03:00.0
> (XEN) [VT-D]d[IO]:PCIe: map 0000:03:00.0
> (XEN) deassign_device ret 0
> (XEN) d4: assign (0000:03:00.0) failed (-22)
> libxl: error: libxl_pci.c:1498:pci_add_dm_done: Domain 4:xc_assign_device failed: Invalid argument
> libxl: error: libxl_pci.c:1781:device_pci_add_done: Domain 4:libxl__device_pci_add failed for PCI device 0:3:0.0 (rc -3)
> libxl: error: libxl_create.c:1895:domcreate_attach_devices: Domain 4:unable to add pci devices
> libxl: error: libxl_domain.c:1183:libxl__destroy_domid: Domain 4:Non-existant domain
> libxl: error: libxl_domain.c:1137:domain_destroy_callback: Domain 4:Unable to destroy guest
> libxl: error: libxl_domain.c:1064:domain_destroy_cb: Domain 4:Destruction of domain failed
> 
> So, it seems to properly solve the issue with pdev->domain left
> set to the domain we couldn't create.
> 
> @Jan, will this address your concern?

Partly: For one I'd have to think through what further implications there
are from going this route. And then completely ignoring the return value
is unlikely to be correct: You certainly want to retain the original
error code for returning to the caller, but you can't leave the error
unhandled. That's likely another case where the "best" choice is to crash
the guest.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign
  2022-02-10  9:22                             ` Jan Beulich
@ 2022-02-10  9:33                               ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-10  9:33 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel, roger.pau,
	Oleksandr Andrushchenko



On 10.02.22 11:22, Jan Beulich wrote:
> On 10.02.2022 09:21, Oleksandr Andrushchenko wrote:
>>
>> On 08.02.22 13:25, Oleksandr Andrushchenko wrote:
>>> On 08.02.22 13:00, Jan Beulich wrote:
>>>> On 08.02.2022 11:52, Oleksandr Andrushchenko wrote:
>>>>> This smells like we first need to fix the existing code, so
>>>>> pdev->domain is not assigned by specific IOMMU implementations,
>>>>> but instead controlled by the code which relies on that, assign_device.
>>>> Feel free to come up with proposals how to cleanly do so. Moving the
>>>> assignment to pdev->domain may even be possible now, but if you go
>>>> back you may find that the code was quite different earlier on.
>>> I do understand that as the code evolves new use cases bring
>>> new issues.
>>>>> I can have something like:
>>>>>
>>>>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>>>>> index 88836aab6baf..cc7790709a50 100644
>>>>> --- a/xen/drivers/passthrough/pci.c
>>>>> +++ b/xen/drivers/passthrough/pci.c
>>>>> @@ -1475,6 +1475,7 @@ static int device_assigned(u16 seg, u8 bus, u8 devfn)
>>>>>      static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>>      {
>>>>>          const struct domain_iommu *hd = dom_iommu(d);
>>>>> +    struct domain *old_owner;
>>>>>          struct pci_dev *pdev;
>>>>>          int rc = 0;
>>>>>
>>>>> @@ -1490,6 +1491,9 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>>          ASSERT(pdev && (pdev->domain == hardware_domain ||
>>>>>                          pdev->domain == dom_io));
>>>>>
>>>>> +    /* We need to restore the old owner in case of an error. */
>>>>> +    old_owner = pdev->domain;
>>>>> +
>>>>>          vpci_deassign_device(pdev->domain, pdev);
>>>>>
>>>>>          rc = pdev_msix_assign(d, pdev);
>>>>> @@ -1515,8 +1519,12 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>>
>>>>>       done:
>>>>>          if ( rc )
>>>>> +    {
>>>>>              printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>>>>                     d, &PCI_SBDF3(seg, bus, devfn), rc);
>>>>> +        /* We failed to assign, so restore the previous owner. */
>>>>> +        pdev->domain = old_owner;
>>>>> +    }
>>>>>          /* The device is assigned to dom_io so mark it as quarantined */
>>>>>          else if ( d == dom_io )
>>>>>              pdev->quarantine = true;
>>>>>
>>>>> But I do not think this belongs to this patch
>>>> Indeed. Plus I'm sure you understand that it's not that simple. Assigning
>>>> to pdev->domain is only the last step of assignment. Restoring the original
>>>> owner would entail putting in place the original IOMMU table entries as
>>>> well, which in turn can fail. Hence why you'll find a number of uses of
>>>> domain_crash() in places where rolling back is far from easy.
>>> So, why don't we just rely on the toolstack to do the roll back then?
>>> This way we won't add new domain_crash() calls.
>>> I do understand though that we may live Xen in a wrong state though.
>>> So, do you think it is possible if we just call deassign_device from
>>> assign_device on the error path? This is just like I do in vpci_assign_device:
>>> I call vpci_deassign_device if the former fails.
>> With the following addition:
>>
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index c4ae22aeefcd..d6c00449193c 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -1511,6 +1511,12 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>        }
>>
>>        rc = vpci_assign_device(pdev);
>> +    if ( rc )
>> +        /*
>> +         * Ignore the return code as we want to preserve the one from the
>> +         * failed assign operation.
>> +         */
>> +        deassign_device(d, seg, bus, devfn);
This needs devfn to be preserved as it can be modified by the loop above:
     for ( ; pdev->phantom_stride; rc = 0 )
     {
         devfn += pdev->phantom_stride;

>>
>>     done:
>>        if ( rc )
>>
>> I see the following logs (PV Dom0):
>>
>> (XEN) assign_device seg 0 bus 3 devfn 0
>> (XEN) [VT-D]d[IO]:PCIe: unmap 0000:03:00.0
>> (XEN) [VT-D]d4:PCIe: map 0000:03:00.0
>> (XEN) assign_device vpci_assign rc -22 from d[IO] to d4
>> (XEN) deassign_device current d4 to d[IO]
>> (XEN) [VT-D]d4:PCIe: unmap 0000:03:00.0
>> (XEN) [VT-D]d[IO]:PCIe: map 0000:03:00.0
>> (XEN) deassign_device ret 0
>> (XEN) d4: assign (0000:03:00.0) failed (-22)
>> libxl: error: libxl_pci.c:1498:pci_add_dm_done: Domain 4:xc_assign_device failed: Invalid argument
>> libxl: error: libxl_pci.c:1781:device_pci_add_done: Domain 4:libxl__device_pci_add failed for PCI device 0:3:0.0 (rc -3)
>> libxl: error: libxl_create.c:1895:domcreate_attach_devices: Domain 4:unable to add pci devices
>> libxl: error: libxl_domain.c:1183:libxl__destroy_domid: Domain 4:Non-existant domain
>> libxl: error: libxl_domain.c:1137:domain_destroy_callback: Domain 4:Unable to destroy guest
>> libxl: error: libxl_domain.c:1064:domain_destroy_cb: Domain 4:Destruction of domain failed
>>
>> So, it seems to properly solve the issue with pdev->domain left
>> set to the domain we couldn't create.
>>
>> @Jan, will this address your concern?
> Partly: For one I'd have to think through what further implications there
> are from going this route. And then completely ignoring the return value
> is unlikely to be correct: You certainly want to retain the original
> error code for returning to the caller, but you can't leave the error
> unhandled. That's likely another case where the "best" choice is to crash
> the guest.
Ok, then I'll crash the domain...
>
> Jan
>
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-07 14:31                   ` Jan Beulich
  2022-02-07 14:46                     ` Oleksandr Andrushchenko
@ 2022-02-10 12:54                     ` Oleksandr Andrushchenko
  2022-02-10 13:36                       ` Jan Beulich
  2022-02-10 12:59                     ` Oleksandr Andrushchenko
  2 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-10 12:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 07.02.22 16:31, Jan Beulich wrote:
> On 07.02.2022 15:17, Oleksandr Andrushchenko wrote:
> But: What's still missing here then is the separation of guest and host
> views. When we set INTx behind the guest's back, it shouldn't observe the
> bit set. Or is this meant to be another (big) TODO?
Why not? This seems to be when a guest tries to both enable MSI/MSI-X
and INTx which is a wrong combination. Let's pretend to be a really
smart PCI device which partially rejects such PCI_COMMAND write,
so guest still sees the register consistent wrt INTx bit. Namely it remains
set.
>
> Jan
>
>

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-07 14:31                   ` Jan Beulich
  2022-02-07 14:46                     ` Oleksandr Andrushchenko
  2022-02-10 12:54                     ` Oleksandr Andrushchenko
@ 2022-02-10 12:59                     ` Oleksandr Andrushchenko
  2 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-10 12:59 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 07.02.22 16:31, Jan Beulich wrote:
> On 07.02.2022 15:17, Oleksandr Andrushchenko wrote:
>>
>> On 07.02.22 14:54, Jan Beulich wrote:
>>> On 07.02.2022 13:51, Oleksandr Andrushchenko wrote:
>>>> On 07.02.22 14:38, Jan Beulich wrote:
>>>>> On 07.02.2022 12:27, Oleksandr Andrushchenko wrote:
>>>>>> On 07.02.22 09:29, Jan Beulich wrote:
>>>>>>> On 04.02.2022 15:37, Oleksandr Andrushchenko wrote:
>>>>>>>> On 04.02.22 16:30, Jan Beulich wrote:
>>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>>>> Reset the command register when assigning a PCI device to a guest:
>>>>>>>>>> according to the PCI spec the PCI_COMMAND register is typically all 0's
>>>>>>>>>> after reset.
>>>>>>>>> It's not entirely clear to me whether setting the hardware register to
>>>>>>>>> zero is okay. What wants to be zero is the value the guest observes
>>>>>>>>> initially.
>>>>>>>> "the PCI spec says the PCI_COMMAND register is typically all 0's after reset."
>>>>>>>> Why wouldn't it be ok? What is the exact concern here?
>>>>>>> The concern is - as voiced is similar ways before, perhaps in other
>>>>>>> contexts - that you need to consider bit-by-bit whether overwriting
>>>>>>> with 0 what is currently there is okay. Xen and/or Dom0 may have put
>>>>>>> values there which they expect to remain unaltered. I guess
>>>>>>> PCI_COMMAND_SERR is a good example: While the guest's view of this
>>>>>>> will want to be zero initially, the host having set it to 1 may not
>>>>>>> easily be overwritten with 0, or else you'd effectively imply giving
>>>>>>> the guest control of the bit.
>>>>>> We have already discussed in great detail PCI_COMMAND emulation [1].
>>>>>> At the end you wrote [1]:
>>>>>> "Well, in order for the whole thing to be security supported it needs to
>>>>>> be explained for every bit why it is safe to allow the guest to drive it.
>>>>>> Until you mean vPCI to reach that state, leaving TODO notes in the code
>>>>>> for anything not investigated may indeed be good enough.
>>>>>>
>>>>>> Jan"
>>>>>>
>>>>>> So, this is why I left a TODO in the PCI_COMMAND emulation for now and only
>>>>>> care about INTx which is honored with the code in this patch.
>>>>> Right. The issue I see is that the description does not have any
>>>>> mention of this, but instead talks about simply writing zero.
>>>> How do you want that mentioned? Extended commit message or
>>>> just a link to the thread [1]?
>>> What I'd like you to describe is what the change does without
>>> fundamentally implying it'll end up being zero which gets written
>>> to the register. Stating as a conclusion that for the time being
>>> this means writing zero is certainly fine (and likely helpful if
>>> made explicit).
>> Xen and/or Dom0 may have put values in PCI_COMMAND which they expect
>> to remain unaltered. PCI_COMMAND_SERR bit is a good example: while the
>> guest's view of this will want to be zero initially, the host having set
>> it to 1 may not easily be overwritten with 0, or else we'd effectively
>> imply giving the guest control of the bit. Thus, PCI_COMMAND register needs
>> proper emulation in order to honor host's settings.
>>
>> There are examples of emulators [1], [2] which already deal with PCI_COMMAND
>> register emulation and it seems that at most they care about the only INTX
>> bit (besides IO/memory enable and bus muster which are write through).
>> It could be because in order to properly emulate the PCI_COMMAND register
>> we need to know about the whole PCI topology, e.g. if any setting in device's
>> command register is aligned with the upstream port etc.
>> This makes me think that because of this complexity others just ignore that.
>> Neither I think this can be easily done in Xen case.
>>
>> According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
>> Device Control" says that the reset state of the command register is
>> typically 0, so reset the command register when assigning a PCI device
>> to a guest t all 0's and for now only make sure INTx bit is set according
>> to if MSI/MSI-X enabled.
> "... is typically 0, so when assigning a PCI device reset the guest view of
>   the command register to all 0's. For now our emulation only makes sure INTx
>   is set according to host requirements, i.e. depending on MSI/MSI-X enabled
>   state."
I'll put this description into PCI_COMMAND emulation patch
>
> Maybe? (Obviously a fresh device given to a guest will have MSI/MSI-X
> disabled, so I'm not sure that aspect really needs mentioning.)
>
> But: What's still missing here then is the separation of guest and host
> views. When we set INTx behind the guest's back, it shouldn't observe the
> bit set. Or is this meant to be another (big) TODO?
>
> Jan
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-10 12:54                     ` Oleksandr Andrushchenko
@ 2022-02-10 13:36                       ` Jan Beulich
  2022-02-10 13:56                         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 138+ messages in thread
From: Jan Beulich @ 2022-02-10 13:36 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel

On 10.02.2022 13:54, Oleksandr Andrushchenko wrote:
> On 07.02.22 16:31, Jan Beulich wrote:
>> On 07.02.2022 15:17, Oleksandr Andrushchenko wrote:
>> But: What's still missing here then is the separation of guest and host
>> views. When we set INTx behind the guest's back, it shouldn't observe the
>> bit set. Or is this meant to be another (big) TODO?
> Why not? This seems to be when a guest tries to both enable MSI/MSI-X
> and INTx which is a wrong combination. Let's pretend to be a really
> smart PCI device which partially rejects such PCI_COMMAND write,
> so guest still sees the register consistent wrt INTx bit. Namely it remains
> set.

I'm afraid this wouldn't be "smart", but "buggy". I'm not aware of
the spec leaving room for such behavior. And our emulation should
give the guest a spec-compliant view of the device.

Jan



^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 10/13] vpci/header: reset the command register when adding devices
  2022-02-10 13:36                       ` Jan Beulich
@ 2022-02-10 13:56                         ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Andrushchenko @ 2022-02-10 13:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, andrew.cooper3, george.dunlap, paul,
	Bertrand Marquis, Rahul Singh, xen-devel,
	Oleksandr Andrushchenko



On 10.02.22 15:36, Jan Beulich wrote:
> On 10.02.2022 13:54, Oleksandr Andrushchenko wrote:
>> On 07.02.22 16:31, Jan Beulich wrote:
>>> On 07.02.2022 15:17, Oleksandr Andrushchenko wrote:
>>> But: What's still missing here then is the separation of guest and host
>>> views. When we set INTx behind the guest's back, it shouldn't observe the
>>> bit set. Or is this meant to be another (big) TODO?
>> Why not? This seems to be when a guest tries to both enable MSI/MSI-X
>> and INTx which is a wrong combination. Let's pretend to be a really
>> smart PCI device which partially rejects such PCI_COMMAND write,
>> so guest still sees the register consistent wrt INTx bit. Namely it remains
>> set.
> I'm afraid this wouldn't be "smart", but "buggy". I'm not aware of
> the spec leaving room for such behavior. And our emulation should
> give the guest a spec-compliant view of the device.
This means we need to emulate PCI_COMMAND for guests in terms
we need to maintain their state just like we do for BARs (header->guest_reg)
So, we will need header->guest_cmd to hold the state
>
> Jan
>
>
Thank you,
Oleksandr

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [PATCH v6 13/13] xen/arm: account IO handlers for emulated PCI MSI-X
  2022-02-04  6:34 ` [PATCH v6 13/13] xen/arm: account IO handlers for emulated PCI MSI-X Oleksandr Andrushchenko
@ 2022-02-11 15:28   ` Julien Grall
  0 siblings, 0 replies; 138+ messages in thread
From: Julien Grall @ 2022-02-11 15:28 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, xen-devel
  Cc: sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	artem_mygaiev, roger.pau, jbeulich, andrew.cooper3,
	george.dunlap, paul, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko

Hi Oleksandr,

On 04/02/2022 06:34, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> At the moment, we always allocate an extra 16 slots for IO handlers
> (see MAX_IO_HANDLER). So while adding IO trap handlers for the emulated
> MSI-X registers we need to explicitly tell that we have additional IO
> handlers, so those are accounted.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Acked-by: Julien Grall <jgrall@amazon.com>

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 138+ messages in thread

end of thread, other threads:[~2022-02-11 15:28 UTC | newest]

Thread overview: 138+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-04  6:34 [PATCH v6 00/13] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
2022-02-04  6:34 ` [PATCH v6 01/13] xen/pci: arm: add stub for is_memory_hole Oleksandr Andrushchenko
2022-02-04  8:51   ` Julien Grall
2022-02-04  9:01     ` Oleksandr Andrushchenko
2022-02-04  9:41       ` Julien Grall
2022-02-04  9:47         ` Oleksandr Andrushchenko
2022-02-04  9:57           ` Julien Grall
2022-02-04 10:35             ` Oleksandr Andrushchenko
2022-02-04 11:00               ` Julien Grall
2022-02-04 11:25                 ` Oleksandr Andrushchenko
2022-02-04  6:34 ` [PATCH v6 02/13] rangeset: add RANGESETF_no_print flag Oleksandr Andrushchenko
2022-02-04  6:34 ` [PATCH v6 03/13] vpci: move lock outside of struct vpci Oleksandr Andrushchenko
2022-02-04  7:52   ` Jan Beulich
2022-02-04  8:13     ` Oleksandr Andrushchenko
2022-02-04  8:36       ` Jan Beulich
2022-02-04  8:58     ` Oleksandr Andrushchenko
2022-02-04  9:15       ` Jan Beulich
2022-02-04 10:12         ` Oleksandr Andrushchenko
2022-02-04 10:49           ` Jan Beulich
2022-02-04 11:13             ` Roger Pau Monné
2022-02-04 11:37               ` Jan Beulich
2022-02-04 12:37                 ` Oleksandr Andrushchenko
2022-02-04 12:47                   ` Jan Beulich
2022-02-04 12:53                     ` Oleksandr Andrushchenko
2022-02-04 13:03                       ` Jan Beulich
2022-02-04 13:06                       ` Roger Pau Monné
2022-02-04 14:43                         ` Oleksandr Andrushchenko
2022-02-04 14:57                           ` Roger Pau Monné
2022-02-07 11:08                             ` Oleksandr Andrushchenko
2022-02-07 12:34                               ` Jan Beulich
2022-02-07 12:57                                 ` Oleksandr Andrushchenko
2022-02-07 13:02                                   ` Jan Beulich
2022-02-07 12:46                               ` Roger Pau Monné
2022-02-07 13:53                                 ` Oleksandr Andrushchenko
2022-02-07 14:11                                   ` Jan Beulich
2022-02-07 14:27                                     ` Roger Pau Monné
2022-02-07 14:33                                       ` Jan Beulich
2022-02-07 14:35                                       ` Oleksandr Andrushchenko
2022-02-07 15:11                                         ` Oleksandr Andrushchenko
2022-02-07 15:26                                           ` Jan Beulich
2022-02-07 16:07                                             ` Oleksandr Andrushchenko
2022-02-07 16:15                                               ` Jan Beulich
2022-02-07 16:21                                                 ` Oleksandr Andrushchenko
2022-02-07 16:37                                                   ` Jan Beulich
2022-02-07 16:44                                                     ` Oleksandr Andrushchenko
2022-02-08  7:35                                                       ` Oleksandr Andrushchenko
2022-02-08  8:57                                                         ` Jan Beulich
2022-02-08  9:03                                                           ` Oleksandr Andrushchenko
2022-02-08 10:50                                                         ` Roger Pau Monné
2022-02-08 11:13                                                           ` Oleksandr Andrushchenko
2022-02-08 13:38                                                             ` Roger Pau Monné
2022-02-08 13:52                                                               ` Oleksandr Andrushchenko
2022-02-08  8:53                                                       ` Jan Beulich
2022-02-08  9:00                                                         ` Oleksandr Andrushchenko
2022-02-08 10:11                                                     ` Roger Pau Monné
2022-02-08 10:32                                                       ` Oleksandr Andrushchenko
2022-02-07 16:08                                             ` Roger Pau Monné
2022-02-07 16:12                                               ` Jan Beulich
2022-02-07 14:28                                     ` Oleksandr Andrushchenko
2022-02-07 14:19                                   ` Roger Pau Monné
2022-02-07 14:27                                     ` Oleksandr Andrushchenko
2022-02-04 11:37               ` Oleksandr Andrushchenko
2022-02-04 12:15                 ` Roger Pau Monné
2022-02-04 10:57           ` Roger Pau Monné
2022-02-04  6:34 ` [PATCH v6 04/13] vpci: restrict unhandled read/write operations for guests Oleksandr Andrushchenko
2022-02-04 14:11   ` Jan Beulich
2022-02-04 14:24     ` Oleksandr Andrushchenko
2022-02-08  8:00       ` Oleksandr Andrushchenko
2022-02-08  9:04         ` Jan Beulich
2022-02-08  9:09           ` Oleksandr Andrushchenko
2022-02-08  9:05         ` Roger Pau Monné
2022-02-08  9:10           ` Oleksandr Andrushchenko
2022-02-04  6:34 ` [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign Oleksandr Andrushchenko
2022-02-07 16:28   ` Jan Beulich
2022-02-08  8:32     ` Oleksandr Andrushchenko
2022-02-08  9:13       ` Jan Beulich
2022-02-08  9:27         ` Oleksandr Andrushchenko
2022-02-08  9:44           ` Jan Beulich
2022-02-08  9:55             ` Oleksandr Andrushchenko
2022-02-08 10:09               ` Jan Beulich
2022-02-08 10:22                 ` Oleksandr Andrushchenko
2022-02-08 10:29                   ` Jan Beulich
2022-02-08 10:52                     ` Oleksandr Andrushchenko
2022-02-08 11:00                       ` Jan Beulich
2022-02-08 11:25                         ` Oleksandr Andrushchenko
2022-02-10  8:21                           ` Oleksandr Andrushchenko
2022-02-10  9:22                             ` Jan Beulich
2022-02-10  9:33                               ` Oleksandr Andrushchenko
2022-02-04  6:34 ` [PATCH v6 06/13] vpci/header: implement guest BAR register handlers Oleksandr Andrushchenko
2022-02-07 17:06   ` Jan Beulich
2022-02-08  8:06     ` Oleksandr Andrushchenko
2022-02-08  9:16       ` Jan Beulich
2022-02-08  9:29         ` Roger Pau Monné
2022-02-08  9:25   ` Roger Pau Monné
2022-02-08  9:31     ` Oleksandr Andrushchenko
2022-02-08  9:48       ` Jan Beulich
2022-02-08  9:57         ` Oleksandr Andrushchenko
2022-02-08 10:15           ` Jan Beulich
2022-02-08 10:29             ` Oleksandr Andrushchenko
2022-02-08 13:58               ` Roger Pau Monné
2022-02-04  6:34 ` [PATCH v6 07/13] vpci/header: handle p2m range sets per BAR Oleksandr Andrushchenko
2022-02-04  6:34 ` [PATCH v6 08/13] vpci/header: program p2m with guest BAR view Oleksandr Andrushchenko
2022-02-04  6:34 ` [PATCH v6 09/13] vpci/header: emulate PCI_COMMAND register for guests Oleksandr Andrushchenko
2022-02-04 14:25   ` Jan Beulich
2022-02-08  8:13     ` Oleksandr Andrushchenko
2022-02-08  9:33       ` Jan Beulich
2022-02-08  9:38         ` Oleksandr Andrushchenko
2022-02-08  9:52           ` Jan Beulich
2022-02-08  9:58             ` Oleksandr Andrushchenko
2022-02-08 11:11               ` Roger Pau Monné
2022-02-08 11:29                 ` Oleksandr Andrushchenko
2022-02-08 14:09                   ` Roger Pau Monné
2022-02-08 14:13                     ` Oleksandr Andrushchenko
2022-02-04  6:34 ` [PATCH v6 10/13] vpci/header: reset the command register when adding devices Oleksandr Andrushchenko
2022-02-04 14:30   ` Jan Beulich
2022-02-04 14:37     ` Oleksandr Andrushchenko
2022-02-07  7:29       ` Jan Beulich
2022-02-07 11:27         ` Oleksandr Andrushchenko
2022-02-07 12:38           ` Jan Beulich
2022-02-07 12:51             ` Oleksandr Andrushchenko
2022-02-07 12:54               ` Jan Beulich
2022-02-07 14:17                 ` Oleksandr Andrushchenko
2022-02-07 14:31                   ` Jan Beulich
2022-02-07 14:46                     ` Oleksandr Andrushchenko
2022-02-07 15:05                       ` Jan Beulich
2022-02-07 15:14                         ` Oleksandr Andrushchenko
2022-02-07 15:28                           ` Jan Beulich
2022-02-07 15:59                             ` Oleksandr Andrushchenko
2022-02-10 12:54                     ` Oleksandr Andrushchenko
2022-02-10 13:36                       ` Jan Beulich
2022-02-10 13:56                         ` Oleksandr Andrushchenko
2022-02-10 12:59                     ` Oleksandr Andrushchenko
2022-02-04  6:34 ` [PATCH v6 11/13] vpci: add initial support for virtual PCI bus topology Oleksandr Andrushchenko
2022-02-04  6:34 ` [PATCH v6 12/13] xen/arm: translate virtual PCI bus topology for guests Oleksandr Andrushchenko
2022-02-04  7:56   ` Jan Beulich
2022-02-04  8:18     ` Oleksandr Andrushchenko
2022-02-04  6:34 ` [PATCH v6 13/13] xen/arm: account IO handlers for emulated PCI MSI-X Oleksandr Andrushchenko
2022-02-11 15:28   ` Julien Grall

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.