All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] PCI devices passthrough on Arm, part 3
@ 2021-09-03 10:08 Oleksandr Andrushchenko
  2021-09-03 10:08 ` [PATCH 1/9] vpci: Make vpci registers removal a dedicated function Oleksandr Andrushchenko
                   ` (8 more replies)
  0 siblings, 9 replies; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-03 10:08 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, bertrand.marquis,
	rahul.singh, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Hi, all!

This patch series is focusing on vPCI and adds support for non-identity
PCI BAR mappings which is required while passing through a PCI device to
a guest. The highlights are:

- Add relevant vpci register handlers when assigning PCI device to a domain
  and remove those when de-assigning. This allows having different
  handlers for different domains, e.g. hwdom and other guests.

- Emulate guest BAR register values based on physical BAR values.
  This allows creating a guest view of the registers and emulates
  size and properties probe as it is done during PCI device enumeration by
  the guest.

- Instead of handling a single range set, that contains all the memory
  regions of all the BARs and ROM, have them per BAR.

- Take into account guest's BAR view and program its p2m accordingly:
  gfn is guest's view of the BAR and mfn is the physical BAR value as set
  up by the host bridge in the hardware domain.
  This way hardware doamin sees physical BAR values and guest sees
  emulated ones.

The series was also tested on x86 PVH Dom0 and doesn't break it.

Thank you,
Oleksandr

Oleksandr Andrushchenko (8):
  vpci: Make vpci registers removal a dedicated function
  vpci: Add hooks for PCI device assign/de-assign
  vpci/header: Move register assignments from init_bars
  vpci/header: Add and remove register handlers dynamically
  vpci/header: Implement guest BAR register handlers
  vpci/header: Handle p2m range sets per BAR
  vpci/header: program p2m with guest BAR view
  vpci/header: Reset the command register when adding devices

Rahul Singh (1):
  vpci/header: Use pdev's domain instead of vCPU

 xen/drivers/passthrough/pci.c |   9 +
 xen/drivers/vpci/header.c     | 431 +++++++++++++++++++++++++++-------
 xen/drivers/vpci/vpci.c       |  28 ++-
 xen/include/xen/pci_regs.h    |   1 +
 xen/include/xen/vpci.h        |  28 ++-
 5 files changed, 413 insertions(+), 84 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH 1/9] vpci: Make vpci registers removal a dedicated function
  2021-09-03 10:08 [PATCH 0/9] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
@ 2021-09-03 10:08 ` Oleksandr Andrushchenko
  2021-09-03 10:08 ` [PATCH 2/9] vpci: Add hooks for PCI device assign/de-assign Oleksandr Andrushchenko
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-03 10:08 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, bertrand.marquis,
	rahul.singh, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

This is in preparation for dynamic assignment of the vpci register
handlers depending on the domain: hwdom or guest.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 xen/drivers/vpci/vpci.c | 7 ++++++-
 xen/include/xen/vpci.h  | 2 ++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index cbd1bac7fc33..b05530f2a6b0 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -35,7 +35,7 @@ extern vpci_register_init_t *const __start_vpci_array[];
 extern vpci_register_init_t *const __end_vpci_array[];
 #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
 
-void vpci_remove_device(struct pci_dev *pdev)
+void vpci_remove_device_registers(struct pci_dev *pdev)
 {
     spin_lock(&pdev->vpci->lock);
     while ( !list_empty(&pdev->vpci->handlers) )
@@ -48,6 +48,11 @@ void vpci_remove_device(struct pci_dev *pdev)
         xfree(r);
     }
     spin_unlock(&pdev->vpci->lock);
+}
+
+void vpci_remove_device(struct pci_dev *pdev)
+{
+    vpci_remove_device_registers(pdev);
     xfree(pdev->vpci->msix);
     xfree(pdev->vpci->msi);
     xfree(pdev->vpci);
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 9f5b5d52e159..b861f438cc78 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -28,6 +28,8 @@ int __must_check vpci_add_handlers(struct pci_dev *dev);
 
 /* Remove all handlers and free vpci related structures. */
 void vpci_remove_device(struct pci_dev *pdev);
+/* Remove all handlers for the device given. */
+void vpci_remove_device_registers(struct pci_dev *pdev);
 
 /* Add/remove a register handler. */
 int __must_check vpci_add_register(struct vpci *vpci,
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 2/9] vpci: Add hooks for PCI device assign/de-assign
  2021-09-03 10:08 [PATCH 0/9] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
  2021-09-03 10:08 ` [PATCH 1/9] vpci: Make vpci registers removal a dedicated function Oleksandr Andrushchenko
@ 2021-09-03 10:08 ` Oleksandr Andrushchenko
  2021-09-06 13:23   ` Jan Beulich
  2021-09-03 10:08 ` [PATCH 3/9] vpci/header: Move register assignments from init_bars Oleksandr Andrushchenko
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-03 10:08 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, bertrand.marquis,
	rahul.singh, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

When a PCI device gets assigned/de-assigned some work on vPCI side needs
to be done for that device. Introduce a pair of hooks so vPCI can handle
that.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 xen/drivers/passthrough/pci.c |  9 +++++++++
 xen/drivers/vpci/vpci.c       | 21 +++++++++++++++++++++
 xen/include/xen/vpci.h        | 16 ++++++++++++++++
 3 files changed, 46 insertions(+)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 25304dbe9956..deef986acbb4 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -864,6 +864,10 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
     if ( ret )
         goto out;
 
+    ret = vpci_deassign_device(d, pdev);
+    if ( ret )
+        goto out;
+
     if ( pdev->domain == hardware_domain  )
         pdev->quarantine = false;
 
@@ -1425,6 +1429,11 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
         rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev), flag);
     }
 
+    if ( rc )
+        goto done;
+
+    rc = vpci_assign_device(d, pdev);
+
  done:
     if ( rc )
         printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index b05530f2a6b0..ee0ad63a3c12 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -86,6 +86,27 @@ int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
 
     return rc;
 }
+
+/* Notify vPCI that device is assigned to guest. */
+int vpci_assign_device(struct domain *d, struct pci_dev *dev)
+{
+    /* It only makes sense to assign for hwdom or guest domain. */
+    if ( !has_vpci(d) || (d->domain_id >= DOMID_FIRST_RESERVED) )
+        return 0;
+
+    return 0;
+}
+
+/* Notify vPCI that device is de-assigned from guest. */
+int vpci_deassign_device(struct domain *d, struct pci_dev *dev)
+{
+    /* It only makes sense to de-assign from hwdom or guest domain. */
+    if ( !has_vpci(d) || (d->domain_id >= DOMID_FIRST_RESERVED) )
+        return 0;
+
+    return 0;
+}
+
 #endif /* __XEN__ */
 
 static int vpci_register_cmp(const struct vpci_register *r1,
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index b861f438cc78..e7a1a09ab4c9 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -26,6 +26,12 @@ typedef int vpci_register_init_t(struct pci_dev *dev);
 /* Add vPCI handlers to device. */
 int __must_check vpci_add_handlers(struct pci_dev *dev);
 
+/* Notify vPCI that device is assigned to guest. */
+int __must_check vpci_assign_device(struct domain *d, struct pci_dev *dev);
+
+/* Notify vPCI that device is de-assigned from guest. */
+int __must_check vpci_deassign_device(struct domain *d, struct pci_dev *dev);
+
 /* Remove all handlers and free vpci related structures. */
 void vpci_remove_device(struct pci_dev *pdev);
 /* Remove all handlers for the device given. */
@@ -220,6 +226,16 @@ static inline int vpci_add_handlers(struct pci_dev *pdev)
     return 0;
 }
 
+static inline int vpci_assign_device(struct domain *d, struct pci_dev *dev)
+{
+    return 0;
+};
+
+static inline int vpci_deassign_device(struct domain *d, struct pci_dev *dev)
+{
+    return 0;
+};
+
 static inline void vpci_dump_msi(void) { }
 
 static inline uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg,
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 3/9] vpci/header: Move register assignments from init_bars
  2021-09-03 10:08 [PATCH 0/9] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
  2021-09-03 10:08 ` [PATCH 1/9] vpci: Make vpci registers removal a dedicated function Oleksandr Andrushchenko
  2021-09-03 10:08 ` [PATCH 2/9] vpci: Add hooks for PCI device assign/de-assign Oleksandr Andrushchenko
@ 2021-09-03 10:08 ` Oleksandr Andrushchenko
  2021-09-06 13:53   ` Jan Beulich
  2021-09-03 10:08 ` [PATCH 4/9] vpci/header: Add and remove register handlers dynamically Oleksandr Andrushchenko
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-03 10:08 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, bertrand.marquis,
	rahul.singh, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

This is in preparation for dynamic assignment of the vpci register
handlers depending on the domain: hwdom or guest.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 xen/drivers/vpci/header.c | 83 ++++++++++++++++++++++++++-------------
 1 file changed, 56 insertions(+), 27 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index f8cd55e7c024..31bca7a12942 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -445,6 +445,55 @@ static void rom_write(const struct pci_dev *pdev, unsigned int reg,
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
 }
 
+static int add_bar_handlers(struct pci_dev *pdev)
+{
+    unsigned int i;
+    struct vpci_header *header = &pdev->vpci->header;
+    struct vpci_bar *bars = header->bars;
+    int rc;
+
+    /* Setup a handler for the command register. */
+    rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
+                           2, header);
+    if ( rc )
+        return rc;
+
+    if ( pdev->ignore_bars )
+        return 0;
+
+    for ( i = 0; i < PCI_HEADER_NORMAL_NR_BARS + 1; i++ )
+    {
+        if ( (bars[i].type == VPCI_BAR_IO) || (bars[i].type == VPCI_BAR_EMPTY) )
+            continue;
+
+        if ( bars[i].type == VPCI_BAR_ROM )
+        {
+            unsigned int rom_reg;
+            uint8_t header_type = pci_conf_read8(pdev->sbdf,
+                                                 PCI_HEADER_TYPE) & 0x7f;
+            if ( header_type == PCI_HEADER_TYPE_NORMAL )
+                rom_reg = PCI_ROM_ADDRESS;
+            else
+                rom_reg = PCI_ROM_ADDRESS1;
+            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write,
+                                   rom_reg, 4, &bars[i]);
+            if ( rc )
+                return rc;
+        }
+        else
+        {
+            uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
+
+            /* This is either VPCI_BAR_MEM32 or VPCI_BAR_MEM64_{LO|HI}. */
+            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
+                                   4, &bars[i]);
+            if ( rc )
+                return rc;
+        }
+    }
+    return 0;
+}
+
 static int init_bars(struct pci_dev *pdev)
 {
     uint16_t cmd;
@@ -470,14 +519,8 @@ static int init_bars(struct pci_dev *pdev)
         return -EOPNOTSUPP;
     }
 
-    /* Setup a handler for the command register. */
-    rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
-                           2, header);
-    if ( rc )
-        return rc;
-
     if ( pdev->ignore_bars )
-        return 0;
+        return add_bar_handlers(pdev);
 
     /* Disable memory decoding before sizing. */
     cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
@@ -492,14 +535,6 @@ static int init_bars(struct pci_dev *pdev)
         if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
         {
             bars[i].type = VPCI_BAR_MEM64_HI;
-            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
-                                   4, &bars[i]);
-            if ( rc )
-            {
-                pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
-                return rc;
-            }
-
             continue;
         }
 
@@ -532,14 +567,6 @@ static int init_bars(struct pci_dev *pdev)
         bars[i].addr = addr;
         bars[i].size = size;
         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
-
-        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
-                               &bars[i]);
-        if ( rc )
-        {
-            pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
-            return rc;
-        }
     }
 
     /* Check expansion ROM. */
@@ -553,11 +580,13 @@ static int init_bars(struct pci_dev *pdev)
         rom->addr = addr;
         header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
                               PCI_ROM_ADDRESS_ENABLE;
+    }
 
-        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
-                               4, rom);
-        if ( rc )
-            rom->type = VPCI_BAR_EMPTY;
+    rc = add_bar_handlers(pdev);
+    if ( rc )
+    {
+        pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
+        return rc;
     }
 
     return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 4/9] vpci/header: Add and remove register handlers dynamically
  2021-09-03 10:08 [PATCH 0/9] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (2 preceding siblings ...)
  2021-09-03 10:08 ` [PATCH 3/9] vpci/header: Move register assignments from init_bars Oleksandr Andrushchenko
@ 2021-09-03 10:08 ` Oleksandr Andrushchenko
  2021-09-06 14:11   ` Jan Beulich
  2021-09-10 21:14   ` Stefano Stabellini
  2021-09-03 10:08 ` [PATCH 5/9] vpci/header: Implement guest BAR register handlers Oleksandr Andrushchenko
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-03 10:08 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, bertrand.marquis,
	rahul.singh, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Add relevant vpci register handlers when assigning PCI device to a domain
and remove those when de-assigning. This allows having different
handlers for different domains, e.g. hwdom and other guests.

Use stubs for guest domains for now.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 xen/drivers/vpci/header.c | 78 +++++++++++++++++++++++++++++++++++----
 xen/drivers/vpci/vpci.c   |  4 +-
 xen/include/xen/vpci.h    |  4 ++
 3 files changed, 76 insertions(+), 10 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 31bca7a12942..5218b1af247e 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -397,6 +397,17 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
     pci_conf_write32(pdev->sbdf, reg, val);
 }
 
+static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
+                            uint32_t val, void *data)
+{
+}
+
+static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
+                               void *data)
+{
+    return 0xffffffff;
+}
+
 static void rom_write(const struct pci_dev *pdev, unsigned int reg,
                       uint32_t val, void *data)
 {
@@ -445,14 +456,25 @@ static void rom_write(const struct pci_dev *pdev, unsigned int reg,
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
 }
 
-static int add_bar_handlers(struct pci_dev *pdev)
+static void guest_rom_write(const struct pci_dev *pdev, unsigned int reg,
+                            uint32_t val, void *data)
+{
+}
+
+static uint32_t guest_rom_read(const struct pci_dev *pdev, unsigned int reg,
+                               void *data)
+{
+    return 0xffffffff;
+}
+
+static int add_bar_handlers(struct pci_dev *pdev, bool is_hwdom)
 {
     unsigned int i;
     struct vpci_header *header = &pdev->vpci->header;
     struct vpci_bar *bars = header->bars;
     int rc;
 
-    /* Setup a handler for the command register. */
+    /* Setup a handler for the command register: same for hwdom and guests. */
     rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
                            2, header);
     if ( rc )
@@ -475,8 +497,13 @@ static int add_bar_handlers(struct pci_dev *pdev)
                 rom_reg = PCI_ROM_ADDRESS;
             else
                 rom_reg = PCI_ROM_ADDRESS1;
-            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write,
-                                   rom_reg, 4, &bars[i]);
+            if ( is_hwdom )
+                rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write,
+                                       rom_reg, 4, &bars[i]);
+            else
+                rc = vpci_add_register(pdev->vpci,
+                                       guest_rom_read, guest_rom_write,
+                                       rom_reg, 4, &bars[i]);
             if ( rc )
                 return rc;
         }
@@ -485,8 +512,13 @@ static int add_bar_handlers(struct pci_dev *pdev)
             uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
 
             /* This is either VPCI_BAR_MEM32 or VPCI_BAR_MEM64_{LO|HI}. */
-            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
-                                   4, &bars[i]);
+            if ( is_hwdom )
+                rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write,
+                                       reg, 4, &bars[i]);
+            else
+                rc = vpci_add_register(pdev->vpci,
+                                       guest_bar_read, guest_bar_write,
+                                       reg, 4, &bars[i]);
             if ( rc )
                 return rc;
         }
@@ -520,7 +552,7 @@ static int init_bars(struct pci_dev *pdev)
     }
 
     if ( pdev->ignore_bars )
-        return add_bar_handlers(pdev);
+        return add_bar_handlers(pdev, true);
 
     /* Disable memory decoding before sizing. */
     cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
@@ -582,7 +614,7 @@ static int init_bars(struct pci_dev *pdev)
                               PCI_ROM_ADDRESS_ENABLE;
     }
 
-    rc = add_bar_handlers(pdev);
+    rc = add_bar_handlers(pdev, true);
     if ( rc )
     {
         pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
@@ -593,6 +625,36 @@ static int init_bars(struct pci_dev *pdev)
 }
 REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
 
+int vpci_bar_add_handlers(const struct domain *d, struct pci_dev *pdev)
+{
+    int rc;
+
+    /* Remove previously added registers. */
+    vpci_remove_device_registers(pdev);
+
+    /* It only makes sense to add registers for hwdom or guest domain. */
+    if ( d->domain_id >= DOMID_FIRST_RESERVED )
+        return 0;
+
+    if ( is_hardware_domain(d) )
+        rc = add_bar_handlers(pdev, true);
+    else
+        rc = add_bar_handlers(pdev, false);
+
+    if ( rc )
+        gprintk(XENLOG_ERR,
+            "%pp: failed to add BAR handlers for dom%d\n", &pdev->sbdf,
+            d->domain_id);
+    return rc;
+}
+
+int vpci_bar_remove_handlers(const struct domain *d, struct pci_dev *pdev)
+{
+    /* Remove previously added registers. */
+    vpci_remove_device_registers(pdev);
+    return 0;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index ee0ad63a3c12..4530313f01e7 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -94,7 +94,7 @@ int vpci_assign_device(struct domain *d, struct pci_dev *dev)
     if ( !has_vpci(d) || (d->domain_id >= DOMID_FIRST_RESERVED) )
         return 0;
 
-    return 0;
+    return vpci_bar_add_handlers(d, dev);
 }
 
 /* Notify vPCI that device is de-assigned from guest. */
@@ -104,7 +104,7 @@ int vpci_deassign_device(struct domain *d, struct pci_dev *dev)
     if ( !has_vpci(d) || (d->domain_id >= DOMID_FIRST_RESERVED) )
         return 0;
 
-    return 0;
+    return vpci_bar_remove_handlers(d, dev);
 }
 
 #endif /* __XEN__ */
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index e7a1a09ab4c9..4aa2941a1081 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -63,6 +63,10 @@ uint32_t vpci_hw_read32(const struct pci_dev *pdev, unsigned int reg,
  */
 bool __must_check vpci_process_pending(struct vcpu *v);
 
+/* Add/remove BAR handlers for a domain. */
+int vpci_bar_add_handlers(const struct domain *d, struct pci_dev *pdev);
+int vpci_bar_remove_handlers(const struct domain *d, struct pci_dev *pdev);
+
 struct vpci {
     /* List of vPCI handlers for a device. */
     struct list_head handlers;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 5/9] vpci/header: Implement guest BAR register handlers
  2021-09-03 10:08 [PATCH 0/9] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (3 preceding siblings ...)
  2021-09-03 10:08 ` [PATCH 4/9] vpci/header: Add and remove register handlers dynamically Oleksandr Andrushchenko
@ 2021-09-03 10:08 ` Oleksandr Andrushchenko
  2021-09-06 14:31   ` Jan Beulich
  2021-09-03 10:08 ` [PATCH 6/9] vpci/header: Handle p2m range sets per BAR Oleksandr Andrushchenko
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-03 10:08 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, bertrand.marquis,
	rahul.singh, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Emulate guest BAR register values: this allows creating a guest view
of the registers and emulates size and properties probe as it is done
during PCI device enumeration by the guest.

ROM BAR is only handled for the hardware domain and for guest domains
there is a stub: at the moment PCI expansion ROM is x86 only, so it
might not be used by other architectures without emulating x86. Other
use-cases may include using that expansion ROM before Xen boots, hence
no emulation is needed in Xen itself. Or when a guest wants to use the
ROM code which seems to be rare.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 xen/drivers/vpci/header.c  | 69 +++++++++++++++++++++++++++++++++++++-
 xen/include/xen/pci_regs.h |  1 +
 xen/include/xen/vpci.h     |  3 ++
 3 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 5218b1af247e..793f79ece831 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -400,12 +400,72 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
 static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
                             uint32_t val, void *data)
 {
+    struct vpci_bar *bar = data;
+    bool hi = false;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+    else
+        val &= PCI_BASE_ADDRESS_MEM_MASK;
+    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
+    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
 }
 
 static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
                                void *data)
 {
-    return 0xffffffff;
+    struct vpci_bar *bar = data;
+    uint32_t val;
+    bool hi = false;
+
+    switch ( bar->type )
+    {
+    case VPCI_BAR_MEM64_HI:
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+        /* fallthrough */
+    case VPCI_BAR_MEM64_LO:
+    {
+        if ( hi )
+            val = bar->guest_addr >> 32;
+        else
+            val = bar->guest_addr & 0xffffffff;
+        if ( (val & PCI_BASE_ADDRESS_MEM_MASK_32) ==  PCI_BASE_ADDRESS_MEM_MASK_32 )
+        {
+            /* Guests detects BAR's properties and sizes. */
+            if ( hi )
+                val = bar->size >> 32;
+            else
+                val = 0xffffffff & ~(bar->size - 1);
+        }
+        if ( !hi )
+        {
+            val |= PCI_BASE_ADDRESS_MEM_TYPE_64;
+            val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
+        }
+        bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
+        bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
+        break;
+    }
+    case VPCI_BAR_MEM32:
+    {
+        val = bar->guest_addr;
+        if ( (val & PCI_BASE_ADDRESS_MEM_MASK_32) ==  PCI_BASE_ADDRESS_MEM_MASK_32 )
+            val = 0xffffffff & ~(bar->size - 1);
+        val |= PCI_BASE_ADDRESS_MEM_TYPE_32;
+        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
+        break;
+    }
+    default:
+        val = bar->guest_addr;
+        break;
+    }
+    return val;
 }
 
 static void rom_write(const struct pci_dev *pdev, unsigned int reg,
@@ -522,6 +582,13 @@ static int add_bar_handlers(struct pci_dev *pdev, bool is_hwdom)
             if ( rc )
                 return rc;
         }
+        /*
+         * It is neither safe nor secure to initialize guest's view of the BARs
+         * with real values which are used by the hardware domain, so assign
+         * all zeros to guest's view of the BARs, so the guest can perform
+         * proper PCI device enumeration and assign BARs on its own.
+         */
+        bars[i].guest_addr = 0;
     }
     return 0;
 }
diff --git a/xen/include/xen/pci_regs.h b/xen/include/xen/pci_regs.h
index cc4ee3b83e5c..038eb18c5357 100644
--- a/xen/include/xen/pci_regs.h
+++ b/xen/include/xen/pci_regs.h
@@ -103,6 +103,7 @@
 #define  PCI_BASE_ADDRESS_MEM_TYPE_64	0x04	/* 64 bit address */
 #define  PCI_BASE_ADDRESS_MEM_PREFETCH	0x08	/* prefetchable? */
 #define  PCI_BASE_ADDRESS_MEM_MASK	(~0x0fUL)
+#define  PCI_BASE_ADDRESS_MEM_MASK_32	(~0x0fU)
 #define  PCI_BASE_ADDRESS_IO_MASK	(~0x03UL)
 /* bit 1 is reserved if address_space = 1 */
 
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 4aa2941a1081..db86b8e7fa3c 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -77,7 +77,10 @@ struct vpci {
     struct vpci_header {
         /* Information about the PCI BARs of this device. */
         struct vpci_bar {
+            /* Physical view of the BAR. */
             uint64_t addr;
+            /* Guest view of the BAR. */
+            uint64_t guest_addr;
             uint64_t size;
             enum {
                 VPCI_BAR_EMPTY,
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 6/9] vpci/header: Handle p2m range sets per BAR
  2021-09-03 10:08 [PATCH 0/9] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (4 preceding siblings ...)
  2021-09-03 10:08 ` [PATCH 5/9] vpci/header: Implement guest BAR register handlers Oleksandr Andrushchenko
@ 2021-09-03 10:08 ` Oleksandr Andrushchenko
  2021-09-06 14:47   ` Jan Beulich
  2021-09-03 10:08 ` [PATCH 7/9] vpci/header: program p2m with guest BAR view Oleksandr Andrushchenko
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-03 10:08 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, bertrand.marquis,
	rahul.singh, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Instead of handling a single range set, that contains all the memory
regions of all the BARs and ROM, have them per BAR.

This is in preparation of making non-identity mappings in p2m for the
MMIOs/ROM.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 xen/drivers/vpci/header.c | 172 ++++++++++++++++++++++++++------------
 xen/include/xen/vpci.h    |   3 +-
 2 files changed, 122 insertions(+), 53 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 793f79ece831..7f54199a3894 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -131,49 +131,75 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
 
 bool vpci_process_pending(struct vcpu *v)
 {
-    if ( v->vpci.mem )
+    if ( v->vpci.num_mem_ranges )
     {
         struct map_data data = {
             .d = v->domain,
             .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
         };
-        int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
+        struct pci_dev *pdev = v->vpci.pdev;
+        struct vpci_header *header = &pdev->vpci->header;
+        unsigned int i;
 
-        if ( rc == -ERESTART )
-            return true;
+        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+        {
+            struct vpci_bar *bar = &header->bars[i];
+            int rc;
 
-        spin_lock(&v->vpci.pdev->vpci->lock);
-        /* Disable memory decoding unconditionally on failure. */
-        modify_decoding(v->vpci.pdev,
-                        rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
-                        !rc && v->vpci.rom_only);
-        spin_unlock(&v->vpci.pdev->vpci->lock);
+            if ( !bar->mem )
+                continue;
 
-        rangeset_destroy(v->vpci.mem);
-        v->vpci.mem = NULL;
-        if ( rc )
-            /*
-             * FIXME: in case of failure remove the device from the domain.
-             * Note that there might still be leftover mappings. While this is
-             * safe for Dom0, for DomUs the domain will likely need to be
-             * killed in order to avoid leaking stale p2m mappings on
-             * failure.
-             */
-            vpci_remove_device(v->vpci.pdev);
+            rc = rangeset_consume_ranges(bar->mem, map_range, &data);
+
+            if ( rc == -ERESTART )
+                return true;
+
+            spin_lock(&pdev->vpci->lock);
+            /* Disable memory decoding unconditionally on failure. */
+            modify_decoding(pdev,
+                            rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
+                            !rc && v->vpci.rom_only);
+            spin_unlock(&pdev->vpci->lock);
+
+            rangeset_destroy(bar->mem);
+            bar->mem = NULL;
+            v->vpci.num_mem_ranges--;
+            if ( rc )
+                /*
+                 * FIXME: in case of failure remove the device from the domain.
+                 * Note that there might still be leftover mappings. While this is
+                 * safe for Dom0, for DomUs the domain will likely need to be
+                 * killed in order to avoid leaking stale p2m mappings on
+                 * failure.
+                 */
+                vpci_remove_device(pdev);
+        }
     }
 
     return false;
 }
 
 static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
-                            struct rangeset *mem, uint16_t cmd)
+                            uint16_t cmd)
 {
     struct map_data data = { .d = d, .map = true };
-    int rc;
+    struct vpci_header *header = &pdev->vpci->header;
+    int rc = 0;
+    unsigned int i;
+
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+    {
+        struct vpci_bar *bar = &header->bars[i];
 
-    while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
-        process_pending_softirqs();
-    rangeset_destroy(mem);
+        if ( !bar->mem )
+            continue;
+
+        while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
+                                              &data)) == -ERESTART )
+            process_pending_softirqs();
+        rangeset_destroy(bar->mem);
+        bar->mem = NULL;
+    }
     if ( !rc )
         modify_decoding(pdev, cmd, false);
 
@@ -181,7 +207,7 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
 }
 
 static void defer_map(struct domain *d, struct pci_dev *pdev,
-                      struct rangeset *mem, uint16_t cmd, bool rom_only)
+                      uint16_t cmd, bool rom_only, uint8_t num_mem_ranges)
 {
     struct vcpu *curr = current;
 
@@ -192,9 +218,9 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
      * started for the same device if the domain is not well-behaved.
      */
     curr->vpci.pdev = pdev;
-    curr->vpci.mem = mem;
     curr->vpci.cmd = cmd;
     curr->vpci.rom_only = rom_only;
+    curr->vpci.num_mem_ranges = num_mem_ranges;
     /*
      * Raise a scheduler softirq in order to prevent the guest from resuming
      * execution with pending mapping operations, to trigger the invocation
@@ -206,42 +232,47 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
 static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
 {
     struct vpci_header *header = &pdev->vpci->header;
-    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
     struct pci_dev *tmp, *dev = NULL;
     const struct vpci_msix *msix = pdev->vpci->msix;
-    unsigned int i;
+    unsigned int i, j;
     int rc;
-
-    if ( !mem )
-        return -ENOMEM;
+    uint8_t num_mem_ranges;
 
     /*
-     * Create a rangeset that represents the current device BARs memory region
+     * Create a rangeset per BAR that represents the current device memory region
      * and compare it against all the currently active BAR memory regions. If
      * an overlap is found, subtract it from the region to be mapped/unmapped.
      *
-     * First fill the rangeset with all the BARs of this device or with the ROM
+     * First fill the rangesets with all the BARs of this device or with the ROM
      * BAR only, depending on whether the guest is toggling the memory decode
      * bit of the command register, or the enable bit of the ROM BAR register.
      */
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
-        const struct vpci_bar *bar = &header->bars[i];
+        struct vpci_bar *bar = &header->bars[i];
         unsigned long start = PFN_DOWN(bar->addr);
         unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
 
+        bar->mem = NULL;
+
         if ( !MAPPABLE_BAR(bar) ||
              (rom_only ? bar->type != VPCI_BAR_ROM
                        : (bar->type == VPCI_BAR_ROM && !header->rom_enabled)) )
             continue;
 
-        rc = rangeset_add_range(mem, start, end);
+        bar->mem = rangeset_new(NULL, NULL, 0);
+        if ( !bar->mem )
+        {
+            rc = -ENOMEM;
+            goto fail;
+        }
+
+        rc = rangeset_add_range(bar->mem, start, end);
         if ( rc )
         {
             printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
                    start, end, rc);
-            rangeset_destroy(mem);
-            return rc;
+            goto fail;
         }
     }
 
@@ -252,14 +283,21 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
         unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
                                      vmsix_table_size(pdev->vpci, i) - 1);
 
-        rc = rangeset_remove_range(mem, start, end);
-        if ( rc )
+        for ( j = 0; j < ARRAY_SIZE(header->bars); j++ )
         {
-            printk(XENLOG_G_WARNING
-                   "Failed to remove MSIX table [%lx, %lx]: %d\n",
-                   start, end, rc);
-            rangeset_destroy(mem);
-            return rc;
+            const struct vpci_bar *bar = &header->bars[j];
+
+            if ( !bar->mem )
+                continue;
+
+            rc = rangeset_remove_range(bar->mem, start, end);
+            if ( rc )
+            {
+                printk(XENLOG_G_WARNING
+                       "Failed to remove MSIX table [%lx, %lx]: %d\n",
+                       start, end, rc);
+                goto fail;
+            }
         }
     }
 
@@ -291,7 +329,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             unsigned long start = PFN_DOWN(bar->addr);
             unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
 
-            if ( !bar->enabled || !rangeset_overlaps_range(mem, start, end) ||
+            if ( !bar->enabled ||
+                 !rangeset_overlaps_range(bar->mem, start, end) ||
                  /*
                   * If only the ROM enable bit is toggled check against other
                   * BARs in the same device for overlaps, but not against the
@@ -300,13 +339,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
                  (rom_only && tmp == pdev && bar->type == VPCI_BAR_ROM) )
                 continue;
 
-            rc = rangeset_remove_range(mem, start, end);
+            rc = rangeset_remove_range(bar->mem, start, end);
             if ( rc )
             {
                 printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
                        start, end, rc);
-                rangeset_destroy(mem);
-                return rc;
+                goto fail;
             }
         }
     }
@@ -324,12 +362,42 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
          * will always be to establish mappings and process all the BARs.
          */
         ASSERT((cmd & PCI_COMMAND_MEMORY) && !rom_only);
-        return apply_map(pdev->domain, pdev, mem, cmd);
+        return apply_map(pdev->domain, pdev, cmd);
     }
 
-    defer_map(dev->domain, dev, mem, cmd, rom_only);
+    /* Find out how many memory ranges has left after MSI and overlaps. */
+    num_mem_ranges = 0;
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+    {
+        struct vpci_bar *bar = &header->bars[i];
+
+        if ( !rangeset_is_empty(bar->mem) )
+            num_mem_ranges++;
+    }
+
+    /*
+     * There are cases when PCI device, root port for example, has neither
+     * memory space nor IO. In this case PCI command register write is
+     * missed resulting in the underlying PCI device not functional, so:
+     *   - if there are no regions write the command register now
+     *   - if there are regions then defer work and write later on
+     */
+    if ( !num_mem_ranges )
+        pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
+    else
+        defer_map(dev->domain, dev, cmd, rom_only, num_mem_ranges);
 
     return 0;
+
+fail:
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+    {
+        struct vpci_bar *bar = &header->bars[i];
+
+        rangeset_destroy(bar->mem);
+        bar->mem = NULL;
+    }
+    return rc;
 }
 
 static void cmd_write(const struct pci_dev *pdev, unsigned int reg,
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index db86b8e7fa3c..a0cbdb4bf4fd 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -82,6 +82,7 @@ struct vpci {
             /* Guest view of the BAR. */
             uint64_t guest_addr;
             uint64_t size;
+            struct rangeset *mem;
             enum {
                 VPCI_BAR_EMPTY,
                 VPCI_BAR_IO,
@@ -156,9 +157,9 @@ struct vpci {
 
 struct vpci_vcpu {
     /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
-    struct rangeset *mem;
     struct pci_dev *pdev;
     uint16_t cmd;
+    uint8_t num_mem_ranges;
     bool rom_only : 1;
 };
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 7/9] vpci/header: program p2m with guest BAR view
  2021-09-03 10:08 [PATCH 0/9] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (5 preceding siblings ...)
  2021-09-03 10:08 ` [PATCH 6/9] vpci/header: Handle p2m range sets per BAR Oleksandr Andrushchenko
@ 2021-09-03 10:08 ` Oleksandr Andrushchenko
  2021-09-06 14:51   ` Jan Beulich
  2021-09-03 10:08 ` [PATCH 8/9] vpci/header: Reset the command register when adding devices Oleksandr Andrushchenko
  2021-09-03 10:08 ` [PATCH 9/9] vpci/header: Use pdev's domain instead of vCPU Oleksandr Andrushchenko
  8 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-03 10:08 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, bertrand.marquis,
	rahul.singh, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Take into account guest's BAR view and program its p2m accordingly:
gfn is guest's view of the BAR and mfn is the physical BAR value as set
up by the host bridge in the hardware domain.
This way hardware doamin sees physical BAR values and guest sees
emulated ones.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 xen/drivers/vpci/header.c | 35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 7f54199a3894..7416ef1e1e06 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -30,6 +30,10 @@
 
 struct map_data {
     struct domain *d;
+    /* Start address of the BAR as seen by the guest. */
+    gfn_t start_gfn;
+    /* Physical start address of the BAR. */
+    mfn_t start_mfn;
     bool map;
 };
 
@@ -37,12 +41,28 @@ static int map_range(unsigned long s, unsigned long e, void *data,
                      unsigned long *c)
 {
     const struct map_data *map = data;
+    gfn_t start_gfn;
     int rc;
 
     for ( ; ; )
     {
         unsigned long size = e - s + 1;
 
+        /*
+         * Any BAR may have holes in its memory we want to map, e.g.
+         * we don't want to map MSI regions which may be a part of that BAR,
+         * e.g. when a single BAR is used for both MMIO and MSI.
+         * In this case MSI regions are subtracted from the mapping, but
+         * map->start_gfn still points to the very beginning of the BAR.
+         * So if there is a hole present then we need to adjust start_gfn
+         * to reflect the fact of that substraction.
+         */
+        start_gfn = gfn_add(map->start_gfn, s - mfn_x(map->start_mfn));
+
+        printk(XENLOG_G_DEBUG
+               "%smap [%lx, %lx] -> %#"PRI_gfn" for d%d\n",
+               map->map ? "" : "un", s, e, gfn_x(start_gfn),
+               map->d->domain_id);
         /*
          * ARM TODOs:
          * - On ARM whether the memory is prefetchable or not should be passed
@@ -52,8 +72,10 @@ static int map_range(unsigned long s, unsigned long e, void *data,
          * - {un}map_mmio_regions doesn't support preemption.
          */
 
-        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
-                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
+        rc = map->map ? map_mmio_regions(map->d, start_gfn,
+                                         size, _mfn(s))
+                      : unmap_mmio_regions(map->d, start_gfn,
+                                           size, _mfn(s));
         if ( rc == 0 )
         {
             *c += size;
@@ -69,6 +91,7 @@ static int map_range(unsigned long s, unsigned long e, void *data,
         ASSERT(rc < size);
         *c += rc;
         s += rc;
+        gfn_add(map->start_gfn, rc);
         if ( general_preempt_check() )
                 return -ERESTART;
     }
@@ -149,6 +172,10 @@ bool vpci_process_pending(struct vcpu *v)
             if ( !bar->mem )
                 continue;
 
+            data.start_gfn = is_hardware_domain(v->vpci.pdev->domain) ?
+                _gfn(PFN_DOWN(bar->addr)) :
+                _gfn(PFN_DOWN(bar->guest_addr));
+            data.start_mfn = _mfn(PFN_DOWN(bar->addr));
             rc = rangeset_consume_ranges(bar->mem, map_range, &data);
 
             if ( rc == -ERESTART )
@@ -194,6 +221,10 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
         if ( !bar->mem )
             continue;
 
+        data.start_gfn = is_hardware_domain(d) ?
+            _gfn(PFN_DOWN(bar->addr)) :
+            _gfn(PFN_DOWN(bar->guest_addr));
+        data.start_mfn = _mfn(PFN_DOWN(bar->addr));
         while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
                                               &data)) == -ERESTART )
             process_pending_softirqs();
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-03 10:08 [PATCH 0/9] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (6 preceding siblings ...)
  2021-09-03 10:08 ` [PATCH 7/9] vpci/header: program p2m with guest BAR view Oleksandr Andrushchenko
@ 2021-09-03 10:08 ` Oleksandr Andrushchenko
  2021-09-06 14:55   ` Jan Beulich
  2021-09-03 10:08 ` [PATCH 9/9] vpci/header: Use pdev's domain instead of vCPU Oleksandr Andrushchenko
  8 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-03 10:08 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, bertrand.marquis,
	rahul.singh, Oleksandr Andrushchenko

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Reset the command register when passing through a PCI device:
it is possible that when passing through a PCI device its memory
decoding bits in the command register are already set. Thus, a
guest OS may not write to the command register to update memory
decoding, so guest mappings (guest's view of the BARs) are
left not updated.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 xen/drivers/vpci/header.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 7416ef1e1e06..dac973368b1e 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -811,6 +811,16 @@ int vpci_bar_add_handlers(const struct domain *d, struct pci_dev *pdev)
         gprintk(XENLOG_ERR,
             "%pp: failed to add BAR handlers for dom%d\n", &pdev->sbdf,
             d->domain_id);
+
+    /*
+     * Reset the command register: it is possible that when passing
+     * through a PCI device its memory decoding bits in the command
+     * register are already set. Thus, a guest OS may not write to the
+     * command register to update memory decoding, so guest mappings
+     * (guest's view of the BARs) are left not updated.
+     */
+    pci_conf_write16(pdev->sbdf, PCI_COMMAND, 0);
+
     return rc;
 }
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 9/9] vpci/header: Use pdev's domain instead of vCPU
  2021-09-03 10:08 [PATCH 0/9] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
                   ` (7 preceding siblings ...)
  2021-09-03 10:08 ` [PATCH 8/9] vpci/header: Reset the command register when adding devices Oleksandr Andrushchenko
@ 2021-09-03 10:08 ` Oleksandr Andrushchenko
  2021-09-06 14:57   ` Jan Beulich
  8 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-03 10:08 UTC (permalink / raw)
  To: xen-devel
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, jbeulich, bertrand.marquis,
	rahul.singh, Oleksandr Andrushchenko

From: Rahul Singh <rahul.singh@arm.com>

Fixes: 9c244fdef7e7 ("vpci: add header handlers")

Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
 xen/drivers/vpci/header.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index dac973368b1e..688c69acbc23 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -157,7 +157,7 @@ bool vpci_process_pending(struct vcpu *v)
     if ( v->vpci.num_mem_ranges )
     {
         struct map_data data = {
-            .d = v->domain,
+            .d = v->vpci.pdev->domain,
             .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
         };
         struct pci_dev *pdev = v->vpci.pdev;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/9] vpci: Add hooks for PCI device assign/de-assign
  2021-09-03 10:08 ` [PATCH 2/9] vpci: Add hooks for PCI device assign/de-assign Oleksandr Andrushchenko
@ 2021-09-06 13:23   ` Jan Beulich
  2021-09-07  8:33     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-06 13:23 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko, xen-devel

On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -864,6 +864,10 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>      if ( ret )
>          goto out;
>  
> +    ret = vpci_deassign_device(d, pdev);
> +    if ( ret )
> +        goto out;
> +
>      if ( pdev->domain == hardware_domain  )
>          pdev->quarantine = false;
>  
> @@ -1425,6 +1429,11 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>          rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev), flag);
>      }
>  
> +    if ( rc )
> +        goto done;
> +
> +    rc = vpci_assign_device(d, pdev);
> +
>   done:
>      if ( rc )
>          printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",

I have to admit that I'm worried by the further lack of unwinding in
case of error: We're not really good at this, I agree, but it would
be quite nice if the problem didn't get worse. At the very least if
the device was de-assigned from Dom0 and assignment to a DomU failed,
imo you will want to restore Dom0's settings.

Also in the latter case don't you need to additionally call
vpci_deassign_device() for the prior owner of the device?

> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -86,6 +86,27 @@ int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
>  
>      return rc;
>  }
> +
> +/* Notify vPCI that device is assigned to guest. */
> +int vpci_assign_device(struct domain *d, struct pci_dev *dev)
> +{
> +    /* It only makes sense to assign for hwdom or guest domain. */
> +    if ( !has_vpci(d) || (d->domain_id >= DOMID_FIRST_RESERVED) )

Please don't open-code is_system_domain(). I also think you want to
flip the two sides of the ||, to avoid evaluating whatever has_vcpi()
expands to for system domains. (Both again below.)

> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -26,6 +26,12 @@ typedef int vpci_register_init_t(struct pci_dev *dev);
>  /* Add vPCI handlers to device. */
>  int __must_check vpci_add_handlers(struct pci_dev *dev);
>  
> +/* Notify vPCI that device is assigned to guest. */
> +int __must_check vpci_assign_device(struct domain *d, struct pci_dev *dev);
> +
> +/* Notify vPCI that device is de-assigned from guest. */
> +int __must_check vpci_deassign_device(struct domain *d, struct pci_dev *dev);

Is the expectation that "dev" may get altered? If not, it may want to
become pointer-to-const. (For "d" there might be the need to acquire
locks, so I guess it might not be a god idea to constify that one.)

I also think that one comment ought to be enough for the two functions.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 3/9] vpci/header: Move register assignments from init_bars
  2021-09-03 10:08 ` [PATCH 3/9] vpci/header: Move register assignments from init_bars Oleksandr Andrushchenko
@ 2021-09-06 13:53   ` Jan Beulich
  2021-09-07 10:04     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-06 13:53 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Roger Pau Monné
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko, xen-devel

On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> This is in preparation for dynamic assignment of the vpci register
> handlers depending on the domain: hwdom or guest.

I guess why exactly this is going to help is going to be seen in
subsequent patches. To aid review (i.e. to not force reviewers to
peek ahead) it would imo be helpful if you outlined how the result
is going to help. After all ...

> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -445,6 +445,55 @@ static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>          rom->addr = val & PCI_ROM_ADDRESS_MASK;
>  }
>  
> +static int add_bar_handlers(struct pci_dev *pdev)

... this function name, for example, isn't Dom0-specific, so one
might expect the function body to gain conditionals. Yet then the
question is why these conditionals can't live in the original
function.

> +{
> +    unsigned int i;
> +    struct vpci_header *header = &pdev->vpci->header;
> +    struct vpci_bar *bars = header->bars;
> +    int rc;
> +
> +    /* Setup a handler for the command register. */
> +    rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
> +                           2, header);
> +    if ( rc )
> +        return rc;
> +
> +    if ( pdev->ignore_bars )
> +        return 0;
> +
> +    for ( i = 0; i < PCI_HEADER_NORMAL_NR_BARS + 1; i++ )
> +    {
> +        if ( (bars[i].type == VPCI_BAR_IO) || (bars[i].type == VPCI_BAR_EMPTY) )
> +            continue;
> +
> +        if ( bars[i].type == VPCI_BAR_ROM )
> +        {
> +            unsigned int rom_reg;
> +            uint8_t header_type = pci_conf_read8(pdev->sbdf,
> +                                                 PCI_HEADER_TYPE) & 0x7f;
> +            if ( header_type == PCI_HEADER_TYPE_NORMAL )
> +                rom_reg = PCI_ROM_ADDRESS;
> +            else
> +                rom_reg = PCI_ROM_ADDRESS1;
> +            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write,
> +                                   rom_reg, 4, &bars[i]);
> +            if ( rc )
> +                return rc;

I'm not the maintainer of this code, but if I was I'd ask for this and ...

> +        }
> +        else
> +        {
> +            uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
> +
> +            /* This is either VPCI_BAR_MEM32 or VPCI_BAR_MEM64_{LO|HI}. */
> +            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
> +                                   4, &bars[i]);
> +            if ( rc )
> +                return rc;

... this to be moved ...

> +        }

... here to reduce redundancy.

> @@ -553,11 +580,13 @@ static int init_bars(struct pci_dev *pdev)
>          rom->addr = addr;
>          header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
>                                PCI_ROM_ADDRESS_ENABLE;
> +    }
>  
> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
> -                               4, rom);
> -        if ( rc )
> -            rom->type = VPCI_BAR_EMPTY;
> +    rc = add_bar_handlers(pdev);
> +    if ( rc )
> +    {
> +        pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> +        return rc;
>      }

Seeing this moved (hence perhaps more a question to Roger than to
you) restoring of the command register - why is it that the error
path(s) here care(s) about restoring this, but ...

>      return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;

... ones in modify_bars() (and downwards) don't? I was wondering
whether the restore could actually be done prior to the two calls
(or, in the original code, the one call), or perhaps even right
after the last call to pci_size_mem_bar(). At the very least the
comment further up suggests memory decode only gets disabled for
sizing BARs, which we're done with at this point.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 4/9] vpci/header: Add and remove register handlers dynamically
  2021-09-03 10:08 ` [PATCH 4/9] vpci/header: Add and remove register handlers dynamically Oleksandr Andrushchenko
@ 2021-09-06 14:11   ` Jan Beulich
  2021-09-07 10:11     ` Oleksandr Andrushchenko
  2021-09-10 21:14   ` Stefano Stabellini
  1 sibling, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-06 14:11 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko, xen-devel

On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
> @@ -593,6 +625,36 @@ static int init_bars(struct pci_dev *pdev)
>  }
>  REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
>  
> +int vpci_bar_add_handlers(const struct domain *d, struct pci_dev *pdev)
> +{
> +    int rc;
> +
> +    /* Remove previously added registers. */
> +    vpci_remove_device_registers(pdev);
> +
> +    /* It only makes sense to add registers for hwdom or guest domain. */
> +    if ( d->domain_id >= DOMID_FIRST_RESERVED )
> +        return 0;
> +
> +    if ( is_hardware_domain(d) )
> +        rc = add_bar_handlers(pdev, true);
> +    else
> +        rc = add_bar_handlers(pdev, false);

    rc = add_bar_handlers(pdev, is_hardware_domain(d));

> +    if ( rc )
> +        gprintk(XENLOG_ERR,
> +            "%pp: failed to add BAR handlers for dom%d\n", &pdev->sbdf,
> +            d->domain_id);

Please use %pd and correct indentation. Logging the error code might
also help some in diagnosing issues. Further I'm not sure this is a
message we want in release builds - perhaps gdprintk()?

> +    return rc;
> +}
> +
> +int vpci_bar_remove_handlers(const struct domain *d, struct pci_dev *pdev)
> +{
> +    /* Remove previously added registers. */
> +    vpci_remove_device_registers(pdev);
> +    return 0;
> +}

Also - in how far is the goal of your work to also make vPCI work for
x86 DomU-s? If that's not a goal, I'd like to ask that you limit the
introduction of code that ends up dead there.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 5/9] vpci/header: Implement guest BAR register handlers
  2021-09-03 10:08 ` [PATCH 5/9] vpci/header: Implement guest BAR register handlers Oleksandr Andrushchenko
@ 2021-09-06 14:31   ` Jan Beulich
  2021-09-07 13:33     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-06 14:31 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko, xen-devel

On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -400,12 +400,72 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>  static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>                              uint32_t val, void *data)
>  {
> +    struct vpci_bar *bar = data;
> +    bool hi = false;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +    else
> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);

What you store here is not the address that's going to be used, as
you don't mask off the low bits (to account for the BAR's size).
When a BAR gets written with all ones, all writable bits get these
ones stored. The address of the BAR, aiui, really changes to
(typically) close below 4Gb (in the case of a 32-bit BAR), which
is why memory / I/O decoding should be off while sizing BARs.
Therefore you shouldn't look for the specific "all writable bits
are ones" pattern (or worse, as you presently do, the "all bits
outside of the type specifier are ones" one) on the read path.
Instead mask the value appropriately here, and simply return back
the stored value from the read path.

>  }
>  
>  static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
>                                 void *data)
>  {
> -    return 0xffffffff;
> +    struct vpci_bar *bar = data;
> +    uint32_t val;
> +    bool hi = false;
> +
> +    switch ( bar->type )
> +    {
> +    case VPCI_BAR_MEM64_HI:
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +        /* fallthrough */
> +    case VPCI_BAR_MEM64_LO:
> +    {

Please don't add braces to case blocks when they're not needed.

> +        if ( hi )
> +            val = bar->guest_addr >> 32;
> +        else
> +            val = bar->guest_addr & 0xffffffff;
> +        if ( (val & PCI_BASE_ADDRESS_MEM_MASK_32) ==  PCI_BASE_ADDRESS_MEM_MASK_32 )

This is wrong when falling through to here from VPCI_BAR_MEM64_HI:
All 32 bits need to be looked at. Yet as per the comment further
up I think it isn't right anyway to apply the mask here.

Also: Stray double blanks.

> +        {
> +            /* Guests detects BAR's properties and sizes. */
> +            if ( hi )
> +                val = bar->size >> 32;
> +            else
> +                val = 0xffffffff & ~(bar->size - 1);
> +        }
> +        if ( !hi )
> +        {
> +            val |= PCI_BASE_ADDRESS_MEM_TYPE_64;
> +            val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
> +        }
> +        bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
> +        bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
> +        break;
> +    }
> +    case VPCI_BAR_MEM32:

Please separate non-fall-through case blocks by a blank line.

> @@ -522,6 +582,13 @@ static int add_bar_handlers(struct pci_dev *pdev, bool is_hwdom)
>              if ( rc )
>                  return rc;
>          }
> +        /*
> +         * It is neither safe nor secure to initialize guest's view of the BARs
> +         * with real values which are used by the hardware domain, so assign
> +         * all zeros to guest's view of the BARs, so the guest can perform
> +         * proper PCI device enumeration and assign BARs on its own.
> +         */
> +        bars[i].guest_addr = 0;

I'm afraid I don't understand the comment: Without memory decoding
enabled, the BARs are simple registers (with a few r/o bits).

> --- a/xen/include/xen/pci_regs.h
> +++ b/xen/include/xen/pci_regs.h
> @@ -103,6 +103,7 @@
>  #define  PCI_BASE_ADDRESS_MEM_TYPE_64	0x04	/* 64 bit address */
>  #define  PCI_BASE_ADDRESS_MEM_PREFETCH	0x08	/* prefetchable? */
>  #define  PCI_BASE_ADDRESS_MEM_MASK	(~0x0fUL)
> +#define  PCI_BASE_ADDRESS_MEM_MASK_32	(~0x0fU)

Please don't introduce an identical constant that's merely of
different type. (uint32_t)PCI_BASE_ADDRESS_MEM_MASK at the use
site (if actually still needed as per the comment above) would
seem more clear to me.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 6/9] vpci/header: Handle p2m range sets per BAR
  2021-09-03 10:08 ` [PATCH 6/9] vpci/header: Handle p2m range sets per BAR Oleksandr Andrushchenko
@ 2021-09-06 14:47   ` Jan Beulich
  2021-09-08 14:31     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-06 14:47 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko, xen-devel

On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Instead of handling a single range set, that contains all the memory
> regions of all the BARs and ROM, have them per BAR.

Without looking at how you carry out this change - this look wrong (as
in: wasteful) to me. Despite ...

> This is in preparation of making non-identity mappings in p2m for the
> MMIOs/ROM.

... the need for this, every individual BAR is still contiguous in both
host and guest address spaces, so can be represented as a single
(start,end) tuple (or a pair thereof, to account for both host and guest
values). No need to use a rangeset for this.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 7/9] vpci/header: program p2m with guest BAR view
  2021-09-03 10:08 ` [PATCH 7/9] vpci/header: program p2m with guest BAR view Oleksandr Andrushchenko
@ 2021-09-06 14:51   ` Jan Beulich
  2021-09-09  6:13     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-06 14:51 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko, xen-devel

On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
> @@ -37,12 +41,28 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>                       unsigned long *c)
>  {
>      const struct map_data *map = data;
> +    gfn_t start_gfn;
>      int rc;
>  
>      for ( ; ; )
>      {
>          unsigned long size = e - s + 1;
>  
> +        /*
> +         * Any BAR may have holes in its memory we want to map, e.g.
> +         * we don't want to map MSI regions which may be a part of that BAR,
> +         * e.g. when a single BAR is used for both MMIO and MSI.
> +         * In this case MSI regions are subtracted from the mapping, but
> +         * map->start_gfn still points to the very beginning of the BAR.
> +         * So if there is a hole present then we need to adjust start_gfn
> +         * to reflect the fact of that substraction.
> +         */
> +        start_gfn = gfn_add(map->start_gfn, s - mfn_x(map->start_mfn));

I may be missing something, but don't you need to adjust "size" then
as well? And don't you need to account for the "hole" not being at
the start? (As an aside - do you mean "MSI-X regions" everywhere you
say just "MSI" above?)

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-03 10:08 ` [PATCH 8/9] vpci/header: Reset the command register when adding devices Oleksandr Andrushchenko
@ 2021-09-06 14:55   ` Jan Beulich
  2021-09-07  7:43     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-06 14:55 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, bertrand.marquis, rahul.singh,
	Oleksandr Andrushchenko, xen-devel

On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -811,6 +811,16 @@ int vpci_bar_add_handlers(const struct domain *d, struct pci_dev *pdev)
>          gprintk(XENLOG_ERR,
>              "%pp: failed to add BAR handlers for dom%d\n", &pdev->sbdf,
>              d->domain_id);
> +
> +    /*
> +     * Reset the command register: it is possible that when passing
> +     * through a PCI device its memory decoding bits in the command
> +     * register are already set. Thus, a guest OS may not write to the
> +     * command register to update memory decoding, so guest mappings
> +     * (guest's view of the BARs) are left not updated.
> +     */
> +    pci_conf_write16(pdev->sbdf, PCI_COMMAND, 0);

Can you really blindly write 0 here? What about bits that have to be
under host control, e.g. INTX_DISABLE? I can see that you may want to
hand off with I/O and memory decoding off and bus mastering disabled,
but for every other bit (including reserved ones) I'd expect separate
justification (in the commit message).

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 9/9] vpci/header: Use pdev's domain instead of vCPU
  2021-09-03 10:08 ` [PATCH 9/9] vpci/header: Use pdev's domain instead of vCPU Oleksandr Andrushchenko
@ 2021-09-06 14:57   ` Jan Beulich
  2021-09-09  4:23     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-06 14:57 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, rahul.singh
  Cc: julien, sstabellini, oleksandr_tyshchenko, volodymyr_babchuk,
	Artem_Mygaiev, roger.pau, bertrand.marquis,
	Oleksandr Andrushchenko, xen-devel

On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
> From: Rahul Singh <rahul.singh@arm.com>
> 
> Fixes: 9c244fdef7e7 ("vpci: add header handlers")

In which way is that original change broken? The title doesn't
clarify this, and the description is empty ...

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-06 14:55   ` Jan Beulich
@ 2021-09-07  7:43     ` Oleksandr Andrushchenko
  2021-09-07  8:00       ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-07  7:43 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 06.09.21 17:55, Jan Beulich wrote:
> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -811,6 +811,16 @@ int vpci_bar_add_handlers(const struct domain *d, struct pci_dev *pdev)
>>           gprintk(XENLOG_ERR,
>>               "%pp: failed to add BAR handlers for dom%d\n", &pdev->sbdf,
>>               d->domain_id);
>> +
>> +    /*
>> +     * Reset the command register: it is possible that when passing
>> +     * through a PCI device its memory decoding bits in the command
>> +     * register are already set. Thus, a guest OS may not write to the
>> +     * command register to update memory decoding, so guest mappings
>> +     * (guest's view of the BARs) are left not updated.
>> +     */
>> +    pci_conf_write16(pdev->sbdf, PCI_COMMAND, 0);
> Can you really blindly write 0 here? What about bits that have to be
> under host control, e.g. INTX_DISABLE? I can see that you may want to
> hand off with I/O and memory decoding off and bus mastering disabled,
> but for every other bit (including reserved ones) I'd expect separate
> justification (in the commit message).
According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0" I have at hand,
section "6.2.2 Device Control" says that the reset state of the command
register is typically 0, so this is why I chose to write 0 here, e.g.
make the command register as if it is after the reset.

With respect to host control: we currently do not really emulate command
register which probably was ok for x86 PVH Dom0 and this might not be the
case now as we add DomU's. That being said: in my implementation guest can
alter command register as it wants without restrictions.
If we see it does need proper emulation then we would need adding that as
well (is not part of this series though).

Meanwhile, I agree that we can only reset IO space, memory space and bus
master bits and leave the rest untouched. But again, without proper command
register emulation guests can still set what they want.


Please let me know your opinion on how we can proceed.

>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-07  7:43     ` Oleksandr Andrushchenko
@ 2021-09-07  8:00       ` Jan Beulich
  2021-09-07  8:18         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-07  8:00 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 07.09.2021 09:43, Oleksandr Andrushchenko wrote:
> 
> On 06.09.21 17:55, Jan Beulich wrote:
>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>> --- a/xen/drivers/vpci/header.c
>>> +++ b/xen/drivers/vpci/header.c
>>> @@ -811,6 +811,16 @@ int vpci_bar_add_handlers(const struct domain *d, struct pci_dev *pdev)
>>>           gprintk(XENLOG_ERR,
>>>               "%pp: failed to add BAR handlers for dom%d\n", &pdev->sbdf,
>>>               d->domain_id);
>>> +
>>> +    /*
>>> +     * Reset the command register: it is possible that when passing
>>> +     * through a PCI device its memory decoding bits in the command
>>> +     * register are already set. Thus, a guest OS may not write to the
>>> +     * command register to update memory decoding, so guest mappings
>>> +     * (guest's view of the BARs) are left not updated.
>>> +     */
>>> +    pci_conf_write16(pdev->sbdf, PCI_COMMAND, 0);
>> Can you really blindly write 0 here? What about bits that have to be
>> under host control, e.g. INTX_DISABLE? I can see that you may want to
>> hand off with I/O and memory decoding off and bus mastering disabled,
>> but for every other bit (including reserved ones) I'd expect separate
>> justification (in the commit message).
> According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0" I have at hand,
> section "6.2.2 Device Control" says that the reset state of the command
> register is typically 0, so this is why I chose to write 0 here, e.g.
> make the command register as if it is after the reset.
> 
> With respect to host control: we currently do not really emulate command
> register which probably was ok for x86 PVH Dom0 and this might not be the
> case now as we add DomU's. That being said: in my implementation guest can
> alter command register as it wants without restrictions.
> If we see it does need proper emulation then we would need adding that as
> well (is not part of this series though).
> 
> Meanwhile, I agree that we can only reset IO space, memory space and bus
> master bits and leave the rest untouched. But again, without proper command
> register emulation guests can still set what they want.

Yes, writes to the register will need emulating for DomU. Reporting the
emulated register as zero initially is probably also quite fine (to
match, as you say, mandated reset state).

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-07  8:00       ` Jan Beulich
@ 2021-09-07  8:18         ` Oleksandr Andrushchenko
  2021-09-07  8:49           ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-07  8:18 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 07.09.21 11:00, Jan Beulich wrote:
> On 07.09.2021 09:43, Oleksandr Andrushchenko wrote:
>> On 06.09.21 17:55, Jan Beulich wrote:
>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>> --- a/xen/drivers/vpci/header.c
>>>> +++ b/xen/drivers/vpci/header.c
>>>> @@ -811,6 +811,16 @@ int vpci_bar_add_handlers(const struct domain *d, struct pci_dev *pdev)
>>>>            gprintk(XENLOG_ERR,
>>>>                "%pp: failed to add BAR handlers for dom%d\n", &pdev->sbdf,
>>>>                d->domain_id);
>>>> +
>>>> +    /*
>>>> +     * Reset the command register: it is possible that when passing
>>>> +     * through a PCI device its memory decoding bits in the command
>>>> +     * register are already set. Thus, a guest OS may not write to the
>>>> +     * command register to update memory decoding, so guest mappings
>>>> +     * (guest's view of the BARs) are left not updated.
>>>> +     */
>>>> +    pci_conf_write16(pdev->sbdf, PCI_COMMAND, 0);
>>> Can you really blindly write 0 here? What about bits that have to be
>>> under host control, e.g. INTX_DISABLE? I can see that you may want to
>>> hand off with I/O and memory decoding off and bus mastering disabled,
>>> but for every other bit (including reserved ones) I'd expect separate
>>> justification (in the commit message).
>> According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0" I have at hand,
>> section "6.2.2 Device Control" says that the reset state of the command
>> register is typically 0, so this is why I chose to write 0 here, e.g.
>> make the command register as if it is after the reset.
>>
>> With respect to host control: we currently do not really emulate command
>> register which probably was ok for x86 PVH Dom0 and this might not be the
>> case now as we add DomU's. That being said: in my implementation guest can
>> alter command register as it wants without restrictions.
>> If we see it does need proper emulation then we would need adding that as
>> well (is not part of this series though).
>>
>> Meanwhile, I agree that we can only reset IO space, memory space and bus
>> master bits and leave the rest untouched. But again, without proper command
>> register emulation guests can still set what they want.
> Yes, writes to the register will need emulating for DomU.

But then I am wondering to what extent we need to emulate the command

register? We have the following bits in the command register:

#define  PCI_COMMAND_IO        0x1    /* Enable response in I/O space */
#define  PCI_COMMAND_MEMORY    0x2    /* Enable response in Memory space */
#define  PCI_COMMAND_MASTER    0x4    /* Enable bus mastering */
#define  PCI_COMMAND_SPECIAL    0x8    /* Enable response to special cycles */
#define  PCI_COMMAND_INVALIDATE    0x10    /* Use memory write and invalidate */
#define  PCI_COMMAND_VGA_PALETTE 0x20    /* Enable palette snooping */
#define  PCI_COMMAND_PARITY    0x40    /* Enable parity checking */
#define  PCI_COMMAND_WAIT     0x80    /* Enable address/data stepping */
#define  PCI_COMMAND_SERR    0x100    /* Enable SERR */
#define  PCI_COMMAND_FAST_BACK    0x200    /* Enable back-to-back writes */
#define  PCI_COMMAND_INTX_DISABLE 0x400 /* INTx Emulation Disable */

We want the guest to access directly at least I/O and memory decoding and bus mastering

bits, but how do we emulate the rest? Do you mean we can match the rest to what host

uses for the device, like PCI_COMMAND_INTX_DISABLE bit? If so, as per my understanding,

those bits get set/cleared when a device is enabled, e.g. by Linux kernel/device driver for example.

So, if we have a hidden PCI device which can be assigned to a guest and it is literally untouched

(not enabled in Dom0) then I think there will be no such reference as "host assigned values" as

most probably the command register will remain in its after reset state.

Thus, I am not quite sure the command register can easily be emulated.

Please correct me if my understanding is wrong here.

>   Reporting the
> emulated register as zero initially is probably also quite fine (to
> match, as you say, mandated reset state).
>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/9] vpci: Add hooks for PCI device assign/de-assign
  2021-09-06 13:23   ` Jan Beulich
@ 2021-09-07  8:33     ` Oleksandr Andrushchenko
  2021-09-07  8:44       ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-07  8:33 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

Hello, Jan!

On 06.09.21 16:23, Jan Beulich wrote:
> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -864,6 +864,10 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>>       if ( ret )
>>           goto out;
>>   
>> +    ret = vpci_deassign_device(d, pdev);
>> +    if ( ret )
>> +        goto out;
>> +
>>       if ( pdev->domain == hardware_domain  )
>>           pdev->quarantine = false;
>>   
>> @@ -1425,6 +1429,11 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>           rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev), flag);
>>       }
>>   
>> +    if ( rc )
>> +        goto done;
>> +
>> +    rc = vpci_assign_device(d, pdev);
>> +
>>    done:
>>       if ( rc )
>>           printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
> I have to admit that I'm worried by the further lack of unwinding in
> case of error: We're not really good at this, I agree, but it would
> be quite nice if the problem didn't get worse. At the very least if
> the device was de-assigned from Dom0 and assignment to a DomU failed,
> imo you will want to restore Dom0's settings.

In the current design the error path is handled by the toolstack

via XEN_DOMCTL_assign_device/XEN_DOMCTL_deassign_device,

so this is why it is "ok" to have the code structured in the

assign_device as it is, e.g. roll back will be handled on deassign_device.

So, it is up to the toolstack to re-assign the device to Dom0 or DomIO(?)

in case of error and we do rely on the toolstack in Xen.

>
> Also in the latter case don't you need to additionally call
> vpci_deassign_device() for the prior owner of the device?

Even if we wanted to help the toolstack with the roll-back in case of an error

this would IMO make things even worth, e.g. we will de-assign for vPCI, but will

leave IOMMU path untouched which would result in some mess.

So, my only guess here is that we should rely on the toolstack completely as

it was before PCI passthrough on Arm.

>
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -86,6 +86,27 @@ int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
>>   
>>       return rc;
>>   }
>> +
>> +/* Notify vPCI that device is assigned to guest. */
>> +int vpci_assign_device(struct domain *d, struct pci_dev *dev)
>> +{
>> +    /* It only makes sense to assign for hwdom or guest domain. */
>> +    if ( !has_vpci(d) || (d->domain_id >= DOMID_FIRST_RESERVED) )
> Please don't open-code is_system_domain(). I also think you want to
> flip the two sides of the ||, to avoid evaluating whatever has_vcpi()
> expands to for system domains. (Both again below.)
Good catch, I missed is_system_domain completely, thank you!
>
>> --- a/xen/include/xen/vpci.h
>> +++ b/xen/include/xen/vpci.h
>> @@ -26,6 +26,12 @@ typedef int vpci_register_init_t(struct pci_dev *dev);
>>   /* Add vPCI handlers to device. */
>>   int __must_check vpci_add_handlers(struct pci_dev *dev);
>>   
>> +/* Notify vPCI that device is assigned to guest. */
>> +int __must_check vpci_assign_device(struct domain *d, struct pci_dev *dev);
>> +
>> +/* Notify vPCI that device is de-assigned from guest. */
>> +int __must_check vpci_deassign_device(struct domain *d, struct pci_dev *dev);
> Is the expectation that "dev" may get altered? If not, it may want to
> become pointer-to-const. (For "d" there might be the need to acquire
> locks, so I guess it might not be a god idea to constify that one.)
Just checked that and it is indeed possible to constify. Will do
>
> I also think that one comment ought to be enough for the two functions.
Sure
>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/9] vpci: Add hooks for PCI device assign/de-assign
  2021-09-07  8:33     ` Oleksandr Andrushchenko
@ 2021-09-07  8:44       ` Jan Beulich
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Beulich @ 2021-09-07  8:44 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 07.09.2021 10:33, Oleksandr Andrushchenko wrote:
> On 06.09.21 16:23, Jan Beulich wrote:
>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>> --- a/xen/drivers/passthrough/pci.c
>>> +++ b/xen/drivers/passthrough/pci.c
>>> @@ -864,6 +864,10 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
>>>       if ( ret )
>>>           goto out;
>>>   
>>> +    ret = vpci_deassign_device(d, pdev);
>>> +    if ( ret )
>>> +        goto out;
>>> +
>>>       if ( pdev->domain == hardware_domain  )
>>>           pdev->quarantine = false;
>>>   
>>> @@ -1425,6 +1429,11 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>           rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev), flag);
>>>       }
>>>   
>>> +    if ( rc )
>>> +        goto done;
>>> +
>>> +    rc = vpci_assign_device(d, pdev);
>>> +
>>>    done:
>>>       if ( rc )
>>>           printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>> I have to admit that I'm worried by the further lack of unwinding in
>> case of error: We're not really good at this, I agree, but it would
>> be quite nice if the problem didn't get worse. At the very least if
>> the device was de-assigned from Dom0 and assignment to a DomU failed,
>> imo you will want to restore Dom0's settings.
> 
> In the current design the error path is handled by the toolstack
> via XEN_DOMCTL_assign_device/XEN_DOMCTL_deassign_device,
> so this is why it is "ok" to have the code structured in the
> assign_device as it is, e.g. roll back will be handled on deassign_device.
> So, it is up to the toolstack to re-assign the device to Dom0 or DomIO(?)
> in case of error and we do rely on the toolstack in Xen.
> 
>>
>> Also in the latter case don't you need to additionally call
>> vpci_deassign_device() for the prior owner of the device?
> 
> Even if we wanted to help the toolstack with the roll-back in case of an error
> this would IMO make things even worth, e.g. we will de-assign for vPCI, but will
> leave IOMMU path untouched which would result in some mess.
> So, my only guess here is that we should rely on the toolstack completely as
> it was before PCI passthrough on Arm.

Well, okay, but please make this explicit in the description then.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-07  8:18         ` Oleksandr Andrushchenko
@ 2021-09-07  8:49           ` Jan Beulich
  2021-09-07  9:07             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-07  8:49 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 07.09.2021 10:18, Oleksandr Andrushchenko wrote:
> 
> On 07.09.21 11:00, Jan Beulich wrote:
>> On 07.09.2021 09:43, Oleksandr Andrushchenko wrote:
>>> On 06.09.21 17:55, Jan Beulich wrote:
>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>>> --- a/xen/drivers/vpci/header.c
>>>>> +++ b/xen/drivers/vpci/header.c
>>>>> @@ -811,6 +811,16 @@ int vpci_bar_add_handlers(const struct domain *d, struct pci_dev *pdev)
>>>>>            gprintk(XENLOG_ERR,
>>>>>                "%pp: failed to add BAR handlers for dom%d\n", &pdev->sbdf,
>>>>>                d->domain_id);
>>>>> +
>>>>> +    /*
>>>>> +     * Reset the command register: it is possible that when passing
>>>>> +     * through a PCI device its memory decoding bits in the command
>>>>> +     * register are already set. Thus, a guest OS may not write to the
>>>>> +     * command register to update memory decoding, so guest mappings
>>>>> +     * (guest's view of the BARs) are left not updated.
>>>>> +     */
>>>>> +    pci_conf_write16(pdev->sbdf, PCI_COMMAND, 0);
>>>> Can you really blindly write 0 here? What about bits that have to be
>>>> under host control, e.g. INTX_DISABLE? I can see that you may want to
>>>> hand off with I/O and memory decoding off and bus mastering disabled,
>>>> but for every other bit (including reserved ones) I'd expect separate
>>>> justification (in the commit message).
>>> According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0" I have at hand,
>>> section "6.2.2 Device Control" says that the reset state of the command
>>> register is typically 0, so this is why I chose to write 0 here, e.g.
>>> make the command register as if it is after the reset.
>>>
>>> With respect to host control: we currently do not really emulate command
>>> register which probably was ok for x86 PVH Dom0 and this might not be the
>>> case now as we add DomU's. That being said: in my implementation guest can
>>> alter command register as it wants without restrictions.
>>> If we see it does need proper emulation then we would need adding that as
>>> well (is not part of this series though).
>>>
>>> Meanwhile, I agree that we can only reset IO space, memory space and bus
>>> master bits and leave the rest untouched. But again, without proper command
>>> register emulation guests can still set what they want.
>> Yes, writes to the register will need emulating for DomU.
> 
> But then I am wondering to what extent we need to emulate the command
> 
> register? We have the following bits in the command register:
> 
> #define  PCI_COMMAND_IO        0x1    /* Enable response in I/O space */
> #define  PCI_COMMAND_MEMORY    0x2    /* Enable response in Memory space */
> #define  PCI_COMMAND_MASTER    0x4    /* Enable bus mastering */
> #define  PCI_COMMAND_SPECIAL    0x8    /* Enable response to special cycles */
> #define  PCI_COMMAND_INVALIDATE    0x10    /* Use memory write and invalidate */
> #define  PCI_COMMAND_VGA_PALETTE 0x20    /* Enable palette snooping */
> #define  PCI_COMMAND_PARITY    0x40    /* Enable parity checking */
> #define  PCI_COMMAND_WAIT     0x80    /* Enable address/data stepping */
> #define  PCI_COMMAND_SERR    0x100    /* Enable SERR */
> #define  PCI_COMMAND_FAST_BACK    0x200    /* Enable back-to-back writes */
> #define  PCI_COMMAND_INTX_DISABLE 0x400 /* INTx Emulation Disable */
> 
> We want the guest to access directly at least I/O and memory decoding and bus mastering
> bits, but how do we emulate the rest? Do you mean we can match the rest to what host
> uses for the device, like PCI_COMMAND_INTX_DISABLE bit? If so, as per my understanding,
> those bits get set/cleared when a device is enabled, e.g. by Linux kernel/device driver for example.

I would suggest to take qemu's emulation as a starting point.

> So, if we have a hidden PCI device which can be assigned to a guest and it is literally untouched
> (not enabled in Dom0) then I think there will be no such reference as "host assigned values" as
> most probably the command register will remain in its after reset state.

What meaning of "hidden" do you imply here? Devices passed to
pci_{hide,ro}_device() may not be assigned to guests ...

For any other meaning of "hidden", even if the device is completely
ignored by Dom0, certain of the properties still cannot be allowed
to be DomU-controlled. (I'm therefore not sure in how far Dom0 can
actually legitimately "ignore" devices. It may decide to not enable
them, but that's not "ignoring".)

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-07  8:49           ` Jan Beulich
@ 2021-09-07  9:07             ` Oleksandr Andrushchenko
  2021-09-07  9:19               ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-07  9:07 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 07.09.21 11:49, Jan Beulich wrote:
> On 07.09.2021 10:18, Oleksandr Andrushchenko wrote:
>> On 07.09.21 11:00, Jan Beulich wrote:
>>> On 07.09.2021 09:43, Oleksandr Andrushchenko wrote:
>>>> On 06.09.21 17:55, Jan Beulich wrote:
>>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>> @@ -811,6 +811,16 @@ int vpci_bar_add_handlers(const struct domain *d, struct pci_dev *pdev)
>>>>>>             gprintk(XENLOG_ERR,
>>>>>>                 "%pp: failed to add BAR handlers for dom%d\n", &pdev->sbdf,
>>>>>>                 d->domain_id);
>>>>>> +
>>>>>> +    /*
>>>>>> +     * Reset the command register: it is possible that when passing
>>>>>> +     * through a PCI device its memory decoding bits in the command
>>>>>> +     * register are already set. Thus, a guest OS may not write to the
>>>>>> +     * command register to update memory decoding, so guest mappings
>>>>>> +     * (guest's view of the BARs) are left not updated.
>>>>>> +     */
>>>>>> +    pci_conf_write16(pdev->sbdf, PCI_COMMAND, 0);
>>>>> Can you really blindly write 0 here? What about bits that have to be
>>>>> under host control, e.g. INTX_DISABLE? I can see that you may want to
>>>>> hand off with I/O and memory decoding off and bus mastering disabled,
>>>>> but for every other bit (including reserved ones) I'd expect separate
>>>>> justification (in the commit message).
>>>> According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0" I have at hand,
>>>> section "6.2.2 Device Control" says that the reset state of the command
>>>> register is typically 0, so this is why I chose to write 0 here, e.g.
>>>> make the command register as if it is after the reset.
>>>>
>>>> With respect to host control: we currently do not really emulate command
>>>> register which probably was ok for x86 PVH Dom0 and this might not be the
>>>> case now as we add DomU's. That being said: in my implementation guest can
>>>> alter command register as it wants without restrictions.
>>>> If we see it does need proper emulation then we would need adding that as
>>>> well (is not part of this series though).
>>>>
>>>> Meanwhile, I agree that we can only reset IO space, memory space and bus
>>>> master bits and leave the rest untouched. But again, without proper command
>>>> register emulation guests can still set what they want.
>>> Yes, writes to the register will need emulating for DomU.
>> But then I am wondering to what extent we need to emulate the command
>>
>> register? We have the following bits in the command register:
>>
>> #define  PCI_COMMAND_IO        0x1    /* Enable response in I/O space */
>> #define  PCI_COMMAND_MEMORY    0x2    /* Enable response in Memory space */
>> #define  PCI_COMMAND_MASTER    0x4    /* Enable bus mastering */
>> #define  PCI_COMMAND_SPECIAL    0x8    /* Enable response to special cycles */
>> #define  PCI_COMMAND_INVALIDATE    0x10    /* Use memory write and invalidate */
>> #define  PCI_COMMAND_VGA_PALETTE 0x20    /* Enable palette snooping */
>> #define  PCI_COMMAND_PARITY    0x40    /* Enable parity checking */
>> #define  PCI_COMMAND_WAIT     0x80    /* Enable address/data stepping */
>> #define  PCI_COMMAND_SERR    0x100    /* Enable SERR */
>> #define  PCI_COMMAND_FAST_BACK    0x200    /* Enable back-to-back writes */
>> #define  PCI_COMMAND_INTX_DISABLE 0x400 /* INTx Emulation Disable */
>>
>> We want the guest to access directly at least I/O and memory decoding and bus mastering
>> bits, but how do we emulate the rest? Do you mean we can match the rest to what host
>> uses for the device, like PCI_COMMAND_INTX_DISABLE bit? If so, as per my understanding,
>> those bits get set/cleared when a device is enabled, e.g. by Linux kernel/device driver for example.
> I would suggest to take qemu's emulation as a starting point.

Sure, I'll take a look what QEMU does. But I guess that emulation may depend

on host bridge emulation etc. which may not be applicable for our case without

serious complications.

>
>> So, if we have a hidden PCI device which can be assigned to a guest and it is literally untouched
>> (not enabled in Dom0) then I think there will be no such reference as "host assigned values" as
>> most probably the command register will remain in its after reset state.
> What meaning of "hidden" do you imply here? Devices passed to
> pci_{hide,ro}_device() may not be assigned to guests ...
You are completely right here.
>
> For any other meaning of "hidden", even if the device is completely
> ignored by Dom0,

Dom0less is such a case when a device is assigned to the guest

without Dom0 at all?

>   certain of the properties still cannot be allowed
> to be DomU-controlled.

The list is not that big, could you please name a few you think cannot

be controlled by a guest? I can think of PCI_COMMAND_SPECIAL(?),

PCI_COMMAND_INVALIDATE(?), PCI_COMMAND_PARITY, PCI_COMMAND_WAIT,

PCI_COMMAND_SERR, PCI_COMMAND_INTX_DISABLE which we may want to

be aligned with the "host reference" values, e.g. we only allow those bits

to be set as they are in Dom0.

>   (I'm therefore not sure in how far Dom0 can
> actually legitimately "ignore" devices. It may decide to not enable
> them, but that's not "ignoring".)
>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-07  9:07             ` Oleksandr Andrushchenko
@ 2021-09-07  9:19               ` Jan Beulich
  2021-09-07  9:52                 ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-07  9:19 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 07.09.2021 11:07, Oleksandr Andrushchenko wrote:
> On 07.09.21 11:49, Jan Beulich wrote:
>> On 07.09.2021 10:18, Oleksandr Andrushchenko wrote:
>>> So, if we have a hidden PCI device which can be assigned to a guest and it is literally untouched
>>> (not enabled in Dom0) then I think there will be no such reference as "host assigned values" as
>>> most probably the command register will remain in its after reset state.
>> What meaning of "hidden" do you imply here? Devices passed to
>> pci_{hide,ro}_device() may not be assigned to guests ...
> You are completely right here.
>>
>> For any other meaning of "hidden", even if the device is completely
>> ignored by Dom0,
> 
> Dom0less is such a case when a device is assigned to the guest
> without Dom0 at all?

In this case it is entirely unclear to me what entity it is to have
a global view on the PCI subsystem.

>>   certain of the properties still cannot be allowed
>> to be DomU-controlled.
> 
> The list is not that big, could you please name a few you think cannot
> be controlled by a guest? I can think of PCI_COMMAND_SPECIAL(?),
> PCI_COMMAND_INVALIDATE(?), PCI_COMMAND_PARITY, PCI_COMMAND_WAIT,
> PCI_COMMAND_SERR, PCI_COMMAND_INTX_DISABLE which we may want to
> be aligned with the "host reference" values, e.g. we only allow those bits
> to be set as they are in Dom0.

Well, you've compile a list already, and I did say so before as well:
Everything except I/O and memory decoding as well as bus mastering
needs at least closely looking at. INTX_DISABLE, for example, is
something I don't think a guest should be able to directly control.
It may still be the case that the host permits it control, but then
only indirectly, allowing the host to appropriately adjust its
internals.

Note that even for I/O and memory decoding as well as bus mastering
it may be necessary to limit guest control: In case the host wants
to disable any of these (perhaps transiently) despite the guest
wanting them enabled.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-07  9:19               ` Jan Beulich
@ 2021-09-07  9:52                 ` Oleksandr Andrushchenko
  2021-09-07 10:06                   ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-07  9:52 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 07.09.21 12:19, Jan Beulich wrote:
> On 07.09.2021 11:07, Oleksandr Andrushchenko wrote:
>> On 07.09.21 11:49, Jan Beulich wrote:
>>> On 07.09.2021 10:18, Oleksandr Andrushchenko wrote:
>>>> So, if we have a hidden PCI device which can be assigned to a guest and it is literally untouched
>>>> (not enabled in Dom0) then I think there will be no such reference as "host assigned values" as
>>>> most probably the command register will remain in its after reset state.
>>> What meaning of "hidden" do you imply here? Devices passed to
>>> pci_{hide,ro}_device() may not be assigned to guests ...
>> You are completely right here.
>>> For any other meaning of "hidden", even if the device is completely
>>> ignored by Dom0,
>> Dom0less is such a case when a device is assigned to the guest
>> without Dom0 at all?
> In this case it is entirely unclear to me what entity it is to have
> a global view on the PCI subsystem.
>
>>>    certain of the properties still cannot be allowed
>>> to be DomU-controlled.
>> The list is not that big, could you please name a few you think cannot
>> be controlled by a guest? I can think of PCI_COMMAND_SPECIAL(?),
>> PCI_COMMAND_INVALIDATE(?), PCI_COMMAND_PARITY, PCI_COMMAND_WAIT,
>> PCI_COMMAND_SERR, PCI_COMMAND_INTX_DISABLE which we may want to
>> be aligned with the "host reference" values, e.g. we only allow those bits
>> to be set as they are in Dom0.
> Well, you've compile a list already, and I did say so before as well:
> Everything except I/O and memory decoding as well as bus mastering
> needs at least closely looking at. INTX_DISABLE, for example, is
> something I don't think a guest should be able to directly control.
> It may still be the case that the host permits it control, but then
> only indirectly, allowing the host to appropriately adjust its
> internals.
>
> Note that even for I/O and memory decoding as well as bus mastering
> it may be necessary to limit guest control: In case the host wants
> to disable any of these (perhaps transiently) despite the guest
> wanting them enabled.

Ok, so it is now clear that we need a yet another patch to add a proper

command register emulation. What is your preference: drop the current

patch, implement command register emulation and add a "reset patch"

after that or we can have the patch as is now, but I'll only reset IO/mem and bus

master bits, e.g. read the real value, mask the wanted bits and write back?

>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 3/9] vpci/header: Move register assignments from init_bars
  2021-09-06 13:53   ` Jan Beulich
@ 2021-09-07 10:04     ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-07 10:04 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko, Roger Pau Monné
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, Bertrand Marquis, Rahul Singh, xen-devel


On 06.09.21 16:53, Jan Beulich wrote:
> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> This is in preparation for dynamic assignment of the vpci register
>> handlers depending on the domain: hwdom or guest.
> I guess why exactly this is going to help is going to be seen in
> subsequent patches. To aid review (i.e. to not force reviewers to
> peek ahead) it would imo be helpful if you outlined how the result
> is going to help.

Sure, will do next time. The need for this step is that is it easier to have

all related functionality (BARs here) put at one place and when the subsequent

patches add decisions on which handlers to install, e.g. hwdom or guest handlers,

this function is extended to accept a one more parameter, is_hwdom, and all

the assignment logic is put here. Of course I could have all the "if (is_hwdom)"'s

put at the original location, but dedicated function looked cleaner to me.

>   After all ...
>
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -445,6 +445,55 @@ static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>>           rom->addr = val & PCI_ROM_ADDRESS_MASK;
>>   }
>>   
>> +static int add_bar_handlers(struct pci_dev *pdev)
> ... this function name, for example, isn't Dom0-specific, so one
> might expect the function body to gain conditionals. Yet then the
> question is why these conditionals can't live in the original
> function.

Answered above. I think it makes code cleaner and easier for modification

as handlers' assignment for BARs becomes grouped as it is done for MSI/MSI-X.


>
>> +{
>> +    unsigned int i;
>> +    struct vpci_header *header = &pdev->vpci->header;
>> +    struct vpci_bar *bars = header->bars;
>> +    int rc;
>> +
>> +    /* Setup a handler for the command register. */
>> +    rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
>> +                           2, header);
>> +    if ( rc )
>> +        return rc;
>> +
>> +    if ( pdev->ignore_bars )
>> +        return 0;
>> +
>> +    for ( i = 0; i < PCI_HEADER_NORMAL_NR_BARS + 1; i++ )
>> +    {
>> +        if ( (bars[i].type == VPCI_BAR_IO) || (bars[i].type == VPCI_BAR_EMPTY) )
>> +            continue;
>> +
>> +        if ( bars[i].type == VPCI_BAR_ROM )
>> +        {
>> +            unsigned int rom_reg;
>> +            uint8_t header_type = pci_conf_read8(pdev->sbdf,
>> +                                                 PCI_HEADER_TYPE) & 0x7f;
>> +            if ( header_type == PCI_HEADER_TYPE_NORMAL )
>> +                rom_reg = PCI_ROM_ADDRESS;
>> +            else
>> +                rom_reg = PCI_ROM_ADDRESS1;
>> +            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write,
>> +                                   rom_reg, 4, &bars[i]);
>> +            if ( rc )
>> +                return rc;
> I'm not the maintainer of this code, but if I was I'd ask for this and ...
>
>> +        }
>> +        else
>> +        {
>> +            uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
>> +
>> +            /* This is either VPCI_BAR_MEM32 or VPCI_BAR_MEM64_{LO|HI}. */
>> +            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
>> +                                   4, &bars[i]);
>> +            if ( rc )
>> +                return rc;
> ... this to be moved ...
>
>> +        }
> ... here to reduce redundancy.
>
>> @@ -553,11 +580,13 @@ static int init_bars(struct pci_dev *pdev)
>>           rom->addr = addr;
>>           header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
>>                                 PCI_ROM_ADDRESS_ENABLE;
>> +    }
>>   
>> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
>> -                               4, rom);
>> -        if ( rc )
>> -            rom->type = VPCI_BAR_EMPTY;
>> +    rc = add_bar_handlers(pdev);
>> +    if ( rc )
>> +    {
>> +        pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
>> +        return rc;
>>       }
> Seeing this moved (hence perhaps more a question to Roger than to
> you) restoring of the command register - why is it that the error
> path(s) here care(s) about restoring this, but ...
>
>>       return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
> ... ones in modify_bars() (and downwards) don't? I was wondering
> whether the restore could actually be done prior to the two calls
> (or, in the original code, the one call), or perhaps even right
> after the last call to pci_size_mem_bar(). At the very least the
> comment further up suggests memory decode only gets disabled for
> sizing BARs, which we're done with at this point.

For all the above: what this patch does is a pure code move.

I had no intention to alter it in any other way rather than that.

If you think the code needs to be functionally modified I think this

deserves a dedicated work to be submitted, but IMO this patch

shouldn't touch anything.

>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-07  9:52                 ` Oleksandr Andrushchenko
@ 2021-09-07 10:06                   ` Jan Beulich
  2021-09-09  8:39                     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-07 10:06 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 07.09.2021 11:52, Oleksandr Andrushchenko wrote:
> 
> On 07.09.21 12:19, Jan Beulich wrote:
>> On 07.09.2021 11:07, Oleksandr Andrushchenko wrote:
>>> On 07.09.21 11:49, Jan Beulich wrote:
>>>> On 07.09.2021 10:18, Oleksandr Andrushchenko wrote:
>>>>> So, if we have a hidden PCI device which can be assigned to a guest and it is literally untouched
>>>>> (not enabled in Dom0) then I think there will be no such reference as "host assigned values" as
>>>>> most probably the command register will remain in its after reset state.
>>>> What meaning of "hidden" do you imply here? Devices passed to
>>>> pci_{hide,ro}_device() may not be assigned to guests ...
>>> You are completely right here.
>>>> For any other meaning of "hidden", even if the device is completely
>>>> ignored by Dom0,
>>> Dom0less is such a case when a device is assigned to the guest
>>> without Dom0 at all?
>> In this case it is entirely unclear to me what entity it is to have
>> a global view on the PCI subsystem.
>>
>>>>    certain of the properties still cannot be allowed
>>>> to be DomU-controlled.
>>> The list is not that big, could you please name a few you think cannot
>>> be controlled by a guest? I can think of PCI_COMMAND_SPECIAL(?),
>>> PCI_COMMAND_INVALIDATE(?), PCI_COMMAND_PARITY, PCI_COMMAND_WAIT,
>>> PCI_COMMAND_SERR, PCI_COMMAND_INTX_DISABLE which we may want to
>>> be aligned with the "host reference" values, e.g. we only allow those bits
>>> to be set as they are in Dom0.
>> Well, you've compile a list already, and I did say so before as well:
>> Everything except I/O and memory decoding as well as bus mastering
>> needs at least closely looking at. INTX_DISABLE, for example, is
>> something I don't think a guest should be able to directly control.
>> It may still be the case that the host permits it control, but then
>> only indirectly, allowing the host to appropriately adjust its
>> internals.
>>
>> Note that even for I/O and memory decoding as well as bus mastering
>> it may be necessary to limit guest control: In case the host wants
>> to disable any of these (perhaps transiently) despite the guest
>> wanting them enabled.
> 
> Ok, so it is now clear that we need a yet another patch to add a proper
> command register emulation. What is your preference: drop the current
> patch, implement command register emulation and add a "reset patch"
> after that or we can have the patch as is now, but I'll only reset IO/mem and bus
> master bits, e.g. read the real value, mask the wanted bits and write back?

Either order is fine with me as long as the result will be claimed to
be complete until proper emulation is in place.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 4/9] vpci/header: Add and remove register handlers dynamically
  2021-09-06 14:11   ` Jan Beulich
@ 2021-09-07 10:11     ` Oleksandr Andrushchenko
  2021-09-07 10:43       ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-07 10:11 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 06.09.21 17:11, Jan Beulich wrote:
> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>> @@ -593,6 +625,36 @@ static int init_bars(struct pci_dev *pdev)
>>   }
>>   REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
>>   
>> +int vpci_bar_add_handlers(const struct domain *d, struct pci_dev *pdev)
>> +{
>> +    int rc;
>> +
>> +    /* Remove previously added registers. */
>> +    vpci_remove_device_registers(pdev);
>> +
>> +    /* It only makes sense to add registers for hwdom or guest domain. */
>> +    if ( d->domain_id >= DOMID_FIRST_RESERVED )
>> +        return 0;
>> +
>> +    if ( is_hardware_domain(d) )
>> +        rc = add_bar_handlers(pdev, true);
>> +    else
>> +        rc = add_bar_handlers(pdev, false);
>      rc = add_bar_handlers(pdev, is_hardware_domain(d));
Indeed, thank you ;)
>
>> +    if ( rc )
>> +        gprintk(XENLOG_ERR,
>> +            "%pp: failed to add BAR handlers for dom%d\n", &pdev->sbdf,
>> +            d->domain_id);
> Please use %pd and correct indentation. Logging the error code might
> also help some in diagnosing issues.
Sure, I'll change it to %pd
>   Further I'm not sure this is a
> message we want in release builds
Why not?
>   - perhaps gdprintk()?
I'll change if we decide so
>
>> +    return rc;
>> +}
>> +
>> +int vpci_bar_remove_handlers(const struct domain *d, struct pci_dev *pdev)
>> +{
>> +    /* Remove previously added registers. */
>> +    vpci_remove_device_registers(pdev);
>> +    return 0;
>> +}
> Also - in how far is the goal of your work to also make vPCI work for
> x86 DomU-s? If that's not a goal
It is not, unfortunately. The goal is not to break x86 and to enable Arm
> , I'd like to ask that you limit the
> introduction of code that ends up dead there.

What's wrong with this function even if it is a one-liner?

This way we have a pair vpci_bar_add_handlers/vpci_bar_remove_handlers

and if I understood correctly you suggest vpci_bar_add_handlers/vpci_remove_device_registers?

What would we gain from that, but yet another secret knowledge that in order

to remove BAR handlers one needs to call vpci_remove_device_registers

while I would personally expect to call vpci_bar_add_handlers' counterpart,

vpci_remove_device_registers namely.

> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 4/9] vpci/header: Add and remove register handlers dynamically
  2021-09-07 10:11     ` Oleksandr Andrushchenko
@ 2021-09-07 10:43       ` Jan Beulich
  2021-09-07 11:10         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-07 10:43 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 07.09.2021 12:11, Oleksandr Andrushchenko wrote:
> On 06.09.21 17:11, Jan Beulich wrote:
>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>> @@ -593,6 +625,36 @@ static int init_bars(struct pci_dev *pdev)
>>>   }
>>>   REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
>>>   
>>> +int vpci_bar_add_handlers(const struct domain *d, struct pci_dev *pdev)
>>> +{
>>> +    int rc;
>>> +
>>> +    /* Remove previously added registers. */
>>> +    vpci_remove_device_registers(pdev);
>>> +
>>> +    /* It only makes sense to add registers for hwdom or guest domain. */
>>> +    if ( d->domain_id >= DOMID_FIRST_RESERVED )
>>> +        return 0;
>>> +
>>> +    if ( is_hardware_domain(d) )
>>> +        rc = add_bar_handlers(pdev, true);
>>> +    else
>>> +        rc = add_bar_handlers(pdev, false);
>>      rc = add_bar_handlers(pdev, is_hardware_domain(d));
> Indeed, thank you ;)
>>
>>> +    if ( rc )
>>> +        gprintk(XENLOG_ERR,
>>> +            "%pp: failed to add BAR handlers for dom%d\n", &pdev->sbdf,
>>> +            d->domain_id);
>> Please use %pd and correct indentation. Logging the error code might
>> also help some in diagnosing issues.
> Sure, I'll change it to %pd
>>   Further I'm not sure this is a
>> message we want in release builds
> Why not?

Excess verbosity: If we have such here, why not elsewhere on error paths?
And I hope you agree things will get too verbose if we had such (about)
everywhere?

>>   - perhaps gdprintk()?
> I'll change if we decide so
>>
>>> +    return rc;
>>> +}
>>> +
>>> +int vpci_bar_remove_handlers(const struct domain *d, struct pci_dev *pdev)
>>> +{
>>> +    /* Remove previously added registers. */
>>> +    vpci_remove_device_registers(pdev);
>>> +    return 0;
>>> +}
>> Also - in how far is the goal of your work to also make vPCI work for
>> x86 DomU-s? If that's not a goal
> It is not, unfortunately. The goal is not to break x86 and to enable Arm
>> , I'd like to ask that you limit the
>> introduction of code that ends up dead there.
> 
> What's wrong with this function even if it is a one-liner?

The comment is primarily on the earlier function, and then extends to
this one.

> This way we have a pair vpci_bar_add_handlers/vpci_bar_remove_handlers
> and if I understood correctly you suggest vpci_bar_add_handlers/vpci_remove_device_registers?
> What would we gain from that, but yet another secret knowledge that in order
> to remove BAR handlers one needs to call vpci_remove_device_registers
> while I would personally expect to call vpci_bar_add_handlers' counterpart,
> vpci_remove_device_registers namely.

This is all fine. Yet vpci_bar_{add,remove}_handlers() will, aiui, be
dead code on x86. Hence there should be an arrangement allowing the
compiler to eliminate this dead code. Whether that's enclosing these
by "#ifdef" or adding early "if(!IS_ENABLED(CONFIG_*))" is secondary.
This has a knock-on effect on other functions as you certainly realize:
The compiler seeing e.g. the 2nd argument to the add-BARs function
always being true allows it to instantiate just a clone of the original
function with the respective conditionals removed.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 4/9] vpci/header: Add and remove register handlers dynamically
  2021-09-07 10:43       ` Jan Beulich
@ 2021-09-07 11:10         ` Oleksandr Andrushchenko
  2021-09-07 11:49           ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-07 11:10 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 07.09.21 13:43, Jan Beulich wrote:
> On 07.09.2021 12:11, Oleksandr Andrushchenko wrote:
>> On 06.09.21 17:11, Jan Beulich wrote:
>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>> @@ -593,6 +625,36 @@ static int init_bars(struct pci_dev *pdev)
>>>>    }
>>>>    REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
>>>>    
>>>> +int vpci_bar_add_handlers(const struct domain *d, struct pci_dev *pdev)
>>>> +{
>>>> +    int rc;
>>>> +
>>>> +    /* Remove previously added registers. */
>>>> +    vpci_remove_device_registers(pdev);
>>>> +
>>>> +    /* It only makes sense to add registers for hwdom or guest domain. */
>>>> +    if ( d->domain_id >= DOMID_FIRST_RESERVED )
>>>> +        return 0;
>>>> +
>>>> +    if ( is_hardware_domain(d) )
>>>> +        rc = add_bar_handlers(pdev, true);
>>>> +    else
>>>> +        rc = add_bar_handlers(pdev, false);
>>>       rc = add_bar_handlers(pdev, is_hardware_domain(d));
>> Indeed, thank you ;)
>>>> +    if ( rc )
>>>> +        gprintk(XENLOG_ERR,
>>>> +            "%pp: failed to add BAR handlers for dom%d\n", &pdev->sbdf,
>>>> +            d->domain_id);
>>> Please use %pd and correct indentation. Logging the error code might
>>> also help some in diagnosing issues.
>> Sure, I'll change it to %pd
>>>    Further I'm not sure this is a
>>> message we want in release builds
>> Why not?
> Excess verbosity: If we have such here, why not elsewhere on error paths?
> And I hope you agree things will get too verbose if we had such (about)
> everywhere?
Agree, will change it to gdprintk
>
>>>    - perhaps gdprintk()?
>> I'll change if we decide so
>>>> +    return rc;
>>>> +}
>>>> +
>>>> +int vpci_bar_remove_handlers(const struct domain *d, struct pci_dev *pdev)
>>>> +{
>>>> +    /* Remove previously added registers. */
>>>> +    vpci_remove_device_registers(pdev);
>>>> +    return 0;
>>>> +}
>>> Also - in how far is the goal of your work to also make vPCI work for
>>> x86 DomU-s? If that's not a goal
>> It is not, unfortunately. The goal is not to break x86 and to enable Arm
>>> , I'd like to ask that you limit the
>>> introduction of code that ends up dead there.
>> What's wrong with this function even if it is a one-liner?
> The comment is primarily on the earlier function, and then extends to
> this one.
>
>> This way we have a pair vpci_bar_add_handlers/vpci_bar_remove_handlers
>> and if I understood correctly you suggest vpci_bar_add_handlers/vpci_remove_device_registers?
>> What would we gain from that, but yet another secret knowledge that in order
>> to remove BAR handlers one needs to call vpci_remove_device_registers
>> while I would personally expect to call vpci_bar_add_handlers' counterpart,
>> vpci_remove_device_registers namely.
> This is all fine. Yet vpci_bar_{add,remove}_handlers() will, aiui, be
> dead code on x86.
vpci_bar_add_handlers will be used by x86 PVH Dom0
>   Hence there should be an arrangement allowing the
> compiler to eliminate this dead code.

So, the only dead code for x86 here will be vpci_bar_remove_handlers. Yet.

Because I hope x86 to gain guest support for PVH Dom0 sooner or later.

>   Whether that's enclosing these
> by "#ifdef" or adding early "if(!IS_ENABLED(CONFIG_*))" is secondary.
> This has a knock-on effect on other functions as you certainly realize:
> The compiler seeing e.g. the 2nd argument to the add-BARs function
> always being true allows it to instantiate just a clone of the original
> function with the respective conditionals removed.

With the above (e.g. add is going to be used, but not remove) do you

think it is worth playing with ifdef's to strip that single function and add

a piece of spaghetti code to save a bit? What would that ifdef look like,

e.g. #ifdef CONFIG_ARM or #ifndef CONFIG_X86 && any other platform, but ARM?

IMO, it is cleaner to leave it as is. Yet we waste some bits for x86.

>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 4/9] vpci/header: Add and remove register handlers dynamically
  2021-09-07 11:10         ` Oleksandr Andrushchenko
@ 2021-09-07 11:49           ` Jan Beulich
  2021-09-07 12:16             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-07 11:49 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 07.09.2021 13:10, Oleksandr Andrushchenko wrote:
> 
> On 07.09.21 13:43, Jan Beulich wrote:
>> On 07.09.2021 12:11, Oleksandr Andrushchenko wrote:
>>> On 06.09.21 17:11, Jan Beulich wrote:
>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>>> @@ -593,6 +625,36 @@ static int init_bars(struct pci_dev *pdev)
>>>>>    }
>>>>>    REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
>>>>>    
>>>>> +int vpci_bar_add_handlers(const struct domain *d, struct pci_dev *pdev)
>>>>> +{
>>>>> +    int rc;
>>>>> +
>>>>> +    /* Remove previously added registers. */
>>>>> +    vpci_remove_device_registers(pdev);
>>>>> +
>>>>> +    /* It only makes sense to add registers for hwdom or guest domain. */
>>>>> +    if ( d->domain_id >= DOMID_FIRST_RESERVED )
>>>>> +        return 0;
>>>>> +
>>>>> +    if ( is_hardware_domain(d) )
>>>>> +        rc = add_bar_handlers(pdev, true);
>>>>> +    else
>>>>> +        rc = add_bar_handlers(pdev, false);
>>>>       rc = add_bar_handlers(pdev, is_hardware_domain(d));
>>> Indeed, thank you ;)
>>>>> +    if ( rc )
>>>>> +        gprintk(XENLOG_ERR,
>>>>> +            "%pp: failed to add BAR handlers for dom%d\n", &pdev->sbdf,
>>>>> +            d->domain_id);
>>>> Please use %pd and correct indentation. Logging the error code might
>>>> also help some in diagnosing issues.
>>> Sure, I'll change it to %pd
>>>>    Further I'm not sure this is a
>>>> message we want in release builds
>>> Why not?
>> Excess verbosity: If we have such here, why not elsewhere on error paths?
>> And I hope you agree things will get too verbose if we had such (about)
>> everywhere?
> Agree, will change it to gdprintk
>>
>>>>    - perhaps gdprintk()?
>>> I'll change if we decide so
>>>>> +    return rc;
>>>>> +}
>>>>> +
>>>>> +int vpci_bar_remove_handlers(const struct domain *d, struct pci_dev *pdev)
>>>>> +{
>>>>> +    /* Remove previously added registers. */
>>>>> +    vpci_remove_device_registers(pdev);
>>>>> +    return 0;
>>>>> +}
>>>> Also - in how far is the goal of your work to also make vPCI work for
>>>> x86 DomU-s? If that's not a goal
>>> It is not, unfortunately. The goal is not to break x86 and to enable Arm
>>>> , I'd like to ask that you limit the
>>>> introduction of code that ends up dead there.
>>> What's wrong with this function even if it is a one-liner?
>> The comment is primarily on the earlier function, and then extends to
>> this one.
>>
>>> This way we have a pair vpci_bar_add_handlers/vpci_bar_remove_handlers
>>> and if I understood correctly you suggest vpci_bar_add_handlers/vpci_remove_device_registers?
>>> What would we gain from that, but yet another secret knowledge that in order
>>> to remove BAR handlers one needs to call vpci_remove_device_registers
>>> while I would personally expect to call vpci_bar_add_handlers' counterpart,
>>> vpci_remove_device_registers namely.
>> This is all fine. Yet vpci_bar_{add,remove}_handlers() will, aiui, be
>> dead code on x86.
> vpci_bar_add_handlers will be used by x86 PVH Dom0

Where / when? You add a call from vpci_assign_device(), but besides that
also being dead code on x86 (for now), you can't mean that because
vpci_deassign_device() also calls vpci_bar_remove_handlers().

>>   Hence there should be an arrangement allowing the
>> compiler to eliminate this dead code.
> 
> So, the only dead code for x86 here will be vpci_bar_remove_handlers. Yet.
> Because I hope x86 to gain guest support for PVH Dom0 sooner or later.
> 
>>   Whether that's enclosing these
>> by "#ifdef" or adding early "if(!IS_ENABLED(CONFIG_*))" is secondary.
>> This has a knock-on effect on other functions as you certainly realize:
>> The compiler seeing e.g. the 2nd argument to the add-BARs function
>> always being true allows it to instantiate just a clone of the original
>> function with the respective conditionals removed.
> 
> With the above (e.g. add is going to be used, but not remove) do you
> think it is worth playing with ifdef's to strip that single function and add
> a piece of spaghetti code to save a bit?

No, that I agree wouldn't be worth it.

> What would that ifdef look like,
> e.g. #ifdef CONFIG_ARM or #ifndef CONFIG_X86 && any other platform, but ARM?

A new setting, preferably; e.g. VCPU_UNPRIVILEGED, to be "select"ed by
architectures as functionality gets enabled.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 4/9] vpci/header: Add and remove register handlers dynamically
  2021-09-07 11:49           ` Jan Beulich
@ 2021-09-07 12:16             ` Oleksandr Andrushchenko
  2021-09-07 12:20               ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-07 12:16 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 07.09.21 14:49, Jan Beulich wrote:
> On 07.09.2021 13:10, Oleksandr Andrushchenko wrote:
>> On 07.09.21 13:43, Jan Beulich wrote:
>>> On 07.09.2021 12:11, Oleksandr Andrushchenko wrote:
>>>> On 06.09.21 17:11, Jan Beulich wrote:
>>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>>>> @@ -593,6 +625,36 @@ static int init_bars(struct pci_dev *pdev)
>>>>>>     }
>>>>>>     REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
>>>>>>     
>>>>>> +int vpci_bar_add_handlers(const struct domain *d, struct pci_dev *pdev)
>>>>>> +{
>>>>>> +    int rc;
>>>>>> +
>>>>>> +    /* Remove previously added registers. */
>>>>>> +    vpci_remove_device_registers(pdev);
>>>>>> +
>>>>>> +    /* It only makes sense to add registers for hwdom or guest domain. */
>>>>>> +    if ( d->domain_id >= DOMID_FIRST_RESERVED )
>>>>>> +        return 0;
>>>>>> +
>>>>>> +    if ( is_hardware_domain(d) )
>>>>>> +        rc = add_bar_handlers(pdev, true);
>>>>>> +    else
>>>>>> +        rc = add_bar_handlers(pdev, false);
>>>>>        rc = add_bar_handlers(pdev, is_hardware_domain(d));
>>>> Indeed, thank you ;)
>>>>>> +    if ( rc )
>>>>>> +        gprintk(XENLOG_ERR,
>>>>>> +            "%pp: failed to add BAR handlers for dom%d\n", &pdev->sbdf,
>>>>>> +            d->domain_id);
>>>>> Please use %pd and correct indentation. Logging the error code might
>>>>> also help some in diagnosing issues.
>>>> Sure, I'll change it to %pd
>>>>>     Further I'm not sure this is a
>>>>> message we want in release builds
>>>> Why not?
>>> Excess verbosity: If we have such here, why not elsewhere on error paths?
>>> And I hope you agree things will get too verbose if we had such (about)
>>> everywhere?
>> Agree, will change it to gdprintk
>>>>>     - perhaps gdprintk()?
>>>> I'll change if we decide so
>>>>>> +    return rc;
>>>>>> +}
>>>>>> +
>>>>>> +int vpci_bar_remove_handlers(const struct domain *d, struct pci_dev *pdev)
>>>>>> +{
>>>>>> +    /* Remove previously added registers. */
>>>>>> +    vpci_remove_device_registers(pdev);
>>>>>> +    return 0;
>>>>>> +}
>>>>> Also - in how far is the goal of your work to also make vPCI work for
>>>>> x86 DomU-s? If that's not a goal
>>>> It is not, unfortunately. The goal is not to break x86 and to enable Arm
>>>>> , I'd like to ask that you limit the
>>>>> introduction of code that ends up dead there.
>>>> What's wrong with this function even if it is a one-liner?
>>> The comment is primarily on the earlier function, and then extends to
>>> this one.
>>>
>>>> This way we have a pair vpci_bar_add_handlers/vpci_bar_remove_handlers
>>>> and if I understood correctly you suggest vpci_bar_add_handlers/vpci_remove_device_registers?
>>>> What would we gain from that, but yet another secret knowledge that in order
>>>> to remove BAR handlers one needs to call vpci_remove_device_registers
>>>> while I would personally expect to call vpci_bar_add_handlers' counterpart,
>>>> vpci_remove_device_registers namely.
>>> This is all fine. Yet vpci_bar_{add,remove}_handlers() will, aiui, be
>>> dead code on x86.
>> vpci_bar_add_handlers will be used by x86 PVH Dom0
> Where / when? You add a call from vpci_assign_device(), but besides that
> also being dead code on x86 (for now), you can't mean that because
> vpci_deassign_device() also calls vpci_bar_remove_handlers().

You are right here and both add/remove are not used on x86 PVH Dom0.

I am sorry for wasting your time

>
>>>    Hence there should be an arrangement allowing the
>>> compiler to eliminate this dead code.
>> So, the only dead code for x86 here will be vpci_bar_remove_handlers. Yet.
>> Because I hope x86 to gain guest support for PVH Dom0 sooner or later.
>>
>>>    Whether that's enclosing these
>>> by "#ifdef" or adding early "if(!IS_ENABLED(CONFIG_*))" is secondary.
>>> This has a knock-on effect on other functions as you certainly realize:
>>> The compiler seeing e.g. the 2nd argument to the add-BARs function
>>> always being true allows it to instantiate just a clone of the original
>>> function with the respective conditionals removed.
>> With the above (e.g. add is going to be used, but not remove) do you
>> think it is worth playing with ifdef's to strip that single function and add
>> a piece of spaghetti code to save a bit?
> No, that I agree wouldn't be worth it.
>
>> What would that ifdef look like,
>> e.g. #ifdef CONFIG_ARM or #ifndef CONFIG_X86 && any other platform, but ARM?
> A new setting, preferably; e.g. VCPU_UNPRIVILEGED, to be "select"ed by
> architectures as functionality gets enabled.

So, as add/remove are only needed for Arm at the moment

you suggest I add VCPU_UNPRIVILEGED to Arm's Kconfig to enable

compiling vpci_bar_add_handlers/vpci_bar_remove_handlers?

To me this is more about vPCI's support for guests, so should we probably call it

VPCI_XXX instead? E.g. VPCI_HAS_GUEST_SUPPORT or something which

will reflect the nature of the code being gated? VCPU_UNPRIVILEGED sounds

like not connected to vpci to me

>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 4/9] vpci/header: Add and remove register handlers dynamically
  2021-09-07 12:16             ` Oleksandr Andrushchenko
@ 2021-09-07 12:20               ` Jan Beulich
  2021-09-07 12:23                 ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-07 12:20 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 07.09.2021 14:16, Oleksandr Andrushchenko wrote:
> 
> On 07.09.21 14:49, Jan Beulich wrote:
>> On 07.09.2021 13:10, Oleksandr Andrushchenko wrote:
>>> On 07.09.21 13:43, Jan Beulich wrote:
>>>> On 07.09.2021 12:11, Oleksandr Andrushchenko wrote:
>>>>> On 06.09.21 17:11, Jan Beulich wrote:
>>>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>>>>> @@ -593,6 +625,36 @@ static int init_bars(struct pci_dev *pdev)
>>>>>>>     }
>>>>>>>     REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
>>>>>>>     
>>>>>>> +int vpci_bar_add_handlers(const struct domain *d, struct pci_dev *pdev)
>>>>>>> +{
>>>>>>> +    int rc;
>>>>>>> +
>>>>>>> +    /* Remove previously added registers. */
>>>>>>> +    vpci_remove_device_registers(pdev);
>>>>>>> +
>>>>>>> +    /* It only makes sense to add registers for hwdom or guest domain. */
>>>>>>> +    if ( d->domain_id >= DOMID_FIRST_RESERVED )
>>>>>>> +        return 0;
>>>>>>> +
>>>>>>> +    if ( is_hardware_domain(d) )
>>>>>>> +        rc = add_bar_handlers(pdev, true);
>>>>>>> +    else
>>>>>>> +        rc = add_bar_handlers(pdev, false);
>>>>>>        rc = add_bar_handlers(pdev, is_hardware_domain(d));
>>>>> Indeed, thank you ;)
>>>>>>> +    if ( rc )
>>>>>>> +        gprintk(XENLOG_ERR,
>>>>>>> +            "%pp: failed to add BAR handlers for dom%d\n", &pdev->sbdf,
>>>>>>> +            d->domain_id);
>>>>>> Please use %pd and correct indentation. Logging the error code might
>>>>>> also help some in diagnosing issues.
>>>>> Sure, I'll change it to %pd
>>>>>>     Further I'm not sure this is a
>>>>>> message we want in release builds
>>>>> Why not?
>>>> Excess verbosity: If we have such here, why not elsewhere on error paths?
>>>> And I hope you agree things will get too verbose if we had such (about)
>>>> everywhere?
>>> Agree, will change it to gdprintk
>>>>>>     - perhaps gdprintk()?
>>>>> I'll change if we decide so
>>>>>>> +    return rc;
>>>>>>> +}
>>>>>>> +
>>>>>>> +int vpci_bar_remove_handlers(const struct domain *d, struct pci_dev *pdev)
>>>>>>> +{
>>>>>>> +    /* Remove previously added registers. */
>>>>>>> +    vpci_remove_device_registers(pdev);
>>>>>>> +    return 0;
>>>>>>> +}
>>>>>> Also - in how far is the goal of your work to also make vPCI work for
>>>>>> x86 DomU-s? If that's not a goal
>>>>> It is not, unfortunately. The goal is not to break x86 and to enable Arm
>>>>>> , I'd like to ask that you limit the
>>>>>> introduction of code that ends up dead there.
>>>>> What's wrong with this function even if it is a one-liner?
>>>> The comment is primarily on the earlier function, and then extends to
>>>> this one.
>>>>
>>>>> This way we have a pair vpci_bar_add_handlers/vpci_bar_remove_handlers
>>>>> and if I understood correctly you suggest vpci_bar_add_handlers/vpci_remove_device_registers?
>>>>> What would we gain from that, but yet another secret knowledge that in order
>>>>> to remove BAR handlers one needs to call vpci_remove_device_registers
>>>>> while I would personally expect to call vpci_bar_add_handlers' counterpart,
>>>>> vpci_remove_device_registers namely.
>>>> This is all fine. Yet vpci_bar_{add,remove}_handlers() will, aiui, be
>>>> dead code on x86.
>>> vpci_bar_add_handlers will be used by x86 PVH Dom0
>> Where / when? You add a call from vpci_assign_device(), but besides that
>> also being dead code on x86 (for now), you can't mean that because
>> vpci_deassign_device() also calls vpci_bar_remove_handlers().
> 
> You are right here and both add/remove are not used on x86 PVH Dom0.
> 
> I am sorry for wasting your time
> 
>>
>>>>    Hence there should be an arrangement allowing the
>>>> compiler to eliminate this dead code.
>>> So, the only dead code for x86 here will be vpci_bar_remove_handlers. Yet.
>>> Because I hope x86 to gain guest support for PVH Dom0 sooner or later.
>>>
>>>>    Whether that's enclosing these
>>>> by "#ifdef" or adding early "if(!IS_ENABLED(CONFIG_*))" is secondary.
>>>> This has a knock-on effect on other functions as you certainly realize:
>>>> The compiler seeing e.g. the 2nd argument to the add-BARs function
>>>> always being true allows it to instantiate just a clone of the original
>>>> function with the respective conditionals removed.
>>> With the above (e.g. add is going to be used, but not remove) do you
>>> think it is worth playing with ifdef's to strip that single function and add
>>> a piece of spaghetti code to save a bit?
>> No, that I agree wouldn't be worth it.
>>
>>> What would that ifdef look like,
>>> e.g. #ifdef CONFIG_ARM or #ifndef CONFIG_X86 && any other platform, but ARM?
>> A new setting, preferably; e.g. VCPU_UNPRIVILEGED, to be "select"ed by
>> architectures as functionality gets enabled.
> 
> So, as add/remove are only needed for Arm at the moment
> you suggest I add VCPU_UNPRIVILEGED to Arm's Kconfig to enable
> compiling vpci_bar_add_handlers/vpci_bar_remove_handlers?
> To me this is more about vPCI's support for guests, so should we probably call it
> VPCI_XXX instead? E.g. VPCI_HAS_GUEST_SUPPORT or something which
> will reflect the nature of the code being gated? VCPU_UNPRIVILEGED sounds
> like not connected to vpci to me

And validly so - my fingers didn't type what the brain told them. I've
meant VPCI_UNPRIVILEGED. I would also be okay with HAS_VPCI_GUEST_SUPPORT
(i.e. not exactly what you've suggested), for example.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 4/9] vpci/header: Add and remove register handlers dynamically
  2021-09-07 12:20               ` Jan Beulich
@ 2021-09-07 12:23                 ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-07 12:23 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 07.09.21 15:20, Jan Beulich wrote:
> On 07.09.2021 14:16, Oleksandr Andrushchenko wrote:
>> On 07.09.21 14:49, Jan Beulich wrote:
>>> On 07.09.2021 13:10, Oleksandr Andrushchenko wrote:
>>>> On 07.09.21 13:43, Jan Beulich wrote:
>>>>> On 07.09.2021 12:11, Oleksandr Andrushchenko wrote:
>>>>>> On 06.09.21 17:11, Jan Beulich wrote:
>>>>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>>>>>> @@ -593,6 +625,36 @@ static int init_bars(struct pci_dev *pdev)
>>>>>>>>      }
>>>>>>>>      REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
>>>>>>>>      
>>>>>>>> +int vpci_bar_add_handlers(const struct domain *d, struct pci_dev *pdev)
>>>>>>>> +{
>>>>>>>> +    int rc;
>>>>>>>> +
>>>>>>>> +    /* Remove previously added registers. */
>>>>>>>> +    vpci_remove_device_registers(pdev);
>>>>>>>> +
>>>>>>>> +    /* It only makes sense to add registers for hwdom or guest domain. */
>>>>>>>> +    if ( d->domain_id >= DOMID_FIRST_RESERVED )
>>>>>>>> +        return 0;
>>>>>>>> +
>>>>>>>> +    if ( is_hardware_domain(d) )
>>>>>>>> +        rc = add_bar_handlers(pdev, true);
>>>>>>>> +    else
>>>>>>>> +        rc = add_bar_handlers(pdev, false);
>>>>>>>         rc = add_bar_handlers(pdev, is_hardware_domain(d));
>>>>>> Indeed, thank you ;)
>>>>>>>> +    if ( rc )
>>>>>>>> +        gprintk(XENLOG_ERR,
>>>>>>>> +            "%pp: failed to add BAR handlers for dom%d\n", &pdev->sbdf,
>>>>>>>> +            d->domain_id);
>>>>>>> Please use %pd and correct indentation. Logging the error code might
>>>>>>> also help some in diagnosing issues.
>>>>>> Sure, I'll change it to %pd
>>>>>>>      Further I'm not sure this is a
>>>>>>> message we want in release builds
>>>>>> Why not?
>>>>> Excess verbosity: If we have such here, why not elsewhere on error paths?
>>>>> And I hope you agree things will get too verbose if we had such (about)
>>>>> everywhere?
>>>> Agree, will change it to gdprintk
>>>>>>>      - perhaps gdprintk()?
>>>>>> I'll change if we decide so
>>>>>>>> +    return rc;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +int vpci_bar_remove_handlers(const struct domain *d, struct pci_dev *pdev)
>>>>>>>> +{
>>>>>>>> +    /* Remove previously added registers. */
>>>>>>>> +    vpci_remove_device_registers(pdev);
>>>>>>>> +    return 0;
>>>>>>>> +}
>>>>>>> Also - in how far is the goal of your work to also make vPCI work for
>>>>>>> x86 DomU-s? If that's not a goal
>>>>>> It is not, unfortunately. The goal is not to break x86 and to enable Arm
>>>>>>> , I'd like to ask that you limit the
>>>>>>> introduction of code that ends up dead there.
>>>>>> What's wrong with this function even if it is a one-liner?
>>>>> The comment is primarily on the earlier function, and then extends to
>>>>> this one.
>>>>>
>>>>>> This way we have a pair vpci_bar_add_handlers/vpci_bar_remove_handlers
>>>>>> and if I understood correctly you suggest vpci_bar_add_handlers/vpci_remove_device_registers?
>>>>>> What would we gain from that, but yet another secret knowledge that in order
>>>>>> to remove BAR handlers one needs to call vpci_remove_device_registers
>>>>>> while I would personally expect to call vpci_bar_add_handlers' counterpart,
>>>>>> vpci_remove_device_registers namely.
>>>>> This is all fine. Yet vpci_bar_{add,remove}_handlers() will, aiui, be
>>>>> dead code on x86.
>>>> vpci_bar_add_handlers will be used by x86 PVH Dom0
>>> Where / when? You add a call from vpci_assign_device(), but besides that
>>> also being dead code on x86 (for now), you can't mean that because
>>> vpci_deassign_device() also calls vpci_bar_remove_handlers().
>> You are right here and both add/remove are not used on x86 PVH Dom0.
>>
>> I am sorry for wasting your time
>>
>>>>>     Hence there should be an arrangement allowing the
>>>>> compiler to eliminate this dead code.
>>>> So, the only dead code for x86 here will be vpci_bar_remove_handlers. Yet.
>>>> Because I hope x86 to gain guest support for PVH Dom0 sooner or later.
>>>>
>>>>>     Whether that's enclosing these
>>>>> by "#ifdef" or adding early "if(!IS_ENABLED(CONFIG_*))" is secondary.
>>>>> This has a knock-on effect on other functions as you certainly realize:
>>>>> The compiler seeing e.g. the 2nd argument to the add-BARs function
>>>>> always being true allows it to instantiate just a clone of the original
>>>>> function with the respective conditionals removed.
>>>> With the above (e.g. add is going to be used, but not remove) do you
>>>> think it is worth playing with ifdef's to strip that single function and add
>>>> a piece of spaghetti code to save a bit?
>>> No, that I agree wouldn't be worth it.
>>>
>>>> What would that ifdef look like,
>>>> e.g. #ifdef CONFIG_ARM or #ifndef CONFIG_X86 && any other platform, but ARM?
>>> A new setting, preferably; e.g. VCPU_UNPRIVILEGED, to be "select"ed by
>>> architectures as functionality gets enabled.
>> So, as add/remove are only needed for Arm at the moment
>> you suggest I add VCPU_UNPRIVILEGED to Arm's Kconfig to enable
>> compiling vpci_bar_add_handlers/vpci_bar_remove_handlers?
>> To me this is more about vPCI's support for guests, so should we probably call it
>> VPCI_XXX instead? E.g. VPCI_HAS_GUEST_SUPPORT or something which
>> will reflect the nature of the code being gated? VCPU_UNPRIVILEGED sounds
>> like not connected to vpci to me
> And validly so - my fingers didn't type what the brain told them. I've
> meant VPCI_UNPRIVILEGED. I would also be okay with HAS_VPCI_GUEST_SUPPORT
> (i.e. not exactly what you've suggested), for example.
I'll stick to HAS_VPCI_GUEST_SUPPORT as it seems to be more descriptive
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 5/9] vpci/header: Implement guest BAR register handlers
  2021-09-06 14:31   ` Jan Beulich
@ 2021-09-07 13:33     ` Oleksandr Andrushchenko
  2021-09-07 16:30       ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-07 13:33 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 06.09.21 17:31, Jan Beulich wrote:
> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -400,12 +400,72 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>   static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>                               uint32_t val, void *data)
>>   {
>> +    struct vpci_bar *bar = data;
>> +    bool hi = false;
>> +
>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>> +    {
>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>> +        bar--;
>> +        hi = true;
>> +    }
>> +    else
>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
> What you store here is not the address that's going to be used,

bar->guest_addr is never used directly to be reported to a guest.

The same as bar->addr is never used to write to real BAR.

It is always used as an initial value which is then modified to reflect

lower bits, e.g. BAR type and if prefetchable, so I think this is perfectly

fine to have it this way.

>   as
> you don't mask off the low bits (to account for the BAR's size).
> When a BAR gets written with all ones, all writable bits get these
> ones stored. The address of the BAR, aiui, really changes to
> (typically) close below 4Gb (in the case of a 32-bit BAR), which
> is why memory / I/O decoding should be off while sizing BARs.
> Therefore you shouldn't look for the specific "all writable bits
> are ones" pattern (or worse, as you presently do, the "all bits
> outside of the type specifier are ones" one) on the read path.
> Instead mask the value appropriately here, and simply return back
> the stored value from the read path.
"PCI LOCAL BUS SPECIFICATION, REV. 3.0", "IMPLEMENTATION NOTE

Sizing a 32-bit Base Address Register Example" says, that

"Software saves the original value of the Base Address register, writes
0 FFFF FFFFh to the register, then reads it back."

The same applies for 64-bit BARs. So what's wrong if I try to catch such

a write when a guest tries to size the BAR? The only difference is that

I compare as

         if ( (val & PCI_BASE_ADDRESS_MEM_MASK_32) == PCI_BASE_ADDRESS_MEM_MASK_32 )
which is because val in the question has lower bits cleared.

With that respect I see no obvious reason why we can't construct our code

as it is.

>
>>   }
>>   
>>   static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
>>                                  void *data)
>>   {
>> -    return 0xffffffff;
>> +    struct vpci_bar *bar = data;
>> +    uint32_t val;
>> +    bool hi = false;
>> +
>> +    switch ( bar->type )
>> +    {
>> +    case VPCI_BAR_MEM64_HI:
>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>> +        bar--;
>> +        hi = true;
>> +        /* fallthrough */
>> +    case VPCI_BAR_MEM64_LO:
>> +    {
> Please don't add braces to case blocks when they're not needed.
Sure
>
>> +        if ( hi )
>> +            val = bar->guest_addr >> 32;
>> +        else
>> +            val = bar->guest_addr & 0xffffffff;
>> +        if ( (val & PCI_BASE_ADDRESS_MEM_MASK_32) ==  PCI_BASE_ADDRESS_MEM_MASK_32 )
> This is wrong when falling through to here from VPCI_BAR_MEM64_HI:
> All 32 bits need to be looked at.
Good catch, will fix
>   Yet as per the comment further
> up I think it isn't right anyway to apply the mask here.
>
> Also: Stray double blanks.
>
>> +        {
>> +            /* Guests detects BAR's properties and sizes. */
>> +            if ( hi )
>> +                val = bar->size >> 32;
>> +            else
>> +                val = 0xffffffff & ~(bar->size - 1);
>> +        }
>> +        if ( !hi )
>> +        {
>> +            val |= PCI_BASE_ADDRESS_MEM_TYPE_64;
>> +            val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>> +        }
>> +        bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>> +        bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>> +        break;
>> +    }
>> +    case VPCI_BAR_MEM32:
> Please separate non-fall-through case blocks by a blank line.
Will do
>
>> @@ -522,6 +582,13 @@ static int add_bar_handlers(struct pci_dev *pdev, bool is_hwdom)
>>               if ( rc )
>>                   return rc;
>>           }
>> +        /*
>> +         * It is neither safe nor secure to initialize guest's view of the BARs
>> +         * with real values which are used by the hardware domain, so assign
>> +         * all zeros to guest's view of the BARs, so the guest can perform
>> +         * proper PCI device enumeration and assign BARs on its own.
>> +         */
>> +        bars[i].guest_addr = 0;
> I'm afraid I don't understand the comment: Without memory decoding
> enabled, the BARs are simple registers (with a few r/o bits).

My first implementation was that bar->guest_addr was initialized with

the value of bar->addr (physical BAR value), but talking on IRC with

Roger he suggested that this might be a security issue to let guest

a hint about physical values, so then I changed the assignment to be 0.

Thus the comment

>
>> --- a/xen/include/xen/pci_regs.h
>> +++ b/xen/include/xen/pci_regs.h
>> @@ -103,6 +103,7 @@
>>   #define  PCI_BASE_ADDRESS_MEM_TYPE_64	0x04	/* 64 bit address */
>>   #define  PCI_BASE_ADDRESS_MEM_PREFETCH	0x08	/* prefetchable? */
>>   #define  PCI_BASE_ADDRESS_MEM_MASK	(~0x0fUL)
>> +#define  PCI_BASE_ADDRESS_MEM_MASK_32	(~0x0fU)
> Please don't introduce an identical constant that's merely of
> different type. (uint32_t)PCI_BASE_ADDRESS_MEM_MASK at the use
> site (if actually still needed as per the comment above) would
> seem more clear to me.
Ok, I thought type casting is a bigger evil here
>
> Jan

Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 5/9] vpci/header: Implement guest BAR register handlers
  2021-09-07 13:33     ` Oleksandr Andrushchenko
@ 2021-09-07 16:30       ` Jan Beulich
  2021-09-07 17:39         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-07 16:30 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 07.09.2021 15:33, Oleksandr Andrushchenko wrote:
> On 06.09.21 17:31, Jan Beulich wrote:
>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>> --- a/xen/drivers/vpci/header.c
>>> +++ b/xen/drivers/vpci/header.c
>>> @@ -400,12 +400,72 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>   static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>                               uint32_t val, void *data)
>>>   {
>>> +    struct vpci_bar *bar = data;
>>> +    bool hi = false;
>>> +
>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>> +    {
>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>> +        bar--;
>>> +        hi = true;
>>> +    }
>>> +    else
>>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>> What you store here is not the address that's going to be used,
> 
> bar->guest_addr is never used directly to be reported to a guest.

And this is what I question, as an approach. Your model _may_ work,
but its needlessly deviating from how people like me would expect
this to work. And if it's intended to be this way, how would I
have known?

> It is always used as an initial value which is then modified to reflect
> lower bits, e.g. BAR type and if prefetchable, so I think this is perfectly
> fine to have it this way.

And it is also perfectly fine to store the value to be handed
back to guests on the next read. Keeps the read path simple,
which I think can be assumed to be taken more frequently than
the write one. Plus stored values reflect reality.

Plus - if what you say was really the case, why do you mask off
PCI_BASE_ADDRESS_MEM_MASK here? You should then simply store
the written value and do _all_ the processing in the read path.
No point having partial logic in two places.

>>   as
>> you don't mask off the low bits (to account for the BAR's size).
>> When a BAR gets written with all ones, all writable bits get these
>> ones stored. The address of the BAR, aiui, really changes to
>> (typically) close below 4Gb (in the case of a 32-bit BAR), which
>> is why memory / I/O decoding should be off while sizing BARs.
>> Therefore you shouldn't look for the specific "all writable bits
>> are ones" pattern (or worse, as you presently do, the "all bits
>> outside of the type specifier are ones" one) on the read path.
>> Instead mask the value appropriately here, and simply return back
>> the stored value from the read path.
> "PCI LOCAL BUS SPECIFICATION, REV. 3.0", "IMPLEMENTATION NOTE
> 
> Sizing a 32-bit Base Address Register Example" says, that
> 
> "Software saves the original value of the Base Address register, writes
> 0 FFFF FFFFh to the register, then reads it back."
> 
> The same applies for 64-bit BARs. So what's wrong if I try to catch such
> a write when a guest tries to size the BAR? The only difference is that
> I compare as
>          if ( (val & PCI_BASE_ADDRESS_MEM_MASK_32) == PCI_BASE_ADDRESS_MEM_MASK_32 )
> which is because val in the question has lower bits cleared.

Because while this matches what the spec says, it's not enough to
match how hardware behaves. Yet you want to mimic hardware behavior
as closely as possible here. There is (iirc) at least one other
place in the source tree were we had to adjust a similar initial
implementation to be closer to one expected by guests, no matter
that they may not be following the spec to the letter. Don't
forget that there may be bugs in kernels which don't surface until
the kernel gets run on an overly simplistic emulation.

>>> @@ -522,6 +582,13 @@ static int add_bar_handlers(struct pci_dev *pdev, bool is_hwdom)
>>>               if ( rc )
>>>                   return rc;
>>>           }
>>> +        /*
>>> +         * It is neither safe nor secure to initialize guest's view of the BARs
>>> +         * with real values which are used by the hardware domain, so assign
>>> +         * all zeros to guest's view of the BARs, so the guest can perform
>>> +         * proper PCI device enumeration and assign BARs on its own.
>>> +         */
>>> +        bars[i].guest_addr = 0;
>> I'm afraid I don't understand the comment: Without memory decoding
>> enabled, the BARs are simple registers (with a few r/o bits).
> 
> My first implementation was that bar->guest_addr was initialized with
> the value of bar->addr (physical BAR value), but talking on IRC with
> Roger he suggested that this might be a security issue to let guest
> a hint about physical values, so then I changed the assignment to be 0.

Well, yes, that's certainly correct. It's perhaps too unobvious to me
why one may want to use the host value here in the first place. It
simply has no meaning here.

>>> --- a/xen/include/xen/pci_regs.h
>>> +++ b/xen/include/xen/pci_regs.h
>>> @@ -103,6 +103,7 @@
>>>   #define  PCI_BASE_ADDRESS_MEM_TYPE_64	0x04	/* 64 bit address */
>>>   #define  PCI_BASE_ADDRESS_MEM_PREFETCH	0x08	/* prefetchable? */
>>>   #define  PCI_BASE_ADDRESS_MEM_MASK	(~0x0fUL)
>>> +#define  PCI_BASE_ADDRESS_MEM_MASK_32	(~0x0fU)
>> Please don't introduce an identical constant that's merely of
>> different type. (uint32_t)PCI_BASE_ADDRESS_MEM_MASK at the use
>> site (if actually still needed as per the comment above) would
>> seem more clear to me.
> Ok, I thought type casting is a bigger evil here

Often it is, but imo here it is not. I hope you realize that ~0x0fU
if not necessarily 0xfffffff0? We make certain assumptions on type
widths. For unsigned int we assume it to be at least 32 bits wide.
We should avoid assumptions of it being exactly 32 bits wide. Just
like we cannot (more obviously) assume the width of unsigned long.
(Which tells us that for 32-bit arches PCI_BASE_ADDRESS_MEM_MASK is
actually of the wrong type. This constant should be the same no
matter the bitness.)

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 5/9] vpci/header: Implement guest BAR register handlers
  2021-09-07 16:30       ` Jan Beulich
@ 2021-09-07 17:39         ` Oleksandr Andrushchenko
  2021-09-08  9:27           ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-07 17:39 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 07.09.21 19:30, Jan Beulich wrote:
> On 07.09.2021 15:33, Oleksandr Andrushchenko wrote:
>> On 06.09.21 17:31, Jan Beulich wrote:
>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>> --- a/xen/drivers/vpci/header.c
>>>> +++ b/xen/drivers/vpci/header.c
>>>> @@ -400,12 +400,72 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>    static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>                                uint32_t val, void *data)
>>>>    {
>>>> +    struct vpci_bar *bar = data;
>>>> +    bool hi = false;
>>>> +
>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>> +    {
>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>> +        bar--;
>>>> +        hi = true;
>>>> +    }
>>>> +    else
>>>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>>>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>>> What you store here is not the address that's going to be used,
>> bar->guest_addr is never used directly to be reported to a guest.
> And this is what I question, as an approach. Your model _may_ work,
> but its needlessly deviating from how people like me would expect
> this to work. And if it's intended to be this way, how would I
> have known?
Well, I just tried to follow the model we already have in the existing code...
>
>> It is always used as an initial value which is then modified to reflect
>> lower bits, e.g. BAR type and if prefetchable, so I think this is perfectly
>> fine to have it this way.
> And it is also perfectly fine to store the value to be handed
> back to guests on the next read. Keeps the read path simple,
> which I think can be assumed to be taken more frequently than
> the write one. Plus stored values reflect reality.
>
> Plus - if what you say was really the case, why do you mask off
> PCI_BASE_ADDRESS_MEM_MASK here? You should then simply store
> the written value and do _all_ the processing in the read path.
> No point having partial logic in two places.

I will move the logic to the write handler, indeed we can save some

CPU cycles here

>
>>>    as
>>> you don't mask off the low bits (to account for the BAR's size).
>>> When a BAR gets written with all ones, all writable bits get these
>>> ones stored. The address of the BAR, aiui, really changes to
>>> (typically) close below 4Gb (in the case of a 32-bit BAR), which
>>> is why memory / I/O decoding should be off while sizing BARs.
>>> Therefore you shouldn't look for the specific "all writable bits
>>> are ones" pattern (or worse, as you presently do, the "all bits
>>> outside of the type specifier are ones" one) on the read path.
>>> Instead mask the value appropriately here, and simply return back
>>> the stored value from the read path.
>> "PCI LOCAL BUS SPECIFICATION, REV. 3.0", "IMPLEMENTATION NOTE
>>
>> Sizing a 32-bit Base Address Register Example" says, that
>>
>> "Software saves the original value of the Base Address register, writes
>> 0 FFFF FFFFh to the register, then reads it back."
>>
>> The same applies for 64-bit BARs. So what's wrong if I try to catch such
>> a write when a guest tries to size the BAR? The only difference is that
>> I compare as
>>           if ( (val & PCI_BASE_ADDRESS_MEM_MASK_32) == PCI_BASE_ADDRESS_MEM_MASK_32 )
>> which is because val in the question has lower bits cleared.
> Because while this matches what the spec says, it's not enough to
> match how hardware behaves.
But we should implement as the spec says, not like buggy hardware behaves
>   Yet you want to mimic hardware behavior
> as closely as possible here. There is (iirc) at least one other
> place in the source tree were we had to adjust a similar initial
> implementation to be closer to one expected by guests,

Could you please point me to that code so I can consult and possibly

re-use the approach?

>   no matter
> that they may not be following the spec to the letter. Don't
> forget that there may be bugs in kernels which don't surface until
> the kernel gets run on an overly simplistic emulation.
This is sad. And the kernel needs to be fixed then, not Xen
>
>>>> @@ -522,6 +582,13 @@ static int add_bar_handlers(struct pci_dev *pdev, bool is_hwdom)
>>>>                if ( rc )
>>>>                    return rc;
>>>>            }
>>>> +        /*
>>>> +         * It is neither safe nor secure to initialize guest's view of the BARs
>>>> +         * with real values which are used by the hardware domain, so assign
>>>> +         * all zeros to guest's view of the BARs, so the guest can perform
>>>> +         * proper PCI device enumeration and assign BARs on its own.
>>>> +         */
>>>> +        bars[i].guest_addr = 0;
>>> I'm afraid I don't understand the comment: Without memory decoding
>>> enabled, the BARs are simple registers (with a few r/o bits).
>> My first implementation was that bar->guest_addr was initialized with
>> the value of bar->addr (physical BAR value), but talking on IRC with
>> Roger he suggested that this might be a security issue to let guest
>> a hint about physical values, so then I changed the assignment to be 0.
> Well, yes, that's certainly correct. It's perhaps too unobvious to me
> why one may want to use the host value here in the first place. It
> simply has no meaning here.
Do you want me to remove the comment?
>
>>>> --- a/xen/include/xen/pci_regs.h
>>>> +++ b/xen/include/xen/pci_regs.h
>>>> @@ -103,6 +103,7 @@
>>>>    #define  PCI_BASE_ADDRESS_MEM_TYPE_64	0x04	/* 64 bit address */
>>>>    #define  PCI_BASE_ADDRESS_MEM_PREFETCH	0x08	/* prefetchable? */
>>>>    #define  PCI_BASE_ADDRESS_MEM_MASK	(~0x0fUL)
>>>> +#define  PCI_BASE_ADDRESS_MEM_MASK_32	(~0x0fU)
>>> Please don't introduce an identical constant that's merely of
>>> different type. (uint32_t)PCI_BASE_ADDRESS_MEM_MASK at the use
>>> site (if actually still needed as per the comment above) would
>>> seem more clear to me.
>> Ok, I thought type casting is a bigger evil here
> Often it is, but imo here it is not. I hope you realize that ~0x0fU
> if not necessarily 0xfffffff0? We make certain assumptions on type
> widths. For unsigned int we assume it to be at least 32 bits wide.
> We should avoid assumptions of it being exactly 32 bits wide. Just
> like we cannot (more obviously) assume the width of unsigned long.
> (Which tells us that for 32-bit arches PCI_BASE_ADDRESS_MEM_MASK is
> actually of the wrong type. This constant should be the same no
> matter the bitness.)
Ok, I will not introduce a new define and use (uint32_t)
>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 5/9] vpci/header: Implement guest BAR register handlers
  2021-09-07 17:39         ` Oleksandr Andrushchenko
@ 2021-09-08  9:27           ` Jan Beulich
  2021-09-08  9:43             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-08  9:27 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 07.09.2021 19:39, Oleksandr Andrushchenko wrote:
> On 07.09.21 19:30, Jan Beulich wrote:
>> On 07.09.2021 15:33, Oleksandr Andrushchenko wrote:
>>> On 06.09.21 17:31, Jan Beulich wrote:
>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>>> --- a/xen/drivers/vpci/header.c
>>>>> +++ b/xen/drivers/vpci/header.c
>>>>> @@ -400,12 +400,72 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>    static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>                                uint32_t val, void *data)
>>>>>    {
>>>>> +    struct vpci_bar *bar = data;
>>>>> +    bool hi = false;
>>>>> +
>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>> +    {
>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>> +        bar--;
>>>>> +        hi = true;
>>>>> +    }
>>>>> +    else
>>>>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>>>>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>>>> What you store here is not the address that's going to be used,
>>>>    as
>>>> you don't mask off the low bits (to account for the BAR's size).
>>>> When a BAR gets written with all ones, all writable bits get these
>>>> ones stored. The address of the BAR, aiui, really changes to
>>>> (typically) close below 4Gb (in the case of a 32-bit BAR), which
>>>> is why memory / I/O decoding should be off while sizing BARs.
>>>> Therefore you shouldn't look for the specific "all writable bits
>>>> are ones" pattern (or worse, as you presently do, the "all bits
>>>> outside of the type specifier are ones" one) on the read path.
>>>> Instead mask the value appropriately here, and simply return back
>>>> the stored value from the read path.
>>> "PCI LOCAL BUS SPECIFICATION, REV. 3.0", "IMPLEMENTATION NOTE
>>>
>>> Sizing a 32-bit Base Address Register Example" says, that
>>>
>>> "Software saves the original value of the Base Address register, writes
>>> 0 FFFF FFFFh to the register, then reads it back."
>>>
>>> The same applies for 64-bit BARs. So what's wrong if I try to catch such
>>> a write when a guest tries to size the BAR? The only difference is that
>>> I compare as
>>>           if ( (val & PCI_BASE_ADDRESS_MEM_MASK_32) == PCI_BASE_ADDRESS_MEM_MASK_32 )
>>> which is because val in the question has lower bits cleared.
>> Because while this matches what the spec says, it's not enough to
>> match how hardware behaves.
> But we should implement as the spec says, not like buggy hardware behaves

The behavior I'm describing doesn't violate the spec. There merely is
more to it than what the spec says, or one might also view it the way
that the spec doesn't use the best way of expressing things.

>>   Yet you want to mimic hardware behavior
>> as closely as possible here. There is (iirc) at least one other
>> place in the source tree were we had to adjust a similar initial
>> implementation to be closer to one expected by guests,
> 
> Could you please point me to that code so I can consult and possibly
> re-use the approach?

I only recall the fact; this might have been hvmloader, vpci, or yet
somewhere else. I'm sorry.

>>>>> @@ -522,6 +582,13 @@ static int add_bar_handlers(struct pci_dev *pdev, bool is_hwdom)
>>>>>                if ( rc )
>>>>>                    return rc;
>>>>>            }
>>>>> +        /*
>>>>> +         * It is neither safe nor secure to initialize guest's view of the BARs
>>>>> +         * with real values which are used by the hardware domain, so assign
>>>>> +         * all zeros to guest's view of the BARs, so the guest can perform
>>>>> +         * proper PCI device enumeration and assign BARs on its own.
>>>>> +         */
>>>>> +        bars[i].guest_addr = 0;
>>>> I'm afraid I don't understand the comment: Without memory decoding
>>>> enabled, the BARs are simple registers (with a few r/o bits).
>>> My first implementation was that bar->guest_addr was initialized with
>>> the value of bar->addr (physical BAR value), but talking on IRC with
>>> Roger he suggested that this might be a security issue to let guest
>>> a hint about physical values, so then I changed the assignment to be 0.
>> Well, yes, that's certainly correct. It's perhaps too unobvious to me
>> why one may want to use the host value here in the first place. It
>> simply has no meaning here.
> Do you want me to remove the comment?

Yes. I wonder whether the assignment is necessary in the first place:
I'd somehow expect the structure to come from xzalloc(). Albeit I
guess this function can be run through more than once without freeing
the structure and then allocating is anew?

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 5/9] vpci/header: Implement guest BAR register handlers
  2021-09-08  9:27           ` Jan Beulich
@ 2021-09-08  9:43             ` Oleksandr Andrushchenko
  2021-09-08 10:03               ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-08  9:43 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 08.09.21 12:27, Jan Beulich wrote:
> On 07.09.2021 19:39, Oleksandr Andrushchenko wrote:
>> On 07.09.21 19:30, Jan Beulich wrote:
>>> On 07.09.2021 15:33, Oleksandr Andrushchenko wrote:
>>>> On 06.09.21 17:31, Jan Beulich wrote:
>>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>> @@ -400,12 +400,72 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>     static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>                                 uint32_t val, void *data)
>>>>>>     {
>>>>>> +    struct vpci_bar *bar = data;
>>>>>> +    bool hi = false;
>>>>>> +
>>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>>> +    {
>>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>>> +        bar--;
>>>>>> +        hi = true;
>>>>>> +    }
>>>>>> +    else
>>>>>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>>>>>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>>>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>>>>> What you store here is not the address that's going to be used,
>>>>>     as
>>>>> you don't mask off the low bits (to account for the BAR's size).
>>>>> When a BAR gets written with all ones, all writable bits get these
>>>>> ones stored. The address of the BAR, aiui, really changes to
>>>>> (typically) close below 4Gb (in the case of a 32-bit BAR), which
>>>>> is why memory / I/O decoding should be off while sizing BARs.
>>>>> Therefore you shouldn't look for the specific "all writable bits
>>>>> are ones" pattern (or worse, as you presently do, the "all bits
>>>>> outside of the type specifier are ones" one) on the read path.
>>>>> Instead mask the value appropriately here, and simply return back
>>>>> the stored value from the read path.

But in case of BAR sizing I need to actually return BAR size.

So, the comparison is the way to tell if the guest wants to read

actual (configured) BAR value or it tries to determine BAR's size.

This is why I compare and use the result as the answer to what needs

to be supplied to the guest. So, if I don't compare with 0xffffffff for the

hi part and 0xfffffff0 for the low then how do I know when to return

configured BAR or return the size?

>>>> "PCI LOCAL BUS SPECIFICATION, REV. 3.0", "IMPLEMENTATION NOTE
>>>>
>>>> Sizing a 32-bit Base Address Register Example" says, that
>>>>
>>>> "Software saves the original value of the Base Address register, writes
>>>> 0 FFFF FFFFh to the register, then reads it back."
>>>>
>>>> The same applies for 64-bit BARs. So what's wrong if I try to catch such
>>>> a write when a guest tries to size the BAR? The only difference is that
>>>> I compare as
>>>>            if ( (val & PCI_BASE_ADDRESS_MEM_MASK_32) == PCI_BASE_ADDRESS_MEM_MASK_32 )
>>>> which is because val in the question has lower bits cleared.
>>> Because while this matches what the spec says, it's not enough to
>>> match how hardware behaves.
>> But we should implement as the spec says, not like buggy hardware behaves
> The behavior I'm describing doesn't violate the spec. There merely is
> more to it than what the spec says, or one might also view it the way
> that the spec doesn't use the best way of expressing things.

Well, the spec explicitly says "write 0xffffffff and read back"

I can't see any way it can be read differently

>
>>>    Yet you want to mimic hardware behavior
>>> as closely as possible here. There is (iirc) at least one other
>>> place in the source tree were we had to adjust a similar initial
>>> implementation to be closer to one expected by guests,
>> Could you please point me to that code so I can consult and possibly
>> re-use the approach?
> I only recall the fact; this might have been hvmloader, vpci, or yet
> somewhere else. I'm sorry.
No problem
>
>>>>>> @@ -522,6 +582,13 @@ static int add_bar_handlers(struct pci_dev *pdev, bool is_hwdom)
>>>>>>                 if ( rc )
>>>>>>                     return rc;
>>>>>>             }
>>>>>> +        /*
>>>>>> +         * It is neither safe nor secure to initialize guest's view of the BARs
>>>>>> +         * with real values which are used by the hardware domain, so assign
>>>>>> +         * all zeros to guest's view of the BARs, so the guest can perform
>>>>>> +         * proper PCI device enumeration and assign BARs on its own.
>>>>>> +         */
>>>>>> +        bars[i].guest_addr = 0;
>>>>> I'm afraid I don't understand the comment: Without memory decoding
>>>>> enabled, the BARs are simple registers (with a few r/o bits).
>>>> My first implementation was that bar->guest_addr was initialized with
>>>> the value of bar->addr (physical BAR value), but talking on IRC with
>>>> Roger he suggested that this might be a security issue to let guest
>>>> a hint about physical values, so then I changed the assignment to be 0.
>>> Well, yes, that's certainly correct. It's perhaps too unobvious to me
>>> why one may want to use the host value here in the first place. It
>>> simply has no meaning here.
>> Do you want me to remove the comment?
> Yes. I wonder whether the assignment is necessary in the first place:
> I'd somehow expect the structure to come from xzalloc(). Albeit I
> guess this function can be run through more than once without freeing
> the structure and then allocating is anew?

I'll check that and if the structure is already zeroed then I'll remove both the

assignment and the comment

>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 5/9] vpci/header: Implement guest BAR register handlers
  2021-09-08  9:43             ` Oleksandr Andrushchenko
@ 2021-09-08 10:03               ` Jan Beulich
  2021-09-08 13:33                 ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-08 10:03 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 08.09.2021 11:43, Oleksandr Andrushchenko wrote:
> 
> On 08.09.21 12:27, Jan Beulich wrote:
>> On 07.09.2021 19:39, Oleksandr Andrushchenko wrote:
>>> On 07.09.21 19:30, Jan Beulich wrote:
>>>> On 07.09.2021 15:33, Oleksandr Andrushchenko wrote:
>>>>> On 06.09.21 17:31, Jan Beulich wrote:
>>>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>>> @@ -400,12 +400,72 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>     static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>                                 uint32_t val, void *data)
>>>>>>>     {
>>>>>>> +    struct vpci_bar *bar = data;
>>>>>>> +    bool hi = false;
>>>>>>> +
>>>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>>>> +    {
>>>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>>>> +        bar--;
>>>>>>> +        hi = true;
>>>>>>> +    }
>>>>>>> +    else
>>>>>>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>>>>>>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>>>>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>>>>>> What you store here is not the address that's going to be used,
>>>>>>     as
>>>>>> you don't mask off the low bits (to account for the BAR's size).
>>>>>> When a BAR gets written with all ones, all writable bits get these
>>>>>> ones stored. The address of the BAR, aiui, really changes to
>>>>>> (typically) close below 4Gb (in the case of a 32-bit BAR), which
>>>>>> is why memory / I/O decoding should be off while sizing BARs.
>>>>>> Therefore you shouldn't look for the specific "all writable bits
>>>>>> are ones" pattern (or worse, as you presently do, the "all bits
>>>>>> outside of the type specifier are ones" one) on the read path.
>>>>>> Instead mask the value appropriately here, and simply return back
>>>>>> the stored value from the read path.
> 
> But in case of BAR sizing I need to actually return BAR size.
> So, the comparison is the way to tell if the guest wants to read
> actual (configured) BAR value or it tries to determine BAR's size.
> This is why I compare and use the result as the answer to what needs
> to be supplied to the guest. So, if I don't compare with 0xffffffff for the
> hi part and 0xfffffff0 for the low then how do I know when to return
> configured BAR or return the size?

Well, but that's the common misunderstanding that I've been trying
to point out: There's no difference between these two forms of
reads. The BARs are simply registers with some r/o bits. There's
no hidden 2nd register recording what was last written. When you
write 0xffffffff, all you do is set all writable bits to 1. When
you read back from the register, you will find all r/o bits
unchanged (which in particular means all lower address bits are
zero, thus allowing you to determine the size).

When the spec says to write 0xffffffff for sizing purposes, OSes
should follow that, yes. This doesn't preclude them to use other
forms of writes for whichever purpose. Hence you do not want to
special case sizing, but instead you want to emulate correctly
all forms of writes, including subsequent reads to uniformly
return the intended / expected values.

Just to give an example (perhaps a little contrived): To size a
64-bit BAR, in principle you'd first need to write 0xffffffff to
both halves. But there's nothing wrong with doing this in a
different order: Act on the low half alone first, and then act
on the high half. The acting on the high half could even be
skipped if the low half sizing produced at least bit 31 set. Now
if you were to special case seeing ffffffff:fffffff? as the
last written pair of values, you'd break that (imo legitimate)
alternative process of sizing.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 5/9] vpci/header: Implement guest BAR register handlers
  2021-09-08 10:03               ` Jan Beulich
@ 2021-09-08 13:33                 ` Oleksandr Andrushchenko
  2021-09-08 14:46                   ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-08 13:33 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 08.09.21 13:03, Jan Beulich wrote:
> On 08.09.2021 11:43, Oleksandr Andrushchenko wrote:
>> On 08.09.21 12:27, Jan Beulich wrote:
>>> On 07.09.2021 19:39, Oleksandr Andrushchenko wrote:
>>>> On 07.09.21 19:30, Jan Beulich wrote:
>>>>> On 07.09.2021 15:33, Oleksandr Andrushchenko wrote:
>>>>>> On 06.09.21 17:31, Jan Beulich wrote:
>>>>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>>>> @@ -400,12 +400,72 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>      static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>                                  uint32_t val, void *data)
>>>>>>>>      {
>>>>>>>> +    struct vpci_bar *bar = data;
>>>>>>>> +    bool hi = false;
>>>>>>>> +
>>>>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>>>>> +    {
>>>>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>>>>> +        bar--;
>>>>>>>> +        hi = true;
>>>>>>>> +    }
>>>>>>>> +    else
>>>>>>>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>>>>>>>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>>>>>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>>>>>>> What you store here is not the address that's going to be used,
>>>>>>>      as
>>>>>>> you don't mask off the low bits (to account for the BAR's size).
>>>>>>> When a BAR gets written with all ones, all writable bits get these
>>>>>>> ones stored. The address of the BAR, aiui, really changes to
>>>>>>> (typically) close below 4Gb (in the case of a 32-bit BAR), which
>>>>>>> is why memory / I/O decoding should be off while sizing BARs.
>>>>>>> Therefore you shouldn't look for the specific "all writable bits
>>>>>>> are ones" pattern (or worse, as you presently do, the "all bits
>>>>>>> outside of the type specifier are ones" one) on the read path.
>>>>>>> Instead mask the value appropriately here, and simply return back
>>>>>>> the stored value from the read path.
>> But in case of BAR sizing I need to actually return BAR size.
>> So, the comparison is the way to tell if the guest wants to read
>> actual (configured) BAR value or it tries to determine BAR's size.
>> This is why I compare and use the result as the answer to what needs
>> to be supplied to the guest. So, if I don't compare with 0xffffffff for the
>> hi part and 0xfffffff0 for the low then how do I know when to return
>> configured BAR or return the size?
> Well, but that's the common misunderstanding that I've been trying
> to point out: There's no difference between these two forms of
> reads. The BARs are simply registers with some r/o bits. There's
> no hidden 2nd register recording what was last written. When you
> write 0xffffffff, all you do is set all writable bits to 1. When
> you read back from the register, you will find all r/o bits
> unchanged (which in particular means all lower address bits are
> zero, thus allowing you to determine the size).
>
> When the spec says to write 0xffffffff for sizing purposes, OSes
> should follow that, yes. This doesn't preclude them to use other
> forms of writes for whichever purpose. Hence you do not want to
> special case sizing, but instead you want to emulate correctly
> all forms of writes, including subsequent reads to uniformly
> return the intended / expected values.
>
> Just to give an example (perhaps a little contrived): To size a
> 64-bit BAR, in principle you'd first need to write 0xffffffff to
> both halves. But there's nothing wrong with doing this in a
> different order: Act on the low half alone first, and then act
> on the high half. The acting on the high half could even be
> skipped if the low half sizing produced at least bit 31 set. Now
> if you were to special case seeing ffffffff:fffffff? as the
> last written pair of values, you'd break that (imo legitimate)
> alternative process of sizing.

How about:

static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
                             uint32_t val, void *data)
{
     struct vpci_bar *bar = data;
     bool hi = false;

     if ( bar->type == VPCI_BAR_MEM64_HI )
     {
         ASSERT(reg > PCI_BASE_ADDRESS_0);
         bar--;
         hi = true;
     }
     else
     {
         val &= PCI_BASE_ADDRESS_MEM_MASK;
         val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
                                            : PCI_BASE_ADDRESS_MEM_TYPE_64;
         val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
     }

     bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
     bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);

     bar->guest_addr &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
}

static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
                                void *data)
{
     struct vpci_bar *bar = data;

     if ( bar->type == VPCI_BAR_MEM64_HI )
         return bar->guest_addr >> 32;

     return bar->guest_addr;
}

It seems to solve all the questions we have: more work on write path,

no comparison with 0xffffffff: BAR's size is used to mask unwanted bits.

BTW, bars[i].guest_addr = 0; is needed as this field can be re-used.
>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 6/9] vpci/header: Handle p2m range sets per BAR
  2021-09-06 14:47   ` Jan Beulich
@ 2021-09-08 14:31     ` Oleksandr Andrushchenko
  2021-09-08 15:00       ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-08 14:31 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 06.09.21 17:47, Jan Beulich wrote:
> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Instead of handling a single range set, that contains all the memory
>> regions of all the BARs and ROM, have them per BAR.
> Without looking at how you carry out this change - this look wrong (as
> in: wasteful) to me. Despite ...
>
>> This is in preparation of making non-identity mappings in p2m for the
>> MMIOs/ROM.
> ... the need for this, every individual BAR is still contiguous in both
> host and guest address spaces, so can be represented as a single
> (start,end) tuple (or a pair thereof, to account for both host and guest
> values). No need to use a rangeset for this.

First of all this change is in preparation for non-identity mappings,

e.g. currently we collect all the memory ranges which require mappings

into a single range set, then we cut off MSI-X regions and then use range set

functionality to call a callback for every memory range left after MSI-X.

This works perfectly fine for 1:1 mappings, e.g. what we have as the range

set's starting address is what we want to be mapped/unmapped.

Why range sets? Because they allow partial mappings, e.g. you can map part of

the range and return back and continue from where you stopped. And if I

understand that correctly that was the initial intention of introducing range sets here.


For non-identity mappings this becomes not that easy. Each individual BAR may be

mapped differently according to what guest OS has programmed as bar->guest_addr

(guest view of the BAR start). Thus we need to collect all those non-identity mappings

per BAR now (so we have a mapping "guest view" : "physical BAR" and again cut off

MSI-X regions as before.  So, yes, it may be a bit wasteful to use many range sets,

but makes vPCI life much-much easier. Thus, I think that even per-BAR range sets are

good to go as they have more pros than cons. IMO

Even if we go with "can be represented as a single (start,end) tuple" it doesn't answer

the question what needs to be done if a range gets partially mapped/unmapped.

We'll need to put some logic to re-try the operation later and remember where did we stop.

At the end of the day we'll invent one more range set, but now vPCI own.

>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 5/9] vpci/header: Implement guest BAR register handlers
  2021-09-08 13:33                 ` Oleksandr Andrushchenko
@ 2021-09-08 14:46                   ` Jan Beulich
  2021-09-08 15:14                     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-08 14:46 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 08.09.2021 15:33, Oleksandr Andrushchenko wrote:
> 
> On 08.09.21 13:03, Jan Beulich wrote:
>> On 08.09.2021 11:43, Oleksandr Andrushchenko wrote:
>>> On 08.09.21 12:27, Jan Beulich wrote:
>>>> On 07.09.2021 19:39, Oleksandr Andrushchenko wrote:
>>>>> On 07.09.21 19:30, Jan Beulich wrote:
>>>>>> On 07.09.2021 15:33, Oleksandr Andrushchenko wrote:
>>>>>>> On 06.09.21 17:31, Jan Beulich wrote:
>>>>>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>>>>> @@ -400,12 +400,72 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>>      static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>>                                  uint32_t val, void *data)
>>>>>>>>>      {
>>>>>>>>> +    struct vpci_bar *bar = data;
>>>>>>>>> +    bool hi = false;
>>>>>>>>> +
>>>>>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>>>>>> +    {
>>>>>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>>>>>> +        bar--;
>>>>>>>>> +        hi = true;
>>>>>>>>> +    }
>>>>>>>>> +    else
>>>>>>>>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>>>>>>>>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>>>>>>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>>>>>>>> What you store here is not the address that's going to be used,
>>>>>>>>      as
>>>>>>>> you don't mask off the low bits (to account for the BAR's size).
>>>>>>>> When a BAR gets written with all ones, all writable bits get these
>>>>>>>> ones stored. The address of the BAR, aiui, really changes to
>>>>>>>> (typically) close below 4Gb (in the case of a 32-bit BAR), which
>>>>>>>> is why memory / I/O decoding should be off while sizing BARs.
>>>>>>>> Therefore you shouldn't look for the specific "all writable bits
>>>>>>>> are ones" pattern (or worse, as you presently do, the "all bits
>>>>>>>> outside of the type specifier are ones" one) on the read path.
>>>>>>>> Instead mask the value appropriately here, and simply return back
>>>>>>>> the stored value from the read path.
>>> But in case of BAR sizing I need to actually return BAR size.
>>> So, the comparison is the way to tell if the guest wants to read
>>> actual (configured) BAR value or it tries to determine BAR's size.
>>> This is why I compare and use the result as the answer to what needs
>>> to be supplied to the guest. So, if I don't compare with 0xffffffff for the
>>> hi part and 0xfffffff0 for the low then how do I know when to return
>>> configured BAR or return the size?
>> Well, but that's the common misunderstanding that I've been trying
>> to point out: There's no difference between these two forms of
>> reads. The BARs are simply registers with some r/o bits. There's
>> no hidden 2nd register recording what was last written. When you
>> write 0xffffffff, all you do is set all writable bits to 1. When
>> you read back from the register, you will find all r/o bits
>> unchanged (which in particular means all lower address bits are
>> zero, thus allowing you to determine the size).
>>
>> When the spec says to write 0xffffffff for sizing purposes, OSes
>> should follow that, yes. This doesn't preclude them to use other
>> forms of writes for whichever purpose. Hence you do not want to
>> special case sizing, but instead you want to emulate correctly
>> all forms of writes, including subsequent reads to uniformly
>> return the intended / expected values.
>>
>> Just to give an example (perhaps a little contrived): To size a
>> 64-bit BAR, in principle you'd first need to write 0xffffffff to
>> both halves. But there's nothing wrong with doing this in a
>> different order: Act on the low half alone first, and then act
>> on the high half. The acting on the high half could even be
>> skipped if the low half sizing produced at least bit 31 set. Now
>> if you were to special case seeing ffffffff:fffffff? as the
>> last written pair of values, you'd break that (imo legitimate)
>> alternative process of sizing.
> 
> How about:

Yes, that's what I was after. Just one nit right away:

> static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>                              uint32_t val, void *data)
> {
>      struct vpci_bar *bar = data;
>      bool hi = false;
> 
>      if ( bar->type == VPCI_BAR_MEM64_HI )
>      {
>          ASSERT(reg > PCI_BASE_ADDRESS_0);
>          bar--;
>          hi = true;
>      }
>      else
>      {
>          val &= PCI_BASE_ADDRESS_MEM_MASK;
>          val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>                                             : PCI_BASE_ADDRESS_MEM_TYPE_64;
>          val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>      }
> 
>      bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>      bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
> 
>      bar->guest_addr &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
> }
> 
> static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
>                                 void *data)
> {
>      struct vpci_bar *bar = data;

const please.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 6/9] vpci/header: Handle p2m range sets per BAR
  2021-09-08 14:31     ` Oleksandr Andrushchenko
@ 2021-09-08 15:00       ` Jan Beulich
  2021-09-09  5:22         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-08 15:00 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 08.09.2021 16:31, Oleksandr Andrushchenko wrote:
> 
> On 06.09.21 17:47, Jan Beulich wrote:
>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>
>>> Instead of handling a single range set, that contains all the memory
>>> regions of all the BARs and ROM, have them per BAR.
>> Without looking at how you carry out this change - this look wrong (as
>> in: wasteful) to me. Despite ...
>>
>>> This is in preparation of making non-identity mappings in p2m for the
>>> MMIOs/ROM.
>> ... the need for this, every individual BAR is still contiguous in both
>> host and guest address spaces, so can be represented as a single
>> (start,end) tuple (or a pair thereof, to account for both host and guest
>> values). No need to use a rangeset for this.
> 
> First of all this change is in preparation for non-identity mappings,

I'm afraid I continue to not see how this matters in the discussion at
hand. I'm fully aware that this is the goal.

> e.g. currently we collect all the memory ranges which require mappings
> into a single range set, then we cut off MSI-X regions and then use range set
> functionality to call a callback for every memory range left after MSI-X.
> This works perfectly fine for 1:1 mappings, e.g. what we have as the range
> set's starting address is what we want to be mapped/unmapped.
> Why range sets? Because they allow partial mappings, e.g. you can map part of
> the range and return back and continue from where you stopped. And if I
> understand that correctly that was the initial intention of introducing range sets here.
> 
> For non-identity mappings this becomes not that easy. Each individual BAR may be
> mapped differently according to what guest OS has programmed as bar->guest_addr
> (guest view of the BAR start).

I don't see how the rangeset helps here. You have a guest and a host pair
of values for every BAR. Pages with e.g. the MSI-X table may not be mapped
to their host counterpart address, yes, but you need to special cases
these anyway: Accesses to them need to be handled. Hence I'm having a hard
time seeing how a per-BAR rangeset (which will cover at most three distinct
ranges afaict, which is way too little for this kind of data organization
imo) can gain you all this much.

Overall the 6 BARs of a device will cover up to 8 non-adjacent ranges. IOW
the majority (4 or more) of the rangesets will indeed merely represent a
plain (start,end) pair (or be entirely empty).

> Thus we need to collect all those non-identity mappings
> per BAR now (so we have a mapping "guest view" : "physical BAR" and again cut off
> MSI-X regions as before.  So, yes, it may be a bit wasteful to use many range sets,
> but makes vPCI life much-much easier.

Which I'm yet to be convinced of. Then again I'm not the maintainer of
this code, so if you can convince Roger you'll be all good.

> Thus, I think that even per-BAR range sets are
> good to go as they have more pros than cons. IMO
> Even if we go with "can be represented as a single (start,end) tuple" it doesn't answer
> the question what needs to be done if a range gets partially mapped/unmapped.

This question also isn't answered when you use rangesets.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 5/9] vpci/header: Implement guest BAR register handlers
  2021-09-08 14:46                   ` Jan Beulich
@ 2021-09-08 15:14                     ` Oleksandr Andrushchenko
  2021-09-08 15:29                       ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-08 15:14 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 08.09.21 17:46, Jan Beulich wrote:
> On 08.09.2021 15:33, Oleksandr Andrushchenko wrote:
>> On 08.09.21 13:03, Jan Beulich wrote:
>>> On 08.09.2021 11:43, Oleksandr Andrushchenko wrote:
>>>> On 08.09.21 12:27, Jan Beulich wrote:
>>>>> On 07.09.2021 19:39, Oleksandr Andrushchenko wrote:
>>>>>> On 07.09.21 19:30, Jan Beulich wrote:
>>>>>>> On 07.09.2021 15:33, Oleksandr Andrushchenko wrote:
>>>>>>>> On 06.09.21 17:31, Jan Beulich wrote:
>>>>>>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>>>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>>>>>> @@ -400,12 +400,72 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>>>       static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>>>>>                                   uint32_t val, void *data)
>>>>>>>>>>       {
>>>>>>>>>> +    struct vpci_bar *bar = data;
>>>>>>>>>> +    bool hi = false;
>>>>>>>>>> +
>>>>>>>>>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>>>>>>> +    {
>>>>>>>>>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>>>>>>> +        bar--;
>>>>>>>>>> +        hi = true;
>>>>>>>>>> +    }
>>>>>>>>>> +    else
>>>>>>>>>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>>>>>>>>>> +    bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>>>>>>>>>> +    bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>>>>>>>>> What you store here is not the address that's going to be used,
>>>>>>>>>       as
>>>>>>>>> you don't mask off the low bits (to account for the BAR's size).
>>>>>>>>> When a BAR gets written with all ones, all writable bits get these
>>>>>>>>> ones stored. The address of the BAR, aiui, really changes to
>>>>>>>>> (typically) close below 4Gb (in the case of a 32-bit BAR), which
>>>>>>>>> is why memory / I/O decoding should be off while sizing BARs.
>>>>>>>>> Therefore you shouldn't look for the specific "all writable bits
>>>>>>>>> are ones" pattern (or worse, as you presently do, the "all bits
>>>>>>>>> outside of the type specifier are ones" one) on the read path.
>>>>>>>>> Instead mask the value appropriately here, and simply return back
>>>>>>>>> the stored value from the read path.
>>>> But in case of BAR sizing I need to actually return BAR size.
>>>> So, the comparison is the way to tell if the guest wants to read
>>>> actual (configured) BAR value or it tries to determine BAR's size.
>>>> This is why I compare and use the result as the answer to what needs
>>>> to be supplied to the guest. So, if I don't compare with 0xffffffff for the
>>>> hi part and 0xfffffff0 for the low then how do I know when to return
>>>> configured BAR or return the size?
>>> Well, but that's the common misunderstanding that I've been trying
>>> to point out: There's no difference between these two forms of
>>> reads. The BARs are simply registers with some r/o bits. There's
>>> no hidden 2nd register recording what was last written. When you
>>> write 0xffffffff, all you do is set all writable bits to 1. When
>>> you read back from the register, you will find all r/o bits
>>> unchanged (which in particular means all lower address bits are
>>> zero, thus allowing you to determine the size).
>>>
>>> When the spec says to write 0xffffffff for sizing purposes, OSes
>>> should follow that, yes. This doesn't preclude them to use other
>>> forms of writes for whichever purpose. Hence you do not want to
>>> special case sizing, but instead you want to emulate correctly
>>> all forms of writes, including subsequent reads to uniformly
>>> return the intended / expected values.
>>>
>>> Just to give an example (perhaps a little contrived): To size a
>>> 64-bit BAR, in principle you'd first need to write 0xffffffff to
>>> both halves. But there's nothing wrong with doing this in a
>>> different order: Act on the low half alone first, and then act
>>> on the high half. The acting on the high half could even be
>>> skipped if the low half sizing produced at least bit 31 set. Now
>>> if you were to special case seeing ffffffff:fffffff? as the
>>> last written pair of values, you'd break that (imo legitimate)
>>> alternative process of sizing.
>> How about:
> Yes, that's what I was after. Just one nit right away:
>
>> static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>                               uint32_t val, void *data)
>> {
>>       struct vpci_bar *bar = data;
>>       bool hi = false;
>>
>>       if ( bar->type == VPCI_BAR_MEM64_HI )
>>       {
>>           ASSERT(reg > PCI_BASE_ADDRESS_0);
>>           bar--;
>>           hi = true;
>>       }
>>       else
>>       {
>>           val &= PCI_BASE_ADDRESS_MEM_MASK;
>>           val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>>                                              : PCI_BASE_ADDRESS_MEM_TYPE_64;
>>           val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>>       }
>>
>>       bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));

Do you think this needs to be 0xfffffffful, not 0xffffffffull?

e.g. s/ull/ul

>>       bar->guest_addr |= (uint64_t)val << (hi ? 32 : 0);
>>
>>       bar->guest_addr &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
>> }
>>
>> static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
>>                                  void *data)
>> {
>>       struct vpci_bar *bar = data;
> const please.
Sure
>
> Jan
>
Thank you for helping with this!!

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 5/9] vpci/header: Implement guest BAR register handlers
  2021-09-08 15:14                     ` Oleksandr Andrushchenko
@ 2021-09-08 15:29                       ` Jan Beulich
  2021-09-08 15:35                         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-08 15:29 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 08.09.2021 17:14, Oleksandr Andrushchenko wrote:
> On 08.09.21 17:46, Jan Beulich wrote:
>> On 08.09.2021 15:33, Oleksandr Andrushchenko wrote:
>>> static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>                               uint32_t val, void *data)
>>> {
>>>       struct vpci_bar *bar = data;
>>>       bool hi = false;
>>>
>>>       if ( bar->type == VPCI_BAR_MEM64_HI )
>>>       {
>>>           ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>           bar--;
>>>           hi = true;
>>>       }
>>>       else
>>>       {
>>>           val &= PCI_BASE_ADDRESS_MEM_MASK;
>>>           val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>>>                                              : PCI_BASE_ADDRESS_MEM_TYPE_64;
>>>           val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>>>       }
>>>
>>>       bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
> 
> Do you think this needs to be 0xfffffffful, not 0xffffffffull?
> 
> e.g. s/ull/ul

If guest_addr is uint64_t then ull would seem more correct to me,
especially when considering (hypothetical?) 32-bit architectures
potentially wanting to use this code.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 5/9] vpci/header: Implement guest BAR register handlers
  2021-09-08 15:29                       ` Jan Beulich
@ 2021-09-08 15:35                         ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-08 15:35 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 08.09.21 18:29, Jan Beulich wrote:
> On 08.09.2021 17:14, Oleksandr Andrushchenko wrote:
>> On 08.09.21 17:46, Jan Beulich wrote:
>>> On 08.09.2021 15:33, Oleksandr Andrushchenko wrote:
>>>> static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>>>>                                uint32_t val, void *data)
>>>> {
>>>>        struct vpci_bar *bar = data;
>>>>        bool hi = false;
>>>>
>>>>        if ( bar->type == VPCI_BAR_MEM64_HI )
>>>>        {
>>>>            ASSERT(reg > PCI_BASE_ADDRESS_0);
>>>>            bar--;
>>>>            hi = true;
>>>>        }
>>>>        else
>>>>        {
>>>>            val &= PCI_BASE_ADDRESS_MEM_MASK;
>>>>            val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>>>>                                               : PCI_BASE_ADDRESS_MEM_TYPE_64;
>>>>            val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>>>>        }
>>>>
>>>>        bar->guest_addr &= ~(0xffffffffull << (hi ? 32 : 0));
>> Do you think this needs to be 0xfffffffful, not 0xffffffffull?
>>
>> e.g. s/ull/ul
> If guest_addr is uint64_t then ull would seem more correct to me,
> especially when considering (hypothetical?) 32-bit architectures
> potentially wanting to use this code.
Ok, then I'll keep ull
>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 9/9] vpci/header: Use pdev's domain instead of vCPU
  2021-09-06 14:57   ` Jan Beulich
@ 2021-09-09  4:23     ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-09  4:23 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko, Rahul Singh
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, xen-devel


On 06.09.21 17:57, Jan Beulich wrote:
> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>> From: Rahul Singh <rahul.singh@arm.com>
>>
>> Fixes: 9c244fdef7e7 ("vpci: add header handlers")
> In which way is that original change broken?

After consulting with Arm we decided that this patch can be dropped.

If we face some issue and need be it will be submitted separately

> The title doesn't
> clarify this, and the description is empty ...
>
> Jan
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 6/9] vpci/header: Handle p2m range sets per BAR
  2021-09-08 15:00       ` Jan Beulich
@ 2021-09-09  5:22         ` Oleksandr Andrushchenko
  2021-09-09  8:24           ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-09  5:22 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 08.09.21 18:00, Jan Beulich wrote:
> On 08.09.2021 16:31, Oleksandr Andrushchenko wrote:
>> On 06.09.21 17:47, Jan Beulich wrote:
>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>
>>>> Instead of handling a single range set, that contains all the memory
>>>> regions of all the BARs and ROM, have them per BAR.
>>> Without looking at how you carry out this change - this look wrong (as
>>> in: wasteful) to me. Despite ...
>>>
>>>> This is in preparation of making non-identity mappings in p2m for the
>>>> MMIOs/ROM.
>>> ... the need for this, every individual BAR is still contiguous in both
>>> host and guest address spaces, so can be represented as a single
>>> (start,end) tuple (or a pair thereof, to account for both host and guest
>>> values). No need to use a rangeset for this.
>> First of all this change is in preparation for non-identity mappings,
> I'm afraid I continue to not see how this matters in the discussion at
> hand. I'm fully aware that this is the goal.
>
>> e.g. currently we collect all the memory ranges which require mappings
>> into a single range set, then we cut off MSI-X regions and then use range set
>> functionality to call a callback for every memory range left after MSI-X.
>> This works perfectly fine for 1:1 mappings, e.g. what we have as the range
>> set's starting address is what we want to be mapped/unmapped.
>> Why range sets? Because they allow partial mappings, e.g. you can map part of
>> the range and return back and continue from where you stopped. And if I
>> understand that correctly that was the initial intention of introducing range sets here.
>>
>> For non-identity mappings this becomes not that easy. Each individual BAR may be
>> mapped differently according to what guest OS has programmed as bar->guest_addr
>> (guest view of the BAR start).
> I don't see how the rangeset helps here. You have a guest and a host pair
> of values for every BAR. Pages with e.g. the MSI-X table may not be mapped
> to their host counterpart address, yes, but you need to special cases
> these anyway: Accesses to them need to be handled. Hence I'm having a hard
> time seeing how a per-BAR rangeset (which will cover at most three distinct
> ranges afaict, which is way too little for this kind of data organization
> imo) can gain you all this much.
>
> Overall the 6 BARs of a device will cover up to 8 non-adjacent ranges. IOW
> the majority (4 or more) of the rangesets will indeed merely represent a
> plain (start,end) pair (or be entirely empty).
First of all, let me explain why I decided to move to per-BAR
range sets.
Before this change all the MMIO regions and MSI-X holes were
accounted by a single range set, e.g. we go over all BARs and
add MMIOs and then subtract MSI-X from there. When it comes to
mapping/unmapping we have an assumtion that the starting address of
each element in the range set is equal to map/unmap address, e.g.
we have identity mapping. Please note, that the range set accepts
a single private data parameter which is enough to hold all
required data about the pdev in common, but there is no way to provide
any per-BAR data.

Now, that we want non-identity mappings, we can no longer assume
that starting address == mapping address and we need to provide
additional information on how to map and which is now per-BAR.
This is why I decided to use per-BAR range sets.

One of the solutions may be that we form an additional list of
structures in a form (I ommit some of the fields):
struct non_identity {
     unsigned long start_mfn;
     unsigned long start_gfn;
     unsigned long size;
};
So this way when the range set gets processed we go over the list
and find out the corresponding list's element which describes the
range set entry being processed (s, e, data):

static int map_range(unsigned long s, unsigned long e, void *data,
                      unsigned long *c)
{
[snip]
     go over the list elements
         if ( list->start_mfn == s )
             found, can use list->start_gfn for mapping
[snip]
}
This has some complications as map_range may be called multiple times
for the same range: if {unmap|map}_mmio_regions was not able to complete
the operation it returns the number of pages it was able to process:
         rc = map->map ? map_mmio_regions(map->d, start_gfn,
                                          size, _mfn(s))
                       : unmap_mmio_regions(map->d, start_gfn,
                                            size, _mfn(s));
In this case we need to update the list item:
     list->start_mfn += rc;
     list->start_gfn += rc;
     list->size -= rc;
and if all the pages of the range were processed delete the list entry.

With respect of creating the list everything also not so complicated:
while processing each BAR create a list entry and fill it with mfn, gfn
and size. Then, if MSI-X region is present within this BAR, break the
list item into multiple ones with respect to the holes, for example:

MMIO 0 list item
MSI-X hole 0
MMIO 1 list item
MSI-X hole 1

Here instead of a single BAR description we now have 2 list elements
describing the BAR without MSI-X regions.

All the above still relies on a single range set per pdev as it is in the
original code. We can go this route if we agree this is more acceptable
than the range sets per BAR
>
>> Thus we need to collect all those non-identity mappings
>> per BAR now (so we have a mapping "guest view" : "physical BAR" and again cut off
>> MSI-X regions as before.  So, yes, it may be a bit wasteful to use many range sets,
>> but makes vPCI life much-much easier.
> Which I'm yet to be convinced of. Then again I'm not the maintainer of
> this code, so if you can convince Roger you'll be all good.

Per-BAR range sets look more clear to me and add relatively less code which

seems to be good.

>
>> Thus, I think that even per-BAR range sets are
>> good to go as they have more pros than cons. IMO
>> Even if we go with "can be represented as a single (start,end) tuple" it doesn't answer
>> the question what needs to be done if a range gets partially mapped/unmapped.
> This question also isn't answered when you use rangesets.
bool vpci_process_pending(struct vcpu *v)
{

[snip]

             rc = rangeset_consume_ranges(bar->mem, map_range, &data);

             if ( rc == -ERESTART )
                 return true;

>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 7/9] vpci/header: program p2m with guest BAR view
  2021-09-06 14:51   ` Jan Beulich
@ 2021-09-09  6:13     ` Oleksandr Andrushchenko
  2021-09-09  8:26       ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-09  6:13 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 06.09.21 17:51, Jan Beulich wrote:
> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>> @@ -37,12 +41,28 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>>                        unsigned long *c)
>>   {
>>       const struct map_data *map = data;
>> +    gfn_t start_gfn;
>>       int rc;
>>   
>>       for ( ; ; )
>>       {
>>           unsigned long size = e - s + 1;
>>   
>> +        /*
>> +         * Any BAR may have holes in its memory we want to map, e.g.
>> +         * we don't want to map MSI regions which may be a part of that BAR,
>> +         * e.g. when a single BAR is used for both MMIO and MSI.
>> +         * In this case MSI regions are subtracted from the mapping, but
>> +         * map->start_gfn still points to the very beginning of the BAR.
>> +         * So if there is a hole present then we need to adjust start_gfn
>> +         * to reflect the fact of that substraction.
>> +         */
>> +        start_gfn = gfn_add(map->start_gfn, s - mfn_x(map->start_mfn));
> I may be missing something, but don't you need to adjust "size" then
> as well?

No, as range sets get consumed we have e and s updated accordingly,

so each time size represents the right value.

>   And don't you need to account for the "hole" not being at
> the start?

We only have MMIO ranges here and all the ranges have their start set

appropriately

>   (As an aside - do you mean "MSI-X regions" everywhere you
> say just "MSI" above?)
Yes, I mean MSI-X: will update
>
> Jan
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 6/9] vpci/header: Handle p2m range sets per BAR
  2021-09-09  5:22         ` Oleksandr Andrushchenko
@ 2021-09-09  8:24           ` Jan Beulich
  2021-09-09  9:12             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-09  8:24 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 09.09.2021 07:22, Oleksandr Andrushchenko wrote:
> 
> On 08.09.21 18:00, Jan Beulich wrote:
>> On 08.09.2021 16:31, Oleksandr Andrushchenko wrote:
>>> On 06.09.21 17:47, Jan Beulich wrote:
>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>
>>>>> Instead of handling a single range set, that contains all the memory
>>>>> regions of all the BARs and ROM, have them per BAR.
>>>> Without looking at how you carry out this change - this look wrong (as
>>>> in: wasteful) to me. Despite ...
>>>>
>>>>> This is in preparation of making non-identity mappings in p2m for the
>>>>> MMIOs/ROM.
>>>> ... the need for this, every individual BAR is still contiguous in both
>>>> host and guest address spaces, so can be represented as a single
>>>> (start,end) tuple (or a pair thereof, to account for both host and guest
>>>> values). No need to use a rangeset for this.
>>> First of all this change is in preparation for non-identity mappings,
>> I'm afraid I continue to not see how this matters in the discussion at
>> hand. I'm fully aware that this is the goal.
>>
>>> e.g. currently we collect all the memory ranges which require mappings
>>> into a single range set, then we cut off MSI-X regions and then use range set
>>> functionality to call a callback for every memory range left after MSI-X.
>>> This works perfectly fine for 1:1 mappings, e.g. what we have as the range
>>> set's starting address is what we want to be mapped/unmapped.
>>> Why range sets? Because they allow partial mappings, e.g. you can map part of
>>> the range and return back and continue from where you stopped. And if I
>>> understand that correctly that was the initial intention of introducing range sets here.
>>>
>>> For non-identity mappings this becomes not that easy. Each individual BAR may be
>>> mapped differently according to what guest OS has programmed as bar->guest_addr
>>> (guest view of the BAR start).
>> I don't see how the rangeset helps here. You have a guest and a host pair
>> of values for every BAR. Pages with e.g. the MSI-X table may not be mapped
>> to their host counterpart address, yes, but you need to special cases
>> these anyway: Accesses to them need to be handled. Hence I'm having a hard
>> time seeing how a per-BAR rangeset (which will cover at most three distinct
>> ranges afaict, which is way too little for this kind of data organization
>> imo) can gain you all this much.
>>
>> Overall the 6 BARs of a device will cover up to 8 non-adjacent ranges. IOW
>> the majority (4 or more) of the rangesets will indeed merely represent a
>> plain (start,end) pair (or be entirely empty).
> First of all, let me explain why I decided to move to per-BAR
> range sets.
> Before this change all the MMIO regions and MSI-X holes were
> accounted by a single range set, e.g. we go over all BARs and
> add MMIOs and then subtract MSI-X from there. When it comes to
> mapping/unmapping we have an assumtion that the starting address of
> each element in the range set is equal to map/unmap address, e.g.
> we have identity mapping. Please note, that the range set accepts
> a single private data parameter which is enough to hold all
> required data about the pdev in common, but there is no way to provide
> any per-BAR data.
> 
> Now, that we want non-identity mappings, we can no longer assume
> that starting address == mapping address and we need to provide
> additional information on how to map and which is now per-BAR.
> This is why I decided to use per-BAR range sets.
> 
> One of the solutions may be that we form an additional list of
> structures in a form (I ommit some of the fields):
> struct non_identity {
>      unsigned long start_mfn;
>      unsigned long start_gfn;
>      unsigned long size;
> };
> So this way when the range set gets processed we go over the list
> and find out the corresponding list's element which describes the
> range set entry being processed (s, e, data):
> 
> static int map_range(unsigned long s, unsigned long e, void *data,
>                       unsigned long *c)
> {
> [snip]
>      go over the list elements
>          if ( list->start_mfn == s )
>              found, can use list->start_gfn for mapping
> [snip]
> }
> This has some complications as map_range may be called multiple times
> for the same range: if {unmap|map}_mmio_regions was not able to complete
> the operation it returns the number of pages it was able to process:
>          rc = map->map ? map_mmio_regions(map->d, start_gfn,
>                                           size, _mfn(s))
>                        : unmap_mmio_regions(map->d, start_gfn,
>                                             size, _mfn(s));
> In this case we need to update the list item:
>      list->start_mfn += rc;
>      list->start_gfn += rc;
>      list->size -= rc;
> and if all the pages of the range were processed delete the list entry.
> 
> With respect of creating the list everything also not so complicated:
> while processing each BAR create a list entry and fill it with mfn, gfn
> and size. Then, if MSI-X region is present within this BAR, break the
> list item into multiple ones with respect to the holes, for example:
> 
> MMIO 0 list item
> MSI-X hole 0
> MMIO 1 list item
> MSI-X hole 1
> 
> Here instead of a single BAR description we now have 2 list elements
> describing the BAR without MSI-X regions.
> 
> All the above still relies on a single range set per pdev as it is in the
> original code. We can go this route if we agree this is more acceptable
> than the range sets per BAR

I guess I am now even more confused: I can't spot any "rangeset per pdev"
either. The rangeset I see being used doesn't get associated with anything
that's device-related; it gets accumulated as a transient data structure,
but _all_ devices owned by a domain influence its final content.

If you associate rangesets with either a device or a BAR, I'm failing to
see how you'd deal with multiple BARs living in the same page (see also
below).

Considering that a rangeset really is a compressed representation of a
bitmap, I wonder whether this data structure is suitable at all for what
you want to express. You have two pieces of information to carry / manage,
after all: Which ranges need mapping, and what their GFN <-> MFN
relationship is. Maybe the latter needs expressing differently in the
first place? And then in a way that's ensuring by its organization that
no conflicting GFN <-> MFN mappings will be possible? Isn't this
precisely what is already getting recorded in the P2M?

I'm also curious what your plan is to deal with BARs overlapping in MFN
space: In such a case, the guest cannot independently change the GFNs of
any of the involved BARs. (Same the other way around: overlaps in GFN
space are only permitted when the same overlap exists in MFN space.) Are
you excluding (forbidding) this case? If so, did I miss you saying so
somewhere? Yet if no overlaps are allowed in the first place, what
modify_bars() does would be far more complicated than necessary in the
DomU case, so it may be worthwhile considering to deviate more from how
Dom0 gets taken care of. In the end a guest writing a BAR is merely a
request to change its P2M. That's very different from Dom0 writing a BAR,
which means the physical BAR also changes, and hence the P2M changes in
quite different a way.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 7/9] vpci/header: program p2m with guest BAR view
  2021-09-09  6:13     ` Oleksandr Andrushchenko
@ 2021-09-09  8:26       ` Jan Beulich
  2021-09-09  9:16         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-09  8:26 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 09.09.2021 08:13, Oleksandr Andrushchenko wrote:
> 
> On 06.09.21 17:51, Jan Beulich wrote:
>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>> @@ -37,12 +41,28 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>>>                        unsigned long *c)
>>>   {
>>>       const struct map_data *map = data;
>>> +    gfn_t start_gfn;
>>>       int rc;
>>>   
>>>       for ( ; ; )
>>>       {
>>>           unsigned long size = e - s + 1;
>>>   
>>> +        /*
>>> +         * Any BAR may have holes in its memory we want to map, e.g.
>>> +         * we don't want to map MSI regions which may be a part of that BAR,
>>> +         * e.g. when a single BAR is used for both MMIO and MSI.
>>> +         * In this case MSI regions are subtracted from the mapping, but
>>> +         * map->start_gfn still points to the very beginning of the BAR.
>>> +         * So if there is a hole present then we need to adjust start_gfn
>>> +         * to reflect the fact of that substraction.
>>> +         */
>>> +        start_gfn = gfn_add(map->start_gfn, s - mfn_x(map->start_mfn));
>> I may be missing something, but don't you need to adjust "size" then
>> as well?
> 
> No, as range sets get consumed we have e and s updated accordingly,
> so each time size represents the right value.

It feels like something's wrong with the rangeset construction then:
Either it represents _all_ holes (including degenerate ones at the
start of end of a range), or none of them.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-07 10:06                   ` Jan Beulich
@ 2021-09-09  8:39                     ` Oleksandr Andrushchenko
  2021-09-09  8:43                       ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-09  8:39 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 07.09.21 13:06, Jan Beulich wrote:
> On 07.09.2021 11:52, Oleksandr Andrushchenko wrote:
>> On 07.09.21 12:19, Jan Beulich wrote:
>>> On 07.09.2021 11:07, Oleksandr Andrushchenko wrote:
>>>> On 07.09.21 11:49, Jan Beulich wrote:
>>>>> On 07.09.2021 10:18, Oleksandr Andrushchenko wrote:
>>>>>> So, if we have a hidden PCI device which can be assigned to a guest and it is literally untouched
>>>>>> (not enabled in Dom0) then I think there will be no such reference as "host assigned values" as
>>>>>> most probably the command register will remain in its after reset state.
>>>>> What meaning of "hidden" do you imply here? Devices passed to
>>>>> pci_{hide,ro}_device() may not be assigned to guests ...
>>>> You are completely right here.
>>>>> For any other meaning of "hidden", even if the device is completely
>>>>> ignored by Dom0,
>>>> Dom0less is such a case when a device is assigned to the guest
>>>> without Dom0 at all?
>>> In this case it is entirely unclear to me what entity it is to have
>>> a global view on the PCI subsystem.
>>>
>>>>>     certain of the properties still cannot be allowed
>>>>> to be DomU-controlled.
>>>> The list is not that big, could you please name a few you think cannot
>>>> be controlled by a guest? I can think of PCI_COMMAND_SPECIAL(?),
>>>> PCI_COMMAND_INVALIDATE(?), PCI_COMMAND_PARITY, PCI_COMMAND_WAIT,
>>>> PCI_COMMAND_SERR, PCI_COMMAND_INTX_DISABLE which we may want to
>>>> be aligned with the "host reference" values, e.g. we only allow those bits
>>>> to be set as they are in Dom0.
>>> Well, you've compile a list already, and I did say so before as well:
>>> Everything except I/O and memory decoding as well as bus mastering
>>> needs at least closely looking at. INTX_DISABLE, for example, is
>>> something I don't think a guest should be able to directly control.
>>> It may still be the case that the host permits it control, but then
>>> only indirectly, allowing the host to appropriately adjust its
>>> internals.
>>>
>>> Note that even for I/O and memory decoding as well as bus mastering
>>> it may be necessary to limit guest control: In case the host wants
>>> to disable any of these (perhaps transiently) despite the guest
>>> wanting them enabled.
>> Ok, so it is now clear that we need a yet another patch to add a proper
>> command register emulation. What is your preference: drop the current
>> patch, implement command register emulation and add a "reset patch"
>> after that or we can have the patch as is now, but I'll only reset IO/mem and bus
>> master bits, e.g. read the real value, mask the wanted bits and write back?
> Either order is fine with me as long as the result will be claimed to
> be complete until proper emulation is in place.
I tried to see what others do in order to emulate PCI_COMMAND register
and it seems that at most they care about the only INTX bit (besides
IO/memory enable and bus muster which are write through). Please see
[1] and [2]. Probably I miss something, but it could be because in order
to properly emulate the COMMAND register we need to know about the
whole PCI topology, e.g. if any setting in device's command register
is aligned with the upstream port etc. This makes me think that because
of this complexity others just ignore that. Neither I think this can be
easily done in our case. So I would suggest we just add the following
simple logic to only emulate PCI_COMMAND_INTX_DISABLE: allow guest to
disable the interrupts, but don't allow to enable if host has disabled
them. This is also could be tricky a bit for the devices which are not
enabled and thus not configured in Dom0, e.g. we do not know for sure
if the value in the PCI_COMMAND register (in particular
PCI_COMMAND_INTX_DISABLE bit) can be used as the reference host value or
not. It can be that the value there is just the one after reset or so.
The rest of the command register bits will go directly to the command
register untouched.
So, at the end of the day the question is if PCI_COMMAND_INTX_DISABLE
is enough and how to get its reference host value.

> Jan

Thank you,

Oleksandr

[1] https://github.com/qemu/qemu/blob/master/hw/xen/xen_pt_config_init.c#L310
[2] https://github.com/projectacrn/acrn-hypervisor/blob/master/hypervisor/hw/pci.c#L336

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-09  8:39                     ` Oleksandr Andrushchenko
@ 2021-09-09  8:43                       ` Jan Beulich
  2021-09-09  8:50                         ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-09  8:43 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 09.09.2021 10:39, Oleksandr Andrushchenko wrote:
> 
> On 07.09.21 13:06, Jan Beulich wrote:
>> On 07.09.2021 11:52, Oleksandr Andrushchenko wrote:
>>> On 07.09.21 12:19, Jan Beulich wrote:
>>>> On 07.09.2021 11:07, Oleksandr Andrushchenko wrote:
>>>>> On 07.09.21 11:49, Jan Beulich wrote:
>>>>>> On 07.09.2021 10:18, Oleksandr Andrushchenko wrote:
>>>>>>> So, if we have a hidden PCI device which can be assigned to a guest and it is literally untouched
>>>>>>> (not enabled in Dom0) then I think there will be no such reference as "host assigned values" as
>>>>>>> most probably the command register will remain in its after reset state.
>>>>>> What meaning of "hidden" do you imply here? Devices passed to
>>>>>> pci_{hide,ro}_device() may not be assigned to guests ...
>>>>> You are completely right here.
>>>>>> For any other meaning of "hidden", even if the device is completely
>>>>>> ignored by Dom0,
>>>>> Dom0less is such a case when a device is assigned to the guest
>>>>> without Dom0 at all?
>>>> In this case it is entirely unclear to me what entity it is to have
>>>> a global view on the PCI subsystem.
>>>>
>>>>>>     certain of the properties still cannot be allowed
>>>>>> to be DomU-controlled.
>>>>> The list is not that big, could you please name a few you think cannot
>>>>> be controlled by a guest? I can think of PCI_COMMAND_SPECIAL(?),
>>>>> PCI_COMMAND_INVALIDATE(?), PCI_COMMAND_PARITY, PCI_COMMAND_WAIT,
>>>>> PCI_COMMAND_SERR, PCI_COMMAND_INTX_DISABLE which we may want to
>>>>> be aligned with the "host reference" values, e.g. we only allow those bits
>>>>> to be set as they are in Dom0.
>>>> Well, you've compile a list already, and I did say so before as well:
>>>> Everything except I/O and memory decoding as well as bus mastering
>>>> needs at least closely looking at. INTX_DISABLE, for example, is
>>>> something I don't think a guest should be able to directly control.
>>>> It may still be the case that the host permits it control, but then
>>>> only indirectly, allowing the host to appropriately adjust its
>>>> internals.
>>>>
>>>> Note that even for I/O and memory decoding as well as bus mastering
>>>> it may be necessary to limit guest control: In case the host wants
>>>> to disable any of these (perhaps transiently) despite the guest
>>>> wanting them enabled.
>>> Ok, so it is now clear that we need a yet another patch to add a proper
>>> command register emulation. What is your preference: drop the current
>>> patch, implement command register emulation and add a "reset patch"
>>> after that or we can have the patch as is now, but I'll only reset IO/mem and bus
>>> master bits, e.g. read the real value, mask the wanted bits and write back?
>> Either order is fine with me as long as the result will be claimed to
>> be complete until proper emulation is in place.
> I tried to see what others do in order to emulate PCI_COMMAND register
> and it seems that at most they care about the only INTX bit (besides
> IO/memory enable and bus muster which are write through). Please see
> [1] and [2]. Probably I miss something, but it could be because in order
> to properly emulate the COMMAND register we need to know about the
> whole PCI topology, e.g. if any setting in device's command register
> is aligned with the upstream port etc. This makes me think that because
> of this complexity others just ignore that. Neither I think this can be
> easily done in our case. So I would suggest we just add the following
> simple logic to only emulate PCI_COMMAND_INTX_DISABLE: allow guest to
> disable the interrupts, but don't allow to enable if host has disabled
> them. This is also could be tricky a bit for the devices which are not
> enabled and thus not configured in Dom0, e.g. we do not know for sure
> if the value in the PCI_COMMAND register (in particular
> PCI_COMMAND_INTX_DISABLE bit) can be used as the reference host value or
> not. It can be that the value there is just the one after reset or so.
> The rest of the command register bits will go directly to the command
> register untouched.
> So, at the end of the day the question is if PCI_COMMAND_INTX_DISABLE
> is enough and how to get its reference host value.

Well, in order for the whole thing to be security supported it needs to
be explained for every bit why it is safe to allow the guest to drive it.
Until you mean vPCI to reach that state, leaving TODO notes in the code
for anything not investigated may indeed be good enough.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-09  8:43                       ` Jan Beulich
@ 2021-09-09  8:50                         ` Oleksandr Andrushchenko
  2021-09-09  9:21                           ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-09  8:50 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 09.09.21 11:43, Jan Beulich wrote:
> On 09.09.2021 10:39, Oleksandr Andrushchenko wrote:
>> On 07.09.21 13:06, Jan Beulich wrote:
>>> On 07.09.2021 11:52, Oleksandr Andrushchenko wrote:
>>>> On 07.09.21 12:19, Jan Beulich wrote:
>>>>> On 07.09.2021 11:07, Oleksandr Andrushchenko wrote:
>>>>>> On 07.09.21 11:49, Jan Beulich wrote:
>>>>>>> On 07.09.2021 10:18, Oleksandr Andrushchenko wrote:
>>>>>>>> So, if we have a hidden PCI device which can be assigned to a guest and it is literally untouched
>>>>>>>> (not enabled in Dom0) then I think there will be no such reference as "host assigned values" as
>>>>>>>> most probably the command register will remain in its after reset state.
>>>>>>> What meaning of "hidden" do you imply here? Devices passed to
>>>>>>> pci_{hide,ro}_device() may not be assigned to guests ...
>>>>>> You are completely right here.
>>>>>>> For any other meaning of "hidden", even if the device is completely
>>>>>>> ignored by Dom0,
>>>>>> Dom0less is such a case when a device is assigned to the guest
>>>>>> without Dom0 at all?
>>>>> In this case it is entirely unclear to me what entity it is to have
>>>>> a global view on the PCI subsystem.
>>>>>
>>>>>>>      certain of the properties still cannot be allowed
>>>>>>> to be DomU-controlled.
>>>>>> The list is not that big, could you please name a few you think cannot
>>>>>> be controlled by a guest? I can think of PCI_COMMAND_SPECIAL(?),
>>>>>> PCI_COMMAND_INVALIDATE(?), PCI_COMMAND_PARITY, PCI_COMMAND_WAIT,
>>>>>> PCI_COMMAND_SERR, PCI_COMMAND_INTX_DISABLE which we may want to
>>>>>> be aligned with the "host reference" values, e.g. we only allow those bits
>>>>>> to be set as they are in Dom0.
>>>>> Well, you've compile a list already, and I did say so before as well:
>>>>> Everything except I/O and memory decoding as well as bus mastering
>>>>> needs at least closely looking at. INTX_DISABLE, for example, is
>>>>> something I don't think a guest should be able to directly control.
>>>>> It may still be the case that the host permits it control, but then
>>>>> only indirectly, allowing the host to appropriately adjust its
>>>>> internals.
>>>>>
>>>>> Note that even for I/O and memory decoding as well as bus mastering
>>>>> it may be necessary to limit guest control: In case the host wants
>>>>> to disable any of these (perhaps transiently) despite the guest
>>>>> wanting them enabled.
>>>> Ok, so it is now clear that we need a yet another patch to add a proper
>>>> command register emulation. What is your preference: drop the current
>>>> patch, implement command register emulation and add a "reset patch"
>>>> after that or we can have the patch as is now, but I'll only reset IO/mem and bus
>>>> master bits, e.g. read the real value, mask the wanted bits and write back?
>>> Either order is fine with me as long as the result will be claimed to
>>> be complete until proper emulation is in place.
>> I tried to see what others do in order to emulate PCI_COMMAND register
>> and it seems that at most they care about the only INTX bit (besides
>> IO/memory enable and bus muster which are write through). Please see
>> [1] and [2]. Probably I miss something, but it could be because in order
>> to properly emulate the COMMAND register we need to know about the
>> whole PCI topology, e.g. if any setting in device's command register
>> is aligned with the upstream port etc. This makes me think that because
>> of this complexity others just ignore that. Neither I think this can be
>> easily done in our case. So I would suggest we just add the following
>> simple logic to only emulate PCI_COMMAND_INTX_DISABLE: allow guest to
>> disable the interrupts, but don't allow to enable if host has disabled
>> them. This is also could be tricky a bit for the devices which are not
>> enabled and thus not configured in Dom0, e.g. we do not know for sure
>> if the value in the PCI_COMMAND register (in particular
>> PCI_COMMAND_INTX_DISABLE bit) can be used as the reference host value or
>> not. It can be that the value there is just the one after reset or so.
>> The rest of the command register bits will go directly to the command
>> register untouched.
>> So, at the end of the day the question is if PCI_COMMAND_INTX_DISABLE
>> is enough and how to get its reference host value.
> Well, in order for the whole thing to be security supported it needs to
> be explained for every bit why it is safe to allow the guest to drive it.

So, do we want at least PCI_COMMAND_INTX_DISABLE bit aligned

between the host and guest? If so, what do you you think about

the reference value for it (please see above).

> Until you mean vPCI to reach that state, leaving TODO notes in the code
> for anything not investigated may indeed be good enough.
Ok, I'll add TODO then.
>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 6/9] vpci/header: Handle p2m range sets per BAR
  2021-09-09  8:24           ` Jan Beulich
@ 2021-09-09  9:12             ` Oleksandr Andrushchenko
  2021-09-09  9:39               ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-09  9:12 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 09.09.21 11:24, Jan Beulich wrote:
> On 09.09.2021 07:22, Oleksandr Andrushchenko wrote:
>> On 08.09.21 18:00, Jan Beulich wrote:
>>> On 08.09.2021 16:31, Oleksandr Andrushchenko wrote:
>>>> On 06.09.21 17:47, Jan Beulich wrote:
>>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>
>>>>>> Instead of handling a single range set, that contains all the memory
>>>>>> regions of all the BARs and ROM, have them per BAR.
>>>>> Without looking at how you carry out this change - this look wrong (as
>>>>> in: wasteful) to me. Despite ...
>>>>>
>>>>>> This is in preparation of making non-identity mappings in p2m for the
>>>>>> MMIOs/ROM.
>>>>> ... the need for this, every individual BAR is still contiguous in both
>>>>> host and guest address spaces, so can be represented as a single
>>>>> (start,end) tuple (or a pair thereof, to account for both host and guest
>>>>> values). No need to use a rangeset for this.
>>>> First of all this change is in preparation for non-identity mappings,
>>> I'm afraid I continue to not see how this matters in the discussion at
>>> hand. I'm fully aware that this is the goal.
>>>
>>>> e.g. currently we collect all the memory ranges which require mappings
>>>> into a single range set, then we cut off MSI-X regions and then use range set
>>>> functionality to call a callback for every memory range left after MSI-X.
>>>> This works perfectly fine for 1:1 mappings, e.g. what we have as the range
>>>> set's starting address is what we want to be mapped/unmapped.
>>>> Why range sets? Because they allow partial mappings, e.g. you can map part of
>>>> the range and return back and continue from where you stopped. And if I
>>>> understand that correctly that was the initial intention of introducing range sets here.
>>>>
>>>> For non-identity mappings this becomes not that easy. Each individual BAR may be
>>>> mapped differently according to what guest OS has programmed as bar->guest_addr
>>>> (guest view of the BAR start).
>>> I don't see how the rangeset helps here. You have a guest and a host pair
>>> of values for every BAR. Pages with e.g. the MSI-X table may not be mapped
>>> to their host counterpart address, yes, but you need to special cases
>>> these anyway: Accesses to them need to be handled. Hence I'm having a hard
>>> time seeing how a per-BAR rangeset (which will cover at most three distinct
>>> ranges afaict, which is way too little for this kind of data organization
>>> imo) can gain you all this much.
>>>
>>> Overall the 6 BARs of a device will cover up to 8 non-adjacent ranges. IOW
>>> the majority (4 or more) of the rangesets will indeed merely represent a
>>> plain (start,end) pair (or be entirely empty).
>> First of all, let me explain why I decided to move to per-BAR
>> range sets.
>> Before this change all the MMIO regions and MSI-X holes were
>> accounted by a single range set, e.g. we go over all BARs and
>> add MMIOs and then subtract MSI-X from there. When it comes to
>> mapping/unmapping we have an assumtion that the starting address of
>> each element in the range set is equal to map/unmap address, e.g.
>> we have identity mapping. Please note, that the range set accepts
>> a single private data parameter which is enough to hold all
>> required data about the pdev in common, but there is no way to provide
>> any per-BAR data.
>>
>> Now, that we want non-identity mappings, we can no longer assume
>> that starting address == mapping address and we need to provide
>> additional information on how to map and which is now per-BAR.
>> This is why I decided to use per-BAR range sets.
>>
>> One of the solutions may be that we form an additional list of
>> structures in a form (I ommit some of the fields):
>> struct non_identity {
>>       unsigned long start_mfn;
>>       unsigned long start_gfn;
>>       unsigned long size;
>> };
>> So this way when the range set gets processed we go over the list
>> and find out the corresponding list's element which describes the
>> range set entry being processed (s, e, data):
>>
>> static int map_range(unsigned long s, unsigned long e, void *data,
>>                        unsigned long *c)
>> {
>> [snip]
>>       go over the list elements
>>           if ( list->start_mfn == s )
>>               found, can use list->start_gfn for mapping
>> [snip]
>> }
>> This has some complications as map_range may be called multiple times
>> for the same range: if {unmap|map}_mmio_regions was not able to complete
>> the operation it returns the number of pages it was able to process:
>>           rc = map->map ? map_mmio_regions(map->d, start_gfn,
>>                                            size, _mfn(s))
>>                         : unmap_mmio_regions(map->d, start_gfn,
>>                                              size, _mfn(s));
>> In this case we need to update the list item:
>>       list->start_mfn += rc;
>>       list->start_gfn += rc;
>>       list->size -= rc;
>> and if all the pages of the range were processed delete the list entry.
>>
>> With respect of creating the list everything also not so complicated:
>> while processing each BAR create a list entry and fill it with mfn, gfn
>> and size. Then, if MSI-X region is present within this BAR, break the
>> list item into multiple ones with respect to the holes, for example:
>>
>> MMIO 0 list item
>> MSI-X hole 0
>> MMIO 1 list item
>> MSI-X hole 1
>>
>> Here instead of a single BAR description we now have 2 list elements
>> describing the BAR without MSI-X regions.
>>
>> All the above still relies on a single range set per pdev as it is in the
>> original code. We can go this route if we agree this is more acceptable
>> than the range sets per BAR
> I guess I am now even more confused: I can't spot any "rangeset per pdev"
> either. The rangeset I see being used doesn't get associated with anything
> that's device-related; it gets accumulated as a transient data structure,
> but _all_ devices owned by a domain influence its final content.

You are absolutely right here, sorry for the confusion: in the current

code the range set belongs to struct vpci_vcpu, e.g.

/* Per-vcpu structure to store state while {un}mapping of PCI BARs. */

>
> If you associate rangesets with either a device or a BAR, I'm failing to
> see how you'd deal with multiple BARs living in the same page (see also
> below).

This was exactly the issue I ran into while emulating RTL8139 on QEMU:

The MMIOs are 128 bytes long and Linux put them on the same page.

So, it is a known limitation that we can't deal with [1]

>
> Considering that a rangeset really is a compressed representation of a
> bitmap, I wonder whether this data structure is suitable at all for what
> you want to express. You have two pieces of information to carry / manage,
> after all: Which ranges need mapping, and what their GFN <-> MFN
> relationship is. Maybe the latter needs expressing differently in the
> first place?

I proposed a list which can be extended to hold all the required information

there, e.g. MFN, GFN, size etc.

>   And then in a way that's ensuring by its organization that
> no conflicting GFN <-> MFN mappings will be possible?

If you mean the use-case above with different device MMIOs living

in the same page then my understanding is that such a use-case is

not supported [1]

>   Isn't this
> precisely what is already getting recorded in the P2M?
>
> I'm also curious what your plan is to deal with BARs overlapping in MFN
> space: In such a case, the guest cannot independently change the GFNs of
> any of the involved BARs. (Same the other way around: overlaps in GFN
> space are only permitted when the same overlap exists in MFN space.) Are
> you excluding (forbidding) this case? If so, did I miss you saying so
> somewhere?
Again [1]
>   Yet if no overlaps are allowed in the first place, what
> modify_bars() does would be far more complicated than necessary in the
> DomU case, so it may be worthwhile considering to deviate more from how
> Dom0 gets taken care of. In the end a guest writing a BAR is merely a
> request to change its P2M. That's very different from Dom0 writing a BAR,
> which means the physical BAR also changes, and hence the P2M changes in
> quite different a way.

So, what is the difference then besides hwdom really writes to a BAR?

To me most of the logic remains the same: we need to map/unmap.

The only difference I see here is that for Dom0 we have 1:1 at the moment

and for guest we need GFN <-> MFN.


Anyways, I am open to any decision on what would be the right approach here:

1. Use range sets per BAR as in the patch

2. Remove range sets completely and have a per-vCPU list with mapping

data as I described above

3. Anything else?

>
> Jan

Thank you,

Oleksandr

[1] https://wiki.xenproject.org/wiki/Xen_PCI_Passthrough#I_get_.22non-page-aligned_MMIO_BAR.22_error_when_trying_to_start_the_guest

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 7/9] vpci/header: program p2m with guest BAR view
  2021-09-09  8:26       ` Jan Beulich
@ 2021-09-09  9:16         ` Oleksandr Andrushchenko
  2021-09-09  9:40           ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-09  9:16 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 09.09.21 11:26, Jan Beulich wrote:
> On 09.09.2021 08:13, Oleksandr Andrushchenko wrote:
>> On 06.09.21 17:51, Jan Beulich wrote:
>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>> @@ -37,12 +41,28 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>>>>                         unsigned long *c)
>>>>    {
>>>>        const struct map_data *map = data;
>>>> +    gfn_t start_gfn;
>>>>        int rc;
>>>>    
>>>>        for ( ; ; )
>>>>        {
>>>>            unsigned long size = e - s + 1;
>>>>    
>>>> +        /*
>>>> +         * Any BAR may have holes in its memory we want to map, e.g.
>>>> +         * we don't want to map MSI regions which may be a part of that BAR,
>>>> +         * e.g. when a single BAR is used for both MMIO and MSI.
>>>> +         * In this case MSI regions are subtracted from the mapping, but
>>>> +         * map->start_gfn still points to the very beginning of the BAR.
>>>> +         * So if there is a hole present then we need to adjust start_gfn
>>>> +         * to reflect the fact of that substraction.
>>>> +         */
>>>> +        start_gfn = gfn_add(map->start_gfn, s - mfn_x(map->start_mfn));
>>> I may be missing something, but don't you need to adjust "size" then
>>> as well?
>> No, as range sets get consumed we have e and s updated accordingly,
>> so each time size represents the right value.
> It feels like something's wrong with the rangeset construction then:
> Either it represents _all_ holes (including degenerate ones at the
> start of end of a range), or none of them.

The resulting range set only has the MMIOs in it. While constructing the range set

we cut off MSI-X out of it (make holes). But finally it only has the ranges that we

need to map/unmap.

>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-09  8:50                         ` Oleksandr Andrushchenko
@ 2021-09-09  9:21                           ` Jan Beulich
  2021-09-09 11:48                             ` Oleksandr Andrushchenko
  2021-09-09 11:48                             ` Oleksandr Andrushchenko
  0 siblings, 2 replies; 75+ messages in thread
From: Jan Beulich @ 2021-09-09  9:21 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 09.09.2021 10:50, Oleksandr Andrushchenko wrote:
> 
> On 09.09.21 11:43, Jan Beulich wrote:
>> On 09.09.2021 10:39, Oleksandr Andrushchenko wrote:
>>> On 07.09.21 13:06, Jan Beulich wrote:
>>>> On 07.09.2021 11:52, Oleksandr Andrushchenko wrote:
>>>>> On 07.09.21 12:19, Jan Beulich wrote:
>>>>>> On 07.09.2021 11:07, Oleksandr Andrushchenko wrote:
>>>>>>> On 07.09.21 11:49, Jan Beulich wrote:
>>>>>>>> On 07.09.2021 10:18, Oleksandr Andrushchenko wrote:
>>>>>>>>> So, if we have a hidden PCI device which can be assigned to a guest and it is literally untouched
>>>>>>>>> (not enabled in Dom0) then I think there will be no such reference as "host assigned values" as
>>>>>>>>> most probably the command register will remain in its after reset state.
>>>>>>>> What meaning of "hidden" do you imply here? Devices passed to
>>>>>>>> pci_{hide,ro}_device() may not be assigned to guests ...
>>>>>>> You are completely right here.
>>>>>>>> For any other meaning of "hidden", even if the device is completely
>>>>>>>> ignored by Dom0,
>>>>>>> Dom0less is such a case when a device is assigned to the guest
>>>>>>> without Dom0 at all?
>>>>>> In this case it is entirely unclear to me what entity it is to have
>>>>>> a global view on the PCI subsystem.
>>>>>>
>>>>>>>>      certain of the properties still cannot be allowed
>>>>>>>> to be DomU-controlled.
>>>>>>> The list is not that big, could you please name a few you think cannot
>>>>>>> be controlled by a guest? I can think of PCI_COMMAND_SPECIAL(?),
>>>>>>> PCI_COMMAND_INVALIDATE(?), PCI_COMMAND_PARITY, PCI_COMMAND_WAIT,
>>>>>>> PCI_COMMAND_SERR, PCI_COMMAND_INTX_DISABLE which we may want to
>>>>>>> be aligned with the "host reference" values, e.g. we only allow those bits
>>>>>>> to be set as they are in Dom0.
>>>>>> Well, you've compile a list already, and I did say so before as well:
>>>>>> Everything except I/O and memory decoding as well as bus mastering
>>>>>> needs at least closely looking at. INTX_DISABLE, for example, is
>>>>>> something I don't think a guest should be able to directly control.
>>>>>> It may still be the case that the host permits it control, but then
>>>>>> only indirectly, allowing the host to appropriately adjust its
>>>>>> internals.
>>>>>>
>>>>>> Note that even for I/O and memory decoding as well as bus mastering
>>>>>> it may be necessary to limit guest control: In case the host wants
>>>>>> to disable any of these (perhaps transiently) despite the guest
>>>>>> wanting them enabled.
>>>>> Ok, so it is now clear that we need a yet another patch to add a proper
>>>>> command register emulation. What is your preference: drop the current
>>>>> patch, implement command register emulation and add a "reset patch"
>>>>> after that or we can have the patch as is now, but I'll only reset IO/mem and bus
>>>>> master bits, e.g. read the real value, mask the wanted bits and write back?
>>>> Either order is fine with me as long as the result will be claimed to
>>>> be complete until proper emulation is in place.
>>> I tried to see what others do in order to emulate PCI_COMMAND register
>>> and it seems that at most they care about the only INTX bit (besides
>>> IO/memory enable and bus muster which are write through). Please see
>>> [1] and [2]. Probably I miss something, but it could be because in order
>>> to properly emulate the COMMAND register we need to know about the
>>> whole PCI topology, e.g. if any setting in device's command register
>>> is aligned with the upstream port etc. This makes me think that because
>>> of this complexity others just ignore that. Neither I think this can be
>>> easily done in our case. So I would suggest we just add the following
>>> simple logic to only emulate PCI_COMMAND_INTX_DISABLE: allow guest to
>>> disable the interrupts, but don't allow to enable if host has disabled
>>> them. This is also could be tricky a bit for the devices which are not
>>> enabled and thus not configured in Dom0, e.g. we do not know for sure
>>> if the value in the PCI_COMMAND register (in particular
>>> PCI_COMMAND_INTX_DISABLE bit) can be used as the reference host value or
>>> not. It can be that the value there is just the one after reset or so.
>>> The rest of the command register bits will go directly to the command
>>> register untouched.
>>> So, at the end of the day the question is if PCI_COMMAND_INTX_DISABLE
>>> is enough and how to get its reference host value.
>> Well, in order for the whole thing to be security supported it needs to
>> be explained for every bit why it is safe to allow the guest to drive it.
> 
> So, do we want at least PCI_COMMAND_INTX_DISABLE bit aligned
> between the host and guest? If so, what do you you think about
> the reference value for it (please see above).

Please may I ask that you come up with a proposal? I don't think I've
said you need to emulate this or any of the other bits. All I've asked
for is that for every bit you allow the guest to control directly, you
justify why that's safe and secure. If no justification can be given,
emulation is going to be necessary. How to solve that is first and
foremost part of your undertaking.

For the bit in question, where the goal appears to be to have hardware
hold the OR of guest and host values, an approach similar to that used
for some of the MSI / MSI-X bits might be chosen: Maintain guest and
host bits in software, and update hardware (at least) when the
effective resulting value changes. A complicating fact here is, though,
that unlike for the MSI / MSI-X bits here Dom0 (pciback or its PCI
susbstem) may also have a view on what the setting ought to be.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 6/9] vpci/header: Handle p2m range sets per BAR
  2021-09-09  9:12             ` Oleksandr Andrushchenko
@ 2021-09-09  9:39               ` Jan Beulich
  2021-09-09 10:03                 ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-09  9:39 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 09.09.2021 11:12, Oleksandr Andrushchenko wrote:
> Anyways, I am open to any decision on what would be the right approach here:
> 
> 1. Use range sets per BAR as in the patch
> 
> 2. Remove range sets completely and have a per-vCPU list with mapping
> 
> data as I described above
> 
> 3. Anything else?

A decision first requires a proposal. I think 3 is the way to investigate
first: Rather than starting from the code we currently have, start from
what you need for DomU-s to work. If there's enough overlap with how we
handle Dom0, code can be shared. If things are sufficiently different,
separate code paths are likely better. As said - to me a guest altering a
BAR is merely a very special form of a request to change its P2M. The M
parts remains unchanged (which is the major difference from Dom0), while
the P part changes. As long as you can assume no two BARs to share a page,
this would appear to suggest that it's simply a P2M operation plus book
keeping at the vPCI layer. Completely different from Dom0 handling.

All of this applies only with memory decoding enabled, I expect.
Disabling memory decoding on a device ought to be a simple "unmap all
BARs", while enabling is "map all BARs". Which again is, due to the
assumed lack of sharing of pages, much simpler than on Dom0: You only
need to subtract the MSI-X table range(s) (if any, and perhaps not
necessary when unmapping, as there's nothing wrong to unmap a P2M slot
which wasn't mapped); this may not even require any rangeset at all to
represent.

And in fact I wonder whether for DomU-s you want to support BAR changes
in the first place while memory decoding is enabled. Depends much on
how quirky the guest OSes are that ought to run on top.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 7/9] vpci/header: program p2m with guest BAR view
  2021-09-09  9:16         ` Oleksandr Andrushchenko
@ 2021-09-09  9:40           ` Jan Beulich
  2021-09-09  9:53             ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-09  9:40 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 09.09.2021 11:16, Oleksandr Andrushchenko wrote:
> 
> On 09.09.21 11:26, Jan Beulich wrote:
>> On 09.09.2021 08:13, Oleksandr Andrushchenko wrote:
>>> On 06.09.21 17:51, Jan Beulich wrote:
>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>>> @@ -37,12 +41,28 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>>>>>                         unsigned long *c)
>>>>>    {
>>>>>        const struct map_data *map = data;
>>>>> +    gfn_t start_gfn;
>>>>>        int rc;
>>>>>    
>>>>>        for ( ; ; )
>>>>>        {
>>>>>            unsigned long size = e - s + 1;
>>>>>    
>>>>> +        /*
>>>>> +         * Any BAR may have holes in its memory we want to map, e.g.
>>>>> +         * we don't want to map MSI regions which may be a part of that BAR,
>>>>> +         * e.g. when a single BAR is used for both MMIO and MSI.
>>>>> +         * In this case MSI regions are subtracted from the mapping, but
>>>>> +         * map->start_gfn still points to the very beginning of the BAR.
>>>>> +         * So if there is a hole present then we need to adjust start_gfn
>>>>> +         * to reflect the fact of that substraction.
>>>>> +         */
>>>>> +        start_gfn = gfn_add(map->start_gfn, s - mfn_x(map->start_mfn));
>>>> I may be missing something, but don't you need to adjust "size" then
>>>> as well?
>>> No, as range sets get consumed we have e and s updated accordingly,
>>> so each time size represents the right value.
>> It feels like something's wrong with the rangeset construction then:
>> Either it represents _all_ holes (including degenerate ones at the
>> start of end of a range), or none of them.
> 
> The resulting range set only has the MMIOs in it. While constructing the range set
> we cut off MSI-X out of it (make holes). But finally it only has the ranges that we
> need to map/unmap.

And then why is there a need to adjust start_gfn?

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 7/9] vpci/header: program p2m with guest BAR view
  2021-09-09  9:40           ` Jan Beulich
@ 2021-09-09  9:53             ` Oleksandr Andrushchenko
  0 siblings, 0 replies; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-09  9:53 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 09.09.21 12:40, Jan Beulich wrote:
> On 09.09.2021 11:16, Oleksandr Andrushchenko wrote:
>> On 09.09.21 11:26, Jan Beulich wrote:
>>> On 09.09.2021 08:13, Oleksandr Andrushchenko wrote:
>>>> On 06.09.21 17:51, Jan Beulich wrote:
>>>>> On 03.09.2021 12:08, Oleksandr Andrushchenko wrote:
>>>>>> @@ -37,12 +41,28 @@ static int map_range(unsigned long s, unsigned long e, void *data,
>>>>>>                          unsigned long *c)
>>>>>>     {
>>>>>>         const struct map_data *map = data;
>>>>>> +    gfn_t start_gfn;
>>>>>>         int rc;
>>>>>>     
>>>>>>         for ( ; ; )
>>>>>>         {
>>>>>>             unsigned long size = e - s + 1;
>>>>>>     
>>>>>> +        /*
>>>>>> +         * Any BAR may have holes in its memory we want to map, e.g.
>>>>>> +         * we don't want to map MSI regions which may be a part of that BAR,
>>>>>> +         * e.g. when a single BAR is used for both MMIO and MSI.
>>>>>> +         * In this case MSI regions are subtracted from the mapping, but
>>>>>> +         * map->start_gfn still points to the very beginning of the BAR.
>>>>>> +         * So if there is a hole present then we need to adjust start_gfn
>>>>>> +         * to reflect the fact of that substraction.
>>>>>> +         */
>>>>>> +        start_gfn = gfn_add(map->start_gfn, s - mfn_x(map->start_mfn));
>>>>> I may be missing something, but don't you need to adjust "size" then
>>>>> as well?
>>>> No, as range sets get consumed we have e and s updated accordingly,
>>>> so each time size represents the right value.
>>> It feels like something's wrong with the rangeset construction then:
>>> Either it represents _all_ holes (including degenerate ones at the
>>> start of end of a range), or none of them.
>> The resulting range set only has the MMIOs in it. While constructing the range set
>> we cut off MSI-X out of it (make holes). But finally it only has the ranges that we
>> need to map/unmap.
> And then why is there a need to adjust start_gfn?

Because of the holes: the range set's private data can only hold BARs start MFN

and start GFN. It doesn't have a list of start_{mfn|gfn} which describe each range,

but only the start_{mfn|gfn} of the whole range set, e.g. where the BAR starts

So, because of the holes we need to adjust the starting addresses:

0. MMIO0 <- we pass start_mfn and start_gfn pointing to the BAR start

1. MSI-X <- hole

2. MMIO1 <- need to adjust start_mfn and start_gfn with respect to the hole above

>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 6/9] vpci/header: Handle p2m range sets per BAR
  2021-09-09  9:39               ` Jan Beulich
@ 2021-09-09 10:03                 ` Oleksandr Andrushchenko
  2021-09-09 10:46                   ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-09 10:03 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 09.09.21 12:39, Jan Beulich wrote:
> On 09.09.2021 11:12, Oleksandr Andrushchenko wrote:
>> Anyways, I am open to any decision on what would be the right approach here:
>>
>> 1. Use range sets per BAR as in the patch
>>
>> 2. Remove range sets completely and have a per-vCPU list with mapping
>>
>> data as I described above
>>
>> 3. Anything else?
> A decision first requires a proposal.

I already have 2: one in the patch with the range set per BAR and one described

earlier in the thread with a single range set and a list for GFN <-> MFN.

If you can tell your opinion I am all ears. But, please be specific as common words

don't change anything to me.

At the same time I do understand that the current code is not set in stone,

but we should have a good reason for major changes to it, IMO. I mean that before

DomU's we were fine with the range sets etc, and now we are not:

so what has changed so much?

>   I think 3 is the way to investigate
> first: Rather than starting from the code we currently have, start from
> what you need for DomU-s to work. If there's enough overlap with how we
> handle Dom0, code can be shared.

You can see that in my patch the same code is used by both hwdom and

guest. What else needs to be proven? The patch shows that all the code

besides guest register handlers (which is expected) is all common.

>   If things are sufficiently different,
> separate code paths are likely better. As said - to me a guest altering a
> BAR is merely a very special form of a request to change its P2M. The M
> parts remains unchanged (which is the major difference from Dom0), while
> the P part changes. As long as you can assume no two BARs to share a page,
> this would appear to suggest that it's simply a P2M operation plus book
> keeping at the vPCI layer. Completely different from Dom0 handling.

Underneath, yes, possibly. But at the level vPCI operates there is no

such difference I can clearly see in vPCI code and the patch in question.

Please point me to the vPCI code I fail to see.

>
> All of this applies only with memory decoding enabled, I expect.
> Disabling memory decoding on a device ought to be a simple "unmap all
> BARs", while enabling is "map all BARs". Which again is, due to the
> assumed lack of sharing of pages, much simpler than on Dom0: You only
> need to subtract the MSI-X table range(s) (if any, and perhaps not
> necessary when unmapping, as there's nothing wrong to unmap a P2M slot
> which wasn't mapped); this may not even require any rangeset at all to
> represent.
>
> And in fact I wonder whether for DomU-s you want to support BAR changes
> in the first place while memory decoding is enabled.

No, why? I want to keep the existing logic, e.g. with memory decoding

disabled as it is now.

>   Depends much on
> how quirky the guest OSes are that ought to run on top.
>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 6/9] vpci/header: Handle p2m range sets per BAR
  2021-09-09 10:03                 ` Oleksandr Andrushchenko
@ 2021-09-09 10:46                   ` Jan Beulich
  2021-09-09 11:30                     ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-09 10:46 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 09.09.2021 12:03, Oleksandr Andrushchenko wrote:
> On 09.09.21 12:39, Jan Beulich wrote:
>> On 09.09.2021 11:12, Oleksandr Andrushchenko wrote:
>>> Anyways, I am open to any decision on what would be the right approach here:
>>>
>>> 1. Use range sets per BAR as in the patch
>>>
>>> 2. Remove range sets completely and have a per-vCPU list with mapping
>>>
>>> data as I described above
>>>
>>> 3. Anything else?
>> A decision first requires a proposal.
> 
> I already have 2: one in the patch with the range set per BAR and one described
> earlier in the thread with a single range set and a list for GFN <-> MFN.
> If you can tell your opinion I am all ears. But, please be specific as common words
> don't change anything to me.
> At the same time I do understand that the current code is not set in stone,
> but we should have a good reason for major changes to it, IMO.

And I view your change, as proposed, as a major one. You turn the logic all
over imo.

> I mean that before
> DomU's we were fine with the range sets etc, and now we are not:
> so what has changed so much?

Nothing has changed. I'm not advocating for removal of the rangeset use in
handling Dom0's needs. I'm suggesting that their use might not be a good
fit for DomU.

>>   I think 3 is the way to investigate
>> first: Rather than starting from the code we currently have, start from
>> what you need for DomU-s to work. If there's enough overlap with how we
>> handle Dom0, code can be shared.
> 
> You can see that in my patch the same code is used by both hwdom and
> guest. What else needs to be proven? The patch shows that all the code
> besides guest register handlers (which is expected) is all common.

The complexity of dealing with Dom0 has increased. I've outlined the
process that I think should be followed: First determine what DomU needs.
Then see how much of this actually fits the existing code (handling Dom0).
Then decide whether altering Dom0 handling is actually worth it,
compared to handling DomU separately. In fact handling it separately
first may have its own benefits, like easing review and reducing the risk
of breaking Dom0 handling. If then there are enough similarities, in a
2nd step both may want folding.

>> All of this applies only with memory decoding enabled, I expect.
>> Disabling memory decoding on a device ought to be a simple "unmap all
>> BARs", while enabling is "map all BARs". Which again is, due to the
>> assumed lack of sharing of pages, much simpler than on Dom0: You only
>> need to subtract the MSI-X table range(s) (if any, and perhaps not
>> necessary when unmapping, as there's nothing wrong to unmap a P2M slot
>> which wasn't mapped); this may not even require any rangeset at all to
>> represent.
>>
>> And in fact I wonder whether for DomU-s you want to support BAR changes
>> in the first place while memory decoding is enabled.
> 
> No, why? I want to keep the existing logic, e.g. with memory decoding
> disabled as it is now.

Afaict existing code deals with both cases. What I was putting under
question is whether DomU handling code also needs to.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 6/9] vpci/header: Handle p2m range sets per BAR
  2021-09-09 10:46                   ` Jan Beulich
@ 2021-09-09 11:30                     ` Oleksandr Andrushchenko
  2021-09-09 11:51                       ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-09 11:30 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 09.09.21 13:46, Jan Beulich wrote:
> On 09.09.2021 12:03, Oleksandr Andrushchenko wrote:
>> On 09.09.21 12:39, Jan Beulich wrote:
>>> On 09.09.2021 11:12, Oleksandr Andrushchenko wrote:
>>>> Anyways, I am open to any decision on what would be the right approach here:
>>>>
>>>> 1. Use range sets per BAR as in the patch
>>>>
>>>> 2. Remove range sets completely and have a per-vCPU list with mapping
>>>>
>>>> data as I described above
>>>>
>>>> 3. Anything else?
>>> A decision first requires a proposal.
>> I already have 2: one in the patch with the range set per BAR and one described
>> earlier in the thread with a single range set and a list for GFN <-> MFN.
>> If you can tell your opinion I am all ears. But, please be specific as common words
>> don't change anything to me.
>> At the same time I do understand that the current code is not set in stone,
>> but we should have a good reason for major changes to it, IMO.
> And I view your change, as proposed, as a major one. You turn the logic all
> over imo.
>
>> I mean that before
>> DomU's we were fine with the range sets etc, and now we are not:
>> so what has changed so much?
> Nothing has changed. I'm not advocating for removal of the rangeset use in
> handling Dom0's needs. I'm suggesting that their use might not be a good
> fit for DomU.

The proposed change makes the same code work for both Dom0 and DomU.

So, instead of having the common code as it proposed do you suggest to invent

something special for DomU (making the same job as we already do for Dom0)

and then see if we can then combine the both to have the code common

again? I am saying that the code is already common even if you think that

for DomU it can be simpler (I can't still see in which way as p2m and other

things are not directly touched by the vPCI code, e.g. both Dom0 and DomU

use {map|unmap}_mmio_regions and the only difference is that for Dom0

we have MFN == GFN and for DomU it's not).

So, even if ranges sets are not good for DomUs (I can't see why), but if they help

have the code common I think it is worth having them.

>
>>>    I think 3 is the way to investigate
>>> first: Rather than starting from the code we currently have, start from
>>> what you need for DomU-s to work. If there's enough overlap with how we
>>> handle Dom0, code can be shared.
>> You can see that in my patch the same code is used by both hwdom and
>> guest. What else needs to be proven? The patch shows that all the code
>> besides guest register handlers (which is expected) is all common.
> The complexity of dealing with Dom0 has increased. I've outlined the
> process that I think should be followed: First determine what DomU needs.
It is already known, GFN <-> MFN non-identity mappings
> Then see how much of this actually fits the existing code (handling Dom0).
It is already in the patch: we have all code common for both Dom0 and DomU
> Then decide whether altering Dom0 handling is actually worth it,
> compared to handling DomU separately.
It leads to the same functionality implemented twice
>   In fact handling it separately
> first may have its own benefits, like easing review and reducing the risk
> of breaking Dom0 handling. If then there are enough similarities, in a
> 2nd step both may want folding.

You can see from the patch if we have "if ( hwdom )" spread over the

implementation. I guess you won't find that (besides guest register

handlers which is expected).

>
>>> All of this applies only with memory decoding enabled, I expect.
>>> Disabling memory decoding on a device ought to be a simple "unmap all
>>> BARs", while enabling is "map all BARs". Which again is, due to the
>>> assumed lack of sharing of pages, much simpler than on Dom0: You only
>>> need to subtract the MSI-X table range(s) (if any, and perhaps not
>>> necessary when unmapping, as there's nothing wrong to unmap a P2M slot
>>> which wasn't mapped); this may not even require any rangeset at all to
>>> represent.
>>>
>>> And in fact I wonder whether for DomU-s you want to support BAR changes
>>> in the first place while memory decoding is enabled.
>> No, why? I want to keep the existing logic, e.g. with memory decoding
>> disabled as it is now.
> Afaict existing code deals with both cases.

Hm, I thought that we only map/unmap with memory decoding disabled.

For my education: what happens if you unmap with decoding enabled and

domain accesses the MMIOs?

>   What I was putting under
> question is whether DomU handling code also needs to.
>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-09  9:21                           ` Jan Beulich
@ 2021-09-09 11:48                             ` Oleksandr Andrushchenko
  2021-09-09 11:53                               ` Jan Beulich
  2021-09-09 11:48                             ` Oleksandr Andrushchenko
  1 sibling, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-09 11:48 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 09.09.21 12:21, Jan Beulich wrote:
> On 09.09.2021 10:50, Oleksandr Andrushchenko wrote:
>> On 09.09.21 11:43, Jan Beulich wrote:
>>> On 09.09.2021 10:39, Oleksandr Andrushchenko wrote:
>>>> On 07.09.21 13:06, Jan Beulich wrote:
>>>>> On 07.09.2021 11:52, Oleksandr Andrushchenko wrote:
>>>>>> On 07.09.21 12:19, Jan Beulich wrote:
>>>>>>> On 07.09.2021 11:07, Oleksandr Andrushchenko wrote:
>>>>>>>> On 07.09.21 11:49, Jan Beulich wrote:
>>>>>>>>> On 07.09.2021 10:18, Oleksandr Andrushchenko wrote:
>>>>>>>>>> So, if we have a hidden PCI device which can be assigned to a guest and it is literally untouched
>>>>>>>>>> (not enabled in Dom0) then I think there will be no such reference as "host assigned values" as
>>>>>>>>>> most probably the command register will remain in its after reset state.
>>>>>>>>> What meaning of "hidden" do you imply here? Devices passed to
>>>>>>>>> pci_{hide,ro}_device() may not be assigned to guests ...
>>>>>>>> You are completely right here.
>>>>>>>>> For any other meaning of "hidden", even if the device is completely
>>>>>>>>> ignored by Dom0,
>>>>>>>> Dom0less is such a case when a device is assigned to the guest
>>>>>>>> without Dom0 at all?
>>>>>>> In this case it is entirely unclear to me what entity it is to have
>>>>>>> a global view on the PCI subsystem.
>>>>>>>
>>>>>>>>>       certain of the properties still cannot be allowed
>>>>>>>>> to be DomU-controlled.
>>>>>>>> The list is not that big, could you please name a few you think cannot
>>>>>>>> be controlled by a guest? I can think of PCI_COMMAND_SPECIAL(?),
>>>>>>>> PCI_COMMAND_INVALIDATE(?), PCI_COMMAND_PARITY, PCI_COMMAND_WAIT,
>>>>>>>> PCI_COMMAND_SERR, PCI_COMMAND_INTX_DISABLE which we may want to
>>>>>>>> be aligned with the "host reference" values, e.g. we only allow those bits
>>>>>>>> to be set as they are in Dom0.
>>>>>>> Well, you've compile a list already, and I did say so before as well:
>>>>>>> Everything except I/O and memory decoding as well as bus mastering
>>>>>>> needs at least closely looking at. INTX_DISABLE, for example, is
>>>>>>> something I don't think a guest should be able to directly control.
>>>>>>> It may still be the case that the host permits it control, but then
>>>>>>> only indirectly, allowing the host to appropriately adjust its
>>>>>>> internals.
>>>>>>>
>>>>>>> Note that even for I/O and memory decoding as well as bus mastering
>>>>>>> it may be necessary to limit guest control: In case the host wants
>>>>>>> to disable any of these (perhaps transiently) despite the guest
>>>>>>> wanting them enabled.
>>>>>> Ok, so it is now clear that we need a yet another patch to add a proper
>>>>>> command register emulation. What is your preference: drop the current
>>>>>> patch, implement command register emulation and add a "reset patch"
>>>>>> after that or we can have the patch as is now, but I'll only reset IO/mem and bus
>>>>>> master bits, e.g. read the real value, mask the wanted bits and write back?
>>>>> Either order is fine with me as long as the result will be claimed to
>>>>> be complete until proper emulation is in place.
>>>> I tried to see what others do in order to emulate PCI_COMMAND register
>>>> and it seems that at most they care about the only INTX bit (besides
>>>> IO/memory enable and bus muster which are write through). Please see
>>>> [1] and [2]. Probably I miss something, but it could be because in order
>>>> to properly emulate the COMMAND register we need to know about the
>>>> whole PCI topology, e.g. if any setting in device's command register
>>>> is aligned with the upstream port etc. This makes me think that because
>>>> of this complexity others just ignore that. Neither I think this can be
>>>> easily done in our case. So I would suggest we just add the following
>>>> simple logic to only emulate PCI_COMMAND_INTX_DISABLE: allow guest to
>>>> disable the interrupts, but don't allow to enable if host has disabled
>>>> them. This is also could be tricky a bit for the devices which are not
>>>> enabled and thus not configured in Dom0, e.g. we do not know for sure
>>>> if the value in the PCI_COMMAND register (in particular
>>>> PCI_COMMAND_INTX_DISABLE bit) can be used as the reference host value or
>>>> not. It can be that the value there is just the one after reset or so.
>>>> The rest of the command register bits will go directly to the command
>>>> register untouched.
>>>> So, at the end of the day the question is if PCI_COMMAND_INTX_DISABLE
>>>> is enough and how to get its reference host value.
>>> Well, in order for the whole thing to be security supported it needs to
>>> be explained for every bit why it is safe to allow the guest to drive it.
>> So, do we want at least PCI_COMMAND_INTX_DISABLE bit aligned
>> between the host and guest? If so, what do you you think about
>> the reference value for it (please see above).
> Please may I ask that you come up with a proposal? I don't think I've
> said you need to emulate this or any of the other bits. All I've asked
> for is that for every bit you allow the guest to control directly, you
> justify why that's safe and secure. If no justification can be given,
> emulation is going to be necessary. How to solve that is first and
> foremost part of your undertaking.

The thing here is that we can't truly justify if we can let the guest

control those bits or not as it all may depend on the topology that some

specific setup might have. Not that we technically can't, but not in a

practical and easy way IMO. Taking that into account we come to a

conclusion that we need to emulate those then. But, again we understand

that full emulation, if properly implemented, is going to  be a big piece of code

which needs to take into account the physical PCI topology etc.

So, this is my understanding why others do not implement that (QEMU, ARCN)

and let the guest control all bits, but INTxDISABLE (Disclaimer: if I understood their

code correctly)

So, my proposal here is to only emulate PCI_COMMAND_INTX_DISABLE and let

the other bits be controlled by the guest (please also see the note below).

I do understand this is not correct, but I can't tell how to deal with this other way.

>
> For the bit in question, where the goal appears to be to have hardware
> hold the OR of guest and host values, an approach similar to that used
> for some of the MSI / MSI-X bits might be chosen: Maintain guest and
> host bits in software, and update hardware (at least) when the
> effective resulting value changes. A complicating fact here is, though,
> that unlike for the MSI / MSI-X bits here Dom0 (pciback or its PCI
> susbstem) may also have a view on what the setting ought to be.

The bigger question here is what can we take as the reference for INTx

bit, e.g. if Dom0 didn't enable/configured the device being passed through

than its COMMAND register may still be in after reset state and IMO there is

no guarantee it has the values we can say are "as host wants them"

>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-09  9:21                           ` Jan Beulich
  2021-09-09 11:48                             ` Oleksandr Andrushchenko
@ 2021-09-09 11:48                             ` Oleksandr Andrushchenko
  1 sibling, 0 replies; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-09 11:48 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 09.09.21 12:21, Jan Beulich wrote:
> On 09.09.2021 10:50, Oleksandr Andrushchenko wrote:
>> On 09.09.21 11:43, Jan Beulich wrote:
>>> On 09.09.2021 10:39, Oleksandr Andrushchenko wrote:
>>>> On 07.09.21 13:06, Jan Beulich wrote:
>>>>> On 07.09.2021 11:52, Oleksandr Andrushchenko wrote:
>>>>>> On 07.09.21 12:19, Jan Beulich wrote:
>>>>>>> On 07.09.2021 11:07, Oleksandr Andrushchenko wrote:
>>>>>>>> On 07.09.21 11:49, Jan Beulich wrote:
>>>>>>>>> On 07.09.2021 10:18, Oleksandr Andrushchenko wrote:
>>>>>>>>>> So, if we have a hidden PCI device which can be assigned to a guest and it is literally untouched
>>>>>>>>>> (not enabled in Dom0) then I think there will be no such reference as "host assigned values" as
>>>>>>>>>> most probably the command register will remain in its after reset state.
>>>>>>>>> What meaning of "hidden" do you imply here? Devices passed to
>>>>>>>>> pci_{hide,ro}_device() may not be assigned to guests ...
>>>>>>>> You are completely right here.
>>>>>>>>> For any other meaning of "hidden", even if the device is completely
>>>>>>>>> ignored by Dom0,
>>>>>>>> Dom0less is such a case when a device is assigned to the guest
>>>>>>>> without Dom0 at all?
>>>>>>> In this case it is entirely unclear to me what entity it is to have
>>>>>>> a global view on the PCI subsystem.
>>>>>>>
>>>>>>>>>       certain of the properties still cannot be allowed
>>>>>>>>> to be DomU-controlled.
>>>>>>>> The list is not that big, could you please name a few you think cannot
>>>>>>>> be controlled by a guest? I can think of PCI_COMMAND_SPECIAL(?),
>>>>>>>> PCI_COMMAND_INVALIDATE(?), PCI_COMMAND_PARITY, PCI_COMMAND_WAIT,
>>>>>>>> PCI_COMMAND_SERR, PCI_COMMAND_INTX_DISABLE which we may want to
>>>>>>>> be aligned with the "host reference" values, e.g. we only allow those bits
>>>>>>>> to be set as they are in Dom0.
>>>>>>> Well, you've compile a list already, and I did say so before as well:
>>>>>>> Everything except I/O and memory decoding as well as bus mastering
>>>>>>> needs at least closely looking at. INTX_DISABLE, for example, is
>>>>>>> something I don't think a guest should be able to directly control.
>>>>>>> It may still be the case that the host permits it control, but then
>>>>>>> only indirectly, allowing the host to appropriately adjust its
>>>>>>> internals.
>>>>>>>
>>>>>>> Note that even for I/O and memory decoding as well as bus mastering
>>>>>>> it may be necessary to limit guest control: In case the host wants
>>>>>>> to disable any of these (perhaps transiently) despite the guest
>>>>>>> wanting them enabled.
>>>>>> Ok, so it is now clear that we need a yet another patch to add a proper
>>>>>> command register emulation. What is your preference: drop the current
>>>>>> patch, implement command register emulation and add a "reset patch"
>>>>>> after that or we can have the patch as is now, but I'll only reset IO/mem and bus
>>>>>> master bits, e.g. read the real value, mask the wanted bits and write back?
>>>>> Either order is fine with me as long as the result will be claimed to
>>>>> be complete until proper emulation is in place.
>>>> I tried to see what others do in order to emulate PCI_COMMAND register
>>>> and it seems that at most they care about the only INTX bit (besides
>>>> IO/memory enable and bus muster which are write through). Please see
>>>> [1] and [2]. Probably I miss something, but it could be because in order
>>>> to properly emulate the COMMAND register we need to know about the
>>>> whole PCI topology, e.g. if any setting in device's command register
>>>> is aligned with the upstream port etc. This makes me think that because
>>>> of this complexity others just ignore that. Neither I think this can be
>>>> easily done in our case. So I would suggest we just add the following
>>>> simple logic to only emulate PCI_COMMAND_INTX_DISABLE: allow guest to
>>>> disable the interrupts, but don't allow to enable if host has disabled
>>>> them. This is also could be tricky a bit for the devices which are not
>>>> enabled and thus not configured in Dom0, e.g. we do not know for sure
>>>> if the value in the PCI_COMMAND register (in particular
>>>> PCI_COMMAND_INTX_DISABLE bit) can be used as the reference host value or
>>>> not. It can be that the value there is just the one after reset or so.
>>>> The rest of the command register bits will go directly to the command
>>>> register untouched.
>>>> So, at the end of the day the question is if PCI_COMMAND_INTX_DISABLE
>>>> is enough and how to get its reference host value.
>>> Well, in order for the whole thing to be security supported it needs to
>>> be explained for every bit why it is safe to allow the guest to drive it.
>> So, do we want at least PCI_COMMAND_INTX_DISABLE bit aligned
>> between the host and guest? If so, what do you you think about
>> the reference value for it (please see above).
> Please may I ask that you come up with a proposal? I don't think I've
> said you need to emulate this or any of the other bits. All I've asked
> for is that for every bit you allow the guest to control directly, you
> justify why that's safe and secure. If no justification can be given,
> emulation is going to be necessary. How to solve that is first and
> foremost part of your undertaking.

The thing here is that we can't truly justify if we can let the guest

control those bits or not as it all may depend on the topology that some

specific setup might have. Not that we technically can't, but not in a

practical and easy way IMO. Taking that into account we come to a

conclusion that we need to emulate those then. But, again we understand

that full emulation, if properly implemented, is going to  be a big piece of code

which needs to take into account the physical PCI topology etc.

So, this is my understanding why others do not implement that (QEMU, ARCN)

and let the guest control all bits, but INTxDISABLE (Disclaimer: if I understood their

code correctly)

So, my proposal here is to only emulate PCI_COMMAND_INTX_DISABLE and let

the other bits be controlled by the guest (please also see the note below).

I do understand this is not correct, but I can't tell how to deal with this other way.

>
> For the bit in question, where the goal appears to be to have hardware
> hold the OR of guest and host values, an approach similar to that used
> for some of the MSI / MSI-X bits might be chosen: Maintain guest and
> host bits in software, and update hardware (at least) when the
> effective resulting value changes. A complicating fact here is, though,
> that unlike for the MSI / MSI-X bits here Dom0 (pciback or its PCI
> susbstem) may also have a view on what the setting ought to be.

The bigger question here is what can we take as the reference for INTx

bit, e.g. if Dom0 didn't enable/configured the device being passed through

than its COMMAND register may still be in after reset state and IMO there is

no guarantee it has the values we can say are "as host wants them"

>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 6/9] vpci/header: Handle p2m range sets per BAR
  2021-09-09 11:30                     ` Oleksandr Andrushchenko
@ 2021-09-09 11:51                       ` Jan Beulich
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Beulich @ 2021-09-09 11:51 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 09.09.2021 13:30, Oleksandr Andrushchenko wrote:
> On 09.09.21 13:46, Jan Beulich wrote:
>> On 09.09.2021 12:03, Oleksandr Andrushchenko wrote:
>>> On 09.09.21 12:39, Jan Beulich wrote:
>>>> And in fact I wonder whether for DomU-s you want to support BAR changes
>>>> in the first place while memory decoding is enabled.
>>> No, why? I want to keep the existing logic, e.g. with memory decoding
>>> disabled as it is now.
>> Afaict existing code deals with both cases.
> 
> Hm, I thought that we only map/unmap with memory decoding disabled.
> For my education: what happens if you unmap with decoding enabled and
> domain accesses the MMIOs?

That would depend on the precise timing; it's certainly not well defined.
But supporting this may be needed for quirky OSes, as said before, as
they may get away with that on real hardware if they avoid actual accesses
at the time of the BAR change.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-09 11:48                             ` Oleksandr Andrushchenko
@ 2021-09-09 11:53                               ` Jan Beulich
  2021-09-09 12:42                                 ` Oleksandr Andrushchenko
  0 siblings, 1 reply; 75+ messages in thread
From: Jan Beulich @ 2021-09-09 11:53 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 09.09.2021 13:48, Oleksandr Andrushchenko wrote:
> On 09.09.21 12:21, Jan Beulich wrote:
>> For the bit in question, where the goal appears to be to have hardware
>> hold the OR of guest and host values, an approach similar to that used
>> for some of the MSI / MSI-X bits might be chosen: Maintain guest and
>> host bits in software, and update hardware (at least) when the
>> effective resulting value changes. A complicating fact here is, though,
>> that unlike for the MSI / MSI-X bits here Dom0 (pciback or its PCI
>> susbstem) may also have a view on what the setting ought to be.
> 
> The bigger question here is what can we take as the reference for INTx
> bit, e.g. if Dom0 didn't enable/configured the device being passed through
> than its COMMAND register may still be in after reset state and IMO there is
> no guarantee it has the values we can say are "as host wants them"

In the absence of Dom0 controlling the device, I think we ought to take
Xen's view as the "host" one. Which will want the bit set at least as
long as either MSI or MSI-X is enabled for the device.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-09 11:53                               ` Jan Beulich
@ 2021-09-09 12:42                                 ` Oleksandr Andrushchenko
  2021-09-09 12:47                                   ` Jan Beulich
  0 siblings, 1 reply; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-09 12:42 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 09.09.21 14:53, Jan Beulich wrote:
> On 09.09.2021 13:48, Oleksandr Andrushchenko wrote:
>> On 09.09.21 12:21, Jan Beulich wrote:
>>> For the bit in question, where the goal appears to be to have hardware
>>> hold the OR of guest and host values, an approach similar to that used
>>> for some of the MSI / MSI-X bits might be chosen: Maintain guest and
>>> host bits in software, and update hardware (at least) when the
>>> effective resulting value changes. A complicating fact here is, though,
>>> that unlike for the MSI / MSI-X bits here Dom0 (pciback or its PCI
>>> susbstem) may also have a view on what the setting ought to be.
>> The bigger question here is what can we take as the reference for INTx
>> bit, e.g. if Dom0 didn't enable/configured the device being passed through
>> than its COMMAND register may still be in after reset state and IMO there is
>> no guarantee it has the values we can say are "as host wants them"
> In the absence of Dom0 controlling the device, I think we ought to take
> Xen's view as the "host" one.
Agree
>   Which will want the bit set at least as
> long as either MSI or MSI-X is enabled for the device.
But what is the INTx relation to MSI/MSI-X here?
>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-09 12:42                                 ` Oleksandr Andrushchenko
@ 2021-09-09 12:47                                   ` Jan Beulich
  2021-09-09 12:48                                     ` Oleksandr Andrushchenko
  2021-09-09 13:17                                     ` Oleksandr Andrushchenko
  0 siblings, 2 replies; 75+ messages in thread
From: Jan Beulich @ 2021-09-09 12:47 UTC (permalink / raw)
  To: Oleksandr Andrushchenko, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel

On 09.09.2021 14:42, Oleksandr Andrushchenko wrote:
> On 09.09.21 14:53, Jan Beulich wrote:
>> On 09.09.2021 13:48, Oleksandr Andrushchenko wrote:
>>> On 09.09.21 12:21, Jan Beulich wrote:
>>>> For the bit in question, where the goal appears to be to have hardware
>>>> hold the OR of guest and host values, an approach similar to that used
>>>> for some of the MSI / MSI-X bits might be chosen: Maintain guest and
>>>> host bits in software, and update hardware (at least) when the
>>>> effective resulting value changes. A complicating fact here is, though,
>>>> that unlike for the MSI / MSI-X bits here Dom0 (pciback or its PCI
>>>> susbstem) may also have a view on what the setting ought to be.
>>> The bigger question here is what can we take as the reference for INTx
>>> bit, e.g. if Dom0 didn't enable/configured the device being passed through
>>> than its COMMAND register may still be in after reset state and IMO there is
>>> no guarantee it has the values we can say are "as host wants them"
>> In the absence of Dom0 controlling the device, I think we ought to take
>> Xen's view as the "host" one.
> Agree
>>   Which will want the bit set at least as
>> long as either MSI or MSI-X is enabled for the device.
> But what is the INTx relation to MSI/MSI-X here?

Devices are not supposed to signal interrupts two different ways at a
time. They may enable only one - pin based, MSI, or MSI-X.

Jan



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-09 12:47                                   ` Jan Beulich
@ 2021-09-09 12:48                                     ` Oleksandr Andrushchenko
  2021-09-09 13:17                                     ` Oleksandr Andrushchenko
  1 sibling, 0 replies; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-09 12:48 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 09.09.21 15:47, Jan Beulich wrote:
> On 09.09.2021 14:42, Oleksandr Andrushchenko wrote:
>> On 09.09.21 14:53, Jan Beulich wrote:
>>> On 09.09.2021 13:48, Oleksandr Andrushchenko wrote:
>>>> On 09.09.21 12:21, Jan Beulich wrote:
>>>>> For the bit in question, where the goal appears to be to have hardware
>>>>> hold the OR of guest and host values, an approach similar to that used
>>>>> for some of the MSI / MSI-X bits might be chosen: Maintain guest and
>>>>> host bits in software, and update hardware (at least) when the
>>>>> effective resulting value changes. A complicating fact here is, though,
>>>>> that unlike for the MSI / MSI-X bits here Dom0 (pciback or its PCI
>>>>> susbstem) may also have a view on what the setting ought to be.
>>>> The bigger question here is what can we take as the reference for INTx
>>>> bit, e.g. if Dom0 didn't enable/configured the device being passed through
>>>> than its COMMAND register may still be in after reset state and IMO there is
>>>> no guarantee it has the values we can say are "as host wants them"
>>> In the absence of Dom0 controlling the device, I think we ought to take
>>> Xen's view as the "host" one.
>> Agree
>>>    Which will want the bit set at least as
>>> long as either MSI or MSI-X is enabled for the device.
>> But what is the INTx relation to MSI/MSI-X here?
> Devices are not supposed to signal interrupts two different ways at a
> time. They may enable only one - pin based, MSI, or MSI-X.

Ah, that simple ;) Yes, of course

>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 8/9] vpci/header: Reset the command register when adding devices
  2021-09-09 12:47                                   ` Jan Beulich
  2021-09-09 12:48                                     ` Oleksandr Andrushchenko
@ 2021-09-09 13:17                                     ` Oleksandr Andrushchenko
  1 sibling, 0 replies; 75+ messages in thread
From: Oleksandr Andrushchenko @ 2021-09-09 13:17 UTC (permalink / raw)
  To: Jan Beulich, Oleksandr Andrushchenko
  Cc: julien, sstabellini, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Artem Mygaiev, roger.pau, Bertrand Marquis, Rahul Singh,
	xen-devel


On 09.09.21 15:47, Jan Beulich wrote:
> On 09.09.2021 14:42, Oleksandr Andrushchenko wrote:
>> On 09.09.21 14:53, Jan Beulich wrote:
>>> On 09.09.2021 13:48, Oleksandr Andrushchenko wrote:
>>>> On 09.09.21 12:21, Jan Beulich wrote:
>>>>> For the bit in question, where the goal appears to be to have hardware
>>>>> hold the OR of guest and host values, an approach similar to that used
>>>>> for some of the MSI / MSI-X bits might be chosen: Maintain guest and
>>>>> host bits in software, and update hardware (at least) when the
>>>>> effective resulting value changes. A complicating fact here is, though,
>>>>> that unlike for the MSI / MSI-X bits here Dom0 (pciback or its PCI
>>>>> susbstem) may also have a view on what the setting ought to be.
>>>> The bigger question here is what can we take as the reference for INTx
>>>> bit, e.g. if Dom0 didn't enable/configured the device being passed through
>>>> than its COMMAND register may still be in after reset state and IMO there is
>>>> no guarantee it has the values we can say are "as host wants them"
>>> In the absence of Dom0 controlling the device, I think we ought to take
>>> Xen's view as the "host" one.
>> Agree
>>>    Which will want the bit set at least as
>>> long as either MSI or MSI-X is enabled for the device.
>> But what is the INTx relation to MSI/MSI-X here?
> Devices are not supposed to signal interrupts two different ways at a
> time. They may enable only one - pin based, MSI, or MSI-X.

Ok, so I see that we can partially emulate the command register as:

static void guest_cmd_write(const struct pci_dev *pdev, unsigned int reg,
                             uint32_t cmd, void *data)
{
     /* TODO: Add proper emulation for all bits of the command register. */

     if ( (cmd & PCI_COMMAND_INTX_DISABLE) == 0 )
     {
         /*
          * Guest wants to enable INTx. It can't be enabled if:
          *  - host has INTx disabled
          *  - MSI/MSI-X enabled
          */
         if ( pdev->vpci->msi->enabled )
             cmd |= PCI_COMMAND_INTX_DISABLE;
         else
         {
             uint16_t current_cmd = pci_conf_read16(pdev->sbdf, reg);

             if ( current_cmd & PCI_COMMAND_INTX_DISABLE )
                 cmd |= PCI_COMMAND_INTX_DISABLE;
         }
     }

     cmd_write(pdev, reg, cmd, data);
}

and of course have grand TODO for the rest.

>
> Jan
>
Thank you,

Oleksandr

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 4/9] vpci/header: Add and remove register handlers dynamically
  2021-09-03 10:08 ` [PATCH 4/9] vpci/header: Add and remove register handlers dynamically Oleksandr Andrushchenko
  2021-09-06 14:11   ` Jan Beulich
@ 2021-09-10 21:14   ` Stefano Stabellini
  1 sibling, 0 replies; 75+ messages in thread
From: Stefano Stabellini @ 2021-09-10 21:14 UTC (permalink / raw)
  To: Oleksandr Andrushchenko
  Cc: xen-devel, julien, sstabellini, oleksandr_tyshchenko,
	volodymyr_babchuk, Artem_Mygaiev, roger.pau, jbeulich,
	bertrand.marquis, rahul.singh, Oleksandr Andrushchenko

On Fri, 3 Sep 2021, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Add relevant vpci register handlers when assigning PCI device to a domain
> and remove those when de-assigning. This allows having different
> handlers for different domains, e.g. hwdom and other guests.
> 
> Use stubs for guest domains for now.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
>  xen/drivers/vpci/header.c | 78 +++++++++++++++++++++++++++++++++++----
>  xen/drivers/vpci/vpci.c   |  4 +-
>  xen/include/xen/vpci.h    |  4 ++
>  3 files changed, 76 insertions(+), 10 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 31bca7a12942..5218b1af247e 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -397,6 +397,17 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>      pci_conf_write32(pdev->sbdf, reg, val);
>  }
>  
> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
> +                            uint32_t val, void *data)
> +{
> +}
> +
> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
> +                               void *data)
> +{
> +    return 0xffffffff;
> +}
>  static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>                        uint32_t val, void *data)
>  {
> @@ -445,14 +456,25 @@ static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>          rom->addr = val & PCI_ROM_ADDRESS_MASK;
>  }
>  
> -static int add_bar_handlers(struct pci_dev *pdev)
> +static void guest_rom_write(const struct pci_dev *pdev, unsigned int reg,
> +                            uint32_t val, void *data)
> +{
> +}
> +
> +static uint32_t guest_rom_read(const struct pci_dev *pdev, unsigned int reg,
> +                               void *data)
> +{
> +    return 0xffffffff;
> +}
> +
> +static int add_bar_handlers(struct pci_dev *pdev, bool is_hwdom)
>  {
>      unsigned int i;
>      struct vpci_header *header = &pdev->vpci->header;
>      struct vpci_bar *bars = header->bars;
>      int rc;
>  
> -    /* Setup a handler for the command register. */
> +    /* Setup a handler for the command register: same for hwdom and guests. */
>      rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
>                             2, header);
>      if ( rc )
> @@ -475,8 +497,13 @@ static int add_bar_handlers(struct pci_dev *pdev)
>                  rom_reg = PCI_ROM_ADDRESS;
>              else
>                  rom_reg = PCI_ROM_ADDRESS1;
> -            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write,
> -                                   rom_reg, 4, &bars[i]);
> +            if ( is_hwdom )
> +                rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write,
> +                                       rom_reg, 4, &bars[i]);
> +            else
> +                rc = vpci_add_register(pdev->vpci,
> +                                       guest_rom_read, guest_rom_write,
> +                                       rom_reg, 4, &bars[i]);
>              if ( rc )
>                  return rc;
>          }
> @@ -485,8 +512,13 @@ static int add_bar_handlers(struct pci_dev *pdev)
>              uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
>  
>              /* This is either VPCI_BAR_MEM32 or VPCI_BAR_MEM64_{LO|HI}. */
> -            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
> -                                   4, &bars[i]);
> +            if ( is_hwdom )
> +                rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write,
> +                                       reg, 4, &bars[i]);
> +            else
> +                rc = vpci_add_register(pdev->vpci,
> +                                       guest_bar_read, guest_bar_write,
> +                                       reg, 4, &bars[i]);
>              if ( rc )
>                  return rc;
>          }
> @@ -520,7 +552,7 @@ static int init_bars(struct pci_dev *pdev)
>      }
>  
>      if ( pdev->ignore_bars )
> -        return add_bar_handlers(pdev);
> +        return add_bar_handlers(pdev, true);
>  
>      /* Disable memory decoding before sizing. */
>      cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
> @@ -582,7 +614,7 @@ static int init_bars(struct pci_dev *pdev)
>                                PCI_ROM_ADDRESS_ENABLE;
>      }
>  
> -    rc = add_bar_handlers(pdev);
> +    rc = add_bar_handlers(pdev, true);
>      if ( rc )
>      {
>          pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> @@ -593,6 +625,36 @@ static int init_bars(struct pci_dev *pdev)
>  }
>  REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
>  
> +int vpci_bar_add_handlers(const struct domain *d, struct pci_dev *pdev)
> +{
> +    int rc;
> +
> +    /* Remove previously added registers. */
> +    vpci_remove_device_registers(pdev);
> +
> +    /* It only makes sense to add registers for hwdom or guest domain. */
> +    if ( d->domain_id >= DOMID_FIRST_RESERVED )
> +        return 0;

This check is redundant, isn't it? Because it is already checked by the
caller?


> +    if ( is_hardware_domain(d) )
> +        rc = add_bar_handlers(pdev, true);
> +    else
> +        rc = add_bar_handlers(pdev, false);

NIT:

  rc = add_bar_handlers(pdev, is_hardware_domain(d));


^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2021-09-10 21:14 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-03 10:08 [PATCH 0/9] PCI devices passthrough on Arm, part 3 Oleksandr Andrushchenko
2021-09-03 10:08 ` [PATCH 1/9] vpci: Make vpci registers removal a dedicated function Oleksandr Andrushchenko
2021-09-03 10:08 ` [PATCH 2/9] vpci: Add hooks for PCI device assign/de-assign Oleksandr Andrushchenko
2021-09-06 13:23   ` Jan Beulich
2021-09-07  8:33     ` Oleksandr Andrushchenko
2021-09-07  8:44       ` Jan Beulich
2021-09-03 10:08 ` [PATCH 3/9] vpci/header: Move register assignments from init_bars Oleksandr Andrushchenko
2021-09-06 13:53   ` Jan Beulich
2021-09-07 10:04     ` Oleksandr Andrushchenko
2021-09-03 10:08 ` [PATCH 4/9] vpci/header: Add and remove register handlers dynamically Oleksandr Andrushchenko
2021-09-06 14:11   ` Jan Beulich
2021-09-07 10:11     ` Oleksandr Andrushchenko
2021-09-07 10:43       ` Jan Beulich
2021-09-07 11:10         ` Oleksandr Andrushchenko
2021-09-07 11:49           ` Jan Beulich
2021-09-07 12:16             ` Oleksandr Andrushchenko
2021-09-07 12:20               ` Jan Beulich
2021-09-07 12:23                 ` Oleksandr Andrushchenko
2021-09-10 21:14   ` Stefano Stabellini
2021-09-03 10:08 ` [PATCH 5/9] vpci/header: Implement guest BAR register handlers Oleksandr Andrushchenko
2021-09-06 14:31   ` Jan Beulich
2021-09-07 13:33     ` Oleksandr Andrushchenko
2021-09-07 16:30       ` Jan Beulich
2021-09-07 17:39         ` Oleksandr Andrushchenko
2021-09-08  9:27           ` Jan Beulich
2021-09-08  9:43             ` Oleksandr Andrushchenko
2021-09-08 10:03               ` Jan Beulich
2021-09-08 13:33                 ` Oleksandr Andrushchenko
2021-09-08 14:46                   ` Jan Beulich
2021-09-08 15:14                     ` Oleksandr Andrushchenko
2021-09-08 15:29                       ` Jan Beulich
2021-09-08 15:35                         ` Oleksandr Andrushchenko
2021-09-03 10:08 ` [PATCH 6/9] vpci/header: Handle p2m range sets per BAR Oleksandr Andrushchenko
2021-09-06 14:47   ` Jan Beulich
2021-09-08 14:31     ` Oleksandr Andrushchenko
2021-09-08 15:00       ` Jan Beulich
2021-09-09  5:22         ` Oleksandr Andrushchenko
2021-09-09  8:24           ` Jan Beulich
2021-09-09  9:12             ` Oleksandr Andrushchenko
2021-09-09  9:39               ` Jan Beulich
2021-09-09 10:03                 ` Oleksandr Andrushchenko
2021-09-09 10:46                   ` Jan Beulich
2021-09-09 11:30                     ` Oleksandr Andrushchenko
2021-09-09 11:51                       ` Jan Beulich
2021-09-03 10:08 ` [PATCH 7/9] vpci/header: program p2m with guest BAR view Oleksandr Andrushchenko
2021-09-06 14:51   ` Jan Beulich
2021-09-09  6:13     ` Oleksandr Andrushchenko
2021-09-09  8:26       ` Jan Beulich
2021-09-09  9:16         ` Oleksandr Andrushchenko
2021-09-09  9:40           ` Jan Beulich
2021-09-09  9:53             ` Oleksandr Andrushchenko
2021-09-03 10:08 ` [PATCH 8/9] vpci/header: Reset the command register when adding devices Oleksandr Andrushchenko
2021-09-06 14:55   ` Jan Beulich
2021-09-07  7:43     ` Oleksandr Andrushchenko
2021-09-07  8:00       ` Jan Beulich
2021-09-07  8:18         ` Oleksandr Andrushchenko
2021-09-07  8:49           ` Jan Beulich
2021-09-07  9:07             ` Oleksandr Andrushchenko
2021-09-07  9:19               ` Jan Beulich
2021-09-07  9:52                 ` Oleksandr Andrushchenko
2021-09-07 10:06                   ` Jan Beulich
2021-09-09  8:39                     ` Oleksandr Andrushchenko
2021-09-09  8:43                       ` Jan Beulich
2021-09-09  8:50                         ` Oleksandr Andrushchenko
2021-09-09  9:21                           ` Jan Beulich
2021-09-09 11:48                             ` Oleksandr Andrushchenko
2021-09-09 11:53                               ` Jan Beulich
2021-09-09 12:42                                 ` Oleksandr Andrushchenko
2021-09-09 12:47                                   ` Jan Beulich
2021-09-09 12:48                                     ` Oleksandr Andrushchenko
2021-09-09 13:17                                     ` Oleksandr Andrushchenko
2021-09-09 11:48                             ` Oleksandr Andrushchenko
2021-09-03 10:08 ` [PATCH 9/9] vpci/header: Use pdev's domain instead of vCPU Oleksandr Andrushchenko
2021-09-06 14:57   ` Jan Beulich
2021-09-09  4:23     ` Oleksandr Andrushchenko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.