[PATCH v6 0/7] MSI-X support with qemu in stubdomain, and other related changes

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 0/7] MSI-X support with qemu in stubdomain, and other related changes
@ 2024-04-26 17:53 Marek Marczykowski-Górecki
  2024-04-26 17:53 ` [PATCH v6 1/7] x86/msi: passthrough all MSI-X vector ctrl writes to device model Marek Marczykowski-Górecki
                   ` (6 more replies)
  0 siblings, 7 replies; 13+ messages in thread
From: Marek Marczykowski-Górecki @ 2024-04-26 17:53 UTC (permalink / raw)
  To: xen-devel; +Cc: Marek Marczykowski-Górecki

This series includes changes to make MSI-X working with Linux stubdomain and
especially Intel Wifi 6 AX210 card. This takes care of remaining reasons for
QEMU to access /dev/mem, but also the Intel Wifi card violating spec by putting
some registers on the same page as the MSI-X table.
Besides the stubdomain case (of which I care more), this is also necessary for
PCI-passthrough to work with lockdown enabled in dom0 (when QEMU runs in dom0).

See individual patches for details.

This series include also tests for MSI-X using new approach (by preventing QEMU
access to /dev/mem). But for it to work, it needs QEMU change that
makes use of the changes introduced here. It can be seen at
https://github.com/marmarek/qemu/commits/msix

Here is the pipeline that used the QEMU fork above:
https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1269664578

Marek Marczykowski-Górecki (7):
  x86/msi: passthrough all MSI-X vector ctrl writes to device model
  x86/msi: Extend per-domain/device warning mechanism
  x86/hvm: Allow access to registers on the same page as MSI-X table
  automation: prevent QEMU access to /dev/mem in PCI passthrough tests
  automation: switch to a wifi card on ADL system
  [DO NOT APPLY] switch to qemu fork
  [DO NOT APPLY] switch to alternative artifact repo

 Config.mk                                           |   4 +-
 automation/gitlab-ci/build.yaml                     |   4 +-
 automation/gitlab-ci/test.yaml                      |   4 +-
 automation/scripts/qubes-x86-64.sh                  |   9 +-
 automation/tests-artifacts/alpine/3.18.dockerfile   |   7 +-
 automation/tests-artifacts/kernel/6.1.19.dockerfile |   2 +-
 xen/arch/x86/hvm/vmsi.c                             | 215 ++++++++++++-
 xen/arch/x86/include/asm/msi.h                      |  22 +-
 xen/arch/x86/msi.c                                  |  46 ++-
 xen/common/kernel.c                                 |   1 +-
 xen/include/public/features.h                       |   8 +-
 11 files changed, 300 insertions(+), 22 deletions(-)

base-commit: 7846f7699fea25502061a05ea847e942ea624f12
-- 
git-series 0.9.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v6 1/7] x86/msi: passthrough all MSI-X vector ctrl writes to device model
  2024-04-26 17:53 [PATCH v6 0/7] MSI-X support with qemu in stubdomain, and other related changes Marek Marczykowski-Górecki
@ 2024-04-26 17:53 ` Marek Marczykowski-Górecki
  2024-04-26 17:53 ` [PATCH v6 2/7] x86/msi: Extend per-domain/device warning mechanism Marek Marczykowski-Górecki
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Marek Marczykowski-Górecki @ 2024-04-26 17:53 UTC (permalink / raw)
  To: xen-devel
  Cc: Marek Marczykowski-Górecki, Jan Beulich, Andrew Cooper,
	Roger Pau Monné,
	George Dunlap, Julien Grall, Stefano Stabellini

QEMU needs to know whether clearing maskbit of a vector is really
clearing, or was already cleared before. Currently Xen sends only
clearing that bit to the device model, but not setting it, so QEMU
cannot detect it. Because of that, QEMU is working this around by
checking via /dev/mem, but that isn't the proper approach.

Give all necessary information to QEMU by passing all ctrl writes,
including masking a vector. Advertise the new behavior via
XENVER_get_features, so QEMU can know it doesn't need to access /dev/mem
anymore.

While this commit doesn't move the whole maskbit handling to QEMU (as
discussed on xen-devel as one of the possibilities), it is a necessary
first step anyway. Including telling QEMU it will get all the required
information to do so. The actual implementation would need to include:
 - a hypercall for QEMU to control just maskbit (without (re)binding the
   interrupt again
 - a methor for QEMU to tell Xen it will actually do the work
Those are not part of this series.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
I did not added any control to enable/disable this new behavior (as
Roger have suggested for possible non-QEMU ioreqs). I don't see how the
new behavior could be problematic for some existing ioreq server (they
already received writes to those addresses, just not all of them),
but if that's really necessary, I can probably add a command line option
to restore previous behavior system-wide.

Changes in v5:
- announce the feature only on x86
- style fixes

Changes in v4:
- ignore unaligned writes with X86EMUL_OKAY
- restructure the code to forward all writes in _msixtbl_write() instead
  of manipulating return value of msixtbl_write() - this makes
  WRITE_LEN4_COMPLETION special case unnecessary
- advertise the changed behavior via XENVER_get_features instead of DMOP
v3:
 - advertise changed behavior in XEN_DMOP_get_ioreq_server_info - make
   "flags" parameter IN/OUT
 - move len check back to msixtbl_write() - will be needed there anyway
   in a later patch
v2:
 - passthrough quad writes to emulator too (Jan)
 - (ab)use len==0 for write len=4 completion (Jan), but add descriptive
   #define for this magic value
---
 xen/arch/x86/hvm/vmsi.c       | 19 ++++++++++++++-----
 xen/common/kernel.c           |  1 +
 xen/include/public/features.h |  8 ++++++++
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index adbac965f9f7..999917983789 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -283,8 +283,8 @@ static int msixtbl_write(struct vcpu *v, unsigned long address,
     unsigned long flags;
     struct irq_desc *desc;
 
-    if ( (len != 4 && len != 8) || (address & (len - 1)) )
-        return r;
+    if ( !IS_ALIGNED(address, len) )
+        return X86EMUL_OKAY;
 
     rcu_read_lock(&msixtbl_rcu_lock);
 
@@ -345,8 +345,7 @@ static int msixtbl_write(struct vcpu *v, unsigned long address,
 
 unlock:
     spin_unlock_irqrestore(&desc->lock, flags);
-    if ( len == 4 )
-        r = X86EMUL_OKAY;
+    r = X86EMUL_OKAY;
 
 out:
     rcu_read_unlock(&msixtbl_rcu_lock);
@@ -357,7 +356,17 @@ static int cf_check _msixtbl_write(
     const struct hvm_io_handler *handler, uint64_t address, uint32_t len,
     uint64_t val)
 {
-    return msixtbl_write(current, address, len, val);
+    /* Ignore invalid length or unaligned writes. */
+    if ( (len != 4 && len != 8) || !IS_ALIGNED(address, len) )
+        return X86EMUL_OKAY;
+
+    /*
+     * This function returns X86EMUL_UNHANDLEABLE even if write is properly
+     * handled, to propagate it to the device model (so it can keep its
+     * internal state in sync).
+     */
+    msixtbl_write(current, address, len, val);
+    return X86EMUL_UNHANDLEABLE;
 }
 
 static bool cf_check msixtbl_range(
diff --git a/xen/common/kernel.c b/xen/common/kernel.c
index 08dbaa2a054c..b44b2439ca8e 100644
--- a/xen/common/kernel.c
+++ b/xen/common/kernel.c
@@ -637,6 +637,7 @@ long do_xen_version(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
                 fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) |
                              (1U << XENFEAT_hvm_callback_vector) |
                              (has_pirq(d) ? (1U << XENFEAT_hvm_pirqs) : 0);
+            fi.submap |= (1U << XENFEAT_dm_msix_all_writes);
 #endif
             if ( !paging_mode_translate(d) || is_domain_direct_mapped(d) )
                 fi.submap |= (1U << XENFEAT_direct_mapped);
diff --git a/xen/include/public/features.h b/xen/include/public/features.h
index 4437f25d2532..880193094713 100644
--- a/xen/include/public/features.h
+++ b/xen/include/public/features.h
@@ -120,6 +120,14 @@
 #define XENFEAT_runstate_phys_area        18
 #define XENFEAT_vcpu_time_phys_area       19
 
+/*
+ * If set, Xen will passthrough all MSI-X vector ctrl writes to device model,
+ * not only those unmasking an entry. This allows device model to properly keep
+ * track of the MSI-X table without having to read it from the device behind
+ * Xen's backs. This information is relevant only for device models.
+ */
+#define XENFEAT_dm_msix_all_writes        20
+
 #define XENFEAT_NR_SUBMAPS 1
 
 #endif /* __XEN_PUBLIC_FEATURES_H__ */
-- 
git-series 0.9.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v6 2/7] x86/msi: Extend per-domain/device warning mechanism
  2024-04-26 17:53 [PATCH v6 0/7] MSI-X support with qemu in stubdomain, and other related changes Marek Marczykowski-Górecki
  2024-04-26 17:53 ` [PATCH v6 1/7] x86/msi: passthrough all MSI-X vector ctrl writes to device model Marek Marczykowski-Górecki
@ 2024-04-26 17:53 ` Marek Marczykowski-Górecki
  2024-04-30 14:58   ` Jan Beulich
  2024-04-26 17:54 ` [PATCH v6 3/7] x86/hvm: Allow access to registers on the same page as MSI-X table Marek Marczykowski-Górecki
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Marek Marczykowski-Górecki @ 2024-04-26 17:53 UTC (permalink / raw)
  To: xen-devel
  Cc: Marek Marczykowski-Górecki, Jan Beulich, Andrew Cooper,
	Roger Pau Monné

The arch_msix struct had a single "warned" field with a domid for which
warning was issued. Upcoming patch will need similar mechanism for few
more warnings, so change it to save a bit field of issued warnings.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
---
Changes in v6:
- add MSIX_CHECK_WARN macro (Jan)
- drop struct name from warned_kind union (Jan)

New in v5
---
 xen/arch/x86/include/asm/msi.h | 17 ++++++++++++++++-
 xen/arch/x86/msi.c             |  5 +----
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
index 997ccb87be0c..bcfdfd35345d 100644
--- a/xen/arch/x86/include/asm/msi.h
+++ b/xen/arch/x86/include/asm/msi.h
@@ -208,6 +208,15 @@ struct msg_address {
                                        PCI_MSIX_ENTRY_SIZE + \
                                        (~PCI_MSIX_BIRMASK & (PAGE_SIZE - 1)))
 
+#define MSIX_CHECK_WARN(msix, domid, which) ({ \
+    if ( (msix)->warned_domid != (domid) ) \
+    { \
+        (msix)->warned_domid = (domid); \
+        (msix)->warned_kind.all = 0; \
+    } \
+    (msix)->warned_kind.which ? false : ((msix)->warned_kind.which = true); \
+})
+
 struct arch_msix {
     unsigned int nr_entries, used_entries;
     struct {
@@ -217,7 +226,13 @@ struct arch_msix {
     int table_idx[MAX_MSIX_TABLE_PAGES];
     spinlock_t table_lock;
     bool host_maskall, guest_maskall;
-    domid_t warned;
+    domid_t warned_domid;
+    union {
+        uint8_t all;
+        struct {
+            bool maskall                   : 1;
+        };
+    } warned_kind;
 };
 
 void early_msi_init(void);
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index e721aaf5c001..42c793426da3 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -364,13 +364,10 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
             domid_t domid = pdev->domain->domain_id;
 
             maskall = true;
-            if ( pdev->msix->warned != domid )
-            {
-                pdev->msix->warned = domid;
+            if ( MSIX_CHECK_WARN(pdev->msix, domid, maskall) )
                 printk(XENLOG_G_WARNING
                        "cannot mask IRQ %d: masking MSI-X on Dom%d's %pp\n",
                        desc->irq, domid, &pdev->sbdf);
-            }
         }
         pdev->msix->host_maskall = maskall;
         if ( maskall || pdev->msix->guest_maskall )
-- 
git-series 0.9.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v6 3/7] x86/hvm: Allow access to registers on the same page as MSI-X table
  2024-04-26 17:53 [PATCH v6 0/7] MSI-X support with qemu in stubdomain, and other related changes Marek Marczykowski-Górecki
  2024-04-26 17:53 ` [PATCH v6 1/7] x86/msi: passthrough all MSI-X vector ctrl writes to device model Marek Marczykowski-Górecki
  2024-04-26 17:53 ` [PATCH v6 2/7] x86/msi: Extend per-domain/device warning mechanism Marek Marczykowski-Górecki
@ 2024-04-26 17:54 ` Marek Marczykowski-Górecki
  2024-04-30 15:03   ` Jan Beulich
  2024-05-03  8:33   ` Roger Pau Monné
  2024-04-26 17:54 ` [PATCH v6 4/7] automation: prevent QEMU access to /dev/mem in PCI passthrough tests Marek Marczykowski-Górecki
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 13+ messages in thread
From: Marek Marczykowski-Górecki @ 2024-04-26 17:54 UTC (permalink / raw)
  To: xen-devel
  Cc: Marek Marczykowski-Górecki, Jan Beulich, Andrew Cooper,
	Roger Pau Monné

Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers
on the same page as MSI-X table. Device model (especially one in
stubdomain) cannot really handle those, as direct writes to that page is
refused (page is on the mmio_ro_ranges list). Instead, extend
msixtbl_mmio_ops to handle such accesses too.

Doing this, requires correlating read/write location with guest
of MSI-X table address. Since QEMU doesn't map MSI-X table to the guest,
it requires msixtbl_entry->gtable, which is HVM-only. Similar feature
for PV would need to be done separately.

This will be also used to read Pending Bit Array, if it lives on the same
page, making QEMU not needing /dev/mem access at all (especially helpful
with lockdown enabled in dom0). If PBA lives on another page, QEMU will
map it to the guest directly.
If PBA lives on the same page, discard writes and log a message.
Technically, writes outside of PBA could be allowed, but at this moment
the precise location of PBA isn't saved, and also no known device abuses
the spec in this way (at least yet).

To access those registers, msixtbl_mmio_ops need the relevant page
mapped. MSI handling already has infrastructure for that, using fixmap,
so try to map first/last page of the MSI-X table (if necessary) and save
their fixmap indexes. Note that msix_get_fixmap() does reference
counting and reuses existing mapping, so just call it directly, even if
the page was mapped before. Also, it uses a specific range of fixmap
indexes which doesn't include 0, so use 0 as default ("not mapped")
value - which simplifies code a bit.

GCC 12.2.1 gets confused about 'desc' variable:

    arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’:
    arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized [-Werror=maybe-uninitialized]
      553 |     if ( desc )
          |        ^
    arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here
      537 |     const struct msi_desc *desc;
          |                            ^~~~

It's conditional initialization is actually correct (in the case where
it isn't initialized, function returns early), but to avoid
build failure initialize it explicitly to NULL anyway.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
---
Changes in v6:
- use MSIX_CHECK_WARN macro
- extend assert on fixmap_idx
- add break in default label, after ASSERT_UNREACHABLE(), and move
  setting default there
- style fixes
Changes in v5:
- style fixes
- include GCC version in the commit message
- warn only once (per domain, per device) about failed adjacent access
Changes in v4:
- drop same_page parameter of msixtbl_find_entry(), distinguish two
  cases in relevant callers
- rename adj_access_table_idx to adj_access_idx
- code style fixes
- drop alignment check in adjacent_{read,write}() - all callers already
  have it earlier
- delay mapping first/last MSI-X pages until preparing device for a
  passthrough
v3:
 - merge handling into msixtbl_mmio_ops
 - extend commit message
v2:
 - adjust commit message
 - pass struct domain to msixtbl_page_handler_get_hwaddr()
 - reduce local variables used only once
 - log a warning if write is forbidden if MSI-X and PBA lives on the same
   page
 - do not passthrough unaligned accesses
 - handle accesses both before and after MSI-X table
---
 xen/arch/x86/hvm/vmsi.c        | 200 ++++++++++++++++++++++++++++++++--
 xen/arch/x86/include/asm/msi.h |   5 +-
 xen/arch/x86/msi.c             |  41 +++++++-
 3 files changed, 236 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 999917983789..230e3a5dee3f 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -180,6 +180,10 @@ static bool msixtbl_initialised(const struct domain *d)
     return d->arch.hvm.msixtbl_list.next;
 }
 
+/*
+ * Lookup an msixtbl_entry on the same page as given addr. It's up to the
+ * caller to check if address is strictly part of the table - if relevant.
+ */
 static struct msixtbl_entry *msixtbl_find_entry(
     struct vcpu *v, unsigned long addr)
 {
@@ -187,8 +191,8 @@ static struct msixtbl_entry *msixtbl_find_entry(
     struct domain *d = v->domain;
 
     list_for_each_entry( entry, &d->arch.hvm.msixtbl_list, list )
-        if ( addr >= entry->gtable &&
-             addr < entry->gtable + entry->table_len )
+        if ( PFN_DOWN(addr) >= PFN_DOWN(entry->gtable) &&
+             PFN_DOWN(addr) <= PFN_DOWN(entry->gtable + entry->table_len - 1) )
             return entry;
 
     return NULL;
@@ -213,6 +217,138 @@ static struct msi_desc *msixtbl_addr_to_desc(
     return NULL;
 }
 
+/*
+ * Returns:
+ *  - ADJACENT_DONT_HANDLE if no handling should be done
+ *  - ADJACENT_DISCARD_WRITE if write should be discarded
+ *  - a fixmap idx to use for handling
+ */
+#define ADJACENT_DONT_HANDLE UINT_MAX
+#define ADJACENT_DISCARD_WRITE (UINT_MAX - 1)
+static unsigned int adjacent_handle(
+    const struct msixtbl_entry *entry, unsigned long addr, bool write)
+{
+    unsigned int adj_type;
+    struct arch_msix *msix;
+
+    if ( !entry || !entry->pdev )
+        return ADJACENT_DONT_HANDLE;
+
+    if ( PFN_DOWN(addr) == PFN_DOWN(entry->gtable) && addr < entry->gtable )
+        adj_type = ADJ_IDX_FIRST;
+    else if ( PFN_DOWN(addr) == PFN_DOWN(entry->gtable + entry->table_len - 1) &&
+              addr >= entry->gtable + entry->table_len )
+        adj_type = ADJ_IDX_LAST;
+    else
+        return ADJACENT_DONT_HANDLE;
+
+    msix = entry->pdev->msix;
+    ASSERT(msix);
+
+    if ( !msix->adj_access_idx[adj_type] )
+    {
+        if ( MSIX_CHECK_WARN(msix, entry->pdev->domain->domain_id,
+                             adjacent_not_initialized) )
+            gprintk(XENLOG_WARNING,
+                    "Page for adjacent(%d) MSI-X table access not initialized for %pp (addr %#lx, gtable %#lx\n",
+                    adj_type, &entry->pdev->sbdf, addr, entry->gtable);
+        return ADJACENT_DONT_HANDLE;
+    }
+
+    /* If PBA lives on the same page too, discard writes. */
+    if ( write &&
+         ((adj_type == ADJ_IDX_LAST &&
+           msix->table.last == msix->pba.first) ||
+          (adj_type == ADJ_IDX_FIRST &&
+           msix->table.first == msix->pba.last)) )
+    {
+        if ( MSIX_CHECK_WARN(msix, entry->pdev->domain->domain_id,
+                             adjacent_pba) )
+            gprintk(XENLOG_WARNING,
+                    "MSI-X table and PBA of %pp live on the same page, "
+                    "writing to other registers there is not implemented\n",
+                    &entry->pdev->sbdf);
+        return ADJACENT_DISCARD_WRITE;
+    }
+
+    return msix->adj_access_idx[adj_type];
+}
+
+static int adjacent_read(
+    unsigned int fixmap_idx,
+    paddr_t address, unsigned int len, uint64_t *pval)
+{
+    const void __iomem *hwaddr;
+
+    ASSERT(fixmap_idx <= FIX_MSIX_IO_RESERV_END);
+
+    hwaddr = fix_to_virt(fixmap_idx) + PAGE_OFFSET(address);
+
+    switch ( len )
+    {
+    case 1:
+        *pval = readb(hwaddr);
+        break;
+
+    case 2:
+        *pval = readw(hwaddr);
+        break;
+
+    case 4:
+        *pval = readl(hwaddr);
+        break;
+
+    case 8:
+        *pval = readq(hwaddr);
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+        *pval = ~0UL;
+        break;
+    }
+
+    return X86EMUL_OKAY;
+}
+
+static int adjacent_write(
+    unsigned int fixmap_idx,
+    paddr_t address, unsigned int len, uint64_t val)
+{
+    void __iomem *hwaddr;
+
+    if ( fixmap_idx == ADJACENT_DISCARD_WRITE )
+        return X86EMUL_OKAY;
+
+    ASSERT(fixmap_idx <= FIX_MSIX_IO_RESERV_END);
+
+    hwaddr = fix_to_virt(fixmap_idx) + PAGE_OFFSET(address);
+
+    switch ( len )
+    {
+    case 1:
+        writeb(val, hwaddr);
+        break;
+
+    case 2:
+        writew(val, hwaddr);
+        break;
+
+    case 4:
+        writel(val, hwaddr);
+        break;
+
+    case 8:
+        writeq(val, hwaddr);
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+    }
+
+    return X86EMUL_OKAY;
+}
+
 static int cf_check msixtbl_read(
     const struct hvm_io_handler *handler, uint64_t address, uint32_t len,
     uint64_t *pval)
@@ -220,9 +356,10 @@ static int cf_check msixtbl_read(
     unsigned long offset;
     struct msixtbl_entry *entry;
     unsigned int nr_entry, index;
+    unsigned int adjacent_fixmap;
     int r = X86EMUL_UNHANDLEABLE;
 
-    if ( (len != 4 && len != 8) || (address & (len - 1)) )
+    if ( !IS_ALIGNED(address, len) )
         return r;
 
     rcu_read_lock(&msixtbl_rcu_lock);
@@ -230,6 +367,21 @@ static int cf_check msixtbl_read(
     entry = msixtbl_find_entry(current, address);
     if ( !entry )
         goto out;
+
+    adjacent_fixmap = adjacent_handle(entry, address, false);
+    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
+    {
+        r = adjacent_read(adjacent_fixmap, address, len, pval);
+        goto out;
+    }
+
+    if ( address < entry->gtable ||
+         address >= entry->gtable + entry->table_len )
+        goto out;
+
+    if ( len != 4 && len != 8 )
+        goto out;
+
     offset = address & (PCI_MSIX_ENTRY_SIZE - 1);
 
     if ( offset != PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET )
@@ -282,6 +434,7 @@ static int msixtbl_write(struct vcpu *v, unsigned long address,
     int r = X86EMUL_UNHANDLEABLE;
     unsigned long flags;
     struct irq_desc *desc;
+    unsigned int adjacent_fixmap;
 
     if ( !IS_ALIGNED(address, len) )
         return X86EMUL_OKAY;
@@ -291,6 +444,21 @@ static int msixtbl_write(struct vcpu *v, unsigned long address,
     entry = msixtbl_find_entry(v, address);
     if ( !entry )
         goto out;
+
+    adjacent_fixmap = adjacent_handle(entry, address, true);
+    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
+    {
+        r = adjacent_write(adjacent_fixmap, address, len, val);
+        goto out;
+    }
+
+    if ( address < entry->gtable ||
+         address >= entry->gtable + entry->table_len )
+        goto out;
+
+    if ( len != 4 && len != 8 )
+        goto out;
+
     nr_entry = array_index_nospec(((address - entry->gtable) /
                                    PCI_MSIX_ENTRY_SIZE),
                                   MAX_MSIX_TABLE_ENTRIES);
@@ -356,8 +524,8 @@ static int cf_check _msixtbl_write(
     const struct hvm_io_handler *handler, uint64_t address, uint32_t len,
     uint64_t val)
 {
-    /* Ignore invalid length or unaligned writes. */
-    if ( (len != 4 && len != 8) || !IS_ALIGNED(address, len) )
+    /* Ignore unaligned writes. */
+    if ( !IS_ALIGNED(address, len) )
         return X86EMUL_OKAY;
 
     /*
@@ -374,14 +542,22 @@ static bool cf_check msixtbl_range(
 {
     struct vcpu *curr = current;
     unsigned long addr = r->addr;
-    const struct msi_desc *desc;
+    const struct msixtbl_entry *entry;
+    const struct msi_desc *desc = NULL;
+    unsigned int adjacent_fixmap;
 
     ASSERT(r->type == IOREQ_TYPE_COPY);
 
     rcu_read_lock(&msixtbl_rcu_lock);
-    desc = msixtbl_addr_to_desc(msixtbl_find_entry(curr, addr), addr);
+    entry = msixtbl_find_entry(curr, addr);
+    adjacent_fixmap = adjacent_handle(entry, addr, false);
+    if ( adjacent_fixmap == ADJACENT_DONT_HANDLE )
+        desc = msixtbl_addr_to_desc(entry, addr);
     rcu_read_unlock(&msixtbl_rcu_lock);
 
+    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
+        return 1;
+
     if ( desc )
         return 1;
 
@@ -622,12 +798,16 @@ void msix_write_completion(struct vcpu *v)
          v->arch.hvm.hvm_io.msix_snoop_gpa )
     {
         unsigned int token = hvmemul_cache_disable(v);
-        const struct msi_desc *desc;
+        const struct msi_desc *desc = NULL;
+        const struct msixtbl_entry *entry;
         uint32_t data;
 
         rcu_read_lock(&msixtbl_rcu_lock);
-        desc = msixtbl_addr_to_desc(msixtbl_find_entry(v, snoop_addr),
-                                    snoop_addr);
+        entry = msixtbl_find_entry(v, snoop_addr);
+        if ( entry &&
+             snoop_addr >= entry->gtable &&
+             snoop_addr < entry->gtable + entry->table_len )
+            desc = msixtbl_addr_to_desc(entry, snoop_addr);
         rcu_read_unlock(&msixtbl_rcu_lock);
 
         if ( desc &&
diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
index bcfdfd35345d..923b730d48b8 100644
--- a/xen/arch/x86/include/asm/msi.h
+++ b/xen/arch/x86/include/asm/msi.h
@@ -224,6 +224,9 @@ struct arch_msix {
     } table, pba;
     int table_refcnt[MAX_MSIX_TABLE_PAGES];
     int table_idx[MAX_MSIX_TABLE_PAGES];
+#define ADJ_IDX_FIRST 0
+#define ADJ_IDX_LAST  1
+    unsigned int adj_access_idx[2];
     spinlock_t table_lock;
     bool host_maskall, guest_maskall;
     domid_t warned_domid;
@@ -231,6 +234,8 @@ struct arch_msix {
         uint8_t all;
         struct {
             bool maskall                   : 1;
+            bool adjacent_not_initialized  : 1;
+            bool adjacent_pba              : 1;
         };
     } warned_kind;
 };
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 42c793426da3..c77b81896269 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -913,6 +913,36 @@ static int msix_capability_init(struct pci_dev *dev,
         list_add_tail(&entry->list, &dev->msi_list);
         *desc = entry;
     }
+    else
+    {
+        /*
+         * If the MSI-X table doesn't start at the page boundary, map the first page for
+         * passthrough accesses.
+         */
+        if ( PAGE_OFFSET(table_paddr) )
+        {
+            int idx = msix_get_fixmap(msix, table_paddr, table_paddr);
+
+            if ( idx > 0 )
+                msix->adj_access_idx[ADJ_IDX_FIRST] = idx;
+            else
+                gprintk(XENLOG_ERR, "Failed to map first MSI-X table page: %d\n", idx);
+        }
+        /*
+         * If the MSI-X table doesn't end on the page boundary, map the last page
+         * for passthrough accesses.
+         */
+        if ( PAGE_OFFSET(table_paddr + msix->nr_entries * PCI_MSIX_ENTRY_SIZE) )
+        {
+            uint64_t entry_paddr = table_paddr + msix->nr_entries * PCI_MSIX_ENTRY_SIZE;
+            int idx = msix_get_fixmap(msix, table_paddr, entry_paddr);
+
+            if ( idx > 0 )
+                msix->adj_access_idx[ADJ_IDX_LAST] = idx;
+            else
+                gprintk(XENLOG_ERR, "Failed to map last MSI-X table page: %d\n", idx);
+        }
+    }
 
     if ( !msix->used_entries )
     {
@@ -1079,6 +1109,17 @@ static void _pci_cleanup_msix(struct arch_msix *msix)
         msix->table.first = 0;
         msix->table.last = 0;
 
+        if ( msix->adj_access_idx[ADJ_IDX_FIRST] )
+        {
+            msix_put_fixmap(msix, msix->adj_access_idx[ADJ_IDX_FIRST]);
+            msix->adj_access_idx[ADJ_IDX_FIRST] = 0;
+        }
+        if ( msix->adj_access_idx[ADJ_IDX_LAST] )
+        {
+            msix_put_fixmap(msix, msix->adj_access_idx[ADJ_IDX_LAST]);
+            msix->adj_access_idx[ADJ_IDX_LAST] = 0;
+        }
+
         if ( rangeset_remove_range(mmio_ro_ranges, msix->pba.first,
                                    msix->pba.last) )
             WARN();
-- 
git-series 0.9.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v6 4/7] automation: prevent QEMU access to /dev/mem in PCI passthrough tests
  2024-04-26 17:53 [PATCH v6 0/7] MSI-X support with qemu in stubdomain, and other related changes Marek Marczykowski-Górecki
                   ` (2 preceding siblings ...)
  2024-04-26 17:54 ` [PATCH v6 3/7] x86/hvm: Allow access to registers on the same page as MSI-X table Marek Marczykowski-Górecki
@ 2024-04-26 17:54 ` Marek Marczykowski-Górecki
  2024-04-26 17:54 ` [PATCH v6 5/7] automation: switch to a wifi card on ADL system Marek Marczykowski-Górecki
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Marek Marczykowski-Górecki @ 2024-04-26 17:54 UTC (permalink / raw)
  To: xen-devel
  Cc: Marek Marczykowski-Górecki, Stefano Stabellini, Doug Goldstein

/dev/mem access doesn't work in dom0 in lockdown and in stubdomain.
Simulate this environment with removing /dev/mem device node. Full test
for lockdown and stubdomain will come later, when all requirements will
be in place.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
---
This can be applied only after QEMU change is committed. Otherwise the
test will fail.
---
 automation/scripts/qubes-x86-64.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/automation/scripts/qubes-x86-64.sh b/automation/scripts/qubes-x86-64.sh
index d81ed7b931cf..7eabc1bd6ad4 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -163,6 +163,8 @@ ifconfig eth0 up
 ifconfig xenbr0 up
 ifconfig xenbr0 192.168.0.1
 
+# ensure QEMU wont have access /dev/mem
+rm -f /dev/mem
 # get domU console content into test log
 tail -F /var/log/xen/console/guest-domU.log 2>/dev/null | sed -e \"s/^/(domU) /\" &
 xl create /etc/xen/domU.cfg
-- 
git-series 0.9.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v6 5/7] automation: switch to a wifi card on ADL system
  2024-04-26 17:53 [PATCH v6 0/7] MSI-X support with qemu in stubdomain, and other related changes Marek Marczykowski-Górecki
                   ` (3 preceding siblings ...)
  2024-04-26 17:54 ` [PATCH v6 4/7] automation: prevent QEMU access to /dev/mem in PCI passthrough tests Marek Marczykowski-Górecki
@ 2024-04-26 17:54 ` Marek Marczykowski-Górecki
  2024-04-26 17:54 ` [PATCH v6 6/7] [DO NOT APPLY] switch to qemu fork Marek Marczykowski-Górecki
  2024-04-26 17:54 ` [PATCH v6 7/7] [DO NOT APPLY] switch to alternative artifact repo Marek Marczykowski-Górecki
  6 siblings, 0 replies; 13+ messages in thread
From: Marek Marczykowski-Górecki @ 2024-04-26 17:54 UTC (permalink / raw)
  To: xen-devel
  Cc: Marek Marczykowski-Górecki, Stefano Stabellini, Doug Goldstein

Switch to a wifi card that has registers on a MSI-X page. This tests the
"x86/hvm: Allow writes to registers on the same page as MSI-X table"
feature. Switch it only for HVM test, because MSI-X adjacent write is
not supported on PV.

This requires also including drivers and firmware in system for tests.
Remove firmware unrelated to the test, to not increase initrd size too
much (all firmware takes over 100MB compressed).
And finally adjusts test script to handle not only eth0 as a test device,
but also wlan0 and connect it to the wifi network.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
---
This needs two new gitlab variables: WIFI_HW2_SSID and WIFI_HW2_PSK. I'll
provide them in private.

This change requires rebuilding test containers.

This can be applied only after QEMU change is committed. Otherwise the
test will fail.
---
 automation/gitlab-ci/test.yaml                      | 4 ++++
 automation/scripts/qubes-x86-64.sh                  | 7 +++++++
 automation/tests-artifacts/alpine/3.18.dockerfile   | 7 +++++++
 automation/tests-artifacts/kernel/6.1.19.dockerfile | 2 ++
 4 files changed, 20 insertions(+)

diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index 8b7b2e4da92d..23a183fad967 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -193,6 +193,10 @@ adl-pci-pv-x86-64-gcc-debug:
 
 adl-pci-hvm-x86-64-gcc-debug:
   extends: .adl-x86-64
+  variables:
+    PCIDEV: "00:14.3"
+    WIFI_SSID: "$WIFI_HW2_SSID"
+    WIFI_PSK: "$WIFI_HW2_PSK"
   script:
     - ./automation/scripts/qubes-x86-64.sh pci-hvm 2>&1 | tee ${LOGFILE}
   needs:
diff --git a/automation/scripts/qubes-x86-64.sh b/automation/scripts/qubes-x86-64.sh
index 7eabc1bd6ad4..60498ef1e89a 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -94,6 +94,13 @@ on_reboot = "destroy"
     domU_check="
 set -x -e
 interface=eth0
+if [ -e /sys/class/net/wlan0 ]; then
+    interface=wlan0
+    set +x
+    wpa_passphrase "$WIFI_SSID" "$WIFI_PSK" > /etc/wpa_supplicant.conf
+    set -x
+    wpa_supplicant -B -iwlan0 -c /etc/wpa_supplicant.conf
+fi
 ip link set \"\$interface\" up
 timeout 30s udhcpc -i \"\$interface\"
 pingip=\$(ip -o -4 r show default|cut -f 3 -d ' ')
diff --git a/automation/tests-artifacts/alpine/3.18.dockerfile b/automation/tests-artifacts/alpine/3.18.dockerfile
index 9cde6c9ad4da..c323e266c7da 100644
--- a/automation/tests-artifacts/alpine/3.18.dockerfile
+++ b/automation/tests-artifacts/alpine/3.18.dockerfile
@@ -34,6 +34,13 @@ RUN \
   apk add udev && \
   apk add pciutils && \
   apk add libelf && \
+  apk add wpa_supplicant && \
+  # Select firmware for hardware tests
+  apk add linux-firmware-other && \
+  mkdir /lib/firmware-preserve && \
+  mv /lib/firmware/iwlwifi-so-a0-gf-a0* /lib/firmware-preserve/ && \
+  rm -rf /lib/firmware && \
+  mv /lib/firmware-preserve /lib/firmware && \
   \
   # Xen
   cd / && \
diff --git a/automation/tests-artifacts/kernel/6.1.19.dockerfile b/automation/tests-artifacts/kernel/6.1.19.dockerfile
index 3a4096780d20..84ed5dff23ae 100644
--- a/automation/tests-artifacts/kernel/6.1.19.dockerfile
+++ b/automation/tests-artifacts/kernel/6.1.19.dockerfile
@@ -32,6 +32,8 @@ RUN curl -fsSLO https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-"$LINUX_VERSI
     make xen.config && \
     scripts/config --enable BRIDGE && \
     scripts/config --enable IGC && \
+    scripts/config --enable IWLWIFI && \
+    scripts/config --enable IWLMVM && \
     cp .config .config.orig && \
     cat .config.orig | grep XEN | grep =m |sed 's/=m/=y/g' >> .config && \
     make -j$(nproc) bzImage && \
-- 
git-series 0.9.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v6 6/7] [DO NOT APPLY] switch to qemu fork
  2024-04-26 17:53 [PATCH v6 0/7] MSI-X support with qemu in stubdomain, and other related changes Marek Marczykowski-Górecki
                   ` (4 preceding siblings ...)
  2024-04-26 17:54 ` [PATCH v6 5/7] automation: switch to a wifi card on ADL system Marek Marczykowski-Górecki
@ 2024-04-26 17:54 ` Marek Marczykowski-Górecki
  2024-04-26 17:54 ` [PATCH v6 7/7] [DO NOT APPLY] switch to alternative artifact repo Marek Marczykowski-Górecki
  6 siblings, 0 replies; 13+ messages in thread
From: Marek Marczykowski-Górecki @ 2024-04-26 17:54 UTC (permalink / raw)
  To: xen-devel
  Cc: Marek Marczykowski-Górecki, Andrew Cooper, George Dunlap,
	Jan Beulich, Julien Grall, Stefano Stabellini

This makes tests to use patched QEMU, to actually test the new behavior.
---
 Config.mk | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Config.mk b/Config.mk
index a962f095ca16..5e220a1284e4 100644
--- a/Config.mk
+++ b/Config.mk
@@ -220,8 +220,8 @@ endif
 OVMF_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/ovmf.git
 OVMF_UPSTREAM_REVISION ?= ba91d0292e593df8528b66f99c1b0b14fadc8e16
 
-QEMU_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/qemu-xen.git
-QEMU_UPSTREAM_REVISION ?= master
+QEMU_UPSTREAM_URL ?= https://github.com/marmarek/qemu
+QEMU_UPSTREAM_REVISION ?= origin/msix
 
 MINIOS_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/mini-os.git
 MINIOS_UPSTREAM_REVISION ?= b6a5b4d72b88e5c4faed01f5a44505de022860fc
-- 
git-series 0.9.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v6 7/7] [DO NOT APPLY] switch to alternative artifact repo
  2024-04-26 17:53 [PATCH v6 0/7] MSI-X support with qemu in stubdomain, and other related changes Marek Marczykowski-Górecki
                   ` (5 preceding siblings ...)
  2024-04-26 17:54 ` [PATCH v6 6/7] [DO NOT APPLY] switch to qemu fork Marek Marczykowski-Górecki
@ 2024-04-26 17:54 ` Marek Marczykowski-Górecki
  6 siblings, 0 replies; 13+ messages in thread
From: Marek Marczykowski-Górecki @ 2024-04-26 17:54 UTC (permalink / raw)
  To: xen-devel
  Cc: Marek Marczykowski-Górecki, Doug Goldstein, Stefano Stabellini

For testing, switch to my containers registry that includes containers
rebuilt with changes in this series.
---
 automation/gitlab-ci/build.yaml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml
index aac29ee13ab6..6957f06016b7 100644
--- a/automation/gitlab-ci/build.yaml
+++ b/automation/gitlab-ci/build.yaml
@@ -320,7 +320,7 @@ qemu-system-ppc64-8.1.0-ppc64-export:
 
 alpine-3.18-rootfs-export:
   extends: .test-jobs-artifact-common
-  image: registry.gitlab.com/xen-project/xen/tests-artifacts/alpine:3.18
+  image: registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/alpine:3.18
   script:
     - mkdir binaries && cp /initrd.tar.gz binaries/initrd.tar.gz
   artifacts:
@@ -331,7 +331,7 @@ alpine-3.18-rootfs-export:
 
 kernel-6.1.19-export:
   extends: .test-jobs-artifact-common
-  image: registry.gitlab.com/xen-project/xen/tests-artifacts/kernel:6.1.19
+  image: registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/kernel:6.1.19
   script:
     - mkdir binaries && cp /bzImage binaries/bzImage
   artifacts:
-- 
git-series 0.9.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 2/7] x86/msi: Extend per-domain/device warning mechanism
  2024-04-26 17:53 ` [PATCH v6 2/7] x86/msi: Extend per-domain/device warning mechanism Marek Marczykowski-Górecki
@ 2024-04-30 14:58   ` Jan Beulich
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Beulich @ 2024-04-30 14:58 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Andrew Cooper, Roger Pau Monné, xen-devel

On 26.04.2024 19:53, Marek Marczykowski-Górecki wrote:
> The arch_msix struct had a single "warned" field with a domid for which
> warning was issued. Upcoming patch will need similar mechanism for few
> more warnings, so change it to save a bit field of issued warnings.
> 
> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>
(if that makes sense at all, considering that one fundamental part of it,
the macro, was suggested by me)

However, unlike at other times I'd like this to go in only together with
the following patch (or whatever other 2nd user of the new machinery).

Jan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 3/7] x86/hvm: Allow access to registers on the same page as MSI-X table
  2024-04-26 17:54 ` [PATCH v6 3/7] x86/hvm: Allow access to registers on the same page as MSI-X table Marek Marczykowski-Górecki
@ 2024-04-30 15:03   ` Jan Beulich
  2024-05-03  8:33   ` Roger Pau Monné
  1 sibling, 0 replies; 13+ messages in thread
From: Jan Beulich @ 2024-04-30 15:03 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Andrew Cooper, Roger Pau Monné, xen-devel

On 26.04.2024 19:54, Marek Marczykowski-Górecki wrote:
> Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers
> on the same page as MSI-X table. Device model (especially one in
> stubdomain) cannot really handle those, as direct writes to that page is
> refused (page is on the mmio_ro_ranges list). Instead, extend
> msixtbl_mmio_ops to handle such accesses too.
> 
> Doing this, requires correlating read/write location with guest
> of MSI-X table address. Since QEMU doesn't map MSI-X table to the guest,
> it requires msixtbl_entry->gtable, which is HVM-only. Similar feature
> for PV would need to be done separately.
> 
> This will be also used to read Pending Bit Array, if it lives on the same
> page, making QEMU not needing /dev/mem access at all (especially helpful
> with lockdown enabled in dom0). If PBA lives on another page, QEMU will
> map it to the guest directly.
> If PBA lives on the same page, discard writes and log a message.
> Technically, writes outside of PBA could be allowed, but at this moment
> the precise location of PBA isn't saved, and also no known device abuses
> the spec in this way (at least yet).
> 
> To access those registers, msixtbl_mmio_ops need the relevant page
> mapped. MSI handling already has infrastructure for that, using fixmap,
> so try to map first/last page of the MSI-X table (if necessary) and save
> their fixmap indexes. Note that msix_get_fixmap() does reference
> counting and reuses existing mapping, so just call it directly, even if
> the page was mapped before. Also, it uses a specific range of fixmap
> indexes which doesn't include 0, so use 0 as default ("not mapped")
> value - which simplifies code a bit.
> 
> GCC 12.2.1 gets confused about 'desc' variable:
> 
>     arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’:
>     arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized [-Werror=maybe-uninitialized]
>       553 |     if ( desc )
>           |        ^
>     arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here
>       537 |     const struct msi_desc *desc;
>           |                            ^~~~
> 
> It's conditional initialization is actually correct (in the case where
> it isn't initialized, function returns early), but to avoid
> build failure initialize it explicitly to NULL anyway.
> 
> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>

_Without_ the usually implied ack (as indicated before) and with two
small tweaks (which can likely be taken care of while committing):
Reviewed-by: Jan Beulich <jbeulich@suse.com>

> +static int adjacent_read(
> +    unsigned int fixmap_idx,
> +    paddr_t address, unsigned int len, uint64_t *pval)
> +{
> +    const void __iomem *hwaddr;
> +
> +    ASSERT(fixmap_idx <= FIX_MSIX_IO_RESERV_END);
> +
> +    hwaddr = fix_to_virt(fixmap_idx) + PAGE_OFFSET(address);
> +
> +    switch ( len )
> +    {
> +    case 1:
> +        *pval = readb(hwaddr);
> +        break;
> +
> +    case 2:
> +        *pval = readw(hwaddr);
> +        break;
> +
> +    case 4:
> +        *pval = readl(hwaddr);
> +        break;
> +
> +    case 8:
> +        *pval = readq(hwaddr);
> +        break;
> +
> +    default:
> +        ASSERT_UNREACHABLE();
> +        *pval = ~0UL;

Nit: Better ~0ULL here (short of there being UINT64_C()).

> +        break;
> +    }
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +static int adjacent_write(
> +    unsigned int fixmap_idx,
> +    paddr_t address, unsigned int len, uint64_t val)
> +{
> +    void __iomem *hwaddr;
> +
> +    if ( fixmap_idx == ADJACENT_DISCARD_WRITE )
> +        return X86EMUL_OKAY;
> +
> +    ASSERT(fixmap_idx <= FIX_MSIX_IO_RESERV_END);
> +
> +    hwaddr = fix_to_virt(fixmap_idx) + PAGE_OFFSET(address);
> +
> +    switch ( len )
> +    {
> +    case 1:
> +        writeb(val, hwaddr);
> +        break;
> +
> +    case 2:
> +        writew(val, hwaddr);
> +        break;
> +
> +    case 4:
> +        writel(val, hwaddr);
> +        break;
> +
> +    case 8:
> +        writeq(val, hwaddr);
> +        break;
> +
> +    default:
> +        ASSERT_UNREACHABLE();
> +    }

There's still a "break;" missing here.

Jan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 3/7] x86/hvm: Allow access to registers on the same page as MSI-X table
  2024-04-26 17:54 ` [PATCH v6 3/7] x86/hvm: Allow access to registers on the same page as MSI-X table Marek Marczykowski-Górecki
  2024-04-30 15:03   ` Jan Beulich
@ 2024-05-03  8:33   ` Roger Pau Monné
  2024-05-04  0:48     ` Marek Marczykowski-Górecki
  1 sibling, 1 reply; 13+ messages in thread
From: Roger Pau Monné @ 2024-05-03  8:33 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki; +Cc: xen-devel, Jan Beulich, Andrew Cooper

On Fri, Apr 26, 2024 at 07:54:00PM +0200, Marek Marczykowski-Górecki wrote:
> Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers
> on the same page as MSI-X table. Device model (especially one in
> stubdomain) cannot really handle those, as direct writes to that page is
> refused (page is on the mmio_ro_ranges list). Instead, extend
> msixtbl_mmio_ops to handle such accesses too.
> 
> Doing this, requires correlating read/write location with guest
> of MSI-X table address. Since QEMU doesn't map MSI-X table to the guest,
  ^ extra 'of'?
> it requires msixtbl_entry->gtable, which is HVM-only. Similar feature
> for PV would need to be done separately.
> 
> This will be also used to read Pending Bit Array, if it lives on the same
> page, making QEMU not needing /dev/mem access at all (especially helpful
> with lockdown enabled in dom0). If PBA lives on another page, QEMU will
> map it to the guest directly.
> If PBA lives on the same page, discard writes and log a message.
> Technically, writes outside of PBA could be allowed, but at this moment
> the precise location of PBA isn't saved, and also no known device abuses
> the spec in this way (at least yet).
> 
> To access those registers, msixtbl_mmio_ops need the relevant page
> mapped. MSI handling already has infrastructure for that, using fixmap,
> so try to map first/last page of the MSI-X table (if necessary) and save
> their fixmap indexes. Note that msix_get_fixmap() does reference
> counting and reuses existing mapping, so just call it directly, even if
> the page was mapped before. Also, it uses a specific range of fixmap
> indexes which doesn't include 0, so use 0 as default ("not mapped")
> value - which simplifies code a bit.
> 
> GCC 12.2.1 gets confused about 'desc' variable:
> 
>     arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’:
>     arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized [-Werror=maybe-uninitialized]
>       553 |     if ( desc )
>           |        ^
>     arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here
>       537 |     const struct msi_desc *desc;
>           |                            ^~~~
> 
> It's conditional initialization is actually correct (in the case where
> it isn't initialized, function returns early), but to avoid
> build failure initialize it explicitly to NULL anyway.
> 
> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> ---
> Changes in v6:
> - use MSIX_CHECK_WARN macro
> - extend assert on fixmap_idx
> - add break in default label, after ASSERT_UNREACHABLE(), and move
>   setting default there
> - style fixes
> Changes in v5:
> - style fixes
> - include GCC version in the commit message
> - warn only once (per domain, per device) about failed adjacent access
> Changes in v4:
> - drop same_page parameter of msixtbl_find_entry(), distinguish two
>   cases in relevant callers
> - rename adj_access_table_idx to adj_access_idx
> - code style fixes
> - drop alignment check in adjacent_{read,write}() - all callers already
>   have it earlier
> - delay mapping first/last MSI-X pages until preparing device for a
>   passthrough
> v3:
>  - merge handling into msixtbl_mmio_ops
>  - extend commit message
> v2:
>  - adjust commit message
>  - pass struct domain to msixtbl_page_handler_get_hwaddr()
>  - reduce local variables used only once
>  - log a warning if write is forbidden if MSI-X and PBA lives on the same
>    page
>  - do not passthrough unaligned accesses
>  - handle accesses both before and after MSI-X table
> ---
>  xen/arch/x86/hvm/vmsi.c        | 200 ++++++++++++++++++++++++++++++++--
>  xen/arch/x86/include/asm/msi.h |   5 +-
>  xen/arch/x86/msi.c             |  41 +++++++-
>  3 files changed, 236 insertions(+), 10 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index 999917983789..230e3a5dee3f 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -180,6 +180,10 @@ static bool msixtbl_initialised(const struct domain *d)
>      return d->arch.hvm.msixtbl_list.next;
>  }
>  
> +/*
> + * Lookup an msixtbl_entry on the same page as given addr. It's up to the
> + * caller to check if address is strictly part of the table - if relevant.
> + */
>  static struct msixtbl_entry *msixtbl_find_entry(
>      struct vcpu *v, unsigned long addr)
>  {
> @@ -187,8 +191,8 @@ static struct msixtbl_entry *msixtbl_find_entry(
>      struct domain *d = v->domain;
>  
>      list_for_each_entry( entry, &d->arch.hvm.msixtbl_list, list )
> -        if ( addr >= entry->gtable &&
> -             addr < entry->gtable + entry->table_len )
> +        if ( PFN_DOWN(addr) >= PFN_DOWN(entry->gtable) &&
> +             PFN_DOWN(addr) <= PFN_DOWN(entry->gtable + entry->table_len - 1) )
>              return entry;
>  
>      return NULL;
> @@ -213,6 +217,138 @@ static struct msi_desc *msixtbl_addr_to_desc(
>      return NULL;
>  }
>  
> +/*
> + * Returns:
> + *  - ADJACENT_DONT_HANDLE if no handling should be done
> + *  - ADJACENT_DISCARD_WRITE if write should be discarded
> + *  - a fixmap idx to use for handling
> + */
> +#define ADJACENT_DONT_HANDLE UINT_MAX
> +#define ADJACENT_DISCARD_WRITE (UINT_MAX - 1)

I think this could be simpler, there's no need to signal with so fine
grained detail about the action to be performed.

Any adjacent access to the MSI-X table should be handled by the logic
you are adding, so anything that falls in those ranges should
terminate here.

adjacent_handle() should IMO just return whether the access is
replayed against the hardware, or if it's just dropped.

> +static unsigned int adjacent_handle(
> +    const struct msixtbl_entry *entry, unsigned long addr, bool write)
> +{
> +    unsigned int adj_type;
> +    struct arch_msix *msix;
> +
> +    if ( !entry || !entry->pdev )
> +        return ADJACENT_DONT_HANDLE;
> +
> +    if ( PFN_DOWN(addr) == PFN_DOWN(entry->gtable) && addr < entry->gtable )
> +        adj_type = ADJ_IDX_FIRST;
> +    else if ( PFN_DOWN(addr) == PFN_DOWN(entry->gtable + entry->table_len - 1) &&
> +              addr >= entry->gtable + entry->table_len )
> +        adj_type = ADJ_IDX_LAST;
> +    else
> +        return ADJACENT_DONT_HANDLE;
> +
> +    msix = entry->pdev->msix;
> +    ASSERT(msix);
> +
> +    if ( !msix->adj_access_idx[adj_type] )
> +    {
> +        if ( MSIX_CHECK_WARN(msix, entry->pdev->domain->domain_id,
> +                             adjacent_not_initialized) )
> +            gprintk(XENLOG_WARNING,
> +                    "Page for adjacent(%d) MSI-X table access not initialized for %pp (addr %#lx, gtable %#lx\n",
> +                    adj_type, &entry->pdev->sbdf, addr, entry->gtable);
> +        return ADJACENT_DONT_HANDLE;
> +    }
> +
> +    /* If PBA lives on the same page too, discard writes. */
> +    if ( write &&
> +         ((adj_type == ADJ_IDX_LAST &&
> +           msix->table.last == msix->pba.first) ||
> +          (adj_type == ADJ_IDX_FIRST &&
> +           msix->table.first == msix->pba.last)) )
> +    {
> +        if ( MSIX_CHECK_WARN(msix, entry->pdev->domain->domain_id,
> +                             adjacent_pba) )
> +            gprintk(XENLOG_WARNING,
> +                    "MSI-X table and PBA of %pp live on the same page, "
> +                    "writing to other registers there is not implemented\n",
> +                    &entry->pdev->sbdf);
> +        return ADJACENT_DISCARD_WRITE;
> +    }
> +
> +    return msix->adj_access_idx[adj_type];
> +}
> +
> +static int adjacent_read(
> +    unsigned int fixmap_idx,
> +    paddr_t address, unsigned int len, uint64_t *pval)
> +{
> +    const void __iomem *hwaddr;
> +
> +    ASSERT(fixmap_idx <= FIX_MSIX_IO_RESERV_END);
> +
> +    hwaddr = fix_to_virt(fixmap_idx) + PAGE_OFFSET(address);

IMO adjacent_handle() should be called here (and in adjacent_write()),
and adjacent_{read,write}() called unconditionally from
msixtbl_{read,write}() when an access that doesn't fall in the MSI-X
table is handled.  See comment below in msixtbl_read().

> +
> +    switch ( len )
> +    {
> +    case 1:
> +        *pval = readb(hwaddr);
> +        break;
> +
> +    case 2:
> +        *pval = readw(hwaddr);
> +        break;
> +
> +    case 4:
> +        *pval = readl(hwaddr);
> +        break;
> +
> +    case 8:
> +        *pval = readq(hwaddr);
> +        break;
> +
> +    default:
> +        ASSERT_UNREACHABLE();
> +        *pval = ~0UL;
> +        break;
> +    }
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +static int adjacent_write(
> +    unsigned int fixmap_idx,
> +    paddr_t address, unsigned int len, uint64_t val)
> +{
> +    void __iomem *hwaddr;
> +
> +    if ( fixmap_idx == ADJACENT_DISCARD_WRITE )
> +        return X86EMUL_OKAY;
> +
> +    ASSERT(fixmap_idx <= FIX_MSIX_IO_RESERV_END);

Since you check the idx is sane, shouldn't you also assert idx >=
FIX_MSIX_IO_RESERV_BASE?

> +
> +    hwaddr = fix_to_virt(fixmap_idx) + PAGE_OFFSET(address);
> +
> +    switch ( len )
> +    {
> +    case 1:
> +        writeb(val, hwaddr);
> +        break;
> +
> +    case 2:
> +        writew(val, hwaddr);
> +        break;
> +
> +    case 4:
> +        writel(val, hwaddr);
> +        break;
> +
> +    case 8:
> +        writeq(val, hwaddr);
> +        break;
> +
> +    default:
> +        ASSERT_UNREACHABLE();
> +    }
> +
> +    return X86EMUL_OKAY;
> +}
> +
>  static int cf_check msixtbl_read(
>      const struct hvm_io_handler *handler, uint64_t address, uint32_t len,
>      uint64_t *pval)
> @@ -220,9 +356,10 @@ static int cf_check msixtbl_read(
>      unsigned long offset;
>      struct msixtbl_entry *entry;
>      unsigned int nr_entry, index;
> +    unsigned int adjacent_fixmap;
>      int r = X86EMUL_UNHANDLEABLE;
>  
> -    if ( (len != 4 && len != 8) || (address & (len - 1)) )
> +    if ( !IS_ALIGNED(address, len) )
>          return r;
>  
>      rcu_read_lock(&msixtbl_rcu_lock);
> @@ -230,6 +367,21 @@ static int cf_check msixtbl_read(
>      entry = msixtbl_find_entry(current, address);
>      if ( !entry )
>          goto out;
> +
> +    adjacent_fixmap = adjacent_handle(entry, address, false);

This seems overly complicated, but is possible I'm missing some logic.

IMO it would seem way less convoluted to simply do:

entry = msixtbl_find_entry(current, address);
if ( !entry )
    goto out;

if ( address < entry->gtable ||
     address >= entry->gtable + entry->table_len )
{
    adjacent_read(...);
    goto out;
}

And put all the logic in adjacent_{read,write}() directly rather than
having both adjacent_{read,write}() plus adjacent_handle() calls here?

If the access doesn't fall between the boundaries of the MSI-X table
it's either going to be a handled adjacent access, or it's going to be
discarded.

> +    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
> +    {
> +        r = adjacent_read(adjacent_fixmap, address, len, pval);
> +        goto out;
> +    }
> +
> +    if ( address < entry->gtable ||
> +         address >= entry->gtable + entry->table_len )
> +        goto out;
> +
> +    if ( len != 4 && len != 8 )
> +        goto out;
> +
>      offset = address & (PCI_MSIX_ENTRY_SIZE - 1);
>  
>      if ( offset != PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET )
> @@ -282,6 +434,7 @@ static int msixtbl_write(struct vcpu *v, unsigned long address,
>      int r = X86EMUL_UNHANDLEABLE;
>      unsigned long flags;
>      struct irq_desc *desc;
> +    unsigned int adjacent_fixmap;
>  
>      if ( !IS_ALIGNED(address, len) )
>          return X86EMUL_OKAY;
> @@ -291,6 +444,21 @@ static int msixtbl_write(struct vcpu *v, unsigned long address,
>      entry = msixtbl_find_entry(v, address);
>      if ( !entry )
>          goto out;
> +
> +    adjacent_fixmap = adjacent_handle(entry, address, true);
> +    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
> +    {
> +        r = adjacent_write(adjacent_fixmap, address, len, val);
> +        goto out;
> +    }
> +
> +    if ( address < entry->gtable ||
> +         address >= entry->gtable + entry->table_len )
> +        goto out;
> +
> +    if ( len != 4 && len != 8 )
> +        goto out;
> +
>      nr_entry = array_index_nospec(((address - entry->gtable) /
>                                     PCI_MSIX_ENTRY_SIZE),
>                                    MAX_MSIX_TABLE_ENTRIES);
> @@ -356,8 +524,8 @@ static int cf_check _msixtbl_write(
>      const struct hvm_io_handler *handler, uint64_t address, uint32_t len,
>      uint64_t val)
>  {
> -    /* Ignore invalid length or unaligned writes. */
> -    if ( (len != 4 && len != 8) || !IS_ALIGNED(address, len) )
> +    /* Ignore unaligned writes. */
> +    if ( !IS_ALIGNED(address, len) )
>          return X86EMUL_OKAY;
>  
>      /*
> @@ -374,14 +542,22 @@ static bool cf_check msixtbl_range(
>  {
>      struct vcpu *curr = current;
>      unsigned long addr = r->addr;
> -    const struct msi_desc *desc;
> +    const struct msixtbl_entry *entry;
> +    const struct msi_desc *desc = NULL;
> +    unsigned int adjacent_fixmap;
>  
>      ASSERT(r->type == IOREQ_TYPE_COPY);
>  
>      rcu_read_lock(&msixtbl_rcu_lock);
> -    desc = msixtbl_addr_to_desc(msixtbl_find_entry(curr, addr), addr);
> +    entry = msixtbl_find_entry(curr, addr);
> +    adjacent_fixmap = adjacent_handle(entry, addr, false);
> +    if ( adjacent_fixmap == ADJACENT_DONT_HANDLE )
> +        desc = msixtbl_addr_to_desc(entry, addr);

I'm not sure you need adjacent_handle() here, I would think that any
address adjacent to the MSI-X table Xen would want to handle
unconditionally, and hence msixtbl_range() should return true:

entry = msixtbl_find_entry(curr, addr);
if ( !entry )
    return 0;

if ( addr < entry->gtable || addr >= entry->gtable + entry->table_len )
    /* Possibly handle adjacent access. */
    return 1;

desc = msixtbl_find_entry(curr, addr);
...

(Missing the _unlock calls in the chunk above)

There's no other processing that can happen for an adjacent access
unless it's are handled here.  Otherwise such accesses will be
discarded anyway?  Hence better short-circuit the MMIO handler search
earlier.

>      rcu_read_unlock(&msixtbl_rcu_lock);
>  
> +    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
> +        return 1;
> +
>      if ( desc )
>          return 1;
>  
> @@ -622,12 +798,16 @@ void msix_write_completion(struct vcpu *v)
>           v->arch.hvm.hvm_io.msix_snoop_gpa )
>      {
>          unsigned int token = hvmemul_cache_disable(v);
> -        const struct msi_desc *desc;
> +        const struct msi_desc *desc = NULL;
> +        const struct msixtbl_entry *entry;
>          uint32_t data;
>  
>          rcu_read_lock(&msixtbl_rcu_lock);
> -        desc = msixtbl_addr_to_desc(msixtbl_find_entry(v, snoop_addr),
> -                                    snoop_addr);
> +        entry = msixtbl_find_entry(v, snoop_addr);
> +        if ( entry &&
> +             snoop_addr >= entry->gtable &&
> +             snoop_addr < entry->gtable + entry->table_len )
> +            desc = msixtbl_addr_to_desc(entry, snoop_addr);

This would be easier if you added the MSI-X table boundary checks in
msixtbl_addr_to_desc() itself, rather than open-coding here.  That way
you don't need to modify msix_write_completion() at all.  It also
makes msixtbl_addr_to_desc() more robust in case it gets called with
bogus addresses.

>          rcu_read_unlock(&msixtbl_rcu_lock);
>  
>          if ( desc &&
> diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
> index bcfdfd35345d..923b730d48b8 100644
> --- a/xen/arch/x86/include/asm/msi.h
> +++ b/xen/arch/x86/include/asm/msi.h
> @@ -224,6 +224,9 @@ struct arch_msix {
>      } table, pba;
>      int table_refcnt[MAX_MSIX_TABLE_PAGES];
>      int table_idx[MAX_MSIX_TABLE_PAGES];
> +#define ADJ_IDX_FIRST 0
> +#define ADJ_IDX_LAST  1
> +    unsigned int adj_access_idx[2];
>      spinlock_t table_lock;
>      bool host_maskall, guest_maskall;
>      domid_t warned_domid;
> @@ -231,6 +234,8 @@ struct arch_msix {
>          uint8_t all;
>          struct {
>              bool maskall                   : 1;
> +            bool adjacent_not_initialized  : 1;
> +            bool adjacent_pba              : 1;
>          };
>      } warned_kind;
>  };
> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index 42c793426da3..c77b81896269 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -913,6 +913,36 @@ static int msix_capability_init(struct pci_dev *dev,
>          list_add_tail(&entry->list, &dev->msi_list);
>          *desc = entry;
>      }
> +    else
> +    {
> +        /*
> +         * If the MSI-X table doesn't start at the page boundary, map the first page for
> +         * passthrough accesses.
> +         */
> +        if ( PAGE_OFFSET(table_paddr) )
> +        {
> +            int idx = msix_get_fixmap(msix, table_paddr, table_paddr);
> +
> +            if ( idx > 0 )
> +                msix->adj_access_idx[ADJ_IDX_FIRST] = idx;
> +            else
> +                gprintk(XENLOG_ERR, "Failed to map first MSI-X table page: %d\n", idx);
> +        }
> +        /*
> +         * If the MSI-X table doesn't end on the page boundary, map the last page
> +         * for passthrough accesses.
> +         */
> +        if ( PAGE_OFFSET(table_paddr + msix->nr_entries * PCI_MSIX_ENTRY_SIZE) )

This check is correct ....

> +        {
> +            uint64_t entry_paddr = table_paddr + msix->nr_entries * PCI_MSIX_ENTRY_SIZE;

... however for correctness you want to use msix->nr_entries - 1 here,
otherwise you are requesting an address that's past the last MSI-X
table entry, and hence msix_get_fixmap() could refuse to provide an
idx if it ever does boundary checking.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 3/7] x86/hvm: Allow access to registers on the same page as MSI-X table
  2024-05-03  8:33   ` Roger Pau Monné
@ 2024-05-04  0:48     ` Marek Marczykowski-Górecki
  2024-05-06  7:38       ` Roger Pau Monné
  0 siblings, 1 reply; 13+ messages in thread
From: Marek Marczykowski-Górecki @ 2024-05-04  0:48 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Jan Beulich, Andrew Cooper

[-- Attachment #1: Type: text/plain, Size: 20131 bytes --]

On Fri, May 03, 2024 at 10:33:38AM +0200, Roger Pau Monné wrote:
> On Fri, Apr 26, 2024 at 07:54:00PM +0200, Marek Marczykowski-Górecki wrote:
> > Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers
> > on the same page as MSI-X table. Device model (especially one in
> > stubdomain) cannot really handle those, as direct writes to that page is
> > refused (page is on the mmio_ro_ranges list). Instead, extend
> > msixtbl_mmio_ops to handle such accesses too.
> > 
> > Doing this, requires correlating read/write location with guest
> > of MSI-X table address. Since QEMU doesn't map MSI-X table to the guest,
>   ^ extra 'of'?
> > it requires msixtbl_entry->gtable, which is HVM-only. Similar feature
> > for PV would need to be done separately.
> > 
> > This will be also used to read Pending Bit Array, if it lives on the same
> > page, making QEMU not needing /dev/mem access at all (especially helpful
> > with lockdown enabled in dom0). If PBA lives on another page, QEMU will
> > map it to the guest directly.
> > If PBA lives on the same page, discard writes and log a message.
> > Technically, writes outside of PBA could be allowed, but at this moment
> > the precise location of PBA isn't saved, and also no known device abuses
> > the spec in this way (at least yet).
> > 
> > To access those registers, msixtbl_mmio_ops need the relevant page
> > mapped. MSI handling already has infrastructure for that, using fixmap,
> > so try to map first/last page of the MSI-X table (if necessary) and save
> > their fixmap indexes. Note that msix_get_fixmap() does reference
> > counting and reuses existing mapping, so just call it directly, even if
> > the page was mapped before. Also, it uses a specific range of fixmap
> > indexes which doesn't include 0, so use 0 as default ("not mapped")
> > value - which simplifies code a bit.
> > 
> > GCC 12.2.1 gets confused about 'desc' variable:
> > 
> >     arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’:
> >     arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized [-Werror=maybe-uninitialized]
> >       553 |     if ( desc )
> >           |        ^
> >     arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here
> >       537 |     const struct msi_desc *desc;
> >           |                            ^~~~
> > 
> > It's conditional initialization is actually correct (in the case where
> > it isn't initialized, function returns early), but to avoid
> > build failure initialize it explicitly to NULL anyway.
> > 
> > Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > ---
> > Changes in v6:
> > - use MSIX_CHECK_WARN macro
> > - extend assert on fixmap_idx
> > - add break in default label, after ASSERT_UNREACHABLE(), and move
> >   setting default there
> > - style fixes
> > Changes in v5:
> > - style fixes
> > - include GCC version in the commit message
> > - warn only once (per domain, per device) about failed adjacent access
> > Changes in v4:
> > - drop same_page parameter of msixtbl_find_entry(), distinguish two
> >   cases in relevant callers
> > - rename adj_access_table_idx to adj_access_idx
> > - code style fixes
> > - drop alignment check in adjacent_{read,write}() - all callers already
> >   have it earlier
> > - delay mapping first/last MSI-X pages until preparing device for a
> >   passthrough
> > v3:
> >  - merge handling into msixtbl_mmio_ops
> >  - extend commit message
> > v2:
> >  - adjust commit message
> >  - pass struct domain to msixtbl_page_handler_get_hwaddr()
> >  - reduce local variables used only once
> >  - log a warning if write is forbidden if MSI-X and PBA lives on the same
> >    page
> >  - do not passthrough unaligned accesses
> >  - handle accesses both before and after MSI-X table
> > ---
> >  xen/arch/x86/hvm/vmsi.c        | 200 ++++++++++++++++++++++++++++++++--
> >  xen/arch/x86/include/asm/msi.h |   5 +-
> >  xen/arch/x86/msi.c             |  41 +++++++-
> >  3 files changed, 236 insertions(+), 10 deletions(-)
> > 
> > diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> > index 999917983789..230e3a5dee3f 100644
> > --- a/xen/arch/x86/hvm/vmsi.c
> > +++ b/xen/arch/x86/hvm/vmsi.c
> > @@ -180,6 +180,10 @@ static bool msixtbl_initialised(const struct domain *d)
> >      return d->arch.hvm.msixtbl_list.next;
> >  }
> >  
> > +/*
> > + * Lookup an msixtbl_entry on the same page as given addr. It's up to the
> > + * caller to check if address is strictly part of the table - if relevant.
> > + */
> >  static struct msixtbl_entry *msixtbl_find_entry(
> >      struct vcpu *v, unsigned long addr)
> >  {
> > @@ -187,8 +191,8 @@ static struct msixtbl_entry *msixtbl_find_entry(
> >      struct domain *d = v->domain;
> >  
> >      list_for_each_entry( entry, &d->arch.hvm.msixtbl_list, list )
> > -        if ( addr >= entry->gtable &&
> > -             addr < entry->gtable + entry->table_len )
> > +        if ( PFN_DOWN(addr) >= PFN_DOWN(entry->gtable) &&
> > +             PFN_DOWN(addr) <= PFN_DOWN(entry->gtable + entry->table_len - 1) )
> >              return entry;
> >  
> >      return NULL;
> > @@ -213,6 +217,138 @@ static struct msi_desc *msixtbl_addr_to_desc(
> >      return NULL;
> >  }
> >  
> > +/*
> > + * Returns:
> > + *  - ADJACENT_DONT_HANDLE if no handling should be done
> > + *  - ADJACENT_DISCARD_WRITE if write should be discarded
> > + *  - a fixmap idx to use for handling
> > + */
> > +#define ADJACENT_DONT_HANDLE UINT_MAX
> > +#define ADJACENT_DISCARD_WRITE (UINT_MAX - 1)
> 
> I think this could be simpler, there's no need to signal with so fine
> grained detail about the action to be performed.
> 
> Any adjacent access to the MSI-X table should be handled by the logic
> you are adding, so anything that falls in those ranges should
> terminate here.
> 
> adjacent_handle() should IMO just return whether the access is
> replayed against the hardware, or if it's just dropped.

The distinction here is to return X86EMUL_OKAY in case of adjacent write
that is ignored because PBA is somewhere near, but X86EMUL_UNHANDLABLE
for other/error cases (like fixmap indices not initialized).
But maybe this distinction doesn't make sense and X86EMUL_UNHANDLABLE is
okay in either case? 

> > +static unsigned int adjacent_handle(
> > +    const struct msixtbl_entry *entry, unsigned long addr, bool write)
> > +{
> > +    unsigned int adj_type;
> > +    struct arch_msix *msix;
> > +
> > +    if ( !entry || !entry->pdev )
> > +        return ADJACENT_DONT_HANDLE;
> > +
> > +    if ( PFN_DOWN(addr) == PFN_DOWN(entry->gtable) && addr < entry->gtable )
> > +        adj_type = ADJ_IDX_FIRST;
> > +    else if ( PFN_DOWN(addr) == PFN_DOWN(entry->gtable + entry->table_len - 1) &&
> > +              addr >= entry->gtable + entry->table_len )
> > +        adj_type = ADJ_IDX_LAST;
> > +    else
> > +        return ADJACENT_DONT_HANDLE;
> > +
> > +    msix = entry->pdev->msix;
> > +    ASSERT(msix);
> > +
> > +    if ( !msix->adj_access_idx[adj_type] )
> > +    {
> > +        if ( MSIX_CHECK_WARN(msix, entry->pdev->domain->domain_id,
> > +                             adjacent_not_initialized) )
> > +            gprintk(XENLOG_WARNING,
> > +                    "Page for adjacent(%d) MSI-X table access not initialized for %pp (addr %#lx, gtable %#lx\n",
> > +                    adj_type, &entry->pdev->sbdf, addr, entry->gtable);
> > +        return ADJACENT_DONT_HANDLE;
> > +    }
> > +
> > +    /* If PBA lives on the same page too, discard writes. */
> > +    if ( write &&
> > +         ((adj_type == ADJ_IDX_LAST &&
> > +           msix->table.last == msix->pba.first) ||
> > +          (adj_type == ADJ_IDX_FIRST &&
> > +           msix->table.first == msix->pba.last)) )
> > +    {
> > +        if ( MSIX_CHECK_WARN(msix, entry->pdev->domain->domain_id,
> > +                             adjacent_pba) )
> > +            gprintk(XENLOG_WARNING,
> > +                    "MSI-X table and PBA of %pp live on the same page, "
> > +                    "writing to other registers there is not implemented\n",
> > +                    &entry->pdev->sbdf);
> > +        return ADJACENT_DISCARD_WRITE;
> > +    }
> > +
> > +    return msix->adj_access_idx[adj_type];
> > +}
> > +
> > +static int adjacent_read(
> > +    unsigned int fixmap_idx,
> > +    paddr_t address, unsigned int len, uint64_t *pval)
> > +{
> > +    const void __iomem *hwaddr;
> > +
> > +    ASSERT(fixmap_idx <= FIX_MSIX_IO_RESERV_END);
> > +
> > +    hwaddr = fix_to_virt(fixmap_idx) + PAGE_OFFSET(address);
> 
> IMO adjacent_handle() should be called here (and in adjacent_write()),
> and adjacent_{read,write}() called unconditionally from
> msixtbl_{read,write}() when an access that doesn't fall in the MSI-X
> table is handled.  See comment below in msixtbl_read().

Makes sense.

> > +
> > +    switch ( len )
> > +    {
> > +    case 1:
> > +        *pval = readb(hwaddr);
> > +        break;
> > +
> > +    case 2:
> > +        *pval = readw(hwaddr);
> > +        break;
> > +
> > +    case 4:
> > +        *pval = readl(hwaddr);
> > +        break;
> > +
> > +    case 8:
> > +        *pval = readq(hwaddr);
> > +        break;
> > +
> > +    default:
> > +        ASSERT_UNREACHABLE();
> > +        *pval = ~0UL;
> > +        break;
> > +    }
> > +
> > +    return X86EMUL_OKAY;
> > +}
> > +
> > +static int adjacent_write(
> > +    unsigned int fixmap_idx,
> > +    paddr_t address, unsigned int len, uint64_t val)
> > +{
> > +    void __iomem *hwaddr;
> > +
> > +    if ( fixmap_idx == ADJACENT_DISCARD_WRITE )
> > +        return X86EMUL_OKAY;
> > +
> > +    ASSERT(fixmap_idx <= FIX_MSIX_IO_RESERV_END);
> 
> Since you check the idx is sane, shouldn't you also assert idx >=
> FIX_MSIX_IO_RESERV_BASE?

If moving adjacent_handle() here, I'd simply drop this assert.

> > +
> > +    hwaddr = fix_to_virt(fixmap_idx) + PAGE_OFFSET(address);
> > +
> > +    switch ( len )
> > +    {
> > +    case 1:
> > +        writeb(val, hwaddr);
> > +        break;
> > +
> > +    case 2:
> > +        writew(val, hwaddr);
> > +        break;
> > +
> > +    case 4:
> > +        writel(val, hwaddr);
> > +        break;
> > +
> > +    case 8:
> > +        writeq(val, hwaddr);
> > +        break;
> > +
> > +    default:
> > +        ASSERT_UNREACHABLE();
> > +    }
> > +
> > +    return X86EMUL_OKAY;
> > +}
> > +
> >  static int cf_check msixtbl_read(
> >      const struct hvm_io_handler *handler, uint64_t address, uint32_t len,
> >      uint64_t *pval)
> > @@ -220,9 +356,10 @@ static int cf_check msixtbl_read(
> >      unsigned long offset;
> >      struct msixtbl_entry *entry;
> >      unsigned int nr_entry, index;
> > +    unsigned int adjacent_fixmap;
> >      int r = X86EMUL_UNHANDLEABLE;
> >  
> > -    if ( (len != 4 && len != 8) || (address & (len - 1)) )
> > +    if ( !IS_ALIGNED(address, len) )
> >          return r;
> >  
> >      rcu_read_lock(&msixtbl_rcu_lock);
> > @@ -230,6 +367,21 @@ static int cf_check msixtbl_read(
> >      entry = msixtbl_find_entry(current, address);
> >      if ( !entry )
> >          goto out;
> > +
> > +    adjacent_fixmap = adjacent_handle(entry, address, false);
> 
> This seems overly complicated, but is possible I'm missing some logic.
> 
> IMO it would seem way less convoluted to simply do:
> 
> entry = msixtbl_find_entry(current, address);
> if ( !entry )
>     goto out;
> 
> if ( address < entry->gtable ||
>      address >= entry->gtable + entry->table_len )
> {
>     adjacent_read(...);
>     goto out;
> }
> 
> And put all the logic in adjacent_{read,write}() directly rather than
> having both adjacent_{read,write}() plus adjacent_handle() calls here?
> 
> If the access doesn't fall between the boundaries of the MSI-X table
> it's either going to be a handled adjacent access, or it's going to be
> discarded.

Discarded - should it return X86EMUL_OKAY in that case? Currently it
returns X86EMUL_UNHANDLABLE in case adjacent access isn't handled (for
any reason) either.

> > +    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
> > +    {
> > +        r = adjacent_read(adjacent_fixmap, address, len, pval);
> > +        goto out;
> > +    }
> > +
> > +    if ( address < entry->gtable ||
> > +         address >= entry->gtable + entry->table_len )
> > +        goto out;
> > +
> > +    if ( len != 4 && len != 8 )
> > +        goto out;
> > +
> >      offset = address & (PCI_MSIX_ENTRY_SIZE - 1);
> >  
> >      if ( offset != PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET )
> > @@ -282,6 +434,7 @@ static int msixtbl_write(struct vcpu *v, unsigned long address,
> >      int r = X86EMUL_UNHANDLEABLE;
> >      unsigned long flags;
> >      struct irq_desc *desc;
> > +    unsigned int adjacent_fixmap;
> >  
> >      if ( !IS_ALIGNED(address, len) )
> >          return X86EMUL_OKAY;
> > @@ -291,6 +444,21 @@ static int msixtbl_write(struct vcpu *v, unsigned long address,
> >      entry = msixtbl_find_entry(v, address);
> >      if ( !entry )
> >          goto out;
> > +
> > +    adjacent_fixmap = adjacent_handle(entry, address, true);
> > +    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
> > +    {
> > +        r = adjacent_write(adjacent_fixmap, address, len, val);
> > +        goto out;
> > +    }
> > +
> > +    if ( address < entry->gtable ||
> > +         address >= entry->gtable + entry->table_len )
> > +        goto out;
> > +
> > +    if ( len != 4 && len != 8 )
> > +        goto out;
> > +
> >      nr_entry = array_index_nospec(((address - entry->gtable) /
> >                                     PCI_MSIX_ENTRY_SIZE),
> >                                    MAX_MSIX_TABLE_ENTRIES);
> > @@ -356,8 +524,8 @@ static int cf_check _msixtbl_write(
> >      const struct hvm_io_handler *handler, uint64_t address, uint32_t len,
> >      uint64_t val)
> >  {
> > -    /* Ignore invalid length or unaligned writes. */
> > -    if ( (len != 4 && len != 8) || !IS_ALIGNED(address, len) )
> > +    /* Ignore unaligned writes. */
> > +    if ( !IS_ALIGNED(address, len) )
> >          return X86EMUL_OKAY;
> >  
> >      /*
> > @@ -374,14 +542,22 @@ static bool cf_check msixtbl_range(
> >  {
> >      struct vcpu *curr = current;
> >      unsigned long addr = r->addr;
> > -    const struct msi_desc *desc;
> > +    const struct msixtbl_entry *entry;
> > +    const struct msi_desc *desc = NULL;
> > +    unsigned int adjacent_fixmap;
> >  
> >      ASSERT(r->type == IOREQ_TYPE_COPY);
> >  
> >      rcu_read_lock(&msixtbl_rcu_lock);
> > -    desc = msixtbl_addr_to_desc(msixtbl_find_entry(curr, addr), addr);
> > +    entry = msixtbl_find_entry(curr, addr);
> > +    adjacent_fixmap = adjacent_handle(entry, addr, false);
> > +    if ( adjacent_fixmap == ADJACENT_DONT_HANDLE )
> > +        desc = msixtbl_addr_to_desc(entry, addr);
> 
> I'm not sure you need adjacent_handle() here, I would think that any
> address adjacent to the MSI-X table Xen would want to handle
> unconditionally, and hence msixtbl_range() should return true:

I put it here to not duplicate a set of checks for adjacent access. It
isn't only about the range, but also few other case (like if fixmap
indices are set).

> entry = msixtbl_find_entry(curr, addr);
> if ( !entry )
>     return 0;
> 
> if ( addr < entry->gtable || addr >= entry->gtable + entry->table_len )
>     /* Possibly handle adjacent access. */
>     return 1;
> 
> desc = msixtbl_find_entry(curr, addr);
> ...
> 
> (Missing the _unlock calls in the chunk above)
> 
> There's no other processing that can happen for an adjacent access
> unless it's are handled here.  Otherwise such accesses will be
> discarded anyway?  Hence better short-circuit the MMIO handler search
> earlier.

Can't there be some ioreq server that could theoretically handle them?

> >      rcu_read_unlock(&msixtbl_rcu_lock);
> >  
> > +    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
> > +        return 1;
> > +
> >      if ( desc )
> >          return 1;
> >  
> > @@ -622,12 +798,16 @@ void msix_write_completion(struct vcpu *v)
> >           v->arch.hvm.hvm_io.msix_snoop_gpa )
> >      {
> >          unsigned int token = hvmemul_cache_disable(v);
> > -        const struct msi_desc *desc;
> > +        const struct msi_desc *desc = NULL;
> > +        const struct msixtbl_entry *entry;
> >          uint32_t data;
> >  
> >          rcu_read_lock(&msixtbl_rcu_lock);
> > -        desc = msixtbl_addr_to_desc(msixtbl_find_entry(v, snoop_addr),
> > -                                    snoop_addr);
> > +        entry = msixtbl_find_entry(v, snoop_addr);
> > +        if ( entry &&
> > +             snoop_addr >= entry->gtable &&
> > +             snoop_addr < entry->gtable + entry->table_len )
> > +            desc = msixtbl_addr_to_desc(entry, snoop_addr);
> 
> This would be easier if you added the MSI-X table boundary checks in
> msixtbl_addr_to_desc() itself, rather than open-coding here.  That way
> you don't need to modify msix_write_completion() at all.  It also
> makes msixtbl_addr_to_desc() more robust in case it gets called with
> bogus addresses.

Makes sense I think.

> >          rcu_read_unlock(&msixtbl_rcu_lock);
> >  
> >          if ( desc &&
> > diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
> > index bcfdfd35345d..923b730d48b8 100644
> > --- a/xen/arch/x86/include/asm/msi.h
> > +++ b/xen/arch/x86/include/asm/msi.h
> > @@ -224,6 +224,9 @@ struct arch_msix {
> >      } table, pba;
> >      int table_refcnt[MAX_MSIX_TABLE_PAGES];
> >      int table_idx[MAX_MSIX_TABLE_PAGES];
> > +#define ADJ_IDX_FIRST 0
> > +#define ADJ_IDX_LAST  1
> > +    unsigned int adj_access_idx[2];
> >      spinlock_t table_lock;
> >      bool host_maskall, guest_maskall;
> >      domid_t warned_domid;
> > @@ -231,6 +234,8 @@ struct arch_msix {
> >          uint8_t all;
> >          struct {
> >              bool maskall                   : 1;
> > +            bool adjacent_not_initialized  : 1;
> > +            bool adjacent_pba              : 1;
> >          };
> >      } warned_kind;
> >  };
> > diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> > index 42c793426da3..c77b81896269 100644
> > --- a/xen/arch/x86/msi.c
> > +++ b/xen/arch/x86/msi.c
> > @@ -913,6 +913,36 @@ static int msix_capability_init(struct pci_dev *dev,
> >          list_add_tail(&entry->list, &dev->msi_list);
> >          *desc = entry;
> >      }
> > +    else
> > +    {
> > +        /*
> > +         * If the MSI-X table doesn't start at the page boundary, map the first page for
> > +         * passthrough accesses.
> > +         */
> > +        if ( PAGE_OFFSET(table_paddr) )
> > +        {
> > +            int idx = msix_get_fixmap(msix, table_paddr, table_paddr);
> > +
> > +            if ( idx > 0 )
> > +                msix->adj_access_idx[ADJ_IDX_FIRST] = idx;
> > +            else
> > +                gprintk(XENLOG_ERR, "Failed to map first MSI-X table page: %d\n", idx);
> > +        }
> > +        /*
> > +         * If the MSI-X table doesn't end on the page boundary, map the last page
> > +         * for passthrough accesses.
> > +         */
> > +        if ( PAGE_OFFSET(table_paddr + msix->nr_entries * PCI_MSIX_ENTRY_SIZE) )
> 
> This check is correct ....
> 
> > +        {
> > +            uint64_t entry_paddr = table_paddr + msix->nr_entries * PCI_MSIX_ENTRY_SIZE;
> 
> ... however for correctness you want to use msix->nr_entries - 1 here,
> otherwise you are requesting an address that's past the last MSI-X
> table entry, and hence msix_get_fixmap() could refuse to provide an
> idx if it ever does boundary checking.

Ok.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 3/7] x86/hvm: Allow access to registers on the same page as MSI-X table
  2024-05-04  0:48     ` Marek Marczykowski-Górecki
@ 2024-05-06  7:38       ` Roger Pau Monné
  0 siblings, 0 replies; 13+ messages in thread
From: Roger Pau Monné @ 2024-05-06  7:38 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki; +Cc: xen-devel, Jan Beulich, Andrew Cooper

On Sat, May 04, 2024 at 02:48:12AM +0200, Marek Marczykowski-Górecki wrote:
> On Fri, May 03, 2024 at 10:33:38AM +0200, Roger Pau Monné wrote:
> > On Fri, Apr 26, 2024 at 07:54:00PM +0200, Marek Marczykowski-Górecki wrote:
> > > Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers
> > > on the same page as MSI-X table. Device model (especially one in
> > > stubdomain) cannot really handle those, as direct writes to that page is
> > > refused (page is on the mmio_ro_ranges list). Instead, extend
> > > msixtbl_mmio_ops to handle such accesses too.
> > > 
> > > Doing this, requires correlating read/write location with guest
> > > of MSI-X table address. Since QEMU doesn't map MSI-X table to the guest,
> >   ^ extra 'of'?
> > > it requires msixtbl_entry->gtable, which is HVM-only. Similar feature
> > > for PV would need to be done separately.
> > > 
> > > This will be also used to read Pending Bit Array, if it lives on the same
> > > page, making QEMU not needing /dev/mem access at all (especially helpful
> > > with lockdown enabled in dom0). If PBA lives on another page, QEMU will
> > > map it to the guest directly.
> > > If PBA lives on the same page, discard writes and log a message.
> > > Technically, writes outside of PBA could be allowed, but at this moment
> > > the precise location of PBA isn't saved, and also no known device abuses
> > > the spec in this way (at least yet).
> > > 
> > > To access those registers, msixtbl_mmio_ops need the relevant page
> > > mapped. MSI handling already has infrastructure for that, using fixmap,
> > > so try to map first/last page of the MSI-X table (if necessary) and save
> > > their fixmap indexes. Note that msix_get_fixmap() does reference
> > > counting and reuses existing mapping, so just call it directly, even if
> > > the page was mapped before. Also, it uses a specific range of fixmap
> > > indexes which doesn't include 0, so use 0 as default ("not mapped")
> > > value - which simplifies code a bit.
> > > 
> > > GCC 12.2.1 gets confused about 'desc' variable:
> > > 
> > >     arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’:
> > >     arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized [-Werror=maybe-uninitialized]
> > >       553 |     if ( desc )
> > >           |        ^
> > >     arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here
> > >       537 |     const struct msi_desc *desc;
> > >           |                            ^~~~
> > > 
> > > It's conditional initialization is actually correct (in the case where
> > > it isn't initialized, function returns early), but to avoid
> > > build failure initialize it explicitly to NULL anyway.
> > > 
> > > Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > > ---
> > > Changes in v6:
> > > - use MSIX_CHECK_WARN macro
> > > - extend assert on fixmap_idx
> > > - add break in default label, after ASSERT_UNREACHABLE(), and move
> > >   setting default there
> > > - style fixes
> > > Changes in v5:
> > > - style fixes
> > > - include GCC version in the commit message
> > > - warn only once (per domain, per device) about failed adjacent access
> > > Changes in v4:
> > > - drop same_page parameter of msixtbl_find_entry(), distinguish two
> > >   cases in relevant callers
> > > - rename adj_access_table_idx to adj_access_idx
> > > - code style fixes
> > > - drop alignment check in adjacent_{read,write}() - all callers already
> > >   have it earlier
> > > - delay mapping first/last MSI-X pages until preparing device for a
> > >   passthrough
> > > v3:
> > >  - merge handling into msixtbl_mmio_ops
> > >  - extend commit message
> > > v2:
> > >  - adjust commit message
> > >  - pass struct domain to msixtbl_page_handler_get_hwaddr()
> > >  - reduce local variables used only once
> > >  - log a warning if write is forbidden if MSI-X and PBA lives on the same
> > >    page
> > >  - do not passthrough unaligned accesses
> > >  - handle accesses both before and after MSI-X table
> > > ---
> > >  xen/arch/x86/hvm/vmsi.c        | 200 ++++++++++++++++++++++++++++++++--
> > >  xen/arch/x86/include/asm/msi.h |   5 +-
> > >  xen/arch/x86/msi.c             |  41 +++++++-
> > >  3 files changed, 236 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> > > index 999917983789..230e3a5dee3f 100644
> > > --- a/xen/arch/x86/hvm/vmsi.c
> > > +++ b/xen/arch/x86/hvm/vmsi.c
> > > @@ -180,6 +180,10 @@ static bool msixtbl_initialised(const struct domain *d)
> > >      return d->arch.hvm.msixtbl_list.next;
> > >  }
> > >  
> > > +/*
> > > + * Lookup an msixtbl_entry on the same page as given addr. It's up to the
> > > + * caller to check if address is strictly part of the table - if relevant.
> > > + */
> > >  static struct msixtbl_entry *msixtbl_find_entry(
> > >      struct vcpu *v, unsigned long addr)
> > >  {
> > > @@ -187,8 +191,8 @@ static struct msixtbl_entry *msixtbl_find_entry(
> > >      struct domain *d = v->domain;
> > >  
> > >      list_for_each_entry( entry, &d->arch.hvm.msixtbl_list, list )
> > > -        if ( addr >= entry->gtable &&
> > > -             addr < entry->gtable + entry->table_len )
> > > +        if ( PFN_DOWN(addr) >= PFN_DOWN(entry->gtable) &&
> > > +             PFN_DOWN(addr) <= PFN_DOWN(entry->gtable + entry->table_len - 1) )
> > >              return entry;
> > >  
> > >      return NULL;
> > > @@ -213,6 +217,138 @@ static struct msi_desc *msixtbl_addr_to_desc(
> > >      return NULL;
> > >  }
> > >  
> > > +/*
> > > + * Returns:
> > > + *  - ADJACENT_DONT_HANDLE if no handling should be done
> > > + *  - ADJACENT_DISCARD_WRITE if write should be discarded
> > > + *  - a fixmap idx to use for handling
> > > + */
> > > +#define ADJACENT_DONT_HANDLE UINT_MAX
> > > +#define ADJACENT_DISCARD_WRITE (UINT_MAX - 1)
> > 
> > I think this could be simpler, there's no need to signal with so fine
> > grained detail about the action to be performed.
> > 
> > Any adjacent access to the MSI-X table should be handled by the logic
> > you are adding, so anything that falls in those ranges should
> > terminate here.
> > 
> > adjacent_handle() should IMO just return whether the access is
> > replayed against the hardware, or if it's just dropped.
> 
> The distinction here is to return X86EMUL_OKAY in case of adjacent write
> that is ignored because PBA is somewhere near, but X86EMUL_UNHANDLABLE
> for other/error cases (like fixmap indices not initialized).
> But maybe this distinction doesn't make sense and X86EMUL_UNHANDLABLE is
> okay in either case? 

That's my suggestion, yes.  I don't think it's expected for ioreqs to
handle those adjacent accesses, the more with the limitation that some
of them might not even have access to such region in the first place.

> > > @@ -220,9 +356,10 @@ static int cf_check msixtbl_read(
> > >      unsigned long offset;
> > >      struct msixtbl_entry *entry;
> > >      unsigned int nr_entry, index;
> > > +    unsigned int adjacent_fixmap;
> > >      int r = X86EMUL_UNHANDLEABLE;
> > >  
> > > -    if ( (len != 4 && len != 8) || (address & (len - 1)) )
> > > +    if ( !IS_ALIGNED(address, len) )
> > >          return r;
> > >  
> > >      rcu_read_lock(&msixtbl_rcu_lock);
> > > @@ -230,6 +367,21 @@ static int cf_check msixtbl_read(
> > >      entry = msixtbl_find_entry(current, address);
> > >      if ( !entry )
> > >          goto out;
> > > +
> > > +    adjacent_fixmap = adjacent_handle(entry, address, false);
> > 
> > This seems overly complicated, but is possible I'm missing some logic.
> > 
> > IMO it would seem way less convoluted to simply do:
> > 
> > entry = msixtbl_find_entry(current, address);
> > if ( !entry )
> >     goto out;
> > 
> > if ( address < entry->gtable ||
> >      address >= entry->gtable + entry->table_len )
> > {
> >     adjacent_read(...);
> >     goto out;
> > }
> > 
> > And put all the logic in adjacent_{read,write}() directly rather than
> > having both adjacent_{read,write}() plus adjacent_handle() calls here?
> > 
> > If the access doesn't fall between the boundaries of the MSI-X table
> > it's either going to be a handled adjacent access, or it's going to be
> > discarded.
> 
> Discarded - should it return X86EMUL_OKAY in that case? Currently it
> returns X86EMUL_UNHANDLABLE in case adjacent access isn't handled (for
> any reason) either.

Yes, I think we want to terminate all adjacent accesses here - there's
so far no need to forward them to any other handler, and that would
simplify the logic.  I don't see an use case for handling those
elsewhere, but if that ever arises we can always add the support them.
There's no need to cater for a non-existent use-case that just makes
the code more complicated.

> > > +    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
> > > +    {
> > > +        r = adjacent_read(adjacent_fixmap, address, len, pval);
> > > +        goto out;
> > > +    }
> > > +
> > > +    if ( address < entry->gtable ||
> > > +         address >= entry->gtable + entry->table_len )
> > > +        goto out;
> > > +
> > > +    if ( len != 4 && len != 8 )
> > > +        goto out;
> > > +
> > >      offset = address & (PCI_MSIX_ENTRY_SIZE - 1);
> > >  
> > >      if ( offset != PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET )
> > > @@ -282,6 +434,7 @@ static int msixtbl_write(struct vcpu *v, unsigned long address,
> > >      int r = X86EMUL_UNHANDLEABLE;
> > >      unsigned long flags;
> > >      struct irq_desc *desc;
> > > +    unsigned int adjacent_fixmap;
> > >  
> > >      if ( !IS_ALIGNED(address, len) )
> > >          return X86EMUL_OKAY;
> > > @@ -291,6 +444,21 @@ static int msixtbl_write(struct vcpu *v, unsigned long address,
> > >      entry = msixtbl_find_entry(v, address);
> > >      if ( !entry )
> > >          goto out;
> > > +
> > > +    adjacent_fixmap = adjacent_handle(entry, address, true);
> > > +    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
> > > +    {
> > > +        r = adjacent_write(adjacent_fixmap, address, len, val);
> > > +        goto out;
> > > +    }
> > > +
> > > +    if ( address < entry->gtable ||
> > > +         address >= entry->gtable + entry->table_len )
> > > +        goto out;
> > > +
> > > +    if ( len != 4 && len != 8 )
> > > +        goto out;
> > > +
> > >      nr_entry = array_index_nospec(((address - entry->gtable) /
> > >                                     PCI_MSIX_ENTRY_SIZE),
> > >                                    MAX_MSIX_TABLE_ENTRIES);
> > > @@ -356,8 +524,8 @@ static int cf_check _msixtbl_write(
> > >      const struct hvm_io_handler *handler, uint64_t address, uint32_t len,
> > >      uint64_t val)
> > >  {
> > > -    /* Ignore invalid length or unaligned writes. */
> > > -    if ( (len != 4 && len != 8) || !IS_ALIGNED(address, len) )
> > > +    /* Ignore unaligned writes. */
> > > +    if ( !IS_ALIGNED(address, len) )
> > >          return X86EMUL_OKAY;
> > >  
> > >      /*
> > > @@ -374,14 +542,22 @@ static bool cf_check msixtbl_range(
> > >  {
> > >      struct vcpu *curr = current;
> > >      unsigned long addr = r->addr;
> > > -    const struct msi_desc *desc;
> > > +    const struct msixtbl_entry *entry;
> > > +    const struct msi_desc *desc = NULL;
> > > +    unsigned int adjacent_fixmap;
> > >  
> > >      ASSERT(r->type == IOREQ_TYPE_COPY);
> > >  
> > >      rcu_read_lock(&msixtbl_rcu_lock);
> > > -    desc = msixtbl_addr_to_desc(msixtbl_find_entry(curr, addr), addr);
> > > +    entry = msixtbl_find_entry(curr, addr);
> > > +    adjacent_fixmap = adjacent_handle(entry, addr, false);
> > > +    if ( adjacent_fixmap == ADJACENT_DONT_HANDLE )
> > > +        desc = msixtbl_addr_to_desc(entry, addr);
> > 
> > I'm not sure you need adjacent_handle() here, I would think that any
> > address adjacent to the MSI-X table Xen would want to handle
> > unconditionally, and hence msixtbl_range() should return true:
> 
> I put it here to not duplicate a set of checks for adjacent access. It
> isn't only about the range, but also few other case (like if fixmap
> indices are set).

Right, but as discussed above, we likely want to terminate those
accesses here anyway, since there's no other handler that cares about
MSI-X table adjacent regions.

> > entry = msixtbl_find_entry(curr, addr);
> > if ( !entry )
> >     return 0;
> > 
> > if ( addr < entry->gtable || addr >= entry->gtable + entry->table_len )
> >     /* Possibly handle adjacent access. */
> >     return 1;
> > 
> > desc = msixtbl_find_entry(curr, addr);
> > ...
> > 
> > (Missing the _unlock calls in the chunk above)
> > 
> > There's no other processing that can happen for an adjacent access
> > unless it's are handled here.  Otherwise such accesses will be
> > discarded anyway?  Hence better short-circuit the MMIO handler search
> > earlier.
> 
> Can't there be some ioreq server that could theoretically handle them?

I don't think any of the current ioreq implementations care about
those ATM, and with your work to make QEMU not depend on /dev/mem
it's even more unlikely that we might gain such in the future.

As said above, I wouldn't mind allowing such forwarding if it made the
code easier, but seeing how it makes the logic more complicated I
think it's best to terminate all MSI-X table adjacent accesses here.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-05-06  7:38 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-26 17:53 [PATCH v6 0/7] MSI-X support with qemu in stubdomain, and other related changes Marek Marczykowski-Górecki
2024-04-26 17:53 ` [PATCH v6 1/7] x86/msi: passthrough all MSI-X vector ctrl writes to device model Marek Marczykowski-Górecki
2024-04-26 17:53 ` [PATCH v6 2/7] x86/msi: Extend per-domain/device warning mechanism Marek Marczykowski-Górecki
2024-04-30 14:58   ` Jan Beulich
2024-04-26 17:54 ` [PATCH v6 3/7] x86/hvm: Allow access to registers on the same page as MSI-X table Marek Marczykowski-Górecki
2024-04-30 15:03   ` Jan Beulich
2024-05-03  8:33   ` Roger Pau Monné
2024-05-04  0:48     ` Marek Marczykowski-Górecki
2024-05-06  7:38       ` Roger Pau Monné
2024-04-26 17:54 ` [PATCH v6 4/7] automation: prevent QEMU access to /dev/mem in PCI passthrough tests Marek Marczykowski-Górecki
2024-04-26 17:54 ` [PATCH v6 5/7] automation: switch to a wifi card on ADL system Marek Marczykowski-Górecki
2024-04-26 17:54 ` [PATCH v6 6/7] [DO NOT APPLY] switch to qemu fork Marek Marczykowski-Górecki
2024-04-26 17:54 ` [PATCH v6 7/7] [DO NOT APPLY] switch to alternative artifact repo Marek Marczykowski-Górecki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).