All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] x86: Fix racy accesses to MSI-X Control register
@ 2022-11-10 16:59 David Vrabel
  2022-11-10 16:59 ` [PATCH 1/3] x86/msi: consistently handle BAR mapping failures in MSI-X setup David Vrabel
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: David Vrabel @ 2022-11-10 16:59 UTC (permalink / raw)
  To: xen-devel
  Cc: Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu, David Vrabel

The main patch in this series is 3/3 with some preparatory patches to
simplify the implementation. To summarize:

    Concurrent access the the MSI-X control register are not serialized
    with a suitable lock. For example, in msix_capability_init() access
    use the pcidevs_lock() but some calls to msi_set_mask_bit() use the
    interrupt descriptor lock.
    
    This can lead to MSI-X being incorrectly disabled and subsequent
    failures due to msix_memory_decoded() calls that check for MSI-X being
    enabled.

David




^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] x86/msi: consistently handle BAR mapping failures in MSI-X setup
  2022-11-10 16:59 [PATCH 0/3] x86: Fix racy accesses to MSI-X Control register David Vrabel
@ 2022-11-10 16:59 ` David Vrabel
  2022-11-11  9:24   ` Jan Beulich
  2022-11-10 16:59 ` [PATCH 2/3] x86/msi: remove return value from msi_set_mask_bit() David Vrabel
  2022-11-10 16:59 ` [PATCH 3/3] x86/pci: Fix racy accesses to MSI-X Control register David Vrabel
  2 siblings, 1 reply; 12+ messages in thread
From: David Vrabel @ 2022-11-10 16:59 UTC (permalink / raw)
  To: xen-devel
  Cc: Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu, David Vrabel

When setting up an MSI-X vector in msix_capability_init() the error
handling after a BAR mapping failure is different depending on whether
the first page fails or a subsequent page. There's no reason to break
working vectors so consistently use the later error handling
behaviour.

The zap_on_error flag was added as part of XSA-337, beb54596cfda
(x86/MSI-X: restrict reading of table/PBA bases from BARs), but
appears to be unrelated to XSA-337 and is not useful because:

1. table.first and pba.first are not used unless msix->used_vectors > 0.

2. Force disabling MSI-X in this error path is not necessary as the
   per-vector mask is still still set.

Signed-off-by: David Vrabel <dvrabel@amazon.co.uk>

CR: https://code.amazon.com/reviews/CR-79020908
---
 xen/arch/x86/msi.c | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index d0bf63df1d..8bde6b9be1 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -776,7 +776,7 @@ static int msix_capability_init(struct pci_dev *dev,
     u8 bus = dev->bus;
     u8 slot = PCI_SLOT(dev->devfn);
     u8 func = PCI_FUNC(dev->devfn);
-    bool maskall = msix->host_maskall, zap_on_error = false;
+    bool maskall = msix->host_maskall;
     unsigned int pos = pci_find_cap_offset(seg, bus, slot, func,
                                            PCI_CAP_ID_MSIX);
 
@@ -875,8 +875,6 @@ static int msix_capability_init(struct pci_dev *dev,
                                   BITS_TO_LONGS(msix->nr_entries) - 1);
         WARN_ON(rangeset_overlaps_range(mmio_ro_ranges, msix->pba.first,
                                         msix->pba.last));
-
-        zap_on_error = true;
     }
     else if ( !msix->table.first )
     {
@@ -897,14 +895,6 @@ static int msix_capability_init(struct pci_dev *dev,
 
         if ( idx < 0 )
         {
-            if ( zap_on_error )
-            {
-                msix->table.first = 0;
-                msix->pba.first = 0;
-
-                control &= ~PCI_MSIX_FLAGS_ENABLE;
-            }
-
             pci_conf_write16(dev->sbdf, msix_control_reg(pos), control);
             xfree(entry);
             return idx;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] x86/msi: remove return value from msi_set_mask_bit()
  2022-11-10 16:59 [PATCH 0/3] x86: Fix racy accesses to MSI-X Control register David Vrabel
  2022-11-10 16:59 ` [PATCH 1/3] x86/msi: consistently handle BAR mapping failures in MSI-X setup David Vrabel
@ 2022-11-10 16:59 ` David Vrabel
  2022-11-11  9:44   ` Jan Beulich
  2022-11-10 16:59 ` [PATCH 3/3] x86/pci: Fix racy accesses to MSI-X Control register David Vrabel
  2 siblings, 1 reply; 12+ messages in thread
From: David Vrabel @ 2022-11-10 16:59 UTC (permalink / raw)
  To: xen-devel
  Cc: Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu, David Vrabel

The return value was only used for WARN()s or BUG()s so it has no
functional purpose. Simplify the code by removing it.

The meaning of the return value and the purpose of the various WARNs()
and BUGs() is rather unclear. The only failure path (where an MSI-X
vector needs to be masked but the MSI-X table is not accessible) has a
useful warning message.

Signed-off-by: David Vrabel <dvrabel@amazon.co.uk>

CR: https://code.amazon.com/reviews/CR-79020927
---
 xen/arch/x86/msi.c | 34 +++++++++-------------------------
 1 file changed, 9 insertions(+), 25 deletions(-)

diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 8bde6b9be1..6c675d11d1 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -314,7 +314,7 @@ int msi_maskable_irq(const struct msi_desc *entry)
            || entry->msi_attrib.maskbit;
 }
 
-static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
+static void msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
 {
     struct msi_desc *entry = desc->msi_desc;
     struct pci_dev *pdev;
@@ -361,11 +361,6 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
 
             if ( likely(control & PCI_MSIX_FLAGS_ENABLE) )
                 break;
-
-            entry->msi_attrib.host_masked = host;
-            entry->msi_attrib.guest_masked = guest;
-
-            flag = true;
         }
         else if ( flag && !(control & PCI_MSIX_FLAGS_MASKALL) )
         {
@@ -385,14 +380,10 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
             control |= PCI_MSIX_FLAGS_MASKALL;
         pci_conf_write16(pdev->sbdf,
                          msix_control_reg(entry->msi_attrib.pos), control);
-        return flag;
-    default:
-        return 0;
+        break;
     }
     entry->msi_attrib.host_masked = host;
     entry->msi_attrib.guest_masked = guest;
-
-    return 1;
 }
 
 static int msi_get_mask_bit(const struct msi_desc *entry)
@@ -418,16 +409,12 @@ static int msi_get_mask_bit(const struct msi_desc *entry)
 
 void cf_check mask_msi_irq(struct irq_desc *desc)
 {
-    if ( unlikely(!msi_set_mask_bit(desc, 1,
-                                    desc->msi_desc->msi_attrib.guest_masked)) )
-        BUG_ON(!(desc->status & IRQ_DISABLED));
+    msi_set_mask_bit(desc, 1, desc->msi_desc->msi_attrib.guest_masked);
 }
 
 void cf_check unmask_msi_irq(struct irq_desc *desc)
 {
-    if ( unlikely(!msi_set_mask_bit(desc, 0,
-                                    desc->msi_desc->msi_attrib.guest_masked)) )
-        WARN();
+    msi_set_mask_bit(desc, 0, desc->msi_desc->msi_attrib.guest_masked);
 }
 
 void guest_mask_msi_irq(struct irq_desc *desc, bool mask)
@@ -437,15 +424,13 @@ void guest_mask_msi_irq(struct irq_desc *desc, bool mask)
 
 static unsigned int cf_check startup_msi_irq(struct irq_desc *desc)
 {
-    if ( unlikely(!msi_set_mask_bit(desc, 0, !!(desc->status & IRQ_GUEST))) )
-        WARN();
+    msi_set_mask_bit(desc, 0, !!(desc->status & IRQ_GUEST));
     return 0;
 }
 
 static void cf_check shutdown_msi_irq(struct irq_desc *desc)
 {
-    if ( unlikely(!msi_set_mask_bit(desc, 1, 1)) )
-        BUG_ON(!(desc->status & IRQ_DISABLED));
+    msi_set_mask_bit(desc, 1, 1);
 }
 
 void cf_check ack_nonmaskable_msi_irq(struct irq_desc *desc)
@@ -1358,10 +1343,9 @@ int pci_restore_msi_state(struct pci_dev *pdev)
 
         for ( i = 0; ; )
         {
-            if ( unlikely(!msi_set_mask_bit(desc,
-                                            entry[i].msi_attrib.host_masked,
-                                            entry[i].msi_attrib.guest_masked)) )
-                BUG();
+            msi_set_mask_bit(desc,
+                             entry[i].msi_attrib.host_masked,
+                             entry[i].msi_attrib.guest_masked);
 
             if ( !--nr )
                 break;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] x86/pci: Fix racy accesses to MSI-X Control register
  2022-11-10 16:59 [PATCH 0/3] x86: Fix racy accesses to MSI-X Control register David Vrabel
  2022-11-10 16:59 ` [PATCH 1/3] x86/msi: consistently handle BAR mapping failures in MSI-X setup David Vrabel
  2022-11-10 16:59 ` [PATCH 2/3] x86/msi: remove return value from msi_set_mask_bit() David Vrabel
@ 2022-11-10 16:59 ` David Vrabel
  2022-12-12 17:04   ` Jan Beulich
  2 siblings, 1 reply; 12+ messages in thread
From: David Vrabel @ 2022-11-10 16:59 UTC (permalink / raw)
  To: xen-devel
  Cc: Jan Beulich, Andrew Cooper, Roger Pau Monné, Wei Liu, David Vrabel

Concurrent access the the MSI-X control register are not serialized
with a suitable lock. For example, in msix_capability_init() access
use the pcidevs_lock() but some calls to msi_set_mask_bit() use the
interrupt descriptor lock.

This can lead to MSI-X being incorrectly disabled and subsequent
failures due to msix_memory_decoded() calls that check for MSI-X being
enabled.

This was seen with some non-compliant hardware that gated MSI-X
messages on the per-vector mask bit only (i.e., the MSI-X Enable bit
and Function Mask bits in the MSI-X Control register were ignored). An
interrupt (with a pending move) for vector 0 would occur while vector
1 was being initialized in msix_capability_init(). Updates the the
Control register would race and the vector 1 initialization would
intermittently fail with -ENXIO.

Typically a race between initializing a vector and another vector
being moved doesn't occur because:

1. Racy Control accesses only occur when MSI-X is (guest) disabled

2  Hardware should only raise interrupts when MSI-X is enabled and unmasked.

3. Xen always sets Function Mask when temporarily enabling MSI-X.

But there may be other race conditions depending on hardware and guest
driver behaviour (e.g., disabling MSI-X when a IRQ has a pending move
on another PCPU).

Fix this by:

1. Tracking the host and guest enable state in a similar way to the
   host and guest maskall state. Note that since multiple CPUs can be
   updating different vectors concurrently, a counter is needed for
   the host enable state.

2. Add a new lock for serialize the Control read/modify/write
   sequence.

3. Wrap the above in two helper functions (msix_update_lock(), and
   msix_update_unlock()), which bracket any MSI-X register updates
   that require MSI-X to be (temporarily) enabled.

Signed-off-by: David Vrabel <dvrabel@amazon.co.uk>
SIM: https://t.corp.amazon.com/P63914633

CR: https://code.amazon.com/reviews/CR-79020945
---
 xen/arch/x86/include/asm/msi.h |   3 +
 xen/arch/x86/msi.c             | 215 +++++++++++++++++----------------
 xen/drivers/passthrough/msi.c  |   1 +
 3 files changed, 114 insertions(+), 105 deletions(-)

diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
index fe670895ee..aa36e44f4e 100644
--- a/xen/arch/x86/include/asm/msi.h
+++ b/xen/arch/x86/include/asm/msi.h
@@ -237,7 +237,10 @@ struct arch_msix {
     int table_refcnt[MAX_MSIX_TABLE_PAGES];
     int table_idx[MAX_MSIX_TABLE_PAGES];
     spinlock_t table_lock;
+    spinlock_t control_lock;
     bool host_maskall, guest_maskall;
+    uint16_t host_enable;
+    bool guest_enable;
     domid_t warned;
 };
 
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 6c675d11d1..8e394da07a 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -147,6 +147,57 @@ static bool msix_memory_decoded(const struct pci_dev *dev, unsigned int pos)
     return memory_decoded(dev);
 }
 
+
+/*
+ * Ensure MSI-X interrupts are masked during setup. Some devices require
+ * MSI-X to be enabled before we can touch the MSI-X registers. We need
+ * to mask all the vectors to prevent interrupts coming in before they're
+ * fully set up.
+ */
+static uint16_t msix_update_lock(struct pci_dev *dev, unsigned int pos)
+{
+    uint16_t control, new_control;
+    unsigned long flags;
+
+    spin_lock_irqsave(&dev->msix->control_lock, flags);
+
+    dev->msix->host_enable++;
+
+    control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
+    if ( !(control & PCI_MSIX_FLAGS_ENABLE) )
+    {
+        new_control = control | PCI_MSIX_FLAGS_ENABLE | PCI_MSIX_FLAGS_MASKALL;
+        pci_conf_write16(dev->sbdf, msix_control_reg(pos), new_control);
+    }
+    else
+        dev->msix->guest_enable = true;
+
+    spin_unlock_irqrestore(&dev->msix->control_lock, flags);
+
+    return control;
+}
+
+static void msix_update_unlock(struct pci_dev *dev, unsigned int pos, uint16_t control)
+{
+    uint16_t new_control;
+    unsigned long flags;
+
+    spin_lock_irqsave(&dev->msix->control_lock, flags);
+
+    dev->msix->host_enable--;
+
+    new_control = control & ~(PCI_MSIX_FLAGS_ENABLE | PCI_MSIX_FLAGS_MASKALL);
+
+    if ( dev->msix->host_enable || dev->msix->guest_enable )
+        new_control |= PCI_MSIX_FLAGS_ENABLE;
+    if ( dev->msix->host_maskall || dev->msix->guest_maskall || dev->msix->host_enable )
+        new_control |= PCI_MSIX_FLAGS_MASKALL;
+    if ( new_control != control )
+        pci_conf_write16(dev->sbdf, msix_control_reg(pos), control);
+
+    spin_unlock_irqrestore(&dev->msix->control_lock, flags);
+}
+
 /*
  * MSI message composition
  */
@@ -288,7 +339,7 @@ static void msi_set_enable(struct pci_dev *dev, int enable)
         __msi_set_enable(seg, bus, slot, func, pos, enable);
 }
 
-static void msix_set_enable(struct pci_dev *dev, int enable)
+static void msix_force_disable(struct pci_dev *dev)
 {
     int pos;
     u16 control, seg = dev->seg;
@@ -299,11 +350,16 @@ static void msix_set_enable(struct pci_dev *dev, int enable)
     pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
     if ( pos )
     {
+        spin_lock_irq(&dev->msix->control_lock);
+
+        dev->msix->host_enable = false;
+        dev->msix->guest_enable = false;
+
         control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
         control &= ~PCI_MSIX_FLAGS_ENABLE;
-        if ( enable )
-            control |= PCI_MSIX_FLAGS_ENABLE;
         pci_conf_write16(dev->sbdf, msix_control_reg(pos), control);
+
+        spin_unlock_irq(&dev->msix->control_lock);
     }
 }
 
@@ -318,9 +374,10 @@ static void msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
 {
     struct msi_desc *entry = desc->msi_desc;
     struct pci_dev *pdev;
-    u16 seg, control;
+    u16 seg;
     u8 bus, slot, func;
-    bool flag = host || guest, maskall;
+    bool flag = host || guest;
+    uint16_t control;
 
     ASSERT(spin_is_locked(&desc->lock));
     BUG_ON(!entry || !entry->dev);
@@ -343,30 +400,18 @@ static void msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
         }
         break;
     case PCI_CAP_ID_MSIX:
-        maskall = pdev->msix->host_maskall;
-        control = pci_conf_read16(pdev->sbdf,
-                                  msix_control_reg(entry->msi_attrib.pos));
-        if ( unlikely(!(control & PCI_MSIX_FLAGS_ENABLE)) )
-        {
-            pdev->msix->host_maskall = 1;
-            pci_conf_write16(pdev->sbdf,
-                             msix_control_reg(entry->msi_attrib.pos),
-                             control | (PCI_MSIX_FLAGS_ENABLE |
-                                        PCI_MSIX_FLAGS_MASKALL));
-        }
+        control = msix_update_lock(pdev, entry->msi_attrib.pos);
+
         if ( likely(memory_decoded(pdev)) )
         {
             writel(flag, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
             readl(entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
-
-            if ( likely(control & PCI_MSIX_FLAGS_ENABLE) )
-                break;
         }
-        else if ( flag && !(control & PCI_MSIX_FLAGS_MASKALL) )
+        else if ( !pdev->msix->host_maskall && !pdev->msix->guest_maskall )
         {
             domid_t domid = pdev->domain->domain_id;
 
-            maskall = true;
+            pdev->msix->host_maskall = true;
             if ( pdev->msix->warned != domid )
             {
                 pdev->msix->warned = domid;
@@ -375,11 +420,8 @@ static void msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
                        desc->irq, domid, &pdev->sbdf);
             }
         }
-        pdev->msix->host_maskall = maskall;
-        if ( maskall || pdev->msix->guest_maskall )
-            control |= PCI_MSIX_FLAGS_MASKALL;
-        pci_conf_write16(pdev->sbdf,
-                         msix_control_reg(entry->msi_attrib.pos), control);
+
+        msix_update_unlock(pdev, entry->msi_attrib.pos, control);
         break;
     }
     entry->msi_attrib.host_masked = host;
@@ -494,26 +536,19 @@ static struct msi_desc *alloc_msi_entry(unsigned int nr)
 
 int setup_msi_irq(struct irq_desc *desc, struct msi_desc *msidesc)
 {
-    const struct pci_dev *pdev = msidesc->dev;
-    unsigned int cpos = msix_control_reg(msidesc->msi_attrib.pos);
-    u16 control = ~0;
+    struct pci_dev *pdev = msidesc->dev;
+    uint16_t control = 0;
     int rc;
 
     if ( msidesc->msi_attrib.type == PCI_CAP_ID_MSIX )
-    {
-        control = pci_conf_read16(pdev->sbdf, cpos);
-        if ( !(control & PCI_MSIX_FLAGS_ENABLE) )
-            pci_conf_write16(pdev->sbdf, cpos,
-                             control | (PCI_MSIX_FLAGS_ENABLE |
-                                        PCI_MSIX_FLAGS_MASKALL));
-    }
+        control = msix_update_lock(pdev, msidesc->msi_attrib.pos);
 
     rc = __setup_msi_irq(desc, msidesc,
                          msi_maskable_irq(msidesc) ? &pci_msi_maskable
                                                    : &pci_msi_nonmaskable);
 
-    if ( !(control & PCI_MSIX_FLAGS_ENABLE) )
-        pci_conf_write16(pdev->sbdf, cpos, control);
+    if ( control )
+        msix_update_unlock(pdev, msidesc->msi_attrib.pos, control);
 
     return rc;
 }
@@ -754,14 +789,14 @@ static int msix_capability_init(struct pci_dev *dev,
 {
     struct arch_msix *msix = dev->msix;
     struct msi_desc *entry = NULL;
-    u16 control;
     u64 table_paddr;
     u32 table_offset;
     u16 seg = dev->seg;
     u8 bus = dev->bus;
     u8 slot = PCI_SLOT(dev->devfn);
     u8 func = PCI_FUNC(dev->devfn);
-    bool maskall = msix->host_maskall;
+    uint16_t control;
+    int ret = 0;
     unsigned int pos = pci_find_cap_offset(seg, bus, slot, func,
                                            PCI_CAP_ID_MSIX);
 
@@ -770,37 +805,22 @@ static int msix_capability_init(struct pci_dev *dev,
 
     ASSERT(pcidevs_locked());
 
-    control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
-    /*
-     * Ensure MSI-X interrupts are masked during setup. Some devices require
-     * MSI-X to be enabled before we can touch the MSI-X registers. We need
-     * to mask all the vectors to prevent interrupts coming in before they're
-     * fully set up.
-     */
-    msix->host_maskall = 1;
-    pci_conf_write16(dev->sbdf, msix_control_reg(pos),
-                     control | (PCI_MSIX_FLAGS_ENABLE |
-                                PCI_MSIX_FLAGS_MASKALL));
-
-    if ( unlikely(!memory_decoded(dev)) )
-    {
-        pci_conf_write16(dev->sbdf, msix_control_reg(pos),
-                         control & ~PCI_MSIX_FLAGS_ENABLE);
-        return -ENXIO;
-    }
-
     if ( desc )
     {
         entry = alloc_msi_entry(1);
         if ( !entry )
-        {
-            pci_conf_write16(dev->sbdf, msix_control_reg(pos),
-                             control & ~PCI_MSIX_FLAGS_ENABLE);
             return -ENOMEM;
-        }
         ASSERT(msi);
     }
 
+    control = msix_update_lock(dev, pos);
+
+    if ( unlikely(!memory_decoded(dev)) )
+    {
+        ret = -ENXIO;
+        goto out;
+    }
+
     /* Locate MSI-X table region */
     table_offset = pci_conf_read32(dev->sbdf, msix_table_offset_reg(pos));
     if ( !msix->used_entries &&
@@ -834,10 +854,8 @@ static int msix_capability_init(struct pci_dev *dev,
         {
             if ( !msi || !msi->table_base )
             {
-                pci_conf_write16(dev->sbdf, msix_control_reg(pos),
-                                 control & ~PCI_MSIX_FLAGS_ENABLE);
-                xfree(entry);
-                return -ENXIO;
+                ret = -ENXIO;
+                goto out;
             }
             table_paddr = msi->table_base;
         }
@@ -863,9 +881,8 @@ static int msix_capability_init(struct pci_dev *dev,
     }
     else if ( !msix->table.first )
     {
-        pci_conf_write16(dev->sbdf, msix_control_reg(pos), control);
-        xfree(entry);
-        return -ENODATA;
+        ret = -ENODATA;
+        goto out;
     }
     else
         table_paddr = (msix->table.first << PAGE_SHIFT) +
@@ -880,9 +897,8 @@ static int msix_capability_init(struct pci_dev *dev,
 
         if ( idx < 0 )
         {
-            pci_conf_write16(dev->sbdf, msix_control_reg(pos), control);
-            xfree(entry);
-            return idx;
+            ret = idx;
+            goto out;
         }
         base = fix_to_virt(idx) + (entry_paddr & (PAGE_SIZE - 1));
 
@@ -906,12 +922,6 @@ static int msix_capability_init(struct pci_dev *dev,
 
     if ( !msix->used_entries )
     {
-        maskall = false;
-        if ( !msix->guest_maskall )
-            control &= ~PCI_MSIX_FLAGS_MASKALL;
-        else
-            control |= PCI_MSIX_FLAGS_MASKALL;
-
         if ( rangeset_add_range(mmio_ro_ranges, msix->table.first,
                                 msix->table.last) )
             WARN();
@@ -940,23 +950,13 @@ static int msix_capability_init(struct pci_dev *dev,
     WARN_ON(msix->table.first != (table_paddr >> PAGE_SHIFT));
     ++msix->used_entries;
 
-    /* Restore MSI-X enabled bits */
-    if ( !hardware_domain )
-    {
-        /*
-         * ..., except for internal requests (before Dom0 starts), in which
-         * case we rather need to behave "normally", i.e. not follow the split
-         * brain model where Dom0 actually enables MSI (and disables INTx).
-         */
-        pci_intx(dev, false);
-        control |= PCI_MSIX_FLAGS_ENABLE;
-        control &= ~PCI_MSIX_FLAGS_MASKALL;
-        maskall = 0;
-    }
-    msix->host_maskall = maskall;
-    pci_conf_write16(dev->sbdf, msix_control_reg(pos), control);
+  out:
+    if ( ret < 0 )
+        xfree(entry);
 
-    return 0;
+    msix_update_unlock(dev, pos, control);
+
+    return ret;
 }
 
 /**
@@ -1180,7 +1180,7 @@ void pci_cleanup_msi(struct pci_dev *pdev)
 {
     /* Disable MSI and/or MSI-X */
     msi_set_enable(pdev, 0);
-    msix_set_enable(pdev, 0);
+    msix_force_disable(pdev);
     msi_free_irqs(pdev);
 }
 
@@ -1229,11 +1229,20 @@ int pci_msi_conf_write_intercept(struct pci_dev *pdev, unsigned int reg,
             if ( reg != msix_control_reg(pos) || size != 2 )
                 return -EACCES;
 
+            spin_lock_irq(&pdev->msix->control_lock);
+
+            pdev->msix->guest_enable = !!(*data & PCI_MSIX_FLAGS_ENABLE);
+            if ( pdev->msix->host_enable )
+                *data |= PCI_MSIX_FLAGS_ENABLE;
             pdev->msix->guest_maskall = !!(*data & PCI_MSIX_FLAGS_MASKALL);
             if ( pdev->msix->host_maskall )
                 *data |= PCI_MSIX_FLAGS_MASKALL;
 
-            return 1;
+            pci_conf_write16(pdev->sbdf, reg, *data);
+
+            spin_unlock_irq(&pdev->msix->control_lock);
+
+            return -EPERM; /* Already done the write. */
         }
     }
 
@@ -1324,15 +1333,12 @@ int pci_restore_msi_state(struct pci_dev *pdev)
         }
         else if ( !type && entry->msi_attrib.type == PCI_CAP_ID_MSIX )
         {
-            control = pci_conf_read16(pdev->sbdf, msix_control_reg(pos));
-            pci_conf_write16(pdev->sbdf, msix_control_reg(pos),
-                             control | (PCI_MSIX_FLAGS_ENABLE |
-                                        PCI_MSIX_FLAGS_MASKALL));
+            control = msix_update_lock(pdev, pos);
+
             if ( unlikely(!memory_decoded(pdev)) )
             {
+                msix_update_unlock(pdev, pos, control);
                 spin_unlock_irqrestore(&desc->lock, flags);
-                pci_conf_write16(pdev->sbdf, msix_control_reg(pos),
-                                 control & ~PCI_MSIX_FLAGS_ENABLE);
                 return -ENXIO;
             }
         }
@@ -1372,8 +1378,7 @@ int pci_restore_msi_state(struct pci_dev *pdev)
     }
 
     if ( type == PCI_CAP_ID_MSIX )
-        pci_conf_write16(pdev->sbdf, msix_control_reg(pos),
-                         control | PCI_MSIX_FLAGS_ENABLE);
+        msix_update_unlock(pdev, pos, control);
 
     return 0;
 }
diff --git a/xen/drivers/passthrough/msi.c b/xen/drivers/passthrough/msi.c
index ce1a450f6f..436c78b7aa 100644
--- a/xen/drivers/passthrough/msi.c
+++ b/xen/drivers/passthrough/msi.c
@@ -44,6 +44,7 @@ int pdev_msi_init(struct pci_dev *pdev)
             return -ENOMEM;
 
         spin_lock_init(&msix->table_lock);
+        spin_lock_init(&msix->control_lock);
 
         ctrl = pci_conf_read16(pdev->sbdf, msix_control_reg(pos));
         msix->nr_entries = msix_table_size(ctrl);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] x86/msi: consistently handle BAR mapping failures in MSI-X setup
  2022-11-10 16:59 ` [PATCH 1/3] x86/msi: consistently handle BAR mapping failures in MSI-X setup David Vrabel
@ 2022-11-11  9:24   ` Jan Beulich
  0 siblings, 0 replies; 12+ messages in thread
From: Jan Beulich @ 2022-11-11  9:24 UTC (permalink / raw)
  To: David Vrabel
  Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, David Vrabel, xen-devel

On 10.11.2022 17:59, David Vrabel wrote:
> When setting up an MSI-X vector in msix_capability_init() the error
> handling after a BAR mapping failure is different depending on whether
> the first page fails or a subsequent page. There's no reason to break
> working vectors so consistently use the later error handling
> behaviour.

"zap_on_error" can only be set when there were no working vectors yet
(msix->used_entries being zero), so I don't see what case this last
sentence describes. In fact it was the intention with "zap_on_error"
to leave previously set up vectors functional.

> The zap_on_error flag was added as part of XSA-337, beb54596cfda
> (x86/MSI-X: restrict reading of table/PBA bases from BARs), but
> appears to be unrelated to XSA-337 and is not useful because:
> 
> 1. table.first and pba.first are not used unless msix->used_vectors > 0.

This isn't true afaics. The condition around their setting up is involving
more than just ->used_vectors:

    if ( !msix->used_entries &&
         (!msi ||
          (is_hardware_domain(current->domain) &&
           (dev->domain == current->domain || dev->domain == dom_io))) )

Hence the associated "else if( !msix->table.first )" can also be taken
if msix->used_entries is zero. And in case of a failure we need to force
the error return there for DomU-s, which is achieved by clearing
msix->table.first on the error handling path you alter.

Furthermore I'd consider it bad practice to leave stale values on record.

> 2. Force disabling MSI-X in this error path is not necessary as the
>    per-vector mask is still still set.

I agree that we might be overly strict there, but to remove that
disabling you'd need to further prove that no other inconsistencies can
(later) result (this being on the safe side is where the connection to
the rest of the XSA-337 changes comes from, along with the desire to
not leave stale values around, as per above). Plus you'd want to justify
why this error path is different from others in the function where we
also disable MSI-X altogether (beyond the path you modify there's exactly
one error path where we don't, and I now wonder why I had done it like
that).

But then I may also be misunderstanding some of your intentions here.
The "consistently" in the title and the associated first sentence of the
description escape me for the moment: You're talking about things in
terms of pages, when the handling really is in terms of entries.

Jan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] x86/msi: remove return value from msi_set_mask_bit()
  2022-11-10 16:59 ` [PATCH 2/3] x86/msi: remove return value from msi_set_mask_bit() David Vrabel
@ 2022-11-11  9:44   ` Jan Beulich
  2022-11-11 14:41     ` David Vrabel
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Beulich @ 2022-11-11  9:44 UTC (permalink / raw)
  To: David Vrabel
  Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, David Vrabel, xen-devel

On 10.11.2022 17:59, David Vrabel wrote:
> The return value was only used for WARN()s or BUG()s so it has no
> functional purpose. Simplify the code by removing it.
> 
> The meaning of the return value and the purpose of the various WARNs()
> and BUGs() is rather unclear. The only failure path (where an MSI-X
> vector needs to be masked but the MSI-X table is not accessible) has a
> useful warning message.

No, you're removing an important 2nd such path - the default case in the
switch() statement. Getting there would mean another bug elsewhere, which
we don't want to leave undetected for yet longer (and hence yet harder to
debug once finally some misbehavior surfaces). That default case may
warrant the addition of ASSERT_UNREACHABLE() according to the respective
description in ./CODING_STYLE, but I don't see the removal of the
"return" as acceptable (also for another reason as explained below).

The idea of the WARN() / BUG_ON() is to
- not leave failed unmasking unrecorded,
- not continue after failure to mask an entry.
This combines with the functions where the constructs are not having
ways to properly propagate the errors to their callers.

> @@ -361,11 +361,6 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
>  
>              if ( likely(control & PCI_MSIX_FLAGS_ENABLE) )
>                  break;
> -
> -            entry->msi_attrib.host_masked = host;
> -            entry->msi_attrib.guest_masked = guest;
> -
> -            flag = true;
>          }
>          else if ( flag && !(control & PCI_MSIX_FLAGS_MASKALL) )
>          {
> @@ -385,14 +380,10 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
>              control |= PCI_MSIX_FLAGS_MASKALL;
>          pci_conf_write16(pdev->sbdf,
>                           msix_control_reg(entry->msi_attrib.pos), control);
> -        return flag;

Both this and ...

> -    default:
> -        return 0;

... this have previously prevented to make it ...

> +        break;
>      }
>      entry->msi_attrib.host_masked = host;
>      entry->msi_attrib.guest_masked = guest;

... here. This is a change in behavior which is neither obviously benign
nor properly justified in the description, yet clearly going beyond what
the title says the patch is about.

Jan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] x86/msi: remove return value from msi_set_mask_bit()
  2022-11-11  9:44   ` Jan Beulich
@ 2022-11-11 14:41     ` David Vrabel
  2022-11-14 10:14       ` Jan Beulich
  0 siblings, 1 reply; 12+ messages in thread
From: David Vrabel @ 2022-11-11 14:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, David Vrabel, xen-devel



On 11/11/2022 09:44, Jan Beulich wrote:
> 
> The idea of the WARN() / BUG_ON() is to
> - not leave failed unmasking unrecorded,
> - not continue after failure to mask an entry.

Then lets make msi_set_mask_bit() unable to fail with something like 
this (untested) patch. Would this be acceptable?

David

 From 837649a70d44455f4fd98e2eaa46dcf35a56d00a Mon Sep 17 00:00:00 2001
From: David Vrabel <dvrabel@amazon.co.uk>
Date: Fri, 11 Nov 2022 14:30:16 +0000
Subject: [PATCH] x86: Always enable memory space decodes when using MSI-X

Instead of the numerous (racy) checks for memory space accesses being
enabled before writing the the MSI-X table, force Memory Space Enable
to be set in the Command register if MSI-X is used.

This allows the memory_decoded() function and the associated error
paths to be removed (since it will always return true). In particular,
msi_set_mask_bit() can no longer fail and its return value is removed.

Note that if the PCI device is a virtual function, the relevant
command register is in the corresponding physical function.

Signed-off-by: David Vrabel <dvrabel@amazon.co.uk>
---
  xen/arch/x86/include/asm/pci.h |   3 +
  xen/arch/x86/msi.c             | 116 +++++++++------------------------
  xen/arch/x86/pci.c             |  39 ++++++++++-
  3 files changed, 71 insertions(+), 87 deletions(-)

diff --git a/xen/arch/x86/include/asm/pci.h b/xen/arch/x86/include/asm/pci.h
index f4a58c8acf..4f59b70959 100644
--- a/xen/arch/x86/include/asm/pci.h
+++ b/xen/arch/x86/include/asm/pci.h
@@ -32,8 +32,11 @@ struct arch_pci_dev {
      domid_t pseudo_domid;
      mfn_t leaf_mfn;
      struct page_list_head pgtables_list;
+    uint16_t host_command;
+    uint16_t guest_command;
  };

+void pci_command_override(struct pci_dev *pdev, uint16_t val);
  int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
                               unsigned int reg, unsigned int size,
                               uint32_t *data);
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index d0bf63df1d..2f8667aa7b 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -124,27 +124,11 @@ static void msix_put_fixmap(struct arch_msix 
*msix, int idx)
      spin_unlock(&msix->table_lock);
  }

-static bool memory_decoded(const struct pci_dev *dev)
-{
-    pci_sbdf_t sbdf = dev->sbdf;
-
-    if ( dev->info.is_virtfn )
-    {
-        sbdf.bus = dev->info.physfn.bus;
-        sbdf.devfn = dev->info.physfn.devfn;
-    }
-
-    return pci_conf_read16(sbdf, PCI_COMMAND) & PCI_COMMAND_MEMORY;
-}
-
  static bool msix_memory_decoded(const struct pci_dev *dev, unsigned 
int pos)
  {
      uint16_t control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));

-    if ( !(control & PCI_MSIX_FLAGS_ENABLE) )
-        return false;
-
-    return memory_decoded(dev);
+    return control & PCI_MSIX_FLAGS_ENABLE;
  }

  /*
@@ -314,7 +298,7 @@ int msi_maskable_irq(const struct msi_desc *entry)
             || entry->msi_attrib.maskbit;
  }

-static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
+static void msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest)
  {
      struct msi_desc *entry = desc->msi_desc;
      struct pci_dev *pdev;
@@ -354,45 +338,26 @@ static bool msi_set_mask_bit(struct irq_desc 
*desc, bool host, bool guest)
                               control | (PCI_MSIX_FLAGS_ENABLE |
                                          PCI_MSIX_FLAGS_MASKALL));
          }
-        if ( likely(memory_decoded(pdev)) )
-        {
-            writel(flag, entry->mask_base + 
PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
-            readl(entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);

-            if ( likely(control & PCI_MSIX_FLAGS_ENABLE) )
-                break;
-
-            entry->msi_attrib.host_masked = host;
-            entry->msi_attrib.guest_masked = guest;
+        writel(flag, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
+        readl(entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);

-            flag = true;
-        }
-        else if ( flag && !(control & PCI_MSIX_FLAGS_MASKALL) )
+        if ( unlikely(!(control & PCI_MSIX_FLAGS_ENABLE)) )
          {
-            domid_t domid = pdev->domain->domain_id;
-
-            maskall = true;
-            if ( pdev->msix->warned != domid )
-            {
-                pdev->msix->warned = domid;
-                printk(XENLOG_G_WARNING
-                       "cannot mask IRQ %d: masking MSI-X on Dom%d's 
%pp\n",
-                       desc->irq, domid, &pdev->sbdf);
-            }
+            pdev->msix->host_maskall = maskall;
+            if ( maskall || pdev->msix->guest_maskall )
+                control |= PCI_MSIX_FLAGS_MASKALL;
+            pci_conf_write16(pdev->sbdf,
+                             msix_control_reg(entry->msi_attrib.pos), 
control);
          }
-        pdev->msix->host_maskall = maskall;
-        if ( maskall || pdev->msix->guest_maskall )
-            control |= PCI_MSIX_FLAGS_MASKALL;
-        pci_conf_write16(pdev->sbdf,
-                         msix_control_reg(entry->msi_attrib.pos), control);
-        return flag;
+        break;
      default:
-        return 0;
+        ASSERT_UNREACHABLE();
+        break;
      }
+
      entry->msi_attrib.host_masked = host;
      entry->msi_attrib.guest_masked = guest;
-
-    return 1;
  }

  static int msi_get_mask_bit(const struct msi_desc *entry)
@@ -418,16 +383,12 @@ static int msi_get_mask_bit(const struct msi_desc 
*entry)

  void cf_check mask_msi_irq(struct irq_desc *desc)
  {
-    if ( unlikely(!msi_set_mask_bit(desc, 1,
- 
desc->msi_desc->msi_attrib.guest_masked)) )
-        BUG_ON(!(desc->status & IRQ_DISABLED));
+    msi_set_mask_bit(desc, 1, desc->msi_desc->msi_attrib.guest_masked);
  }

  void cf_check unmask_msi_irq(struct irq_desc *desc)
  {
-    if ( unlikely(!msi_set_mask_bit(desc, 0,
- 
desc->msi_desc->msi_attrib.guest_masked)) )
-        WARN();
+    msi_set_mask_bit(desc, 0, desc->msi_desc->msi_attrib.guest_masked);
  }

  void guest_mask_msi_irq(struct irq_desc *desc, bool mask)
@@ -437,15 +398,13 @@ void guest_mask_msi_irq(struct irq_desc *desc, 
bool mask)

  static unsigned int cf_check startup_msi_irq(struct irq_desc *desc)
  {
-    if ( unlikely(!msi_set_mask_bit(desc, 0, !!(desc->status & 
IRQ_GUEST))) )
-        WARN();
+    msi_set_mask_bit(desc, 0, !!(desc->status & IRQ_GUEST));
      return 0;
  }

  static void cf_check shutdown_msi_irq(struct irq_desc *desc)
  {
-    if ( unlikely(!msi_set_mask_bit(desc, 1, 1)) )
-        BUG_ON(!(desc->status & IRQ_DISABLED));
+    msi_set_mask_bit(desc, 1, 1);
  }

  void cf_check ack_nonmaskable_msi_irq(struct irq_desc *desc)
@@ -785,6 +744,12 @@ static int msix_capability_init(struct pci_dev *dev,

      ASSERT(pcidevs_locked());

+    /*
+     * Force enable access to the MSI-X tables, so access to the
+     * per-vector mask bits always works.
+     */
+    pci_command_override(dev, PCI_COMMAND_MEMORY);
+
      control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
      /*
       * Ensure MSI-X interrupts are masked during setup. Some devices 
require
@@ -797,13 +762,6 @@ static int msix_capability_init(struct pci_dev *dev,
                       control | (PCI_MSIX_FLAGS_ENABLE |
                                  PCI_MSIX_FLAGS_MASKALL));

-    if ( unlikely(!memory_decoded(dev)) )
-    {
-        pci_conf_write16(dev->sbdf, msix_control_reg(pos),
-                         control & ~PCI_MSIX_FLAGS_ENABLE);
-        return -ENXIO;
-    }
-
      if ( desc )
      {
          entry = alloc_msi_entry(1);
@@ -1122,19 +1080,15 @@ static void __pci_disable_msix(struct msi_desc 
*entry)

      BUG_ON(list_empty(&dev->msi_list));

-    if ( likely(memory_decoded(dev)) )
-        writel(1, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
-    else if ( !(control & PCI_MSIX_FLAGS_MASKALL) )
-    {
-        printk(XENLOG_WARNING "cannot disable IRQ %d: masking MSI-X on 
%pp\n",
-               entry->irq, &dev->sbdf);
-        maskall = true;
-    }
+    writel(1, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
+
      dev->msix->host_maskall = maskall;
      if ( maskall || dev->msix->guest_maskall )
          control |= PCI_MSIX_FLAGS_MASKALL;
      pci_conf_write16(dev->sbdf, msix_control_reg(pos), control);

+    pci_command_override(dev, 0);
+
      _pci_cleanup_msix(dev->msix);
  }

@@ -1353,13 +1307,6 @@ int pci_restore_msi_state(struct pci_dev *pdev)
              pci_conf_write16(pdev->sbdf, msix_control_reg(pos),
                               control | (PCI_MSIX_FLAGS_ENABLE |
                                          PCI_MSIX_FLAGS_MASKALL));
-            if ( unlikely(!memory_decoded(pdev)) )
-            {
-                spin_unlock_irqrestore(&desc->lock, flags);
-                pci_conf_write16(pdev->sbdf, msix_control_reg(pos),
-                                 control & ~PCI_MSIX_FLAGS_ENABLE);
-                return -ENXIO;
-            }
          }
          type = entry->msi_attrib.type;

@@ -1368,10 +1315,9 @@ int pci_restore_msi_state(struct pci_dev *pdev)

          for ( i = 0; ; )
          {
-            if ( unlikely(!msi_set_mask_bit(desc,
- 
entry[i].msi_attrib.host_masked,
- 
entry[i].msi_attrib.guest_masked)) )
-                BUG();
+            msi_set_mask_bit(desc,
+                             entry[i].msi_attrib.host_masked,
+                             entry[i].msi_attrib.guest_masked);

              if ( !--nr )
                  break;
diff --git a/xen/arch/x86/pci.c b/xen/arch/x86/pci.c
index 97b792e578..0c4b49f042 100644
--- a/xen/arch/x86/pci.c
+++ b/xen/arch/x86/pci.c
@@ -69,6 +69,24 @@ void pci_conf_write(uint32_t cf8, uint8_t offset, 
uint8_t bytes, uint32_t data)
      spin_unlock_irqrestore(&pci_config_lock, flags);
  }

+void pci_command_override(struct pci_dev *pdev, uint16_t val)
+{
+    pci_sbdf_t sbdf = pdev->sbdf;
+
+    ASSERT(pcidevs_locked());
+
+    if ( pdev->info.is_virtfn )
+    {
+        sbdf.bus = pdev->info.physfn.bus;
+        sbdf.devfn = pdev->info.physfn.devfn;
+
+        pdev = pci_get_pdev(NULL, sbdf);
+    }
+
+    pdev->arch.host_command = val;
+    pci_conf_write16(sbdf, PCI_COMMAND, pdev->arch.host_command | 
pdev->arch.guest_command);
+}
+
  int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
                               unsigned int reg, unsigned int size,
                               uint32_t *data)
@@ -85,14 +103,31 @@ int pci_conf_write_intercept(unsigned int seg, 
unsigned int bdf,
       * Avoid expensive operations when no hook is going to do anything
       * for the access anyway.
       */
-    if ( reg < 64 || reg >= 256 )
+    if ( reg != PCI_COMMAND && (reg < 64 || reg >= 256) )
          return 0;

      pcidevs_lock();

      pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bdf));
      if ( pdev )
-        rc = pci_msi_conf_write_intercept(pdev, reg, size, data);
+    {
+        switch ( reg )
+        {
+        case PCI_COMMAND:
+            if ( size == 2 )
+            {
+                pdev->arch.guest_command = *data;
+                *data |= pdev->arch.host_command;
+            }
+            else
+                rc = -EACCESS;
+            break;
+
+        default:
+            rc = pci_msi_conf_write_intercept(pdev, reg, size, data);
+            break;
+        }
+    }

      pcidevs_unlock();

-- 
2.37.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] x86/msi: remove return value from msi_set_mask_bit()
  2022-11-11 14:41     ` David Vrabel
@ 2022-11-14 10:14       ` Jan Beulich
  0 siblings, 0 replies; 12+ messages in thread
From: Jan Beulich @ 2022-11-14 10:14 UTC (permalink / raw)
  To: David Vrabel
  Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, David Vrabel, xen-devel

On 11.11.2022 15:41, David Vrabel wrote:
> On 11/11/2022 09:44, Jan Beulich wrote:
>>
>> The idea of the WARN() / BUG_ON() is to
>> - not leave failed unmasking unrecorded,
>> - not continue after failure to mask an entry.
> 
> Then lets make msi_set_mask_bit() unable to fail with something like 
> this (untested) patch. Would this be acceptable?
> 
> David

Hmm, that's quite a bit of code churn for something which, for now at
least, I'm not convinced needs changing. Yes, ...

>  From 837649a70d44455f4fd98e2eaa46dcf35a56d00a Mon Sep 17 00:00:00 2001
> From: David Vrabel <dvrabel@amazon.co.uk>
> Date: Fri, 11 Nov 2022 14:30:16 +0000
> Subject: [PATCH] x86: Always enable memory space decodes when using MSI-X
> 
> Instead of the numerous (racy) checks for memory space accesses being
> enabled before writing the the MSI-X table, force Memory Space Enable
> to be set in the Command register if MSI-X is used.

... there is some raciness there, but we assume a well-behaved Dom0
(and DomU don't have direct access to the decode enable bits), so the
checks are more to be on the safe side (the original change attempted
to merely deal with the specific pciback behavior, without impacting
other [odd/unexpected] things Dom0 may be doing, as long as what it's
doing isn't plain wrong/buggy).

> This allows the memory_decoded() function and the associated error
> paths to be removed (since it will always return true). In particular,
> msi_set_mask_bit() can no longer fail and its return value is removed.
> 
> Note that if the PCI device is a virtual function, the relevant
> command register is in the corresponding physical function.
> 
> Signed-off-by: David Vrabel <dvrabel@amazon.co.uk>

What I'm missing in the description is a discussion of the safety of
the change, in particular the taking away of the control of the
memory decode bit from Dom0 (over certain periods of time). Without
that I'm afraid I can't answer your question (and wouldn't want to
review the change in detail).

Jan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] x86/pci: Fix racy accesses to MSI-X Control register
  2022-11-10 16:59 ` [PATCH 3/3] x86/pci: Fix racy accesses to MSI-X Control register David Vrabel
@ 2022-12-12 17:04   ` Jan Beulich
  2022-12-13 11:34     ` David Vrabel
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Beulich @ 2022-12-12 17:04 UTC (permalink / raw)
  To: David Vrabel
  Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, David Vrabel, xen-devel

On 10.11.2022 17:59, David Vrabel wrote:
> Concurrent access the the MSI-X control register are not serialized
> with a suitable lock. For example, in msix_capability_init() access
> use the pcidevs_lock() but some calls to msi_set_mask_bit() use the
> interrupt descriptor lock.
> 
> This can lead to MSI-X being incorrectly disabled and subsequent
> failures due to msix_memory_decoded() calls that check for MSI-X being
> enabled.
> 
> This was seen with some non-compliant hardware that gated MSI-X
> messages on the per-vector mask bit only (i.e., the MSI-X Enable bit
> and Function Mask bits in the MSI-X Control register were ignored). An
> interrupt (with a pending move) for vector 0 would occur while vector
> 1 was being initialized in msix_capability_init(). Updates the the
> Control register would race and the vector 1 initialization would
> intermittently fail with -ENXIO.
> 
> Typically a race between initializing a vector and another vector
> being moved doesn't occur because:
> 
> 1. Racy Control accesses only occur when MSI-X is (guest) disabled
> 
> 2  Hardware should only raise interrupts when MSI-X is enabled and unmasked.
> 
> 3. Xen always sets Function Mask when temporarily enabling MSI-X.
> 
> But there may be other race conditions depending on hardware and guest
> driver behaviour (e.g., disabling MSI-X when a IRQ has a pending move
> on another PCPU).
> 
> Fix this by:
> 
> 1. Tracking the host and guest enable state in a similar way to the
>    host and guest maskall state. Note that since multiple CPUs can be
>    updating different vectors concurrently, a counter is needed for
>    the host enable state.
> 
> 2. Add a new lock for serialize the Control read/modify/write
>    sequence.
> 
> 3. Wrap the above in two helper functions (msix_update_lock(), and
>    msix_update_unlock()), which bracket any MSI-X register updates
>    that require MSI-X to be (temporarily) enabled.
> 
> Signed-off-by: David Vrabel <dvrabel@amazon.co.uk>

Just a couple of mechanical / style comments, at least for now:

> SIM: https://t.corp.amazon.com/P63914633
> 
> CR: https://code.amazon.com/reviews/CR-79020945

These tags shouldn't be here, unless they have a meaning in the
"upstream" context.

> --- a/xen/arch/x86/include/asm/msi.h
> +++ b/xen/arch/x86/include/asm/msi.h
> @@ -237,7 +237,10 @@ struct arch_msix {
>      int table_refcnt[MAX_MSIX_TABLE_PAGES];
>      int table_idx[MAX_MSIX_TABLE_PAGES];
>      spinlock_t table_lock;
> +    spinlock_t control_lock;
>      bool host_maskall, guest_maskall;
> +    uint16_t host_enable;

If you want to keep this more narrow than "unsigned int", then please
add a BUILD_BUG_ON() against NR_CPUS, so the need to update the field
can be easily noticed (in some perhaps distant future).

> +static void msix_update_unlock(struct pci_dev *dev, unsigned int pos, uint16_t control)
> +{
> +    uint16_t new_control;
> +    unsigned long flags;
> +
> +    spin_lock_irqsave(&dev->msix->control_lock, flags);
> +
> +    dev->msix->host_enable--;
> +
> +    new_control = control & ~(PCI_MSIX_FLAGS_ENABLE | PCI_MSIX_FLAGS_MASKALL);
> +
> +    if ( dev->msix->host_enable || dev->msix->guest_enable )
> +        new_control |= PCI_MSIX_FLAGS_ENABLE;
> +    if ( dev->msix->host_maskall || dev->msix->guest_maskall || dev->msix->host_enable )
> +        new_control |= PCI_MSIX_FLAGS_MASKALL;

In particular this use of "host_enable" suggests the field wants to be
named differently: It makes no sense to derive MASKALL from ENABLE
without it being clear (from the name) that the field is an init-time
override only.

> @@ -299,11 +350,16 @@ static void msix_set_enable(struct pci_dev *dev, int enable)
>      pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
>      if ( pos )
>      {
> +        spin_lock_irq(&dev->msix->control_lock);
> +
> +        dev->msix->host_enable = false;

You explicitly say this isn't a boolean, so the initializer would better
be 0.

Jan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] x86/pci: Fix racy accesses to MSI-X Control register
  2022-12-12 17:04   ` Jan Beulich
@ 2022-12-13 11:34     ` David Vrabel
  2022-12-13 11:50       ` Jan Beulich
  0 siblings, 1 reply; 12+ messages in thread
From: David Vrabel @ 2022-12-13 11:34 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, David Vrabel, xen-devel



On 12/12/2022 17:04, Jan Beulich wrote:
> On 10.11.2022 17:59, David Vrabel wrote:
>> 
>> --- a/xen/arch/x86/include/asm/msi.h
>> +++ b/xen/arch/x86/include/asm/msi.h
>> @@ -237,7 +237,10 @@ struct arch_msix {
>>       int table_refcnt[MAX_MSIX_TABLE_PAGES];
>>       int table_idx[MAX_MSIX_TABLE_PAGES];
>>       spinlock_t table_lock;
>> +    spinlock_t control_lock;
>>       bool host_maskall, guest_maskall;
>> +    uint16_t host_enable;
> 
> If you want to keep this more narrow than "unsigned int", then please
> add a BUILD_BUG_ON() against NR_CPUS, so the need to update the field
> can be easily noticed (in some perhaps distant future).

This is only incremented:

- while holding the pci_devs lock, or
- while holding a lock for one of the associated IRQs.

Since there are at most 4096 MSI-X vectors (and thus at most 4096 IRQs), 
the highest value this can be (even with >> 4096 PCPUs) is 4097, thus a 
uint16_t is fine.

>> +static void msix_update_unlock(struct pci_dev *dev, unsigned int pos, uint16_t control)
>> +{
>> +    uint16_t new_control;
>> +    unsigned long flags;
>> +
>> +    spin_lock_irqsave(&dev->msix->control_lock, flags);
>> +
>> +    dev->msix->host_enable--;
>> +
>> +    new_control = control & ~(PCI_MSIX_FLAGS_ENABLE | PCI_MSIX_FLAGS_MASKALL);
>> +
>> +    if ( dev->msix->host_enable || dev->msix->guest_enable )
>> +        new_control |= PCI_MSIX_FLAGS_ENABLE;
>> +    if ( dev->msix->host_maskall || dev->msix->guest_maskall || dev->msix->host_enable )
>> +        new_control |= PCI_MSIX_FLAGS_MASKALL;
> 
> In particular this use of "host_enable" suggests the field wants to be
> named differently: It makes no sense to derive MASKALL from ENABLE
> without it being clear (from the name) that the field is an init-time
> override only.

I think the name as-is makes sense. It is analogous to the host_maskall 
and complements guest_enable. I can't think of a better name, so what 
name do you suggest.

David


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] x86/pci: Fix racy accesses to MSI-X Control register
  2022-12-13 11:34     ` David Vrabel
@ 2022-12-13 11:50       ` Jan Beulich
  2022-12-13 12:03         ` David Vrabel
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Beulich @ 2022-12-13 11:50 UTC (permalink / raw)
  To: David Vrabel
  Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, David Vrabel, xen-devel

On 13.12.2022 12:34, David Vrabel wrote:
> On 12/12/2022 17:04, Jan Beulich wrote:
>> On 10.11.2022 17:59, David Vrabel wrote:
>>>
>>> --- a/xen/arch/x86/include/asm/msi.h
>>> +++ b/xen/arch/x86/include/asm/msi.h
>>> @@ -237,7 +237,10 @@ struct arch_msix {
>>>       int table_refcnt[MAX_MSIX_TABLE_PAGES];
>>>       int table_idx[MAX_MSIX_TABLE_PAGES];
>>>       spinlock_t table_lock;
>>> +    spinlock_t control_lock;
>>>       bool host_maskall, guest_maskall;
>>> +    uint16_t host_enable;
>>
>> If you want to keep this more narrow than "unsigned int", then please
>> add a BUILD_BUG_ON() against NR_CPUS, so the need to update the field
>> can be easily noticed (in some perhaps distant future).
> 
> This is only incremented:
> 
> - while holding the pci_devs lock, or
> - while holding a lock for one of the associated IRQs.

How do the locks held matter here, especially given that - as iirc you say
in the description - neither lock is held uniformly?

> Since there are at most 4096 MSI-X vectors (and thus at most 4096 IRQs), 
> the highest value this can be (even with >> 4096 PCPUs) is 4097, thus a 
> uint16_t is fine.

Where's the 4096 coming from as a limit for MSI-X vectors? DYM 2048, which
is the per-device limit (because the qsize field is 11 bits wide)? If so,
yes, I think that's indeed restricting how large the number can get.

>>> +static void msix_update_unlock(struct pci_dev *dev, unsigned int pos, uint16_t control)
>>> +{
>>> +    uint16_t new_control;
>>> +    unsigned long flags;
>>> +
>>> +    spin_lock_irqsave(&dev->msix->control_lock, flags);
>>> +
>>> +    dev->msix->host_enable--;
>>> +
>>> +    new_control = control & ~(PCI_MSIX_FLAGS_ENABLE | PCI_MSIX_FLAGS_MASKALL);
>>> +
>>> +    if ( dev->msix->host_enable || dev->msix->guest_enable )
>>> +        new_control |= PCI_MSIX_FLAGS_ENABLE;
>>> +    if ( dev->msix->host_maskall || dev->msix->guest_maskall || dev->msix->host_enable )
>>> +        new_control |= PCI_MSIX_FLAGS_MASKALL;
>>
>> In particular this use of "host_enable" suggests the field wants to be
>> named differently: It makes no sense to derive MASKALL from ENABLE
>> without it being clear (from the name) that the field is an init-time
>> override only.
> 
> I think the name as-is makes sense. It is analogous to the host_maskall 
> and complements guest_enable. I can't think of a better name, so what 
> name do you suggest.

I could only think of less neat ones like host_enable_override or
forced_enable or some such. If I had any good name in mind, I would
certainly have said so.

Jan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] x86/pci: Fix racy accesses to MSI-X Control register
  2022-12-13 11:50       ` Jan Beulich
@ 2022-12-13 12:03         ` David Vrabel
  0 siblings, 0 replies; 12+ messages in thread
From: David Vrabel @ 2022-12-13 12:03 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné, Wei Liu, David Vrabel, xen-devel



On 13/12/2022 11:50, Jan Beulich wrote:
> On 13.12.2022 12:34, David Vrabel wrote:
>> On 12/12/2022 17:04, Jan Beulich wrote:
>>> On 10.11.2022 17:59, David Vrabel wrote:
>>>>
>>>> --- a/xen/arch/x86/include/asm/msi.h
>>>> +++ b/xen/arch/x86/include/asm/msi.h
>>>> @@ -237,7 +237,10 @@ struct arch_msix {
>>>>        int table_refcnt[MAX_MSIX_TABLE_PAGES];
>>>>        int table_idx[MAX_MSIX_TABLE_PAGES];
>>>>        spinlock_t table_lock;
>>>> +    spinlock_t control_lock;
>>>>        bool host_maskall, guest_maskall;
>>>> +    uint16_t host_enable;
>>>
>>> If you want to keep this more narrow than "unsigned int", then please
>>> add a BUILD_BUG_ON() against NR_CPUS, so the need to update the field
>>> can be easily noticed (in some perhaps distant future).
>>
>> This is only incremented:
>>
>> - while holding the pci_devs lock, or
>> - while holding a lock for one of the associated IRQs.
> 
> How do the locks held matter here, especially given that - as iirc you say
> in the description - neither lock is held uniformly?

Because the usage here is:

    lock()
    host_enable++
    ...
    host_enable--
    unlock()

>> Since there are at most 4096 MSI-X vectors (and thus at most 4096 IRQs),
>> the highest value this can be (even with >> 4096 PCPUs) is 4097, thus a
>> uint16_t is fine.
> 
> Where's the 4096 coming from as a limit for MSI-X vectors? DYM 2048, which
> is the per-device limit (because the qsize field is 11 bits wide)? If so,
> yes, I think that's indeed restricting how large the number can get.

Yes, I did mean 2048 here.

> I could only think of less neat ones like host_enable_override or
> forced_enable or some such. If I had any good name in mind, I would > certainly have said so.

host_enable_override works for me.

David


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-12-13 12:04 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-10 16:59 [PATCH 0/3] x86: Fix racy accesses to MSI-X Control register David Vrabel
2022-11-10 16:59 ` [PATCH 1/3] x86/msi: consistently handle BAR mapping failures in MSI-X setup David Vrabel
2022-11-11  9:24   ` Jan Beulich
2022-11-10 16:59 ` [PATCH 2/3] x86/msi: remove return value from msi_set_mask_bit() David Vrabel
2022-11-11  9:44   ` Jan Beulich
2022-11-11 14:41     ` David Vrabel
2022-11-14 10:14       ` Jan Beulich
2022-11-10 16:59 ` [PATCH 3/3] x86/pci: Fix racy accesses to MSI-X Control register David Vrabel
2022-12-12 17:04   ` Jan Beulich
2022-12-13 11:34     ` David Vrabel
2022-12-13 11:50       ` Jan Beulich
2022-12-13 12:03         ` David Vrabel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.