xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/10] x86/MSI-X: XSA-120, 128..131 follow-up
@ 2015-06-05 11:09 Jan Beulich
  2015-06-05 11:20 ` [PATCH v3 01/10] x86/MSI-X: be more careful during teardown Jan Beulich
                   ` (9 more replies)
  0 siblings, 10 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-05 11:09 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Keir Fraser

The problem requiring the first patch here is actually what lead to
XSA-120. This patch in turn introduces an issue similar to XSA-129,
which gets dealt with by patch 10 (and of course requiring qemu to
be fixed, for which a separate series will be sent). (Note that the
ordering of patches is solely due to their history; it seemed risky
and undue extra work to re-order it.)

01: MSI-X: be more careful during teardown
02: MSI-X: access MSI-X table only after having enabled MSI-X
03: MSI-X: reduce fiddling with control register during restore
04: MSI-X: cleanup
05: MSI: track host and guest masking separately
06: vMSI-X: cleanup
07: vMSI-X: support qword MMIO access
08: MSI-X: use qword MMIO access for address writes
09: VT-d: use qword MMIO access for MSI address writes
10: MSI-X: provide hypercall interface for MSI-X mask-all control

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New patches 5...10. Small update to patch 2 (see there).

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 01/10] x86/MSI-X: be more careful during teardown
  2015-06-05 11:09 [PATCH v3 00/10] x86/MSI-X: XSA-120, 128..131 follow-up Jan Beulich
@ 2015-06-05 11:20 ` Jan Beulich
  2015-06-05 11:20 ` [PATCH v3 02/10] x86/MSI-X: access MSI-X table only after having enabled MSI-X Jan Beulich
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-05 11:20 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 11431 bytes --]

When a device gets detached from a guest, pciback will clear its
command register, thus disabling both memory and I/O decoding. The
disabled memory decoding, however, has an effect on the MSI-X table
accesses the hypervisor does: These won't have the intended effect
anymore. Even worse, for PCIe devices (but not SR-IOV virtual
functions) such accesses may (will?) be treated as Unsupported
Requests, causing respective errors to be surfaced, potentially in the
form of NMIs that may be fatal to the hypervisor or Dom0 is different
ways. Hence rather than carrying out these accesses, we should avoid
them where we can, and use alternative (e.g. PCI config space based)
mechanisms to achieve at least the same effect.

At this time it continues to be unclear whether this is fixing an
actual bug or is rather just working around bogus (but apparently
common) system behavior.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
The use of the mask-all bit here collides with qemu's incorreect use
of that same bit. This would become a security issue if released that
way. A later patch in this series will provide the infrastructure for
qemu to stop direct access to that bit. A qemu series including a patch
making use of the new interface will be sent subsequently.

Backporting note (largely to myself):
   Depends on (not yet backported) commit 061eebe0e "x86/MSI: drop
   workaround for insecure Dom0 kernels" (due to re-use of struct
   arch_msix's warned field).

--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -217,9 +217,9 @@ void destroy_irq(unsigned int irq)
     }
 
     spin_lock_irqsave(&desc->lock, flags);
-    desc->status  |= IRQ_DISABLED;
     desc->status  &= ~IRQ_GUEST;
     desc->handler->shutdown(desc);
+    desc->status |= IRQ_DISABLED;
     action = desc->action;
     desc->action  = NULL;
     desc->msi_desc = NULL;
@@ -995,8 +995,8 @@ void __init release_irq(unsigned int irq
     spin_lock_irqsave(&desc->lock,flags);
     action = desc->action;
     desc->action  = NULL;
-    desc->status |= IRQ_DISABLED;
     desc->handler->shutdown(desc);
+    desc->status |= IRQ_DISABLED;
     spin_unlock_irqrestore(&desc->lock,flags);
 
     /* Wait to make sure it's not being used on another CPU */
@@ -1726,8 +1726,8 @@ static irq_guest_action_t *__pirq_guest_
     BUG_ON(action->in_flight != 0);
 
     /* Disabling IRQ before releasing the desc_lock avoids an IRQ storm. */
-    desc->status |= IRQ_DISABLED;
     desc->handler->disable(desc);
+    desc->status |= IRQ_DISABLED;
 
     /*
      * Mark any remaining pending EOIs as ready to flush.
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -121,6 +121,27 @@ static void msix_put_fixmap(struct arch_
     spin_unlock(&msix->table_lock);
 }
 
+static bool_t memory_decoded(const struct pci_dev *dev)
+{
+    u8 bus, slot, func;
+
+    if ( !dev->info.is_virtfn )
+    {
+        bus = dev->bus;
+        slot = PCI_SLOT(dev->devfn);
+        func = PCI_FUNC(dev->devfn);
+    }
+    else
+    {
+        bus = dev->info.physfn.bus;
+        slot = PCI_SLOT(dev->info.physfn.devfn);
+        func = PCI_FUNC(dev->info.physfn.devfn);
+    }
+
+    return !!(pci_conf_read16(dev->seg, bus, slot, func, PCI_COMMAND) &
+              PCI_COMMAND_MEMORY);
+}
+
 /*
  * MSI message composition
  */
@@ -162,7 +183,7 @@ void msi_compose_msg(unsigned vector, co
     }
 }
 
-static void read_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
+static bool_t read_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
 {
     switch ( entry->msi_attrib.type )
     {
@@ -198,6 +219,8 @@ static void read_msi_msg(struct msi_desc
         void __iomem *base;
         base = entry->mask_base;
 
+        if ( unlikely(!memory_decoded(entry->dev)) )
+            return 0;
         msg->address_lo = readl(base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
         msg->address_hi = readl(base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
         msg->data = readl(base + PCI_MSIX_ENTRY_DATA_OFFSET);
@@ -209,6 +232,8 @@ static void read_msi_msg(struct msi_desc
 
     if ( iommu_intremap )
         iommu_read_msi_from_ire(entry, msg);
+
+    return 1;
 }
 
 static int write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
@@ -260,6 +285,8 @@ static int write_msi_msg(struct msi_desc
         void __iomem *base;
         base = entry->mask_base;
 
+        if ( unlikely(!memory_decoded(entry->dev)) )
+            return -ENXIO;
         writel(msg->address_lo,
                base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
         writel(msg->address_hi,
@@ -287,7 +314,8 @@ void set_msi_affinity(struct irq_desc *d
     ASSERT(spin_is_locked(&desc->lock));
 
     memset(&msg, 0, sizeof(msg));
-    read_msi_msg(msi_desc, &msg);
+    if ( !read_msi_msg(msi_desc, &msg) )
+        return;
 
     msg.data &= ~MSI_DATA_VECTOR_MASK;
     msg.data |= MSI_DATA_VECTOR(desc->arch.vector);
@@ -347,20 +375,24 @@ int msi_maskable_irq(const struct msi_de
            || entry->msi_attrib.maskbit;
 }
 
-static void msi_set_mask_bit(struct irq_desc *desc, int flag)
+static bool_t msi_set_mask_bit(struct irq_desc *desc, int flag)
 {
     struct msi_desc *entry = desc->msi_desc;
+    struct pci_dev *pdev;
+    u16 seg;
+    u8 bus, slot, func;
 
     ASSERT(spin_is_locked(&desc->lock));
     BUG_ON(!entry || !entry->dev);
+    pdev = entry->dev;
+    seg = pdev->seg;
+    bus = pdev->bus;
+    slot = PCI_SLOT(pdev->devfn);
+    func = PCI_FUNC(pdev->devfn);
     switch (entry->msi_attrib.type) {
     case PCI_CAP_ID_MSI:
         if (entry->msi_attrib.maskbit) {
             u32 mask_bits;
-            u16 seg = entry->dev->seg;
-            u8 bus = entry->dev->bus;
-            u8 slot = PCI_SLOT(entry->dev->devfn);
-            u8 func = PCI_FUNC(entry->dev->devfn);
 
             mask_bits = pci_conf_read32(seg, bus, slot, func, entry->msi.mpos);
             mask_bits &= ~((u32)1 << entry->msi_attrib.entry_nr);
@@ -369,24 +401,52 @@ static void msi_set_mask_bit(struct irq_
         }
         break;
     case PCI_CAP_ID_MSIX:
-    {
-        int offset = PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET;
-        writel(flag, entry->mask_base + offset);
-        readl(entry->mask_base + offset);
-        break;
-    }
+        if ( likely(memory_decoded(pdev)) )
+        {
+            writel(flag, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
+            readl(entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
+            break;
+        }
+        if ( flag )
+        {
+            u16 control;
+            domid_t domid = pdev->domain->domain_id;
+
+            control = pci_conf_read16(seg, bus, slot, func,
+                                      msix_control_reg(entry->msi_attrib.pos));
+            if ( control & PCI_MSIX_FLAGS_MASKALL )
+                break;
+            pci_conf_write16(seg, bus, slot, func,
+                             msix_control_reg(entry->msi_attrib.pos),
+                             control | PCI_MSIX_FLAGS_MASKALL);
+            if ( pdev->msix->warned != domid )
+            {
+                pdev->msix->warned = domid;
+                printk(XENLOG_G_WARNING
+                       "cannot mask IRQ %d: masked MSI-X on Dom%d's %04x:%02x:%02x.%u\n",
+                       desc->irq, domid, pdev->seg, pdev->bus,
+                       PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+            }
+            break;
+        }
+        /* fall through */
     default:
-        BUG();
-        break;
+        return 0;
     }
     entry->msi_attrib.masked = !!flag;
+
+    return 1;
 }
 
 static int msi_get_mask_bit(const struct msi_desc *entry)
 {
-    switch (entry->msi_attrib.type) {
+    if ( !entry->dev )
+        return -1;
+
+    switch ( entry->msi_attrib.type )
+    {
     case PCI_CAP_ID_MSI:
-        if (!entry->dev || !entry->msi_attrib.maskbit)
+        if ( !entry->msi_attrib.maskbit )
             break;
         return (pci_conf_read32(entry->dev->seg, entry->dev->bus,
                                 PCI_SLOT(entry->dev->devfn),
@@ -394,6 +454,8 @@ static int msi_get_mask_bit(const struct
                                 entry->msi.mpos) >>
                 entry->msi_attrib.entry_nr) & 1;
     case PCI_CAP_ID_MSIX:
+        if ( unlikely(!memory_decoded(entry->dev)) )
+            break;
         return readl(entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET) & 1;
     }
     return -1;
@@ -401,12 +463,14 @@ static int msi_get_mask_bit(const struct
 
 void mask_msi_irq(struct irq_desc *desc)
 {
-    msi_set_mask_bit(desc, 1);
+    if ( unlikely(!msi_set_mask_bit(desc, 1)) )
+        BUG_ON(!(desc->status & IRQ_DISABLED));
 }
 
 void unmask_msi_irq(struct irq_desc *desc)
 {
-    msi_set_mask_bit(desc, 0);
+    if ( unlikely(!msi_set_mask_bit(desc, 0)) )
+        WARN();
 }
 
 static unsigned int startup_msi_irq(struct irq_desc *desc)
@@ -723,6 +787,9 @@ static int msix_capability_init(struct p
     control = pci_conf_read16(seg, bus, slot, func, msix_control_reg(pos));
     msix_set_enable(dev, 0);/* Ensure msix is disabled as I set it up */
 
+    if ( unlikely(!memory_decoded(dev)) )
+        return -ENXIO;
+
     if ( desc )
     {
         entry = alloc_msi_entry(1);
@@ -855,7 +922,8 @@ static int msix_capability_init(struct p
     ++msix->used_entries;
 
     /* Restore MSI-X enabled bits */
-    pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos), control);
+    pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
+                     control & ~PCI_MSIX_FLAGS_MASKALL);
 
     return 0;
 }
@@ -1008,8 +1076,16 @@ static void __pci_disable_msix(struct ms
 
     BUG_ON(list_empty(&dev->msi_list));
 
-    writel(1, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
-
+    if ( likely(memory_decoded(dev)) )
+        writel(1, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
+    else if ( !(control & PCI_MSIX_FLAGS_MASKALL) )
+    {
+        printk(XENLOG_WARNING
+               "cannot disable IRQ %d: masking MSI-X on %04x:%02x:%02x.%u\n",
+               entry->irq, dev->seg, dev->bus,
+               PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn));
+        control |= PCI_MSIX_FLAGS_MASKALL;
+    }
     pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos), control);
 
     _pci_cleanup_msix(dev->msix);
@@ -1147,14 +1223,23 @@ int pci_restore_msi_state(struct pci_dev
             nr = entry->msi.nvec;
         }
         else if ( entry->msi_attrib.type == PCI_CAP_ID_MSIX )
+        {
             msix_set_enable(pdev, 0);
+            if ( unlikely(!memory_decoded(pdev)) )
+            {
+                spin_unlock_irqrestore(&desc->lock, flags);
+                return -ENXIO;
+            }
+        }
 
         msg = entry->msg;
         write_msi_msg(entry, &msg);
 
         for ( i = 0; ; )
         {
-            msi_set_mask_bit(desc, entry[i].msi_attrib.masked);
+            if ( unlikely(!msi_set_mask_bit(desc,
+                                            entry[i].msi_attrib.masked)) )
+                BUG();
 
             if ( !--nr )
                 break;



[-- Attachment #2: x86-MSI-X-teardown.patch --]
[-- Type: text/plain, Size: 11473 bytes --]

x86/MSI-X: be more careful during teardown

When a device gets detached from a guest, pciback will clear its
command register, thus disabling both memory and I/O decoding. The
disabled memory decoding, however, has an effect on the MSI-X table
accesses the hypervisor does: These won't have the intended effect
anymore. Even worse, for PCIe devices (but not SR-IOV virtual
functions) such accesses may (will?) be treated as Unsupported
Requests, causing respective errors to be surfaced, potentially in the
form of NMIs that may be fatal to the hypervisor or Dom0 is different
ways. Hence rather than carrying out these accesses, we should avoid
them where we can, and use alternative (e.g. PCI config space based)
mechanisms to achieve at least the same effect.

At this time it continues to be unclear whether this is fixing an
actual bug or is rather just working around bogus (but apparently
common) system behavior.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
The use of the mask-all bit here collides with qemu's incorreect use
of that same bit. This would become a security issue if released that
way. A later patch in this series will provide the infrastructure for
qemu to stop direct access to that bit. A qemu series including a patch
making use of the new interface will be sent subsequently.

Backporting note (largely to myself):
   Depends on (not yet backported) commit 061eebe0e "x86/MSI: drop
   workaround for insecure Dom0 kernels" (due to re-use of struct
   arch_msix's warned field).

--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -217,9 +217,9 @@ void destroy_irq(unsigned int irq)
     }
 
     spin_lock_irqsave(&desc->lock, flags);
-    desc->status  |= IRQ_DISABLED;
     desc->status  &= ~IRQ_GUEST;
     desc->handler->shutdown(desc);
+    desc->status |= IRQ_DISABLED;
     action = desc->action;
     desc->action  = NULL;
     desc->msi_desc = NULL;
@@ -995,8 +995,8 @@ void __init release_irq(unsigned int irq
     spin_lock_irqsave(&desc->lock,flags);
     action = desc->action;
     desc->action  = NULL;
-    desc->status |= IRQ_DISABLED;
     desc->handler->shutdown(desc);
+    desc->status |= IRQ_DISABLED;
     spin_unlock_irqrestore(&desc->lock,flags);
 
     /* Wait to make sure it's not being used on another CPU */
@@ -1726,8 +1726,8 @@ static irq_guest_action_t *__pirq_guest_
     BUG_ON(action->in_flight != 0);
 
     /* Disabling IRQ before releasing the desc_lock avoids an IRQ storm. */
-    desc->status |= IRQ_DISABLED;
     desc->handler->disable(desc);
+    desc->status |= IRQ_DISABLED;
 
     /*
      * Mark any remaining pending EOIs as ready to flush.
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -121,6 +121,27 @@ static void msix_put_fixmap(struct arch_
     spin_unlock(&msix->table_lock);
 }
 
+static bool_t memory_decoded(const struct pci_dev *dev)
+{
+    u8 bus, slot, func;
+
+    if ( !dev->info.is_virtfn )
+    {
+        bus = dev->bus;
+        slot = PCI_SLOT(dev->devfn);
+        func = PCI_FUNC(dev->devfn);
+    }
+    else
+    {
+        bus = dev->info.physfn.bus;
+        slot = PCI_SLOT(dev->info.physfn.devfn);
+        func = PCI_FUNC(dev->info.physfn.devfn);
+    }
+
+    return !!(pci_conf_read16(dev->seg, bus, slot, func, PCI_COMMAND) &
+              PCI_COMMAND_MEMORY);
+}
+
 /*
  * MSI message composition
  */
@@ -162,7 +183,7 @@ void msi_compose_msg(unsigned vector, co
     }
 }
 
-static void read_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
+static bool_t read_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
 {
     switch ( entry->msi_attrib.type )
     {
@@ -198,6 +219,8 @@ static void read_msi_msg(struct msi_desc
         void __iomem *base;
         base = entry->mask_base;
 
+        if ( unlikely(!memory_decoded(entry->dev)) )
+            return 0;
         msg->address_lo = readl(base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
         msg->address_hi = readl(base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
         msg->data = readl(base + PCI_MSIX_ENTRY_DATA_OFFSET);
@@ -209,6 +232,8 @@ static void read_msi_msg(struct msi_desc
 
     if ( iommu_intremap )
         iommu_read_msi_from_ire(entry, msg);
+
+    return 1;
 }
 
 static int write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
@@ -260,6 +285,8 @@ static int write_msi_msg(struct msi_desc
         void __iomem *base;
         base = entry->mask_base;
 
+        if ( unlikely(!memory_decoded(entry->dev)) )
+            return -ENXIO;
         writel(msg->address_lo,
                base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
         writel(msg->address_hi,
@@ -287,7 +314,8 @@ void set_msi_affinity(struct irq_desc *d
     ASSERT(spin_is_locked(&desc->lock));
 
     memset(&msg, 0, sizeof(msg));
-    read_msi_msg(msi_desc, &msg);
+    if ( !read_msi_msg(msi_desc, &msg) )
+        return;
 
     msg.data &= ~MSI_DATA_VECTOR_MASK;
     msg.data |= MSI_DATA_VECTOR(desc->arch.vector);
@@ -347,20 +375,24 @@ int msi_maskable_irq(const struct msi_de
            || entry->msi_attrib.maskbit;
 }
 
-static void msi_set_mask_bit(struct irq_desc *desc, int flag)
+static bool_t msi_set_mask_bit(struct irq_desc *desc, int flag)
 {
     struct msi_desc *entry = desc->msi_desc;
+    struct pci_dev *pdev;
+    u16 seg;
+    u8 bus, slot, func;
 
     ASSERT(spin_is_locked(&desc->lock));
     BUG_ON(!entry || !entry->dev);
+    pdev = entry->dev;
+    seg = pdev->seg;
+    bus = pdev->bus;
+    slot = PCI_SLOT(pdev->devfn);
+    func = PCI_FUNC(pdev->devfn);
     switch (entry->msi_attrib.type) {
     case PCI_CAP_ID_MSI:
         if (entry->msi_attrib.maskbit) {
             u32 mask_bits;
-            u16 seg = entry->dev->seg;
-            u8 bus = entry->dev->bus;
-            u8 slot = PCI_SLOT(entry->dev->devfn);
-            u8 func = PCI_FUNC(entry->dev->devfn);
 
             mask_bits = pci_conf_read32(seg, bus, slot, func, entry->msi.mpos);
             mask_bits &= ~((u32)1 << entry->msi_attrib.entry_nr);
@@ -369,24 +401,52 @@ static void msi_set_mask_bit(struct irq_
         }
         break;
     case PCI_CAP_ID_MSIX:
-    {
-        int offset = PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET;
-        writel(flag, entry->mask_base + offset);
-        readl(entry->mask_base + offset);
-        break;
-    }
+        if ( likely(memory_decoded(pdev)) )
+        {
+            writel(flag, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
+            readl(entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
+            break;
+        }
+        if ( flag )
+        {
+            u16 control;
+            domid_t domid = pdev->domain->domain_id;
+
+            control = pci_conf_read16(seg, bus, slot, func,
+                                      msix_control_reg(entry->msi_attrib.pos));
+            if ( control & PCI_MSIX_FLAGS_MASKALL )
+                break;
+            pci_conf_write16(seg, bus, slot, func,
+                             msix_control_reg(entry->msi_attrib.pos),
+                             control | PCI_MSIX_FLAGS_MASKALL);
+            if ( pdev->msix->warned != domid )
+            {
+                pdev->msix->warned = domid;
+                printk(XENLOG_G_WARNING
+                       "cannot mask IRQ %d: masked MSI-X on Dom%d's %04x:%02x:%02x.%u\n",
+                       desc->irq, domid, pdev->seg, pdev->bus,
+                       PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+            }
+            break;
+        }
+        /* fall through */
     default:
-        BUG();
-        break;
+        return 0;
     }
     entry->msi_attrib.masked = !!flag;
+
+    return 1;
 }
 
 static int msi_get_mask_bit(const struct msi_desc *entry)
 {
-    switch (entry->msi_attrib.type) {
+    if ( !entry->dev )
+        return -1;
+
+    switch ( entry->msi_attrib.type )
+    {
     case PCI_CAP_ID_MSI:
-        if (!entry->dev || !entry->msi_attrib.maskbit)
+        if ( !entry->msi_attrib.maskbit )
             break;
         return (pci_conf_read32(entry->dev->seg, entry->dev->bus,
                                 PCI_SLOT(entry->dev->devfn),
@@ -394,6 +454,8 @@ static int msi_get_mask_bit(const struct
                                 entry->msi.mpos) >>
                 entry->msi_attrib.entry_nr) & 1;
     case PCI_CAP_ID_MSIX:
+        if ( unlikely(!memory_decoded(entry->dev)) )
+            break;
         return readl(entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET) & 1;
     }
     return -1;
@@ -401,12 +463,14 @@ static int msi_get_mask_bit(const struct
 
 void mask_msi_irq(struct irq_desc *desc)
 {
-    msi_set_mask_bit(desc, 1);
+    if ( unlikely(!msi_set_mask_bit(desc, 1)) )
+        BUG_ON(!(desc->status & IRQ_DISABLED));
 }
 
 void unmask_msi_irq(struct irq_desc *desc)
 {
-    msi_set_mask_bit(desc, 0);
+    if ( unlikely(!msi_set_mask_bit(desc, 0)) )
+        WARN();
 }
 
 static unsigned int startup_msi_irq(struct irq_desc *desc)
@@ -723,6 +787,9 @@ static int msix_capability_init(struct p
     control = pci_conf_read16(seg, bus, slot, func, msix_control_reg(pos));
     msix_set_enable(dev, 0);/* Ensure msix is disabled as I set it up */
 
+    if ( unlikely(!memory_decoded(dev)) )
+        return -ENXIO;
+
     if ( desc )
     {
         entry = alloc_msi_entry(1);
@@ -855,7 +922,8 @@ static int msix_capability_init(struct p
     ++msix->used_entries;
 
     /* Restore MSI-X enabled bits */
-    pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos), control);
+    pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
+                     control & ~PCI_MSIX_FLAGS_MASKALL);
 
     return 0;
 }
@@ -1008,8 +1076,16 @@ static void __pci_disable_msix(struct ms
 
     BUG_ON(list_empty(&dev->msi_list));
 
-    writel(1, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
-
+    if ( likely(memory_decoded(dev)) )
+        writel(1, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
+    else if ( !(control & PCI_MSIX_FLAGS_MASKALL) )
+    {
+        printk(XENLOG_WARNING
+               "cannot disable IRQ %d: masking MSI-X on %04x:%02x:%02x.%u\n",
+               entry->irq, dev->seg, dev->bus,
+               PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn));
+        control |= PCI_MSIX_FLAGS_MASKALL;
+    }
     pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos), control);
 
     _pci_cleanup_msix(dev->msix);
@@ -1147,14 +1223,23 @@ int pci_restore_msi_state(struct pci_dev
             nr = entry->msi.nvec;
         }
         else if ( entry->msi_attrib.type == PCI_CAP_ID_MSIX )
+        {
             msix_set_enable(pdev, 0);
+            if ( unlikely(!memory_decoded(pdev)) )
+            {
+                spin_unlock_irqrestore(&desc->lock, flags);
+                return -ENXIO;
+            }
+        }
 
         msg = entry->msg;
         write_msi_msg(entry, &msg);
 
         for ( i = 0; ; )
         {
-            msi_set_mask_bit(desc, entry[i].msi_attrib.masked);
+            if ( unlikely(!msi_set_mask_bit(desc,
+                                            entry[i].msi_attrib.masked)) )
+                BUG();
 
             if ( !--nr )
                 break;

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 02/10] x86/MSI-X: access MSI-X table only after having enabled MSI-X
  2015-06-05 11:09 [PATCH v3 00/10] x86/MSI-X: XSA-120, 128..131 follow-up Jan Beulich
  2015-06-05 11:20 ` [PATCH v3 01/10] x86/MSI-X: be more careful during teardown Jan Beulich
@ 2015-06-05 11:20 ` Jan Beulich
  2015-06-05 13:01   ` Andrew Cooper
  2015-06-05 11:21 ` [PATCH v3 03/10] x86/MSI-X: reduce fiddling with control register during restore Jan Beulich
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2015-06-05 11:20 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 11945 bytes --]

As done in Linux by f598282f51 ("PCI: Fix the NIU MSI-X problem in a
better way") and its broken predecessor, make sure we don't access the
MSI-X table without having enabled MSI-X first, using the mask-all flag
instead to prevent interrupts from occurring.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: temporarily enable MSI-X in setup_msi_irq() if not already enabled

--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -142,6 +142,17 @@ static bool_t memory_decoded(const struc
               PCI_COMMAND_MEMORY);
 }
 
+static bool_t msix_memory_decoded(const struct pci_dev *dev, unsigned int pos)
+{
+    u16 control = pci_conf_read16(dev->seg, dev->bus, PCI_SLOT(dev->devfn),
+                                  PCI_FUNC(dev->devfn), msix_control_reg(pos));
+
+    if ( !(control & PCI_MSIX_FLAGS_ENABLE) )
+        return 0;
+
+    return memory_decoded(dev);
+}
+
 /*
  * MSI message composition
  */
@@ -219,7 +230,8 @@ static bool_t read_msi_msg(struct msi_de
         void __iomem *base;
         base = entry->mask_base;
 
-        if ( unlikely(!memory_decoded(entry->dev)) )
+        if ( unlikely(!msix_memory_decoded(entry->dev,
+                                           entry->msi_attrib.pos)) )
             return 0;
         msg->address_lo = readl(base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
         msg->address_hi = readl(base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
@@ -285,7 +297,8 @@ static int write_msi_msg(struct msi_desc
         void __iomem *base;
         base = entry->mask_base;
 
-        if ( unlikely(!memory_decoded(entry->dev)) )
+        if ( unlikely(!msix_memory_decoded(entry->dev,
+                                           entry->msi_attrib.pos)) )
             return -ENXIO;
         writel(msg->address_lo,
                base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
@@ -379,7 +392,7 @@ static bool_t msi_set_mask_bit(struct ir
 {
     struct msi_desc *entry = desc->msi_desc;
     struct pci_dev *pdev;
-    u16 seg;
+    u16 seg, control;
     u8 bus, slot, func;
 
     ASSERT(spin_is_locked(&desc->lock));
@@ -401,35 +414,38 @@ static bool_t msi_set_mask_bit(struct ir
         }
         break;
     case PCI_CAP_ID_MSIX:
+        control = pci_conf_read16(seg, bus, slot, func,
+                                  msix_control_reg(entry->msi_attrib.pos));
+        if ( unlikely(!(control & PCI_MSIX_FLAGS_ENABLE)) )
+            pci_conf_write16(seg, bus, slot, func,
+                             msix_control_reg(entry->msi_attrib.pos),
+                             control | (PCI_MSIX_FLAGS_ENABLE |
+                                        PCI_MSIX_FLAGS_MASKALL));
         if ( likely(memory_decoded(pdev)) )
         {
             writel(flag, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
             readl(entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
-            break;
+            if ( likely(control & PCI_MSIX_FLAGS_ENABLE) )
+                break;
+            flag = 1;
         }
-        if ( flag )
+        else if ( flag && !(control & PCI_MSIX_FLAGS_MASKALL) )
         {
-            u16 control;
             domid_t domid = pdev->domain->domain_id;
 
-            control = pci_conf_read16(seg, bus, slot, func,
-                                      msix_control_reg(entry->msi_attrib.pos));
-            if ( control & PCI_MSIX_FLAGS_MASKALL )
-                break;
-            pci_conf_write16(seg, bus, slot, func,
-                             msix_control_reg(entry->msi_attrib.pos),
-                             control | PCI_MSIX_FLAGS_MASKALL);
+            control |= PCI_MSIX_FLAGS_MASKALL;
             if ( pdev->msix->warned != domid )
             {
                 pdev->msix->warned = domid;
                 printk(XENLOG_G_WARNING
-                       "cannot mask IRQ %d: masked MSI-X on Dom%d's %04x:%02x:%02x.%u\n",
+                       "cannot mask IRQ %d: masking MSI-X on Dom%d's %04x:%02x:%02x.%u\n",
                        desc->irq, domid, pdev->seg, pdev->bus,
                        PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
             }
-            break;
         }
-        /* fall through */
+        pci_conf_write16(seg, bus, slot, func,
+                         msix_control_reg(entry->msi_attrib.pos), control);
+        return flag;
     default:
         return 0;
     }
@@ -454,7 +470,8 @@ static int msi_get_mask_bit(const struct
                                 entry->msi.mpos) >>
                 entry->msi_attrib.entry_nr) & 1;
     case PCI_CAP_ID_MSIX:
-        if ( unlikely(!memory_decoded(entry->dev)) )
+        if ( unlikely(!msix_memory_decoded(entry->dev,
+                                           entry->msi_attrib.pos)) )
             break;
         return readl(entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET) & 1;
     }
@@ -543,9 +560,31 @@ static struct msi_desc *alloc_msi_entry(
 
 int setup_msi_irq(struct irq_desc *desc, struct msi_desc *msidesc)
 {
-    return __setup_msi_irq(desc, msidesc,
-                           msi_maskable_irq(msidesc) ? &pci_msi_maskable
-                                                     : &pci_msi_nonmaskable);
+    const struct pci_dev *pdev = msidesc->dev;
+    unsigned int cpos = msix_control_reg(msidesc->msi_attrib.pos);
+    u16 control = ~0;
+    int rc;
+
+    if ( msidesc->msi_attrib.type == PCI_CAP_ID_MSIX )
+    {
+        control = pci_conf_read16(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                                  PCI_FUNC(pdev->devfn), cpos);
+        if ( !(control & PCI_MSIX_FLAGS_ENABLE) )
+            pci_conf_write16(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                             PCI_FUNC(pdev->devfn), cpos,
+                             control | (PCI_MSIX_FLAGS_ENABLE |
+                                        PCI_MSIX_FLAGS_MASKALL));
+    }
+
+    rc = __setup_msi_irq(desc, msidesc,
+                         msi_maskable_irq(msidesc) ? &pci_msi_maskable
+                                                   : &pci_msi_nonmaskable);
+
+    if ( !(control & PCI_MSIX_FLAGS_ENABLE) )
+        pci_conf_write16(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                         PCI_FUNC(pdev->devfn), cpos, control);
+
+    return rc;
 }
 
 int __setup_msi_irq(struct irq_desc *desc, struct msi_desc *msidesc,
@@ -785,16 +824,32 @@ static int msix_capability_init(struct p
 
     pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
     control = pci_conf_read16(seg, bus, slot, func, msix_control_reg(pos));
-    msix_set_enable(dev, 0);/* Ensure msix is disabled as I set it up */
+    /*
+     * Ensure MSI-X interrupts are masked during setup. Some devices require
+     * MSI-X to be enabled before we can touch the MSI-X registers. We need
+     * to mask all the vectors to prevent interrupts coming in before they're
+     * fully set up.
+     */
+    pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
+                     control | (PCI_MSIX_FLAGS_ENABLE |
+                                PCI_MSIX_FLAGS_MASKALL));
 
     if ( unlikely(!memory_decoded(dev)) )
+    {
+        pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
+                         control & ~PCI_MSIX_FLAGS_ENABLE);
         return -ENXIO;
+    }
 
     if ( desc )
     {
         entry = alloc_msi_entry(1);
         if ( !entry )
+        {
+            pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
+                             control & ~PCI_MSIX_FLAGS_ENABLE);
             return -ENOMEM;
+        }
         ASSERT(msi);
     }
 
@@ -825,6 +880,8 @@ static int msix_capability_init(struct p
     {
         if ( !msi || !msi->table_base )
         {
+            pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
+                             control & ~PCI_MSIX_FLAGS_ENABLE);
             xfree(entry);
             return -ENXIO;
         }
@@ -867,6 +924,8 @@ static int msix_capability_init(struct p
 
         if ( idx < 0 )
         {
+            pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
+                             control & ~PCI_MSIX_FLAGS_ENABLE);
             xfree(entry);
             return idx;
         }
@@ -922,8 +981,7 @@ static int msix_capability_init(struct p
     ++msix->used_entries;
 
     /* Restore MSI-X enabled bits */
-    pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
-                     control & ~PCI_MSIX_FLAGS_MASKALL);
+    pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos), control);
 
     return 0;
 }
@@ -1072,7 +1130,10 @@ static void __pci_disable_msix(struct ms
 
     pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
     control = pci_conf_read16(seg, bus, slot, func, msix_control_reg(pos));
-    msix_set_enable(dev, 0);
+    if ( unlikely(!(control & PCI_MSIX_FLAGS_ENABLE)) )
+        pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
+                         control | (PCI_MSIX_FLAGS_ENABLE |
+                                    PCI_MSIX_FLAGS_MASKALL));
 
     BUG_ON(list_empty(&dev->msi_list));
 
@@ -1198,6 +1259,8 @@ int pci_restore_msi_state(struct pci_dev
     list_for_each_entry_safe( entry, tmp, &pdev->msi_list, list )
     {
         unsigned int i = 0, nr = 1;
+        u16 control = 0;
+        u8 slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
 
         irq = entry->irq;
         desc = &irq_desc[irq];
@@ -1224,10 +1287,18 @@ int pci_restore_msi_state(struct pci_dev
         }
         else if ( entry->msi_attrib.type == PCI_CAP_ID_MSIX )
         {
-            msix_set_enable(pdev, 0);
+            control = pci_conf_read16(pdev->seg, pdev->bus, slot, func,
+                                      msix_control_reg(entry->msi_attrib.pos));
+            pci_conf_write16(pdev->seg, pdev->bus, slot, func,
+                             msix_control_reg(entry->msi_attrib.pos),
+                             control | (PCI_MSIX_FLAGS_ENABLE |
+                                        PCI_MSIX_FLAGS_MASKALL));
             if ( unlikely(!memory_decoded(pdev)) )
             {
                 spin_unlock_irqrestore(&desc->lock, flags);
+                pci_conf_write16(pdev->seg, pdev->bus, slot, func,
+                                 msix_control_reg(entry->msi_attrib.pos),
+                                 control & ~PCI_MSIX_FLAGS_ENABLE);
                 return -ENXIO;
             }
         }
@@ -1256,11 +1327,9 @@ int pci_restore_msi_state(struct pci_dev
         if ( entry->msi_attrib.type == PCI_CAP_ID_MSI )
         {
             unsigned int cpos = msi_control_reg(entry->msi_attrib.pos);
-            u16 control = pci_conf_read16(pdev->seg, pdev->bus,
-                                          PCI_SLOT(pdev->devfn),
-                                          PCI_FUNC(pdev->devfn), cpos);
 
-            control &= ~PCI_MSI_FLAGS_QSIZE;
+            control = pci_conf_read16(pdev->seg, pdev->bus, slot, func, cpos) &
+                      ~PCI_MSI_FLAGS_QSIZE;
             multi_msi_enable(control, entry->msi.nvec);
             pci_conf_write16(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
                              PCI_FUNC(pdev->devfn), cpos, control);
@@ -1268,7 +1337,9 @@ int pci_restore_msi_state(struct pci_dev
             msi_set_enable(pdev, 1);
         }
         else if ( entry->msi_attrib.type == PCI_CAP_ID_MSIX )
-            msix_set_enable(pdev, 1);
+            pci_conf_write16(pdev->seg, pdev->bus, slot, func,
+                             msix_control_reg(entry->msi_attrib.pos),
+                             control | PCI_MSIX_FLAGS_ENABLE);
     }
 
     return 0;



[-- Attachment #2: x86-MSI-X-enable.patch --]
[-- Type: text/plain, Size: 12006 bytes --]

x86/MSI-X: access MSI-X table only after having enabled MSI-X

As done in Linux by f598282f51 ("PCI: Fix the NIU MSI-X problem in a
better way") and its broken predecessor, make sure we don't access the
MSI-X table without having enabled MSI-X first, using the mask-all flag
instead to prevent interrupts from occurring.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: temporarily enable MSI-X in setup_msi_irq() if not already enabled

--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -142,6 +142,17 @@ static bool_t memory_decoded(const struc
               PCI_COMMAND_MEMORY);
 }
 
+static bool_t msix_memory_decoded(const struct pci_dev *dev, unsigned int pos)
+{
+    u16 control = pci_conf_read16(dev->seg, dev->bus, PCI_SLOT(dev->devfn),
+                                  PCI_FUNC(dev->devfn), msix_control_reg(pos));
+
+    if ( !(control & PCI_MSIX_FLAGS_ENABLE) )
+        return 0;
+
+    return memory_decoded(dev);
+}
+
 /*
  * MSI message composition
  */
@@ -219,7 +230,8 @@ static bool_t read_msi_msg(struct msi_de
         void __iomem *base;
         base = entry->mask_base;
 
-        if ( unlikely(!memory_decoded(entry->dev)) )
+        if ( unlikely(!msix_memory_decoded(entry->dev,
+                                           entry->msi_attrib.pos)) )
             return 0;
         msg->address_lo = readl(base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
         msg->address_hi = readl(base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
@@ -285,7 +297,8 @@ static int write_msi_msg(struct msi_desc
         void __iomem *base;
         base = entry->mask_base;
 
-        if ( unlikely(!memory_decoded(entry->dev)) )
+        if ( unlikely(!msix_memory_decoded(entry->dev,
+                                           entry->msi_attrib.pos)) )
             return -ENXIO;
         writel(msg->address_lo,
                base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
@@ -379,7 +392,7 @@ static bool_t msi_set_mask_bit(struct ir
 {
     struct msi_desc *entry = desc->msi_desc;
     struct pci_dev *pdev;
-    u16 seg;
+    u16 seg, control;
     u8 bus, slot, func;
 
     ASSERT(spin_is_locked(&desc->lock));
@@ -401,35 +414,38 @@ static bool_t msi_set_mask_bit(struct ir
         }
         break;
     case PCI_CAP_ID_MSIX:
+        control = pci_conf_read16(seg, bus, slot, func,
+                                  msix_control_reg(entry->msi_attrib.pos));
+        if ( unlikely(!(control & PCI_MSIX_FLAGS_ENABLE)) )
+            pci_conf_write16(seg, bus, slot, func,
+                             msix_control_reg(entry->msi_attrib.pos),
+                             control | (PCI_MSIX_FLAGS_ENABLE |
+                                        PCI_MSIX_FLAGS_MASKALL));
         if ( likely(memory_decoded(pdev)) )
         {
             writel(flag, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
             readl(entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
-            break;
+            if ( likely(control & PCI_MSIX_FLAGS_ENABLE) )
+                break;
+            flag = 1;
         }
-        if ( flag )
+        else if ( flag && !(control & PCI_MSIX_FLAGS_MASKALL) )
         {
-            u16 control;
             domid_t domid = pdev->domain->domain_id;
 
-            control = pci_conf_read16(seg, bus, slot, func,
-                                      msix_control_reg(entry->msi_attrib.pos));
-            if ( control & PCI_MSIX_FLAGS_MASKALL )
-                break;
-            pci_conf_write16(seg, bus, slot, func,
-                             msix_control_reg(entry->msi_attrib.pos),
-                             control | PCI_MSIX_FLAGS_MASKALL);
+            control |= PCI_MSIX_FLAGS_MASKALL;
             if ( pdev->msix->warned != domid )
             {
                 pdev->msix->warned = domid;
                 printk(XENLOG_G_WARNING
-                       "cannot mask IRQ %d: masked MSI-X on Dom%d's %04x:%02x:%02x.%u\n",
+                       "cannot mask IRQ %d: masking MSI-X on Dom%d's %04x:%02x:%02x.%u\n",
                        desc->irq, domid, pdev->seg, pdev->bus,
                        PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
             }
-            break;
         }
-        /* fall through */
+        pci_conf_write16(seg, bus, slot, func,
+                         msix_control_reg(entry->msi_attrib.pos), control);
+        return flag;
     default:
         return 0;
     }
@@ -454,7 +470,8 @@ static int msi_get_mask_bit(const struct
                                 entry->msi.mpos) >>
                 entry->msi_attrib.entry_nr) & 1;
     case PCI_CAP_ID_MSIX:
-        if ( unlikely(!memory_decoded(entry->dev)) )
+        if ( unlikely(!msix_memory_decoded(entry->dev,
+                                           entry->msi_attrib.pos)) )
             break;
         return readl(entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET) & 1;
     }
@@ -543,9 +560,31 @@ static struct msi_desc *alloc_msi_entry(
 
 int setup_msi_irq(struct irq_desc *desc, struct msi_desc *msidesc)
 {
-    return __setup_msi_irq(desc, msidesc,
-                           msi_maskable_irq(msidesc) ? &pci_msi_maskable
-                                                     : &pci_msi_nonmaskable);
+    const struct pci_dev *pdev = msidesc->dev;
+    unsigned int cpos = msix_control_reg(msidesc->msi_attrib.pos);
+    u16 control = ~0;
+    int rc;
+
+    if ( msidesc->msi_attrib.type == PCI_CAP_ID_MSIX )
+    {
+        control = pci_conf_read16(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                                  PCI_FUNC(pdev->devfn), cpos);
+        if ( !(control & PCI_MSIX_FLAGS_ENABLE) )
+            pci_conf_write16(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                             PCI_FUNC(pdev->devfn), cpos,
+                             control | (PCI_MSIX_FLAGS_ENABLE |
+                                        PCI_MSIX_FLAGS_MASKALL));
+    }
+
+    rc = __setup_msi_irq(desc, msidesc,
+                         msi_maskable_irq(msidesc) ? &pci_msi_maskable
+                                                   : &pci_msi_nonmaskable);
+
+    if ( !(control & PCI_MSIX_FLAGS_ENABLE) )
+        pci_conf_write16(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                         PCI_FUNC(pdev->devfn), cpos, control);
+
+    return rc;
 }
 
 int __setup_msi_irq(struct irq_desc *desc, struct msi_desc *msidesc,
@@ -785,16 +824,32 @@ static int msix_capability_init(struct p
 
     pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
     control = pci_conf_read16(seg, bus, slot, func, msix_control_reg(pos));
-    msix_set_enable(dev, 0);/* Ensure msix is disabled as I set it up */
+    /*
+     * Ensure MSI-X interrupts are masked during setup. Some devices require
+     * MSI-X to be enabled before we can touch the MSI-X registers. We need
+     * to mask all the vectors to prevent interrupts coming in before they're
+     * fully set up.
+     */
+    pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
+                     control | (PCI_MSIX_FLAGS_ENABLE |
+                                PCI_MSIX_FLAGS_MASKALL));
 
     if ( unlikely(!memory_decoded(dev)) )
+    {
+        pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
+                         control & ~PCI_MSIX_FLAGS_ENABLE);
         return -ENXIO;
+    }
 
     if ( desc )
     {
         entry = alloc_msi_entry(1);
         if ( !entry )
+        {
+            pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
+                             control & ~PCI_MSIX_FLAGS_ENABLE);
             return -ENOMEM;
+        }
         ASSERT(msi);
     }
 
@@ -825,6 +880,8 @@ static int msix_capability_init(struct p
     {
         if ( !msi || !msi->table_base )
         {
+            pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
+                             control & ~PCI_MSIX_FLAGS_ENABLE);
             xfree(entry);
             return -ENXIO;
         }
@@ -867,6 +924,8 @@ static int msix_capability_init(struct p
 
         if ( idx < 0 )
         {
+            pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
+                             control & ~PCI_MSIX_FLAGS_ENABLE);
             xfree(entry);
             return idx;
         }
@@ -922,8 +981,7 @@ static int msix_capability_init(struct p
     ++msix->used_entries;
 
     /* Restore MSI-X enabled bits */
-    pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
-                     control & ~PCI_MSIX_FLAGS_MASKALL);
+    pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos), control);
 
     return 0;
 }
@@ -1072,7 +1130,10 @@ static void __pci_disable_msix(struct ms
 
     pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
     control = pci_conf_read16(seg, bus, slot, func, msix_control_reg(pos));
-    msix_set_enable(dev, 0);
+    if ( unlikely(!(control & PCI_MSIX_FLAGS_ENABLE)) )
+        pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
+                         control | (PCI_MSIX_FLAGS_ENABLE |
+                                    PCI_MSIX_FLAGS_MASKALL));
 
     BUG_ON(list_empty(&dev->msi_list));
 
@@ -1198,6 +1259,8 @@ int pci_restore_msi_state(struct pci_dev
     list_for_each_entry_safe( entry, tmp, &pdev->msi_list, list )
     {
         unsigned int i = 0, nr = 1;
+        u16 control = 0;
+        u8 slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
 
         irq = entry->irq;
         desc = &irq_desc[irq];
@@ -1224,10 +1287,18 @@ int pci_restore_msi_state(struct pci_dev
         }
         else if ( entry->msi_attrib.type == PCI_CAP_ID_MSIX )
         {
-            msix_set_enable(pdev, 0);
+            control = pci_conf_read16(pdev->seg, pdev->bus, slot, func,
+                                      msix_control_reg(entry->msi_attrib.pos));
+            pci_conf_write16(pdev->seg, pdev->bus, slot, func,
+                             msix_control_reg(entry->msi_attrib.pos),
+                             control | (PCI_MSIX_FLAGS_ENABLE |
+                                        PCI_MSIX_FLAGS_MASKALL));
             if ( unlikely(!memory_decoded(pdev)) )
             {
                 spin_unlock_irqrestore(&desc->lock, flags);
+                pci_conf_write16(pdev->seg, pdev->bus, slot, func,
+                                 msix_control_reg(entry->msi_attrib.pos),
+                                 control & ~PCI_MSIX_FLAGS_ENABLE);
                 return -ENXIO;
             }
         }
@@ -1256,11 +1327,9 @@ int pci_restore_msi_state(struct pci_dev
         if ( entry->msi_attrib.type == PCI_CAP_ID_MSI )
         {
             unsigned int cpos = msi_control_reg(entry->msi_attrib.pos);
-            u16 control = pci_conf_read16(pdev->seg, pdev->bus,
-                                          PCI_SLOT(pdev->devfn),
-                                          PCI_FUNC(pdev->devfn), cpos);
 
-            control &= ~PCI_MSI_FLAGS_QSIZE;
+            control = pci_conf_read16(pdev->seg, pdev->bus, slot, func, cpos) &
+                      ~PCI_MSI_FLAGS_QSIZE;
             multi_msi_enable(control, entry->msi.nvec);
             pci_conf_write16(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
                              PCI_FUNC(pdev->devfn), cpos, control);
@@ -1268,7 +1337,9 @@ int pci_restore_msi_state(struct pci_dev
             msi_set_enable(pdev, 1);
         }
         else if ( entry->msi_attrib.type == PCI_CAP_ID_MSIX )
-            msix_set_enable(pdev, 1);
+            pci_conf_write16(pdev->seg, pdev->bus, slot, func,
+                             msix_control_reg(entry->msi_attrib.pos),
+                             control | PCI_MSIX_FLAGS_ENABLE);
     }
 
     return 0;

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 03/10] x86/MSI-X: reduce fiddling with control register during restore
  2015-06-05 11:09 [PATCH v3 00/10] x86/MSI-X: XSA-120, 128..131 follow-up Jan Beulich
  2015-06-05 11:20 ` [PATCH v3 01/10] x86/MSI-X: be more careful during teardown Jan Beulich
  2015-06-05 11:20 ` [PATCH v3 02/10] x86/MSI-X: access MSI-X table only after having enabled MSI-X Jan Beulich
@ 2015-06-05 11:21 ` Jan Beulich
  2015-06-05 11:21 ` [PATCH v3 04/10] x86/MSI-X: cleanup Jan Beulich
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-05 11:21 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Keir Fraser

Rather than disabling and enabling MSI-X once per vector, do it just
once per device.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -1243,6 +1243,9 @@ int pci_restore_msi_state(struct pci_dev
     struct msi_desc *entry, *tmp;
     struct irq_desc *desc;
     struct msi_msg msg;
+    u8 slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    unsigned int type = 0, pos = 0;
+    u16 control = 0;
 
     ASSERT(spin_is_locked(&pcidevs_lock));
 
@@ -1259,8 +1262,6 @@ int pci_restore_msi_state(struct pci_dev
     list_for_each_entry_safe( entry, tmp, &pdev->msi_list, list )
     {
         unsigned int i = 0, nr = 1;
-        u16 control = 0;
-        u8 slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
 
         irq = entry->irq;
         desc = &irq_desc[irq];
@@ -1277,31 +1278,38 @@ int pci_restore_msi_state(struct pci_dev
                     pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
                     PCI_FUNC(pdev->devfn), i);
             spin_unlock_irqrestore(&desc->lock, flags);
+            if ( type == PCI_CAP_ID_MSIX )
+                pci_conf_write16(pdev->seg, pdev->bus, slot, func,
+                                 msix_control_reg(pos),
+                                 control & ~PCI_MSIX_FLAGS_ENABLE);
             return -EINVAL;
         }
 
+        ASSERT(!type || type == entry->msi_attrib.type);
+        pos = entry->msi_attrib.pos;
         if ( entry->msi_attrib.type == PCI_CAP_ID_MSI )
         {
             msi_set_enable(pdev, 0);
             nr = entry->msi.nvec;
         }
-        else if ( entry->msi_attrib.type == PCI_CAP_ID_MSIX )
+        else if ( !type && entry->msi_attrib.type == PCI_CAP_ID_MSIX )
         {
             control = pci_conf_read16(pdev->seg, pdev->bus, slot, func,
-                                      msix_control_reg(entry->msi_attrib.pos));
+                                      msix_control_reg(pos));
             pci_conf_write16(pdev->seg, pdev->bus, slot, func,
-                             msix_control_reg(entry->msi_attrib.pos),
+                             msix_control_reg(pos),
                              control | (PCI_MSIX_FLAGS_ENABLE |
                                         PCI_MSIX_FLAGS_MASKALL));
             if ( unlikely(!memory_decoded(pdev)) )
             {
                 spin_unlock_irqrestore(&desc->lock, flags);
                 pci_conf_write16(pdev->seg, pdev->bus, slot, func,
-                                 msix_control_reg(entry->msi_attrib.pos),
+                                 msix_control_reg(pos),
                                  control & ~PCI_MSIX_FLAGS_ENABLE);
                 return -ENXIO;
             }
         }
+        type = entry->msi_attrib.type;
 
         msg = entry->msg;
         write_msi_msg(entry, &msg);
@@ -1324,9 +1332,9 @@ int pci_restore_msi_state(struct pci_dev
 
         spin_unlock_irqrestore(&desc->lock, flags);
 
-        if ( entry->msi_attrib.type == PCI_CAP_ID_MSI )
+        if ( type == PCI_CAP_ID_MSI )
         {
-            unsigned int cpos = msi_control_reg(entry->msi_attrib.pos);
+            unsigned int cpos = msi_control_reg(pos);
 
             control = pci_conf_read16(pdev->seg, pdev->bus, slot, func, cpos) &
                       ~PCI_MSI_FLAGS_QSIZE;
@@ -1336,12 +1344,13 @@ int pci_restore_msi_state(struct pci_dev
 
             msi_set_enable(pdev, 1);
         }
-        else if ( entry->msi_attrib.type == PCI_CAP_ID_MSIX )
-            pci_conf_write16(pdev->seg, pdev->bus, slot, func,
-                             msix_control_reg(entry->msi_attrib.pos),
-                             control | PCI_MSIX_FLAGS_ENABLE);
     }
 
+    if ( type == PCI_CAP_ID_MSIX )
+        pci_conf_write16(pdev->seg, pdev->bus, slot, func,
+                         msix_control_reg(pos),
+                         control | PCI_MSIX_FLAGS_ENABLE);
+
     return 0;
 }

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 04/10] x86/MSI-X: cleanup
  2015-06-05 11:09 [PATCH v3 00/10] x86/MSI-X: XSA-120, 128..131 follow-up Jan Beulich
                   ` (2 preceding siblings ...)
  2015-06-05 11:21 ` [PATCH v3 03/10] x86/MSI-X: reduce fiddling with control register during restore Jan Beulich
@ 2015-06-05 11:21 ` Jan Beulich
  2015-06-05 11:22 ` [PATCH v3 05/10] x86/MSI: track host and guest masking separately Jan Beulich
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-05 11:21 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 10726 bytes --]

- __pci_enable_msix() now checks that an MSI-X capability was actually
  found
- pass "pos" to msix_capability_init() as both callers already know it
  (and hence there's no need to re-obtain it)
- call __pci_disable_msi{,x}() directly instead of via
  pci_disable_msi() from __pci_enable_msi{x,}() state validation paths
- use msix_control_reg() instead of open coding it
- log message adjustments
- coding style corrections

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -35,6 +35,8 @@
 static s8 __read_mostly use_msi = -1;
 boolean_param("msi", use_msi);
 
+static void __pci_disable_msix(struct msi_desc *);
+
 /* bitmap indicate which fixed map is free */
 static DEFINE_SPINLOCK(msix_fixmap_lock);
 static DECLARE_BITMAP(msix_fixmap_pages, FIX_MSIX_MAX_PAGES);
@@ -161,12 +163,14 @@ void msi_compose_msg(unsigned vector, co
     unsigned dest;
 
     memset(msg, 0, sizeof(*msg));
-    if ( !cpumask_intersects(cpu_mask, &cpu_online_map) ) {
+    if ( !cpumask_intersects(cpu_mask, &cpu_online_map) )
+    {
         dprintk(XENLOG_ERR,"%s, compose msi message error!!\n", __func__);
         return;
     }
 
-    if ( vector ) {
+    if ( vector )
+    {
         cpumask_t *mask = this_cpu(scratch_mask);
 
         cpumask_and(mask, cpu_mask, &cpu_online_map);
@@ -227,8 +231,7 @@ static bool_t read_msi_msg(struct msi_de
     }
     case PCI_CAP_ID_MSIX:
     {
-        void __iomem *base;
-        base = entry->mask_base;
+        void __iomem *base = entry->mask_base;
 
         if ( unlikely(!msix_memory_decoded(entry->dev,
                                            entry->msi_attrib.pos)) )
@@ -294,8 +297,7 @@ static int write_msi_msg(struct msi_desc
     }
     case PCI_CAP_ID_MSIX:
     {
-        void __iomem *base;
-        base = entry->mask_base;
+        void __iomem *base = entry->mask_base;
 
         if ( unlikely(!msix_memory_decoded(entry->dev,
                                            entry->msi_attrib.pos)) )
@@ -321,7 +323,7 @@ void set_msi_affinity(struct irq_desc *d
     struct msi_desc *msi_desc = desc->msi_desc;
 
     dest = set_desc_affinity(desc, mask);
-    if (dest == BAD_APICID || !msi_desc)
+    if ( dest == BAD_APICID || !msi_desc )
         return;
 
     ASSERT(spin_is_locked(&desc->lock));
@@ -373,11 +375,11 @@ static void msix_set_enable(struct pci_d
     pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
     if ( pos )
     {
-        control = pci_conf_read16(seg, bus, slot, func, pos + PCI_MSIX_FLAGS);
+        control = pci_conf_read16(seg, bus, slot, func, msix_control_reg(pos));
         control &= ~PCI_MSIX_FLAGS_ENABLE;
         if ( enable )
             control |= PCI_MSIX_FLAGS_ENABLE;
-        pci_conf_write16(seg, bus, slot, func, pos + PCI_MSIX_FLAGS, control);
+        pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos), control);
     }
 }
 
@@ -402,9 +404,11 @@ static bool_t msi_set_mask_bit(struct ir
     bus = pdev->bus;
     slot = PCI_SLOT(pdev->devfn);
     func = PCI_FUNC(pdev->devfn);
-    switch (entry->msi_attrib.type) {
+    switch ( entry->msi_attrib.type )
+    {
     case PCI_CAP_ID_MSI:
-        if (entry->msi_attrib.maskbit) {
+        if ( entry->msi_attrib.maskbit )
+        {
             u32 mask_bits;
 
             mask_bits = pci_conf_read32(seg, bus, slot, func, entry->msi.mpos);
@@ -804,13 +808,14 @@ static u64 read_pci_mem_bar(u16 seg, u8 
  * requested MSI-X entries with allocated irqs or non-zero for otherwise.
  **/
 static int msix_capability_init(struct pci_dev *dev,
+                                unsigned int pos,
                                 struct msi_info *msi,
                                 struct msi_desc **desc,
                                 unsigned int nr_entries)
 {
     struct arch_msix *msix = dev->msix;
     struct msi_desc *entry = NULL;
-    int pos, vf;
+    int vf;
     u16 control;
     u64 table_paddr;
     u32 table_offset;
@@ -822,7 +827,6 @@ static int msix_capability_init(struct p
 
     ASSERT(spin_is_locked(&pcidevs_lock));
 
-    pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
     control = pci_conf_read16(seg, bus, slot, func, msix_control_reg(pos));
     /*
      * Ensure MSI-X interrupts are masked during setup. Some devices require
@@ -1010,10 +1014,9 @@ static int __pci_enable_msi(struct msi_i
     old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSI);
     if ( old_desc )
     {
-        dprintk(XENLOG_WARNING, "irq %d has already mapped to MSI on "
-                "device %04x:%02x:%02x.%01x\n",
-                msi->irq, msi->seg, msi->bus,
-                PCI_SLOT(msi->devfn), PCI_FUNC(msi->devfn));
+        printk(XENLOG_WARNING "irq %d already mapped to MSI on %04x:%02x:%02x.%u\n",
+               msi->irq, msi->seg, msi->bus,
+               PCI_SLOT(msi->devfn), PCI_FUNC(msi->devfn));
         *desc = old_desc;
         return 0;
     }
@@ -1021,10 +1024,10 @@ static int __pci_enable_msi(struct msi_i
     old_desc = find_msi_entry(pdev, -1, PCI_CAP_ID_MSIX);
     if ( old_desc )
     {
-        dprintk(XENLOG_WARNING, "MSI-X is already in use on "
-                "device %04x:%02x:%02x.%01x\n", msi->seg, msi->bus,
-                PCI_SLOT(msi->devfn), PCI_FUNC(msi->devfn));
-        pci_disable_msi(old_desc);
+        printk(XENLOG_WARNING "MSI-X already in use on %04x:%02x:%02x.%u\n",
+               msi->seg, msi->bus,
+               PCI_SLOT(msi->devfn), PCI_FUNC(msi->devfn));
+        __pci_disable_msix(old_desc);
     }
 
     return msi_capability_init(pdev, msi->irq, desc, msi->entry_nr);
@@ -1038,7 +1041,6 @@ static void __pci_disable_msi(struct msi
     msi_set_enable(dev, 0);
 
     BUG_ON(list_empty(&dev->msi_list));
-
 }
 
 /**
@@ -1058,7 +1060,7 @@ static void __pci_disable_msi(struct msi
  **/
 static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
 {
-    int status, pos, nr_entries;
+    int pos, nr_entries;
     struct pci_dev *pdev;
     u16 control;
     u8 slot = PCI_SLOT(msi->devfn);
@@ -1067,23 +1069,22 @@ static int __pci_enable_msix(struct msi_
 
     ASSERT(spin_is_locked(&pcidevs_lock));
     pdev = pci_get_pdev(msi->seg, msi->bus, msi->devfn);
-    if ( !pdev )
+    pos = pci_find_cap_offset(msi->seg, msi->bus, slot, func, PCI_CAP_ID_MSIX);
+    if ( !pdev || !pos )
         return -ENODEV;
 
-    pos = pci_find_cap_offset(msi->seg, msi->bus, slot, func, PCI_CAP_ID_MSIX);
     control = pci_conf_read16(msi->seg, msi->bus, slot, func,
                               msix_control_reg(pos));
     nr_entries = multi_msix_capable(control);
-    if (msi->entry_nr >= nr_entries)
+    if ( msi->entry_nr >= nr_entries )
         return -EINVAL;
 
     old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSIX);
     if ( old_desc )
     {
-        dprintk(XENLOG_WARNING, "irq %d has already mapped to MSIX on "
-                "device %04x:%02x:%02x.%01x\n",
-                msi->irq, msi->seg, msi->bus,
-                PCI_SLOT(msi->devfn), PCI_FUNC(msi->devfn));
+        printk(XENLOG_WARNING "irq %d already mapped to MSI-X on %04x:%02x:%02x.%u\n",
+               msi->irq, msi->seg, msi->bus,
+               PCI_SLOT(msi->devfn), PCI_FUNC(msi->devfn));
         *desc = old_desc;
         return 0;
     }
@@ -1091,15 +1092,13 @@ static int __pci_enable_msix(struct msi_
     old_desc = find_msi_entry(pdev, -1, PCI_CAP_ID_MSI);
     if ( old_desc )
     {
-        dprintk(XENLOG_WARNING, "MSI is already in use on "
-                "device %04x:%02x:%02x.%01x\n", msi->seg, msi->bus,
-                PCI_SLOT(msi->devfn), PCI_FUNC(msi->devfn));
-        pci_disable_msi(old_desc);
-
+        printk(XENLOG_WARNING "MSI already in use on %04x:%02x:%02x.%u\n",
+               msi->seg, msi->bus,
+               PCI_SLOT(msi->devfn), PCI_FUNC(msi->devfn));
+        __pci_disable_msi(old_desc);
     }
 
-    status = msix_capability_init(pdev, msi, desc, nr_entries);
-    return status;
+    return msix_capability_init(pdev, pos, msi, desc, nr_entries);
 }
 
 static void _pci_cleanup_msix(struct arch_msix *msix)
@@ -1117,19 +1116,16 @@ static void _pci_cleanup_msix(struct arc
 
 static void __pci_disable_msix(struct msi_desc *entry)
 {
-    struct pci_dev *dev;
-    int pos;
-    u16 control, seg;
-    u8 bus, slot, func;
-
-    dev = entry->dev;
-    seg = dev->seg;
-    bus = dev->bus;
-    slot = PCI_SLOT(dev->devfn);
-    func = PCI_FUNC(dev->devfn);
+    struct pci_dev *dev = entry->dev;
+    u16 seg = dev->seg;
+    u8 bus = dev->bus;
+    u8 slot = PCI_SLOT(dev->devfn);
+    u8 func = PCI_FUNC(dev->devfn);
+    unsigned int pos = pci_find_cap_offset(seg, bus, slot, func,
+                                           PCI_CAP_ID_MSIX);
+    u16 control = pci_conf_read16(seg, bus, slot, func,
+                                  msix_control_reg(entry->msi_attrib.pos));
 
-    pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
-    control = pci_conf_read16(seg, bus, slot, func, msix_control_reg(pos));
     if ( unlikely(!(control & PCI_MSIX_FLAGS_ENABLE)) )
         pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
                          control | (PCI_MSIX_FLAGS_ENABLE |
@@ -1182,7 +1178,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8
         u16 control = pci_conf_read16(seg, bus, slot, func,
                                       msix_control_reg(pos));
 
-        rc = msix_capability_init(pdev, NULL, NULL,
+        rc = msix_capability_init(pdev, pos, NULL, NULL,
                                   multi_msix_capable(control));
     }
     spin_unlock(&pcidevs_lock);
@@ -1201,8 +1197,8 @@ int pci_enable_msi(struct msi_info *msi,
     if ( !use_msi )
         return -EPERM;
 
-    return  msi->table_base ? __pci_enable_msix(msi, desc) :
-        __pci_enable_msi(msi, desc);
+    return msi->table_base ? __pci_enable_msix(msi, desc) :
+                             __pci_enable_msi(msi, desc);
 }
 
 /*
@@ -1255,7 +1251,9 @@ int pci_restore_msi_state(struct pci_dev
     if ( !pdev )
         return -EINVAL;
 
-    ret = xsm_resource_setup_pci(XSM_PRIV, (pdev->seg << 16) | (pdev->bus << 8) | pdev->devfn);
+    ret = xsm_resource_setup_pci(XSM_PRIV,
+                                (pdev->seg << 16) | (pdev->bus << 8) |
+                                pdev->devfn);
     if ( ret )
         return ret;
 



[-- Attachment #2: x86-MSI-X-cleanup.patch --]
[-- Type: text/plain, Size: 10744 bytes --]

x86/MSI-X: cleanup

- __pci_enable_msix() now checks that an MSI-X capability was actually
  found
- pass "pos" to msix_capability_init() as both callers already know it
  (and hence there's no need to re-obtain it)
- call __pci_disable_msi{,x}() directly instead of via
  pci_disable_msi() from __pci_enable_msi{x,}() state validation paths
- use msix_control_reg() instead of open coding it
- log message adjustments
- coding style corrections

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -35,6 +35,8 @@
 static s8 __read_mostly use_msi = -1;
 boolean_param("msi", use_msi);
 
+static void __pci_disable_msix(struct msi_desc *);
+
 /* bitmap indicate which fixed map is free */
 static DEFINE_SPINLOCK(msix_fixmap_lock);
 static DECLARE_BITMAP(msix_fixmap_pages, FIX_MSIX_MAX_PAGES);
@@ -161,12 +163,14 @@ void msi_compose_msg(unsigned vector, co
     unsigned dest;
 
     memset(msg, 0, sizeof(*msg));
-    if ( !cpumask_intersects(cpu_mask, &cpu_online_map) ) {
+    if ( !cpumask_intersects(cpu_mask, &cpu_online_map) )
+    {
         dprintk(XENLOG_ERR,"%s, compose msi message error!!\n", __func__);
         return;
     }
 
-    if ( vector ) {
+    if ( vector )
+    {
         cpumask_t *mask = this_cpu(scratch_mask);
 
         cpumask_and(mask, cpu_mask, &cpu_online_map);
@@ -227,8 +231,7 @@ static bool_t read_msi_msg(struct msi_de
     }
     case PCI_CAP_ID_MSIX:
     {
-        void __iomem *base;
-        base = entry->mask_base;
+        void __iomem *base = entry->mask_base;
 
         if ( unlikely(!msix_memory_decoded(entry->dev,
                                            entry->msi_attrib.pos)) )
@@ -294,8 +297,7 @@ static int write_msi_msg(struct msi_desc
     }
     case PCI_CAP_ID_MSIX:
     {
-        void __iomem *base;
-        base = entry->mask_base;
+        void __iomem *base = entry->mask_base;
 
         if ( unlikely(!msix_memory_decoded(entry->dev,
                                            entry->msi_attrib.pos)) )
@@ -321,7 +323,7 @@ void set_msi_affinity(struct irq_desc *d
     struct msi_desc *msi_desc = desc->msi_desc;
 
     dest = set_desc_affinity(desc, mask);
-    if (dest == BAD_APICID || !msi_desc)
+    if ( dest == BAD_APICID || !msi_desc )
         return;
 
     ASSERT(spin_is_locked(&desc->lock));
@@ -373,11 +375,11 @@ static void msix_set_enable(struct pci_d
     pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
     if ( pos )
     {
-        control = pci_conf_read16(seg, bus, slot, func, pos + PCI_MSIX_FLAGS);
+        control = pci_conf_read16(seg, bus, slot, func, msix_control_reg(pos));
         control &= ~PCI_MSIX_FLAGS_ENABLE;
         if ( enable )
             control |= PCI_MSIX_FLAGS_ENABLE;
-        pci_conf_write16(seg, bus, slot, func, pos + PCI_MSIX_FLAGS, control);
+        pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos), control);
     }
 }
 
@@ -402,9 +404,11 @@ static bool_t msi_set_mask_bit(struct ir
     bus = pdev->bus;
     slot = PCI_SLOT(pdev->devfn);
     func = PCI_FUNC(pdev->devfn);
-    switch (entry->msi_attrib.type) {
+    switch ( entry->msi_attrib.type )
+    {
     case PCI_CAP_ID_MSI:
-        if (entry->msi_attrib.maskbit) {
+        if ( entry->msi_attrib.maskbit )
+        {
             u32 mask_bits;
 
             mask_bits = pci_conf_read32(seg, bus, slot, func, entry->msi.mpos);
@@ -804,13 +808,14 @@ static u64 read_pci_mem_bar(u16 seg, u8 
  * requested MSI-X entries with allocated irqs or non-zero for otherwise.
  **/
 static int msix_capability_init(struct pci_dev *dev,
+                                unsigned int pos,
                                 struct msi_info *msi,
                                 struct msi_desc **desc,
                                 unsigned int nr_entries)
 {
     struct arch_msix *msix = dev->msix;
     struct msi_desc *entry = NULL;
-    int pos, vf;
+    int vf;
     u16 control;
     u64 table_paddr;
     u32 table_offset;
@@ -822,7 +827,6 @@ static int msix_capability_init(struct p
 
     ASSERT(spin_is_locked(&pcidevs_lock));
 
-    pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
     control = pci_conf_read16(seg, bus, slot, func, msix_control_reg(pos));
     /*
      * Ensure MSI-X interrupts are masked during setup. Some devices require
@@ -1010,10 +1014,9 @@ static int __pci_enable_msi(struct msi_i
     old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSI);
     if ( old_desc )
     {
-        dprintk(XENLOG_WARNING, "irq %d has already mapped to MSI on "
-                "device %04x:%02x:%02x.%01x\n",
-                msi->irq, msi->seg, msi->bus,
-                PCI_SLOT(msi->devfn), PCI_FUNC(msi->devfn));
+        printk(XENLOG_WARNING "irq %d already mapped to MSI on %04x:%02x:%02x.%u\n",
+               msi->irq, msi->seg, msi->bus,
+               PCI_SLOT(msi->devfn), PCI_FUNC(msi->devfn));
         *desc = old_desc;
         return 0;
     }
@@ -1021,10 +1024,10 @@ static int __pci_enable_msi(struct msi_i
     old_desc = find_msi_entry(pdev, -1, PCI_CAP_ID_MSIX);
     if ( old_desc )
     {
-        dprintk(XENLOG_WARNING, "MSI-X is already in use on "
-                "device %04x:%02x:%02x.%01x\n", msi->seg, msi->bus,
-                PCI_SLOT(msi->devfn), PCI_FUNC(msi->devfn));
-        pci_disable_msi(old_desc);
+        printk(XENLOG_WARNING "MSI-X already in use on %04x:%02x:%02x.%u\n",
+               msi->seg, msi->bus,
+               PCI_SLOT(msi->devfn), PCI_FUNC(msi->devfn));
+        __pci_disable_msix(old_desc);
     }
 
     return msi_capability_init(pdev, msi->irq, desc, msi->entry_nr);
@@ -1038,7 +1041,6 @@ static void __pci_disable_msi(struct msi
     msi_set_enable(dev, 0);
 
     BUG_ON(list_empty(&dev->msi_list));
-
 }
 
 /**
@@ -1058,7 +1060,7 @@ static void __pci_disable_msi(struct msi
  **/
 static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
 {
-    int status, pos, nr_entries;
+    int pos, nr_entries;
     struct pci_dev *pdev;
     u16 control;
     u8 slot = PCI_SLOT(msi->devfn);
@@ -1067,23 +1069,22 @@ static int __pci_enable_msix(struct msi_
 
     ASSERT(spin_is_locked(&pcidevs_lock));
     pdev = pci_get_pdev(msi->seg, msi->bus, msi->devfn);
-    if ( !pdev )
+    pos = pci_find_cap_offset(msi->seg, msi->bus, slot, func, PCI_CAP_ID_MSIX);
+    if ( !pdev || !pos )
         return -ENODEV;
 
-    pos = pci_find_cap_offset(msi->seg, msi->bus, slot, func, PCI_CAP_ID_MSIX);
     control = pci_conf_read16(msi->seg, msi->bus, slot, func,
                               msix_control_reg(pos));
     nr_entries = multi_msix_capable(control);
-    if (msi->entry_nr >= nr_entries)
+    if ( msi->entry_nr >= nr_entries )
         return -EINVAL;
 
     old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSIX);
     if ( old_desc )
     {
-        dprintk(XENLOG_WARNING, "irq %d has already mapped to MSIX on "
-                "device %04x:%02x:%02x.%01x\n",
-                msi->irq, msi->seg, msi->bus,
-                PCI_SLOT(msi->devfn), PCI_FUNC(msi->devfn));
+        printk(XENLOG_WARNING "irq %d already mapped to MSI-X on %04x:%02x:%02x.%u\n",
+               msi->irq, msi->seg, msi->bus,
+               PCI_SLOT(msi->devfn), PCI_FUNC(msi->devfn));
         *desc = old_desc;
         return 0;
     }
@@ -1091,15 +1092,13 @@ static int __pci_enable_msix(struct msi_
     old_desc = find_msi_entry(pdev, -1, PCI_CAP_ID_MSI);
     if ( old_desc )
     {
-        dprintk(XENLOG_WARNING, "MSI is already in use on "
-                "device %04x:%02x:%02x.%01x\n", msi->seg, msi->bus,
-                PCI_SLOT(msi->devfn), PCI_FUNC(msi->devfn));
-        pci_disable_msi(old_desc);
-
+        printk(XENLOG_WARNING "MSI already in use on %04x:%02x:%02x.%u\n",
+               msi->seg, msi->bus,
+               PCI_SLOT(msi->devfn), PCI_FUNC(msi->devfn));
+        __pci_disable_msi(old_desc);
     }
 
-    status = msix_capability_init(pdev, msi, desc, nr_entries);
-    return status;
+    return msix_capability_init(pdev, pos, msi, desc, nr_entries);
 }
 
 static void _pci_cleanup_msix(struct arch_msix *msix)
@@ -1117,19 +1116,16 @@ static void _pci_cleanup_msix(struct arc
 
 static void __pci_disable_msix(struct msi_desc *entry)
 {
-    struct pci_dev *dev;
-    int pos;
-    u16 control, seg;
-    u8 bus, slot, func;
-
-    dev = entry->dev;
-    seg = dev->seg;
-    bus = dev->bus;
-    slot = PCI_SLOT(dev->devfn);
-    func = PCI_FUNC(dev->devfn);
+    struct pci_dev *dev = entry->dev;
+    u16 seg = dev->seg;
+    u8 bus = dev->bus;
+    u8 slot = PCI_SLOT(dev->devfn);
+    u8 func = PCI_FUNC(dev->devfn);
+    unsigned int pos = pci_find_cap_offset(seg, bus, slot, func,
+                                           PCI_CAP_ID_MSIX);
+    u16 control = pci_conf_read16(seg, bus, slot, func,
+                                  msix_control_reg(entry->msi_attrib.pos));
 
-    pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
-    control = pci_conf_read16(seg, bus, slot, func, msix_control_reg(pos));
     if ( unlikely(!(control & PCI_MSIX_FLAGS_ENABLE)) )
         pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
                          control | (PCI_MSIX_FLAGS_ENABLE |
@@ -1182,7 +1178,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8
         u16 control = pci_conf_read16(seg, bus, slot, func,
                                       msix_control_reg(pos));
 
-        rc = msix_capability_init(pdev, NULL, NULL,
+        rc = msix_capability_init(pdev, pos, NULL, NULL,
                                   multi_msix_capable(control));
     }
     spin_unlock(&pcidevs_lock);
@@ -1201,8 +1197,8 @@ int pci_enable_msi(struct msi_info *msi,
     if ( !use_msi )
         return -EPERM;
 
-    return  msi->table_base ? __pci_enable_msix(msi, desc) :
-        __pci_enable_msi(msi, desc);
+    return msi->table_base ? __pci_enable_msix(msi, desc) :
+                             __pci_enable_msi(msi, desc);
 }
 
 /*
@@ -1255,7 +1251,9 @@ int pci_restore_msi_state(struct pci_dev
     if ( !pdev )
         return -EINVAL;
 
-    ret = xsm_resource_setup_pci(XSM_PRIV, (pdev->seg << 16) | (pdev->bus << 8) | pdev->devfn);
+    ret = xsm_resource_setup_pci(XSM_PRIV,
+                                (pdev->seg << 16) | (pdev->bus << 8) |
+                                pdev->devfn);
     if ( ret )
         return ret;
 

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 05/10] x86/MSI: track host and guest masking separately
  2015-06-05 11:09 [PATCH v3 00/10] x86/MSI-X: XSA-120, 128..131 follow-up Jan Beulich
                   ` (3 preceding siblings ...)
  2015-06-05 11:21 ` [PATCH v3 04/10] x86/MSI-X: cleanup Jan Beulich
@ 2015-06-05 11:22 ` Jan Beulich
  2015-06-05 13:05   ` Andrew Cooper
  2016-04-01  7:40   ` Li, Liang Z
  2015-06-05 11:23 ` [PATCH v3 06/10] x86/vMSI-X: cleanup Jan Beulich
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-05 11:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 11334 bytes --]

In particular we want to avoid losing track of our own intention to
have an entry masked. Physical unmasking now happens only when both
host and guest requested so.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -241,7 +241,7 @@ static void hpet_msi_unmask(struct irq_d
     cfg = hpet_read32(HPET_Tn_CFG(ch->idx));
     cfg |= HPET_TN_ENABLE;
     hpet_write32(cfg, HPET_Tn_CFG(ch->idx));
-    ch->msi.msi_attrib.masked = 0;
+    ch->msi.msi_attrib.host_masked = 0;
 }
 
 static void hpet_msi_mask(struct irq_desc *desc)
@@ -252,7 +252,7 @@ static void hpet_msi_mask(struct irq_des
     cfg = hpet_read32(HPET_Tn_CFG(ch->idx));
     cfg &= ~HPET_TN_ENABLE;
     hpet_write32(cfg, HPET_Tn_CFG(ch->idx));
-    ch->msi.msi_attrib.masked = 1;
+    ch->msi.msi_attrib.host_masked = 1;
 }
 
 static int hpet_msi_write(struct hpet_event_channel *ch, struct msi_msg *msg)
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -219,7 +219,6 @@ static int msixtbl_read(
 {
     unsigned long offset;
     struct msixtbl_entry *entry;
-    void *virt;
     unsigned int nr_entry, index;
     int r = X86EMUL_UNHANDLEABLE;
 
@@ -244,10 +243,16 @@ static int msixtbl_read(
     }
     else 
     {
-        virt = msixtbl_addr_to_virt(entry, address);
+        const struct msi_desc *msi_desc;
+        void *virt = msixtbl_addr_to_virt(entry, address);
+
         if ( !virt )
             goto out;
-        *pval = readl(virt);
+        msi_desc = virt_to_msi_desc(entry->pdev, virt);
+        if ( !msi_desc )
+            goto out;
+        *pval = MASK_INSR(msi_desc->msi_attrib.guest_masked,
+                          PCI_MSIX_VECTOR_BITMASK);
     }
     
     r = X86EMUL_OKAY;
@@ -265,7 +270,7 @@ static int msixtbl_write(struct vcpu *v,
     void *virt;
     unsigned int nr_entry, index;
     int r = X86EMUL_UNHANDLEABLE;
-    unsigned long flags, orig;
+    unsigned long flags;
     struct irq_desc *desc;
 
     if ( len != 4 || (address & 3) )
@@ -318,37 +323,7 @@ static int msixtbl_write(struct vcpu *v,
 
     ASSERT(msi_desc == desc->msi_desc);
    
-    orig = readl(virt);
-
-    /*
-     * Do not allow guest to modify MSI-X control bit if it is masked 
-     * by Xen. We'll only handle the case where Xen thinks that
-     * bit is unmasked, but hardware has silently masked the bit
-     * (in case of SR-IOV VF reset, etc). On the other hand, if Xen 
-     * thinks that the bit is masked, but it's really not, 
-     * we log a warning.
-     */
-    if ( msi_desc->msi_attrib.masked )
-    {
-        if ( !(orig & PCI_MSIX_VECTOR_BITMASK) )
-            printk(XENLOG_WARNING "MSI-X control bit is unmasked when"
-                   " it is expected to be masked [%04x:%02x:%02x.%u]\n", 
-                   entry->pdev->seg, entry->pdev->bus,
-                   PCI_SLOT(entry->pdev->devfn), 
-                   PCI_FUNC(entry->pdev->devfn));
-
-        goto unlock;
-    }
-
-    /*
-     * The mask bit is the only defined bit in the word. But we 
-     * ought to preserve the reserved bits. Clearing the reserved 
-     * bits can result in undefined behaviour (see PCI Local Bus
-     * Specification revision 2.3).
-     */
-    val &= PCI_MSIX_VECTOR_BITMASK;
-    val |= (orig & ~PCI_MSIX_VECTOR_BITMASK);
-    writel(val, virt);
+    guest_mask_msi_irq(desc, !!(val & PCI_MSIX_VECTOR_BITMASK));
 
 unlock:
     spin_unlock_irqrestore(&desc->lock, flags);
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -390,12 +390,13 @@ int msi_maskable_irq(const struct msi_de
            || entry->msi_attrib.maskbit;
 }
 
-static bool_t msi_set_mask_bit(struct irq_desc *desc, int flag)
+static bool_t msi_set_mask_bit(struct irq_desc *desc, bool_t host, bool_t guest)
 {
     struct msi_desc *entry = desc->msi_desc;
     struct pci_dev *pdev;
     u16 seg, control;
     u8 bus, slot, func;
+    bool_t flag = host || guest;
 
     ASSERT(spin_is_locked(&desc->lock));
     BUG_ON(!entry || !entry->dev);
@@ -453,7 +454,8 @@ static bool_t msi_set_mask_bit(struct ir
     default:
         return 0;
     }
-    entry->msi_attrib.masked = !!flag;
+    entry->msi_attrib.host_masked = host;
+    entry->msi_attrib.guest_masked = guest;
 
     return 1;
 }
@@ -484,22 +486,39 @@ static int msi_get_mask_bit(const struct
 
 void mask_msi_irq(struct irq_desc *desc)
 {
-    if ( unlikely(!msi_set_mask_bit(desc, 1)) )
+    if ( unlikely(!msi_set_mask_bit(desc, 1,
+                                    desc->msi_desc->msi_attrib.guest_masked)) )
         BUG_ON(!(desc->status & IRQ_DISABLED));
 }
 
 void unmask_msi_irq(struct irq_desc *desc)
 {
-    if ( unlikely(!msi_set_mask_bit(desc, 0)) )
+    if ( unlikely(!msi_set_mask_bit(desc, 0,
+                                    desc->msi_desc->msi_attrib.guest_masked)) )
         WARN();
 }
 
+void guest_mask_msi_irq(struct irq_desc *desc, bool_t mask)
+{
+    msi_set_mask_bit(desc, desc->msi_desc->msi_attrib.host_masked, mask);
+}
+
 static unsigned int startup_msi_irq(struct irq_desc *desc)
 {
-    unmask_msi_irq(desc);
+    bool_t guest_masked = (desc->status & IRQ_GUEST) &&
+                          is_hvm_domain(desc->msi_desc->dev->domain);
+
+    if ( unlikely(!msi_set_mask_bit(desc, 0, guest_masked)) )
+        WARN();
     return 0;
 }
 
+static void shutdown_msi_irq(struct irq_desc *desc)
+{
+    if ( unlikely(!msi_set_mask_bit(desc, 1, 1)) )
+        BUG_ON(!(desc->status & IRQ_DISABLED));
+}
+
 void ack_nonmaskable_msi_irq(struct irq_desc *desc)
 {
     irq_complete_move(desc);
@@ -524,7 +543,7 @@ void end_nonmaskable_msi_irq(struct irq_
 static hw_irq_controller pci_msi_maskable = {
     .typename     = "PCI-MSI/-X",
     .startup      = startup_msi_irq,
-    .shutdown     = mask_msi_irq,
+    .shutdown     = shutdown_msi_irq,
     .enable       = unmask_msi_irq,
     .disable      = mask_msi_irq,
     .ack          = ack_maskable_msi_irq,
@@ -694,7 +713,8 @@ static int msi_capability_init(struct pc
         entry[i].msi_attrib.is_64 = is_64bit_address(control);
         entry[i].msi_attrib.entry_nr = i;
         entry[i].msi_attrib.maskbit = is_mask_bit_support(control);
-        entry[i].msi_attrib.masked = 1;
+        entry[i].msi_attrib.host_masked = 1;
+        entry[i].msi_attrib.guest_masked = 0;
         entry[i].msi_attrib.pos = pos;
         if ( entry[i].msi_attrib.maskbit )
             entry[i].msi.mpos = mpos;
@@ -943,7 +963,8 @@ static int msix_capability_init(struct p
         entry->msi_attrib.is_64 = 1;
         entry->msi_attrib.entry_nr = msi->entry_nr;
         entry->msi_attrib.maskbit = 1;
-        entry->msi_attrib.masked = 1;
+        entry->msi_attrib.host_masked = 1;
+        entry->msi_attrib.guest_masked = 1;
         entry->msi_attrib.pos = pos;
         entry->irq = msi->irq;
         entry->dev = dev;
@@ -1315,7 +1336,8 @@ int pci_restore_msi_state(struct pci_dev
         for ( i = 0; ; )
         {
             if ( unlikely(!msi_set_mask_bit(desc,
-                                            entry[i].msi_attrib.masked)) )
+                                            entry[i].msi_attrib.host_masked,
+                                            entry[i].msi_attrib.guest_masked)) )
                 BUG();
 
             if ( !--nr )
@@ -1469,7 +1491,7 @@ static void dump_msi(unsigned char key)
         else
             mask = '?';
         printk(" %-6s%4u vec=%02x%7s%6s%3sassert%5s%7s"
-               " dest=%08x mask=%d/%d/%c\n",
+               " dest=%08x mask=%d/%c%c/%c\n",
                type, irq,
                (data & MSI_DATA_VECTOR_MASK) >> MSI_DATA_VECTOR_SHIFT,
                data & MSI_DATA_DELIVERY_LOWPRI ? "lowest" : "fixed",
@@ -1477,7 +1499,10 @@ static void dump_msi(unsigned char key)
                data & MSI_DATA_LEVEL_ASSERT ? "" : "de",
                addr & MSI_ADDR_DESTMODE_LOGIC ? "log" : "phys",
                addr & MSI_ADDR_REDIRECTION_LOWPRI ? "lowest" : "cpu",
-               dest32, attr.maskbit, attr.masked, mask);
+               dest32, attr.maskbit,
+               attr.host_masked ? 'H' : ' ',
+               attr.guest_masked ? 'G' : ' ',
+               mask);
     }
 }
 
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -451,7 +451,7 @@ static void iommu_msi_unmask(struct irq_
     spin_lock_irqsave(&iommu->lock, flags);
     amd_iommu_msi_enable(iommu, IOMMU_CONTROL_ENABLED);
     spin_unlock_irqrestore(&iommu->lock, flags);
-    iommu->msi.msi_attrib.masked = 0;
+    iommu->msi.msi_attrib.host_masked = 0;
 }
 
 static void iommu_msi_mask(struct irq_desc *desc)
@@ -464,7 +464,7 @@ static void iommu_msi_mask(struct irq_de
     spin_lock_irqsave(&iommu->lock, flags);
     amd_iommu_msi_enable(iommu, IOMMU_CONTROL_DISABLED);
     spin_unlock_irqrestore(&iommu->lock, flags);
-    iommu->msi.msi_attrib.masked = 1;
+    iommu->msi.msi_attrib.host_masked = 1;
 }
 
 static unsigned int iommu_msi_startup(struct irq_desc *desc)
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -997,7 +997,7 @@ static void dma_msi_unmask(struct irq_de
     spin_lock_irqsave(&iommu->register_lock, flags);
     dmar_writel(iommu->reg, DMAR_FECTL_REG, 0);
     spin_unlock_irqrestore(&iommu->register_lock, flags);
-    iommu->msi.msi_attrib.masked = 0;
+    iommu->msi.msi_attrib.host_masked = 0;
 }
 
 static void dma_msi_mask(struct irq_desc *desc)
@@ -1009,7 +1009,7 @@ static void dma_msi_mask(struct irq_desc
     spin_lock_irqsave(&iommu->register_lock, flags);
     dmar_writel(iommu->reg, DMAR_FECTL_REG, DMA_FECTL_IM);
     spin_unlock_irqrestore(&iommu->register_lock, flags);
-    iommu->msi.msi_attrib.masked = 1;
+    iommu->msi.msi_attrib.host_masked = 1;
 }
 
 static unsigned int dma_msi_startup(struct irq_desc *desc)
--- a/xen/include/asm-x86/msi.h
+++ b/xen/include/asm-x86/msi.h
@@ -90,12 +90,13 @@ extern unsigned int pci_msix_get_table_l
 
 struct msi_desc {
 	struct msi_attrib {
-		__u8	type	: 5; 	/* {0: unused, 5h:MSI, 11h:MSI-X} */
-		__u8	maskbit	: 1; 	/* mask-pending bit supported ?   */
-		__u8	masked	: 1;
+		__u8	type;		/* {0: unused, 5h:MSI, 11h:MSI-X} */
+		__u8	pos;		/* Location of the MSI capability */
+		__u8	maskbit	: 1;	/* mask/pending bit supported ?   */
 		__u8	is_64	: 1;	/* Address size: 0=32bit 1=64bit  */
-		__u8	pos;	 	/* Location of the msi capability */
-		__u16	entry_nr;    	/* specific enabled entry 	  */
+		__u8	host_masked : 1;
+		__u8	guest_masked : 1;
+		__u16	entry_nr;	/* specific enabled entry 	  */
 	} msi_attrib;
 
 	struct list_head list;
@@ -236,6 +237,7 @@ void msi_compose_msg(unsigned vector, co
 void __msi_set_enable(u16 seg, u8 bus, u8 slot, u8 func, int pos, int enable);
 void mask_msi_irq(struct irq_desc *);
 void unmask_msi_irq(struct irq_desc *);
+void guest_mask_msi_irq(struct irq_desc *, bool_t mask);
 void ack_nonmaskable_msi_irq(struct irq_desc *);
 void end_nonmaskable_msi_irq(struct irq_desc *, u8 vector);
 void set_msi_affinity(struct irq_desc *, const cpumask_t *);



[-- Attachment #2: x86-MSI-X-guest-mask.patch --]
[-- Type: text/plain, Size: 11382 bytes --]

x86/MSI: track host and guest masking separately

In particular we want to avoid losing track of our own intention to
have an entry masked. Physical unmasking now happens only when both
host and guest requested so.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -241,7 +241,7 @@ static void hpet_msi_unmask(struct irq_d
     cfg = hpet_read32(HPET_Tn_CFG(ch->idx));
     cfg |= HPET_TN_ENABLE;
     hpet_write32(cfg, HPET_Tn_CFG(ch->idx));
-    ch->msi.msi_attrib.masked = 0;
+    ch->msi.msi_attrib.host_masked = 0;
 }
 
 static void hpet_msi_mask(struct irq_desc *desc)
@@ -252,7 +252,7 @@ static void hpet_msi_mask(struct irq_des
     cfg = hpet_read32(HPET_Tn_CFG(ch->idx));
     cfg &= ~HPET_TN_ENABLE;
     hpet_write32(cfg, HPET_Tn_CFG(ch->idx));
-    ch->msi.msi_attrib.masked = 1;
+    ch->msi.msi_attrib.host_masked = 1;
 }
 
 static int hpet_msi_write(struct hpet_event_channel *ch, struct msi_msg *msg)
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -219,7 +219,6 @@ static int msixtbl_read(
 {
     unsigned long offset;
     struct msixtbl_entry *entry;
-    void *virt;
     unsigned int nr_entry, index;
     int r = X86EMUL_UNHANDLEABLE;
 
@@ -244,10 +243,16 @@ static int msixtbl_read(
     }
     else 
     {
-        virt = msixtbl_addr_to_virt(entry, address);
+        const struct msi_desc *msi_desc;
+        void *virt = msixtbl_addr_to_virt(entry, address);
+
         if ( !virt )
             goto out;
-        *pval = readl(virt);
+        msi_desc = virt_to_msi_desc(entry->pdev, virt);
+        if ( !msi_desc )
+            goto out;
+        *pval = MASK_INSR(msi_desc->msi_attrib.guest_masked,
+                          PCI_MSIX_VECTOR_BITMASK);
     }
     
     r = X86EMUL_OKAY;
@@ -265,7 +270,7 @@ static int msixtbl_write(struct vcpu *v,
     void *virt;
     unsigned int nr_entry, index;
     int r = X86EMUL_UNHANDLEABLE;
-    unsigned long flags, orig;
+    unsigned long flags;
     struct irq_desc *desc;
 
     if ( len != 4 || (address & 3) )
@@ -318,37 +323,7 @@ static int msixtbl_write(struct vcpu *v,
 
     ASSERT(msi_desc == desc->msi_desc);
    
-    orig = readl(virt);
-
-    /*
-     * Do not allow guest to modify MSI-X control bit if it is masked 
-     * by Xen. We'll only handle the case where Xen thinks that
-     * bit is unmasked, but hardware has silently masked the bit
-     * (in case of SR-IOV VF reset, etc). On the other hand, if Xen 
-     * thinks that the bit is masked, but it's really not, 
-     * we log a warning.
-     */
-    if ( msi_desc->msi_attrib.masked )
-    {
-        if ( !(orig & PCI_MSIX_VECTOR_BITMASK) )
-            printk(XENLOG_WARNING "MSI-X control bit is unmasked when"
-                   " it is expected to be masked [%04x:%02x:%02x.%u]\n", 
-                   entry->pdev->seg, entry->pdev->bus,
-                   PCI_SLOT(entry->pdev->devfn), 
-                   PCI_FUNC(entry->pdev->devfn));
-
-        goto unlock;
-    }
-
-    /*
-     * The mask bit is the only defined bit in the word. But we 
-     * ought to preserve the reserved bits. Clearing the reserved 
-     * bits can result in undefined behaviour (see PCI Local Bus
-     * Specification revision 2.3).
-     */
-    val &= PCI_MSIX_VECTOR_BITMASK;
-    val |= (orig & ~PCI_MSIX_VECTOR_BITMASK);
-    writel(val, virt);
+    guest_mask_msi_irq(desc, !!(val & PCI_MSIX_VECTOR_BITMASK));
 
 unlock:
     spin_unlock_irqrestore(&desc->lock, flags);
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -390,12 +390,13 @@ int msi_maskable_irq(const struct msi_de
            || entry->msi_attrib.maskbit;
 }
 
-static bool_t msi_set_mask_bit(struct irq_desc *desc, int flag)
+static bool_t msi_set_mask_bit(struct irq_desc *desc, bool_t host, bool_t guest)
 {
     struct msi_desc *entry = desc->msi_desc;
     struct pci_dev *pdev;
     u16 seg, control;
     u8 bus, slot, func;
+    bool_t flag = host || guest;
 
     ASSERT(spin_is_locked(&desc->lock));
     BUG_ON(!entry || !entry->dev);
@@ -453,7 +454,8 @@ static bool_t msi_set_mask_bit(struct ir
     default:
         return 0;
     }
-    entry->msi_attrib.masked = !!flag;
+    entry->msi_attrib.host_masked = host;
+    entry->msi_attrib.guest_masked = guest;
 
     return 1;
 }
@@ -484,22 +486,39 @@ static int msi_get_mask_bit(const struct
 
 void mask_msi_irq(struct irq_desc *desc)
 {
-    if ( unlikely(!msi_set_mask_bit(desc, 1)) )
+    if ( unlikely(!msi_set_mask_bit(desc, 1,
+                                    desc->msi_desc->msi_attrib.guest_masked)) )
         BUG_ON(!(desc->status & IRQ_DISABLED));
 }
 
 void unmask_msi_irq(struct irq_desc *desc)
 {
-    if ( unlikely(!msi_set_mask_bit(desc, 0)) )
+    if ( unlikely(!msi_set_mask_bit(desc, 0,
+                                    desc->msi_desc->msi_attrib.guest_masked)) )
         WARN();
 }
 
+void guest_mask_msi_irq(struct irq_desc *desc, bool_t mask)
+{
+    msi_set_mask_bit(desc, desc->msi_desc->msi_attrib.host_masked, mask);
+}
+
 static unsigned int startup_msi_irq(struct irq_desc *desc)
 {
-    unmask_msi_irq(desc);
+    bool_t guest_masked = (desc->status & IRQ_GUEST) &&
+                          is_hvm_domain(desc->msi_desc->dev->domain);
+
+    if ( unlikely(!msi_set_mask_bit(desc, 0, guest_masked)) )
+        WARN();
     return 0;
 }
 
+static void shutdown_msi_irq(struct irq_desc *desc)
+{
+    if ( unlikely(!msi_set_mask_bit(desc, 1, 1)) )
+        BUG_ON(!(desc->status & IRQ_DISABLED));
+}
+
 void ack_nonmaskable_msi_irq(struct irq_desc *desc)
 {
     irq_complete_move(desc);
@@ -524,7 +543,7 @@ void end_nonmaskable_msi_irq(struct irq_
 static hw_irq_controller pci_msi_maskable = {
     .typename     = "PCI-MSI/-X",
     .startup      = startup_msi_irq,
-    .shutdown     = mask_msi_irq,
+    .shutdown     = shutdown_msi_irq,
     .enable       = unmask_msi_irq,
     .disable      = mask_msi_irq,
     .ack          = ack_maskable_msi_irq,
@@ -694,7 +713,8 @@ static int msi_capability_init(struct pc
         entry[i].msi_attrib.is_64 = is_64bit_address(control);
         entry[i].msi_attrib.entry_nr = i;
         entry[i].msi_attrib.maskbit = is_mask_bit_support(control);
-        entry[i].msi_attrib.masked = 1;
+        entry[i].msi_attrib.host_masked = 1;
+        entry[i].msi_attrib.guest_masked = 0;
         entry[i].msi_attrib.pos = pos;
         if ( entry[i].msi_attrib.maskbit )
             entry[i].msi.mpos = mpos;
@@ -943,7 +963,8 @@ static int msix_capability_init(struct p
         entry->msi_attrib.is_64 = 1;
         entry->msi_attrib.entry_nr = msi->entry_nr;
         entry->msi_attrib.maskbit = 1;
-        entry->msi_attrib.masked = 1;
+        entry->msi_attrib.host_masked = 1;
+        entry->msi_attrib.guest_masked = 1;
         entry->msi_attrib.pos = pos;
         entry->irq = msi->irq;
         entry->dev = dev;
@@ -1315,7 +1336,8 @@ int pci_restore_msi_state(struct pci_dev
         for ( i = 0; ; )
         {
             if ( unlikely(!msi_set_mask_bit(desc,
-                                            entry[i].msi_attrib.masked)) )
+                                            entry[i].msi_attrib.host_masked,
+                                            entry[i].msi_attrib.guest_masked)) )
                 BUG();
 
             if ( !--nr )
@@ -1469,7 +1491,7 @@ static void dump_msi(unsigned char key)
         else
             mask = '?';
         printk(" %-6s%4u vec=%02x%7s%6s%3sassert%5s%7s"
-               " dest=%08x mask=%d/%d/%c\n",
+               " dest=%08x mask=%d/%c%c/%c\n",
                type, irq,
                (data & MSI_DATA_VECTOR_MASK) >> MSI_DATA_VECTOR_SHIFT,
                data & MSI_DATA_DELIVERY_LOWPRI ? "lowest" : "fixed",
@@ -1477,7 +1499,10 @@ static void dump_msi(unsigned char key)
                data & MSI_DATA_LEVEL_ASSERT ? "" : "de",
                addr & MSI_ADDR_DESTMODE_LOGIC ? "log" : "phys",
                addr & MSI_ADDR_REDIRECTION_LOWPRI ? "lowest" : "cpu",
-               dest32, attr.maskbit, attr.masked, mask);
+               dest32, attr.maskbit,
+               attr.host_masked ? 'H' : ' ',
+               attr.guest_masked ? 'G' : ' ',
+               mask);
     }
 }
 
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -451,7 +451,7 @@ static void iommu_msi_unmask(struct irq_
     spin_lock_irqsave(&iommu->lock, flags);
     amd_iommu_msi_enable(iommu, IOMMU_CONTROL_ENABLED);
     spin_unlock_irqrestore(&iommu->lock, flags);
-    iommu->msi.msi_attrib.masked = 0;
+    iommu->msi.msi_attrib.host_masked = 0;
 }
 
 static void iommu_msi_mask(struct irq_desc *desc)
@@ -464,7 +464,7 @@ static void iommu_msi_mask(struct irq_de
     spin_lock_irqsave(&iommu->lock, flags);
     amd_iommu_msi_enable(iommu, IOMMU_CONTROL_DISABLED);
     spin_unlock_irqrestore(&iommu->lock, flags);
-    iommu->msi.msi_attrib.masked = 1;
+    iommu->msi.msi_attrib.host_masked = 1;
 }
 
 static unsigned int iommu_msi_startup(struct irq_desc *desc)
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -997,7 +997,7 @@ static void dma_msi_unmask(struct irq_de
     spin_lock_irqsave(&iommu->register_lock, flags);
     dmar_writel(iommu->reg, DMAR_FECTL_REG, 0);
     spin_unlock_irqrestore(&iommu->register_lock, flags);
-    iommu->msi.msi_attrib.masked = 0;
+    iommu->msi.msi_attrib.host_masked = 0;
 }
 
 static void dma_msi_mask(struct irq_desc *desc)
@@ -1009,7 +1009,7 @@ static void dma_msi_mask(struct irq_desc
     spin_lock_irqsave(&iommu->register_lock, flags);
     dmar_writel(iommu->reg, DMAR_FECTL_REG, DMA_FECTL_IM);
     spin_unlock_irqrestore(&iommu->register_lock, flags);
-    iommu->msi.msi_attrib.masked = 1;
+    iommu->msi.msi_attrib.host_masked = 1;
 }
 
 static unsigned int dma_msi_startup(struct irq_desc *desc)
--- a/xen/include/asm-x86/msi.h
+++ b/xen/include/asm-x86/msi.h
@@ -90,12 +90,13 @@ extern unsigned int pci_msix_get_table_l
 
 struct msi_desc {
 	struct msi_attrib {
-		__u8	type	: 5; 	/* {0: unused, 5h:MSI, 11h:MSI-X} */
-		__u8	maskbit	: 1; 	/* mask-pending bit supported ?   */
-		__u8	masked	: 1;
+		__u8	type;		/* {0: unused, 5h:MSI, 11h:MSI-X} */
+		__u8	pos;		/* Location of the MSI capability */
+		__u8	maskbit	: 1;	/* mask/pending bit supported ?   */
 		__u8	is_64	: 1;	/* Address size: 0=32bit 1=64bit  */
-		__u8	pos;	 	/* Location of the msi capability */
-		__u16	entry_nr;    	/* specific enabled entry 	  */
+		__u8	host_masked : 1;
+		__u8	guest_masked : 1;
+		__u16	entry_nr;	/* specific enabled entry 	  */
 	} msi_attrib;
 
 	struct list_head list;
@@ -236,6 +237,7 @@ void msi_compose_msg(unsigned vector, co
 void __msi_set_enable(u16 seg, u8 bus, u8 slot, u8 func, int pos, int enable);
 void mask_msi_irq(struct irq_desc *);
 void unmask_msi_irq(struct irq_desc *);
+void guest_mask_msi_irq(struct irq_desc *, bool_t mask);
 void ack_nonmaskable_msi_irq(struct irq_desc *);
 void end_nonmaskable_msi_irq(struct irq_desc *, u8 vector);
 void set_msi_affinity(struct irq_desc *, const cpumask_t *);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 06/10] x86/vMSI-X: cleanup
  2015-06-05 11:09 [PATCH v3 00/10] x86/MSI-X: XSA-120, 128..131 follow-up Jan Beulich
                   ` (4 preceding siblings ...)
  2015-06-05 11:22 ` [PATCH v3 05/10] x86/MSI: track host and guest masking separately Jan Beulich
@ 2015-06-05 11:23 ` Jan Beulich
  2015-06-05 13:07   ` Andrew Cooper
  2015-06-05 11:24 ` [PATCH v3 07/10] x86/vMSI-X: support qword MMIO access Jan Beulich
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2015-06-05 11:23 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 3369 bytes --]

Fold msixtbl_addr_to_virt() + virt_to_msi_desc() into simplified
msixtbl_addr_to_desc(), as the callers don't need the virtual address
anymore.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -181,36 +181,23 @@ static struct msixtbl_entry *msixtbl_fin
     return NULL;
 }
 
-static struct msi_desc *virt_to_msi_desc(struct pci_dev *dev, void *virt)
+static struct msi_desc *msixtbl_addr_to_desc(
+    const struct msixtbl_entry *entry, unsigned long addr)
 {
+    unsigned int nr_entry;
     struct msi_desc *desc;
 
-    list_for_each_entry( desc, &dev->msi_list, list )
-        if ( desc->msi_attrib.type == PCI_CAP_ID_MSIX  &&
-             virt >= desc->mask_base &&
-             virt < desc->mask_base + PCI_MSIX_ENTRY_SIZE ) 
-            return desc;
-
-    return NULL;
-}
-
-static void __iomem *msixtbl_addr_to_virt(
-    struct msixtbl_entry *entry, unsigned long addr)
-{
-    unsigned int idx, nr_page;
-
     if ( !entry || !entry->pdev )
         return NULL;
 
-    nr_page = (addr >> PAGE_SHIFT) -
-              (entry->gtable >> PAGE_SHIFT);
+    nr_entry = (addr - entry->gtable) / PCI_MSIX_ENTRY_SIZE;
 
-    idx = entry->pdev->msix->table_idx[nr_page];
-    if ( !idx )
-        return NULL;
+    list_for_each_entry( desc, &entry->pdev->msi_list, list )
+        if ( desc->msi_attrib.type == PCI_CAP_ID_MSIX &&
+             desc->msi_attrib.entry_nr == nr_entry )
+            return desc;
 
-    return (void *)(fix_to_virt(idx) +
-                    (addr & ((1UL << PAGE_SHIFT) - 1)));
+    return NULL;
 }
 
 static int msixtbl_read(
@@ -243,12 +230,8 @@ static int msixtbl_read(
     }
     else 
     {
-        const struct msi_desc *msi_desc;
-        void *virt = msixtbl_addr_to_virt(entry, address);
+        const struct msi_desc *msi_desc = msixtbl_addr_to_desc(entry, address);
 
-        if ( !virt )
-            goto out;
-        msi_desc = virt_to_msi_desc(entry->pdev, virt);
         if ( !msi_desc )
             goto out;
         *pval = MASK_INSR(msi_desc->msi_attrib.guest_masked,
@@ -267,7 +250,6 @@ static int msixtbl_write(struct vcpu *v,
     unsigned long offset;
     struct msixtbl_entry *entry;
     const struct msi_desc *msi_desc;
-    void *virt;
     unsigned int nr_entry, index;
     int r = X86EMUL_UNHANDLEABLE;
     unsigned long flags;
@@ -304,11 +286,7 @@ static int msixtbl_write(struct vcpu *v,
         goto out;
     }
 
-    virt = msixtbl_addr_to_virt(entry, address);
-    if ( !virt )
-        goto out;
-
-    msi_desc = virt_to_msi_desc(entry->pdev, virt);
+    msi_desc = msixtbl_addr_to_desc(entry, address);
     if ( !msi_desc || msi_desc->irq < 0 )
         goto out;
     
@@ -336,17 +314,13 @@ out:
 
 static int msixtbl_range(struct vcpu *v, unsigned long addr)
 {
-    struct msixtbl_entry *entry;
-    void *virt;
+    const struct msi_desc *desc;
 
     rcu_read_lock(&msixtbl_rcu_lock);
-
-    entry = msixtbl_find_entry(v, addr);
-    virt = msixtbl_addr_to_virt(entry, addr);
-
+    desc = msixtbl_addr_to_desc(msixtbl_find_entry(v, addr), addr);
     rcu_read_unlock(&msixtbl_rcu_lock);
 
-    return !!virt;
+    return !!desc;
 }
 
 const struct hvm_mmio_handler msixtbl_mmio_handler = {




[-- Attachment #2: x86-vMSI-X-cleanup.patch --]
[-- Type: text/plain, Size: 3386 bytes --]

x86/vMSI-X: cleanup

Fold msixtbl_addr_to_virt() + virt_to_msi_desc() into simplified
msixtbl_addr_to_desc(), as the callers don't need the virtual address
anymore.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -181,36 +181,23 @@ static struct msixtbl_entry *msixtbl_fin
     return NULL;
 }
 
-static struct msi_desc *virt_to_msi_desc(struct pci_dev *dev, void *virt)
+static struct msi_desc *msixtbl_addr_to_desc(
+    const struct msixtbl_entry *entry, unsigned long addr)
 {
+    unsigned int nr_entry;
     struct msi_desc *desc;
 
-    list_for_each_entry( desc, &dev->msi_list, list )
-        if ( desc->msi_attrib.type == PCI_CAP_ID_MSIX  &&
-             virt >= desc->mask_base &&
-             virt < desc->mask_base + PCI_MSIX_ENTRY_SIZE ) 
-            return desc;
-
-    return NULL;
-}
-
-static void __iomem *msixtbl_addr_to_virt(
-    struct msixtbl_entry *entry, unsigned long addr)
-{
-    unsigned int idx, nr_page;
-
     if ( !entry || !entry->pdev )
         return NULL;
 
-    nr_page = (addr >> PAGE_SHIFT) -
-              (entry->gtable >> PAGE_SHIFT);
+    nr_entry = (addr - entry->gtable) / PCI_MSIX_ENTRY_SIZE;
 
-    idx = entry->pdev->msix->table_idx[nr_page];
-    if ( !idx )
-        return NULL;
+    list_for_each_entry( desc, &entry->pdev->msi_list, list )
+        if ( desc->msi_attrib.type == PCI_CAP_ID_MSIX &&
+             desc->msi_attrib.entry_nr == nr_entry )
+            return desc;
 
-    return (void *)(fix_to_virt(idx) +
-                    (addr & ((1UL << PAGE_SHIFT) - 1)));
+    return NULL;
 }
 
 static int msixtbl_read(
@@ -243,12 +230,8 @@ static int msixtbl_read(
     }
     else 
     {
-        const struct msi_desc *msi_desc;
-        void *virt = msixtbl_addr_to_virt(entry, address);
+        const struct msi_desc *msi_desc = msixtbl_addr_to_desc(entry, address);
 
-        if ( !virt )
-            goto out;
-        msi_desc = virt_to_msi_desc(entry->pdev, virt);
         if ( !msi_desc )
             goto out;
         *pval = MASK_INSR(msi_desc->msi_attrib.guest_masked,
@@ -267,7 +250,6 @@ static int msixtbl_write(struct vcpu *v,
     unsigned long offset;
     struct msixtbl_entry *entry;
     const struct msi_desc *msi_desc;
-    void *virt;
     unsigned int nr_entry, index;
     int r = X86EMUL_UNHANDLEABLE;
     unsigned long flags;
@@ -304,11 +286,7 @@ static int msixtbl_write(struct vcpu *v,
         goto out;
     }
 
-    virt = msixtbl_addr_to_virt(entry, address);
-    if ( !virt )
-        goto out;
-
-    msi_desc = virt_to_msi_desc(entry->pdev, virt);
+    msi_desc = msixtbl_addr_to_desc(entry, address);
     if ( !msi_desc || msi_desc->irq < 0 )
         goto out;
     
@@ -336,17 +314,13 @@ out:
 
 static int msixtbl_range(struct vcpu *v, unsigned long addr)
 {
-    struct msixtbl_entry *entry;
-    void *virt;
+    const struct msi_desc *desc;
 
     rcu_read_lock(&msixtbl_rcu_lock);
-
-    entry = msixtbl_find_entry(v, addr);
-    virt = msixtbl_addr_to_virt(entry, addr);
-
+    desc = msixtbl_addr_to_desc(msixtbl_find_entry(v, addr), addr);
     rcu_read_unlock(&msixtbl_rcu_lock);
 
-    return !!virt;
+    return !!desc;
 }
 
 const struct hvm_mmio_handler msixtbl_mmio_handler = {

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 07/10] x86/vMSI-X: support qword MMIO access
  2015-06-05 11:09 [PATCH v3 00/10] x86/MSI-X: XSA-120, 128..131 follow-up Jan Beulich
                   ` (5 preceding siblings ...)
  2015-06-05 11:23 ` [PATCH v3 06/10] x86/vMSI-X: cleanup Jan Beulich
@ 2015-06-05 11:24 ` Jan Beulich
  2015-06-05 15:34   ` Andrew Cooper
  2015-06-05 11:25 ` [PATCH v3 08/10] x86/MSI-X: use qword MMIO access for address writes Jan Beulich
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2015-06-05 11:24 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 3256 bytes --]

The specification explicitly provides for this, so we should have
supported this from the beginning.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -209,7 +209,7 @@ static int msixtbl_read(
     unsigned int nr_entry, index;
     int r = X86EMUL_UNHANDLEABLE;
 
-    if ( len != 4 || (address & 3) )
+    if ( (len != 4 && len != 8) || (address & (len - 1)) )
         return r;
 
     rcu_read_lock(&msixtbl_rcu_lock);
@@ -227,15 +227,28 @@ static int msixtbl_read(
              !acc_bit(test, entry, nr_entry, index) )
             goto out;
         *pval = entry->gentries[nr_entry].msi_ad[index];
+        if ( len == 8 )
+        {
+            if ( index )
+                offset = PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET;
+            else if ( acc_bit(test, entry, nr_entry, 1) )
+                *pval |= (u64)entry->gentries[nr_entry].msi_ad[1] << 32;
+            else
+                goto out;
+        }
     }
-    else 
+    if ( offset == PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET )
     {
         const struct msi_desc *msi_desc = msixtbl_addr_to_desc(entry, address);
 
         if ( !msi_desc )
             goto out;
-        *pval = MASK_INSR(msi_desc->msi_attrib.guest_masked,
-                          PCI_MSIX_VECTOR_BITMASK);
+        if ( len == 4 )
+            *pval = MASK_INSR(msi_desc->msi_attrib.guest_masked,
+                              PCI_MSIX_VECTOR_BITMASK);
+        else
+            *pval |= (u64)MASK_INSR(msi_desc->msi_attrib.guest_masked,
+                                    PCI_MSIX_VECTOR_BITMASK) << 32;
     }
     
     r = X86EMUL_OKAY;
@@ -255,7 +268,7 @@ static int msixtbl_write(struct vcpu *v,
     unsigned long flags;
     struct irq_desc *desc;
 
-    if ( len != 4 || (address & 3) )
+    if ( (len != 4 && len != 8) || (address & (len - 1)) )
         return r;
 
     rcu_read_lock(&msixtbl_rcu_lock);
@@ -266,16 +279,23 @@ static int msixtbl_write(struct vcpu *v,
     nr_entry = (address - entry->gtable) / PCI_MSIX_ENTRY_SIZE;
 
     offset = address & (PCI_MSIX_ENTRY_SIZE - 1);
-    if ( offset != PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET)
+    if ( offset != PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET )
     {
+        index = offset / sizeof(uint32_t);
         if ( nr_entry < MAX_MSIX_ACC_ENTRIES ) 
         {
-            index = offset / sizeof(uint32_t);
             entry->gentries[nr_entry].msi_ad[index] = val;
             acc_bit(set, entry, nr_entry, index);
+            if ( len == 8 && !index )
+            {
+                entry->gentries[nr_entry].msi_ad[1] = val >> 32;
+                acc_bit(set, entry, nr_entry, 1);
+            }
         }
         set_bit(nr_entry, &entry->table_flags);
-        goto out;
+        if ( len != 8 || !index )
+            goto out;
+        val >>= 32;
     }
 
     /* Exit to device model when unmasking and address/data got modified. */
@@ -305,7 +325,8 @@ static int msixtbl_write(struct vcpu *v,
 
 unlock:
     spin_unlock_irqrestore(&desc->lock, flags);
-    r = X86EMUL_OKAY;
+    if ( len == 4 )
+        r = X86EMUL_OKAY;
 
 out:
     rcu_read_unlock(&msixtbl_rcu_lock);




[-- Attachment #2: x86-vMSI-X-rw-qword.patch --]
[-- Type: text/plain, Size: 3291 bytes --]

x86/vMSI-X: support qword MMIO access

The specification explicitly provides for this, so we should have
supported this from the beginning.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -209,7 +209,7 @@ static int msixtbl_read(
     unsigned int nr_entry, index;
     int r = X86EMUL_UNHANDLEABLE;
 
-    if ( len != 4 || (address & 3) )
+    if ( (len != 4 && len != 8) || (address & (len - 1)) )
         return r;
 
     rcu_read_lock(&msixtbl_rcu_lock);
@@ -227,15 +227,28 @@ static int msixtbl_read(
              !acc_bit(test, entry, nr_entry, index) )
             goto out;
         *pval = entry->gentries[nr_entry].msi_ad[index];
+        if ( len == 8 )
+        {
+            if ( index )
+                offset = PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET;
+            else if ( acc_bit(test, entry, nr_entry, 1) )
+                *pval |= (u64)entry->gentries[nr_entry].msi_ad[1] << 32;
+            else
+                goto out;
+        }
     }
-    else 
+    if ( offset == PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET )
     {
         const struct msi_desc *msi_desc = msixtbl_addr_to_desc(entry, address);
 
         if ( !msi_desc )
             goto out;
-        *pval = MASK_INSR(msi_desc->msi_attrib.guest_masked,
-                          PCI_MSIX_VECTOR_BITMASK);
+        if ( len == 4 )
+            *pval = MASK_INSR(msi_desc->msi_attrib.guest_masked,
+                              PCI_MSIX_VECTOR_BITMASK);
+        else
+            *pval |= (u64)MASK_INSR(msi_desc->msi_attrib.guest_masked,
+                                    PCI_MSIX_VECTOR_BITMASK) << 32;
     }
     
     r = X86EMUL_OKAY;
@@ -255,7 +268,7 @@ static int msixtbl_write(struct vcpu *v,
     unsigned long flags;
     struct irq_desc *desc;
 
-    if ( len != 4 || (address & 3) )
+    if ( (len != 4 && len != 8) || (address & (len - 1)) )
         return r;
 
     rcu_read_lock(&msixtbl_rcu_lock);
@@ -266,16 +279,23 @@ static int msixtbl_write(struct vcpu *v,
     nr_entry = (address - entry->gtable) / PCI_MSIX_ENTRY_SIZE;
 
     offset = address & (PCI_MSIX_ENTRY_SIZE - 1);
-    if ( offset != PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET)
+    if ( offset != PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET )
     {
+        index = offset / sizeof(uint32_t);
         if ( nr_entry < MAX_MSIX_ACC_ENTRIES ) 
         {
-            index = offset / sizeof(uint32_t);
             entry->gentries[nr_entry].msi_ad[index] = val;
             acc_bit(set, entry, nr_entry, index);
+            if ( len == 8 && !index )
+            {
+                entry->gentries[nr_entry].msi_ad[1] = val >> 32;
+                acc_bit(set, entry, nr_entry, 1);
+            }
         }
         set_bit(nr_entry, &entry->table_flags);
-        goto out;
+        if ( len != 8 || !index )
+            goto out;
+        val >>= 32;
     }
 
     /* Exit to device model when unmasking and address/data got modified. */
@@ -305,7 +325,8 @@ static int msixtbl_write(struct vcpu *v,
 
 unlock:
     spin_unlock_irqrestore(&desc->lock, flags);
-    r = X86EMUL_OKAY;
+    if ( len == 4 )
+        r = X86EMUL_OKAY;
 
 out:
     rcu_read_unlock(&msixtbl_rcu_lock);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 08/10] x86/MSI-X: use qword MMIO access for address writes
  2015-06-05 11:09 [PATCH v3 00/10] x86/MSI-X: XSA-120, 128..131 follow-up Jan Beulich
                   ` (6 preceding siblings ...)
  2015-06-05 11:24 ` [PATCH v3 07/10] x86/vMSI-X: support qword MMIO access Jan Beulich
@ 2015-06-05 11:25 ` Jan Beulich
  2015-06-05 15:37   ` Andrew Cooper
  2015-06-05 11:26 ` [PATCH v3 09/10] VT-d: use qword MMIO access for MSI " Jan Beulich
  2015-06-05 11:28 ` [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control Jan Beulich
  9 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2015-06-05 11:25 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 2172 bytes --]

Now that we support it for our guests, let's do so ourselves too.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -236,8 +236,7 @@ static bool_t read_msi_msg(struct msi_de
         if ( unlikely(!msix_memory_decoded(entry->dev,
                                            entry->msi_attrib.pos)) )
             return 0;
-        msg->address_lo = readl(base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
-        msg->address_hi = readl(base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
+        msg->address = readq(base + PCI_MSIX_ENTRY_ADDRESS_OFFSET);
         msg->data = readl(base + PCI_MSIX_ENTRY_DATA_OFFSET);
         break;
     }
@@ -302,10 +301,7 @@ static int write_msi_msg(struct msi_desc
         if ( unlikely(!msix_memory_decoded(entry->dev,
                                            entry->msi_attrib.pos)) )
             return -ENXIO;
-        writel(msg->address_lo,
-               base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
-        writel(msg->address_hi,
-               base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
+        writeq(msg->address, base + PCI_MSIX_ENTRY_ADDRESS_OFFSET);
         writel(msg->data, base + PCI_MSIX_ENTRY_DATA_OFFSET);
         break;
     }
--- a/xen/include/asm-x86/msi.h
+++ b/xen/include/asm-x86/msi.h
@@ -65,8 +65,13 @@ struct msi_info {
 };
 
 struct msi_msg {
-	u32	address_lo;	/* low 32 bits of msi message address */
-	u32	address_hi;	/* high 32 bits of msi message address */
+	union {
+		u64	address; /* message address */
+		struct {
+			u32	address_lo; /* message address low 32 bits */
+			u32	address_hi; /* message address high 32 bits */
+		};
+	};
 	u32	data;		/* 16 bits of msi message data */
 	u32	dest32;		/* used when Interrupt Remapping with EIM is enabled */
 };
--- a/xen/include/xen/pci_regs.h
+++ b/xen/include/xen/pci_regs.h
@@ -307,6 +307,7 @@
 #define  PCI_MSIX_BIRMASK	(7 << 0)
 
 #define PCI_MSIX_ENTRY_SIZE			16
+#define  PCI_MSIX_ENTRY_ADDRESS_OFFSET		0
 #define  PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET	0
 #define  PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET	4
 #define  PCI_MSIX_ENTRY_DATA_OFFSET		8




[-- Attachment #2: x86-MSI-X-qword.patch --]
[-- Type: text/plain, Size: 2221 bytes --]

x86/MSI-X: use qword MMIO access for address writes

Now that we support it for our guests, let's do so ourselves too.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -236,8 +236,7 @@ static bool_t read_msi_msg(struct msi_de
         if ( unlikely(!msix_memory_decoded(entry->dev,
                                            entry->msi_attrib.pos)) )
             return 0;
-        msg->address_lo = readl(base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
-        msg->address_hi = readl(base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
+        msg->address = readq(base + PCI_MSIX_ENTRY_ADDRESS_OFFSET);
         msg->data = readl(base + PCI_MSIX_ENTRY_DATA_OFFSET);
         break;
     }
@@ -302,10 +301,7 @@ static int write_msi_msg(struct msi_desc
         if ( unlikely(!msix_memory_decoded(entry->dev,
                                            entry->msi_attrib.pos)) )
             return -ENXIO;
-        writel(msg->address_lo,
-               base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
-        writel(msg->address_hi,
-               base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
+        writeq(msg->address, base + PCI_MSIX_ENTRY_ADDRESS_OFFSET);
         writel(msg->data, base + PCI_MSIX_ENTRY_DATA_OFFSET);
         break;
     }
--- a/xen/include/asm-x86/msi.h
+++ b/xen/include/asm-x86/msi.h
@@ -65,8 +65,13 @@ struct msi_info {
 };
 
 struct msi_msg {
-	u32	address_lo;	/* low 32 bits of msi message address */
-	u32	address_hi;	/* high 32 bits of msi message address */
+	union {
+		u64	address; /* message address */
+		struct {
+			u32	address_lo; /* message address low 32 bits */
+			u32	address_hi; /* message address high 32 bits */
+		};
+	};
 	u32	data;		/* 16 bits of msi message data */
 	u32	dest32;		/* used when Interrupt Remapping with EIM is enabled */
 };
--- a/xen/include/xen/pci_regs.h
+++ b/xen/include/xen/pci_regs.h
@@ -307,6 +307,7 @@
 #define  PCI_MSIX_BIRMASK	(7 << 0)
 
 #define PCI_MSIX_ENTRY_SIZE			16
+#define  PCI_MSIX_ENTRY_ADDRESS_OFFSET		0
 #define  PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET	0
 #define  PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET	4
 #define  PCI_MSIX_ENTRY_DATA_OFFSET		8

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 09/10] VT-d: use qword MMIO access for MSI address writes
  2015-06-05 11:09 [PATCH v3 00/10] x86/MSI-X: XSA-120, 128..131 follow-up Jan Beulich
                   ` (7 preceding siblings ...)
  2015-06-05 11:25 ` [PATCH v3 08/10] x86/MSI-X: use qword MMIO access for address writes Jan Beulich
@ 2015-06-05 11:26 ` Jan Beulich
  2015-06-05 15:39   ` Andrew Cooper
  2015-06-11  7:45   ` Tian, Kevin
  2015-06-05 11:28 ` [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control Jan Beulich
  9 siblings, 2 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-05 11:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Yang Z Zhang, Kevin Tian

[-- Attachment #1: Type: text/plain, Size: 1717 bytes --]

Also make dmar_{read,write}q() actually do what their names suggest (we
don't need to be concerned of 32-bit restrictions anymore).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1054,8 +1054,7 @@ static void dma_msi_set_affinity(struct 
 
     spin_lock_irqsave(&iommu->register_lock, flags);
     dmar_writel(iommu->reg, DMAR_FEDATA_REG, msg.data);
-    dmar_writel(iommu->reg, DMAR_FEADDR_REG, msg.address_lo);
-    dmar_writel(iommu->reg, DMAR_FEUADDR_REG, msg.address_hi);
+    dmar_writeq(iommu->reg, DMAR_FEADDR_REG, msg.address);
     spin_unlock_irqrestore(&iommu->register_lock, flags);
 }
 
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -51,17 +51,10 @@
 #define    DMAR_IRTA_REG   0xB8    /* intr remap */
 
 #define OFFSET_STRIDE        (9)
-#define dmar_readl(dmar, reg) readl(dmar + reg)
-#define dmar_writel(dmar, reg, val) writel(val, dmar + reg)
-#define dmar_readq(dmar, reg) ({ \
-        u32 lo, hi; \
-        lo = dmar_readl(dmar, reg); \
-        hi = dmar_readl(dmar, reg + 4); \
-        (((u64) hi) << 32) + lo; })
-#define dmar_writeq(dmar, reg, val) do {\
-        dmar_writel(dmar, reg, (u32)val); \
-        dmar_writel(dmar, reg + 4, (u32)((u64) val >> 32)); \
-    } while (0)
+#define dmar_readl(dmar, reg) readl((dmar) + (reg))
+#define dmar_readq(dmar, reg) readq((dmar) + (reg))
+#define dmar_writel(dmar, reg, val) writel(val, (dmar) + (reg))
+#define dmar_writeq(dmar, reg, val) writeq(val, (dmar) + (reg))
 
 #define VER_MAJOR(v)        (((v) & 0xf0) >> 4)
 #define VER_MINOR(v)        ((v) & 0x0f)




[-- Attachment #2: VT-d-MMIO-qword.patch --]
[-- Type: text/plain, Size: 1765 bytes --]

VT-d: use qword MMIO access for MSI address writes

Also make dmar_{read,write}q() actually do what their names suggest (we
don#t need to be concerned of 32-bit restrictions anymore).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1054,8 +1054,7 @@ static void dma_msi_set_affinity(struct 
 
     spin_lock_irqsave(&iommu->register_lock, flags);
     dmar_writel(iommu->reg, DMAR_FEDATA_REG, msg.data);
-    dmar_writel(iommu->reg, DMAR_FEADDR_REG, msg.address_lo);
-    dmar_writel(iommu->reg, DMAR_FEUADDR_REG, msg.address_hi);
+    dmar_writeq(iommu->reg, DMAR_FEADDR_REG, msg.address);
     spin_unlock_irqrestore(&iommu->register_lock, flags);
 }
 
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -51,17 +51,10 @@
 #define    DMAR_IRTA_REG   0xB8    /* intr remap */
 
 #define OFFSET_STRIDE        (9)
-#define dmar_readl(dmar, reg) readl(dmar + reg)
-#define dmar_writel(dmar, reg, val) writel(val, dmar + reg)
-#define dmar_readq(dmar, reg) ({ \
-        u32 lo, hi; \
-        lo = dmar_readl(dmar, reg); \
-        hi = dmar_readl(dmar, reg + 4); \
-        (((u64) hi) << 32) + lo; })
-#define dmar_writeq(dmar, reg, val) do {\
-        dmar_writel(dmar, reg, (u32)val); \
-        dmar_writel(dmar, reg + 4, (u32)((u64) val >> 32)); \
-    } while (0)
+#define dmar_readl(dmar, reg) readl((dmar) + (reg))
+#define dmar_readq(dmar, reg) readq((dmar) + (reg))
+#define dmar_writel(dmar, reg, val) writel(val, (dmar) + (reg))
+#define dmar_writeq(dmar, reg, val) writeq(val, (dmar) + (reg))
 
 #define VER_MAJOR(v)        (((v) & 0xf0) >> 4)
 #define VER_MINOR(v)        ((v) & 0x0f)

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control
  2015-06-05 11:09 [PATCH v3 00/10] x86/MSI-X: XSA-120, 128..131 follow-up Jan Beulich
                   ` (8 preceding siblings ...)
  2015-06-05 11:26 ` [PATCH v3 09/10] VT-d: use qword MMIO access for MSI " Jan Beulich
@ 2015-06-05 11:28 ` Jan Beulich
  2015-06-05 15:57   ` Andrew Cooper
  2015-06-11  8:35   ` Jan Beulich
  9 siblings, 2 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-05 11:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Ian Jackson,
	Ian Campbell, Keir Fraser, dgdegra

[-- Attachment #1: Type: text/plain, Size: 14153 bytes --]

Qemu shouldn't be fiddling with this bit directly, as the hypervisor
may (and now does) use it for its own purposes. Provide it with a
replacement interface, allowing the hypervisor to track host and guest
masking intentions independently (clearing the bit only when both want
it clear).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Whether the permission check should really be an XSM_TARGET one needs
to be determined: That allowing the guest to issue the hypercalls on
itself means permitting it to bypass the device model, and thus render
device model state stale.

--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -1807,6 +1807,17 @@ int xc_physdev_unmap_pirq(xc_interface *
                           int domid,
                           int pirq);
 
+int xc_physdev_msix_enable(xc_interface *xch,
+                           int segment,
+                           int bus,
+                           int devfn,
+                           int on);
+int xc_physdev_msix_mask_all(xc_interface *xch,
+                             int segment,
+                             int bus,
+                             int devfn,
+                             int mask);
+
 int xc_hvm_set_pci_intx_level(
     xc_interface *xch, domid_t dom,
     uint8_t domain, uint8_t bus, uint8_t device, uint8_t intx,
--- a/tools/libxc/xc_physdev.c
+++ b/tools/libxc/xc_physdev.c
@@ -112,3 +112,38 @@ int xc_physdev_unmap_pirq(xc_interface *
     return rc;
 }
 
+int xc_physdev_msix_enable(xc_interface *xch,
+                           int segment,
+                           int bus,
+                           int devfn,
+                           int on)
+{
+    struct physdev_pci_device dev = {
+        .seg = segment,
+        .bus = bus,
+        .devfn = devfn
+    };
+
+    return do_physdev_op(xch,
+                         on ? PHYSDEVOP_msix_enable
+                            : PHYSDEVOP_msix_disable,
+                         &dev, sizeof(dev));
+}
+
+int xc_physdev_msix_mask_all(xc_interface *xch,
+                             int segment,
+                             int bus,
+                             int devfn,
+                             int mask)
+{
+    struct physdev_pci_device dev = {
+        .seg = segment,
+        .bus = bus,
+        .devfn = devfn
+    };
+
+    return do_physdev_op(xch,
+                         mask ? PHYSDEVOP_msix_mask_all
+                              : PHYSDEVOP_msix_unmask_all,
+                         &dev, sizeof(dev));
+}
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -392,7 +392,7 @@ static bool_t msi_set_mask_bit(struct ir
     struct pci_dev *pdev;
     u16 seg, control;
     u8 bus, slot, func;
-    bool_t flag = host || guest;
+    bool_t flag = host || guest, maskall;
 
     ASSERT(spin_is_locked(&desc->lock));
     BUG_ON(!entry || !entry->dev);
@@ -415,13 +415,17 @@ static bool_t msi_set_mask_bit(struct ir
         }
         break;
     case PCI_CAP_ID_MSIX:
+        maskall = pdev->msix->host_maskall;
         control = pci_conf_read16(seg, bus, slot, func,
                                   msix_control_reg(entry->msi_attrib.pos));
         if ( unlikely(!(control & PCI_MSIX_FLAGS_ENABLE)) )
+        {
+            pdev->msix->host_maskall = 1;
             pci_conf_write16(seg, bus, slot, func,
                              msix_control_reg(entry->msi_attrib.pos),
                              control | (PCI_MSIX_FLAGS_ENABLE |
                                         PCI_MSIX_FLAGS_MASKALL));
+        }
         if ( likely(memory_decoded(pdev)) )
         {
             writel(flag, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
@@ -434,7 +438,7 @@ static bool_t msi_set_mask_bit(struct ir
         {
             domid_t domid = pdev->domain->domain_id;
 
-            control |= PCI_MSIX_FLAGS_MASKALL;
+            maskall = 1;
             if ( pdev->msix->warned != domid )
             {
                 pdev->msix->warned = domid;
@@ -444,6 +448,9 @@ static bool_t msi_set_mask_bit(struct ir
                        PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
             }
         }
+        pdev->msix->host_maskall = maskall;
+        if ( maskall || pdev->msix->guest_maskall )
+            control |= PCI_MSIX_FLAGS_MASKALL;
         pci_conf_write16(seg, bus, slot, func,
                          msix_control_reg(entry->msi_attrib.pos), control);
         return flag;
@@ -840,6 +847,7 @@ static int msix_capability_init(struct p
     u8 bus = dev->bus;
     u8 slot = PCI_SLOT(dev->devfn);
     u8 func = PCI_FUNC(dev->devfn);
+    bool_t maskall = msix->host_maskall;
 
     ASSERT(spin_is_locked(&pcidevs_lock));
 
@@ -850,6 +858,7 @@ static int msix_capability_init(struct p
      * to mask all the vectors to prevent interrupts coming in before they're
      * fully set up.
      */
+    msix->host_maskall = 1;
     pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
                      control | (PCI_MSIX_FLAGS_ENABLE |
                                 PCI_MSIX_FLAGS_MASKALL));
@@ -972,6 +981,10 @@ static int msix_capability_init(struct p
 
     if ( !msix->used_entries )
     {
+        maskall = 0;
+        msix->guest_maskall = 0;
+        control &= ~PCI_MSIX_FLAGS_MASKALL;
+
         if ( rangeset_add_range(mmio_ro_ranges, msix->table.first,
                                 msix->table.last) )
             WARN();
@@ -1002,6 +1015,7 @@ static int msix_capability_init(struct p
     ++msix->used_entries;
 
     /* Restore MSI-X enabled bits */
+    msix->host_maskall = maskall;
     pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos), control);
 
     return 0;
@@ -1142,11 +1156,15 @@ static void __pci_disable_msix(struct ms
                                            PCI_CAP_ID_MSIX);
     u16 control = pci_conf_read16(seg, bus, slot, func,
                                   msix_control_reg(entry->msi_attrib.pos));
+    bool_t maskall = dev->msix->host_maskall;
 
     if ( unlikely(!(control & PCI_MSIX_FLAGS_ENABLE)) )
+    {
+        dev->msix->host_maskall = 1;
         pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
                          control | (PCI_MSIX_FLAGS_ENABLE |
                                     PCI_MSIX_FLAGS_MASKALL));
+    }
 
     BUG_ON(list_empty(&dev->msi_list));
 
@@ -1158,8 +1176,11 @@ static void __pci_disable_msix(struct ms
                "cannot disable IRQ %d: masking MSI-X on %04x:%02x:%02x.%u\n",
                entry->irq, dev->seg, dev->bus,
                PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn));
-        control |= PCI_MSIX_FLAGS_MASKALL;
+        maskall = 1;
     }
+    dev->msix->host_maskall = maskall;
+    if ( maskall || dev->msix->guest_maskall )
+        control |= PCI_MSIX_FLAGS_MASKALL;
     pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos), control);
 
     _pci_cleanup_msix(dev->msix);
@@ -1203,6 +1224,62 @@ int pci_prepare_msix(u16 seg, u8 bus, u8
     return rc;
 }
 
+int pci_msix_enable(u16 seg, u8 bus, u8 devfn, bool_t on)
+{
+    int rc;
+    struct pci_dev *pdev;
+
+    if ( !use_msi )
+        return -EOPNOTSUPP;
+
+    spin_lock(&pcidevs_lock);
+    pdev = pci_get_pdev(seg, bus, devfn);
+    if ( !pdev || !pdev->msix || !pdev->domain )
+        rc = -ENODEV;
+    else if ( !is_hvm_domain(pdev->domain) )
+        rc = -ENXIO;
+    else if ( (rc = xsm_manage_domain_pirq(XSM_TARGET, pdev->domain)) == 0 )
+        msix_set_enable(pdev, on);
+    spin_unlock(&pcidevs_lock);
+
+    return rc;
+}
+
+int pci_msix_maskall(u16 seg, u8 bus, u8 devfn, bool_t mask)
+{
+    int rc;
+    struct pci_dev *pdev;
+    u8 slot = PCI_SLOT(devfn), func = PCI_FUNC(devfn);
+
+    if ( !use_msi )
+        return -EOPNOTSUPP;
+
+    spin_lock(&pcidevs_lock);
+    pdev = pci_get_pdev(seg, bus, devfn);
+    if ( !pdev || !pdev->msix || !pdev->domain )
+        rc = -ENODEV;
+    else if ( !is_hvm_domain(pdev->domain) )
+        rc = -ENXIO;
+    else if ( (rc = xsm_manage_domain_pirq(XSM_TARGET, pdev->domain)) == 0 )
+    {
+        unsigned int pos = pci_find_cap_offset(seg, bus, slot, func,
+                                               PCI_CAP_ID_MSIX);
+        u16 control = pci_conf_read16(seg, bus, slot, func,
+                                      msix_control_reg(pos));
+
+        BUG_ON(!pos);
+        pdev->msix->guest_maskall = mask;
+        if ( pdev->msix->host_maskall || mask )
+            control |= PCI_MSIX_FLAGS_MASKALL;
+        else
+            control &= ~PCI_MSIX_FLAGS_MASKALL;
+        pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos), control);
+    }
+    spin_unlock(&pcidevs_lock);
+
+    return rc;
+}
+
 /*
  * Notice: only construct the msi_desc
  * no change to irq_desc here, and the interrupt is masked
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -665,6 +665,30 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
         break;
     }
 
+    case PHYSDEVOP_msix_enable:
+    case PHYSDEVOP_msix_disable: {
+        struct physdev_pci_device dev;
+
+        if ( copy_from_guest(&dev, arg, 1) )
+            ret = -EFAULT;
+        else
+            ret = pci_msix_enable(dev.seg, dev.bus, dev.devfn,
+                                  cmd == PHYSDEVOP_msix_enable);
+        break;
+    }
+
+    case PHYSDEVOP_msix_mask_all:
+    case PHYSDEVOP_msix_unmask_all: {
+        struct physdev_pci_device dev;
+
+        if ( copy_from_guest(&dev, arg, 1) )
+            ret = -EFAULT;
+        else
+            ret = pci_msix_maskall(dev.seg, dev.bus, dev.devfn,
+                                   cmd == PHYSDEVOP_msix_mask_all);
+        break;
+    }
+
     case PHYSDEVOP_pci_mmcfg_reserved: {
         struct physdev_pci_mmcfg_reserved info;
 
--- a/xen/include/asm-x86/msi.h
+++ b/xen/include/asm-x86/msi.h
@@ -83,6 +83,8 @@ struct msi_desc;
 extern int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc);
 extern void pci_disable_msi(struct msi_desc *desc);
 extern int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool_t off);
+extern int pci_msix_enable(u16 seg, u8 bus, u8 devfn, bool_t on);
+extern int pci_msix_maskall(u16 seg, u8 bus, u8 devfn, bool_t mask);
 extern void pci_cleanup_msi(struct pci_dev *pdev);
 extern int setup_msi_irq(struct irq_desc *, struct msi_desc *);
 extern int __setup_msi_irq(struct irq_desc *, struct msi_desc *,
@@ -233,6 +235,7 @@ struct arch_msix {
     int table_refcnt[MAX_MSIX_TABLE_PAGES];
     int table_idx[MAX_MSIX_TABLE_PAGES];
     spinlock_t table_lock;
+    bool_t host_maskall, guest_maskall;
     domid_t warned;
 };
 
--- a/xen/include/public/physdev.h
+++ b/xen/include/public/physdev.h
@@ -315,6 +315,14 @@ DEFINE_XEN_GUEST_HANDLE(physdev_pci_devi
  */
 #define PHYSDEVOP_prepare_msix          30
 #define PHYSDEVOP_release_msix          31
+/*
+ * The device model domain for a guest should be using these instead of
+ * fiddling with the respective flags in the MSI-X capability structure.
+ */
+#define PHYSDEVOP_msix_enable           32
+#define PHYSDEVOP_msix_disable          33
+#define PHYSDEVOP_msix_mask_all         34
+#define PHYSDEVOP_msix_unmask_all       35
 struct physdev_pci_device {
     /* IN */
     uint16_t seg;
--- a/xen/include/xsm/dummy.h
+++ b/xen/include/xsm/dummy.h
@@ -462,6 +462,12 @@ static XSM_INLINE int xsm_map_domain_irq
     return xsm_default_action(action, current->domain, d);
 }
 
+static XSM_INLINE int xsm_manage_domain_pirq(XSM_DEFAULT_ARG struct domain *d)
+{
+    XSM_ASSERT_ACTION(XSM_TARGET);
+    return xsm_default_action(action, current->domain, d);
+}
+
 static XSM_INLINE int xsm_unmap_domain_pirq(XSM_DEFAULT_ARG struct domain *d)
 {
     XSM_ASSERT_ACTION(XSM_TARGET);
--- a/xen/include/xsm/xsm.h
+++ b/xen/include/xsm/xsm.h
@@ -105,6 +105,7 @@ struct xsm_operations {
     char *(*show_irq_sid) (int irq);
     int (*map_domain_pirq) (struct domain *d);
     int (*map_domain_irq) (struct domain *d, int irq, void *data);
+    int (*manage_domain_pirq) (struct domain *d);
     int (*unmap_domain_pirq) (struct domain *d);
     int (*unmap_domain_irq) (struct domain *d, int irq, void *data);
     int (*bind_pt_irq) (struct domain *d, struct xen_domctl_bind_pt_irq *bind);
@@ -424,6 +425,11 @@ static inline int xsm_map_domain_irq (xs
     return xsm_ops->map_domain_irq(d, irq, data);
 }
 
+static inline int xsm_manage_domain_pirq(xsm_default_t def, struct domain *d)
+{
+    return xsm_ops->manage_domain_pirq(d);
+}
+
 static inline int xsm_unmap_domain_pirq (xsm_default_t def, struct domain *d)
 {
     return xsm_ops->unmap_domain_pirq(d);
--- a/xen/xsm/dummy.c
+++ b/xen/xsm/dummy.c
@@ -79,6 +79,7 @@ void xsm_fixup_ops (struct xsm_operation
     set_to_dummy_if_null(ops, show_irq_sid);
     set_to_dummy_if_null(ops, map_domain_pirq);
     set_to_dummy_if_null(ops, map_domain_irq);
+    set_to_dummy_if_null(ops, manage_domain_pirq);
     set_to_dummy_if_null(ops, unmap_domain_pirq);
     set_to_dummy_if_null(ops, unmap_domain_irq);
     set_to_dummy_if_null(ops, bind_pt_irq);
--- a/xen/xsm/flask/hooks.c
+++ b/xen/xsm/flask/hooks.c
@@ -881,6 +881,11 @@ static int flask_map_domain_irq (struct 
     return rc;
 }
 
+static int flask_manage_domain_pirq(struct domain *d)
+{
+    return current_has_perm(d, SECCLASS_RESOURCE, RESOURCE__USE);
+}
+
 static int flask_unmap_domain_pirq (struct domain *d)
 {
     return current_has_perm(d, SECCLASS_RESOURCE, RESOURCE__REMOVE);
@@ -1633,6 +1638,7 @@ static struct xsm_operations flask_ops =
 
     .map_domain_pirq = flask_map_domain_pirq,
     .map_domain_irq = flask_map_domain_irq,
+    .manage_domain_pirq = flask_manage_domain_pirq,
     .unmap_domain_pirq = flask_unmap_domain_pirq,
     .unmap_domain_irq = flask_unmap_domain_irq,
     .bind_pt_irq = flask_bind_pt_irq,



[-- Attachment #2: x86-MSI-X-maskall.patch --]
[-- Type: text/plain, Size: 14212 bytes --]

x86/MSI-X: provide hypercall interface for mask-all control

Qemu shouldn't be fiddling with this bit directly, as the hypervisor
may (and now does) use it for its own purposes. Provide it with a
replacement interface, allowing the hypervisor to track host and guest
masking intentions independently (clearing the bit only when both want
it clear).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Whether the permission check should really be an XSM_TARGET one needs
to be determined: That allowing the guest to issue the hypercalls on
itself means permitting it to bypass the device model, and thus render
device model state stale.

--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -1807,6 +1807,17 @@ int xc_physdev_unmap_pirq(xc_interface *
                           int domid,
                           int pirq);
 
+int xc_physdev_msix_enable(xc_interface *xch,
+                           int segment,
+                           int bus,
+                           int devfn,
+                           int on);
+int xc_physdev_msix_mask_all(xc_interface *xch,
+                             int segment,
+                             int bus,
+                             int devfn,
+                             int mask);
+
 int xc_hvm_set_pci_intx_level(
     xc_interface *xch, domid_t dom,
     uint8_t domain, uint8_t bus, uint8_t device, uint8_t intx,
--- a/tools/libxc/xc_physdev.c
+++ b/tools/libxc/xc_physdev.c
@@ -112,3 +112,38 @@ int xc_physdev_unmap_pirq(xc_interface *
     return rc;
 }
 
+int xc_physdev_msix_enable(xc_interface *xch,
+                           int segment,
+                           int bus,
+                           int devfn,
+                           int on)
+{
+    struct physdev_pci_device dev = {
+        .seg = segment,
+        .bus = bus,
+        .devfn = devfn
+    };
+
+    return do_physdev_op(xch,
+                         on ? PHYSDEVOP_msix_enable
+                            : PHYSDEVOP_msix_disable,
+                         &dev, sizeof(dev));
+}
+
+int xc_physdev_msix_mask_all(xc_interface *xch,
+                             int segment,
+                             int bus,
+                             int devfn,
+                             int mask)
+{
+    struct physdev_pci_device dev = {
+        .seg = segment,
+        .bus = bus,
+        .devfn = devfn
+    };
+
+    return do_physdev_op(xch,
+                         mask ? PHYSDEVOP_msix_mask_all
+                              : PHYSDEVOP_msix_unmask_all,
+                         &dev, sizeof(dev));
+}
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -392,7 +392,7 @@ static bool_t msi_set_mask_bit(struct ir
     struct pci_dev *pdev;
     u16 seg, control;
     u8 bus, slot, func;
-    bool_t flag = host || guest;
+    bool_t flag = host || guest, maskall;
 
     ASSERT(spin_is_locked(&desc->lock));
     BUG_ON(!entry || !entry->dev);
@@ -415,13 +415,17 @@ static bool_t msi_set_mask_bit(struct ir
         }
         break;
     case PCI_CAP_ID_MSIX:
+        maskall = pdev->msix->host_maskall;
         control = pci_conf_read16(seg, bus, slot, func,
                                   msix_control_reg(entry->msi_attrib.pos));
         if ( unlikely(!(control & PCI_MSIX_FLAGS_ENABLE)) )
+        {
+            pdev->msix->host_maskall = 1;
             pci_conf_write16(seg, bus, slot, func,
                              msix_control_reg(entry->msi_attrib.pos),
                              control | (PCI_MSIX_FLAGS_ENABLE |
                                         PCI_MSIX_FLAGS_MASKALL));
+        }
         if ( likely(memory_decoded(pdev)) )
         {
             writel(flag, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
@@ -434,7 +438,7 @@ static bool_t msi_set_mask_bit(struct ir
         {
             domid_t domid = pdev->domain->domain_id;
 
-            control |= PCI_MSIX_FLAGS_MASKALL;
+            maskall = 1;
             if ( pdev->msix->warned != domid )
             {
                 pdev->msix->warned = domid;
@@ -444,6 +448,9 @@ static bool_t msi_set_mask_bit(struct ir
                        PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
             }
         }
+        pdev->msix->host_maskall = maskall;
+        if ( maskall || pdev->msix->guest_maskall )
+            control |= PCI_MSIX_FLAGS_MASKALL;
         pci_conf_write16(seg, bus, slot, func,
                          msix_control_reg(entry->msi_attrib.pos), control);
         return flag;
@@ -840,6 +847,7 @@ static int msix_capability_init(struct p
     u8 bus = dev->bus;
     u8 slot = PCI_SLOT(dev->devfn);
     u8 func = PCI_FUNC(dev->devfn);
+    bool_t maskall = msix->host_maskall;
 
     ASSERT(spin_is_locked(&pcidevs_lock));
 
@@ -850,6 +858,7 @@ static int msix_capability_init(struct p
      * to mask all the vectors to prevent interrupts coming in before they're
      * fully set up.
      */
+    msix->host_maskall = 1;
     pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
                      control | (PCI_MSIX_FLAGS_ENABLE |
                                 PCI_MSIX_FLAGS_MASKALL));
@@ -972,6 +981,10 @@ static int msix_capability_init(struct p
 
     if ( !msix->used_entries )
     {
+        maskall = 0;
+        msix->guest_maskall = 0;
+        control &= ~PCI_MSIX_FLAGS_MASKALL;
+
         if ( rangeset_add_range(mmio_ro_ranges, msix->table.first,
                                 msix->table.last) )
             WARN();
@@ -1002,6 +1015,7 @@ static int msix_capability_init(struct p
     ++msix->used_entries;
 
     /* Restore MSI-X enabled bits */
+    msix->host_maskall = maskall;
     pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos), control);
 
     return 0;
@@ -1142,11 +1156,15 @@ static void __pci_disable_msix(struct ms
                                            PCI_CAP_ID_MSIX);
     u16 control = pci_conf_read16(seg, bus, slot, func,
                                   msix_control_reg(entry->msi_attrib.pos));
+    bool_t maskall = dev->msix->host_maskall;
 
     if ( unlikely(!(control & PCI_MSIX_FLAGS_ENABLE)) )
+    {
+        dev->msix->host_maskall = 1;
         pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos),
                          control | (PCI_MSIX_FLAGS_ENABLE |
                                     PCI_MSIX_FLAGS_MASKALL));
+    }
 
     BUG_ON(list_empty(&dev->msi_list));
 
@@ -1158,8 +1176,11 @@ static void __pci_disable_msix(struct ms
                "cannot disable IRQ %d: masking MSI-X on %04x:%02x:%02x.%u\n",
                entry->irq, dev->seg, dev->bus,
                PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn));
-        control |= PCI_MSIX_FLAGS_MASKALL;
+        maskall = 1;
     }
+    dev->msix->host_maskall = maskall;
+    if ( maskall || dev->msix->guest_maskall )
+        control |= PCI_MSIX_FLAGS_MASKALL;
     pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos), control);
 
     _pci_cleanup_msix(dev->msix);
@@ -1203,6 +1224,62 @@ int pci_prepare_msix(u16 seg, u8 bus, u8
     return rc;
 }
 
+int pci_msix_enable(u16 seg, u8 bus, u8 devfn, bool_t on)
+{
+    int rc;
+    struct pci_dev *pdev;
+
+    if ( !use_msi )
+        return -EOPNOTSUPP;
+
+    spin_lock(&pcidevs_lock);
+    pdev = pci_get_pdev(seg, bus, devfn);
+    if ( !pdev || !pdev->msix || !pdev->domain )
+        rc = -ENODEV;
+    else if ( !is_hvm_domain(pdev->domain) )
+        rc = -ENXIO;
+    else if ( (rc = xsm_manage_domain_pirq(XSM_TARGET, pdev->domain)) == 0 )
+        msix_set_enable(pdev, on);
+    spin_unlock(&pcidevs_lock);
+
+    return rc;
+}
+
+int pci_msix_maskall(u16 seg, u8 bus, u8 devfn, bool_t mask)
+{
+    int rc;
+    struct pci_dev *pdev;
+    u8 slot = PCI_SLOT(devfn), func = PCI_FUNC(devfn);
+
+    if ( !use_msi )
+        return -EOPNOTSUPP;
+
+    spin_lock(&pcidevs_lock);
+    pdev = pci_get_pdev(seg, bus, devfn);
+    if ( !pdev || !pdev->msix || !pdev->domain )
+        rc = -ENODEV;
+    else if ( !is_hvm_domain(pdev->domain) )
+        rc = -ENXIO;
+    else if ( (rc = xsm_manage_domain_pirq(XSM_TARGET, pdev->domain)) == 0 )
+    {
+        unsigned int pos = pci_find_cap_offset(seg, bus, slot, func,
+                                               PCI_CAP_ID_MSIX);
+        u16 control = pci_conf_read16(seg, bus, slot, func,
+                                      msix_control_reg(pos));
+
+        BUG_ON(!pos);
+        pdev->msix->guest_maskall = mask;
+        if ( pdev->msix->host_maskall || mask )
+            control |= PCI_MSIX_FLAGS_MASKALL;
+        else
+            control &= ~PCI_MSIX_FLAGS_MASKALL;
+        pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos), control);
+    }
+    spin_unlock(&pcidevs_lock);
+
+    return rc;
+}
+
 /*
  * Notice: only construct the msi_desc
  * no change to irq_desc here, and the interrupt is masked
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -665,6 +665,30 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
         break;
     }
 
+    case PHYSDEVOP_msix_enable:
+    case PHYSDEVOP_msix_disable: {
+        struct physdev_pci_device dev;
+
+        if ( copy_from_guest(&dev, arg, 1) )
+            ret = -EFAULT;
+        else
+            ret = pci_msix_enable(dev.seg, dev.bus, dev.devfn,
+                                  cmd == PHYSDEVOP_msix_enable);
+        break;
+    }
+
+    case PHYSDEVOP_msix_mask_all:
+    case PHYSDEVOP_msix_unmask_all: {
+        struct physdev_pci_device dev;
+
+        if ( copy_from_guest(&dev, arg, 1) )
+            ret = -EFAULT;
+        else
+            ret = pci_msix_maskall(dev.seg, dev.bus, dev.devfn,
+                                   cmd == PHYSDEVOP_msix_mask_all);
+        break;
+    }
+
     case PHYSDEVOP_pci_mmcfg_reserved: {
         struct physdev_pci_mmcfg_reserved info;
 
--- a/xen/include/asm-x86/msi.h
+++ b/xen/include/asm-x86/msi.h
@@ -83,6 +83,8 @@ struct msi_desc;
 extern int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc);
 extern void pci_disable_msi(struct msi_desc *desc);
 extern int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool_t off);
+extern int pci_msix_enable(u16 seg, u8 bus, u8 devfn, bool_t on);
+extern int pci_msix_maskall(u16 seg, u8 bus, u8 devfn, bool_t mask);
 extern void pci_cleanup_msi(struct pci_dev *pdev);
 extern int setup_msi_irq(struct irq_desc *, struct msi_desc *);
 extern int __setup_msi_irq(struct irq_desc *, struct msi_desc *,
@@ -233,6 +235,7 @@ struct arch_msix {
     int table_refcnt[MAX_MSIX_TABLE_PAGES];
     int table_idx[MAX_MSIX_TABLE_PAGES];
     spinlock_t table_lock;
+    bool_t host_maskall, guest_maskall;
     domid_t warned;
 };
 
--- a/xen/include/public/physdev.h
+++ b/xen/include/public/physdev.h
@@ -315,6 +315,14 @@ DEFINE_XEN_GUEST_HANDLE(physdev_pci_devi
  */
 #define PHYSDEVOP_prepare_msix          30
 #define PHYSDEVOP_release_msix          31
+/*
+ * The device model domain for a guest should be using these instead of
+ * fiddling with the respective flags in the MSI-X capability structure.
+ */
+#define PHYSDEVOP_msix_enable           32
+#define PHYSDEVOP_msix_disable          33
+#define PHYSDEVOP_msix_mask_all         34
+#define PHYSDEVOP_msix_unmask_all       35
 struct physdev_pci_device {
     /* IN */
     uint16_t seg;
--- a/xen/include/xsm/dummy.h
+++ b/xen/include/xsm/dummy.h
@@ -462,6 +462,12 @@ static XSM_INLINE int xsm_map_domain_irq
     return xsm_default_action(action, current->domain, d);
 }
 
+static XSM_INLINE int xsm_manage_domain_pirq(XSM_DEFAULT_ARG struct domain *d)
+{
+    XSM_ASSERT_ACTION(XSM_TARGET);
+    return xsm_default_action(action, current->domain, d);
+}
+
 static XSM_INLINE int xsm_unmap_domain_pirq(XSM_DEFAULT_ARG struct domain *d)
 {
     XSM_ASSERT_ACTION(XSM_TARGET);
--- a/xen/include/xsm/xsm.h
+++ b/xen/include/xsm/xsm.h
@@ -105,6 +105,7 @@ struct xsm_operations {
     char *(*show_irq_sid) (int irq);
     int (*map_domain_pirq) (struct domain *d);
     int (*map_domain_irq) (struct domain *d, int irq, void *data);
+    int (*manage_domain_pirq) (struct domain *d);
     int (*unmap_domain_pirq) (struct domain *d);
     int (*unmap_domain_irq) (struct domain *d, int irq, void *data);
     int (*bind_pt_irq) (struct domain *d, struct xen_domctl_bind_pt_irq *bind);
@@ -424,6 +425,11 @@ static inline int xsm_map_domain_irq (xs
     return xsm_ops->map_domain_irq(d, irq, data);
 }
 
+static inline int xsm_manage_domain_pirq(xsm_default_t def, struct domain *d)
+{
+    return xsm_ops->manage_domain_pirq(d);
+}
+
 static inline int xsm_unmap_domain_pirq (xsm_default_t def, struct domain *d)
 {
     return xsm_ops->unmap_domain_pirq(d);
--- a/xen/xsm/dummy.c
+++ b/xen/xsm/dummy.c
@@ -79,6 +79,7 @@ void xsm_fixup_ops (struct xsm_operation
     set_to_dummy_if_null(ops, show_irq_sid);
     set_to_dummy_if_null(ops, map_domain_pirq);
     set_to_dummy_if_null(ops, map_domain_irq);
+    set_to_dummy_if_null(ops, manage_domain_pirq);
     set_to_dummy_if_null(ops, unmap_domain_pirq);
     set_to_dummy_if_null(ops, unmap_domain_irq);
     set_to_dummy_if_null(ops, bind_pt_irq);
--- a/xen/xsm/flask/hooks.c
+++ b/xen/xsm/flask/hooks.c
@@ -881,6 +881,11 @@ static int flask_map_domain_irq (struct 
     return rc;
 }
 
+static int flask_manage_domain_pirq(struct domain *d)
+{
+    return current_has_perm(d, SECCLASS_RESOURCE, RESOURCE__USE);
+}
+
 static int flask_unmap_domain_pirq (struct domain *d)
 {
     return current_has_perm(d, SECCLASS_RESOURCE, RESOURCE__REMOVE);
@@ -1633,6 +1638,7 @@ static struct xsm_operations flask_ops =
 
     .map_domain_pirq = flask_map_domain_pirq,
     .map_domain_irq = flask_map_domain_irq,
+    .manage_domain_pirq = flask_manage_domain_pirq,
     .unmap_domain_pirq = flask_unmap_domain_pirq,
     .unmap_domain_irq = flask_unmap_domain_irq,
     .bind_pt_irq = flask_bind_pt_irq,

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 02/10] x86/MSI-X: access MSI-X table only after having enabled MSI-X
  2015-06-05 11:20 ` [PATCH v3 02/10] x86/MSI-X: access MSI-X table only after having enabled MSI-X Jan Beulich
@ 2015-06-05 13:01   ` Andrew Cooper
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2015-06-05 13:01 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Keir Fraser

On 05/06/15 12:20, Jan Beulich wrote:
> As done in Linux by f598282f51 ("PCI: Fix the NIU MSI-X problem in a
> better way") and its broken predecessor, make sure we don't access the
> MSI-X table without having enabled MSI-X first, using the mask-all flag
> instead to prevent interrupts from occurring.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>, with one suggestion.

> @@ -401,35 +414,38 @@ static bool_t msi_set_mask_bit(struct ir
>          }
>          break;
>      case PCI_CAP_ID_MSIX:
> +        control = pci_conf_read16(seg, bus, slot, func,
> +                                  msix_control_reg(entry->msi_attrib.pos));
> +        if ( unlikely(!(control & PCI_MSIX_FLAGS_ENABLE)) )
> +            pci_conf_write16(seg, bus, slot, func,
> +                             msix_control_reg(entry->msi_attrib.pos),
> +                             control | (PCI_MSIX_FLAGS_ENABLE |
> +                                        PCI_MSIX_FLAGS_MASKALL));
>          if ( likely(memory_decoded(pdev)) )
>          {
>              writel(flag, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
>              readl(entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
> -            break;
> +            if ( likely(control & PCI_MSIX_FLAGS_ENABLE) )
> +                break;
> +            flag = 1;
>          }
> -        if ( flag )
> +        else if ( flag && !(control & PCI_MSIX_FLAGS_MASKALL) )
>          {
> -            u16 control;
>              domid_t domid = pdev->domain->domain_id;
>  
> -            control = pci_conf_read16(seg, bus, slot, func,
> -                                      msix_control_reg(entry->msi_attrib.pos));
> -            if ( control & PCI_MSIX_FLAGS_MASKALL )
> -                break;
> -            pci_conf_write16(seg, bus, slot, func,
> -                             msix_control_reg(entry->msi_attrib.pos),
> -                             control | PCI_MSIX_FLAGS_MASKALL);
> +            control |= PCI_MSIX_FLAGS_MASKALL;
>              if ( pdev->msix->warned != domid )
>              {
>                  pdev->msix->warned = domid;
>                  printk(XENLOG_G_WARNING
> -                       "cannot mask IRQ %d: masked MSI-X on Dom%d's %04x:%02x:%02x.%u\n",
> +                       "cannot mask IRQ %d: masking MSI-X on Dom%d's %04x:%02x:%02x.%u\n",

"masking all", which is a more clear statement.

~Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 05/10] x86/MSI: track host and guest masking separately
  2015-06-05 11:22 ` [PATCH v3 05/10] x86/MSI: track host and guest masking separately Jan Beulich
@ 2015-06-05 13:05   ` Andrew Cooper
  2016-04-01  7:40   ` Li, Liang Z
  1 sibling, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2015-06-05 13:05 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Keir Fraser

On 05/06/15 12:22, Jan Beulich wrote:
> In particular we want to avoid losing track of our own intention to
> have an entry masked. Physical unmasking now happens only when both
> host and guest requested so.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

I really need to dust off my HPET/MSI series and get it submitted.

~Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 06/10] x86/vMSI-X: cleanup
  2015-06-05 11:23 ` [PATCH v3 06/10] x86/vMSI-X: cleanup Jan Beulich
@ 2015-06-05 13:07   ` Andrew Cooper
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2015-06-05 13:07 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Keir Fraser

On 05/06/15 12:23, Jan Beulich wrote:
> Fold msixtbl_addr_to_virt() + virt_to_msi_desc() into simplified
> msixtbl_addr_to_desc(), as the callers don't need the virtual address
> anymore.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 07/10] x86/vMSI-X: support qword MMIO access
  2015-06-05 11:24 ` [PATCH v3 07/10] x86/vMSI-X: support qword MMIO access Jan Beulich
@ 2015-06-05 15:34   ` Andrew Cooper
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2015-06-05 15:34 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Keir Fraser

On 05/06/15 12:24, Jan Beulich wrote:
> The specification explicitly provides for this, so we should have
> supported this from the beginning.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 08/10] x86/MSI-X: use qword MMIO access for address writes
  2015-06-05 11:25 ` [PATCH v3 08/10] x86/MSI-X: use qword MMIO access for address writes Jan Beulich
@ 2015-06-05 15:37   ` Andrew Cooper
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2015-06-05 15:37 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Keir Fraser

On 05/06/15 12:25, Jan Beulich wrote:
> Now that we support it for our guests, let's do so ourselves too.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 09/10] VT-d: use qword MMIO access for MSI address writes
  2015-06-05 11:26 ` [PATCH v3 09/10] VT-d: use qword MMIO access for MSI " Jan Beulich
@ 2015-06-05 15:39   ` Andrew Cooper
  2015-06-05 15:46     ` Jan Beulich
  2015-06-11  7:45   ` Tian, Kevin
  1 sibling, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2015-06-05 15:39 UTC (permalink / raw)
  To: xen-devel

On 05/06/15 12:26, Jan Beulich wrote:
> Also make dmar_{read,write}q() actually do what their names suggest (we
> don't need to be concerned of 32-bit restrictions anymore).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Your patch has a typo "don#t" which isn't present in this commit message.

Otherwise, Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 09/10] VT-d: use qword MMIO access for MSI address writes
  2015-06-05 15:39   ` Andrew Cooper
@ 2015-06-05 15:46     ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-05 15:46 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

>>> On 05.06.15 at 17:39, <andrew.cooper3@citrix.com> wrote:
> On 05/06/15 12:26, Jan Beulich wrote:
>> Also make dmar_{read,write}q() actually do what their names suggest (we
>> don't need to be concerned of 32-bit restrictions anymore).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Your patch has a typo "don#t" which isn't present in this commit message.

Already fixed; apparently the mail client fetches attachments when
you attach them, not when the mail gets sent.

> Otherwise, Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

Thanks, Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control
  2015-06-05 11:28 ` [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control Jan Beulich
@ 2015-06-05 15:57   ` Andrew Cooper
  2015-06-05 16:17     ` Jan Beulich
  2015-06-11  8:35   ` Jan Beulich
  1 sibling, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2015-06-05 15:57 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Keir Fraser, Stefano Stabellini, Ian Jackson, Ian Campbell,
	Wei Liu, dgdegra

On 05/06/15 12:28, Jan Beulich wrote:
> Qemu shouldn't be fiddling with this bit directly, as the hypervisor
> may (and now does) use it for its own purposes. Provide it with a
> replacement interface, allowing the hypervisor to track host and guest
> masking intentions independently (clearing the bit only when both want
> it clear).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Whether the permission check should really be an XSM_TARGET one needs
> to be determined: That allowing the guest to issue the hypercalls on
> itself means permitting it to bypass the device model, and thus render
> device model state stale.

I concur.

On the other hand, there should be nothing the guest could do with
access to these hypercalls which would damage Xen itself.

>
> --- a/tools/libxc/include/xenctrl.h
> +++ b/tools/libxc/include/xenctrl.h
> @@ -1807,6 +1807,17 @@ int xc_physdev_unmap_pirq(xc_interface *
>                            int domid,
>                            int pirq);
>  
> +int xc_physdev_msix_enable(xc_interface *xch,
> +                           int segment,
> +                           int bus,
> +                           int devfn,
> +                           int on);
> +int xc_physdev_msix_mask_all(xc_interface *xch,
> +                             int segment,
> +                             int bus,
> +                             int devfn,
> +                             int mask);
> +
>  int xc_hvm_set_pci_intx_level(
>      xc_interface *xch, domid_t dom,
>      uint8_t domain, uint8_t bus, uint8_t device, uint8_t intx,
> --- a/tools/libxc/xc_physdev.c
> +++ b/tools/libxc/xc_physdev.c
> @@ -112,3 +112,38 @@ int xc_physdev_unmap_pirq(xc_interface *
>      return rc;
>  }
>  
> +int xc_physdev_msix_enable(xc_interface *xch,
> +                           int segment,
> +                           int bus,
> +                           int devfn,
> +                           int on)
> +{
> +    struct physdev_pci_device dev = {
> +        .seg = segment,
> +        .bus = bus,
> +        .devfn = devfn
> +    };
> +
> +    return do_physdev_op(xch,
> +                         on ? PHYSDEVOP_msix_enable
> +                            : PHYSDEVOP_msix_disable,
> +                         &dev, sizeof(dev));
> +}

xc_physdev_misx_enable(xch, X, Y, Z, 0) is slightly misleading to read...

> +
> +int xc_physdev_msix_mask_all(xc_interface *xch,
> +                             int segment,
> +                             int bus,
> +                             int devfn,
> +                             int mask)
> +{
> +    struct physdev_pci_device dev = {
> +        .seg = segment,
> +        .bus = bus,
> +        .devfn = devfn
> +    };
> +
> +    return do_physdev_op(xch,
> +                         mask ? PHYSDEVOP_msix_mask_all
> +                              : PHYSDEVOP_msix_unmask_all,
> +                         &dev, sizeof(dev));
> +}

And equally xc_physdev_misx_mask_all(xch, X, Y, Z, 0).

I think it would be cleaner to have something like

xc_physdev_msix_manage(xch, X, Y, Z, PHYSDEVOP_msix_XXX)

which results in far more readable client code.

Otherwise, the rest looks fine.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control
  2015-06-05 15:57   ` Andrew Cooper
@ 2015-06-05 16:17     ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-05 16:17 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Stefano Stabellini, Ian Jackson, Ian Campbell,
	xen-devel, dgdegra, Keir Fraser

>>> On 05.06.15 at 17:57, <andrew.cooper3@citrix.com> wrote:
> On 05/06/15 12:28, Jan Beulich wrote:
>> Qemu shouldn't be fiddling with this bit directly, as the hypervisor
>> may (and now does) use it for its own purposes. Provide it with a
>> replacement interface, allowing the hypervisor to track host and guest
>> masking intentions independently (clearing the bit only when both want
>> it clear).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> Whether the permission check should really be an XSM_TARGET one needs
>> to be determined: That allowing the guest to issue the hypercalls on
>> itself means permitting it to bypass the device model, and thus render
>> device model state stale.
> 
> I concur.
> 
> On the other hand, there should be nothing the guest could do with
> access to these hypercalls which would damage Xen itself.

That's exactly the reason I have it as it is now.

>> --- a/tools/libxc/xc_physdev.c
>> +++ b/tools/libxc/xc_physdev.c
>> @@ -112,3 +112,38 @@ int xc_physdev_unmap_pirq(xc_interface *
>>      return rc;
>>  }
>>  
>> +int xc_physdev_msix_enable(xc_interface *xch,
>> +                           int segment,
>> +                           int bus,
>> +                           int devfn,
>> +                           int on)
>> +{
>> +    struct physdev_pci_device dev = {
>> +        .seg = segment,
>> +        .bus = bus,
>> +        .devfn = devfn
>> +    };
>> +
>> +    return do_physdev_op(xch,
>> +                         on ? PHYSDEVOP_msix_enable
>> +                            : PHYSDEVOP_msix_disable,
>> +                         &dev, sizeof(dev));
>> +}
> 
> xc_physdev_misx_enable(xch, X, Y, Z, 0) is slightly misleading to read...
> 
>> +
>> +int xc_physdev_msix_mask_all(xc_interface *xch,
>> +                             int segment,
>> +                             int bus,
>> +                             int devfn,
>> +                             int mask)
>> +{
>> +    struct physdev_pci_device dev = {
>> +        .seg = segment,
>> +        .bus = bus,
>> +        .devfn = devfn
>> +    };
>> +
>> +    return do_physdev_op(xch,
>> +                         mask ? PHYSDEVOP_msix_mask_all
>> +                              : PHYSDEVOP_msix_unmask_all,
>> +                         &dev, sizeof(dev));
>> +}
> 
> And equally xc_physdev_misx_mask_all(xch, X, Y, Z, 0).
> 
> I think it would be cleaner to have something like
> 
> xc_physdev_msix_manage(xch, X, Y, Z, PHYSDEVOP_msix_XXX)
> 
> which results in far more readable client code.

But will require callers to include xen/physdev.h which I think
otherwise they wouldn't have to. Anyway, I'll wait for what the
tool stack and qemu maintainers say.

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 09/10] VT-d: use qword MMIO access for MSI address writes
  2015-06-05 11:26 ` [PATCH v3 09/10] VT-d: use qword MMIO access for MSI " Jan Beulich
  2015-06-05 15:39   ` Andrew Cooper
@ 2015-06-11  7:45   ` Tian, Kevin
  1 sibling, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2015-06-11  7:45 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Zhang, Yang Z

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Friday, June 05, 2015 7:26 PM
> 
> Also make dmar_{read,write}q() actually do what their names suggest (we
> don't need to be concerned of 32-bit restrictions anymore).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Kevin Tian <kevin.tian@intel.com>

> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -1054,8 +1054,7 @@ static void dma_msi_set_affinity(struct
> 
>      spin_lock_irqsave(&iommu->register_lock, flags);
>      dmar_writel(iommu->reg, DMAR_FEDATA_REG, msg.data);
> -    dmar_writel(iommu->reg, DMAR_FEADDR_REG, msg.address_lo);
> -    dmar_writel(iommu->reg, DMAR_FEUADDR_REG, msg.address_hi);
> +    dmar_writeq(iommu->reg, DMAR_FEADDR_REG, msg.address);
>      spin_unlock_irqrestore(&iommu->register_lock, flags);
>  }
> 
> --- a/xen/drivers/passthrough/vtd/iommu.h
> +++ b/xen/drivers/passthrough/vtd/iommu.h
> @@ -51,17 +51,10 @@
>  #define    DMAR_IRTA_REG   0xB8    /* intr remap */
> 
>  #define OFFSET_STRIDE        (9)
> -#define dmar_readl(dmar, reg) readl(dmar + reg)
> -#define dmar_writel(dmar, reg, val) writel(val, dmar + reg)
> -#define dmar_readq(dmar, reg) ({ \
> -        u32 lo, hi; \
> -        lo = dmar_readl(dmar, reg); \
> -        hi = dmar_readl(dmar, reg + 4); \
> -        (((u64) hi) << 32) + lo; })
> -#define dmar_writeq(dmar, reg, val) do {\
> -        dmar_writel(dmar, reg, (u32)val); \
> -        dmar_writel(dmar, reg + 4, (u32)((u64) val >> 32)); \
> -    } while (0)
> +#define dmar_readl(dmar, reg) readl((dmar) + (reg))
> +#define dmar_readq(dmar, reg) readq((dmar) + (reg))
> +#define dmar_writel(dmar, reg, val) writel(val, (dmar) + (reg))
> +#define dmar_writeq(dmar, reg, val) writeq(val, (dmar) + (reg))
> 
>  #define VER_MAJOR(v)        (((v) & 0xf0) >> 4)
>  #define VER_MINOR(v)        ((v) & 0x0f)
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control
  2015-06-05 11:28 ` [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control Jan Beulich
  2015-06-05 15:57   ` Andrew Cooper
@ 2015-06-11  8:35   ` Jan Beulich
  2015-06-11  9:51     ` Andrew Cooper
  2015-06-12 13:21     ` Konrad Rzeszutek Wilk
  1 sibling, 2 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-11  8:35 UTC (permalink / raw)
  To: Andrew Cooper, Roger Pau Monne, Wei Liu, Ian Campbell,
	Ian Jackson, Stefano Stabellini, Konrad Rzeszutek Wilk
  Cc: xen-devel, dgdegra, Keir Fraser

>>> On 05.06.15 at 13:28, <JBeulich@suse.com> wrote:
> Qemu shouldn't be fiddling with this bit directly, as the hypervisor
> may (and now does) use it for its own purposes. Provide it with a
> replacement interface, allowing the hypervisor to track host and guest
> masking intentions independently (clearing the bit only when both want
> it clear).

Originally I merely meant to ping the tools side changes here
(considering that the original issue has been pending for months,
delayed by various security issues as well as slow turnaround on
understanding the nature and validity of that original issue, I'd
_really_ like to see this go in now), but thinking about it once
again over night I realized that what we do here to allow qemu
to be fixed would then also be made use of by the kernels
running pciback: While Dom0 fiddling with the MSI-X mask-all bit
for its own purposes is at least not a security problem, it doing
so on behalf of (and directed by) a guest would be as soon as
the hypervisor side patches making use of that bit went in.

While I continue to be of the opinion that all direct writes to
interrupt masking bits (MSI-X mask-all, MSI-X per-entry mask,
MSI per entry mask) outside of the hypervisor are wrong and
should be eliminated, the scope of the problem now clearly
going beyond qemu made me reconsider whether we shouldn't,
as advocated by Stefano, follow the trap-and-emulate route
instead. This would not only mean adding code to x86's existing
port CF8/CFC intercepts, but also write-protecting the MMCFG
pages for all PCI devices being MSI or MSI-X capable, emulating
writes with inspection / modification of writes to any of the mask
bits located in PCI config space. (A subsequent optimization to
this may then be a hypercall to do config space writes,
eliminating the emulation overhead, accompanied by a bitmap
indicating which devices' CFG space can be written directly.)

For a (from now on) timely resolution of the original problem I'd
really appreciate opinions (or alternative suggestions).

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control
  2015-06-11  8:35   ` Jan Beulich
@ 2015-06-11  9:51     ` Andrew Cooper
  2015-06-19 13:00       ` Jan Beulich
  2015-06-12 13:21     ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2015-06-11  9:51 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monne, Wei Liu, Ian Campbell, Ian Jackson,
	Stefano Stabellini, Konrad Rzeszutek Wilk
  Cc: xen-devel, dgdegra, Keir Fraser

On 11/06/15 09:35, Jan Beulich wrote:
>>>> On 05.06.15 at 13:28, <JBeulich@suse.com> wrote:
>> Qemu shouldn't be fiddling with this bit directly, as the hypervisor
>> may (and now does) use it for its own purposes. Provide it with a
>> replacement interface, allowing the hypervisor to track host and guest
>> masking intentions independently (clearing the bit only when both want
>> it clear).
> Originally I merely meant to ping the tools side changes here
> (considering that the original issue has been pending for months,
> delayed by various security issues as well as slow turnaround on
> understanding the nature and validity of that original issue, I'd
> _really_ like to see this go in now), but thinking about it once
> again over night I realized that what we do here to allow qemu
> to be fixed would then also be made use of by the kernels
> running pciback: While Dom0 fiddling with the MSI-X mask-all bit
> for its own purposes is at least not a security problem, it doing
> so on behalf of (and directed by) a guest would be as soon as
> the hypervisor side patches making use of that bit went in.
>
> While I continue to be of the opinion that all direct writes to
> interrupt masking bits (MSI-X mask-all, MSI-X per-entry mask,
> MSI per entry mask) outside of the hypervisor are wrong and
> should be eliminated, the scope of the problem now clearly
> going beyond qemu made me reconsider whether we shouldn't,
> as advocated by Stefano, follow the trap-and-emulate route
> instead. This would not only mean adding code to x86's existing
> port CF8/CFC intercepts, but also write-protecting the MMCFG
> pages for all PCI devices being MSI or MSI-X capable, emulating
> writes with inspection / modification of writes to any of the mask
> bits located in PCI config space. (A subsequent optimization to
> this may then be a hypercall to do config space writes,
> eliminating the emulation overhead, accompanied by a bitmap
> indicating which devices' CFG space can be written directly.)
>
> For a (from now on) timely resolution of the original problem I'd
> really appreciate opinions (or alternative suggestions).

A very definite +1 from me.  I have previously suggested as much.

~Andrew

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control
  2015-06-11  8:35   ` Jan Beulich
  2015-06-11  9:51     ` Andrew Cooper
@ 2015-06-12 13:21     ` Konrad Rzeszutek Wilk
  2015-06-12 13:51       ` Jan Beulich
  1 sibling, 1 reply; 40+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-12 13:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Ian Jackson,
	Ian Campbell, xen-devel, dgdegra, Keir Fraser, Roger Pau Monne

On Thu, Jun 11, 2015 at 09:35:51AM +0100, Jan Beulich wrote:
> >>> On 05.06.15 at 13:28, <JBeulich@suse.com> wrote:
> > Qemu shouldn't be fiddling with this bit directly, as the hypervisor
> > may (and now does) use it for its own purposes. Provide it with a
> > replacement interface, allowing the hypervisor to track host and guest
> > masking intentions independently (clearing the bit only when both want
> > it clear).
> 
> Originally I merely meant to ping the tools side changes here
> (considering that the original issue has been pending for months,
> delayed by various security issues as well as slow turnaround on
> understanding the nature and validity of that original issue, I'd
> _really_ like to see this go in now), but thinking about it once
> again over night I realized that what we do here to allow qemu
> to be fixed would then also be made use of by the kernels
> running pciback: While Dom0 fiddling with the MSI-X mask-all bit
> for its own purposes is at least not a security problem, it doing
> so on behalf of (and directed by) a guest would be as soon as
> the hypervisor side patches making use of that bit went in.

It is hard to comment on this since I don't know exactly what
those patches would do.  But the 'pci_msi_ignore_mask'
from 38737d82f9f0168955f9944c3f8bd3bb262c7e88, "PCI/MSI: Add
pci_msi_ignore_mask to prevent writes to MSI/MSI-X Mask Bits""
should have prevented that. That said said patches could change
the pci_msi_ignore_mask of course.

> 
> While I continue to be of the opinion that all direct writes to
> interrupt masking bits (MSI-X mask-all, MSI-X per-entry mask,
> MSI per entry mask) outside of the hypervisor are wrong and
> should be eliminated, the scope of the problem now clearly
> going beyond qemu made me reconsider whether we shouldn't,
> as advocated by Stefano, follow the trap-and-emulate route
> instead. This would not only mean adding code to x86's existing
> port CF8/CFC intercepts, but also write-protecting the MMCFG
> pages for all PCI devices being MSI or MSI-X capable, emulating
> writes with inspection / modification of writes to any of the mask
> bits located in PCI config space. (A subsequent optimization to
> this may then be a hypercall to do config space writes,
> eliminating the emulation overhead, accompanied by a bitmap
> indicating which devices' CFG space can be written directly.)
> 
> For a (from now on) timely resolution of the original problem I'd
> really appreciate opinions (or alternative suggestions).

Weeding out the direct writes is nice, but the process of eliminating
those is going to take time.

I like your idea as it provides a nice mechanism to track which
component is at fault writting to those areas and can help in
fixing that. But on the flip side - it also might delay patches
for the offending code as we would have already an 'firewall'
in place.

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control
  2015-06-12 13:21     ` Konrad Rzeszutek Wilk
@ 2015-06-12 13:51       ` Jan Beulich
  2015-06-12 14:17         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2015-06-12 13:51 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Ian Jackson,
	Ian Campbell, xen-devel, dgdegra, Keir Fraser, Roger Pau Monne

>>> On 12.06.15 at 15:21, <konrad.wilk@oracle.com> wrote:
> On Thu, Jun 11, 2015 at 09:35:51AM +0100, Jan Beulich wrote:
>> >>> On 05.06.15 at 13:28, <JBeulich@suse.com> wrote:
>> > Qemu shouldn't be fiddling with this bit directly, as the hypervisor
>> > may (and now does) use it for its own purposes. Provide it with a
>> > replacement interface, allowing the hypervisor to track host and guest
>> > masking intentions independently (clearing the bit only when both want
>> > it clear).
>> 
>> Originally I merely meant to ping the tools side changes here
>> (considering that the original issue has been pending for months,
>> delayed by various security issues as well as slow turnaround on
>> understanding the nature and validity of that original issue, I'd
>> _really_ like to see this go in now), but thinking about it once
>> again over night I realized that what we do here to allow qemu
>> to be fixed would then also be made use of by the kernels
>> running pciback: While Dom0 fiddling with the MSI-X mask-all bit
>> for its own purposes is at least not a security problem, it doing
>> so on behalf of (and directed by) a guest would be as soon as
>> the hypervisor side patches making use of that bit went in.
> 
> It is hard to comment on this since I don't know exactly what
> those patches would do.

Did you take a look?

>  But the 'pci_msi_ignore_mask'
> from 38737d82f9f0168955f9944c3f8bd3bb262c7e88, "PCI/MSI: Add
> pci_msi_ignore_mask to prevent writes to MSI/MSI-X Mask Bits""
> should have prevented that. That said said patches could change
> the pci_msi_ignore_mask of course.

For one, this doesn't deal with the MSI-X mask-all bit. And then it
only suppresses functionality that the guest really ought to be
allowed to use, just not by directly manipulating hardware. Plus
of course any older Linux as well as other OSes would still be a
problem.

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control
  2015-06-12 13:51       ` Jan Beulich
@ 2015-06-12 14:17         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 40+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-12 14:17 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Ian Jackson,
	Ian Campbell, xen-devel, dgdegra, Keir Fraser, Roger Pau Monne

On Fri, Jun 12, 2015 at 02:51:02PM +0100, Jan Beulich wrote:
> >>> On 12.06.15 at 15:21, <konrad.wilk@oracle.com> wrote:
> > On Thu, Jun 11, 2015 at 09:35:51AM +0100, Jan Beulich wrote:
> >> >>> On 05.06.15 at 13:28, <JBeulich@suse.com> wrote:
> >> > Qemu shouldn't be fiddling with this bit directly, as the hypervisor
> >> > may (and now does) use it for its own purposes. Provide it with a
> >> > replacement interface, allowing the hypervisor to track host and guest
> >> > masking intentions independently (clearing the bit only when both want
> >> > it clear).
> >> 
> >> Originally I merely meant to ping the tools side changes here
> >> (considering that the original issue has been pending for months,
> >> delayed by various security issues as well as slow turnaround on
> >> understanding the nature and validity of that original issue, I'd
> >> _really_ like to see this go in now), but thinking about it once
> >> again over night I realized that what we do here to allow qemu
> >> to be fixed would then also be made use of by the kernels
> >> running pciback: While Dom0 fiddling with the MSI-X mask-all bit
> >> for its own purposes is at least not a security problem, it doing
> >> so on behalf of (and directed by) a guest would be as soon as
> >> the hypervisor side patches making use of that bit went in.
> > 
> > It is hard to comment on this since I don't know exactly what
> > those patches would do.
> 
> Did you take a look?

No. Oddly enough they didn't show up in my thread and I didn't
even look at the title to Google for it. Doing it now.
> 
> >  But the 'pci_msi_ignore_mask'
> > from 38737d82f9f0168955f9944c3f8bd3bb262c7e88, "PCI/MSI: Add
> > pci_msi_ignore_mask to prevent writes to MSI/MSI-X Mask Bits""
> > should have prevented that. That said said patches could change
> > the pci_msi_ignore_mask of course.
> 
> For one, this doesn't deal with the MSI-X mask-all bit. And then it
> only suppresses functionality that the guest really ought to be
> allowed to use, just not by directly manipulating hardware. Plus
> of course any older Linux as well as other OSes would still be a
> problem.

True.
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control
  2015-06-11  9:51     ` Andrew Cooper
@ 2015-06-19 13:00       ` Jan Beulich
  2015-06-19 13:05         ` Konrad Rzeszutek Wilk
  2015-06-19 14:07         ` Roger Pau Monné
  0 siblings, 2 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-19 13:00 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Stefano Stabellini, Ian Jackson, Ian Campbell,
	xen-devel, dgdegra, Keir Fraser, Roger Pau Monne

>>> On 11.06.15 at 11:51, <andrew.cooper3@citrix.com> wrote:
> On 11/06/15 09:35, Jan Beulich wrote:
>> While I continue to be of the opinion that all direct writes to
>> interrupt masking bits (MSI-X mask-all, MSI-X per-entry mask,
>> MSI per entry mask) outside of the hypervisor are wrong and
>> should be eliminated, the scope of the problem now clearly
>> going beyond qemu made me reconsider whether we shouldn't,
>> as advocated by Stefano, follow the trap-and-emulate route
>> instead. This would not only mean adding code to x86's existing
>> port CF8/CFC intercepts, but also write-protecting the MMCFG
>> pages for all PCI devices being MSI or MSI-X capable, emulating
>> writes with inspection / modification of writes to any of the mask
>> bits located in PCI config space. (A subsequent optimization to
>> this may then be a hypercall to do config space writes,
>> eliminating the emulation overhead, accompanied by a bitmap
>> indicating which devices' CFG space can be written directly.)
>>
>> For a (from now on) timely resolution of the original problem I'd
>> really appreciate opinions (or alternative suggestions).
> 
> A very definite +1 from me.  I have previously suggested as much.

And now that I started looking into what it takes to make this
work, I'm having a deja vu: In order for us to reliably intercept
all CFG accesses, we need to whitelist the MMCFG pages of
devices we know we don't care about being written. I.e. we
need to start out with all of them being read-only. And the
affected MFNs have to be known before Dom0 maps these
pages (or else we would have to hunt down all the mappings in
the page tables, which is nothing I consider even remotely
reasonable). Yet, and here comes the deja vu, upstream Linux
_still_ doesn't make use of PHYSDEVOP_pci_mmcfg_reserved.
No idea whether FreeBSD or whatever else can be used as Dom0
do. So no matter how we turn it, we have a dependency on the
Dom0 kernel also being adjusted. In which case we might as well
go the original route of requiring hypercalls to be used for certain
operations to deal with the problem here.

Otoh the write interception has the potential of dealing with other
problems (like that of XSAs 120 and 126), but making the security
of Xen (in presence of the fix/workaround to the original problem
here) dependent on a Dom0 side change not even on its way into
the master Linux branch yet makes me really hesitant to try going
that route. (And no, I'm not up to fighting for another pv-ops hook
considering that I've never been really convinced of the pv-ops
model in the first place.)

But then again the one thing we might consider saving us on the
Linux side is that as of 2.6.25 base config space accesses don't
get done via MMCFG anymore, and we don't have an immediate
need to intercept extended ones (i.e. initially we might even get
away without snooping MMCFG writes at all). Roger - how do
things look like on the FreeBSD side?

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control
  2015-06-19 13:00       ` Jan Beulich
@ 2015-06-19 13:05         ` Konrad Rzeszutek Wilk
  2015-06-19 14:52           ` Jan Beulich
  2015-06-19 14:07         ` Roger Pau Monné
  1 sibling, 1 reply; 40+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-19 13:05 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper
  Cc: Wei Liu, Stefano Stabellini, Ian Jackson, Ian Campbell,
	xen-devel, dgdegra, Keir Fraser, Roger Pau Monne

On June 19, 2015 9:00:39 AM EDT, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 11.06.15 at 11:51, <andrew.cooper3@citrix.com> wrote:
>> On 11/06/15 09:35, Jan Beulich wrote:
>>> While I continue to be of the opinion that all direct writes to
>>> interrupt masking bits (MSI-X mask-all, MSI-X per-entry mask,
>>> MSI per entry mask) outside of the hypervisor are wrong and
>>> should be eliminated, the scope of the problem now clearly
>>> going beyond qemu made me reconsider whether we shouldn't,
>>> as advocated by Stefano, follow the trap-and-emulate route
>>> instead. This would not only mean adding code to x86's existing
>>> port CF8/CFC intercepts, but also write-protecting the MMCFG
>>> pages for all PCI devices being MSI or MSI-X capable, emulating
>>> writes with inspection / modification of writes to any of the mask
>>> bits located in PCI config space. (A subsequent optimization to
>>> this may then be a hypercall to do config space writes,
>>> eliminating the emulation overhead, accompanied by a bitmap
>>> indicating which devices' CFG space can be written directly.)
>>>
>>> For a (from now on) timely resolution of the original problem I'd
>>> really appreciate opinions (or alternative suggestions).
>> 
>> A very definite +1 from me.  I have previously suggested as much.
>
>And now that I started looking into what it takes to make this
>work, I'm having a deja vu: In order for us to reliably intercept
>all CFG accesses, we need to whitelist the MMCFG pages of
>devices we know we don't care about being written. I.e. we
>need to start out with all of them being read-only. And the
>affected MFNs have to be known before Dom0 maps these
>pages (or else we would have to hunt down all the mappings in
>the page tables, which is nothing I consider even remotely
>reasonable). Yet, and here comes the deja vu, upstream Linux
>_still_ doesn't make use of PHYSDEVOP_pci_mmcfg_reserved.

Yes it does: http://lists.xenproject.org/archives/html/xen-devel/2013-11/msg00807.html

Is in the kernel.

>No idea whether FreeBSD or whatever else can be used as Dom0
>do. So no matter how we turn it, we have a dependency on the
>Dom0 kernel also being adjusted. In which case we might as well
>go the original route of requiring hypercalls to be used for certain
>operations to deal with the problem here.
>
>Otoh the write interception has the potential of dealing with other
>problems (like that of XSAs 120 and 126), but making the security
>of Xen (in presence of the fix/workaround to the original problem
>here) dependent on a Dom0 side change not even on its way into
>the master Linux branch yet makes me really hesitant to try going
>that route. (And no, I'm not up to fighting for another pv-ops hook
>considering that I've never been really convinced of the pv-ops
>model in the first place.)
>
>But then again the one thing we might consider saving us on the
>Linux side is that as of 2.6.25 base config space accesses don't
>get done via MMCFG anymore, and we don't have an immediate
>need to intercept extended ones (i.e. initially we might even get
>away without snooping MMCFG writes at all). Roger - how do
>things look like on the FreeBSD side?
>
>Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control
  2015-06-19 13:00       ` Jan Beulich
  2015-06-19 13:05         ` Konrad Rzeszutek Wilk
@ 2015-06-19 14:07         ` Roger Pau Monné
  2015-06-19 14:58           ` Jan Beulich
  2015-06-22 11:25           ` Jan Beulich
  1 sibling, 2 replies; 40+ messages in thread
From: Roger Pau Monné @ 2015-06-19 14:07 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper
  Cc: Wei Liu, Stefano Stabellini, Ian Jackson, Ian Campbell,
	xen-devel, dgdegra, Keir Fraser

El 19/06/15 a les 15.00, Jan Beulich ha escrit:
>>>> On 11.06.15 at 11:51, <andrew.cooper3@citrix.com> wrote:
>> On 11/06/15 09:35, Jan Beulich wrote:
>>> While I continue to be of the opinion that all direct writes to
>>> interrupt masking bits (MSI-X mask-all, MSI-X per-entry mask,
>>> MSI per entry mask) outside of the hypervisor are wrong and
>>> should be eliminated, the scope of the problem now clearly
>>> going beyond qemu made me reconsider whether we shouldn't,
>>> as advocated by Stefano, follow the trap-and-emulate route
>>> instead. This would not only mean adding code to x86's existing
>>> port CF8/CFC intercepts, but also write-protecting the MMCFG
>>> pages for all PCI devices being MSI or MSI-X capable, emulating
>>> writes with inspection / modification of writes to any of the mask
>>> bits located in PCI config space. (A subsequent optimization to
>>> this may then be a hypercall to do config space writes,
>>> eliminating the emulation overhead, accompanied by a bitmap
>>> indicating which devices' CFG space can be written directly.)
>>>
>>> For a (from now on) timely resolution of the original problem I'd
>>> really appreciate opinions (or alternative suggestions).
>>
>> A very definite +1 from me.  I have previously suggested as much.
> 
> And now that I started looking into what it takes to make this
> work, I'm having a deja vu: In order for us to reliably intercept
> all CFG accesses, we need to whitelist the MMCFG pages of
> devices we know we don't care about being written. I.e. we
> need to start out with all of them being read-only. And the
> affected MFNs have to be known before Dom0 maps these
> pages (or else we would have to hunt down all the mappings in
> the page tables, which is nothing I consider even remotely
> reasonable). Yet, and here comes the deja vu, upstream Linux
> _still_ doesn't make use of PHYSDEVOP_pci_mmcfg_reserved.
> No idea whether FreeBSD or whatever else can be used as Dom0
> do. So no matter how we turn it, we have a dependency on the
> Dom0 kernel also being adjusted. In which case we might as well
> go the original route of requiring hypercalls to be used for certain
> operations to deal with the problem here.

FreeBSD doesn't implement PHYSDEVOP_pci_mmcfg_reserved ATM. I had a
patch to implement it, but it's completely useless with the way we map
MMIO regions on PVH right now.

Every hole in the e820 is basically mapped as a MMIO region _before_
starting Dom0, making the white/black listing done in
PHYSDEVOP_pci_mmcfg_reserved completely moot.

> Otoh the write interception has the potential of dealing with other
> problems (like that of XSAs 120 and 126), but making the security
> of Xen (in presence of the fix/workaround to the original problem
> here) dependent on a Dom0 side change not even on its way into
> the master Linux branch yet makes me really hesitant to try going
> that route. (And no, I'm not up to fighting for another pv-ops hook
> considering that I've never been really convinced of the pv-ops
> model in the first place.)
> 
> But then again the one thing we might consider saving us on the
> Linux side is that as of 2.6.25 base config space accesses don't
> get done via MMCFG anymore, and we don't have an immediate
> need to intercept extended ones (i.e. initially we might even get
> away without snooping MMCFG writes at all). Roger - how do
> things look like on the FreeBSD side?

I don't mind adding a PHYSDEVOP_pci_mmcfg_reserved call to FreeBSD, but
for it to have any effect we need to stop unconditionally mapping
everything as MMIO regions on PVH Dom0.

Then we need to expand XENMEM_add_to_physmap_batch so it can be used to
map MMIO regions on demand from Dom0 or modify
PHYSDEVOP_pci_mmcfg_reserved so it sets up the right mappings (1:1) for
auto-translated guests.

Roger.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control
  2015-06-19 13:05         ` Konrad Rzeszutek Wilk
@ 2015-06-19 14:52           ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-19 14:52 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Ian Jackson,
	Ian Campbell, xen-devel, dgdegra, Keir Fraser, Roger Pau Monne

>>> On 19.06.15 at 15:05, <konrad.wilk@oracle.com> wrote:
>>And now that I started looking into what it takes to make this
>>work, I'm having a deja vu: In order for us to reliably intercept
>>all CFG accesses, we need to whitelist the MMCFG pages of
>>devices we know we don't care about being written. I.e. we
>>need to start out with all of them being read-only. And the
>>affected MFNs have to be known before Dom0 maps these
>>pages (or else we would have to hunt down all the mappings in
>>the page tables, which is nothing I consider even remotely
>>reasonable). Yet, and here comes the deja vu, upstream Linux
>>_still_ doesn't make use of PHYSDEVOP_pci_mmcfg_reserved.
> 
> Yes it does: 
> http://lists.xenproject.org/archives/html/xen-devel/2013-11/msg00807.html 

Oh, sorry - I expected it to live somewhere under arch/x86/.

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control
  2015-06-19 14:07         ` Roger Pau Monné
@ 2015-06-19 14:58           ` Jan Beulich
  2015-06-22 17:02             ` Roger Pau Monné
  2015-06-22 11:25           ` Jan Beulich
  1 sibling, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2015-06-19 14:58 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Ian Jackson,
	Ian Campbell, xen-devel, dgdegra, Keir Fraser

>>> On 19.06.15 at 16:07, <roger.pau@citrix.com> wrote:
> I don't mind adding a PHYSDEVOP_pci_mmcfg_reserved call to FreeBSD, but
> for it to have any effect we need to stop unconditionally mapping
> everything as MMIO regions on PVH Dom0.

Right, I didn't mean to imply PVH would have any chance of working
right now.

But what you didn't respond to is the (kind of implicit I admit) question
of whether FreeBSD is using MMCFG accesses for the low 256 bytes
of the config space. (I note that there's one special case in Linux -
NumaChip - whether this is the case.)

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control
  2015-06-19 14:07         ` Roger Pau Monné
  2015-06-19 14:58           ` Jan Beulich
@ 2015-06-22 11:25           ` Jan Beulich
  1 sibling, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-22 11:25 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Ian Jackson,
	Ian Campbell, xen-devel, dgdegra, Keir Fraser

>>> On 19.06.15 at 16:07, <roger.pau@citrix.com> wrote:
> I don't mind adding a PHYSDEVOP_pci_mmcfg_reserved call to FreeBSD, but
> for it to have any effect we need to stop unconditionally mapping
> everything as MMIO regions on PVH Dom0.

Actually I don't think we need this as a prereq (it's rather a pretty
orthogonal change, which I nevertheless agree should be carried
out soon): Even if the kernel registered the MMCFG ranges,
whether it maps them up front could be of no interest to Xen in
the PVH case (in the PV case we'd need to search for mappings in
page tables, so we wouldn't want to lift the ordering restriction).
Xen could deal with the permission change in two ways: For one,
MMIO ranges are 1:1 mapped, so it could brute force adjust the
affected 2nd level paging entries. Alternatively we could (ab)use
p2m_mmio_write_dm for the Dom0 case, and invoke
p2m_change_entry_type_global(d, p2m_mmio_direct,
p2m_mmio_write_dm) upon registration of an MMCFG range.

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control
  2015-06-19 14:58           ` Jan Beulich
@ 2015-06-22 17:02             ` Roger Pau Monné
  2015-06-23  7:20               ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Roger Pau Monné @ 2015-06-22 17:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Ian Jackson,
	Ian Campbell, xen-devel, dgdegra, Keir Fraser

El 19/06/15 a les 16.58, Jan Beulich ha escrit:
>>>> On 19.06.15 at 16:07, <roger.pau@citrix.com> wrote:
>> I don't mind adding a PHYSDEVOP_pci_mmcfg_reserved call to FreeBSD, but
>> for it to have any effect we need to stop unconditionally mapping
>> everything as MMIO regions on PVH Dom0.
> 
> Right, I didn't mean to imply PVH would have any chance of working
> right now.
> 
> But what you didn't respond to is the (kind of implicit I admit) question
> of whether FreeBSD is using MMCFG accesses for the low 256 bytes
> of the config space. (I note that there's one special case in Linux -
> NumaChip - whether this is the case.)

OK, I didn't get the part of the question. AFAICT yes, FreeBSD will 
access the low 256 bytes of the config space. For example the stub to 
write to a cfg register is as follows:

void
pci_cfgregwrite(int bus, int slot, int func, int reg, u_int32_t data, int bytes)
{

	if (cfgmech == CFGMECH_PCIE &&
	    (bus >= pcie_minbus && bus <= pcie_maxbus) &&
	    (bus != 0 || !(1 << slot & pcie_badslots)))
		pciereg_cfgwrite(bus, slot, func, reg, data, bytes);
	else
		pcireg_cfgwrite(bus, slot, func, reg, data, bytes);
}

I take that you would prefer it to be:

void
pci_cfgregwrite(int bus, int slot, int func, int reg, u_int32_t data, int bytes)
{

	if (cfgmech == CFGMECH_PCIE &&
	    (bus >= pcie_minbus && bus <= pcie_maxbus) &&
	    (bus != 0 || !(1 << slot & pcie_badslots)) &&
	    (reg > PCI_REGMAX))
		pciereg_cfgwrite(bus, slot, func, reg, data, bytes);
	else
		pcireg_cfgwrite(bus, slot, func, reg, data, bytes);
}

Where 'PCI_REGMAX' is 255.

Roger.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control
  2015-06-22 17:02             ` Roger Pau Monné
@ 2015-06-23  7:20               ` Jan Beulich
  2015-06-23  7:29                 ` Roger Pau Monné
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2015-06-23  7:20 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Ian Jackson,
	Ian Campbell, xen-devel, dgdegra, Keir Fraser

>>> On 22.06.15 at 19:02, <roger.pau@citrix.com> wrote:
> OK, I didn't get the part of the question. AFAICT yes, FreeBSD will 
> access the low 256 bytes of the config space. For example the stub to 
> write to a cfg register is as follows:
> 
> void
> pci_cfgregwrite(int bus, int slot, int func, int reg, u_int32_t data, int 
> bytes)
> {
> 
> 	if (cfgmech == CFGMECH_PCIE &&
> 	    (bus >= pcie_minbus && bus <= pcie_maxbus) &&
> 	    (bus != 0 || !(1 << slot & pcie_badslots)))
> 		pciereg_cfgwrite(bus, slot, func, reg, data, bytes);
> 	else
> 		pcireg_cfgwrite(bus, slot, func, reg, data, bytes);
> }
> 
> I take that you would prefer it to be:
> 
> void
> pci_cfgregwrite(int bus, int slot, int func, int reg, u_int32_t data, int 
> bytes)
> {
> 
> 	if (cfgmech == CFGMECH_PCIE &&
> 	    (bus >= pcie_minbus && bus <= pcie_maxbus) &&
> 	    (bus != 0 || !(1 << slot & pcie_badslots)) &&
> 	    (reg > PCI_REGMAX))
> 		pciereg_cfgwrite(bus, slot, func, reg, data, bytes);
> 	else
> 		pcireg_cfgwrite(bus, slot, func, reg, data, bytes);
> }
> 
> Where 'PCI_REGMAX' is 255.

Indeed that would have been nice, but we have to live with what
OSes do. But then again - did I understand correctly that FreeBSD
doesn't support PV Dom0 operation anymore, but wants to only
run in PVH mode? Was that a recent change, i.e. are PV Dom0-s
assumed to still be in use?

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control
  2015-06-23  7:20               ` Jan Beulich
@ 2015-06-23  7:29                 ` Roger Pau Monné
  2015-06-23  8:13                   ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Roger Pau Monné @ 2015-06-23  7:29 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Ian Jackson,
	Ian Campbell, xen-devel, dgdegra, Keir Fraser

El 23/06/15 a les 9.20, Jan Beulich ha escrit:
>>>> On 22.06.15 at 19:02, <roger.pau@citrix.com> wrote:
>> OK, I didn't get the part of the question. AFAICT yes, FreeBSD will 
>> access the low 256 bytes of the config space. For example the stub to 
>> write to a cfg register is as follows:
>>
>> void
>> pci_cfgregwrite(int bus, int slot, int func, int reg, u_int32_t data, int 
>> bytes)
>> {
>>
>> 	if (cfgmech == CFGMECH_PCIE &&
>> 	    (bus >= pcie_minbus && bus <= pcie_maxbus) &&
>> 	    (bus != 0 || !(1 << slot & pcie_badslots)))
>> 		pciereg_cfgwrite(bus, slot, func, reg, data, bytes);
>> 	else
>> 		pcireg_cfgwrite(bus, slot, func, reg, data, bytes);
>> }
>>
>> I take that you would prefer it to be:
>>
>> void
>> pci_cfgregwrite(int bus, int slot, int func, int reg, u_int32_t data, int 
>> bytes)
>> {
>>
>> 	if (cfgmech == CFGMECH_PCIE &&
>> 	    (bus >= pcie_minbus && bus <= pcie_maxbus) &&
>> 	    (bus != 0 || !(1 << slot & pcie_badslots)) &&
>> 	    (reg > PCI_REGMAX))
>> 		pciereg_cfgwrite(bus, slot, func, reg, data, bytes);
>> 	else
>> 		pcireg_cfgwrite(bus, slot, func, reg, data, bytes);
>> }
>>
>> Where 'PCI_REGMAX' is 255.
> 
> Indeed that would have been nice, but we have to live with what
> OSes do. But then again - did I understand correctly that FreeBSD
> doesn't support PV Dom0 operation anymore, but wants to only
> run in PVH mode? Was that a recent change, i.e. are PV Dom0-s
> assumed to still be in use?

FreeBSD never supported PV Dom0 operation, it only had a very minimal
and crappy i386 PV DomU support which has now been completely removed.
Maybe you are confusing it with NetBSD, which does have PV Dom0 support
since a long time ago?

Yes, FreeBSD only aims to run as PVHVM or PVH (or whatever we call HVM
without a device model guests). I frankly don't mind trying to change
this in FreeBSD if there's a good rational behind it. I'm currently
testing FreeBSD with the change above to see how it behaves.

Why doesn't Linux use the low 256 bytes of the config space and instead
uses the IO ports with PCIe devices?

Roger.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control
  2015-06-23  7:29                 ` Roger Pau Monné
@ 2015-06-23  8:13                   ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-23  8:13 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Ian Jackson,
	Ian Campbell, xen-devel, dgdegra, Keir Fraser

>>> On 23.06.15 at 09:29, <roger.pau@citrix.com> wrote:
> FreeBSD never supported PV Dom0 operation, it only had a very minimal
> and crappy i386 PV DomU support which has now been completely removed.
> Maybe you are confusing it with NetBSD, which does have PV Dom0 support
> since a long time ago?

Very likely.

> Yes, FreeBSD only aims to run as PVHVM or PVH (or whatever we call HVM
> without a device model guests). I frankly don't mind trying to change
> this in FreeBSD if there's a good rational behind it. I'm currently
> testing FreeBSD with the change above to see how it behaves.
> 
> Why doesn't Linux use the low 256 bytes of the config space and instead
> uses the IO ports with PCIe devices?

See
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit?id=a0ca9909609470ad779b9b9cc68ce96e975afff7

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 05/10] x86/MSI: track host and guest masking separately
  2015-06-05 11:22 ` [PATCH v3 05/10] x86/MSI: track host and guest masking separately Jan Beulich
  2015-06-05 13:05   ` Andrew Cooper
@ 2016-04-01  7:40   ` Li, Liang Z
  2016-04-01  8:47     ` Jan Beulich
  1 sibling, 1 reply; 40+ messages in thread
From: Li, Liang Z @ 2016-04-01  7:40 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Andrew Cooper, Keir Fraser, Chang, JianzhongX

Hi Jan,

A couple of weeks ago, Jianzhong reported an issue, the SRIOV NICs (Intel 82599, 82571 ) can't work correctly in Windows guest.
By debugging, we found your this patch, the commit ID ad28e42bd1d28d746988ed71654e8aa670629753,  caused the regression.

Could you help to take a look which part of your change lead to the regression?
The SRIOV NICs works well in Linux guest.

Liang

> -----Original Message-----
> From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-
> bounces@lists.xen.org] On Behalf Of Jan Beulich
> Sent: Friday, June 05, 2015 7:23 PM
> To: xen-devel
> Cc: Andrew Cooper; Keir Fraser
> Subject: [Xen-devel] [PATCH v3 05/10] x86/MSI: track host and guest
> masking separately
> 
> In particular we want to avoid losing track of our own intention to have an
> entry masked. Physical unmasking now happens only when both host and
> guest requested so.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/arch/x86/hpet.c
> +++ b/xen/arch/x86/hpet.c
> @@ -241,7 +241,7 @@ static void hpet_msi_unmask(struct irq_d
>      cfg = hpet_read32(HPET_Tn_CFG(ch->idx));
>      cfg |= HPET_TN_ENABLE;
>      hpet_write32(cfg, HPET_Tn_CFG(ch->idx));
> -    ch->msi.msi_attrib.masked = 0;
> +    ch->msi.msi_attrib.host_masked = 0;
>  }
> 
>  static void hpet_msi_mask(struct irq_desc *desc) @@ -252,7 +252,7 @@
> static void hpet_msi_mask(struct irq_des
>      cfg = hpet_read32(HPET_Tn_CFG(ch->idx));
>      cfg &= ~HPET_TN_ENABLE;
>      hpet_write32(cfg, HPET_Tn_CFG(ch->idx));
> -    ch->msi.msi_attrib.masked = 1;
> +    ch->msi.msi_attrib.host_masked = 1;
>  }
> 
>  static int hpet_msi_write(struct hpet_event_channel *ch, struct msi_msg
> *msg)
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -219,7 +219,6 @@ static int msixtbl_read(  {
>      unsigned long offset;
>      struct msixtbl_entry *entry;
> -    void *virt;
>      unsigned int nr_entry, index;
>      int r = X86EMUL_UNHANDLEABLE;
> 
> @@ -244,10 +243,16 @@ static int msixtbl_read(
>      }
>      else
>      {
> -        virt = msixtbl_addr_to_virt(entry, address);
> +        const struct msi_desc *msi_desc;
> +        void *virt = msixtbl_addr_to_virt(entry, address);
> +
>          if ( !virt )
>              goto out;
> -        *pval = readl(virt);
> +        msi_desc = virt_to_msi_desc(entry->pdev, virt);
> +        if ( !msi_desc )
> +            goto out;
> +        *pval = MASK_INSR(msi_desc->msi_attrib.guest_masked,
> +                          PCI_MSIX_VECTOR_BITMASK);
>      }
> 
>      r = X86EMUL_OKAY;
> @@ -265,7 +270,7 @@ static int msixtbl_write(struct vcpu *v,
>      void *virt;
>      unsigned int nr_entry, index;
>      int r = X86EMUL_UNHANDLEABLE;
> -    unsigned long flags, orig;
> +    unsigned long flags;
>      struct irq_desc *desc;
> 
>      if ( len != 4 || (address & 3) )
> @@ -318,37 +323,7 @@ static int msixtbl_write(struct vcpu *v,
> 
>      ASSERT(msi_desc == desc->msi_desc);
> 
> -    orig = readl(virt);
> -
> -    /*
> -     * Do not allow guest to modify MSI-X control bit if it is masked
> -     * by Xen. We'll only handle the case where Xen thinks that
> -     * bit is unmasked, but hardware has silently masked the bit
> -     * (in case of SR-IOV VF reset, etc). On the other hand, if Xen
> -     * thinks that the bit is masked, but it's really not,
> -     * we log a warning.
> -     */
> -    if ( msi_desc->msi_attrib.masked )
> -    {
> -        if ( !(orig & PCI_MSIX_VECTOR_BITMASK) )
> -            printk(XENLOG_WARNING "MSI-X control bit is unmasked when"
> -                   " it is expected to be masked [%04x:%02x:%02x.%u]\n",
> -                   entry->pdev->seg, entry->pdev->bus,
> -                   PCI_SLOT(entry->pdev->devfn),
> -                   PCI_FUNC(entry->pdev->devfn));
> -
> -        goto unlock;
> -    }
> -
> -    /*
> -     * The mask bit is the only defined bit in the word. But we
> -     * ought to preserve the reserved bits. Clearing the reserved
> -     * bits can result in undefined behaviour (see PCI Local Bus
> -     * Specification revision 2.3).
> -     */
> -    val &= PCI_MSIX_VECTOR_BITMASK;
> -    val |= (orig & ~PCI_MSIX_VECTOR_BITMASK);
> -    writel(val, virt);
> +    guest_mask_msi_irq(desc, !!(val & PCI_MSIX_VECTOR_BITMASK));
> 
>  unlock:
>      spin_unlock_irqrestore(&desc->lock, flags);
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -390,12 +390,13 @@ int msi_maskable_irq(const struct msi_de
>             || entry->msi_attrib.maskbit;  }
> 
> -static bool_t msi_set_mask_bit(struct irq_desc *desc, int flag)
> +static bool_t msi_set_mask_bit(struct irq_desc *desc, bool_t host,
> +bool_t guest)
>  {
>      struct msi_desc *entry = desc->msi_desc;
>      struct pci_dev *pdev;
>      u16 seg, control;
>      u8 bus, slot, func;
> +    bool_t flag = host || guest;
> 
>      ASSERT(spin_is_locked(&desc->lock));
>      BUG_ON(!entry || !entry->dev);
> @@ -453,7 +454,8 @@ static bool_t msi_set_mask_bit(struct ir
>      default:
>          return 0;
>      }
> -    entry->msi_attrib.masked = !!flag;
> +    entry->msi_attrib.host_masked = host;
> +    entry->msi_attrib.guest_masked = guest;
> 
>      return 1;
>  }
> @@ -484,22 +486,39 @@ static int msi_get_mask_bit(const struct
> 
>  void mask_msi_irq(struct irq_desc *desc)  {
> -    if ( unlikely(!msi_set_mask_bit(desc, 1)) )
> +    if ( unlikely(!msi_set_mask_bit(desc, 1,
> +
> + desc->msi_desc->msi_attrib.guest_masked)) )
>          BUG_ON(!(desc->status & IRQ_DISABLED));  }
> 
>  void unmask_msi_irq(struct irq_desc *desc)  {
> -    if ( unlikely(!msi_set_mask_bit(desc, 0)) )
> +    if ( unlikely(!msi_set_mask_bit(desc, 0,
> +
> + desc->msi_desc->msi_attrib.guest_masked)) )
>          WARN();
>  }
> 
> +void guest_mask_msi_irq(struct irq_desc *desc, bool_t mask) {
> +    msi_set_mask_bit(desc, desc->msi_desc->msi_attrib.host_masked,
> +mask); }
> +
>  static unsigned int startup_msi_irq(struct irq_desc *desc)  {
> -    unmask_msi_irq(desc);
> +    bool_t guest_masked = (desc->status & IRQ_GUEST) &&
> +                          is_hvm_domain(desc->msi_desc->dev->domain);
> +
> +    if ( unlikely(!msi_set_mask_bit(desc, 0, guest_masked)) )
> +        WARN();
>      return 0;
>  }
> 
> +static void shutdown_msi_irq(struct irq_desc *desc) {
> +    if ( unlikely(!msi_set_mask_bit(desc, 1, 1)) )
> +        BUG_ON(!(desc->status & IRQ_DISABLED)); }
> +
>  void ack_nonmaskable_msi_irq(struct irq_desc *desc)  {
>      irq_complete_move(desc);
> @@ -524,7 +543,7 @@ void end_nonmaskable_msi_irq(struct irq_  static
> hw_irq_controller pci_msi_maskable = {
>      .typename     = "PCI-MSI/-X",
>      .startup      = startup_msi_irq,
> -    .shutdown     = mask_msi_irq,
> +    .shutdown     = shutdown_msi_irq,
>      .enable       = unmask_msi_irq,
>      .disable      = mask_msi_irq,
>      .ack          = ack_maskable_msi_irq,
> @@ -694,7 +713,8 @@ static int msi_capability_init(struct pc
>          entry[i].msi_attrib.is_64 = is_64bit_address(control);
>          entry[i].msi_attrib.entry_nr = i;
>          entry[i].msi_attrib.maskbit = is_mask_bit_support(control);
> -        entry[i].msi_attrib.masked = 1;
> +        entry[i].msi_attrib.host_masked = 1;
> +        entry[i].msi_attrib.guest_masked = 0;
>          entry[i].msi_attrib.pos = pos;
>          if ( entry[i].msi_attrib.maskbit )
>              entry[i].msi.mpos = mpos;
> @@ -943,7 +963,8 @@ static int msix_capability_init(struct p
>          entry->msi_attrib.is_64 = 1;
>          entry->msi_attrib.entry_nr = msi->entry_nr;
>          entry->msi_attrib.maskbit = 1;
> -        entry->msi_attrib.masked = 1;
> +        entry->msi_attrib.host_masked = 1;
> +        entry->msi_attrib.guest_masked = 1;
>          entry->msi_attrib.pos = pos;
>          entry->irq = msi->irq;
>          entry->dev = dev;
> @@ -1315,7 +1336,8 @@ int pci_restore_msi_state(struct pci_dev
>          for ( i = 0; ; )
>          {
>              if ( unlikely(!msi_set_mask_bit(desc,
> -                                            entry[i].msi_attrib.masked)) )
> +                                            entry[i].msi_attrib.host_masked,
> +
> + entry[i].msi_attrib.guest_masked)) )
>                  BUG();
> 
>              if ( !--nr )
> @@ -1469,7 +1491,7 @@ static void dump_msi(unsigned char key)
>          else
>              mask = '?';
>          printk(" %-6s%4u vec=%02x%7s%6s%3sassert%5s%7s"
> -               " dest=%08x mask=%d/%d/%c\n",
> +               " dest=%08x mask=%d/%c%c/%c\n",
>                 type, irq,
>                 (data & MSI_DATA_VECTOR_MASK) >> MSI_DATA_VECTOR_SHIFT,
>                 data & MSI_DATA_DELIVERY_LOWPRI ? "lowest" : "fixed", @@ -
> 1477,7 +1499,10 @@ static void dump_msi(unsigned char key)
>                 data & MSI_DATA_LEVEL_ASSERT ? "" : "de",
>                 addr & MSI_ADDR_DESTMODE_LOGIC ? "log" : "phys",
>                 addr & MSI_ADDR_REDIRECTION_LOWPRI ? "lowest" : "cpu",
> -               dest32, attr.maskbit, attr.masked, mask);
> +               dest32, attr.maskbit,
> +               attr.host_masked ? 'H' : ' ',
> +               attr.guest_masked ? 'G' : ' ',
> +               mask);
>      }
>  }
> 
> --- a/xen/drivers/passthrough/amd/iommu_init.c
> +++ b/xen/drivers/passthrough/amd/iommu_init.c
> @@ -451,7 +451,7 @@ static void iommu_msi_unmask(struct irq_
>      spin_lock_irqsave(&iommu->lock, flags);
>      amd_iommu_msi_enable(iommu, IOMMU_CONTROL_ENABLED);
>      spin_unlock_irqrestore(&iommu->lock, flags);
> -    iommu->msi.msi_attrib.masked = 0;
> +    iommu->msi.msi_attrib.host_masked = 0;
>  }
> 
>  static void iommu_msi_mask(struct irq_desc *desc) @@ -464,7 +464,7 @@
> static void iommu_msi_mask(struct irq_de
>      spin_lock_irqsave(&iommu->lock, flags);
>      amd_iommu_msi_enable(iommu, IOMMU_CONTROL_DISABLED);
>      spin_unlock_irqrestore(&iommu->lock, flags);
> -    iommu->msi.msi_attrib.masked = 1;
> +    iommu->msi.msi_attrib.host_masked = 1;
>  }
> 
>  static unsigned int iommu_msi_startup(struct irq_desc *desc)
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -997,7 +997,7 @@ static void dma_msi_unmask(struct irq_de
>      spin_lock_irqsave(&iommu->register_lock, flags);
>      dmar_writel(iommu->reg, DMAR_FECTL_REG, 0);
>      spin_unlock_irqrestore(&iommu->register_lock, flags);
> -    iommu->msi.msi_attrib.masked = 0;
> +    iommu->msi.msi_attrib.host_masked = 0;
>  }
> 
>  static void dma_msi_mask(struct irq_desc *desc) @@ -1009,7 +1009,7 @@
> static void dma_msi_mask(struct irq_desc
>      spin_lock_irqsave(&iommu->register_lock, flags);
>      dmar_writel(iommu->reg, DMAR_FECTL_REG, DMA_FECTL_IM);
>      spin_unlock_irqrestore(&iommu->register_lock, flags);
> -    iommu->msi.msi_attrib.masked = 1;
> +    iommu->msi.msi_attrib.host_masked = 1;
>  }
> 
>  static unsigned int dma_msi_startup(struct irq_desc *desc)
> --- a/xen/include/asm-x86/msi.h
> +++ b/xen/include/asm-x86/msi.h
> @@ -90,12 +90,13 @@ extern unsigned int pci_msix_get_table_l
> 
>  struct msi_desc {
>  	struct msi_attrib {
> -		__u8	type	: 5; 	/* {0: unused, 5h:MSI, 11h:MSI-X} */
> -		__u8	maskbit	: 1; 	/* mask-pending bit
> supported ?   */
> -		__u8	masked	: 1;
> +		__u8	type;		/* {0: unused, 5h:MSI, 11h:MSI-X} */
> +		__u8	pos;		/* Location of the MSI capability */
> +		__u8	maskbit	: 1;	/* mask/pending bit
> supported ?   */
>  		__u8	is_64	: 1;	/* Address size: 0=32bit 1=64bit  */
> -		__u8	pos;	 	/* Location of the msi capability */
> -		__u16	entry_nr;    	/* specific enabled entry 	  */
> +		__u8	host_masked : 1;
> +		__u8	guest_masked : 1;
> +		__u16	entry_nr;	/* specific enabled entry 	  */
>  	} msi_attrib;
> 
>  	struct list_head list;
> @@ -236,6 +237,7 @@ void msi_compose_msg(unsigned vector, co  void
> __msi_set_enable(u16 seg, u8 bus, u8 slot, u8 func, int pos, int enable);  void
> mask_msi_irq(struct irq_desc *);  void unmask_msi_irq(struct irq_desc *);
> +void guest_mask_msi_irq(struct irq_desc *, bool_t mask);
>  void ack_nonmaskable_msi_irq(struct irq_desc *);  void
> end_nonmaskable_msi_irq(struct irq_desc *, u8 vector);  void
> set_msi_affinity(struct irq_desc *, const cpumask_t *);
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 05/10] x86/MSI: track host and guest masking separately
  2016-04-01  7:40   ` Li, Liang Z
@ 2016-04-01  8:47     ` Jan Beulich
  2016-04-01  9:21       ` Li, Liang Z
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2016-04-01  8:47 UTC (permalink / raw)
  To: Liang Z Li; +Cc: Andrew Cooper, Keir Fraser, JianzhongX Chang, xen-devel

>>> On 01.04.16 at 09:40, <liang.z.li@intel.com> wrote:
> A couple of weeks ago, Jianzhong reported an issue, the SRIOV NICs (Intel 
> 82599, 82571 ) can't work correctly in Windows guest.
> By debugging, we found your this patch, the commit ID 
> ad28e42bd1d28d746988ed71654e8aa670629753,  caused the regression.

That was obvious.

> Could you help to take a look which part of your change lead to the 
> regression?
> The SRIOV NICs works well in Linux guest.

As you may have seen, I've already found and fixed one aspect.
I'm in the process of fixing another because, no, SRIOV NICs do not
work well in Linux guests, at least not in recent ones (first noticed in
4.4, but last time I had tried before was with 4.0 I think) with
CONFIG_XEN off.

Beyond that I guess I could use some help in at least diagnosing the
issue, e.g. by logging the actions done by Xen, so we can understand
why guest_mask_msi_irq() doesn't get called to unmask the MSI-X
vector(s). I've found the reason why this isn't happening on Linux,
and I am testing a possible fix. However, from the logs I've seen so
far for Windows (with older Xen, and with the above mentioned fix
backported) it is already clear that this change won't help there
(unless of course the issue you see is Windows version dependent).

And just to be clear - the change previously proposed by Jianzhong
is still not going to be an option: We need to understand and address
the root cause of the issue, not paper over it by guessing when to
unmask interrupts.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 05/10] x86/MSI: track host and guest masking separately
  2016-04-01  8:47     ` Jan Beulich
@ 2016-04-01  9:21       ` Li, Liang Z
  2016-04-01  9:33         ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Li, Liang Z @ 2016-04-01  9:21 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Keir Fraser, Chang, JianzhongX, xen-devel

> >>> On 01.04.16 at 09:40, <liang.z.li@intel.com> wrote:
> > A couple of weeks ago, Jianzhong reported an issue, the SRIOV NICs
> > (Intel 82599, 82571 ) can't work correctly in Windows guest.
> > By debugging, we found your this patch, the commit ID
> > ad28e42bd1d28d746988ed71654e8aa670629753,  caused the regression.
> 
> That was obvious.
> 
> > Could you help to take a look which part of your change lead to the
> > regression?
> > The SRIOV NICs works well in Linux guest.
> 
> As you may have seen, I've already found and fixed one aspect.
> I'm in the process of fixing another because, no, SRIOV NICs do not work well
> in Linux guests, at least not in recent ones (first noticed in 4.4, but last time I
> had tried before was with 4.0 I think) with CONFIG_XEN off.
> 
> Beyond that I guess I could use some help in at least diagnosing the issue, e.g.
> by logging the actions done by Xen, so we can understand why
> guest_mask_msi_irq() doesn't get called to unmask the MSI-X vector(s). I've
> found the reason why this isn't happening on Linux, and I am testing a
> possible fix. However, from the logs I've seen so far for Windows (with older

Glad to hear that you are working on this. Thanks!

> Xen, and with the above mentioned fix
> backported) it is already clear that this change won't help there (unless of
> course the issue you see is Windows version dependent).

> And just to be clear - the change previously proposed by Jianzhong is still not
> going to be an option: We need to understand and address the root cause of
> the issue, not paper over it by guessing when to unmask interrupts.
> 

Totally agree.

Liang
> Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 05/10] x86/MSI: track host and guest masking separately
  2016-04-01  9:21       ` Li, Liang Z
@ 2016-04-01  9:33         ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2016-04-01  9:33 UTC (permalink / raw)
  To: Liang Z Li; +Cc: Andrew Cooper, Keir Fraser, JianzhongX Chang, xen-devel

>>> On 01.04.16 at 11:21, <liang.z.li@intel.com> wrote:
>> Beyond that I guess I could use some help in at least diagnosing the issue, e.g.
>> by logging the actions done by Xen, so we can understand why
>> guest_mask_msi_irq() doesn't get called to unmask the MSI-X vector(s). I've
>> found the reason why this isn't happening on Linux, and I am testing a
>> possible fix. However, from the logs I've seen so far for Windows (with older
> 
> Glad to hear that you are working on this. Thanks!

But you did understand that above I was asking for assistance
from your end (or a clear indication that I shouldn't hope for such)?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2016-04-01  9:33 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-05 11:09 [PATCH v3 00/10] x86/MSI-X: XSA-120, 128..131 follow-up Jan Beulich
2015-06-05 11:20 ` [PATCH v3 01/10] x86/MSI-X: be more careful during teardown Jan Beulich
2015-06-05 11:20 ` [PATCH v3 02/10] x86/MSI-X: access MSI-X table only after having enabled MSI-X Jan Beulich
2015-06-05 13:01   ` Andrew Cooper
2015-06-05 11:21 ` [PATCH v3 03/10] x86/MSI-X: reduce fiddling with control register during restore Jan Beulich
2015-06-05 11:21 ` [PATCH v3 04/10] x86/MSI-X: cleanup Jan Beulich
2015-06-05 11:22 ` [PATCH v3 05/10] x86/MSI: track host and guest masking separately Jan Beulich
2015-06-05 13:05   ` Andrew Cooper
2016-04-01  7:40   ` Li, Liang Z
2016-04-01  8:47     ` Jan Beulich
2016-04-01  9:21       ` Li, Liang Z
2016-04-01  9:33         ` Jan Beulich
2015-06-05 11:23 ` [PATCH v3 06/10] x86/vMSI-X: cleanup Jan Beulich
2015-06-05 13:07   ` Andrew Cooper
2015-06-05 11:24 ` [PATCH v3 07/10] x86/vMSI-X: support qword MMIO access Jan Beulich
2015-06-05 15:34   ` Andrew Cooper
2015-06-05 11:25 ` [PATCH v3 08/10] x86/MSI-X: use qword MMIO access for address writes Jan Beulich
2015-06-05 15:37   ` Andrew Cooper
2015-06-05 11:26 ` [PATCH v3 09/10] VT-d: use qword MMIO access for MSI " Jan Beulich
2015-06-05 15:39   ` Andrew Cooper
2015-06-05 15:46     ` Jan Beulich
2015-06-11  7:45   ` Tian, Kevin
2015-06-05 11:28 ` [PATCH v3 10/10] x86/MSI-X: provide hypercall interface for mask-all control Jan Beulich
2015-06-05 15:57   ` Andrew Cooper
2015-06-05 16:17     ` Jan Beulich
2015-06-11  8:35   ` Jan Beulich
2015-06-11  9:51     ` Andrew Cooper
2015-06-19 13:00       ` Jan Beulich
2015-06-19 13:05         ` Konrad Rzeszutek Wilk
2015-06-19 14:52           ` Jan Beulich
2015-06-19 14:07         ` Roger Pau Monné
2015-06-19 14:58           ` Jan Beulich
2015-06-22 17:02             ` Roger Pau Monné
2015-06-23  7:20               ` Jan Beulich
2015-06-23  7:29                 ` Roger Pau Monné
2015-06-23  8:13                   ` Jan Beulich
2015-06-22 11:25           ` Jan Beulich
2015-06-12 13:21     ` Konrad Rzeszutek Wilk
2015-06-12 13:51       ` Jan Beulich
2015-06-12 14:17         ` Konrad Rzeszutek Wilk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).