All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] IOMMU: XSA-373 follow-on
@ 2021-06-09  9:25 Jan Beulich
  2021-06-09  9:26 ` [PATCH 1/9] AMD/IOMMU: redo awaiting of command completion Jan Beulich
                   ` (9 more replies)
  0 siblings, 10 replies; 28+ messages in thread
From: Jan Beulich @ 2021-06-09  9:25 UTC (permalink / raw)
  To: xen-devel; +Cc: Paul Durrant, Andrew Cooper, Kevin Tian

A number of further adjustments were left out of the XSA, for not
being a security concern (anymore in some of the cases, with the
changes put in place there). This is the collection, possibly
looking a little random in what it contains.

1: AMD/IOMMU: redo awaiting of command completion
2: AMD/IOMMU: re-work locking around sending of commands
3: VT-d: undo device mappings upon error
4: VT-d: adjust domid map updating when unmapping context
5: VT-d: clear_fault_bits() should clear all fault bits
6: VT-d: don't lose errors when flushing TLBs on multiple IOMMUs
7: VT-d: centralize mapping of QI entries
8: VT-d: drop/move a few QI related constants
9: IOMMU/PCI: don't let domain cleanup continue when device de-assignment failed

Jan



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/9] AMD/IOMMU: redo awaiting of command completion
  2021-06-09  9:25 [PATCH 0/9] IOMMU: XSA-373 follow-on Jan Beulich
@ 2021-06-09  9:26 ` Jan Beulich
  2021-06-09 10:36   ` Andrew Cooper
  2021-06-09  9:27 ` [PATCH 2/9] AMD/IOMMU: re-work locking around sending of commands Jan Beulich
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2021-06-09  9:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Paul Durrant, Andrew Cooper

The present abuse of the completion interrupt does not only stand in the
way of, down the road, using it for its actual purpose, but also
requires holding the IOMMU lock while waiting for command completion,
limiting parallelism and keeping interrupts off for non-negligible
periods of time. Have the IOMMU do an ordinary memory write instead of
signaling an otherwise disabled interrupt (by just updating a status
register bit).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu_cmd.c
+++ b/xen/drivers/passthrough/amd/iommu_cmd.c
@@ -20,6 +20,9 @@
 #include "iommu.h"
 #include "../ats.h"
 
+#define CMD_COMPLETION_INIT 0
+#define CMD_COMPLETION_DONE 1
+
 static void send_iommu_command(struct amd_iommu *iommu,
                                const uint32_t cmd[4])
 {
@@ -49,28 +52,31 @@ static void send_iommu_command(struct am
 static void flush_command_buffer(struct amd_iommu *iommu,
                                  unsigned int timeout_base)
 {
+    static DEFINE_PER_CPU(uint64_t, poll_slot);
+    uint64_t *this_poll_slot = &this_cpu(poll_slot);
+    paddr_t addr = virt_to_maddr(this_poll_slot);
     uint32_t cmd[4];
     s_time_t start, timeout;
     static unsigned int __read_mostly threshold = 1;
 
-    /* RW1C 'ComWaitInt' in status register */
-    writel(IOMMU_STATUS_COMP_WAIT_INT,
-           iommu->mmio_base + IOMMU_STATUS_MMIO_OFFSET);
-
-    /* send an empty COMPLETION_WAIT command to flush command buffer */
-    cmd[3] = cmd[2] = 0;
-    set_field_in_reg_u32(IOMMU_CMD_COMPLETION_WAIT, 0,
+    ACCESS_ONCE(*this_poll_slot) = CMD_COMPLETION_INIT;
+
+    /* send a COMPLETION_WAIT command to flush command buffer */
+    cmd[0] = addr;
+    set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, cmd[0],
+                         IOMMU_COMP_WAIT_S_FLAG_MASK,
+                         IOMMU_COMP_WAIT_S_FLAG_SHIFT, &cmd[0]);
+    cmd[1] = addr >> 32;
+    set_field_in_reg_u32(IOMMU_CMD_COMPLETION_WAIT, cmd[1],
                          IOMMU_CMD_OPCODE_MASK,
                          IOMMU_CMD_OPCODE_SHIFT, &cmd[1]);
-    set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, 0,
-                         IOMMU_COMP_WAIT_I_FLAG_MASK,
-                         IOMMU_COMP_WAIT_I_FLAG_SHIFT, &cmd[0]);
+    cmd[2] = CMD_COMPLETION_DONE;
+    cmd[3] = 0;
     send_iommu_command(iommu, cmd);
 
     start = NOW();
     timeout = start + (timeout_base ?: 100) * MILLISECS(threshold);
-    while ( !(readl(iommu->mmio_base + IOMMU_STATUS_MMIO_OFFSET) &
-              IOMMU_STATUS_COMP_WAIT_INT) )
+    while ( ACCESS_ONCE(*this_poll_slot) != CMD_COMPLETION_DONE )
     {
         if ( timeout && NOW() > timeout )
         {



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/9] AMD/IOMMU: re-work locking around sending of commands
  2021-06-09  9:25 [PATCH 0/9] IOMMU: XSA-373 follow-on Jan Beulich
  2021-06-09  9:26 ` [PATCH 1/9] AMD/IOMMU: redo awaiting of command completion Jan Beulich
@ 2021-06-09  9:27 ` Jan Beulich
  2021-06-09 10:53   ` Andrew Cooper
  2021-06-09  9:27 ` [PATCH 3/9] VT-d: undo device mappings upon error Jan Beulich
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2021-06-09  9:27 UTC (permalink / raw)
  To: xen-devel; +Cc: Paul Durrant, Andrew Cooper

It appears unhelpful to me for flush_command_buffer() to block all
progress elsewhere for the given IOMMU by holding its lock while
waiting for command completion. Unless the lock is already held,
acquire it in send_iommu_command(). Release it in all cases in
flush_command_buffer(), before actually starting the wait loop.

Some of the involved functions did/do get called with the lock already
held: For amd_iommu_flush_intremap() we can simply move the locking
inside. For amd_iommu_flush_device() and amd_iommu_flush_all_caches()
the lock now gets dropped in the course of the function's operation.

Where touching function headers anyway, also adjust types used to be
better in line with our coding style and, where applicable, the
functions' callers.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
---
v2: Add comments. Adjust parameter types when function headers get
    touched anyway.

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -253,9 +253,10 @@ void amd_iommu_flush_pages(struct domain
                            unsigned int order);
 void amd_iommu_flush_iotlb(u8 devfn, const struct pci_dev *pdev,
                            uint64_t gaddr, unsigned int order);
-void amd_iommu_flush_device(struct amd_iommu *iommu, uint16_t bdf);
+void amd_iommu_flush_device(struct amd_iommu *iommu, uint16_t bdf,
+                            unsigned long flags);
 void amd_iommu_flush_intremap(struct amd_iommu *iommu, uint16_t bdf);
-void amd_iommu_flush_all_caches(struct amd_iommu *iommu);
+void amd_iommu_flush_all_caches(struct amd_iommu *iommu, unsigned long flags);
 
 /* find iommu for bdf */
 struct amd_iommu *find_iommu_for_device(int seg, int bdf);
--- a/xen/drivers/passthrough/amd/iommu_cmd.c
+++ b/xen/drivers/passthrough/amd/iommu_cmd.c
@@ -23,11 +23,20 @@
 #define CMD_COMPLETION_INIT 0
 #define CMD_COMPLETION_DONE 1
 
+/*
+ * When @flags is non-NULL, the function will acquire the IOMMU lock,
+ * transferring lock ownership to the caller.  When @flags is NULL,
+ * the lock is assumed to be already held.
+ */
 static void send_iommu_command(struct amd_iommu *iommu,
-                               const uint32_t cmd[4])
+                               const uint32_t cmd[4],
+                               unsigned long *flags)
 {
     uint32_t tail;
 
+    if ( flags )
+        spin_lock_irqsave(&iommu->lock, *flags);
+
     tail = iommu->cmd_buffer.tail + sizeof(cmd_entry_t);
     if ( tail == iommu->cmd_buffer.size )
         tail = 0;
@@ -49,8 +58,13 @@ static void send_iommu_command(struct am
     writel(tail, iommu->mmio_base + IOMMU_CMD_BUFFER_TAIL_OFFSET);
 }
 
+/*
+ * Callers need to hold the IOMMU lock, which will be released here before
+ * entering the loop to await command completion.
+ */
 static void flush_command_buffer(struct amd_iommu *iommu,
-                                 unsigned int timeout_base)
+                                 unsigned int timeout_base,
+                                 unsigned long flags)
 {
     static DEFINE_PER_CPU(uint64_t, poll_slot);
     uint64_t *this_poll_slot = &this_cpu(poll_slot);
@@ -72,7 +86,9 @@ static void flush_command_buffer(struct
                          IOMMU_CMD_OPCODE_SHIFT, &cmd[1]);
     cmd[2] = CMD_COMPLETION_DONE;
     cmd[3] = 0;
-    send_iommu_command(iommu, cmd);
+    send_iommu_command(iommu, cmd, NULL);
+
+    spin_unlock_irqrestore(&iommu->lock, flags);
 
     start = NOW();
     timeout = start + (timeout_base ?: 100) * MILLISECS(threshold);
@@ -99,12 +115,19 @@ static void flush_command_buffer(struct
 }
 
 /* Build low level iommu command messages */
-static void invalidate_iommu_pages(struct amd_iommu *iommu,
-                                   u64 io_addr, u16 domain_id, u16 order)
+
+/*
+ * The function will acquire the IOMMU lock, via its call to
+ * send_iommu_command(), and then transfer lock ownership to the caller.
+ */
+static unsigned long invalidate_iommu_pages(struct amd_iommu *iommu,
+                                            daddr_t io_addr, domid_t domain_id,
+                                            unsigned int order)
 {
     u64 addr_lo, addr_hi;
     u32 cmd[4], entry;
     int sflag = 0, pde = 0;
+    unsigned long flags;
 
     ASSERT ( order == 0 || order == 9 || order == 18 );
 
@@ -152,16 +175,27 @@ static void invalidate_iommu_pages(struc
     cmd[3] = entry;
 
     cmd[0] = 0;
-    send_iommu_command(iommu, cmd);
+    send_iommu_command(iommu, cmd, &flags);
+
+    return flags;
 }
 
-static void invalidate_iotlb_pages(struct amd_iommu *iommu,
-                                   u16 maxpend, u32 pasid, u16 queueid,
-                                   u64 io_addr, u16 dev_id, u16 order)
+/*
+ * The function will acquire the IOMMU lock, via its call to
+ * send_iommu_command(), and then transfer lock ownership to the caller.
+ */
+static unsigned long invalidate_iotlb_pages(struct amd_iommu *iommu,
+                                            unsigned int maxpend,
+                                            unsigned int pasid,
+                                            unsigned int queueid,
+                                            daddr_t io_addr,
+                                            unsigned int dev_id,
+                                            unsigned int order)
 {
     u64 addr_lo, addr_hi;
     u32 cmd[4], entry;
     int sflag = 0;
+    unsigned long flags;
 
     ASSERT ( order == 0 || order == 9 || order == 18 );
 
@@ -222,9 +256,12 @@ static void invalidate_iotlb_pages(struc
                          IOMMU_INV_IOTLB_PAGES_ADDR_HIGH_SHIFT, &entry);
     cmd[3] = entry;
 
-    send_iommu_command(iommu, cmd);
+    send_iommu_command(iommu, cmd, &flags);
+
+    return flags;
 }
 
+/* Callers need to hold the IOMMU lock. */
 static void invalidate_dev_table_entry(struct amd_iommu *iommu,
                                        u16 device_id)
 {
@@ -241,12 +278,18 @@ static void invalidate_dev_table_entry(s
                          &entry);
     cmd[1] = entry;
 
-    send_iommu_command(iommu, cmd);
+    send_iommu_command(iommu, cmd, NULL);
 }
 
-static void invalidate_interrupt_table(struct amd_iommu *iommu, u16 device_id)
+/*
+ * The function will acquire the IOMMU lock, via its call to
+ * send_iommu_command(), and then transfer lock ownership to the caller.
+ */
+static unsigned long invalidate_interrupt_table(struct amd_iommu *iommu,
+                                                uint16_t device_id)
 {
     u32 cmd[4], entry;
+    unsigned long flags;
 
     cmd[3] = cmd[2] = 0;
     set_field_in_reg_u32(device_id, 0,
@@ -257,9 +300,12 @@ static void invalidate_interrupt_table(s
                          IOMMU_CMD_OPCODE_MASK, IOMMU_CMD_OPCODE_SHIFT,
                          &entry);
     cmd[1] = entry;
-    send_iommu_command(iommu, cmd);
+    send_iommu_command(iommu, cmd, &flags);
+
+    return flags;
 }
 
+/* Callers need to hold the IOMMU lock. */
 static void invalidate_iommu_all(struct amd_iommu *iommu)
 {
     u32 cmd[4], entry;
@@ -271,7 +317,7 @@ static void invalidate_iommu_all(struct
                          &entry);
     cmd[1] = entry;
 
-    send_iommu_command(iommu, cmd);
+    send_iommu_command(iommu, cmd, NULL);
 }
 
 void amd_iommu_flush_iotlb(u8 devfn, const struct pci_dev *pdev,
@@ -304,10 +350,9 @@ void amd_iommu_flush_iotlb(u8 devfn, con
     maxpend = pdev->ats.queue_depth & 0xff;
 
     /* send INVALIDATE_IOTLB_PAGES command */
-    spin_lock_irqsave(&iommu->lock, flags);
-    invalidate_iotlb_pages(iommu, maxpend, 0, queueid, daddr, req_id, order);
-    flush_command_buffer(iommu, iommu_dev_iotlb_timeout);
-    spin_unlock_irqrestore(&iommu->lock, flags);
+    flags = invalidate_iotlb_pages(iommu, maxpend, 0, queueid, daddr,
+                                   req_id, order);
+    flush_command_buffer(iommu, iommu_dev_iotlb_timeout, flags);
 }
 
 static void amd_iommu_flush_all_iotlbs(struct domain *d, daddr_t daddr,
@@ -336,15 +381,12 @@ static void _amd_iommu_flush_pages(struc
 {
     unsigned long flags;
     struct amd_iommu *iommu;
-    unsigned int dom_id = d->domain_id;
 
     /* send INVALIDATE_IOMMU_PAGES command */
     for_each_amd_iommu ( iommu )
     {
-        spin_lock_irqsave(&iommu->lock, flags);
-        invalidate_iommu_pages(iommu, daddr, dom_id, order);
-        flush_command_buffer(iommu, 0);
-        spin_unlock_irqrestore(&iommu->lock, flags);
+        flags = invalidate_iommu_pages(iommu, daddr, d->domain_id, order);
+        flush_command_buffer(iommu, 0, flags);
     }
 
     if ( ats_enabled )
@@ -362,39 +404,44 @@ void amd_iommu_flush_pages(struct domain
     _amd_iommu_flush_pages(d, __dfn_to_daddr(dfn), order);
 }
 
-void amd_iommu_flush_device(struct amd_iommu *iommu, uint16_t bdf)
+/*
+ * Callers need to hold the IOMMU lock, which will be released here by
+ * calling flush_command_buffer().
+ */
+void amd_iommu_flush_device(struct amd_iommu *iommu, uint16_t bdf,
+                            unsigned long flags)
 {
     ASSERT( spin_is_locked(&iommu->lock) );
 
     invalidate_dev_table_entry(iommu, bdf);
-    flush_command_buffer(iommu, 0);
+    flush_command_buffer(iommu, 0, flags);
 }
 
 void amd_iommu_flush_intremap(struct amd_iommu *iommu, uint16_t bdf)
 {
-    ASSERT( spin_is_locked(&iommu->lock) );
+    unsigned long flags;
 
-    invalidate_interrupt_table(iommu, bdf);
-    flush_command_buffer(iommu, 0);
+    flags = invalidate_interrupt_table(iommu, bdf);
+    flush_command_buffer(iommu, 0, flags);
 }
 
-void amd_iommu_flush_all_caches(struct amd_iommu *iommu)
+/*
+ * Callers need to hold the IOMMU lock, which will be released here by
+ * calling flush_command_buffer().
+ */
+void amd_iommu_flush_all_caches(struct amd_iommu *iommu, unsigned long flags)
 {
     ASSERT( spin_is_locked(&iommu->lock) );
 
     invalidate_iommu_all(iommu);
-    flush_command_buffer(iommu, 0);
+    flush_command_buffer(iommu, 0, flags);
 }
 
 void amd_iommu_send_guest_cmd(struct amd_iommu *iommu, u32 cmd[])
 {
     unsigned long flags;
 
-    spin_lock_irqsave(&iommu->lock, flags);
-
-    send_iommu_command(iommu, cmd);
+    send_iommu_command(iommu, cmd, &flags);
     /* TBD: Timeout selection may require peeking into cmd[]. */
-    flush_command_buffer(iommu, 0);
-
-    spin_unlock_irqrestore(&iommu->lock, flags);
+    flush_command_buffer(iommu, 0, flags);
 }
--- a/xen/drivers/passthrough/amd/iommu_guest.c
+++ b/xen/drivers/passthrough/amd/iommu_guest.c
@@ -449,8 +449,7 @@ static int do_invalidate_dte(struct doma
     spin_lock_irqsave(&iommu->lock, flags);
     dte_set_gcr3_table(mdte, hdom_id, gcr3_mfn << PAGE_SHIFT, gv, glx);
 
-    amd_iommu_flush_device(iommu, req_id);
-    spin_unlock_irqrestore(&iommu->lock, flags);
+    amd_iommu_flush_device(iommu, req_id, flags);
 
     return 0;
 }
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -921,13 +921,13 @@ static void enable_iommu(struct amd_iomm
 
     set_iommu_translation_control(iommu, IOMMU_CONTROL_ENABLED);
 
-    if ( iommu->features.flds.ia_sup )
-        amd_iommu_flush_all_caches(iommu);
-
     iommu->enabled = 1;
 
+    if ( iommu->features.flds.ia_sup )
+        amd_iommu_flush_all_caches(iommu, flags);
+    else
  out:
-    spin_unlock_irqrestore(&iommu->lock, flags);
+        spin_unlock_irqrestore(&iommu->lock, flags);
 }
 
 static void disable_iommu(struct amd_iommu *iommu)
@@ -1554,9 +1554,8 @@ static int _invalidate_all_devices(
         if ( iommu )
         {
             spin_lock_irqsave(&iommu->lock, flags);
-            amd_iommu_flush_device(iommu, req_id);
+            amd_iommu_flush_device(iommu, req_id, flags);
             amd_iommu_flush_intremap(iommu, req_id);
-            spin_unlock_irqrestore(&iommu->lock, flags);
         }
     }
 
--- a/xen/drivers/passthrough/amd/iommu_intr.c
+++ b/xen/drivers/passthrough/amd/iommu_intr.c
@@ -310,9 +310,7 @@ static int update_intremap_entry_from_io
         entry.ptr32->flds.remap_en = false;
         spin_unlock(lock);
 
-        spin_lock(&iommu->lock);
         amd_iommu_flush_intremap(iommu, req_id);
-        spin_unlock(&iommu->lock);
 
         spin_lock(lock);
     }
@@ -527,11 +525,9 @@ static int update_intremap_entry_from_ms
 
         if ( iommu->enabled )
         {
-            spin_lock_irqsave(&iommu->lock, flags);
             amd_iommu_flush_intremap(iommu, req_id);
             if ( alias_id != req_id )
                 amd_iommu_flush_intremap(iommu, alias_id);
-            spin_unlock_irqrestore(&iommu->lock, flags);
         }
 
         return 0;
@@ -567,11 +563,9 @@ static int update_intremap_entry_from_ms
         entry.ptr32->flds.remap_en = false;
         spin_unlock(lock);
 
-        spin_lock(&iommu->lock);
         amd_iommu_flush_intremap(iommu, req_id);
         if ( alias_id != req_id )
             amd_iommu_flush_intremap(iommu, alias_id);
-        spin_unlock(&iommu->lock);
 
         spin_lock(lock);
     }
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -129,7 +129,7 @@ static void amd_iommu_setup_domain_devic
              iommu_has_cap(iommu, PCI_CAP_IOTLB_SHIFT) )
             dte->i = ats_enabled;
 
-        amd_iommu_flush_device(iommu, req_id);
+        amd_iommu_flush_device(iommu, req_id, flags);
 
         AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
                         "root table = %#"PRIx64", "
@@ -138,8 +138,8 @@ static void amd_iommu_setup_domain_devic
                         page_to_maddr(hd->arch.amd.root_table),
                         domain->domain_id, hd->arch.amd.paging_mode);
     }
-
-    spin_unlock_irqrestore(&iommu->lock, flags);
+    else
+        spin_unlock_irqrestore(&iommu->lock, flags);
 
     ASSERT(pcidevs_locked());
 
@@ -307,14 +307,15 @@ static void amd_iommu_disable_domain_dev
         smp_wmb();
         dte->v = true;
 
-        amd_iommu_flush_device(iommu, req_id);
+        amd_iommu_flush_device(iommu, req_id, flags);
 
         AMD_IOMMU_DEBUG("Disable: device id = %#x, "
                         "domain = %d, paging mode = %d\n",
                         req_id,  domain->domain_id,
                         dom_iommu(domain)->arch.amd.paging_mode);
     }
-    spin_unlock_irqrestore(&iommu->lock, flags);
+    else
+        spin_unlock_irqrestore(&iommu->lock, flags);
 
     ASSERT(pcidevs_locked());
 
@@ -455,9 +456,7 @@ static int amd_iommu_add_device(u8 devfn
             iommu->dev_table.buffer + (bdf * IOMMU_DEV_TABLE_ENTRY_SIZE),
             ivrs_mappings[bdf].intremap_table, iommu, iommu_intremap);
 
-        amd_iommu_flush_device(iommu, bdf);
-
-        spin_unlock_irqrestore(&iommu->lock, flags);
+        amd_iommu_flush_device(iommu, bdf, flags);
     }
 
     amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3/9] VT-d: undo device mappings upon error
  2021-06-09  9:25 [PATCH 0/9] IOMMU: XSA-373 follow-on Jan Beulich
  2021-06-09  9:26 ` [PATCH 1/9] AMD/IOMMU: redo awaiting of command completion Jan Beulich
  2021-06-09  9:27 ` [PATCH 2/9] AMD/IOMMU: re-work locking around sending of commands Jan Beulich
@ 2021-06-09  9:27 ` Jan Beulich
  2021-06-24  5:13   ` Tian, Kevin
  2021-06-09  9:28 ` [PATCH 4/9] VT-d: adjust domid map updating when unmapping context Jan Beulich
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2021-06-09  9:27 UTC (permalink / raw)
  To: xen-devel; +Cc: Paul Durrant, Andrew Cooper, Kevin Tian

When
 - flushes (supposedly not possible anymore after XSA-373),
 - secondary mappings for legacy PCI devices behind bridges,
 - secondary mappings for chipset quirks, or
 - find_upstream_bridge() invocations
fail, the successfully established device mappings should not be left
around.

Further, when (parts of) unmapping fail, simply returning an error is
typically not enough. Crash the domain instead in such cases, arranging
for domain cleanup to continue in a best effort manner despite such
failures.

Finally make domain_context_unmap()'s error behavior consistent in the
legacy PCI device case: Don't bail from the function in one special
case, but always just exit the switch statement.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1442,9 +1442,15 @@ int domain_context_mapping_one(
     if ( !seg && !rc )
         rc = me_wifi_quirk(domain, bus, devfn, MAP_ME_PHANTOM_FUNC);
 
+    if ( rc )
+        domain_context_unmap_one(domain, iommu, bus, devfn);
+
     return rc;
 }
 
+static int domain_context_unmap(struct domain *d, uint8_t devfn,
+                                struct pci_dev *pdev);
+
 static int domain_context_mapping(struct domain *domain, u8 devfn,
                                   struct pci_dev *pdev)
 {
@@ -1505,16 +1511,21 @@ static int domain_context_mapping(struct
         if ( ret )
             break;
 
-        if ( find_upstream_bridge(seg, &bus, &devfn, &secbus) < 1 )
-            break;
+        if ( (ret = find_upstream_bridge(seg, &bus, &devfn, &secbus)) < 1 )
+        {
+            if ( !ret )
+                break;
+            ret = -ENXIO;
+        }
 
         /*
          * Mapping a bridge should, if anything, pass the struct pci_dev of
          * that bridge. Since bridges don't normally get assigned to guests,
          * their owner would be the wrong one. Pass NULL instead.
          */
-        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         NULL);
+        if ( ret >= 0 )
+            ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
+                                             NULL);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1531,6 +1542,9 @@ static int domain_context_mapping(struct
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
                                              NULL);
 
+        if ( ret )
+            domain_context_unmap(domain, devfn, pdev);
+
         break;
 
     default:
@@ -1609,6 +1623,19 @@ int domain_context_unmap_one(
     if ( !iommu->drhd->segment && !rc )
         rc = me_wifi_quirk(domain, bus, devfn, UNMAP_ME_PHANTOM_FUNC);
 
+    if ( rc && !is_hardware_domain(domain) && domain != dom_io )
+    {
+        if ( domain->is_dying )
+        {
+            printk(XENLOG_ERR "%pd: error %d unmapping %04x:%02x:%02x.%u\n",
+                   domain, rc, iommu->drhd->segment, bus,
+                   PCI_SLOT(devfn), PCI_FUNC(devfn));
+            rc = 0; /* Make upper layers continue in a best effort manner. */
+        }
+        else
+            domain_crash(domain);
+    }
+
     return rc;
 }
 
@@ -1661,17 +1688,29 @@ static int domain_context_unmap(struct d
 
         tmp_bus = bus;
         tmp_devfn = devfn;
-        if ( find_upstream_bridge(seg, &tmp_bus, &tmp_devfn, &secbus) < 1 )
+        if ( (ret = find_upstream_bridge(seg, &tmp_bus, &tmp_devfn,
+                                         &secbus)) < 1 )
+        {
+            if ( ret )
+            {
+                ret = -ENXIO;
+                if ( !domain->is_dying &&
+                     !is_hardware_domain(domain) && domain != dom_io )
+                {
+                    domain_crash(domain);
+                    /* Make upper layers continue in a best effort manner. */
+                    ret = 0;
+                }
+            }
             break;
+        }
 
         /* PCIe to PCI/PCIx bridge */
         if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
         {
             ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
-            if ( ret )
-                return ret;
-
-            ret = domain_context_unmap_one(domain, iommu, secbus, 0);
+            if ( !ret )
+                ret = domain_context_unmap_one(domain, iommu, secbus, 0);
         }
         else /* Legacy PCI bridge */
             ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 4/9] VT-d: adjust domid map updating when unmapping context
  2021-06-09  9:25 [PATCH 0/9] IOMMU: XSA-373 follow-on Jan Beulich
                   ` (2 preceding siblings ...)
  2021-06-09  9:27 ` [PATCH 3/9] VT-d: undo device mappings upon error Jan Beulich
@ 2021-06-09  9:28 ` Jan Beulich
  2021-06-24  5:21   ` Tian, Kevin
  2021-06-09  9:28 ` [PATCH 5/9] VT-d: clear_fault_bits() should clear all fault bits Jan Beulich
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2021-06-09  9:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Paul Durrant, Andrew Cooper, Kevin Tian

When an earlier error occurred, cleaning up the domid mapping data is
wrong, as references likely still exist. The only exception to this is
when the actual unmapping worked, but some flush failed (supposedly
impossible after XSA-373). The guest will get crashed in such a case
though, so add fallback cleanup to domain destruction to cover this
case. This in turn makes it desirable to silence the dprintk() in
domain_iommu_domid().

Note that no error will be returned anymore when the lookup fails - in
the common case lookup failure would already have caused
domain_context_unmap_one() to fail, yet even from a more general
perspective it doesn't look right to fail domain_context_unmap() in such
a case when this was the last device, but not when any earlier unmap was
otherwise successful.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -80,9 +80,11 @@ static int domain_iommu_domid(struct dom
         i = find_next_bit(iommu->domid_bitmap, nr_dom, i+1);
     }
 
-    dprintk(XENLOG_ERR VTDPREFIX,
-            "Cannot get valid iommu domid: domid=%d iommu->index=%d\n",
-            d->domain_id, iommu->index);
+    if ( !d->is_dying )
+        dprintk(XENLOG_ERR VTDPREFIX,
+                "Cannot get valid iommu %u domid: %pd\n",
+                iommu->index, d);
+
     return -1;
 }
 
@@ -147,6 +149,17 @@ static int context_get_domain_id(struct
     return domid;
 }
 
+static void cleanup_domid_map(struct domain *domain, struct vtd_iommu *iommu)
+{
+    int iommu_domid = domain_iommu_domid(domain, iommu);
+
+    if ( iommu_domid >= 0 )
+    {
+        clear_bit(iommu_domid, iommu->domid_bitmap);
+        iommu->domid_map[iommu_domid] = 0;
+    }
+}
+
 static void sync_cache(const void *addr, unsigned int size)
 {
     static unsigned long clflush_size = 0;
@@ -1724,6 +1737,9 @@ static int domain_context_unmap(struct d
         goto out;
     }
 
+    if ( ret )
+        goto out;
+
     /*
      * if no other devices under the same iommu owned by this domain,
      * clear iommu in iommu_bitmap and clear domain_id in domid_bitmp
@@ -1743,19 +1759,8 @@ static int domain_context_unmap(struct d
 
     if ( found == 0 )
     {
-        int iommu_domid;
-
         clear_bit(iommu->index, &dom_iommu(domain)->arch.vtd.iommu_bitmap);
-
-        iommu_domid = domain_iommu_domid(domain, iommu);
-        if ( iommu_domid == -1 )
-        {
-            ret = -EINVAL;
-            goto out;
-        }
-
-        clear_bit(iommu_domid, iommu->domid_bitmap);
-        iommu->domid_map[iommu_domid] = 0;
+        cleanup_domid_map(domain, iommu);
     }
 
 out:
@@ -1775,6 +1780,7 @@ static void iommu_domain_teardown(struct
 {
     struct domain_iommu *hd = dom_iommu(d);
     struct mapped_rmrr *mrmrr, *tmp;
+    const struct acpi_drhd_unit *drhd;
 
     if ( list_empty(&acpi_drhd_units) )
         return;
@@ -1786,6 +1792,9 @@ static void iommu_domain_teardown(struct
     }
 
     ASSERT(!hd->arch.vtd.pgd_maddr);
+
+    for_each_drhd_unit ( drhd )
+        cleanup_domid_map(d, drhd->iommu);
 }
 
 static int __must_check intel_iommu_map_page(struct domain *d, dfn_t dfn,



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 5/9] VT-d: clear_fault_bits() should clear all fault bits
  2021-06-09  9:25 [PATCH 0/9] IOMMU: XSA-373 follow-on Jan Beulich
                   ` (3 preceding siblings ...)
  2021-06-09  9:28 ` [PATCH 4/9] VT-d: adjust domid map updating when unmapping context Jan Beulich
@ 2021-06-09  9:28 ` Jan Beulich
  2021-06-24  5:26   ` Tian, Kevin
  2021-06-09  9:29 ` [PATCH 6/9] VT-d: don't lose errors when flushing TLBs on multiple IOMMUs Jan Beulich
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2021-06-09  9:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Paul Durrant, Andrew Cooper, Kevin Tian

If there is any way for one fault to be left set in the recording
registers, there's no reason there couldn't also be multiple ones. If
PPF set set (being the OR or all F fields), simply loop over the entire
range of fault recording registers, clearing F everywhere.

Since PPF is a r/o bit, also remove it from DMA_FSTS_FAULTS (arguably
the constant's name is ambiguous as well).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2094,13 +2094,23 @@ static int __hwdom_init setup_hwdom_devi
 
 void clear_fault_bits(struct vtd_iommu *iommu)
 {
-    u64 val;
     unsigned long flags;
 
     spin_lock_irqsave(&iommu->register_lock, flags);
-    val = dmar_readq(iommu->reg, cap_fault_reg_offset(iommu->cap) + 8);
-    dmar_writeq(iommu->reg, cap_fault_reg_offset(iommu->cap) + 8, val);
+
+    if ( dmar_readl(iommu->reg, DMAR_FSTS_REG) & DMA_FSTS_PPF )
+    {
+        unsigned int reg = cap_fault_reg_offset(iommu->cap);
+        unsigned int end = reg + cap_num_fault_regs(iommu->cap);
+
+        do {
+           dmar_writel(iommu->reg, reg + 12, DMA_FRCD_F);
+           reg += PRIMARY_FAULT_REG_LEN;
+        } while ( reg < end );
+    }
+
     dmar_writel(iommu->reg, DMAR_FSTS_REG, DMA_FSTS_FAULTS);
+
     spin_unlock_irqrestore(&iommu->register_lock, flags);
 }
 
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -174,9 +174,8 @@
 #define DMA_FSTS_IQE (1u << 4)
 #define DMA_FSTS_ICE (1u << 5)
 #define DMA_FSTS_ITE (1u << 6)
-#define DMA_FSTS_FAULTS (DMA_FSTS_PFO | DMA_FSTS_PPF | DMA_FSTS_AFO | \
-                         DMA_FSTS_APF | DMA_FSTS_IQE | DMA_FSTS_ICE | \
-                         DMA_FSTS_ITE)
+#define DMA_FSTS_FAULTS (DMA_FSTS_PFO | DMA_FSTS_AFO | DMA_FSTS_APF | \
+                         DMA_FSTS_IQE | DMA_FSTS_ICE | DMA_FSTS_ITE)
 #define dma_fsts_fault_record_index(s) (((s) >> 8) & 0xff)
 
 /* FRCD_REG, 32 bits access */



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 6/9] VT-d: don't lose errors when flushing TLBs on multiple IOMMUs
  2021-06-09  9:25 [PATCH 0/9] IOMMU: XSA-373 follow-on Jan Beulich
                   ` (4 preceding siblings ...)
  2021-06-09  9:28 ` [PATCH 5/9] VT-d: clear_fault_bits() should clear all fault bits Jan Beulich
@ 2021-06-09  9:29 ` Jan Beulich
  2021-06-24  5:28   ` Tian, Kevin
  2021-06-09  9:29 ` [PATCH 7/9] VT-d: centralize mapping of QI entries Jan Beulich
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2021-06-09  9:29 UTC (permalink / raw)
  To: xen-devel; +Cc: Paul Durrant, Andrew Cooper, Kevin Tian

While no longer an immediate problem with flushes no longer timing out,
errors (if any) get properly reported by iommu_flush_iotlb_{dsi,psi}().
Overwriting such an error with, perhaps, a success indicator received
from another IOMMU will misguide callers. Record the first error, but
don't bail from the loop (such that further necessary invalidation gets
carried out on a best effort basis).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -643,7 +643,7 @@ static int __must_check iommu_flush_iotl
     struct vtd_iommu *iommu;
     bool_t flush_dev_iotlb;
     int iommu_domid;
-    int rc = 0;
+    int ret = 0;
 
     /*
      * No need pcideves_lock here because we have flush
@@ -651,6 +651,8 @@ static int __must_check iommu_flush_iotl
      */
     for_each_drhd_unit ( drhd )
     {
+        int rc;
+
         iommu = drhd->iommu;
 
         if ( !test_bit(iommu->index, &hd->arch.vtd.iommu_bitmap) )
@@ -673,13 +675,12 @@ static int __must_check iommu_flush_iotl
                                        flush_dev_iotlb);
 
         if ( rc > 0 )
-        {
             iommu_flush_write_buffer(iommu);
-            rc = 0;
-        }
+        else if ( !ret )
+            ret = rc;
     }
 
-    return rc;
+    return ret;
 }
 
 static int __must_check iommu_flush_iotlb_pages(struct domain *d,



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 7/9] VT-d: centralize mapping of QI entries
  2021-06-09  9:25 [PATCH 0/9] IOMMU: XSA-373 follow-on Jan Beulich
                   ` (5 preceding siblings ...)
  2021-06-09  9:29 ` [PATCH 6/9] VT-d: don't lose errors when flushing TLBs on multiple IOMMUs Jan Beulich
@ 2021-06-09  9:29 ` Jan Beulich
  2021-06-24  5:31   ` Tian, Kevin
  2021-06-09  9:29 ` [PATCH 8/9] VT-d: drop/move a few QI related constants Jan Beulich
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2021-06-09  9:29 UTC (permalink / raw)
  To: xen-devel; +Cc: Paul Durrant, Andrew Cooper, Kevin Tian

Introduce a helper function to reduce redundancy. Take the opportunity
to express the logic without using the somewhat odd QINVAL_ENTRY_ORDER.
Also take the opportunity to uniformly unmap after updating queue tail
and dropping the lock (like was done so far only by
queue_invalidate_context_sync()).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I wonder though whether we wouldn't be better off permanently mapping
the queue(s).

--- a/xen/drivers/passthrough/vtd/qinval.c
+++ b/xen/drivers/passthrough/vtd/qinval.c
@@ -69,6 +69,16 @@ static void qinval_update_qtail(struct v
     dmar_writel(iommu->reg, DMAR_IQT_REG, val << QINVAL_INDEX_SHIFT);
 }
 
+static struct qinval_entry *qi_map_entry(const struct vtd_iommu *iommu,
+                                         unsigned int index)
+{
+    paddr_t base = iommu->qinval_maddr +
+                   ((index * sizeof(struct qinval_entry)) & PAGE_MASK);
+    struct qinval_entry *entries = map_vtd_domain_page(base);
+
+    return &entries[index % (PAGE_SIZE / sizeof(*entries))];
+}
+
 static int __must_check queue_invalidate_context_sync(struct vtd_iommu *iommu,
                                                       u16 did, u16 source_id,
                                                       u8 function_mask,
@@ -76,15 +86,11 @@ static int __must_check queue_invalidate
 {
     unsigned long flags;
     unsigned int index;
-    u64 entry_base;
-    struct qinval_entry *qinval_entry, *qinval_entries;
+    struct qinval_entry *qinval_entry;
 
     spin_lock_irqsave(&iommu->register_lock, flags);
     index = qinval_next_index(iommu);
-    entry_base = iommu->qinval_maddr +
-                 ((index >> QINVAL_ENTRY_ORDER) << PAGE_SHIFT);
-    qinval_entries = map_vtd_domain_page(entry_base);
-    qinval_entry = &qinval_entries[index % (1 << QINVAL_ENTRY_ORDER)];
+    qinval_entry = qi_map_entry(iommu, index);
 
     qinval_entry->q.cc_inv_dsc.lo.type = TYPE_INVAL_CONTEXT;
     qinval_entry->q.cc_inv_dsc.lo.granu = granu;
@@ -98,7 +104,7 @@ static int __must_check queue_invalidate
     qinval_update_qtail(iommu, index);
     spin_unlock_irqrestore(&iommu->register_lock, flags);
 
-    unmap_vtd_domain_page(qinval_entries);
+    unmap_vtd_domain_page(qinval_entry);
 
     return invalidate_sync(iommu);
 }
@@ -110,15 +116,11 @@ static int __must_check queue_invalidate
 {
     unsigned long flags;
     unsigned int index;
-    u64 entry_base;
-    struct qinval_entry *qinval_entry, *qinval_entries;
+    struct qinval_entry *qinval_entry;
 
     spin_lock_irqsave(&iommu->register_lock, flags);
     index = qinval_next_index(iommu);
-    entry_base = iommu->qinval_maddr +
-                 ((index >> QINVAL_ENTRY_ORDER) << PAGE_SHIFT);
-    qinval_entries = map_vtd_domain_page(entry_base);
-    qinval_entry = &qinval_entries[index % (1 << QINVAL_ENTRY_ORDER)];
+    qinval_entry = qi_map_entry(iommu, index);
 
     qinval_entry->q.iotlb_inv_dsc.lo.type = TYPE_INVAL_IOTLB;
     qinval_entry->q.iotlb_inv_dsc.lo.granu = granu;
@@ -133,10 +135,11 @@ static int __must_check queue_invalidate
     qinval_entry->q.iotlb_inv_dsc.hi.res_1 = 0;
     qinval_entry->q.iotlb_inv_dsc.hi.addr = addr >> PAGE_SHIFT_4K;
 
-    unmap_vtd_domain_page(qinval_entries);
     qinval_update_qtail(iommu, index);
     spin_unlock_irqrestore(&iommu->register_lock, flags);
 
+    unmap_vtd_domain_page(qinval_entry);
+
     return invalidate_sync(iommu);
 }
 
@@ -147,17 +150,13 @@ static int __must_check queue_invalidate
     static DEFINE_PER_CPU(uint32_t, poll_slot);
     unsigned int index;
     unsigned long flags;
-    u64 entry_base;
-    struct qinval_entry *qinval_entry, *qinval_entries;
+    struct qinval_entry *qinval_entry;
     uint32_t *this_poll_slot = &this_cpu(poll_slot);
 
     spin_lock_irqsave(&iommu->register_lock, flags);
     ACCESS_ONCE(*this_poll_slot) = QINVAL_STAT_INIT;
     index = qinval_next_index(iommu);
-    entry_base = iommu->qinval_maddr +
-                 ((index >> QINVAL_ENTRY_ORDER) << PAGE_SHIFT);
-    qinval_entries = map_vtd_domain_page(entry_base);
-    qinval_entry = &qinval_entries[index % (1 << QINVAL_ENTRY_ORDER)];
+    qinval_entry = qi_map_entry(iommu, index);
 
     qinval_entry->q.inv_wait_dsc.lo.type = TYPE_INVAL_WAIT;
     qinval_entry->q.inv_wait_dsc.lo.iflag = iflag;
@@ -167,10 +166,11 @@ static int __must_check queue_invalidate
     qinval_entry->q.inv_wait_dsc.lo.sdata = QINVAL_STAT_DONE;
     qinval_entry->q.inv_wait_dsc.hi.saddr = virt_to_maddr(this_poll_slot);
 
-    unmap_vtd_domain_page(qinval_entries);
     qinval_update_qtail(iommu, index);
     spin_unlock_irqrestore(&iommu->register_lock, flags);
 
+    unmap_vtd_domain_page(qinval_entry);
+
     /* Now we don't support interrupt method */
     if ( sw )
     {
@@ -246,16 +246,12 @@ int qinval_device_iotlb_sync(struct vtd_
 {
     unsigned long flags;
     unsigned int index;
-    u64 entry_base;
-    struct qinval_entry *qinval_entry, *qinval_entries;
+    struct qinval_entry *qinval_entry;
 
     ASSERT(pdev);
     spin_lock_irqsave(&iommu->register_lock, flags);
     index = qinval_next_index(iommu);
-    entry_base = iommu->qinval_maddr +
-                 ((index >> QINVAL_ENTRY_ORDER) << PAGE_SHIFT);
-    qinval_entries = map_vtd_domain_page(entry_base);
-    qinval_entry = &qinval_entries[index % (1 << QINVAL_ENTRY_ORDER)];
+    qinval_entry = qi_map_entry(iommu, index);
 
     qinval_entry->q.dev_iotlb_inv_dsc.lo.type = TYPE_INVAL_DEVICE_IOTLB;
     qinval_entry->q.dev_iotlb_inv_dsc.lo.res_1 = 0;
@@ -268,10 +264,11 @@ int qinval_device_iotlb_sync(struct vtd_
     qinval_entry->q.dev_iotlb_inv_dsc.hi.res_1 = 0;
     qinval_entry->q.dev_iotlb_inv_dsc.hi.addr = addr >> PAGE_SHIFT_4K;
 
-    unmap_vtd_domain_page(qinval_entries);
     qinval_update_qtail(iommu, index);
     spin_unlock_irqrestore(&iommu->register_lock, flags);
 
+    unmap_vtd_domain_page(qinval_entry);
+
     return dev_invalidate_sync(iommu, pdev, did);
 }
 
@@ -280,16 +277,12 @@ static int __must_check queue_invalidate
 {
     unsigned long flags;
     unsigned int index;
-    u64 entry_base;
-    struct qinval_entry *qinval_entry, *qinval_entries;
+    struct qinval_entry *qinval_entry;
     int ret;
 
     spin_lock_irqsave(&iommu->register_lock, flags);
     index = qinval_next_index(iommu);
-    entry_base = iommu->qinval_maddr +
-                 ((index >> QINVAL_ENTRY_ORDER) << PAGE_SHIFT);
-    qinval_entries = map_vtd_domain_page(entry_base);
-    qinval_entry = &qinval_entries[index % (1 << QINVAL_ENTRY_ORDER)];
+    qinval_entry = qi_map_entry(iommu, index);
 
     qinval_entry->q.iec_inv_dsc.lo.type = TYPE_INVAL_IEC;
     qinval_entry->q.iec_inv_dsc.lo.granu = granu;
@@ -299,10 +292,11 @@ static int __must_check queue_invalidate
     qinval_entry->q.iec_inv_dsc.lo.res_2 = 0;
     qinval_entry->q.iec_inv_dsc.hi.res = 0;
 
-    unmap_vtd_domain_page(qinval_entries);
     qinval_update_qtail(iommu, index);
     spin_unlock_irqrestore(&iommu->register_lock, flags);
 
+    unmap_vtd_domain_page(qinval_entry);
+
     ret = invalidate_sync(iommu);
 
     /*



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 8/9] VT-d: drop/move a few QI related constants
  2021-06-09  9:25 [PATCH 0/9] IOMMU: XSA-373 follow-on Jan Beulich
                   ` (6 preceding siblings ...)
  2021-06-09  9:29 ` [PATCH 7/9] VT-d: centralize mapping of QI entries Jan Beulich
@ 2021-06-09  9:29 ` Jan Beulich
  2021-06-24  5:32   ` Tian, Kevin
  2021-06-09  9:30 ` [PATCH 9/9] IOMMU/PCI: don't let domain cleanup continue when device de-assignment failed Jan Beulich
  2021-06-23  6:51 ` Ping: [PATCH 0/9] IOMMU: XSA-373 follow-on Jan Beulich
  9 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2021-06-09  9:29 UTC (permalink / raw)
  To: xen-devel; +Cc: Paul Durrant, Andrew Cooper, Kevin Tian

Replace uses of QINVAL_ENTRY_ORDER and QINVAL_INDEX_SHIFT, such that
the constants can be dropped. Move the remaining QINVAL_* ones to the
single source file using them.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -451,17 +451,6 @@ struct qinval_entry {
     }q;
 };
 
-/* Each entry is 16 bytes, so 2^8 entries per page */
-#define QINVAL_ENTRY_ORDER  ( PAGE_SHIFT - 4 )
-#define QINVAL_MAX_ENTRY_NR (1u << (7 + QINVAL_ENTRY_ORDER))
-
-/* Status data flag */
-#define QINVAL_STAT_INIT  0
-#define QINVAL_STAT_DONE  1
-
-/* Queue invalidation head/tail shift */
-#define QINVAL_INDEX_SHIFT 4
-
 #define TYPE_INVAL_CONTEXT      0x1
 #define TYPE_INVAL_IOTLB        0x2
 #define TYPE_INVAL_DEVICE_IOTLB 0x3
--- a/xen/drivers/passthrough/vtd/qinval.c
+++ b/xen/drivers/passthrough/vtd/qinval.c
@@ -29,6 +29,13 @@
 #include "extern.h"
 #include "../ats.h"
 
+/* Each entry is 16 bytes, and there can be up to 2^7 pages. */
+#define QINVAL_MAX_ENTRY_NR (1u << (7 + PAGE_SHIFT_4K - 4))
+
+/* Status data flag */
+#define QINVAL_STAT_INIT  0
+#define QINVAL_STAT_DONE  1
+
 static unsigned int __read_mostly qi_pg_order;
 static unsigned int __read_mostly qi_entry_nr;
 
@@ -45,11 +52,11 @@ static unsigned int qinval_next_index(st
 {
     unsigned int tail = dmar_readl(iommu->reg, DMAR_IQT_REG);
 
-    tail >>= QINVAL_INDEX_SHIFT;
+    tail /= sizeof(struct qinval_entry);
 
     /* (tail+1 == head) indicates a full queue, wait for HW */
     while ( ((tail + 1) & (qi_entry_nr - 1)) ==
-            (dmar_readl(iommu->reg, DMAR_IQH_REG) >> QINVAL_INDEX_SHIFT) )
+            (dmar_readl(iommu->reg, DMAR_IQH_REG) / sizeof(struct qinval_entry)) )
     {
         printk_once(XENLOG_ERR VTDPREFIX " IOMMU#%u: no QI slot available\n",
                     iommu->index);
@@ -66,7 +73,7 @@ static void qinval_update_qtail(struct v
     /* Need hold register lock when update tail */
     ASSERT( spin_is_locked(&iommu->register_lock) );
     val = (index + 1) & (qi_entry_nr - 1);
-    dmar_writel(iommu->reg, DMAR_IQT_REG, val << QINVAL_INDEX_SHIFT);
+    dmar_writel(iommu->reg, DMAR_IQT_REG, val * sizeof(struct qinval_entry));
 }
 
 static struct qinval_entry *qi_map_entry(const struct vtd_iommu *iommu,
@@ -413,17 +420,18 @@ int enable_qinval(struct vtd_iommu *iomm
              * only one entry left.
              */
             BUILD_BUG_ON(CONFIG_NR_CPUS * 2 >= QINVAL_MAX_ENTRY_NR);
-            qi_pg_order = get_order_from_bytes((num_present_cpus() * 2 + 1) <<
-                                               (PAGE_SHIFT -
-                                                QINVAL_ENTRY_ORDER));
-            qi_entry_nr = 1u << (qi_pg_order + QINVAL_ENTRY_ORDER);
+            qi_pg_order = get_order_from_bytes((num_present_cpus() * 2 + 1) *
+                                               sizeof(struct qinval_entry));
+            qi_entry_nr = (PAGE_SIZE << qi_pg_order) /
+                          sizeof(struct qinval_entry);
 
             dprintk(XENLOG_INFO VTDPREFIX,
                     "QI: using %u-entry ring(s)\n", qi_entry_nr);
         }
 
         iommu->qinval_maddr =
-            alloc_pgtable_maddr(qi_entry_nr >> QINVAL_ENTRY_ORDER,
+            alloc_pgtable_maddr(PFN_DOWN(qi_entry_nr *
+                                         sizeof(struct qinval_entry)),
                                 iommu->node);
         if ( iommu->qinval_maddr == 0 )
         {



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 9/9] IOMMU/PCI: don't let domain cleanup continue when device de-assignment failed
  2021-06-09  9:25 [PATCH 0/9] IOMMU: XSA-373 follow-on Jan Beulich
                   ` (7 preceding siblings ...)
  2021-06-09  9:29 ` [PATCH 8/9] VT-d: drop/move a few QI related constants Jan Beulich
@ 2021-06-09  9:30 ` Jan Beulich
  2021-06-24 15:34   ` Paul Durrant
  2021-06-23  6:51 ` Ping: [PATCH 0/9] IOMMU: XSA-373 follow-on Jan Beulich
  9 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2021-06-09  9:30 UTC (permalink / raw)
  To: xen-devel; +Cc: Paul Durrant, Andrew Cooper

Failure here could in principle mean the device may still be issuing DMA
requests, which would continue to be translated by the page tables the
device entry currently points at. With this we cannot allow the
subsequent cleanup step of freeing the page tables to occur, to prevent
use-after-free issues. We would need to accept, for the time being, that
in such a case the remaining domain resources will all be leaked, and
the domain will continue to exist as a zombie.

However, with flushes no longer timing out (and with proper timeout
detection for device I/O TLB flushing yet to be implemented), there's no
way anymore for failures to occur, except due to bugs elsewhere. Hence
the change here is merely a "just in case" one.

In order to continue the loop in spite of an error, we can't use
pci_get_pdev_by_domain() anymore. I have no idea why it was used here in
the first place, instead of the cheaper list iteration.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
A first step beyond this could be to have the backing functions of
deassign_device() allow the caller to tell whether the failure was from
removing the device from the domain being cleaned up, or from re-setup
in wherever the device was supposed to get moved to. In the latter case
we could allow domain cleanup to continue. I wonder whether we could
simply make those functions return "success" anyway, overriding their
returning of an error when ->is_dying is set.

A next step then might be to figure whether there's any "emergency"
adjustment that could be done instead of the full-fledged (and failed)
de-assign, to allow at least recovering all the memory from the guest.

--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -894,7 +894,7 @@ static int deassign_device(struct domain
 
 int pci_release_devices(struct domain *d)
 {
-    struct pci_dev *pdev;
+    struct pci_dev *pdev, *tmp;
     u8 bus, devfn;
     int ret;
 
@@ -905,15 +905,15 @@ int pci_release_devices(struct domain *d
         pcidevs_unlock();
         return ret;
     }
-    while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
+    list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
     {
         bus = pdev->bus;
         devfn = pdev->devfn;
-        deassign_device(d, pdev->seg, bus, devfn);
+        ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
     }
     pcidevs_unlock();
 
-    return 0;
+    return ret;
 }
 
 #define PCI_CLASS_BRIDGE_HOST    0x0600



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/9] AMD/IOMMU: redo awaiting of command completion
  2021-06-09  9:26 ` [PATCH 1/9] AMD/IOMMU: redo awaiting of command completion Jan Beulich
@ 2021-06-09 10:36   ` Andrew Cooper
  2021-06-09 12:08     ` Jan Beulich
  2021-06-10 12:24     ` Jan Beulich
  0 siblings, 2 replies; 28+ messages in thread
From: Andrew Cooper @ 2021-06-09 10:36 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Paul Durrant

On 09/06/2021 10:26, Jan Beulich wrote:
> The present abuse of the completion interrupt does not only stand in the
> way of, down the road, using it for its actual purpose, but also
> requires holding the IOMMU lock while waiting for command completion,
> limiting parallelism and keeping interrupts off for non-negligible
> periods of time. Have the IOMMU do an ordinary memory write instead of
> signaling an otherwise disabled interrupt (by just updating a status
> register bit).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Paul Durrant <paul@xen.org>

While I agree with the direction of the patch, some of the details could
do with improvement.

>
> --- a/xen/drivers/passthrough/amd/iommu_cmd.c
> +++ b/xen/drivers/passthrough/amd/iommu_cmd.c
> @@ -20,6 +20,9 @@
>  #include "iommu.h"
>  #include "../ats.h"
>  
> +#define CMD_COMPLETION_INIT 0
> +#define CMD_COMPLETION_DONE 1
> +
>  static void send_iommu_command(struct amd_iommu *iommu,
>                                 const uint32_t cmd[4])
>  {
> @@ -49,28 +52,31 @@ static void send_iommu_command(struct am
>  static void flush_command_buffer(struct amd_iommu *iommu,
>                                   unsigned int timeout_base)
>  {
> +    static DEFINE_PER_CPU(uint64_t, poll_slot);
> +    uint64_t *this_poll_slot = &this_cpu(poll_slot);
> +    paddr_t addr = virt_to_maddr(this_poll_slot);
>      uint32_t cmd[4];
>      s_time_t start, timeout;
>      static unsigned int __read_mostly threshold = 1;
>  
> -    /* RW1C 'ComWaitInt' in status register */
> -    writel(IOMMU_STATUS_COMP_WAIT_INT,
> -           iommu->mmio_base + IOMMU_STATUS_MMIO_OFFSET);
> -
> -    /* send an empty COMPLETION_WAIT command to flush command buffer */
> -    cmd[3] = cmd[2] = 0;
> -    set_field_in_reg_u32(IOMMU_CMD_COMPLETION_WAIT, 0,
> +    ACCESS_ONCE(*this_poll_slot) = CMD_COMPLETION_INIT;
> +
> +    /* send a COMPLETION_WAIT command to flush command buffer */
> +    cmd[0] = addr;
> +    set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, cmd[0],
> +                         IOMMU_COMP_WAIT_S_FLAG_MASK,
> +                         IOMMU_COMP_WAIT_S_FLAG_SHIFT, &cmd[0]);

set_field_in_reg_u32() is a disaster of a function - both in terms of
semantics, and code gen - and needs to be purged from the code.

It is a shame we don't have a real struct for objects in the command
buffer, but in lieu of that, this is just

    cmd[0] = addr | IOMMU_COMP_WAIT_S_FLAG_MASK;

which is the direction that previous cleanup has gone.

There are no current users of IOMMU_COMP_WAIT_S_FLAG_SHIFT, and ...

> +    cmd[1] = addr >> 32;
> +    set_field_in_reg_u32(IOMMU_CMD_COMPLETION_WAIT, cmd[1],
>                           IOMMU_CMD_OPCODE_MASK,
>                           IOMMU_CMD_OPCODE_SHIFT, &cmd[1]);
> -    set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, 0,
> -                         IOMMU_COMP_WAIT_I_FLAG_MASK,
> -                         IOMMU_COMP_WAIT_I_FLAG_SHIFT, &cmd[0]);

... this drops the final use of IOMMU_COMP_WAIT_I_FLAG_SHIFT, so both
should be dropped.

As for IOMMU_CMD_OPCODE_SHIFT, that can't be dropped yet, but it would
still be better to use

    cmd[1] = (addr >> 32) | MASK_INSR(IOMMU_CMD_COMPLETION_WAIT,
IOMMU_CMD_COMPLETION_WAIT);

in the short term.

~Andrew

P.S. an observation of cmd[1] means that AMD IOMMUs don't actually work
for a physaddr width of >52, despite some support along these lines
elsewhere in the spec.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/9] AMD/IOMMU: re-work locking around sending of commands
  2021-06-09  9:27 ` [PATCH 2/9] AMD/IOMMU: re-work locking around sending of commands Jan Beulich
@ 2021-06-09 10:53   ` Andrew Cooper
  2021-06-09 12:22     ` Jan Beulich
  2021-06-10 11:58     ` Jan Beulich
  0 siblings, 2 replies; 28+ messages in thread
From: Andrew Cooper @ 2021-06-09 10:53 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Paul Durrant

On 09/06/2021 10:27, Jan Beulich wrote:
> It appears unhelpful to me for flush_command_buffer() to block all
> progress elsewhere for the given IOMMU by holding its lock while
> waiting for command completion. Unless the lock is already held,
> acquire it in send_iommu_command(). Release it in all cases in
> flush_command_buffer(), before actually starting the wait loop.
>
> Some of the involved functions did/do get called with the lock already
> held: For amd_iommu_flush_intremap() we can simply move the locking
> inside. For amd_iommu_flush_device() and amd_iommu_flush_all_caches()
> the lock now gets dropped in the course of the function's operation.
>
> Where touching function headers anyway, also adjust types used to be
> better in line with our coding style and, where applicable, the
> functions' callers.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Paul Durrant <paul@xen.org>

Honestly, I'm -2 to this.  It is horrible obfuscation of the logic.

I agree with the premise of not holding the lock when we don't need to,
but moving the lock/unlocks into different functions makes it impossible
to follow.  (Also, the static analysers are going to scream at this
patch, and rightfully so IMO.)

send_iommu_command() is static, as is flush_command_buffer(), so there
is no need to split the locking like this AFAICT.

Instead, each amd_iommu_flush_* external accessor knows exactly what it
is doing, and whether a wait descriptor is wanted. 
flush_command_buffer() wants merging into send_iommu_command() as a
"bool wait" parameter, at which point the locking and unlocking moves
entirely into send_iommu_command() with no pointer games.

~Andrew



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/9] AMD/IOMMU: redo awaiting of command completion
  2021-06-09 10:36   ` Andrew Cooper
@ 2021-06-09 12:08     ` Jan Beulich
  2021-06-09 12:33       ` Andrew Cooper
  2021-06-10 12:24     ` Jan Beulich
  1 sibling, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2021-06-09 12:08 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Paul Durrant, xen-devel

On 09.06.2021 12:36, Andrew Cooper wrote:
> On 09/06/2021 10:26, Jan Beulich wrote:
>> @@ -49,28 +52,31 @@ static void send_iommu_command(struct am
>>  static void flush_command_buffer(struct amd_iommu *iommu,
>>                                   unsigned int timeout_base)
>>  {
>> +    static DEFINE_PER_CPU(uint64_t, poll_slot);
>> +    uint64_t *this_poll_slot = &this_cpu(poll_slot);
>> +    paddr_t addr = virt_to_maddr(this_poll_slot);
>>      uint32_t cmd[4];
>>      s_time_t start, timeout;
>>      static unsigned int __read_mostly threshold = 1;
>>  
>> -    /* RW1C 'ComWaitInt' in status register */
>> -    writel(IOMMU_STATUS_COMP_WAIT_INT,
>> -           iommu->mmio_base + IOMMU_STATUS_MMIO_OFFSET);
>> -
>> -    /* send an empty COMPLETION_WAIT command to flush command buffer */
>> -    cmd[3] = cmd[2] = 0;
>> -    set_field_in_reg_u32(IOMMU_CMD_COMPLETION_WAIT, 0,
>> +    ACCESS_ONCE(*this_poll_slot) = CMD_COMPLETION_INIT;
>> +
>> +    /* send a COMPLETION_WAIT command to flush command buffer */
>> +    cmd[0] = addr;
>> +    set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, cmd[0],
>> +                         IOMMU_COMP_WAIT_S_FLAG_MASK,
>> +                         IOMMU_COMP_WAIT_S_FLAG_SHIFT, &cmd[0]);
> 
> set_field_in_reg_u32() is a disaster of a function - both in terms of
> semantics, and code gen - and needs to be purged from the code.

Long ago I had an item on my todo list to get this cleaned up. But
it never really having made it up high enough, I dropped it at
some point, in the hope that we'd manage to get this sorted while
re-writing code step by step.

> It is a shame we don't have a real struct for objects in the command
> buffer, but in lieu of that, this is just
> 
>     cmd[0] = addr | IOMMU_COMP_WAIT_S_FLAG_MASK;
> 
> which is the direction that previous cleanup has gone.

I don't think I can spot a single instance of such. Some work was
done to introduce (mainly bitfield) structs, but this surely goes
too far for the change at hand. I can spot two instances using
MASK_INSR(), so I can see two consistent ways of doing what you
ask for:

    cmd[0] = addr | MASK_INSR(IOMMU_CONTROL_ENABLED,
                              IOMMU_COMP_WAIT_S_FLAG_MASK);

keeping the name as *_MASK (and I'd be open to replace
IOMMU_CONTROL_ENABLED by true) or

    cmd[0] = addr | IOMMU_COMP_WAIT_S_FLAG;

i.e. dropping _MASK (but requiring adjustments elsewhere in the
code). Please let me know which one you'd prefer.

> There are no current users of IOMMU_COMP_WAIT_S_FLAG_SHIFT, and ...
> 
>> +    cmd[1] = addr >> 32;
>> +    set_field_in_reg_u32(IOMMU_CMD_COMPLETION_WAIT, cmd[1],
>>                           IOMMU_CMD_OPCODE_MASK,
>>                           IOMMU_CMD_OPCODE_SHIFT, &cmd[1]);
>> -    set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, 0,
>> -                         IOMMU_COMP_WAIT_I_FLAG_MASK,
>> -                         IOMMU_COMP_WAIT_I_FLAG_SHIFT, &cmd[0]);
> 
> ... this drops the final use of IOMMU_COMP_WAIT_I_FLAG_SHIFT, so both
> should be dropped.

Well, I can surely do so, but like this entire request of yours this
feels like scope creep - there was no intention here to do any
unrelated cleanup. And if I remove _S_ and _I_, then surely _F_
wants dropping as well, while IOMMU_COMP_WAIT_ADDR_*_SHIFT have a
use each in iommu_guest.c and hence need to stay for now.

> As for IOMMU_CMD_OPCODE_SHIFT, that can't be dropped yet, but it would
> still be better to use
> 
>     cmd[1] = (addr >> 32) | MASK_INSR(IOMMU_CMD_COMPLETION_WAIT,
> IOMMU_CMD_COMPLETION_WAIT);
> 
> in the short term.

Can do (using IOMMU_CMD_OPCODE_MASK).

Jan



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/9] AMD/IOMMU: re-work locking around sending of commands
  2021-06-09 10:53   ` Andrew Cooper
@ 2021-06-09 12:22     ` Jan Beulich
  2021-06-10 11:58     ` Jan Beulich
  1 sibling, 0 replies; 28+ messages in thread
From: Jan Beulich @ 2021-06-09 12:22 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Paul Durrant, xen-devel

On 09.06.2021 12:53, Andrew Cooper wrote:
> On 09/06/2021 10:27, Jan Beulich wrote:
>> It appears unhelpful to me for flush_command_buffer() to block all
>> progress elsewhere for the given IOMMU by holding its lock while
>> waiting for command completion. Unless the lock is already held,
>> acquire it in send_iommu_command(). Release it in all cases in
>> flush_command_buffer(), before actually starting the wait loop.
>>
>> Some of the involved functions did/do get called with the lock already
>> held: For amd_iommu_flush_intremap() we can simply move the locking
>> inside. For amd_iommu_flush_device() and amd_iommu_flush_all_caches()
>> the lock now gets dropped in the course of the function's operation.
>>
>> Where touching function headers anyway, also adjust types used to be
>> better in line with our coding style and, where applicable, the
>> functions' callers.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Reviewed-by: Paul Durrant <paul@xen.org>
> 
> Honestly, I'm -2 to this.  It is horrible obfuscation of the logic.
> 
> I agree with the premise of not holding the lock when we don't need to,
> but moving the lock/unlocks into different functions makes it impossible
> to follow.  (Also, the static analysers are going to scream at this
> patch, and rightfully so IMO.)

Just to make it explicit - the immediate goal isn't so much to
shrink the locked regions as much as possible, but first of all to
avoid spin-waiting in flush_command_buffer() while holding the lock
(and having IRQs off).

> send_iommu_command() is static, as is flush_command_buffer(), so there
> is no need to split the locking like this AFAICT.
> 
> Instead, each amd_iommu_flush_* external accessor knows exactly what it
> is doing, and whether a wait descriptor is wanted. 
> flush_command_buffer() wants merging into send_iommu_command() as a
> "bool wait" parameter, at which point the locking and unlocking moves
> entirely into send_iommu_command() with no pointer games.

Then I can only guess you didn't look closely at the pci_amd_iommu.c
part of the change? You may rest assured that I wouldn't have taken
the chosen route if there was a reasonable alternative (within the
current overall code structure). In fact I had tried first with what
you suggest, and had to make it the way it was posted because of the
requirements of these callers.

I'm also pretty certain Paul wouldn't have given his R-b if there
was this simple an alternative.

Jan



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/9] AMD/IOMMU: redo awaiting of command completion
  2021-06-09 12:08     ` Jan Beulich
@ 2021-06-09 12:33       ` Andrew Cooper
  2021-06-09 12:51         ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Cooper @ 2021-06-09 12:33 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Paul Durrant, xen-devel

On 09/06/2021 13:08, Jan Beulich wrote:
> On 09.06.2021 12:36, Andrew Cooper wrote:
>> On 09/06/2021 10:26, Jan Beulich wrote:
>>> @@ -49,28 +52,31 @@ static void send_iommu_command(struct am
>>>  static void flush_command_buffer(struct amd_iommu *iommu,
>>>                                   unsigned int timeout_base)
>>>  {
>>> +    static DEFINE_PER_CPU(uint64_t, poll_slot);
>>> +    uint64_t *this_poll_slot = &this_cpu(poll_slot);
>>> +    paddr_t addr = virt_to_maddr(this_poll_slot);
>>>      uint32_t cmd[4];
>>>      s_time_t start, timeout;
>>>      static unsigned int __read_mostly threshold = 1;
>>>  
>>> -    /* RW1C 'ComWaitInt' in status register */
>>> -    writel(IOMMU_STATUS_COMP_WAIT_INT,
>>> -           iommu->mmio_base + IOMMU_STATUS_MMIO_OFFSET);
>>> -
>>> -    /* send an empty COMPLETION_WAIT command to flush command buffer */
>>> -    cmd[3] = cmd[2] = 0;
>>> -    set_field_in_reg_u32(IOMMU_CMD_COMPLETION_WAIT, 0,
>>> +    ACCESS_ONCE(*this_poll_slot) = CMD_COMPLETION_INIT;
>>> +
>>> +    /* send a COMPLETION_WAIT command to flush command buffer */
>>> +    cmd[0] = addr;
>>> +    set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, cmd[0],
>>> +                         IOMMU_COMP_WAIT_S_FLAG_MASK,
>>> +                         IOMMU_COMP_WAIT_S_FLAG_SHIFT, &cmd[0]);
>> set_field_in_reg_u32() is a disaster of a function - both in terms of
>> semantics, and code gen - and needs to be purged from the code.
> Long ago I had an item on my todo list to get this cleaned up. But
> it never really having made it up high enough, I dropped it at
> some point, in the hope that we'd manage to get this sorted while
> re-writing code step by step.
>
>> It is a shame we don't have a real struct for objects in the command
>> buffer, but in lieu of that, this is just
>>
>>     cmd[0] = addr | IOMMU_COMP_WAIT_S_FLAG_MASK;
>>
>> which is the direction that previous cleanup has gone.
> I don't think I can spot a single instance of such.

It's actually the other way around, for the emulation logic (which isn't
used in practice).

drivers/passthrough/amd/iommu_guest.c:348
    i = cmd->data[0] & IOMMU_COMP_WAIT_I_FLAG_MASK;

>  Some work was
> done to introduce (mainly bitfield) structs, but this surely goes
> too far for the change at hand. I can spot two instances using
> MASK_INSR(), so I can see two consistent ways of doing what you
> ask for:
>
>     cmd[0] = addr | MASK_INSR(IOMMU_CONTROL_ENABLED,
>                               IOMMU_COMP_WAIT_S_FLAG_MASK);
>
> keeping the name as *_MASK (and I'd be open to replace
> IOMMU_CONTROL_ENABLED by true) or
>
>     cmd[0] = addr | IOMMU_COMP_WAIT_S_FLAG;
>
> i.e. dropping _MASK (but requiring adjustments elsewhere in the
> code). Please let me know which one you'd prefer.

TBH, I'd suggest just using

    cmd[0] = addr | IOMMU_COMP_WAIT_S_FLAG_MASK;

for now.  The constant is correct - its just the name which is wonky. 
This in particular will reduce the code churn for ...

>> There are no current users of IOMMU_COMP_WAIT_S_FLAG_SHIFT, and ...
>>
>>> +    cmd[1] = addr >> 32;
>>> +    set_field_in_reg_u32(IOMMU_CMD_COMPLETION_WAIT, cmd[1],
>>>                           IOMMU_CMD_OPCODE_MASK,
>>>                           IOMMU_CMD_OPCODE_SHIFT, &cmd[1]);
>>> -    set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, 0,
>>> -                         IOMMU_COMP_WAIT_I_FLAG_MASK,
>>> -                         IOMMU_COMP_WAIT_I_FLAG_SHIFT, &cmd[0]);
>> ... this drops the final use of IOMMU_COMP_WAIT_I_FLAG_SHIFT, so both
>> should be dropped.
> Well, I can surely do so, but like this entire request of yours this
> feels like scope creep - there was no intention here to do any
> unrelated cleanup. And if I remove _S_ and _I_, then surely _F_
> wants dropping as well, while IOMMU_COMP_WAIT_ADDR_*_SHIFT have a
> use each in iommu_guest.c and hence need to stay for now.

... this, which I'm perfectly happy leaving to a subsequent change. 
(I'll even do it, if you're too busy right now).

What I am mainly concerned with is not using this opportunity to remove
uses of set_field_in_reg_u32().

>
>> As for IOMMU_CMD_OPCODE_SHIFT, that can't be dropped yet, but it would
>> still be better to use
>>
>>     cmd[1] = (addr >> 32) | MASK_INSR(IOMMU_CMD_COMPLETION_WAIT,
>> IOMMU_CMD_COMPLETION_WAIT);
>>
>> in the short term.
> Can do (using IOMMU_CMD_OPCODE_MASK).

Oops yes.  That was a copy&paste mistake.

~Andrew



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/9] AMD/IOMMU: redo awaiting of command completion
  2021-06-09 12:33       ` Andrew Cooper
@ 2021-06-09 12:51         ` Jan Beulich
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Beulich @ 2021-06-09 12:51 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Paul Durrant, xen-devel

On 09.06.2021 14:33, Andrew Cooper wrote:
> On 09/06/2021 13:08, Jan Beulich wrote:
>> On 09.06.2021 12:36, Andrew Cooper wrote:
>>> On 09/06/2021 10:26, Jan Beulich wrote:
>>>> @@ -49,28 +52,31 @@ static void send_iommu_command(struct am
>>>>  static void flush_command_buffer(struct amd_iommu *iommu,
>>>>                                   unsigned int timeout_base)
>>>>  {
>>>> +    static DEFINE_PER_CPU(uint64_t, poll_slot);
>>>> +    uint64_t *this_poll_slot = &this_cpu(poll_slot);
>>>> +    paddr_t addr = virt_to_maddr(this_poll_slot);
>>>>      uint32_t cmd[4];
>>>>      s_time_t start, timeout;
>>>>      static unsigned int __read_mostly threshold = 1;
>>>>  
>>>> -    /* RW1C 'ComWaitInt' in status register */
>>>> -    writel(IOMMU_STATUS_COMP_WAIT_INT,
>>>> -           iommu->mmio_base + IOMMU_STATUS_MMIO_OFFSET);
>>>> -
>>>> -    /* send an empty COMPLETION_WAIT command to flush command buffer */
>>>> -    cmd[3] = cmd[2] = 0;
>>>> -    set_field_in_reg_u32(IOMMU_CMD_COMPLETION_WAIT, 0,
>>>> +    ACCESS_ONCE(*this_poll_slot) = CMD_COMPLETION_INIT;
>>>> +
>>>> +    /* send a COMPLETION_WAIT command to flush command buffer */
>>>> +    cmd[0] = addr;
>>>> +    set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, cmd[0],
>>>> +                         IOMMU_COMP_WAIT_S_FLAG_MASK,
>>>> +                         IOMMU_COMP_WAIT_S_FLAG_SHIFT, &cmd[0]);
>>> set_field_in_reg_u32() is a disaster of a function - both in terms of
>>> semantics, and code gen - and needs to be purged from the code.
>> Long ago I had an item on my todo list to get this cleaned up. But
>> it never really having made it up high enough, I dropped it at
>> some point, in the hope that we'd manage to get this sorted while
>> re-writing code step by step.
>>
>>> It is a shame we don't have a real struct for objects in the command
>>> buffer, but in lieu of that, this is just
>>>
>>>     cmd[0] = addr | IOMMU_COMP_WAIT_S_FLAG_MASK;
>>>
>>> which is the direction that previous cleanup has gone.
>> I don't think I can spot a single instance of such.
> 
> It's actually the other way around, for the emulation logic (which isn't
> used in practice).
> 
> drivers/passthrough/amd/iommu_guest.c:348
>     i = cmd->data[0] & IOMMU_COMP_WAIT_I_FLAG_MASK;
> 
>>  Some work was
>> done to introduce (mainly bitfield) structs, but this surely goes
>> too far for the change at hand. I can spot two instances using
>> MASK_INSR(), so I can see two consistent ways of doing what you
>> ask for:
>>
>>     cmd[0] = addr | MASK_INSR(IOMMU_CONTROL_ENABLED,
>>                               IOMMU_COMP_WAIT_S_FLAG_MASK);
>>
>> keeping the name as *_MASK (and I'd be open to replace
>> IOMMU_CONTROL_ENABLED by true) or
>>
>>     cmd[0] = addr | IOMMU_COMP_WAIT_S_FLAG;
>>
>> i.e. dropping _MASK (but requiring adjustments elsewhere in the
>> code). Please let me know which one you'd prefer.
> 
> TBH, I'd suggest just using
> 
>     cmd[0] = addr | IOMMU_COMP_WAIT_S_FLAG_MASK;
> 
> for now.  The constant is correct - its just the name which is wonky. 

But my previous reply was to make clear that I don't agree with
ORing in a *_MASK into a (to become) live value. *_MASK should be
used exclusively for masking, not as actual field values. Any
code violating this should imo be looked at with suspicion, as to
possibly having used the wrong value altogether.

> This in particular will reduce the code churn for ...
> 
>>> There are no current users of IOMMU_COMP_WAIT_S_FLAG_SHIFT, and ...
>>>
>>>> +    cmd[1] = addr >> 32;
>>>> +    set_field_in_reg_u32(IOMMU_CMD_COMPLETION_WAIT, cmd[1],
>>>>                           IOMMU_CMD_OPCODE_MASK,
>>>>                           IOMMU_CMD_OPCODE_SHIFT, &cmd[1]);
>>>> -    set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, 0,
>>>> -                         IOMMU_COMP_WAIT_I_FLAG_MASK,
>>>> -                         IOMMU_COMP_WAIT_I_FLAG_SHIFT, &cmd[0]);
>>> ... this drops the final use of IOMMU_COMP_WAIT_I_FLAG_SHIFT, so both
>>> should be dropped.
>> Well, I can surely do so, but like this entire request of yours this
>> feels like scope creep - there was no intention here to do any
>> unrelated cleanup. And if I remove _S_ and _I_, then surely _F_
>> wants dropping as well, while IOMMU_COMP_WAIT_ADDR_*_SHIFT have a
>> use each in iommu_guest.c and hence need to stay for now.
> 
> ... this, which I'm perfectly happy leaving to a subsequent change. 
> (I'll even do it, if you're too busy right now).
> 
> What I am mainly concerned with is not using this opportunity to remove
> uses of set_field_in_reg_u32().

Well, when I put the patch together I was thinking of two "proper"
options - keeping the use of set_field_in_reg_u32(), or replacing it
by (bitfield) struct uses. The latter would be a far larger change,
and should imo be one on its own (i.e. no functional change) anyway.
Hence I went the former route. Since you vehemently ask for it, I'll
go the middle route you suggest, but this only sets us up for re-
writing this piece of code another time once we introduce proper
structs.

Jan



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/9] AMD/IOMMU: re-work locking around sending of commands
  2021-06-09 10:53   ` Andrew Cooper
  2021-06-09 12:22     ` Jan Beulich
@ 2021-06-10 11:58     ` Jan Beulich
  2021-06-10 12:53       ` Jan Beulich
  1 sibling, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2021-06-10 11:58 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Paul Durrant, xen-devel

On 09.06.2021 12:53, Andrew Cooper wrote:
> On 09/06/2021 10:27, Jan Beulich wrote:
>> It appears unhelpful to me for flush_command_buffer() to block all
>> progress elsewhere for the given IOMMU by holding its lock while
>> waiting for command completion. Unless the lock is already held,
>> acquire it in send_iommu_command(). Release it in all cases in
>> flush_command_buffer(), before actually starting the wait loop.
>>
>> Some of the involved functions did/do get called with the lock already
>> held: For amd_iommu_flush_intremap() we can simply move the locking
>> inside. For amd_iommu_flush_device() and amd_iommu_flush_all_caches()
>> the lock now gets dropped in the course of the function's operation.
>>
>> Where touching function headers anyway, also adjust types used to be
>> better in line with our coding style and, where applicable, the
>> functions' callers.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Reviewed-by: Paul Durrant <paul@xen.org>
> 
> Honestly, I'm -2 to this.  It is horrible obfuscation of the logic.
> 
> I agree with the premise of not holding the lock when we don't need to,
> but moving the lock/unlocks into different functions makes it impossible
> to follow.  (Also, the static analysers are going to scream at this
> patch, and rightfully so IMO.)
> 
> send_iommu_command() is static, as is flush_command_buffer(), so there
> is no need to split the locking like this AFAICT.
> 
> Instead, each amd_iommu_flush_* external accessor knows exactly what it
> is doing, and whether a wait descriptor is wanted. 
> flush_command_buffer() wants merging into send_iommu_command() as a
> "bool wait" parameter,

A further remark on this particular suggestion: While this is likely
doable, the result will presumably look a little odd: Besides the
various code paths calling send_iommu_command() and then
flush_command_buffer(), the former is also called _by_ the latter.
I can give this a try, but I'd like to be halfway certain I won't
be asked to undo that later on.

And of course this won't help with the split locking, only with some
of the passing around of the saved / to-be-restored eflags.

As an aside, the suggested "bool wait" parameter would (right now) only
ever get passed a "true" argument, so I'm not convinced it's useful to
have at this point, as then we'd also need to deal with the "false"
case (requiring a completion interrupt to be arranged for, which we
have no handler for) despite that code path being unused (and hence
also un-testable).

Jan



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/9] AMD/IOMMU: redo awaiting of command completion
  2021-06-09 10:36   ` Andrew Cooper
  2021-06-09 12:08     ` Jan Beulich
@ 2021-06-10 12:24     ` Jan Beulich
  1 sibling, 0 replies; 28+ messages in thread
From: Jan Beulich @ 2021-06-10 12:24 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Paul Durrant, xen-devel

On 09.06.2021 12:36, Andrew Cooper wrote:
> On 09/06/2021 10:26, Jan Beulich wrote:
>> The present abuse of the completion interrupt does not only stand in the
>> way of, down the road, using it for its actual purpose, but also
>> requires holding the IOMMU lock while waiting for command completion,
>> limiting parallelism and keeping interrupts off for non-negligible
>> periods of time. Have the IOMMU do an ordinary memory write instead of
>> signaling an otherwise disabled interrupt (by just updating a status
>> register bit).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Reviewed-by: Paul Durrant <paul@xen.org>
> 
> While I agree with the direction of the patch, some of the details could
> do with improvement.
> 
>>
>> --- a/xen/drivers/passthrough/amd/iommu_cmd.c
>> +++ b/xen/drivers/passthrough/amd/iommu_cmd.c
>> @@ -20,6 +20,9 @@
>>  #include "iommu.h"
>>  #include "../ats.h"
>>  
>> +#define CMD_COMPLETION_INIT 0
>> +#define CMD_COMPLETION_DONE 1
>> +
>>  static void send_iommu_command(struct amd_iommu *iommu,
>>                                 const uint32_t cmd[4])
>>  {
>> @@ -49,28 +52,31 @@ static void send_iommu_command(struct am
>>  static void flush_command_buffer(struct amd_iommu *iommu,
>>                                   unsigned int timeout_base)
>>  {
>> +    static DEFINE_PER_CPU(uint64_t, poll_slot);
>> +    uint64_t *this_poll_slot = &this_cpu(poll_slot);
>> +    paddr_t addr = virt_to_maddr(this_poll_slot);
>>      uint32_t cmd[4];
>>      s_time_t start, timeout;
>>      static unsigned int __read_mostly threshold = 1;
>>  
>> -    /* RW1C 'ComWaitInt' in status register */
>> -    writel(IOMMU_STATUS_COMP_WAIT_INT,
>> -           iommu->mmio_base + IOMMU_STATUS_MMIO_OFFSET);
>> -
>> -    /* send an empty COMPLETION_WAIT command to flush command buffer */
>> -    cmd[3] = cmd[2] = 0;
>> -    set_field_in_reg_u32(IOMMU_CMD_COMPLETION_WAIT, 0,
>> +    ACCESS_ONCE(*this_poll_slot) = CMD_COMPLETION_INIT;
>> +
>> +    /* send a COMPLETION_WAIT command to flush command buffer */
>> +    cmd[0] = addr;
>> +    set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, cmd[0],
>> +                         IOMMU_COMP_WAIT_S_FLAG_MASK,
>> +                         IOMMU_COMP_WAIT_S_FLAG_SHIFT, &cmd[0]);
> 
> set_field_in_reg_u32() is a disaster of a function - both in terms of
> semantics, and code gen - and needs to be purged from the code.
> 
> It is a shame we don't have a real struct for objects in the command
> buffer, but in lieu of that, this is just
> 
>     cmd[0] = addr | IOMMU_COMP_WAIT_S_FLAG_MASK;
> 
> which is the direction that previous cleanup has gone.
> 
> There are no current users of IOMMU_COMP_WAIT_S_FLAG_SHIFT, and ...
> 
>> +    cmd[1] = addr >> 32;
>> +    set_field_in_reg_u32(IOMMU_CMD_COMPLETION_WAIT, cmd[1],
>>                           IOMMU_CMD_OPCODE_MASK,
>>                           IOMMU_CMD_OPCODE_SHIFT, &cmd[1]);
>> -    set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, 0,
>> -                         IOMMU_COMP_WAIT_I_FLAG_MASK,
>> -                         IOMMU_COMP_WAIT_I_FLAG_SHIFT, &cmd[0]);
> 
> ... this drops the final use of IOMMU_COMP_WAIT_I_FLAG_SHIFT, so both
> should be dropped.
> 
> As for IOMMU_CMD_OPCODE_SHIFT, that can't be dropped yet, but it would
> still be better to use
> 
>     cmd[1] = (addr >> 32) | MASK_INSR(IOMMU_CMD_COMPLETION_WAIT,
> IOMMU_CMD_COMPLETION_WAIT);
> 
> in the short term.

Okay, this conversion has indeed saved a single

	and	$0x0FFFFFFF, %eax

But we're down by two set_field_in_reg_u32() now; only some 30 left.

Jan



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/9] AMD/IOMMU: re-work locking around sending of commands
  2021-06-10 11:58     ` Jan Beulich
@ 2021-06-10 12:53       ` Jan Beulich
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Beulich @ 2021-06-10 12:53 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Paul Durrant, xen-devel

On 10.06.2021 13:58, Jan Beulich wrote:
> On 09.06.2021 12:53, Andrew Cooper wrote:
>> On 09/06/2021 10:27, Jan Beulich wrote:
>>> It appears unhelpful to me for flush_command_buffer() to block all
>>> progress elsewhere for the given IOMMU by holding its lock while
>>> waiting for command completion. Unless the lock is already held,
>>> acquire it in send_iommu_command(). Release it in all cases in
>>> flush_command_buffer(), before actually starting the wait loop.
>>>
>>> Some of the involved functions did/do get called with the lock already
>>> held: For amd_iommu_flush_intremap() we can simply move the locking
>>> inside. For amd_iommu_flush_device() and amd_iommu_flush_all_caches()
>>> the lock now gets dropped in the course of the function's operation.
>>>
>>> Where touching function headers anyway, also adjust types used to be
>>> better in line with our coding style and, where applicable, the
>>> functions' callers.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> Reviewed-by: Paul Durrant <paul@xen.org>
>>
>> Honestly, I'm -2 to this.  It is horrible obfuscation of the logic.
>>
>> I agree with the premise of not holding the lock when we don't need to,
>> but moving the lock/unlocks into different functions makes it impossible
>> to follow.  (Also, the static analysers are going to scream at this
>> patch, and rightfully so IMO.)
>>
>> send_iommu_command() is static, as is flush_command_buffer(), so there
>> is no need to split the locking like this AFAICT.
>>
>> Instead, each amd_iommu_flush_* external accessor knows exactly what it
>> is doing, and whether a wait descriptor is wanted. 
>> flush_command_buffer() wants merging into send_iommu_command() as a
>> "bool wait" parameter,
> 
> A further remark on this particular suggestion: While this is likely
> doable, the result will presumably look a little odd: Besides the
> various code paths calling send_iommu_command() and then
> flush_command_buffer(), the former is also called _by_ the latter.
> I can give this a try, but I'd like to be halfway certain I won't
> be asked to undo that later on.
> 
> And of course this won't help with the split locking, only with some
> of the passing around of the saved / to-be-restored eflags.

Actually, different observation: I don't think there really is a need
for either amd_iommu_flush_device() or amd_iommu_flush_all_caches()
to be called with the lock held. The callers can drop the lock, and
then all locking in iommu_cmd.c can likely be contained to
send_iommu_command() alone, without any need to fold in
flush_command_buffer().

Jan



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Ping: [PATCH 0/9] IOMMU: XSA-373 follow-on
  2021-06-09  9:25 [PATCH 0/9] IOMMU: XSA-373 follow-on Jan Beulich
                   ` (8 preceding siblings ...)
  2021-06-09  9:30 ` [PATCH 9/9] IOMMU/PCI: don't let domain cleanup continue when device de-assignment failed Jan Beulich
@ 2021-06-23  6:51 ` Jan Beulich
  2021-06-23  6:58   ` Tian, Kevin
  9 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2021-06-23  6:51 UTC (permalink / raw)
  To: Paul Durrant, Kevin Tian; +Cc: Andrew Cooper, xen-devel

On 09.06.2021 11:25, Jan Beulich wrote:
> A number of further adjustments were left out of the XSA, for not
> being a security concern (anymore in some of the cases, with the
> changes put in place there). This is the collection, possibly
> looking a little random in what it contains.
> 
> 1: AMD/IOMMU: redo awaiting of command completion
> 2: AMD/IOMMU: re-work locking around sending of commands

For these two I have v2 largely ready.

> 3: VT-d: undo device mappings upon error
> 4: VT-d: adjust domid map updating when unmapping context
> 5: VT-d: clear_fault_bits() should clear all fault bits
> 6: VT-d: don't lose errors when flushing TLBs on multiple IOMMUs
> 7: VT-d: centralize mapping of QI entries
> 8: VT-d: drop/move a few QI related constants

Kevin, any word on these?

> 9: IOMMU/PCI: don't let domain cleanup continue when device de-assignment failed

Paul, any feedback on this one?

Thanks, Jan



^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Ping: [PATCH 0/9] IOMMU: XSA-373 follow-on
  2021-06-23  6:51 ` Ping: [PATCH 0/9] IOMMU: XSA-373 follow-on Jan Beulich
@ 2021-06-23  6:58   ` Tian, Kevin
  0 siblings, 0 replies; 28+ messages in thread
From: Tian, Kevin @ 2021-06-23  6:58 UTC (permalink / raw)
  To: Jan Beulich, Paul Durrant; +Cc: Cooper, Andrew, xen-devel

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Wednesday, June 23, 2021 2:52 PM
> 
> On 09.06.2021 11:25, Jan Beulich wrote:
> > A number of further adjustments were left out of the XSA, for not
> > being a security concern (anymore in some of the cases, with the
> > changes put in place there). This is the collection, possibly
> > looking a little random in what it contains.
> >
> > 1: AMD/IOMMU: redo awaiting of command completion
> > 2: AMD/IOMMU: re-work locking around sending of commands
> 
> For these two I have v2 largely ready.
> 
> > 3: VT-d: undo device mappings upon error
> > 4: VT-d: adjust domid map updating when unmapping context
> > 5: VT-d: clear_fault_bits() should clear all fault bits
> > 6: VT-d: don't lose errors when flushing TLBs on multiple IOMMUs
> > 7: VT-d: centralize mapping of QI entries
> > 8: VT-d: drop/move a few QI related constants
> 
> Kevin, any word on these?

will take a look later today or tomorrow.

> 
> > 9: IOMMU/PCI: don't let domain cleanup continue when device de-
> assignment failed
> 
> Paul, any feedback on this one?
> 
> Thanks, Jan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 3/9] VT-d: undo device mappings upon error
  2021-06-09  9:27 ` [PATCH 3/9] VT-d: undo device mappings upon error Jan Beulich
@ 2021-06-24  5:13   ` Tian, Kevin
  0 siblings, 0 replies; 28+ messages in thread
From: Tian, Kevin @ 2021-06-24  5:13 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Paul Durrant, Cooper, Andrew

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Wednesday, June 9, 2021 5:28 PM
> 
> When
>  - flushes (supposedly not possible anymore after XSA-373),
>  - secondary mappings for legacy PCI devices behind bridges,
>  - secondary mappings for chipset quirks, or
>  - find_upstream_bridge() invocations
> fail, the successfully established device mappings should not be left
> around.
> 
> Further, when (parts of) unmapping fail, simply returning an error is
> typically not enough. Crash the domain instead in such cases, arranging
> for domain cleanup to continue in a best effort manner despite such
> failures.
> 
> Finally make domain_context_unmap()'s error behavior consistent in the
> legacy PCI device case: Don't bail from the function in one special
> case, but always just exit the switch statement.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Paul Durrant <paul@xen.org>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -1442,9 +1442,15 @@ int domain_context_mapping_one(
>      if ( !seg && !rc )
>          rc = me_wifi_quirk(domain, bus, devfn, MAP_ME_PHANTOM_FUNC);
> 
> +    if ( rc )
> +        domain_context_unmap_one(domain, iommu, bus, devfn);
> +
>      return rc;
>  }
> 
> +static int domain_context_unmap(struct domain *d, uint8_t devfn,
> +                                struct pci_dev *pdev);
> +
>  static int domain_context_mapping(struct domain *domain, u8 devfn,
>                                    struct pci_dev *pdev)
>  {
> @@ -1505,16 +1511,21 @@ static int domain_context_mapping(struct
>          if ( ret )
>              break;
> 
> -        if ( find_upstream_bridge(seg, &bus, &devfn, &secbus) < 1 )
> -            break;
> +        if ( (ret = find_upstream_bridge(seg, &bus, &devfn, &secbus)) < 1 )
> +        {
> +            if ( !ret )
> +                break;
> +            ret = -ENXIO;
> +        }
> 
>          /*
>           * Mapping a bridge should, if anything, pass the struct pci_dev of
>           * that bridge. Since bridges don't normally get assigned to guests,
>           * their owner would be the wrong one. Pass NULL instead.
>           */
> -        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
> -                                         NULL);
> +        if ( ret >= 0 )
> +            ret = domain_context_mapping_one(domain, drhd->iommu, bus,
> devfn,
> +                                             NULL);
> 
>          /*
>           * Devices behind PCIe-to-PCI/PCIx bridge may generate different
> @@ -1531,6 +1542,9 @@ static int domain_context_mapping(struct
>              ret = domain_context_mapping_one(domain, drhd->iommu, secbus,
> 0,
>                                               NULL);
> 
> +        if ( ret )
> +            domain_context_unmap(domain, devfn, pdev);
> +
>          break;
> 
>      default:
> @@ -1609,6 +1623,19 @@ int domain_context_unmap_one(
>      if ( !iommu->drhd->segment && !rc )
>          rc = me_wifi_quirk(domain, bus, devfn, UNMAP_ME_PHANTOM_FUNC);
> 
> +    if ( rc && !is_hardware_domain(domain) && domain != dom_io )
> +    {
> +        if ( domain->is_dying )
> +        {
> +            printk(XENLOG_ERR "%pd: error %d
> unmapping %04x:%02x:%02x.%u\n",
> +                   domain, rc, iommu->drhd->segment, bus,
> +                   PCI_SLOT(devfn), PCI_FUNC(devfn));
> +            rc = 0; /* Make upper layers continue in a best effort manner. */
> +        }
> +        else
> +            domain_crash(domain);
> +    }
> +
>      return rc;
>  }
> 
> @@ -1661,17 +1688,29 @@ static int domain_context_unmap(struct d
> 
>          tmp_bus = bus;
>          tmp_devfn = devfn;
> -        if ( find_upstream_bridge(seg, &tmp_bus, &tmp_devfn, &secbus) < 1 )
> +        if ( (ret = find_upstream_bridge(seg, &tmp_bus, &tmp_devfn,
> +                                         &secbus)) < 1 )
> +        {
> +            if ( ret )
> +            {
> +                ret = -ENXIO;
> +                if ( !domain->is_dying &&
> +                     !is_hardware_domain(domain) && domain != dom_io )
> +                {
> +                    domain_crash(domain);
> +                    /* Make upper layers continue in a best effort manner. */
> +                    ret = 0;
> +                }
> +            }
>              break;
> +        }
> 
>          /* PCIe to PCI/PCIx bridge */
>          if ( pdev_type(seg, tmp_bus, tmp_devfn) ==
> DEV_TYPE_PCIe2PCI_BRIDGE )
>          {
>              ret = domain_context_unmap_one(domain, iommu, tmp_bus,
> tmp_devfn);
> -            if ( ret )
> -                return ret;
> -
> -            ret = domain_context_unmap_one(domain, iommu, secbus, 0);
> +            if ( !ret )
> +                ret = domain_context_unmap_one(domain, iommu, secbus, 0);
>          }
>          else /* Legacy PCI bridge */
>              ret = domain_context_unmap_one(domain, iommu, tmp_bus,
> tmp_devfn);


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 4/9] VT-d: adjust domid map updating when unmapping context
  2021-06-09  9:28 ` [PATCH 4/9] VT-d: adjust domid map updating when unmapping context Jan Beulich
@ 2021-06-24  5:21   ` Tian, Kevin
  0 siblings, 0 replies; 28+ messages in thread
From: Tian, Kevin @ 2021-06-24  5:21 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Paul Durrant, Cooper, Andrew

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Wednesday, June 9, 2021 5:28 PM
> 
> When an earlier error occurred, cleaning up the domid mapping data is
> wrong, as references likely still exist. The only exception to this is
> when the actual unmapping worked, but some flush failed (supposedly
> impossible after XSA-373). The guest will get crashed in such a case
> though, so add fallback cleanup to domain destruction to cover this
> case. This in turn makes it desirable to silence the dprintk() in
> domain_iommu_domid().
> 
> Note that no error will be returned anymore when the lookup fails - in
> the common case lookup failure would already have caused
> domain_context_unmap_one() to fail, yet even from a more general
> perspective it doesn't look right to fail domain_context_unmap() in such
> a case when this was the last device, but not when any earlier unmap was
> otherwise successful.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -80,9 +80,11 @@ static int domain_iommu_domid(struct dom
>          i = find_next_bit(iommu->domid_bitmap, nr_dom, i+1);
>      }
> 
> -    dprintk(XENLOG_ERR VTDPREFIX,
> -            "Cannot get valid iommu domid: domid=%d iommu->index=%d\n",
> -            d->domain_id, iommu->index);
> +    if ( !d->is_dying )
> +        dprintk(XENLOG_ERR VTDPREFIX,
> +                "Cannot get valid iommu %u domid: %pd\n",
> +                iommu->index, d);
> +
>      return -1;
>  }
> 
> @@ -147,6 +149,17 @@ static int context_get_domain_id(struct
>      return domid;
>  }
> 
> +static void cleanup_domid_map(struct domain *domain, struct vtd_iommu
> *iommu)
> +{
> +    int iommu_domid = domain_iommu_domid(domain, iommu);
> +
> +    if ( iommu_domid >= 0 )
> +    {
> +        clear_bit(iommu_domid, iommu->domid_bitmap);
> +        iommu->domid_map[iommu_domid] = 0;
> +    }
> +}
> +
>  static void sync_cache(const void *addr, unsigned int size)
>  {
>      static unsigned long clflush_size = 0;
> @@ -1724,6 +1737,9 @@ static int domain_context_unmap(struct d
>          goto out;
>      }
> 
> +    if ( ret )
> +        goto out;
> +
>      /*
>       * if no other devices under the same iommu owned by this domain,
>       * clear iommu in iommu_bitmap and clear domain_id in domid_bitmp
> @@ -1743,19 +1759,8 @@ static int domain_context_unmap(struct d
> 
>      if ( found == 0 )
>      {
> -        int iommu_domid;
> -
>          clear_bit(iommu->index, &dom_iommu(domain)-
> >arch.vtd.iommu_bitmap);
> -
> -        iommu_domid = domain_iommu_domid(domain, iommu);
> -        if ( iommu_domid == -1 )
> -        {
> -            ret = -EINVAL;
> -            goto out;
> -        }
> -
> -        clear_bit(iommu_domid, iommu->domid_bitmap);
> -        iommu->domid_map[iommu_domid] = 0;
> +        cleanup_domid_map(domain, iommu);
>      }
> 
>  out:
> @@ -1775,6 +1780,7 @@ static void iommu_domain_teardown(struct
>  {
>      struct domain_iommu *hd = dom_iommu(d);
>      struct mapped_rmrr *mrmrr, *tmp;
> +    const struct acpi_drhd_unit *drhd;
> 
>      if ( list_empty(&acpi_drhd_units) )
>          return;
> @@ -1786,6 +1792,9 @@ static void iommu_domain_teardown(struct
>      }
> 
>      ASSERT(!hd->arch.vtd.pgd_maddr);
> +
> +    for_each_drhd_unit ( drhd )
> +        cleanup_domid_map(d, drhd->iommu);
>  }
> 
>  static int __must_check intel_iommu_map_page(struct domain *d, dfn_t dfn,


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 5/9] VT-d: clear_fault_bits() should clear all fault bits
  2021-06-09  9:28 ` [PATCH 5/9] VT-d: clear_fault_bits() should clear all fault bits Jan Beulich
@ 2021-06-24  5:26   ` Tian, Kevin
  0 siblings, 0 replies; 28+ messages in thread
From: Tian, Kevin @ 2021-06-24  5:26 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Paul Durrant, Cooper, Andrew

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Wednesday, June 9, 2021 5:29 PM
> 
> If there is any way for one fault to be left set in the recording
> registers, there's no reason there couldn't also be multiple ones. If
> PPF set set (being the OR or all F fields), simply loop over the entire
> range of fault recording registers, clearing F everywhere.
> 
> Since PPF is a r/o bit, also remove it from DMA_FSTS_FAULTS (arguably
> the constant's name is ambiguous as well).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -2094,13 +2094,23 @@ static int __hwdom_init setup_hwdom_devi
> 
>  void clear_fault_bits(struct vtd_iommu *iommu)
>  {
> -    u64 val;
>      unsigned long flags;
> 
>      spin_lock_irqsave(&iommu->register_lock, flags);
> -    val = dmar_readq(iommu->reg, cap_fault_reg_offset(iommu->cap) + 8);
> -    dmar_writeq(iommu->reg, cap_fault_reg_offset(iommu->cap) + 8, val);
> +
> +    if ( dmar_readl(iommu->reg, DMAR_FSTS_REG) & DMA_FSTS_PPF )
> +    {
> +        unsigned int reg = cap_fault_reg_offset(iommu->cap);
> +        unsigned int end = reg + cap_num_fault_regs(iommu->cap);
> +
> +        do {
> +           dmar_writel(iommu->reg, reg + 12, DMA_FRCD_F);
> +           reg += PRIMARY_FAULT_REG_LEN;
> +        } while ( reg < end );
> +    }
> +
>      dmar_writel(iommu->reg, DMAR_FSTS_REG, DMA_FSTS_FAULTS);
> +
>      spin_unlock_irqrestore(&iommu->register_lock, flags);
>  }
> 
> --- a/xen/drivers/passthrough/vtd/iommu.h
> +++ b/xen/drivers/passthrough/vtd/iommu.h
> @@ -174,9 +174,8 @@
>  #define DMA_FSTS_IQE (1u << 4)
>  #define DMA_FSTS_ICE (1u << 5)
>  #define DMA_FSTS_ITE (1u << 6)
> -#define DMA_FSTS_FAULTS (DMA_FSTS_PFO | DMA_FSTS_PPF |
> DMA_FSTS_AFO | \
> -                         DMA_FSTS_APF | DMA_FSTS_IQE | DMA_FSTS_ICE | \
> -                         DMA_FSTS_ITE)
> +#define DMA_FSTS_FAULTS (DMA_FSTS_PFO | DMA_FSTS_AFO |
> DMA_FSTS_APF | \
> +                         DMA_FSTS_IQE | DMA_FSTS_ICE | DMA_FSTS_ITE)
>  #define dma_fsts_fault_record_index(s) (((s) >> 8) & 0xff)
> 
>  /* FRCD_REG, 32 bits access */


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 6/9] VT-d: don't lose errors when flushing TLBs on multiple IOMMUs
  2021-06-09  9:29 ` [PATCH 6/9] VT-d: don't lose errors when flushing TLBs on multiple IOMMUs Jan Beulich
@ 2021-06-24  5:28   ` Tian, Kevin
  0 siblings, 0 replies; 28+ messages in thread
From: Tian, Kevin @ 2021-06-24  5:28 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Paul Durrant, Cooper, Andrew

> From: Jan Beulich
> Sent: Wednesday, June 9, 2021 5:29 PM
> 
> While no longer an immediate problem with flushes no longer timing out,
> errors (if any) get properly reported by iommu_flush_iotlb_{dsi,psi}().
> Overwriting such an error with, perhaps, a success indicator received
> from another IOMMU will misguide callers. Record the first error, but
> don't bail from the loop (such that further necessary invalidation gets
> carried out on a best effort basis).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> 
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -643,7 +643,7 @@ static int __must_check iommu_flush_iotl
>      struct vtd_iommu *iommu;
>      bool_t flush_dev_iotlb;
>      int iommu_domid;
> -    int rc = 0;
> +    int ret = 0;
> 
>      /*
>       * No need pcideves_lock here because we have flush
> @@ -651,6 +651,8 @@ static int __must_check iommu_flush_iotl
>       */
>      for_each_drhd_unit ( drhd )
>      {
> +        int rc;
> +
>          iommu = drhd->iommu;
> 
>          if ( !test_bit(iommu->index, &hd->arch.vtd.iommu_bitmap) )
> @@ -673,13 +675,12 @@ static int __must_check iommu_flush_iotl
>                                         flush_dev_iotlb);
> 
>          if ( rc > 0 )
> -        {
>              iommu_flush_write_buffer(iommu);
> -            rc = 0;
> -        }
> +        else if ( !ret )
> +            ret = rc;
>      }
> 
> -    return rc;
> +    return ret;
>  }
> 
>  static int __must_check iommu_flush_iotlb_pages(struct domain *d,
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 7/9] VT-d: centralize mapping of QI entries
  2021-06-09  9:29 ` [PATCH 7/9] VT-d: centralize mapping of QI entries Jan Beulich
@ 2021-06-24  5:31   ` Tian, Kevin
  0 siblings, 0 replies; 28+ messages in thread
From: Tian, Kevin @ 2021-06-24  5:31 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Paul Durrant, Cooper, Andrew

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Wednesday, June 9, 2021 5:30 PM
> 
> Introduce a helper function to reduce redundancy. Take the opportunity
> to express the logic without using the somewhat odd QINVAL_ENTRY_ORDER.
> Also take the opportunity to uniformly unmap after updating queue tail
> and dropping the lock (like was done so far only by
> queue_invalidate_context_sync()).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> ---
> I wonder though whether we wouldn't be better off permanently mapping
> the queue(s).
> 
> --- a/xen/drivers/passthrough/vtd/qinval.c
> +++ b/xen/drivers/passthrough/vtd/qinval.c
> @@ -69,6 +69,16 @@ static void qinval_update_qtail(struct v
>      dmar_writel(iommu->reg, DMAR_IQT_REG, val << QINVAL_INDEX_SHIFT);
>  }
> 
> +static struct qinval_entry *qi_map_entry(const struct vtd_iommu *iommu,
> +                                         unsigned int index)
> +{
> +    paddr_t base = iommu->qinval_maddr +
> +                   ((index * sizeof(struct qinval_entry)) & PAGE_MASK);
> +    struct qinval_entry *entries = map_vtd_domain_page(base);
> +
> +    return &entries[index % (PAGE_SIZE / sizeof(*entries))];
> +}
> +
>  static int __must_check queue_invalidate_context_sync(struct vtd_iommu
> *iommu,
>                                                        u16 did, u16 source_id,
>                                                        u8 function_mask,
> @@ -76,15 +86,11 @@ static int __must_check queue_invalidate
>  {
>      unsigned long flags;
>      unsigned int index;
> -    u64 entry_base;
> -    struct qinval_entry *qinval_entry, *qinval_entries;
> +    struct qinval_entry *qinval_entry;
> 
>      spin_lock_irqsave(&iommu->register_lock, flags);
>      index = qinval_next_index(iommu);
> -    entry_base = iommu->qinval_maddr +
> -                 ((index >> QINVAL_ENTRY_ORDER) << PAGE_SHIFT);
> -    qinval_entries = map_vtd_domain_page(entry_base);
> -    qinval_entry = &qinval_entries[index % (1 << QINVAL_ENTRY_ORDER)];
> +    qinval_entry = qi_map_entry(iommu, index);
> 
>      qinval_entry->q.cc_inv_dsc.lo.type = TYPE_INVAL_CONTEXT;
>      qinval_entry->q.cc_inv_dsc.lo.granu = granu;
> @@ -98,7 +104,7 @@ static int __must_check queue_invalidate
>      qinval_update_qtail(iommu, index);
>      spin_unlock_irqrestore(&iommu->register_lock, flags);
> 
> -    unmap_vtd_domain_page(qinval_entries);
> +    unmap_vtd_domain_page(qinval_entry);
> 
>      return invalidate_sync(iommu);
>  }
> @@ -110,15 +116,11 @@ static int __must_check queue_invalidate
>  {
>      unsigned long flags;
>      unsigned int index;
> -    u64 entry_base;
> -    struct qinval_entry *qinval_entry, *qinval_entries;
> +    struct qinval_entry *qinval_entry;
> 
>      spin_lock_irqsave(&iommu->register_lock, flags);
>      index = qinval_next_index(iommu);
> -    entry_base = iommu->qinval_maddr +
> -                 ((index >> QINVAL_ENTRY_ORDER) << PAGE_SHIFT);
> -    qinval_entries = map_vtd_domain_page(entry_base);
> -    qinval_entry = &qinval_entries[index % (1 << QINVAL_ENTRY_ORDER)];
> +    qinval_entry = qi_map_entry(iommu, index);
> 
>      qinval_entry->q.iotlb_inv_dsc.lo.type = TYPE_INVAL_IOTLB;
>      qinval_entry->q.iotlb_inv_dsc.lo.granu = granu;
> @@ -133,10 +135,11 @@ static int __must_check queue_invalidate
>      qinval_entry->q.iotlb_inv_dsc.hi.res_1 = 0;
>      qinval_entry->q.iotlb_inv_dsc.hi.addr = addr >> PAGE_SHIFT_4K;
> 
> -    unmap_vtd_domain_page(qinval_entries);
>      qinval_update_qtail(iommu, index);
>      spin_unlock_irqrestore(&iommu->register_lock, flags);
> 
> +    unmap_vtd_domain_page(qinval_entry);
> +
>      return invalidate_sync(iommu);
>  }
> 
> @@ -147,17 +150,13 @@ static int __must_check queue_invalidate
>      static DEFINE_PER_CPU(uint32_t, poll_slot);
>      unsigned int index;
>      unsigned long flags;
> -    u64 entry_base;
> -    struct qinval_entry *qinval_entry, *qinval_entries;
> +    struct qinval_entry *qinval_entry;
>      uint32_t *this_poll_slot = &this_cpu(poll_slot);
> 
>      spin_lock_irqsave(&iommu->register_lock, flags);
>      ACCESS_ONCE(*this_poll_slot) = QINVAL_STAT_INIT;
>      index = qinval_next_index(iommu);
> -    entry_base = iommu->qinval_maddr +
> -                 ((index >> QINVAL_ENTRY_ORDER) << PAGE_SHIFT);
> -    qinval_entries = map_vtd_domain_page(entry_base);
> -    qinval_entry = &qinval_entries[index % (1 << QINVAL_ENTRY_ORDER)];
> +    qinval_entry = qi_map_entry(iommu, index);
> 
>      qinval_entry->q.inv_wait_dsc.lo.type = TYPE_INVAL_WAIT;
>      qinval_entry->q.inv_wait_dsc.lo.iflag = iflag;
> @@ -167,10 +166,11 @@ static int __must_check queue_invalidate
>      qinval_entry->q.inv_wait_dsc.lo.sdata = QINVAL_STAT_DONE;
>      qinval_entry->q.inv_wait_dsc.hi.saddr = virt_to_maddr(this_poll_slot);
> 
> -    unmap_vtd_domain_page(qinval_entries);
>      qinval_update_qtail(iommu, index);
>      spin_unlock_irqrestore(&iommu->register_lock, flags);
> 
> +    unmap_vtd_domain_page(qinval_entry);
> +
>      /* Now we don't support interrupt method */
>      if ( sw )
>      {
> @@ -246,16 +246,12 @@ int qinval_device_iotlb_sync(struct vtd_
>  {
>      unsigned long flags;
>      unsigned int index;
> -    u64 entry_base;
> -    struct qinval_entry *qinval_entry, *qinval_entries;
> +    struct qinval_entry *qinval_entry;
> 
>      ASSERT(pdev);
>      spin_lock_irqsave(&iommu->register_lock, flags);
>      index = qinval_next_index(iommu);
> -    entry_base = iommu->qinval_maddr +
> -                 ((index >> QINVAL_ENTRY_ORDER) << PAGE_SHIFT);
> -    qinval_entries = map_vtd_domain_page(entry_base);
> -    qinval_entry = &qinval_entries[index % (1 << QINVAL_ENTRY_ORDER)];
> +    qinval_entry = qi_map_entry(iommu, index);
> 
>      qinval_entry->q.dev_iotlb_inv_dsc.lo.type = TYPE_INVAL_DEVICE_IOTLB;
>      qinval_entry->q.dev_iotlb_inv_dsc.lo.res_1 = 0;
> @@ -268,10 +264,11 @@ int qinval_device_iotlb_sync(struct vtd_
>      qinval_entry->q.dev_iotlb_inv_dsc.hi.res_1 = 0;
>      qinval_entry->q.dev_iotlb_inv_dsc.hi.addr = addr >> PAGE_SHIFT_4K;
> 
> -    unmap_vtd_domain_page(qinval_entries);
>      qinval_update_qtail(iommu, index);
>      spin_unlock_irqrestore(&iommu->register_lock, flags);
> 
> +    unmap_vtd_domain_page(qinval_entry);
> +
>      return dev_invalidate_sync(iommu, pdev, did);
>  }
> 
> @@ -280,16 +277,12 @@ static int __must_check queue_invalidate
>  {
>      unsigned long flags;
>      unsigned int index;
> -    u64 entry_base;
> -    struct qinval_entry *qinval_entry, *qinval_entries;
> +    struct qinval_entry *qinval_entry;
>      int ret;
> 
>      spin_lock_irqsave(&iommu->register_lock, flags);
>      index = qinval_next_index(iommu);
> -    entry_base = iommu->qinval_maddr +
> -                 ((index >> QINVAL_ENTRY_ORDER) << PAGE_SHIFT);
> -    qinval_entries = map_vtd_domain_page(entry_base);
> -    qinval_entry = &qinval_entries[index % (1 << QINVAL_ENTRY_ORDER)];
> +    qinval_entry = qi_map_entry(iommu, index);
> 
>      qinval_entry->q.iec_inv_dsc.lo.type = TYPE_INVAL_IEC;
>      qinval_entry->q.iec_inv_dsc.lo.granu = granu;
> @@ -299,10 +292,11 @@ static int __must_check queue_invalidate
>      qinval_entry->q.iec_inv_dsc.lo.res_2 = 0;
>      qinval_entry->q.iec_inv_dsc.hi.res = 0;
> 
> -    unmap_vtd_domain_page(qinval_entries);
>      qinval_update_qtail(iommu, index);
>      spin_unlock_irqrestore(&iommu->register_lock, flags);
> 
> +    unmap_vtd_domain_page(qinval_entry);
> +
>      ret = invalidate_sync(iommu);
> 
>      /*


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH 8/9] VT-d: drop/move a few QI related constants
  2021-06-09  9:29 ` [PATCH 8/9] VT-d: drop/move a few QI related constants Jan Beulich
@ 2021-06-24  5:32   ` Tian, Kevin
  0 siblings, 0 replies; 28+ messages in thread
From: Tian, Kevin @ 2021-06-24  5:32 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Paul Durrant, Cooper, Andrew

> From: Jan Beulich <jbeulich@suse.com>
> Sent: Wednesday, June 9, 2021 5:30 PM
> 
> Replace uses of QINVAL_ENTRY_ORDER and QINVAL_INDEX_SHIFT, such that
> the constants can be dropped. Move the remaining QINVAL_* ones to the
> single source file using them.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

> 
> --- a/xen/drivers/passthrough/vtd/iommu.h
> +++ b/xen/drivers/passthrough/vtd/iommu.h
> @@ -451,17 +451,6 @@ struct qinval_entry {
>      }q;
>  };
> 
> -/* Each entry is 16 bytes, so 2^8 entries per page */
> -#define QINVAL_ENTRY_ORDER  ( PAGE_SHIFT - 4 )
> -#define QINVAL_MAX_ENTRY_NR (1u << (7 + QINVAL_ENTRY_ORDER))
> -
> -/* Status data flag */
> -#define QINVAL_STAT_INIT  0
> -#define QINVAL_STAT_DONE  1
> -
> -/* Queue invalidation head/tail shift */
> -#define QINVAL_INDEX_SHIFT 4
> -
>  #define TYPE_INVAL_CONTEXT      0x1
>  #define TYPE_INVAL_IOTLB        0x2
>  #define TYPE_INVAL_DEVICE_IOTLB 0x3
> --- a/xen/drivers/passthrough/vtd/qinval.c
> +++ b/xen/drivers/passthrough/vtd/qinval.c
> @@ -29,6 +29,13 @@
>  #include "extern.h"
>  #include "../ats.h"
> 
> +/* Each entry is 16 bytes, and there can be up to 2^7 pages. */
> +#define QINVAL_MAX_ENTRY_NR (1u << (7 + PAGE_SHIFT_4K - 4))
> +
> +/* Status data flag */
> +#define QINVAL_STAT_INIT  0
> +#define QINVAL_STAT_DONE  1
> +
>  static unsigned int __read_mostly qi_pg_order;
>  static unsigned int __read_mostly qi_entry_nr;
> 
> @@ -45,11 +52,11 @@ static unsigned int qinval_next_index(st
>  {
>      unsigned int tail = dmar_readl(iommu->reg, DMAR_IQT_REG);
> 
> -    tail >>= QINVAL_INDEX_SHIFT;
> +    tail /= sizeof(struct qinval_entry);
> 
>      /* (tail+1 == head) indicates a full queue, wait for HW */
>      while ( ((tail + 1) & (qi_entry_nr - 1)) ==
> -            (dmar_readl(iommu->reg, DMAR_IQH_REG) >>
> QINVAL_INDEX_SHIFT) )
> +            (dmar_readl(iommu->reg, DMAR_IQH_REG) / sizeof(struct
> qinval_entry)) )
>      {
>          printk_once(XENLOG_ERR VTDPREFIX " IOMMU#%u: no QI slot
> available\n",
>                      iommu->index);
> @@ -66,7 +73,7 @@ static void qinval_update_qtail(struct v
>      /* Need hold register lock when update tail */
>      ASSERT( spin_is_locked(&iommu->register_lock) );
>      val = (index + 1) & (qi_entry_nr - 1);
> -    dmar_writel(iommu->reg, DMAR_IQT_REG, val << QINVAL_INDEX_SHIFT);
> +    dmar_writel(iommu->reg, DMAR_IQT_REG, val * sizeof(struct
> qinval_entry));
>  }
> 
>  static struct qinval_entry *qi_map_entry(const struct vtd_iommu *iommu,
> @@ -413,17 +420,18 @@ int enable_qinval(struct vtd_iommu *iomm
>               * only one entry left.
>               */
>              BUILD_BUG_ON(CONFIG_NR_CPUS * 2 >= QINVAL_MAX_ENTRY_NR);
> -            qi_pg_order = get_order_from_bytes((num_present_cpus() * 2 + 1)
> <<
> -                                               (PAGE_SHIFT -
> -                                                QINVAL_ENTRY_ORDER));
> -            qi_entry_nr = 1u << (qi_pg_order + QINVAL_ENTRY_ORDER);
> +            qi_pg_order = get_order_from_bytes((num_present_cpus() * 2 + 1) *
> +                                               sizeof(struct qinval_entry));
> +            qi_entry_nr = (PAGE_SIZE << qi_pg_order) /
> +                          sizeof(struct qinval_entry);
> 
>              dprintk(XENLOG_INFO VTDPREFIX,
>                      "QI: using %u-entry ring(s)\n", qi_entry_nr);
>          }
> 
>          iommu->qinval_maddr =
> -            alloc_pgtable_maddr(qi_entry_nr >> QINVAL_ENTRY_ORDER,
> +            alloc_pgtable_maddr(PFN_DOWN(qi_entry_nr *
> +                                         sizeof(struct qinval_entry)),
>                                  iommu->node);
>          if ( iommu->qinval_maddr == 0 )
>          {


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 9/9] IOMMU/PCI: don't let domain cleanup continue when device de-assignment failed
  2021-06-09  9:30 ` [PATCH 9/9] IOMMU/PCI: don't let domain cleanup continue when device de-assignment failed Jan Beulich
@ 2021-06-24 15:34   ` Paul Durrant
  0 siblings, 0 replies; 28+ messages in thread
From: Paul Durrant @ 2021-06-24 15:34 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Andrew Cooper

On 09/06/2021 10:30, Jan Beulich wrote:
> Failure here could in principle mean the device may still be issuing DMA
> requests, which would continue to be translated by the page tables the
> device entry currently points at. With this we cannot allow the
> subsequent cleanup step of freeing the page tables to occur, to prevent
> use-after-free issues. We would need to accept, for the time being, that
> in such a case the remaining domain resources will all be leaked, and
> the domain will continue to exist as a zombie.
> 
> However, with flushes no longer timing out (and with proper timeout
> detection for device I/O TLB flushing yet to be implemented), there's no
> way anymore for failures to occur, except due to bugs elsewhere. Hence
> the change here is merely a "just in case" one.
> 
> In order to continue the loop in spite of an error, we can't use
> pci_get_pdev_by_domain() anymore. I have no idea why it was used here in
> the first place, instead of the cheaper list iteration.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Paul Durrant <paul@xen.org>

> ---
> A first step beyond this could be to have the backing functions of
> deassign_device() allow the caller to tell whether the failure was from
> removing the device from the domain being cleaned up, or from re-setup
> in wherever the device was supposed to get moved to. In the latter case
> we could allow domain cleanup to continue. I wonder whether we could
> simply make those functions return "success" anyway, overriding their
> returning of an error when ->is_dying is set.
> 
> A next step then might be to figure whether there's any "emergency"
> adjustment that could be done instead of the full-fledged (and failed)
> de-assign, to allow at least recovering all the memory from the guest.
> 
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -894,7 +894,7 @@ static int deassign_device(struct domain
>   
>   int pci_release_devices(struct domain *d)
>   {
> -    struct pci_dev *pdev;
> +    struct pci_dev *pdev, *tmp;
>       u8 bus, devfn;
>       int ret;
>   
> @@ -905,15 +905,15 @@ int pci_release_devices(struct domain *d
>           pcidevs_unlock();
>           return ret;
>       }
> -    while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
> +    list_for_each_entry_safe ( pdev, tmp, &d->pdev_list, domain_list )
>       {
>           bus = pdev->bus;
>           devfn = pdev->devfn;
> -        deassign_device(d, pdev->seg, bus, devfn);
> +        ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>       }
>       pcidevs_unlock();
>   
> -    return 0;
> +    return ret;
>   }
>   
>   #define PCI_CLASS_BRIDGE_HOST    0x0600
> 



^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2021-06-24 15:34 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-09  9:25 [PATCH 0/9] IOMMU: XSA-373 follow-on Jan Beulich
2021-06-09  9:26 ` [PATCH 1/9] AMD/IOMMU: redo awaiting of command completion Jan Beulich
2021-06-09 10:36   ` Andrew Cooper
2021-06-09 12:08     ` Jan Beulich
2021-06-09 12:33       ` Andrew Cooper
2021-06-09 12:51         ` Jan Beulich
2021-06-10 12:24     ` Jan Beulich
2021-06-09  9:27 ` [PATCH 2/9] AMD/IOMMU: re-work locking around sending of commands Jan Beulich
2021-06-09 10:53   ` Andrew Cooper
2021-06-09 12:22     ` Jan Beulich
2021-06-10 11:58     ` Jan Beulich
2021-06-10 12:53       ` Jan Beulich
2021-06-09  9:27 ` [PATCH 3/9] VT-d: undo device mappings upon error Jan Beulich
2021-06-24  5:13   ` Tian, Kevin
2021-06-09  9:28 ` [PATCH 4/9] VT-d: adjust domid map updating when unmapping context Jan Beulich
2021-06-24  5:21   ` Tian, Kevin
2021-06-09  9:28 ` [PATCH 5/9] VT-d: clear_fault_bits() should clear all fault bits Jan Beulich
2021-06-24  5:26   ` Tian, Kevin
2021-06-09  9:29 ` [PATCH 6/9] VT-d: don't lose errors when flushing TLBs on multiple IOMMUs Jan Beulich
2021-06-24  5:28   ` Tian, Kevin
2021-06-09  9:29 ` [PATCH 7/9] VT-d: centralize mapping of QI entries Jan Beulich
2021-06-24  5:31   ` Tian, Kevin
2021-06-09  9:29 ` [PATCH 8/9] VT-d: drop/move a few QI related constants Jan Beulich
2021-06-24  5:32   ` Tian, Kevin
2021-06-09  9:30 ` [PATCH 9/9] IOMMU/PCI: don't let domain cleanup continue when device de-assignment failed Jan Beulich
2021-06-24 15:34   ` Paul Durrant
2021-06-23  6:51 ` Ping: [PATCH 0/9] IOMMU: XSA-373 follow-on Jan Beulich
2021-06-23  6:58   ` Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.