All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu blocking list
@ 2016-11-18  1:57 Feng Wu
  2016-11-18  1:57 ` [PATCH v8 1/7] VMX: Permanently assign PI hook vmx_pi_switch_to() Feng Wu
                   ` (8 more replies)
  0 siblings, 9 replies; 29+ messages in thread
From: Feng Wu @ 2016-11-18  1:57 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, Feng Wu, george.dunlap, andrew.cooper3,
	dario.faggioli, jbeulich

The current VT-d PI related code may operate incorrectly in the
following scenarios:
1. When the last assigned device is dettached from the domain, all
the PI related hooks are removed then, however, the vCPU can be
blocked, switched to another pCPU, etc, all without the aware of
PI. After the next time we attach another device to the domain,
which makes the PI realted hooks avaliable again, the status
of the pi descriptor is not true. Beside that, the blocking vcpu
may still remain in the per-cpu blocking in this case. Patch [1/7]
and [2/7] handle this.

2. [4/7] unify the code path of PI mode update and remapped mode update

2. When IRTE is in posted mode, we don't need to set the irq
affinity for it, since the destination of these interrupts is
vCPU and the vCPU affinity is set during vCPU scheduling. Patch
[5/7] handles this.

4. [6/7] is a cleanup patch

5. When a pCPU is unplugged, and there might be vCPUs on its
list. Since the pCPU is offline, those vCPUs might not be woken
up again. [7/7] addresses it.


Feng Wu (7):
  VMX: Permanently assign PI hook vmx_pi_switch_to()
  VMX: Properly handle pi when all the assigned devices are removed
  VMX: Make sure PI is in proper state before install the hooks
  VT-d: Use one function to update both remapped and posted IRTE
  VT-d: No need to set irq affinity for posted format IRTE
  VT-d: Some cleanups
  VMX: Fixup PI descriptor when cpu is offline

 xen/arch/x86/hvm/vmx/vmcs.c            |  14 +--
 xen/arch/x86/hvm/vmx/vmx.c             | 133 ++++++++++++++++++++++-
 xen/drivers/passthrough/pci.c          |  14 +++
 xen/drivers/passthrough/vtd/intremap.c | 193 +++++++++++++++------------------
 xen/include/asm-x86/hvm/vmx/vmx.h      |   3 +
 5 files changed, 240 insertions(+), 117 deletions(-)

-- 
2.1.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v8 1/7] VMX: Permanently assign PI hook vmx_pi_switch_to()
  2016-11-18  1:57 [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu blocking list Feng Wu
@ 2016-11-18  1:57 ` Feng Wu
  2016-11-18  3:14   ` Tian, Kevin
  2016-11-18  1:57 ` [PATCH v8 2/7] VMX: Properly handle pi when all the assigned devices are removed Feng Wu
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Feng Wu @ 2016-11-18  1:57 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, Feng Wu, george.dunlap, andrew.cooper3,
	dario.faggioli, jbeulich

PI hook vmx_pi_switch_to() is needed even when any previously
assigned device is detached from the domain. Since 'SN' bit is
also used to control the CPU side PI and we change the state of
SN bit in these two functions, then evaluate this bit in
vmx_deliver_posted_intr() when trying to deliver the interrupt
in posted way via software. The problem is if we deassign the
hooks while the vCPU is runnable in the runqueue with 'SN' set,
all the furture notificaton event will be suppressed. This patch
makes the hook permanently assigned.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v8:
- Comments changes

v7:
- comments changes.

v6:
- Adjust the comments and wording.

v5:
- Zap "pi_switch_from" hook

v4:
- Don't zap vmx_pi_switch_from() and vmx_pi_switch_to() when
any previously assigned device is detached from the domain.
- Comments changes.

 xen/arch/x86/hvm/vmx/vmx.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 3d330b6..f3911f2 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -222,8 +222,14 @@ void vmx_pi_hooks_deassign(struct domain *d)
 
     d->arch.hvm_domain.vmx.vcpu_block = NULL;
     d->arch.hvm_domain.vmx.pi_switch_from = NULL;
-    d->arch.hvm_domain.vmx.pi_switch_to = NULL;
     d->arch.hvm_domain.vmx.pi_do_resume = NULL;
+
+    /*
+     * In fact, we could set 'd->arch.hvm_domain.vmx.pi_switch_to' to NULL
+     * in vmx_pi_switch_to() if no new device is in the process of getting
+     * assigned and "from" hook is NULL. However, it is not straightforward
+     * to find a clear solution, so just leave it here.
+     */
 }
 
 static int vmx_domain_initialise(struct domain *d)
-- 
2.1.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 2/7] VMX: Properly handle pi when all the assigned devices are removed
  2016-11-18  1:57 [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu blocking list Feng Wu
  2016-11-18  1:57 ` [PATCH v8 1/7] VMX: Permanently assign PI hook vmx_pi_switch_to() Feng Wu
@ 2016-11-18  1:57 ` Feng Wu
  2016-11-18  3:19   ` Tian, Kevin
  2016-11-18  1:57 ` [PATCH v8 3/7] VMX: Make sure PI is in proper state before install the hooks Feng Wu
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Feng Wu @ 2016-11-18  1:57 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, Feng Wu, george.dunlap, andrew.cooper3,
	dario.faggioli, jbeulich

This patch handles some corner cases when the last assigned device
is removed from the domain. In this case we should carefully handle
pi descriptor and the per-cpu blocking list, to make sure:
- all the PI descriptor are in the right state when next time a
devices is assigned to the domain again.
- No remaining vcpus of the domain in the per-cpu blocking list.

Here we call vmx_pi_unblock_vcpu() to remove the vCPU from the blocking list
if it is on the list. However, this could happen when vmx_vcpu_block() is
being called, hence we might incorrectly add the vCPU to the blocking list
while the last devcie is detached from the domain. Consider that the situation
can only occur when detaching the last device from the domain and it is not
a frequent operation, so we use domain_pause before that, which is considered
as an clean and maintainable solution for the situation.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
v7:
- Prevent the domain from pausing itself.

v6:
- Comments changes
- Rename vmx_pi_list_remove() to vmx_pi_unblock_vcpu()

v5:
- Remove a no-op wrapper

v4:
- Rename some functions:
	vmx_pi_remove_vcpu_from_blocking_list() -> vmx_pi_list_remove()
	vmx_pi_blocking_cleanup() -> vmx_pi_list_cleanup()
- Remove the check in vmx_pi_list_cleanup()
- Comments adjustment

 xen/arch/x86/hvm/vmx/vmx.c    | 28 ++++++++++++++++++++++++----
 xen/drivers/passthrough/pci.c | 14 ++++++++++++++
 2 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index f3911f2..a8dcabe 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -158,14 +158,12 @@ static void vmx_pi_switch_to(struct vcpu *v)
     pi_clear_sn(pi_desc);
 }
 
-static void vmx_pi_do_resume(struct vcpu *v)
+static void vmx_pi_unblock_vcpu(struct vcpu *v)
 {
     unsigned long flags;
     spinlock_t *pi_blocking_list_lock;
     struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
 
-    ASSERT(!test_bit(_VPF_blocked, &v->pause_flags));
-
     /*
      * Set 'NV' field back to posted_intr_vector, so the
      * Posted-Interrupts can be delivered to the vCPU when
@@ -173,12 +171,12 @@ static void vmx_pi_do_resume(struct vcpu *v)
      */
     write_atomic(&pi_desc->nv, posted_intr_vector);
 
-    /* The vCPU is not on any blocking list. */
     pi_blocking_list_lock = v->arch.hvm_vmx.pi_blocking.lock;
 
     /* Prevent the compiler from eliminating the local variable.*/
     smp_rmb();
 
+    /* The vCPU is not on any blocking list. */
     if ( pi_blocking_list_lock == NULL )
         return;
 
@@ -198,6 +196,13 @@ static void vmx_pi_do_resume(struct vcpu *v)
     spin_unlock_irqrestore(pi_blocking_list_lock, flags);
 }
 
+static void vmx_pi_do_resume(struct vcpu *v)
+{
+    ASSERT(!test_bit(_VPF_blocked, &v->pause_flags));
+
+    vmx_pi_unblock_vcpu(v);
+}
+
 /* This function is called when pcidevs_lock is held */
 void vmx_pi_hooks_assign(struct domain *d)
 {
@@ -215,11 +220,21 @@ void vmx_pi_hooks_assign(struct domain *d)
 /* This function is called when pcidevs_lock is held */
 void vmx_pi_hooks_deassign(struct domain *d)
 {
+    struct vcpu *v;
+
     if ( !iommu_intpost || !has_hvm_container_domain(d) )
         return;
 
     ASSERT(d->arch.hvm_domain.vmx.vcpu_block);
 
+    /*
+     * Pausing the domain can make sure the vCPU is not
+     * running and hence not calling the hooks simultaneously
+     * when deassigning the PI hooks and removing the vCPU
+     * from the blocking list.
+     */
+    domain_pause(d);
+
     d->arch.hvm_domain.vmx.vcpu_block = NULL;
     d->arch.hvm_domain.vmx.pi_switch_from = NULL;
     d->arch.hvm_domain.vmx.pi_do_resume = NULL;
@@ -230,6 +245,11 @@ void vmx_pi_hooks_deassign(struct domain *d)
      * assigned and "from" hook is NULL. However, it is not straightforward
      * to find a clear solution, so just leave it here.
      */
+
+    for_each_vcpu ( d, v )
+        vmx_pi_unblock_vcpu(v);
+
+    domain_unpause(d);
 }
 
 static int vmx_domain_initialise(struct domain *d)
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 8bce213..e71732f 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1602,6 +1602,13 @@ int iommu_do_pci_domctl(
         break;
 
     case XEN_DOMCTL_assign_device:
+        /* no domain_pause() */
+        if ( d == current->domain )
+        {
+            ret = -EINVAL;
+            break;
+        }
+
         ret = -ENODEV;
         if ( domctl->u.assign_device.dev != XEN_DOMCTL_DEV_PCI )
             break;
@@ -1642,6 +1649,13 @@ int iommu_do_pci_domctl(
         break;
 
     case XEN_DOMCTL_deassign_device:
+        /* no domain_pause() */
+        if ( d == current->domain )
+        {
+            ret = -EINVAL;
+            break;
+        }
+
         ret = -ENODEV;
         if ( domctl->u.assign_device.dev != XEN_DOMCTL_DEV_PCI )
             break;
-- 
2.1.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 3/7] VMX: Make sure PI is in proper state before install the hooks
  2016-11-18  1:57 [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu blocking list Feng Wu
  2016-11-18  1:57 ` [PATCH v8 1/7] VMX: Permanently assign PI hook vmx_pi_switch_to() Feng Wu
  2016-11-18  1:57 ` [PATCH v8 2/7] VMX: Properly handle pi when all the assigned devices are removed Feng Wu
@ 2016-11-18  1:57 ` Feng Wu
  2016-11-18  4:11   ` Tian, Kevin
  2016-11-18  1:57 ` [PATCH v8 4/7] VT-d: Use one function to update both remapped and posted IRTE Feng Wu
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Feng Wu @ 2016-11-18  1:57 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, Feng Wu, george.dunlap, andrew.cooper3,
	dario.faggioli, jbeulich

We may hit the last ASSERT() in vmx_vcpu_block in the current code,
since vmx_vcpu_block() may get called before vmx_pi_switch_to()
has been installed or executed. Here We use cmpxchg to update
the NDST field, this can make sure we only update the NDST when
vmx_pi_switch_to() has not been called. So the NDST is in a
proper state in vmx_vcpu_block().

Suggested-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
v6:
- Comments changes
- Define macro 'APIC_INVALID_DEST' for '0xffffffff'

v5:
- Use 0xffffffff as the invalid value for NDST field.

v4:
- This patch is previously called "Pause/Unpause the domain before/after assigning PI hooks"
- Remove the pause/unpause method
- Use cmpxchg to update NDST

 xen/arch/x86/hvm/vmx/vmcs.c       | 13 +++++--------
 xen/arch/x86/hvm/vmx/vmx.c        | 27 ++++++++++++++++++++++++++-
 xen/include/asm-x86/hvm/vmx/vmx.h |  2 ++
 3 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 1bd875a..e8e3616 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -956,16 +956,13 @@ void virtual_vmcs_vmwrite(const struct vcpu *v, u32 vmcs_encoding, u64 val)
  */
 static void pi_desc_init(struct vcpu *v)
 {
-    uint32_t dest;
-
     v->arch.hvm_vmx.pi_desc.nv = posted_intr_vector;
 
-    dest = cpu_physical_id(v->processor);
-
-    if ( x2apic_enabled )
-        v->arch.hvm_vmx.pi_desc.ndst = dest;
-    else
-        v->arch.hvm_vmx.pi_desc.ndst = MASK_INSR(dest, PI_xAPIC_NDST_MASK);
+    /*
+     * Mark NDST as invalid, then we can use this invalid value as a
+     * marker to whether update NDST or not in vmx_pi_hooks_assign().
+     */
+    v->arch.hvm_vmx.pi_desc.ndst = APIC_INVALID_DEST;
 }
 
 static int construct_vmcs(struct vcpu *v)
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index a8dcabe..a1f7903 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -206,14 +206,39 @@ static void vmx_pi_do_resume(struct vcpu *v)
 /* This function is called when pcidevs_lock is held */
 void vmx_pi_hooks_assign(struct domain *d)
 {
+    struct vcpu *v;
+
     if ( !iommu_intpost || !has_hvm_container_domain(d) )
         return;
 
     ASSERT(!d->arch.hvm_domain.vmx.vcpu_block);
 
-    d->arch.hvm_domain.vmx.vcpu_block = vmx_vcpu_block;
+    /*
+     * We carefully handle the timing here:
+     * - Install the context switch first
+     * - Then set the NDST field
+     * - Install the block and resume hooks in the end
+     *
+     * This can make sure the PI (especially the NDST feild) is
+     * in proper state when we call vmx_vcpu_block().
+     */
     d->arch.hvm_domain.vmx.pi_switch_from = vmx_pi_switch_from;
     d->arch.hvm_domain.vmx.pi_switch_to = vmx_pi_switch_to;
+
+    for_each_vcpu ( d, v )
+    {
+        unsigned int dest = cpu_physical_id(v->processor);
+        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+
+        /*
+         * We don't need to update NDST if vmx_pi_switch_to()
+         * has already got called.
+         */
+        (void)cmpxchg(&pi_desc->ndst, APIC_INVALID_DEST,
+                x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK));
+    }
+
+    d->arch.hvm_domain.vmx.vcpu_block = vmx_vcpu_block;
     d->arch.hvm_domain.vmx.pi_do_resume = vmx_pi_do_resume;
 }
 
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 4cdd9b1..2f0435c 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -573,6 +573,8 @@ void vmx_pi_per_cpu_init(unsigned int cpu);
 void vmx_pi_hooks_assign(struct domain *d);
 void vmx_pi_hooks_deassign(struct domain *d);
 
+#define APIC_INVALID_DEST           0xffffffff
+
 /* EPT violation qualifications definitions */
 #define _EPT_READ_VIOLATION         0
 #define EPT_READ_VIOLATION          (1UL<<_EPT_READ_VIOLATION)
-- 
2.1.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 4/7] VT-d: Use one function to update both remapped and posted IRTE
  2016-11-18  1:57 [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu blocking list Feng Wu
                   ` (2 preceding siblings ...)
  2016-11-18  1:57 ` [PATCH v8 3/7] VMX: Make sure PI is in proper state before install the hooks Feng Wu
@ 2016-11-18  1:57 ` Feng Wu
  2016-11-18  4:31   ` Tian, Kevin
  2016-11-22  9:58   ` Jan Beulich
  2016-11-18  1:57 ` [PATCH v8 5/7] VT-d: No need to set irq affinity for posted format IRTE Feng Wu
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 29+ messages in thread
From: Feng Wu @ 2016-11-18  1:57 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, Feng Wu, george.dunlap, andrew.cooper3,
	dario.faggioli, jbeulich

Use one function to update both remapped IRTE and posted IRET.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v8:
- Newly added

 xen/drivers/passthrough/vtd/intremap.c | 162 ++++++++++++++-------------------
 1 file changed, 66 insertions(+), 96 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index bfd468b..fd2a49a 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -420,7 +420,7 @@ void io_apic_write_remap_rte(
         __ioapic_write_entry(apic, ioapic_pin, 1, old_rte);
 }
 
-static void set_msi_source_id(struct pci_dev *pdev, struct iremap_entry *ire)
+static void set_msi_source_id(const struct pci_dev *pdev, struct iremap_entry *ire)
 {
     u16 seg;
     u8 bus, devfn, secbus;
@@ -548,11 +548,12 @@ static int remap_entry_to_msi_msg(
 }
 
 static int msi_msg_to_remap_entry(
-    struct iommu *iommu, struct pci_dev *pdev,
-    struct msi_desc *msi_desc, struct msi_msg *msg)
+    struct iommu *iommu, const struct pci_dev *pdev,
+    struct msi_desc *msi_desc, struct msi_msg *msg,
+    const struct pi_desc *pi_desc, const uint8_t gvec)
 {
     struct iremap_entry *iremap_entry = NULL, *iremap_entries;
-    struct iremap_entry new_ire;
+    struct iremap_entry new_ire, old_ire;
     struct msi_msg_remap_entry *remap_rte;
     unsigned int index, i, nr = 1;
     unsigned long flags;
@@ -597,31 +598,50 @@ static int msi_msg_to_remap_entry(
 
     memcpy(&new_ire, iremap_entry, sizeof(struct iremap_entry));
 
-    /* Set interrupt remapping table entry */
-    new_ire.remap.fpd = 0;
-    new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
-    new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
-    new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
-    /* Hardware require RH = 1 for LPR delivery mode */
-    new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
-    new_ire.remap.avail = 0;
-    new_ire.remap.res_1 = 0;
-    new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
-                            MSI_DATA_VECTOR_MASK;
-    new_ire.remap.res_2 = 0;
-    if ( x2apic_enabled )
-        new_ire.remap.dst = msg->dest32;
+    if ( !pi_desc )
+    {
+        /* Set interrupt remapping table entry */
+        new_ire.remap.fpd = 0;
+        new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
+        new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
+        new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
+        /* Hardware require RH = 1 for LPR delivery mode */
+        new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
+        new_ire.remap.avail = 0;
+        new_ire.remap.res_1 = 0;
+        new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
+                                MSI_DATA_VECTOR_MASK;
+        new_ire.remap.res_2 = 0;
+        if ( x2apic_enabled )
+            new_ire.remap.dst = msg->dest32;
+        else
+            new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
+                                 & 0xff) << 8;
+
+        new_ire.remap.res_3 = 0;
+        new_ire.remap.res_4 = 0;
+        new_ire.remap.p = 1;    /* finally, set present bit */
+    }
     else
-        new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
-                             & 0xff) << 8;
+    {
+        new_ire.post.fpd = 0;
+        new_ire.post.res_1 = 0;
+        new_ire.post.res_2 = 0;
+        new_ire.post.urg = 0;
+        new_ire.post.im = 1;
+        new_ire.post.vector = gvec;
+        new_ire.post.res_3 = 0;
+        new_ire.post.res_4 = 0;
+        new_ire.post.res_5 = 0;
+        new_ire.post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
+        new_ire.post.pda_h = virt_to_maddr(pi_desc) >> 32;
+        new_ire.post.p = 1;    /* finally, set present bit */
+    }
 
     if ( pdev )
         set_msi_source_id(pdev, &new_ire);
     else
         set_hpet_source_id(msi_desc->hpet_id, &new_ire);
-    new_ire.remap.res_3 = 0;
-    new_ire.remap.res_4 = 0;
-    new_ire.remap.p = 1;    /* finally, set present bit */
 
     /* now construct new MSI/MSI-X rte entry */
     remap_rte = (struct msi_msg_remap_entry *)msg;
@@ -637,7 +657,23 @@ static int msi_msg_to_remap_entry(
     remap_rte->address_hi = 0;
     remap_rte->data = index - i;
 
-    memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
+    if ( !pi_desc )
+        memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
+    else
+    {
+        __uint128_t ret;
+
+        old_ire = *iremap_entry;
+        ret = cmpxchg16b(iremap_entry, &old_ire, &new_ire);
+
+        /*
+         * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
+         * and the hardware cannot update the IRTE behind us, so the return value
+         * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
+         */
+        ASSERT(ret == old_ire.val);
+    }
+
     iommu_flush_cache_entry(iremap_entry, sizeof(struct iremap_entry));
     iommu_flush_iec_index(iommu, 0, index);
 
@@ -668,7 +704,8 @@ int msi_msg_write_remap_rte(
 
     drhd = pdev ? acpi_find_matched_drhd_unit(pdev)
                 : hpet_to_drhd(msi_desc->hpet_id);
-    return drhd ? msi_msg_to_remap_entry(drhd->iommu, pdev, msi_desc, msg)
+    return drhd ? msi_msg_to_remap_entry(drhd->iommu, pdev,
+                                         msi_desc, msg, NULL, 0)
                 : -EINVAL;
 }
 
@@ -902,42 +939,6 @@ void iommu_disable_x2apic_IR(void)
         disable_qinval(drhd->iommu);
 }
 
-static void setup_posted_irte(
-    struct iremap_entry *new_ire, const struct iremap_entry *old_ire,
-    const struct pi_desc *pi_desc, const uint8_t gvec)
-{
-    memset(new_ire, 0, sizeof(*new_ire));
-
-    /*
-     * 'im' filed decides whether the irte is in posted format (with value 1)
-     * or remapped format (with value 0), if the old irte is in remapped format,
-     * we copy things from remapped part in 'struct iremap_entry', otherwise,
-     * we copy from posted part.
-     */
-    if ( !old_ire->remap.im )
-    {
-        new_ire->post.p = old_ire->remap.p;
-        new_ire->post.fpd = old_ire->remap.fpd;
-        new_ire->post.sid = old_ire->remap.sid;
-        new_ire->post.sq = old_ire->remap.sq;
-        new_ire->post.svt = old_ire->remap.svt;
-    }
-    else
-    {
-        new_ire->post.p = old_ire->post.p;
-        new_ire->post.fpd = old_ire->post.fpd;
-        new_ire->post.sid = old_ire->post.sid;
-        new_ire->post.sq = old_ire->post.sq;
-        new_ire->post.svt = old_ire->post.svt;
-        new_ire->post.urg = old_ire->post.urg;
-    }
-
-    new_ire->post.im = 1;
-    new_ire->post.vector = gvec;
-    new_ire->post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
-    new_ire->post.pda_h = virt_to_maddr(pi_desc) >> 32;
-}
-
 /*
  * This function is used to update the IRTE for posted-interrupt
  * when guest changes MSI/MSI-X information.
@@ -946,17 +947,12 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
     const uint8_t gvec)
 {
     struct irq_desc *desc;
-    const struct msi_desc *msi_desc;
-    int remap_index;
+    struct msi_desc *msi_desc;
     int rc = 0;
     const struct pci_dev *pci_dev;
     const struct acpi_drhd_unit *drhd;
     struct iommu *iommu;
-    struct ir_ctrl *ir_ctrl;
-    struct iremap_entry *iremap_entries = NULL, *p = NULL;
-    struct iremap_entry new_ire, old_ire;
     const struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
-    __uint128_t ret;
 
     desc = pirq_spin_lock_irq_desc(pirq, NULL);
     if ( !desc )
@@ -976,8 +972,6 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
         goto unlock_out;
     }
 
-    remap_index = msi_desc->remap_index;
-
     spin_unlock_irq(&desc->lock);
 
     ASSERT(pcidevs_locked());
@@ -992,35 +986,11 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
         return -ENODEV;
 
     iommu = drhd->iommu;
-    ir_ctrl = iommu_ir_ctrl(iommu);
-    if ( !ir_ctrl )
+    if ( !iommu_ir_ctrl(iommu) )
         return -ENODEV;
 
-    spin_lock_irq(&ir_ctrl->iremap_lock);
-
-    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index, iremap_entries, p);
-
-    old_ire = *p;
-
-    /* Setup/Update interrupt remapping table entry. */
-    setup_posted_irte(&new_ire, &old_ire, pi_desc, gvec);
-    ret = cmpxchg16b(p, &old_ire, &new_ire);
-
-    /*
-     * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
-     * and the hardware cannot update the IRTE behind us, so the return value
-     * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
-     */
-    ASSERT(ret == old_ire.val);
-
-    iommu_flush_cache_entry(p, sizeof(*p));
-    iommu_flush_iec_index(iommu, 0, remap_index);
-
-    unmap_vtd_domain_page(iremap_entries);
-
-    spin_unlock_irq(&ir_ctrl->iremap_lock);
-
-    return 0;
+    return msi_msg_to_remap_entry(iommu, pci_dev, msi_desc, &msi_desc->msg,
+                                  pi_desc, gvec);
 
  unlock_out:
     spin_unlock_irq(&desc->lock);
-- 
2.1.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 5/7] VT-d: No need to set irq affinity for posted format IRTE
  2016-11-18  1:57 [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu blocking list Feng Wu
                   ` (3 preceding siblings ...)
  2016-11-18  1:57 ` [PATCH v8 4/7] VT-d: Use one function to update both remapped and posted IRTE Feng Wu
@ 2016-11-18  1:57 ` Feng Wu
  2016-11-18  1:57 ` [PATCH v8 6/7] VT-d: Some cleanups Feng Wu
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Feng Wu @ 2016-11-18  1:57 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, Feng Wu, george.dunlap, andrew.cooper3,
	dario.faggioli, jbeulich

We don't set the affinity for posted format IRTE, since the
destination of these interrupts is vCPU and the vCPU affinity
is set during vCPU scheduling.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v8:
- Changes based on [6/7]

v7:
- Compare all the field in IRTE to justify whether we can suppress the update

v6:
- Make pi_can_suppress_irte_update() a check-only function
- Introduce another function pi_get_new_irte() to update the 'new_ire' if needed

v5:
- Only suppress affinity related IRTE updates for PI

v4:
- Keep the construction of new_ire and only modify the hardware
IRTE when it is not in posted mode.

 xen/drivers/passthrough/vtd/intremap.c | 87 ++++++++++++++++++++--------------
 1 file changed, 52 insertions(+), 35 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index fd2a49a..0cb8c37 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -600,27 +600,41 @@ static int msi_msg_to_remap_entry(
 
     if ( !pi_desc )
     {
-        /* Set interrupt remapping table entry */
-        new_ire.remap.fpd = 0;
-        new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
-        new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
-        new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
-        /* Hardware require RH = 1 for LPR delivery mode */
-        new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
-        new_ire.remap.avail = 0;
-        new_ire.remap.res_1 = 0;
-        new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
-                                MSI_DATA_VECTOR_MASK;
-        new_ire.remap.res_2 = 0;
-        if ( x2apic_enabled )
-            new_ire.remap.dst = msg->dest32;
-        else
-            new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
-                                 & 0xff) << 8;
+        /*
+         * We are here because we are trying to update the IRTE to remapped mode,
+         * we only need to update the remapped specific fields for the following
+         * two cases:
+         * 1. When we create a new IRTE. A new IRTE is created when we create a
+         *    new irq, so a new IRTE is always initialized with remapped format.
+         * 2. When the old IRTE is present and in remapped mode. Since if the old
+         *    IRTE is in posted mode, we cannot update it to remapped mode and
+         *    this is what we need to suppress. So we don't update the remapped
+         *    specific fields here, we only update the commom field.
+         */
+        if ( !iremap_entry->remap.p || !iremap_entry->remap.im )
+        {
+            /* Set interrupt remapping table entry */
+            new_ire.remap.fpd = 0;
+            new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
+            new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
+            new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
+            /* Hardware require RH = 1 for LPR delivery mode */
+            new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
+            new_ire.remap.avail = 0;
+            new_ire.remap.res_1 = 0;
+            new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
+                                    MSI_DATA_VECTOR_MASK;
+            new_ire.remap.res_2 = 0;
+            if ( x2apic_enabled )
+                new_ire.remap.dst = msg->dest32;
+            else
+                new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
+                                     & 0xff) << 8;
 
-        new_ire.remap.res_3 = 0;
-        new_ire.remap.res_4 = 0;
-        new_ire.remap.p = 1;    /* finally, set present bit */
+            new_ire.remap.res_3 = 0;
+            new_ire.remap.res_4 = 0;
+            new_ire.remap.p = 1;    /* finally, set present bit */
+        }
     }
     else
     {
@@ -657,25 +671,28 @@ static int msi_msg_to_remap_entry(
     remap_rte->address_hi = 0;
     remap_rte->data = index - i;
 
-    if ( !pi_desc )
-        memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
-    else
+    if ( iremap_entry->val != new_ire.val )
     {
-        __uint128_t ret;
+        if ( !pi_desc )
+            memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
+        else
+        {
+            __uint128_t ret;
 
-        old_ire = *iremap_entry;
-        ret = cmpxchg16b(iremap_entry, &old_ire, &new_ire);
+            old_ire = *iremap_entry;
+            ret = cmpxchg16b(iremap_entry, &old_ire, &new_ire);
 
-        /*
-         * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
-         * and the hardware cannot update the IRTE behind us, so the return value
-         * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
-         */
-        ASSERT(ret == old_ire.val);
-    }
+            /*
+             * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
+             * and the hardware cannot update the IRTE behind us, so the return value
+             * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
+             */
+            ASSERT(ret == old_ire.val);
+        }
 
-    iommu_flush_cache_entry(iremap_entry, sizeof(struct iremap_entry));
-    iommu_flush_iec_index(iommu, 0, index);
+        iommu_flush_cache_entry(iremap_entry, sizeof(struct iremap_entry));
+        iommu_flush_iec_index(iommu, 0, index);
+    }
 
     unmap_vtd_domain_page(iremap_entries);
     spin_unlock_irqrestore(&ir_ctrl->iremap_lock, flags);
-- 
2.1.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 6/7] VT-d: Some cleanups
  2016-11-18  1:57 [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu blocking list Feng Wu
                   ` (4 preceding siblings ...)
  2016-11-18  1:57 ` [PATCH v8 5/7] VT-d: No need to set irq affinity for posted format IRTE Feng Wu
@ 2016-11-18  1:57 ` Feng Wu
  2016-11-18  4:45   ` Tian, Kevin
  2016-11-18  1:57 ` [PATCH v8 7/7] VMX: Fixup PI descriptor when cpu is offline Feng Wu
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Feng Wu @ 2016-11-18  1:57 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, Feng Wu, george.dunlap, andrew.cooper3,
	dario.faggioli, jbeulich

Use type-safe structure assignment instead of memcpy()
Use sizeof(*iremap_entry).

Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v7:
- Remove a useless cleanup

v6:
- More descripion about the patch

 xen/drivers/passthrough/vtd/intremap.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index 0cb8c37..8664194 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -183,8 +183,8 @@ static void free_remap_entry(struct iommu *iommu, int index)
     GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
                      iremap_entries, iremap_entry);
 
-    memset(iremap_entry, 0, sizeof(struct iremap_entry));
-    iommu_flush_cache_entry(iremap_entry, sizeof(struct iremap_entry));
+    memset(iremap_entry, 0, sizeof(*iremap_entry));
+    iommu_flush_cache_entry(iremap_entry, sizeof(*iremap_entry));
     iommu_flush_iec_index(iommu, 0, index);
 
     unmap_vtd_domain_page(iremap_entries);
@@ -310,7 +310,7 @@ static int ioapic_rte_to_remap_entry(struct iommu *iommu,
     GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
                      iremap_entries, iremap_entry);
 
-    memcpy(&new_ire, iremap_entry, sizeof(struct iremap_entry));
+    new_ire = *iremap_entry;
 
     if ( rte_upper )
     {
@@ -353,8 +353,8 @@ static int ioapic_rte_to_remap_entry(struct iommu *iommu,
         remap_rte->format = 1;    /* indicate remap format */
     }
 
-    memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
-    iommu_flush_cache_entry(iremap_entry, sizeof(struct iremap_entry));
+    *iremap_entry = new_ire;
+    iommu_flush_cache_entry(iremap_entry, sizeof(*iremap_entry));
     iommu_flush_iec_index(iommu, 0, index);
 
     unmap_vtd_domain_page(iremap_entries);
@@ -674,7 +674,7 @@ static int msi_msg_to_remap_entry(
     if ( iremap_entry->val != new_ire.val )
     {
         if ( !pi_desc )
-            memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
+            *iremap_entry = new_ire;
         else
         {
             __uint128_t ret;
@@ -690,7 +690,7 @@ static int msi_msg_to_remap_entry(
             ASSERT(ret == old_ire.val);
         }
 
-        iommu_flush_cache_entry(iremap_entry, sizeof(struct iremap_entry));
+        iommu_flush_cache_entry(iremap_entry, sizeof(*iremap_entry));
         iommu_flush_iec_index(iommu, 0, index);
     }
 
-- 
2.1.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v8 7/7] VMX: Fixup PI descriptor when cpu is offline
  2016-11-18  1:57 [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu blocking list Feng Wu
                   ` (5 preceding siblings ...)
  2016-11-18  1:57 ` [PATCH v8 6/7] VT-d: Some cleanups Feng Wu
@ 2016-11-18  1:57 ` Feng Wu
  2016-11-18  4:55   ` Tian, Kevin
  2016-11-18  3:02 ` [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu blocking list Tian, Kevin
  2017-02-21 21:30 ` Chao Gao
  8 siblings, 1 reply; 29+ messages in thread
From: Feng Wu @ 2016-11-18  1:57 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, Feng Wu, george.dunlap, andrew.cooper3,
	dario.faggioli, jbeulich

When cpu is offline, we need to move all the vcpus in its blocking
list to another online cpu, this patch handles it.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
v7:
- Pass unsigned int to vmx_pi_desc_fixup()

v6:
- Carefully suppress 'SN' to avoid missing notification event
during moving the vcpu to the new list

v5:
- Add some comments to explain why it doesn't cause deadlock
for the ABBA deadlock scenario.

v4:
- Remove the pointless check since we are in machine stop
context and no other cpus go down in parallel.

 xen/arch/x86/hvm/vmx/vmcs.c       |  1 +
 xen/arch/x86/hvm/vmx/vmx.c        | 70 +++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/hvm/vmx/vmx.h |  1 +
 3 files changed, 72 insertions(+)

diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index e8e3616..1846e25 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -578,6 +578,7 @@ void vmx_cpu_dead(unsigned int cpu)
     vmx_free_vmcs(per_cpu(vmxon_region, cpu));
     per_cpu(vmxon_region, cpu) = 0;
     nvmx_cpu_dead(cpu);
+    vmx_pi_desc_fixup(cpu);
 }
 
 int vmx_cpu_up(void)
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index a1f7903..614d538 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -203,6 +203,76 @@ static void vmx_pi_do_resume(struct vcpu *v)
     vmx_pi_unblock_vcpu(v);
 }
 
+void vmx_pi_desc_fixup(unsigned int cpu)
+{
+    unsigned int new_cpu, dest;
+    unsigned long flags;
+    struct arch_vmx_struct *vmx, *tmp;
+    spinlock_t *new_lock, *old_lock = &per_cpu(vmx_pi_blocking, cpu).lock;
+    struct list_head *blocked_vcpus = &per_cpu(vmx_pi_blocking, cpu).list;
+
+    if ( !iommu_intpost )
+        return;
+
+    /*
+     * We are in the context of CPU_DEAD or CPU_UP_CANCELED notification,
+     * and it is impossible for a second CPU go down in parallel. So we
+     * can safely acquire the old cpu's lock and then acquire the new_cpu's
+     * lock after that.
+     */
+    spin_lock_irqsave(old_lock, flags);
+
+    list_for_each_entry_safe(vmx, tmp, blocked_vcpus, pi_blocking.list)
+    {
+        /*
+         * Suppress notification or we may miss an interrupt when the
+         * target cpu is dying.
+         */
+        pi_set_sn(&vmx->pi_desc);
+
+        /*
+         * Check whether a notification is pending before doing the
+         * movement, if that is the case we need to wake up it directly
+         * other than moving it to the new cpu's list.
+         */
+        if ( pi_test_on(&vmx->pi_desc) )
+        {
+            list_del(&vmx->pi_blocking.list);
+            vmx->pi_blocking.lock = NULL;
+            vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
+        }
+        else
+        {
+            /*
+             * We need to find an online cpu as the NDST of the PI descriptor, it
+             * doesn't matter whether it is within the cpupool of the domain or
+             * not. As long as it is online, the vCPU will be woken up once the
+             * notification event arrives.
+             */
+            new_cpu = cpumask_any(&cpu_online_map);
+            new_lock = &per_cpu(vmx_pi_blocking, new_cpu).lock;
+
+            spin_lock(new_lock);
+
+            ASSERT(vmx->pi_blocking.lock == old_lock);
+
+            dest = cpu_physical_id(new_cpu);
+            write_atomic(&vmx->pi_desc.ndst,
+                         x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK));
+
+            list_move(&vmx->pi_blocking.list,
+                      &per_cpu(vmx_pi_blocking, new_cpu).list);
+            vmx->pi_blocking.lock = new_lock;
+
+            spin_unlock(new_lock);
+        }
+
+        pi_clear_sn(&vmx->pi_desc);
+    }
+
+    spin_unlock_irqrestore(old_lock, flags);
+}
+
 /* This function is called when pcidevs_lock is held */
 void vmx_pi_hooks_assign(struct domain *d)
 {
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 2f0435c..5c8fe5d 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -569,6 +569,7 @@ void free_p2m_hap_data(struct p2m_domain *p2m);
 void p2m_init_hap_data(struct p2m_domain *p2m);
 
 void vmx_pi_per_cpu_init(unsigned int cpu);
+void vmx_pi_desc_fixup(unsigned int cpu);
 
 void vmx_pi_hooks_assign(struct domain *d);
 void vmx_pi_hooks_deassign(struct domain *d);
-- 
2.1.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu blocking list
  2016-11-18  1:57 [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu blocking list Feng Wu
                   ` (6 preceding siblings ...)
  2016-11-18  1:57 ` [PATCH v8 7/7] VMX: Fixup PI descriptor when cpu is offline Feng Wu
@ 2016-11-18  3:02 ` Tian, Kevin
  2016-11-18  4:21   ` Wu, Feng
  2017-02-21 21:30 ` Chao Gao
  8 siblings, 1 reply; 29+ messages in thread
From: Tian, Kevin @ 2016-11-18  3:02 UTC (permalink / raw)
  To: Wu, Feng, xen-devel
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, jbeulich

> From: Wu, Feng
> Sent: Friday, November 18, 2016 9:57 AM
> 
> The current VT-d PI related code may operate incorrectly in the
> following scenarios:
> 1. When the last assigned device is dettached from the domain, all
> the PI related hooks are removed then, however, the vCPU can be
> blocked, switched to another pCPU, etc, all without the aware of
> PI. After the next time we attach another device to the domain,
> which makes the PI realted hooks avaliable again, the status
> of the pi descriptor is not true. Beside that, the blocking vcpu
> may still remain in the per-cpu blocking in this case. Patch [1/7]
> and [2/7] handle this.
> 
> 2. [4/7] unify the code path of PI mode update and remapped mode update
> 
> 2. When IRTE is in posted mode, we don't need to set the irq
> affinity for it, since the destination of these interrupts is
> vCPU and the vCPU affinity is set during vCPU scheduling. Patch
> [5/7] handles this.
> 
> 4. [6/7] is a cleanup patch
> 
> 5. When a pCPU is unplugged, and there might be vCPUs on its
> list. Since the pCPU is offline, those vCPUs might not be woken
> up again. [7/7] addresses it.
> 
> 

just for completeness you didn't introduce 3/7 here.

> Feng Wu (7):
>   VMX: Permanently assign PI hook vmx_pi_switch_to()
>   VMX: Properly handle pi when all the assigned devices are removed
>   VMX: Make sure PI is in proper state before install the hooks
>   VT-d: Use one function to update both remapped and posted IRTE
>   VT-d: No need to set irq affinity for posted format IRTE
>   VT-d: Some cleanups
>   VMX: Fixup PI descriptor when cpu is offline
> 
>  xen/arch/x86/hvm/vmx/vmcs.c            |  14 +--
>  xen/arch/x86/hvm/vmx/vmx.c             | 133 ++++++++++++++++++++++-
>  xen/drivers/passthrough/pci.c          |  14 +++
>  xen/drivers/passthrough/vtd/intremap.c | 193
> +++++++++++++++------------------
>  xen/include/asm-x86/hvm/vmx/vmx.h      |   3 +
>  5 files changed, 240 insertions(+), 117 deletions(-)
> 
> --
> 2.1.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 1/7] VMX: Permanently assign PI hook vmx_pi_switch_to()
  2016-11-18  1:57 ` [PATCH v8 1/7] VMX: Permanently assign PI hook vmx_pi_switch_to() Feng Wu
@ 2016-11-18  3:14   ` Tian, Kevin
  2016-11-18  4:23     ` Wu, Feng
  0 siblings, 1 reply; 29+ messages in thread
From: Tian, Kevin @ 2016-11-18  3:14 UTC (permalink / raw)
  To: Wu, Feng, xen-devel
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, jbeulich

> From: Wu, Feng
> Sent: Friday, November 18, 2016 9:57 AM
> 
> PI hook vmx_pi_switch_to() is needed even when any previously

even when -> even after

> assigned device is detached from the domain. Since 'SN' bit is
> also used to control the CPU side PI and we change the state of
> SN bit in these two functions, then evaluate this bit in

which two functions?

> vmx_deliver_posted_intr() when trying to deliver the interrupt
> in posted way via software. The problem is if we deassign the
> hooks while the vCPU is runnable in the runqueue with 'SN' set,
> all the furture notificaton event will be suppressed. This patch
> makes the hook permanently assigned.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
> v8:
> - Comments changes
> 
> v7:
> - comments changes.
> 
> v6:
> - Adjust the comments and wording.
> 
> v5:
> - Zap "pi_switch_from" hook
> 
> v4:
> - Don't zap vmx_pi_switch_from() and vmx_pi_switch_to() when
> any previously assigned device is detached from the domain.
> - Comments changes.
> 
>  xen/arch/x86/hvm/vmx/vmx.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index 3d330b6..f3911f2 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -222,8 +222,14 @@ void vmx_pi_hooks_deassign(struct domain *d)
> 
>      d->arch.hvm_domain.vmx.vcpu_block = NULL;
>      d->arch.hvm_domain.vmx.pi_switch_from = NULL;
> -    d->arch.hvm_domain.vmx.pi_switch_to = NULL;
>      d->arch.hvm_domain.vmx.pi_do_resume = NULL;
> +
> +    /*
> +     * In fact, we could set 'd->arch.hvm_domain.vmx.pi_switch_to' to NULL
> +     * in vmx_pi_switch_to() if no new device is in the process of getting
> +     * assigned and "from" hook is NULL. However, it is not straightforward
> +     * to find a clear solution, so just leave it here.
> +     */

better to move this comment ahead of earlier zaps? Also seems the
comment is different from what commit msg says. Please make
them correlated.

>  }
> 
>  static int vmx_domain_initialise(struct domain *d)
> --
> 2.1.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 2/7] VMX: Properly handle pi when all the assigned devices are removed
  2016-11-18  1:57 ` [PATCH v8 2/7] VMX: Properly handle pi when all the assigned devices are removed Feng Wu
@ 2016-11-18  3:19   ` Tian, Kevin
  2016-11-18  4:27     ` Wu, Feng
  0 siblings, 1 reply; 29+ messages in thread
From: Tian, Kevin @ 2016-11-18  3:19 UTC (permalink / raw)
  To: Wu, Feng, xen-devel
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, jbeulich

> From: Wu, Feng
> Sent: Friday, November 18, 2016 9:57 AM
> 
> This patch handles some corner cases when the last assigned device
> is removed from the domain. In this case we should carefully handle
> pi descriptor and the per-cpu blocking list, to make sure:
> - all the PI descriptor are in the right state when next time a
> devices is assigned to the domain again.
> - No remaining vcpus of the domain in the per-cpu blocking list.
> 
> Here we call vmx_pi_unblock_vcpu() to remove the vCPU from the blocking list
> if it is on the list. However, this could happen when vmx_vcpu_block() is
> being called, hence we might incorrectly add the vCPU to the blocking list
> while the last devcie is detached from the domain. Consider that the situation
> can only occur when detaching the last device from the domain and it is not
> a frequent operation, so we use domain_pause before that, which is considered
> as an clean and maintainable solution for the situation.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> ---
> v7:
> - Prevent the domain from pausing itself.
> 
> v6:
> - Comments changes
> - Rename vmx_pi_list_remove() to vmx_pi_unblock_vcpu()
> 
> v5:
> - Remove a no-op wrapper
> 
> v4:
> - Rename some functions:
> 	vmx_pi_remove_vcpu_from_blocking_list() -> vmx_pi_list_remove()
> 	vmx_pi_blocking_cleanup() -> vmx_pi_list_cleanup()
> - Remove the check in vmx_pi_list_cleanup()
> - Comments adjustment
> 
>  xen/arch/x86/hvm/vmx/vmx.c    | 28 ++++++++++++++++++++++++----
>  xen/drivers/passthrough/pci.c | 14 ++++++++++++++
>  2 files changed, 38 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index f3911f2..a8dcabe 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -158,14 +158,12 @@ static void vmx_pi_switch_to(struct vcpu *v)
>      pi_clear_sn(pi_desc);
>  }
> 
> -static void vmx_pi_do_resume(struct vcpu *v)
> +static void vmx_pi_unblock_vcpu(struct vcpu *v)
>  {
>      unsigned long flags;
>      spinlock_t *pi_blocking_list_lock;
>      struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> 
> -    ASSERT(!test_bit(_VPF_blocked, &v->pause_flags));
> -
>      /*
>       * Set 'NV' field back to posted_intr_vector, so the
>       * Posted-Interrupts can be delivered to the vCPU when
> @@ -173,12 +171,12 @@ static void vmx_pi_do_resume(struct vcpu *v)
>       */
>      write_atomic(&pi_desc->nv, posted_intr_vector);
> 
> -    /* The vCPU is not on any blocking list. */
>      pi_blocking_list_lock = v->arch.hvm_vmx.pi_blocking.lock;
> 
>      /* Prevent the compiler from eliminating the local variable.*/
>      smp_rmb();
> 
> +    /* The vCPU is not on any blocking list. */
>      if ( pi_blocking_list_lock == NULL )
>          return;
> 
> @@ -198,6 +196,13 @@ static void vmx_pi_do_resume(struct vcpu *v)
>      spin_unlock_irqrestore(pi_blocking_list_lock, flags);
>  }
> 
> +static void vmx_pi_do_resume(struct vcpu *v)
> +{
> +    ASSERT(!test_bit(_VPF_blocked, &v->pause_flags));
> +
> +    vmx_pi_unblock_vcpu(v);
> +}
> +
>  /* This function is called when pcidevs_lock is held */
>  void vmx_pi_hooks_assign(struct domain *d)
>  {
> @@ -215,11 +220,21 @@ void vmx_pi_hooks_assign(struct domain *d)
>  /* This function is called when pcidevs_lock is held */
>  void vmx_pi_hooks_deassign(struct domain *d)
>  {
> +    struct vcpu *v;
> +
>      if ( !iommu_intpost || !has_hvm_container_domain(d) )
>          return;
> 
>      ASSERT(d->arch.hvm_domain.vmx.vcpu_block);
> 
> +    /*
> +     * Pausing the domain can make sure the vCPU is not

"the vCPUs are"

> +     * running and hence not calling the hooks simultaneously
> +     * when deassigning the PI hooks and removing the vCPU
> +     * from the blocking list.
> +     */
> +    domain_pause(d);
> +
>      d->arch.hvm_domain.vmx.vcpu_block = NULL;
>      d->arch.hvm_domain.vmx.pi_switch_from = NULL;
>      d->arch.hvm_domain.vmx.pi_do_resume = NULL;
> @@ -230,6 +245,11 @@ void vmx_pi_hooks_deassign(struct domain *d)
>       * assigned and "from" hook is NULL. However, it is not straightforward
>       * to find a clear solution, so just leave it here.
>       */
> +
> +    for_each_vcpu ( d, v )
> +        vmx_pi_unblock_vcpu(v);
> +
> +    domain_unpause(d);
>  }
> 
>  static int vmx_domain_initialise(struct domain *d)
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 8bce213..e71732f 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -1602,6 +1602,13 @@ int iommu_do_pci_domctl(
>          break;
> 
>      case XEN_DOMCTL_assign_device:
> +        /* no domain_pause() */
> +        if ( d == current->domain )
> +        {
> +            ret = -EINVAL;
> +            break;
> +        }
> +

don't understand why adding above check, and why "no domain_pause"
matters in this change.

>          ret = -ENODEV;
>          if ( domctl->u.assign_device.dev != XEN_DOMCTL_DEV_PCI )
>              break;
> @@ -1642,6 +1649,13 @@ int iommu_do_pci_domctl(
>          break;
> 
>      case XEN_DOMCTL_deassign_device:
> +        /* no domain_pause() */
> +        if ( d == current->domain )
> +        {
> +            ret = -EINVAL;
> +            break;
> +        }
> +
>          ret = -ENODEV;
>          if ( domctl->u.assign_device.dev != XEN_DOMCTL_DEV_PCI )
>              break;
> --
> 2.1.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 3/7] VMX: Make sure PI is in proper state before install the hooks
  2016-11-18  1:57 ` [PATCH v8 3/7] VMX: Make sure PI is in proper state before install the hooks Feng Wu
@ 2016-11-18  4:11   ` Tian, Kevin
  2016-11-18  4:27     ` Wu, Feng
  0 siblings, 1 reply; 29+ messages in thread
From: Tian, Kevin @ 2016-11-18  4:11 UTC (permalink / raw)
  To: Wu, Feng, xen-devel
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, jbeulich

> From: Wu, Feng
> Sent: Friday, November 18, 2016 9:57 AM
> 
> We may hit the last ASSERT() in vmx_vcpu_block in the current code,
> since vmx_vcpu_block() may get called before vmx_pi_switch_to()
> has been installed or executed. Here We use cmpxchg to update
> the NDST field, this can make sure we only update the NDST when
> vmx_pi_switch_to() has not been called. So the NDST is in a
> proper state in vmx_vcpu_block().
> 
> Suggested-by: Jan Beulich <JBeulich@suse.com>
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Kevin Tian <kevin.tian@intel.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu blocking list
  2016-11-18  3:02 ` [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu blocking list Tian, Kevin
@ 2016-11-18  4:21   ` Wu, Feng
  0 siblings, 0 replies; 29+ messages in thread
From: Wu, Feng @ 2016-11-18  4:21 UTC (permalink / raw)
  To: Tian, Kevin, xen-devel
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, Wu, Feng, jbeulich



> -----Original Message-----
> From: Tian, Kevin
> Sent: Friday, November 18, 2016 11:03 AM
> To: Wu, Feng <feng.wu@intel.com>; xen-devel@lists.xen.org
> Cc: jbeulich@suse.com; andrew.cooper3@citrix.com;
> george.dunlap@eu.citrix.com; dario.faggioli@citrix.com
> Subject: RE: [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu
> blocking list
> 
> > From: Wu, Feng
> > Sent: Friday, November 18, 2016 9:57 AM
> >
> > The current VT-d PI related code may operate incorrectly in the
> > following scenarios:
> > 1. When the last assigned device is dettached from the domain, all
> > the PI related hooks are removed then, however, the vCPU can be
> > blocked, switched to another pCPU, etc, all without the aware of
> > PI. After the next time we attach another device to the domain,
> > which makes the PI realted hooks avaliable again, the status
> > of the pi descriptor is not true. Beside that, the blocking vcpu
> > may still remain in the per-cpu blocking in this case. Patch [1/7]
> > and [2/7] handle this.
> >
> > 2. [4/7] unify the code path of PI mode update and remapped mode update
> >
> > 2. When IRTE is in posted mode, we don't need to set the irq
> > affinity for it, since the destination of these interrupts is
> > vCPU and the vCPU affinity is set during vCPU scheduling. Patch
> > [5/7] handles this.
> >
> > 4. [6/7] is a cleanup patch
> >
> > 5. When a pCPU is unplugged, and there might be vCPUs on its
> > list. Since the pCPU is offline, those vCPUs might not be woken
> > up again. [7/7] addresses it.
> >
> >
> 
> just for completeness you didn't introduce 3/7 here.

Okay, maybe I need add it.

Thanks,
Feng

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 1/7] VMX: Permanently assign PI hook vmx_pi_switch_to()
  2016-11-18  3:14   ` Tian, Kevin
@ 2016-11-18  4:23     ` Wu, Feng
  0 siblings, 0 replies; 29+ messages in thread
From: Wu, Feng @ 2016-11-18  4:23 UTC (permalink / raw)
  To: Tian, Kevin, xen-devel
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, Wu, Feng, jbeulich



> -----Original Message-----
> From: Tian, Kevin
> Sent: Friday, November 18, 2016 11:14 AM
> To: Wu, Feng <feng.wu@intel.com>; xen-devel@lists.xen.org
> Cc: jbeulich@suse.com; andrew.cooper3@citrix.com;
> george.dunlap@eu.citrix.com; dario.faggioli@citrix.com
> Subject: RE: [PATCH v8 1/7] VMX: Permanently assign PI hook
> vmx_pi_switch_to()
> 
> > From: Wu, Feng
> > Sent: Friday, November 18, 2016 9:57 AM
> >
> > PI hook vmx_pi_switch_to() is needed even when any previously
> 
> even when -> even after
> 
> > assigned device is detached from the domain. Since 'SN' bit is
> > also used to control the CPU side PI and we change the state of
> > SN bit in these two functions, then evaluate this bit in
> 
> which two functions?

vmx_pi_switch_to() and vmx_pi_switch_from(), I will explicitly
describe it in the next version.

Thanks,
Feng

> 
> > vmx_deliver_posted_intr() when trying to deliver the interrupt
> > in posted way via software. The problem is if we deassign the
> > hooks while the vCPU is runnable in the runqueue with 'SN' set,
> > all the furture notificaton event will be suppressed. This patch
> > makes the hook permanently assigned.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > ---
> > v8:
> > - Comments changes
> >
> > v7:
> > - comments changes.
> >
> > v6:
> > - Adjust the comments and wording.
> >
> > v5:
> > - Zap "pi_switch_from" hook
> >
> > v4:
> > - Don't zap vmx_pi_switch_from() and vmx_pi_switch_to() when
> > any previously assigned device is detached from the domain.
> > - Comments changes.
> >
> >  xen/arch/x86/hvm/vmx/vmx.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> > index 3d330b6..f3911f2 100644
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -222,8 +222,14 @@ void vmx_pi_hooks_deassign(struct domain *d)
> >
> >      d->arch.hvm_domain.vmx.vcpu_block = NULL;
> >      d->arch.hvm_domain.vmx.pi_switch_from = NULL;
> > -    d->arch.hvm_domain.vmx.pi_switch_to = NULL;
> >      d->arch.hvm_domain.vmx.pi_do_resume = NULL;
> > +
> > +    /*
> > +     * In fact, we could set 'd->arch.hvm_domain.vmx.pi_switch_to' to NULL
> > +     * in vmx_pi_switch_to() if no new device is in the process of getting
> > +     * assigned and "from" hook is NULL. However, it is not straightforward
> > +     * to find a clear solution, so just leave it here.
> > +     */
> 
> better to move this comment ahead of earlier zaps? Also seems the
> comment is different from what commit msg says. Please make
> them correlated.

Sure.

Thanks,
Feng

> 
> >  }
> >
> >  static int vmx_domain_initialise(struct domain *d)
> > --
> > 2.1.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 2/7] VMX: Properly handle pi when all the assigned devices are removed
  2016-11-18  3:19   ` Tian, Kevin
@ 2016-11-18  4:27     ` Wu, Feng
  2016-11-18  4:58       ` Tian, Kevin
  0 siblings, 1 reply; 29+ messages in thread
From: Wu, Feng @ 2016-11-18  4:27 UTC (permalink / raw)
  To: Tian, Kevin, xen-devel
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, Wu, Feng, jbeulich



> -----Original Message-----
> From: Tian, Kevin
> Sent: Friday, November 18, 2016 11:19 AM
> To: Wu, Feng <feng.wu@intel.com>; xen-devel@lists.xen.org
> Cc: jbeulich@suse.com; andrew.cooper3@citrix.com;
> george.dunlap@eu.citrix.com; dario.faggioli@citrix.com
> Subject: RE: [PATCH v8 2/7] VMX: Properly handle pi when all the assigned
> devices are removed
> 
> > From: Wu, Feng
> > Sent: Friday, November 18, 2016 9:57 AM
> >
> > This patch handles some corner cases when the last assigned device
> > is removed from the domain. In this case we should carefully handle
> > pi descriptor and the per-cpu blocking list, to make sure:
> > - all the PI descriptor are in the right state when next time a
> > devices is assigned to the domain again.
> > - No remaining vcpus of the domain in the per-cpu blocking list.
> >
> > Here we call vmx_pi_unblock_vcpu() to remove the vCPU from the blocking list
> > if it is on the list. However, this could happen when vmx_vcpu_block() is
> > being called, hence we might incorrectly add the vCPU to the blocking list
> > while the last devcie is detached from the domain. Consider that the situation
> > can only occur when detaching the last device from the domain and it is not
> > a frequent operation, so we use domain_pause before that, which is
> considered
> > as an clean and maintainable solution for the situation.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > Reviewed-by: Jan Beulich <jbeulich@suse.com>
> > ---
> > v7:
> > - Prevent the domain from pausing itself.
> >
> > v6:
> > - Comments changes
> > - Rename vmx_pi_list_remove() to vmx_pi_unblock_vcpu()
> >
> > v5:
> > - Remove a no-op wrapper
> >
> > v4:
> > - Rename some functions:
> > 	vmx_pi_remove_vcpu_from_blocking_list() -> vmx_pi_list_remove()
> > 	vmx_pi_blocking_cleanup() -> vmx_pi_list_cleanup()
> > - Remove the check in vmx_pi_list_cleanup()
> > - Comments adjustment
> >
> >  xen/arch/x86/hvm/vmx/vmx.c    | 28 ++++++++++++++++++++++++----
> >  xen/drivers/passthrough/pci.c | 14 ++++++++++++++
> >  2 files changed, 38 insertions(+), 4 deletions(-)
> >
> > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> > index f3911f2..a8dcabe 100644
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -158,14 +158,12 @@ static void vmx_pi_switch_to(struct vcpu *v)
> >      pi_clear_sn(pi_desc);
> >  }
> >
> > -static void vmx_pi_do_resume(struct vcpu *v)
> > +static void vmx_pi_unblock_vcpu(struct vcpu *v)
> >  {
> >      unsigned long flags;
> >      spinlock_t *pi_blocking_list_lock;
> >      struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> >
> > -    ASSERT(!test_bit(_VPF_blocked, &v->pause_flags));
> > -
> >      /*
> >       * Set 'NV' field back to posted_intr_vector, so the
> >       * Posted-Interrupts can be delivered to the vCPU when
> > @@ -173,12 +171,12 @@ static void vmx_pi_do_resume(struct vcpu *v)
> >       */
> >      write_atomic(&pi_desc->nv, posted_intr_vector);
> >
> > -    /* The vCPU is not on any blocking list. */
> >      pi_blocking_list_lock = v->arch.hvm_vmx.pi_blocking.lock;
> >
> >      /* Prevent the compiler from eliminating the local variable.*/
> >      smp_rmb();
> >
> > +    /* The vCPU is not on any blocking list. */
> >      if ( pi_blocking_list_lock == NULL )
> >          return;
> >
> > @@ -198,6 +196,13 @@ static void vmx_pi_do_resume(struct vcpu *v)
> >      spin_unlock_irqrestore(pi_blocking_list_lock, flags);
> >  }
> >
> > +static void vmx_pi_do_resume(struct vcpu *v)
> > +{
> > +    ASSERT(!test_bit(_VPF_blocked, &v->pause_flags));
> > +
> > +    vmx_pi_unblock_vcpu(v);
> > +}
> > +
> >  /* This function is called when pcidevs_lock is held */
> >  void vmx_pi_hooks_assign(struct domain *d)
> >  {
> > @@ -215,11 +220,21 @@ void vmx_pi_hooks_assign(struct domain *d)
> >  /* This function is called when pcidevs_lock is held */
> >  void vmx_pi_hooks_deassign(struct domain *d)
> >  {
> > +    struct vcpu *v;
> > +
> >      if ( !iommu_intpost || !has_hvm_container_domain(d) )
> >          return;
> >
> >      ASSERT(d->arch.hvm_domain.vmx.vcpu_block);
> >
> > +    /*
> > +     * Pausing the domain can make sure the vCPU is not
> 
> "the vCPUs are"
> 
> > +     * running and hence not calling the hooks simultaneously
> > +     * when deassigning the PI hooks and removing the vCPU
> > +     * from the blocking list.
> > +     */
> > +    domain_pause(d);
> > +
> >      d->arch.hvm_domain.vmx.vcpu_block = NULL;
> >      d->arch.hvm_domain.vmx.pi_switch_from = NULL;
> >      d->arch.hvm_domain.vmx.pi_do_resume = NULL;
> > @@ -230,6 +245,11 @@ void vmx_pi_hooks_deassign(struct domain *d)
> >       * assigned and "from" hook is NULL. However, it is not straightforward
> >       * to find a clear solution, so just leave it here.
> >       */
> > +
> > +    for_each_vcpu ( d, v )
> > +        vmx_pi_unblock_vcpu(v);
> > +
> > +    domain_unpause(d);
> >  }
> >
> >  static int vmx_domain_initialise(struct domain *d)
> > diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> > index 8bce213..e71732f 100644
> > --- a/xen/drivers/passthrough/pci.c
> > +++ b/xen/drivers/passthrough/pci.c
> > @@ -1602,6 +1602,13 @@ int iommu_do_pci_domctl(
> >          break;
> >
> >      case XEN_DOMCTL_assign_device:
> > +        /* no domain_pause() */
> > +        if ( d == current->domain )
> > +        {
> > +            ret = -EINVAL;
> > +            break;
> > +        }
> > +
> 
> don't understand why adding above check, and why "no domain_pause"
> matters in this change.

In fact, this change is according Jan's following comments on v6:

" There's one additional caveat here which no-one of us so far thought
of: Currently there's nothing preventing the domctl-s under which
this sits from being issued by the control domain for itself. Various
other domctl-s, however, guard against this case when intending
to pause the target domain. The same needs to be done for the
ones leading here."

We need to prevent the domain from pausing itself.

Thanks,
Feng

> 
> >          ret = -ENODEV;
> >          if ( domctl->u.assign_device.dev != XEN_DOMCTL_DEV_PCI )
> >              break;
> > @@ -1642,6 +1649,13 @@ int iommu_do_pci_domctl(
> >          break;
> >
> >      case XEN_DOMCTL_deassign_device:
> > +        /* no domain_pause() */
> > +        if ( d == current->domain )
> > +        {
> > +            ret = -EINVAL;
> > +            break;
> > +        }
> > +
> >          ret = -ENODEV;
> >          if ( domctl->u.assign_device.dev != XEN_DOMCTL_DEV_PCI )
> >              break;
> > --
> > 2.1.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 3/7] VMX: Make sure PI is in proper state before install the hooks
  2016-11-18  4:11   ` Tian, Kevin
@ 2016-11-18  4:27     ` Wu, Feng
  0 siblings, 0 replies; 29+ messages in thread
From: Wu, Feng @ 2016-11-18  4:27 UTC (permalink / raw)
  To: Tian, Kevin, xen-devel
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, Wu, Feng, jbeulich



> -----Original Message-----
> From: Tian, Kevin
> Sent: Friday, November 18, 2016 12:11 PM
> To: Wu, Feng <feng.wu@intel.com>; xen-devel@lists.xen.org
> Cc: jbeulich@suse.com; andrew.cooper3@citrix.com;
> george.dunlap@eu.citrix.com; dario.faggioli@citrix.com
> Subject: RE: [PATCH v8 3/7] VMX: Make sure PI is in proper state before install
> the hooks
> 
> > From: Wu, Feng
> > Sent: Friday, November 18, 2016 9:57 AM
> >
> > We may hit the last ASSERT() in vmx_vcpu_block in the current code,
> > since vmx_vcpu_block() may get called before vmx_pi_switch_to()
> > has been installed or executed. Here We use cmpxchg to update
> > the NDST field, this can make sure we only update the NDST when
> > vmx_pi_switch_to() has not been called. So the NDST is in a
> > proper state in vmx_vcpu_block().
> >
> > Suggested-by: Jan Beulich <JBeulich@suse.com>
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > Reviewed-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Kevin Tian <kevin.tian@intel.com>

Thanks for the Ack!

Thanks,
Feng

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 4/7] VT-d: Use one function to update both remapped and posted IRTE
  2016-11-18  1:57 ` [PATCH v8 4/7] VT-d: Use one function to update both remapped and posted IRTE Feng Wu
@ 2016-11-18  4:31   ` Tian, Kevin
  2016-11-22  4:31     ` Wu, Feng
  2016-11-22  9:58   ` Jan Beulich
  1 sibling, 1 reply; 29+ messages in thread
From: Tian, Kevin @ 2016-11-18  4:31 UTC (permalink / raw)
  To: Wu, Feng, xen-devel
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, jbeulich

> From: Wu, Feng
> Sent: Friday, November 18, 2016 9:57 AM
> 
> Use one function to update both remapped IRTE and posted IRET.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
> v8:
> - Newly added
> 
>  xen/drivers/passthrough/vtd/intremap.c | 162 ++++++++++++++-------------------
>  1 file changed, 66 insertions(+), 96 deletions(-)
> 
> diff --git a/xen/drivers/passthrough/vtd/intremap.c
> b/xen/drivers/passthrough/vtd/intremap.c
> index bfd468b..fd2a49a 100644
> --- a/xen/drivers/passthrough/vtd/intremap.c
> +++ b/xen/drivers/passthrough/vtd/intremap.c
> @@ -420,7 +420,7 @@ void io_apic_write_remap_rte(
>          __ioapic_write_entry(apic, ioapic_pin, 1, old_rte);
>  }
> 
> -static void set_msi_source_id(struct pci_dev *pdev, struct iremap_entry *ire)
> +static void set_msi_source_id(const struct pci_dev *pdev, struct iremap_entry *ire)
>  {
>      u16 seg;
>      u8 bus, devfn, secbus;
> @@ -548,11 +548,12 @@ static int remap_entry_to_msi_msg(
>  }
> 
>  static int msi_msg_to_remap_entry(
> -    struct iommu *iommu, struct pci_dev *pdev,
> -    struct msi_desc *msi_desc, struct msi_msg *msg)
> +    struct iommu *iommu, const struct pci_dev *pdev,
> +    struct msi_desc *msi_desc, struct msi_msg *msg,
> +    const struct pi_desc *pi_desc, const uint8_t gvec)
>  {
>      struct iremap_entry *iremap_entry = NULL, *iremap_entries;
> -    struct iremap_entry new_ire;
> +    struct iremap_entry new_ire, old_ire;

old_ire is used only in later 'else' branch. move definition there.

>      struct msi_msg_remap_entry *remap_rte;
>      unsigned int index, i, nr = 1;
>      unsigned long flags;
> @@ -597,31 +598,50 @@ static int msi_msg_to_remap_entry(
> 
>      memcpy(&new_ire, iremap_entry, sizeof(struct iremap_entry));
> 
> -    /* Set interrupt remapping table entry */
> -    new_ire.remap.fpd = 0;
> -    new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
> -    new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
> -    new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
> -    /* Hardware require RH = 1 for LPR delivery mode */
> -    new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
> -    new_ire.remap.avail = 0;
> -    new_ire.remap.res_1 = 0;
> -    new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
> -                            MSI_DATA_VECTOR_MASK;
> -    new_ire.remap.res_2 = 0;
> -    if ( x2apic_enabled )
> -        new_ire.remap.dst = msg->dest32;
> +    if ( !pi_desc )
> +    {
> +        /* Set interrupt remapping table entry */
> +        new_ire.remap.fpd = 0;
> +        new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) &
> 0x1;
> +        new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
> +        new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) &
> 0x1;
> +        /* Hardware require RH = 1 for LPR delivery mode */
> +        new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
> +        new_ire.remap.avail = 0;
> +        new_ire.remap.res_1 = 0;
> +        new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
> +                                MSI_DATA_VECTOR_MASK;
> +        new_ire.remap.res_2 = 0;
> +        if ( x2apic_enabled )
> +            new_ire.remap.dst = msg->dest32;
> +        else
> +            new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
> +                                 & 0xff) << 8;
> +
> +        new_ire.remap.res_3 = 0;
> +        new_ire.remap.res_4 = 0;
> +        new_ire.remap.p = 1;    /* finally, set present bit */
> +    }
>      else
> -        new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
> -                             & 0xff) << 8;
> +    {
> +        new_ire.post.fpd = 0;
> +        new_ire.post.res_1 = 0;
> +        new_ire.post.res_2 = 0;
> +        new_ire.post.urg = 0;
> +        new_ire.post.im = 1;
> +        new_ire.post.vector = gvec;
> +        new_ire.post.res_3 = 0;
> +        new_ire.post.res_4 = 0;
> +        new_ire.post.res_5 = 0;
> +        new_ire.post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
> +        new_ire.post.pda_h = virt_to_maddr(pi_desc) >> 32;
> +        new_ire.post.p = 1;    /* finally, set present bit */
> +    }

It's a little confusing here:

- you first copy current remapping entry to new_ire and then do some initialization.
Is it necessary? If there are some fields we'd like to inherit, better to describe them
in comment so others can have a clear understanding of what exactly fields must
be updated here;

- in original code of setup_posted_irte, you actually do conditional assignment 
based on format of old_ire. Above you changed to non-conditional assignment,
not relying on old_ire. Is it desired? 

> 
>      if ( pdev )
>          set_msi_source_id(pdev, &new_ire);
>      else
>          set_hpet_source_id(msi_desc->hpet_id, &new_ire);
> -    new_ire.remap.res_3 = 0;
> -    new_ire.remap.res_4 = 0;
> -    new_ire.remap.p = 1;    /* finally, set present bit */
> 
>      /* now construct new MSI/MSI-X rte entry */
>      remap_rte = (struct msi_msg_remap_entry *)msg;
> @@ -637,7 +657,23 @@ static int msi_msg_to_remap_entry(
>      remap_rte->address_hi = 0;
>      remap_rte->data = index - i;
> 
> -    memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
> +    if ( !pi_desc )
> +        memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
> +    else
> +    {
> +        __uint128_t ret;
> +
> +        old_ire = *iremap_entry;
> +        ret = cmpxchg16b(iremap_entry, &old_ire, &new_ire);
> +
> +        /*
> +         * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
> +         * and the hardware cannot update the IRTE behind us, so the return value
> +         * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
> +         */
> +        ASSERT(ret == old_ire.val);
> +    }
> +
>      iommu_flush_cache_entry(iremap_entry, sizeof(struct iremap_entry));
>      iommu_flush_iec_index(iommu, 0, index);
> 
> @@ -668,7 +704,8 @@ int msi_msg_write_remap_rte(
> 
>      drhd = pdev ? acpi_find_matched_drhd_unit(pdev)
>                  : hpet_to_drhd(msi_desc->hpet_id);
> -    return drhd ? msi_msg_to_remap_entry(drhd->iommu, pdev, msi_desc, msg)
> +    return drhd ? msi_msg_to_remap_entry(drhd->iommu, pdev,
> +                                         msi_desc, msg, NULL, 0)
>                  : -EINVAL;
>  }
> 
> @@ -902,42 +939,6 @@ void iommu_disable_x2apic_IR(void)
>          disable_qinval(drhd->iommu);
>  }
> 
> -static void setup_posted_irte(
> -    struct iremap_entry *new_ire, const struct iremap_entry *old_ire,
> -    const struct pi_desc *pi_desc, const uint8_t gvec)
> -{
> -    memset(new_ire, 0, sizeof(*new_ire));
> -
> -    /*
> -     * 'im' filed decides whether the irte is in posted format (with value 1)
> -     * or remapped format (with value 0), if the old irte is in remapped format,
> -     * we copy things from remapped part in 'struct iremap_entry', otherwise,
> -     * we copy from posted part.
> -     */
> -    if ( !old_ire->remap.im )
> -    {
> -        new_ire->post.p = old_ire->remap.p;
> -        new_ire->post.fpd = old_ire->remap.fpd;
> -        new_ire->post.sid = old_ire->remap.sid;
> -        new_ire->post.sq = old_ire->remap.sq;
> -        new_ire->post.svt = old_ire->remap.svt;
> -    }
> -    else
> -    {
> -        new_ire->post.p = old_ire->post.p;
> -        new_ire->post.fpd = old_ire->post.fpd;
> -        new_ire->post.sid = old_ire->post.sid;
> -        new_ire->post.sq = old_ire->post.sq;
> -        new_ire->post.svt = old_ire->post.svt;
> -        new_ire->post.urg = old_ire->post.urg;
> -    }
> -
> -    new_ire->post.im = 1;
> -    new_ire->post.vector = gvec;
> -    new_ire->post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
> -    new_ire->post.pda_h = virt_to_maddr(pi_desc) >> 32;
> -}
> -
>  /*
>   * This function is used to update the IRTE for posted-interrupt
>   * when guest changes MSI/MSI-X information.
> @@ -946,17 +947,12 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
>      const uint8_t gvec)
>  {
>      struct irq_desc *desc;
> -    const struct msi_desc *msi_desc;
> -    int remap_index;
> +    struct msi_desc *msi_desc;
>      int rc = 0;
>      const struct pci_dev *pci_dev;
>      const struct acpi_drhd_unit *drhd;
>      struct iommu *iommu;
> -    struct ir_ctrl *ir_ctrl;
> -    struct iremap_entry *iremap_entries = NULL, *p = NULL;
> -    struct iremap_entry new_ire, old_ire;
>      const struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> -    __uint128_t ret;
> 
>      desc = pirq_spin_lock_irq_desc(pirq, NULL);
>      if ( !desc )
> @@ -976,8 +972,6 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
>          goto unlock_out;
>      }
> 
> -    remap_index = msi_desc->remap_index;
> -
>      spin_unlock_irq(&desc->lock);
> 
>      ASSERT(pcidevs_locked());
> @@ -992,35 +986,11 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
>          return -ENODEV;
> 
>      iommu = drhd->iommu;
> -    ir_ctrl = iommu_ir_ctrl(iommu);
> -    if ( !ir_ctrl )
> +    if ( !iommu_ir_ctrl(iommu) )
>          return -ENODEV;
> 
> -    spin_lock_irq(&ir_ctrl->iremap_lock);
> -
> -    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index, iremap_entries, p);
> -
> -    old_ire = *p;
> -
> -    /* Setup/Update interrupt remapping table entry. */
> -    setup_posted_irte(&new_ire, &old_ire, pi_desc, gvec);
> -    ret = cmpxchg16b(p, &old_ire, &new_ire);
> -
> -    /*
> -     * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
> -     * and the hardware cannot update the IRTE behind us, so the return value
> -     * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
> -     */
> -    ASSERT(ret == old_ire.val);
> -
> -    iommu_flush_cache_entry(p, sizeof(*p));
> -    iommu_flush_iec_index(iommu, 0, remap_index);
> -
> -    unmap_vtd_domain_page(iremap_entries);
> -
> -    spin_unlock_irq(&ir_ctrl->iremap_lock);
> -
> -    return 0;
> +    return msi_msg_to_remap_entry(iommu, pci_dev, msi_desc, &msi_desc->msg,
> +                                  pi_desc, gvec);
> 
>   unlock_out:
>      spin_unlock_irq(&desc->lock);
> --
> 2.1.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 6/7] VT-d: Some cleanups
  2016-11-18  1:57 ` [PATCH v8 6/7] VT-d: Some cleanups Feng Wu
@ 2016-11-18  4:45   ` Tian, Kevin
  0 siblings, 0 replies; 29+ messages in thread
From: Tian, Kevin @ 2016-11-18  4:45 UTC (permalink / raw)
  To: Wu, Feng, xen-devel
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, jbeulich

> From: Wu, Feng
> Sent: Friday, November 18, 2016 9:57 AM
> 
> Use type-safe structure assignment instead of memcpy()
> Use sizeof(*iremap_entry).
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Acked-by: Kevin Tian <kevin.tian@intel.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 7/7] VMX: Fixup PI descriptor when cpu is offline
  2016-11-18  1:57 ` [PATCH v8 7/7] VMX: Fixup PI descriptor when cpu is offline Feng Wu
@ 2016-11-18  4:55   ` Tian, Kevin
  0 siblings, 0 replies; 29+ messages in thread
From: Tian, Kevin @ 2016-11-18  4:55 UTC (permalink / raw)
  To: Wu, Feng, xen-devel
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, jbeulich

> From: Wu, Feng
> Sent: Friday, November 18, 2016 9:57 AM
> 
> When cpu is offline, we need to move all the vcpus in its blocking
> list to another online cpu, this patch handles it.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Kevin Tian <kevin.tian@intel.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 2/7] VMX: Properly handle pi when all the assigned devices are removed
  2016-11-18  4:27     ` Wu, Feng
@ 2016-11-18  4:58       ` Tian, Kevin
  2016-11-18  5:22         ` Wu, Feng
  0 siblings, 1 reply; 29+ messages in thread
From: Tian, Kevin @ 2016-11-18  4:58 UTC (permalink / raw)
  To: Wu, Feng, xen-devel
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, jbeulich

> From: Wu, Feng
> Sent: Friday, November 18, 2016 12:27 PM
> > > diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> > > index 8bce213..e71732f 100644
> > > --- a/xen/drivers/passthrough/pci.c
> > > +++ b/xen/drivers/passthrough/pci.c
> > > @@ -1602,6 +1602,13 @@ int iommu_do_pci_domctl(
> > >          break;
> > >
> > >      case XEN_DOMCTL_assign_device:
> > > +        /* no domain_pause() */
> > > +        if ( d == current->domain )
> > > +        {
> > > +            ret = -EINVAL;
> > > +            break;
> > > +        }
> > > +
> >
> > don't understand why adding above check, and why "no domain_pause"
> > matters in this change.
> 
> In fact, this change is according Jan's following comments on v6:
> 
> " There's one additional caveat here which no-one of us so far thought
> of: Currently there's nothing preventing the domctl-s under which
> this sits from being issued by the control domain for itself. Various
> other domctl-s, however, guard against this case when intending
> to pause the target domain. The same needs to be done for the
> ones leading here."
> 
> We need to prevent the domain from pausing itself.
> 

XEN_DOMCTL_assign_device doesn't imply a domain_pause operation,
at least not obvious in this level. If we don't have PI enabled underneath,
is above guard still necessary? If the answer is yes, the comment should
be elaborated for easy understanding... :-)

Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 2/7] VMX: Properly handle pi when all the assigned devices are removed
  2016-11-18  4:58       ` Tian, Kevin
@ 2016-11-18  5:22         ` Wu, Feng
  2016-11-18  9:23           ` Jan Beulich
  0 siblings, 1 reply; 29+ messages in thread
From: Wu, Feng @ 2016-11-18  5:22 UTC (permalink / raw)
  To: Tian, Kevin, xen-devel
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, Wu, Feng, jbeulich



> -----Original Message-----
> From: Tian, Kevin
> Sent: Friday, November 18, 2016 12:59 PM
> To: Wu, Feng <feng.wu@intel.com>; xen-devel@lists.xen.org
> Cc: jbeulich@suse.com; andrew.cooper3@citrix.com;
> george.dunlap@eu.citrix.com; dario.faggioli@citrix.com
> Subject: RE: [PATCH v8 2/7] VMX: Properly handle pi when all the assigned
> devices are removed
> 
> > From: Wu, Feng
> > Sent: Friday, November 18, 2016 12:27 PM
> > > > diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> > > > index 8bce213..e71732f 100644
> > > > --- a/xen/drivers/passthrough/pci.c
> > > > +++ b/xen/drivers/passthrough/pci.c
> > > > @@ -1602,6 +1602,13 @@ int iommu_do_pci_domctl(
> > > >          break;
> > > >
> > > >      case XEN_DOMCTL_assign_device:
> > > > +        /* no domain_pause() */
> > > > +        if ( d == current->domain )
> > > > +        {
> > > > +            ret = -EINVAL;
> > > > +            break;
> > > > +        }
> > > > +
> > >
> > > don't understand why adding above check, and why "no domain_pause"
> > > matters in this change.
> >
> > In fact, this change is according Jan's following comments on v6:
> >
> > " There's one additional caveat here which no-one of us so far thought
> > of: Currently there's nothing preventing the domctl-s under which
> > this sits from being issued by the control domain for itself. Various
> > other domctl-s, however, guard against this case when intending
> > to pause the target domain. The same needs to be done for the
> > ones leading here."
> >
> > We need to prevent the domain from pausing itself.
> >
> 
> XEN_DOMCTL_assign_device doesn't imply a domain_pause operation,
> at least not obvious in this level. If we don't have PI enabled underneath,
> is above guard still necessary? If the answer is yes, the comment should
> be elaborated for easy understanding... :-)

I think the domain_pause() is introduced in PI related logic. So maybe I
need Jan's comments about whether we need to add this check unconditionally
here. Anyway, the comments need to be more detailed.

Thanks,
Feng

> 
> Thanks
> Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 2/7] VMX: Properly handle pi when all the assigned devices are removed
  2016-11-18  5:22         ` Wu, Feng
@ 2016-11-18  9:23           ` Jan Beulich
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2016-11-18  9:23 UTC (permalink / raw)
  To: Feng Wu, Kevin Tian
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, xen-devel

>>> On 18.11.16 at 06:22, <feng.wu@intel.com> wrote:
>> From: Tian, Kevin
>> Sent: Friday, November 18, 2016 12:59 PM
>> > From: Wu, Feng
>> > Sent: Friday, November 18, 2016 12:27 PM
>> > > > diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> > > > index 8bce213..e71732f 100644
>> > > > --- a/xen/drivers/passthrough/pci.c
>> > > > +++ b/xen/drivers/passthrough/pci.c
>> > > > @@ -1602,6 +1602,13 @@ int iommu_do_pci_domctl(
>> > > >          break;
>> > > >
>> > > >      case XEN_DOMCTL_assign_device:
>> > > > +        /* no domain_pause() */
>> > > > +        if ( d == current->domain )
>> > > > +        {
>> > > > +            ret = -EINVAL;
>> > > > +            break;
>> > > > +        }
>> > > > +
>> > >
>> > > don't understand why adding above check, and why "no domain_pause"
>> > > matters in this change.
>> >
>> > In fact, this change is according Jan's following comments on v6:
>> >
>> > " There's one additional caveat here which no-one of us so far thought
>> > of: Currently there's nothing preventing the domctl-s under which
>> > this sits from being issued by the control domain for itself. Various
>> > other domctl-s, however, guard against this case when intending
>> > to pause the target domain. The same needs to be done for the
>> > ones leading here."
>> >
>> > We need to prevent the domain from pausing itself.
>> >
>> 
>> XEN_DOMCTL_assign_device doesn't imply a domain_pause operation,
>> at least not obvious in this level. If we don't have PI enabled underneath,
>> is above guard still necessary? If the answer is yes, the comment should
>> be elaborated for easy understanding... :-)
> 
> I think the domain_pause() is introduced in PI related logic. So maybe I
> need Jan's comments about whether we need to add this check unconditionally
> here. Anyway, the comments need to be more detailed.

I've never said anything about the specifics of the check (e.g.
whether it needs to be unconditional), I had only pointed out that
such a check is needed if you're adding pausing. That said, I don't
think there's anything wrong with having it unconditional: In the
end we don't mean to support self-(de)assignment of devices.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 4/7] VT-d: Use one function to update both remapped and posted IRTE
  2016-11-18  4:31   ` Tian, Kevin
@ 2016-11-22  4:31     ` Wu, Feng
  0 siblings, 0 replies; 29+ messages in thread
From: Wu, Feng @ 2016-11-22  4:31 UTC (permalink / raw)
  To: Tian, Kevin, xen-devel
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, Wu, Feng, jbeulich



> -----Original Message-----
> From: Tian, Kevin
> Sent: Friday, November 18, 2016 12:31 PM
> To: Wu, Feng <feng.wu@intel.com>; xen-devel@lists.xen.org
> Cc: jbeulich@suse.com; andrew.cooper3@citrix.com;
> george.dunlap@eu.citrix.com; dario.faggioli@citrix.com
> Subject: RE: [PATCH v8 4/7] VT-d: Use one function to update both remapped
> and posted IRTE
> 
> > From: Wu, Feng
> > Sent: Friday, November 18, 2016 9:57 AM
> >
> > Use one function to update both remapped IRTE and posted IRET.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > ---
> > v8:
> > - Newly added
> >
> >  xen/drivers/passthrough/vtd/intremap.c | 162 ++++++++++++++-----------------
> --
> >  1 file changed, 66 insertions(+), 96 deletions(-)
> >
> > diff --git a/xen/drivers/passthrough/vtd/intremap.c
> > b/xen/drivers/passthrough/vtd/intremap.c
> > index bfd468b..fd2a49a 100644
> > --- a/xen/drivers/passthrough/vtd/intremap.c
> > +++ b/xen/drivers/passthrough/vtd/intremap.c
> > @@ -420,7 +420,7 @@ void io_apic_write_remap_rte(
> >          __ioapic_write_entry(apic, ioapic_pin, 1, old_rte);
> >  }
> >
> > -static void set_msi_source_id(struct pci_dev *pdev, struct iremap_entry *ire)
> > +static void set_msi_source_id(const struct pci_dev *pdev, struct
> iremap_entry *ire)
> >  {
> >      u16 seg;
> >      u8 bus, devfn, secbus;
> > @@ -548,11 +548,12 @@ static int remap_entry_to_msi_msg(
> >  }
> >
> >  static int msi_msg_to_remap_entry(
> > -    struct iommu *iommu, struct pci_dev *pdev,
> > -    struct msi_desc *msi_desc, struct msi_msg *msg)
> > +    struct iommu *iommu, const struct pci_dev *pdev,
> > +    struct msi_desc *msi_desc, struct msi_msg *msg,
> > +    const struct pi_desc *pi_desc, const uint8_t gvec)
> >  {
> >      struct iremap_entry *iremap_entry = NULL, *iremap_entries;
> > -    struct iremap_entry new_ire;
> > +    struct iremap_entry new_ire, old_ire;
> 
> old_ire is used only in later 'else' branch. move definition there.
> 
> >      struct msi_msg_remap_entry *remap_rte;
> >      unsigned int index, i, nr = 1;
> >      unsigned long flags;
> > @@ -597,31 +598,50 @@ static int msi_msg_to_remap_entry(
> >
> >      memcpy(&new_ire, iremap_entry, sizeof(struct iremap_entry));
> >
> > -    /* Set interrupt remapping table entry */
> > -    new_ire.remap.fpd = 0;
> > -    new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT)
> & 0x1;
> > -    new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
> > -    new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) &
> 0x1;
> > -    /* Hardware require RH = 1 for LPR delivery mode */
> > -    new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
> > -    new_ire.remap.avail = 0;
> > -    new_ire.remap.res_1 = 0;
> > -    new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
> > -                            MSI_DATA_VECTOR_MASK;
> > -    new_ire.remap.res_2 = 0;
> > -    if ( x2apic_enabled )
> > -        new_ire.remap.dst = msg->dest32;
> > +    if ( !pi_desc )
> > +    {
> > +        /* Set interrupt remapping table entry */
> > +        new_ire.remap.fpd = 0;
> > +        new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT)
> &
> > 0x1;
> > +        new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
> > +        new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT)
> &
> > 0x1;
> > +        /* Hardware require RH = 1 for LPR delivery mode */
> > +        new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
> > +        new_ire.remap.avail = 0;
> > +        new_ire.remap.res_1 = 0;
> > +        new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
> > +                                MSI_DATA_VECTOR_MASK;
> > +        new_ire.remap.res_2 = 0;
> > +        if ( x2apic_enabled )
> > +            new_ire.remap.dst = msg->dest32;
> > +        else
> > +            new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
> > +                                 & 0xff) << 8;
> > +
> > +        new_ire.remap.res_3 = 0;
> > +        new_ire.remap.res_4 = 0;
> > +        new_ire.remap.p = 1;    /* finally, set present bit */
> > +    }
> >      else
> > -        new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
> > -                             & 0xff) << 8;
> > +    {
> > +        new_ire.post.fpd = 0;
> > +        new_ire.post.res_1 = 0;
> > +        new_ire.post.res_2 = 0;
> > +        new_ire.post.urg = 0;
> > +        new_ire.post.im = 1;
> > +        new_ire.post.vector = gvec;
> > +        new_ire.post.res_3 = 0;
> > +        new_ire.post.res_4 = 0;
> > +        new_ire.post.res_5 = 0;
> > +        new_ire.post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
> > +        new_ire.post.pda_h = virt_to_maddr(pi_desc) >> 32;
> > +        new_ire.post.p = 1;    /* finally, set present bit */
> > +    }
> 
> It's a little confusing here:
> 
> - you first copy current remapping entry to new_ire and then do some
> initialization.
> Is it necessary? If there are some fields we'd like to inherit, better to describe
> them
> in comment so others can have a clear understanding of what exactly fields
> must
> be updated here;

This is needed, please refer to my replay to [5/7].

> 
> - in original code of setup_posted_irte, you actually do conditional assignment
> based on format of old_ire. Above you changed to non-conditional assignment,
> not relying on old_ire. Is it desired?

In setup_posted_irte() we need to inherit the common fields (fpd, sid, sq, svt)
from the old IRTE. But now this common fields are setup by set_msi_source_id()
in this function.

Thanks,
Feng

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 4/7] VT-d: Use one function to update both remapped and posted IRTE
  2016-11-18  1:57 ` [PATCH v8 4/7] VT-d: Use one function to update both remapped and posted IRTE Feng Wu
  2016-11-18  4:31   ` Tian, Kevin
@ 2016-11-22  9:58   ` Jan Beulich
  2017-02-22  1:53     ` Chao Gao
  1 sibling, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2016-11-22  9:58 UTC (permalink / raw)
  To: Feng Wu
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, kevin.tian, xen-devel

>>> On 18.11.16 at 02:57, <feng.wu@intel.com> wrote:
> @@ -597,31 +598,50 @@ static int msi_msg_to_remap_entry(

Considering you basically re-do most of the function, I think there's
some more adjustment necessary (or at least very desirable) here.

>  
>      memcpy(&new_ire, iremap_entry, sizeof(struct iremap_entry));
>  
> -    /* Set interrupt remapping table entry */
> -    new_ire.remap.fpd = 0;
> -    new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
> -    new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
> -    new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
> -    /* Hardware require RH = 1 for LPR delivery mode */
> -    new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
> -    new_ire.remap.avail = 0;
> -    new_ire.remap.res_1 = 0;
> -    new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
> -                            MSI_DATA_VECTOR_MASK;
> -    new_ire.remap.res_2 = 0;
> -    if ( x2apic_enabled )
> -        new_ire.remap.dst = msg->dest32;
> +    if ( !pi_desc )
> +    {
> +        /* Set interrupt remapping table entry */
> +        new_ire.remap.fpd = 0;
> +        new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
> +        new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;

These two "& 0x1" seem unnecessary, as the fields are 1 bit only
anyway.

> +        new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;

This masking, however, has always been puzzling me: dlm is a 3 bit
field, and the MSI message field is a 3-bit one too. Hence the
masking should also be dropped here - if the message field is wrong
(i.e. holding an unsupported value), there's no good reason to try
to compensate for it here. If at all the function should refuse to do
the requested translation.

> +        /* Hardware require RH = 1 for LPR delivery mode */
> +        new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
> +        new_ire.remap.avail = 0;
> +        new_ire.remap.res_1 = 0;
> +        new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
> +                                MSI_DATA_VECTOR_MASK;
> +        new_ire.remap.res_2 = 0;
> +        if ( x2apic_enabled )
> +            new_ire.remap.dst = msg->dest32;
> +        else
> +            new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
> +                                 & 0xff) << 8;
> +
> +        new_ire.remap.res_3 = 0;
> +        new_ire.remap.res_4 = 0;
> +        new_ire.remap.p = 1;    /* finally, set present bit */

I don't understand this comment. The order of operations does not
matter at all, and in fact you now set p before being done with all
other fields. Please drop such misleading comments, or make them
useful/correct.

Also, the only field you don't explicitly set here is .im. Why?

> +    }
>      else
> -        new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
> -                             & 0xff) << 8;
> +    {
> +        new_ire.post.fpd = 0;
> +        new_ire.post.res_1 = 0;
> +        new_ire.post.res_2 = 0;
> +        new_ire.post.urg = 0;
> +        new_ire.post.im = 1;
> +        new_ire.post.vector = gvec;
> +        new_ire.post.res_3 = 0;
> +        new_ire.post.res_4 = 0;
> +        new_ire.post.res_5 = 0;
> +        new_ire.post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
> +        new_ire.post.pda_h = virt_to_maddr(pi_desc) >> 32;
> +        new_ire.post.p = 1;    /* finally, set present bit */
> +    }

Along the lines of the comment above, you don't fill avail here. Why?

Taking both together, I don't see why - after adding said initialization -
new_ire needs to start out as a copy of the live IRTE. Instead you
could memset() the whole structure to zero, or simply give it an empty
initializer (saving you from initializing all the reserved fields, plus some
more).

And of course, along the lines of ...

>      if ( pdev )
>          set_msi_source_id(pdev, &new_ire);
>      else
>          set_hpet_source_id(msi_desc->hpet_id, &new_ire);

... this, I see little reason for common fields to be initialized separately
on each path. According to the code above that's only p (leaving
aside fields which get zeroed), but anyway. Perhaps there should
even be a common sub-structure of the union ...

> @@ -637,7 +657,23 @@ static int msi_msg_to_remap_entry(
>      remap_rte->address_hi = 0;
>      remap_rte->data = index - i;
>  
> -    memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
> +    if ( !pi_desc )
> +        memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
> +    else
> +    {
> +        __uint128_t ret;
> +
> +        old_ire = *iremap_entry;
> +        ret = cmpxchg16b(iremap_entry, &old_ire, &new_ire);
> +
> +        /*
> +         * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
> +         * and the hardware cannot update the IRTE behind us, so the return value
> +         * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
> +         */
> +        ASSERT(ret == old_ire.val);
> +    }

Could you remind me again please why posted format updates need
to use cmpxchg16b, but remapped format ones don't? (As a aside,
with the code structure as you have it you should move the old_irte
declaration here, or omit it altogether, as you could as well pass
*iremap_entry directly afaict.)

> @@ -668,7 +704,8 @@ int msi_msg_write_remap_rte(
>  
>      drhd = pdev ? acpi_find_matched_drhd_unit(pdev)
>                  : hpet_to_drhd(msi_desc->hpet_id);
> -    return drhd ? msi_msg_to_remap_entry(drhd->iommu, pdev, msi_desc, msg)
> +    return drhd ? msi_msg_to_remap_entry(drhd->iommu, pdev,
> +                                         msi_desc, msg, NULL, 0)

Is this unconditional passing of NULL here really correct?

> @@ -946,17 +947,12 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
>      const uint8_t gvec)
>  {
>      struct irq_desc *desc;
> -    const struct msi_desc *msi_desc;
> -    int remap_index;
> +    struct msi_desc *msi_desc;
>      int rc = 0;
>      const struct pci_dev *pci_dev;
>      const struct acpi_drhd_unit *drhd;
>      struct iommu *iommu;
> -    struct ir_ctrl *ir_ctrl;
> -    struct iremap_entry *iremap_entries = NULL, *p = NULL;
> -    struct iremap_entry new_ire, old_ire;
>      const struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> -    __uint128_t ret;
>  
>      desc = pirq_spin_lock_irq_desc(pirq, NULL);
>      if ( !desc )
> @@ -976,8 +972,6 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
>          goto unlock_out;
>      }
>  
> -    remap_index = msi_desc->remap_index;
> -
>      spin_unlock_irq(&desc->lock);
>  
>      ASSERT(pcidevs_locked());
> @@ -992,35 +986,11 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
>          return -ENODEV;
>  
>      iommu = drhd->iommu;
> -    ir_ctrl = iommu_ir_ctrl(iommu);
> -    if ( !ir_ctrl )
> +    if ( !iommu_ir_ctrl(iommu) )
>          return -ENODEV;
>  
> -    spin_lock_irq(&ir_ctrl->iremap_lock);
> -
> -    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index, iremap_entries, p);
> -
> -    old_ire = *p;
> -
> -    /* Setup/Update interrupt remapping table entry. */
> -    setup_posted_irte(&new_ire, &old_ire, pi_desc, gvec);
> -    ret = cmpxchg16b(p, &old_ire, &new_ire);
> -
> -    /*
> -     * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
> -     * and the hardware cannot update the IRTE behind us, so the return value
> -     * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
> -     */
> -    ASSERT(ret == old_ire.val);
> -
> -    iommu_flush_cache_entry(p, sizeof(*p));
> -    iommu_flush_iec_index(iommu, 0, remap_index);
> -
> -    unmap_vtd_domain_page(iremap_entries);
> -
> -    spin_unlock_irq(&ir_ctrl->iremap_lock);
> -
> -    return 0;
> +    return msi_msg_to_remap_entry(iommu, pci_dev, msi_desc, &msi_desc->msg,
> +                                  pi_desc, gvec);

There are few changes here which appear to have the potential of
affecting behavior: Previously you didn't alter msi_desc or the MSI
message contained therein (as documented by the pointer having
been const). Is this possible updating of message and remap index
really benign? In any event any such changes should be reasoned
about in the commit message.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu blocking list
  2016-11-18  1:57 [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu blocking list Feng Wu
                   ` (7 preceding siblings ...)
  2016-11-18  3:02 ` [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu blocking list Tian, Kevin
@ 2017-02-21 21:30 ` Chao Gao
  8 siblings, 0 replies; 29+ messages in thread
From: Chao Gao @ 2017-02-21 21:30 UTC (permalink / raw)
  To: xen-devel
  Cc: george.dunlap, andrew.cooper3, Tian, Kevin, dario.faggioli, jbeulich

Hi, everyone. Really sorry for digging out this thread. Because feng has left
Intel, I will take over this work, address some problems in his v8 patch set
and send out a v9 verson later such that VT-d PI can work properly on Xen.  

Thanks,
Chao

On Fri, Nov 18, 2016 at 09:57:17AM +0800, Wu, Feng wrote:
>The current VT-d PI related code may operate incorrectly in the
>following scenarios:
>1. When the last assigned device is dettached from the domain, all
>the PI related hooks are removed then, however, the vCPU can be
>blocked, switched to another pCPU, etc, all without the aware of
>PI. After the next time we attach another device to the domain,
>which makes the PI realted hooks avaliable again, the status
>of the pi descriptor is not true. Beside that, the blocking vcpu
>may still remain in the per-cpu blocking in this case. Patch [1/7]
>and [2/7] handle this.
>
>2. [4/7] unify the code path of PI mode update and remapped mode update
>
>2. When IRTE is in posted mode, we don't need to set the irq
>affinity for it, since the destination of these interrupts is
>vCPU and the vCPU affinity is set during vCPU scheduling. Patch
>[5/7] handles this.
>
>4. [6/7] is a cleanup patch
>
>5. When a pCPU is unplugged, and there might be vCPUs on its
>list. Since the pCPU is offline, those vCPUs might not be woken
>up again. [7/7] addresses it.
>
>
>Feng Wu (7):
>  VMX: Permanently assign PI hook vmx_pi_switch_to()
>  VMX: Properly handle pi when all the assigned devices are removed
>  VMX: Make sure PI is in proper state before install the hooks
>  VT-d: Use one function to update both remapped and posted IRTE
>  VT-d: No need to set irq affinity for posted format IRTE
>  VT-d: Some cleanups
>  VMX: Fixup PI descriptor when cpu is offline
>
> xen/arch/x86/hvm/vmx/vmcs.c            |  14 +--
> xen/arch/x86/hvm/vmx/vmx.c             | 133 ++++++++++++++++++++++-
> xen/drivers/passthrough/pci.c          |  14 +++
> xen/drivers/passthrough/vtd/intremap.c | 193 +++++++++++++++------------------
> xen/include/asm-x86/hvm/vmx/vmx.h      |   3 +
> 5 files changed, 240 insertions(+), 117 deletions(-)
>
>--
>2.1.0
>
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xen.org
>https://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 4/7] VT-d: Use one function to update both remapped and posted IRTE
  2016-11-22  9:58   ` Jan Beulich
@ 2017-02-22  1:53     ` Chao Gao
  2017-02-22  9:10       ` Jan Beulich
  0 siblings, 1 reply; 29+ messages in thread
From: Chao Gao @ 2017-02-22  1:53 UTC (permalink / raw)
  To: Jan Beulich
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, Tian, Kevin, xen-devel

On Tue, Nov 22, 2016 at 05:58:56PM +0800, Jan Beulich wrote:
>>>> On 18.11.16 at 02:57, <feng.wu@intel.com> wrote:
>> @@ -597,31 +598,50 @@ static int msi_msg_to_remap_entry(
>
>Considering you basically re-do most of the function, I think there's
>some more adjustment necessary (or at least very desirable) here.
>
>>
>>      memcpy(&new_ire, iremap_entry, sizeof(struct iremap_entry));
>>
>> -    /* Set interrupt remapping table entry */
>> -    new_ire.remap.fpd = 0;
>> -    new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
>> -    new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
>> -    new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
>> -    /* Hardware require RH = 1 for LPR delivery mode */
>> -    new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
>> -    new_ire.remap.avail = 0;
>> -    new_ire.remap.res_1 = 0;
>> -    new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
>> -                            MSI_DATA_VECTOR_MASK;
>> -    new_ire.remap.res_2 = 0;
>> -    if ( x2apic_enabled )
>> -        new_ire.remap.dst = msg->dest32;
>> +    if ( !pi_desc )
>> +    {
>> +        /* Set interrupt remapping table entry */
>> +        new_ire.remap.fpd = 0;
>> +        new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
>> +        new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
>
>These two "& 0x1" seem unnecessary, as the fields are 1 bit only
>anyway.
>

Ok, will remove them.

>> +        new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
>
>This masking, however, has always been puzzling me: dlm is a 3 bit
>field, and the MSI message field is a 3-bit one too. Hence the
>masking should also be dropped here - if the message field is wrong
>(i.e. holding an unsupported value), there's no good reason to try
>to compensate for it here. If at all the function should refuse to do
>the requested translation.
>
>> +        /* Hardware require RH = 1 for LPR delivery mode */
>> +        new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
>> +        new_ire.remap.avail = 0;
>> +        new_ire.remap.res_1 = 0;
>> +        new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
>> +                                MSI_DATA_VECTOR_MASK;
>> +        new_ire.remap.res_2 = 0;
>> +        if ( x2apic_enabled )
>> +            new_ire.remap.dst = msg->dest32;
>> +        else
>> +            new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
>> +                                 & 0xff) << 8;
>> +
>> +        new_ire.remap.res_3 = 0;
>> +        new_ire.remap.res_4 = 0;
>> +        new_ire.remap.p = 1;    /* finally, set present bit */
>
>I don't understand this comment. The order of operations does not
>matter at all, and in fact you now set p before being done with all
>other fields. Please drop such misleading comments, or make them
>useful/correct.

Yes, The order does not matter. I will drop this comment.

>
>Also, the only field you don't explicitly set here is .im. Why?
>
>> +    }
>>      else
>> -        new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
>> -                             & 0xff) << 8;
>> +    {
>> +        new_ire.post.fpd = 0;
>> +        new_ire.post.res_1 = 0;
>> +        new_ire.post.res_2 = 0;
>> +        new_ire.post.urg = 0;
>> +        new_ire.post.im = 1;
>> +        new_ire.post.vector = gvec;
>> +        new_ire.post.res_3 = 0;
>> +        new_ire.post.res_4 = 0;
>> +        new_ire.post.res_5 = 0;
>> +        new_ire.post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
>> +        new_ire.post.pda_h = virt_to_maddr(pi_desc) >> 32;
>> +        new_ire.post.p = 1;    /* finally, set present bit */
>> +    }
>
>Along the lines of the comment above, you don't fill avail here. Why?
>
>Taking both together, I don't see why - after adding said initialization -
>new_ire needs to start out as a copy of the live IRTE. Instead you
>could memset() the whole structure to zero, or simply give it an empty
>initializer (saving you from initializing all the reserved fields, plus some
>more).
>

Yes, it is really confused. Maybe because this field is available to software in posting
format IRTE. We can reserve the avail field through introducing a flag to
distinguish initialization from update. We also can clear the avail field every
time if we don't use it right now, leaving the problem to others who want use
the avail field.  Do you think which is better?

>And of course, along the lines of ...
>
>>      if ( pdev )
>>          set_msi_source_id(pdev, &new_ire);
>>      else
>>          set_hpet_source_id(msi_desc->hpet_id, &new_ire);
>
>... this, I see little reason for common fields to be initialized separately
>on each path. According to the code above that's only p (leaving
>aside fields which get zeroed), but anyway. Perhaps there should
>even be a common sub-structure of the union ...

I can add a patch in this series to do this.

>
>> @@ -637,7 +657,23 @@ static int msi_msg_to_remap_entry(
>>      remap_rte->address_hi = 0;
>>      remap_rte->data = index - i;
>>
>> -    memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
>> +    if ( !pi_desc )
>> +        memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
>> +    else
>> +    {
>> +        __uint128_t ret;
>> +
>> +        old_ire = *iremap_entry;
>> +        ret = cmpxchg16b(iremap_entry, &old_ire, &new_ire);
>> +
>> +        /*
>> +         * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
>> +         * and the hardware cannot update the IRTE behind us, so the return value
>> +         * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
>> +         */
>> +        ASSERT(ret == old_ire.val);
>> +    }
>
>Could you remind me again please why posted format updates need
>to use cmpxchg16b, but remapped format ones don't? (As a aside,
>with the code structure as you have it you should move the old_irte
>declaration here, or omit it altogether, as you could as well pass
>*iremap_entry directly afaict.)

Before feng left, I have asked him about this question. He told me that
the PDA field of posted format IRTE comprises of two parts:
Posted Descritor Address High[127:96] and Low [63:38]. If we want to update
PDA field, do it atomically or disable-update-enable. He also said, it had
been confirmed that cmpxchg16b was supported on all intel platform with VT-d PI.
If we assume that updates to remapped format IRTE only is to update either
64 bit or high 64 bit (except initialition), two 64bit memory write operations
is enough. 

>
>> @@ -668,7 +704,8 @@ int msi_msg_write_remap_rte(
>>
>>      drhd = pdev ? acpi_find_matched_drhd_unit(pdev)
>>                  : hpet_to_drhd(msi_desc->hpet_id);
>> -    return drhd ? msi_msg_to_remap_entry(drhd->iommu, pdev, msi_desc, msg)
>> +    return drhd ? msi_msg_to_remap_entry(drhd->iommu, pdev,
>> +                                         msi_desc, msg, NULL, 0)
>
>Is this unconditional passing of NULL here really correct?

Since two parameters are added to this function, we should think about what
the function does again. the last 2 parameters are optional.

If they are not present, just means a physical device driver changes its msi
message. So it notifies iommu to do some changes to IRTE accordingly (the driver doesn't
know the format of the live IRTE). This is the case above.

If they are present, it means the msi should be delivered to the vcpu with the
vector num. To achieve that, the function replaces the old IRTE with a new
posted format IRTE.

>
>> @@ -946,17 +947,12 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
>>      const uint8_t gvec)
>>  {
>>      struct irq_desc *desc;
>> -    const struct msi_desc *msi_desc;
>> -    int remap_index;
>> +    struct msi_desc *msi_desc;
>>      int rc = 0;
>>      const struct pci_dev *pci_dev;
>>      const struct acpi_drhd_unit *drhd;
>>      struct iommu *iommu;
>> -    struct ir_ctrl *ir_ctrl;
>> -    struct iremap_entry *iremap_entries = NULL, *p = NULL;
>> -    struct iremap_entry new_ire, old_ire;
>>      const struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
>> -    __uint128_t ret;
>>
>>      desc = pirq_spin_lock_irq_desc(pirq, NULL);
>>      if ( !desc )
>> @@ -976,8 +972,6 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
>>          goto unlock_out;
>>      }
>>
>> -    remap_index = msi_desc->remap_index;
>> -
>>      spin_unlock_irq(&desc->lock);
>>
>>      ASSERT(pcidevs_locked());
>> @@ -992,35 +986,11 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
>>          return -ENODEV;
>>
>>      iommu = drhd->iommu;
>> -    ir_ctrl = iommu_ir_ctrl(iommu);
>> -    if ( !ir_ctrl )
>> +    if ( !iommu_ir_ctrl(iommu) )
>>          return -ENODEV;
>>
>> -    spin_lock_irq(&ir_ctrl->iremap_lock);
>> -
>> -    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index, iremap_entries, p);
>> -
>> -    old_ire = *p;
>> -
>> -    /* Setup/Update interrupt remapping table entry. */
>> -    setup_posted_irte(&new_ire, &old_ire, pi_desc, gvec);
>> -    ret = cmpxchg16b(p, &old_ire, &new_ire);
>> -
>> -    /*
>> -     * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
>> -     * and the hardware cannot update the IRTE behind us, so the return value
>> -     * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
>> -     */
>> -    ASSERT(ret == old_ire.val);
>> -
>> -    iommu_flush_cache_entry(p, sizeof(*p));
>> -    iommu_flush_iec_index(iommu, 0, remap_index);
>> -
>> -    unmap_vtd_domain_page(iremap_entries);
>> -
>> -    spin_unlock_irq(&ir_ctrl->iremap_lock);
>> -
>> -    return 0;
>> +    return msi_msg_to_remap_entry(iommu, pci_dev, msi_desc, &msi_desc->msg,
>> +                                  pi_desc, gvec);
>
>There are few changes here which appear to have the potential of
>affecting behavior: Previously you didn't alter msi_desc or the MSI
>message contained therein (as documented by the pointer having
>been const). Is this possible updating of message and remap index
>really benign? In any event any such changes should be reasoned
>about in the commit message.

I agree that we can't update message and remap index in this pi_update_irte.
but msi_msg_to_remap_entry will change msi_desc when msi_desc->remap_index < 0.
How about splitting part of msi_msg_to_remap_entry to a new function which consumes a const
msi_desc parameter and pi_update_irte will call the new function?

>
>Jan
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xen.org
>https://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 4/7] VT-d: Use one function to update both remapped and posted IRTE
  2017-02-22  9:10       ` Jan Beulich
@ 2017-02-22  7:12         ` Chao Gao
  2017-02-22 14:54           ` Jan Beulich
  0 siblings, 1 reply; 29+ messages in thread
From: Chao Gao @ 2017-02-22  7:12 UTC (permalink / raw)
  To: Jan Beulich
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, Kevin Tian, xen-devel

On Wed, Feb 22, 2017 at 02:10:25AM -0700, Jan Beulich wrote:
>>>> On 22.02.17 at 02:53, <chao.gao@intel.com> wrote:
>> On Tue, Nov 22, 2016 at 05:58:56PM +0800, Jan Beulich wrote:
>>>> @@ -637,7 +657,23 @@ static int msi_msg_to_remap_entry(
>>>>      remap_rte->address_hi = 0;
>>>>      remap_rte->data = index - i;
>>>>
>>>> -    memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
>>>> +    if ( !pi_desc )
>>>> +        memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
>>>> +    else
>>>> +    {
>>>> +        __uint128_t ret;
>>>> +
>>>> +        old_ire = *iremap_entry;
>>>> +        ret = cmpxchg16b(iremap_entry, &old_ire, &new_ire);
>>>> +
>>>> +        /*
>>>> +         * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
>>>> +         * and the hardware cannot update the IRTE behind us, so the return value
>>>> +         * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
>>>> +         */
>>>> +        ASSERT(ret == old_ire.val);
>>>> +    }
>>>
>>>Could you remind me again please why posted format updates need
>>>to use cmpxchg16b, but remapped format ones don't? (As a aside,
>>>with the code structure as you have it you should move the old_irte
>>>declaration here, or omit it altogether, as you could as well pass
>>>*iremap_entry directly afaict.)
>> 
>> Before feng left, I have asked him about this question. He told me that
>> the PDA field of posted format IRTE comprises of two parts:
>> Posted Descritor Address High[127:96] and Low [63:38]. If we want to update
>> PDA field, do it atomically or disable-update-enable. He also said, it had
>> been confirmed that cmpxchg16b was supported on all intel platform with VT-d PI.
>> If we assume that updates to remapped format IRTE only is to update either
>> 64 bit or high 64 bit (except initialition), two 64bit memory write operations
>> is enough. 
>
>Two 64-bit memory write operations? Where do you see them? I
>only see memcpy(), which for the purposes of the code here is
>supposed to be a black box.

Ok. I made a mistake here. In ioapic case, before update IRTE, the according
IOAPIC RTE is masked. So, using a memcpy() is safe. In msi case, there is no
mask operation. I think only using a memcpy() is unsafe. Do you think so?

>
>>>> @@ -668,7 +704,8 @@ int msi_msg_write_remap_rte(
>>>>
>>>>      drhd = pdev ? acpi_find_matched_drhd_unit(pdev)
>>>>                  : hpet_to_drhd(msi_desc->hpet_id);
>>>> -    return drhd ? msi_msg_to_remap_entry(drhd->iommu, pdev, msi_desc, msg)
>>>> +    return drhd ? msi_msg_to_remap_entry(drhd->iommu, pdev,
>>>> +                                         msi_desc, msg, NULL, 0)
>>>
>>>Is this unconditional passing of NULL here really correct?
>> 
>> Since two parameters are added to this function, we should think about what
>> the function does again. the last 2 parameters are optional.
>> 
>> If they are not present, just means a physical device driver changes its msi
>> message. So it notifies iommu to do some changes to IRTE accordingly (the driver doesn't
>> know the format of the live IRTE). This is the case above.
>> 
>> If they are present, it means the msi should be delivered to the vcpu with the
>> vector num. To achieve that, the function replaces the old IRTE with a new
>> posted format IRTE.
>
>I don't see how this answers my question. In fact it feels like you,
>just like Feng, are making assumptions on the conditions under
>which the function here is being called _at present_. You should,
>however, make the function work correctly for all possible uses,
>or add ASSERT()s to clearly expose issues with potential new,
>future callers.

Ok. Your suggestion is very good. 
Every caller tells the function to construct a centain format IRTE. Return to
your question, I think it is defintely wrong to pass NULL unconditionally.
We should  pass NULL before the msi is binded to a guest interrupt and pass the
destination vcpu and vector num after that. At this moment, I can't come up
with a way to check whether the msi is binded to a guest interrupt and get the
destination vcpu and vector num only through the struct pci_dev and struct
msi_desc. Could you give me some suggestion on that or recommend a structure,
you think, in which we can add some fields to record these information?  

Thanks,
Chao

>
>>>> @@ -992,35 +986,11 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
>>>>          return -ENODEV;
>>>>
>>>>      iommu = drhd->iommu;
>>>> -    ir_ctrl = iommu_ir_ctrl(iommu);
>>>> -    if ( !ir_ctrl )
>>>> +    if ( !iommu_ir_ctrl(iommu) )
>>>>          return -ENODEV;
>>>>
>>>> -    spin_lock_irq(&ir_ctrl->iremap_lock);
>>>> -
>>>> -    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index, iremap_entries, p);
>>>> -
>>>> -    old_ire = *p;
>>>> -
>>>> -    /* Setup/Update interrupt remapping table entry. */
>>>> -    setup_posted_irte(&new_ire, &old_ire, pi_desc, gvec);
>>>> -    ret = cmpxchg16b(p, &old_ire, &new_ire);
>>>> -
>>>> -    /*
>>>> -     * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
>>>> -     * and the hardware cannot update the IRTE behind us, so the return value
>>>> -     * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
>>>> -     */
>>>> -    ASSERT(ret == old_ire.val);
>>>> -
>>>> -    iommu_flush_cache_entry(p, sizeof(*p));
>>>> -    iommu_flush_iec_index(iommu, 0, remap_index);
>>>> -
>>>> -    unmap_vtd_domain_page(iremap_entries);
>>>> -
>>>> -    spin_unlock_irq(&ir_ctrl->iremap_lock);
>>>> -
>>>> -    return 0;
>>>> +    return msi_msg_to_remap_entry(iommu, pci_dev, msi_desc, &msi_desc->msg,
>>>> +                                  pi_desc, gvec);
>>>
>>>There are few changes here which appear to have the potential of
>>>affecting behavior: Previously you didn't alter msi_desc or the MSI
>>>message contained therein (as documented by the pointer having
>>>been const). Is this possible updating of message and remap index
>>>really benign? In any event any such changes should be reasoned
>>>about in the commit message.
>> 
>> I agree that we can't update message and remap index in this pi_update_irte.
>> but msi_msg_to_remap_entry will change msi_desc when msi_desc->remap_index < 0.
>> How about splitting part of msi_msg_to_remap_entry to a new function which consumes a const
>> msi_desc parameter and pi_update_irte will call the new function?
>
>Well, I can't easily say yes or no here without seeing what the
>result would be. Give it a try, and we'll look at the result in v9.
>
>Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 4/7] VT-d: Use one function to update both remapped and posted IRTE
  2017-02-22  1:53     ` Chao Gao
@ 2017-02-22  9:10       ` Jan Beulich
  2017-02-22  7:12         ` Chao Gao
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2017-02-22  9:10 UTC (permalink / raw)
  To: Chao Gao
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, Kevin Tian, xen-devel

>>> On 22.02.17 at 02:53, <chao.gao@intel.com> wrote:
> On Tue, Nov 22, 2016 at 05:58:56PM +0800, Jan Beulich wrote:
>>>>> On 18.11.16 at 02:57, <feng.wu@intel.com> wrote:
>>>      else
>>> -        new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
>>> -                             & 0xff) << 8;
>>> +    {
>>> +        new_ire.post.fpd = 0;
>>> +        new_ire.post.res_1 = 0;
>>> +        new_ire.post.res_2 = 0;
>>> +        new_ire.post.urg = 0;
>>> +        new_ire.post.im = 1;
>>> +        new_ire.post.vector = gvec;
>>> +        new_ire.post.res_3 = 0;
>>> +        new_ire.post.res_4 = 0;
>>> +        new_ire.post.res_5 = 0;
>>> +        new_ire.post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
>>> +        new_ire.post.pda_h = virt_to_maddr(pi_desc) >> 32;
>>> +        new_ire.post.p = 1;    /* finally, set present bit */
>>> +    }
>>
>>Along the lines of the comment above, you don't fill avail here. Why?
>>
>>Taking both together, I don't see why - after adding said initialization -
>>new_ire needs to start out as a copy of the live IRTE. Instead you
>>could memset() the whole structure to zero, or simply give it an empty
>>initializer (saving you from initializing all the reserved fields, plus some
>>more).
>>
> 
> Yes, it is really confused. Maybe because this field is available to software in posting
> format IRTE. We can reserve the avail field through introducing a flag to
> distinguish initialization from update. We also can clear the avail field every
> time if we don't use it right now, leaving the problem to others who want use
> the avail field.  Do you think which is better?

As its unused, clear it every time. We simply don't know how a
potential use would want it set during update.

>>> @@ -637,7 +657,23 @@ static int msi_msg_to_remap_entry(
>>>      remap_rte->address_hi = 0;
>>>      remap_rte->data = index - i;
>>>
>>> -    memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
>>> +    if ( !pi_desc )
>>> +        memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
>>> +    else
>>> +    {
>>> +        __uint128_t ret;
>>> +
>>> +        old_ire = *iremap_entry;
>>> +        ret = cmpxchg16b(iremap_entry, &old_ire, &new_ire);
>>> +
>>> +        /*
>>> +         * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
>>> +         * and the hardware cannot update the IRTE behind us, so the return value
>>> +         * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
>>> +         */
>>> +        ASSERT(ret == old_ire.val);
>>> +    }
>>
>>Could you remind me again please why posted format updates need
>>to use cmpxchg16b, but remapped format ones don't? (As a aside,
>>with the code structure as you have it you should move the old_irte
>>declaration here, or omit it altogether, as you could as well pass
>>*iremap_entry directly afaict.)
> 
> Before feng left, I have asked him about this question. He told me that
> the PDA field of posted format IRTE comprises of two parts:
> Posted Descritor Address High[127:96] and Low [63:38]. If we want to update
> PDA field, do it atomically or disable-update-enable. He also said, it had
> been confirmed that cmpxchg16b was supported on all intel platform with VT-d PI.
> If we assume that updates to remapped format IRTE only is to update either
> 64 bit or high 64 bit (except initialition), two 64bit memory write operations
> is enough. 

Two 64-bit memory write operations? Where do you see them? I
only see memcpy(), which for the purposes of the code here is
supposed to be a black box.

>>> @@ -668,7 +704,8 @@ int msi_msg_write_remap_rte(
>>>
>>>      drhd = pdev ? acpi_find_matched_drhd_unit(pdev)
>>>                  : hpet_to_drhd(msi_desc->hpet_id);
>>> -    return drhd ? msi_msg_to_remap_entry(drhd->iommu, pdev, msi_desc, msg)
>>> +    return drhd ? msi_msg_to_remap_entry(drhd->iommu, pdev,
>>> +                                         msi_desc, msg, NULL, 0)
>>
>>Is this unconditional passing of NULL here really correct?
> 
> Since two parameters are added to this function, we should think about what
> the function does again. the last 2 parameters are optional.
> 
> If they are not present, just means a physical device driver changes its msi
> message. So it notifies iommu to do some changes to IRTE accordingly (the driver doesn't
> know the format of the live IRTE). This is the case above.
> 
> If they are present, it means the msi should be delivered to the vcpu with the
> vector num. To achieve that, the function replaces the old IRTE with a new
> posted format IRTE.

I don't see how this answers my question. In fact it feels like you,
just like Feng, are making assumptions on the conditions under
which the function here is being called _at present_. You should,
however, make the function work correctly for all possible uses,
or add ASSERT()s to clearly expose issues with potential new,
future callers.

>>> @@ -992,35 +986,11 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
>>>          return -ENODEV;
>>>
>>>      iommu = drhd->iommu;
>>> -    ir_ctrl = iommu_ir_ctrl(iommu);
>>> -    if ( !ir_ctrl )
>>> +    if ( !iommu_ir_ctrl(iommu) )
>>>          return -ENODEV;
>>>
>>> -    spin_lock_irq(&ir_ctrl->iremap_lock);
>>> -
>>> -    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index, iremap_entries, p);
>>> -
>>> -    old_ire = *p;
>>> -
>>> -    /* Setup/Update interrupt remapping table entry. */
>>> -    setup_posted_irte(&new_ire, &old_ire, pi_desc, gvec);
>>> -    ret = cmpxchg16b(p, &old_ire, &new_ire);
>>> -
>>> -    /*
>>> -     * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
>>> -     * and the hardware cannot update the IRTE behind us, so the return value
>>> -     * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
>>> -     */
>>> -    ASSERT(ret == old_ire.val);
>>> -
>>> -    iommu_flush_cache_entry(p, sizeof(*p));
>>> -    iommu_flush_iec_index(iommu, 0, remap_index);
>>> -
>>> -    unmap_vtd_domain_page(iremap_entries);
>>> -
>>> -    spin_unlock_irq(&ir_ctrl->iremap_lock);
>>> -
>>> -    return 0;
>>> +    return msi_msg_to_remap_entry(iommu, pci_dev, msi_desc, &msi_desc->msg,
>>> +                                  pi_desc, gvec);
>>
>>There are few changes here which appear to have the potential of
>>affecting behavior: Previously you didn't alter msi_desc or the MSI
>>message contained therein (as documented by the pointer having
>>been const). Is this possible updating of message and remap index
>>really benign? In any event any such changes should be reasoned
>>about in the commit message.
> 
> I agree that we can't update message and remap index in this pi_update_irte.
> but msi_msg_to_remap_entry will change msi_desc when msi_desc->remap_index < 0.
> How about splitting part of msi_msg_to_remap_entry to a new function which consumes a const
> msi_desc parameter and pi_update_irte will call the new function?

Well, I can't easily say yes or no here without seeing what the
result would be. Give it a try, and we'll look at the result in v9.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v8 4/7] VT-d: Use one function to update both remapped and posted IRTE
  2017-02-22  7:12         ` Chao Gao
@ 2017-02-22 14:54           ` Jan Beulich
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2017-02-22 14:54 UTC (permalink / raw)
  To: Chao Gao
  Cc: george.dunlap, andrew.cooper3, dario.faggioli, Kevin Tian, xen-devel

>>> On 22.02.17 at 08:12, <chao.gao@intel.com> wrote:
> On Wed, Feb 22, 2017 at 02:10:25AM -0700, Jan Beulich wrote:
>>>>> On 22.02.17 at 02:53, <chao.gao@intel.com> wrote:
>>> On Tue, Nov 22, 2016 at 05:58:56PM +0800, Jan Beulich wrote:
>>>>> @@ -637,7 +657,23 @@ static int msi_msg_to_remap_entry(
>>>>>      remap_rte->address_hi = 0;
>>>>>      remap_rte->data = index - i;
>>>>>
>>>>> -    memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
>>>>> +    if ( !pi_desc )
>>>>> +        memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
>>>>> +    else
>>>>> +    {
>>>>> +        __uint128_t ret;
>>>>> +
>>>>> +        old_ire = *iremap_entry;
>>>>> +        ret = cmpxchg16b(iremap_entry, &old_ire, &new_ire);
>>>>> +
>>>>> +        /*
>>>>> +         * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
>>>>> +         * and the hardware cannot update the IRTE behind us, so the return value
>>>>> +         * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
>>>>> +         */
>>>>> +        ASSERT(ret == old_ire.val);
>>>>> +    }
>>>>
>>>>Could you remind me again please why posted format updates need
>>>>to use cmpxchg16b, but remapped format ones don't? (As a aside,
>>>>with the code structure as you have it you should move the old_irte
>>>>declaration here, or omit it altogether, as you could as well pass
>>>>*iremap_entry directly afaict.)
>>> 
>>> Before feng left, I have asked him about this question. He told me that
>>> the PDA field of posted format IRTE comprises of two parts:
>>> Posted Descritor Address High[127:96] and Low [63:38]. If we want to update
>>> PDA field, do it atomically or disable-update-enable. He also said, it had
>>> been confirmed that cmpxchg16b was supported on all intel platform with VT-d PI.
>>> If we assume that updates to remapped format IRTE only is to update either
>>> 64 bit or high 64 bit (except initialition), two 64bit memory write operations
>>> is enough. 
>>
>>Two 64-bit memory write operations? Where do you see them? I
>>only see memcpy(), which for the purposes of the code here is
>>supposed to be a black box.
> 
> Ok. I made a mistake here. In ioapic case, before update IRTE, the according
> IOAPIC RTE is masked. So, using a memcpy() is safe. In msi case, there is no
> mask operation. I think only using a memcpy() is unsafe. Do you think so?

memcpy() is unsafe in any event. I was actually trying to
understand why it's not always cmpxchg16b that gets used
here.

>>>>> @@ -668,7 +704,8 @@ int msi_msg_write_remap_rte(
>>>>>
>>>>>      drhd = pdev ? acpi_find_matched_drhd_unit(pdev)
>>>>>                  : hpet_to_drhd(msi_desc->hpet_id);
>>>>> -    return drhd ? msi_msg_to_remap_entry(drhd->iommu, pdev, msi_desc, msg)
>>>>> +    return drhd ? msi_msg_to_remap_entry(drhd->iommu, pdev,
>>>>> +                                         msi_desc, msg, NULL, 0)
>>>>
>>>>Is this unconditional passing of NULL here really correct?
>>> 
>>> Since two parameters are added to this function, we should think about what
>>> the function does again. the last 2 parameters are optional.
>>> 
>>> If they are not present, just means a physical device driver changes its msi
>>> message. So it notifies iommu to do some changes to IRTE accordingly (the driver doesn't
>>> know the format of the live IRTE). This is the case above.
>>> 
>>> If they are present, it means the msi should be delivered to the vcpu with the
>>> vector num. To achieve that, the function replaces the old IRTE with a new
>>> posted format IRTE.
>>
>>I don't see how this answers my question. In fact it feels like you,
>>just like Feng, are making assumptions on the conditions under
>>which the function here is being called _at present_. You should,
>>however, make the function work correctly for all possible uses,
>>or add ASSERT()s to clearly expose issues with potential new,
>>future callers.
> 
> Ok. Your suggestion is very good. 
> Every caller tells the function to construct a centain format IRTE. Return to
> your question, I think it is defintely wrong to pass NULL unconditionally.
> We should  pass NULL before the msi is binded to a guest interrupt and pass the
> destination vcpu and vector num after that. At this moment, I can't come up
> with a way to check whether the msi is binded to a guest interrupt and get the
> destination vcpu and vector num only through the struct pci_dev and struct
> msi_desc. Could you give me some suggestion on that or recommend a structure,
> you think, in which we can add some fields to record these information?  

I would hope all information is already available (i.e. stored
somewhere). And I have to admit that it is increasingly annoying
to have to repeatedly, after many weeks of silence, swap back
in all the context here. In particular, I don't have the original
patch in my mailbox anymore (it got purged automatically as it
seems), so recovering context is even more time consuming.

And then, pi_update_irte() is being called when the interrupt
gets bound. gvec, if it changed, would require a rebind, so the
request would again come that route. Perhaps the function
here simply needs to fail if undue changes are being requested
without all necessary data? As suggested before, even adding
ASSERT()s for places where you make assumptions on current
caller behavior would be okay. The only thing that's not okay in
my opinion is to make implicit/hidden assumptions.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2017-02-22 14:54 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-18  1:57 [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu blocking list Feng Wu
2016-11-18  1:57 ` [PATCH v8 1/7] VMX: Permanently assign PI hook vmx_pi_switch_to() Feng Wu
2016-11-18  3:14   ` Tian, Kevin
2016-11-18  4:23     ` Wu, Feng
2016-11-18  1:57 ` [PATCH v8 2/7] VMX: Properly handle pi when all the assigned devices are removed Feng Wu
2016-11-18  3:19   ` Tian, Kevin
2016-11-18  4:27     ` Wu, Feng
2016-11-18  4:58       ` Tian, Kevin
2016-11-18  5:22         ` Wu, Feng
2016-11-18  9:23           ` Jan Beulich
2016-11-18  1:57 ` [PATCH v8 3/7] VMX: Make sure PI is in proper state before install the hooks Feng Wu
2016-11-18  4:11   ` Tian, Kevin
2016-11-18  4:27     ` Wu, Feng
2016-11-18  1:57 ` [PATCH v8 4/7] VT-d: Use one function to update both remapped and posted IRTE Feng Wu
2016-11-18  4:31   ` Tian, Kevin
2016-11-22  4:31     ` Wu, Feng
2016-11-22  9:58   ` Jan Beulich
2017-02-22  1:53     ` Chao Gao
2017-02-22  9:10       ` Jan Beulich
2017-02-22  7:12         ` Chao Gao
2017-02-22 14:54           ` Jan Beulich
2016-11-18  1:57 ` [PATCH v8 5/7] VT-d: No need to set irq affinity for posted format IRTE Feng Wu
2016-11-18  1:57 ` [PATCH v8 6/7] VT-d: Some cleanups Feng Wu
2016-11-18  4:45   ` Tian, Kevin
2016-11-18  1:57 ` [PATCH v8 7/7] VMX: Fixup PI descriptor when cpu is offline Feng Wu
2016-11-18  4:55   ` Tian, Kevin
2016-11-18  3:02 ` [PATCH v8 0/7] VMX: Properly handle pi descriptor and per-cpu blocking list Tian, Kevin
2016-11-18  4:21   ` Wu, Feng
2017-02-21 21:30 ` Chao Gao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.