All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9 0/8] VMX: Properly handle pi descriptor and per-cpu
@ 2017-02-27  1:45 Chao Gao
  2017-02-27  1:45 ` [PATCH v9 1/8] VMX: Permanently assign PI hook vmx_pi_switch_to() Chao Gao
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Chao Gao @ 2017-02-27  1:45 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Jun Nakajima, George Dunlap, Andrew Cooper,
	Dario Faggioli, Jan Beulich, Chao Gao

The current VT-d PI related code may operate incorrectly in the 
following scenarios: 
1. When the last assigned device is dettached from the domain, all 
the PI related hooks are removed then, however, the vCPU can be 
blocked, switched to another pCPU, etc, all without the aware of 
PI. After the next time we attach another device to the domain, 
which makes the PI realted hooks avaliable again, the status 
of the pi descriptor is not true. Beside that, the blocking vcpu 
may still remain in the per-cpu blocking in this case. Patch [1/8] 
and [3/8] handle this. Patch [2/8] rejects self-(de)assignment to
make domain pause in [3/8] safe.

2. In the current code, we didn't mind the assignment sequences
of the four hook functions(ie, vcpu_block, switch_to, switch_from,
do_resume). It caused the last the assertion in vmx_vcpu_block()
failed. Patch [4/8] handles the sequences carefully to make sure
that the assertion will not be triggered again.

3. When IRTE is in posted mode, we don't need to set the irq 
affinity for it, since the destination of these interrupts is 
vCPU and the vCPU affinity is set during vCPU scheduling. Patch 
[5/8] handles this. In order to make the logic clearer, it also
extracts the common part from pi_update_irte() and
msi_msg_to_rte_entry() which two functions update irte to make
the logic clearer.

4. [6/8] is a cleanup patch 

5. When a pCPU is unplugged, and there might be vCPUs on its 
list. Since the pCPU is offline, those vCPUs might not be woken 
up again. [7/8] addresses it. 

6. IRTE is updated through structure assigment or memcpy() which is
unsafe. To resolve this, Patch [8/8] use cmpxchg16b() if supported or
two 64-bit write operations to update irte.

Note that I just took over feng's work and the v9 patch series are based
on feng's v8 patch
(https://lists.xenproject.org/archives/html/xen-devel/2016-11/msg01472.html).
Most of these patches are reviewed or acked by maintainers. Patch [2/8],
Patch [5/8], Patch [8/8] is added by me to address remainly issues.

Chao Gao (3):
  xen/passthrough: Reject self-(de)assignment of devices
  VT-d: Introduce a new function update_irte_for_msi_common
  VT-d: Add copy_irte_{to,from}_irt for updating irte

Feng Wu (5):
  VMX: Permanently assign PI hook vmx_pi_switch_to()
  VMX: Properly handle pi when all the assigned devices are removed
  VMX: Make sure PI is in proper state before install the hooks
  VT-d: Some cleanups
  VMX: Fixup PI descriptor when cpu is offline

 xen/arch/x86/hvm/vmx/vmcs.c            |  14 +-
 xen/arch/x86/hvm/vmx/vmx.c             | 134 +++++++++++++++++-
 xen/drivers/passthrough/pci.c          |  14 ++
 xen/drivers/passthrough/vtd/intremap.c | 252 +++++++++++++++++++--------------
 xen/include/asm-x86/hvm/vmx/vmx.h      |   3 +
 5 files changed, 297 insertions(+), 120 deletions(-)

-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v9 1/8] VMX: Permanently assign PI hook vmx_pi_switch_to()
  2017-02-27  1:45 [PATCH v9 0/8] VMX: Properly handle pi descriptor and per-cpu Chao Gao
@ 2017-02-27  1:45 ` Chao Gao
  2017-02-28 16:43   ` Jan Beulich
  2017-02-27  1:45 ` [PATCH v9 2/8] xen/passthrough: Reject self-(de)assignment of devices Chao Gao
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Chao Gao @ 2017-02-27  1:45 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Feng Wu, Jun Nakajima, George Dunlap, Andrew Cooper,
	Dario Faggioli, Jan Beulich, Chao Gao

From: Feng Wu <feng.wu@intel.com>

PI hook vmx_pi_switch_to() is needed even after any previously
assigned device is detached from the domain. Since 'SN' bit is
also used to control the CPU side PI and we change the state of
SN bit in vmx_pi_switch_to() and vmx_pi_switch_from(), then
evaluate this bit in vmx_deliver_posted_intr() when trying to
deliver the interrupt in posted way via software. The problem
is if we deassign the hooks while the vCPU is runnable in the
runqueue with 'SN' set, all the furture notificaton event will
be suppressed. This patch makes the hook permanently assigned.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v9:
- Comments changes [per Kevin's comments]

v8: 
- Comments changes 

v7: 
- comments changes. 

v6: 
- Adjust the comments and wording. 

v5: 
- Zap "pi_switch_from" hook 

v4: 
- Don't zap vmx_pi_switch_from() and vmx_pi_switch_to() when 
any previously assigned device is detached from the domain. 
- Comments changes. 

 xen/arch/x86/hvm/vmx/vmx.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index d3d98da..5fa16dd 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -260,9 +260,15 @@ void vmx_pi_hooks_deassign(struct domain *d)
 
     ASSERT(d->arch.hvm_domain.pi_ops.vcpu_block);
 
+    /*
+     * Note that we don't set 'd->arch.hvm_domain.pi_ops.switch_to' to NULL
+     * here. If we deassign the hooks while the vCPU is runnable in the
+     * runqueue with 'SN' set, all the future notification event will be
+     * suppressed. Preserving the 'switch_to' hook function can make sure
+     * event time the vCPU is running the 'SN' field is cleared.
+     */
     d->arch.hvm_domain.pi_ops.vcpu_block = NULL;
     d->arch.hvm_domain.pi_ops.switch_from = NULL;
-    d->arch.hvm_domain.pi_ops.switch_to = NULL;
     d->arch.hvm_domain.pi_ops.do_resume = NULL;
 }
 
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v9 2/8] xen/passthrough: Reject self-(de)assignment of devices
  2017-02-27  1:45 [PATCH v9 0/8] VMX: Properly handle pi descriptor and per-cpu Chao Gao
  2017-02-27  1:45 ` [PATCH v9 1/8] VMX: Permanently assign PI hook vmx_pi_switch_to() Chao Gao
@ 2017-02-27  1:45 ` Chao Gao
  2017-02-28 16:46   ` Jan Beulich
  2017-02-27  1:45 ` [PATCH v9 3/8] VMX: Properly handle pi when all the assigned devices are removed Chao Gao
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Chao Gao @ 2017-02-27  1:45 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Jun Nakajima, George Dunlap, Andrew Cooper,
	Dario Faggioli, Jan Beulich, Chao Gao

That is to say, don't support a domain assigns a device to itself or detachs
a device from itself.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
v9:
- Newly added

 xen/drivers/passthrough/pci.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 338d6b4..a80d59e 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1603,6 +1603,13 @@ int iommu_do_pci_domctl(
         break;
 
     case XEN_DOMCTL_assign_device:
+        /* Don't support self-(de)assignment of devices */
+        if ( d == current->domain )
+        {
+            ret = -EINVAL;
+            break;
+        }
+
         ret = -ENODEV;
         if ( domctl->u.assign_device.dev != XEN_DOMCTL_DEV_PCI )
             break;
@@ -1643,6 +1650,13 @@ int iommu_do_pci_domctl(
         break;
 
     case XEN_DOMCTL_deassign_device:
+        /* Don't support self-(de)assignment of devices */
+        if ( d == current->domain )
+        {
+            ret = -EINVAL;
+            break;
+        }
+
         ret = -ENODEV;
         if ( domctl->u.assign_device.dev != XEN_DOMCTL_DEV_PCI )
             break;
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v9 3/8] VMX: Properly handle pi when all the assigned devices are removed
  2017-02-27  1:45 [PATCH v9 0/8] VMX: Properly handle pi descriptor and per-cpu Chao Gao
  2017-02-27  1:45 ` [PATCH v9 1/8] VMX: Permanently assign PI hook vmx_pi_switch_to() Chao Gao
  2017-02-27  1:45 ` [PATCH v9 2/8] xen/passthrough: Reject self-(de)assignment of devices Chao Gao
@ 2017-02-27  1:45 ` Chao Gao
  2017-03-03 11:51   ` Tian, Kevin
  2017-02-27  1:45 ` [PATCH v9 4/8] VMX: Make sure PI is in proper state before install the hooks Chao Gao
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Chao Gao @ 2017-02-27  1:45 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Feng Wu, Jun Nakajima, George Dunlap, Andrew Cooper,
	Dario Faggioli, Jan Beulich, Chao Gao

From: Feng Wu <feng.wu@intel.com>

This patch handles some corner cases when the last assigned device
is removed from the domain. In this case we should carefully handle
pi descriptor and the per-cpu blocking list, to make sure:
- all the PI descriptor are in the right state when next time a
devices is assigned to the domain again.
- No remaining vcpus of the domain in the per-cpu blocking list.

Here we call vmx_pi_unblock_vcpu() to remove the vCPU from the blocking list
if it is on the list. However, this could happen when vmx_vcpu_block() is
being called, hence we might incorrectly add the vCPU to the blocking list
while the last devcie is detached from the domain. Consider that the situation
can only occur when detaching the last device from the domain and it is not
a frequent operation, so we use domain_pause before that, which is considered
as an clean and maintainable solution for the situation.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
v9:
- Based on [v8 2/7]. Add a assertion before domain pause.

v7: 
- Prevent the domain from pausing itself. 

v6: 
- Comments changes 
- Rename vmx_pi_list_remove() to vmx_pi_unblock_vcpu() 

v5: 
- Remove a no-op wrapper 

v4: 
- Rename some functions: 
vmx_pi_remove_vcpu_from_blocking_list() -> vmx_pi_list_remove() 
vmx_pi_blocking_cleanup() -> vmx_pi_list_cleanup() 
- Remove the check in vmx_pi_list_cleanup() 
- Comments adjustment 

 xen/arch/x86/hvm/vmx/vmx.c | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 5fa16dd..a7a70e7 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -155,14 +155,12 @@ static void vmx_pi_switch_to(struct vcpu *v)
     pi_clear_sn(pi_desc);
 }
 
-static void vmx_pi_do_resume(struct vcpu *v)
+static void vmx_pi_unblock_vcpu(struct vcpu *v)
 {
     unsigned long flags;
     spinlock_t *pi_blocking_list_lock;
     struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
 
-    ASSERT(!test_bit(_VPF_blocked, &v->pause_flags));
-
     /*
      * Set 'NV' field back to posted_intr_vector, so the
      * Posted-Interrupts can be delivered to the vCPU when
@@ -170,12 +168,12 @@ static void vmx_pi_do_resume(struct vcpu *v)
      */
     write_atomic(&pi_desc->nv, posted_intr_vector);
 
-    /* The vCPU is not on any blocking list. */
     pi_blocking_list_lock = v->arch.hvm_vmx.pi_blocking.lock;
 
     /* Prevent the compiler from eliminating the local variable.*/
     smp_rmb();
 
+    /* The vCPU is not on any blocking list. */
     if ( pi_blocking_list_lock == NULL )
         return;
 
@@ -195,6 +193,13 @@ static void vmx_pi_do_resume(struct vcpu *v)
     spin_unlock_irqrestore(pi_blocking_list_lock, flags);
 }
 
+static void vmx_pi_do_resume(struct vcpu *v)
+{
+    ASSERT(!test_bit(_VPF_blocked, &v->pause_flags));
+
+    vmx_pi_unblock_vcpu(v);
+}
+
 /*
  * To handle posted interrupts correctly, we need to set the following
  * state:
@@ -255,12 +260,23 @@ void vmx_pi_hooks_assign(struct domain *d)
 /* This function is called when pcidevs_lock is held */
 void vmx_pi_hooks_deassign(struct domain *d)
 {
+    struct vcpu *v;
+
     if ( !iommu_intpost || !has_hvm_container_domain(d) )
         return;
 
     ASSERT(d->arch.hvm_domain.pi_ops.vcpu_block);
 
     /*
+     * Pausing the domain can make sure the vCPUs are not
+     * running and hence not calling the hooks simultaneously
+     * when deassigning the PI hooks and removing the vCPU
+     * from the blocking list.
+     */
+    ASSERT(current->domain != d);
+    domain_pause(d);
+
+    /*
      * Note that we don't set 'd->arch.hvm_domain.pi_ops.switch_to' to NULL
      * here. If we deassign the hooks while the vCPU is runnable in the
      * runqueue with 'SN' set, all the future notification event will be
@@ -270,6 +286,11 @@ void vmx_pi_hooks_deassign(struct domain *d)
     d->arch.hvm_domain.pi_ops.vcpu_block = NULL;
     d->arch.hvm_domain.pi_ops.switch_from = NULL;
     d->arch.hvm_domain.pi_ops.do_resume = NULL;
+
+    for_each_vcpu ( d, v )
+        vmx_pi_unblock_vcpu(v);
+
+    domain_unpause(d);
 }
 
 static int vmx_domain_initialise(struct domain *d)
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v9 4/8] VMX: Make sure PI is in proper state before install the hooks
  2017-02-27  1:45 [PATCH v9 0/8] VMX: Properly handle pi descriptor and per-cpu Chao Gao
                   ` (2 preceding siblings ...)
  2017-02-27  1:45 ` [PATCH v9 3/8] VMX: Properly handle pi when all the assigned devices are removed Chao Gao
@ 2017-02-27  1:45 ` Chao Gao
  2017-02-27  1:45 ` [PATCH v9 5/8] VT-d: Introduce a new function update_irte_for_msi_common Chao Gao
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Chao Gao @ 2017-02-27  1:45 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Feng Wu, Jun Nakajima, George Dunlap, Andrew Cooper,
	Dario Faggioli, Jan Beulich, Chao Gao

From: Feng Wu <feng.wu@intel.com>

We may hit the last ASSERT() in vmx_vcpu_block in the current code,
since vmx_vcpu_block() may get called before vmx_pi_switch_to()
has been installed or executed. Here We use cmpxchg to update
the NDST field, this can make sure we only update the NDST when
vmx_pi_switch_to() has not been called. So the NDST is in a
proper state in vmx_vcpu_block().

Suggested-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
---
v6: 
- Comments changes 
- Define macro 'APIC_INVALID_DEST' for '0xffffffff' 

v5: 
- Use 0xffffffff as the invalid value for NDST field. 

v4: 
- This patch is previously called "Pause/Unpause the domain before/after
assigning PI hooks" 
- Remove the pause/unpause method 
- Use cmpxchg to update NDST 

 xen/arch/x86/hvm/vmx/vmcs.c       | 13 +++++--------
 xen/arch/x86/hvm/vmx/vmx.c        | 27 ++++++++++++++++++++++++++-
 xen/include/asm-x86/hvm/vmx/vmx.h |  2 ++
 3 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 0b77dbc..7905d3e 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -956,16 +956,13 @@ void virtual_vmcs_vmwrite(const struct vcpu *v, u32 vmcs_encoding, u64 val)
  */
 static void pi_desc_init(struct vcpu *v)
 {
-    uint32_t dest;
-
     v->arch.hvm_vmx.pi_desc.nv = posted_intr_vector;
 
-    dest = cpu_physical_id(v->processor);
-
-    if ( x2apic_enabled )
-        v->arch.hvm_vmx.pi_desc.ndst = dest;
-    else
-        v->arch.hvm_vmx.pi_desc.ndst = MASK_INSR(dest, PI_xAPIC_NDST_MASK);
+    /*
+     * Mark NDST as invalid, then we can use this invalid value as a
+     * marker to whether update NDST or not in vmx_pi_hooks_assign().
+     */
+    v->arch.hvm_vmx.pi_desc.ndst = APIC_INVALID_DEST;
 }
 
 static int construct_vmcs(struct vcpu *v)
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index a7a70e7..e03786b 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -246,14 +246,39 @@ static void vmx_pi_do_resume(struct vcpu *v)
 /* This function is called when pcidevs_lock is held */
 void vmx_pi_hooks_assign(struct domain *d)
 {
+    struct vcpu *v;
+
     if ( !iommu_intpost || !has_hvm_container_domain(d) )
         return;
 
     ASSERT(!d->arch.hvm_domain.pi_ops.vcpu_block);
 
-    d->arch.hvm_domain.pi_ops.vcpu_block = vmx_vcpu_block;
+    /*
+     * We carefully handle the timing here:
+     * - Install the context switch first
+     * - Then set the NDST field
+     * - Install the block and resume hooks in the end
+     *
+     * This can make sure the PI (especially the NDST feild) is
+     * in proper state when we call vmx_vcpu_block().
+     */
     d->arch.hvm_domain.pi_ops.switch_from = vmx_pi_switch_from;
     d->arch.hvm_domain.pi_ops.switch_to = vmx_pi_switch_to;
+
+    for_each_vcpu ( d, v )
+    {
+        unsigned int dest = cpu_physical_id(v->processor);
+        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+
+        /*
+         * We don't need to update NDST if vmx_pi_switch_to()
+         * has already got called.
+         */
+        (void)cmpxchg(&pi_desc->ndst, APIC_INVALID_DEST,
+                x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK));
+    }
+
+    d->arch.hvm_domain.pi_ops.vcpu_block = vmx_vcpu_block;
     d->arch.hvm_domain.pi_ops.do_resume = vmx_pi_do_resume;
 }
 
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index f4183d9..00e6f0d 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -601,6 +601,8 @@ void vmx_pi_per_cpu_init(unsigned int cpu);
 void vmx_pi_hooks_assign(struct domain *d);
 void vmx_pi_hooks_deassign(struct domain *d);
 
+#define APIC_INVALID_DEST           0xffffffff
+
 /* EPT violation qualifications definitions */
 typedef union ept_qual {
     unsigned long raw;
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v9 5/8] VT-d: Introduce a new function update_irte_for_msi_common
  2017-02-27  1:45 [PATCH v9 0/8] VMX: Properly handle pi descriptor and per-cpu Chao Gao
                   ` (3 preceding siblings ...)
  2017-02-27  1:45 ` [PATCH v9 4/8] VMX: Make sure PI is in proper state before install the hooks Chao Gao
@ 2017-02-27  1:45 ` Chao Gao
  2017-03-02  8:58   ` Jan Beulich
  2017-02-27  1:45 ` [PATCH v9 6/8] VT-d: Some cleanups Chao Gao
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Chao Gao @ 2017-02-27  1:45 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Feng Wu, Jun Nakajima, George Dunlap, Andrew Cooper,
	Dario Faggioli, Jan Beulich, Chao Gao

Both pi_update_irte() and msi_msg_to_remap_entry() update the content of IRTE;
besides, the current msi_msg_to_remap_entry is buggy when the live IRTE is in
posted format. This patch try to rework these two functions to make them
clearer by moving their common part to the new function.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
---
v9:
- Newly added.

 xen/drivers/passthrough/vtd/intremap.c | 232 +++++++++++++++++++--------------
 1 file changed, 131 insertions(+), 101 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index bfd468b..4269cd4 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -420,7 +420,8 @@ void io_apic_write_remap_rte(
         __ioapic_write_entry(apic, ioapic_pin, 1, old_rte);
 }
 
-static void set_msi_source_id(struct pci_dev *pdev, struct iremap_entry *ire)
+static void set_msi_source_id(const struct pci_dev *pdev,
+                              struct iremap_entry *ire)
 {
     u16 seg;
     u8 bus, devfn, secbus;
@@ -547,16 +548,116 @@ static int remap_entry_to_msi_msg(
     return 0;
 }
 
+/*
+ * This function is a common interface to update irte for msi case.
+ *
+ * If @pi_desc != NULL and @gvec != 0, the IRTE will be updated to a posted
+ * format. In this case, @msg is ignored because constructing a posted format
+ * IRTE doesn't need any information about the msi address or msi data.
+ *
+ * If @pi_desc == NULL and @gvec == 0, the IRTE will be updated to a remapped
+ * format. In this case, @msg can't be NULL.
+ *
+ * Assume 'ir_ctrl->iremap_lock' has been acquired and the remap_index
+ * of msi_desc has a benign value.
+ */
+static int update_irte_for_msi_common(
+    struct iommu *iommu, const struct pci_dev *pdev,
+    const struct msi_desc *msi_desc, struct msi_msg *msg,
+    const struct pi_desc *pi_desc, const uint8_t gvec)
+{
+    struct iremap_entry *iremap_entry = NULL, *iremap_entries;
+    struct iremap_entry new_ire = {{0}};
+    unsigned int index = msi_desc->remap_index;
+    struct ir_ctrl *ir_ctrl = iommu_ir_ctrl(iommu);
+
+    ASSERT( ir_ctrl );
+    ASSERT( spin_is_locked(&ir_ctrl->iremap_lock) );
+    ASSERT( (index >= 0) && (index < IREMAP_ENTRY_NR) );
+
+    if ( (!pi_desc && gvec) || (pi_desc && !gvec) )
+        return -EINVAL;
+
+    if ( !pi_desc && !gvec && !msg )
+        return -EINVAL;
+
+    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
+                     iremap_entries, iremap_entry);
+
+    if ( !pi_desc )
+    {
+       /* Set interrupt remapping table entry */
+        new_ire.remap.dm = msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT;
+        new_ire.remap.tm = msg->data >> MSI_DATA_TRIGGER_SHIFT;
+        new_ire.remap.dlm = msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT;
+        /* Hardware require RH = 1 for LPR delivery mode */
+        new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
+        new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
+                                MSI_DATA_VECTOR_MASK;
+        if ( x2apic_enabled )
+            new_ire.remap.dst = msg->dest32;
+        else
+            new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
+                                 & 0xff) << 8;
+        new_ire.remap.p = 1;
+    }
+    else
+    {
+        new_ire.post.im = 1;
+        new_ire.post.vector = gvec;
+        new_ire.post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
+        new_ire.post.pda_h = virt_to_maddr(pi_desc) >> 32;
+        new_ire.post.p = 1;
+    }
+
+    if ( pdev )
+        set_msi_source_id(pdev, &new_ire);
+    else
+        set_hpet_source_id(msi_desc->hpet_id, &new_ire);
+
+    if ( iremap_entry->val != new_ire.val )
+    {
+        if ( cpu_has_cx16 )
+        {
+            __uint128_t ret;
+            struct iremap_entry old_ire;
+
+            old_ire = *iremap_entry;
+            ret = cmpxchg16b(iremap_entry, &old_ire, &new_ire);
+
+            /*
+             * In the above, we use cmpxchg16 to atomically update the 128-bit
+             * IRTE, and the hardware cannot update the IRTE behind us, so
+             * the return value of cmpxchg16 should be the same as old_ire.
+             * This ASSERT validate it.
+             */
+            ASSERT(ret == old_ire.val);
+        }
+        else
+        {
+            iremap_entry->lo = new_ire.lo;
+            iremap_entry->hi = new_ire.hi;
+        }
+
+        iommu_flush_cache_entry(iremap_entry, sizeof(struct iremap_entry));
+        iommu_flush_iec_index(iommu, 0, index);
+    }
+
+    unmap_vtd_domain_page(iremap_entries);
+    return 0;
+}
+
 static int msi_msg_to_remap_entry(
-    struct iommu *iommu, struct pci_dev *pdev,
+    struct iommu *iommu, const struct pci_dev *pdev,
     struct msi_desc *msi_desc, struct msi_msg *msg)
 {
     struct iremap_entry *iremap_entry = NULL, *iremap_entries;
-    struct iremap_entry new_ire;
     struct msi_msg_remap_entry *remap_rte;
     unsigned int index, i, nr = 1;
     unsigned long flags;
     struct ir_ctrl *ir_ctrl = iommu_ir_ctrl(iommu);
+    void *pi_desc;
+    int gvec;
 
     if ( msi_desc->msi_attrib.type == PCI_CAP_ID_MSI )
         nr = msi_desc->msi.nvec;
@@ -592,38 +693,33 @@ static int msi_msg_to_remap_entry(
         return -EFAULT;
     }
 
+    /* Get the IRTE's bind relationship with guest from the live IRTE. */
     GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
                      iremap_entries, iremap_entry);
-
-    memcpy(&new_ire, iremap_entry, sizeof(struct iremap_entry));
-
-    /* Set interrupt remapping table entry */
-    new_ire.remap.fpd = 0;
-    new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
-    new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
-    new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
-    /* Hardware require RH = 1 for LPR delivery mode */
-    new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
-    new_ire.remap.avail = 0;
-    new_ire.remap.res_1 = 0;
-    new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
-                            MSI_DATA_VECTOR_MASK;
-    new_ire.remap.res_2 = 0;
-    if ( x2apic_enabled )
-        new_ire.remap.dst = msg->dest32;
+    if ( !iremap_entry->remap.im )
+    {
+        gvec = 0;
+        pi_desc = NULL;
+    }
     else
-        new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
-                             & 0xff) << 8;
+    {
+        gvec = iremap_entry->post.vector;
+        pi_desc = (void *)((((u64)iremap_entry->post.pda_h) << PDA_LOW_BIT )
+                           + iremap_entry->post.pda_l);
+    }
+    unmap_vtd_domain_page(iremap_entries);
 
-    if ( pdev )
-        set_msi_source_id(pdev, &new_ire);
-    else
-        set_hpet_source_id(msi_desc->hpet_id, &new_ire);
-    new_ire.remap.res_3 = 0;
-    new_ire.remap.res_4 = 0;
-    new_ire.remap.p = 1;    /* finally, set present bit */
+    /*
+     * Actually we can just suppress the update when IRTE is already in posted
+     * format. After a msi gets bound to a guest interrupt, changes to the msi
+     * message have no effect to the IRTE.
+     */
+    update_irte_for_msi_common(iommu, pdev, msi_desc, msg, pi_desc, gvec);
 
     /* now construct new MSI/MSI-X rte entry */
+    if ( msi_desc->msi_attrib.type == PCI_CAP_ID_MSI )
+        nr = msi_desc->msi.nvec;
+
     remap_rte = (struct msi_msg_remap_entry *)msg;
     remap_rte->address_lo.dontcare = 0;
     i = index;
@@ -637,11 +733,6 @@ static int msi_msg_to_remap_entry(
     remap_rte->address_hi = 0;
     remap_rte->data = index - i;
 
-    memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
-    iommu_flush_cache_entry(iremap_entry, sizeof(struct iremap_entry));
-    iommu_flush_iec_index(iommu, 0, index);
-
-    unmap_vtd_domain_page(iremap_entries);
     spin_unlock_irqrestore(&ir_ctrl->iremap_lock, flags);
     return 0;
 }
@@ -902,42 +993,6 @@ void iommu_disable_x2apic_IR(void)
         disable_qinval(drhd->iommu);
 }
 
-static void setup_posted_irte(
-    struct iremap_entry *new_ire, const struct iremap_entry *old_ire,
-    const struct pi_desc *pi_desc, const uint8_t gvec)
-{
-    memset(new_ire, 0, sizeof(*new_ire));
-
-    /*
-     * 'im' filed decides whether the irte is in posted format (with value 1)
-     * or remapped format (with value 0), if the old irte is in remapped format,
-     * we copy things from remapped part in 'struct iremap_entry', otherwise,
-     * we copy from posted part.
-     */
-    if ( !old_ire->remap.im )
-    {
-        new_ire->post.p = old_ire->remap.p;
-        new_ire->post.fpd = old_ire->remap.fpd;
-        new_ire->post.sid = old_ire->remap.sid;
-        new_ire->post.sq = old_ire->remap.sq;
-        new_ire->post.svt = old_ire->remap.svt;
-    }
-    else
-    {
-        new_ire->post.p = old_ire->post.p;
-        new_ire->post.fpd = old_ire->post.fpd;
-        new_ire->post.sid = old_ire->post.sid;
-        new_ire->post.sq = old_ire->post.sq;
-        new_ire->post.svt = old_ire->post.svt;
-        new_ire->post.urg = old_ire->post.urg;
-    }
-
-    new_ire->post.im = 1;
-    new_ire->post.vector = gvec;
-    new_ire->post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
-    new_ire->post.pda_h = virt_to_maddr(pi_desc) >> 32;
-}
-
 /*
  * This function is used to update the IRTE for posted-interrupt
  * when guest changes MSI/MSI-X information.
@@ -947,16 +1002,13 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
 {
     struct irq_desc *desc;
     const struct msi_desc *msi_desc;
-    int remap_index;
     int rc = 0;
     const struct pci_dev *pci_dev;
     const struct acpi_drhd_unit *drhd;
     struct iommu *iommu;
     struct ir_ctrl *ir_ctrl;
-    struct iremap_entry *iremap_entries = NULL, *p = NULL;
-    struct iremap_entry new_ire, old_ire;
+    unsigned long flags;
     const struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
-    __uint128_t ret;
 
     desc = pirq_spin_lock_irq_desc(pirq, NULL);
     if ( !desc )
@@ -976,8 +1028,6 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
         goto unlock_out;
     }
 
-    remap_index = msi_desc->remap_index;
-
     spin_unlock_irq(&desc->lock);
 
     ASSERT(pcidevs_locked());
@@ -996,31 +1046,11 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
     if ( !ir_ctrl )
         return -ENODEV;
 
-    spin_lock_irq(&ir_ctrl->iremap_lock);
-
-    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index, iremap_entries, p);
-
-    old_ire = *p;
-
-    /* Setup/Update interrupt remapping table entry. */
-    setup_posted_irte(&new_ire, &old_ire, pi_desc, gvec);
-    ret = cmpxchg16b(p, &old_ire, &new_ire);
-
-    /*
-     * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
-     * and the hardware cannot update the IRTE behind us, so the return value
-     * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
-     */
-    ASSERT(ret == old_ire.val);
-
-    iommu_flush_cache_entry(p, sizeof(*p));
-    iommu_flush_iec_index(iommu, 0, remap_index);
-
-    unmap_vtd_domain_page(iremap_entries);
-
-    spin_unlock_irq(&ir_ctrl->iremap_lock);
-
-    return 0;
+    spin_lock_irqsave(&ir_ctrl->iremap_lock, flags);
+    rc = update_irte_for_msi_common(iommu, pci_dev, msi_desc, NULL, pi_desc,
+                                    gvec);
+    spin_unlock_irqrestore(&ir_ctrl->iremap_lock, flags);
+    return rc;
 
  unlock_out:
     spin_unlock_irq(&desc->lock);
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v9 6/8] VT-d: Some cleanups
  2017-02-27  1:45 [PATCH v9 0/8] VMX: Properly handle pi descriptor and per-cpu Chao Gao
                   ` (4 preceding siblings ...)
  2017-02-27  1:45 ` [PATCH v9 5/8] VT-d: Introduce a new function update_irte_for_msi_common Chao Gao
@ 2017-02-27  1:45 ` Chao Gao
  2017-02-27  1:45 ` [PATCH v9 7/8] VMX: Fixup PI descriptor when cpu is offline Chao Gao
  2017-02-27  1:45 ` [PATCH v9 8/8] VT-d: Add copy_irte_{to, from}_irt for updating irte Chao Gao
  7 siblings, 0 replies; 22+ messages in thread
From: Chao Gao @ 2017-02-27  1:45 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Feng Wu, Jun Nakajima, George Dunlap, Andrew Cooper,
	Dario Faggioli, Jan Beulich, Chao Gao

From: Feng Wu <feng.wu@intel.com>

Use type-safe structure assignment instead of memcpy()
Use sizeof(*iremap_entry).

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
---
v9:
- Delete several lines for patch [5/8] has been reworked.

v7: 
- Remove a useless cleanup

v6: 
- More descripion about the patch

 xen/drivers/passthrough/vtd/intremap.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index 4269cd4..737b886 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -183,8 +183,8 @@ static void free_remap_entry(struct iommu *iommu, int index)
     GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
                      iremap_entries, iremap_entry);
 
-    memset(iremap_entry, 0, sizeof(struct iremap_entry));
-    iommu_flush_cache_entry(iremap_entry, sizeof(struct iremap_entry));
+    memset(iremap_entry, 0, sizeof(*iremap_entry));
+    iommu_flush_cache_entry(iremap_entry, sizeof(*iremap_entry));
     iommu_flush_iec_index(iommu, 0, index);
 
     unmap_vtd_domain_page(iremap_entries);
@@ -310,7 +310,7 @@ static int ioapic_rte_to_remap_entry(struct iommu *iommu,
     GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
                      iremap_entries, iremap_entry);
 
-    memcpy(&new_ire, iremap_entry, sizeof(struct iremap_entry));
+    new_ire = *iremap_entry;
 
     if ( rte_upper )
     {
@@ -353,8 +353,8 @@ static int ioapic_rte_to_remap_entry(struct iommu *iommu,
         remap_rte->format = 1;    /* indicate remap format */
     }
 
-    memcpy(iremap_entry, &new_ire, sizeof(struct iremap_entry));
-    iommu_flush_cache_entry(iremap_entry, sizeof(struct iremap_entry));
+    *iremap_entry = new_ire;
+    iommu_flush_cache_entry(iremap_entry, sizeof(*iremap_entry));
     iommu_flush_iec_index(iommu, 0, index);
 
     unmap_vtd_domain_page(iremap_entries);
@@ -639,7 +639,7 @@ static int update_irte_for_msi_common(
             iremap_entry->hi = new_ire.hi;
         }
 
-        iommu_flush_cache_entry(iremap_entry, sizeof(struct iremap_entry));
+        iommu_flush_cache_entry(iremap_entry, sizeof(*iremap_entry));
         iommu_flush_iec_index(iommu, 0, index);
     }
 
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v9 7/8] VMX: Fixup PI descriptor when cpu is offline
  2017-02-27  1:45 [PATCH v9 0/8] VMX: Properly handle pi descriptor and per-cpu Chao Gao
                   ` (5 preceding siblings ...)
  2017-02-27  1:45 ` [PATCH v9 6/8] VT-d: Some cleanups Chao Gao
@ 2017-02-27  1:45 ` Chao Gao
  2017-02-27  1:45 ` [PATCH v9 8/8] VT-d: Add copy_irte_{to, from}_irt for updating irte Chao Gao
  7 siblings, 0 replies; 22+ messages in thread
From: Chao Gao @ 2017-02-27  1:45 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Feng Wu, Jun Nakajima, George Dunlap, Andrew Cooper,
	Dario Faggioli, Jan Beulich, Chao Gao

From: Feng Wu <feng.wu@intel.com>

When cpu is offline, we need to move all the vcpus in its blocking
list to another online cpu, this patch handles it.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
---
v7: 
- Pass unsigned int to vmx_pi_desc_fixup()

v6: 
- Carefully suppress 'SN' to avoid missing notification event
during moving the vcpu to the new list

v5: 
- Add some comments to explain why it doesn't cause deadlock
for the ABBA deadlock scenario. 

v4: 
- Remove the pointless check since we are in machine stop
context and no other cpus go down in parallel.

 xen/arch/x86/hvm/vmx/vmcs.c       |  1 +
 xen/arch/x86/hvm/vmx/vmx.c        | 70 +++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/hvm/vmx/vmx.h |  1 +
 3 files changed, 72 insertions(+)

diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 7905d3e..7e3e093 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -578,6 +578,7 @@ void vmx_cpu_dead(unsigned int cpu)
     vmx_free_vmcs(per_cpu(vmxon_region, cpu));
     per_cpu(vmxon_region, cpu) = 0;
     nvmx_cpu_dead(cpu);
+    vmx_pi_desc_fixup(cpu);
 }
 
 int vmx_cpu_up(void)
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index e03786b..b8a385b 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -200,6 +200,76 @@ static void vmx_pi_do_resume(struct vcpu *v)
     vmx_pi_unblock_vcpu(v);
 }
 
+void vmx_pi_desc_fixup(unsigned int cpu)
+{
+    unsigned int new_cpu, dest;
+    unsigned long flags;
+    struct arch_vmx_struct *vmx, *tmp;
+    spinlock_t *new_lock, *old_lock = &per_cpu(vmx_pi_blocking, cpu).lock;
+    struct list_head *blocked_vcpus = &per_cpu(vmx_pi_blocking, cpu).list;
+
+    if ( !iommu_intpost )
+        return;
+
+    /*
+     * We are in the context of CPU_DEAD or CPU_UP_CANCELED notification,
+     * and it is impossible for a second CPU go down in parallel. So we
+     * can safely acquire the old cpu's lock and then acquire the new_cpu's
+     * lock after that.
+     */
+    spin_lock_irqsave(old_lock, flags);
+
+    list_for_each_entry_safe(vmx, tmp, blocked_vcpus, pi_blocking.list)
+    {
+        /*
+         * Suppress notification or we may miss an interrupt when the
+         * target cpu is dying.
+         */
+        pi_set_sn(&vmx->pi_desc);
+
+        /*
+         * Check whether a notification is pending before doing the
+         * movement, if that is the case we need to wake up it directly
+         * other than moving it to the new cpu's list.
+         */
+        if ( pi_test_on(&vmx->pi_desc) )
+        {
+            list_del(&vmx->pi_blocking.list);
+            vmx->pi_blocking.lock = NULL;
+            vcpu_unblock(container_of(vmx, struct vcpu, arch.hvm_vmx));
+        }
+        else
+        {
+            /*
+             * We need to find an online cpu as the NDST of the PI descriptor, it
+             * doesn't matter whether it is within the cpupool of the domain or
+             * not. As long as it is online, the vCPU will be woken up once the
+             * notification event arrives.
+             */
+            new_cpu = cpumask_any(&cpu_online_map);
+            new_lock = &per_cpu(vmx_pi_blocking, new_cpu).lock;
+
+            spin_lock(new_lock);
+
+            ASSERT(vmx->pi_blocking.lock == old_lock);
+
+            dest = cpu_physical_id(new_cpu);
+            write_atomic(&vmx->pi_desc.ndst,
+                         x2apic_enabled ? dest : MASK_INSR(dest, PI_xAPIC_NDST_MASK));
+
+            list_move(&vmx->pi_blocking.list,
+                      &per_cpu(vmx_pi_blocking, new_cpu).list);
+            vmx->pi_blocking.lock = new_lock;
+
+            spin_unlock(new_lock);
+        }
+
+        pi_clear_sn(&vmx->pi_desc);
+    }
+
+    spin_unlock_irqrestore(old_lock, flags);
+}
+
 /*
  * To handle posted interrupts correctly, we need to set the following
  * state:
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 00e6f0d..70155fb 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -597,6 +597,7 @@ void free_p2m_hap_data(struct p2m_domain *p2m);
 void p2m_init_hap_data(struct p2m_domain *p2m);
 
 void vmx_pi_per_cpu_init(unsigned int cpu);
+void vmx_pi_desc_fixup(unsigned int cpu);
 
 void vmx_pi_hooks_assign(struct domain *d);
 void vmx_pi_hooks_deassign(struct domain *d);
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v9 8/8] VT-d: Add copy_irte_{to, from}_irt for updating irte
  2017-02-27  1:45 [PATCH v9 0/8] VMX: Properly handle pi descriptor and per-cpu Chao Gao
                   ` (6 preceding siblings ...)
  2017-02-27  1:45 ` [PATCH v9 7/8] VMX: Fixup PI descriptor when cpu is offline Chao Gao
@ 2017-02-27  1:45 ` Chao Gao
  2017-03-02  9:03   ` Jan Beulich
  2017-03-15 10:38   ` Tian, Kevin
  7 siblings, 2 replies; 22+ messages in thread
From: Chao Gao @ 2017-02-27  1:45 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Jun Nakajima, George Dunlap, Andrew Cooper,
	Dario Faggioli, Jan Beulich, Chao Gao

We used structure assignment to update irte which was not safe when a
interrupt happend during update. It is better to update IRTE atomically
through cmpxchg16b(). When cmpxchg16b is not supported, two 64-bit write
operation can atomically update IRTE when only one the high dword or
low dword is intented to be changed.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
v9:
- Newly added.

 xen/drivers/passthrough/vtd/intremap.c | 58 ++++++++++++++++++++--------------
 1 file changed, 34 insertions(+), 24 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index 737b886..649bebd 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -169,6 +169,37 @@ bool_t __init iommu_supports_eim(void)
     return 1;
 }
 
+static inline void copy_irte_from_irt(struct iremap_entry *dst,
+                                      struct iremap_entry *src)
+{
+    *dst = *src;
+}
+
+static void copy_irte_to_irt(struct iremap_entry *dst, struct iremap_entry *src)
+{
+    if ( cpu_has_cx16 )
+    {
+        __uint128_t ret;
+        struct iremap_entry old_ire;
+
+        copy_irte_from_irt(&old_ire, dst);
+        ret = cmpxchg16b(dst, &old_ire, src);
+
+        /*
+         * In the above, we use cmpxchg16 to atomically update the 128-bit
+         * IRTE, and the hardware cannot update the IRTE behind us, so
+         * the return value of cmpxchg16 should be the same as old_ire.
+         * This ASSERT validate it.
+         */
+        ASSERT(ret == old_ire.val);
+    }
+    else
+    {
+        dst->lo = src->lo;
+        dst->hi = src->hi;
+    }
+}
+
 /* Mark specified intr remap entry as free */
 static void free_remap_entry(struct iommu *iommu, int index)
 {
@@ -310,7 +341,7 @@ static int ioapic_rte_to_remap_entry(struct iommu *iommu,
     GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
                      iremap_entries, iremap_entry);
 
-    new_ire = *iremap_entry;
+    copy_irte_from_irt(&new_ire, iremap_entry);
 
     if ( rte_upper )
     {
@@ -353,7 +384,7 @@ static int ioapic_rte_to_remap_entry(struct iommu *iommu,
         remap_rte->format = 1;    /* indicate remap format */
     }
 
-    *iremap_entry = new_ire;
+    copy_irte_to_irt(iremap_entry, &new_ire);
     iommu_flush_cache_entry(iremap_entry, sizeof(*iremap_entry));
     iommu_flush_iec_index(iommu, 0, index);
 
@@ -617,28 +648,7 @@ static int update_irte_for_msi_common(
 
     if ( iremap_entry->val != new_ire.val )
     {
-        if ( cpu_has_cx16 )
-        {
-            __uint128_t ret;
-            struct iremap_entry old_ire;
-
-            old_ire = *iremap_entry;
-            ret = cmpxchg16b(iremap_entry, &old_ire, &new_ire);
-
-            /*
-             * In the above, we use cmpxchg16 to atomically update the 128-bit
-             * IRTE, and the hardware cannot update the IRTE behind us, so
-             * the return value of cmpxchg16 should be the same as old_ire.
-             * This ASSERT validate it.
-             */
-            ASSERT(ret == old_ire.val);
-        }
-        else
-        {
-            iremap_entry->lo = new_ire.lo;
-            iremap_entry->hi = new_ire.hi;
-        }
-
+        copy_irte_to_irt(iremap_entry, &new_ire);
         iommu_flush_cache_entry(iremap_entry, sizeof(*iremap_entry));
         iommu_flush_iec_index(iommu, 0, index);
     }
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v9 1/8] VMX: Permanently assign PI hook vmx_pi_switch_to()
  2017-02-27  1:45 ` [PATCH v9 1/8] VMX: Permanently assign PI hook vmx_pi_switch_to() Chao Gao
@ 2017-02-28 16:43   ` Jan Beulich
  2017-03-01  0:01     ` Chao Gao
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2017-02-28 16:43 UTC (permalink / raw)
  To: Chao Gao
  Cc: Kevin Tian, Feng Wu, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel, Jun Nakajima

>>> On 27.02.17 at 02:45, <chao.gao@intel.com> wrote:
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -260,9 +260,15 @@ void vmx_pi_hooks_deassign(struct domain *d)
>  
>      ASSERT(d->arch.hvm_domain.pi_ops.vcpu_block);
>  
> +    /*
> +     * Note that we don't set 'd->arch.hvm_domain.pi_ops.switch_to' to NULL
> +     * here. If we deassign the hooks while the vCPU is runnable in the
> +     * runqueue with 'SN' set, all the future notification event will be
> +     * suppressed. Preserving the 'switch_to' hook function can make sure
> +     * event time the vCPU is running the 'SN' field is cleared.
> +     */

Did we now lose the remark indicating that at least in theory
we could remove the hook after it had run one more time? It's
also not really becoming clear why SN continues to matter
after device removal, but perhaps that's just because of my
so far limited understanding of PI.

Also I think you mean "every time" on the last line, albeit that
(as per my remark above) is irrelevant - we need it to run just
once more.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v9 2/8] xen/passthrough: Reject self-(de)assignment of devices
  2017-02-27  1:45 ` [PATCH v9 2/8] xen/passthrough: Reject self-(de)assignment of devices Chao Gao
@ 2017-02-28 16:46   ` Jan Beulich
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2017-02-28 16:46 UTC (permalink / raw)
  To: Chao Gao
  Cc: Kevin Tian, George Dunlap, Andrew Cooper, Dario Faggioli,
	xen-devel, Jun Nakajima

>>> On 27.02.17 at 02:45, <chao.gao@intel.com> wrote:
> That is to say, don't support a domain assigns a device to itself or detachs
> a device from itself.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>

Acked-by: Jan Beulich <jbeulich@suse.com>
albeit I would have wished for ...

> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -1603,6 +1603,13 @@ int iommu_do_pci_domctl(
>          break;
>  
>      case XEN_DOMCTL_assign_device:
> +        /* Don't support self-(de)assignment of devices */

... this comment to only talk about assignment and ...

> +        if ( d == current->domain )
> +        {
> +            ret = -EINVAL;
> +            break;
> +        }
> +
>          ret = -ENODEV;
>          if ( domctl->u.assign_device.dev != XEN_DOMCTL_DEV_PCI )
>              break;
> @@ -1643,6 +1650,13 @@ int iommu_do_pci_domctl(
>          break;
>  
>      case XEN_DOMCTL_deassign_device:
> +        /* Don't support self-(de)assignment of devices */

... this one only about de-assignment. I may take the liberty to do
this edit if I end up committing this.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v9 1/8] VMX: Permanently assign PI hook vmx_pi_switch_to()
  2017-02-28 16:43   ` Jan Beulich
@ 2017-03-01  0:01     ` Chao Gao
  2017-03-01  7:41       ` Jan Beulich
  0 siblings, 1 reply; 22+ messages in thread
From: Chao Gao @ 2017-03-01  0:01 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, George Dunlap, Andrew Cooper, Dario Faggioli,
	xen-devel, Jun Nakajima

On Tue, Feb 28, 2017 at 09:43:09AM -0700, Jan Beulich wrote:
>>>> On 27.02.17 at 02:45, <chao.gao@intel.com> wrote:
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -260,9 +260,15 @@ void vmx_pi_hooks_deassign(struct domain *d)
>>  
>>      ASSERT(d->arch.hvm_domain.pi_ops.vcpu_block);
>>  
>> +    /*
>> +     * Note that we don't set 'd->arch.hvm_domain.pi_ops.switch_to' to NULL
>> +     * here. If we deassign the hooks while the vCPU is runnable in the
>> +     * runqueue with 'SN' set, all the future notification event will be
>> +     * suppressed. Preserving the 'switch_to' hook function can make sure
>> +     * event time the vCPU is running the 'SN' field is cleared.
>> +     */
>
>Did we now lose the remark indicating that at least in theory
>we could remove the hook after it had run one more time? It's

I also think it only need to run one more time, because the hook
function that set 'SN' bit is already removed.

>also not really becoming clear why SN continues to matter
>after device removal, but perhaps that's just because of my
>so far limited understanding of PI.
>
>Also I think you mean "every time" on the last line, albeit that
>(as per my remark above) is irrelevant - we need it to run just
>once more.

I would like to make the comment as clear as possible.
How about the underlying comment,

Note that we don't set 'd->arch.hvm_domain.pi_ops.switch_to' to NULL
here. If we deassign the hooks while the vCPU is runnable in the
runqueue with 'SN' set, all the future notification event will be
suppressed since vmx_deliver_posted_intr() also use 'SN' bit
as the suppression flag. Preserving the 'switch_to' hook function can
clear the 'SN' bit when the vCPU becomes running next time. After
that, No matter which status(runnable, running or block) the vCPU is in,
the 'SN' bit will keep clear for the 'switch_from' hook function that set
the 'SN' bit has been removed. At that time, the 'switch_to' hook function
is also useless. Considering the function doesn't do harm to the whole
system, leave it here until we find a clean solution to deassign the 
'switch_to' hook function.

Thanks,
Chao

>
>Jan
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v9 1/8] VMX: Permanently assign PI hook vmx_pi_switch_to()
  2017-03-01  0:01     ` Chao Gao
@ 2017-03-01  7:41       ` Jan Beulich
  2017-03-03  8:29         ` Tian, Kevin
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2017-03-01  7:41 UTC (permalink / raw)
  To: Chao Gao
  Cc: Kevin Tian, George Dunlap, Andrew Cooper, Dario Faggioli,
	xen-devel, Jun Nakajima

>>> On 01.03.17 at 01:01, <chao.gao@intel.com> wrote:
> On Tue, Feb 28, 2017 at 09:43:09AM -0700, Jan Beulich wrote:
>>>>> On 27.02.17 at 02:45, <chao.gao@intel.com> wrote:
>>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>>> @@ -260,9 +260,15 @@ void vmx_pi_hooks_deassign(struct domain *d)
>>>  
>>>      ASSERT(d->arch.hvm_domain.pi_ops.vcpu_block);
>>>  
>>> +    /*
>>> +     * Note that we don't set 'd->arch.hvm_domain.pi_ops.switch_to' to NULL
>>> +     * here. If we deassign the hooks while the vCPU is runnable in the
>>> +     * runqueue with 'SN' set, all the future notification event will be
>>> +     * suppressed. Preserving the 'switch_to' hook function can make sure
>>> +     * event time the vCPU is running the 'SN' field is cleared.
>>> +     */
>>
>>Did we now lose the remark indicating that at least in theory
>>we could remove the hook after it had run one more time? It's
> 
> I also think it only need to run one more time, because the hook
> function that set 'SN' bit is already removed.
> 
>>also not really becoming clear why SN continues to matter
>>after device removal, but perhaps that's just because of my
>>so far limited understanding of PI.
>>
>>Also I think you mean "every time" on the last line, albeit that
>>(as per my remark above) is irrelevant - we need it to run just
>>once more.
> 
> I would like to make the comment as clear as possible.
> How about the underlying comment,
> 
> Note that we don't set 'd->arch.hvm_domain.pi_ops.switch_to' to NULL
> here. If we deassign the hooks while the vCPU is runnable in the
> runqueue with 'SN' set, all the future notification event will be
> suppressed since vmx_deliver_posted_intr() also use 'SN' bit
> as the suppression flag. Preserving the 'switch_to' hook function can
> clear the 'SN' bit when the vCPU becomes running next time. After
> that, No matter which status(runnable, running or block) the vCPU is in,
> the 'SN' bit will keep clear for the 'switch_from' hook function that set
> the 'SN' bit has been removed. At that time, the 'switch_to' hook function
> is also useless. Considering the function doesn't do harm to the whole
> system, leave it here until we find a clean solution to deassign the 
> 'switch_to' hook function.

Sounds good to me (read: Reviewed-by: Jan Beulich <jbeulich@suse.com>).
If Kevin would give his ack, I could replace the comment while committing,
so you wouldn't need to re-send.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v9 5/8] VT-d: Introduce a new function update_irte_for_msi_common
  2017-03-02  8:58   ` Jan Beulich
@ 2017-03-02  7:14     ` Chao Gao
  2017-03-02 14:32       ` Jan Beulich
  0 siblings, 1 reply; 22+ messages in thread
From: Chao Gao @ 2017-03-02  7:14 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Feng Wu, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel, Jun Nakajima

On Thu, Mar 02, 2017 at 01:58:14AM -0700, Jan Beulich wrote:
>>>> On 27.02.17 at 02:45, <chao.gao@intel.com> wrote:
>> @@ -547,16 +548,116 @@ static int remap_entry_to_msi_msg(
>>      return 0;
>>  }
>>  
>> +/*
>> + * This function is a common interface to update irte for msi case.
>> + *
>> + * If @pi_desc != NULL and @gvec != 0, the IRTE will be updated to a posted
>> + * format. In this case, @msg is ignored because constructing a posted format
>> + * IRTE doesn't need any information about the msi address or msi data.
>> + *
>> + * If @pi_desc == NULL and @gvec == 0, the IRTE will be updated to a remapped
>> + * format. In this case, @msg can't be NULL.
>
>This kind of implies that in the other case msg can be NULL. Please
>make this explicit or remove the last sentence, to avoid confusing
>readers. Plus, if msg can be NULL in that case, why don't you pass
>NULL in that case?

I choose to make this explicit.

>
>> + * Assume 'ir_ctrl->iremap_lock' has been acquired and the remap_index
>> + * of msi_desc has a benign value.
>> + */
>> +static int update_irte_for_msi_common(
>> +    struct iommu *iommu, const struct pci_dev *pdev,
>> +    const struct msi_desc *msi_desc, struct msi_msg *msg,
>> +    const struct pi_desc *pi_desc, const uint8_t gvec)
>> +{
>> +    struct iremap_entry *iremap_entry = NULL, *iremap_entries;
>> +    struct iremap_entry new_ire = {{0}};
>
>I think just "{ }" will do.
>
>> +    unsigned int index = msi_desc->remap_index;
>> +    struct ir_ctrl *ir_ctrl = iommu_ir_ctrl(iommu);
>> +
>> +    ASSERT( ir_ctrl );
>> +    ASSERT( spin_is_locked(&ir_ctrl->iremap_lock) );
>> +    ASSERT( (index >= 0) && (index < IREMAP_ENTRY_NR) );
>
>Stray blanks inside parentheses.
>
>> +    if ( (!pi_desc && gvec) || (pi_desc && !gvec) )
>
>gvec == 0 alone is never a valid check: Either all vectors are valid,
>or a whole range (e.g. 0x00...0x0f) is invalid. Furthermore I think
>such checks are easier to read as either

How about only use pi_desc is NULL or not to decide the format of the IRTE?

>
>    if ( !pi_desc != !gvec )
>
>or
>
>    if ( pi_desc ? !gvec : gvec )
>
>> +        return -EINVAL;
>> +
>> +    if ( !pi_desc && !gvec && !msg )
>
>With the earlier check the first or second part could be omitted
>afaict.

Agree

>
>> +        return -EINVAL;
>> +
>> +    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
>> +                     iremap_entries, iremap_entry);
>> +
>> +    if ( !pi_desc )
>> +    {
>> +       /* Set interrupt remapping table entry */
>
>Again a request for consistency: Either have a respective comment
>also at the top of the else branch, or omit the one here too.
>
>> +        new_ire.remap.dm = msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT;
>> +        new_ire.remap.tm = msg->data >> MSI_DATA_TRIGGER_SHIFT;
>> +        new_ire.remap.dlm = msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT;
>> +        /* Hardware require RH = 1 for LPR delivery mode */
>> +        new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
>> +        new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
>> +                                MSI_DATA_VECTOR_MASK;
>> +        if ( x2apic_enabled )
>> +            new_ire.remap.dst = msg->dest32;
>> +        else
>> +            new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
>> +                                 & 0xff) << 8;
>
>Please strive to eliminate literal numbers here. At least the 0xff looks
>to be easy to deal with (using MASK_EXTR() together with
>MSI_ADDR_DEST_ID_MASK).
>
>> +        new_ire.remap.p = 1;
>> +    }
>> +    else
>> +    {
>> +        new_ire.post.im = 1;
>> +        new_ire.post.vector = gvec;
>> +        new_ire.post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
>> +        new_ire.post.pda_h = virt_to_maddr(pi_desc) >> 32;
>> +        new_ire.post.p = 1;
>> +    }
>> +
>> +    if ( pdev )
>> +        set_msi_source_id(pdev, &new_ire);
>> +    else
>> +        set_hpet_source_id(msi_desc->hpet_id, &new_ire);
>> +
>> +    if ( iremap_entry->val != new_ire.val )
>> +    {
>> +        if ( cpu_has_cx16 )
>> +        {
>> +            __uint128_t ret;
>> +            struct iremap_entry old_ire;
>> +
>> +            old_ire = *iremap_entry;
>> +            ret = cmpxchg16b(iremap_entry, &old_ire, &new_ire);
>> +
>> +            /*
>> +             * In the above, we use cmpxchg16 to atomically update the 128-bit
>> +             * IRTE, and the hardware cannot update the IRTE behind us, so
>> +             * the return value of cmpxchg16 should be the same as old_ire.
>> +             * This ASSERT validate it.
>> +             */
>> +            ASSERT(ret == old_ire.val);
>> +        }
>> +        else
>> +        {
>
>This wants a comment added explaining the conditions under which
>this is safe. Perhaps also one or more ASSERT()s to that effect.

Yes, will add an explaination and ASSERT().

>
>> +            iremap_entry->lo = new_ire.lo;
>> +            iremap_entry->hi = new_ire.hi;
>> +        }
>> +
>> +        iommu_flush_cache_entry(iremap_entry, sizeof(struct iremap_entry));
>
>sizeof(*iremap_entry)
>
>> @@ -592,38 +693,33 @@ static int msi_msg_to_remap_entry(
>>          return -EFAULT;
>>      }
>>  
>> +    /* Get the IRTE's bind relationship with guest from the live IRTE. */
>>      GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
>>                       iremap_entries, iremap_entry);
>> -
>> -    memcpy(&new_ire, iremap_entry, sizeof(struct iremap_entry));
>> -
>> -    /* Set interrupt remapping table entry */
>> -    new_ire.remap.fpd = 0;
>> -    new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
>> -    new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
>> -    new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
>> -    /* Hardware require RH = 1 for LPR delivery mode */
>> -    new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
>> -    new_ire.remap.avail = 0;
>> -    new_ire.remap.res_1 = 0;
>> -    new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
>> -                            MSI_DATA_VECTOR_MASK;
>> -    new_ire.remap.res_2 = 0;
>> -    if ( x2apic_enabled )
>> -        new_ire.remap.dst = msg->dest32;
>> +    if ( !iremap_entry->remap.im )
>> +    {
>> +        gvec = 0;
>> +        pi_desc = NULL;
>> +    }
>>      else
>> -        new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
>> -                             & 0xff) << 8;
>> +    {
>> +        gvec = iremap_entry->post.vector;
>> +        pi_desc = (void *)((((u64)iremap_entry->post.pda_h) << PDA_LOW_BIT )
>> +                           + iremap_entry->post.pda_l);
>> +    }
>> +    unmap_vtd_domain_page(iremap_entries);
>
>I don't follow: Why does it matter what the entry currently holds?
>As I've pointed out more than once before (mainly to Feng), the
>goal ought to be to produce the new entry solely based on what
>the intended new state is, i.e. function input and global data.
>

I think the function introduced by this patch is to produce the new
entry solely based on input. If someone wants to produce the new entry,
it can call it directly. 

I want to explain why we read the entry.

msi_msg_to_remap_entry() can be called before a msi gets bound to a guest
interrupt or after that. If we call the function without realizing the msi
has been binded to a guest interrupt, the IRTE would be updated to a 
remapped format breaking the binding (at least breaking the intention to use
VT-d PI). I think this is a possible case in the current code. This patch avoids
this case and provides a new function to the callers who are intended to replace
a posted format IRTE with a remapped format IRTE. Reading this entry is to get
the binding information and use it to update IRTE ( as comments in code, when
the IRTE is in posted format, we can suppress the update since the content of IRTE will not change for the binding information hasn't change. and Also if the binding information changed, we should call pi_update_irte ).

At this moment, we don't recognize any existing caller of
msi_msg_to_remap_entry() needs to update a posted IRTE to a remapped IRTE.
If the need emerges, we can expose the common function to the callers.

I also want to extend pi_update_irte to replace a posted IRTE to a remapped one,when guest wrongly configurate its msi interrupt.

I hope I have made it a little clear. and glad to see your further suggestion
to make the code and the description better to enlighten the following readers.

>> -    if ( pdev )
>> -        set_msi_source_id(pdev, &new_ire);
>> -    else
>> -        set_hpet_source_id(msi_desc->hpet_id, &new_ire);
>> -    new_ire.remap.res_3 = 0;
>> -    new_ire.remap.res_4 = 0;
>> -    new_ire.remap.p = 1;    /* finally, set present bit */
>> +    /*
>> +     * Actually we can just suppress the update when IRTE is already in posted
>> +     * format. After a msi gets bound to a guest interrupt, changes to the msi
>> +     * message have no effect to the IRTE.
>> +     */
>> +    update_irte_for_msi_common(iommu, pdev, msi_desc, msg, pi_desc, gvec);
>>  
>>      /* now construct new MSI/MSI-X rte entry */
>> +    if ( msi_desc->msi_attrib.type == PCI_CAP_ID_MSI )
>> +        nr = msi_desc->msi.nvec;
>
>Why do you re-do here what was already done earlier in the function
>(code you didn't touch)?
>

Will remove this.

>> @@ -996,31 +1046,11 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
>>      if ( !ir_ctrl )
>>          return -ENODEV;
>>  
>> -    spin_lock_irq(&ir_ctrl->iremap_lock);
>> -
>> -    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index, iremap_entries, p);
>> -
>> -    old_ire = *p;
>> -
>> -    /* Setup/Update interrupt remapping table entry. */
>> -    setup_posted_irte(&new_ire, &old_ire, pi_desc, gvec);
>> -    ret = cmpxchg16b(p, &old_ire, &new_ire);
>> -
>> -    /*
>> -     * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
>> -     * and the hardware cannot update the IRTE behind us, so the return value
>> -     * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
>> -     */
>> -    ASSERT(ret == old_ire.val);
>> -
>> -    iommu_flush_cache_entry(p, sizeof(*p));
>> -    iommu_flush_iec_index(iommu, 0, remap_index);
>> -
>> -    unmap_vtd_domain_page(iremap_entries);
>> -
>> -    spin_unlock_irq(&ir_ctrl->iremap_lock);
>> -
>> -    return 0;
>> +    spin_lock_irqsave(&ir_ctrl->iremap_lock, flags);
>> +    rc = update_irte_for_msi_common(iommu, pci_dev, msi_desc, NULL, pi_desc,
>> +                                    gvec);
>> +    spin_unlock_irqrestore(&ir_ctrl->iremap_lock, flags);
>> +    return rc;
>
>Considering the old code use spin_lock_irq() (and there's such left
>also earlier in the function), why do you use the irqsave function
>here?

Yes, it should be spin_lock_irq(). I saw it used in msi_msg_to_remap_entry() so
I followed it without digging deep into the difference. The spin_lock_irq()
makes an assumption to the current context, so it's better.  

Thanks,
Chao

>
>Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v9 5/8] VT-d: Introduce a new function update_irte_for_msi_common
  2017-02-27  1:45 ` [PATCH v9 5/8] VT-d: Introduce a new function update_irte_for_msi_common Chao Gao
@ 2017-03-02  8:58   ` Jan Beulich
  2017-03-02  7:14     ` Chao Gao
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2017-03-02  8:58 UTC (permalink / raw)
  To: Chao Gao
  Cc: Kevin Tian, Feng Wu, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel, Jun Nakajima

>>> On 27.02.17 at 02:45, <chao.gao@intel.com> wrote:
> @@ -547,16 +548,116 @@ static int remap_entry_to_msi_msg(
>      return 0;
>  }
>  
> +/*
> + * This function is a common interface to update irte for msi case.
> + *
> + * If @pi_desc != NULL and @gvec != 0, the IRTE will be updated to a posted
> + * format. In this case, @msg is ignored because constructing a posted format
> + * IRTE doesn't need any information about the msi address or msi data.
> + *
> + * If @pi_desc == NULL and @gvec == 0, the IRTE will be updated to a remapped
> + * format. In this case, @msg can't be NULL.

This kind of implies that in the other case msg can be NULL. Please
make this explicit or remove the last sentence, to avoid confusing
readers. Plus, if msg can be NULL in that case, why don't you pass
NULL in that case?

> + * Assume 'ir_ctrl->iremap_lock' has been acquired and the remap_index
> + * of msi_desc has a benign value.
> + */
> +static int update_irte_for_msi_common(
> +    struct iommu *iommu, const struct pci_dev *pdev,
> +    const struct msi_desc *msi_desc, struct msi_msg *msg,
> +    const struct pi_desc *pi_desc, const uint8_t gvec)
> +{
> +    struct iremap_entry *iremap_entry = NULL, *iremap_entries;
> +    struct iremap_entry new_ire = {{0}};

I think just "{ }" will do.

> +    unsigned int index = msi_desc->remap_index;
> +    struct ir_ctrl *ir_ctrl = iommu_ir_ctrl(iommu);
> +
> +    ASSERT( ir_ctrl );
> +    ASSERT( spin_is_locked(&ir_ctrl->iremap_lock) );
> +    ASSERT( (index >= 0) && (index < IREMAP_ENTRY_NR) );

Stray blanks inside parentheses.

> +    if ( (!pi_desc && gvec) || (pi_desc && !gvec) )

gvec == 0 alone is never a valid check: Either all vectors are valid,
or a whole range (e.g. 0x00...0x0f) is invalid. Furthermore I think
such checks are easier to read as either

    if ( !pi_desc != !gvec )

or

    if ( pi_desc ? !gvec : gvec )

> +        return -EINVAL;
> +
> +    if ( !pi_desc && !gvec && !msg )

With the earlier check the first or second part could be omitted
afaict.

> +        return -EINVAL;
> +
> +    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
> +                     iremap_entries, iremap_entry);
> +
> +    if ( !pi_desc )
> +    {
> +       /* Set interrupt remapping table entry */

Again a request for consistency: Either have a respective comment
also at the top of the else branch, or omit the one here too.

> +        new_ire.remap.dm = msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT;
> +        new_ire.remap.tm = msg->data >> MSI_DATA_TRIGGER_SHIFT;
> +        new_ire.remap.dlm = msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT;
> +        /* Hardware require RH = 1 for LPR delivery mode */
> +        new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
> +        new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
> +                                MSI_DATA_VECTOR_MASK;
> +        if ( x2apic_enabled )
> +            new_ire.remap.dst = msg->dest32;
> +        else
> +            new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
> +                                 & 0xff) << 8;

Please strive to eliminate literal numbers here. At least the 0xff looks
to be easy to deal with (using MASK_EXTR() together with
MSI_ADDR_DEST_ID_MASK).

> +        new_ire.remap.p = 1;
> +    }
> +    else
> +    {
> +        new_ire.post.im = 1;
> +        new_ire.post.vector = gvec;
> +        new_ire.post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
> +        new_ire.post.pda_h = virt_to_maddr(pi_desc) >> 32;
> +        new_ire.post.p = 1;
> +    }
> +
> +    if ( pdev )
> +        set_msi_source_id(pdev, &new_ire);
> +    else
> +        set_hpet_source_id(msi_desc->hpet_id, &new_ire);
> +
> +    if ( iremap_entry->val != new_ire.val )
> +    {
> +        if ( cpu_has_cx16 )
> +        {
> +            __uint128_t ret;
> +            struct iremap_entry old_ire;
> +
> +            old_ire = *iremap_entry;
> +            ret = cmpxchg16b(iremap_entry, &old_ire, &new_ire);
> +
> +            /*
> +             * In the above, we use cmpxchg16 to atomically update the 128-bit
> +             * IRTE, and the hardware cannot update the IRTE behind us, so
> +             * the return value of cmpxchg16 should be the same as old_ire.
> +             * This ASSERT validate it.
> +             */
> +            ASSERT(ret == old_ire.val);
> +        }
> +        else
> +        {

This wants a comment added explaining the conditions under which
this is safe. Perhaps also one or more ASSERT()s to that effect.

> +            iremap_entry->lo = new_ire.lo;
> +            iremap_entry->hi = new_ire.hi;
> +        }
> +
> +        iommu_flush_cache_entry(iremap_entry, sizeof(struct iremap_entry));

sizeof(*iremap_entry)

> @@ -592,38 +693,33 @@ static int msi_msg_to_remap_entry(
>          return -EFAULT;
>      }
>  
> +    /* Get the IRTE's bind relationship with guest from the live IRTE. */
>      GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
>                       iremap_entries, iremap_entry);
> -
> -    memcpy(&new_ire, iremap_entry, sizeof(struct iremap_entry));
> -
> -    /* Set interrupt remapping table entry */
> -    new_ire.remap.fpd = 0;
> -    new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
> -    new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
> -    new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
> -    /* Hardware require RH = 1 for LPR delivery mode */
> -    new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
> -    new_ire.remap.avail = 0;
> -    new_ire.remap.res_1 = 0;
> -    new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
> -                            MSI_DATA_VECTOR_MASK;
> -    new_ire.remap.res_2 = 0;
> -    if ( x2apic_enabled )
> -        new_ire.remap.dst = msg->dest32;
> +    if ( !iremap_entry->remap.im )
> +    {
> +        gvec = 0;
> +        pi_desc = NULL;
> +    }
>      else
> -        new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
> -                             & 0xff) << 8;
> +    {
> +        gvec = iremap_entry->post.vector;
> +        pi_desc = (void *)((((u64)iremap_entry->post.pda_h) << PDA_LOW_BIT )
> +                           + iremap_entry->post.pda_l);
> +    }
> +    unmap_vtd_domain_page(iremap_entries);

I don't follow: Why does it matter what the entry currently holds?
As I've pointed out more than once before (mainly to Feng), the
goal ought to be to produce the new entry solely based on what
the intended new state is, i.e. function input and global data.

> -    if ( pdev )
> -        set_msi_source_id(pdev, &new_ire);
> -    else
> -        set_hpet_source_id(msi_desc->hpet_id, &new_ire);
> -    new_ire.remap.res_3 = 0;
> -    new_ire.remap.res_4 = 0;
> -    new_ire.remap.p = 1;    /* finally, set present bit */
> +    /*
> +     * Actually we can just suppress the update when IRTE is already in posted
> +     * format. After a msi gets bound to a guest interrupt, changes to the msi
> +     * message have no effect to the IRTE.
> +     */
> +    update_irte_for_msi_common(iommu, pdev, msi_desc, msg, pi_desc, gvec);
>  
>      /* now construct new MSI/MSI-X rte entry */
> +    if ( msi_desc->msi_attrib.type == PCI_CAP_ID_MSI )
> +        nr = msi_desc->msi.nvec;

Why do you re-do here what was already done earlier in the function
(code you didn't touch)?

> @@ -996,31 +1046,11 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
>      if ( !ir_ctrl )
>          return -ENODEV;
>  
> -    spin_lock_irq(&ir_ctrl->iremap_lock);
> -
> -    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index, iremap_entries, p);
> -
> -    old_ire = *p;
> -
> -    /* Setup/Update interrupt remapping table entry. */
> -    setup_posted_irte(&new_ire, &old_ire, pi_desc, gvec);
> -    ret = cmpxchg16b(p, &old_ire, &new_ire);
> -
> -    /*
> -     * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
> -     * and the hardware cannot update the IRTE behind us, so the return value
> -     * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
> -     */
> -    ASSERT(ret == old_ire.val);
> -
> -    iommu_flush_cache_entry(p, sizeof(*p));
> -    iommu_flush_iec_index(iommu, 0, remap_index);
> -
> -    unmap_vtd_domain_page(iremap_entries);
> -
> -    spin_unlock_irq(&ir_ctrl->iremap_lock);
> -
> -    return 0;
> +    spin_lock_irqsave(&ir_ctrl->iremap_lock, flags);
> +    rc = update_irte_for_msi_common(iommu, pci_dev, msi_desc, NULL, pi_desc,
> +                                    gvec);
> +    spin_unlock_irqrestore(&ir_ctrl->iremap_lock, flags);
> +    return rc;

Considering the old code use spin_lock_irq() (and there's such left
also earlier in the function), why do you use the irqsave function
here?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v9 8/8] VT-d: Add copy_irte_{to, from}_irt for updating irte
  2017-02-27  1:45 ` [PATCH v9 8/8] VT-d: Add copy_irte_{to, from}_irt for updating irte Chao Gao
@ 2017-03-02  9:03   ` Jan Beulich
  2017-03-15 10:38   ` Tian, Kevin
  1 sibling, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2017-03-02  9:03 UTC (permalink / raw)
  To: Chao Gao
  Cc: Kevin Tian, George Dunlap, Andrew Cooper, Dario Faggioli,
	xen-devel, Jun Nakajima

>>> On 27.02.17 at 02:45, <chao.gao@intel.com> wrote:
> We used structure assignment to update irte which was not safe when a
> interrupt happend during update. It is better to update IRTE atomically
> through cmpxchg16b(). When cmpxchg16b is not supported, two 64-bit write
> operation can atomically update IRTE when only one the high dword or
> low dword is intented to be changed.

s/dword/qword/ in the last sentence.

> --- a/xen/drivers/passthrough/vtd/intremap.c
> +++ b/xen/drivers/passthrough/vtd/intremap.c
> @@ -169,6 +169,37 @@ bool_t __init iommu_supports_eim(void)
>      return 1;
>  }
>  
> +static inline void copy_irte_from_irt(struct iremap_entry *dst,
> +                                      struct iremap_entry *src)
> +{
> +    *dst = *src;
> +}
> +
> +static void copy_irte_to_irt(struct iremap_entry *dst, struct iremap_entry *src)

For both functions the source side wants to be pointer to const.

> +{
> +    if ( cpu_has_cx16 )
> +    {
> +        __uint128_t ret;
> +        struct iremap_entry old_ire;
> +
> +        copy_irte_from_irt(&old_ire, dst);
> +        ret = cmpxchg16b(dst, &old_ire, src);
> +
> +        /*
> +         * In the above, we use cmpxchg16 to atomically update the 128-bit
> +         * IRTE, and the hardware cannot update the IRTE behind us, so
> +         * the return value of cmpxchg16 should be the same as old_ire.
> +         * This ASSERT validate it.
> +         */
> +        ASSERT(ret == old_ire.val);
> +    }
> +    else
> +    {
> +        dst->lo = src->lo;
> +        dst->hi = src->hi;

This doesn't exactly match the description: You nevertheless write
both halves, which makes the update non-atomic. But with the earlier
patch adjusted to document the conditions under which this is safe,
this may be fine here too. Just make the patch description match
reality then please.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v9 5/8] VT-d: Introduce a new function update_irte_for_msi_common
  2017-03-02  7:14     ` Chao Gao
@ 2017-03-02 14:32       ` Jan Beulich
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2017-03-02 14:32 UTC (permalink / raw)
  To: Chao Gao
  Cc: Kevin Tian, Feng Wu, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel, Jun Nakajima

>>> On 02.03.17 at 08:14, <chao.gao@intel.com> wrote:
> On Thu, Mar 02, 2017 at 01:58:14AM -0700, Jan Beulich wrote:
>>>>> On 27.02.17 at 02:45, <chao.gao@intel.com> wrote:
>>> +    if ( (!pi_desc && gvec) || (pi_desc && !gvec) )
>>
>>gvec == 0 alone is never a valid check: Either all vectors are valid,
>>or a whole range (e.g. 0x00...0x0f) is invalid. Furthermore I think
>>such checks are easier to read as either
> 
> How about only use pi_desc is NULL or not to decide the format of the IRTE?

That makes sense, I think.

>>> @@ -592,38 +693,33 @@ static int msi_msg_to_remap_entry(
>>>          return -EFAULT;
>>>      }
>>>  
>>> +    /* Get the IRTE's bind relationship with guest from the live IRTE. */
>>>      GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
>>>                       iremap_entries, iremap_entry);
>>> -
>>> -    memcpy(&new_ire, iremap_entry, sizeof(struct iremap_entry));
>>> -
>>> -    /* Set interrupt remapping table entry */
>>> -    new_ire.remap.fpd = 0;
>>> -    new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
>>> -    new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
>>> -    new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
>>> -    /* Hardware require RH = 1 for LPR delivery mode */
>>> -    new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
>>> -    new_ire.remap.avail = 0;
>>> -    new_ire.remap.res_1 = 0;
>>> -    new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
>>> -                            MSI_DATA_VECTOR_MASK;
>>> -    new_ire.remap.res_2 = 0;
>>> -    if ( x2apic_enabled )
>>> -        new_ire.remap.dst = msg->dest32;
>>> +    if ( !iremap_entry->remap.im )
>>> +    {
>>> +        gvec = 0;
>>> +        pi_desc = NULL;
>>> +    }
>>>      else
>>> -        new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
>>> -                             & 0xff) << 8;
>>> +    {
>>> +        gvec = iremap_entry->post.vector;
>>> +        pi_desc = (void *)((((u64)iremap_entry->post.pda_h) << PDA_LOW_BIT )
>>> +                           + iremap_entry->post.pda_l);
>>> +    }
>>> +    unmap_vtd_domain_page(iremap_entries);
>>
>>I don't follow: Why does it matter what the entry currently holds?
>>As I've pointed out more than once before (mainly to Feng), the
>>goal ought to be to produce the new entry solely based on what
>>the intended new state is, i.e. function input and global data.
>>
> 
> I think the function introduced by this patch is to produce the new
> entry solely based on input.

Which contradicts you reading the table entry.

> If someone wants to produce the new entry,
> it can call it directly. 
> 
> I want to explain why we read the entry.
> 
> msi_msg_to_remap_entry() can be called before a msi gets bound to a guest
> interrupt or after that. If we call the function without realizing the msi
> has been binded to a guest interrupt, the IRTE would be updated to a 
> remapped format breaking the binding (at least breaking the intention to use
> VT-d PI). I think this is a possible case in the current code. This patch avoids
> this case and provides a new function to the callers who are intended to replace
> a posted format IRTE with a remapped format IRTE. Reading this entry is to get
> the binding information and use it to update IRTE ( as comments in code, when
> the IRTE is in posted format, we can suppress the update since the content 
> of IRTE will not change for the binding information hasn't change. and Also 
> if the binding information changed, we should call pi_update_irte ).

Just to repeat what I've said a number of times before: Preferably
the new entry should be created from scratch, and all data needed
to do so should be passed in (or be derivable from passed in data,
e.g. by looking up global data structures _other than the IRT_ ).

> At this moment, we don't recognize any existing caller of
> msi_msg_to_remap_entry() needs to update a posted IRTE to a remapped IRTE.
> If the need emerges, we can expose the common function to the callers.

This restriction would go away with the function being fully self
contained, as re-explained above. The second best option (if
the preferred one is overly complicated to implement) is to make
the function actively refuse anything it can't currently handle. I
don't think that's the case yet with this version of the patch.

>>> @@ -996,31 +1046,11 @@ int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
>>>      if ( !ir_ctrl )
>>>          return -ENODEV;
>>>  
>>> -    spin_lock_irq(&ir_ctrl->iremap_lock);
>>> -
>>> -    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index, iremap_entries, p);
>>> -
>>> -    old_ire = *p;
>>> -
>>> -    /* Setup/Update interrupt remapping table entry. */
>>> -    setup_posted_irte(&new_ire, &old_ire, pi_desc, gvec);
>>> -    ret = cmpxchg16b(p, &old_ire, &new_ire);
>>> -
>>> -    /*
>>> -     * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
>>> -     * and the hardware cannot update the IRTE behind us, so the return value
>>> -     * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
>>> -     */
>>> -    ASSERT(ret == old_ire.val);
>>> -
>>> -    iommu_flush_cache_entry(p, sizeof(*p));
>>> -    iommu_flush_iec_index(iommu, 0, remap_index);
>>> -
>>> -    unmap_vtd_domain_page(iremap_entries);
>>> -
>>> -    spin_unlock_irq(&ir_ctrl->iremap_lock);
>>> -
>>> -    return 0;
>>> +    spin_lock_irqsave(&ir_ctrl->iremap_lock, flags);
>>> +    rc = update_irte_for_msi_common(iommu, pci_dev, msi_desc, NULL, pi_desc,
>>> +                                    gvec);
>>> +    spin_unlock_irqrestore(&ir_ctrl->iremap_lock, flags);
>>> +    return rc;
>>
>>Considering the old code use spin_lock_irq() (and there's such left
>>also earlier in the function), why do you use the irqsave function
>>here?
> 
> Yes, it should be spin_lock_irq(). I saw it used in msi_msg_to_remap_entry() so
> I followed it without digging deep into the difference. The spin_lock_irq()
> makes an assumption to the current context, so it's better.  

It's not necessarily better (or worse), but it's consistent with
neighboring code.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v9 1/8] VMX: Permanently assign PI hook vmx_pi_switch_to()
  2017-03-01  7:41       ` Jan Beulich
@ 2017-03-03  8:29         ` Tian, Kevin
  2017-03-03 10:49           ` Jan Beulich
  0 siblings, 1 reply; 22+ messages in thread
From: Tian, Kevin @ 2017-03-03  8:29 UTC (permalink / raw)
  To: Jan Beulich, Gao, Chao
  Cc: George Dunlap, Andrew Cooper, Dario Faggioli, xen-devel, Nakajima, Jun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, March 01, 2017 3:42 PM
> 
> >>> On 01.03.17 at 01:01, <chao.gao@intel.com> wrote:
> > On Tue, Feb 28, 2017 at 09:43:09AM -0700, Jan Beulich wrote:
> >>>>> On 27.02.17 at 02:45, <chao.gao@intel.com> wrote:
> >>> --- a/xen/arch/x86/hvm/vmx/vmx.c
> >>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> >>> @@ -260,9 +260,15 @@ void vmx_pi_hooks_deassign(struct domain *d)
> >>>
> >>>      ASSERT(d->arch.hvm_domain.pi_ops.vcpu_block);
> >>>
> >>> +    /*
> >>> +     * Note that we don't set 'd->arch.hvm_domain.pi_ops.switch_to' to NULL
> >>> +     * here. If we deassign the hooks while the vCPU is runnable in the
> >>> +     * runqueue with 'SN' set, all the future notification event will be
> >>> +     * suppressed. Preserving the 'switch_to' hook function can make sure
> >>> +     * event time the vCPU is running the 'SN' field is cleared.
> >>> +     */
> >>
> >>Did we now lose the remark indicating that at least in theory
> >>we could remove the hook after it had run one more time? It's
> >
> > I also think it only need to run one more time, because the hook
> > function that set 'SN' bit is already removed.
> >
> >>also not really becoming clear why SN continues to matter
> >>after device removal, but perhaps that's just because of my
> >>so far limited understanding of PI.
> >>
> >>Also I think you mean "every time" on the last line, albeit that
> >>(as per my remark above) is irrelevant - we need it to run just
> >>once more.
> >
> > I would like to make the comment as clear as possible.
> > How about the underlying comment,
> >
> > Note that we don't set 'd->arch.hvm_domain.pi_ops.switch_to' to NULL
> > here. If we deassign the hooks while the vCPU is runnable in the
> > runqueue with 'SN' set, all the future notification event will be
> > suppressed since vmx_deliver_posted_intr() also use 'SN' bit
> > as the suppression flag. Preserving the 'switch_to' hook function can
> > clear the 'SN' bit when the vCPU becomes running next time. After
> > that, No matter which status(runnable, running or block) the vCPU is in,
> > the 'SN' bit will keep clear for the 'switch_from' hook function that set
> > the 'SN' bit has been removed. At that time, the 'switch_to' hook function
> > is also useless. Considering the function doesn't do harm to the whole
> > system, leave it here until we find a clean solution to deassign the
> > 'switch_to' hook function.
> 
> Sounds good to me (read: Reviewed-by: Jan Beulich <jbeulich@suse.com>).
> If Kevin would give his ack, I could replace the comment while committing,
> so you wouldn't need to re-send.
> 
> Jan

Good to me too.

Acked-by: Kevin Tian <kevin.tian@intel.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v9 1/8] VMX: Permanently assign PI hook vmx_pi_switch_to()
  2017-03-03  8:29         ` Tian, Kevin
@ 2017-03-03 10:49           ` Jan Beulich
  2017-03-03 11:54             ` Tian, Kevin
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2017-03-03 10:49 UTC (permalink / raw)
  To: Kevin Tian
  Cc: George Dunlap, Andrew Cooper, Dario Faggioli, xen-devel,
	Jun Nakajima, Chao Gao

>>> On 03.03.17 at 09:29, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, March 01, 2017 3:42 PM
>> Sounds good to me (read: Reviewed-by: Jan Beulich <jbeulich@suse.com>).
>> If Kevin would give his ack, I could replace the comment while committing,
>> so you wouldn't need to re-send.
> 
> Good to me too.
> 
> Acked-by: Kevin Tian <kevin.tian@intel.com>

How about patch 3 of this series then? Patch 4 already has your ack,
and patches 5 and onwards need another iteration afaict.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v9 3/8] VMX: Properly handle pi when all the assigned devices are removed
  2017-02-27  1:45 ` [PATCH v9 3/8] VMX: Properly handle pi when all the assigned devices are removed Chao Gao
@ 2017-03-03 11:51   ` Tian, Kevin
  0 siblings, 0 replies; 22+ messages in thread
From: Tian, Kevin @ 2017-03-03 11:51 UTC (permalink / raw)
  To: Gao, Chao, xen-devel
  Cc: Feng Wu, Nakajima, Jun, George Dunlap, Andrew Cooper,
	Dario Faggioli, Jan Beulich

> From: Gao, Chao
> Sent: Monday, February 27, 2017 9:46 AM
> 
> From: Feng Wu <feng.wu@intel.com>
> 
> This patch handles some corner cases when the last assigned device
> is removed from the domain. In this case we should carefully handle
> pi descriptor and the per-cpu blocking list, to make sure:
> - all the PI descriptor are in the right state when next time a
> devices is assigned to the domain again.
> - No remaining vcpus of the domain in the per-cpu blocking list.
> 
> Here we call vmx_pi_unblock_vcpu() to remove the vCPU from the blocking list
> if it is on the list. However, this could happen when vmx_vcpu_block() is
> being called, hence we might incorrectly add the vCPU to the blocking list
> while the last devcie is detached from the domain. Consider that the situation
> can only occur when detaching the last device from the domain and it is not
> a frequent operation, so we use domain_pause before that, which is considered
> as an clean and maintainable solution for the situation.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Kevin Tian <kevin.tian@intel.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v9 1/8] VMX: Permanently assign PI hook vmx_pi_switch_to()
  2017-03-03 10:49           ` Jan Beulich
@ 2017-03-03 11:54             ` Tian, Kevin
  0 siblings, 0 replies; 22+ messages in thread
From: Tian, Kevin @ 2017-03-03 11:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Andrew Cooper, Dario Faggioli, xen-devel,
	Nakajima, Jun, Gao, Chao

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Friday, March 03, 2017 6:49 PM
> 
> >>> On 03.03.17 at 09:29, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Wednesday, March 01, 2017 3:42 PM
> >> Sounds good to me (read: Reviewed-by: Jan Beulich <jbeulich@suse.com>).
> >> If Kevin would give his ack, I could replace the comment while committing,
> >> so you wouldn't need to re-send.
> >
> > Good to me too.
> >
> > Acked-by: Kevin Tian <kevin.tian@intel.com>
> 
> How about patch 3 of this series then? Patch 4 already has your ack,
> and patches 5 and onwards need another iteration afaict.
> 

I thought I did but looks not. Just gave my ack. For patch 5 onwards
I'll look the new iteration closely after my vacation.

Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v9 8/8] VT-d: Add copy_irte_{to, from}_irt for updating irte
  2017-02-27  1:45 ` [PATCH v9 8/8] VT-d: Add copy_irte_{to, from}_irt for updating irte Chao Gao
  2017-03-02  9:03   ` Jan Beulich
@ 2017-03-15 10:38   ` Tian, Kevin
  1 sibling, 0 replies; 22+ messages in thread
From: Tian, Kevin @ 2017-03-15 10:38 UTC (permalink / raw)
  To: Gao, Chao, xen-devel
  Cc: Nakajima, Jun, George Dunlap, Andrew Cooper, Dario Faggioli, Jan Beulich

> From: Gao, Chao
> Sent: Monday, February 27, 2017 9:46 AM
> 
> We used structure assignment to update irte which was not safe when a

a -> an

> interrupt happend during update. It is better to update IRTE atomically

happend -> happened

> through cmpxchg16b(). When cmpxchg16b is not supported, two 64-bit write
> operation can atomically update IRTE when only one the high dword or low

remove 'one'

> dword is intented to be changed.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> ---
> v9:
> - Newly added.
> 
>  xen/drivers/passthrough/vtd/intremap.c | 58 ++++++++++++++++++++--------
> ------
>  1 file changed, 34 insertions(+), 24 deletions(-)
> 
> diff --git a/xen/drivers/passthrough/vtd/intremap.c
> b/xen/drivers/passthrough/vtd/intremap.c
> index 737b886..649bebd 100644
> --- a/xen/drivers/passthrough/vtd/intremap.c
> +++ b/xen/drivers/passthrough/vtd/intremap.c
> @@ -169,6 +169,37 @@ bool_t __init iommu_supports_eim(void)
>      return 1;
>  }
> 
> +static inline void copy_irte_from_irt(struct iremap_entry *dst,
> +                                      struct iremap_entry *src) {

It's not immediately clear between 'irte' and 'irt'...

> +    *dst = *src;
> +}
> +
> +static void copy_irte_to_irt(struct iremap_entry *dst, struct
> +iremap_entry *src) {
> +    if ( cpu_has_cx16 )
> +    {
> +        __uint128_t ret;
> +        struct iremap_entry old_ire;
> +
> +        copy_irte_from_irt(&old_ire, dst);
> +        ret = cmpxchg16b(dst, &old_ire, src);
> +
> +        /*
> +         * In the above, we use cmpxchg16 to atomically update the 128-bit
> +         * IRTE, and the hardware cannot update the IRTE behind us, so
> +         * the return value of cmpxchg16 should be the same as old_ire.
> +         * This ASSERT validate it.
> +         */
> +        ASSERT(ret == old_ire.val);
> +    }
> +    else
> +    {
> +        dst->lo = src->lo;
> +        dst->hi = src->hi;
> +    }
> +}
> +
>  /* Mark specified intr remap entry as free */  static void
> free_remap_entry(struct iommu *iommu, int index)  { @@ -310,7 +341,7
> @@ static int ioapic_rte_to_remap_entry(struct iommu *iommu,
>      GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
>                       iremap_entries, iremap_entry);
> 
> -    new_ire = *iremap_entry;
> +    copy_irte_from_irt(&new_ire, iremap_entry);
> 
>      if ( rte_upper )
>      {
> @@ -353,7 +384,7 @@ static int ioapic_rte_to_remap_entry(struct iommu
> *iommu,
>          remap_rte->format = 1;    /* indicate remap format */
>      }
> 
> -    *iremap_entry = new_ire;
> +    copy_irte_to_irt(iremap_entry, &new_ire);
>      iommu_flush_cache_entry(iremap_entry, sizeof(*iremap_entry));
>      iommu_flush_iec_index(iommu, 0, index);
> 
> @@ -617,28 +648,7 @@ static int update_irte_for_msi_common(
> 
>      if ( iremap_entry->val != new_ire.val )
>      {
> -        if ( cpu_has_cx16 )
> -        {
> -            __uint128_t ret;
> -            struct iremap_entry old_ire;
> -
> -            old_ire = *iremap_entry;
> -            ret = cmpxchg16b(iremap_entry, &old_ire, &new_ire);
> -
> -            /*
> -             * In the above, we use cmpxchg16 to atomically update the 128-bit
> -             * IRTE, and the hardware cannot update the IRTE behind us, so
> -             * the return value of cmpxchg16 should be the same as old_ire.
> -             * This ASSERT validate it.
> -             */
> -            ASSERT(ret == old_ire.val);
> -        }
> -        else
> -        {
> -            iremap_entry->lo = new_ire.lo;
> -            iremap_entry->hi = new_ire.hi;
> -        }
> -
> +        copy_irte_to_irt(iremap_entry, &new_ire);
>          iommu_flush_cache_entry(iremap_entry, sizeof(*iremap_entry));
>          iommu_flush_iec_index(iommu, 0, index);
>      }
> --
> 1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-03-15 10:38 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-27  1:45 [PATCH v9 0/8] VMX: Properly handle pi descriptor and per-cpu Chao Gao
2017-02-27  1:45 ` [PATCH v9 1/8] VMX: Permanently assign PI hook vmx_pi_switch_to() Chao Gao
2017-02-28 16:43   ` Jan Beulich
2017-03-01  0:01     ` Chao Gao
2017-03-01  7:41       ` Jan Beulich
2017-03-03  8:29         ` Tian, Kevin
2017-03-03 10:49           ` Jan Beulich
2017-03-03 11:54             ` Tian, Kevin
2017-02-27  1:45 ` [PATCH v9 2/8] xen/passthrough: Reject self-(de)assignment of devices Chao Gao
2017-02-28 16:46   ` Jan Beulich
2017-02-27  1:45 ` [PATCH v9 3/8] VMX: Properly handle pi when all the assigned devices are removed Chao Gao
2017-03-03 11:51   ` Tian, Kevin
2017-02-27  1:45 ` [PATCH v9 4/8] VMX: Make sure PI is in proper state before install the hooks Chao Gao
2017-02-27  1:45 ` [PATCH v9 5/8] VT-d: Introduce a new function update_irte_for_msi_common Chao Gao
2017-03-02  8:58   ` Jan Beulich
2017-03-02  7:14     ` Chao Gao
2017-03-02 14:32       ` Jan Beulich
2017-02-27  1:45 ` [PATCH v9 6/8] VT-d: Some cleanups Chao Gao
2017-02-27  1:45 ` [PATCH v9 7/8] VMX: Fixup PI descriptor when cpu is offline Chao Gao
2017-02-27  1:45 ` [PATCH v9 8/8] VT-d: Add copy_irte_{to, from}_irt for updating irte Chao Gao
2017-03-02  9:03   ` Jan Beulich
2017-03-15 10:38   ` Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.