All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v2 00/15] Add VT-d Posted-Interrupts support
@ 2015-05-08  9:07 Feng Wu
  2015-05-08  9:07 ` [RFC v2 01/15] Vt-d Posted-intterrupt (PI) design Feng Wu
                   ` (16 more replies)
  0 siblings, 17 replies; 64+ messages in thread
From: Feng Wu @ 2015-05-08  9:07 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, Feng Wu

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

You can find the VT-d Posted-Interrtups Spec. in the following URL:
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html

This patch set follow the following design:
http://article.gmane.org/gmane.comp.emulators.xen.devel/236476

v1 -> v2
1. Add the design doc.
2. Coding style fix.
3. Add some comments for struct pi_desc.
4. Extend 'struct iremap_entry' to a more common format.
5. Delete the atomic helper functions for pi descriptor manipulation.
6. Add the new command line in docs/misc/xen-command-line.markdown.
7. Use macros to replace some magic numbers.

One open in "[RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used"
How to update the IRTE for PI dynamically (in an atomic way)? I am trying
to use cmpxchg16b and it is in progress, I will add this in the next version.

Feng Wu (15):
  Vt-d Posted-intterrupt (PI) design
  iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature
  vt-d: VT-d Posted-Interrupts feature detection
  vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  vmx: Initialize VT-d Posted-Interrupts Descriptor
  vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts
  vt-d: Add API to update IRTE when VT-d PI is used
  Update IRTE according to guest interrupt config changes
  Add a new per-vCPU tasklet to wakeup the blocked vCPU
  vmx: Define two per-cpu variables
  vmx: Add a global wake-up vector for VT-d Posted-Interrupts
  vmx: Properly handle notification event when vCPU is running
  Update Posted-Interrupts Descriptor during vCPU scheduling
  Suppress posting interrupts when 'SN' is set
  Add a command line parameter for VT-d posted-interrupts

 docs/misc/vtd-pi.txt                   | 332 +++++++++++++++++++++++++++++++++
 docs/misc/xen-command-line.markdown    |   9 +-
 xen/arch/x86/hvm/vmx/vmcs.c            |   6 +
 xen/arch/x86/hvm/vmx/vmx.c             | 236 ++++++++++++++++++++++-
 xen/common/domain.c                    |  11 ++
 xen/common/schedule.c                  |   5 +
 xen/drivers/passthrough/io.c           |  99 +++++++++-
 xen/drivers/passthrough/iommu.c        |  17 +-
 xen/drivers/passthrough/vtd/intremap.c | 188 ++++++++++++++-----
 xen/drivers/passthrough/vtd/iommu.c    |  15 +-
 xen/drivers/passthrough/vtd/iommu.h    |  31 ++-
 xen/drivers/passthrough/vtd/utils.c    |  10 +-
 xen/include/asm-x86/hvm/hvm.h          |   1 +
 xen/include/asm-x86/hvm/vmx/vmcs.h     |  15 +-
 xen/include/asm-x86/hvm/vmx/vmx.h      |  27 ++-
 xen/include/asm-x86/iommu.h            |   2 +
 xen/include/xen/iommu.h                |   2 +-
 xen/include/xen/sched.h                |   7 +
 18 files changed, 945 insertions(+), 68 deletions(-)
 create mode 100644 docs/misc/vtd-pi.txt

-- 
2.1.0

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC v2 01/15] Vt-d Posted-intterrupt (PI) design
  2015-05-08  9:07 [RFC v2 00/15] Add VT-d Posted-Interrupts support Feng Wu
@ 2015-05-08  9:07 ` Feng Wu
  2015-05-08  9:07 ` [RFC v2 02/15] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature Feng Wu
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 64+ messages in thread
From: Feng Wu @ 2015-05-08  9:07 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, Feng Wu

Add the design doc for VT-d PI.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 docs/misc/vtd-pi.txt | 332 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 332 insertions(+)
 create mode 100644 docs/misc/vtd-pi.txt

diff --git a/docs/misc/vtd-pi.txt b/docs/misc/vtd-pi.txt
new file mode 100644
index 0000000..b1444c3
--- /dev/null
+++ b/docs/misc/vtd-pi.txt
@@ -0,0 +1,332 @@
+Authors: Feng Wu <feng.wu@intel.com>
+
+VT-d Posted-interrupt (PI) design for XEN
+
+Background
+==========
+With the development of virtualization, there are more and more device
+assignment requirements. However, today when a VM is running with
+assigned devices (such as, NIC), external interrupt handling for the assigned
+devices always needs VMM intervention.
+
+VT-d Posted-interrupt is a more enhanced method to handle interrupts
+in the virtualization environment. Interrupt posting is the process by
+which an interrupt request is recorded in a memory-resident
+posted-interrupt-descriptor structure by the root-complex, followed by
+an optional notification event issued to the CPU complex.
+
+With VT-d Posted-interrupt we can get the following advantages:
+- Direct delivery of external interrupts to running vCPUs without VMM
+intervention
+- Decrease the interrupt migration complexity. On vCPU migration, software
+can atomically co-migrate all interrupts targeting the migrating vCPU. For
+virtual machines with assigned devices, migrating a vCPU across pCPUs
+either incur the overhead of forwarding interrupts in software (e.g. via VMM
+generated IPIS), or complexity to independently migrate each interrupt targeting
+the vCPU to the new pCPU. However, after enabling VT-d PI, the destination vCPU
+of an external interrupt from assigned devices is stored in the IRTE (i.e.
+Posted-interrupt Descriptor Address), when vCPU is migrated to another pCPU,
+we will set this new pCPU in the 'NDST' filed of Posted-interrupt descriptor, this
+make the interrupt migration automatic.
+
+Here is what Xen currently does for external interrupts from assigned devices:
+
+When a VM is running and an external interrupt from an assigned device occurs
+for it. VM-EXIT happens, then:
+
+vmx_do_extint() --> do_IRQ() --> __do_IRQ_guest() --> hvm_do_IRQ_dpci() -->
+raise_softirq_for(pirq_dpci) --> raise_softirq(HVM_DPCI_SOFTIRQ)
+
+softirq HVM_DPCI_SOFTIRQ is bound to dpci_softirq()
+
+dpci_softirq() --> hvm_dirq_assist() --> vmsi_deliver_pirq() --> vmsi_deliver() -->
+vmsi_inj_irq() --> vlapic_set_irq()
+
+vlapic_set_irq() does the following things:
+1. If CPU-side posted-interrupt is supported, call vmx_deliver_posted_intr() to deliver
+the virtual interrupt via posted-interrupt infrastructure.
+2. Else if CPU-side posted-interrupt is not supported, set the related vIRR in vLAPIC
+page and call vcpu_kick() to kick the related vCPU. Before VM-Entry, vmx_intr_assist()
+will help to inject the interrupt to guests.
+
+However, after VT-d PI is supported, when a guest is running in non-root and an
+external interrupt from an assigned device occurs for it. No VM-Exit is needed,
+the guest can handle this totally in non-root mode, thus avoiding all the above
+code flow.
+
+Posted-interrupt Introduction
+========================
+There are two components to the Posted-interrupt architecture:
+Processor Support and Root-Complex Support
+
+- Processor Support
+Posted-interrupt processing is a feature by which a processor processes
+the virtual interrupts by recording them as pending on the virtual-APIC
+page.
+
+Posted-interrupt processing is enabled by setting the process posted
+interrupts VM-execution control. The processing is performed in response
+to the arrival of an interrupt with the posted-interrupt notification vector.
+In response to such an interrupt, the processor processes virtual interrupts
+recorded in a data structure called a posted-interrupt descriptor.
+
+More information about APICv and CPU-side Posted-interrupt, please refer
+to Chapter 29, and Section 29.6 in the Intel SDM:
+http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
+
+- Root-Complex Support
+Interrupt posting is the process by which an interrupt request (from IOAPIC
+or MSI/MSIx capable sources) is recorded in a memory-resident
+posted-interrupt-descriptor structure by the root-complex, followed by
+an optional notification event issued to the CPU complex. The interrupt
+request arriving at the root-complex carry the identity of the interrupt
+request source and a 'remapping-index'. The remapping-index is used to
+look-up an entry from the memory-resident interrupt-remap-table. Unlike
+with interrupt-remapping, the interrupt-remap-table-entry for a posted-
+interrupt, specifies a virtual-vector and a pointer to the posted-interrupt
+descriptor. The virtual-vector specifies the vector of the interrupt to be
+recorded in the posted-interrupt descriptor. The posted-interrupt descriptor
+hosts storage for the virtual-vectors and contains the attributes of the
+notification event (interrupt) to be issued to the CPU complex to inform
+CPU/software about pending interrupts recorded in the posted-interrupt
+descriptor.
+
+More information about VT-d PI, please refer to
+http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html
+
+Important Definitions
+==================
+There are some changes to IRTE and posted-interrupt descriptor after
+VT-d PI is introduced:
+IRTE:
+Posted-interrupt Descriptor Address: the address of the posted-interrupt descriptor
+Virtual Vector: the guest vector of the interrupt
+URG: indicates if the interrupt is urgent
+
+Posted-interrupt descriptor:
+The Posted Interrupt Descriptor hosts the following fields:
+Posted Interrupt Request (PIR): Provide storage for posting (recording) interrupts (one bit
+per vector, for up to 256 vectors).
+
+Outstanding Notification (ON): Indicate if there is a notification event outstanding (not
+processed by processor or software) for this Posted Interrupt Descriptor. When this field is 0,
+hardware modifies it from 0 to 1 when generating a notification event, and the entity receiving
+the notification event (processor or software) resets it as part of posted interrupt processing.
+
+Suppress Notification (SN): Indicate if a notification event is to be suppressed (not
+generated) for non-urgent interrupt requests (interrupts processed through an IRTE with
+URG=0).
+
+Notification Vector (NV): Specify the vector for notification event (interrupt).
+
+Notification Destination (NDST): Specify the physical APIC-ID of the destination logical
+processor for the notification event.
+
+Design Overview
+==============
+In this design, we will cover the following items:
+1. Add a variable to control whether enable VT-d posted-interrupt or not.
+2. VT-d PI feature detection.
+3. Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff.
+4. Extend IRTE structure to support VT-d PI.
+5. Introduce a new global vector which is used for waking up the blocked vCPU.
+6. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration).
+7. Update posted-interrupt descriptor during vCPU scheduling (when the state
+of the vCPU is transmitted among RUNSTATE_running / RUNSTATE_blocked/
+RUNSTATE_runnable / RUNSTATE_offline).
+8. How to wakeup blocked vCPU when an interrupt is posted for it (wakeup notification handler).
+9. New boot command line for Xen, which controls VT-d PI feature by user.
+10. Multicast/broadcast and lowest priority interrupts consideration.
+
+
+Implementation details
+===================
+- New variable to control VT-d PI
+
+Like variable 'iommu_intremap' for interrupt remapping, it is very straightforward
+to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is set
+only when interrupt remapping and VT-d posted-interrupt are both enabled.
+
+- VT-d PI feature detection.
+Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt support.
+
+- Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff.
+Here is the new structure for posted-interrupt descriptor:
+
+struct pi_desc {
+     DECLARE_BITMAP(pir, NR_VECTORS);
+     union {
+        struct
+        {
+        u64 on     : 1,
+            sn     : 1,
+            rsvd_1 : 13,
+            ndm    : 1,
+            nv     : 8,
+            rsvd_2 : 8,
+            ndst   : 32;
+        };
+        u64 control;
+    };
+    u32 rsvd[6];
+ } __attribute__ ((aligned (64)));
+
+- Extend IRTE structure to support VT-d PI.
+
+Here is the new structure for IRTE:
+/* interrupt remap entry */
+struct iremap_entry {
+  union {
+    struct { u64 lo, hi; };
+    struct {
+        u64 p       : 1,
+            fpd     : 1,
+            dm      : 1,
+            rh      : 1,
+            tm      : 1,
+            dlm     : 3,
+            avail   : 4,
+            res_1   : 4,
+            vector  : 8,
+            res_2   : 8,
+            dst     : 32;
+        u64 sid     : 16,
+            sq      : 2,
+            svt     : 2,
+            res_3   : 44;
+    }remap;
+    struct {
+        u64 p       : 1,
+            fpd     : 1,
+            res_1   : 6,
+            avail   : 4,
+            res_2   : 2,
+            urg     : 1,
+            im      : 1,
+            vector  : 8,
+            res_3   : 14,
+            pda_l   : 26;
+        u64 sid     : 16,
+            sq      : 2,
+            svt     : 2,
+            res_4   : 12,
+            pda_h   : 32;
+    }post;
+  };
+};
+
+- Introduce a new global vector which is used to wake up the blocked vCPU.
+
+Currently, there is a global vector 'posted_intr_vector', which is used as the
+global notification vector for all vCPUs in the system. This vector is stored in
+VMCS and CPU considers it as a _special_ vector, uses it to notify the related
+pCPU when an interrupt is recorded in the posted-interrupt descriptor.
+
+This existing global vector is a _special_ vector to CPU, CPU handle it in a
+_special_ way compared to normal vectors, please refer to 29.6 in Intel SDM
+http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
+for more information about how CPU handles it.
+
+After having VT-d PI, VT-d engine can issue notification event when the
+assigned devices issue interrupts. We need add a new global vector to
+wakeup the blocked vCPU, please refer to later section in this design for
+how to use this new global vector.
+
+- Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration).
+After VT-d PI is introduced, the format of IRTE is changed as follows:
+	Descriptor Address: the address of the posted-interrupt descriptor
+	Virtual Vector: the guest vector of the interrupt
+	URG: indicates if the interrupt is urgent
+	Other fields continue to have the same meaning
+
+'Descriptor Address' tells the destination vCPU of this interrupt, since
+each vCPU has a dedicated posted-interrupt descriptor.
+
+'Virtual Vector' tells the guest vector of the interrupt.
+
+When guest changes the configuration of the interrupts, such as, the
+cpu affinity, or the vector, we need to update the associated IRTE accordingly.
+
+- Update posted-interrupt descriptor during vCPU scheduling
+
+The basic idea here is:
+1. When vCPU's state is RUNSTATE_running,
+        - Set 'NV' to 'posted_intr_vector'.
+        - Clear 'SN' to accept posted-interrupts.
+        - Set 'NDST' to the pCPU on which the vCPU will be running.
+2. When vCPU's state is RUNSTATE_blocked,
+        - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the
+          related vCPU when posted-interrupt happens for it.
+          Please refer to the above section about the new global vector.
+        - Clear 'SN' to accept posted-interrupts
+3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
+        - Set 'SN' to suppress non-urgent interrupts
+          (Current, we only support non-urgent interrupts)
+         When vCPU is in RUNSTATE_runnable or RUNSTATE_offline,
+         It is not needed to accept posted-interrupt notification event,
+         since we don't change the behavior of scheduler when the interrupt
+         occurs, we still need wait the next scheduling of the vCPU.
+         When external interrupts from assigned devices occur, the interrupts
+         are recorded in PIR, and will be synced to IRR before VM-Entry.
+        - Set 'NV' to 'posted_intr_vector'.
+
+- How to wakeup blocked vCPU when an interrupt is posted for it (wakeup notification handler).
+
+Here is the scenario for the usage of the new global vector:
+
+1. vCPU0 is running on pCPU0
+2. vCPU0 is blocked and vCPU1 is currently running on pCPU0
+3. An external interrupt from an assigned device occurs for vCPU0, if we
+still use 'posted_intr_vector' as the notification vector for vCPU0, the
+notification event for vCPU0 (the event will go to pCPU1) will be consumed
+by vCPU1 incorrectly (remember this is a special vector to CPU). The worst
+case is that vCPU0 will never be woken up again since the wakeup event
+for it is always consumed by other vCPUs incorrectly. So we need introduce
+another global vector, naming 'pi_wakeup_vector' to wake up the blocked vCPU.
+
+After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue notification
+event using this new vector. Since this new vector is not a SPECIAL one to CPU,
+it is just a normal vector. To cpu, it just receives an normal external interrupt,
+then we can get control in the handler of this new vector. In this case, hypervisor
+can do something in it, such as wakeup the blocked vCPU.
+
+Here are what we do for the blocked vCPU:
+1. Define a per-cpu list 'blocked_vcpu_on_cpu', which stored the blocked
+vCPU on the pCPU.
+2. When the vCPU's state is changed to RUNSTATE_blocked, insert the vCPU
+to the per-cpu list belonging to the pCPU it was running.
+3. When the vCPU is unblocked, remove the vCPU from the related pCPU list.
+
+In the handler of 'pi_wakeup_vector', we do:
+1. Get the physical CPU.
+2. Iterate the list 'blocked_vcpu_on_cpu' of the current pCPU, if 'ON' is set,
+we unblock the associated vCPU.
+
+- New boot command line for Xen, which controls VT-d PI feature by user.
+
+Like 'intremap' for interrupt remapping, we add a new boot command line
+'intpost' for posted-interrupts.
+
+- Multicast/broadcast and lowest priority interrupts consideration.
+
+With VT-d PI, the destination vCPU information of an external interrupt
+from assigned devices is stored in IRTE, this makes the following
+consideration of the design:
+1. Multicast/broadcast interrupts cannot be posted.
+2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex
+(starting from Nehalem) ignore TPR value, and instead supported two other
+ways (configurable by BIOS) on how the handle lowest priority interrupts:
+	A) Round robin: In this method, the chipset simply delivers lowest priority
+interrupts in a round-robin manner across all the available logical CPUs. While
+this provides good load balancing, this was not the best thing to do always as
+interrupts from the same device (like NIC) will start running on all the CPUs
+thrashing caches and taking locks. This led to the next scheme.
+	B) Vector hashing: In this method, hardware would apply a hash function
+on the vector value in the interrupt request, and use that hash to pick a logical
+CPU to route the lowest priority interrupt. This way, a given vector always goes
+to the same logical CPU, avoiding the thrashing problem above.
+
+So, gist of above is that, lowest priority interrupts has never been delivered as
+"lowest priority" in physical hardware.
+
+I will emulate vector hashing for posted-interrupt for XEN.
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 02/15] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature
  2015-05-08  9:07 [RFC v2 00/15] Add VT-d Posted-Interrupts support Feng Wu
  2015-05-08  9:07 ` [RFC v2 01/15] Vt-d Posted-intterrupt (PI) design Feng Wu
@ 2015-05-08  9:07 ` Feng Wu
  2015-06-09 14:10   ` Jan Beulich
  2015-05-08  9:07 ` [RFC v2 03/15] vt-d: VT-d Posted-Interrupts feature detection Feng Wu
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 64+ messages in thread
From: Feng Wu @ 2015-05-08  9:07 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, Feng Wu

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

This patch adds variable 'iommu_intpost' to control whether enable VT-d
posted-interrupt or not in the generic IOMMU code.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 xen/drivers/passthrough/iommu.c | 11 ++++++++++-
 xen/include/xen/iommu.h         |  2 +-
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index 92ea26f..302e3e4 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -39,6 +39,7 @@ static void iommu_dump_p2m_table(unsigned char key);
  *   no-snoop                   Disable VT-d Snoop Control
  *   no-qinval                  Disable VT-d Queued Invalidation
  *   no-intremap                Disable VT-d Interrupt Remapping
+ *   no-intpost                 Disable VT-d Interrupt posting
  */
 custom_param("iommu", parse_iommu_param);
 bool_t __initdata iommu_enable = 1;
@@ -51,6 +52,7 @@ bool_t __read_mostly iommu_passthrough;
 bool_t __read_mostly iommu_snoop = 1;
 bool_t __read_mostly iommu_qinval = 1;
 bool_t __read_mostly iommu_intremap = 1;
+bool_t __read_mostly iommu_intpost = 0;
 bool_t __read_mostly iommu_hap_pt_share = 1;
 bool_t __read_mostly iommu_debug;
 bool_t __read_mostly amd_iommu_perdev_intremap = 1;
@@ -94,7 +96,11 @@ static void __init parse_iommu_param(char *s)
         else if ( !strcmp(s, "qinval") )
             iommu_qinval = val;
         else if ( !strcmp(s, "intremap") )
+        {
             iommu_intremap = val;
+            if ( iommu_intremap == 0 )
+                iommu_intpost = 0;
+        }
         else if ( !strcmp(s, "debug") )
         {
             iommu_debug = val;
@@ -272,7 +278,10 @@ int __init iommu_setup(void)
         iommu_enabled = (rc == 0);
     }
     if ( !iommu_enabled )
+    {
         iommu_intremap = 0;
+        iommu_intpost = 0;
+    }
 
     if ( (force_iommu && !iommu_enabled) ||
          (force_intremap && !iommu_intremap) )
@@ -341,7 +350,7 @@ void iommu_crash_shutdown(void)
     const struct iommu_ops *ops = iommu_get_ops();
     if ( iommu_enabled )
         ops->crash_shutdown();
-    iommu_enabled = iommu_intremap = 0;
+    iommu_enabled = iommu_intremap = iommu_intpost = 0;
 }
 
 bool_t iommu_has_feature(struct domain *d, enum iommu_feature feature)
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index bf4aff0..91063bb 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -31,7 +31,7 @@
 extern bool_t iommu_enable, iommu_enabled;
 extern bool_t force_iommu, iommu_verbose;
 extern bool_t iommu_workaround_bios_bug, iommu_passthrough;
-extern bool_t iommu_snoop, iommu_qinval, iommu_intremap;
+extern bool_t iommu_snoop, iommu_qinval, iommu_intremap, iommu_intpost;
 extern bool_t iommu_hap_pt_share;
 extern bool_t iommu_debug;
 extern bool_t amd_iommu_perdev_intremap;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 03/15] vt-d: VT-d Posted-Interrupts feature detection
  2015-05-08  9:07 [RFC v2 00/15] Add VT-d Posted-Interrupts support Feng Wu
  2015-05-08  9:07 ` [RFC v2 01/15] Vt-d Posted-intterrupt (PI) design Feng Wu
  2015-05-08  9:07 ` [RFC v2 02/15] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature Feng Wu
@ 2015-05-08  9:07 ` Feng Wu
  2015-06-09 14:15   ` Jan Beulich
  2015-05-08  9:07 ` [RFC v2 04/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts Feng Wu
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 64+ messages in thread
From: Feng Wu @ 2015-05-08  9:07 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, Feng Wu

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

This patch adds feature detection logic for VT-d posted-interrupt.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 xen/drivers/passthrough/vtd/iommu.c | 15 +++++++++++++--
 xen/drivers/passthrough/vtd/iommu.h |  1 +
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index c7bda73..8ad1f58 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2044,6 +2044,7 @@ static int init_vtd_hw(void)
             if ( ioapic_to_iommu(IO_APIC_ID(apic)) == NULL )
             {
                 iommu_intremap = 0;
+                iommu_intpost = 0;
                 dprintk(XENLOG_ERR VTDPREFIX,
                     "ioapic_to_iommu: ioapic %#x (id: %#x) is NULL! "
                     "Will not try to enable Interrupt Remapping.\n",
@@ -2060,6 +2061,7 @@ static int init_vtd_hw(void)
             if ( enable_intremap(iommu, 0) != 0 )
             {
                 iommu_intremap = 0;
+                iommu_intpost = 0;
                 dprintk(XENLOG_WARNING VTDPREFIX,
                         "Interrupt Remapping not enabled\n");
 
@@ -2133,8 +2135,8 @@ int __init intel_vtd_setup(void)
     }
 
     /* We enable the following features only if they are supported by all VT-d
-     * engines: Snoop Control, DMA passthrough, Queued Invalidation and
-     * Interrupt Remapping.
+     * engines: Snoop Control, DMA passthrough, Queued Invalidation, Interrupt
+     * Remapping, and Posted Interrupt
      */
     for_each_drhd_unit ( drhd )
     {
@@ -2160,7 +2162,13 @@ int __init intel_vtd_setup(void)
             iommu_qinval = 0;
 
         if ( iommu_intremap && !ecap_intr_remap(iommu->ecap) )
+        {
             iommu_intremap = 0;
+            iommu_intpost = 0;
+        }
+
+        if ( iommu_intpost && !cap_intr_post(iommu->cap) )
+            iommu_intpost = 0;
 
         if ( !vtd_ept_page_compatible(iommu) )
             iommu_hap_pt_share = 0;
@@ -2178,6 +2186,7 @@ int __init intel_vtd_setup(void)
     if ( !iommu_qinval && iommu_intremap )
     {
         iommu_intremap = 0;
+        iommu_intpost = 0;
         dprintk(XENLOG_WARNING VTDPREFIX, "Interrupt Remapping disabled "
             "since Queued Invalidation isn't supported or enabled.\n");
     }
@@ -2187,6 +2196,7 @@ int __init intel_vtd_setup(void)
     P(iommu_passthrough, "Dom0 DMA Passthrough");
     P(iommu_qinval, "Queued Invalidation");
     P(iommu_intremap, "Interrupt Remapping");
+    P(iommu_intpost, "Posted Interrupt");
     P(iommu_hap_pt_share, "Shared EPT tables");
 #undef P
 
@@ -2206,6 +2216,7 @@ int __init intel_vtd_setup(void)
     iommu_passthrough = 0;
     iommu_qinval = 0;
     iommu_intremap = 0;
+    iommu_intpost = 0;
     return ret;
 }
 
diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
index d6e6520..42047e0 100644
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -69,6 +69,7 @@
 /*
  * Decoding Capability Register
  */
+#define cap_intr_post(c)       (((c) >> 59) & 1)
 #define cap_read_drain(c)      (((c) >> 55) & 1)
 #define cap_write_drain(c)     (((c) >> 54) & 1)
 #define cap_max_amask_val(c)   (((c) >> 48) & 0x3f)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 04/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  2015-05-08  9:07 [RFC v2 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (2 preceding siblings ...)
  2015-05-08  9:07 ` [RFC v2 03/15] vt-d: VT-d Posted-Interrupts feature detection Feng Wu
@ 2015-05-08  9:07 ` Feng Wu
  2015-06-09 14:17   ` Jan Beulich
  2015-05-08  9:07 ` [RFC v2 05/15] vmx: Initialize VT-d Posted-Interrupts Descriptor Feng Wu
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 64+ messages in thread
From: Feng Wu @ 2015-05-08  9:07 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, Feng Wu

Extend struct pi_desc according to VT-d Posted-Interrupts Spec.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 xen/include/asm-x86/hvm/vmx/vmcs.h | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index 6fce6aa..3707d1c 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -76,8 +76,19 @@ struct vmx_domain {
 
 struct pi_desc {
     DECLARE_BITMAP(pir, NR_VECTORS);
-    u32 control;
-    u32 rsvd[7];
+    union {
+        struct
+        {
+        u64 on     : 1,  /* bit 256 - Outstanding Notification */
+            sn     : 1,  /* bit 257 - Suppress Notification */
+            rsvd_1 : 14, /* bit 271:258 - Reserved */
+            nv     : 8,  /* bit 279:272 - Notification Vector */
+            rsvd_2 : 8,  /* bit 287:280 - Reserved */
+            ndst   : 32; /* bit 319:288 - Notification Destination */
+        };
+        u64 control;
+    };
+    u32 rsvd[6];
 } __attribute__ ((aligned (64)));
 
 #define ept_get_wl(ept)   ((ept)->ept_wl)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 05/15] vmx: Initialize VT-d Posted-Interrupts Descriptor
  2015-05-08  9:07 [RFC v2 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (3 preceding siblings ...)
  2015-05-08  9:07 ` [RFC v2 04/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts Feng Wu
@ 2015-05-08  9:07 ` Feng Wu
  2015-06-09 14:20   ` Jan Beulich
  2015-05-08  9:07 ` [RFC v2 06/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts Feng Wu
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 64+ messages in thread
From: Feng Wu @ 2015-05-08  9:07 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, Feng Wu

This patch initializes the VT-d Posted-interrupt Descriptor.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 xen/arch/x86/hvm/vmx/vmcs.c       |  3 +++
 xen/include/asm-x86/hvm/vmx/vmx.h | 20 +++++++++++++++++++-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 63007a9..f60a454 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -1004,6 +1004,9 @@ static int construct_vmcs(struct vcpu *v)
 
     if ( cpu_has_vmx_posted_intr_processing )
     {
+        if ( iommu_intpost )
+            pi_desc_init(v);
+
         __vmwrite(PI_DESC_ADDR, virt_to_maddr(&v->arch.hvm_vmx.pi_desc));
         __vmwrite(POSTED_INTR_NOTIFICATION_VECTOR, posted_intr_vector);
     }
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 91c5e18..e4292cc 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -28,6 +28,9 @@
 #include <asm/hvm/support.h>
 #include <asm/hvm/trace.h>
 #include <asm/hvm/vmx/vmcs.h>
+#include <asm/apic.h>
+
+extern uint8_t posted_intr_vector;
 
 typedef union {
     struct {
@@ -125,6 +128,22 @@ static inline unsigned long pi_get_pir(struct pi_desc *pi_desc, int group)
     return xchg(&pi_desc->pir[group], 0);
 }
 
+#define PI_xAPIC_NDST_MASK   0xFF00
+
+static inline void pi_desc_init(struct vcpu *v)
+{
+    uint32_t dest;
+
+    v->arch.hvm_vmx.pi_desc.nv = posted_intr_vector;
+
+    dest = cpu_physical_id(v->processor);
+
+    if ( x2apic_enabled )
+        v->arch.hvm_vmx.pi_desc.ndst = dest;
+    else
+        v->arch.hvm_vmx.pi_desc.ndst = MASK_INSR(dest, PI_xAPIC_NDST_MASK);
+}
+
 /*
  * Exit Reasons
  */
@@ -244,7 +263,6 @@ static inline unsigned long pi_get_pir(struct pi_desc *pi_desc, int group)
 #define MODRM_EAX_ECX   ".byte 0xc1\n" /* EAX, ECX */
 
 extern u64 vmx_ept_vpid_cap;
-extern uint8_t posted_intr_vector;
 
 #define cpu_has_vmx_ept_exec_only_supported        \
     (vmx_ept_vpid_cap & VMX_EPT_EXEC_ONLY_SUPPORTED)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 06/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts
  2015-05-08  9:07 [RFC v2 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (4 preceding siblings ...)
  2015-05-08  9:07 ` [RFC v2 05/15] vmx: Initialize VT-d Posted-Interrupts Descriptor Feng Wu
@ 2015-05-08  9:07 ` Feng Wu
  2015-06-09 14:25   ` Jan Beulich
  2015-05-08  9:07 ` [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used Feng Wu
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 64+ messages in thread
From: Feng Wu @ 2015-05-08  9:07 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, Feng Wu

Extend struct iremap_entry according to VT-d Posted-Interrupts Spec.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 xen/drivers/passthrough/vtd/intremap.c | 90 +++++++++++++++++-----------------
 xen/drivers/passthrough/vtd/iommu.h    | 26 +++++++---
 xen/drivers/passthrough/vtd/utils.c    | 10 ++--
 3 files changed, 69 insertions(+), 57 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index 0333686..5ec76b4 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -123,9 +123,9 @@ static u16 hpetid_to_bdf(unsigned int hpet_id)
 static void set_ire_sid(struct iremap_entry *ire,
                         unsigned int svt, unsigned int sq, unsigned int sid)
 {
-    ire->hi.svt = svt;
-    ire->hi.sq = sq;
-    ire->hi.sid = sid;
+    ire->remap.svt = svt;
+    ire->remap.sq = sq;
+    ire->remap.sid = sid;
 }
 
 static void set_ioapic_source_id(int apic_id, struct iremap_entry *ire)
@@ -220,7 +220,7 @@ static unsigned int alloc_remap_entry(struct iommu *iommu, unsigned int nr)
         else
             p = &iremap_entries[i % (1 << IREMAP_ENTRY_ORDER)];
 
-        if ( p->lo_val || p->hi_val ) /* not a free entry */
+        if ( p->lo || p->hi ) /* not a free entry */
             found = 0;
         else if ( ++found == nr )
             break;
@@ -254,7 +254,7 @@ static int remap_entry_to_ioapic_rte(
     GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
                      iremap_entries, iremap_entry);
 
-    if ( iremap_entry->hi_val == 0 && iremap_entry->lo_val == 0 )
+    if ( iremap_entry->hi == 0 && iremap_entry->lo == 0 )
     {
         dprintk(XENLOG_ERR VTDPREFIX,
                 "%s: index (%d) get an empty entry!\n",
@@ -264,13 +264,13 @@ static int remap_entry_to_ioapic_rte(
         return -EFAULT;
     }
 
-    old_rte->vector = iremap_entry->lo.vector;
-    old_rte->delivery_mode = iremap_entry->lo.dlm;
-    old_rte->dest_mode = iremap_entry->lo.dm;
-    old_rte->trigger = iremap_entry->lo.tm;
+    old_rte->vector = iremap_entry->remap.vector;
+    old_rte->delivery_mode = iremap_entry->remap.dlm;
+    old_rte->dest_mode = iremap_entry->remap.dm;
+    old_rte->trigger = iremap_entry->remap.tm;
     old_rte->__reserved_2 = 0;
     old_rte->dest.logical.__reserved_1 = 0;
-    old_rte->dest.logical.logical_dest = iremap_entry->lo.dst >> 8;
+    old_rte->dest.logical.logical_dest = iremap_entry->remap.dst >> 8;
 
     unmap_vtd_domain_page(iremap_entries);
     spin_unlock_irqrestore(&ir_ctrl->iremap_lock, flags);
@@ -318,27 +318,27 @@ static int ioapic_rte_to_remap_entry(struct iommu *iommu,
     if ( rte_upper )
     {
         if ( x2apic_enabled )
-            new_ire.lo.dst = value;
+            new_ire.remap.dst = value;
         else
-            new_ire.lo.dst = (value >> 24) << 8;
+            new_ire.remap.dst = (value >> 24) << 8;
     }
     else
     {
         *(((u32 *)&new_rte) + 0) = value;
-        new_ire.lo.fpd = 0;
-        new_ire.lo.dm = new_rte.dest_mode;
-        new_ire.lo.tm = new_rte.trigger;
-        new_ire.lo.dlm = new_rte.delivery_mode;
+        new_ire.remap.fpd = 0;
+        new_ire.remap.dm = new_rte.dest_mode;
+        new_ire.remap.tm = new_rte.trigger;
+        new_ire.remap.dlm = new_rte.delivery_mode;
         /* Hardware require RH = 1 for LPR delivery mode */
-        new_ire.lo.rh = (new_ire.lo.dlm == dest_LowestPrio);
-        new_ire.lo.avail = 0;
-        new_ire.lo.res_1 = 0;
-        new_ire.lo.vector = new_rte.vector;
-        new_ire.lo.res_2 = 0;
+        new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
+        new_ire.remap.avail = 0;
+        new_ire.remap.res_1 = 0;
+        new_ire.remap.vector = new_rte.vector;
+        new_ire.remap.res_2 = 0;
 
         set_ioapic_source_id(IO_APIC_ID(apic), &new_ire);
-        new_ire.hi.res_1 = 0;
-        new_ire.lo.p = 1;     /* finally, set present bit */
+        new_ire.remap.res_3 = 0;
+        new_ire.remap.p = 1;     /* finally, set present bit */
 
         /* now construct new ioapic rte entry */
         remap_rte->vector = new_rte.vector;
@@ -511,7 +511,7 @@ static int remap_entry_to_msi_msg(
     GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
                      iremap_entries, iremap_entry);
 
-    if ( iremap_entry->hi_val == 0 && iremap_entry->lo_val == 0 )
+    if ( iremap_entry->hi == 0 && iremap_entry->lo == 0 )
     {
         dprintk(XENLOG_ERR VTDPREFIX,
                 "%s: index (%d) get an empty entry!\n",
@@ -524,25 +524,25 @@ static int remap_entry_to_msi_msg(
     msg->address_hi = MSI_ADDR_BASE_HI;
     msg->address_lo =
         MSI_ADDR_BASE_LO |
-        ((iremap_entry->lo.dm == 0) ?
+        ((iremap_entry->remap.dm == 0) ?
             MSI_ADDR_DESTMODE_PHYS:
             MSI_ADDR_DESTMODE_LOGIC) |
-        ((iremap_entry->lo.dlm != dest_LowestPrio) ?
+        ((iremap_entry->remap.dlm != dest_LowestPrio) ?
             MSI_ADDR_REDIRECTION_CPU:
             MSI_ADDR_REDIRECTION_LOWPRI);
     if ( x2apic_enabled )
-        msg->dest32 = iremap_entry->lo.dst;
+        msg->dest32 = iremap_entry->remap.dst;
     else
-        msg->dest32 = (iremap_entry->lo.dst >> 8) & 0xff;
+        msg->dest32 = (iremap_entry->remap.dst >> 8) & 0xff;
     msg->address_lo |= MSI_ADDR_DEST_ID(msg->dest32);
 
     msg->data =
         MSI_DATA_TRIGGER_EDGE |
         MSI_DATA_LEVEL_ASSERT |
-        ((iremap_entry->lo.dlm != dest_LowestPrio) ?
+        ((iremap_entry->remap.dlm != dest_LowestPrio) ?
             MSI_DATA_DELIVERY_FIXED:
             MSI_DATA_DELIVERY_LOWPRI) |
-        iremap_entry->lo.vector;
+        iremap_entry->remap.vector;
 
     unmap_vtd_domain_page(iremap_entries);
     spin_unlock_irqrestore(&ir_ctrl->iremap_lock, flags);
@@ -601,29 +601,29 @@ static int msi_msg_to_remap_entry(
     memcpy(&new_ire, iremap_entry, sizeof(struct iremap_entry));
 
     /* Set interrupt remapping table entry */
-    new_ire.lo.fpd = 0;
-    new_ire.lo.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
-    new_ire.lo.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
-    new_ire.lo.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
+    new_ire.remap.fpd = 0;
+    new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
+    new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
+    new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
     /* Hardware require RH = 1 for LPR delivery mode */
-    new_ire.lo.rh = (new_ire.lo.dlm == dest_LowestPrio);
-    new_ire.lo.avail = 0;
-    new_ire.lo.res_1 = 0;
-    new_ire.lo.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
-                        MSI_DATA_VECTOR_MASK;
-    new_ire.lo.res_2 = 0;
+    new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
+    new_ire.remap.avail = 0;
+    new_ire.remap.res_1 = 0;
+    new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
+                            MSI_DATA_VECTOR_MASK;
+    new_ire.remap.res_2 = 0;
     if ( x2apic_enabled )
-        new_ire.lo.dst = msg->dest32;
+        new_ire.remap.dst = msg->dest32;
     else
-        new_ire.lo.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
-                          & 0xff) << 8;
+        new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
+                             & 0xff) << 8;
 
     if ( pdev )
         set_msi_source_id(pdev, &new_ire);
     else
         set_hpet_source_id(msi_desc->hpet_id, &new_ire);
-    new_ire.hi.res_1 = 0;
-    new_ire.lo.p = 1;    /* finally, set present bit */
+    new_ire.remap.res_3 = 0;
+    new_ire.remap.p = 1;    /* finally, set present bit */
 
     /* now construct new MSI/MSI-X rte entry */
     remap_rte = (struct msi_msg_remap_entry *)msg;
diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
index 42047e0..77a9227 100644
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -289,7 +289,7 @@ struct dma_pte {
 /* interrupt remap entry */
 struct iremap_entry {
   union {
-    u64 lo_val;
+    struct { u64 lo, hi; };
     struct {
         u64 p       : 1,
             fpd     : 1,
@@ -302,16 +302,28 @@ struct iremap_entry {
             vector  : 8,
             res_2   : 8,
             dst     : 32;
-    }lo;
-  };
-  union {
-    u64 hi_val;
+        u64 sid     : 16,
+            sq      : 2,
+            svt     : 2,
+            res_3   : 44;
+    }remap;
     struct {
+        u64 p       : 1,
+            fpd     : 1,
+            res_1   : 6,
+            avail   : 4,
+            res_2   : 2,
+            urg     : 1,
+            im      : 1,
+            vector  : 8,
+            res_3   : 14,
+            pda_l   : 26;
         u64 sid     : 16,
             sq      : 2,
             svt     : 2,
-            res_1   : 44;
-    }hi;
+            res_4   : 12,
+            pda_h   : 32;
+    }post;
   };
 };
 
diff --git a/xen/drivers/passthrough/vtd/utils.c b/xen/drivers/passthrough/vtd/utils.c
index bd14c02..a5fe237 100644
--- a/xen/drivers/passthrough/vtd/utils.c
+++ b/xen/drivers/passthrough/vtd/utils.c
@@ -238,14 +238,14 @@ static void dump_iommu_info(unsigned char key)
                 else
                     p = &iremap_entries[i % (1 << IREMAP_ENTRY_ORDER)];
 
-                if ( !p->lo.p )
+                if ( !p->remap.p )
                     continue;
                 printk("  %04x:  %x   %x  %04x %08x %02x    %x   %x  %x  %x  %x"
                     "   %x %x\n", i,
-                    (u32)p->hi.svt, (u32)p->hi.sq, (u32)p->hi.sid,
-                    (u32)p->lo.dst, (u32)p->lo.vector, (u32)p->lo.avail,
-                    (u32)p->lo.dlm, (u32)p->lo.tm, (u32)p->lo.rh,
-                    (u32)p->lo.dm, (u32)p->lo.fpd, (u32)p->lo.p);
+                    (u32)p->remap.svt, (u32)p->remap.sq, (u32)p->remap.sid,
+                    (u32)p->remap.dst, (u32)p->remap.vector, (u32)p->remap.avail,
+                    (u32)p->remap.dlm, (u32)p->remap.tm, (u32)p->remap.rh,
+                    (u32)p->remap.dm, (u32)p->remap.fpd, (u32)p->remap.p);
                 print_cnt++;
             }
             if ( iremap_entries )
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-05-08  9:07 [RFC v2 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (5 preceding siblings ...)
  2015-05-08  9:07 ` [RFC v2 06/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts Feng Wu
@ 2015-05-08  9:07 ` Feng Wu
  2015-06-09 14:32   ` Jan Beulich
  2015-06-10  6:17   ` Jan Beulich
  2015-05-08  9:07 ` [RFC v2 08/15] Update IRTE according to guest interrupt config changes Feng Wu
                   ` (9 subsequent siblings)
  16 siblings, 2 replies; 64+ messages in thread
From: Feng Wu @ 2015-05-08  9:07 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, Feng Wu

This patch adds an API which is used to update the IRTE
for posted-interrupt when guest changes MSI/MSI-X information.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 xen/drivers/passthrough/vtd/intremap.c | 98 ++++++++++++++++++++++++++++++++++
 xen/drivers/passthrough/vtd/iommu.h    |  4 ++
 xen/include/asm-x86/iommu.h            |  2 +
 3 files changed, 104 insertions(+)

diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index 5ec76b4..dd1a3d8 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -898,3 +898,101 @@ void iommu_disable_x2apic_IR(void)
     for_each_drhd_unit ( drhd )
         disable_qinval(drhd->iommu);
 }
+
+static inline void setup_posted_irte(
+    struct iremap_entry *new_ire, struct pi_desc *pi_desc, uint8_t gvec)
+{
+    new_ire->post.urg = 0;
+    new_ire->post.vector = gvec;
+    new_ire->post.pda_l = (((u64)virt_to_maddr(pi_desc)) >>
+                           (32 - PDA_LOW_BIT)) & PDA_MASK(LOW);
+    new_ire->post.pda_h = (((u64)virt_to_maddr(pi_desc)) >> 32) &
+                           PDA_MASK(HIGH);
+
+    new_ire->post.res_1 = 0;
+    new_ire->post.res_2 = 0;
+    new_ire->post.res_3 = 0;
+    new_ire->post.res_4 = 0;
+
+    new_ire->post.im = 1;
+}
+
+/*
+ * This function is used to update the IRTE for posted-interrupt
+ * when guest changes MSI/MSI-X information.
+ */
+bool_t pi_update_irte(struct vcpu *v, struct pirq *pirq, uint8_t gvec)
+{
+    struct irq_desc *desc;
+    struct msi_desc *msi_desc;
+    int remap_index;
+    bool_t rc = 0;
+    struct pci_dev *pci_dev;
+    struct acpi_drhd_unit *drhd;
+    struct iommu *iommu;
+    struct ir_ctrl *ir_ctrl;
+    struct iremap_entry *iremap_entries = NULL, *p = NULL;
+    struct iremap_entry new_ire;
+    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+    unsigned long flags;
+
+    desc = pirq_spin_lock_irq_desc(pirq, NULL);
+    if ( !desc )
+        return 0;
+
+    msi_desc = desc->msi_desc;
+    if ( !msi_desc )
+        goto unlock_out;
+
+    pci_dev = msi_desc->dev;
+    if ( !pci_dev )
+        goto unlock_out;
+
+    remap_index = msi_desc->remap_index;
+    drhd = acpi_find_matched_drhd_unit(pci_dev);
+    if ( !drhd )
+    {
+        dprintk(XENLOG_INFO VTDPREFIX,
+                "%pv: failed to get drhd, pci device: "
+                "%04x:%02x:%02x.%u, guest vector: %u\n",
+                v, pci_dev->seg, pci_dev->bus, PCI_SLOT(pci_dev->devfn),
+                PCI_FUNC(pci_dev->devfn), gvec);
+        goto unlock_out;
+    }
+
+    iommu = drhd->iommu;
+    ir_ctrl = iommu_ir_ctrl(iommu);
+    if ( !ir_ctrl )
+    {
+        dprintk(XENLOG_INFO VTDPREFIX,
+                "%pv: failed to get ir_ctrl, pci device: "
+                "%04x:%02x:%02x.%u, guest vector: %u\n",
+                v, pci_dev->seg, pci_dev->bus, PCI_SLOT(pci_dev->devfn),
+                PCI_FUNC(pci_dev->devfn), gvec);
+        goto unlock_out;
+    }
+
+    spin_lock_irqsave(&ir_ctrl->iremap_lock, flags);
+
+    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index, iremap_entries, p);
+
+    memcpy(&new_ire, p, sizeof(new_ire));
+
+    /* Setup/Update interrupt remapping table entry. */
+    setup_posted_irte(&new_ire, pi_desc, gvec);
+
+    memcpy(p, &new_ire, sizeof(new_ire));
+    iommu_flush_cache_entry(p, sizeof(struct iremap_entry));
+    iommu_flush_iec_index(iommu, 0, remap_index);
+
+    if ( iremap_entries )
+        unmap_vtd_domain_page(iremap_entries);
+
+    spin_unlock_irqrestore(&ir_ctrl->iremap_lock, flags);
+
+    rc = 1;
+ unlock_out:
+    spin_unlock_irq(&desc->lock);
+
+    return rc;
+}
diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
index 77a9227..f41b4e2 100644
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -327,6 +327,10 @@ struct iremap_entry {
   };
 };
 
+#define PDA_LOW_BIT    26
+#define PDA_HIGH_BIT   32
+#define PDA_MASK(XX)   (~(-1UL << PDA_##XX##_BIT))
+
 /* Max intr remapping table page order is 8, as max number of IRTEs is 64K */
 #define IREMAP_PAGE_ORDER  8
 
diff --git a/xen/include/asm-x86/iommu.h b/xen/include/asm-x86/iommu.h
index e7a65da..1528af8 100644
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -32,6 +32,8 @@ int iommu_supports_eim(void);
 int iommu_enable_x2apic_IR(void);
 void iommu_disable_x2apic_IR(void);
 
+bool_t pi_update_irte(struct vcpu *v, struct pirq *pirq, uint8_t gvec);
+
 #endif /* !__ARCH_X86_IOMMU_H__ */
 /*
  * Local variables:
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 08/15] Update IRTE according to guest interrupt config changes
  2015-05-08  9:07 [RFC v2 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (6 preceding siblings ...)
  2015-05-08  9:07 ` [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used Feng Wu
@ 2015-05-08  9:07 ` Feng Wu
  2015-06-09 15:06   ` Jan Beulich
  2015-05-08  9:07 ` [RFC v2 09/15] Add a new per-vCPU tasklet to wakeup the blocked vCPU Feng Wu
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 64+ messages in thread
From: Feng Wu @ 2015-05-08  9:07 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, Feng Wu

When guest changes its interrupt configuration (such as, vector, etc.)
for direct-assigned devices, we need to update the associated IRTE
with the new guest vector, so external interrupts from the assigned
devices can be injected to guests without VM-Exit.

For lowest-priority interrupts, we use vector-hashing mechamisn to find
the destination vCPU. This follows the hardware behavior, since modern
Intel CPUs use vector hashing to handle the lowest-priority interrupt.

For multicast/broadcast vCPU, we cannot handle it via interrupt posting,
still use interrupt remapping.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 xen/drivers/passthrough/io.c | 99 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 98 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 9b77334..7b1c094 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -26,6 +26,7 @@
 #include <asm/hvm/iommu.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
+#include <asm/io_apic.h>
 
 static DEFINE_PER_CPU(struct list_head, dpci_list);
 
@@ -199,6 +200,73 @@ void free_hvm_irq_dpci(struct hvm_irq_dpci *dpci)
     xfree(dpci);
 }
 
+/*
+ * The purpose of this routine is to find the right destination vCPU for
+ * an interrupt which will be delivered by VT-d posted-interrupt. There
+ * are several cases as below:
+ *
+ * - For lowest-priority interrupts, we find the destination vCPU from the
+ *   guest vector using vector-hashing mechanism and return true. This follows
+ *   the hardware behavior, since modern Intel CPUs use vector hashing to
+ *   handle the lowest-priority interrupt.
+ * - Otherwise, for single destination interrupt, it is straightforward to
+ *   find the destination vCPU and return true.
+ * - For multicast/broadcast vCPU, we cannot handle it via interrupt posting,
+ *   so return false.
+ *
+ *   Here is the details about the vector-hashing mechanism:
+ *   1. For lowest-priority interrupts, store all the possible destination
+ *      vCPUs in an array.
+ *   2. Use "gvec % max number of destination vCPUs" to find the right
+ *      destination vCPU in the array for the lowest-priority interrupt.
+ */
+static bool_t pi_find_dest_vcpu(struct domain *d, uint8_t dest_id,
+                                uint8_t dest_mode, uint8_t delivery_mode,
+                                uint8_t gvec, struct vcpu **dest_vcpu)
+{
+    struct vcpu *v, **dest_vcpu_array;
+    unsigned int dest_vcpu_num = 0;
+    int ret;
+
+    dest_vcpu_array = xzalloc_array(struct vcpu *, d->max_vcpus);
+    if ( !dest_vcpu_array )
+    {
+        dprintk(XENLOG_G_INFO,
+                "dom%d: failed to allocate memeory.\n", d->domain_id);
+        return 0;
+    }
+
+    for_each_vcpu ( d, v )
+    {
+        if ( !vlapic_match_dest(vcpu_vlapic(v), NULL, 0,
+                                dest_id, dest_mode) )
+            continue;
+
+        dest_vcpu_array[dest_vcpu_num++] = v;
+    }
+
+    if ( delivery_mode == dest_LowestPrio )
+    {
+        if (  dest_vcpu_num != 0 )
+        {
+            *dest_vcpu = dest_vcpu_array[gvec % dest_vcpu_num];
+            ret = 1;
+        }
+        else
+            ret = 0;
+    }
+    else if (  dest_vcpu_num == 1 )
+    {
+        *dest_vcpu = dest_vcpu_array[0];
+        ret = 1;
+    }
+    else
+        ret = 0;
+
+    xfree(dest_vcpu_array);
+    return ret;
+}
+
 int pt_irq_create_bind(
     struct domain *d, xen_domctl_bind_pt_irq_t *pt_irq_bind)
 {
@@ -257,7 +325,7 @@ int pt_irq_create_bind(
     {
     case PT_IRQ_TYPE_MSI:
     {
-        uint8_t dest, dest_mode;
+        uint8_t dest, dest_mode, delivery_mode;
         int dest_vcpu_id;
 
         if ( !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
@@ -330,11 +398,40 @@ int pt_irq_create_bind(
         /* Calculate dest_vcpu_id for MSI-type pirq migration. */
         dest = pirq_dpci->gmsi.gflags & VMSI_DEST_ID_MASK;
         dest_mode = !!(pirq_dpci->gmsi.gflags & VMSI_DM_MASK);
+        delivery_mode = (pirq_dpci->gmsi.gflags >> GFLAGS_SHIFT_DELIV_MODE) &
+                        VMSI_DELIV_MASK;
         dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
         pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
         spin_unlock(&d->event_lock);
         if ( dest_vcpu_id >= 0 )
             hvm_migrate_pirqs(d->vcpu[dest_vcpu_id]);
+
+        /* Use interrupt posting if it is supported */
+        if ( iommu_intpost )
+        {
+            struct vcpu *vcpu = NULL;
+
+            if ( !pi_find_dest_vcpu(d, dest, dest_mode, delivery_mode,
+                                    pirq_dpci->gmsi.gvec, &vcpu) )
+            {
+                dprintk(XENLOG_G_WARNING,
+                        "%pv: failed to find the dest vCPU for PI, guest "
+                        "vector:%u use software way to deliver the "
+                        " interrupts.\n", vcpu, pirq_dpci->gmsi.gvec);
+                break;
+            }
+
+            if ( pi_update_irte( vcpu, info, pirq_dpci->gmsi.gvec ) != 0 )
+            {
+                dprintk(XENLOG_G_WARNING,
+                        "%pv: failed to update PI IRTE, guest vector:%u "
+                        "use software way to deliver the interrupts.\n",
+                        vcpu, pirq_dpci->gmsi.gvec);
+
+                break;
+            }
+        }
+
         break;
     }
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 09/15] Add a new per-vCPU tasklet to wakeup the blocked vCPU
  2015-05-08  9:07 [RFC v2 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (7 preceding siblings ...)
  2015-05-08  9:07 ` [RFC v2 08/15] Update IRTE according to guest interrupt config changes Feng Wu
@ 2015-05-08  9:07 ` Feng Wu
  2015-06-09 15:08   ` Jan Beulich
  2015-05-08  9:07 ` [RFC v2 10/15] vmx: Define two per-cpu variables Feng Wu
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 64+ messages in thread
From: Feng Wu @ 2015-05-08  9:07 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, Feng Wu

This patch adds a new per-vCPU tasklet to wakeup the blocked
vCPU. It can be used in the case vcpu_unblock cannot be called
directly. This tasklet will be used in later patch in this
series.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 xen/common/domain.c     | 11 +++++++++++
 xen/include/xen/sched.h |  3 +++
 2 files changed, 14 insertions(+)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 6803c4d..95e2a10 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -109,6 +109,13 @@ static void vcpu_check_shutdown(struct vcpu *v)
     spin_unlock(&d->shutdown_lock);
 }
 
+static void vcpu_wakeup_tasklet_handler(unsigned long arg)
+{
+    struct vcpu *v = (void *)arg;
+
+    vcpu_unblock(v);
+}
+
 struct vcpu *alloc_vcpu(
     struct domain *d, unsigned int vcpu_id, unsigned int cpu_id)
 {
@@ -126,6 +133,9 @@ struct vcpu *alloc_vcpu(
 
     tasklet_init(&v->continue_hypercall_tasklet, NULL, 0);
 
+    tasklet_init(&v->vcpu_wakeup_tasklet, vcpu_wakeup_tasklet_handler,
+                 (unsigned long)v);
+
     if ( !zalloc_cpumask_var(&v->cpu_hard_affinity) ||
          !zalloc_cpumask_var(&v->cpu_hard_affinity_tmp) ||
          !zalloc_cpumask_var(&v->cpu_hard_affinity_saved) ||
@@ -785,6 +795,7 @@ static void complete_domain_destroy(struct rcu_head *head)
         if ( (v = d->vcpu[i]) == NULL )
             continue;
         tasklet_kill(&v->continue_hypercall_tasklet);
+        tasklet_kill(&v->vcpu_wakeup_tasklet);
         vcpu_destroy(v);
         sched_destroy_vcpu(v);
         destroy_waitqueue_vcpu(v);
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 80c6f62..fd9e01e 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -239,6 +239,9 @@ struct vcpu
     /* Tasklet for continue_hypercall_on_cpu(). */
     struct tasklet   continue_hypercall_tasklet;
 
+    /* Tasklet for wakeup_blocked_vcpu(). */
+    struct tasklet   vcpu_wakeup_tasklet;
+
     /* Multicall information. */
     struct mc_state  mc_state;
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 10/15] vmx: Define two per-cpu variables
  2015-05-08  9:07 [RFC v2 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (8 preceding siblings ...)
  2015-05-08  9:07 ` [RFC v2 09/15] Add a new per-vCPU tasklet to wakeup the blocked vCPU Feng Wu
@ 2015-05-08  9:07 ` Feng Wu
  2015-06-09 15:12   ` Jan Beulich
  2015-05-08  9:07 ` [RFC v2 11/15] vmx: Add a global wake-up vector for VT-d Posted-Interrupts Feng Wu
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 64+ messages in thread
From: Feng Wu @ 2015-05-08  9:07 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, Feng Wu

This patch defines two per-cpu variables:

blocked_vcpu:
A list storing the vCPUs which were blocked on this pCPU.

blocked_vcpu_lock:
The spinlock to protect blocked_vcpu.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 xen/arch/x86/hvm/vmx/vmcs.c       | 3 +++
 xen/arch/x86/hvm/vmx/vmx.c        | 7 +++++++
 xen/include/asm-x86/hvm/vmx/vmx.h | 3 +++
 3 files changed, 13 insertions(+)

diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index f60a454..8166f08 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -585,6 +585,9 @@ int vmx_cpu_up(void)
     if ( cpu_has_vmx_vpid )
         vpid_sync_all();
 
+    INIT_LIST_HEAD(&per_cpu(blocked_vcpu, cpu));
+    spin_lock_init(&per_cpu(blocked_vcpu_lock, cpu));
+
     return 0;
 }
 
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 6c4f78c..2451ca5 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -82,6 +82,13 @@ static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content);
 static int vmx_msr_write_intercept(unsigned int msr, uint64_t msr_content);
 static void vmx_invlpg_intercept(unsigned long vaddr);
 
+/*
+ * We maintian a per-CPU linked-list of vCPU, so in PI wakeup handler we
+ * can find which vCPU should be waken up.
+ */
+DEFINE_PER_CPU(struct list_head, blocked_vcpu);
+DEFINE_PER_CPU(spinlock_t, blocked_vcpu_lock);
+
 uint8_t __read_mostly posted_intr_vector;
 
 static int vmx_domain_initialise(struct domain *d)
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index e4292cc..0d11f9c 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -30,6 +30,9 @@
 #include <asm/hvm/vmx/vmcs.h>
 #include <asm/apic.h>
 
+DECLARE_PER_CPU(struct list_head, blocked_vcpu);
+DECLARE_PER_CPU(spinlock_t, blocked_vcpu_lock);
+
 extern uint8_t posted_intr_vector;
 
 typedef union {
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 11/15] vmx: Add a global wake-up vector for VT-d Posted-Interrupts
  2015-05-08  9:07 [RFC v2 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (9 preceding siblings ...)
  2015-05-08  9:07 ` [RFC v2 10/15] vmx: Define two per-cpu variables Feng Wu
@ 2015-05-08  9:07 ` Feng Wu
  2015-06-09 15:20   ` Jan Beulich
  2015-05-08  9:07 ` [RFC v2 12/15] vmx: Properly handle notification event when vCPU is running Feng Wu
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 64+ messages in thread
From: Feng Wu @ 2015-05-08  9:07 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, Feng Wu

This patch adds a global vector which is used to wake up
the blocked vCPU when an interrupt is being posted to it.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Suggested-by: Yang Zhang <yang.z.zhang@intel.com>
---
 xen/arch/x86/hvm/vmx/vmx.c        | 31 +++++++++++++++++++++++++++++++
 xen/include/asm-x86/hvm/hvm.h     |  1 +
 xen/include/asm-x86/hvm/vmx/vmx.h |  3 +++
 xen/include/xen/sched.h           |  2 ++
 4 files changed, 37 insertions(+)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 2451ca5..0e71d7e 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -90,6 +90,7 @@ DEFINE_PER_CPU(struct list_head, blocked_vcpu);
 DEFINE_PER_CPU(spinlock_t, blocked_vcpu_lock);
 
 uint8_t __read_mostly posted_intr_vector;
+uint8_t __read_mostly pi_wakeup_vector;
 
 static int vmx_domain_initialise(struct domain *d)
 {
@@ -132,6 +133,8 @@ static int vmx_vcpu_initialise(struct vcpu *v)
     if ( v->vcpu_id == 0 )
         v->arch.user_regs.eax = 1;
 
+    INIT_LIST_HEAD(&v->blocked_vcpu_list);
+
     return 0;
 }
 
@@ -1835,11 +1838,17 @@ const struct hvm_function_table * __init start_vmx(void)
     }
 
     if ( cpu_has_vmx_posted_intr_processing )
+    {
         alloc_direct_apic_vector(&posted_intr_vector, event_check_interrupt);
+
+        if ( iommu_intpost )
+            alloc_direct_apic_vector(&pi_wakeup_vector, pi_wakeup_interrupt);
+    }
     else
     {
         vmx_function_table.deliver_posted_intr = NULL;
         vmx_function_table.sync_pir_to_irr = NULL;
+        vmx_function_table.pi_desc_update = NULL;
     }
 
     if ( cpu_has_vmx_ept
@@ -3262,6 +3271,28 @@ void vmx_vmenter_helper(const struct cpu_user_regs *regs)
 }
 
 /*
+ * Handle VT-d posted-interrupt when VCPU is blocked.
+ */
+void pi_wakeup_interrupt(struct cpu_user_regs *regs)
+{
+    struct vcpu *v;
+    unsigned int cpu = smp_processor_id();
+
+    spin_lock(&per_cpu(blocked_vcpu_lock, cpu));
+    list_for_each_entry(v, &per_cpu(blocked_vcpu, cpu),
+                    blocked_vcpu_list) {
+        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+
+        if ( pi_desc->on == 1 )
+            tasklet_schedule(&v->vcpu_wakeup_tasklet);
+    }
+    spin_unlock(&per_cpu(blocked_vcpu_lock, cpu));
+
+    ack_APIC_irq();
+    this_cpu(irq_count)++;
+}
+
+/*
  * Local variables:
  * mode: C
  * c-file-style: "BSD"
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index 77eeac5..e621c30 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -195,6 +195,7 @@ struct hvm_function_table {
     void (*deliver_posted_intr)(struct vcpu *v, u8 vector);
     void (*sync_pir_to_irr)(struct vcpu *v);
     void (*handle_eoi)(u8 vector);
+    void (*pi_desc_update)(struct vcpu *v, int old_state);
 
     /*Walk nested p2m  */
     int (*nhvm_hap_walk_L1_p2m)(struct vcpu *v, paddr_t L2_gpa,
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 0d11f9c..3adf776 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -34,6 +34,7 @@ DECLARE_PER_CPU(struct list_head, blocked_vcpu);
 DECLARE_PER_CPU(spinlock_t, blocked_vcpu_lock);
 
 extern uint8_t posted_intr_vector;
+extern uint8_t pi_wakeup_vector;
 
 typedef union {
     struct {
@@ -552,6 +553,8 @@ int alloc_p2m_hap_data(struct p2m_domain *p2m);
 void free_p2m_hap_data(struct p2m_domain *p2m);
 void p2m_init_hap_data(struct p2m_domain *p2m);
 
+void pi_wakeup_interrupt(struct cpu_user_regs *regs);
+
 /* EPT violation qualifications definitions */
 #define _EPT_READ_VIOLATION         0
 #define EPT_READ_VIOLATION          (1UL<<_EPT_READ_VIOLATION)
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index fd9e01e..4a7e6b3 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -148,6 +148,8 @@ struct vcpu
 
     struct vcpu     *next_in_list;
 
+    struct list_head blocked_vcpu_list;
+
     s_time_t         periodic_period;
     s_time_t         periodic_last_event;
     struct timer     periodic_timer;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 12/15] vmx: Properly handle notification event when vCPU is running
  2015-05-08  9:07 [RFC v2 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (10 preceding siblings ...)
  2015-05-08  9:07 ` [RFC v2 11/15] vmx: Add a global wake-up vector for VT-d Posted-Interrupts Feng Wu
@ 2015-05-08  9:07 ` Feng Wu
  2015-06-09 16:12   ` Jan Beulich
  2015-05-08  9:07 ` [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling Feng Wu
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 64+ messages in thread
From: Feng Wu @ 2015-05-08  9:07 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, Feng Wu

When a vCPU is running in Root mode and a notification event
has been injected to it. we need to set VCPU_KICK_SOFTIRQ for
the current cpu, so the pending interrupt in PIRR will be
synced to vIRR before VM-Exit in time.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 xen/arch/x86/hvm/vmx/vmx.c        | 55 ++++++++++++++++++++++++++++++++++++++-
 xen/include/asm-x86/hvm/vmx/vmx.h |  1 +
 2 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 0e71d7e..556a584 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1839,7 +1839,7 @@ const struct hvm_function_table * __init start_vmx(void)
 
     if ( cpu_has_vmx_posted_intr_processing )
     {
-        alloc_direct_apic_vector(&posted_intr_vector, event_check_interrupt);
+        alloc_direct_apic_vector(&posted_intr_vector, pi_notification_interrupt);
 
         if ( iommu_intpost )
             alloc_direct_apic_vector(&pi_wakeup_vector, pi_wakeup_interrupt);
@@ -3293,6 +3293,59 @@ void pi_wakeup_interrupt(struct cpu_user_regs *regs)
 }
 
 /*
+ * Handle VT-d posted-interrupt when VCPU is running.
+ */
+
+void pi_notification_interrupt(struct cpu_user_regs *regs)
+{
+    /*
+     * We get here when a vCPU is running in root-mode
+     * (such as via hypercall, or any other reasons which
+     * can result in VM-Exit), and before vCPU is back to
+     * non-root, external interrupts from an assigned
+     * device happen and a notification event is delivered
+     * to this logical CPU.
+     *
+     * we need to set VCPU_KICK_SOFTIRQ for the current
+     * cpu, just like __vmx_deliver_posted_interrupt().
+     *
+     * So the pending interrupt in PIRR will be synced to
+     * vIRR before VM-Exit in time.
+     *
+     * Please refer to the following code fragments from
+     * xen/arch/x86/hvm/vmx/entry.S:
+     *
+     * .Lvmx_do_vmentry
+     *
+     *  ......
+     *  point 1
+     *
+     *  cmp  %ecx,(%rdx,%rax,1)
+     *  jnz  .Lvmx_process_softirqs
+     *
+     *  ......
+     *
+     *  je   .Lvmx_launch
+     *
+     *  ......
+     *
+     * .Lvmx_process_softirqs:
+     *  sti
+     *  call do_softirq
+     *  jmp  .Lvmx_do_vmentry
+     *
+     *  If VT-d engine issues a notification event at
+     *  point 1 above, it cannot be delivered to the guest
+     *  during this VM-entry without raising the softirq
+     *  in this notification handler.
+     */
+    raise_softirq(VCPU_KICK_SOFTIRQ);
+
+    ack_APIC_irq();
+    this_cpu(irq_count)++;
+}
+
+/*
  * Local variables:
  * mode: C
  * c-file-style: "BSD"
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 3adf776..89c0f56 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -554,6 +554,7 @@ void free_p2m_hap_data(struct p2m_domain *p2m);
 void p2m_init_hap_data(struct p2m_domain *p2m);
 
 void pi_wakeup_interrupt(struct cpu_user_regs *regs);
+void pi_notification_interrupt(struct cpu_user_regs *regs);
 
 /* EPT violation qualifications definitions */
 #define _EPT_READ_VIOLATION         0
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-05-08  9:07 [RFC v2 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (11 preceding siblings ...)
  2015-05-08  9:07 ` [RFC v2 12/15] vmx: Properly handle notification event when vCPU is running Feng Wu
@ 2015-05-08  9:07 ` Feng Wu
  2015-06-10  6:43   ` Jan Beulich
  2015-05-08  9:07 ` [RFC v2 14/15] Suppress posting interrupts when 'SN' is set Feng Wu
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 64+ messages in thread
From: Feng Wu @ 2015-05-08  9:07 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, Feng Wu

The basic idea here is:
1. When vCPU's state is RUNSTATE_running,
        - set 'NV' to 'Notification Vector'.
        - Clear 'SN' to accpet PI.
        - set 'NDST' to the right pCPU.
2. When vCPU's state is RUNSTATE_blocked,
        - set 'NV' to 'Wake-up Vector', so we can wake up the
          related vCPU when posted-interrupt happens for it.
        - Clear 'SN' to accpet PI.
3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
        - Set 'SN' to suppress non-urgent interrupts.
          (Current, we only support non-urgent interrupts)
        - Set 'NV' back to 'Notification Vector' if needed.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 xen/arch/x86/hvm/vmx/vmx.c | 130 +++++++++++++++++++++++++++++++++++++++++++++
 xen/common/schedule.c      |   5 ++
 xen/include/xen/sched.h    |   2 +
 3 files changed, 137 insertions(+)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 556a584..cdcc012 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1711,6 +1711,131 @@ static void vmx_handle_eoi(u8 vector)
     __vmwrite(GUEST_INTR_STATUS, status);
 }
 
+static void vmx_pi_desc_update(struct vcpu *v, int old_state)
+{
+    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+    struct pi_desc old, new;
+    unsigned long flags;
+
+    if ( !iommu_intpost )
+        return;
+
+    switch ( v->runstate.state )
+    {
+    case RUNSTATE_runnable:
+    case RUNSTATE_offline:
+        /*
+         * We don't need to send notification event to a non-running
+         * vcpu, the interrupt information will be delivered to it before
+         * VM-ENTRY when the vcpu is scheduled to run next time.
+         */
+        pi_desc->sn = 1;
+
+        /*
+         * If the state is transferred from RUNSTATE_blocked,
+         * we should set 'NV' feild back to posted_intr_vector,
+         * so the Posted-Interrupts can be delivered to the vCPU
+         * by VT-d HW after it is scheduled to run.
+         */
+        if ( old_state == RUNSTATE_blocked )
+        {
+            do
+            {
+                old.control = new.control = pi_desc->control;
+                new.nv = posted_intr_vector;
+            }
+            while ( cmpxchg(&pi_desc->control, old.control, new.control)
+                    != old.control );
+
+           /*
+            * Delete the vCPU from the related block list
+            * if we are resuming from blocked state
+            */
+           spin_lock_irqsave(&per_cpu(blocked_vcpu_lock,
+                             v->pre_pcpu), flags);
+           list_del(&v->blocked_vcpu_list);
+           spin_unlock_irqrestore(&per_cpu(blocked_vcpu_lock,
+                                  v->pre_pcpu), flags);
+        }
+        break;
+
+    case RUNSTATE_blocked:
+        /*
+         * The vCPU is blocked on the block list.
+         * Add the blocked vCPU on the list of the
+         * vcpu->pre_pcpu, which is the destination
+         * of the wake-up notification event.
+         */
+        v->pre_pcpu = v->processor;
+        spin_lock_irqsave(&per_cpu(blocked_vcpu_lock,
+                          v->pre_pcpu), flags);
+        list_add_tail(&v->blocked_vcpu_list,
+                      &per_cpu(blocked_vcpu, v->pre_pcpu));
+        spin_unlock_irqrestore(&per_cpu(blocked_vcpu_lock,
+                               v->pre_pcpu), flags);
+
+        do
+        {
+            old.control = new.control = pi_desc->control;
+
+            /*
+             * We should not block the vCPU if
+             * an interrupt was posted for it.
+             */
+
+            if ( old.on == 1 )
+            {
+                /*
+                 * The vCPU will be removed from the block list
+                 * during its state transferring from RUNSTATE_blocked
+                 * to RUNSTATE_runnable after the following tasklet
+                 * is scheduled to run.
+                 */
+                tasklet_schedule(&v->vcpu_wakeup_tasklet);
+                return;
+            }
+
+            /*
+             * Change the 'NDST' field to v->pre_pcpu, so when
+             * external interrupts from assigned deivces happen,
+             * wakeup notifiction event will go to v->pre_pcpu,
+             * then in pi_wakeup_interrupt() we can find the
+             * vCPU in the right list to wake up.
+             */
+            if ( x2apic_enabled )
+                new.ndst = cpu_physical_id(v->pre_pcpu);
+            else
+                new.ndst = MASK_INSR(cpu_physical_id(v->pre_pcpu),
+                                     PI_xAPIC_NDST_MASK);
+            new.sn = 0;
+            new.nv = pi_wakeup_vector;
+        }
+        while ( cmpxchg(&pi_desc->control, old.control, new.control)
+                != old.control );
+        break;
+
+    case RUNSTATE_running:
+        ASSERT( pi_desc->sn == 1 );
+
+        do
+        {
+            old.control = new.control = pi_desc->control;
+            if ( x2apic_enabled )
+                new.ndst = cpu_physical_id(v->processor);
+            else
+                new.ndst = (cpu_physical_id(v->processor) << 8) & 0xFF00;
+
+            new.sn = 0;
+        }
+        while ( cmpxchg(&pi_desc->control, old.control, new.control)
+                != old.control );
+        break;
+
+    default:
+        break;
+    }
+}
+
 void vmx_hypervisor_cpuid_leaf(uint32_t sub_idx,
                                uint32_t *eax, uint32_t *ebx,
                                uint32_t *ecx, uint32_t *edx)
@@ -1842,7 +1967,12 @@ const struct hvm_function_table * __init start_vmx(void)
         alloc_direct_apic_vector(&posted_intr_vector, pi_notification_interrupt);
 
         if ( iommu_intpost )
+        {
             alloc_direct_apic_vector(&pi_wakeup_vector, pi_wakeup_interrupt);
+            vmx_function_table.pi_desc_update = vmx_pi_desc_update;
+        }
+        else
+            vmx_function_table.pi_desc_update = NULL;
     }
     else
     {
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 88770c6..3c6b2e1 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -142,6 +142,7 @@ static inline void vcpu_runstate_change(
     struct vcpu *v, int new_state, s_time_t new_entry_time)
 {
     s_time_t delta;
+    int old_state;
 
     ASSERT(v->runstate.state != new_state);
     ASSERT(spin_is_locked(per_cpu(schedule_data,v->processor).schedule_lock));
@@ -157,7 +158,11 @@ static inline void vcpu_runstate_change(
         v->runstate.state_entry_time = new_entry_time;
     }
 
+    old_state = v->runstate.state;
     v->runstate.state = new_state;
+
+    if ( is_hvm_vcpu(v) && hvm_funcs.pi_desc_update )
+        hvm_funcs.pi_desc_update(v, old_state);
 }
 
 void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate)
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 4a7e6b3..71d228a 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -142,6 +142,8 @@ struct vcpu
 
     int              processor;
 
+    int              pre_pcpu;
+
     vcpu_info_t     *vcpu_info;
 
     struct domain   *domain;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 14/15] Suppress posting interrupts when 'SN' is set
  2015-05-08  9:07 [RFC v2 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (12 preceding siblings ...)
  2015-05-08  9:07 ` [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling Feng Wu
@ 2015-05-08  9:07 ` Feng Wu
  2015-06-10  6:48   ` Jan Beulich
  2015-05-08  9:07 ` [RFC v2 15/15] Add a command line parameter for VT-d posted-interrupts Feng Wu
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 64+ messages in thread
From: Feng Wu @ 2015-05-08  9:07 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, Feng Wu

Currently, we don't support urgent interrupt, all interrupts
are recognized as non-urgent interrupt, so we cannot send
posted-interrupt when 'SN' is set.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 xen/arch/x86/hvm/vmx/vmx.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index cdcc012..77a7897 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1664,9 +1664,20 @@ static void __vmx_deliver_posted_interrupt(struct vcpu *v)
 
 static void vmx_deliver_posted_intr(struct vcpu *v, u8 vector)
 {
+    int r, sn;
+
     if ( pi_test_and_set_pir(vector, &v->arch.hvm_vmx.pi_desc) )
         return;
 
+    /*
+     * Currently, we don't support urgent interrupt, all interrupts
+     * are recognized as non-urgent interrupt, so we cannot send
+     * posted-interrupt when 'SN' is set.
+     */
+
+    sn = v->arch.hvm_vmx.pi_desc.sn;
+    r = pi_test_and_set_on(&v->arch.hvm_vmx.pi_desc);
+
     if ( unlikely(v->arch.hvm_vmx.eoi_exitmap_changed) )
     {
         /*
@@ -1676,7 +1687,7 @@ static void vmx_deliver_posted_intr(struct vcpu *v, u8 vector)
          */
         pi_set_on(&v->arch.hvm_vmx.pi_desc);
     }
-    else if ( !pi_test_and_set_on(&v->arch.hvm_vmx.pi_desc) )
+    else if ( !r && !sn )
     {
         __vmx_deliver_posted_interrupt(v);
         return;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 15/15] Add a command line parameter for VT-d posted-interrupts
  2015-05-08  9:07 [RFC v2 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (13 preceding siblings ...)
  2015-05-08  9:07 ` [RFC v2 14/15] Suppress posting interrupts when 'SN' is set Feng Wu
@ 2015-05-08  9:07 ` Feng Wu
  2015-06-10  6:52   ` Jan Beulich
  2015-05-13  5:11 ` [RFC v2 00/15] Add VT-d Posted-Interrupts support Wu, Feng
  2015-05-18  5:33 ` Tian, Kevin
  16 siblings, 1 reply; 64+ messages in thread
From: Feng Wu @ 2015-05-08  9:07 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, Feng Wu

Enable VT-d Posted-Interrupts and add a command line
parameter for it.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 docs/misc/xen-command-line.markdown | 9 ++++++++-
 xen/drivers/passthrough/iommu.c     | 8 +++++++-
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 1dda1f0..3faa073 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -821,7 +821,7 @@ debug hypervisor only).
 > Default: `new` unless directed-EOI is supported
 
 ### iommu
-> `= List of [ <boolean> | force | required | intremap | qinval | snoop | sharept | dom0-passthrough | dom0-strict | amd-iommu-perdev-intremap | workaround_bios_bug | verbose | debug ]`
+> `= List of [ <boolean> | force | required | intremap | intpost | qinval | snoop | sharept | dom0-passthrough | dom0-strict | amd-iommu-perdev-intremap | workaround_bios_bug | verbose | debug ]`
 
 > Sub-options:
 
@@ -848,6 +848,13 @@ debug hypervisor only).
 >> Control the use of interrupt remapping (DMA remapping will always be enabled
 >> if IOMMU functionality is enabled).
 
+> `intpost`
+
+> Default: `true`
+
+>> Control the use of interrupt posting, interrupt posting is dependant on
+>> interrupt remapping.
+
 > `qinval` (VT-d)
 
 > Default: `true`
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index 302e3e4..1bda7e9 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -52,7 +52,7 @@ bool_t __read_mostly iommu_passthrough;
 bool_t __read_mostly iommu_snoop = 1;
 bool_t __read_mostly iommu_qinval = 1;
 bool_t __read_mostly iommu_intremap = 1;
-bool_t __read_mostly iommu_intpost = 0;
+bool_t __read_mostly iommu_intpost = 1;
 bool_t __read_mostly iommu_hap_pt_share = 1;
 bool_t __read_mostly iommu_debug;
 bool_t __read_mostly amd_iommu_perdev_intremap = 1;
@@ -101,6 +101,12 @@ static void __init parse_iommu_param(char *s)
             if ( iommu_intremap == 0 )
                 iommu_intpost = 0;
         }
+        else if ( !strcmp(s, "intpost") )
+        {
+            iommu_intpost = val;
+            if ( iommu_intremap == 0 )
+                iommu_intpost = 0;
+        }
         else if ( !strcmp(s, "debug") )
         {
             iommu_debug = val;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [RFC v2 00/15] Add VT-d Posted-Interrupts support
  2015-05-08  9:07 [RFC v2 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (14 preceding siblings ...)
  2015-05-08  9:07 ` [RFC v2 15/15] Add a command line parameter for VT-d posted-interrupts Feng Wu
@ 2015-05-13  5:11 ` Wu, Feng
  2015-05-13  6:50   ` Jan Beulich
  2015-05-18  5:33 ` Tian, Kevin
  16 siblings, 1 reply; 64+ messages in thread
From: Wu, Feng @ 2015-05-13  5:11 UTC (permalink / raw)
  To: xen-devel
  Cc: Tian, Kevin, keir, george.dunlap, andrew.cooper3, jbeulich,
	Zhang, Yang Z, Wu, Feng

Ping.. 

> -----Original Message-----
> From: Wu, Feng
> Sent: Friday, May 08, 2015 5:07 PM
> To: xen-devel@lists.xen.org
> Cc: keir@xen.org; jbeulich@suse.com; andrew.cooper3@citrix.com; Zhang,
> Yang Z; Tian, Kevin; george.dunlap@eu.citrix.com; Wu, Feng
> Subject: [RFC v2 00/15] Add VT-d Posted-Interrupts support
> 
> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> With VT-d Posted-Interrupts enabled, external interrupts from
> direct-assigned devices can be delivered to guests without VMM
> intervention when guest is running in non-root mode.
> 
> You can find the VT-d Posted-Interrtups Spec. in the following URL:
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
> y/vt-directed-io-spec.html
> 
> This patch set follow the following design:
> http://article.gmane.org/gmane.comp.emulators.xen.devel/236476
> 
> v1 -> v2
> 1. Add the design doc.
> 2. Coding style fix.
> 3. Add some comments for struct pi_desc.
> 4. Extend 'struct iremap_entry' to a more common format.
> 5. Delete the atomic helper functions for pi descriptor manipulation.
> 6. Add the new command line in docs/misc/xen-command-line.markdown.
> 7. Use macros to replace some magic numbers.
> 
> One open in "[RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used"
> How to update the IRTE for PI dynamically (in an atomic way)? I am trying
> to use cmpxchg16b and it is in progress, I will add this in the next version.
> 
> Feng Wu (15):
>   Vt-d Posted-intterrupt (PI) design
>   iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature
>   vt-d: VT-d Posted-Interrupts feature detection
>   vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
>   vmx: Initialize VT-d Posted-Interrupts Descriptor
>   vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts
>   vt-d: Add API to update IRTE when VT-d PI is used
>   Update IRTE according to guest interrupt config changes
>   Add a new per-vCPU tasklet to wakeup the blocked vCPU
>   vmx: Define two per-cpu variables
>   vmx: Add a global wake-up vector for VT-d Posted-Interrupts
>   vmx: Properly handle notification event when vCPU is running
>   Update Posted-Interrupts Descriptor during vCPU scheduling
>   Suppress posting interrupts when 'SN' is set
>   Add a command line parameter for VT-d posted-interrupts
> 
>  docs/misc/vtd-pi.txt                   | 332
> +++++++++++++++++++++++++++++++++
>  docs/misc/xen-command-line.markdown    |   9 +-
>  xen/arch/x86/hvm/vmx/vmcs.c            |   6 +
>  xen/arch/x86/hvm/vmx/vmx.c             | 236
> ++++++++++++++++++++++-
>  xen/common/domain.c                    |  11 ++
>  xen/common/schedule.c                  |   5 +
>  xen/drivers/passthrough/io.c           |  99 +++++++++-
>  xen/drivers/passthrough/iommu.c        |  17 +-
>  xen/drivers/passthrough/vtd/intremap.c | 188 ++++++++++++++-----
>  xen/drivers/passthrough/vtd/iommu.c    |  15 +-
>  xen/drivers/passthrough/vtd/iommu.h    |  31 ++-
>  xen/drivers/passthrough/vtd/utils.c    |  10 +-
>  xen/include/asm-x86/hvm/hvm.h          |   1 +
>  xen/include/asm-x86/hvm/vmx/vmcs.h     |  15 +-
>  xen/include/asm-x86/hvm/vmx/vmx.h      |  27 ++-
>  xen/include/asm-x86/iommu.h            |   2 +
>  xen/include/xen/iommu.h                |   2 +-
>  xen/include/xen/sched.h                |   7 +
>  18 files changed, 945 insertions(+), 68 deletions(-)
>  create mode 100644 docs/misc/vtd-pi.txt
> 
> --
> 2.1.0

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 00/15] Add VT-d Posted-Interrupts support
  2015-05-13  5:11 ` [RFC v2 00/15] Add VT-d Posted-Interrupts support Wu, Feng
@ 2015-05-13  6:50   ` Jan Beulich
  2015-05-13  7:04     ` Wu, Feng
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2015-05-13  6:50 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir, george.dunlap, andrew.cooper3, xen-devel, Yang Z Zhang

>>> On 13.05.15 at 07:11, <feng.wu@intel.com> wrote:
> Ping.. 

I have a list of almost 60 patches to review, among which the ones
improving existing code have higher priority than the ones adding
new functionality (of course bug fixes come always first), and for
the latter series with a rather high version (and without RFC tag)
would come before lower version / RFC ones. As said previously
(also on the hackathon) - relying on just two or three people doing
all the reviewing work (especially when the patch submission rate
is as high as it was recently) just can't scale. So by thoroughly
reviewing others' patches you can reduce the latency until yours
get dealt with (and of course this isn't meant for just you
personally, but is a general thing).

Apart from that it wasn't even a week ago that this series was
posted.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 00/15] Add VT-d Posted-Interrupts support
  2015-05-13  6:50   ` Jan Beulich
@ 2015-05-13  7:04     ` Wu, Feng
  0 siblings, 0 replies; 64+ messages in thread
From: Wu, Feng @ 2015-05-13  7:04 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, george.dunlap, andrew.cooper3, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, May 13, 2015 2:51 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: RE: [RFC v2 00/15] Add VT-d Posted-Interrupts support
> 
> >>> On 13.05.15 at 07:11, <feng.wu@intel.com> wrote:
> > Ping..
> 
> I have a list of almost 60 patches to review, among which the ones
> improving existing code have higher priority than the ones adding
> new functionality (of course bug fixes come always first), and for
> the latter series with a rather high version (and without RFC tag)
> would come before lower version / RFC ones. As said previously
> (also on the hackathon) - relying on just two or three people doing
> all the reviewing work (especially when the patch submission rate
> is as high as it was recently) just can't scale. So by thoroughly
> reviewing others' patches you can reduce the latency until yours
> get dealt with (and of course this isn't meant for just you
> personally, but is a general thing).

Fair enough!

Thanks,
Feng

> 
> Apart from that it wasn't even a week ago that this series was
> posted.
> 
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 00/15] Add VT-d Posted-Interrupts support
  2015-05-08  9:07 [RFC v2 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (15 preceding siblings ...)
  2015-05-13  5:11 ` [RFC v2 00/15] Add VT-d Posted-Interrupts support Wu, Feng
@ 2015-05-18  5:33 ` Tian, Kevin
  16 siblings, 0 replies; 64+ messages in thread
From: Tian, Kevin @ 2015-05-18  5:33 UTC (permalink / raw)
  To: Wu, Feng, xen-devel
  Cc: Zhang, Yang Z, andrew.cooper3, keir, george.dunlap, jbeulich

> From: Wu, Feng
> Sent: Friday, May 08, 2015 5:07 PM
> 
> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> With VT-d Posted-Interrupts enabled, external interrupts from
> direct-assigned devices can be delivered to guests without VMM
> intervention when guest is running in non-root mode.
> 
> You can find the VT-d Posted-Interrtups Spec. in the following URL:
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
> y/vt-directed-io-spec.html
> 
> This patch set follow the following design:
> http://article.gmane.org/gmane.comp.emulators.xen.devel/236476
> 
> v1 -> v2
> 1. Add the design doc.
> 2. Coding style fix.
> 3. Add some comments for struct pi_desc.
> 4. Extend 'struct iremap_entry' to a more common format.
> 5. Delete the atomic helper functions for pi descriptor manipulation.
> 6. Add the new command line in docs/misc/xen-command-line.markdown.
> 7. Use macros to replace some magic numbers.

Though generally this version looks good to me, it'd be clearer if you could
give v1->v2 information per patch to help review. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 02/15] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature
  2015-05-08  9:07 ` [RFC v2 02/15] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature Feng Wu
@ 2015-06-09 14:10   ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2015-06-09 14:10 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel, yang.z.zhang

>>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> @@ -51,6 +52,7 @@ bool_t __read_mostly iommu_passthrough;
>  bool_t __read_mostly iommu_snoop = 1;
>  bool_t __read_mostly iommu_qinval = 1;
>  bool_t __read_mostly iommu_intremap = 1;
> +bool_t __read_mostly iommu_intpost = 0;

Pointless initializer.

> @@ -94,7 +96,11 @@ static void __init parse_iommu_param(char *s)
>          else if ( !strcmp(s, "qinval") )
>              iommu_qinval = val;
>          else if ( !strcmp(s, "intremap") )
> +        {
>              iommu_intremap = val;
> +            if ( iommu_intremap == 0 )
> +                iommu_intpost = 0;

Please help the compiler and use the local variable in the if() clause
rather than the global one. And considering it's a boolean, perhaps
!val would also be better than val == 0.

> @@ -272,7 +278,10 @@ int __init iommu_setup(void)
>          iommu_enabled = (rc == 0);
>      }
>      if ( !iommu_enabled )
> +    {
>          iommu_intremap = 0;
> +        iommu_intpost = 0;
> +    }

I think you'd better do this in a second if ( !iommu_intremap ).

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 03/15] vt-d: VT-d Posted-Interrupts feature detection
  2015-05-08  9:07 ` [RFC v2 03/15] vt-d: VT-d Posted-Interrupts feature detection Feng Wu
@ 2015-06-09 14:15   ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2015-06-09 14:15 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel, yang.z.zhang

>>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -2044,6 +2044,7 @@ static int init_vtd_hw(void)
>              if ( ioapic_to_iommu(IO_APIC_ID(apic)) == NULL )
>              {
>                  iommu_intremap = 0;
> +                iommu_intpost = 0;
>                  dprintk(XENLOG_ERR VTDPREFIX,
>                      "ioapic_to_iommu: ioapic %#x (id: %#x) is NULL! "
>                      "Will not try to enable Interrupt Remapping.\n",
> @@ -2060,6 +2061,7 @@ static int init_vtd_hw(void)
>              if ( enable_intremap(iommu, 0) != 0 )
>              {
>                  iommu_intremap = 0;
> +                iommu_intpost = 0;
>                  dprintk(XENLOG_WARNING VTDPREFIX,
>                          "Interrupt Remapping not enabled\n");
>  

Instead of sprinkling these around, wouldn't it be possible to clear
the variable more centrally upon !iommu_intremap (assuming the
clearing in generic IOMMU code doesn't already suffice)?

> @@ -2160,7 +2162,13 @@ int __init intel_vtd_setup(void)
>              iommu_qinval = 0;
>  
>          if ( iommu_intremap && !ecap_intr_remap(iommu->ecap) )
> +        {
>              iommu_intremap = 0;
> +            iommu_intpost = 0;
> +        }

Just like in the previous patch, I think this would better be folded
into the next if():

> +        if ( iommu_intpost && !cap_intr_post(iommu->cap) )
> +            iommu_intpost = 0;

Please try to avoid copying odd code: There's no point (afaics) for
the if to check iommu_intpost. Together with the above I think what
you want is

        if ( !iommu_intremap || !cap_intr_post(iommu->cap) )
            iommu_intpost = 0;

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 04/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  2015-05-08  9:07 ` [RFC v2 04/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts Feng Wu
@ 2015-06-09 14:17   ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2015-06-09 14:17 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel, yang.z.zhang

>>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> Extend struct pi_desc according to VT-d Posted-Interrupts Spec.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
>  xen/include/asm-x86/hvm/vmx/vmcs.h | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h 
> b/xen/include/asm-x86/hvm/vmx/vmcs.h
> index 6fce6aa..3707d1c 100644
> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
> +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
> @@ -76,8 +76,19 @@ struct vmx_domain {
>  
>  struct pi_desc {
>      DECLARE_BITMAP(pir, NR_VECTORS);
> -    u32 control;
> -    u32 rsvd[7];
> +    union {
> +        struct
> +        {
> +        u64 on     : 1,  /* bit 256 - Outstanding Notification */

Unless really needed, please avoid u64 here; u32 or even unsigned
int would seem to do just fine, ...

> +            sn     : 1,  /* bit 257 - Suppress Notification */
> +            rsvd_1 : 14, /* bit 271:258 - Reserved */
> +            nv     : 8,  /* bit 279:272 - Notification Vector */
> +            rsvd_2 : 8,  /* bit 287:280 - Reserved */
> +            ndst   : 32; /* bit 319:288 - Notification Destination */

... allowing this to not be a bitfield member at all.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 05/15] vmx: Initialize VT-d Posted-Interrupts Descriptor
  2015-05-08  9:07 ` [RFC v2 05/15] vmx: Initialize VT-d Posted-Interrupts Descriptor Feng Wu
@ 2015-06-09 14:20   ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2015-06-09 14:20 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel, yang.z.zhang

>>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> --- a/xen/include/asm-x86/hvm/vmx/vmx.h
> +++ b/xen/include/asm-x86/hvm/vmx/vmx.h
> @@ -28,6 +28,9 @@
>  #include <asm/hvm/support.h>
>  #include <asm/hvm/trace.h>
>  #include <asm/hvm/vmx/vmcs.h>
> +#include <asm/apic.h>
> +
> +extern uint8_t posted_intr_vector;
>  
>  typedef union {
>      struct {
> @@ -125,6 +128,22 @@ static inline unsigned long pi_get_pir(struct pi_desc *pi_desc, int group)
>      return xchg(&pi_desc->pir[group], 0);
>  }
>  
> +#define PI_xAPIC_NDST_MASK   0xFF00
> +
> +static inline void pi_desc_init(struct vcpu *v)
> +{
> +    uint32_t dest;
> +
> +    v->arch.hvm_vmx.pi_desc.nv = posted_intr_vector;
> +
> +    dest = cpu_physical_id(v->processor);
> +
> +    if ( x2apic_enabled )
> +        v->arch.hvm_vmx.pi_desc.ndst = dest;
> +    else
> +        v->arch.hvm_vmx.pi_desc.ndst = MASK_INSR(dest, PI_xAPIC_NDST_MASK);
> +}

Unless this is going to be needed outside of vmcs.c this shouldn't
be in a header file (and probably also not be inline).

> @@ -244,7 +263,6 @@ static inline unsigned long pi_get_pir(struct pi_desc 
> *pi_desc, int group)
>  #define MODRM_EAX_ECX   ".byte 0xc1\n" /* EAX, ECX */
>  
>  extern u64 vmx_ept_vpid_cap;
> -extern uint8_t posted_intr_vector;

Which then also eliminates the need to move this declaration.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 06/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts
  2015-05-08  9:07 ` [RFC v2 06/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts Feng Wu
@ 2015-06-09 14:25   ` Jan Beulich
  2015-06-16  6:18     ` Wu, Feng
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2015-06-09 14:25 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel, yang.z.zhang

>>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> --- a/xen/drivers/passthrough/vtd/iommu.h
> +++ b/xen/drivers/passthrough/vtd/iommu.h
> @@ -289,7 +289,7 @@ struct dma_pte {
>  /* interrupt remap entry */
>  struct iremap_entry {
>    union {
> -    u64 lo_val;
> +    struct { u64 lo, hi; };
>      struct {
>          u64 p       : 1,
>              fpd     : 1,
> @@ -302,16 +302,28 @@ struct iremap_entry {
>              vector  : 8,
>              res_2   : 8,
>              dst     : 32;
> -    }lo;
> -  };
> -  union {
> -    u64 hi_val;
> +        u64 sid     : 16,
> +            sq      : 2,
> +            svt     : 2,
> +            res_3   : 44;
> +    }remap;
>      struct {
> +        u64 p       : 1,
> +            fpd     : 1,
> +            res_1   : 6,
> +            avail   : 4,
> +            res_2   : 2,
> +            urg     : 1,
> +            im      : 1,
> +            vector  : 8,
> +            res_3   : 14,
> +            pda_l   : 26;
>          u64 sid     : 16,
>              sq      : 2,
>              svt     : 2,
> -            res_1   : 44;
> -    }hi;
> +            res_4   : 12,
> +            pda_h   : 32;
> +    }post;
>    };
>  };

Same here - unless the VT-d maintainers disagree, I think using u32
wherever possible would be preferable over using u64, as well as
avoiding bitfields for members filling an entire word.

Also please fix the formatting (blanks between closing braces and field
names).

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-05-08  9:07 ` [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used Feng Wu
@ 2015-06-09 14:32   ` Jan Beulich
  2015-06-12  9:40     ` Wu, Feng
  2015-06-10  6:17   ` Jan Beulich
  1 sibling, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2015-06-09 14:32 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel, yang.z.zhang

>>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> +bool_t pi_update_irte(struct vcpu *v, struct pirq *pirq, uint8_t gvec)

Without seeing the caller right away it's hard to judge, but generally I'd
prefer functions to return -E... values as error indicators, i.e.

> +{
> +    struct irq_desc *desc;
> +    struct msi_desc *msi_desc;
> +    int remap_index;
> +    bool_t rc = 0;
> +    struct pci_dev *pci_dev;
> +    struct acpi_drhd_unit *drhd;
> +    struct iommu *iommu;
> +    struct ir_ctrl *ir_ctrl;
> +    struct iremap_entry *iremap_entries = NULL, *p = NULL;
> +    struct iremap_entry new_ire;
> +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +    unsigned long flags;
> +
> +    desc = pirq_spin_lock_irq_desc(pirq, NULL);
> +    if ( !desc )
> +        return 0;

-ENOMEM

> +    msi_desc = desc->msi_desc;
> +    if ( !msi_desc )
> +        goto unlock_out;

-EBADSLT

> +    pci_dev = msi_desc->dev;
> +    if ( !pci_dev )
> +        goto unlock_out;

-ENODEV

> +    remap_index = msi_desc->remap_index;
> +    drhd = acpi_find_matched_drhd_unit(pci_dev);
> +    if ( !drhd )
> +    {
> +        dprintk(XENLOG_INFO VTDPREFIX,
> +                "%pv: failed to get drhd, pci device: "
> +                "%04x:%02x:%02x.%u, guest vector: %u\n",
> +                v, pci_dev->seg, pci_dev->bus, PCI_SLOT(pci_dev->devfn),
> +                PCI_FUNC(pci_dev->devfn), gvec);
> +        goto unlock_out;
> +    }
> +
> +    iommu = drhd->iommu;
> +    ir_ctrl = iommu_ir_ctrl(iommu);
> +    if ( !ir_ctrl )
> +    {
> +        dprintk(XENLOG_INFO VTDPREFIX,
> +                "%pv: failed to get ir_ctrl, pci device: "
> +                "%04x:%02x:%02x.%u, guest vector: %u\n",
> +                v, pci_dev->seg, pci_dev->bus, PCI_SLOT(pci_dev->devfn),
> +                PCI_FUNC(pci_dev->devfn), gvec);
> +        goto unlock_out;
> +    }

Do you think these log messages are useful beyond your bringup
purposes?

> +    spin_lock_irqsave(&ir_ctrl->iremap_lock, flags);
> +
> +    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index, iremap_entries, p);
> +
> +    memcpy(&new_ire, p, sizeof(new_ire));

Please use structure assignment (being type safe) in preference to
memcpy() (not being type safe).

> --- a/xen/drivers/passthrough/vtd/iommu.h
> +++ b/xen/drivers/passthrough/vtd/iommu.h
> @@ -327,6 +327,10 @@ struct iremap_entry {
>    };
>  };
>  
> +#define PDA_LOW_BIT    26
> +#define PDA_HIGH_BIT   32
> +#define PDA_MASK(XX)   (~(-1UL << PDA_##XX##_BIT))

To me it would look more natural if you used ~0UL.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 08/15] Update IRTE according to guest interrupt config changes
  2015-05-08  9:07 ` [RFC v2 08/15] Update IRTE according to guest interrupt config changes Feng Wu
@ 2015-06-09 15:06   ` Jan Beulich
  2015-06-16  8:08     ` Wu, Feng
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2015-06-09 15:06 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel, yang.z.zhang

>>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> +static bool_t pi_find_dest_vcpu(struct domain *d, uint8_t dest_id,
> +                                uint8_t dest_mode, uint8_t delivery_mode,
> +                                uint8_t gvec, struct vcpu **dest_vcpu)
> +{
> +    struct vcpu *v, **dest_vcpu_array;
> +    unsigned int dest_vcpu_num = 0;
> +    int ret;

This, being given as operand to "return", should match in type with
the function's return type.

> +    dest_vcpu_array = xzalloc_array(struct vcpu *, d->max_vcpus);

You realize that this can be quite big an allocation? (You could at
least halve it by storing vCPU IDs instead of pointers, but if I'm
not mistaken this could even be a simple bitmap.)

> +    if ( !dest_vcpu_array )
> +    {
> +        dprintk(XENLOG_G_INFO,
> +                "dom%d: failed to allocate memeory.\n", d->domain_id);

Please fix the typo and remove the stop.

> +        return 0;
> +    }
> +
> +    for_each_vcpu ( d, v )
> +    {
> +        if ( !vlapic_match_dest(vcpu_vlapic(v), NULL, 0,
> +                                dest_id, dest_mode) )
> +            continue;
> +
> +        dest_vcpu_array[dest_vcpu_num++] = v;
> +    }
> +
> +    if ( delivery_mode == dest_LowestPrio )
> +    {
> +        if (  dest_vcpu_num != 0 )
> +        {
> +            *dest_vcpu = dest_vcpu_array[gvec % dest_vcpu_num];
> +            ret = 1;
> +        }
> +        else
> +            ret = 0;
> +    }
> +    else if (  dest_vcpu_num == 1 )
> +    {
> +        *dest_vcpu = dest_vcpu_array[0];
> +        ret = 1;
> +    }
> +    else
> +        ret = 0;
> +
> +    xfree(dest_vcpu_array);
> +    return ret;
> +}

Blank line before final return statement please.

> @@ -330,11 +398,40 @@ int pt_irq_create_bind(
>          /* Calculate dest_vcpu_id for MSI-type pirq migration. */
>          dest = pirq_dpci->gmsi.gflags & VMSI_DEST_ID_MASK;
>          dest_mode = !!(pirq_dpci->gmsi.gflags & VMSI_DM_MASK);
> +        delivery_mode = (pirq_dpci->gmsi.gflags >> GFLAGS_SHIFT_DELIV_MODE) &
> +                        VMSI_DELIV_MASK;
>          dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
>          pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
>          spin_unlock(&d->event_lock);
>          if ( dest_vcpu_id >= 0 )
>              hvm_migrate_pirqs(d->vcpu[dest_vcpu_id]);
> +
> +        /* Use interrupt posting if it is supported */
> +        if ( iommu_intpost )
> +        {
> +            struct vcpu *vcpu = NULL;
> +
> +            if ( !pi_find_dest_vcpu(d, dest, dest_mode, delivery_mode,
> +                                    pirq_dpci->gmsi.gvec, &vcpu) )

Why not have the function return the vCPU instead of passing it
via indirection?

> +            {
> +                dprintk(XENLOG_G_WARNING,
> +                        "%pv: failed to find the dest vCPU for PI, guest "
> +                        "vector:%u use software way to deliver the "
> +                        " interrupts.\n", vcpu, pirq_dpci->gmsi.gvec);

You shouldn't be printing the (NULL) vCPU here. And please print
vectors as hex values.

> +                break;
> +            }
> +
> +            if ( pi_update_irte( vcpu, info, pirq_dpci->gmsi.gvec ) != 0 )
> +            {
> +                dprintk(XENLOG_G_WARNING,
> +                        "%pv: failed to update PI IRTE, guest vector:%u "
> +                        "use software way to deliver the interrupts.\n",
> +                        vcpu, pirq_dpci->gmsi.gvec);
> +
> +                break;
> +            }
> +        }
> +
>          break;

By using if() / else if() you could drop _both_ break-s you add.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 09/15] Add a new per-vCPU tasklet to wakeup the blocked vCPU
  2015-05-08  9:07 ` [RFC v2 09/15] Add a new per-vCPU tasklet to wakeup the blocked vCPU Feng Wu
@ 2015-06-09 15:08   ` Jan Beulich
  2015-06-12  9:40     ` Wu, Feng
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2015-06-09 15:08 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel, yang.z.zhang

>>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> This patch adds a new per-vCPU tasklet to wakeup the blocked
> vCPU. It can be used in the case vcpu_unblock cannot be called
> directly. This tasklet will be used in later patch in this
> series.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
>  xen/common/domain.c     | 11 +++++++++++
>  xen/include/xen/sched.h |  3 +++
>  2 files changed, 14 insertions(+)

What is the rationale for doing this in common code, modifying
common structures?

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 10/15] vmx: Define two per-cpu variables
  2015-05-08  9:07 ` [RFC v2 10/15] vmx: Define two per-cpu variables Feng Wu
@ 2015-06-09 15:12   ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2015-06-09 15:12 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel, yang.z.zhang

>>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> This patch defines two per-cpu variables:
> 
> blocked_vcpu:
> A list storing the vCPUs which were blocked on this pCPU.
> 
> blocked_vcpu_lock:
> The spinlock to protect blocked_vcpu.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>

It doesn't look like this is worth a separate patch (some of the earlier
ones had been on the edge too) - reviewing such changes without
seeing what the variables/functions get used for is more cumbersome
than necessary. Plus committing such patch series multiple pieces
might add (at least temporarily) dead code to the tree.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 11/15] vmx: Add a global wake-up vector for VT-d Posted-Interrupts
  2015-05-08  9:07 ` [RFC v2 11/15] vmx: Add a global wake-up vector for VT-d Posted-Interrupts Feng Wu
@ 2015-06-09 15:20   ` Jan Beulich
  2015-06-12  9:40     ` Wu, Feng
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2015-06-09 15:20 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel, yang.z.zhang

>>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> @@ -3262,6 +3271,28 @@ void vmx_vmenter_helper(const struct cpu_user_regs *regs)
>  }
>  
>  /*
> + * Handle VT-d posted-interrupt when VCPU is blocked.
> + */
> +void pi_wakeup_interrupt(struct cpu_user_regs *regs)

static (and perhaps move it up so you don't need to forward declare it).

> +{
> +    struct vcpu *v;
> +    unsigned int cpu = smp_processor_id();
> +
> +    spin_lock(&per_cpu(blocked_vcpu_lock, cpu));
> +    list_for_each_entry(v, &per_cpu(blocked_vcpu, cpu),
> +                    blocked_vcpu_list) {

How long do you think such a list can grow? I'm afraid you might
be adding quite a bit of latency here to the system.

> +        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +
> +        if ( pi_desc->on == 1 )

Isn't this a single-bit (i.e. boolean) field? In which case - no reason
to compare with 1.

I also don't see the value of the local variable - it's being used only
once, and without it the line wouldn't get overly long or unreadable.

> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -148,6 +148,8 @@ struct vcpu
>  
>      struct vcpu     *next_in_list;
>  
> +    struct list_head blocked_vcpu_list;

Again - why here instead of in VT-d/VMX specific structures?

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 12/15] vmx: Properly handle notification event when vCPU is running
  2015-05-08  9:07 ` [RFC v2 12/15] vmx: Properly handle notification event when vCPU is running Feng Wu
@ 2015-06-09 16:12   ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2015-06-09 16:12 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel, yang.z.zhang

>>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> +void pi_notification_interrupt(struct cpu_user_regs *regs)

static again

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-05-08  9:07 ` [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used Feng Wu
  2015-06-09 14:32   ` Jan Beulich
@ 2015-06-10  6:17   ` Jan Beulich
  2015-06-12  9:40     ` Wu, Feng
  1 sibling, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2015-06-10  6:17 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel, yang.z.zhang

>>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> +static inline void setup_posted_irte(
> +    struct iremap_entry *new_ire, struct pi_desc *pi_desc, uint8_t gvec)
> +{
> +    new_ire->post.urg = 0;
> +    new_ire->post.vector = gvec;
> +    new_ire->post.pda_l = (((u64)virt_to_maddr(pi_desc)) >>
> +                           (32 - PDA_LOW_BIT)) & PDA_MASK(LOW);
> +    new_ire->post.pda_h = (((u64)virt_to_maddr(pi_desc)) >> 32) &
> +                           PDA_MASK(HIGH);

Looking at this another time I can't see what the and-ing with
PAD_MASK() is supposed to be good for. Nor can I see the need
for the u64 casts. The two literal 32-s aren't ideal either, but I
guess tolerable.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-05-08  9:07 ` [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling Feng Wu
@ 2015-06-10  6:43   ` Jan Beulich
  2015-06-16  0:17     ` Wu, Feng
                       ` (2 more replies)
  0 siblings, 3 replies; 64+ messages in thread
From: Jan Beulich @ 2015-06-10  6:43 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel, yang.z.zhang

>>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -1711,6 +1711,131 @@ static void vmx_handle_eoi(u8 vector)
>      __vmwrite(GUEST_INTR_STATUS, status);
>  }
>  
> +static void vmx_pi_desc_update(struct vcpu *v, int old_state)
> +{
> +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +    struct pi_desc old, new;
> +    unsigned long flags;
> +
> +    if ( !iommu_intpost )
> +        return;

Considering how the adjustment to start_vmx(), this at best should
be an ASSERT() (if a check is needed at all).

> +    switch ( v->runstate.state )
> +    {
> +    case RUNSTATE_runnable:
> +    case RUNSTATE_offline:
> +        /*
> +         * We don't need to send notification event to a non-running
> +         * vcpu, the interrupt information will be delivered to it before
> +         * VM-ENTRY when the vcpu is scheduled to run next time.
> +         */
> +        pi_desc->sn = 1;
> +
> +        /*
> +         * If the state is transferred from RUNSTATE_blocked,
> +         * we should set 'NV' feild back to posted_intr_vector,
> +         * so the Posted-Interrupts can be delivered to the vCPU
> +         * by VT-d HW after it is scheduled to run.
> +         */
> +        if ( old_state == RUNSTATE_blocked )
> +        {
> +            do
> +            {
> +                old.control = new.control = pi_desc->control;
> +                new.nv = posted_intr_vector;
> +            }
> +            while ( cmpxchg(&pi_desc->control, old.control, new.control)
> +                    != old.control );

So why is it okay to do the SN update non-atomically when the
vector update is done atomically?

Also the curly braces do _not_ go on separate lines for do/while
constructs.

And then do you really need to atomically update the whole 64 bits
here, rather than just the nv field? By not making it a bit field
you could perhaps just write_atomic() it?

> +
> +           /*
> +            * Delete the vCPU from the related block list
> +            * if we are resuming from blocked state
> +            */
> +           spin_lock_irqsave(&per_cpu(blocked_vcpu_lock,
> +                             v->pre_pcpu), flags);
> +           list_del(&v->blocked_vcpu_list);
> +           spin_unlock_irqrestore(&per_cpu(blocked_vcpu_lock,
> +                                  v->pre_pcpu), flags);
> +        }
> +        break;
> +
> +    case RUNSTATE_blocked:
> +        /*
> +         * The vCPU is blocked on the block list.
> +         * Add the blocked vCPU on the list of the
> +         * vcpu->pre_pcpu, which is the destination
> +         * of the wake-up notification event.
> +         */
> +        v->pre_pcpu = v->processor;

Is latching this upon runstate change really enough? I.e. what about
the v->processor changes that sched_move_domain() or individual
schedulers do? Or if it really just matters on which CPU's blocked list
the vCPU is (while its ->processor changing subsequently doesn't
matter) I'd like to see the field named more after its purpose (e.g.
pi_block_cpu; list and lock should btw also have a connection to PI
in their names).

In the end, if the placement on a list followed v->processor, you
would likely get away without the extra new field. Are there
synchronization constraints speaking against such a model?

> +        spin_lock_irqsave(&per_cpu(blocked_vcpu_lock,
> +                          v->pre_pcpu), flags);
> +        list_add_tail(&v->blocked_vcpu_list,
> +                      &per_cpu(blocked_vcpu, v->pre_pcpu));
> +        spin_unlock_irqrestore(&per_cpu(blocked_vcpu_lock,
> +                               v->pre_pcpu), flags);
> +
> +        do
> +        {
> +            old.control = new.control = pi_desc->control;
> +
> +            /*
> +             * We should not block the vCPU if
> +             * an interrupt was posted for it.
> +             */
> +
> +            if ( old.on == 1 )

I think I said so elsewhere, but just in case I didn't: Please don't
compare boolean fields with 1 - e.g. in the case here "if ( old.on )"
suffices.

> +            {
> +                /*
> +                 * The vCPU will be removed from the block list
> +                 * during its state transferring from RUNSTATE_blocked
> +                 * to RUNSTATE_runnable after the following tasklet
> +                 * is scheduled to run.
> +                 */

Precise comments please - I don't think the scheduling of the
tasklet makes any difference; I suppose it's the execution of it
that does.

> +                tasklet_schedule(&v->vcpu_wakeup_tasklet);
> +                return;
> +            }
> +
> +            /*
> +             * Change the 'NDST' field to v->pre_pcpu, so when
> +             * external interrupts from assigned deivces happen,
> +             * wakeup notifiction event will go to v->pre_pcpu,
> +             * then in pi_wakeup_interrupt() we can find the
> +             * vCPU in the right list to wake up.
> +             */
> +            if ( x2apic_enabled )
> +                new.ndst = cpu_physical_id(v->pre_pcpu);
> +            else
> +                new.ndst = MASK_INSR(cpu_physical_id(v->pre_pcpu),
> +                                     PI_xAPIC_NDST_MASK);
> +            new.sn = 0;
> +            new.nv = pi_wakeup_vector;
> +        }
> +        while ( cmpxchg(&pi_desc->control, old.control, new.control)
> +                != old.control );
> +        break;
> +
> +    case RUNSTATE_running:
> +        ASSERT( pi_desc->sn == 1 );
> +
> +        do
> +        {
> +            old.control = new.control = pi_desc->control;
> +            if ( x2apic_enabled )
> +                new.ndst = cpu_physical_id(v->processor);
> +            else
> +                new.ndst = (cpu_physical_id(v->processor) << 8) & 0xFF00;

Why are you not using MASK_INSR() here like you do above?

> +
> +            new.sn = 0;
> +        }
> +        while ( cmpxchg(&pi_desc->control, old.control, new.control)
> +                != old.control );

Same here - can't this be a write_atomic() of ndst and a clear_bit()
of sn instead of a loop over cmpxchg()?

> +        break;
> +
> +    default:
> +        break;

This seems dangerous: I think you want at least an
ASSERT_UNREACHABLE() here.

> @@ -1842,7 +1967,12 @@ const struct hvm_function_table * __init start_vmx(void)
>          alloc_direct_apic_vector(&posted_intr_vector, pi_notification_interrupt);
>  
>          if ( iommu_intpost )
> +        {
>              alloc_direct_apic_vector(&pi_wakeup_vector, pi_wakeup_interrupt);
> +            vmx_function_table.pi_desc_update = vmx_pi_desc_update;
> +        }
> +        else
> +            vmx_function_table.pi_desc_update = NULL;

Isn't that field NULL anyway?

> @@ -157,7 +158,11 @@ static inline void vcpu_runstate_change(
>          v->runstate.state_entry_time = new_entry_time;
>      }
>  
> +    old_state = v->runstate.state;
>      v->runstate.state = new_state;
> +
> +    if ( is_hvm_vcpu(v) && hvm_funcs.pi_desc_update )
> +        hvm_funcs.pi_desc_update(v, old_state);

I don't see how this would build on ARM.

> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -142,6 +142,8 @@ struct vcpu
>  
>      int              processor;
>  
> +    int              pre_pcpu;

This again doesn't belong into the common structure (and should
be accompanied by a comment, and should be unsigned int).

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 14/15] Suppress posting interrupts when 'SN' is set
  2015-05-08  9:07 ` [RFC v2 14/15] Suppress posting interrupts when 'SN' is set Feng Wu
@ 2015-06-10  6:48   ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2015-06-10  6:48 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel, yang.z.zhang

>>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -1664,9 +1664,20 @@ static void __vmx_deliver_posted_interrupt(struct vcpu *v)
>  
>  static void vmx_deliver_posted_intr(struct vcpu *v, u8 vector)
>  {
> +    int r, sn;
> +
>      if ( pi_test_and_set_pir(vector, &v->arch.hvm_vmx.pi_desc) )
>          return;
>  
> +    /*
> +     * Currently, we don't support urgent interrupt, all interrupts
> +     * are recognized as non-urgent interrupt, so we cannot send
> +     * posted-interrupt when 'SN' is set.
> +     */
> +
> +    sn = v->arch.hvm_vmx.pi_desc.sn;
> +    r = pi_test_and_set_on(&v->arch.hvm_vmx.pi_desc);

I'm probably misunderstanding something here, but to me this looks
like a change that would need to be done quite a bit earlier in the
series (i.e. at this point it looks like it's fixing a bug/oversight of an
earlier patch).

Apart from that I'm also not understanding the synchronization
aspect here: What if SN gets set after having been latched above,
but before the latched value gets looked at below?

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 15/15] Add a command line parameter for VT-d posted-interrupts
  2015-05-08  9:07 ` [RFC v2 15/15] Add a command line parameter for VT-d posted-interrupts Feng Wu
@ 2015-06-10  6:52   ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2015-06-10  6:52 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel, yang.z.zhang

>>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> @@ -101,6 +101,12 @@ static void __init parse_iommu_param(char *s)
>              if ( iommu_intremap == 0 )
>                  iommu_intpost = 0;
>          }
> +        else if ( !strcmp(s, "intpost") )
> +        {
> +            iommu_intpost = val;
> +            if ( iommu_intremap == 0 )
> +                iommu_intpost = 0;
> +        }

This again highlights that you shouldn't be (repeatedly) doing the
"if no intremap then no intpost" here and elsewhere. Just do this
in one or two central places (two in case intremap getting cleared
requires intpost to be cleared in turn, while also requiring the
connection to be represented correctly before starting IOMMU
setup).

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-06-09 14:32   ` Jan Beulich
@ 2015-06-12  9:40     ` Wu, Feng
  2015-06-12 10:33       ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Wu, Feng @ 2015-06-12  9:40 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, george.dunlap, andrew.cooper3, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, June 09, 2015 10:33 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: Re: [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used
> 
> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> > +bool_t pi_update_irte(struct vcpu *v, struct pirq *pirq, uint8_t gvec)
> 
> Without seeing the caller right away it's hard to judge, but generally I'd
> prefer functions to return -E... values as error indicators, i.e.
> 
> > +{
> > +    struct irq_desc *desc;
> > +    struct msi_desc *msi_desc;
> > +    int remap_index;
> > +    bool_t rc = 0;
> > +    struct pci_dev *pci_dev;
> > +    struct acpi_drhd_unit *drhd;
> > +    struct iommu *iommu;
> > +    struct ir_ctrl *ir_ctrl;
> > +    struct iremap_entry *iremap_entries = NULL, *p = NULL;
> > +    struct iremap_entry new_ire;
> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> > +    unsigned long flags;
> > +
> > +    desc = pirq_spin_lock_irq_desc(pirq, NULL);
> > +    if ( !desc )
> > +        return 0;
> 
> -ENOMEM
> 
> > +    msi_desc = desc->msi_desc;
> > +    if ( !msi_desc )
> > +        goto unlock_out;
> 
> -EBADSLT
> 
> > +    pci_dev = msi_desc->dev;
> > +    if ( !pci_dev )
> > +        goto unlock_out;
> 
> -ENODEV
> 
> > +    remap_index = msi_desc->remap_index;
> > +    drhd = acpi_find_matched_drhd_unit(pci_dev);
> > +    if ( !drhd )
> > +    {
> > +        dprintk(XENLOG_INFO VTDPREFIX,
> > +                "%pv: failed to get drhd, pci device: "
> > +                "%04x:%02x:%02x.%u, guest vector: %u\n",
> > +                v, pci_dev->seg, pci_dev->bus,
> PCI_SLOT(pci_dev->devfn),
> > +                PCI_FUNC(pci_dev->devfn), gvec);
> > +        goto unlock_out;
> > +    }
> > +
> > +    iommu = drhd->iommu;
> > +    ir_ctrl = iommu_ir_ctrl(iommu);
> > +    if ( !ir_ctrl )
> > +    {
> > +        dprintk(XENLOG_INFO VTDPREFIX,
> > +                "%pv: failed to get ir_ctrl, pci device: "
> > +                "%04x:%02x:%02x.%u, guest vector: %u\n",
> > +                v, pci_dev->seg, pci_dev->bus,
> PCI_SLOT(pci_dev->devfn),
> > +                PCI_FUNC(pci_dev->devfn), gvec);
> > +        goto unlock_out;
> > +    }
> 
> Do you think these log messages are useful beyond your bringup
> purposes?

So what kind of message do you think I need to show here?

Thanks,
Feng

> 
> > +    spin_lock_irqsave(&ir_ctrl->iremap_lock, flags);
> > +
> > +    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index,
> iremap_entries, p);
> > +
> > +    memcpy(&new_ire, p, sizeof(new_ire));
> 
> Please use structure assignment (being type safe) in preference to
> memcpy() (not being type safe).
> 
> > --- a/xen/drivers/passthrough/vtd/iommu.h
> > +++ b/xen/drivers/passthrough/vtd/iommu.h
> > @@ -327,6 +327,10 @@ struct iremap_entry {
> >    };
> >  };
> >
> > +#define PDA_LOW_BIT    26
> > +#define PDA_HIGH_BIT   32
> > +#define PDA_MASK(XX)   (~(-1UL << PDA_##XX##_BIT))
> 
> To me it would look more natural if you used ~0UL.
> 
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 09/15] Add a new per-vCPU tasklet to wakeup the blocked vCPU
  2015-06-09 15:08   ` Jan Beulich
@ 2015-06-12  9:40     ` Wu, Feng
  2015-06-12 10:41       ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Wu, Feng @ 2015-06-12  9:40 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, george.dunlap, andrew.cooper3, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, June 09, 2015 11:09 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: Re: [RFC v2 09/15] Add a new per-vCPU tasklet to wakeup the blocked
> vCPU
> 
> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> > This patch adds a new per-vCPU tasklet to wakeup the blocked
> > vCPU. It can be used in the case vcpu_unblock cannot be called
> > directly. This tasklet will be used in later patch in this
> > series.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > ---
> >  xen/common/domain.c     | 11 +++++++++++
> >  xen/include/xen/sched.h |  3 +++
> >  2 files changed, 14 insertions(+)
> 
> What is the rationale for doing this in common code, modifying
> common structures?
> 

Do you mean moving this tasklet to the architecture specific structure, such as, struct arch_vcpu ?

Thanks,
Feng

> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-06-10  6:17   ` Jan Beulich
@ 2015-06-12  9:40     ` Wu, Feng
  2015-06-12 10:34       ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Wu, Feng @ 2015-06-12  9:40 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, george.dunlap, andrew.cooper3, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, June 10, 2015 2:18 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: Re: [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used
> 
> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> > +static inline void setup_posted_irte(
> > +    struct iremap_entry *new_ire, struct pi_desc *pi_desc, uint8_t gvec)
> > +{
> > +    new_ire->post.urg = 0;
> > +    new_ire->post.vector = gvec;
> > +    new_ire->post.pda_l = (((u64)virt_to_maddr(pi_desc)) >>
> > +                           (32 - PDA_LOW_BIT)) & PDA_MASK(LOW);
> > +    new_ire->post.pda_h = (((u64)virt_to_maddr(pi_desc)) >> 32) &
> > +                           PDA_MASK(HIGH);
> 
> Looking at this another time I can't see what the and-ing with
> PAD_MASK() is supposed to be good for. 

I cannot understand this well. Do you mean we don't need and PDA_MASK() here?

Thanks,
Feng

Nor can I see the need
> for the u64 casts. The two literal 32-s aren't ideal either, but I
> guess tolerable.
> 
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 11/15] vmx: Add a global wake-up vector for VT-d Posted-Interrupts
  2015-06-09 15:20   ` Jan Beulich
@ 2015-06-12  9:40     ` Wu, Feng
  2015-06-12 10:44       ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Wu, Feng @ 2015-06-12  9:40 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, george.dunlap, andrew.cooper3, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: xen-devel-bounces@lists.xen.org
> [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Jan Beulich
> Sent: Tuesday, June 09, 2015 11:20 PM
> To: Wu, Feng
> Cc: Tian, Kevin; keir@xen.org; george.dunlap@eu.citrix.com;
> andrew.cooper3@citrix.com; xen-devel@lists.xen.org; Zhang, Yang Z
> Subject: Re: [Xen-devel] [RFC v2 11/15] vmx: Add a global wake-up vector for
> VT-d Posted-Interrupts
> 
> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> > @@ -3262,6 +3271,28 @@ void vmx_vmenter_helper(const struct
> cpu_user_regs *regs)
> >  }
> >
> >  /*
> > + * Handle VT-d posted-interrupt when VCPU is blocked.
> > + */
> > +void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> 
> static (and perhaps move it up so you don't need to forward declare it).
> 
> > +{
> > +    struct vcpu *v;
> > +    unsigned int cpu = smp_processor_id();
> > +
> > +    spin_lock(&per_cpu(blocked_vcpu_lock, cpu));
> > +    list_for_each_entry(v, &per_cpu(blocked_vcpu, cpu),
> > +                    blocked_vcpu_list) {
> 
> How long do you think such a list can grow? 

It depends on how many vCPU is current blocked on this specific pCPU.
I am not sure how long it can grow. 

The basic idea here is that, when notification wakeup event happens,
we need to find a way to find the right vCPU, and then wakeup it, do
you have any better idea for this?

Thanks,
Feng

> I'm afraid you might
> be adding quite a bit of latency here to the system.
> 
> > +        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> > +
> > +        if ( pi_desc->on == 1 )
> 
> Isn't this a single-bit (i.e. boolean) field? In which case - no reason
> to compare with 1.
> 
> I also don't see the value of the local variable - it's being used only
> once, and without it the line wouldn't get overly long or unreadable.
> 
> > --- a/xen/include/xen/sched.h
> > +++ b/xen/include/xen/sched.h
> > @@ -148,6 +148,8 @@ struct vcpu
> >
> >      struct vcpu     *next_in_list;
> >
> > +    struct list_head blocked_vcpu_list;
> 
> Again - why here instead of in VT-d/VMX specific structures?

Do you think it is okay to put it in " struct arch_vmx_struct" ?

Thanks,
Feng

> 
> Jan
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-06-12  9:40     ` Wu, Feng
@ 2015-06-12 10:33       ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2015-06-12 10:33 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir, george.dunlap, andrew.cooper3, xen-devel, Yang Z Zhang

>>> On 12.06.15 at 11:40, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Tuesday, June 09, 2015 10:33 PM
>> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
>> > +    remap_index = msi_desc->remap_index;
>> > +    drhd = acpi_find_matched_drhd_unit(pci_dev);
>> > +    if ( !drhd )
>> > +    {
>> > +        dprintk(XENLOG_INFO VTDPREFIX,
>> > +                "%pv: failed to get drhd, pci device: "
>> > +                "%04x:%02x:%02x.%u, guest vector: %u\n",
>> > +                v, pci_dev->seg, pci_dev->bus, PCI_SLOT(pci_dev->devfn),
>> > +                PCI_FUNC(pci_dev->devfn), gvec);
>> > +        goto unlock_out;
>> > +    }
>> > +
>> > +    iommu = drhd->iommu;
>> > +    ir_ctrl = iommu_ir_ctrl(iommu);
>> > +    if ( !ir_ctrl )
>> > +    {
>> > +        dprintk(XENLOG_INFO VTDPREFIX,
>> > +                "%pv: failed to get ir_ctrl, pci device: "
>> > +                "%04x:%02x:%02x.%u, guest vector: %u\n",
>> > +                v, pci_dev->seg, pci_dev->bus, PCI_SLOT(pci_dev->devfn),
>> > +                PCI_FUNC(pci_dev->devfn), gvec);
>> > +        goto unlock_out;
>> > +    }
>> 
>> Do you think these log messages are useful beyond your bringup
>> purposes?
> 
> So what kind of message do you think I need to show here?

None at all?

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-06-12  9:40     ` Wu, Feng
@ 2015-06-12 10:34       ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2015-06-12 10:34 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir, george.dunlap, andrew.cooper3, xen-devel, Yang Z Zhang

>>> On 12.06.15 at 11:40, <feng.wu@intel.com> wrote:
>> -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
>> > +static inline void setup_posted_irte(
>> > +    struct iremap_entry *new_ire, struct pi_desc *pi_desc, uint8_t gvec)
>> > +{
>> > +    new_ire->post.urg = 0;
>> > +    new_ire->post.vector = gvec;
>> > +    new_ire->post.pda_l = (((u64)virt_to_maddr(pi_desc)) >>
>> > +                           (32 - PDA_LOW_BIT)) & PDA_MASK(LOW);
>> > +    new_ire->post.pda_h = (((u64)virt_to_maddr(pi_desc)) >> 32) &
>> > +                           PDA_MASK(HIGH);
>> 
>> Looking at this another time I can't see what the and-ing with
>> PAD_MASK() is supposed to be good for. 
> 
> I cannot understand this well. Do you mean we don't need and PDA_MASK() 
> here?

Correct - the bitfield width (where the data gets stored into) already
enforces the intended truncation afaict.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 09/15] Add a new per-vCPU tasklet to wakeup the blocked vCPU
  2015-06-12  9:40     ` Wu, Feng
@ 2015-06-12 10:41       ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2015-06-12 10:41 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir, george.dunlap, andrew.cooper3, xen-devel, Yang Z Zhang

>>> On 12.06.15 at 11:40, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Tuesday, June 09, 2015 11:09 PM
>> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
>> > This patch adds a new per-vCPU tasklet to wakeup the blocked
>> > vCPU. It can be used in the case vcpu_unblock cannot be called
>> > directly. This tasklet will be used in later patch in this
>> > series.
>> >
>> > Signed-off-by: Feng Wu <feng.wu@intel.com>
>> > ---
>> >  xen/common/domain.c     | 11 +++++++++++
>> >  xen/include/xen/sched.h |  3 +++
>> >  2 files changed, 14 insertions(+)
>> 
>> What is the rationale for doing this in common code, modifying
>> common structures?
>> 
> 
> Do you mean moving this tasklet to the architecture specific structure, such 
> as, struct arch_vcpu ?

Exactly.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 11/15] vmx: Add a global wake-up vector for VT-d Posted-Interrupts
  2015-06-12  9:40     ` Wu, Feng
@ 2015-06-12 10:44       ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2015-06-12 10:44 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir, george.dunlap, andrew.cooper3, xen-devel, Yang Z Zhang

>>> On 12.06.15 at 11:40, <feng.wu@intel.com> wrote:
>> From: xen-devel-bounces@lists.xen.org 
>> [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Jan Beulich
>> Sent: Tuesday, June 09, 2015 11:20 PM
>> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
>> > +{
>> > +    struct vcpu *v;
>> > +    unsigned int cpu = smp_processor_id();
>> > +
>> > +    spin_lock(&per_cpu(blocked_vcpu_lock, cpu));
>> > +    list_for_each_entry(v, &per_cpu(blocked_vcpu, cpu),
>> > +                    blocked_vcpu_list) {
>> 
>> How long do you think such a list can grow? 
> 
> It depends on how many vCPU is current blocked on this specific pCPU.
> I am not sure how long it can grow. 
> 
> The basic idea here is that, when notification wakeup event happens,
> we need to find a way to find the right vCPU, and then wakeup it, do
> you have any better idea for this?

Not immediately, or else I would have spelled it out. But the list
perhaps growing to hundreds entries is something that needs to
be taken into consideration.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-06-10  6:43   ` Jan Beulich
@ 2015-06-16  0:17     ` Wu, Feng
  2015-06-16  7:53       ` Jan Beulich
  2015-06-17  6:54     ` Wu, Feng
  2015-06-17  7:00     ` Wu, Feng
  2 siblings, 1 reply; 64+ messages in thread
From: Wu, Feng @ 2015-06-16  0:17 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, george.dunlap, andrew.cooper3, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, June 10, 2015 2:44 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: Re: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU
> scheduling
> 
> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -1711,6 +1711,131 @@ static void vmx_handle_eoi(u8 vector)
> >      __vmwrite(GUEST_INTR_STATUS, status);
> >  }
> >
> > +static void vmx_pi_desc_update(struct vcpu *v, int old_state)
> > +{
> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> > +    struct pi_desc old, new;
> > +    unsigned long flags;
> > +
> > +    if ( !iommu_intpost )
> > +        return;
> 
> Considering how the adjustment to start_vmx(), this at best should
> be an ASSERT() (if a check is needed at all).
> 
> > +    switch ( v->runstate.state )
> > +    {
> > +    case RUNSTATE_runnable:
> > +    case RUNSTATE_offline:
> > +        /*
> > +         * We don't need to send notification event to a non-running
> > +         * vcpu, the interrupt information will be delivered to it before
> > +         * VM-ENTRY when the vcpu is scheduled to run next time.
> > +         */
> > +        pi_desc->sn = 1;
> > +
> > +        /*
> > +         * If the state is transferred from RUNSTATE_blocked,
> > +         * we should set 'NV' feild back to posted_intr_vector,
> > +         * so the Posted-Interrupts can be delivered to the vCPU
> > +         * by VT-d HW after it is scheduled to run.
> > +         */
> > +        if ( old_state == RUNSTATE_blocked )
> > +        {
> > +            do
> > +            {
> > +                old.control = new.control = pi_desc->control;
> > +                new.nv = posted_intr_vector;
> > +            }
> > +            while ( cmpxchg(&pi_desc->control, old.control, new.control)
> > +                    != old.control );
> 
> So why is it okay to do the SN update non-atomically when the
> vector update is done atomically?

After thinking this a bit more. It should be atomic to update the 'SN' filed, since
the hardware will perform a coherent atomic read-modify-write operation of
the posted-interrupt descriptor when posting interrupts.

> 
> Also the curly braces do _not_ go on separate lines for do/while
> constructs.
> 
> And then do you really need to atomically update the whole 64 bits
> here, rather than just the nv field? By not making it a bit field
> you could perhaps just write_atomic() it?

I think only atomically updating the nv field is fine.

> 
> > +
> > +           /*
> > +            * Delete the vCPU from the related block list
> > +            * if we are resuming from blocked state
> > +            */
> > +           spin_lock_irqsave(&per_cpu(blocked_vcpu_lock,
> > +                             v->pre_pcpu), flags);
> > +           list_del(&v->blocked_vcpu_list);
> > +           spin_unlock_irqrestore(&per_cpu(blocked_vcpu_lock,
> > +                                  v->pre_pcpu), flags);
> > +        }
> > +        break;
> > +
> > +    case RUNSTATE_blocked:
> > +        /*
> > +         * The vCPU is blocked on the block list.
> > +         * Add the blocked vCPU on the list of the
> > +         * vcpu->pre_pcpu, which is the destination
> > +         * of the wake-up notification event.
> > +         */
> > +        v->pre_pcpu = v->processor;
> 
> Is latching this upon runstate change really enough? I.e. what about
> the v->processor changes that sched_move_domain() or individual
> schedulers do? Or if it really just matters on which CPU's blocked list
> the vCPU is (while its ->processor changing subsequently doesn't
> matter) I'd like to see the field named more after its purpose (e.g.
> pi_block_cpu; list and lock should btw also have a connection to PI
> in their names).

Yes, It doesn't matter if vCPU changes. The key point is that we put
the vCPU on a pCPU list and we change the NDST field to this pCPU,
then the wakeup notification event will get there. You are right, I
need to rename them to reflect the real purpose of it.

> 
> In the end, if the placement on a list followed v->processor, you
> would likely get away without the extra new field. Are there
> synchronization constraints speaking against such a model?

I don't understand this quit well. Do you mean using 'v->processor'
as the pCPU list for the blocked vCPUs? Then what about 'v->processor'
changes, seems we cannot handle this case.

> 
> > +        spin_lock_irqsave(&per_cpu(blocked_vcpu_lock,
> > +                          v->pre_pcpu), flags);
> > +        list_add_tail(&v->blocked_vcpu_list,
> > +                      &per_cpu(blocked_vcpu, v->pre_pcpu));
> > +        spin_unlock_irqrestore(&per_cpu(blocked_vcpu_lock,
> > +                               v->pre_pcpu), flags);
> > +
> > +        do
> > +        {
> > +            old.control = new.control = pi_desc->control;
> > +
> > +            /*
> > +             * We should not block the vCPU if
> > +             * an interrupt was posted for it.
> > +             */
> > +
> > +            if ( old.on == 1 )
> 
> I think I said so elsewhere, but just in case I didn't: Please don't
> compare boolean fields with 1 - e.g. in the case here "if ( old.on )"
> suffices.
> 
> > +            {
> > +                /*
> > +                 * The vCPU will be removed from the block list
> > +                 * during its state transferring from RUNSTATE_blocked
> > +                 * to RUNSTATE_runnable after the following tasklet
> > +                 * is scheduled to run.
> > +                 */
> 
> Precise comments please - I don't think the scheduling of the
> tasklet makes any difference; I suppose it's the execution of it
> that does.
> 
> > +                tasklet_schedule(&v->vcpu_wakeup_tasklet);
> > +                return;
> > +            }
> > +
> > +            /*
> > +             * Change the 'NDST' field to v->pre_pcpu, so when
> > +             * external interrupts from assigned deivces happen,
> > +             * wakeup notifiction event will go to v->pre_pcpu,
> > +             * then in pi_wakeup_interrupt() we can find the
> > +             * vCPU in the right list to wake up.
> > +             */
> > +            if ( x2apic_enabled )
> > +                new.ndst = cpu_physical_id(v->pre_pcpu);
> > +            else
> > +                new.ndst = MASK_INSR(cpu_physical_id(v->pre_pcpu),
> > +                                     PI_xAPIC_NDST_MASK);
> > +            new.sn = 0;
> > +            new.nv = pi_wakeup_vector;
> > +        }
> > +        while ( cmpxchg(&pi_desc->control, old.control, new.control)
> > +                != old.control );
> > +        break;
> > +
> > +    case RUNSTATE_running:
> > +        ASSERT( pi_desc->sn == 1 );
> > +
> > +        do
> > +        {
> > +            old.control = new.control = pi_desc->control;
> > +            if ( x2apic_enabled )
> > +                new.ndst = cpu_physical_id(v->processor);
> > +            else
> > +                new.ndst = (cpu_physical_id(v->processor) << 8) &
> 0xFF00;
> 
> Why are you not using MASK_INSR() here like you do above?
> 
> > +
> > +            new.sn = 0;
> > +        }
> > +        while ( cmpxchg(&pi_desc->control, old.control, new.control)
> > +                != old.control );
> 
> Same here - can't this be a write_atomic() of ndst and a clear_bit()
> of sn instead of a loop over cmpxchg()?

Your suggestion is good!

> 
> > +        break;
> > +
> > +    default:
> > +        break;
> 
> This seems dangerous: I think you want at least an
> ASSERT_UNREACHABLE() here.
> 
> > @@ -1842,7 +1967,12 @@ const struct hvm_function_table * __init
> start_vmx(void)
> >          alloc_direct_apic_vector(&posted_intr_vector,
> pi_notification_interrupt);
> >
> >          if ( iommu_intpost )
> > +        {
> >              alloc_direct_apic_vector(&pi_wakeup_vector,
> pi_wakeup_interrupt);
> > +            vmx_function_table.pi_desc_update = vmx_pi_desc_update;
> > +        }
> > +        else
> > +            vmx_function_table.pi_desc_update = NULL;
> 
> Isn't that field NULL anyway?
> 
> > @@ -157,7 +158,11 @@ static inline void vcpu_runstate_change(
> >          v->runstate.state_entry_time = new_entry_time;
> >      }
> >
> > +    old_state = v->runstate.state;
> >      v->runstate.state = new_state;
> > +
> > +    if ( is_hvm_vcpu(v) && hvm_funcs.pi_desc_update )
> > +        hvm_funcs.pi_desc_update(v, old_state);
> 
> I don't see how this would build on ARM.

So what about adding " #ifdef CONFIG_X86 ..." here?

Thanks,
Feng

> 
> > --- a/xen/include/xen/sched.h
> > +++ b/xen/include/xen/sched.h
> > @@ -142,6 +142,8 @@ struct vcpu
> >
> >      int              processor;
> >
> > +    int              pre_pcpu;
> 
> This again doesn't belong into the common structure (and should
> be accompanied by a comment, and should be unsigned int).
> 
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 06/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts
  2015-06-09 14:25   ` Jan Beulich
@ 2015-06-16  6:18     ` Wu, Feng
  2015-06-16  7:48       ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Wu, Feng @ 2015-06-16  6:18 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, george.dunlap, andrew.cooper3, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, June 09, 2015 10:25 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: Re: [RFC v2 06/15] vt-d: Extend struct iremap_entry to support VT-d
> Posted-Interrupts
> 
> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> > --- a/xen/drivers/passthrough/vtd/iommu.h
> > +++ b/xen/drivers/passthrough/vtd/iommu.h
> > @@ -289,7 +289,7 @@ struct dma_pte {
> >  /* interrupt remap entry */
> >  struct iremap_entry {
> >    union {
> > -    u64 lo_val;
> > +    struct { u64 lo, hi; };
> >      struct {
> >          u64 p       : 1,
> >              fpd     : 1,
> > @@ -302,16 +302,28 @@ struct iremap_entry {
> >              vector  : 8,
> >              res_2   : 8,
> >              dst     : 32;
> > -    }lo;
> > -  };
> > -  union {
> > -    u64 hi_val;
> > +        u64 sid     : 16,
> > +            sq      : 2,
> > +            svt     : 2,
> > +            res_3   : 44;

For this, are we worth using u32 for this one?

	u32 sid: 16,
	    sq: 2,
	    svt: 2,
	    res_3: 12;
	u32 res_4;


> > +    }remap;
> >      struct {
> > +        u64 p       : 1,
> > +            fpd     : 1,
> > +            res_1   : 6,
> > +            avail   : 4,
> > +            res_2   : 2,
> > +            urg     : 1,
> > +            im      : 1,
> > +            vector  : 8,
> > +            res_3   : 14,
> > +            pda_l   : 26;

Also this one?

	...
		res_3: 8
	u32 res_4: 6,
	    pda_l: 26;


> >          u64 sid     : 16,
> >              sq      : 2,
> >              svt     : 2,
> > -            res_1   : 44;

This one?

Thanks,
Feng

> > -    }hi;
> > +            res_4   : 12,
> > +            pda_h   : 32;
> > +    }post;
> >    };
> >  };
> 
> Same here - unless the VT-d maintainers disagree, I think using u32
> wherever possible would be preferable over using u64, as well as
> avoiding bitfields for members filling an entire word.
> 
> Also please fix the formatting (blanks between closing braces and field
> names).
> 
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 06/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts
  2015-06-16  6:18     ` Wu, Feng
@ 2015-06-16  7:48       ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2015-06-16  7:48 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir, george.dunlap, andrew.cooper3, xen-devel, Yang Z Zhang

>>> On 16.06.15 at 08:18, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Tuesday, June 09, 2015 10:25 PM
>> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
>> > @@ -302,16 +302,28 @@ struct iremap_entry {
>> >              vector  : 8,
>> >              res_2   : 8,
>> >              dst     : 32;
>> > -    }lo;
>> > -  };
>> > -  union {
>> > -    u64 hi_val;
>> > +        u64 sid     : 16,
>> > +            sq      : 2,
>> > +            svt     : 2,
>> > +            res_3   : 44;
> 
> For this, are we worth using u32 for this one?
> 
> 	u32 sid: 16,
> 	    sq: 2,
> 	    svt: 2,
> 	    res_3: 12;
> 	u32 res_4;
> 
> 
>> > +    }remap;
>> >      struct {
>> > +        u64 p       : 1,
>> > +            fpd     : 1,
>> > +            res_1   : 6,
>> > +            avail   : 4,
>> > +            res_2   : 2,
>> > +            urg     : 1,
>> > +            im      : 1,
>> > +            vector  : 8,
>> > +            res_3   : 14,
>> > +            pda_l   : 26;
> 
> Also this one?
> 
> 	...
> 		res_3: 8
> 	u32 res_4: 6,
> 	    pda_l: 26;
> 
> 
>> >          u64 sid     : 16,
>> >              sq      : 2,
>> >              svt     : 2,
>> > -            res_1   : 44;
> 
> This one?

Yes in all three case, subject to ...

>> > -    }hi;
>> > +            res_4   : 12,
>> > +            pda_h   : 32;
>> > +    }post;
>> >    };
>> >  };
>> 
>> Same here - unless the VT-d maintainers disagree, I think using u32
>> wherever possible would be preferable over using u64, as well as
>> avoiding bitfields for members filling an entire word.

... the maintainers' not disagreeing.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-06-16  0:17     ` Wu, Feng
@ 2015-06-16  7:53       ` Jan Beulich
  2015-06-16  8:13         ` Wu, Feng
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2015-06-16  7:53 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir, george.dunlap, andrew.cooper3, xen-devel, Yang Z Zhang

>>> On 16.06.15 at 02:17, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, June 10, 2015 2:44 PM
>> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
>> > +
>> > +           /*
>> > +            * Delete the vCPU from the related block list
>> > +            * if we are resuming from blocked state
>> > +            */
>> > +           spin_lock_irqsave(&per_cpu(blocked_vcpu_lock,
>> > +                             v->pre_pcpu), flags);
>> > +           list_del(&v->blocked_vcpu_list);
>> > +           spin_unlock_irqrestore(&per_cpu(blocked_vcpu_lock,
>> > +                                  v->pre_pcpu), flags);
>> > +        }
>> > +        break;
>> > +
>> > +    case RUNSTATE_blocked:
>> > +        /*
>> > +         * The vCPU is blocked on the block list.
>> > +         * Add the blocked vCPU on the list of the
>> > +         * vcpu->pre_pcpu, which is the destination
>> > +         * of the wake-up notification event.
>> > +         */
>> > +        v->pre_pcpu = v->processor;
>> 
>> Is latching this upon runstate change really enough? I.e. what about
>> the v->processor changes that sched_move_domain() or individual
>> schedulers do? Or if it really just matters on which CPU's blocked list
>> the vCPU is (while its ->processor changing subsequently doesn't
>> matter) I'd like to see the field named more after its purpose (e.g.
>> pi_block_cpu; list and lock should btw also have a connection to PI
>> in their names).
> 
> Yes, It doesn't matter if vCPU changes. The key point is that we put
> the vCPU on a pCPU list and we change the NDST field to this pCPU,
> then the wakeup notification event will get there. You are right, I
> need to rename them to reflect the real purpose of it.
> 
>> 
>> In the end, if the placement on a list followed v->processor, you
>> would likely get away without the extra new field. Are there
>> synchronization constraints speaking against such a model?
> 
> I don't understand this quit well. Do you mean using 'v->processor'
> as the pCPU list for the blocked vCPUs? Then what about 'v->processor'
> changes, seems we cannot handle this case.

That was the question - does anything speak against such a model?

>> > @@ -157,7 +158,11 @@ static inline void vcpu_runstate_change(
>> >          v->runstate.state_entry_time = new_entry_time;
>> >      }
>> >
>> > +    old_state = v->runstate.state;
>> >      v->runstate.state = new_state;
>> > +
>> > +    if ( is_hvm_vcpu(v) && hvm_funcs.pi_desc_update )
>> > +        hvm_funcs.pi_desc_update(v, old_state);
>> 
>> I don't see how this would build on ARM.
> 
> So what about adding " #ifdef CONFIG_X86 ..." here?

That would yield ugly code. If this needs to be here, you'll have
to introduce a suitable arch_...() hook (doing nothing on ARM).

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 08/15] Update IRTE according to guest interrupt config changes
  2015-06-09 15:06   ` Jan Beulich
@ 2015-06-16  8:08     ` Wu, Feng
  2015-06-16  8:17       ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Wu, Feng @ 2015-06-16  8:08 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, george.dunlap, andrew.cooper3, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, June 09, 2015 11:06 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: Re: [RFC v2 08/15] Update IRTE according to guest interrupt config
> changes
> 
> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> > +static bool_t pi_find_dest_vcpu(struct domain *d, uint8_t dest_id,
> > +                                uint8_t dest_mode, uint8_t
> delivery_mode,
> > +                                uint8_t gvec, struct vcpu
> **dest_vcpu)
> > +{
> > +    struct vcpu *v, **dest_vcpu_array;
> > +    unsigned int dest_vcpu_num = 0;
> > +    int ret;
> 
> This, being given as operand to "return", should match in type with
> the function's return type.
> 
> > +    dest_vcpu_array = xzalloc_array(struct vcpu *, d->max_vcpus);
> 
> You realize that this can be quite big an allocation? (You could at
> least halve it by storing vCPU IDs instead of pointers, but if I'm
> not mistaken this could even be a simple bitmap.)

I think maybe storing the vCPU IDs is better than using bitmap, because
If using vCPU IDs, we can easily to find the destination vCPU id by dest_vcpu_id_array [gvec % dest_vcpu_num],
then we can get the vcpu form d->vcpu[]. However if we use bitmap here, we set the bits for those vCPUs which
is present in the destination field, then we need to use find_next_bit() to find the right vCPU ID, this may need
a while loop, see the following scenario:

- Guest has 8 vCPUs, in the interrupt destination fields, it only configures vcpu 0, 1, 3, 4, 5, 7 for the lowest-priority interrupt.
 So we get the vCPU destination bitmap like this: 10111011b
- Suppose guest vector is 44, so we choose 44%6=2, we need to use find_next_bit() to find the second set bit (counting from 0, which is vCPU ID 3), this may need a while loop.

What is your opinion about this? Space vs Speed?

Thanks,
Feng

> 
> > +    if ( !dest_vcpu_array )
> > +    {
> > +        dprintk(XENLOG_G_INFO,
> > +                "dom%d: failed to allocate memeory.\n", d->domain_id);
> 
> Please fix the typo and remove the stop.
> 
> > +        return 0;
> > +    }
> > +
> > +    for_each_vcpu ( d, v )
> > +    {
> > +        if ( !vlapic_match_dest(vcpu_vlapic(v), NULL, 0,
> > +                                dest_id, dest_mode) )
> > +            continue;
> > +
> > +        dest_vcpu_array[dest_vcpu_num++] = v;
> > +    }
> > +
> > +    if ( delivery_mode == dest_LowestPrio )
> > +    {
> > +        if (  dest_vcpu_num != 0 )
> > +        {
> > +            *dest_vcpu = dest_vcpu_array[gvec % dest_vcpu_num];
> > +            ret = 1;
> > +        }
> > +        else
> > +            ret = 0;
> > +    }
> > +    else if (  dest_vcpu_num == 1 )
> > +    {
> > +        *dest_vcpu = dest_vcpu_array[0];
> > +        ret = 1;
> > +    }
> > +    else
> > +        ret = 0;
> > +
> > +    xfree(dest_vcpu_array);
> > +    return ret;
> > +}
> 
> Blank line before final return statement please.
> 
> > @@ -330,11 +398,40 @@ int pt_irq_create_bind(
> >          /* Calculate dest_vcpu_id for MSI-type pirq migration. */
> >          dest = pirq_dpci->gmsi.gflags & VMSI_DEST_ID_MASK;
> >          dest_mode = !!(pirq_dpci->gmsi.gflags & VMSI_DM_MASK);
> > +        delivery_mode = (pirq_dpci->gmsi.gflags >>
> GFLAGS_SHIFT_DELIV_MODE) &
> > +                        VMSI_DELIV_MASK;
> >          dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
> >          pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
> >          spin_unlock(&d->event_lock);
> >          if ( dest_vcpu_id >= 0 )
> >              hvm_migrate_pirqs(d->vcpu[dest_vcpu_id]);
> > +
> > +        /* Use interrupt posting if it is supported */
> > +        if ( iommu_intpost )
> > +        {
> > +            struct vcpu *vcpu = NULL;
> > +
> > +            if ( !pi_find_dest_vcpu(d, dest, dest_mode, delivery_mode,
> > +                                    pirq_dpci->gmsi.gvec, &vcpu) )
> 
> Why not have the function return the vCPU instead of passing it
> via indirection?
> 
> > +            {
> > +                dprintk(XENLOG_G_WARNING,
> > +                        "%pv: failed to find the dest vCPU for PI, guest
> "
> > +                        "vector:%u use software way to deliver the "
> > +                        " interrupts.\n", vcpu, pirq_dpci->gmsi.gvec);
> 
> You shouldn't be printing the (NULL) vCPU here. And please print
> vectors as hex values.
> 
> > +                break;
> > +            }
> > +
> > +            if ( pi_update_irte( vcpu, info, pirq_dpci->gmsi.gvec ) != 0 )
> > +            {
> > +                dprintk(XENLOG_G_WARNING,
> > +                        "%pv: failed to update PI IRTE, guest
> vector:%u "
> > +                        "use software way to deliver the
> interrupts.\n",
> > +                        vcpu, pirq_dpci->gmsi.gvec);
> > +
> > +                break;
> > +            }
> > +        }
> > +
> >          break;
> 
> By using if() / else if() you could drop _both_ break-s you add.
> 
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-06-16  7:53       ` Jan Beulich
@ 2015-06-16  8:13         ` Wu, Feng
  2015-06-16  8:29           ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Wu, Feng @ 2015-06-16  8:13 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, george.dunlap, andrew.cooper3, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, June 16, 2015 3:53 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: RE: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU
> scheduling
> 
> >>> On 16.06.15 at 02:17, <feng.wu@intel.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Wednesday, June 10, 2015 2:44 PM
> >> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> >> > +
> >> > +           /*
> >> > +            * Delete the vCPU from the related block list
> >> > +            * if we are resuming from blocked state
> >> > +            */
> >> > +           spin_lock_irqsave(&per_cpu(blocked_vcpu_lock,
> >> > +                             v->pre_pcpu), flags);
> >> > +           list_del(&v->blocked_vcpu_list);
> >> > +           spin_unlock_irqrestore(&per_cpu(blocked_vcpu_lock,
> >> > +                                  v->pre_pcpu), flags);
> >> > +        }
> >> > +        break;
> >> > +
> >> > +    case RUNSTATE_blocked:
> >> > +        /*
> >> > +         * The vCPU is blocked on the block list.
> >> > +         * Add the blocked vCPU on the list of the
> >> > +         * vcpu->pre_pcpu, which is the destination
> >> > +         * of the wake-up notification event.
> >> > +         */
> >> > +        v->pre_pcpu = v->processor;
> >>
> >> Is latching this upon runstate change really enough? I.e. what about
> >> the v->processor changes that sched_move_domain() or individual
> >> schedulers do? Or if it really just matters on which CPU's blocked list
> >> the vCPU is (while its ->processor changing subsequently doesn't
> >> matter) I'd like to see the field named more after its purpose (e.g.
> >> pi_block_cpu; list and lock should btw also have a connection to PI
> >> in their names).
> >
> > Yes, It doesn't matter if vCPU changes. The key point is that we put
> > the vCPU on a pCPU list and we change the NDST field to this pCPU,
> > then the wakeup notification event will get there. You are right, I
> > need to rename them to reflect the real purpose of it.
> >
> >>
> >> In the end, if the placement on a list followed v->processor, you
> >> would likely get away without the extra new field. Are there
> >> synchronization constraints speaking against such a model?
> >
> > I don't understand this quit well. Do you mean using 'v->processor'
> > as the pCPU list for the blocked vCPUs? Then what about 'v->processor'
> > changes, seems we cannot handle this case.
> 
> That was the question - does anything speak against such a model?

Do you mean still using v->processor as the pCPU to store the blocked vCPU?

Thanks,
Feng

> 
> >> > @@ -157,7 +158,11 @@ static inline void vcpu_runstate_change(
> >> >          v->runstate.state_entry_time = new_entry_time;
> >> >      }
> >> >
> >> > +    old_state = v->runstate.state;
> >> >      v->runstate.state = new_state;
> >> > +
> >> > +    if ( is_hvm_vcpu(v) && hvm_funcs.pi_desc_update )
> >> > +        hvm_funcs.pi_desc_update(v, old_state);
> >>
> >> I don't see how this would build on ARM.
> >
> > So what about adding " #ifdef CONFIG_X86 ..." here?
> 
> That would yield ugly code. If this needs to be here, you'll have
> to introduce a suitable arch_...() hook (doing nothing on ARM).
> 
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 08/15] Update IRTE according to guest interrupt config changes
  2015-06-16  8:08     ` Wu, Feng
@ 2015-06-16  8:17       ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2015-06-16  8:17 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir, george.dunlap, andrew.cooper3, xen-devel, Yang Z Zhang

>>> On 16.06.15 at 10:08, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Tuesday, June 09, 2015 11:06 PM
>> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
>> > +    dest_vcpu_array = xzalloc_array(struct vcpu *, d->max_vcpus);
>> 
>> You realize that this can be quite big an allocation? (You could at
>> least halve it by storing vCPU IDs instead of pointers, but if I'm
>> not mistaken this could even be a simple bitmap.)
> 
> I think maybe storing the vCPU IDs is better than using bitmap, because
> If using vCPU IDs, we can easily to find the destination vCPU id by 
> dest_vcpu_id_array [gvec % dest_vcpu_num],
> then we can get the vcpu form d->vcpu[]. However if we use bitmap here, we set 
> the bits for those vCPUs which
> is present in the destination field, then we need to use find_next_bit() to 
> find the right vCPU ID, this may need
> a while loop, see the following scenario:
> 
> - Guest has 8 vCPUs, in the interrupt destination fields, it only configures 
> vcpu 0, 1, 3, 4, 5, 7 for the lowest-priority interrupt.
>  So we get the vCPU destination bitmap like this: 10111011b
> - Suppose guest vector is 44, so we choose 44%6=2, we need to use 
> find_next_bit() to find the second set bit (counting from 0, which is vCPU ID 
> 3), this may need a while loop.
> 
> What is your opinion about this? Space vs Speed?

Since the array can get rather large (in particular, larger than a page),
space would seem the more important factor here.

Also - please trim your replies.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-06-16  8:13         ` Wu, Feng
@ 2015-06-16  8:29           ` Jan Beulich
  2015-06-16  8:56             ` Wu, Feng
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2015-06-16  8:29 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir, george.dunlap, andrew.cooper3, xen-devel, Yang Z Zhang

>>> On 16.06.15 at 10:13, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Tuesday, June 16, 2015 3:53 PM
>> >>> On 16.06.15 at 02:17, <feng.wu@intel.com> wrote:
>> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Wednesday, June 10, 2015 2:44 PM
>> >> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
>> >> > +
>> >> > +           /*
>> >> > +            * Delete the vCPU from the related block list
>> >> > +            * if we are resuming from blocked state
>> >> > +            */
>> >> > +           spin_lock_irqsave(&per_cpu(blocked_vcpu_lock,
>> >> > +                             v->pre_pcpu), flags);
>> >> > +           list_del(&v->blocked_vcpu_list);
>> >> > +           spin_unlock_irqrestore(&per_cpu(blocked_vcpu_lock,
>> >> > +                                  v->pre_pcpu), flags);
>> >> > +        }
>> >> > +        break;
>> >> > +
>> >> > +    case RUNSTATE_blocked:
>> >> > +        /*
>> >> > +         * The vCPU is blocked on the block list.
>> >> > +         * Add the blocked vCPU on the list of the
>> >> > +         * vcpu->pre_pcpu, which is the destination
>> >> > +         * of the wake-up notification event.
>> >> > +         */
>> >> > +        v->pre_pcpu = v->processor;
>> >>
>> >> Is latching this upon runstate change really enough? I.e. what about
>> >> the v->processor changes that sched_move_domain() or individual
>> >> schedulers do? Or if it really just matters on which CPU's blocked list
>> >> the vCPU is (while its ->processor changing subsequently doesn't
>> >> matter) I'd like to see the field named more after its purpose (e.g.
>> >> pi_block_cpu; list and lock should btw also have a connection to PI
>> >> in their names).
>> >
>> > Yes, It doesn't matter if vCPU changes. The key point is that we put
>> > the vCPU on a pCPU list and we change the NDST field to this pCPU,
>> > then the wakeup notification event will get there. You are right, I
>> > need to rename them to reflect the real purpose of it.
>> >
>> >>
>> >> In the end, if the placement on a list followed v->processor, you
>> >> would likely get away without the extra new field. Are there
>> >> synchronization constraints speaking against such a model?
>> >
>> > I don't understand this quit well. Do you mean using 'v->processor'
>> > as the pCPU list for the blocked vCPUs? Then what about 'v->processor'
>> > changes, seems we cannot handle this case.
>> 
>> That was the question - does anything speak against such a model?
> 
> Do you mean still using v->processor as the pCPU to store the blocked vCPU?

Not "to store", but "to indicate", but yes, the question still is regarding
the need for the new field. I'm fine with adding the field if there is a
proper explanation for it being needed; I'm not going to agree to add
new fields when existing ones can serve the purpose.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-06-16  8:29           ` Jan Beulich
@ 2015-06-16  8:56             ` Wu, Feng
  2015-06-16  9:24               ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Wu, Feng @ 2015-06-16  8:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, george.dunlap, andrew.cooper3, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, June 16, 2015 4:29 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: RE: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU
> scheduling
> 
> >>> On 16.06.15 at 10:13, <feng.wu@intel.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Tuesday, June 16, 2015 3:53 PM
> >> >>> On 16.06.15 at 02:17, <feng.wu@intel.com> wrote:
> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: Wednesday, June 10, 2015 2:44 PM
> >> >> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> >> >> > +
> >> >> > +           /*
> >> >> > +            * Delete the vCPU from the related block list
> >> >> > +            * if we are resuming from blocked state
> >> >> > +            */
> >> >> > +           spin_lock_irqsave(&per_cpu(blocked_vcpu_lock,
> >> >> > +                             v->pre_pcpu), flags);
> >> >> > +           list_del(&v->blocked_vcpu_list);
> >> >> > +           spin_unlock_irqrestore(&per_cpu(blocked_vcpu_lock,
> >> >> > +                                  v->pre_pcpu), flags);
> >> >> > +        }
> >> >> > +        break;
> >> >> > +
> >> >> > +    case RUNSTATE_blocked:
> >> >> > +        /*
> >> >> > +         * The vCPU is blocked on the block list.
> >> >> > +         * Add the blocked vCPU on the list of the
> >> >> > +         * vcpu->pre_pcpu, which is the destination
> >> >> > +         * of the wake-up notification event.
> >> >> > +         */
> >> >> > +        v->pre_pcpu = v->processor;
> >> >>
> >> >> Is latching this upon runstate change really enough? I.e. what about
> >> >> the v->processor changes that sched_move_domain() or individual
> >> >> schedulers do? Or if it really just matters on which CPU's blocked list
> >> >> the vCPU is (while its ->processor changing subsequently doesn't
> >> >> matter) I'd like to see the field named more after its purpose (e.g.
> >> >> pi_block_cpu; list and lock should btw also have a connection to PI
> >> >> in their names).
> >> >
> >> > Yes, It doesn't matter if vCPU changes. The key point is that we put
> >> > the vCPU on a pCPU list and we change the NDST field to this pCPU,
> >> > then the wakeup notification event will get there. You are right, I
> >> > need to rename them to reflect the real purpose of it.
> >> >
> >> >>
> >> >> In the end, if the placement on a list followed v->processor, you
> >> >> would likely get away without the extra new field. Are there
> >> >> synchronization constraints speaking against such a model?
> >> >
> >> > I don't understand this quit well. Do you mean using 'v->processor'
> >> > as the pCPU list for the blocked vCPUs? Then what about 'v->processor'
> >> > changes, seems we cannot handle this case.
> >>
> >> That was the question - does anything speak against such a model?
> >
> > Do you mean still using v->processor as the pCPU to store the blocked vCPU?
> 
> Not "to store", but "to indicate", but yes, the question still is regarding
> the need for the new field. I'm fine with adding the field if there is a
> proper explanation for it being needed; I'm not going to agree to add
> new fields when existing ones can serve the purpose.

Fair enough. The basic reason for adding this new field is 'v->processor' can
be changed behind, so we lost control of it. This is straightforward way I
can think of right now.

Thanks,
Feng

> 
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-06-16  8:56             ` Wu, Feng
@ 2015-06-16  9:24               ` Jan Beulich
  2015-06-17  6:52                 ` Wu, Feng
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2015-06-16  9:24 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir, george.dunlap, andrew.cooper3, xen-devel, Yang Z Zhang

>>> On 16.06.15 at 10:56, <feng.wu@intel.com> wrote:

> 
>> -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Tuesday, June 16, 2015 4:29 PM
>> To: Wu, Feng
>> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
>> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org 
>> Subject: RE: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU
>> scheduling
>> 
>> >>> On 16.06.15 at 10:13, <feng.wu@intel.com> wrote:
>> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Tuesday, June 16, 2015 3:53 PM
>> >> >>> On 16.06.15 at 02:17, <feng.wu@intel.com> wrote:
>> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> >> Sent: Wednesday, June 10, 2015 2:44 PM
>> >> >> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
>> >> >> > +
>> >> >> > +           /*
>> >> >> > +            * Delete the vCPU from the related block list
>> >> >> > +            * if we are resuming from blocked state
>> >> >> > +            */
>> >> >> > +           spin_lock_irqsave(&per_cpu(blocked_vcpu_lock,
>> >> >> > +                             v->pre_pcpu), flags);
>> >> >> > +           list_del(&v->blocked_vcpu_list);
>> >> >> > +           spin_unlock_irqrestore(&per_cpu(blocked_vcpu_lock,
>> >> >> > +                                  v->pre_pcpu), flags);
>> >> >> > +        }
>> >> >> > +        break;
>> >> >> > +
>> >> >> > +    case RUNSTATE_blocked:
>> >> >> > +        /*
>> >> >> > +         * The vCPU is blocked on the block list.
>> >> >> > +         * Add the blocked vCPU on the list of the
>> >> >> > +         * vcpu->pre_pcpu, which is the destination
>> >> >> > +         * of the wake-up notification event.
>> >> >> > +         */
>> >> >> > +        v->pre_pcpu = v->processor;
>> >> >>
>> >> >> Is latching this upon runstate change really enough? I.e. what about
>> >> >> the v->processor changes that sched_move_domain() or individual
>> >> >> schedulers do? Or if it really just matters on which CPU's blocked list
>> >> >> the vCPU is (while its ->processor changing subsequently doesn't
>> >> >> matter) I'd like to see the field named more after its purpose (e.g.
>> >> >> pi_block_cpu; list and lock should btw also have a connection to PI
>> >> >> in their names).
>> >> >
>> >> > Yes, It doesn't matter if vCPU changes. The key point is that we put
>> >> > the vCPU on a pCPU list and we change the NDST field to this pCPU,
>> >> > then the wakeup notification event will get there. You are right, I
>> >> > need to rename them to reflect the real purpose of it.
>> >> >
>> >> >>
>> >> >> In the end, if the placement on a list followed v->processor, you
>> >> >> would likely get away without the extra new field. Are there
>> >> >> synchronization constraints speaking against such a model?
>> >> >
>> >> > I don't understand this quit well. Do you mean using 'v->processor'
>> >> > as the pCPU list for the blocked vCPUs? Then what about 'v->processor'
>> >> > changes, seems we cannot handle this case.
>> >>
>> >> That was the question - does anything speak against such a model?
>> >
>> > Do you mean still using v->processor as the pCPU to store the blocked vCPU?
>> 
>> Not "to store", but "to indicate", but yes, the question still is regarding
>> the need for the new field. I'm fine with adding the field if there is a
>> proper explanation for it being needed; I'm not going to agree to add
>> new fields when existing ones can serve the purpose.
> 
> Fair enough. The basic reason for adding this new field is 'v->processor' can
> be changed behind, so we lost control of it. This is straightforward way I
> can think of right now.

This is the obvious part you state. What needs to be explained is
why it would be impossible (or a t least a bad idea) to move the
vCPU between blocked lists when its v->processor changes.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-06-16  9:24               ` Jan Beulich
@ 2015-06-17  6:52                 ` Wu, Feng
  0 siblings, 0 replies; 64+ messages in thread
From: Wu, Feng @ 2015-06-17  6:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, george.dunlap, andrew.cooper3, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, June 16, 2015 5:25 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: RE: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU
> scheduling
> 
> >>> On 16.06.15 at 10:56, <feng.wu@intel.com> wrote:
> 
> >
> >> -----Original Message-----
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Tuesday, June 16, 2015 4:29 PM
> >> To: Wu, Feng
> >> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> >> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> >> Subject: RE: [RFC v2 13/15] Update Posted-Interrupts Descriptor during
> vCPU
> >> scheduling
> >>
> >> >>> On 16.06.15 at 10:13, <feng.wu@intel.com> wrote:
> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: Tuesday, June 16, 2015 3:53 PM
> >> >> >>> On 16.06.15 at 02:17, <feng.wu@intel.com> wrote:
> >> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> >> Sent: Wednesday, June 10, 2015 2:44 PM
> >> >> >> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> >> >> >> > +
> >> >> >> > +           /*
> >> >> >> > +            * Delete the vCPU from the related block list
> >> >> >> > +            * if we are resuming from blocked state
> >> >> >> > +            */
> >> >> >> > +           spin_lock_irqsave(&per_cpu(blocked_vcpu_lock,
> >> >> >> > +                             v->pre_pcpu), flags);
> >> >> >> > +           list_del(&v->blocked_vcpu_list);
> >> >> >> > +           spin_unlock_irqrestore(&per_cpu(blocked_vcpu_lock,
> >> >> >> > +                                  v->pre_pcpu), flags);
> >> >> >> > +        }
> >> >> >> > +        break;
> >> >> >> > +
> >> >> >> > +    case RUNSTATE_blocked:
> >> >> >> > +        /*
> >> >> >> > +         * The vCPU is blocked on the block list.
> >> >> >> > +         * Add the blocked vCPU on the list of the
> >> >> >> > +         * vcpu->pre_pcpu, which is the destination
> >> >> >> > +         * of the wake-up notification event.
> >> >> >> > +         */
> >> >> >> > +        v->pre_pcpu = v->processor;
> >> >> >>
> >> >> >> Is latching this upon runstate change really enough? I.e. what about
> >> >> >> the v->processor changes that sched_move_domain() or individual
> >> >> >> schedulers do? Or if it really just matters on which CPU's blocked list
> >> >> >> the vCPU is (while its ->processor changing subsequently doesn't
> >> >> >> matter) I'd like to see the field named more after its purpose (e.g.
> >> >> >> pi_block_cpu; list and lock should btw also have a connection to PI
> >> >> >> in their names).
> >> >> >
> >> >> > Yes, It doesn't matter if vCPU changes. The key point is that we put
> >> >> > the vCPU on a pCPU list and we change the NDST field to this pCPU,
> >> >> > then the wakeup notification event will get there. You are right, I
> >> >> > need to rename them to reflect the real purpose of it.
> >> >> >
> >> >> >>
> >> >> >> In the end, if the placement on a list followed v->processor, you
> >> >> >> would likely get away without the extra new field. Are there
> >> >> >> synchronization constraints speaking against such a model?
> >> >> >
> >> >> > I don't understand this quit well. Do you mean using 'v->processor'
> >> >> > as the pCPU list for the blocked vCPUs? Then what about 'v->processor'
> >> >> > changes, seems we cannot handle this case.
> >> >>
> >> >> That was the question - does anything speak against such a model?
> >> >
> >> > Do you mean still using v->processor as the pCPU to store the blocked
> vCPU?
> >>
> >> Not "to store", but "to indicate", but yes, the question still is regarding
> >> the need for the new field. I'm fine with adding the field if there is a
> >> proper explanation for it being needed; I'm not going to agree to add
> >> new fields when existing ones can serve the purpose.
> >
> > Fair enough. The basic reason for adding this new field is 'v->processor' can
> > be changed behind, so we lost control of it. This is straightforward way I
> > can think of right now.
> 
> This is the obvious part you state. What needs to be explained is
> why it would be impossible (or a t least a bad idea) to move the
> vCPU between blocked lists when its v->processor changes.

The key point here is that we put the blocked vCPU in the pCPU list and
update the 'NDST' field in posted-interrupt descriptor for the wakeup
notification event. Here are some reasons why using 'v->processor' as
the pCPU to store the blocked vCPUs is not a good idea:

There may be many places where 'v->processor' gets changed, we need
to find all of them and in each place, see if the vCPU is currently blocked, if
it is, remove it from the old pCPU list and insert to the new pCPU list (need
spinlock operations), we need also update the 'NDST' filed for the wakeup
notification event. Besides that, the individual scheduler can change
'v->processor', which means we may need to add some code specific to PI
purpose to the scheduler code, I don't think this is a good way. On the other
hand, if you use the current solution, we can get the following benefit:
- The logic is clear and simple.
- We only need to update the posted-interrupt descriptor before the vCPU
is blocked, i.e. once the 'NDST' filed is updated before blocking the vCPU,
we don't need to change it even 'v->processor' is changed.
- We have a central place to operate the vCPU list, we don't need to care
missing some places.

Thanks,
Feng

> 
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-06-10  6:43   ` Jan Beulich
  2015-06-16  0:17     ` Wu, Feng
@ 2015-06-17  6:54     ` Wu, Feng
  2015-06-17  7:56       ` Jan Beulich
  2015-06-17  7:00     ` Wu, Feng
  2 siblings, 1 reply; 64+ messages in thread
From: Wu, Feng @ 2015-06-17  6:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, george.dunlap, andrew.cooper3, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, June 10, 2015 2:44 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: Re: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU
> scheduling
> 
> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -1711,6 +1711,131 @@ static void vmx_handle_eoi(u8 vector)
> >      __vmwrite(GUEST_INTR_STATUS, status);
> >  }
> >
> > +static void vmx_pi_desc_update(struct vcpu *v, int old_state)
> > +{
> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> > +    struct pi_desc old, new;
> > +    unsigned long flags;
> > +
> > +    if ( !iommu_intpost )
> > +        return;
> 
> Considering how the adjustment to start_vmx(), this at best should
> be an ASSERT() (if a check is needed at all).
> 
> > +    switch ( v->runstate.state )
> > +    {
> > +    case RUNSTATE_runnable:
> > +    case RUNSTATE_offline:
> > +        /*
> > +         * We don't need to send notification event to a non-running
> > +         * vcpu, the interrupt information will be delivered to it before
> > +         * VM-ENTRY when the vcpu is scheduled to run next time.
> > +         */
> > +        pi_desc->sn = 1;
> > +
> > +        /*
> > +         * If the state is transferred from RUNSTATE_blocked,
> > +         * we should set 'NV' feild back to posted_intr_vector,
> > +         * so the Posted-Interrupts can be delivered to the vCPU
> > +         * by VT-d HW after it is scheduled to run.
> > +         */
> > +        if ( old_state == RUNSTATE_blocked )
> > +        {
> > +            do
> > +            {
> > +                old.control = new.control = pi_desc->control;
> > +                new.nv = posted_intr_vector;
> > +            }
> > +            while ( cmpxchg(&pi_desc->control, old.control, new.control)
> > +                    != old.control );
> 
> So why is it okay to do the SN update non-atomically when the
> vector update is done atomically?
> 
> Also the curly braces do _not_ go on separate lines for do/while
> constructs.
> 
> And then do you really need to atomically update the whole 64 bits
> here, rather than just the nv field? By not making it a bit field
> you could perhaps just write_atomic() it?
>

Thinking more about this, maybe write_atomic() is not good for this, we need use
the LOCK prefix when updating the posted-interrupt descriptor.

Thanks,
Feng


> > +
> > +           /*
> > +            * Delete the vCPU from the related block list
> > +            * if we are resuming from blocked state
> > +            */
> > +           spin_lock_irqsave(&per_cpu(blocked_vcpu_lock,
> > +                             v->pre_pcpu), flags);
> > +           list_del(&v->blocked_vcpu_list);
> > +           spin_unlock_irqrestore(&per_cpu(blocked_vcpu_lock,
> > +                                  v->pre_pcpu), flags);
> > +        }
> > +        break;
> > +
> > +    case RUNSTATE_blocked:
> > +        /*
> > +         * The vCPU is blocked on the block list.
> > +         * Add the blocked vCPU on the list of the
> > +         * vcpu->pre_pcpu, which is the destination
> > +         * of the wake-up notification event.
> > +         */
> > +        v->pre_pcpu = v->processor;
> 
> Is latching this upon runstate change really enough? I.e. what about
> the v->processor changes that sched_move_domain() or individual
> schedulers do? Or if it really just matters on which CPU's blocked list
> the vCPU is (while its ->processor changing subsequently doesn't
> matter) I'd like to see the field named more after its purpose (e.g.
> pi_block_cpu; list and lock should btw also have a connection to PI
> in their names).
> 
> In the end, if the placement on a list followed v->processor, you
> would likely get away without the extra new field. Are there
> synchronization constraints speaking against such a model?
> 
> > +        spin_lock_irqsave(&per_cpu(blocked_vcpu_lock,
> > +                          v->pre_pcpu), flags);
> > +        list_add_tail(&v->blocked_vcpu_list,
> > +                      &per_cpu(blocked_vcpu, v->pre_pcpu));
> > +        spin_unlock_irqrestore(&per_cpu(blocked_vcpu_lock,
> > +                               v->pre_pcpu), flags);
> > +
> > +        do
> > +        {
> > +            old.control = new.control = pi_desc->control;
> > +
> > +            /*
> > +             * We should not block the vCPU if
> > +             * an interrupt was posted for it.
> > +             */
> > +
> > +            if ( old.on == 1 )
> 
> I think I said so elsewhere, but just in case I didn't: Please don't
> compare boolean fields with 1 - e.g. in the case here "if ( old.on )"
> suffices.
> 
> > +            {
> > +                /*
> > +                 * The vCPU will be removed from the block list
> > +                 * during its state transferring from RUNSTATE_blocked
> > +                 * to RUNSTATE_runnable after the following tasklet
> > +                 * is scheduled to run.
> > +                 */
> 
> Precise comments please - I don't think the scheduling of the
> tasklet makes any difference; I suppose it's the execution of it
> that does.
> 
> > +                tasklet_schedule(&v->vcpu_wakeup_tasklet);
> > +                return;
> > +            }
> > +
> > +            /*
> > +             * Change the 'NDST' field to v->pre_pcpu, so when
> > +             * external interrupts from assigned deivces happen,
> > +             * wakeup notifiction event will go to v->pre_pcpu,
> > +             * then in pi_wakeup_interrupt() we can find the
> > +             * vCPU in the right list to wake up.
> > +             */
> > +            if ( x2apic_enabled )
> > +                new.ndst = cpu_physical_id(v->pre_pcpu);
> > +            else
> > +                new.ndst = MASK_INSR(cpu_physical_id(v->pre_pcpu),
> > +                                     PI_xAPIC_NDST_MASK);
> > +            new.sn = 0;
> > +            new.nv = pi_wakeup_vector;
> > +        }
> > +        while ( cmpxchg(&pi_desc->control, old.control, new.control)
> > +                != old.control );
> > +        break;
> > +
> > +    case RUNSTATE_running:
> > +        ASSERT( pi_desc->sn == 1 );
> > +
> > +        do
> > +        {
> > +            old.control = new.control = pi_desc->control;
> > +            if ( x2apic_enabled )
> > +                new.ndst = cpu_physical_id(v->processor);
> > +            else
> > +                new.ndst = (cpu_physical_id(v->processor) << 8) &
> 0xFF00;
> 
> Why are you not using MASK_INSR() here like you do above?
> 
> > +
> > +            new.sn = 0;
> > +        }
> > +        while ( cmpxchg(&pi_desc->control, old.control, new.control)
> > +                != old.control );
> 
> Same here - can't this be a write_atomic() of ndst and a clear_bit()
> of sn instead of a loop over cmpxchg()?
> 
> > +        break;
> > +
> > +    default:
> > +        break;
> 
> This seems dangerous: I think you want at least an
> ASSERT_UNREACHABLE() here.
> 
> > @@ -1842,7 +1967,12 @@ const struct hvm_function_table * __init
> start_vmx(void)
> >          alloc_direct_apic_vector(&posted_intr_vector,
> pi_notification_interrupt);
> >
> >          if ( iommu_intpost )
> > +        {
> >              alloc_direct_apic_vector(&pi_wakeup_vector,
> pi_wakeup_interrupt);
> > +            vmx_function_table.pi_desc_update = vmx_pi_desc_update;
> > +        }
> > +        else
> > +            vmx_function_table.pi_desc_update = NULL;
> 
> Isn't that field NULL anyway?
> 
> > @@ -157,7 +158,11 @@ static inline void vcpu_runstate_change(
> >          v->runstate.state_entry_time = new_entry_time;
> >      }
> >
> > +    old_state = v->runstate.state;
> >      v->runstate.state = new_state;
> > +
> > +    if ( is_hvm_vcpu(v) && hvm_funcs.pi_desc_update )
> > +        hvm_funcs.pi_desc_update(v, old_state);
> 
> I don't see how this would build on ARM.
> 
> > --- a/xen/include/xen/sched.h
> > +++ b/xen/include/xen/sched.h
> > @@ -142,6 +142,8 @@ struct vcpu
> >
> >      int              processor;
> >
> > +    int              pre_pcpu;
> 
> This again doesn't belong into the common structure (and should
> be accompanied by a comment, and should be unsigned int).
> 
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-06-10  6:43   ` Jan Beulich
  2015-06-16  0:17     ` Wu, Feng
  2015-06-17  6:54     ` Wu, Feng
@ 2015-06-17  7:00     ` Wu, Feng
  2015-06-17  7:59       ` Jan Beulich
  2 siblings, 1 reply; 64+ messages in thread
From: Wu, Feng @ 2015-06-17  7:00 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, george.dunlap, andrew.cooper3, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, June 10, 2015 2:44 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: Re: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU
> scheduling
> 
> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -1711,6 +1711,131 @@ static void vmx_handle_eoi(u8 vector)
> >      __vmwrite(GUEST_INTR_STATUS, status);
> >  }
> >
> > +static void vmx_pi_desc_update(struct vcpu *v, int old_state)
> > +{
> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> > +    struct pi_desc old, new;
> > +    unsigned long flags;
> > +
> > +    if ( !iommu_intpost )
> > +        return;
> 
> Considering how the adjustment to start_vmx(), this at best should
> be an ASSERT() (if a check is needed at all).
> 
> > +    switch ( v->runstate.state )
> > +    {
> > +    case RUNSTATE_runnable:
> > +    case RUNSTATE_offline:
> > +        /*
> > +         * We don't need to send notification event to a non-running
> > +         * vcpu, the interrupt information will be delivered to it before
> > +         * VM-ENTRY when the vcpu is scheduled to run next time.
> > +         */
> > +        pi_desc->sn = 1;
> > +
> > +        /*
> > +         * If the state is transferred from RUNSTATE_blocked,
> > +         * we should set 'NV' feild back to posted_intr_vector,
> > +         * so the Posted-Interrupts can be delivered to the vCPU
> > +         * by VT-d HW after it is scheduled to run.
> > +         */
> > +        if ( old_state == RUNSTATE_blocked )
> > +        {
> > +            do
> > +            {
> > +                old.control = new.control = pi_desc->control;
> > +                new.nv = posted_intr_vector;
> > +            }
> > +            while ( cmpxchg(&pi_desc->control, old.control, new.control)
> > +                    != old.control );
> 
> So why is it okay to do the SN update non-atomically when the
> vector update is done atomically?
> 
> Also the curly braces do _not_ go on separate lines for do/while
> constructs.
> 
> And then do you really need to atomically update the whole 64 bits
> here, rather than just the nv field? By not making it a bit field
> you could perhaps just write_atomic() it?
> 
> > +
> > +           /*
> > +            * Delete the vCPU from the related block list
> > +            * if we are resuming from blocked state
> > +            */
> > +           spin_lock_irqsave(&per_cpu(blocked_vcpu_lock,
> > +                             v->pre_pcpu), flags);
> > +           list_del(&v->blocked_vcpu_list);
> > +           spin_unlock_irqrestore(&per_cpu(blocked_vcpu_lock,
> > +                                  v->pre_pcpu), flags);
> > +        }
> > +        break;
> > +
> > +    case RUNSTATE_blocked:
> > +        /*
> > +         * The vCPU is blocked on the block list.
> > +         * Add the blocked vCPU on the list of the
> > +         * vcpu->pre_pcpu, which is the destination
> > +         * of the wake-up notification event.
> > +         */
> > +        v->pre_pcpu = v->processor;
> 
> Is latching this upon runstate change really enough? I.e. what about
> the v->processor changes that sched_move_domain() or individual
> schedulers do? Or if it really just matters on which CPU's blocked list
> the vCPU is (while its ->processor changing subsequently doesn't
> matter) I'd like to see the field named more after its purpose (e.g.
> pi_block_cpu; list and lock should btw also have a connection to PI
> in their names).
> 
> In the end, if the placement on a list followed v->processor, you
> would likely get away without the extra new field. Are there
> synchronization constraints speaking against such a model?
> 
> > +        spin_lock_irqsave(&per_cpu(blocked_vcpu_lock,
> > +                          v->pre_pcpu), flags);
> > +        list_add_tail(&v->blocked_vcpu_list,
> > +                      &per_cpu(blocked_vcpu, v->pre_pcpu));
> > +        spin_unlock_irqrestore(&per_cpu(blocked_vcpu_lock,
> > +                               v->pre_pcpu), flags);
> > +
> > +        do
> > +        {
> > +            old.control = new.control = pi_desc->control;
> > +
> > +            /*
> > +             * We should not block the vCPU if
> > +             * an interrupt was posted for it.
> > +             */
> > +
> > +            if ( old.on == 1 )
> 
> I think I said so elsewhere, but just in case I didn't: Please don't
> compare boolean fields with 1 - e.g. in the case here "if ( old.on )"
> suffices.
> 
> > +            {
> > +                /*
> > +                 * The vCPU will be removed from the block list
> > +                 * during its state transferring from RUNSTATE_blocked
> > +                 * to RUNSTATE_runnable after the following tasklet
> > +                 * is scheduled to run.
> > +                 */
> 
> Precise comments please - I don't think the scheduling of the
> tasklet makes any difference; I suppose it's the execution of it
> that does.
> 
> > +                tasklet_schedule(&v->vcpu_wakeup_tasklet);
> > +                return;
> > +            }
> > +
> > +            /*
> > +             * Change the 'NDST' field to v->pre_pcpu, so when
> > +             * external interrupts from assigned deivces happen,
> > +             * wakeup notifiction event will go to v->pre_pcpu,
> > +             * then in pi_wakeup_interrupt() we can find the
> > +             * vCPU in the right list to wake up.
> > +             */
> > +            if ( x2apic_enabled )
> > +                new.ndst = cpu_physical_id(v->pre_pcpu);
> > +            else
> > +                new.ndst = MASK_INSR(cpu_physical_id(v->pre_pcpu),
> > +                                     PI_xAPIC_NDST_MASK);
> > +            new.sn = 0;
> > +            new.nv = pi_wakeup_vector;
> > +        }
> > +        while ( cmpxchg(&pi_desc->control, old.control, new.control)
> > +                != old.control );
> > +        break;
> > +
> > +    case RUNSTATE_running:
> > +        ASSERT( pi_desc->sn == 1 );
> > +
> > +        do
> > +        {
> > +            old.control = new.control = pi_desc->control;
> > +            if ( x2apic_enabled )
> > +                new.ndst = cpu_physical_id(v->processor);
> > +            else
> > +                new.ndst = (cpu_physical_id(v->processor) << 8) &
> 0xFF00;
> 
> Why are you not using MASK_INSR() here like you do above?
> 
> > +
> > +            new.sn = 0;
> > +        }
> > +        while ( cmpxchg(&pi_desc->control, old.control, new.control)
> > +                != old.control );
> 
> Same here - can't this be a write_atomic() of ndst and a clear_bit()
> of sn instead of a loop over cmpxchg()?
> 
> > +        break;
> > +
> > +    default:
> > +        break;
> 
> This seems dangerous: I think you want at least an
> ASSERT_UNREACHABLE() here.
> 
> > @@ -1842,7 +1967,12 @@ const struct hvm_function_table * __init
> start_vmx(void)
> >          alloc_direct_apic_vector(&posted_intr_vector,
> pi_notification_interrupt);
> >
> >          if ( iommu_intpost )
> > +        {
> >              alloc_direct_apic_vector(&pi_wakeup_vector,
> pi_wakeup_interrupt);
> > +            vmx_function_table.pi_desc_update = vmx_pi_desc_update;
> > +        }
> > +        else
> > +            vmx_function_table.pi_desc_update = NULL;
> 
> Isn't that field NULL anyway?
> 
> > @@ -157,7 +158,11 @@ static inline void vcpu_runstate_change(
> >          v->runstate.state_entry_time = new_entry_time;
> >      }
> >
> > +    old_state = v->runstate.state;
> >      v->runstate.state = new_state;
> > +
> > +    if ( is_hvm_vcpu(v) && hvm_funcs.pi_desc_update )
> > +        hvm_funcs.pi_desc_update(v, old_state);
> 
> I don't see how this would build on ARM.
> 
> > --- a/xen/include/xen/sched.h
> > +++ b/xen/include/xen/sched.h
> > @@ -142,6 +142,8 @@ struct vcpu
> >
> >      int              processor;
> >
> > +    int              pre_pcpu;
> 
> This again doesn't belong into the common structure (and should
> be accompanied by a comment, and should be unsigned int).

I added three new member to 'struct vcpu' in this series:
	- struct tasklet   vcpu_wakeup_tasklet;
	- struct list_head blocked_vcpu_list;
	- int              pre_pcpu;

I am trying to find a place specific to vmx to put them in, but the only one
I find is ' struct arch_vmx_struct ', however, I don't think this is a good
place, since it contains all vt-x stuff defined in the SDM. Do you think
'struct hvm_vcpu' is the right place?

Thanks,
Feng

> 
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-06-17  6:54     ` Wu, Feng
@ 2015-06-17  7:56       ` Jan Beulich
  2015-06-17  8:46         ` Wu, Feng
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2015-06-17  7:56 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir, george.dunlap, andrew.cooper3, xen-devel, Yang Z Zhang

>>> On 17.06.15 at 08:54, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, June 10, 2015 2:44 PM
>> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
>> > +    switch ( v->runstate.state )
>> > +    {
>> > +    case RUNSTATE_runnable:
>> > +    case RUNSTATE_offline:
>> > +        /*
>> > +         * We don't need to send notification event to a non-running
>> > +         * vcpu, the interrupt information will be delivered to it before
>> > +         * VM-ENTRY when the vcpu is scheduled to run next time.
>> > +         */
>> > +        pi_desc->sn = 1;
>> > +
>> > +        /*
>> > +         * If the state is transferred from RUNSTATE_blocked,
>> > +         * we should set 'NV' feild back to posted_intr_vector,
>> > +         * so the Posted-Interrupts can be delivered to the vCPU
>> > +         * by VT-d HW after it is scheduled to run.
>> > +         */
>> > +        if ( old_state == RUNSTATE_blocked )
>> > +        {
>> > +            do
>> > +            {
>> > +                old.control = new.control = pi_desc->control;
>> > +                new.nv = posted_intr_vector;
>> > +            }
>> > +            while ( cmpxchg(&pi_desc->control, old.control, new.control)
>> > +                    != old.control );
>> 
>> So why is it okay to do the SN update non-atomically when the
>> vector update is done atomically?
>> 
>> Also the curly braces do _not_ go on separate lines for do/while
>> constructs.
>> 
>> And then do you really need to atomically update the whole 64 bits
>> here, rather than just the nv field? By not making it a bit field
>> you could perhaps just write_atomic() it?
>>
> 
> Thinking more about this, maybe write_atomic() is not good for this, we need 
> use
> the LOCK prefix when updating the posted-interrupt descriptor.

The LOCK prefix can't be applied to simple loads and stores; they're
implicitly atomic.

Also - please remember to trim your replies.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-06-17  7:00     ` Wu, Feng
@ 2015-06-17  7:59       ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2015-06-17  7:59 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir, george.dunlap, andrew.cooper3, xen-devel, Yang Z Zhang

>>> On 17.06.15 at 09:00, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, June 10, 2015 2:44 PM
>> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
>> > --- a/xen/include/xen/sched.h
>> > +++ b/xen/include/xen/sched.h
>> > @@ -142,6 +142,8 @@ struct vcpu
>> >
>> >      int              processor;
>> >
>> > +    int              pre_pcpu;
>> 
>> This again doesn't belong into the common structure (and should
>> be accompanied by a comment, and should be unsigned int).
> 
> I added three new member to 'struct vcpu' in this series:
> 	- struct tasklet   vcpu_wakeup_tasklet;
> 	- struct list_head blocked_vcpu_list;
> 	- int              pre_pcpu;
> 
> I am trying to find a place specific to vmx to put them in, but the only one
> I find is ' struct arch_vmx_struct ', however, I don't think this is a good
> place, since it contains all vt-x stuff defined in the SDM. Do you think
> 'struct hvm_vcpu' is the right place?

No, arch_vmx_struct is. Taking active_list and active_cpu as
example, the structure clearly isn't limited to fields defined by
the SDM.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-06-17  7:56       ` Jan Beulich
@ 2015-06-17  8:46         ` Wu, Feng
  2015-06-17  8:59           ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Wu, Feng @ 2015-06-17  8:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, george.dunlap, andrew.cooper3, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, June 17, 2015 3:56 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: RE: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU
> scheduling
> 
> >>> On 17.06.15 at 08:54, <feng.wu@intel.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Wednesday, June 10, 2015 2:44 PM
> >> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> >> > +    switch ( v->runstate.state )
> >> > +    {
> >> > +    case RUNSTATE_runnable:
> >> > +    case RUNSTATE_offline:
> >> > +        /*
> >> > +         * We don't need to send notification event to a non-running
> >> > +         * vcpu, the interrupt information will be delivered to it before
> >> > +         * VM-ENTRY when the vcpu is scheduled to run next time.
> >> > +         */
> >> > +        pi_desc->sn = 1;
> >> > +
> >> > +        /*
> >> > +         * If the state is transferred from RUNSTATE_blocked,
> >> > +         * we should set 'NV' feild back to posted_intr_vector,
> >> > +         * so the Posted-Interrupts can be delivered to the vCPU
> >> > +         * by VT-d HW after it is scheduled to run.
> >> > +         */
> >> > +        if ( old_state == RUNSTATE_blocked )
> >> > +        {
> >> > +            do
> >> > +            {
> >> > +                old.control = new.control = pi_desc->control;
> >> > +                new.nv = posted_intr_vector;
> >> > +            }
> >> > +            while ( cmpxchg(&pi_desc->control, old.control,
> new.control)
> >> > +                    != old.control );
> >>
> >> So why is it okay to do the SN update non-atomically when the
> >> vector update is done atomically?
> >>
> >> Also the curly braces do _not_ go on separate lines for do/while
> >> constructs.
> >>
> >> And then do you really need to atomically update the whole 64 bits
> >> here, rather than just the nv field? By not making it a bit field
> >> you could perhaps just write_atomic() it?
> >>
> >
> > Thinking more about this, maybe write_atomic() is not good for this, we need
> > use
> > the LOCK prefix when updating the posted-interrupt descriptor.
> 
> The LOCK prefix can't be applied to simple loads and stores; they're
> implicitly atomic.
> 

Yes, I know LOCK prefix cannot be used together with 'mov', but
my concern is without LOCK, does it work well when this
write-atomic() operation and hardware setting 'ON' occur at the
same time?

Hardware updates the 'ON' filed like this:
(which is listed in the Spec.)

"Hardware performs a coherent atomic read-modify-write
operation of the posted-interrupt descriptor as follows:
- Read contents of the Posted Interrupt Descriptor, claiming
exclusive ownership of its hosting cache-line. If invalid
programming of Posted Interrupt Descriptor is detected,
release ownership of the cache-line, and block the interrupt
request.
- If above checks succeed, retrieve current values of Posted
Interrupt Requests (PIR bits 255:0), Outstanding Notification (ON),
Suppress Notification (SN), Notification Vector (NV), and
Notification Destination (NDST) fields in the Posted Interrupt
Descriptor.
- Modify the following descriptor field values atomically:
* Set bit in PIR corresponding to the Vector field value from
the IRTE
* Compute X = ((ON == 0) & (URG | (SN == 0)))
* If (X == 1), Set ON field.
- Promote the cache-line to be globally observable, so that
the modifications are visible to other caching agents. Hardware
may write-back the cache-line any time after this step."

If we don't use the LOCK prefix, 'NV' filed may be changed between
reading these filed and writing back by hardware, right? But for
'NV', maybe this is not an issue, since hardware doesn't change it,
It only check 'ON', 'SN' and updates 'ON' if needed. And that's
why we need use LOCK prefix to update 'SN', right? What is you
thoughts about this?

Sorry for the long description above, since the hardware guys
advise me to use LOCK prefix when updating the posted-interrupt
descriptor, I just don't want to make any mistake here.

Thanks,
Feng

> Also - please remember to trim your replies.
> 
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-06-17  8:46         ` Wu, Feng
@ 2015-06-17  8:59           ` Jan Beulich
  2015-06-17  9:14             ` Wu, Feng
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2015-06-17  8:59 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir, george.dunlap, andrew.cooper3, xen-devel, Yang Z Zhang

>>> On 17.06.15 at 10:46, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, June 17, 2015 3:56 PM
>> >>> On 17.06.15 at 08:54, <feng.wu@intel.com> wrote:
>> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Wednesday, June 10, 2015 2:44 PM
>> >> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
>> >> > +    switch ( v->runstate.state )
>> >> > +    {
>> >> > +    case RUNSTATE_runnable:
>> >> > +    case RUNSTATE_offline:
>> >> > +        /*
>> >> > +         * We don't need to send notification event to a non-running
>> >> > +         * vcpu, the interrupt information will be delivered to it before
>> >> > +         * VM-ENTRY when the vcpu is scheduled to run next time.
>> >> > +         */
>> >> > +        pi_desc->sn = 1;
>> >> > +
>> >> > +        /*
>> >> > +         * If the state is transferred from RUNSTATE_blocked,
>> >> > +         * we should set 'NV' feild back to posted_intr_vector,
>> >> > +         * so the Posted-Interrupts can be delivered to the vCPU
>> >> > +         * by VT-d HW after it is scheduled to run.
>> >> > +         */
>> >> > +        if ( old_state == RUNSTATE_blocked )
>> >> > +        {
>> >> > +            do
>> >> > +            {
>> >> > +                old.control = new.control = pi_desc->control;
>> >> > +                new.nv = posted_intr_vector;
>> >> > +            }
>> >> > +            while ( cmpxchg(&pi_desc->control, old.control,
>> new.control)
>> >> > +                    != old.control );
>> >>
>> >> So why is it okay to do the SN update non-atomically when the
>> >> vector update is done atomically?
>> >>
>> >> Also the curly braces do _not_ go on separate lines for do/while
>> >> constructs.
>> >>
>> >> And then do you really need to atomically update the whole 64 bits
>> >> here, rather than just the nv field? By not making it a bit field
>> >> you could perhaps just write_atomic() it?
>> >>
>> >
>> > Thinking more about this, maybe write_atomic() is not good for this, we need
>> > use
>> > the LOCK prefix when updating the posted-interrupt descriptor.
>> 
>> The LOCK prefix can't be applied to simple loads and stores; they're
>> implicitly atomic.
>> 
> 
> Yes, I know LOCK prefix cannot be used together with 'mov', but
> my concern is without LOCK, does it work well when this
> write-atomic() operation and hardware setting 'ON' occur at the
> same time?
> 
> Hardware updates the 'ON' filed like this:
> (which is listed in the Spec.)
> 
> "Hardware performs a coherent atomic read-modify-write
> operation of the posted-interrupt descriptor as follows:
> - Read contents of the Posted Interrupt Descriptor, claiming
> exclusive ownership of its hosting cache-line.

The "cache line" part here is the really important aspect. The CPU
will do the same for LOCKed transactions as well as simple stores.
But of course - I repeat this just to avoid any misunderstanding -
this works only if the store covers _only_ nv (recall that I brought
this up in context of trying to avoid using bitfields where possible).

> Sorry for the long description above, since the hardware guys
> advise me to use LOCK prefix when updating the posted-interrupt
> descriptor, I just don't want to make any mistake here.

And that advice is correct if you fiddle with the descriptor in
quantities covering more than the precise byte(s) you intend
to modify.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-06-17  8:59           ` Jan Beulich
@ 2015-06-17  9:14             ` Wu, Feng
  0 siblings, 0 replies; 64+ messages in thread
From: Wu, Feng @ 2015-06-17  9:14 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, george.dunlap, andrew.cooper3, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, June 17, 2015 4:59 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: RE: [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU
> scheduling
> 
> >>> On 17.06.15 at 10:46, <feng.wu@intel.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Wednesday, June 17, 2015 3:56 PM
> >> >>> On 17.06.15 at 08:54, <feng.wu@intel.com> wrote:
> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: Wednesday, June 10, 2015 2:44 PM
> >> >> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> >> >> > +    switch ( v->runstate.state )
> >> >> > +    {
> >> >> > +    case RUNSTATE_runnable:
> >> >> > +    case RUNSTATE_offline:
> >> >> > +        /*
> >> >> > +         * We don't need to send notification event to a non-running
> >> >> > +         * vcpu, the interrupt information will be delivered to it
> before
> >> >> > +         * VM-ENTRY when the vcpu is scheduled to run next time.
> >> >> > +         */
> >> >> > +        pi_desc->sn = 1;
> >> >> > +
> >> >> > +        /*
> >> >> > +         * If the state is transferred from RUNSTATE_blocked,
> >> >> > +         * we should set 'NV' feild back to posted_intr_vector,
> >> >> > +         * so the Posted-Interrupts can be delivered to the vCPU
> >> >> > +         * by VT-d HW after it is scheduled to run.
> >> >> > +         */
> >> >> > +        if ( old_state == RUNSTATE_blocked )
> >> >> > +        {
> >> >> > +            do
> >> >> > +            {
> >> >> > +                old.control = new.control = pi_desc->control;
> >> >> > +                new.nv = posted_intr_vector;
> >> >> > +            }
> >> >> > +            while ( cmpxchg(&pi_desc->control, old.control,
> >> new.control)
> >> >> > +                    != old.control );
> >> >>
> >> >> So why is it okay to do the SN update non-atomically when the
> >> >> vector update is done atomically?
> >> >>
> >> >> Also the curly braces do _not_ go on separate lines for do/while
> >> >> constructs.
> >> >>
> >> >> And then do you really need to atomically update the whole 64 bits
> >> >> here, rather than just the nv field? By not making it a bit field
> >> >> you could perhaps just write_atomic() it?
> >> >>
> >> >
> >> > Thinking more about this, maybe write_atomic() is not good for this, we
> need
> >> > use
> >> > the LOCK prefix when updating the posted-interrupt descriptor.
> >>
> >> The LOCK prefix can't be applied to simple loads and stores; they're
> >> implicitly atomic.
> >>
> >
> > Yes, I know LOCK prefix cannot be used together with 'mov', but
> > my concern is without LOCK, does it work well when this
> > write-atomic() operation and hardware setting 'ON' occur at the
> > same time?
> >
> > Hardware updates the 'ON' filed like this:
> > (which is listed in the Spec.)
> >
> > "Hardware performs a coherent atomic read-modify-write
> > operation of the posted-interrupt descriptor as follows:
> > - Read contents of the Posted Interrupt Descriptor, claiming
> > exclusive ownership of its hosting cache-line.
> 
> The "cache line" part here is the really important aspect. The CPU
> will do the same for LOCKed transactions as well as simple stores.
> But of course - I repeat this just to avoid any misunderstanding -
> this works only if the store covers _only_ nv (recall that I brought
> this up in context of trying to avoid using bitfields where possible).

Your clarification makes thing clear. Thank you very much!

Thanks,
Feng

> 
> > Sorry for the long description above, since the hardware guys
> > advise me to use LOCK prefix when updating the posted-interrupt
> > descriptor, I just don't want to make any mistake here.
> 
> And that advice is correct if you fiddle with the descriptor in
> quantities covering more than the precise byte(s) you intend
> to modify.
> 
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-06-15  9:35 ` Jan Beulich
@ 2015-06-15 11:38   ` Wu, Feng
  0 siblings, 0 replies; 64+ messages in thread
From: Wu, Feng @ 2015-06-15 11:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, george.dunlap, andrew.cooper3, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Monday, June 15, 2015 5:35 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: RE: [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used
> 
> >>> On 15.06.15 at 11:20, <feng.wu@intel.com> wrote:
> 
> >
> >> -----Original Message-----
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Friday, June 12, 2015 6:35 PM
> >> To: Wu, Feng
> >> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> >> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> >> Subject: RE: [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used
> >>
> >> >>> On 12.06.15 at 11:40, <feng.wu@intel.com> wrote:
> >> >> -----Original Message-----
> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> >> >> > +static inline void setup_posted_irte(
> >> >> > +    struct iremap_entry *new_ire, struct pi_desc *pi_desc, uint8_t
> >> gvec)
> >> >> > +{
> >> >> > +    new_ire->post.urg = 0;
> >> >> > +    new_ire->post.vector = gvec;
> >> >> > +    new_ire->post.pda_l = (((u64)virt_to_maddr(pi_desc)) >>
> >> >> > +                           (32 - PDA_LOW_BIT)) &
> >> PDA_MASK(LOW);
> >> >> > +    new_ire->post.pda_h = (((u64)virt_to_maddr(pi_desc)) >> 32) &
> >> >> > +                           PDA_MASK(HIGH);
> >> >>
> >> >> Looking at this another time I can't see what the and-ing with
> >> >> PAD_MASK() is supposed to be good for.
> >> >
> >> > I cannot understand this well. Do you mean we don't need and
> PDA_MASK()
> >> > here?
> >>
> >> Correct - the bitfield width (where the data gets stored into) already
> >> enforces the intended truncation afaict.
> >
> > We may not need PDA_MASK(HIGH), but is PDA_MASK(LOW) really
> unnecessary
> > here?
> 
> I think so - iirc the bitfield is exactly the width you need (or if it
> wasn't, that would be a mistake). What you can't get rid of is the
> shifting.

You are right. I got the same result after removing the PDA_MASK.

Thanks,
Feng

> 
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-06-15  9:20 [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used Wu, Feng
@ 2015-06-15  9:35 ` Jan Beulich
  2015-06-15 11:38   ` Wu, Feng
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2015-06-15  9:35 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir, george.dunlap, andrew.cooper3, xen-devel, Yang Z Zhang

>>> On 15.06.15 at 11:20, <feng.wu@intel.com> wrote:

> 
>> -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Friday, June 12, 2015 6:35 PM
>> To: Wu, Feng
>> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
>> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org 
>> Subject: RE: [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used
>> 
>> >>> On 12.06.15 at 11:40, <feng.wu@intel.com> wrote:
>> >> -----Original Message-----
>> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
>> >> > +static inline void setup_posted_irte(
>> >> > +    struct iremap_entry *new_ire, struct pi_desc *pi_desc, uint8_t
>> gvec)
>> >> > +{
>> >> > +    new_ire->post.urg = 0;
>> >> > +    new_ire->post.vector = gvec;
>> >> > +    new_ire->post.pda_l = (((u64)virt_to_maddr(pi_desc)) >>
>> >> > +                           (32 - PDA_LOW_BIT)) &
>> PDA_MASK(LOW);
>> >> > +    new_ire->post.pda_h = (((u64)virt_to_maddr(pi_desc)) >> 32) &
>> >> > +                           PDA_MASK(HIGH);
>> >>
>> >> Looking at this another time I can't see what the and-ing with
>> >> PAD_MASK() is supposed to be good for.
>> >
>> > I cannot understand this well. Do you mean we don't need and PDA_MASK()
>> > here?
>> 
>> Correct - the bitfield width (where the data gets stored into) already
>> enforces the intended truncation afaict.
> 
> We may not need PDA_MASK(HIGH), but is PDA_MASK(LOW) really unnecessary 
> here?

I think so - iirc the bitfield is exactly the width you need (or if it
wasn't, that would be a mistake). What you can't get rid of is the
shifting.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used
@ 2015-06-15  9:20 Wu, Feng
  2015-06-15  9:35 ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Wu, Feng @ 2015-06-15  9:20 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir, george.dunlap, andrew.cooper3, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Friday, June 12, 2015 6:35 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: RE: [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used
> 
> >>> On 12.06.15 at 11:40, <feng.wu@intel.com> wrote:
> >> -----Original Message-----
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >>> On 08.05.15 at 11:07, <feng.wu@intel.com> wrote:
> >> > +static inline void setup_posted_irte(
> >> > +    struct iremap_entry *new_ire, struct pi_desc *pi_desc, uint8_t
> gvec)
> >> > +{
> >> > +    new_ire->post.urg = 0;
> >> > +    new_ire->post.vector = gvec;
> >> > +    new_ire->post.pda_l = (((u64)virt_to_maddr(pi_desc)) >>
> >> > +                           (32 - PDA_LOW_BIT)) &
> PDA_MASK(LOW);
> >> > +    new_ire->post.pda_h = (((u64)virt_to_maddr(pi_desc)) >> 32) &
> >> > +                           PDA_MASK(HIGH);
> >>
> >> Looking at this another time I can't see what the and-ing with
> >> PAD_MASK() is supposed to be good for.
> >
> > I cannot understand this well. Do you mean we don't need and PDA_MASK()
> > here?
> 
> Correct - the bitfield width (where the data gets stored into) already
> enforces the intended truncation afaict.

We may not need PDA_MASK(HIGH), but is PDA_MASK(LOW) really unnecessary here?

Thanks,
Feng

> 
> Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2015-06-17  9:14 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-08  9:07 [RFC v2 00/15] Add VT-d Posted-Interrupts support Feng Wu
2015-05-08  9:07 ` [RFC v2 01/15] Vt-d Posted-intterrupt (PI) design Feng Wu
2015-05-08  9:07 ` [RFC v2 02/15] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature Feng Wu
2015-06-09 14:10   ` Jan Beulich
2015-05-08  9:07 ` [RFC v2 03/15] vt-d: VT-d Posted-Interrupts feature detection Feng Wu
2015-06-09 14:15   ` Jan Beulich
2015-05-08  9:07 ` [RFC v2 04/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts Feng Wu
2015-06-09 14:17   ` Jan Beulich
2015-05-08  9:07 ` [RFC v2 05/15] vmx: Initialize VT-d Posted-Interrupts Descriptor Feng Wu
2015-06-09 14:20   ` Jan Beulich
2015-05-08  9:07 ` [RFC v2 06/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts Feng Wu
2015-06-09 14:25   ` Jan Beulich
2015-06-16  6:18     ` Wu, Feng
2015-06-16  7:48       ` Jan Beulich
2015-05-08  9:07 ` [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used Feng Wu
2015-06-09 14:32   ` Jan Beulich
2015-06-12  9:40     ` Wu, Feng
2015-06-12 10:33       ` Jan Beulich
2015-06-10  6:17   ` Jan Beulich
2015-06-12  9:40     ` Wu, Feng
2015-06-12 10:34       ` Jan Beulich
2015-05-08  9:07 ` [RFC v2 08/15] Update IRTE according to guest interrupt config changes Feng Wu
2015-06-09 15:06   ` Jan Beulich
2015-06-16  8:08     ` Wu, Feng
2015-06-16  8:17       ` Jan Beulich
2015-05-08  9:07 ` [RFC v2 09/15] Add a new per-vCPU tasklet to wakeup the blocked vCPU Feng Wu
2015-06-09 15:08   ` Jan Beulich
2015-06-12  9:40     ` Wu, Feng
2015-06-12 10:41       ` Jan Beulich
2015-05-08  9:07 ` [RFC v2 10/15] vmx: Define two per-cpu variables Feng Wu
2015-06-09 15:12   ` Jan Beulich
2015-05-08  9:07 ` [RFC v2 11/15] vmx: Add a global wake-up vector for VT-d Posted-Interrupts Feng Wu
2015-06-09 15:20   ` Jan Beulich
2015-06-12  9:40     ` Wu, Feng
2015-06-12 10:44       ` Jan Beulich
2015-05-08  9:07 ` [RFC v2 12/15] vmx: Properly handle notification event when vCPU is running Feng Wu
2015-06-09 16:12   ` Jan Beulich
2015-05-08  9:07 ` [RFC v2 13/15] Update Posted-Interrupts Descriptor during vCPU scheduling Feng Wu
2015-06-10  6:43   ` Jan Beulich
2015-06-16  0:17     ` Wu, Feng
2015-06-16  7:53       ` Jan Beulich
2015-06-16  8:13         ` Wu, Feng
2015-06-16  8:29           ` Jan Beulich
2015-06-16  8:56             ` Wu, Feng
2015-06-16  9:24               ` Jan Beulich
2015-06-17  6:52                 ` Wu, Feng
2015-06-17  6:54     ` Wu, Feng
2015-06-17  7:56       ` Jan Beulich
2015-06-17  8:46         ` Wu, Feng
2015-06-17  8:59           ` Jan Beulich
2015-06-17  9:14             ` Wu, Feng
2015-06-17  7:00     ` Wu, Feng
2015-06-17  7:59       ` Jan Beulich
2015-05-08  9:07 ` [RFC v2 14/15] Suppress posting interrupts when 'SN' is set Feng Wu
2015-06-10  6:48   ` Jan Beulich
2015-05-08  9:07 ` [RFC v2 15/15] Add a command line parameter for VT-d posted-interrupts Feng Wu
2015-06-10  6:52   ` Jan Beulich
2015-05-13  5:11 ` [RFC v2 00/15] Add VT-d Posted-Interrupts support Wu, Feng
2015-05-13  6:50   ` Jan Beulich
2015-05-13  7:04     ` Wu, Feng
2015-05-18  5:33 ` Tian, Kevin
2015-06-15  9:20 [RFC v2 07/15] vt-d: Add API to update IRTE when VT-d PI is used Wu, Feng
2015-06-15  9:35 ` Jan Beulich
2015-06-15 11:38   ` Wu, Feng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.