All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 00/17] Add VT-d Posted-Interrupts support
@ 2015-09-11  8:28 Feng Wu
  2015-09-11  8:28 ` [PATCH v7 01/17] VT-d Posted-intterrupt (PI) design Feng Wu
                   ` (16 more replies)
  0 siblings, 17 replies; 86+ messages in thread
From: Feng Wu @ 2015-09-11  8:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Feng Wu

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

You can find the VT-d Posted-Interrtups Spec. in the following URL:
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html

Feng Wu (17):
  VT-d Posted-intterrupt (PI) design
  Add cmpxchg16b support for x86-64
  iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature
  vt-d: VT-d Posted-Interrupts feature detection
  vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  vmx: Add some helper functions for Posted-Interrupts
  vmx: Initialize VT-d Posted-Interrupts Descriptor
  vmx: Suppress posting interrupts when 'SN' is set
  VT-d: Remove pointless casts
  vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts
  vt-d: Add API to update IRTE when VT-d PI is used
  x86: move some APIC related macros to apicdef.h
  Update IRTE according to guest interrupt config changes
  vmx: Properly handle notification event when vCPU is running
  vmx: VT-d posted-interrupt core logic handling
  VT-d: Dump the posted format IRTE
  Add a command line parameter for VT-d posted-interrupts

 docs/misc/vtd-pi.txt                   | 332 +++++++++++++++++++++++++++++++++
 docs/misc/xen-command-line.markdown    |   9 +-
 xen/arch/x86/domain.c                  |  21 +++
 xen/arch/x86/hvm/hvm.c                 |   6 +
 xen/arch/x86/hvm/vlapic.c              |   5 -
 xen/arch/x86/hvm/vmx/vmcs.c            |  24 +++
 xen/arch/x86/hvm/vmx/vmx.c             | 312 ++++++++++++++++++++++++++++++-
 xen/common/schedule.c                  |   2 +
 xen/drivers/passthrough/io.c           | 118 +++++++++++-
 xen/drivers/passthrough/iommu.c        |  16 +-
 xen/drivers/passthrough/vtd/intremap.c | 213 ++++++++++++++++-----
 xen/drivers/passthrough/vtd/iommu.c    |  14 +-
 xen/drivers/passthrough/vtd/iommu.h    |  51 +++--
 xen/drivers/passthrough/vtd/utils.c    |  42 +++--
 xen/include/asm-arm/domain.h           |   2 +
 xen/include/asm-x86/apicdef.h          |   3 +
 xen/include/asm-x86/domain.h           |   3 +
 xen/include/asm-x86/hvm/hvm.h          |   4 +
 xen/include/asm-x86/hvm/vmx/vmcs.h     |  25 ++-
 xen/include/asm-x86/hvm/vmx/vmx.h      |  27 +++
 xen/include/asm-x86/iommu.h            |   2 +
 xen/include/asm-x86/x86_64/system.h    |  31 +++
 xen/include/xen/iommu.h                |   2 +-
 23 files changed, 1176 insertions(+), 88 deletions(-)
 create mode 100644 docs/misc/vtd-pi.txt

-- 
2.1.0

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v7 01/17] VT-d Posted-intterrupt (PI) design
  2015-09-11  8:28 [PATCH v7 00/17] Add VT-d Posted-Interrupts support Feng Wu
@ 2015-09-11  8:28 ` Feng Wu
  2015-09-11  8:28 ` [PATCH v7 02/17] Add cmpxchg16b support for x86-64 Feng Wu
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 86+ messages in thread
From: Feng Wu @ 2015-09-11  8:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper,
	Jan Beulich, Yang Zhang, Feng Wu

Add the design doc for VT-d PI.

CC: Kevin Tian <kevin.tian@intel.com>
CC: Yang Zhang <yang.z.zhang@intel.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Keir Fraser <keir@xen.org>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
CC: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 docs/misc/vtd-pi.txt | 332 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 332 insertions(+)
 create mode 100644 docs/misc/vtd-pi.txt

diff --git a/docs/misc/vtd-pi.txt b/docs/misc/vtd-pi.txt
new file mode 100644
index 0000000..af5409a
--- /dev/null
+++ b/docs/misc/vtd-pi.txt
@@ -0,0 +1,332 @@
+Authors: Feng Wu <feng.wu@intel.com>
+
+VT-d Posted-interrupt (PI) design for XEN
+
+Background
+==========
+With the development of virtualization, there are more and more device
+assignment requirements. However, today when a VM is running with
+assigned devices (such as, NIC), external interrupt handling for the assigned
+devices always needs VMM intervention.
+
+VT-d Posted-interrupt is a more enhanced method to handle interrupts
+in the virtualization environment. Interrupt posting is the process by
+which an interrupt request is recorded in a memory-resident
+posted-interrupt-descriptor structure by the root-complex, followed by
+an optional notification event issued to the CPU complex.
+
+With VT-d Posted-interrupt we can get the following advantages:
+- Direct delivery of external interrupts to running vCPUs without VMM
+intervention
+- Decrease the interrupt migration complexity. On vCPU migration, software
+can atomically co-migrate all interrupts targeting the migrating vCPU. For
+virtual machines with assigned devices, migrating a vCPU across pCPUs
+either incurs the overhead of forwarding interrupts in software (e.g. via VMM
+generated IPIs), or complexity to independently migrate each interrupt targeting
+the vCPU to the new pCPU. However, after enabling VT-d PI, the destination vCPU
+of an external interrupt from assigned devices is stored in the IRTE (i.e.
+Posted-interrupt Descriptor Address), when vCPU is migrated to another pCPU,
+we will set this new pCPU in the 'NDST' filed of Posted-interrupt descriptor, this
+make the interrupt migration automatic.
+
+Here is what Xen currently does for external interrupts from assigned devices:
+
+When a VM is running and an external interrupt from an assigned device occurs
+for it. VM-EXIT happens, then:
+
+vmx_do_extint() --> do_IRQ() --> __do_IRQ_guest() --> hvm_do_IRQ_dpci() -->
+raise_softirq_for(pirq_dpci) --> raise_softirq(HVM_DPCI_SOFTIRQ)
+
+softirq HVM_DPCI_SOFTIRQ is bound to dpci_softirq()
+
+dpci_softirq() --> hvm_dirq_assist() --> vmsi_deliver_pirq() --> vmsi_deliver() -->
+vmsi_inj_irq() --> vlapic_set_irq()
+
+vlapic_set_irq() does the following things:
+1. If CPU-side posted-interrupt is supported, call vmx_deliver_posted_intr() to deliver
+the virtual interrupt via posted-interrupt infrastructure.
+2. Else if CPU-side posted-interrupt is not supported, set the related vIRR in vLAPIC
+page and call vcpu_kick() to kick the related vCPU. Before VM-Entry, vmx_intr_assist()
+will help to inject the interrupt to guests.
+
+However, after VT-d PI is supported, when a guest is running in non-root and an
+external interrupt from an assigned device occurs for it. No VM-Exit is needed,
+the guest can handle this totally in non-root mode, thus avoiding all the above
+code flow.
+
+Posted-interrupt Introduction
+========================
+There are two components to the Posted-interrupt architecture:
+Processor Support and Root-Complex Support
+
+- Processor Support
+Posted-interrupt processing is a feature by which a processor processes
+the virtual interrupts by recording them as pending on the virtual-APIC
+page.
+
+Posted-interrupt processing is enabled by setting the process posted
+interrupts VM-execution control. The processing is performed in response
+to the arrival of an interrupt with the posted-interrupt notification vector.
+In response to such an interrupt, the processor processes virtual interrupts
+recorded in a data structure called a posted-interrupt descriptor.
+
+More information about APICv and CPU-side Posted-interrupt, please refer
+to Chapter 29, and Section 29.6 in the Intel SDM:
+http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
+
+- Root-Complex Support
+Interrupt posting is the process by which an interrupt request (from IOAPIC
+or MSI/MSIx capable sources) is recorded in a memory-resident
+posted-interrupt-descriptor structure by the root-complex, followed by
+an optional notification event issued to the CPU complex. The interrupt
+request arriving at the root-complex carry the identity of the interrupt
+request source and a 'remapping-index'. The remapping-index is used to
+look-up an entry from the memory-resident interrupt-remap-table. Unlike
+interrupt-remapping, the interrupt-remap-table-entry for a posted-interrupt,
+specifies a virtual-vector and a pointer to the posted-interrupt descriptor.
+The virtual-vector specifies the vector of the interrupt to be recorded in
+the posted-interrupt descriptor. The posted-interrupt descriptor hosts storage
+for the virtual-vectors and contains the attributes of the notification event
+(interrupt) to be issued to the CPU complex to inform CPU/software about pending
+interrupts recorded in the posted-interrupt descriptor.
+
+More information about VT-d PI, please refer to
+http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html
+
+Important Definitions
+==================
+There are some changes to IRTE and posted-interrupt descriptor after
+VT-d PI is introduced:
+IRTE: Interrupt Remapping Table Entry
+Posted-interrupt Descriptor Address: the address of the posted-interrupt descriptor
+Virtual Vector: the guest vector of the interrupt
+URG: indicates if the interrupt is urgent
+
+Posted-interrupt descriptor:
+The Posted Interrupt Descriptor hosts the following fields:
+Posted Interrupt Request (PIR): Provide storage for posting (recording) interrupts (one bit
+per vector, for up to 256 vectors).
+
+Outstanding Notification (ON): Indicate if there is a notification event outstanding (not
+processed by processor or software) for this Posted Interrupt Descriptor. When this field is 0,
+hardware modifies it from 0 to 1 when generating a notification event, and the entity receiving
+the notification event (processor or software) resets it as part of posted interrupt processing.
+
+Suppress Notification (SN): Indicate if a notification event is to be suppressed (not
+generated) for non-urgent interrupt requests (interrupts processed through an IRTE with
+URG=0).
+
+Notification Vector (NV): Specify the vector for notification event (interrupt).
+
+Notification Destination (NDST): Specify the physical APIC-ID of the destination logical
+processor for the notification event.
+
+Design Overview
+==============
+In this design, we will cover the following items:
+1. Add a variable to control whether enable VT-d posted-interrupt or not.
+2. VT-d PI feature detection.
+3. Extend posted-interrupt descriptor structure to cover VT-d PI specific items.
+4. Extend IRTE structure to support VT-d PI.
+5. Introduce a new global vector which is used for waking up the blocked vCPU.
+6. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration).
+7. Update posted-interrupt descriptor during vCPU scheduling (when the state
+of the vCPU is transmitted among RUNSTATE_running / RUNSTATE_blocked/
+RUNSTATE_runnable / RUNSTATE_offline).
+8. How to wakeup blocked vCPU when an interrupt is posted for it (wakeup notification handler).
+9. New boot command line for Xen, which controls VT-d PI feature by user.
+10. Multicast/broadcast and lowest priority interrupts consideration.
+
+
+Implementation details
+===================
+- New variable to control VT-d PI
+
+Like variable 'iommu_intremap' for interrupt remapping, it is very straightforward
+to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is set
+only when interrupt remapping and VT-d posted-interrupt are both enabled.
+
+- VT-d PI feature detection.
+Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt support.
+
+- Extend posted-interrupt descriptor structure to cover VT-d PI specific items.
+Here is the new structure for posted-interrupt descriptor:
+
+struct pi_desc {
+    DECLARE_BITMAP(pir, NR_VECTORS);
+    union {
+        struct
+        {
+        u16 on     : 1,  /* bit 256 - Outstanding Notification */
+            sn     : 1,  /* bit 257 - Suppress Notification */
+            rsvd_1 : 14; /* bit 271:258 - Reserved */
+        u8  nv;          /* bit 279:272 - Notification Vector */
+        u8  rsvd_2;      /* bit 287:280 - Reserved */
+        u32 ndst;        /* bit 319:288 - Notification Destination */
+        };
+        u64 control;
+    };
+    u32 rsvd[6];
+} __attribute__ ((aligned (64)));
+
+- Extend IRTE structure to support VT-d PI.
+
+Here is the new structure for IRTE:
+/* interrupt remap entry */
+struct iremap_entry {
+  union {
+    struct { u64 lo, hi; };
+    struct {
+        u16 p       : 1,
+            fpd     : 1,
+            dm      : 1,
+            rh      : 1,
+            tm      : 1,
+            dlm     : 3,
+            avail   : 4,
+            res_1   : 4;
+        u8  vector;
+        u8  res_2;
+        u32 dst;
+        u16 sid;
+        u16 sq      : 2,
+            svt     : 2,
+            res_3   : 12;
+        u32 res_4   : 32;
+    } remap;
+    struct {
+        u16 p       : 1,
+            fpd     : 1,
+            res_1   : 6,
+            avail   : 4,
+            res_2   : 2,
+            urg     : 1,
+            im      : 1;
+        u8  vector;
+        u8  res_3;
+        u32 res_4   : 6,
+            pda_l   : 26;
+        u16 sid;
+        u16 sq      : 2,
+            svt     : 2,
+            res_5   : 12;
+        u32 pda_h;
+    } post;
+  };
+};
+
+- Introduce a new global vector which is used to wake up the blocked vCPU.
+
+Currently, there is a global vector 'posted_intr_vector', which is used as the
+global notification vector for all vCPUs in the system. This vector is stored in
+VMCS and CPU considers it as a _special_ vector, uses it to notify the related
+pCPU when an interrupt is recorded in the posted-interrupt descriptor.
+
+This existing global vector is a _special_ vector to CPU, CPU handle it in a
+_special_ way compared to normal vectors, please refer to 29.6 in Intel SDM
+http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
+for more information about how CPU handles it.
+
+After having VT-d PI, VT-d engine can issue notification event when the
+assigned devices issue interrupts. We need add a new global vector to
+wakeup the blocked vCPU, please refer to later section in this design for
+how to use this new global vector.
+
+- Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration).
+After VT-d PI is introduced, the format of IRTE is changed as follows:
+	Descriptor Address: the address of the posted-interrupt descriptor
+	Virtual Vector: the guest vector of the interrupt
+	URG: indicates if the interrupt is urgent
+	Other fields continue to have the same meaning
+
+'Descriptor Address' tells the destination vCPU of this interrupt, since
+each vCPU has a dedicated posted-interrupt descriptor.
+
+'Virtual Vector' tells the guest vector of the interrupt.
+
+When guest changes the configuration of the interrupts, such as, the
+cpu affinity, or the vector, we need to update the associated IRTE accordingly.
+
+- Update posted-interrupt descriptor during vCPU scheduling
+
+The basic idea here is:
+1. When vCPU's state is RUNSTATE_running,
+        - Set 'NV' to 'posted_intr_vector'.
+        - Clear 'SN' to accept posted-interrupts.
+        - Set 'NDST' to the pCPU on which the vCPU will be running.
+2. When vCPU's state is RUNSTATE_blocked,
+        - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the
+          related vCPU when posted-interrupt happens for it.
+          Please refer to the above section about the new global vector.
+        - Clear 'SN' to accept posted-interrupts
+3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
+        - Set 'SN' to suppress non-urgent interrupts
+          (Currently, we only support non-urgent interrupts)
+         When vCPU is in RUNSTATE_runnable or RUNSTATE_offline
+         It is not needed to accept posted-interrupt notification event
+         since we don't change the behavior of scheduler when the interrupt
+         occurs, we still need wait for the next scheduling of the vCPU.
+         When external interrupts from assigned devices occur, the interrupts
+         are recorded in PIR, and will be synced to IRR before VM-Entry.
+        - Set 'NV' to 'posted_intr_vector'.
+
+- How to wakeup blocked vCPU when an interrupt is posted for it (wakeup notification handler).
+
+Here is the scenario for the usage of the new global vector:
+
+1. vCPU0 is running on pCPU0
+2. vCPU0 is blocked and vCPU1 is currently running on pCPU0
+3. An external interrupt from an assigned device occurs for vCPU0, if we
+still use 'posted_intr_vector' as the notification vector for vCPU0, the
+notification event for vCPU0 (the event will go to pCPU1) will be consumed
+by vCPU1 incorrectly (remember this is a special vector to CPU). The worst
+case is that vCPU0 will never be woken up again since the wakeup event
+for it is always consumed by other vCPUs incorrectly. So we need introduce
+another global vector, naming 'pi_wakeup_vector' to wake up the blocked vCPU.
+
+After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue notification
+event using this new vector. Since this new vector is not a SPECIAL one to CPU,
+it is just a normal vector. To CPU, it just receives an normal external interrupt,
+then we can get control in the handler of this new vector. In this case, hypervisor
+can do something in it, such as wakeup the blocked vCPU.
+
+Here are what we do for the blocked vCPU:
+1. Define a per-cpu list 'pi_blocked_vcpu', which stored the blocked
+vCPU on the pCPU.
+2. When the vCPU's state is changed to RUNSTATE_blocked, insert the vCPU
+to the per-cpu list belonging to the pCPU it was running.
+3. When the vCPU is unblocked, remove the vCPU from the related pCPU list.
+
+In the handler of 'pi_wakeup_vector', we do:
+1. Get the physical CPU.
+2. Iterate the list 'pi_blocked_vcpu' of the current pCPU, if 'ON' is set,
+we unblock the associated vCPU.
+
+- New boot command line for Xen, which controls VT-d PI feature by user.
+
+Like 'intremap' for interrupt remapping, we add a new boot command line
+'intpost' for posted-interrupts.
+
+- Multicast/broadcast and lowest priority interrupts consideration.
+
+With VT-d PI, the destination vCPU information of an external interrupt
+from assigned devices is stored in IRTE, this makes the following
+consideration of the design:
+1. Multicast/broadcast interrupts cannot be posted.
+2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex
+(starting from Nehalem) ignore TPR value, and instead supported two other
+ways (configurable by BIOS) on how the handle lowest priority interrupts:
+	A) Round robin: In this method, the chipset simply delivers lowest priority
+interrupts in a round-robin manner across all the available logical CPUs. While
+this provides good load balancing, this was not the best thing to do always as
+interrupts from the same device (like NIC) will start running on all the CPUs
+thrashing caches and taking locks. This led to the next scheme.
+	B) Vector hashing: In this method, hardware would apply a hash function
+on the vector value in the interrupt request, and use that hash to pick a logical
+CPU to route the lowest priority interrupt. This way, a given vector always goes
+to the same logical CPU, avoiding the thrashing problem above.
+
+So, gist of above is that, lowest priority interrupts has never been delivered as
+"lowest priority" in physical hardware.
+
+Vector hashing is used in this design.
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 02/17] Add cmpxchg16b support for x86-64
  2015-09-11  8:28 [PATCH v7 00/17] Add VT-d Posted-Interrupts support Feng Wu
  2015-09-11  8:28 ` [PATCH v7 01/17] VT-d Posted-intterrupt (PI) design Feng Wu
@ 2015-09-11  8:28 ` Feng Wu
  2015-09-22 13:50   ` Jan Beulich
  2015-09-11  8:28 ` [PATCH v7 03/17] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature Feng Wu
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 86+ messages in thread
From: Feng Wu @ 2015-09-11  8:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Keir Fraser, Feng Wu, Jan Beulich, Andrew Cooper

This patch adds cmpxchg16b support for x86-64, so software
can perform 128-bit atomic write/read.

CC: Keir Fraser <keir@xen.org>
CC: Jan Beulich <jbeulich@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v7:
- Make the last two parameters of __cmpxchg16b() const
- Remove memory clobber
- Add run-time and build-build check in cmpxchg16b()
- Cast the last two parameter to void * when calling __cmpxchg16b()

v6:
- Fix a typo

v5:
- Change back the parameters of __cmpxchg16b() to __uint128_t *
- Remove pointless cast for 'ptr'
- Remove pointless parentheses
- Use A constraint for the output

v4:
- Use pointer as the parameter of __cmpxchg16b().
- Use gcc's __uint128_t built-in type
- Make the parameters of __cmpxchg16b() void *

v3:
- Newly added.

 xen/include/asm-x86/x86_64/system.h | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/xen/include/asm-x86/x86_64/system.h b/xen/include/asm-x86/x86_64/system.h
index 662813a..defb770 100644
--- a/xen/include/asm-x86/x86_64/system.h
+++ b/xen/include/asm-x86/x86_64/system.h
@@ -6,6 +6,37 @@
                                    (unsigned long)(n),sizeof(*(ptr))))
 
 /*
+ * Atomic 16 bytes compare and exchange.  Compare OLD with MEM, if
+ * identical, store NEW in MEM.  Return the initial value in MEM.
+ * Success is indicated by comparing RETURN with OLD.
+ *
+ * This function can only be called when cpu_has_cx16 is true.
+ */
+
+static always_inline __uint128_t __cmpxchg16b(
+    volatile void *ptr, const __uint128_t *old, const __uint128_t *new)
+{
+    __uint128_t prev;
+    uint64_t new_high = *new >> 64;
+    uint64_t new_low = (uint64_t)*new;
+
+    ASSERT(cpu_has_cx16);
+
+    asm volatile ( "lock; cmpxchg16b %3"
+                   : "=A" (prev)
+                   : "c" (new_high), "b" (new_low), "m" (*__xg(ptr)), "0" (*old)
+                 );
+
+    return prev;
+}
+
+#define cmpxchg16b(ptr,o,n)                                            \
+    ( ({ ASSERT(((unsigned long)ptr & 0xF) == 0); }),                  \
+      (BUILD_BUG_ON(sizeof(*o) != sizeof(__uint128_t))),               \
+      (BUILD_BUG_ON(sizeof(*n) != sizeof(__uint128_t))),               \
+      (__cmpxchg16b((ptr), (void *)(o), (void *)(n))) )
+
+/*
  * This function causes value _o to be changed to _n at location _p.
  * If this access causes a fault then we return 1, otherwise we return 0.
  * If no fault occurs then _o is updated to the value we saw at _p. If this
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 03/17] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature
  2015-09-11  8:28 [PATCH v7 00/17] Add VT-d Posted-Interrupts support Feng Wu
  2015-09-11  8:28 ` [PATCH v7 01/17] VT-d Posted-intterrupt (PI) design Feng Wu
  2015-09-11  8:28 ` [PATCH v7 02/17] Add cmpxchg16b support for x86-64 Feng Wu
@ 2015-09-11  8:28 ` Feng Wu
  2015-09-11  8:28 ` [PATCH v7 04/17] vt-d: VT-d Posted-Interrupts feature detection Feng Wu
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 86+ messages in thread
From: Feng Wu @ 2015-09-11  8:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Kevin Tian, Feng Wu, Jan Beulich

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

This patch adds variable 'iommu_intpost' to control whether enable VT-d
posted-interrupt or not in the generic IOMMU code.

CC: Jan Beulich <jbeulich@suse.com>
CC: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
v5:
- Remove the "if no intremap then no intpost" logic in parse_iommu_param(), which
  can be covered in iommu_setup()

v3:
- Remove pointless initializer for 'iommu_intpost'.
- Some adjustment for "if no intremap then no intpost" logic.
    * For parse_iommu_param(), move it to the end of the function,
      so we don't need to add the some logic when introduing the
      new kernel parameter 'intpost' in later patch.
    * Add this logic in iommu_setup() after iommu_hardware_setup()
      is called.

 xen/drivers/passthrough/iommu.c | 13 ++++++++++++-
 xen/include/xen/iommu.h         |  2 +-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index fc7831e..36d5cc0 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -51,6 +51,14 @@ bool_t __read_mostly iommu_passthrough;
 bool_t __read_mostly iommu_snoop = 1;
 bool_t __read_mostly iommu_qinval = 1;
 bool_t __read_mostly iommu_intremap = 1;
+
+/*
+ * In the current implementation of VT-d posted interrupts, in some extreme
+ * cases, the per cpu list which saves the blocked vCPU will be very long,
+ * and this will affect the interrupt latency, so let this feature off by
+ * default until we find a good solution to resolve it.
+ */
+bool_t __read_mostly iommu_intpost;
 bool_t __read_mostly iommu_hap_pt_share = 1;
 bool_t __read_mostly iommu_debug;
 bool_t __read_mostly amd_iommu_perdev_intremap = 1;
@@ -307,6 +315,9 @@ int __init iommu_setup(void)
         panic("Couldn't enable %s and iommu=required/force",
               !iommu_enabled ? "IOMMU" : "Interrupt Remapping");
 
+    if ( !iommu_intremap )
+        iommu_intpost = 0;
+
     if ( !iommu_enabled )
     {
         iommu_snoop = 0;
@@ -374,7 +385,7 @@ void iommu_crash_shutdown(void)
     const struct iommu_ops *ops = iommu_get_ops();
     if ( iommu_enabled )
         ops->crash_shutdown();
-    iommu_enabled = iommu_intremap = 0;
+    iommu_enabled = iommu_intremap = iommu_intpost = 0;
 }
 
 int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 8f3a20e..1f5d04a 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -30,7 +30,7 @@
 extern bool_t iommu_enable, iommu_enabled;
 extern bool_t force_iommu, iommu_verbose;
 extern bool_t iommu_workaround_bios_bug, iommu_igfx, iommu_passthrough;
-extern bool_t iommu_snoop, iommu_qinval, iommu_intremap;
+extern bool_t iommu_snoop, iommu_qinval, iommu_intremap, iommu_intpost;
 extern bool_t iommu_hap_pt_share;
 extern bool_t iommu_debug;
 extern bool_t amd_iommu_perdev_intremap;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 04/17] vt-d: VT-d Posted-Interrupts feature detection
  2015-09-11  8:28 [PATCH v7 00/17] Add VT-d Posted-Interrupts support Feng Wu
                   ` (2 preceding siblings ...)
  2015-09-11  8:28 ` [PATCH v7 03/17] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature Feng Wu
@ 2015-09-11  8:28 ` Feng Wu
  2015-09-22 14:18   ` Jan Beulich
  2015-09-11  8:28 ` [PATCH v7 05/17] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts Feng Wu
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 86+ messages in thread
From: Feng Wu @ 2015-09-11  8:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Yang Zhang, Kevin Tian, Feng Wu

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

CC: Yang Zhang <yang.z.zhang@intel.com>
CC: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v7:
- Remove pointless "if non iommu_intremap then disable iommu_intpost" logic
- Don't need to check !iommu_intremap or !iommu_intpost when setting iommu_intpost to 0

v5:
- Remove blank line

v4:
- Correct a logic error when setting iommu_intpost to 0

v3:
- Remove the "if no intremap then no intpost" logic in
  intel_vtd_setup(), it is covered in the iommu_setup().
- Add "if no intremap then no intpost" logic in the end
  of init_vtd_hw() which is called by vtd_resume().

So the logic exists in the following three places:
- parse_iommu_param()
- iommu_setup()
- init_vtd_hw()

 xen/drivers/passthrough/vtd/iommu.c | 14 ++++++++++++--
 xen/drivers/passthrough/vtd/iommu.h |  1 +
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 1dffc40..8dee731 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2147,8 +2147,8 @@ int __init intel_vtd_setup(void)
     }
 
     /* We enable the following features only if they are supported by all VT-d
-     * engines: Snoop Control, DMA passthrough, Queued Invalidation and
-     * Interrupt Remapping.
+     * engines: Snoop Control, DMA passthrough, Queued Invalidation, Interrupt
+     * Remapping, and Posted Interrupt
      */
     for_each_drhd_unit ( drhd )
     {
@@ -2176,6 +2176,14 @@ int __init intel_vtd_setup(void)
         if ( iommu_intremap && !ecap_intr_remap(iommu->ecap) )
             iommu_intremap = 0;
 
+        /*
+         * We cannot use posted interrupt if X86_FEATURE_CX16 is
+         * not supported, since we count on this feature to
+         * atomically update 16-byte IRTE in posted format.
+         */
+        if ( !cap_intr_post(iommu->cap) || !cpu_has_cx16 )
+            iommu_intpost = 0;
+
         if ( !vtd_ept_page_compatible(iommu) )
             iommu_hap_pt_share = 0;
 
@@ -2201,6 +2209,7 @@ int __init intel_vtd_setup(void)
     P(iommu_passthrough, "Dom0 DMA Passthrough");
     P(iommu_qinval, "Queued Invalidation");
     P(iommu_intremap, "Interrupt Remapping");
+    P(iommu_intpost, "Posted Interrupt");
     P(iommu_hap_pt_share, "Shared EPT tables");
 #undef P
 
@@ -2220,6 +2229,7 @@ int __init intel_vtd_setup(void)
     iommu_passthrough = 0;
     iommu_qinval = 0;
     iommu_intremap = 0;
+    iommu_intpost = 0;
     return ret;
 }
 
diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
index ac71ed1..22abefe 100644
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -61,6 +61,7 @@
 /*
  * Decoding Capability Register
  */
+#define cap_intr_post(c)       (((c) >> 59) & 1)
 #define cap_read_drain(c)      (((c) >> 55) & 1)
 #define cap_write_drain(c)     (((c) >> 54) & 1)
 #define cap_max_amask_val(c)   (((c) >> 48) & 0x3f)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 05/17] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  2015-09-11  8:28 [PATCH v7 00/17] Add VT-d Posted-Interrupts support Feng Wu
                   ` (3 preceding siblings ...)
  2015-09-11  8:28 ` [PATCH v7 04/17] vt-d: VT-d Posted-Interrupts feature detection Feng Wu
@ 2015-09-11  8:28 ` Feng Wu
  2015-09-22 14:20   ` Jan Beulich
  2015-09-11  8:28 ` [PATCH v7 06/17] vmx: Add some helper functions for Posted-Interrupts Feng Wu
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 86+ messages in thread
From: Feng Wu @ 2015-09-11  8:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Keir Fraser, Kevin Tian, Feng Wu, Jan Beulich, Andrew Cooper

Extend struct pi_desc according to VT-d Posted-Interrupts Spec.

CC: Kevin Tian <kevin.tian@intel.com>
CC: Keir Fraser <keir@xen.org>
CC: Jan Beulich <jbeulich@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v7:
- Coding style.

v3:
- Use u32 instead of u64 for the bitfield in 'struct pi_desc'

 xen/include/asm-x86/hvm/vmx/vmcs.h | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index f1126d4..b7f78e3 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -80,8 +80,18 @@ struct vmx_domain {
 
 struct pi_desc {
     DECLARE_BITMAP(pir, NR_VECTORS);
-    u32 control;
-    u32 rsvd[7];
+    union {
+        struct {
+        u16     on     : 1,  /* bit 256 - Outstanding Notification */
+                sn     : 1,  /* bit 257 - Suppress Notification */
+                rsvd_1 : 14; /* bit 271:258 - Reserved */
+        u8      nv;          /* bit 279:272 - Notification Vector */
+        u8      rsvd_2;      /* bit 287:280 - Reserved */
+        u32     ndst;        /* bit 319:288 - Notification Destination */
+        };
+        u64 control;
+    };
+    u32 rsvd[6];
 } __attribute__ ((aligned (64)));
 
 #define ept_get_wl(ept)   ((ept)->ept_wl)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 06/17] vmx: Add some helper functions for Posted-Interrupts
  2015-09-11  8:28 [PATCH v7 00/17] Add VT-d Posted-Interrupts support Feng Wu
                   ` (4 preceding siblings ...)
  2015-09-11  8:28 ` [PATCH v7 05/17] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts Feng Wu
@ 2015-09-11  8:28 ` Feng Wu
  2015-09-11  8:28 ` [PATCH v7 07/17] vmx: Initialize VT-d Posted-Interrupts Descriptor Feng Wu
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 86+ messages in thread
From: Feng Wu @ 2015-09-11  8:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Keir Fraser, Kevin Tian, Feng Wu, Jan Beulich, Andrew Cooper

This patch adds some helper functions to manipulate the
Posted-Interrupts Descriptor.

CC: Kevin Tian <kevin.tian@intel.com>
CC: Keir Fraser <keir@xen.org>
CC: Jan Beulich <jbeulich@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v7:
- Use bitfield in pi_test_on() and pi_test_sn()

v4:
- Newly added

 xen/include/asm-x86/hvm/vmx/vmx.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 3fbfa44..8d91110 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -101,6 +101,7 @@ void vmx_update_cpu_exec_control(struct vcpu *v);
 void vmx_update_secondary_exec_control(struct vcpu *v);
 
 #define POSTED_INTR_ON  0
+#define POSTED_INTR_SN  1
 static inline int pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
 {
     return test_and_set_bit(vector, pi_desc->pir);
@@ -121,11 +122,31 @@ static inline int pi_test_and_clear_on(struct pi_desc *pi_desc)
     return test_and_clear_bit(POSTED_INTR_ON, &pi_desc->control);
 }
 
+static inline int pi_test_on(struct pi_desc *pi_desc)
+{
+    return pi_desc->on;
+}
+
 static inline unsigned long pi_get_pir(struct pi_desc *pi_desc, int group)
 {
     return xchg(&pi_desc->pir[group], 0);
 }
 
+static inline int pi_test_sn(struct pi_desc *pi_desc)
+{
+    return pi_desc->sn;
+}
+
+static inline void pi_set_sn(struct pi_desc *pi_desc)
+{
+    set_bit(POSTED_INTR_SN, &pi_desc->control);
+}
+
+static inline void pi_clear_sn(struct pi_desc *pi_desc)
+{
+    clear_bit(POSTED_INTR_SN, &pi_desc->control);
+}
+
 /*
  * Exit Reasons
  */
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 07/17] vmx: Initialize VT-d Posted-Interrupts Descriptor
  2015-09-11  8:28 [PATCH v7 00/17] Add VT-d Posted-Interrupts support Feng Wu
                   ` (5 preceding siblings ...)
  2015-09-11  8:28 ` [PATCH v7 06/17] vmx: Add some helper functions for Posted-Interrupts Feng Wu
@ 2015-09-11  8:28 ` Feng Wu
  2015-09-11  8:28 ` [PATCH v7 08/17] vmx: Suppress posting interrupts when 'SN' is set Feng Wu
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 86+ messages in thread
From: Feng Wu @ 2015-09-11  8:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Keir Fraser, Kevin Tian, Feng Wu, Jan Beulich, Andrew Cooper

This patch initializes the VT-d Posted-interrupt Descriptor.

CC: Kevin Tian <kevin.tian@intel.com>
CC: Keir Fraser <keir@xen.org>
CC: Jan Beulich <jbeulich@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v7:
- Add comments to function 'pi_desc_init' to clarify why we
  update the posted-interrupt descriptor in non-atomical way
  in it.

v3:
- Move pi_desc_init() to xen/arch/x86/hvm/vmx/vmcs.c
- Remove the 'inline' flag of pi_desc_init()

 xen/arch/x86/hvm/vmx/vmcs.c       | 22 ++++++++++++++++++++++
 xen/include/asm-x86/hvm/vmx/vmx.h |  2 ++
 2 files changed, 24 insertions(+)

diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index a0a97e7..5f67797 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -39,6 +39,7 @@
 #include <asm/flushtlb.h>
 #include <asm/shadow.h>
 #include <asm/tboot.h>
+#include <asm/apic.h>
 
 static bool_t __read_mostly opt_vpid_enabled = 1;
 boolean_param("vpid", opt_vpid_enabled);
@@ -951,6 +952,24 @@ void virtual_vmcs_vmwrite(void *vvmcs, u32 vmcs_encoding, u64 val)
     virtual_vmcs_exit(vvmcs);
 }
 
+/*
+ * This function is only called in a vCPU's initialization phase,
+ * so we can update the posted-interrupt descriptor in non-atomic way.
+ */
+static void pi_desc_init(struct vcpu *v)
+{
+    uint32_t dest;
+
+    v->arch.hvm_vmx.pi_desc.nv = posted_intr_vector;
+
+    dest = cpu_physical_id(v->processor);
+
+    if ( x2apic_enabled )
+        v->arch.hvm_vmx.pi_desc.ndst = dest;
+    else
+        v->arch.hvm_vmx.pi_desc.ndst = MASK_INSR(dest, PI_xAPIC_NDST_MASK);
+}
+
 static int construct_vmcs(struct vcpu *v)
 {
     struct domain *d = v->domain;
@@ -1089,6 +1108,9 @@ static int construct_vmcs(struct vcpu *v)
 
     if ( cpu_has_vmx_posted_intr_processing )
     {
+        if ( iommu_intpost )
+            pi_desc_init(v);
+
         __vmwrite(PI_DESC_ADDR, virt_to_maddr(&v->arch.hvm_vmx.pi_desc));
         __vmwrite(POSTED_INTR_NOTIFICATION_VECTOR, posted_intr_vector);
     }
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 8d91110..70b254f 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -88,6 +88,8 @@ typedef enum {
 #define EPT_EMT_WB              6
 #define EPT_EMT_RSV2            7
 
+#define PI_xAPIC_NDST_MASK      0xFF00
+
 void vmx_asm_vmexit_handler(struct cpu_user_regs);
 void vmx_asm_do_vmentry(void);
 void vmx_intr_assist(void);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 08/17] vmx: Suppress posting interrupts when 'SN' is set
  2015-09-11  8:28 [PATCH v7 00/17] Add VT-d Posted-Interrupts support Feng Wu
                   ` (6 preceding siblings ...)
  2015-09-11  8:28 ` [PATCH v7 07/17] vmx: Initialize VT-d Posted-Interrupts Descriptor Feng Wu
@ 2015-09-11  8:28 ` Feng Wu
  2015-09-22 14:23   ` Jan Beulich
  2015-09-11  8:28 ` [PATCH v7 09/17] VT-d: Remove pointless casts Feng Wu
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 86+ messages in thread
From: Feng Wu @ 2015-09-11  8:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Keir Fraser, Kevin Tian, Feng Wu, Jan Beulich, Andrew Cooper

Currently, we don't support urgent interrupt, all interrupts
are recognized as non-urgent interrupt, so we cannot send
posted-interrupt when 'SN' is set.

CC: Kevin Tian <kevin.tian@intel.com>
CC: Keir Fraser <keir@xen.org>
CC: Jan Beulich <jbeulich@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v7:
- Coding style
- Read the current pi_desc.control as the intial value of prev.control

v6:
- Add some comments

v5:
- keep the vcpu_kick() at the end of vmx_deliver_posted_intr()
- Keep the 'return' after calling __vmx_deliver_posted_interrupt()

v4:
- Coding style.
- V3 removes a vcpu_kick() from the eoi_exitmap_changed path
  incorrectly, fix it.

v3:
- use cmpxchg to test SN/ON and set ON

 xen/arch/x86/hvm/vmx/vmx.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index c32d863..5f01629 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1701,8 +1701,35 @@ static void vmx_deliver_posted_intr(struct vcpu *v, u8 vector)
          */
         pi_set_on(&v->arch.hvm_vmx.pi_desc);
     }
-    else if ( !pi_test_and_set_on(&v->arch.hvm_vmx.pi_desc) )
+    else
     {
+        struct pi_desc old, new, prev;
+
+        prev.control = v->arch.hvm_vmx.pi_desc.control;
+
+        do {
+            /*
+             * Currently, we don't support urgent interrupt, all
+             * interrupts are recognized as non-urgent interrupt,
+             * so we cannot send posted-interrupt when 'SN' is set.
+             * Besides that, if 'ON' is already set, we cannot set
+             * posted-interrupts as well.
+             */
+            if ( pi_test_sn(&prev) || pi_test_on(&prev) )
+            {
+                vcpu_kick(v);
+                return;
+            }
+
+            old.control = v->arch.hvm_vmx.pi_desc.control &
+                          ~(1 << POSTED_INTR_ON | 1 << POSTED_INTR_SN);
+            new.control = v->arch.hvm_vmx.pi_desc.control |
+                          (1 << POSTED_INTR_ON);
+
+            prev.control = cmpxchg(&v->arch.hvm_vmx.pi_desc.control,
+                                   old.control, new.control);
+        } while ( prev.control != old.control );
+
         __vmx_deliver_posted_interrupt(v);
         return;
     }
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 09/17] VT-d: Remove pointless casts
  2015-09-11  8:28 [PATCH v7 00/17] Add VT-d Posted-Interrupts support Feng Wu
                   ` (7 preceding siblings ...)
  2015-09-11  8:28 ` [PATCH v7 08/17] vmx: Suppress posting interrupts when 'SN' is set Feng Wu
@ 2015-09-11  8:28 ` Feng Wu
  2015-09-22 14:30   ` Jan Beulich
  2015-09-11  8:28 ` [PATCH v7 10/17] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts Feng Wu
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 86+ messages in thread
From: Feng Wu @ 2015-09-11  8:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Yang Zhang, Kevin Tian, Feng Wu

Remove pointless casts.

CC: Yang Zhang <yang.z.zhang@intel.com>
CC: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v7:
- Remove an 'u32' casting omitted in v5

v5:
- Newly added.

 xen/drivers/passthrough/vtd/utils.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/utils.c b/xen/drivers/passthrough/vtd/utils.c
index 44c4ef5..a75059f 100644
--- a/xen/drivers/passthrough/vtd/utils.c
+++ b/xen/drivers/passthrough/vtd/utils.c
@@ -234,10 +234,9 @@ static void dump_iommu_info(unsigned char key)
                     continue;
                 printk("  %04x:  %x   %x  %04x %08x %02x    %x   %x  %x  %x  %x"
                     "   %x %x\n", i,
-                    (u32)p->hi.svt, (u32)p->hi.sq, (u32)p->hi.sid,
-                    (u32)p->lo.dst, (u32)p->lo.vector, (u32)p->lo.avail,
-                    (u32)p->lo.dlm, (u32)p->lo.tm, (u32)p->lo.rh,
-                    (u32)p->lo.dm, (u32)p->lo.fpd, (u32)p->lo.p);
+                    p->hi.svt, p->hi.sq, p->hi.sid, p->lo.dst, p->lo.vector,
+                    p->lo.avail, p->lo.dlm, p->lo.tm, p->lo.rh, p->lo.dm,
+                    p->lo.fpd, p->lo.p);
                 print_cnt++;
             }
             if ( iremap_entries )
@@ -281,11 +280,10 @@ static void dump_iommu_info(unsigned char key)
 
                 printk("   %02x:  %04x   %x    %x   %x   %x   %x    %x"
                     "    %x     %02x\n", i,
-                    (u32)remap->index_0_14 | ((u32)remap->index_15 << 15),
-                    (u32)remap->format, (u32)remap->mask, (u32)remap->trigger,
-                    (u32)remap->irr, (u32)remap->polarity,
-                    (u32)remap->delivery_status, (u32)remap->delivery_mode,
-                    (u32)remap->vector);
+                    remap->index_0_14 | (remap->index_15 << 15),
+                    remap->format, remap->mask, remap->trigger, remap->irr,
+                    remap->polarity, remap->delivery_status, remap->delivery_mode,
+                    remap->vector);
             }
         }
     }
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 10/17] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts
  2015-09-11  8:28 [PATCH v7 00/17] Add VT-d Posted-Interrupts support Feng Wu
                   ` (8 preceding siblings ...)
  2015-09-11  8:28 ` [PATCH v7 09/17] VT-d: Remove pointless casts Feng Wu
@ 2015-09-11  8:28 ` Feng Wu
  2015-09-22 14:28   ` Jan Beulich
  2015-09-11  8:29 ` [PATCH v7 11/17] vt-d: Add API to update IRTE when VT-d PI is used Feng Wu
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 86+ messages in thread
From: Feng Wu @ 2015-09-11  8:28 UTC (permalink / raw)
  To: xen-devel; +Cc: Yang Zhang, Kevin Tian, Feng Wu

Extend struct iremap_entry according to VT-d Posted-Interrupts Spec.

CC: Yang Zhang <yang.z.zhang@intel.com>
CC: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
---
v7:
- Add a __uint128_t member to the union in struct iremap_entry

v4:
- res_4 is not a bitfiled, correct it.
- Expose 'im' to remapped irte as well.

v3:
- Use u32 instead of u64 to define the bitfields in 'struct iremap_entry'
- Limit using bitfield if possible

 xen/drivers/passthrough/vtd/intremap.c | 92 +++++++++++++++++-----------------
 xen/drivers/passthrough/vtd/iommu.h    | 44 ++++++++++------
 xen/drivers/passthrough/vtd/utils.c    |  8 +--
 3 files changed, 81 insertions(+), 63 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index 987bbe9..e9fffa6 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -122,9 +122,9 @@ static u16 hpetid_to_bdf(unsigned int hpet_id)
 static void set_ire_sid(struct iremap_entry *ire,
                         unsigned int svt, unsigned int sq, unsigned int sid)
 {
-    ire->hi.svt = svt;
-    ire->hi.sq = sq;
-    ire->hi.sid = sid;
+    ire->remap.svt = svt;
+    ire->remap.sq = sq;
+    ire->remap.sid = sid;
 }
 
 static void set_ioapic_source_id(int apic_id, struct iremap_entry *ire)
@@ -219,7 +219,7 @@ static unsigned int alloc_remap_entry(struct iommu *iommu, unsigned int nr)
         else
             p = &iremap_entries[i % (1 << IREMAP_ENTRY_ORDER)];
 
-        if ( p->lo_val || p->hi_val ) /* not a free entry */
+        if ( p->lo || p->hi ) /* not a free entry */
             found = 0;
         else if ( ++found == nr )
             break;
@@ -253,7 +253,7 @@ static int remap_entry_to_ioapic_rte(
     GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
                      iremap_entries, iremap_entry);
 
-    if ( iremap_entry->hi_val == 0 && iremap_entry->lo_val == 0 )
+    if ( iremap_entry->hi == 0 && iremap_entry->lo == 0 )
     {
         dprintk(XENLOG_ERR VTDPREFIX,
                 "%s: index (%d) get an empty entry!\n",
@@ -263,13 +263,13 @@ static int remap_entry_to_ioapic_rte(
         return -EFAULT;
     }
 
-    old_rte->vector = iremap_entry->lo.vector;
-    old_rte->delivery_mode = iremap_entry->lo.dlm;
-    old_rte->dest_mode = iremap_entry->lo.dm;
-    old_rte->trigger = iremap_entry->lo.tm;
+    old_rte->vector = iremap_entry->remap.vector;
+    old_rte->delivery_mode = iremap_entry->remap.dlm;
+    old_rte->dest_mode = iremap_entry->remap.dm;
+    old_rte->trigger = iremap_entry->remap.tm;
     old_rte->__reserved_2 = 0;
     old_rte->dest.logical.__reserved_1 = 0;
-    old_rte->dest.logical.logical_dest = iremap_entry->lo.dst >> 8;
+    old_rte->dest.logical.logical_dest = iremap_entry->remap.dst >> 8;
 
     unmap_vtd_domain_page(iremap_entries);
     spin_unlock_irqrestore(&ir_ctrl->iremap_lock, flags);
@@ -317,27 +317,28 @@ static int ioapic_rte_to_remap_entry(struct iommu *iommu,
     if ( rte_upper )
     {
         if ( x2apic_enabled )
-            new_ire.lo.dst = value;
+            new_ire.remap.dst = value;
         else
-            new_ire.lo.dst = (value >> 24) << 8;
+            new_ire.remap.dst = (value >> 24) << 8;
     }
     else
     {
         *(((u32 *)&new_rte) + 0) = value;
-        new_ire.lo.fpd = 0;
-        new_ire.lo.dm = new_rte.dest_mode;
-        new_ire.lo.tm = new_rte.trigger;
-        new_ire.lo.dlm = new_rte.delivery_mode;
+        new_ire.remap.fpd = 0;
+        new_ire.remap.dm = new_rte.dest_mode;
+        new_ire.remap.tm = new_rte.trigger;
+        new_ire.remap.dlm = new_rte.delivery_mode;
         /* Hardware require RH = 1 for LPR delivery mode */
-        new_ire.lo.rh = (new_ire.lo.dlm == dest_LowestPrio);
-        new_ire.lo.avail = 0;
-        new_ire.lo.res_1 = 0;
-        new_ire.lo.vector = new_rte.vector;
-        new_ire.lo.res_2 = 0;
+        new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
+        new_ire.remap.avail = 0;
+        new_ire.remap.res_1 = 0;
+        new_ire.remap.vector = new_rte.vector;
+        new_ire.remap.res_2 = 0;
 
         set_ioapic_source_id(IO_APIC_ID(apic), &new_ire);
-        new_ire.hi.res_1 = 0;
-        new_ire.lo.p = 1;     /* finally, set present bit */
+        new_ire.remap.res_3 = 0;
+        new_ire.remap.res_4 = 0;
+        new_ire.remap.p = 1;     /* finally, set present bit */
 
         /* now construct new ioapic rte entry */
         remap_rte->vector = new_rte.vector;
@@ -510,7 +511,7 @@ static int remap_entry_to_msi_msg(
     GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
                      iremap_entries, iremap_entry);
 
-    if ( iremap_entry->hi_val == 0 && iremap_entry->lo_val == 0 )
+    if ( iremap_entry->hi == 0 && iremap_entry->lo == 0 )
     {
         dprintk(XENLOG_ERR VTDPREFIX,
                 "%s: index (%d) get an empty entry!\n",
@@ -523,25 +524,25 @@ static int remap_entry_to_msi_msg(
     msg->address_hi = MSI_ADDR_BASE_HI;
     msg->address_lo =
         MSI_ADDR_BASE_LO |
-        ((iremap_entry->lo.dm == 0) ?
+        ((iremap_entry->remap.dm == 0) ?
             MSI_ADDR_DESTMODE_PHYS:
             MSI_ADDR_DESTMODE_LOGIC) |
-        ((iremap_entry->lo.dlm != dest_LowestPrio) ?
+        ((iremap_entry->remap.dlm != dest_LowestPrio) ?
             MSI_ADDR_REDIRECTION_CPU:
             MSI_ADDR_REDIRECTION_LOWPRI);
     if ( x2apic_enabled )
-        msg->dest32 = iremap_entry->lo.dst;
+        msg->dest32 = iremap_entry->remap.dst;
     else
-        msg->dest32 = (iremap_entry->lo.dst >> 8) & 0xff;
+        msg->dest32 = (iremap_entry->remap.dst >> 8) & 0xff;
     msg->address_lo |= MSI_ADDR_DEST_ID(msg->dest32);
 
     msg->data =
         MSI_DATA_TRIGGER_EDGE |
         MSI_DATA_LEVEL_ASSERT |
-        ((iremap_entry->lo.dlm != dest_LowestPrio) ?
+        ((iremap_entry->remap.dlm != dest_LowestPrio) ?
             MSI_DATA_DELIVERY_FIXED:
             MSI_DATA_DELIVERY_LOWPRI) |
-        iremap_entry->lo.vector;
+        iremap_entry->remap.vector;
 
     unmap_vtd_domain_page(iremap_entries);
     spin_unlock_irqrestore(&ir_ctrl->iremap_lock, flags);
@@ -600,29 +601,30 @@ static int msi_msg_to_remap_entry(
     memcpy(&new_ire, iremap_entry, sizeof(struct iremap_entry));
 
     /* Set interrupt remapping table entry */
-    new_ire.lo.fpd = 0;
-    new_ire.lo.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
-    new_ire.lo.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
-    new_ire.lo.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
+    new_ire.remap.fpd = 0;
+    new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
+    new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
+    new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
     /* Hardware require RH = 1 for LPR delivery mode */
-    new_ire.lo.rh = (new_ire.lo.dlm == dest_LowestPrio);
-    new_ire.lo.avail = 0;
-    new_ire.lo.res_1 = 0;
-    new_ire.lo.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
-                        MSI_DATA_VECTOR_MASK;
-    new_ire.lo.res_2 = 0;
+    new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
+    new_ire.remap.avail = 0;
+    new_ire.remap.res_1 = 0;
+    new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
+                            MSI_DATA_VECTOR_MASK;
+    new_ire.remap.res_2 = 0;
     if ( x2apic_enabled )
-        new_ire.lo.dst = msg->dest32;
+        new_ire.remap.dst = msg->dest32;
     else
-        new_ire.lo.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
-                          & 0xff) << 8;
+        new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
+                             & 0xff) << 8;
 
     if ( pdev )
         set_msi_source_id(pdev, &new_ire);
     else
         set_hpet_source_id(msi_desc->hpet_id, &new_ire);
-    new_ire.hi.res_1 = 0;
-    new_ire.lo.p = 1;    /* finally, set present bit */
+    new_ire.remap.res_3 = 0;
+    new_ire.remap.res_4 = 0;
+    new_ire.remap.p = 1;    /* finally, set present bit */
 
     /* now construct new MSI/MSI-X rte entry */
     remap_rte = (struct msi_msg_remap_entry *)msg;
diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
index 22abefe..b440b69 100644
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -281,29 +281,45 @@ struct dma_pte {
 /* interrupt remap entry */
 struct iremap_entry {
   union {
-    u64 lo_val;
+    __uint128_t val;
+    struct { u64 lo, hi; };
     struct {
-        u64 p       : 1,
+        u16 p       : 1,
             fpd     : 1,
             dm      : 1,
             rh      : 1,
             tm      : 1,
             dlm     : 3,
             avail   : 4,
-            res_1   : 4,
-            vector  : 8,
-            res_2   : 8,
-            dst     : 32;
-    }lo;
-  };
-  union {
-    u64 hi_val;
+            res_1   : 3,
+            im      : 1;
+        u8  vector;
+        u8  res_2;
+        u32 dst;
+        u16 sid;
+        u16 sq      : 2,
+            svt     : 2,
+            res_3   : 12;
+        u32 res_4;
+    } remap;
     struct {
-        u64 sid     : 16,
-            sq      : 2,
+        u16 p       : 1,
+            fpd     : 1,
+            res_1   : 6,
+            avail   : 4,
+            res_2   : 2,
+            urg     : 1,
+            im      : 1;
+        u8  vector;
+        u8  res_3;
+        u32 res_4   : 6,
+            pda_l   : 26;
+        u16 sid;
+        u16 sq      : 2,
             svt     : 2,
-            res_1   : 44;
-    }hi;
+            res_5   : 12;
+        u32 pda_h;
+    } post;
   };
 };
 
diff --git a/xen/drivers/passthrough/vtd/utils.c b/xen/drivers/passthrough/vtd/utils.c
index a75059f..6daa156 100644
--- a/xen/drivers/passthrough/vtd/utils.c
+++ b/xen/drivers/passthrough/vtd/utils.c
@@ -230,13 +230,13 @@ static void dump_iommu_info(unsigned char key)
                 else
                     p = &iremap_entries[i % (1 << IREMAP_ENTRY_ORDER)];
 
-                if ( !p->lo.p )
+                if ( !p->remap.p )
                     continue;
                 printk("  %04x:  %x   %x  %04x %08x %02x    %x   %x  %x  %x  %x"
                     "   %x %x\n", i,
-                    p->hi.svt, p->hi.sq, p->hi.sid, p->lo.dst, p->lo.vector,
-                    p->lo.avail, p->lo.dlm, p->lo.tm, p->lo.rh, p->lo.dm,
-                    p->lo.fpd, p->lo.p);
+                    p->remap.svt, p->remap.sq, p->remap.sid, p->remap.dst,
+                    p->remap.vector, p->remap.avail, p->remap.dlm, p->remap.tm,
+                    p->remap.rh, p->remap.dm, p->remap.fpd, p->remap.p);
                 print_cnt++;
             }
             if ( iremap_entries )
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 11/17] vt-d: Add API to update IRTE when VT-d PI is used
  2015-09-11  8:28 [PATCH v7 00/17] Add VT-d Posted-Interrupts support Feng Wu
                   ` (9 preceding siblings ...)
  2015-09-11  8:28 ` [PATCH v7 10/17] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts Feng Wu
@ 2015-09-11  8:29 ` Feng Wu
  2015-09-22 14:42   ` Jan Beulich
  2015-09-11  8:29 ` [PATCH v7 12/17] x86: move some APIC related macros to apicdef.h Feng Wu
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 86+ messages in thread
From: Feng Wu @ 2015-09-11  8:29 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Keir Fraser, Andrew Cooper, Jan Beulich, Yang Zhang, Feng Wu

This patch adds an API which is used to update the IRTE
for posted-interrupt when guest changes MSI/MSI-X information.

CC: Yang Zhang <yang.z.zhang@intel.com>
CC: Kevin Tian <kevin.tian@intel.com>
CC: Keir Fraser <keir@xen.org>
CC: Jan Beulich <jbeulich@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v7:
- Remove __uint128_t cast
- Remove Kevin's Ack due to a bug fix for v6
- Reword some comments
- Setup posted IRTE from zeroed structure

v6:
- In some error cases, the desc->lock will be unlocked twice, fix it.
- Coding style fix.
- Add some comments.

v5:
- Make some function parameters const
- Call "spin_unlock_irq(&desc->lock);" a little eariler
- Add "ASSERT(spin_is_locked(&pcidevs_lock))"
- -EBADSLT -> -ENODEV, EBADSLT is removed in the lasted Xen

v4:
- Don't inline setup_posted_irte()
- const struct pi_desc *pi_desc for setup_posted_irte()
- return -EINVAL when pirq_spin_lock_irq_desc() fails.
- Make some variables const
- Release irq desc lock earlier in pi_update_irte()
- Remove the pointless do-while() loop when doing cmpxchg16b()

v3:
- Remove "adding PDA_MASK()" when updating 'pda_l' and 'pda_h' for IRTE.
- Change the return type of pi_update_irte() to int.
- Remove some pointless printk message in pi_update_irte().
- Use structure assignment instead of memcpy() for irte copy.

 xen/drivers/passthrough/vtd/intremap.c | 121 +++++++++++++++++++++++++++++++++
 xen/drivers/passthrough/vtd/iommu.h    |   6 ++
 xen/include/asm-x86/iommu.h            |   2 +
 3 files changed, 129 insertions(+)

diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index e9fffa6..67cc223 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -899,3 +899,124 @@ void iommu_disable_x2apic_IR(void)
     for_each_drhd_unit ( drhd )
         disable_qinval(drhd->iommu);
 }
+
+static void setup_posted_irte(
+    struct iremap_entry *new_ire, const struct iremap_entry *old_ire,
+    const struct pi_desc *pi_desc, const uint8_t gvec)
+{
+    memset(new_ire, sizeof(*new_ire), 0);
+
+    if (!old_ire->remap.im)
+    {
+        new_ire->post.p = old_ire->remap.p;
+        new_ire->post.fpd = old_ire->remap.fpd;
+        new_ire->post.sid = old_ire->remap.sid;
+        new_ire->post.sq = old_ire->remap.sq;
+        new_ire->post.svt = old_ire->remap.svt;
+        new_ire->post.urg = 0;
+    }
+    else
+    {
+        new_ire->post.p = old_ire->post.p;
+        new_ire->post.fpd = old_ire->post.fpd;
+        new_ire->post.sid = old_ire->post.sid;
+        new_ire->post.sq = old_ire->post.sq;
+        new_ire->post.svt = old_ire->post.svt;
+        new_ire->post.urg = old_ire->post.urg;
+    }
+
+    new_ire->post.im = 1;
+    new_ire->post.vector = gvec;
+    new_ire->post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
+    new_ire->post.pda_h = virt_to_maddr(pi_desc) >> 32;
+}
+
+/*
+ * This function is used to update the IRTE for posted-interrupt
+ * when guest changes MSI/MSI-X information.
+ */
+int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
+    const uint8_t gvec)
+{
+    struct irq_desc *desc;
+    const struct msi_desc *msi_desc;
+    int remap_index;
+    int rc = 0;
+    const struct pci_dev *pci_dev;
+    const struct acpi_drhd_unit *drhd;
+    struct iommu *iommu;
+    struct ir_ctrl *ir_ctrl;
+    struct iremap_entry *iremap_entries = NULL, *p = NULL;
+    struct iremap_entry new_ire, old_ire;
+    const struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+    __uint128_t ret;
+
+    desc = pirq_spin_lock_irq_desc(pirq, NULL);
+    if ( !desc )
+        return -EINVAL;
+
+    msi_desc = desc->msi_desc;
+    if ( !msi_desc )
+    {
+        rc = -ENODEV;
+        goto unlock_out;
+    }
+
+    pci_dev = msi_desc->dev;
+    if ( !pci_dev )
+    {
+        rc = -ENODEV;
+        goto unlock_out;
+    }
+
+    remap_index = msi_desc->remap_index;
+
+    spin_unlock_irq(&desc->lock);
+
+    ASSERT(spin_is_locked(&pcidevs_lock));
+
+    /*
+     * FIXME: For performance reasons we should store the 'iommu' pointer in
+     * 'struct msi_desc' in some other place, so we don't need to waste
+     * time searching it here.
+     */
+    drhd = acpi_find_matched_drhd_unit(pci_dev);
+    if ( !drhd )
+        return -ENODEV;
+
+    iommu = drhd->iommu;
+    ir_ctrl = iommu_ir_ctrl(iommu);
+    if ( !ir_ctrl )
+        return -ENODEV;
+
+    spin_lock_irq(&ir_ctrl->iremap_lock);
+
+    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index, iremap_entries, p);
+
+    old_ire = *p;
+
+    /* Setup/Update interrupt remapping table entry. */
+    setup_posted_irte(&new_ire, &old_ire, pi_desc, gvec);
+    ret = cmpxchg16b(p, &old_ire, &new_ire);
+
+    /*
+     * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
+     * and the hardware cannot update the IRTE behind us, so the return value
+     * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
+     */
+    ASSERT(ret == old_ire.val);
+
+    iommu_flush_cache_entry(p, sizeof(*p));
+    iommu_flush_iec_index(iommu, 0, remap_index);
+
+    unmap_vtd_domain_page(iremap_entries);
+
+    spin_unlock_irq(&ir_ctrl->iremap_lock);
+
+    return 0;
+
+ unlock_out:
+    spin_unlock_irq(&desc->lock);
+
+    return rc;
+}
diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
index b440b69..c55ee08 100644
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -323,6 +323,12 @@ struct iremap_entry {
   };
 };
 
+/*
+ * Posted-interrupt descriptor address is 64 bits with 64-byte aligned, only
+ * the upper 26 bits of lest significiant 32 bits is available.
+ */
+#define PDA_LOW_BIT    26
+
 /* Max intr remapping table page order is 8, as max number of IRTEs is 64K */
 #define IREMAP_PAGE_ORDER  8
 
diff --git a/xen/include/asm-x86/iommu.h b/xen/include/asm-x86/iommu.h
index 29203d7..92f0900 100644
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -31,6 +31,8 @@ int iommu_supports_eim(void);
 int iommu_enable_x2apic_IR(void);
 void iommu_disable_x2apic_IR(void);
 
+int pi_update_irte(const struct vcpu *v, const struct pirq *pirq, const uint8_t gvec);
+
 #endif /* !__ARCH_X86_IOMMU_H__ */
 /*
  * Local variables:
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 12/17] x86: move some APIC related macros to apicdef.h
  2015-09-11  8:28 [PATCH v7 00/17] Add VT-d Posted-Interrupts support Feng Wu
                   ` (10 preceding siblings ...)
  2015-09-11  8:29 ` [PATCH v7 11/17] vt-d: Add API to update IRTE when VT-d PI is used Feng Wu
@ 2015-09-11  8:29 ` Feng Wu
  2015-09-22 14:44   ` Jan Beulich
  2015-09-11  8:29 ` [PATCH v7 13/17] Update IRTE according to guest interrupt config changes Feng Wu
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 86+ messages in thread
From: Feng Wu @ 2015-09-11  8:29 UTC (permalink / raw)
  To: xen-devel; +Cc: Keir Fraser, Feng Wu, Jan Beulich, Andrew Cooper

Move some APIC related macros to apicdef.h, so they can be used
outside of vlapic.c.

CC: Keir Fraser <keir@xen.org>
CC: Jan Beulich <jbeulich@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v7:
- Put the Macros to the right place inside the file.

v6:
- Newly introduced.

 xen/arch/x86/hvm/vlapic.c     | 5 -----
 xen/include/asm-x86/apicdef.h | 3 +++
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index b893b40..9b7c871 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -65,11 +65,6 @@ static const unsigned int vlapic_lvt_mask[VLAPIC_LVT_NUM] =
      LVT_MASK
 };
 
-/* Following could belong in apicdef.h */
-#define APIC_SHORT_MASK                  0xc0000
-#define APIC_DEST_NOSHORT                0x0
-#define APIC_DEST_MASK                   0x800
-
 #define vlapic_lvt_vector(vlapic, lvt_type)                     \
     (vlapic_get_reg(vlapic, lvt_type) & APIC_VECTOR_MASK)
 
diff --git a/xen/include/asm-x86/apicdef.h b/xen/include/asm-x86/apicdef.h
index 6069fce..f197ff6 100644
--- a/xen/include/asm-x86/apicdef.h
+++ b/xen/include/asm-x86/apicdef.h
@@ -57,6 +57,8 @@
 #define			APIC_DEST_SELF		0x40000
 #define			APIC_DEST_ALLINC	0x80000
 #define			APIC_DEST_ALLBUT	0xC0000
+#define			APIC_SHORT_MASK		0xC0000
+#define			APIC_DEST_NOSHORT	0x0
 #define			APIC_ICR_RR_MASK	0x30000
 #define			APIC_ICR_RR_INVALID	0x00000
 #define			APIC_ICR_RR_INPROG	0x10000
@@ -64,6 +66,7 @@
 #define			APIC_INT_LEVELTRIG	0x08000
 #define			APIC_INT_ASSERT		0x04000
 #define			APIC_ICR_BUSY		0x01000
+#define			APIC_DEST_MASK		0x800
 #define			APIC_DEST_LOGICAL	0x00800
 #define			APIC_DEST_PHYSICAL	0x00000
 #define			APIC_DM_FIXED		0x00000
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 13/17] Update IRTE according to guest interrupt config changes
  2015-09-11  8:28 [PATCH v7 00/17] Add VT-d Posted-Interrupts support Feng Wu
                   ` (11 preceding siblings ...)
  2015-09-11  8:29 ` [PATCH v7 12/17] x86: move some APIC related macros to apicdef.h Feng Wu
@ 2015-09-11  8:29 ` Feng Wu
  2015-09-22 14:51   ` Jan Beulich
  2015-09-11  8:29 ` [PATCH v7 14/17] vmx: Properly handle notification event when vCPU is running Feng Wu
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 86+ messages in thread
From: Feng Wu @ 2015-09-11  8:29 UTC (permalink / raw)
  To: xen-devel; +Cc: Feng Wu, Jan Beulich

When guest changes its interrupt configuration (such as, vector, etc.)
for direct-assigned devices, we need to update the associated IRTE
with the new guest vector, so external interrupts from the assigned
devices can be injected to guests without VM-Exit.

For lowest-priority interrupts, we use vector-hashing mechamisn to find
the destination vCPU. This follows the hardware behavior, since modern
Intel CPUs use vector hashing to handle the lowest-priority interrupt.

For multicast/broadcast vCPU, we cannot handle it via interrupt posting,
still use interrupt remapping.

CC: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v7:
- Remove some pointless debug printk
- Fix a logic error when assigning 'delivery_mode'
- Adjust the definition of local variable 'idx'
- Add a dprintk if we cannot find the vCPU from 'pi_find_dest_vcpu'
- Add 'else if ( delivery_mode == dest_Fixed )' in 'pi_find_dest_vcpu'

v6:
- Use macro to replace plain numbers
- Correct the overflow error in a loop

v5:
- Make 'struct vcpu *vcpu' const

v4:
- Make some 'int' variables 'unsigned int' in pi_find_dest_vcpu()
- Make 'dest_id' uint32_t
- Rename 'size' to 'bitmap_array_size'
- find_next_bit() and find_first_bit() always return unsigned int,
  so no need to check whether the return value is less than 0.
- Message error level XENLOG_G_WARNING -> XENLOG_G_INFO
- Remove useless warning message
- Create a seperate function vector_hashing_dest() to find the
- destination of lowest-priority interrupts.
- Change some comments

v3:
- Use bitmap to store the all the possible destination vCPUs of an
  interrupt, then trying to find the right destination from the bitmap
- Typo and some small changes

 xen/drivers/passthrough/io.c | 118 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 117 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index bda9374..5b0b11e 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -25,6 +25,7 @@
 #include <asm/hvm/iommu.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
+#include <asm/io_apic.h>
 
 static DEFINE_PER_CPU(struct list_head, dpci_list);
 
@@ -198,6 +199,103 @@ void free_hvm_irq_dpci(struct hvm_irq_dpci *dpci)
     xfree(dpci);
 }
 
+/*
+ * This routine handles lowest-priority interrupts using vector-hashing
+ * mechanism. As an example, modern Intel CPUs use this method to handle
+ * lowest-priority interrupts.
+ *
+ * Here is the details about the vector-hashing mechanism:
+ * 1. For lowest-priority interrupts, store all the possible destination
+ *    vCPUs in an array.
+ * 2. Use "gvec % max number of destination vCPUs" to find the right
+ *    destination vCPU in the array for the lowest-priority interrupt.
+ */
+static struct vcpu *vector_hashing_dest(const struct domain *d,
+                                        uint32_t dest_id,
+                                        bool_t dest_mode,
+                                        uint8_t gvec)
+
+{
+    unsigned long *dest_vcpu_bitmap;
+    unsigned int dest_vcpus = 0;
+    unsigned int bitmap_array_size = BITS_TO_LONGS(d->max_vcpus);
+    struct vcpu *v, *dest = NULL;
+    unsigned int i;
+
+    dest_vcpu_bitmap = xzalloc_array(unsigned long, bitmap_array_size);
+    if ( !dest_vcpu_bitmap )
+        return NULL;
+
+    for_each_vcpu ( d, v )
+    {
+        if ( !vlapic_match_dest(vcpu_vlapic(v), NULL, APIC_DEST_NOSHORT,
+                                dest_id, dest_mode) )
+            continue;
+
+        __set_bit(v->vcpu_id, dest_vcpu_bitmap);
+        dest_vcpus++;
+    }
+
+    if ( dest_vcpus != 0 )
+    {
+        unsigned int mod = gvec % dest_vcpus;
+        unsigned int idx = 0;
+
+        for ( i = 0; i <= mod; i++ )
+        {
+            idx = find_next_bit(dest_vcpu_bitmap, d->max_vcpus, idx) + 1;
+            BUG_ON(idx >= d->max_vcpus);
+        }
+
+        dest = d->vcpu[idx-1];
+    }
+
+    xfree(dest_vcpu_bitmap);
+
+    return dest;
+}
+
+/*
+ * The purpose of this routine is to find the right destination vCPU for
+ * an interrupt which will be delivered by VT-d posted-interrupt. There
+ * are several cases as below:
+ *
+ * - For lowest-priority interrupts, use vector-hashing mechanism to find
+ *   the destination.
+ * - Otherwise, for single destination interrupt, it is straightforward to
+ *   find the destination vCPU and return true.
+ * - For multicast/broadcast vCPU, we cannot handle it via interrupt posting,
+ *   so return NULL.
+ */
+static struct vcpu *pi_find_dest_vcpu(const struct domain *d, uint32_t dest_id,
+                                      bool_t dest_mode, uint8_t delivery_mode,
+                                      uint8_t gvec)
+{
+    unsigned int dest_vcpus = 0;
+    struct vcpu *v, *dest = NULL;
+
+    if ( delivery_mode == dest_LowestPrio )
+        return vector_hashing_dest(d, dest_id, dest_mode, gvec);
+    else if ( delivery_mode == dest_Fixed )
+    {
+        for_each_vcpu ( d, v )
+        {
+            if ( !vlapic_match_dest(vcpu_vlapic(v), NULL, APIC_DEST_NOSHORT,
+                                    dest_id, dest_mode) )
+                continue;
+
+            dest_vcpus++;
+            dest = v;
+        }
+
+        /* For fixed mode, we only handle single-destination interrupts. */
+        if ( dest_vcpus == 1 )
+            return dest;
+    }
+
+    return NULL;
+}
+
 int pt_irq_create_bind(
     struct domain *d, xen_domctl_bind_pt_irq_t *pt_irq_bind)
 {
@@ -256,7 +354,7 @@ int pt_irq_create_bind(
     {
     case PT_IRQ_TYPE_MSI:
     {
-        uint8_t dest, dest_mode;
+        uint8_t dest, dest_mode, delivery_mode;
         int dest_vcpu_id;
 
         if ( !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
@@ -329,11 +427,29 @@ int pt_irq_create_bind(
         /* Calculate dest_vcpu_id for MSI-type pirq migration. */
         dest = pirq_dpci->gmsi.gflags & VMSI_DEST_ID_MASK;
         dest_mode = !!(pirq_dpci->gmsi.gflags & VMSI_DM_MASK);
+        delivery_mode = (pirq_dpci->gmsi.gflags & VMSI_DELIV_MASK) >>
+                         GFLAGS_SHIFT_DELIV_MODE;
+
         dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
         pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
         spin_unlock(&d->event_lock);
         if ( dest_vcpu_id >= 0 )
             hvm_migrate_pirqs(d->vcpu[dest_vcpu_id]);
+
+        /* Use interrupt posting if it is supported. */
+        if ( iommu_intpost )
+        {
+            const struct vcpu *vcpu = pi_find_dest_vcpu(d, dest, dest_mode,
+                                          delivery_mode, pirq_dpci->gmsi.gvec);
+
+            if ( vcpu )
+                pi_update_irte( vcpu, info, pirq_dpci->gmsi.gvec );
+            else
+                dprintk(XENLOG_G_INFO,
+                        "%pv: deliver interrupt in remapping mode,gvec:%02x\n",
+                        vcpu, pirq_dpci->gmsi.gvec);
+        }
+
         break;
     }
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 14/17] vmx: Properly handle notification event when vCPU is running
  2015-09-11  8:28 [PATCH v7 00/17] Add VT-d Posted-Interrupts support Feng Wu
                   ` (12 preceding siblings ...)
  2015-09-11  8:29 ` [PATCH v7 13/17] Update IRTE according to guest interrupt config changes Feng Wu
@ 2015-09-11  8:29 ` Feng Wu
  2015-09-11  8:29 ` [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling Feng Wu
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 86+ messages in thread
From: Feng Wu @ 2015-09-11  8:29 UTC (permalink / raw)
  To: xen-devel; +Cc: Keir Fraser, Kevin Tian, Feng Wu, Jan Beulich, Andrew Cooper

When a vCPU is running in Root mode and a notification event
has been injected to it. we need to set VCPU_KICK_SOFTIRQ for
the current cpu, so the pending interrupt in PIRR will be
synced to vIRR before VM-Exit in time.

CC: Kevin Tian <kevin.tian@intel.com>
CC: Keir Fraser <keir@xen.org>
CC: Jan Beulich <jbeulich@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
---
v7:
- Retain 'cli' in the comments to make it more understandable.
- Register another notification event handler when VT-d PI is enabled

v6:
- Ack the interrupt in the beginning of pi_notification_interrupt()

v4:
- Coding style.

v3:
- Make pi_notification_interrupt() static

 xen/arch/x86/hvm/vmx/vmx.c | 54 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 5f01629..8e41f4b 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1975,6 +1975,53 @@ static struct hvm_function_table __initdata vmx_function_table = {
     .altp2m_vcpu_emulate_vmfunc = vmx_vcpu_emulate_vmfunc,
 };
 
+/* Handle VT-d posted-interrupt when VCPU is running. */
+static void pi_notification_interrupt(struct cpu_user_regs *regs)
+{
+    ack_APIC_irq();
+    this_cpu(irq_count)++;
+
+    /*
+     * We get here when a vCPU is running in root-mode (such as via hypercall,
+     * or any other reasons which can result in VM-Exit), and before vCPU is
+     * back to non-root, external interrupts from an assigned device happen
+     * and a notification event is delivered to this logical CPU.
+     *
+     * we need to set VCPU_KICK_SOFTIRQ for the current cpu, just like
+     * __vmx_deliver_posted_interrupt(). So the pending interrupt in PIRR will
+     * be synced to vIRR before VM-Exit in time.
+     *
+     * Please refer to the following code fragments from
+     * xen/arch/x86/hvm/vmx/entry.S:
+     *
+     *     .Lvmx_do_vmentry
+     *
+     *      ......
+     *
+     *      point 1
+     *
+     *      cli
+     *      cmp  %ecx,(%rdx,%rax,1)
+     *      jnz  .Lvmx_process_softirqs
+     *
+     *      ......
+     *
+     *      je   .Lvmx_launch
+     *
+     *      ......
+     *
+     *     .Lvmx_process_softirqs:
+     *      sti
+     *      call do_softirq
+     *      jmp  .Lvmx_do_vmentry
+     *
+     * If VT-d engine issues a notification event at point 1 above, it cannot
+     * be delivered to the guest during this VM-entry without raising the
+     * softirq in this notification handler.
+     */
+    raise_softirq(VCPU_KICK_SOFTIRQ);
+}
+
 const struct hvm_function_table * __init start_vmx(void)
 {
     set_in_cr4(X86_CR4_VMXE);
@@ -2012,7 +2059,12 @@ const struct hvm_function_table * __init start_vmx(void)
     }
 
     if ( cpu_has_vmx_posted_intr_processing )
-        alloc_direct_apic_vector(&posted_intr_vector, event_check_interrupt);
+    {
+        if ( iommu_intpost )
+            alloc_direct_apic_vector(&posted_intr_vector, pi_notification_interrupt);
+        else
+            alloc_direct_apic_vector(&posted_intr_vector, event_check_interrupt);
+    }
     else
     {
         vmx_function_table.deliver_posted_intr = NULL;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-11  8:28 [PATCH v7 00/17] Add VT-d Posted-Interrupts support Feng Wu
                   ` (13 preceding siblings ...)
  2015-09-11  8:29 ` [PATCH v7 14/17] vmx: Properly handle notification event when vCPU is running Feng Wu
@ 2015-09-11  8:29 ` Feng Wu
  2015-09-16 16:00   ` Dario Faggioli
  2015-09-16 17:18   ` Dario Faggioli
  2015-09-11  8:29 ` [PATCH v7 16/17] VT-d: Dump the posted format IRTE Feng Wu
  2015-09-11  8:29 ` [PATCH v7 17/17] Add a command line parameter for VT-d posted-interrupts Feng Wu
  16 siblings, 2 replies; 86+ messages in thread
From: Feng Wu @ 2015-09-11  8:29 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, Jan Beulich, Feng Wu

This patch includes the following aspects:
- Handling logic when vCPU is blocked:
    * Add a global vector to wake up the blocked vCPU
      when an interrupt is being posted to it (This part
      was sugguested by Yang Zhang <yang.z.zhang@intel.com>).
    * Define two per-cpu variables:
          1. pi_blocked_vcpu:
            A list storing the vCPUs which were blocked
            on this pCPU.

          2. pi_blocked_vcpu_lock:
            The spinlock to protect pi_blocked_vcpu.

- Add some scheduler hooks, this part was suggested
  by Dario Faggioli <dario.faggioli@citrix.com>.
    * vmx_pre_ctx_switch_pi()
      It is called before context switch, we update the
      posted interrupt descriptor when the vCPU is preempted,
      go to sleep, or is blocked.

    * vmx_post_ctx_switch_pi()
      It is called after context switch, we update the posted
      interrupt descriptor when the vCPU is going to run.

    * arch_vcpu_wake_prepare()
      It will be called when waking up the vCPU, we update
      the posted interrupt descriptor when the vCPU is
      unblocked.

CC: Keir Fraser <keir@xen.org>
CC: Jan Beulich <jbeulich@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Kevin Tian <kevin.tian@intel.com>
CC: George Dunlap <george.dunlap@eu.citrix.com>
CC: Dario Faggioli <dario.faggioli@citrix.com>
Sugguested-by: Dario Faggioli <dario.faggioli@citrix.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
---
v7:
- Merge [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
  and "[PATCH v6 14/18] vmx: posted-interrupt handling when vCPU is blocked"
  into this patch, so it is self-contained and more convenient
  for code review.
- Make 'pi_blocked_vcpu' and 'pi_blocked_vcpu_lock' static
- Coding style
- Use per_cpu() instead of this_cpu() in pi_wakeup_interrupt()
- Move ack_APIC_irq() to the beginning of pi_wakeup_interrupt()
- Rename 'pi_ctxt_switch_from' to 'ctxt_switch_prepare'
- Rename 'pi_ctxt_switch_to' to 'ctxt_switch_cancel'
- Use 'has_hvm_container_vcpu' instead of 'is_hvm_vcpu'
- Use 'spin_lock' and 'spin_unlock' when the interrupt has been
  already disabled.
- Rename arch_vcpu_wake_prepare to vmx_vcpu_wake_prepare
- Define vmx_vcpu_wake_prepare in xen/arch/x86/hvm/hvm.c
- Call .pi_ctxt_switch_to() __context_switch() instead of directly
  calling vmx_post_ctx_switch_pi() in vmx_ctxt_switch_to()
- Make .pi_block_cpu unsigned int
- Use list_del() instead of list_del_init()
- Coding style

One remaining item:
Jan has concern about calling vcpu_unblock() in vmx_pre_ctx_switch_pi(),
need Dario or George's input about this.

Changelog for "vmx: Add some scheduler hooks for VT-d posted interrupts"
v6:
- Add two static inline functions for pi context switch
- Fix typos

v5:
- Rename arch_vcpu_wake to arch_vcpu_wake_prepare
- Make arch_vcpu_wake_prepare() inline for ARM
- Merge the ARM dummy hook with together
- Changes to some code comments
- Leave 'pi_ctxt_switch_from' and 'pi_ctxt_switch_to' NULL if
  PI is disabled or the vCPU is not in HVM
- Coding style

v4:
- Newly added

Changlog for "vmx: posted-interrupt handling when vCPU is blocked"
v6:
- Fix some typos
- Ack the interrupt right after the spin_unlock in pi_wakeup_interrupt()

v4:
- Use local variables in pi_wakeup_interrupt()
- Remove vcpu from the blocked list when pi_desc.on==1, this
- avoid kick vcpu multiple times.
- Remove tasklet

v3:
- This patch is generated by merging the following three patches in v2:
   [RFC v2 09/15] Add a new per-vCPU tasklet to wakeup the blocked vCPU
   [RFC v2 10/15] vmx: Define two per-cpu variables
   [RFC v2 11/15] vmx: Add a global wake-up vector for VT-d Posted-Interrupts
- rename 'vcpu_wakeup_tasklet' to 'pi_vcpu_wakeup_tasklet'
- Move the definition of 'pi_vcpu_wakeup_tasklet' to 'struct arch_vmx_struct'
- rename 'vcpu_wakeup_tasklet_handler' to 'pi_vcpu_wakeup_tasklet_handler'
- Make pi_wakeup_interrupt() static
- Rename 'blocked_vcpu_list' to 'pi_blocked_vcpu_list'
- move 'pi_blocked_vcpu_list' to 'struct arch_vmx_struct'
- Rename 'blocked_vcpu' to 'pi_blocked_vcpu'
- Rename 'blocked_vcpu_lock' to 'pi_blocked_vcpu_lock'

 xen/arch/x86/domain.c              |  21 ++++
 xen/arch/x86/hvm/hvm.c             |   6 +
 xen/arch/x86/hvm/vmx/vmcs.c        |   2 +
 xen/arch/x86/hvm/vmx/vmx.c         | 229 +++++++++++++++++++++++++++++++++++++
 xen/common/schedule.c              |   2 +
 xen/include/asm-arm/domain.h       |   2 +
 xen/include/asm-x86/domain.h       |   3 +
 xen/include/asm-x86/hvm/hvm.h      |   4 +
 xen/include/asm-x86/hvm/vmx/vmcs.h |  11 ++
 xen/include/asm-x86/hvm/vmx/vmx.h  |   4 +
 10 files changed, 284 insertions(+)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 045f6ff..d64d4eb 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1531,6 +1531,8 @@ static void __context_switch(void)
         }
         vcpu_restore_fpu_eager(n);
         n->arch.ctxt_switch_to(n);
+        if ( n->arch.pi_ctxt_switch_to )
+            n->arch.pi_ctxt_switch_to(n);
     }
 
     psr_ctxt_switch_to(nd);
@@ -1573,6 +1575,22 @@ static void __context_switch(void)
     per_cpu(curr_vcpu, cpu) = n;
 }
 
+static inline void ctxt_switch_prepare(struct vcpu *prev)
+{
+    /*
+     * When switching from non-idle to idle, we only do a lazy context switch.
+     * However, in order for posted interrupt (if available and enabled) to
+     * work properly, we at least need to update the descriptors.
+     */
+    if ( prev->arch.pi_ctxt_switch_from && !is_idle_vcpu(prev) )
+        prev->arch.pi_ctxt_switch_from(prev);
+}
+
+static inline void ctxt_switch_cancel(struct vcpu *next)
+{
+    if ( next->arch.pi_ctxt_switch_to && !is_idle_vcpu(next) )
+        next->arch.pi_ctxt_switch_to(next);
+}
 
 void context_switch(struct vcpu *prev, struct vcpu *next)
 {
@@ -1605,9 +1623,12 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
 
     set_current(next);
 
+    ctxt_switch_prepare(prev);
+
     if ( (per_cpu(curr_vcpu, cpu) == next) ||
          (is_idle_domain(nextd) && cpu_online(cpu)) )
     {
+        ctxt_switch_cancel(next);
         local_irq_enable();
     }
     else
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index c957610..cfbb56f 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -6817,6 +6817,12 @@ bool_t altp2m_vcpu_emulate_ve(struct vcpu *v)
     return 0;
 }
 
+void arch_vcpu_wake_prepare(struct vcpu *v)
+{
+    if ( hvm_funcs.vcpu_wake_prepare )
+        hvm_funcs.vcpu_wake_prepare(v);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 5f67797..5abe960 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -661,6 +661,8 @@ int vmx_cpu_up(void)
     if ( cpu_has_vmx_vpid )
         vpid_sync_all();
 
+    vmx_pi_per_cpu_init(cpu);
+
     return 0;
 }
 
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 8e41f4b..f32e062 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -67,6 +67,8 @@ enum handler_return { HNDL_done, HNDL_unhandled, HNDL_exception_raised };
 
 static void vmx_ctxt_switch_from(struct vcpu *v);
 static void vmx_ctxt_switch_to(struct vcpu *v);
+static void vmx_pre_ctx_switch_pi(struct vcpu *v);
+static void vmx_post_ctx_switch_pi(struct vcpu *v);
 
 static int  vmx_alloc_vlapic_mapping(struct domain *d);
 static void vmx_free_vlapic_mapping(struct domain *d);
@@ -83,7 +85,21 @@ static int vmx_msr_write_intercept(unsigned int msr, uint64_t msr_content);
 static void vmx_invlpg_intercept(unsigned long vaddr);
 static int vmx_vmfunc_intercept(struct cpu_user_regs *regs);
 
+/*
+ * We maintain a per-CPU linked-list of vCPU, so in PI wakeup handler we
+ * can find which vCPU should be woken up.
+ */
+static DEFINE_PER_CPU(struct list_head, pi_blocked_vcpu);
+static DEFINE_PER_CPU(spinlock_t, pi_blocked_vcpu_lock);
+
 uint8_t __read_mostly posted_intr_vector;
+uint8_t __read_mostly pi_wakeup_vector;
+
+void vmx_pi_per_cpu_init(unsigned int cpu)
+{
+    INIT_LIST_HEAD(&per_cpu(pi_blocked_vcpu, cpu));
+    spin_lock_init(&per_cpu(pi_blocked_vcpu_lock, cpu));
+}
 
 static int vmx_domain_initialise(struct domain *d)
 {
@@ -106,10 +122,23 @@ static int vmx_vcpu_initialise(struct vcpu *v)
 
     spin_lock_init(&v->arch.hvm_vmx.vmcs_lock);
 
+    INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
+    INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_vcpu_on_set_list);
+
+    v->arch.hvm_vmx.pi_block_cpu = NR_CPUS;
+
+    spin_lock_init(&v->arch.hvm_vmx.pi_lock);
+
     v->arch.schedule_tail    = vmx_do_resume;
     v->arch.ctxt_switch_from = vmx_ctxt_switch_from;
     v->arch.ctxt_switch_to   = vmx_ctxt_switch_to;
 
+    if ( iommu_intpost && has_hvm_container_vcpu(v) )
+    {
+        v->arch.pi_ctxt_switch_from = vmx_pre_ctx_switch_pi;
+        v->arch.pi_ctxt_switch_to = vmx_post_ctx_switch_pi;
+    }
+
     if ( (rc = vmx_create_vmcs(v)) != 0 )
     {
         dprintk(XENLOG_WARNING,
@@ -707,6 +736,155 @@ static void vmx_fpu_leave(struct vcpu *v)
     }
 }
 
+void vmx_vcpu_wake_prepare(struct vcpu *v)
+{
+    unsigned long flags;
+
+    if ( !iommu_intpost || !has_hvm_container_vcpu(v) ||
+         !has_arch_pdevs(v->domain) )
+        return;
+
+    spin_lock_irqsave(&v->arch.hvm_vmx.pi_lock, flags);
+
+    if ( !test_bit(_VPF_blocked, &v->pause_flags) )
+    {
+        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+        unsigned int pi_block_cpu;
+
+        /*
+         * We don't need to send notification event to a non-running
+         * vcpu, the interrupt information will be delivered to it before
+         * VM-ENTRY when the vcpu is scheduled to run next time.
+         */
+        pi_set_sn(pi_desc);
+
+        /*
+         * Set 'NV' field back to posted_intr_vector, so the
+         * Posted-Interrupts can be delivered to the vCPU by
+         * VT-d HW after it is scheduled to run.
+         */
+        write_atomic(&pi_desc->nv, posted_intr_vector);
+
+        /*
+         * Delete the vCPU from the related block list
+         * if we are resuming from blocked state.
+         */
+        pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu;
+        if ( pi_block_cpu == NR_CPUS )
+            goto out;
+
+        spin_lock(&per_cpu(pi_blocked_vcpu_lock, pi_block_cpu));
+
+        /*
+         * v->arch.hvm_vmx.pi_block_cpu == NR_CPUS here means the vCPU was
+         * removed from the blocking list while we are acquiring the lock.
+         */
+        if ( v->arch.hvm_vmx.pi_block_cpu == NR_CPUS )
+        {
+            spin_unlock(&per_cpu(pi_blocked_vcpu_lock, pi_block_cpu));
+            goto out;
+        }
+
+        list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
+        v->arch.hvm_vmx.pi_block_cpu = NR_CPUS;
+        spin_unlock(&per_cpu(pi_blocked_vcpu_lock, pi_block_cpu));
+    }
+
+out:
+    spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock, flags);
+}
+
+static void vmx_pre_ctx_switch_pi(struct vcpu *v)
+{
+    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+    unsigned long flags;
+
+    if ( !has_arch_pdevs(v->domain) )
+        return;
+
+    spin_lock_irqsave(&v->arch.hvm_vmx.pi_lock, flags);
+
+    if ( !test_bit(_VPF_blocked, &v->pause_flags) )
+    {
+        /*
+         * The vCPU has been preempted or went to sleep. We don't need to send
+         * notification event to a non-running vcpu, the interrupt information
+         * will be delivered to it before VM-ENTRY when the vcpu is scheduled
+         * to run next time.
+         */
+        pi_set_sn(pi_desc);
+
+    }
+    else
+    {
+        struct pi_desc old, new;
+        unsigned int dest;
+
+        /*
+         * The vCPU is blocking, we need to add it to one of the per pCPU lists.
+         * We save v->processor to v->arch.hvm_vmx.pi_block_cpu and use it for
+         * the per-CPU list, we also save it to posted-interrupt descriptor and
+         * make it as the destination of the wake-up notification event.
+         */
+        v->arch.hvm_vmx.pi_block_cpu = v->processor;
+
+        spin_lock(&per_cpu(pi_blocked_vcpu_lock, v->arch.hvm_vmx.pi_block_cpu));
+        list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list,
+                      &per_cpu(pi_blocked_vcpu, v->arch.hvm_vmx.pi_block_cpu));
+        spin_unlock(&per_cpu(pi_blocked_vcpu_lock,
+                    v->arch.hvm_vmx.pi_block_cpu));
+
+        do {
+            old.control = new.control = pi_desc->control;
+
+            /* Should not block the vCPU if an interrupt was posted for it. */
+            if ( pi_test_on(&old) )
+            {
+                spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock, flags);
+                vcpu_unblock(v);
+                return;
+            }
+
+            /*
+             * Change the 'NDST' field to v->arch.hvm_vmx.pi_block_cpu,
+             * so when external interrupts from assigned deivces happen,
+             * wakeup notifiction event will go to
+             * v->arch.hvm_vmx.pi_block_cpu, then in pi_wakeup_interrupt()
+             * we can find the vCPU in the right list to wake up.
+             */
+            dest = cpu_physical_id(v->arch.hvm_vmx.pi_block_cpu);
+
+            if ( x2apic_enabled )
+                new.ndst = dest;
+            else
+                new.ndst = MASK_INSR(dest, PI_xAPIC_NDST_MASK);
+
+            pi_clear_sn(&new);
+            new.nv = pi_wakeup_vector;
+        } while ( cmpxchg(&pi_desc->control, old.control, new.control) !=
+                  old.control );
+    }
+
+    spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock, flags);
+}
+
+static void vmx_post_ctx_switch_pi(struct vcpu *v)
+{
+    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+
+    if ( !has_arch_pdevs(v->domain) )
+        return;
+
+    if ( x2apic_enabled )
+        write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor));
+    else
+        write_atomic(&pi_desc->ndst,
+                     MASK_INSR(cpu_physical_id(v->processor),
+                     PI_xAPIC_NDST_MASK));
+
+    pi_clear_sn(pi_desc);
+}
+
 static void vmx_ctxt_switch_from(struct vcpu *v)
 {
     /*
@@ -1975,6 +2153,53 @@ static struct hvm_function_table __initdata vmx_function_table = {
     .altp2m_vcpu_emulate_vmfunc = vmx_vcpu_emulate_vmfunc,
 };
 
+/* Handle VT-d posted-interrupt when VCPU is blocked. */
+static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
+{
+    struct arch_vmx_struct *vmx, *tmp;
+    struct vcpu *v;
+    spinlock_t *lock = &per_cpu(pi_blocked_vcpu_lock, smp_processor_id());
+    struct list_head *blocked_vcpus =
+                       &per_cpu(pi_blocked_vcpu, smp_processor_id());
+    LIST_HEAD(list);
+
+    ack_APIC_irq();
+    this_cpu(irq_count)++;
+
+    spin_lock(lock);
+
+    /*
+     * XXX: The length of the list depends on how many vCPU is current
+     * blocked on this specific pCPU. This may hurt the interrupt latency
+     * if the list grows to too many entries.
+     */
+    list_for_each_entry_safe(vmx, tmp, blocked_vcpus, pi_blocked_vcpu_list)
+    {
+        if ( pi_test_on(&vmx->pi_desc) )
+        {
+            list_del(&vmx->pi_blocked_vcpu_list);
+            vmx->pi_block_cpu = NR_CPUS;
+
+            /*
+             * We cannot call vcpu_unblock here, since it also needs
+             * 'pi_blocked_vcpu_lock', we store the vCPUs with ON
+             * set in another list and unblock them after we release
+             * 'pi_blocked_vcpu_lock'.
+             */
+            list_add_tail(&vmx->pi_vcpu_on_set_list, &list);
+        }
+    }
+
+    spin_unlock(lock);
+
+    list_for_each_entry_safe(vmx, tmp, &list, pi_vcpu_on_set_list)
+    {
+        v = container_of(vmx, struct vcpu, arch.hvm_vmx);
+        list_del(&vmx->pi_vcpu_on_set_list);
+        vcpu_unblock(v);
+    }
+}
+
 /* Handle VT-d posted-interrupt when VCPU is running. */
 static void pi_notification_interrupt(struct cpu_user_regs *regs)
 {
@@ -2061,7 +2286,11 @@ const struct hvm_function_table * __init start_vmx(void)
     if ( cpu_has_vmx_posted_intr_processing )
     {
         if ( iommu_intpost )
+        {
             alloc_direct_apic_vector(&posted_intr_vector, pi_notification_interrupt);
+            alloc_direct_apic_vector(&pi_wakeup_vector, pi_wakeup_interrupt);
+            vmx_function_table.vcpu_wake_prepare = vmx_vcpu_wake_prepare;
+        }
         else
             alloc_direct_apic_vector(&posted_intr_vector, event_check_interrupt);
     }
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 3eefed7..bc49098 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -412,6 +412,8 @@ void vcpu_wake(struct vcpu *v)
     unsigned long flags;
     spinlock_t *lock = vcpu_schedule_lock_irqsave(v, &flags);
 
+    arch_vcpu_wake_prepare(v);
+
     if ( likely(vcpu_runnable(v)) )
     {
         if ( v->runstate.state >= RUNSTATE_blocked )
diff --git a/xen/include/asm-arm/domain.h b/xen/include/asm-arm/domain.h
index 56aa208..cffe2c6 100644
--- a/xen/include/asm-arm/domain.h
+++ b/xen/include/asm-arm/domain.h
@@ -301,6 +301,8 @@ static inline register_t vcpuid_to_vaffinity(unsigned int vcpuid)
     return vaff;
 }
 
+static inline void arch_vcpu_wake_prepare(struct vcpu *v) {}
+
 #endif /* __ASM_DOMAIN_H__ */
 
 /*
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 0fce09e..979210a 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -481,6 +481,9 @@ struct arch_vcpu
     void (*ctxt_switch_from) (struct vcpu *);
     void (*ctxt_switch_to) (struct vcpu *);
 
+    void (*pi_ctxt_switch_from) (struct vcpu *);
+    void (*pi_ctxt_switch_to) (struct vcpu *);
+
     struct vpmu_struct vpmu;
 
     /* Virtual Machine Extensions */
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index 3cac64f..50c112f 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -212,6 +212,8 @@ struct hvm_function_table {
     void (*altp2m_vcpu_update_vmfunc_ve)(struct vcpu *v);
     bool_t (*altp2m_vcpu_emulate_ve)(struct vcpu *v);
     int (*altp2m_vcpu_emulate_vmfunc)(struct cpu_user_regs *regs);
+
+    void (*vcpu_wake_prepare)(struct vcpu *v);
 };
 
 extern struct hvm_function_table hvm_funcs;
@@ -545,6 +547,8 @@ static inline bool_t hvm_altp2m_supported(void)
     return hvm_funcs.altp2m_supported;
 }
 
+void arch_vcpu_wake_prepare(struct vcpu *v);
+
 #ifndef NDEBUG
 /* Permit use of the Forced Emulation Prefix in HVM guests */
 extern bool_t opt_hvm_fep;
diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index b7f78e3..65d8523 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -160,6 +160,17 @@ struct arch_vmx_struct {
     struct page_info     *vmwrite_bitmap;
 
     struct page_info     *pml_pg;
+
+    struct list_head     pi_blocked_vcpu_list;
+    struct list_head     pi_vcpu_on_set_list;
+
+    /*
+     * Before vCPU is blocked, it is added to the global per-cpu list
+     * of 'pi_block_cpu', then VT-d engine can send wakeup notification
+     * event to 'pi_block_cpu' and wakeup the related vCPU.
+     */
+    unsigned int         pi_block_cpu;
+    spinlock_t           pi_lock;
 };
 
 int vmx_create_vmcs(struct vcpu *v);
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 70b254f..2eaea32 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -28,6 +28,8 @@
 #include <asm/hvm/trace.h>
 #include <asm/hvm/vmx/vmcs.h>
 
+extern uint8_t pi_wakeup_vector;
+
 typedef union {
     struct {
         u64 r       :   1,  /* bit 0 - Read permission */
@@ -557,6 +559,8 @@ int alloc_p2m_hap_data(struct p2m_domain *p2m);
 void free_p2m_hap_data(struct p2m_domain *p2m);
 void p2m_init_hap_data(struct p2m_domain *p2m);
 
+void vmx_pi_per_cpu_init(unsigned int cpu);
+
 /* EPT violation qualifications definitions */
 #define _EPT_READ_VIOLATION         0
 #define EPT_READ_VIOLATION          (1UL<<_EPT_READ_VIOLATION)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 16/17] VT-d: Dump the posted format IRTE
  2015-09-11  8:28 [PATCH v7 00/17] Add VT-d Posted-Interrupts support Feng Wu
                   ` (14 preceding siblings ...)
  2015-09-11  8:29 ` [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling Feng Wu
@ 2015-09-11  8:29 ` Feng Wu
  2015-09-22 14:58   ` Jan Beulich
  2015-09-11  8:29 ` [PATCH v7 17/17] Add a command line parameter for VT-d posted-interrupts Feng Wu
  16 siblings, 1 reply; 86+ messages in thread
From: Feng Wu @ 2015-09-11  8:29 UTC (permalink / raw)
  To: xen-devel; +Cc: Yang Zhang, Kevin Tian, Feng Wu

Add the utility to dump the posted format IRTE.

CC: Yang Zhang <yang.z.zhang@intel.com>
CC: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v7:
- Remove the two stage loop

v6:
- Fix a typo

v4:
- Newly added

 xen/drivers/passthrough/vtd/utils.c | 30 +++++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/utils.c b/xen/drivers/passthrough/vtd/utils.c
index 6daa156..54db519 100644
--- a/xen/drivers/passthrough/vtd/utils.c
+++ b/xen/drivers/passthrough/vtd/utils.c
@@ -203,6 +203,9 @@ static void dump_iommu_info(unsigned char key)
             ecap_intr_remap(iommu->ecap) ? "" : "not ",
             (status & DMA_GSTS_IRES) ? " and enabled" : "" );
 
+        printk("  Interrupt Posting: %ssupported.\n",
+            cap_intr_post(iommu->cap) ? "" : "not ");
+
         if ( status & DMA_GSTS_IRES )
         {
             /* Dump interrupt remapping table. */
@@ -213,8 +216,11 @@ static void dump_iommu_info(unsigned char key)
 
             printk("  Interrupt remapping table (nr_entry=%#x. "
                 "Only dump P=1 entries here):\n", nr_entry);
-            printk("       SVT  SQ   SID      DST  V  AVL DLM TM RH DM "
-                   "FPD P\n");
+            printk("R means remapped format, P means posted format.\n");
+            printk("R:       SVT  SQ   SID  V  AVL FPD      DST DLM TM RH DM "
+                   "P\n");
+            printk("P:       SVT  SQ   SID  V  AVL FPD              PDA  URG "
+                   "P\n");
             for ( i = 0; i < nr_entry; i++ )
             {
                 struct iremap_entry *p;
@@ -232,11 +238,21 @@ static void dump_iommu_info(unsigned char key)
 
                 if ( !p->remap.p )
                     continue;
-                printk("  %04x:  %x   %x  %04x %08x %02x    %x   %x  %x  %x  %x"
-                    "   %x %x\n", i,
-                    p->remap.svt, p->remap.sq, p->remap.sid, p->remap.dst,
-                    p->remap.vector, p->remap.avail, p->remap.dlm, p->remap.tm,
-                    p->remap.rh, p->remap.dm, p->remap.fpd, p->remap.p);
+                if ( !p->remap.im )
+                    printk("R:  %04x:  %x   %x  %04x %02x    %x   %x %08x   "
+                        "%x  %x  %x  %x %x\n", i,
+                        p->remap.svt, p->remap.sq, p->remap.sid,
+                        p->remap.vector, p->remap.avail, p->remap.fpd,
+                        p->remap.dst, p->remap.dlm, p->remap.tm, p->remap.rh,
+                        p->remap.dm, p->remap.p);
+                else
+                    printk("P:  %04x:  %x   %x  %04x %02x    %x   %x %16lx    "
+                        "%x %x\n", i,
+                        p->post.svt, p->post.sq, p->post.sid, p->post.vector,
+                        p->post.avail, p->post.fpd,
+                        ((u64)p->post.pda_h << 32) | (p->post.pda_l << 6),
+                        p->post.urg, p->post.p);
+
                 print_cnt++;
             }
             if ( iremap_entries )
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 17/17] Add a command line parameter for VT-d posted-interrupts
  2015-09-11  8:28 [PATCH v7 00/17] Add VT-d Posted-Interrupts support Feng Wu
                   ` (15 preceding siblings ...)
  2015-09-11  8:29 ` [PATCH v7 16/17] VT-d: Dump the posted format IRTE Feng Wu
@ 2015-09-11  8:29 ` Feng Wu
  16 siblings, 0 replies; 86+ messages in thread
From: Feng Wu @ 2015-09-11  8:29 UTC (permalink / raw)
  To: xen-devel; +Cc: Feng Wu, Jan Beulich

Enable VT-d Posted-Interrupts and add a command line
parameter for it.

CC: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
v6:
- Change the default value to 'false' in xen-command-line.markdown

 docs/misc/xen-command-line.markdown | 9 ++++++++-
 xen/drivers/passthrough/iommu.c     | 3 +++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index a2e427c..ecaf221 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -855,7 +855,7 @@ debug hypervisor only).
 > Default: `new` unless directed-EOI is supported
 
 ### iommu
-> `= List of [ <boolean> | force | required | intremap | qinval | snoop | sharept | dom0-passthrough | dom0-strict | amd-iommu-perdev-intremap | workaround_bios_bug | igfx | verbose | debug ]`
+> `= List of [ <boolean> | force | required | intremap | intpost | qinval | snoop | sharept | dom0-passthrough | dom0-strict | amd-iommu-perdev-intremap | workaround_bios_bug | igfx | verbose | debug ]`
 
 > Sub-options:
 
@@ -882,6 +882,13 @@ debug hypervisor only).
 >> Control the use of interrupt remapping (DMA remapping will always be enabled
 >> if IOMMU functionality is enabled).
 
+> `intpost`
+
+> Default: `false`
+
+>> Control the use of interrupt posting, which depends on the availability of
+>> interrupt remapping.
+
 > `qinval` (VT-d)
 
 > Default: `true`
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index 36d5cc0..8d03076 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -38,6 +38,7 @@ static void iommu_dump_p2m_table(unsigned char key);
  *   no-snoop                   Disable VT-d Snoop Control
  *   no-qinval                  Disable VT-d Queued Invalidation
  *   no-intremap                Disable VT-d Interrupt Remapping
+ *   no-intpost                 Disable VT-d Interrupt posting
  */
 custom_param("iommu", parse_iommu_param);
 bool_t __initdata iommu_enable = 1;
@@ -105,6 +106,8 @@ static void __init parse_iommu_param(char *s)
             iommu_qinval = val;
         else if ( !strcmp(s, "intremap") )
             iommu_intremap = val;
+        else if ( !strcmp(s, "intpost") )
+            iommu_intpost = val;
         else if ( !strcmp(s, "debug") )
         {
             iommu_debug = val;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-11  8:29 ` [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling Feng Wu
@ 2015-09-16 16:00   ` Dario Faggioli
  2015-09-16 17:18   ` Dario Faggioli
  1 sibling, 0 replies; 86+ messages in thread
From: Dario Faggioli @ 2015-09-16 16:00 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper, xen-devel,
	Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 2015 bytes --]

On Fri, 2015-09-11 at 16:29 +0800, Feng Wu wrote:

> CC: Keir Fraser <keir@xen.org>
> CC: Jan Beulich <jbeulich@suse.com>
> CC: Andrew Cooper <andrew.cooper3@citrix.com>
> CC: Kevin Tian <kevin.tian@intel.com>
> CC: George Dunlap <george.dunlap@eu.citrix.com>
> CC: Dario Faggioli <dario.faggioli@citrix.com>
> Sugguested-by: Dario Faggioli <dario.faggioli@citrix.com>
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
>
Ehm, just one thing, for now...

> ---
> v7:
> - Merge [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
>   and "[PATCH v6 14/18] vmx: posted-interrupt handling when vCPU is blocked"
>   into this patch, so it is self-contained and more convenient
>   for code review.
> - Make 'pi_blocked_vcpu' and 'pi_blocked_vcpu_lock' static
> - Coding style
> - Use per_cpu() instead of this_cpu() in pi_wakeup_interrupt()
> - Move ack_APIC_irq() to the beginning of pi_wakeup_interrupt()
> - Rename 'pi_ctxt_switch_from' to 'ctxt_switch_prepare'
> - Rename 'pi_ctxt_switch_to' to 'ctxt_switch_cancel'
> - Use 'has_hvm_container_vcpu' instead of 'is_hvm_vcpu'
> - Use 'spin_lock' and 'spin_unlock' when the interrupt has been
>   already disabled.
> - Rename arch_vcpu_wake_prepare to vmx_vcpu_wake_prepare
> - Define vmx_vcpu_wake_prepare in xen/arch/x86/hvm/hvm.c
> - Call .pi_ctxt_switch_to() __context_switch() instead of directly
>   calling vmx_post_ctx_switch_pi() in vmx_ctxt_switch_to()
> - Make .pi_block_cpu unsigned int
> - Use list_del() instead of list_del_init()
> - Coding style
> 
...With soo many changes having done to this patch, I'd say you should
have dropped any Acked/Reviewed-by tag!

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-11  8:29 ` [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling Feng Wu
  2015-09-16 16:00   ` Dario Faggioli
@ 2015-09-16 17:18   ` Dario Faggioli
  2015-09-16 18:05     ` Dario Faggioli
  2015-09-17  8:00     ` Wu, Feng
  1 sibling, 2 replies; 86+ messages in thread
From: Dario Faggioli @ 2015-09-16 17:18 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper, xen-devel,
	Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 5008 bytes --]

On Fri, 2015-09-11 at 16:29 +0800, Feng Wu wrote:
> This patch includes the following aspects:
> - Handling logic when vCPU is blocked:
>     * Add a global vector to wake up the blocked vCPU
>       when an interrupt is being posted to it (This part
>       was sugguested by Yang Zhang <yang.z.zhang@intel.com>).
>     * Define two per-cpu variables:
>           1. pi_blocked_vcpu:
>             A list storing the vCPUs which were blocked
>             on this pCPU.
> 
>           2. pi_blocked_vcpu_lock:
>             The spinlock to protect pi_blocked_vcpu.
> 
> - Add some scheduler hooks, this part was suggested
>   by Dario Faggioli <dario.faggioli@citrix.com>.
>     * vmx_pre_ctx_switch_pi()
>       It is called before context switch, we update the
>       posted interrupt descriptor when the vCPU is preempted,
>       go to sleep, or is blocked.
> 
>     * vmx_post_ctx_switch_pi()
>       It is called after context switch, we update the posted
>       interrupt descriptor when the vCPU is going to run.
> 
>     * arch_vcpu_wake_prepare()
>       It will be called when waking up the vCPU, we update
>       the posted interrupt descriptor when the vCPU is
>       unblocked.
> 
> CC: Keir Fraser <keir@xen.org>
> CC: Jan Beulich <jbeulich@suse.com>
> CC: Andrew Cooper <andrew.cooper3@citrix.com>
> CC: Kevin Tian <kevin.tian@intel.com>
> CC: George Dunlap <george.dunlap@eu.citrix.com>
> CC: Dario Faggioli <dario.faggioli@citrix.com>
> Sugguested-by: Dario Faggioli <dario.faggioli@citrix.com>
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> v7:
> - Merge [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted interrupts
>   and "[PATCH v6 14/18] vmx: posted-interrupt handling when vCPU is blocked"
>   into this patch, so it is self-contained and more convenient
>   for code review.
> - Make 'pi_blocked_vcpu' and 'pi_blocked_vcpu_lock' static
> - Coding style
> - Use per_cpu() instead of this_cpu() in pi_wakeup_interrupt()
> - Move ack_APIC_irq() to the beginning of pi_wakeup_interrupt()
> - Rename 'pi_ctxt_switch_from' to 'ctxt_switch_prepare'
> - Rename 'pi_ctxt_switch_to' to 'ctxt_switch_cancel'
> - Use 'has_hvm_container_vcpu' instead of 'is_hvm_vcpu'
> - Use 'spin_lock' and 'spin_unlock' when the interrupt has been
>   already disabled.
> - Rename arch_vcpu_wake_prepare to vmx_vcpu_wake_prepare
> - Define vmx_vcpu_wake_prepare in xen/arch/x86/hvm/hvm.c
> - Call .pi_ctxt_switch_to() __context_switch() instead of directly
>   calling vmx_post_ctx_switch_pi() in vmx_ctxt_switch_to()
> - Make .pi_block_cpu unsigned int
> - Use list_del() instead of list_del_init()
> - Coding style
> 
> One remaining item:
> Jan has concern about calling vcpu_unblock() in vmx_pre_ctx_switch_pi(),
> need Dario or George's input about this.
> 
Hi,

Sorry for the delay in replying, I was on PTO for a few time.

Coming to the issue, well, it's a though call.

First of all, Feng, have you tested this with a debug build of Xen? I'm
asking because it looks to me that you're ending up calling vcpu_wake()
with IRQ disabled which, if my brain is not too rusty after a few weeks
of vacation, should result in check_lock() (in xen/common/spinlock.c)
complaining, doesn't it?

In fact, in principle this is not too much different from what happens
in other places. More specifically, what we have is a vcpu being
re-inserted in  a runqueue, and the need for re-running the scheduler on
a(some) PCPU(s) is evaluated. That is similar to what happens in Credit2
(and in RTDS) in csched2_context_saved(), which is called from within
context_saved(), still from the context switch code (if
__CSFLAG_delayed_runq_add is true).

So it's not the thing per se that is that terrible, IMO. The differences
between that and your case are:
 - in the Credit2 case, it happens later down in the context switch
   path (which would look already better to me) and, more important,
   with IRQs already re-enabled;
 - in the Credit2 case, the effect that something like that can have on 
   the scheduler is much more evident, as it happens inside a scheduler
   hook, rather than buried down in arch specific code, which makes me a
   lot less concerned about the possibility of latent issues Jan was
   hinting at, with which I concur.

So, I guess, first of all, can you confirm whether or not it's exploding
in debug builds? And in either case (just tossing out ideas) would it be
possible to deal with the "interrupt already raised when blocking" case:
 - later in the context switching function ?
 - with another hook, perhaps in vcpu_block() ?

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-16 17:18   ` Dario Faggioli
@ 2015-09-16 18:05     ` Dario Faggioli
  2015-09-17  8:00     ` Wu, Feng
  1 sibling, 0 replies; 86+ messages in thread
From: Dario Faggioli @ 2015-09-16 18:05 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper, xen-devel,
	Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 2882 bytes --]

On Wed, 2015-09-16 at 19:18 +0200, Dario Faggioli wrote:
> On Fri, 2015-09-11 at 16:29 +0800, Feng Wu wrote:
> > One remaining item:
> > Jan has concern about calling vcpu_unblock() in vmx_pre_ctx_switch_pi(),
> > need Dario or George's input about this.
> > 
> Hi,
> 
> Sorry for the delay in replying, I was on PTO for a few time.
> 
> Coming to the issue, well, it's a though call.
> 
> First of all, Feng, have you tested this with a debug build of Xen? I'm
> asking because it looks to me that you're ending up calling vcpu_wake()
> with IRQ disabled which, if my brain is not too rusty after a few weeks
> of vacation, should result in check_lock() (in xen/common/spinlock.c)
> complaining, doesn't it?
> 
Mmm.. My bad (so, yes, I'm a bit rusty, I guess). check_lock(), in case
of spin_lock_irqsave(), is called _after_ local_irq_disable(), inside
_spin_lock_irqsave(), so this above should not be an issue. Sorry for
the noise. :-(

The rest of what's said below, and the fact that I agree with George's
and Jan's concerns, still stand. :-)

So, in particular...

> In fact, in principle this is not too much different from what happens
> in other places. More specifically, what we have is a vcpu being
> re-inserted in  a runqueue, and the need for re-running the scheduler on
> a(some) PCPU(s) is evaluated. That is similar to what happens in Credit2
> (and in RTDS) in csched2_context_saved(), which is called from within
> context_saved(), still from the context switch code (if
> __CSFLAG_delayed_runq_add is true).
> 
> So it's not the thing per se that is that terrible, IMO. The differences
> between that and your case are:
>  - in the Credit2 case, it happens later down in the context switch
>    path (which would look already better to me) and, more important,
>    with IRQs already re-enabled;
>  - in the Credit2 case, the effect that something like that can have on 
>    the scheduler is much more evident, as it happens inside a scheduler
>    hook, rather than buried down in arch specific code, which makes me a
>    lot less concerned about the possibility of latent issues Jan was
>    hinting at, with which I concur.
> 
> So, I guess, first of all, can you confirm whether or not it's exploding
> in debug builds? And in either case (just tossing out ideas) would it be
> possible to deal with the "interrupt already raised when blocking" case:
>  - later in the context switching function ?
>  - with another hook, perhaps in vcpu_block() ?
> 
... let me/us know what you think about these alternatives.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-16 17:18   ` Dario Faggioli
  2015-09-16 18:05     ` Dario Faggioli
@ 2015-09-17  8:00     ` Wu, Feng
  2015-09-17  8:48       ` Dario Faggioli
  1 sibling, 1 reply; 86+ messages in thread
From: Wu, Feng @ 2015-09-17  8:00 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich, Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Thursday, September 17, 2015 1:18 AM
> To: Wu, Feng
> Cc: xen-devel@lists.xen.org; Tian, Kevin; Keir Fraser; George Dunlap; Andrew
> Cooper; Jan Beulich
> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> 
> On Fri, 2015-09-11 at 16:29 +0800, Feng Wu wrote:
> > This patch includes the following aspects:
> > - Handling logic when vCPU is blocked:
> >     * Add a global vector to wake up the blocked vCPU
> >       when an interrupt is being posted to it (This part
> >       was sugguested by Yang Zhang <yang.z.zhang@intel.com>).
> >     * Define two per-cpu variables:
> >           1. pi_blocked_vcpu:
> >             A list storing the vCPUs which were blocked
> >             on this pCPU.
> >
> >           2. pi_blocked_vcpu_lock:
> >             The spinlock to protect pi_blocked_vcpu.
> >
> > - Add some scheduler hooks, this part was suggested
> >   by Dario Faggioli <dario.faggioli@citrix.com>.
> >     * vmx_pre_ctx_switch_pi()
> >       It is called before context switch, we update the
> >       posted interrupt descriptor when the vCPU is preempted,
> >       go to sleep, or is blocked.
> >
> >     * vmx_post_ctx_switch_pi()
> >       It is called after context switch, we update the posted
> >       interrupt descriptor when the vCPU is going to run.
> >
> >     * arch_vcpu_wake_prepare()
> >       It will be called when waking up the vCPU, we update
> >       the posted interrupt descriptor when the vCPU is
> >       unblocked.
> >
> > CC: Keir Fraser <keir@xen.org>
> > CC: Jan Beulich <jbeulich@suse.com>
> > CC: Andrew Cooper <andrew.cooper3@citrix.com>
> > CC: Kevin Tian <kevin.tian@intel.com>
> > CC: George Dunlap <george.dunlap@eu.citrix.com>
> > CC: Dario Faggioli <dario.faggioli@citrix.com>
> > Sugguested-by: Dario Faggioli <dario.faggioli@citrix.com>
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
> > ---
> > v7:
> > - Merge [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
> interrupts
> >   and "[PATCH v6 14/18] vmx: posted-interrupt handling when vCPU is
> blocked"
> >   into this patch, so it is self-contained and more convenient
> >   for code review.
> > - Make 'pi_blocked_vcpu' and 'pi_blocked_vcpu_lock' static
> > - Coding style
> > - Use per_cpu() instead of this_cpu() in pi_wakeup_interrupt()
> > - Move ack_APIC_irq() to the beginning of pi_wakeup_interrupt()
> > - Rename 'pi_ctxt_switch_from' to 'ctxt_switch_prepare'
> > - Rename 'pi_ctxt_switch_to' to 'ctxt_switch_cancel'
> > - Use 'has_hvm_container_vcpu' instead of 'is_hvm_vcpu'
> > - Use 'spin_lock' and 'spin_unlock' when the interrupt has been
> >   already disabled.
> > - Rename arch_vcpu_wake_prepare to vmx_vcpu_wake_prepare
> > - Define vmx_vcpu_wake_prepare in xen/arch/x86/hvm/hvm.c
> > - Call .pi_ctxt_switch_to() __context_switch() instead of directly
> >   calling vmx_post_ctx_switch_pi() in vmx_ctxt_switch_to()
> > - Make .pi_block_cpu unsigned int
> > - Use list_del() instead of list_del_init()
> > - Coding style
> >
> > One remaining item:
> > Jan has concern about calling vcpu_unblock() in vmx_pre_ctx_switch_pi(),
> > need Dario or George's input about this.
> >
> Hi,
> 
> Sorry for the delay in replying, I was on PTO for a few time.

That's fine. Thanks for your reply!

> 
> Coming to the issue, well, it's a though call.
> 
> First of all, Feng, have you tested this with a debug build of Xen? I'm
> asking because it looks to me that you're ending up calling vcpu_wake()
> with IRQ disabled which, if my brain is not too rusty after a few weeks
> of vacation, should result in check_lock() (in xen/common/spinlock.c)
> complaining, doesn't it?
> 
> In fact, in principle this is not too much different from what happens
> in other places. More specifically, what we have is a vcpu being
> re-inserted in  a runqueue, and the need for re-running the scheduler on
> a(some) PCPU(s) is evaluated. That is similar to what happens in Credit2
> (and in RTDS) in csched2_context_saved(), which is called from within
> context_saved(), still from the context switch code (if
> __CSFLAG_delayed_runq_add is true).
> 
> So it's not the thing per se that is that terrible, IMO. The differences
> between that and your case are:
>  - in the Credit2 case, it happens later down in the context switch
>    path (which would look already better to me) and, more important,
>    with IRQs already re-enabled;
>  - in the Credit2 case, the effect that something like that can have on
>    the scheduler is much more evident, as it happens inside a scheduler
>    hook, rather than buried down in arch specific code, which makes me a
>    lot less concerned about the possibility of latent issues Jan was
>    hinting at, with which I concur.
> 
> So, I guess, first of all, can you confirm whether or not it's exploding
> in debug builds?

Does the following information in Config.mk mean it is a debug build?

# A debug build of Xen and tools?
debug ?= y
debug_symbols ?= $(debug)

> And in either case (just tossing out ideas) would it be
> possible to deal with the "interrupt already raised when blocking" case:

Thanks for the suggestions below!

>  - later in the context switching function ?
In this case, we might need to set a flag in vmx_pre_ctx_switch_pi() instead
of calling vcpu_unblock() directly, then when it returns to context_switch(),
we can check the flag and don't really block the vCPU. But I don't have a clear
picture about how to archive this, here are some questions from me:
- When we are in context_switch(), we have done the following changes to
vcpu's state:
	* sd->curr is set to next
	* vCPU's running state (both prev and next ) is changed by
	  vcpu_runstate_change()
	* next->is_running is set to 1
	* periodic timer for prev is stopped
	* periodic timer for next is setup
	......

So what point should we perform the action to _unblock_ the vCPU? We
Need to roll back the formal changes to the vCPU's state, right?

>  - with another hook, perhaps in vcpu_block() ?

We could check this in vcpu_block(), however, the logic here is that before
vCPU is blocked, we need to change the posted-interrupt descriptor,
and during changing it, if 'ON' bit is set, which means VT-d hardware
issues a notification event because interrupts from the assigned devices
is coming, we don't need to block the vCPU and hence no need to update
the PI descriptor in this case. Checking in vcpu_block() might not cover this
logic.

Thanks,
Feng

> 
> Regards,
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-17  8:00     ` Wu, Feng
@ 2015-09-17  8:48       ` Dario Faggioli
  2015-09-17  9:16         ` Wu, Feng
  2015-09-17  9:38         ` George Dunlap
  0 siblings, 2 replies; 86+ messages in thread
From: Dario Faggioli @ 2015-09-17  8:48 UTC (permalink / raw)
  To: Wu, Feng
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 4511 bytes --]

On Thu, 2015-09-17 at 08:00 +0000, Wu, Feng wrote:

> > -----Original Message-----
> > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]

> > So, I guess, first of all, can you confirm whether or not it's exploding
> > in debug builds?
> 
> Does the following information in Config.mk mean it is a debug build?
> 
> # A debug build of Xen and tools?
> debug ?= y
> debug_symbols ?= $(debug)
> 
I think so. But as I said in my other email, I was wrong, and this is
probably not an issue.

> > And in either case (just tossing out ideas) would it be
> > possible to deal with the "interrupt already raised when blocking" case:
> 
> Thanks for the suggestions below!
> 
:-)

> >  - later in the context switching function ?
> In this case, we might need to set a flag in vmx_pre_ctx_switch_pi() instead
> of calling vcpu_unblock() directly, then when it returns to context_switch(),
> we can check the flag and don't really block the vCPU. 
>
Yeah, and that would still be rather hard to understand and maintain,
IMO.

> But I don't have a clear
> picture about how to archive this, here are some questions from me:
> - When we are in context_switch(), we have done the following changes to
> vcpu's state:
> 	* sd->curr is set to next
> 	* vCPU's running state (both prev and next ) is changed by
> 	  vcpu_runstate_change()
> 	* next->is_running is set to 1
> 	* periodic timer for prev is stopped
> 	* periodic timer for next is setup
> 	......
> 
> So what point should we perform the action to _unblock_ the vCPU? We
> Need to roll back the formal changes to the vCPU's state, right?
> 
Mmm... not really. Not blocking prev does not mean that prev would be
kept running on the pCPU, and that's true for your current solution as
well! As you say yourself, you're already in the process of switching
between prev and next, at a point where it's already a thing that next
will be the vCPU that will run. Not blocking means that prev is
reinserted to the runqueue, and a new invocation to the scheduler is
(potentially) queued as well (via raising SCHEDULE_SOFTIRQ, in
__runq_tickle()), but it's only when such new scheduling happens that
prev will (potentially) be selected to run again.

So, no, unless I'm fully missing your point, there wouldn't be no
rollback required. However, I still would like the other solution (doing
stuff in vcpu_block()) better (see below).

> >  - with another hook, perhaps in vcpu_block() ?
> 
> We could check this in vcpu_block(), however, the logic here is that before
> vCPU is blocked, we need to change the posted-interrupt descriptor,
> and during changing it, if 'ON' bit is set, which means VT-d hardware
> issues a notification event because interrupts from the assigned devices
> is coming, we don't need to block the vCPU and hence no need to update
> the PI descriptor in this case. 
>
Yep, I saw that. But could it be possible to do *everything* related to
blocking, including the update of the descriptor, in vcpu_block(), if no
interrupt have been raised yet at that time? I mean, would you, if
updating the descriptor in there, still get the event that allows you to
call vcpu_wake(), and hence vmx_vcpu_wake_prepare(), which would undo
the blocking, no matter whether that resulted in an actual context
switch already or not?

I appreciate that this narrows the window for such an event to happen by
quite a bit, making the logic itself a little less useful (it makes
things more similar to "regular" blocking vs. event delivery, though,
AFAICT), but if it's correct, ad if it allows us to save the ugly
invocation of vcpu_unblock from context switch context, I'd give it a
try.

After all, this PI thing requires actions to be taken when a vCPU is
scheduled or descheduled because of blocking, unblocking and
preemptions, and it would seem natural to me to:
 - deal with blockings in vcpu_block()
 - deal with unblockings in vcpu_wake()
 - deal with preemptions in context_switch()

This does not mean being able to consolidate some of the cases (like
blockings and preemptions, in the current version of the code) were not
a nice thing... But we don't want it at all costs . :-)

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-17  8:48       ` Dario Faggioli
@ 2015-09-17  9:16         ` Wu, Feng
  2015-09-17  9:38         ` George Dunlap
  1 sibling, 0 replies; 86+ messages in thread
From: Wu, Feng @ 2015-09-17  9:16 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich, Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Thursday, September 17, 2015 4:48 PM
> To: Wu, Feng
> Cc: xen-devel@lists.xen.org; Tian, Kevin; Keir Fraser; George Dunlap; Andrew
> Cooper; Jan Beulich
> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> 
> On Thu, 2015-09-17 at 08:00 +0000, Wu, Feng wrote:
> 
> > > -----Original Message-----
> > > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> 
> > > So, I guess, first of all, can you confirm whether or not it's exploding
> > > in debug builds?
> >
> > Does the following information in Config.mk mean it is a debug build?
> >
> > # A debug build of Xen and tools?
> > debug ?= y
> > debug_symbols ?= $(debug)
> >
> I think so. But as I said in my other email, I was wrong, and this is
> probably not an issue.
> 
> > > And in either case (just tossing out ideas) would it be
> > > possible to deal with the "interrupt already raised when blocking" case:
> >
> > Thanks for the suggestions below!
> >
> :-)
> 
> > >  - later in the context switching function ?
> > In this case, we might need to set a flag in vmx_pre_ctx_switch_pi() instead
> > of calling vcpu_unblock() directly, then when it returns to context_switch(),
> > we can check the flag and don't really block the vCPU.
> >
> Yeah, and that would still be rather hard to understand and maintain,
> IMO.
> 
> > But I don't have a clear
> > picture about how to archive this, here are some questions from me:
> > - When we are in context_switch(), we have done the following changes to
> > vcpu's state:
> > 	* sd->curr is set to next
> > 	* vCPU's running state (both prev and next ) is changed by
> > 	  vcpu_runstate_change()
> > 	* next->is_running is set to 1
> > 	* periodic timer for prev is stopped
> > 	* periodic timer for next is setup
> > 	......
> >
> > So what point should we perform the action to _unblock_ the vCPU? We
> > Need to roll back the formal changes to the vCPU's state, right?
> >
> Mmm... not really. Not blocking prev does not mean that prev would be
> kept running on the pCPU, and that's true for your current solution as
> well! As you say yourself, you're already in the process of switching
> between prev and next, at a point where it's already a thing that next
> will be the vCPU that will run. Not blocking means that prev is
> reinserted to the runqueue, and a new invocation to the scheduler is
> (potentially) queued as well (via raising SCHEDULE_SOFTIRQ, in
> __runq_tickle()), but it's only when such new scheduling happens that
> prev will (potentially) be selected to run again.
> 
> So, no, unless I'm fully missing your point, there wouldn't be no
> rollback required. However, I still would like the other solution (doing
> stuff in vcpu_block()) better (see below).

Thanks for the detailed clarification. Yes, maybe my description above
is a little vague. Yes, the non-blocking vCPU should be put into the
runqueue. I shouldn't use the term "roll back". :)

> 
> > >  - with another hook, perhaps in vcpu_block() ?
> >
> > We could check this in vcpu_block(), however, the logic here is that before
> > vCPU is blocked, we need to change the posted-interrupt descriptor,
> > and during changing it, if 'ON' bit is set, which means VT-d hardware
> > issues a notification event because interrupts from the assigned devices
> > is coming, we don't need to block the vCPU and hence no need to update
> > the PI descriptor in this case.
> >
> Yep, I saw that. But could it be possible to do *everything* related to
> blocking, including the update of the descriptor, in vcpu_block(), if no
> interrupt have been raised yet at that time? I mean, would you, if
> updating the descriptor in there, still get the event that allows you to
> call vcpu_wake(), and hence vmx_vcpu_wake_prepare(), which would undo
> the blocking, no matter whether that resulted in an actual context
> switch already or not?
> 
> I appreciate that this narrows the window for such an event to happen by
> quite a bit, making the logic itself a little less useful (it makes
> things more similar to "regular" blocking vs. event delivery, though,
> AFAICT), but if it's correct, ad if it allows us to save the ugly
> invocation of vcpu_unblock from context switch context, I'd give it a
> try.
> 
> After all, this PI thing requires actions to be taken when a vCPU is
> scheduled or descheduled because of blocking, unblocking and
> preemptions, and it would seem natural to me to:
>  - deal with blockings in vcpu_block()
>  - deal with unblockings in vcpu_wake()
>  - deal with preemptions in context_switch()
> 
> This does not mean being able to consolidate some of the cases (like
> blockings and preemptions, in the current version of the code) were not
> a nice thing... But we don't want it at all costs . :-)

Yes, doing this in vcpu_block() is indeed an alternative, In fact, I also thought
about it before. I need to think it more to see whether it meets all the
requirements. I will get back once I have a clear picture. Thank you!

Thanks,
Feng

> 
> Regards,
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-17  8:48       ` Dario Faggioli
  2015-09-17  9:16         ` Wu, Feng
@ 2015-09-17  9:38         ` George Dunlap
  2015-09-17  9:39           ` George Dunlap
  2015-09-17 11:44           ` George Dunlap
  1 sibling, 2 replies; 86+ messages in thread
From: George Dunlap @ 2015-09-17  9:38 UTC (permalink / raw)
  To: Dario Faggioli, Wu, Feng
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich

On 09/17/2015 09:48 AM, Dario Faggioli wrote:
> On Thu, 2015-09-17 at 08:00 +0000, Wu, Feng wrote:
> 
>>> -----Original Message-----
>>> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> 
>>> So, I guess, first of all, can you confirm whether or not it's exploding
>>> in debug builds?
>>
>> Does the following information in Config.mk mean it is a debug build?
>>
>> # A debug build of Xen and tools?
>> debug ?= y
>> debug_symbols ?= $(debug)
>>
> I think so. But as I said in my other email, I was wrong, and this is
> probably not an issue.
> 
>>> And in either case (just tossing out ideas) would it be
>>> possible to deal with the "interrupt already raised when blocking" case:
>>
>> Thanks for the suggestions below!
>>
> :-)
> 
>>>  - later in the context switching function ?
>> In this case, we might need to set a flag in vmx_pre_ctx_switch_pi() instead
>> of calling vcpu_unblock() directly, then when it returns to context_switch(),
>> we can check the flag and don't really block the vCPU. 
>>
> Yeah, and that would still be rather hard to understand and maintain,
> IMO.
> 
>> But I don't have a clear
>> picture about how to archive this, here are some questions from me:
>> - When we are in context_switch(), we have done the following changes to
>> vcpu's state:
>> 	* sd->curr is set to next
>> 	* vCPU's running state (both prev and next ) is changed by
>> 	  vcpu_runstate_change()
>> 	* next->is_running is set to 1
>> 	* periodic timer for prev is stopped
>> 	* periodic timer for next is setup
>> 	......
>>
>> So what point should we perform the action to _unblock_ the vCPU? We
>> Need to roll back the formal changes to the vCPU's state, right?
>>
> Mmm... not really. Not blocking prev does not mean that prev would be
> kept running on the pCPU, and that's true for your current solution as
> well! As you say yourself, you're already in the process of switching
> between prev and next, at a point where it's already a thing that next
> will be the vCPU that will run. Not blocking means that prev is
> reinserted to the runqueue, and a new invocation to the scheduler is
> (potentially) queued as well (via raising SCHEDULE_SOFTIRQ, in
> __runq_tickle()), but it's only when such new scheduling happens that
> prev will (potentially) be selected to run again.
> 
> So, no, unless I'm fully missing your point, there wouldn't be no
> rollback required. However, I still would like the other solution (doing
> stuff in vcpu_block()) better (see below).
> 
>>>  - with another hook, perhaps in vcpu_block() ?
>>
>> We could check this in vcpu_block(), however, the logic here is that before
>> vCPU is blocked, we need to change the posted-interrupt descriptor,
>> and during changing it, if 'ON' bit is set, which means VT-d hardware
>> issues a notification event because interrupts from the assigned devices
>> is coming, we don't need to block the vCPU and hence no need to update
>> the PI descriptor in this case. 
>>
> Yep, I saw that. But could it be possible to do *everything* related to
> blocking, including the update of the descriptor, in vcpu_block(), if no
> interrupt have been raised yet at that time? I mean, would you, if
> updating the descriptor in there, still get the event that allows you to
> call vcpu_wake(), and hence vmx_vcpu_wake_prepare(), which would undo
> the blocking, no matter whether that resulted in an actual context
> switch already or not?
> 
> I appreciate that this narrows the window for such an event to happen by
> quite a bit, making the logic itself a little less useful (it makes
> things more similar to "regular" blocking vs. event delivery, though,
> AFAICT), but if it's correct, ad if it allows us to save the ugly
> invocation of vcpu_unblock from context switch context, I'd give it a
> try.
> 
> After all, this PI thing requires actions to be taken when a vCPU is
> scheduled or descheduled because of blocking, unblocking and
> preemptions, and it would seem natural to me to:
>  - deal with blockings in vcpu_block()
>  - deal with unblockings in vcpu_wake()
>  - deal with preemptions in context_switch()
> 
> This does not mean being able to consolidate some of the cases (like
> blockings and preemptions, in the current version of the code) were not
> a nice thing... But we don't want it at all costs . :-)

So just to clarify the situation...

If a vcpu configured for the "running" state (i.e., NV set to
"posted_intr_vector", notifications enabled), and an interrupt happens
in the hypervisor -- what happens?

Is it the case that the interrupt is not actually delivered to the
processor, but that the pending bit will be set in the pi field, so that
the interrupt will be delivered the next time the hypervisor returns
into the guest?

(I am assuming that is the case, because if the hypervisor *does* get an
interrupt, then it can just unblock it there.)

This sort of race condition -- where we get an interrupt to wake up a
vcpu as we're blocking -- is already handled for "old-style" interrupts
in vcpu_block:

void vcpu_block(void)
{
    struct vcpu *v = current;

    set_bit(_VPF_blocked, &v->pause_flags);

    /* Check for events /after/ blocking: avoids wakeup waiting race. */
    if ( local_events_need_delivery() )
    {
        clear_bit(_VPF_blocked, &v->pause_flags);
    }
    else
    {
        TRACE_2D(TRC_SCHED_BLOCK, v->domain->domain_id, v->vcpu_id);
        raise_softirq(SCHEDULE_SOFTIRQ);
    }
}

That is, we set _VPF_blocked, so that any interrupt which would wake it
up actually wakes it up, and then we check local_events_need_delivery()
to see if there were any that came in after we decided to block but
before we made sure that an interrupt would wake us up.

I think it would be best if we could keep all the logic that does the
same thing in the same place.  Which would mean in vcpu_block(), after
calling set_bit(_VPF_blocked), changing the NV to pi_wakeup_vector, and
then extending local_events_need_delivery() to also look for pending PI
events.

Looking a bit more at your states, I think the actions that need to be
taken on all the transitions are these (collapsing 'runnable' and
'offline' into the same state):

blocked -> runnable (vcpu_wake)
 - NV = posted_intr_vector
 - Take vcpu off blocked list
 - SN = 1
runnable -> running (context_switch)
 - SN = 0
running -> runnable (context_switch)
 - SN = 1
running -> blocked (vcpu_block)
 - NV = pi_wakeup_vector
 - Add vcpu to blocked list

This actually has a few pretty nice properties:
1. You have a nice pair of complementary actions -- block / wake, run /
preempt
2. The potentially long actions with lists happen in vcpu_wake and
vcpu_block, not on the context switch path

And at this point, you don't have the "lazy context switch" issue
anymore, do we?  Because we've handled the "blocking" case in
vcpu_block(), we don't need to do anything in the main context_switch()
(which may do the lazy context switching into idle).  We only need to do
something in the actual __context_switch().

And at that point, could we actually get rid of the PI-specific context
switch hooks altogether, and just put the SN state changes required for
running->runnable and runnable->running in vmx_ctxt_switch_from() and
vmx_ctxt_switch_to()?

If so, then the only hooks we need to add are vcpu_block and vcpu_wake.
 To keep these consistent with other scheduling-related functions, I
would put these in arch_vcpu, next to ctxt_switch_from() and
ctxt_switch_to().

Thoughts?

 -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-17  9:38         ` George Dunlap
@ 2015-09-17  9:39           ` George Dunlap
  2015-09-17 11:44           ` George Dunlap
  1 sibling, 0 replies; 86+ messages in thread
From: George Dunlap @ 2015-09-17  9:39 UTC (permalink / raw)
  To: Dario Faggioli, Wu, Feng
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich

On 09/17/2015 10:38 AM, George Dunlap wrote:
> On 09/17/2015 09:48 AM, Dario Faggioli wrote:
>> On Thu, 2015-09-17 at 08:00 +0000, Wu, Feng wrote:
>>
>>>> -----Original Message-----
>>>> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
>>
>>>> So, I guess, first of all, can you confirm whether or not it's exploding
>>>> in debug builds?
>>>
>>> Does the following information in Config.mk mean it is a debug build?
>>>
>>> # A debug build of Xen and tools?
>>> debug ?= y
>>> debug_symbols ?= $(debug)
>>>
>> I think so. But as I said in my other email, I was wrong, and this is
>> probably not an issue.
>>
>>>> And in either case (just tossing out ideas) would it be
>>>> possible to deal with the "interrupt already raised when blocking" case:
>>>
>>> Thanks for the suggestions below!
>>>
>> :-)
>>
>>>>  - later in the context switching function ?
>>> In this case, we might need to set a flag in vmx_pre_ctx_switch_pi() instead
>>> of calling vcpu_unblock() directly, then when it returns to context_switch(),
>>> we can check the flag and don't really block the vCPU. 
>>>
>> Yeah, and that would still be rather hard to understand and maintain,
>> IMO.
>>
>>> But I don't have a clear
>>> picture about how to archive this, here are some questions from me:
>>> - When we are in context_switch(), we have done the following changes to
>>> vcpu's state:
>>> 	* sd->curr is set to next
>>> 	* vCPU's running state (both prev and next ) is changed by
>>> 	  vcpu_runstate_change()
>>> 	* next->is_running is set to 1
>>> 	* periodic timer for prev is stopped
>>> 	* periodic timer for next is setup
>>> 	......
>>>
>>> So what point should we perform the action to _unblock_ the vCPU? We
>>> Need to roll back the formal changes to the vCPU's state, right?
>>>
>> Mmm... not really. Not blocking prev does not mean that prev would be
>> kept running on the pCPU, and that's true for your current solution as
>> well! As you say yourself, you're already in the process of switching
>> between prev and next, at a point where it's already a thing that next
>> will be the vCPU that will run. Not blocking means that prev is
>> reinserted to the runqueue, and a new invocation to the scheduler is
>> (potentially) queued as well (via raising SCHEDULE_SOFTIRQ, in
>> __runq_tickle()), but it's only when such new scheduling happens that
>> prev will (potentially) be selected to run again.
>>
>> So, no, unless I'm fully missing your point, there wouldn't be no
>> rollback required. However, I still would like the other solution (doing
>> stuff in vcpu_block()) better (see below).
>>
>>>>  - with another hook, perhaps in vcpu_block() ?
>>>
>>> We could check this in vcpu_block(), however, the logic here is that before
>>> vCPU is blocked, we need to change the posted-interrupt descriptor,
>>> and during changing it, if 'ON' bit is set, which means VT-d hardware
>>> issues a notification event because interrupts from the assigned devices
>>> is coming, we don't need to block the vCPU and hence no need to update
>>> the PI descriptor in this case. 
>>>
>> Yep, I saw that. But could it be possible to do *everything* related to
>> blocking, including the update of the descriptor, in vcpu_block(), if no
>> interrupt have been raised yet at that time? I mean, would you, if
>> updating the descriptor in there, still get the event that allows you to
>> call vcpu_wake(), and hence vmx_vcpu_wake_prepare(), which would undo
>> the blocking, no matter whether that resulted in an actual context
>> switch already or not?
>>
>> I appreciate that this narrows the window for such an event to happen by
>> quite a bit, making the logic itself a little less useful (it makes
>> things more similar to "regular" blocking vs. event delivery, though,
>> AFAICT), but if it's correct, ad if it allows us to save the ugly
>> invocation of vcpu_unblock from context switch context, I'd give it a
>> try.
>>
>> After all, this PI thing requires actions to be taken when a vCPU is
>> scheduled or descheduled because of blocking, unblocking and
>> preemptions, and it would seem natural to me to:
>>  - deal with blockings in vcpu_block()
>>  - deal with unblockings in vcpu_wake()
>>  - deal with preemptions in context_switch()
>>
>> This does not mean being able to consolidate some of the cases (like
>> blockings and preemptions, in the current version of the code) were not
>> a nice thing... But we don't want it at all costs . :-)
> 
> So just to clarify the situation...

Er, and to clarify something else -- Technically I'm responding to Dario
here, but my mail is actually addressed to Wu Feng.  This was just a
good point to "put my oar in" to the conversation. :-)

 -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-17  9:38         ` George Dunlap
  2015-09-17  9:39           ` George Dunlap
@ 2015-09-17 11:44           ` George Dunlap
  2015-09-17 12:40             ` Dario Faggioli
  1 sibling, 1 reply; 86+ messages in thread
From: George Dunlap @ 2015-09-17 11:44 UTC (permalink / raw)
  To: Dario Faggioli, Wu, Feng
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich

On 09/17/2015 10:38 AM, George Dunlap wrote:
> Is it the case that the interrupt is not actually delivered to the
> processor, but that the pending bit will be set in the pi field, so that
> the interrupt will be delivered the next time the hypervisor returns
> into the guest?
> 
> (I am assuming that is the case, because if the hypervisor *does* get an
> interrupt, then it can just unblock it there.)

Actually, it looks like you *do* in fact get a
pi_notification_interrupt() in this case.  Could we to check to see if
the current vcpu is blocked and unblock it?

I haven't yet decided whether I prefer my original suggestion of
switching the interrupt and putting things on the wake-up list in
vcpu_block(), or of deferring adding things to the wake-up list until
the actual context switch.

 -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-17 11:44           ` George Dunlap
@ 2015-09-17 12:40             ` Dario Faggioli
  2015-09-17 14:30               ` George Dunlap
  0 siblings, 1 reply; 86+ messages in thread
From: Dario Faggioli @ 2015-09-17 12:40 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich, Wu, Feng


[-- Attachment #1.1: Type: text/plain, Size: 1456 bytes --]

On Thu, 2015-09-17 at 12:44 +0100, George Dunlap wrote:
> On 09/17/2015 10:38 AM, George Dunlap wrote:
> > Is it the case that the interrupt is not actually delivered to the
> > processor, but that the pending bit will be set in the pi field, so that
> > the interrupt will be delivered the next time the hypervisor returns
> > into the guest?
> > 
> > (I am assuming that is the case, because if the hypervisor *does* get an
> > interrupt, then it can just unblock it there.)
> 
> Actually, it looks like you *do* in fact get a
> pi_notification_interrupt() in this case.  Could we to check to see if
> the current vcpu is blocked and unblock it?
> 
> I haven't yet decided whether I prefer my original suggestion of
> switching the interrupt and putting things on the wake-up list in
> vcpu_block(), or of deferring adding things to the wake-up list until
> the actual context switch.
> 
Sorry but I don't get what you mean with the latter.

I particular, I don't think I understand what you mean with and how it
would work to "defer[ring] adding things to the wake-up list until
actual context switch"... In what case would you defer stuff to context
switch?

Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-17 12:40             ` Dario Faggioli
@ 2015-09-17 14:30               ` George Dunlap
  2015-09-17 16:36                 ` Dario Faggioli
  2015-09-18  6:27                 ` Jan Beulich
  0 siblings, 2 replies; 86+ messages in thread
From: George Dunlap @ 2015-09-17 14:30 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich, Wu, Feng

On 09/17/2015 01:40 PM, Dario Faggioli wrote:
> On Thu, 2015-09-17 at 12:44 +0100, George Dunlap wrote:
>> On 09/17/2015 10:38 AM, George Dunlap wrote:
>>> Is it the case that the interrupt is not actually delivered to the
>>> processor, but that the pending bit will be set in the pi field, so that
>>> the interrupt will be delivered the next time the hypervisor returns
>>> into the guest?
>>>
>>> (I am assuming that is the case, because if the hypervisor *does* get an
>>> interrupt, then it can just unblock it there.)
>>
>> Actually, it looks like you *do* in fact get a
>> pi_notification_interrupt() in this case.  Could we to check to see if
>> the current vcpu is blocked and unblock it?
>>
>> I haven't yet decided whether I prefer my original suggestion of
>> switching the interrupt and putting things on the wake-up list in
>> vcpu_block(), or of deferring adding things to the wake-up list until
>> the actual context switch.
>>
> Sorry but I don't get what you mean with the latter.
> 
> I particular, I don't think I understand what you mean with and how it
> would work to "defer[ring] adding things to the wake-up list until
> actual context switch"... In what case would you defer stuff to context
> switch?

So one option is to do the "blocking" stuff in an arch-specific call
from vcpu_block():

vcpu_block()
 set(_VPF_blocked)
 v->arch.block()
  - Add v to pcpu.pi_blocked_vcpu
  - NV => pi_wakeup_vector
 local_events_need_delivery()
   hvm_vcpu_has_pending_irq()

 ...
 context_switch(): nothing

The other is to do the "blocking" stuff in the context switch (similar
to what's done now):

vcpu_block()
  set(_VPF_blocked)
  local_events_need_delivery()
    hvm_vcpu_has_pending_irq()
  ...
context_switch
   v->arch.block()
    - Add v to pcpu.pi_blocked_vcpu
    - NV => pi_wakeup_vector

If we do it the first way, and an interrupt comes in before the context
switch is finished, it will call pi_wakeup_vector, which will DTRT --
take v off the pi_blocked_vcpu list and call vcpu_unblock() (which in
turn will set NV back to posted_intr_vector).

If we do it the second way, and an interrupt comes in before the context
switch is finished, it will call posted_intr_vector.  We can, at that
point, check to see if the current vcpu is marked as blocked.  If it is,
we can call vcpu_unblock() without having to modify NV or worry about
adding / removing the vcpu from the pi_blocked_vcpu list.

The thing I like about the first one is that it makes PI blocking the
same as normal blocking -- everything happens in the same place; so the
code is cleaner and easier to understand.

The thing I like about the second one is that it cleverly avoids having
to do all the work of adding the vcpu to the list and then searching to
remove it if the vcpu in question gets woken up on the way to being
blocked.  So the code may end up being faster for workloads where that
happens frequently.

 -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-17 14:30               ` George Dunlap
@ 2015-09-17 16:36                 ` Dario Faggioli
  2015-09-18  6:27                 ` Jan Beulich
  1 sibling, 0 replies; 86+ messages in thread
From: Dario Faggioli @ 2015-09-17 16:36 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich, Wu, Feng


[-- Attachment #1.1: Type: text/plain, Size: 3870 bytes --]

On Thu, 2015-09-17 at 15:30 +0100, George Dunlap wrote:
> On 09/17/2015 01:40 PM, Dario Faggioli wrote:

> >> I haven't yet decided whether I prefer my original suggestion of
> >> switching the interrupt and putting things on the wake-up list in
> >> vcpu_block(), or of deferring adding things to the wake-up list until
> >> the actual context switch.
> >>
> > Sorry but I don't get what you mean with the latter.
> > 
> > I particular, I don't think I understand what you mean with and how it
> > would work to "defer[ring] adding things to the wake-up list until
> > actual context switch"... In what case would you defer stuff to context
> > switch?
> 
> So one option is to do the "blocking" stuff in an arch-specific call
> from vcpu_block():
> 
> vcpu_block()
>  set(_VPF_blocked)
>  v->arch.block()
>   - Add v to pcpu.pi_blocked_vcpu
>   - NV => pi_wakeup_vector
>  local_events_need_delivery()
>    hvm_vcpu_has_pending_irq()
> 
>  ...
>  context_switch(): nothing
> 
> The other is to do the "blocking" stuff in the context switch (similar
> to what's done now):
> 
> vcpu_block()
>   set(_VPF_blocked)
>   local_events_need_delivery()
>     hvm_vcpu_has_pending_irq()
>   ...
> context_switch
>    v->arch.block()
>     - Add v to pcpu.pi_blocked_vcpu
>     - NV => pi_wakeup_vector
> 
Ok, thanks for elaborating.

> If we do it the first way, and an interrupt comes in before the context
> switch is finished, it will call pi_wakeup_vector, which will DTRT --
> take v off the pi_blocked_vcpu list and call vcpu_unblock() (which in
> turn will set NV back to posted_intr_vector).
> 
> If we do it the second way, and an interrupt comes in before the context
> switch is finished, it will call posted_intr_vector.  We can, at that
> point, check to see if the current vcpu is marked as blocked.  If it is,
> we can call vcpu_unblock() without having to modify NV or worry about
> adding / removing the vcpu from the pi_blocked_vcpu list.
> 
Right.

> The thing I like about the first one is that it makes PI blocking the
> same as normal blocking -- everything happens in the same place; so the
> code is cleaner and easier to understand.
> 
Indeed.

> The thing I like about the second one is that it cleverly avoids having
> to do all the work of adding the vcpu to the list and then searching to
> remove it if the vcpu in question gets woken up on the way to being
> blocked.  So the code may end up being faster for workloads where that
> happens frequently.
> 
Maybe some instrumentation to figure out how frequent this is, at least
in a couple of (thought to be) common workloads can be added, and a few
test performed? Feng?

One thing that may be worth giving some thinking at is whether
interrupts are enabled or not. I mean, during vcpu_block()
SCHEDULE_SOFTIRQ is raised, for invoking the scheduler. Then (at least)
during schedule(), IRQs are disabled. They're re-enabled near the end of
schedule(), and re-disabled for the actual context switch ( ~= around
__context_switch()).

My point being that IRQs are going to be disabled for a significant
amount of time, between vcpu_block() and the actual context being
switched. And solution 2 requires IRQs to be enabled in order to avoid a
potentially pointless NV update, doesn't it? If yes, that may
(negatively) affect the probability of being able to actually benefit
from the optimization...

Anyway, I'm not sure. I think my gut feelings are in favour of solution
1, but, really, it's hard to tell without actually trying. 

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-17 14:30               ` George Dunlap
  2015-09-17 16:36                 ` Dario Faggioli
@ 2015-09-18  6:27                 ` Jan Beulich
  2015-09-18  9:22                   ` Dario Faggioli
  1 sibling, 1 reply; 86+ messages in thread
From: Jan Beulich @ 2015-09-18  6:27 UTC (permalink / raw)
  To: dario.faggioli, george.dunlap
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel, feng.wu

>>> George Dunlap <george.dunlap@citrix.com> 09/17/15 4:30 PM >>>
>On 09/17/2015 01:40 PM, Dario Faggioli wrote:
>So one option is to do the "blocking" stuff in an arch-specific call
>from vcpu_block():
>
>vcpu_block()
 >set(_VPF_blocked)
 >v->arch.block()
  >- Add v to pcpu.pi_blocked_vcpu
  >- NV => pi_wakeup_vector
 >local_events_need_delivery()
   >hvm_vcpu_has_pending_irq()
>
 >...
 >context_switch(): nothing
>
>The other is to do the "blocking" stuff in the context switch (similar
>to what's done now):
>
>vcpu_block()
  >set(_VPF_blocked)
  >local_events_need_delivery()
    >hvm_vcpu_has_pending_irq()
  >...
>context_switch
   >v->arch.block()
    >- Add v to pcpu.pi_blocked_vcpu
    >- NV => pi_wakeup_vector
>
>If we do it the first way, and an interrupt comes in before the context
>switch is finished, it will call pi_wakeup_vector, which will DTRT --
>take v off the pi_blocked_vcpu list and call vcpu_unblock() (which in
>turn will set NV back to posted_intr_vector).
>
>If we do it the second way, and an interrupt comes in before the context
>switch is finished, it will call posted_intr_vector.  We can, at that
>point, check to see if the current vcpu is marked as blocked.  If it is,
>we can call vcpu_unblock() without having to modify NV or worry about
>adding / removing the vcpu from the pi_blocked_vcpu list.
>
>The thing I like about the first one is that it makes PI blocking the
>same as normal blocking -- everything happens in the same place; so the
>code is cleaner and easier to understand.
>
>The thing I like about the second one is that it cleverly avoids having
>to do all the work of adding the vcpu to the list and then searching to
>remove it if the vcpu in question gets woken up on the way to being
>blocked.  So the code may end up being faster for workloads where that
>happens frequently.

 But wouldn't such an optimization argument apply to some/all other
blocking ways too? I.e. shouldn't, if we were to go that route, other
early wakeups get treated equally? Unless that's wrong thinking of mine
or being implemented that way, I'd clearly favor option 1 for consistency
reasons.

Jan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-18  6:27                 ` Jan Beulich
@ 2015-09-18  9:22                   ` Dario Faggioli
  2015-09-18 14:31                     ` George Dunlap
  0 siblings, 1 reply; 86+ messages in thread
From: Dario Faggioli @ 2015-09-18  9:22 UTC (permalink / raw)
  To: Jan Beulich, george.dunlap
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel, feng.wu


[-- Attachment #1.1: Type: text/plain, Size: 2890 bytes --]

On Fri, 2015-09-18 at 00:27 -0600, Jan Beulich wrote:
> > George Dunlap <george.dunlap@citrix.com> 09/17/15 4:30 PM >>>The

> > > > vcpu_block()
>  >set(_VPF_blocked)
>  >v->arch.block()
>   >- Add v to pcpu.pi_blocked_vcpu
>   >- NV => pi_wakeup_vector
>  >local_events_need_delivery()
>    >hvm_vcpu_has_pending_irq()
> > 
>  >...
>  >context_switch(): nothing
> > 
> > The other is to do the "blocking" stuff in the context switch
> > (similar
> > to what's done now):
> > 
> > vcpu_block()
>   >set(_VPF_blocked)
>   >local_events_need_delivery()
>     >hvm_vcpu_has_pending_irq()
>   >...
> > context_switch
>    >v->arch.block()
>     >- Add v to pcpu.pi_blocked_vcpu
>     >- NV => pi_wakeup_vector
> >
> > [...]
> >
> > > > thing I like about the first one is that it makes PI blocking
> > > > the
> > same as normal blocking -- everything happens in the same place; so
> > the
> > code is cleaner and easier to understand.
> > 
> > The thing I like about the second one is that it cleverly avoids
> > having
> > to do all the work of adding the vcpu to the list and then
> > searching to
> > remove it if the vcpu in question gets woken up on the way to being
> > blocked.  So the code may end up being faster for workloads where
> > that
> > happens frequently.
> 
>  But wouldn't such an optimization argument apply to some/all other
> blocking ways too? I.e. shouldn't, if we were to go that route, other
> early wakeups get treated equally? 
>
Good point, actually.

However, with "regular blockings" this would probably be less of an
optimization.

In fact, in PI case, George's solution 2 is, potentially, avoiding
having to switch the descriptor and manipulating the list of blocked
vcpus.
In regular blockings there are no such things. We may be able to avoid
a context switch, but that also depends, AFAICT, on when the interrupt
exactly happens and/or is notified (i.e., before the call to schedule()
as opposed to after that and before the actual __context_switch()). I'm
not saying this wouldn't be better, but it's certainly optimizing less
than in the PI case.

> Unless that's wrong thinking of mine
> or being implemented that way, I'd clearly favor option 1 for
> consistency
> reasons.
> 
As said, me too. Perhaps we can go for option 1, which is simpler,
cleaner and more consistent, considering the current status of the
code. We can always investigate, in future, whether and how to
implement the optimization for all the blockings, if beneficial and fea
sible, or have them diverge, if deemed worthwhile.

Regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-18  9:22                   ` Dario Faggioli
@ 2015-09-18 14:31                     ` George Dunlap
  2015-09-18 14:34                       ` George Dunlap
  0 siblings, 1 reply; 86+ messages in thread
From: George Dunlap @ 2015-09-18 14:31 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, Keir Fraser, Andrew Cooper, George Dunlap,
	xen-devel, Jan Beulich, feng.wu

On Fri, Sep 18, 2015 at 10:22 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On Fri, 2015-09-18 at 00:27 -0600, Jan Beulich wrote:
>> > George Dunlap <george.dunlap@citrix.com> 09/17/15 4:30 PM >>>The
>
>> > > > vcpu_block()
>>  >set(_VPF_blocked)
>>  >v->arch.block()
>>   >- Add v to pcpu.pi_blocked_vcpu
>>   >- NV => pi_wakeup_vector
>>  >local_events_need_delivery()
>>    >hvm_vcpu_has_pending_irq()
>> >
>>  >...
>>  >context_switch(): nothing
>> >
>> > The other is to do the "blocking" stuff in the context switch
>> > (similar
>> > to what's done now):
>> >
>> > vcpu_block()
>>   >set(_VPF_blocked)
>>   >local_events_need_delivery()
>>     >hvm_vcpu_has_pending_irq()
>>   >...
>> > context_switch
>>    >v->arch.block()
>>     >- Add v to pcpu.pi_blocked_vcpu
>>     >- NV => pi_wakeup_vector
>> >
>> > [...]
>> >
>> > > > thing I like about the first one is that it makes PI blocking
>> > > > the
>> > same as normal blocking -- everything happens in the same place; so
>> > the
>> > code is cleaner and easier to understand.
>> >
>> > The thing I like about the second one is that it cleverly avoids
>> > having
>> > to do all the work of adding the vcpu to the list and then
>> > searching to
>> > remove it if the vcpu in question gets woken up on the way to being
>> > blocked.  So the code may end up being faster for workloads where
>> > that
>> > happens frequently.
>>
>>  But wouldn't such an optimization argument apply to some/all other
>> blocking ways too? I.e. shouldn't, if we were to go that route, other
>> early wakeups get treated equally?
>>
> Good point, actually.
>
> However, with "regular blockings" this would probably be less of an
> optimization.
>
> In fact, in PI case, George's solution 2 is, potentially, avoiding
> having to switch the descriptor and manipulating the list of blocked
> vcpus.
> In regular blockings there are no such things. We may be able to avoid
> a context switch, but that also depends, AFAICT, on when the interrupt
> exactly happens and/or is notified (i.e., before the call to schedule()
> as opposed to after that and before the actual __context_switch()). I'm
> not saying this wouldn't be better, but it's certainly optimizing less
> than in the PI case.
>
>> Unless that's wrong thinking of mine
>> or being implemented that way, I'd clearly favor option 1 for
>> consistency
>> reasons.
>>
> As said, me too. Perhaps we can go for option 1, which is simpler,
> cleaner and more consistent, considering the current status of the
> code. We can always investigate, in future, whether and how to
> implement the optimization for all the blockings, if beneficial and fea
> sible, or have them diverge, if deemed worthwhile.

Sounds like a plan.

 -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-18 14:31                     ` George Dunlap
@ 2015-09-18 14:34                       ` George Dunlap
  0 siblings, 0 replies; 86+ messages in thread
From: George Dunlap @ 2015-09-18 14:34 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, Keir Fraser, Andrew Cooper, George Dunlap,
	xen-devel, Jan Beulich, feng.wu

On Fri, Sep 18, 2015 at 3:31 PM, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
>> As said, me too. Perhaps we can go for option 1, which is simpler,
>> cleaner and more consistent, considering the current status of the
>> code. We can always investigate, in future, whether and how to
>> implement the optimization for all the blockings, if beneficial and fea
>> sible, or have them diverge, if deemed worthwhile.
>
> Sounds like a plan.

Er, just in case that idiom wasn't clear: Option 1 sounds like a
*good* plan, so unless Feng disagrees, let's go with that. :-)

 -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 02/17] Add cmpxchg16b support for x86-64
  2015-09-11  8:28 ` [PATCH v7 02/17] Add cmpxchg16b support for x86-64 Feng Wu
@ 2015-09-22 13:50   ` Jan Beulich
  2015-09-22 13:55     ` Wu, Feng
  0 siblings, 1 reply; 86+ messages in thread
From: Jan Beulich @ 2015-09-22 13:50 UTC (permalink / raw)
  To: Feng Wu; +Cc: Andrew Cooper, Keir Fraser, xen-devel

>>> On 11.09.15 at 10:28, <feng.wu@intel.com> wrote:

> --- a/xen/include/asm-x86/x86_64/system.h
> +++ b/xen/include/asm-x86/x86_64/system.h
> @@ -6,6 +6,37 @@
>                                     (unsigned long)(n),sizeof(*(ptr))))
>  
>  /*
> + * Atomic 16 bytes compare and exchange.  Compare OLD with MEM, if
> + * identical, store NEW in MEM.  Return the initial value in MEM.
> + * Success is indicated by comparing RETURN with OLD.
> + *
> + * This function can only be called when cpu_has_cx16 is true.
> + */
> +
> +static always_inline __uint128_t __cmpxchg16b(
> +    volatile void *ptr, const __uint128_t *old, const __uint128_t *new)
> +{
> +    __uint128_t prev;
> +    uint64_t new_high = *new >> 64;
> +    uint64_t new_low = (uint64_t)*new;

Pointless cast.

> +    ASSERT(cpu_has_cx16);
> +
> +    asm volatile ( "lock; cmpxchg16b %3"
> +                   : "=A" (prev)
> +                   : "c" (new_high), "b" (new_low), "m" (*__xg(ptr)), "0" (*old)
> +                 );

The closing parenthesis belongs on the previous line.

> +    return prev;
> +}
> +
> +#define cmpxchg16b(ptr,o,n)                                            \
> +    ( ({ ASSERT(((unsigned long)ptr & 0xF) == 0); }),                  \
> +      (BUILD_BUG_ON(sizeof(*o) != sizeof(__uint128_t))),               \
> +      (BUILD_BUG_ON(sizeof(*n) != sizeof(__uint128_t))),               \
> +      (__cmpxchg16b((ptr), (void *)(o), (void *)(n))) )

Sigh - this is _still_ not properly parenthesized, and evaluates ptr
twice:

#define cmpxchg16b(ptr, o, n) ({                       \
    volatile void *_p = (ptr);                         \
    ASSERT(!((unsigned long)_p & 0xf));                \
    BUILD_BUG_ON(sizeof(*(o)) != sizeof(__uint128_t)); \
    BUILD_BUG_ON(sizeof(*(n)) != sizeof(__uint128_t)); \
    __cmpxchg16b(_p, (void *)(o), (void *)(n));        \
})

Jan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 02/17] Add cmpxchg16b support for x86-64
  2015-09-22 13:50   ` Jan Beulich
@ 2015-09-22 13:55     ` Wu, Feng
  0 siblings, 0 replies; 86+ messages in thread
From: Wu, Feng @ 2015-09-22 13:55 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Keir Fraser, Wu, Feng, xen-devel



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, September 22, 2015 9:51 PM
> To: Wu, Feng
> Cc: Andrew Cooper; xen-devel@lists.xen.org; Keir Fraser
> Subject: Re: [PATCH v7 02/17] Add cmpxchg16b support for x86-64
> > +#define cmpxchg16b(ptr,o,n)
> \
> > +    ( ({ ASSERT(((unsigned long)ptr & 0xF) == 0); }),                  \
> > +      (BUILD_BUG_ON(sizeof(*o) != sizeof(__uint128_t))),
> \
> > +      (BUILD_BUG_ON(sizeof(*n) != sizeof(__uint128_t))),
> \
> > +      (__cmpxchg16b((ptr), (void *)(o), (void *)(n))) )
> 
> Sigh - this is _still_ not properly parenthesized, and evaluates ptr
> twice:
> 
> #define cmpxchg16b(ptr, o, n) ({                       \
>     volatile void *_p = (ptr);                         \
>     ASSERT(!((unsigned long)_p & 0xf));                \
>     BUILD_BUG_ON(sizeof(*(o)) != sizeof(__uint128_t)); \
>     BUILD_BUG_ON(sizeof(*(n)) != sizeof(__uint128_t)); \
>     __cmpxchg16b(_p, (void *)(o), (void *)(n));        \
> })

Thanks a lot for the review!

Thanks,
Feng

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 04/17] vt-d: VT-d Posted-Interrupts feature detection
  2015-09-11  8:28 ` [PATCH v7 04/17] vt-d: VT-d Posted-Interrupts feature detection Feng Wu
@ 2015-09-22 14:18   ` Jan Beulich
  0 siblings, 0 replies; 86+ messages in thread
From: Jan Beulich @ 2015-09-22 14:18 UTC (permalink / raw)
  To: Feng Wu; +Cc: Yang Zhang, Kevin Tian, xen-devel

>>> On 11.09.15 at 10:28, <feng.wu@intel.com> wrote:
> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> With VT-d Posted-Interrupts enabled, external interrupts from
> direct-assigned devices can be delivered to guests without VMM
> intervention when guest is running in non-root mode.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>
albeit ...

> @@ -2220,6 +2229,7 @@ int __init intel_vtd_setup(void)
>      iommu_passthrough = 0;
>      iommu_qinval = 0;
>      iommu_intremap = 0;
> +    iommu_intpost = 0;
>      return ret;
>  }

... I think this clearing of flags (which you add as well as what was
there already) doesn't really belong here (but instead should again
be taken care of by generic IOMMU code).

Jan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 05/17] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  2015-09-11  8:28 ` [PATCH v7 05/17] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts Feng Wu
@ 2015-09-22 14:20   ` Jan Beulich
  2015-09-23  1:02     ` Wu, Feng
  0 siblings, 1 reply; 86+ messages in thread
From: Jan Beulich @ 2015-09-22 14:20 UTC (permalink / raw)
  To: Feng Wu; +Cc: Andrew Cooper, Kevin Tian, Keir Fraser, xen-devel

>>> On 11.09.15 at 10:28, <feng.wu@intel.com> wrote:
> Extend struct pi_desc according to VT-d Posted-Interrupts Spec.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Acked-by: Kevin Tian <kevin.tian@intel.com>
> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
> v7:
> - Coding style.

Are you sure?

> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
> +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
> @@ -80,8 +80,18 @@ struct vmx_domain {
>  
>  struct pi_desc {
>      DECLARE_BITMAP(pir, NR_VECTORS);
> -    u32 control;
> -    u32 rsvd[7];
> +    union {
> +        struct {
> +        u16     on     : 1,  /* bit 256 - Outstanding Notification */
> +                sn     : 1,  /* bit 257 - Suppress Notification */
> +                rsvd_1 : 14; /* bit 271:258 - Reserved */
> +        u8      nv;          /* bit 279:272 - Notification Vector */
> +        u8      rsvd_2;      /* bit 287:280 - Reserved */
> +        u32     ndst;        /* bit 319:288 - Notification Destination */
> +        };

Clearly the body of the structure is still mis-indented.

Jan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 08/17] vmx: Suppress posting interrupts when 'SN' is set
  2015-09-11  8:28 ` [PATCH v7 08/17] vmx: Suppress posting interrupts when 'SN' is set Feng Wu
@ 2015-09-22 14:23   ` Jan Beulich
  0 siblings, 0 replies; 86+ messages in thread
From: Jan Beulich @ 2015-09-22 14:23 UTC (permalink / raw)
  To: Feng Wu; +Cc: Andrew Cooper, Kevin Tian, Keir Fraser, xen-devel

>>> On 11.09.15 at 10:28, <feng.wu@intel.com> wrote:
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -1701,8 +1701,35 @@ static void vmx_deliver_posted_intr(struct vcpu *v, u8 vector)
>           */
>          pi_set_on(&v->arch.hvm_vmx.pi_desc);
>      }
> -    else if ( !pi_test_and_set_on(&v->arch.hvm_vmx.pi_desc) )
> +    else
>      {
> +        struct pi_desc old, new, prev;
> +
> +        prev.control = v->arch.hvm_vmx.pi_desc.control;
> +
> +        do {
> +            /*
> +             * Currently, we don't support urgent interrupt, all
> +             * interrupts are recognized as non-urgent interrupt,
> +             * so we cannot send posted-interrupt when 'SN' is set.
> +             * Besides that, if 'ON' is already set, we cannot set
> +             * posted-interrupts as well.
> +             */
> +            if ( pi_test_sn(&prev) || pi_test_on(&prev) )
> +            {
> +                vcpu_kick(v);
> +                return;
> +            }
> +
> +            old.control = v->arch.hvm_vmx.pi_desc.control &
> +                          ~(1 << POSTED_INTR_ON | 1 << POSTED_INTR_SN);

With this fully parenthesized
Reviewed-by: Jan Beulich <jbeulich@suse.com>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 10/17] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts
  2015-09-11  8:28 ` [PATCH v7 10/17] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts Feng Wu
@ 2015-09-22 14:28   ` Jan Beulich
  0 siblings, 0 replies; 86+ messages in thread
From: Jan Beulich @ 2015-09-22 14:28 UTC (permalink / raw)
  To: Feng Wu; +Cc: Yang Zhang, Kevin Tian, xen-devel

>>> On 11.09.15 at 10:28, <feng.wu@intel.com> wrote:
> Extend struct iremap_entry according to VT-d Posted-Interrupts Spec.
> 
> CC: Yang Zhang <yang.z.zhang@intel.com>
> CC: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> Acked-by: Kevin Tian <kevin.tian@intel.com>
> ---
> v7:
> - Add a __uint128_t member to the union in struct iremap_entry

How about making use of this e.g. ...

> @@ -219,7 +219,7 @@ static unsigned int alloc_remap_entry(struct iommu *iommu, unsigned int nr)
>          else
>              p = &iremap_entries[i % (1 << IREMAP_ENTRY_ORDER)];
>  
> -        if ( p->lo_val || p->hi_val ) /* not a free entry */
> +        if ( p->lo || p->hi ) /* not a free entry */

... here and ...

> @@ -253,7 +253,7 @@ static int remap_entry_to_ioapic_rte(
>      GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
>                       iremap_entries, iremap_entry);
>  
> -    if ( iremap_entry->hi_val == 0 && iremap_entry->lo_val == 0 )
> +    if ( iremap_entry->hi == 0 && iremap_entry->lo == 0 )

... here (and perhaps elsewhere)?

Jan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 09/17] VT-d: Remove pointless casts
  2015-09-11  8:28 ` [PATCH v7 09/17] VT-d: Remove pointless casts Feng Wu
@ 2015-09-22 14:30   ` Jan Beulich
  0 siblings, 0 replies; 86+ messages in thread
From: Jan Beulich @ 2015-09-22 14:30 UTC (permalink / raw)
  To: Feng Wu; +Cc: Yang Zhang, Kevin Tian, xen-devel

>>> On 11.09.15 at 10:28, <feng.wu@intel.com> wrote:
> Remove pointless casts.
> 

Suggested-by: Jan Beulich <jbeulich@suse.com>

> Signed-off-by: Feng Wu <feng.wu@intel.com>
> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 11/17] vt-d: Add API to update IRTE when VT-d PI is used
  2015-09-11  8:29 ` [PATCH v7 11/17] vt-d: Add API to update IRTE when VT-d PI is used Feng Wu
@ 2015-09-22 14:42   ` Jan Beulich
  0 siblings, 0 replies; 86+ messages in thread
From: Jan Beulich @ 2015-09-22 14:42 UTC (permalink / raw)
  To: Feng Wu; +Cc: Yang Zhang, Andrew Cooper, Kevin Tian, Keir Fraser, xen-devel

>>> On 11.09.15 at 10:29, <feng.wu@intel.com> wrote:
> --- a/xen/drivers/passthrough/vtd/intremap.c
> +++ b/xen/drivers/passthrough/vtd/intremap.c
> @@ -899,3 +899,124 @@ void iommu_disable_x2apic_IR(void)
>      for_each_drhd_unit ( drhd )
>          disable_qinval(drhd->iommu);
>  }
> +
> +static void setup_posted_irte(
> +    struct iremap_entry *new_ire, const struct iremap_entry *old_ire,
> +    const struct pi_desc *pi_desc, const uint8_t gvec)
> +{
> +    memset(new_ire, sizeof(*new_ire), 0);
> +
> +    if (!old_ire->remap.im)

Coding style.

> +    {
> +        new_ire->post.p = old_ire->remap.p;
> +        new_ire->post.fpd = old_ire->remap.fpd;
> +        new_ire->post.sid = old_ire->remap.sid;
> +        new_ire->post.sq = old_ire->remap.sq;
> +        new_ire->post.svt = old_ire->remap.svt;
> +        new_ire->post.urg = 0;

No longer needed.

> +    }
> +    else
> +    {
> +        new_ire->post.p = old_ire->post.p;
> +        new_ire->post.fpd = old_ire->post.fpd;
> +        new_ire->post.sid = old_ire->post.sid;
> +        new_ire->post.sq = old_ire->post.sq;
> +        new_ire->post.svt = old_ire->post.svt;
> +        new_ire->post.urg = old_ire->post.urg;
> +    }
> +
> +    new_ire->post.im = 1;
> +    new_ire->post.vector = gvec;
> +    new_ire->post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
> +    new_ire->post.pda_h = virt_to_maddr(pi_desc) >> 32;
> +}

Quite a bit more explicit about what gets inherited and what gets
written anew. Thanks!

With the minor adjustment above done
Reviewed-by: Jan Beulich <jbeulich@suse.com>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 12/17] x86: move some APIC related macros to apicdef.h
  2015-09-11  8:29 ` [PATCH v7 12/17] x86: move some APIC related macros to apicdef.h Feng Wu
@ 2015-09-22 14:44   ` Jan Beulich
  0 siblings, 0 replies; 86+ messages in thread
From: Jan Beulich @ 2015-09-22 14:44 UTC (permalink / raw)
  To: Feng Wu; +Cc: Andrew Cooper, Keir Fraser, xen-devel

>>> On 11.09.15 at 10:29, <feng.wu@intel.com> wrote:
> Move some APIC related macros to apicdef.h, so they can be used
> outside of vlapic.c.
> 
> CC: Keir Fraser <keir@xen.org>
> CC: Jan Beulich <jbeulich@suse.com>
> CC: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
> v7:
> - Put the Macros to the right place inside the file.

Almost:

> --- a/xen/include/asm-x86/apicdef.h
> +++ b/xen/include/asm-x86/apicdef.h
> @@ -57,6 +57,8 @@
>  #define			APIC_DEST_SELF		0x40000
>  #define			APIC_DEST_ALLINC	0x80000
>  #define			APIC_DEST_ALLBUT	0xC0000
> +#define			APIC_SHORT_MASK		0xC0000
> +#define			APIC_DEST_NOSHORT	0x0

This last one would more naturally go ahead of APIC_DEST_SELF. And
both it and ...

> @@ -64,6 +66,7 @@
>  #define			APIC_INT_LEVELTRIG	0x08000
>  #define			APIC_INT_ASSERT		0x04000
>  #define			APIC_ICR_BUSY		0x01000
> +#define			APIC_DEST_MASK		0x800

... this one should be zero-padded to match the respective neighbors.

With that done
Acked-by: Jan Beulich <jbeulich@suse.com>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 13/17] Update IRTE according to guest interrupt config changes
  2015-09-11  8:29 ` [PATCH v7 13/17] Update IRTE according to guest interrupt config changes Feng Wu
@ 2015-09-22 14:51   ` Jan Beulich
  0 siblings, 0 replies; 86+ messages in thread
From: Jan Beulich @ 2015-09-22 14:51 UTC (permalink / raw)
  To: Feng Wu; +Cc: xen-devel

>>> On 11.09.15 at 10:29, <feng.wu@intel.com> wrote:
> @@ -198,6 +199,103 @@ void free_hvm_irq_dpci(struct hvm_irq_dpci *dpci)
>      xfree(dpci);
>  }
>  
> +/*
> + * This routine handles lowest-priority interrupts using vector-hashing
> + * mechanism. As an example, modern Intel CPUs use this method to handle
> + * lowest-priority interrupts.
> + *
> + * Here is the details about the vector-hashing mechanism:
> + * 1. For lowest-priority interrupts, store all the possible destination
> + *    vCPUs in an array.
> + * 2. Use "gvec % max number of destination vCPUs" to find the right
> + *    destination vCPU in the array for the lowest-priority interrupt.
> + */
> +static struct vcpu *vector_hashing_dest(const struct domain *d,
> +                                        uint32_t dest_id,
> +                                        bool_t dest_mode,
> +                                        uint8_t gvec)
> +
> +{
> +    unsigned long *dest_vcpu_bitmap;
> +    unsigned int dest_vcpus = 0;
> +    unsigned int bitmap_array_size = BITS_TO_LONGS(d->max_vcpus);

You use the variable just once. Ditch it, or at least make it const.

> +    struct vcpu *v, *dest = NULL;
> +    unsigned int i;
> +
> +    dest_vcpu_bitmap = xzalloc_array(unsigned long, bitmap_array_size);
> +    if ( !dest_vcpu_bitmap )
> +        return NULL;
> +
> +    for_each_vcpu ( d, v )
> +    {
> +        if ( !vlapic_match_dest(vcpu_vlapic(v), NULL, APIC_DEST_NOSHORT,
> +                                dest_id, dest_mode) )
> +            continue;
> +
> +        __set_bit(v->vcpu_id, dest_vcpu_bitmap);
> +        dest_vcpus++;
> +    }
> +
> +    if ( dest_vcpus != 0 )
> +    {
> +        unsigned int mod = gvec % dest_vcpus;
> +        unsigned int idx = 0;
> +
> +        for ( i = 0; i <= mod; i++ )
> +        {
> +            idx = find_next_bit(dest_vcpu_bitmap, d->max_vcpus, idx) + 1;
> +            BUG_ON(idx >= d->max_vcpus);
> +        }
> +
> +        dest = d->vcpu[idx-1];

Blanks around the - please.

> +static struct vcpu *pi_find_dest_vcpu(const struct domain *d, uint32_t dest_id,
> +                                      bool_t dest_mode, uint8_t delivery_mode,
> +                                      uint8_t gvec)
> +{
> +    unsigned int dest_vcpus = 0;
> +    struct vcpu *v, *dest = NULL;
> +
> +    if ( delivery_mode == dest_LowestPrio )
> +        return vector_hashing_dest(d, dest_id, dest_mode, gvec);
> +    else if ( delivery_mode == dest_Fixed )

Pointless else. And perhaps this should be switch(delivery_mode)
anyway.

With those cosmetic issues addressed
Acked-by: Jan Beulich <jbeulich@suse.com>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 16/17] VT-d: Dump the posted format IRTE
  2015-09-11  8:29 ` [PATCH v7 16/17] VT-d: Dump the posted format IRTE Feng Wu
@ 2015-09-22 14:58   ` Jan Beulich
  0 siblings, 0 replies; 86+ messages in thread
From: Jan Beulich @ 2015-09-22 14:58 UTC (permalink / raw)
  To: Feng Wu; +Cc: Yang Zhang, Kevin Tian, xen-devel

>>> On 11.09.15 at 10:29, <feng.wu@intel.com> wrote:
> Add the utility to dump the posted format IRTE.
> 
> CC: Yang Zhang <yang.z.zhang@intel.com>
> CC: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
> v7:
> - Remove the two stage loop

This looks quite a bit better now, thanks.

> --- a/xen/drivers/passthrough/vtd/utils.c
> +++ b/xen/drivers/passthrough/vtd/utils.c
> @@ -203,6 +203,9 @@ static void dump_iommu_info(unsigned char key)
>              ecap_intr_remap(iommu->ecap) ? "" : "not ",
>              (status & DMA_GSTS_IRES) ? " and enabled" : "" );
>  
> +        printk("  Interrupt Posting: %ssupported.\n",
> +            cap_intr_post(iommu->cap) ? "" : "not ");

Wrong indentation.

> @@ -213,8 +216,11 @@ static void dump_iommu_info(unsigned char key)
>  
>              printk("  Interrupt remapping table (nr_entry=%#x. "
>                  "Only dump P=1 entries here):\n", nr_entry);
> -            printk("       SVT  SQ   SID      DST  V  AVL DLM TM RH DM "
> -                   "FPD P\n");
> +            printk("R means remapped format, P means posted format.\n");
> +            printk("R:       SVT  SQ   SID  V  AVL FPD      DST DLM TM RH DM "
> +                   "P\n");
> +            printk("P:       SVT  SQ   SID  V  AVL FPD              PDA  URG "
> +                   "P\n");

Correct indentation, but this should really be a single line.

> @@ -232,11 +238,21 @@ static void dump_iommu_info(unsigned char key)
>  
>                  if ( !p->remap.p )
>                      continue;
> -                printk("  %04x:  %x   %x  %04x %08x %02x    %x   %x  %x  %x  %x"
> -                    "   %x %x\n", i,
> -                    p->remap.svt, p->remap.sq, p->remap.sid, p->remap.dst,
> -                    p->remap.vector, p->remap.avail, p->remap.dlm, p->remap.tm,
> -                    p->remap.rh, p->remap.dm, p->remap.fpd, p->remap.p);
> +                if ( !p->remap.im )
> +                    printk("R:  %04x:  %x   %x  %04x %02x    %x   %x %08x   "
> +                        "%x  %x  %x  %x %x\n", i,
> +                        p->remap.svt, p->remap.sq, p->remap.sid,
> +                        p->remap.vector, p->remap.avail, p->remap.fpd,
> +                        p->remap.dst, p->remap.dlm, p->remap.tm, p->remap.rh,
> +                        p->remap.dm, p->remap.p);
> +                else
> +                    printk("P:  %04x:  %x   %x  %04x %02x    %x   %x %16lx    "
> +                        "%x %x\n", i,
> +                        p->post.svt, p->post.sq, p->post.sid, p->post.vector,
> +                        p->post.avail, p->post.fpd,
> +                        ((u64)p->post.pda_h << 32) | (p->post.pda_l << 6),
> +                        p->post.urg, p->post.p);

Wrong indentation again, and the format strings also again should all
be on the same line.

Jan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 05/17] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  2015-09-22 14:20   ` Jan Beulich
@ 2015-09-23  1:02     ` Wu, Feng
  2015-09-23  7:36       ` Jan Beulich
  0 siblings, 1 reply; 86+ messages in thread
From: Wu, Feng @ 2015-09-23  1:02 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Tian, Kevin, Keir Fraser, Wu, Feng, xen-devel



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, September 22, 2015 10:20 PM
> To: Wu, Feng
> Cc: Andrew Cooper; Tian, Kevin; xen-devel@lists.xen.org; Keir Fraser
> Subject: Re: [PATCH v7 05/17] vmx: Extend struct pi_desc to support VT-d
> Posted-Interrupts
> 
> >>> On 11.09.15 at 10:28, <feng.wu@intel.com> wrote:
> > Extend struct pi_desc according to VT-d Posted-Interrupts Spec.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> > Acked-by: Kevin Tian <kevin.tian@intel.com>
> > Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > ---
> > v7:
> > - Coding style.
> 
> Are you sure?
> 
> > --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
> > +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
> > @@ -80,8 +80,18 @@ struct vmx_domain {
> >
> >  struct pi_desc {
> >      DECLARE_BITMAP(pir, NR_VECTORS);
> > -    u32 control;
> > -    u32 rsvd[7];
> > +    union {
> > +        struct {
> > +        u16     on     : 1,  /* bit 256 - Outstanding Notification */
> > +                sn     : 1,  /* bit 257 - Suppress Notification */
> > +                rsvd_1 : 14; /* bit 271:258 - Reserved */
> > +        u8      nv;          /* bit 279:272 - Notification Vector */
> > +        u8      rsvd_2;      /* bit 287:280 - Reserved */
> > +        u32     ndst;        /* bit 319:288 - Notification Destination
> */
> > +        };
> 
> Clearly the body of the structure is still mis-indented.

Seeing from the code, this structure is well indented. where do you
think it has problem?

Thanks,
Feng

> 
> Jan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 05/17] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  2015-09-23  1:02     ` Wu, Feng
@ 2015-09-23  7:36       ` Jan Beulich
  0 siblings, 0 replies; 86+ messages in thread
From: Jan Beulich @ 2015-09-23  7:36 UTC (permalink / raw)
  To: Feng Wu; +Cc: Andrew Cooper, Kevin Tian, Keir Fraser, xen-devel

>>> On 23.09.15 at 03:02, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Tuesday, September 22, 2015 10:20 PM
>> >>> On 11.09.15 at 10:28, <feng.wu@intel.com> wrote:
>> > --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
>> > +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
>> > @@ -80,8 +80,18 @@ struct vmx_domain {
>> >
>> >  struct pi_desc {
>> >      DECLARE_BITMAP(pir, NR_VECTORS);
>> > -    u32 control;
>> > -    u32 rsvd[7];
>> > +    union {
>> > +        struct {
>> > +        u16     on     : 1,  /* bit 256 - Outstanding Notification */
>> > +                sn     : 1,  /* bit 257 - Suppress Notification */
>> > +                rsvd_1 : 14; /* bit 271:258 - Reserved */
>> > +        u8      nv;          /* bit 279:272 - Notification Vector */
>> > +        u8      rsvd_2;      /* bit 287:280 - Reserved */
>> > +        u32     ndst;        /* bit 319:288 - Notification Destination
>> */
>> > +        };
>> 
>> Clearly the body of the structure is still mis-indented.
> 
> Seeing from the code, this structure is well indented. where do you
> think it has problem?

I'm not sure what "seeing from the code" is supposed to mean, but
I already said where the problem is: The body of a structure (or
union or enum) needs to be indented one level (four space) more
than the line with the struct (or union or enum) keyword, as btw is
being done correctly above for the union containing the struct.

Jan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-24  7:51                       ` Jan Beulich
@ 2015-09-24  8:03                         ` Wu, Feng
  0 siblings, 0 replies; 86+ messages in thread
From: Wu, Feng @ 2015-09-24  8:03 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, George Dunlap, xen-devel, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, September 24, 2015 3:52 PM
> To: Wu, Feng
> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; George Dunlap; Tian, Kevin;
> xen-devel@lists.xen.org; Keir Fraser
> Subject: RE: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> 
> >>> On 24.09.15 at 03:50, <feng.wu@intel.com> wrote:
> > One issue is the number of vmexits is far far bigger than the number of
> > context switch. I test it for a quite short time and it shows there are
> > 2910043 vmexits and 71733 context switch (only count the number in
> > __context_switch() since we only change the PI state in this function).
> > If we change the PI state in each vmexit/vmentry, I am afraid this will
> > hurt the performance.
> 
> Note that George has already asked whether the updating of the
> PI descriptor is expensive, without you answering. 

Updating the PI descriptor needs to be atomic, I think it should be a little
expensive.

> If this is basically
> just a memory or VMCS field write, I don't think it really matters in
> which code path it sits, regardless of the frequency of either path
> being used. Also note that whatever measuring you do in an area
> like this, it'll only be an example, 

I DON'T think it is just an example, the vmexit numbers is definitely
far larger than context switch.

> unlikely to be representative of anything. 

I don't think so!

> Plus with the tendency to eliminate VMEXITs with newer
> hardware, the penalty of this sitting in the VMEXIT path ought to go
> down.

Broadwell is really very new hardware, the VMEXITs and the number
of context switch is not in the same order of magnitudes.

Thanks,
Feng

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-24  1:50                     ` Wu, Feng
  2015-09-24  3:35                       ` Dario Faggioli
@ 2015-09-24  7:51                       ` Jan Beulich
  2015-09-24  8:03                         ` Wu, Feng
  1 sibling, 1 reply; 86+ messages in thread
From: Jan Beulich @ 2015-09-24  7:51 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, George Dunlap, xen-devel

>>> On 24.09.15 at 03:50, <feng.wu@intel.com> wrote:
> One issue is the number of vmexits is far far bigger than the number of
> context switch. I test it for a quite short time and it shows there are
> 2910043 vmexits and 71733 context switch (only count the number in
> __context_switch() since we only change the PI state in this function).
> If we change the PI state in each vmexit/vmentry, I am afraid this will
> hurt the performance.

Note that George has already asked whether the updating of the
PI descriptor is expensive, without you answering. If this is basically
just a memory or VMCS field write, I don't think it really matters in
which code path it sits, regardless of the frequency of either path
being used. Also note that whatever measuring you do in an area
like this, it'll only be an example, unlikely to be representative of
anything. Plus with the tendency to eliminate VMEXITs with newer
hardware, the penalty of this sitting in the VMEXIT path ought to go
down.

Jan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-24  1:50                     ` Wu, Feng
@ 2015-09-24  3:35                       ` Dario Faggioli
  2015-09-24  7:51                       ` Jan Beulich
  1 sibling, 0 replies; 86+ messages in thread
From: Dario Faggioli @ 2015-09-24  3:35 UTC (permalink / raw)
  To: Wu, Feng, George Dunlap, Jan Beulich
  Cc: George Dunlap, Andrew Cooper, Tian, Kevin, Keir Fraser, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1903 bytes --]

On Thu, 2015-09-24 at 01:50 +0000, Wu, Feng wrote:

> > -----Original Message-----
> > From: George Dunlap [mailto:george.dunlap@citrix.com]

> > This seems to me to be a cleaner design.  It removes the need to
> > modify
> > any code on context-switch path.  It moves modification of NDST and
> > most
> > modifications of SN closer to where I think they logically go.  It
> > reduces the number of unnecessary PI interrupts delivered to the
> > hypervisor (by suppressing notifications most of the time spend in
> > the
> > hypervisor), and automatically deals with the "spurious interrupts
> > during tasklet execution / vcpu offline lazy context switch" issue
> > we
> > were talking about with the other thread.
> 
> One issue is the number of vmexits is far far bigger than the number
> of
> context switch. I test it for a quite short time and it shows there
> are
> 2910043 vmexits and 71733 context switch (only count the number in
> __context_switch() since we only change the PI state in this
> function).
> If we change the PI state in each vmexit/vmentry, I am afraid this
> will
> hurt the performance.
> 
Interesting. Hard to tell without actually trying, though.

Personally, I'd agree with George and Jan that the vmexit solution is
more self contained, and hence preferable.

I don't really dislike the __context_switch() solution, though, and I'd
be fine to live with it, especially considering these numbers.

I guess the absolute best would be for you to prototype both, and try
gathering some performance numbers for comparison... Is this asking too
much? :-)

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-23 15:25                   ` George Dunlap
  2015-09-23 15:38                     ` Jan Beulich
@ 2015-09-24  1:50                     ` Wu, Feng
  2015-09-24  3:35                       ` Dario Faggioli
  2015-09-24  7:51                       ` Jan Beulich
  1 sibling, 2 replies; 86+ messages in thread
From: Wu, Feng @ 2015-09-24  1:50 UTC (permalink / raw)
  To: George Dunlap, Jan Beulich
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel, Wu, Feng



> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@citrix.com]
> Sent: Wednesday, September 23, 2015 11:26 PM
> To: Wu, Feng; Jan Beulich
> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
> xen-devel@lists.xen.org; Keir Fraser
> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> 
> On 09/23/2015 01:35 PM, Wu, Feng wrote:
> >
> >
> >> -----Original Message-----
> >> From: George Dunlap [mailto:george.dunlap@citrix.com]
> >> Sent: Wednesday, September 23, 2015 5:44 PM
> >> To: Jan Beulich; Wu, Feng
> >> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
> >> xen-devel@lists.xen.org; Keir Fraser
> >> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core
> logic
> >> handling
> >>
> >> On 09/22/2015 03:01 PM, Jan Beulich wrote:
> >>>>>> On 22.09.15 at 15:40, <feng.wu@intel.com> wrote:
> >>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Jan Beulich [mailto:JBeulich@suse.com]
> >>>>> Sent: Tuesday, September 22, 2015 5:00 PM
> >>>>> To: Wu, Feng
> >>>>> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; George Dunlap;
> Tian,
> >>>> Kevin;
> >>>>> xen-devel@lists.xen.org; Keir Fraser
> >>>>> Subject: RE: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt
> core
> >> logic
> >>>>> handling
> >>>>>
> >>>>>>>> On 22.09.15 at 09:19, <feng.wu@intel.com> wrote:
> >>>>>> However, I do find some issues with my proposal above, see below:
> >>>>>>
> >>>>>> 1. Set _VPF_blocked
> >>>>>> 2. ret = arch_block()
> >>>>>> 3. if ( ret || local_events_need_delivery() )
> >>>>>> 	clear_bit(_VPF_blocked, &v->pause_flags);
> >>>>>>
> >>>>>> After step #2, if ret == false, that means, we successfully changed the
> PI
> >>>>>> descriptor in arch_block(), if local_events_need_delivery() returns true,
> >>>>>> _VPF_blocked is cleared. After that, external interrupt come in, hence
> >>>>>> pi_wakeup_interrupt() --> vcpu_unblock(), but _VPF_blocked is cleared,
> >>>>>> so vcpu_unblock() does nothing, so the vCPU's PI state is incorrect.
> >>>>>>
> >>>>>> One possible solution for it is:
> >>>>>> - Disable the interrupts before the check in step #3 above
> >>>>>> - if local_events_need_delivery() returns true, undo all the operations
> >>>>>>  done in arch_block()
> >>>>>> - Enable interrupts after _VPF_blocked gets cleared.
> >>>>>
> >>>>> Couldn't this be taken care of by, if necessary, adjusting PI state
> >>>>> in vmx_do_resume()?
> >>>>
> >>>> What do you mean here? Could you please elaborate? Thanks!
> >>>
> >>> From the discussion so far I understand that all you're after is that
> >>> the PI descriptor is correct for the period of time while the guest
> >>> vCPU is actually executing in guest context. If that's a correct
> >>> understanding of mine, then setting the vector and flags back to
> >>> what's needed while the guest is running would be sufficient to be
> >>> done right before entering the guest, i.e. in vmx_do_resume().
> >>
> >> Along those lines, is setting the SN and NDST relatively expensive?
> >>
> >> If they are, then of course switching them in __context_switch() makes
> >> sense.
> >>
> >> But if setting them is fairly cheap, then we could just clear SN on
> >> every vmexit, and set SN and NDST on every vmentry, couldn't we?
> >
> > Do you mean _set_ SN (Suppress Notification) on vmexit and _clear_
> > SN on vmentry? I think this might be an alternative.
> 
> Er, yes, that's what I meant. :-)  Sorry, getting the set / clear
> "suppress notification" mixed up with the normal sti / cli (set / clear
> interrupts).
> 
> >
> >> Then we wouldn't need hooks on the context switch path at all (which are
> only
> >> there to catch running -> runnable and runnable -> running) -- we could
> >> just have the block/unblock hooks (to change NV and add / remove things
> >> from the list).
> >
> > We still need to _clear_ SN when vCPU is being blocked.
> 
> Yes, thanks for the reminder.
> 
> >> Setting NDST on vmentry also has the advantage of being more robust --
> >> you don't have to carefully think about cpu migration and all that; you
> >> simply set it right when you're about to actually use it.
> >
> > In the current solution, we set the NDST in the end of vmx_ctxt_switch_to(),
> > it also doesn't suffer from cpu migration, right?
> 
> It works, yes.
> 
> >> Looking at your comment in pi_notification_interrupt() -- I take it that
> >> "pending interrupt in PIRR will be synced to vIRR" somewhere in the call
> >> to vmx_intr_assist()?
> >
> > Yes.
> > vmx_intr_assist() --> hvm_vcpu_has_pending_irq()
> > --> vlapic_has_pending_irq() --> vlapic_find_highest_irr()
> > --> hvm_funcs.sync_pir_to_irr()
> >
> >> So if we were to set NDST and enable SN before
> >> that call, we should be able to set VCPU_KICK if we get an interrupt
> >> between vmx_intr_assist() and the actual vmentry as necessary.
> >
> > But if the interrupt comes in between last vmexit and enabling SN here,
> > it cannot be injected to guest during this vmentry. This will affect the
> > interrupt latency, each PI happening during vCPU is in root-mode needs
> > to be injected to the guest during the next vmentry.
> 
> I don't understand this.  I'm proposing this:
> 
> (actual vmexit)
> ...[1]
> set SN
> ...[2]
> (Hypervisor does stuff)
> vmx_do_vmentry
> set NDST
> clear SN
> ...[3]
> vmx_intr_assist()
> ...[4]
> cli
> (check for softirqs)
> ...[5]
> (actual vmentry)
> 
> Here is what happens if an interrupt is generated at the various [N]:
> 
> [1]: posted_interrupt_vector delivered to hypervisor.  VCPU_KICK set,
> but unnecessarily, since the pending interrupt will already be picked up
> by vmx_intr_assist().
> 
> [2]: No interrupt delivered to hypervisor.  Pending interrupt will be
> picked up by vmx_intr_assist().
> 
> [3]: posted_interrupt_vector delivered to hypervisor.  VCPU_KICK set,
> but unnecessarily, since pending interrupt will already be picked up by
> vmx_intr_assist().
> 
> [4]: posted_interrupt_vector delivered to hypervisor.  VCPU_KICK set
> necessarily this time.  check for softirqs causes it to retry vmentry,
> so vmx_intr_assist() can pick up the pending interrupt.
> 
> [5]: interrupt not delivered until interrupts are clear in the guest
> context, at which point it will be delivered directly to the guest.
> 
> This seems to me to have the same properties the solution you propose,
> except that in your solution, [2] behaves identically to [1] and [3].
> Have I missed something?

No, I think you covered everything! :) Nice analysis!!

> 
> > I think the current solution we have discussed these days is very clear,
> > and I am about to implement it. Does the above method have obvious
> > advantage compare to what we discussed so far?
> 
> This seems to me to be a cleaner design.  It removes the need to modify
> any code on context-switch path.  It moves modification of NDST and most
> modifications of SN closer to where I think they logically go.  It
> reduces the number of unnecessary PI interrupts delivered to the
> hypervisor (by suppressing notifications most of the time spend in the
> hypervisor), and automatically deals with the "spurious interrupts
> during tasklet execution / vcpu offline lazy context switch" issue we
> were talking about with the other thread.

One issue is the number of vmexits is far far bigger than the number of
context switch. I test it for a quite short time and it shows there are
2910043 vmexits and 71733 context switch (only count the number in
__context_switch() since we only change the PI state in this function).
If we change the PI state in each vmexit/vmentry, I am afraid this will
hurt the performance.

Thanks,
Feng

> 
> On the other hand, it does add extra hooks on the vmentry/vmexit paths;
> but they seem very similar to me to the kind of hooks which are already
> there.
> 
> So, at the moment my *advice* is to look into setting SN / NDST on
> vmexit/vmentry, and only having hooks at block/unblock.
> 
> However, the __context_switch() path is also OK with me, if Jan and
> Dario are happy.
> 
> Jan / Dario, do you guys have any opinions / preferences?
> 
>  -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-23 15:25                   ` George Dunlap
@ 2015-09-23 15:38                     ` Jan Beulich
  2015-09-24  1:50                     ` Wu, Feng
  1 sibling, 0 replies; 86+ messages in thread
From: Jan Beulich @ 2015-09-23 15:38 UTC (permalink / raw)
  To: George Dunlap, Feng Wu
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel

>>> On 23.09.15 at 17:25, <george.dunlap@citrix.com> wrote:
> So, at the moment my *advice* is to look into setting SN / NDST on
> vmexit/vmentry, and only having hooks at block/unblock.
> 
> However, the __context_switch() path is also OK with me, if Jan and
> Dario are happy.
> 
> Jan / Dario, do you guys have any opinions / preferences?

If this model works (and it looks to me as if it should), then I
second this being the better structured and hence preferable
approach.

Jan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-23 12:35                 ` Wu, Feng
@ 2015-09-23 15:25                   ` George Dunlap
  2015-09-23 15:38                     ` Jan Beulich
  2015-09-24  1:50                     ` Wu, Feng
  0 siblings, 2 replies; 86+ messages in thread
From: George Dunlap @ 2015-09-23 15:25 UTC (permalink / raw)
  To: Wu, Feng, Jan Beulich
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel

On 09/23/2015 01:35 PM, Wu, Feng wrote:
> 
> 
>> -----Original Message-----
>> From: George Dunlap [mailto:george.dunlap@citrix.com]
>> Sent: Wednesday, September 23, 2015 5:44 PM
>> To: Jan Beulich; Wu, Feng
>> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
>> xen-devel@lists.xen.org; Keir Fraser
>> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
>> handling
>>
>> On 09/22/2015 03:01 PM, Jan Beulich wrote:
>>>>>> On 22.09.15 at 15:40, <feng.wu@intel.com> wrote:
>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Jan Beulich [mailto:JBeulich@suse.com]
>>>>> Sent: Tuesday, September 22, 2015 5:00 PM
>>>>> To: Wu, Feng
>>>>> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; George Dunlap; Tian,
>>>> Kevin;
>>>>> xen-devel@lists.xen.org; Keir Fraser
>>>>> Subject: RE: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core
>> logic
>>>>> handling
>>>>>
>>>>>>>> On 22.09.15 at 09:19, <feng.wu@intel.com> wrote:
>>>>>> However, I do find some issues with my proposal above, see below:
>>>>>>
>>>>>> 1. Set _VPF_blocked
>>>>>> 2. ret = arch_block()
>>>>>> 3. if ( ret || local_events_need_delivery() )
>>>>>> 	clear_bit(_VPF_blocked, &v->pause_flags);
>>>>>>
>>>>>> After step #2, if ret == false, that means, we successfully changed the PI
>>>>>> descriptor in arch_block(), if local_events_need_delivery() returns true,
>>>>>> _VPF_blocked is cleared. After that, external interrupt come in, hence
>>>>>> pi_wakeup_interrupt() --> vcpu_unblock(), but _VPF_blocked is cleared,
>>>>>> so vcpu_unblock() does nothing, so the vCPU's PI state is incorrect.
>>>>>>
>>>>>> One possible solution for it is:
>>>>>> - Disable the interrupts before the check in step #3 above
>>>>>> - if local_events_need_delivery() returns true, undo all the operations
>>>>>>  done in arch_block()
>>>>>> - Enable interrupts after _VPF_blocked gets cleared.
>>>>>
>>>>> Couldn't this be taken care of by, if necessary, adjusting PI state
>>>>> in vmx_do_resume()?
>>>>
>>>> What do you mean here? Could you please elaborate? Thanks!
>>>
>>> From the discussion so far I understand that all you're after is that
>>> the PI descriptor is correct for the period of time while the guest
>>> vCPU is actually executing in guest context. If that's a correct
>>> understanding of mine, then setting the vector and flags back to
>>> what's needed while the guest is running would be sufficient to be
>>> done right before entering the guest, i.e. in vmx_do_resume().
>>
>> Along those lines, is setting the SN and NDST relatively expensive?
>>
>> If they are, then of course switching them in __context_switch() makes
>> sense.
>>
>> But if setting them is fairly cheap, then we could just clear SN on
>> every vmexit, and set SN and NDST on every vmentry, couldn't we?  
> 
> Do you mean _set_ SN (Suppress Notification) on vmexit and _clear_
> SN on vmentry? I think this might be an alternative.

Er, yes, that's what I meant. :-)  Sorry, getting the set / clear
"suppress notification" mixed up with the normal sti / cli (set / clear
interrupts).

> 
>> Then we wouldn't need hooks on the context switch path at all (which are only
>> there to catch running -> runnable and runnable -> running) -- we could
>> just have the block/unblock hooks (to change NV and add / remove things
>> from the list).
> 
> We still need to _clear_ SN when vCPU is being blocked.

Yes, thanks for the reminder.

>> Setting NDST on vmentry also has the advantage of being more robust --
>> you don't have to carefully think about cpu migration and all that; you
>> simply set it right when you're about to actually use it.
> 
> In the current solution, we set the NDST in the end of vmx_ctxt_switch_to(),
> it also doesn't suffer from cpu migration, right?

It works, yes.

>> Looking at your comment in pi_notification_interrupt() -- I take it that
>> "pending interrupt in PIRR will be synced to vIRR" somewhere in the call
>> to vmx_intr_assist()? 
> 
> Yes.
> vmx_intr_assist() --> hvm_vcpu_has_pending_irq()
> --> vlapic_has_pending_irq() --> vlapic_find_highest_irr()
> --> hvm_funcs.sync_pir_to_irr()
> 
>> So if we were to set NDST and enable SN before
>> that call, we should be able to set VCPU_KICK if we get an interrupt
>> between vmx_intr_assist() and the actual vmentry as necessary.
> 
> But if the interrupt comes in between last vmexit and enabling SN here,
> it cannot be injected to guest during this vmentry. This will affect the
> interrupt latency, each PI happening during vCPU is in root-mode needs
> to be injected to the guest during the next vmentry.

I don't understand this.  I'm proposing this:

(actual vmexit)
...[1]
set SN
...[2]
(Hypervisor does stuff)
vmx_do_vmentry
set NDST
clear SN
...[3]
vmx_intr_assist()
...[4]
cli
(check for softirqs)
...[5]
(actual vmentry)

Here is what happens if an interrupt is generated at the various [N]:

[1]: posted_interrupt_vector delivered to hypervisor.  VCPU_KICK set,
but unnecessarily, since the pending interrupt will already be picked up
by vmx_intr_assist().

[2]: No interrupt delivered to hypervisor.  Pending interrupt will be
picked up by vmx_intr_assist().

[3]: posted_interrupt_vector delivered to hypervisor.  VCPU_KICK set,
but unnecessarily, since pending interrupt will already be picked up by
vmx_intr_assist().

[4]: posted_interrupt_vector delivered to hypervisor.  VCPU_KICK set
necessarily this time.  check for softirqs causes it to retry vmentry,
so vmx_intr_assist() can pick up the pending interrupt.

[5]: interrupt not delivered until interrupts are clear in the guest
context, at which point it will be delivered directly to the guest.

This seems to me to have the same properties the solution you propose,
except that in your solution, [2] behaves identically to [1] and [3].
Have I missed something?

> I think the current solution we have discussed these days is very clear,
> and I am about to implement it. Does the above method have obvious
> advantage compare to what we discussed so far?

This seems to me to be a cleaner design.  It removes the need to modify
any code on context-switch path.  It moves modification of NDST and most
modifications of SN closer to where I think they logically go.  It
reduces the number of unnecessary PI interrupts delivered to the
hypervisor (by suppressing notifications most of the time spend in the
hypervisor), and automatically deals with the "spurious interrupts
during tasklet execution / vcpu offline lazy context switch" issue we
were talking about with the other thread.

On the other hand, it does add extra hooks on the vmentry/vmexit paths;
but they seem very similar to me to the kind of hooks which are already
there.

So, at the moment my *advice* is to look into setting SN / NDST on
vmexit/vmentry, and only having hooks at block/unblock.

However, the __context_switch() path is also OK with me, if Jan and
Dario are happy.

Jan / Dario, do you guys have any opinions / preferences?

 -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-23  9:44               ` George Dunlap
@ 2015-09-23 12:35                 ` Wu, Feng
  2015-09-23 15:25                   ` George Dunlap
  0 siblings, 1 reply; 86+ messages in thread
From: Wu, Feng @ 2015-09-23 12:35 UTC (permalink / raw)
  To: George Dunlap, Jan Beulich
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel, Wu, Feng



> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@citrix.com]
> Sent: Wednesday, September 23, 2015 5:44 PM
> To: Jan Beulich; Wu, Feng
> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
> xen-devel@lists.xen.org; Keir Fraser
> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> 
> On 09/22/2015 03:01 PM, Jan Beulich wrote:
> >>>> On 22.09.15 at 15:40, <feng.wu@intel.com> wrote:
> >
> >>
> >>> -----Original Message-----
> >>> From: Jan Beulich [mailto:JBeulich@suse.com]
> >>> Sent: Tuesday, September 22, 2015 5:00 PM
> >>> To: Wu, Feng
> >>> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; George Dunlap; Tian,
> >> Kevin;
> >>> xen-devel@lists.xen.org; Keir Fraser
> >>> Subject: RE: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core
> logic
> >>> handling
> >>>
> >>>>>> On 22.09.15 at 09:19, <feng.wu@intel.com> wrote:
> >>>> However, I do find some issues with my proposal above, see below:
> >>>>
> >>>> 1. Set _VPF_blocked
> >>>> 2. ret = arch_block()
> >>>> 3. if ( ret || local_events_need_delivery() )
> >>>> 	clear_bit(_VPF_blocked, &v->pause_flags);
> >>>>
> >>>> After step #2, if ret == false, that means, we successfully changed the PI
> >>>> descriptor in arch_block(), if local_events_need_delivery() returns true,
> >>>> _VPF_blocked is cleared. After that, external interrupt come in, hence
> >>>> pi_wakeup_interrupt() --> vcpu_unblock(), but _VPF_blocked is cleared,
> >>>> so vcpu_unblock() does nothing, so the vCPU's PI state is incorrect.
> >>>>
> >>>> One possible solution for it is:
> >>>> - Disable the interrupts before the check in step #3 above
> >>>> - if local_events_need_delivery() returns true, undo all the operations
> >>>>  done in arch_block()
> >>>> - Enable interrupts after _VPF_blocked gets cleared.
> >>>
> >>> Couldn't this be taken care of by, if necessary, adjusting PI state
> >>> in vmx_do_resume()?
> >>
> >> What do you mean here? Could you please elaborate? Thanks!
> >
> > From the discussion so far I understand that all you're after is that
> > the PI descriptor is correct for the period of time while the guest
> > vCPU is actually executing in guest context. If that's a correct
> > understanding of mine, then setting the vector and flags back to
> > what's needed while the guest is running would be sufficient to be
> > done right before entering the guest, i.e. in vmx_do_resume().
> 
> Along those lines, is setting the SN and NDST relatively expensive?
> 
> If they are, then of course switching them in __context_switch() makes
> sense.
> 
> But if setting them is fairly cheap, then we could just clear SN on
> every vmexit, and set SN and NDST on every vmentry, couldn't we?  

Do you mean _set_ SN (Suppress Notification) on vmexit and _clear_
SN on vmentry? I think this might be an alternative.

> Then we wouldn't need hooks on the context switch path at all (which are only
> there to catch running -> runnable and runnable -> running) -- we could
> just have the block/unblock hooks (to change NV and add / remove things
> from the list).

We still need to _clear_ SN when vCPU is being blocked.

> 
> Setting NDST on vmentry also has the advantage of being more robust --
> you don't have to carefully think about cpu migration and all that; you
> simply set it right when you're about to actually use it.

In the current solution, we set the NDST in the end of vmx_ctxt_switch_to(),
it also doesn't suffer from cpu migration, right?

> 
> Looking at your comment in pi_notification_interrupt() -- I take it that
> "pending interrupt in PIRR will be synced to vIRR" somewhere in the call
> to vmx_intr_assist()? 

Yes.
vmx_intr_assist() --> hvm_vcpu_has_pending_irq()
--> vlapic_has_pending_irq() --> vlapic_find_highest_irr()
--> hvm_funcs.sync_pir_to_irr()

> So if we were to set NDST and enable SN before
> that call, we should be able to set VCPU_KICK if we get an interrupt
> between vmx_intr_assist() and the actual vmentry as necessary.

But if the interrupt comes in between last vmexit and enabling SN here,
it cannot be injected to guest during this vmentry. This will affect the
interrupt latency, each PI happening during vCPU is in root-mode needs
to be injected to the guest during the next vmentry.

I think the current solution we have discussed these days is very clear,
and I am about to implement it. Does the above method have obvious
advantage compare to what we discussed so far?

Thanks,
Feng

> 
>  -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-22 14:01             ` Jan Beulich
@ 2015-09-23  9:44               ` George Dunlap
  2015-09-23 12:35                 ` Wu, Feng
  0 siblings, 1 reply; 86+ messages in thread
From: George Dunlap @ 2015-09-23  9:44 UTC (permalink / raw)
  To: Jan Beulich, Feng Wu
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, xen-devel

On 09/22/2015 03:01 PM, Jan Beulich wrote:
>>>> On 22.09.15 at 15:40, <feng.wu@intel.com> wrote:
> 
>>
>>> -----Original Message-----
>>> From: Jan Beulich [mailto:JBeulich@suse.com]
>>> Sent: Tuesday, September 22, 2015 5:00 PM
>>> To: Wu, Feng
>>> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; George Dunlap; Tian, 
>> Kevin;
>>> xen-devel@lists.xen.org; Keir Fraser
>>> Subject: RE: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
>>> handling
>>>
>>>>>> On 22.09.15 at 09:19, <feng.wu@intel.com> wrote:
>>>> However, I do find some issues with my proposal above, see below:
>>>>
>>>> 1. Set _VPF_blocked
>>>> 2. ret = arch_block()
>>>> 3. if ( ret || local_events_need_delivery() )
>>>> 	clear_bit(_VPF_blocked, &v->pause_flags);
>>>>
>>>> After step #2, if ret == false, that means, we successfully changed the PI
>>>> descriptor in arch_block(), if local_events_need_delivery() returns true,
>>>> _VPF_blocked is cleared. After that, external interrupt come in, hence
>>>> pi_wakeup_interrupt() --> vcpu_unblock(), but _VPF_blocked is cleared,
>>>> so vcpu_unblock() does nothing, so the vCPU's PI state is incorrect.
>>>>
>>>> One possible solution for it is:
>>>> - Disable the interrupts before the check in step #3 above
>>>> - if local_events_need_delivery() returns true, undo all the operations
>>>>  done in arch_block()
>>>> - Enable interrupts after _VPF_blocked gets cleared.
>>>
>>> Couldn't this be taken care of by, if necessary, adjusting PI state
>>> in vmx_do_resume()?
>>
>> What do you mean here? Could you please elaborate? Thanks!
> 
> From the discussion so far I understand that all you're after is that
> the PI descriptor is correct for the period of time while the guest
> vCPU is actually executing in guest context. If that's a correct
> understanding of mine, then setting the vector and flags back to
> what's needed while the guest is running would be sufficient to be
> done right before entering the guest, i.e. in vmx_do_resume().

Along those lines, is setting the SN and NDST relatively expensive?

If they are, then of course switching them in __context_switch() makes
sense.

But if setting them is fairly cheap, then we could just clear SN on
every vmexit, and set SN and NDST on every vmentry, couldn't we?  Then
we wouldn't need hooks on the context switch path at all (which are only
there to catch running -> runnable and runnable -> running) -- we could
just have the block/unblock hooks (to change NV and add / remove things
from the list).

Setting NDST on vmentry also has the advantage of being more robust --
you don't have to carefully think about cpu migration and all that; you
simply set it right when you're about to actually use it.

Looking at your comment in pi_notification_interrupt() -- I take it that
"pending interrupt in PIRR will be synced to vIRR" somewhere in the call
to vmx_intr_assist()?  So if we were to set NDST and enable SN before
that call, we should be able to set VCPU_KICK if we get an interrupt
between vmx_intr_assist() and the actual vmentry as necessary.

 -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-23  7:59                             ` Dario Faggioli
@ 2015-09-23  8:11                               ` Wu, Feng
  0 siblings, 0 replies; 86+ messages in thread
From: Wu, Feng @ 2015-09-23  8:11 UTC (permalink / raw)
  To: Dario Faggioli, George Dunlap
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich, Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Wednesday, September 23, 2015 4:00 PM
> To: Wu, Feng; George Dunlap
> Cc: xen-devel@lists.xen.org; Tian, Kevin; Keir Fraser; George Dunlap; Andrew
> Cooper; Jan Beulich
> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> >
> > I cannot think the bad effect of the spurious PI as well. I was just
> > a little
> > confused about we can do this and why we don't do this. Maybe
> > context_switch() is a critical path, if we can bear those spurious
> > PI,
> > it is not worth adding those logic in it at the cost of some
> > performance
> > lost during scheduling. Is this your concern?
> >
> The, however small, performance implications of even only checking
> whether the hooks should be invoked is certainly good to be avoided,
> especially, on non-PI enabled (and even more so on non-VMX) hardware.
> 
> However, what I think it is more important in this case, is that not
> having the hooks in context_switch() yields a better results from an
> architectural and code organization point of view.
> It makes both the context switch code, and the PI code, easier to
> understand and to maintain.

Good to know that, thanks for the explanation!

Thanks,
Feng

> 
> So, thanks to you and Regards,
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-23  5:52                           ` Wu, Feng
@ 2015-09-23  7:59                             ` Dario Faggioli
  2015-09-23  8:11                               ` Wu, Feng
  0 siblings, 1 reply; 86+ messages in thread
From: Dario Faggioli @ 2015-09-23  7:59 UTC (permalink / raw)
  To: Wu, Feng, George Dunlap
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 1802 bytes --]

On Wed, 2015-09-23 at 05:52 +0000, Wu, Feng wrote:
> George & Dario, thanks much for sharing so many scheduler knowledge
> to me, it is very useful. 
>
Well, we're lucky enough that it's our job to do that. :-D

> > > So the only downside to doing everything in block(), wake(), and
> > > __context_switch() is that if a VM is offline, or preempted by a
> > > tasklet, and an interrupt comes in, we will get a spurious PI
> > > (i.e.,
> > > one
> > > which interrupts us but we can't do anything useful about).
> > > 
> > Indeed. And that also seems bearable to me. Feng, are there reasons
> > why
> > you think it's not?
> 
> I cannot think the bad effect of the spurious PI as well. I was just
> a little
> confused about we can do this and why we don't do this. Maybe
> context_switch() is a critical path, if we can bear those spurious
> PI,
> it is not worth adding those logic in it at the cost of some
> performance
> lost during scheduling. Is this your concern?
> 
The, however small, performance implications of even only checking
whether the hooks should be invoked is certainly good to be avoided,
especially, on non-PI enabled (and even more so on non-VMX) hardware.

However, what I think it is more important in this case, is that not
having the hooks in context_switch() yields a better results from an
architectural and code organization point of view.
It makes both the context switch code, and the PI code, easier to
understand and to maintain.

So, thanks to you and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-23  7:11             ` Dario Faggioli
@ 2015-09-23  7:20               ` Wu, Feng
  0 siblings, 0 replies; 86+ messages in thread
From: Wu, Feng @ 2015-09-23  7:20 UTC (permalink / raw)
  To: Dario Faggioli, George Dunlap, George Dunlap
  Cc: Tian, Kevin, Wu, Feng, Andrew Cooper, xen-devel, Jan Beulich,
	Keir Fraser



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Wednesday, September 23, 2015 3:12 PM
> To: Wu, Feng; George Dunlap; George Dunlap
> Cc: Jan Beulich; Tian, Kevin; Keir Fraser; Andrew Cooper;
> xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> >
> > Yes, we can go to blocked only from running state. let me clarify a
> > question first: Xen doesn't support kernel preemption, right?
> >
> No, it does not.
> 
> > ( i.e. we
> > can only perform context switch before returning to guest.)
> >
> Yes, we schedule only when SCHEDULE_SOFTIRQ is checked and found to be
> on.

Good, thanks for the confirmation.

Thanks,
Feng
> 
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-23  6:35           ` Wu, Feng
@ 2015-09-23  7:11             ` Dario Faggioli
  2015-09-23  7:20               ` Wu, Feng
  0 siblings, 1 reply; 86+ messages in thread
From: Dario Faggioli @ 2015-09-23  7:11 UTC (permalink / raw)
  To: Wu, Feng, George Dunlap, George Dunlap
  Cc: Andrew Cooper, Tian, Kevin, Keir Fraser, Jan Beulich, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1271 bytes --]

On Wed, 2015-09-23 at 06:35 +0000, Wu, Feng wrote:
> > From: George Dunlap [mailto:george.dunlap@citrix.com]
> > On 09/22/2015 08:19 AM, Wu, Feng wrote:

> > > In the arch_block() hook, we actually need to
> > > 	- Put vCPU to the blocking list
> > > 	- Set the NV to wakeup vector
> > > 	- Set NDST to the right pCPU
> > > 	- Clear SN
> > 
> > Nit: We shouldn't need to actually clear SN here; SN should already
> > be
> > clear because the vcpu should be currently running.  (I don't think
> > there's a way for a vcpu to go from runnable->blocked, is there?) 
> >  And
> > if it's just been running, then NDST should also already be the
> > correct
> > pcpu.
> 
> Yes, we can go to blocked only from running state. let me clarify a
> question first: Xen doesn't support kernel preemption, right?
>
No, it does not.

> ( i.e. we
> can only perform context switch before returning to guest.) 
>
Yes, we schedule only when SCHEDULE_SOFTIRQ is checked and found to be
on.

Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-22 10:26         ` George Dunlap
@ 2015-09-23  6:35           ` Wu, Feng
  2015-09-23  7:11             ` Dario Faggioli
  0 siblings, 1 reply; 86+ messages in thread
From: Wu, Feng @ 2015-09-23  6:35 UTC (permalink / raw)
  To: George Dunlap, Dario Faggioli, George Dunlap
  Cc: Tian, Kevin, Wu, Feng, Andrew Cooper, xen-devel, Jan Beulich,
	Keir Fraser



> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@citrix.com]
> Sent: Tuesday, September 22, 2015 6:27 PM
> To: Wu, Feng; Dario Faggioli; George Dunlap
> Cc: Jan Beulich; Tian, Kevin; Keir Fraser; Andrew Cooper;
> xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> 
> On 09/22/2015 08:19 AM, Wu, Feng wrote:
> >
> >
> >> -----Original Message-----
> >> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> >> Sent: Monday, September 21, 2015 10:25 PM
> >> To: Wu, Feng; George Dunlap; George Dunlap
> >> Cc: Jan Beulich; Tian, Kevin; Keir Fraser; Andrew Cooper;
> >> xen-devel@lists.xen.org
> >> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core
> logic
> >> handling
> >>
> >> On Mon, 2015-09-21 at 12:22 +0000, Wu, Feng wrote:
> >>>
> >>>> -----Original Message-----
> >>>> From: George Dunlap [mailto:george.dunlap@citrix.com]
> >>
> >>>> You also need to check that local_events_need_delivery() will
> >>>> return
> >>>> "true" if you get an interrupt between that time and entering the
> >>>> hypervisor.  Will that happen automatically from
> >>>> hvm_local_events_need_delivery() -> hvm_vcpu_has_pending_irq() ->
> >>>> vlapic_has_pending_irq()?  Or will you need to add a hook in
> >>>> hvm_vcpu_has_pending_irq()?
> >>>
> >>> I think I don't need to add hook in hvm_vcpu_has_pending_irq(), what
> >>> I need
> >>> to do in vcpu_block() and do_poll() is as below:
> >>>
> >>> 1. set_bit(_VPF_blocked, &v->pause_flags);
> >>>
> >>> 2. ret = v->arch.arch_block(), in this hook, we can re-use the same
> >>> logic in
> >>> vmx_pre_ctx_switch_pi(), and check whether ON bit is set during
> >>> updating
> >>> posted-interrupt descriptor, can return 1 when ON is set
> >>>
> >> It think it would be ok for an hook to return a value (maybe, if doing
> >> that, let's pick variable names and use comments to explain what goes
> >> on as good as we can).
> >>
> >> I think I also see why you seem to prefer doing it that way, rather
> >> than hacking local_events_need_delivery(), but can you please elaborate
> >> on that? (My feeling is that you want to avoid having to update the
> >> data structures in between _VPF_blocked and the check
> >> local_events_need_delivery(), and then having to roll such update back
> >> if local_events_need_delivery() ends up being false, is that the
> >> case?).
> >
> > In the arch_block() hook, we actually need to
> > 	- Put vCPU to the blocking list
> > 	- Set the NV to wakeup vector
> > 	- Set NDST to the right pCPU
> > 	- Clear SN
> 
> Nit: We shouldn't need to actually clear SN here; SN should already be
> clear because the vcpu should be currently running.  (I don't think
> there's a way for a vcpu to go from runnable->blocked, is there?)  And
> if it's just been running, then NDST should also already be the correct
> pcpu.

Yes, we can go to blocked only from running state. let me clarify a
question first: Xen doesn't support kernel preemption, right?( i.e. we
can only perform context switch before returning to guest.) If that is
case, we can make sure the pcpu is not changed for the running
vCPU before we set the NDST filed in PI descriptor, so we don't need
to update NDST.

> 
> > During we are updating the posted-interrupt descriptor, the VT-d
> > hardware can issue notification event and hence ON is set. If that is the
> > case, it is straightforward to return directly, and it doesn't make sense
> > we continue to do the above operations since we don't need to actually.
> 
> But checking to see if any interrupts have come in in the middle of your
> code just adds unnecessary complication.  We need to have the code in
> place to un-do all the blocking steps in the case that
> local_events_need_delivery() returns true anyway.
> 
> Additionally, while local_events_need_delivery() is only called from
> do_block and do_poll, hvm_local_events_need_delivery() is called from a
> number of other places, as is hvm_vcpu_has_pending_irq().  All those
> places presumably also need to know whether there's a PI pending to work
> properly.

local_events_need_delivery() can be called in other places, since it is wrapped
in hypercall_preempt_check(), which can be called in bunch of places. But
that shouldn't be a question here. In fact, ON bit is checked in
local_events_need_delivery() call path: (this is added when the CPU side PI
patches is merged years ago)

local_events_need_delivery() --> hvm_local_events_need_delivery()
--> hvm_vcpu_has_pending_irq() --> vlapic_has_pending_irq()
--> vlapic_find_highest_irr() --> hvm_funcs.sync_pir_to_irr()
--> pi_test_and_clear_on()

What we need to do is to find a good way to recover the PI state in vcpu_block()
and do_poll() if local event delivery is needed. Need to think this more.

Thanks,
Feng

> 
> >> Code would look better, IMO, if we manage to fold that somehow inside
> >> local_events_need_delivery(), but that:
> >
> > As mentioned above, during updating the PI descriptor for blocking, we
> > need to check whether ON is set, and return if it is set. This logic cannot
> > be included in local_events_need_delivery(), since the update itself
> > has not much relationship with local_events_need_delivery().
> >
> > However, I do find some issues with my proposal above, see below:
> >
> > 1. Set _VPF_blocked
> > 2. ret = arch_block()
> > 3. if ( ret || local_events_need_delivery() )
> > 	clear_bit(_VPF_blocked, &v->pause_flags);
> >
> > After step #2, if ret == false, that means, we successfully changed the PI
> > descriptor in arch_block(), if local_events_need_delivery() returns true,
> > _VPF_blocked is cleared. After that, external interrupt come in, hence
> > pi_wakeup_interrupt() --> vcpu_unblock(), but _VPF_blocked is cleared,
> > so vcpu_unblock() does nothing, so the vCPU's PI state is incorrect.
> >
> > One possible solution for it is:
> > - Disable the interrupts before the check in step #3 above
> > - if local_events_need_delivery() returns true, undo all the operations
> >  done in arch_block()
> > - Enable interrupts after _VPF_blocked gets cleared.
> 
> Well, yes, if as a result of checking to see that there are interrupts
> you unblock a processor, then you need to do *all* the things related to
> unblocking, including switching the interrupt vector and removing it
> from the blocked list.
> 
> If we're going to put hooks here in do_block(), then the best thing to
> do is as much as possible to make PI interrupts act exactly the same as
> other interrupts; i.e.,
> 
> * Do a full transition to blocked (set _VPF_blocked, add vcpu to PI
> list, switch vector to wake)
> * Check to see if there are any pending interrupts (either event
> channels, virtual interrupts, or PIs)
> * If so, do a full transition to unblocked (clear _VPF_blocked, switch
> vector to PI, remove vcpu from list).
> 
> We should be able to order the operations such that if interrupts come
> in the middle nothing bad happens, without needing to actually disable
> interrupts.
> 
> OTOH -- I think if we grab a lock during an interrupt, we're not allowed
> to grab it with interrupts disabled, correct?  So we may end up having
> to disable interrupts anyway.
> 
>  -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-22 14:38                         ` Dario Faggioli
@ 2015-09-23  5:52                           ` Wu, Feng
  2015-09-23  7:59                             ` Dario Faggioli
  0 siblings, 1 reply; 86+ messages in thread
From: Wu, Feng @ 2015-09-23  5:52 UTC (permalink / raw)
  To: Dario Faggioli, George Dunlap
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich, Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Tuesday, September 22, 2015 10:39 PM
> To: George Dunlap; Wu, Feng
> Cc: xen-devel@lists.xen.org; Tian, Kevin; Keir Fraser; George Dunlap; Andrew
> Cooper; Jan Beulich
> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> 
> On Tue, 2015-09-22 at 15:15 +0100, George Dunlap wrote:
> > On 09/22/2015 02:52 PM, Wu, Feng wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> 
> > > > Yes, the idle to vCPUB switch is covered by __context_switch(),
> > > > but
> > > it cannot change the PI state of vCPUA at that point. Like
> > > mentioned
> > > above, in this case, spurious PI interrupts happens.
> >
> > On the contrary, when __context_switch() is called for the idle ->
> > vcpuB
> > transition in the above scenario, it is *actually* context switching
> > from vcpuA to vcpuB, since vcpuA is still actually on the cpu.
> >  Which
> > means that if you add PI code into arch->ctxt_switch_from(), the case
> > you describe above will be handled automatically.
> >
> Yep, that's exactly what I was also writing... luckily, I've seen
> George's email before getting too far with that! :-P
> 
> That's the point of lazy context switch. The context of vCPUA is still
> on pCPUA during the time idle executes. If, at the next transition
> point, we switch from idle to vCPUA again, then nothing is needed, and
> we just saved one context saving and one context restoring operation.
> If it is something else, like in your case, we need to save vCPUA's
> context, which is actually what we have on pCPUA, despite idle (and not
> vCPUA itself) was what was running.

George & Dario, thanks much for sharing so many scheduler knowledge
to me, it is very useful. I think I was making a mistake about how
__context_switch() work when I wrote the emails above. Now it is
clear, the scenario I mentioned above can be covered in __context_switch().

> 
> > So the only downside to doing everything in block(), wake(), and
> > __context_switch() is that if a VM is offline, or preempted by a
> > tasklet, and an interrupt comes in, we will get a spurious PI (i.e.,
> > one
> > which interrupts us but we can't do anything useful about).
> >
> Indeed. And that also seems bearable to me. Feng, are there reasons why
> you think it's not?

I cannot think the bad effect of the spurious PI as well. I was just a little
confused about we can do this and why we don't do this. Maybe
context_switch() is a critical path, if we can bear those spurious PI,
it is not worth adding those logic in it at the cost of some performance
lost during scheduling. Is this your concern?

Thanks,
Feng

> 
> Regards,
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-22 14:28                   ` George Dunlap
@ 2015-09-23  5:37                     ` Wu, Feng
  0 siblings, 0 replies; 86+ messages in thread
From: Wu, Feng @ 2015-09-23  5:37 UTC (permalink / raw)
  To: George Dunlap, Dario Faggioli
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich, Wu, Feng



> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@citrix.com]
> Sent: Tuesday, September 22, 2015 10:28 PM
> To: Wu, Feng; Dario Faggioli
> Cc: xen-devel@lists.xen.org; Tian, Kevin; Keir Fraser; George Dunlap; Andrew
> Cooper; Jan Beulich
> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> 
> On 09/22/2015 02:25 PM, Wu, Feng wrote:
> >>> But if we want to avoid spurious PI interrupts when running idle, then
> >>> yes, we need *some* kind of a hook on the lazy context switch path.
> >>>
> >>> /me does some more thinking...
> >>
> >> To be honest, since we'll be get spurious PI interrupts in the
> >> hypervisor all the time anyway, I'm inclined at the moment not to worry
> >> to much about this case.
> >
> > Why do you say "we'll be get spurious PI interrupts in the  hypervisor all the
> time"?
> >
> > And could you please share what is your concern to handle this case to avoid
> > such spurious PI interrupts? Thanks!
> 
> So please correct me if I'm wrong in my understanding:
> 
> * When a vcpu is in runstate "running", with PI enabled, you have the PI
> vector set to "posted_interrupt_vector", SN=0.
> 
> * When in this state in non-root mode, PI interrupts result in an
> interrupt being delivered directly to the guest.
> 
> * When in this state in root mode, PI interrupts result in a
> posted_interrupt_vector interrupt being delivered to Xen.
> 
> Is that the case?

Exactly, it is how PI works today.

Thanks,
Feng

> 
> So basically, if the PI happens to come in when the guest is making a
> hypercall, or the guest is doing any other number of things that involve
> the hypervisor, then Xen will get a "spurious" PI interrupt -- spurious
> because there's nothing Xen actually needs to do about it; the guest
> interrupt will be delivered the next time we do a VMENTER.
> 
> So spurious PI interrupts are already going to happen from time to time
> -- they're not really that big of a deal.  Having them happen when a VM
> is running a tasklet or idle waiting for qemu isn't such a big deal either.



> 
>  -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-22 14:15                       ` George Dunlap
@ 2015-09-22 14:38                         ` Dario Faggioli
  2015-09-23  5:52                           ` Wu, Feng
  0 siblings, 1 reply; 86+ messages in thread
From: Dario Faggioli @ 2015-09-22 14:38 UTC (permalink / raw)
  To: George Dunlap, Wu, Feng
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 2037 bytes --]

On Tue, 2015-09-22 at 15:15 +0100, George Dunlap wrote:
> On 09/22/2015 02:52 PM, Wu, Feng wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]

> > > Yes, the idle to vCPUB switch is covered by __context_switch(),
> > > but
> > it cannot change the PI state of vCPUA at that point. Like
> > mentioned
> > above, in this case, spurious PI interrupts happens.
> 
> On the contrary, when __context_switch() is called for the idle ->
> vcpuB
> transition in the above scenario, it is *actually* context switching
> from vcpuA to vcpuB, since vcpuA is still actually on the cpu.  
>  Which
> means that if you add PI code into arch->ctxt_switch_from(), the case
> you describe above will be handled automatically.
> 
Yep, that's exactly what I was also writing... luckily, I've seen
George's email before getting too far with that! :-P

That's the point of lazy context switch. The context of vCPUA is still
on pCPUA during the time idle executes. If, at the next transition
point, we switch from idle to vCPUA again, then nothing is needed, and
we just saved one context saving and one context restoring operation.
If it is something else, like in your case, we need to save vCPUA's
context, which is actually what we have on pCPUA, despite idle (and not
vCPUA itself) was what was running.

> So the only downside to doing everything in block(), wake(), and
> __context_switch() is that if a VM is offline, or preempted by a
> tasklet, and an interrupt comes in, we will get a spurious PI (i.e.,
> one
> which interrupts us but we can't do anything useful about).
> 
Indeed. And that also seems bearable to me. Feng, are there reasons why
you think it's not?

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-22 13:25                 ` Wu, Feng
  2015-09-22 13:40                   ` Dario Faggioli
@ 2015-09-22 14:28                   ` George Dunlap
  2015-09-23  5:37                     ` Wu, Feng
  1 sibling, 1 reply; 86+ messages in thread
From: George Dunlap @ 2015-09-22 14:28 UTC (permalink / raw)
  To: Wu, Feng, Dario Faggioli
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich

On 09/22/2015 02:25 PM, Wu, Feng wrote:
>>> But if we want to avoid spurious PI interrupts when running idle, then
>>> yes, we need *some* kind of a hook on the lazy context switch path.
>>>
>>> /me does some more thinking...
>>
>> To be honest, since we'll be get spurious PI interrupts in the
>> hypervisor all the time anyway, I'm inclined at the moment not to worry
>> to much about this case.
> 
> Why do you say "we'll be get spurious PI interrupts in the  hypervisor all the time"?
> 
> And could you please share what is your concern to handle this case to avoid
> such spurious PI interrupts? Thanks!

So please correct me if I'm wrong in my understanding:

* When a vcpu is in runstate "running", with PI enabled, you have the PI
vector set to "posted_interrupt_vector", SN=0.

* When in this state in non-root mode, PI interrupts result in an
interrupt being delivered directly to the guest.

* When in this state in root mode, PI interrupts result in a
posted_interrupt_vector interrupt being delivered to Xen.

Is that the case?

So basically, if the PI happens to come in when the guest is making a
hypercall, or the guest is doing any other number of things that involve
the hypervisor, then Xen will get a "spurious" PI interrupt -- spurious
because there's nothing Xen actually needs to do about it; the guest
interrupt will be delivered the next time we do a VMENTER.

So spurious PI interrupts are already going to happen from time to time
-- they're not really that big of a deal.  Having them happen when a VM
is running a tasklet or idle waiting for qemu isn't such a big deal either.

 -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-22 13:52                     ` Wu, Feng
@ 2015-09-22 14:15                       ` George Dunlap
  2015-09-22 14:38                         ` Dario Faggioli
  0 siblings, 1 reply; 86+ messages in thread
From: George Dunlap @ 2015-09-22 14:15 UTC (permalink / raw)
  To: Wu, Feng, Dario Faggioli
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich

On 09/22/2015 02:52 PM, Wu, Feng wrote:
> 
> 
>> -----Original Message-----
>> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
>> Sent: Tuesday, September 22, 2015 9:40 PM
>> To: Wu, Feng; George Dunlap
>> Cc: xen-devel@lists.xen.org; Tian, Kevin; Keir Fraser; George Dunlap; Andrew
>> Cooper; Jan Beulich
>> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
>> handling
>>
>> On Tue, 2015-09-22 at 13:25 +0000, Wu, Feng wrote:
>>>
>>
>>>> -----Original Message-----
>>>> From: George Dunlap [mailto:george.dunlap@citrix.com]
>>
>>> Specifically, consider the following scheduling case happened on
>>> pCPUA:
>>> vCPUA --> idle --> vCPUB
>>>
>>> 1. First, vCPUA is running on pCPUA, so the NDST filed in PI
>>> descriptor is pCPUA
>>> 2. Then vCPUA is switched out and idle is switched in running in
>>> pCPUA
>>> 3. Sometime later, vCPUB is switched in on pCPUA. However, the NDST
>>> field
>>> for vCPUA is still pCPUA, and currently, vCPUB is running on it. That
>>> means
>>> the spurious PI interrupts for vCPUA can disturb vCPUB (because the
>>> NDST
>>> field is pCPUA), it seems not so good to me.
>>>
>> Mmm... Ok, but you're not saying what caused the transition from vCPUA
>> to idle, and from idle to vCPUB. That matters because, if this all
>> happened because blockings and wakeups, it's nothing to do with lazy
>> context switch, which is what we are discussing here (in the sense that
>> PI data structures would, in those cases, be updated appropriately in
>> block and wake hooks).
> 
> Like George mentioned before, Let's assume it is because of tasklets or
> vCPU is going to offline to wait device model's IO operations, so idle
> is switched in.
> 
>>
>> Also, if we're going from vCPUA to vCPUB, even if there's idle in
>> between, that can't be done via lazy context switch.
> 
> Yes, in the above scenario, vCPUB to idle transition has nothing to
> do with lazy context switch, however, the point here is vCPUA to
> idle is related to lazy context switch, and if we don't set the SN
> for vCPUA here, it will remain clear and the NDST field of vCPUA
> will remain pCPUA, even some time later vCPUB is running on it.
> In that case, the spurious Pi interrupts for vCPUA can disturb vCPUB.
> 
>> In fact, in this
>> case, __context_switch() must be called at some point (during the idle-
>> ->vCPUB switch, if my understanding is correct), and the hook will
>> actually get called!
> 
> Yes, the idle to vCPUB switch is covered by __context_switch(), but
> it cannot change the PI state of vCPUA at that point. Like mentioned
> above, in this case, spurious PI interrupts happens.

On the contrary, when __context_switch() is called for the idle -> vcpuB
transition in the above scenario, it is *actually* context switching
from vcpuA to vcpuB, since vcpuA is still actually on the cpu.   Which
means that if you add PI code into arch->ctxt_switch_from(), the case
you describe above will be handled automatically.

So the only downside to doing everything in block(), wake(), and
__context_switch() is that if a VM is offline, or preempted by a
tasklet, and an interrupt comes in, we will get a spurious PI (i.e., one
which interrupts us but we can't do anything useful about).

 -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-22 13:40           ` Wu, Feng
@ 2015-09-22 14:01             ` Jan Beulich
  2015-09-23  9:44               ` George Dunlap
  0 siblings, 1 reply; 86+ messages in thread
From: Jan Beulich @ 2015-09-22 14:01 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, George Dunlap, xen-devel

>>> On 22.09.15 at 15:40, <feng.wu@intel.com> wrote:

> 
>> -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Tuesday, September 22, 2015 5:00 PM
>> To: Wu, Feng
>> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; George Dunlap; Tian, 
> Kevin;
>> xen-devel@lists.xen.org; Keir Fraser
>> Subject: RE: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
>> handling
>> 
>> >>> On 22.09.15 at 09:19, <feng.wu@intel.com> wrote:
>> > However, I do find some issues with my proposal above, see below:
>> >
>> > 1. Set _VPF_blocked
>> > 2. ret = arch_block()
>> > 3. if ( ret || local_events_need_delivery() )
>> > 	clear_bit(_VPF_blocked, &v->pause_flags);
>> >
>> > After step #2, if ret == false, that means, we successfully changed the PI
>> > descriptor in arch_block(), if local_events_need_delivery() returns true,
>> > _VPF_blocked is cleared. After that, external interrupt come in, hence
>> > pi_wakeup_interrupt() --> vcpu_unblock(), but _VPF_blocked is cleared,
>> > so vcpu_unblock() does nothing, so the vCPU's PI state is incorrect.
>> >
>> > One possible solution for it is:
>> > - Disable the interrupts before the check in step #3 above
>> > - if local_events_need_delivery() returns true, undo all the operations
>> >  done in arch_block()
>> > - Enable interrupts after _VPF_blocked gets cleared.
>> 
>> Couldn't this be taken care of by, if necessary, adjusting PI state
>> in vmx_do_resume()?
> 
> What do you mean here? Could you please elaborate? Thanks!

>From the discussion so far I understand that all you're after is that
the PI descriptor is correct for the period of time while the guest
vCPU is actually executing in guest context. If that's a correct
understanding of mine, then setting the vector and flags back to
what's needed while the guest is running would be sufficient to be
done right before entering the guest, i.e. in vmx_do_resume().

Jan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-22 13:40                   ` Dario Faggioli
@ 2015-09-22 13:52                     ` Wu, Feng
  2015-09-22 14:15                       ` George Dunlap
  0 siblings, 1 reply; 86+ messages in thread
From: Wu, Feng @ 2015-09-22 13:52 UTC (permalink / raw)
  To: Dario Faggioli, George Dunlap
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich, Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Tuesday, September 22, 2015 9:40 PM
> To: Wu, Feng; George Dunlap
> Cc: xen-devel@lists.xen.org; Tian, Kevin; Keir Fraser; George Dunlap; Andrew
> Cooper; Jan Beulich
> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> 
> On Tue, 2015-09-22 at 13:25 +0000, Wu, Feng wrote:
> >
> 
> > > -----Original Message-----
> > > From: George Dunlap [mailto:george.dunlap@citrix.com]
> 
> > Specifically, consider the following scheduling case happened on
> > pCPUA:
> > vCPUA --> idle --> vCPUB
> >
> > 1. First, vCPUA is running on pCPUA, so the NDST filed in PI
> > descriptor is pCPUA
> > 2. Then vCPUA is switched out and idle is switched in running in
> > pCPUA
> > 3. Sometime later, vCPUB is switched in on pCPUA. However, the NDST
> > field
> > for vCPUA is still pCPUA, and currently, vCPUB is running on it. That
> > means
> > the spurious PI interrupts for vCPUA can disturb vCPUB (because the
> > NDST
> > field is pCPUA), it seems not so good to me.
> >
> Mmm... Ok, but you're not saying what caused the transition from vCPUA
> to idle, and from idle to vCPUB. That matters because, if this all
> happened because blockings and wakeups, it's nothing to do with lazy
> context switch, which is what we are discussing here (in the sense that
> PI data structures would, in those cases, be updated appropriately in
> block and wake hooks).

Like George mentioned before, Let's assume it is because of tasklets or
vCPU is going to offline to wait device model's IO operations, so idle
is switched in.

> 
> Also, if we're going from vCPUA to vCPUB, even if there's idle in
> between, that can't be done via lazy context switch.

Yes, in the above scenario, vCPUB to idle transition has nothing to
do with lazy context switch, however, the point here is vCPUA to
idle is related to lazy context switch, and if we don't set the SN
for vCPUA here, it will remain clear and the NDST field of vCPUA
will remain pCPUA, even some time later vCPUB is running on it.
In that case, the spurious Pi interrupts for vCPUA can disturb vCPUB.

> In fact, in this
> case, __context_switch() must be called at some point (during the idle-
> ->vCPUB switch, if my understanding is correct), and the hook will
> actually get called!

Yes, the idle to vCPUB switch is covered by __context_switch(), but
it cannot change the PI state of vCPUA at that point. Like mentioned
above, in this case, spurious PI interrupts happens.

Thanks,
Feng

> 
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-22  8:59         ` Jan Beulich
@ 2015-09-22 13:40           ` Wu, Feng
  2015-09-22 14:01             ` Jan Beulich
  0 siblings, 1 reply; 86+ messages in thread
From: Wu, Feng @ 2015-09-22 13:40 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, George Dunlap, xen-devel, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, September 22, 2015 5:00 PM
> To: Wu, Feng
> Cc: Andrew Cooper; Dario Faggioli; George Dunlap; George Dunlap; Tian, Kevin;
> xen-devel@lists.xen.org; Keir Fraser
> Subject: RE: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> 
> >>> On 22.09.15 at 09:19, <feng.wu@intel.com> wrote:
> > However, I do find some issues with my proposal above, see below:
> >
> > 1. Set _VPF_blocked
> > 2. ret = arch_block()
> > 3. if ( ret || local_events_need_delivery() )
> > 	clear_bit(_VPF_blocked, &v->pause_flags);
> >
> > After step #2, if ret == false, that means, we successfully changed the PI
> > descriptor in arch_block(), if local_events_need_delivery() returns true,
> > _VPF_blocked is cleared. After that, external interrupt come in, hence
> > pi_wakeup_interrupt() --> vcpu_unblock(), but _VPF_blocked is cleared,
> > so vcpu_unblock() does nothing, so the vCPU's PI state is incorrect.
> >
> > One possible solution for it is:
> > - Disable the interrupts before the check in step #3 above
> > - if local_events_need_delivery() returns true, undo all the operations
> >  done in arch_block()
> > - Enable interrupts after _VPF_blocked gets cleared.
> 
> Couldn't this be taken care of by, if necessary, adjusting PI state
> in vmx_do_resume()?

What do you mean here? Could you please elaborate? Thanks!

Thanks,
Feng

> 
> Jan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-22 13:25                 ` Wu, Feng
@ 2015-09-22 13:40                   ` Dario Faggioli
  2015-09-22 13:52                     ` Wu, Feng
  2015-09-22 14:28                   ` George Dunlap
  1 sibling, 1 reply; 86+ messages in thread
From: Dario Faggioli @ 2015-09-22 13:40 UTC (permalink / raw)
  To: Wu, Feng, George Dunlap
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 1670 bytes --]

On Tue, 2015-09-22 at 13:25 +0000, Wu, Feng wrote:
> 

> > -----Original Message-----
> > From: George Dunlap [mailto:george.dunlap@citrix.com]

> Specifically, consider the following scheduling case happened on
> pCPUA:
> vCPUA --> idle --> vCPUB
> 
> 1. First, vCPUA is running on pCPUA, so the NDST filed in PI
> descriptor is pCPUA
> 2. Then vCPUA is switched out and idle is switched in running in
> pCPUA
> 3. Sometime later, vCPUB is switched in on pCPUA. However, the NDST
> field
> for vCPUA is still pCPUA, and currently, vCPUB is running on it. That
> means
> the spurious PI interrupts for vCPUA can disturb vCPUB (because the
> NDST
> field is pCPUA), it seems not so good to me.
> 
Mmm... Ok, but you're not saying what caused the transition from vCPUA
to idle, and from idle to vCPUB. That matters because, if this all
happened because blockings and wakeups, it's nothing to do with lazy
context switch, which is what we are discussing here (in the sense that
PI data structures would, in those cases, be updated appropriately in
block and wake hooks).

Also, if we're going from vCPUA to vCPUB, even if there's idle in
between, that can't be done via lazy context switch. In fact, in this
case, __context_switch() must be called at some point (during the idle-
->vCPUB switch, if my understanding is correct), and the hook will
actually get called!

Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-22 10:46               ` George Dunlap
@ 2015-09-22 13:25                 ` Wu, Feng
  2015-09-22 13:40                   ` Dario Faggioli
  2015-09-22 14:28                   ` George Dunlap
  0 siblings, 2 replies; 86+ messages in thread
From: Wu, Feng @ 2015-09-22 13:25 UTC (permalink / raw)
  To: George Dunlap, Dario Faggioli
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich, Wu, Feng



> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@citrix.com]
> Sent: Tuesday, September 22, 2015 6:46 PM
> To: Wu, Feng; Dario Faggioli
> Cc: xen-devel@lists.xen.org; Tian, Kevin; Keir Fraser; George Dunlap; Andrew
> Cooper; Jan Beulich
> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> 
> On 09/22/2015 11:43 AM, George Dunlap wrote:
> > On 09/22/2015 06:10 AM, Wu, Feng wrote:
> >>
> >>
> >>> -----Original Message-----
> >>> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> >>> Sent: Monday, September 21, 2015 10:12 PM
> >>> To: Wu, Feng; George Dunlap
> >>> Cc: xen-devel@lists.xen.org; Tian, Kevin; Keir Fraser; George Dunlap;
> Andrew
> >>> Cooper; Jan Beulich
> >>> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core
> logic
> >>> handling
> >>>
> >>> On Mon, 2015-09-21 at 13:50 +0000, Wu, Feng wrote:
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> >>>
> >>>>> Note that, in case of preemptions, we are switching from a non-idle
> >>>>> vcpu to another, non-idle, vcpu, so lazy context switching to the
> >>>>> idle
> >>>>> vcpu should not have much to do with this case...
> >>>>
> >>>> So do you mean in preemptions, we cannot switch from non-idle to idle
> >>>> or
> >>>> Idle to non-idle, i.e, we can only switch from non-idle to non-idle?
> >>>>
> >>> That's what I meant. It's the definition of 'preemption' and of 'idle
> >>> task/vcpu', AFICT. I mean, the idle vcpu has the lowest possible
> >>> priority ever, so it can't really preempt anyone.
> >>>
> >>> OTOH, if the idle vcpu is running, that means there weren't any active
> >>> vcpus because, e.g., all were blocked; therefore, any active vcpu
> >>> wanting to run would have to wake up (and hence go throught the proper
> >>> wake up logic) before trying to preempt idle.
> >>>
> >>> There is one possible caveat: tasklets. In fact (as you can check
> >>> yourself by looking, in the code, for tasklet_work_scheduled), it is
> >>> possible that we force the idle vcpu to execute, even when we have
> >>> active and runnable vcpus, to let it process tasklets. I'm not really
> >>> sure whether this could be a problem for you or not, can you have a
> >>> look (and/or, a try) and report back?
> >>
> >> Thanks for your information about the tasklets part, it is very important,
> >> consider the following scenario:
> >>
> >> non-idle vCPUA --> idle (tasklet) --> non-idle vCPUA
> >>
> >> The above context switch will use the lazy context switch and cannot be
> >> covered in __context_switch(), we need to change SN's state during the
> >> "non-idle to idle" and "idle to non-idle" transition, so that means we need
> >> add the PI related logic in context_switch() instead of only in
> __context_switch().
> >>
> >> Besides that, it is more robust to do the PI switch in context_switch() which
> >> can cover lazy context switch. Maybe in future, there are some other
> >> feature needing execute task _inside_ an idle context, and it may introduce
> >> some issues due to no PI state transition, and it is a little hard to debug.
> >
> > So a transition like the above could happen in the case of a cpu that's
> > gone offline (e.g., to allow the devicemodel to handle an IO); or, as
> > you say, if we're doing urgent work in a tasklet such that it preempts a
> > running task.
> >
> > One option would be to just ignore this -- in which case we would get
> > spurious PI interrupts, but no other major issues, right?

Specifically, consider the following scheduling case happened on pCPUA:
vCPUA --> idle --> vCPUB

1. First, vCPUA is running on pCPUA, so the NDST filed in PI descriptor is pCPUA
2. Then vCPUA is switched out and idle is switched in running in pCPUA
3. Sometime later, vCPUB is switched in on pCPUA. However, the NDST field
for vCPUA is still pCPUA, and currently, vCPUB is running on it. That means
the spurious PI interrupts for vCPUA can disturb vCPUB (because the NDST
field is pCPUA), it seems not so good to me.

> >
> > But if we want to avoid spurious PI interrupts when running idle, then
> > yes, we need *some* kind of a hook on the lazy context switch path.
> >
> > /me does some more thinking...
> 
> To be honest, since we'll be get spurious PI interrupts in the
> hypervisor all the time anyway, I'm inclined at the moment not to worry
> to much about this case.

Why do you say "we'll be get spurious PI interrupts in the  hypervisor all the time"?

And could you please share what is your concern to handle this case to avoid
such spurious PI interrupts? Thanks!

Thanks,
Feng

> 
>  -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-22 10:43             ` George Dunlap
@ 2015-09-22 10:46               ` George Dunlap
  2015-09-22 13:25                 ` Wu, Feng
  0 siblings, 1 reply; 86+ messages in thread
From: George Dunlap @ 2015-09-22 10:46 UTC (permalink / raw)
  To: Wu, Feng, Dario Faggioli
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich

On 09/22/2015 11:43 AM, George Dunlap wrote:
> On 09/22/2015 06:10 AM, Wu, Feng wrote:
>>
>>
>>> -----Original Message-----
>>> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
>>> Sent: Monday, September 21, 2015 10:12 PM
>>> To: Wu, Feng; George Dunlap
>>> Cc: xen-devel@lists.xen.org; Tian, Kevin; Keir Fraser; George Dunlap; Andrew
>>> Cooper; Jan Beulich
>>> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
>>> handling
>>>
>>> On Mon, 2015-09-21 at 13:50 +0000, Wu, Feng wrote:
>>>>
>>>>> -----Original Message-----
>>>>> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
>>>
>>>>> Note that, in case of preemptions, we are switching from a non-idle
>>>>> vcpu to another, non-idle, vcpu, so lazy context switching to the
>>>>> idle
>>>>> vcpu should not have much to do with this case...
>>>>
>>>> So do you mean in preemptions, we cannot switch from non-idle to idle
>>>> or
>>>> Idle to non-idle, i.e, we can only switch from non-idle to non-idle?
>>>>
>>> That's what I meant. It's the definition of 'preemption' and of 'idle
>>> task/vcpu', AFICT. I mean, the idle vcpu has the lowest possible
>>> priority ever, so it can't really preempt anyone.
>>>
>>> OTOH, if the idle vcpu is running, that means there weren't any active
>>> vcpus because, e.g., all were blocked; therefore, any active vcpu
>>> wanting to run would have to wake up (and hence go throught the proper
>>> wake up logic) before trying to preempt idle.
>>>
>>> There is one possible caveat: tasklets. In fact (as you can check
>>> yourself by looking, in the code, for tasklet_work_scheduled), it is
>>> possible that we force the idle vcpu to execute, even when we have
>>> active and runnable vcpus, to let it process tasklets. I'm not really
>>> sure whether this could be a problem for you or not, can you have a
>>> look (and/or, a try) and report back?
>>
>> Thanks for your information about the tasklets part, it is very important,
>> consider the following scenario:
>>
>> non-idle vCPUA --> idle (tasklet) --> non-idle vCPUA
>>
>> The above context switch will use the lazy context switch and cannot be
>> covered in __context_switch(), we need to change SN's state during the
>> "non-idle to idle" and "idle to non-idle" transition, so that means we need
>> add the PI related logic in context_switch() instead of only in __context_switch().
>>
>> Besides that, it is more robust to do the PI switch in context_switch() which
>> can cover lazy context switch. Maybe in future, there are some other
>> feature needing execute task _inside_ an idle context, and it may introduce
>> some issues due to no PI state transition, and it is a little hard to debug.
> 
> So a transition like the above could happen in the case of a cpu that's
> gone offline (e.g., to allow the devicemodel to handle an IO); or, as
> you say, if we're doing urgent work in a tasklet such that it preempts a
> running task.
> 
> One option would be to just ignore this -- in which case we would get
> spurious PI interrupts, but no other major issues, right?
> 
> But if we want to avoid spurious PI interrupts when running idle, then
> yes, we need *some* kind of a hook on the lazy context switch path.
> 
> /me does some more thinking...

To be honest, since we'll be get spurious PI interrupts in the
hypervisor all the time anyway, I'm inclined at the moment not to worry
to much about this case.

 -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-22  5:10           ` Wu, Feng
@ 2015-09-22 10:43             ` George Dunlap
  2015-09-22 10:46               ` George Dunlap
  0 siblings, 1 reply; 86+ messages in thread
From: George Dunlap @ 2015-09-22 10:43 UTC (permalink / raw)
  To: Wu, Feng, Dario Faggioli
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich

On 09/22/2015 06:10 AM, Wu, Feng wrote:
> 
> 
>> -----Original Message-----
>> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
>> Sent: Monday, September 21, 2015 10:12 PM
>> To: Wu, Feng; George Dunlap
>> Cc: xen-devel@lists.xen.org; Tian, Kevin; Keir Fraser; George Dunlap; Andrew
>> Cooper; Jan Beulich
>> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
>> handling
>>
>> On Mon, 2015-09-21 at 13:50 +0000, Wu, Feng wrote:
>>>
>>>> -----Original Message-----
>>>> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
>>
>>>> Note that, in case of preemptions, we are switching from a non-idle
>>>> vcpu to another, non-idle, vcpu, so lazy context switching to the
>>>> idle
>>>> vcpu should not have much to do with this case...
>>>
>>> So do you mean in preemptions, we cannot switch from non-idle to idle
>>> or
>>> Idle to non-idle, i.e, we can only switch from non-idle to non-idle?
>>>
>> That's what I meant. It's the definition of 'preemption' and of 'idle
>> task/vcpu', AFICT. I mean, the idle vcpu has the lowest possible
>> priority ever, so it can't really preempt anyone.
>>
>> OTOH, if the idle vcpu is running, that means there weren't any active
>> vcpus because, e.g., all were blocked; therefore, any active vcpu
>> wanting to run would have to wake up (and hence go throught the proper
>> wake up logic) before trying to preempt idle.
>>
>> There is one possible caveat: tasklets. In fact (as you can check
>> yourself by looking, in the code, for tasklet_work_scheduled), it is
>> possible that we force the idle vcpu to execute, even when we have
>> active and runnable vcpus, to let it process tasklets. I'm not really
>> sure whether this could be a problem for you or not, can you have a
>> look (and/or, a try) and report back?
> 
> Thanks for your information about the tasklets part, it is very important,
> consider the following scenario:
> 
> non-idle vCPUA --> idle (tasklet) --> non-idle vCPUA
> 
> The above context switch will use the lazy context switch and cannot be
> covered in __context_switch(), we need to change SN's state during the
> "non-idle to idle" and "idle to non-idle" transition, so that means we need
> add the PI related logic in context_switch() instead of only in __context_switch().
> 
> Besides that, it is more robust to do the PI switch in context_switch() which
> can cover lazy context switch. Maybe in future, there are some other
> feature needing execute task _inside_ an idle context, and it may introduce
> some issues due to no PI state transition, and it is a little hard to debug.

So a transition like the above could happen in the case of a cpu that's
gone offline (e.g., to allow the devicemodel to handle an IO); or, as
you say, if we're doing urgent work in a tasklet such that it preempts a
running task.

One option would be to just ignore this -- in which case we would get
spurious PI interrupts, but no other major issues, right?

But if we want to avoid spurious PI interrupts when running idle, then
yes, we need *some* kind of a hook on the lazy context switch path.

/me does some more thinking...

 -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-22  7:19       ` Wu, Feng
  2015-09-22  8:59         ` Jan Beulich
@ 2015-09-22 10:26         ` George Dunlap
  2015-09-23  6:35           ` Wu, Feng
  1 sibling, 1 reply; 86+ messages in thread
From: George Dunlap @ 2015-09-22 10:26 UTC (permalink / raw)
  To: Wu, Feng, Dario Faggioli, George Dunlap
  Cc: Andrew Cooper, Tian, Kevin, Keir Fraser, Jan Beulich, xen-devel

On 09/22/2015 08:19 AM, Wu, Feng wrote:
> 
> 
>> -----Original Message-----
>> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
>> Sent: Monday, September 21, 2015 10:25 PM
>> To: Wu, Feng; George Dunlap; George Dunlap
>> Cc: Jan Beulich; Tian, Kevin; Keir Fraser; Andrew Cooper;
>> xen-devel@lists.xen.org
>> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
>> handling
>>
>> On Mon, 2015-09-21 at 12:22 +0000, Wu, Feng wrote:
>>>
>>>> -----Original Message-----
>>>> From: George Dunlap [mailto:george.dunlap@citrix.com]
>>
>>>> You also need to check that local_events_need_delivery() will
>>>> return
>>>> "true" if you get an interrupt between that time and entering the
>>>> hypervisor.  Will that happen automatically from
>>>> hvm_local_events_need_delivery() -> hvm_vcpu_has_pending_irq() ->
>>>> vlapic_has_pending_irq()?  Or will you need to add a hook in
>>>> hvm_vcpu_has_pending_irq()?
>>>
>>> I think I don't need to add hook in hvm_vcpu_has_pending_irq(), what
>>> I need
>>> to do in vcpu_block() and do_poll() is as below:
>>>
>>> 1. set_bit(_VPF_blocked, &v->pause_flags);
>>>
>>> 2. ret = v->arch.arch_block(), in this hook, we can re-use the same
>>> logic in
>>> vmx_pre_ctx_switch_pi(), and check whether ON bit is set during
>>> updating
>>> posted-interrupt descriptor, can return 1 when ON is set
>>>
>> It think it would be ok for an hook to return a value (maybe, if doing
>> that, let's pick variable names and use comments to explain what goes
>> on as good as we can).
>>
>> I think I also see why you seem to prefer doing it that way, rather
>> than hacking local_events_need_delivery(), but can you please elaborate
>> on that? (My feeling is that you want to avoid having to update the
>> data structures in between _VPF_blocked and the check
>> local_events_need_delivery(), and then having to roll such update back
>> if local_events_need_delivery() ends up being false, is that the
>> case?).
> 
> In the arch_block() hook, we actually need to
> 	- Put vCPU to the blocking list
> 	- Set the NV to wakeup vector
> 	- Set NDST to the right pCPU
> 	- Clear SN

Nit: We shouldn't need to actually clear SN here; SN should already be
clear because the vcpu should be currently running.  (I don't think
there's a way for a vcpu to go from runnable->blocked, is there?)  And
if it's just been running, then NDST should also already be the correct
pcpu.

> During we are updating the posted-interrupt descriptor, the VT-d
> hardware can issue notification event and hence ON is set. If that is the
> case, it is straightforward to return directly, and it doesn't make sense
> we continue to do the above operations since we don't need to actually.

But checking to see if any interrupts have come in in the middle of your
code just adds unnecessary complication.  We need to have the code in
place to un-do all the blocking steps in the case that
local_events_need_delivery() returns true anyway.

Additionally, while local_events_need_delivery() is only called from
do_block and do_poll, hvm_local_events_need_delivery() is called from a
number of other places, as is hvm_vcpu_has_pending_irq().  All those
places presumably also need to know whether there's a PI pending to work
properly.

>> Code would look better, IMO, if we manage to fold that somehow inside
>> local_events_need_delivery(), but that:
> 
> As mentioned above, during updating the PI descriptor for blocking, we
> need to check whether ON is set, and return if it is set. This logic cannot
> be included in local_events_need_delivery(), since the update itself
> has not much relationship with local_events_need_delivery().
> 
> However, I do find some issues with my proposal above, see below:
> 
> 1. Set _VPF_blocked
> 2. ret = arch_block()
> 3. if ( ret || local_events_need_delivery() )
> 	clear_bit(_VPF_blocked, &v->pause_flags);
> 
> After step #2, if ret == false, that means, we successfully changed the PI
> descriptor in arch_block(), if local_events_need_delivery() returns true,
> _VPF_blocked is cleared. After that, external interrupt come in, hence
> pi_wakeup_interrupt() --> vcpu_unblock(), but _VPF_blocked is cleared,
> so vcpu_unblock() does nothing, so the vCPU's PI state is incorrect.
> 
> One possible solution for it is:
> - Disable the interrupts before the check in step #3 above
> - if local_events_need_delivery() returns true, undo all the operations
>  done in arch_block()
> - Enable interrupts after _VPF_blocked gets cleared.

Well, yes, if as a result of checking to see that there are interrupts
you unblock a processor, then you need to do *all* the things related to
unblocking, including switching the interrupt vector and removing it
from the blocked list.

If we're going to put hooks here in do_block(), then the best thing to
do is as much as possible to make PI interrupts act exactly the same as
other interrupts; i.e.,

* Do a full transition to blocked (set _VPF_blocked, add vcpu to PI
list, switch vector to wake)
* Check to see if there are any pending interrupts (either event
channels, virtual interrupts, or PIs)
* If so, do a full transition to unblocked (clear _VPF_blocked, switch
vector to PI, remove vcpu from list).

We should be able to order the operations such that if interrupts come
in the middle nothing bad happens, without needing to actually disable
interrupts.

OTOH -- I think if we grab a lock during an interrupt, we're not allowed
to grab it with interrupts disabled, correct?  So we may end up having
to disable interrupts anyway.

 -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-22  7:19       ` Wu, Feng
@ 2015-09-22  8:59         ` Jan Beulich
  2015-09-22 13:40           ` Wu, Feng
  2015-09-22 10:26         ` George Dunlap
  1 sibling, 1 reply; 86+ messages in thread
From: Jan Beulich @ 2015-09-22  8:59 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, Keir Fraser, George Dunlap, Andrew Cooper,
	Dario Faggioli, George Dunlap, xen-devel

>>> On 22.09.15 at 09:19, <feng.wu@intel.com> wrote:
> However, I do find some issues with my proposal above, see below:
> 
> 1. Set _VPF_blocked
> 2. ret = arch_block()
> 3. if ( ret || local_events_need_delivery() )
> 	clear_bit(_VPF_blocked, &v->pause_flags);
> 
> After step #2, if ret == false, that means, we successfully changed the PI
> descriptor in arch_block(), if local_events_need_delivery() returns true,
> _VPF_blocked is cleared. After that, external interrupt come in, hence
> pi_wakeup_interrupt() --> vcpu_unblock(), but _VPF_blocked is cleared,
> so vcpu_unblock() does nothing, so the vCPU's PI state is incorrect.
> 
> One possible solution for it is:
> - Disable the interrupts before the check in step #3 above
> - if local_events_need_delivery() returns true, undo all the operations
>  done in arch_block()
> - Enable interrupts after _VPF_blocked gets cleared.

Couldn't this be taken care of by, if necessary, adjusting PI state
in vmx_do_resume()?

Jan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-21 14:24     ` Dario Faggioli
@ 2015-09-22  7:19       ` Wu, Feng
  2015-09-22  8:59         ` Jan Beulich
  2015-09-22 10:26         ` George Dunlap
  0 siblings, 2 replies; 86+ messages in thread
From: Wu, Feng @ 2015-09-22  7:19 UTC (permalink / raw)
  To: Dario Faggioli, George Dunlap, George Dunlap
  Cc: Tian, Kevin, Wu, Feng, Andrew Cooper, xen-devel, Jan Beulich,
	Keir Fraser



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Monday, September 21, 2015 10:25 PM
> To: Wu, Feng; George Dunlap; George Dunlap
> Cc: Jan Beulich; Tian, Kevin; Keir Fraser; Andrew Cooper;
> xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> 
> On Mon, 2015-09-21 at 12:22 +0000, Wu, Feng wrote:
> >
> > > -----Original Message-----
> > > From: George Dunlap [mailto:george.dunlap@citrix.com]
> 
> > > You also need to check that local_events_need_delivery() will
> > > return
> > > "true" if you get an interrupt between that time and entering the
> > > hypervisor.  Will that happen automatically from
> > > hvm_local_events_need_delivery() -> hvm_vcpu_has_pending_irq() ->
> > > vlapic_has_pending_irq()?  Or will you need to add a hook in
> > > hvm_vcpu_has_pending_irq()?
> >
> > I think I don't need to add hook in hvm_vcpu_has_pending_irq(), what
> > I need
> > to do in vcpu_block() and do_poll() is as below:
> >
> > 1. set_bit(_VPF_blocked, &v->pause_flags);
> >
> > 2. ret = v->arch.arch_block(), in this hook, we can re-use the same
> > logic in
> > vmx_pre_ctx_switch_pi(), and check whether ON bit is set during
> > updating
> > posted-interrupt descriptor, can return 1 when ON is set
> >
> It think it would be ok for an hook to return a value (maybe, if doing
> that, let's pick variable names and use comments to explain what goes
> on as good as we can).
> 
> I think I also see why you seem to prefer doing it that way, rather
> than hacking local_events_need_delivery(), but can you please elaborate
> on that? (My feeling is that you want to avoid having to update the
> data structures in between _VPF_blocked and the check
> local_events_need_delivery(), and then having to roll such update back
> if local_events_need_delivery() ends up being false, is that the
> case?).

In the arch_block() hook, we actually need to
	- Put vCPU to the blocking list
	- Set the NV to wakeup vector
	- Set NDST to the right pCPU
	- Clear SN

During we are updating the posted-interrupt descriptor, the VT-d
hardware can issue notification event and hence ON is set. If that is the
case, it is straightforward to return directly, and it doesn't make sense
we continue to do the above operations since we don't need to actually.

> 
> Code would look better, IMO, if we manage to fold that somehow inside
> local_events_need_delivery(), but that:

As mentioned above, during updating the PI descriptor for blocking, we
need to check whether ON is set, and return if it is set. This logic cannot
be included in local_events_need_delivery(), since the update itself
has not much relationship with local_events_need_delivery().

However, I do find some issues with my proposal above, see below:

1. Set _VPF_blocked
2. ret = arch_block()
3. if ( ret || local_events_need_delivery() )
	clear_bit(_VPF_blocked, &v->pause_flags);

After step #2, if ret == false, that means, we successfully changed the PI
descriptor in arch_block(), if local_events_need_delivery() returns true,
_VPF_blocked is cleared. After that, external interrupt come in, hence
pi_wakeup_interrupt() --> vcpu_unblock(), but _VPF_blocked is cleared,
so vcpu_unblock() does nothing, so the vCPU's PI state is incorrect.

One possible solution for it is:
- Disable the interrupts before the check in step #3 above
- if local_events_need_delivery() returns true, undo all the operations
 done in arch_block()
- Enable interrupts after _VPF_blocked gets cleared.

It is a little annoying.

Thanks,
Feng


>  1. is hard to tell without actually seeing how the code will end up
>     being
>  2. might be my opinion only, so let's see what others think.
> 
> Regards,
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-21 14:11         ` Dario Faggioli
@ 2015-09-22  5:10           ` Wu, Feng
  2015-09-22 10:43             ` George Dunlap
  0 siblings, 1 reply; 86+ messages in thread
From: Wu, Feng @ 2015-09-22  5:10 UTC (permalink / raw)
  To: Dario Faggioli, George Dunlap
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich, Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Monday, September 21, 2015 10:12 PM
> To: Wu, Feng; George Dunlap
> Cc: xen-devel@lists.xen.org; Tian, Kevin; Keir Fraser; George Dunlap; Andrew
> Cooper; Jan Beulich
> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> 
> On Mon, 2015-09-21 at 13:50 +0000, Wu, Feng wrote:
> >
> > > -----Original Message-----
> > > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> 
> > > Note that, in case of preemptions, we are switching from a non-idle
> > > vcpu to another, non-idle, vcpu, so lazy context switching to the
> > > idle
> > > vcpu should not have much to do with this case...
> >
> > So do you mean in preemptions, we cannot switch from non-idle to idle
> > or
> > Idle to non-idle, i.e, we can only switch from non-idle to non-idle?
> >
> That's what I meant. It's the definition of 'preemption' and of 'idle
> task/vcpu', AFICT. I mean, the idle vcpu has the lowest possible
> priority ever, so it can't really preempt anyone.
> 
> OTOH, if the idle vcpu is running, that means there weren't any active
> vcpus because, e.g., all were blocked; therefore, any active vcpu
> wanting to run would have to wake up (and hence go throught the proper
> wake up logic) before trying to preempt idle.
> 
> There is one possible caveat: tasklets. In fact (as you can check
> yourself by looking, in the code, for tasklet_work_scheduled), it is
> possible that we force the idle vcpu to execute, even when we have
> active and runnable vcpus, to let it process tasklets. I'm not really
> sure whether this could be a problem for you or not, can you have a
> look (and/or, a try) and report back?

Thanks for your information about the tasklets part, it is very important,
consider the following scenario:

non-idle vCPUA --> idle (tasklet) --> non-idle vCPUA

The above context switch will use the lazy context switch and cannot be
covered in __context_switch(), we need to change SN's state during the
"non-idle to idle" and "idle to non-idle" transition, so that means we need
add the PI related logic in context_switch() instead of only in __context_switch().

Besides that, it is more robust to do the PI switch in context_switch() which
can cover lazy context switch. Maybe in future, there are some other
feature needing execute task _inside_ an idle context, and it may introduce
some issues due to no PI state transition, and it is a little hard to debug.

Thanks,
Feng

> 
> Regards,
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-21 12:22   ` Wu, Feng
@ 2015-09-21 14:24     ` Dario Faggioli
  2015-09-22  7:19       ` Wu, Feng
  0 siblings, 1 reply; 86+ messages in thread
From: Dario Faggioli @ 2015-09-21 14:24 UTC (permalink / raw)
  To: Wu, Feng, George Dunlap, George Dunlap
  Cc: Andrew Cooper, Tian, Kevin, Keir Fraser, Jan Beulich, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2062 bytes --]

On Mon, 2015-09-21 at 12:22 +0000, Wu, Feng wrote:
> 
> > -----Original Message-----
> > From: George Dunlap [mailto:george.dunlap@citrix.com]

> > You also need to check that local_events_need_delivery() will
> > return
> > "true" if you get an interrupt between that time and entering the
> > hypervisor.  Will that happen automatically from
> > hvm_local_events_need_delivery() -> hvm_vcpu_has_pending_irq() ->
> > vlapic_has_pending_irq()?  Or will you need to add a hook in
> > hvm_vcpu_has_pending_irq()?
> 
> I think I don't need to add hook in hvm_vcpu_has_pending_irq(), what
> I need
> to do in vcpu_block() and do_poll() is as below:
> 
> 1. set_bit(_VPF_blocked, &v->pause_flags);
> 
> 2. ret = v->arch.arch_block(), in this hook, we can re-use the same
> logic in
> vmx_pre_ctx_switch_pi(), and check whether ON bit is set during
> updating
> posted-interrupt descriptor, can return 1 when ON is set
>
It think it would be ok for an hook to return a value (maybe, if doing
that, let's pick variable names and use comments to explain what goes
on as good as we can).

I think I also see why you seem to prefer doing it that way, rather
than hacking local_events_need_delivery(), but can you please elaborate
on that? (My feeling is that you want to avoid having to update the
data structures in between _VPF_blocked and the check
local_events_need_delivery(), and then having to roll such update back
if local_events_need_delivery() ends up being false, is that the
case?).

Code would look better, IMO, if we manage to fold that somehow inside
local_events_need_delivery(), but that:
 1. is hard to tell without actually seeing how the code will end up 
    being
 2. might be my opinion only, so let's see what others think.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-21 13:50       ` Wu, Feng
@ 2015-09-21 14:11         ` Dario Faggioli
  2015-09-22  5:10           ` Wu, Feng
  0 siblings, 1 reply; 86+ messages in thread
From: Dario Faggioli @ 2015-09-21 14:11 UTC (permalink / raw)
  To: Wu, Feng, George Dunlap
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 1635 bytes --]

On Mon, 2015-09-21 at 13:50 +0000, Wu, Feng wrote:
> 
> > -----Original Message-----
> > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]

> > Note that, in case of preemptions, we are switching from a non-idle
> > vcpu to another, non-idle, vcpu, so lazy context switching to the
> > idle
> > vcpu should not have much to do with this case... 
> 
> So do you mean in preemptions, we cannot switch from non-idle to idle
> or
> Idle to non-idle, i.e, we can only switch from non-idle to non-idle?
> 
That's what I meant. It's the definition of 'preemption' and of 'idle
task/vcpu', AFICT. I mean, the idle vcpu has the lowest possible
priority ever, so it can't really preempt anyone.

OTOH, if the idle vcpu is running, that means there weren't any active
vcpus because, e.g., all were blocked; therefore, any active vcpu
wanting to run would have to wake up (and hence go throught the proper
wake up logic) before trying to preempt idle.

There is one possible caveat: tasklets. In fact (as you can check
yourself by looking, in the code, for tasklet_work_scheduled), it is
possible that we force the idle vcpu to execute, even when we have
active and runnable vcpus, to let it process tasklets. I'm not really
sure whether this could be a problem for you or not, can you have a
look (and/or, a try) and report back?

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-21 13:31     ` Dario Faggioli
@ 2015-09-21 13:50       ` Wu, Feng
  2015-09-21 14:11         ` Dario Faggioli
  0 siblings, 1 reply; 86+ messages in thread
From: Wu, Feng @ 2015-09-21 13:50 UTC (permalink / raw)
  To: Dario Faggioli, George Dunlap
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich, Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Monday, September 21, 2015 9:32 PM
> To: Wu, Feng; George Dunlap
> Cc: xen-devel@lists.xen.org; Tian, Kevin; Keir Fraser; George Dunlap; Andrew
> Cooper; Jan Beulich
> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> 
> On Mon, 2015-09-21 at 11:59 +0000, Wu, Feng wrote:
> >
> > > -----Original Message-----
> > > From: George Dunlap [mailto:george.dunlap@citrix.com]
> 
> > > > I think the handling for lazy context switch is not only for the
> > > > blocking case,
> > > > we still need to do something for lazy context switch even we
> > > > handled the
> > > > blocking case in vcpu_block(), such as,
> > > > 1. For non-idle -> idle
> > > > - set 'SN'
> > >
> > > If we set SN in vcpu_block(), then we don't need to set it on
> > > context
> > > switch -- 是不是?
> >
> > For preemption case (not blocking case) , we still need to clear/set
> > SN, and
> > this has no business with vcpu_block()/vcpu_wake(), right? Do I miss
> > something
> > here? BTW, you Chinese is good! :)
> >
> Well, sure, preemptions are fine being dealt with during context
> switches.
> 
> AFAICR, Geoge was suggesting investigating the possibility of doing
> that within the already existing arch specific part of the context
> switch itself. Have you checked whether that would be possible? If yes,
> it really would be great, IMO.

This is George's suggestion about this:

" And at that point, could we actually get rid of the PI-specific context
switch hooks altogether, and just put the SN state changes required for
running->runnable and runnable->running in vmx_ctxt_switch_from() and
vmx_ctxt_switch_to()?"

However, vmx_ctxt_switch_from() and vmx_ctxt_switch_to() are called in
__context_switch(), which still cannot cover "non-idle -> idle" and "idle -> non-idle"
lazy transition. And I think we need to change SN's state during the two
transitions.

> 
> Note that, in case of preemptions, we are switching from a non-idle
> vcpu to another, non-idle, vcpu, so lazy context switching to the idle
> vcpu should not have much to do with this case... 

So do you mean in preemptions, we cannot switch from non-idle to idle or
Idle to non-idle, i.e, we can only switch from non-idle to non-idle?

Thanks,
Feng

> Was this something
> you were saying something/asking about above? (seems so, but I can't be
> sure, so I thought I better ask :-) ).



> 
> Regards,
> Dario
> 
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-21 11:59   ` Wu, Feng
@ 2015-09-21 13:31     ` Dario Faggioli
  2015-09-21 13:50       ` Wu, Feng
  0 siblings, 1 reply; 86+ messages in thread
From: Dario Faggioli @ 2015-09-21 13:31 UTC (permalink / raw)
  To: Wu, Feng, George Dunlap
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 1685 bytes --]

On Mon, 2015-09-21 at 11:59 +0000, Wu, Feng wrote:
> 
> > -----Original Message-----
> > From: George Dunlap [mailto:george.dunlap@citrix.com]

> > > I think the handling for lazy context switch is not only for the
> > > blocking case,
> > > we still need to do something for lazy context switch even we
> > > handled the
> > > blocking case in vcpu_block(), such as,
> > > 1. For non-idle -> idle
> > > - set 'SN'
> > 
> > If we set SN in vcpu_block(), then we don't need to set it on
> > context
> > switch -- 是不是?
> 
> For preemption case (not blocking case) , we still need to clear/set
> SN, and
> this has no business with vcpu_block()/vcpu_wake(), right? Do I miss
> something
> here? BTW, you Chinese is good! :)
> 
Well, sure, preemptions are fine being dealt with during context
switches.

AFAICR, Geoge was suggesting investigating the possibility of doing
that within the already existing arch specific part of the context
switch itself. Have you checked whether that would be possible? If yes,
it really would be great, IMO.

Note that, in case of preemptions, we are switching from a non-idle
vcpu to another, non-idle, vcpu, so lazy context switching to the idle
vcpu should not have much to do with this case... Was this something
you were saying something/asking about above? (seems so, but I can't be
sure, so I thought I better ask :-) ).

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-21  9:54 ` George Dunlap
@ 2015-09-21 12:22   ` Wu, Feng
  2015-09-21 14:24     ` Dario Faggioli
  0 siblings, 1 reply; 86+ messages in thread
From: Wu, Feng @ 2015-09-21 12:22 UTC (permalink / raw)
  To: George Dunlap, George Dunlap, Dario Faggioli
  Cc: Tian, Kevin, Wu, Feng, Andrew Cooper, xen-devel, Jan Beulich,
	Keir Fraser



> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@citrix.com]
> Sent: Monday, September 21, 2015 5:54 PM
> To: Wu, Feng; George Dunlap; Dario Faggioli
> Cc: Jan Beulich; Tian, Kevin; Keir Fraser; Andrew Cooper;
> xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> 
> On 09/21/2015 06:09 AM, Wu, Feng wrote:
> >
> >
> >> -----Original Message-----
> >> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of
> George
> >> Dunlap
> >> Sent: Friday, September 18, 2015 10:34 PM
> >> To: Dario Faggioli
> >> Cc: Jan Beulich; George Dunlap; Tian, Kevin; Keir Fraser; Andrew Cooper;
> >> xen-devel@lists.xen.org; Wu, Feng
> >> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core
> logic
> >> handling
> >>
> >> On Fri, Sep 18, 2015 at 3:31 PM, George Dunlap
> >> <George.Dunlap@eu.citrix.com> wrote:
> >>>> As said, me too. Perhaps we can go for option 1, which is simpler,
> >>>> cleaner and more consistent, considering the current status of the
> >>>> code. We can always investigate, in future, whether and how to
> >>>> implement the optimization for all the blockings, if beneficial and fea
> >>>> sible, or have them diverge, if deemed worthwhile.
> >>>
> >>> Sounds like a plan.
> >>
> >> Er, just in case that idiom wasn't clear: Option 1 sounds like a
> >> *good* plan, so unless Feng disagrees, let's go with that. :-)
> >
> > Sorry for the late response, I was on leave last Friday.
> >
> > Thanks for your discussions and suggestions. I have one question about option
> 1.
> > I find that there are two places where '_VPF_blocked' can get set:
> vcpu_block()
> > and do_poll(). After putting the logic in vcpu_block(), do we need to care
> about
> > do_poll(). I don't know the purpose of do_poll() and the usage case of it.
> > Dario/George, could you please share some knowledge about it? Thanks a lot!
> 
> Yes, you'll need to make the callback everywhere _VPF_blocked is set.
> 
> Normally you'd want to try to refactor both of those to share a commmon
> codepath, but it looks like there are specific reasons why they have to
> be different codepaths; so you'll just have to make the callback in both
> places (after setting VPF_blocked).
> 
> You also need to check that local_events_need_delivery() will return
> "true" if you get an interrupt between that time and entering the
> hypervisor.  Will that happen automatically from
> hvm_local_events_need_delivery() -> hvm_vcpu_has_pending_irq() ->
> vlapic_has_pending_irq()?  Or will you need to add a hook in
> hvm_vcpu_has_pending_irq()?

I think I don't need to add hook in hvm_vcpu_has_pending_irq(), what I need
to do in vcpu_block() and do_poll() is as below:

1. set_bit(_VPF_blocked, &v->pause_flags);

2. ret = v->arch.arch_block(), in this hook, we can re-use the same logic in
vmx_pre_ctx_switch_pi(), and check whether ON bit is set during updating
posted-interrupt descriptor, can return 1 when ON is set, something like the
following:

        do {
            old.control = new.control = pi_desc->control;

            /* Should not block the vCPU if an interrupt was posted for it. */
            if ( pi_test_on(&old) )
            {
                spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock, flags);
                return 1;
            }

            /*
             * Change the 'NDST' field to v->arch.hvm_vmx.pi_block_cpu,
             * so when external interrupts from assigned deivces happen,
             * wakeup notifiction event will go to
             * v->arch.hvm_vmx.pi_block_cpu, then in pi_wakeup_interrupt()
             * we can find the vCPU in the right list to wake up.
             */
            dest = cpu_physical_id(v->arch.hvm_vmx.pi_block_cpu);

            if ( x2apic_enabled )
                new.ndst = dest;
            else
                new.ndst = MASK_INSR(dest, PI_xAPIC_NDST_MASK);

            pi_clear_sn(&new);
            new.nv = pi_wakeup_vector;
        } while ( cmpxchg(&pi_desc->control, old.control, new.control) !=
                  old.control );

3. After returning from arch_block() we can simple check:
	if ( ret || local_events_need_delivery() )
		Don't block the vCPU;

Thanks,
Feng


> 
>  -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-21  9:18 ` George Dunlap
@ 2015-09-21 11:59   ` Wu, Feng
  2015-09-21 13:31     ` Dario Faggioli
  0 siblings, 1 reply; 86+ messages in thread
From: Wu, Feng @ 2015-09-21 11:59 UTC (permalink / raw)
  To: George Dunlap, Dario Faggioli
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich, Wu, Feng



> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@citrix.com]
> Sent: Monday, September 21, 2015 5:19 PM
> To: Wu, Feng; Dario Faggioli
> Cc: xen-devel@lists.xen.org; Tian, Kevin; Keir Fraser; George Dunlap; Andrew
> Cooper; Jan Beulich
> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> 
> On 09/21/2015 06:08 AM, Wu, Feng wrote:
> >
> >
> >> -----Original Message-----
> >> From: George Dunlap [mailto:george.dunlap@citrix.com]
> >> Sent: Thursday, September 17, 2015 5:38 PM
> >> To: Dario Faggioli; Wu, Feng
> >> Cc: xen-devel@lists.xen.org; Tian, Kevin; Keir Fraser; George Dunlap;
> Andrew
> >> Cooper; Jan Beulich
> >> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core
> logic
> >> handling
> >>
> >> On 09/17/2015 09:48 AM, Dario Faggioli wrote:
> >>> On Thu, 2015-09-17 at 08:00 +0000, Wu, Feng wrote:
> >>>
> >>>>> -----Original Message-----
> >>>>> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> >>>
> >>>>> So, I guess, first of all, can you confirm whether or not it's exploding
> >>>>> in debug builds?
> >>>>
> >>>> Does the following information in Config.mk mean it is a debug build?
> >>>>
> >>>> # A debug build of Xen and tools?
> >>>> debug ?= y
> >>>> debug_symbols ?= $(debug)
> >>>>
> >>> I think so. But as I said in my other email, I was wrong, and this is
> >>> probably not an issue.
> >>>
> >>>>> And in either case (just tossing out ideas) would it be
> >>>>> possible to deal with the "interrupt already raised when blocking" case:
> >>>>
> >>>> Thanks for the suggestions below!
> >>>>
> >>> :-)
> >>>
> >>>>>  - later in the context switching function ?
> >>>> In this case, we might need to set a flag in vmx_pre_ctx_switch_pi()
> instead
> >>>> of calling vcpu_unblock() directly, then when it returns to
> context_switch(),
> >>>> we can check the flag and don't really block the vCPU.
> >>>>
> >>> Yeah, and that would still be rather hard to understand and maintain,
> >>> IMO.
> >>>
> >>>> But I don't have a clear
> >>>> picture about how to archive this, here are some questions from me:
> >>>> - When we are in context_switch(), we have done the following changes to
> >>>> vcpu's state:
> >>>> 	* sd->curr is set to next
> >>>> 	* vCPU's running state (both prev and next ) is changed by
> >>>> 	  vcpu_runstate_change()
> >>>> 	* next->is_running is set to 1
> >>>> 	* periodic timer for prev is stopped
> >>>> 	* periodic timer for next is setup
> >>>> 	......
> >>>>
> >>>> So what point should we perform the action to _unblock_ the vCPU? We
> >>>> Need to roll back the formal changes to the vCPU's state, right?
> >>>>
> >>> Mmm... not really. Not blocking prev does not mean that prev would be
> >>> kept running on the pCPU, and that's true for your current solution as
> >>> well! As you say yourself, you're already in the process of switching
> >>> between prev and next, at a point where it's already a thing that next
> >>> will be the vCPU that will run. Not blocking means that prev is
> >>> reinserted to the runqueue, and a new invocation to the scheduler is
> >>> (potentially) queued as well (via raising SCHEDULE_SOFTIRQ, in
> >>> __runq_tickle()), but it's only when such new scheduling happens that
> >>> prev will (potentially) be selected to run again.
> >>>
> >>> So, no, unless I'm fully missing your point, there wouldn't be no
> >>> rollback required. However, I still would like the other solution (doing
> >>> stuff in vcpu_block()) better (see below).
> >>>
> >>>>>  - with another hook, perhaps in vcpu_block() ?
> >>>>
> >>>> We could check this in vcpu_block(), however, the logic here is that before
> >>>> vCPU is blocked, we need to change the posted-interrupt descriptor,
> >>>> and during changing it, if 'ON' bit is set, which means VT-d hardware
> >>>> issues a notification event because interrupts from the assigned devices
> >>>> is coming, we don't need to block the vCPU and hence no need to update
> >>>> the PI descriptor in this case.
> >>>>
> >>> Yep, I saw that. But could it be possible to do *everything* related to
> >>> blocking, including the update of the descriptor, in vcpu_block(), if no
> >>> interrupt have been raised yet at that time? I mean, would you, if
> >>> updating the descriptor in there, still get the event that allows you to
> >>> call vcpu_wake(), and hence vmx_vcpu_wake_prepare(), which would undo
> >>> the blocking, no matter whether that resulted in an actual context
> >>> switch already or not?
> >>>
> >>> I appreciate that this narrows the window for such an event to happen by
> >>> quite a bit, making the logic itself a little less useful (it makes
> >>> things more similar to "regular" blocking vs. event delivery, though,
> >>> AFAICT), but if it's correct, ad if it allows us to save the ugly
> >>> invocation of vcpu_unblock from context switch context, I'd give it a
> >>> try.
> >>>
> >>> After all, this PI thing requires actions to be taken when a vCPU is
> >>> scheduled or descheduled because of blocking, unblocking and
> >>> preemptions, and it would seem natural to me to:
> >>>  - deal with blockings in vcpu_block()
> >>>  - deal with unblockings in vcpu_wake()
> >>>  - deal with preemptions in context_switch()
> >>>
> >>> This does not mean being able to consolidate some of the cases (like
> >>> blockings and preemptions, in the current version of the code) were not
> >>> a nice thing... But we don't want it at all costs . :-)
> >>
> >> So just to clarify the situation...
> >>
> >> If a vcpu configured for the "running" state (i.e., NV set to
> >> "posted_intr_vector", notifications enabled), and an interrupt happens
> >> in the hypervisor -- what happens?
> >>
> >> Is it the case that the interrupt is not actually delivered to the
> >> processor, but that the pending bit will be set in the pi field, so that
> >> the interrupt will be delivered the next time the hypervisor returns
> >> into the guest?
> >>
> >> (I am assuming that is the case, because if the hypervisor *does* get an
> >> interrupt, then it can just unblock it there.)
> >>
> >> This sort of race condition -- where we get an interrupt to wake up a
> >> vcpu as we're blocking -- is already handled for "old-style" interrupts
> >> in vcpu_block:
> >>
> >> void vcpu_block(void)
> >> {
> >>     struct vcpu *v = current;
> >>
> >>     set_bit(_VPF_blocked, &v->pause_flags);
> >>
> >>     /* Check for events /after/ blocking: avoids wakeup waiting race. */
> >>     if ( local_events_need_delivery() )
> >>     {
> >>         clear_bit(_VPF_blocked, &v->pause_flags);
> >>     }
> >>     else
> >>     {
> >>         TRACE_2D(TRC_SCHED_BLOCK, v->domain->domain_id,
> v->vcpu_id);
> >>         raise_softirq(SCHEDULE_SOFTIRQ);
> >>     }
> >> }
> >>
> >> That is, we set _VPF_blocked, so that any interrupt which would wake it
> >> up actually wakes it up, and then we check local_events_need_delivery()
> >> to see if there were any that came in after we decided to block but
> >> before we made sure that an interrupt would wake us up.
> >>
> >> I think it would be best if we could keep all the logic that does the
> >> same thing in the same place.  Which would mean in vcpu_block(), after
> >> calling set_bit(_VPF_blocked), changing the NV to pi_wakeup_vector, and
> >> then extending local_events_need_delivery() to also look for pending PI
> >> events.
> >>
> >> Looking a bit more at your states, I think the actions that need to be
> >> taken on all the transitions are these (collapsing 'runnable' and
> >> 'offline' into the same state):
> >>
> >> blocked -> runnable (vcpu_wake)
> >>  - NV = posted_intr_vector
> >>  - Take vcpu off blocked list
> >>  - SN = 1
> >> runnable -> running (context_switch)
> >>  - SN = 0
> >
> > Need set the 'NDST' field to the right dest vCPU as well.
> >
> >> running -> runnable (context_switch)
> >>  - SN = 1
> >> running -> blocked (vcpu_block)
> >>  - NV = pi_wakeup_vector
> >>  - Add vcpu to blocked list
> >
> > Need set the 'NDST' field to the pCPU which owns the blocking list,
> > So we can wake up the vCPU from the right blocking list in the wakeup
> > event handler.
> >
> >>
> >> This actually has a few pretty nice properties:
> >> 1. You have a nice pair of complementary actions -- block / wake, run /
> >> preempt
> >> 2. The potentially long actions with lists happen in vcpu_wake and
> >> vcpu_block, not on the context switch path
> >>
> >> And at this point, you don't have the "lazy context switch" issue
> >> anymore, do we?  Because we've handled the "blocking" case in
> >> vcpu_block(), we don't need to do anything in the main context_switch()
> >> (which may do the lazy context switching into idle).  We only need to do
> >> something in the actual __context_switch().
> >
> > I think the handling for lazy context switch is not only for the blocking case,
> > we still need to do something for lazy context switch even we handled the
> > blocking case in vcpu_block(), such as,
> > 1. For non-idle -> idle
> > - set 'SN'
> 
> If we set SN in vcpu_block(), then we don't need to set it on context
> switch -- 是不是?

For preemption case (not blocking case) , we still need to clear/set SN, and
this has no business with vcpu_block()/vcpu_wake(), right? Do I miss something
here? BTW, you Chinese is good! :)

> 
> > 2. For idle -> non-idle
> > - clear 'SN'
> > - set 'NDST' filed to the right cpu the vCPU is going to running on. (Maybe
> > this one doesn't belong to lazy context switch, if the cpu of the non-idle
> > vCPU was changed, then per_cpu(curr_vcpu, cpu) != next in context_switch(),
> > hence it will go to __context_switch() directly, right?)
> 
> If we clear SN* in vcpu_wake(), then we don't need to clear it on a
> context switch. 

The same as above.

> And the only way we can transition from "idle lazy
> context switch" to "runnable" is if the vcpu was the last vcpu running
> on this pcpu -- in which case, NDST should already be set to this pcpu,
> right?

Yes, like I mentioned above, in this case, the NDST filed is not changed, so
we don't need to update it for lazy idle vcpu to "runnable" transition.

Thanks,
Feng

> 
>  -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-21  5:09 Wu, Feng
@ 2015-09-21  9:54 ` George Dunlap
  2015-09-21 12:22   ` Wu, Feng
  0 siblings, 1 reply; 86+ messages in thread
From: George Dunlap @ 2015-09-21  9:54 UTC (permalink / raw)
  To: Wu, Feng, George Dunlap, Dario Faggioli
  Cc: Andrew Cooper, Tian, Kevin, Keir Fraser, Jan Beulich, xen-devel

On 09/21/2015 06:09 AM, Wu, Feng wrote:
> 
> 
>> -----Original Message-----
>> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George
>> Dunlap
>> Sent: Friday, September 18, 2015 10:34 PM
>> To: Dario Faggioli
>> Cc: Jan Beulich; George Dunlap; Tian, Kevin; Keir Fraser; Andrew Cooper;
>> xen-devel@lists.xen.org; Wu, Feng
>> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
>> handling
>>
>> On Fri, Sep 18, 2015 at 3:31 PM, George Dunlap
>> <George.Dunlap@eu.citrix.com> wrote:
>>>> As said, me too. Perhaps we can go for option 1, which is simpler,
>>>> cleaner and more consistent, considering the current status of the
>>>> code. We can always investigate, in future, whether and how to
>>>> implement the optimization for all the blockings, if beneficial and fea
>>>> sible, or have them diverge, if deemed worthwhile.
>>>
>>> Sounds like a plan.
>>
>> Er, just in case that idiom wasn't clear: Option 1 sounds like a
>> *good* plan, so unless Feng disagrees, let's go with that. :-)
> 
> Sorry for the late response, I was on leave last Friday.
> 
> Thanks for your discussions and suggestions. I have one question about option 1.
> I find that there are two places where '_VPF_blocked' can get set: vcpu_block()
> and do_poll(). After putting the logic in vcpu_block(), do we need to care about
> do_poll(). I don't know the purpose of do_poll() and the usage case of it.
> Dario/George, could you please share some knowledge about it? Thanks a lot!

Yes, you'll need to make the callback everywhere _VPF_blocked is set.

Normally you'd want to try to refactor both of those to share a commmon
codepath, but it looks like there are specific reasons why they have to
be different codepaths; so you'll just have to make the callback in both
places (after setting VPF_blocked).

You also need to check that local_events_need_delivery() will return
"true" if you get an interrupt between that time and entering the
hypervisor.  Will that happen automatically from
hvm_local_events_need_delivery() -> hvm_vcpu_has_pending_irq() ->
vlapic_has_pending_irq()?  Or will you need to add a hook in
hvm_vcpu_has_pending_irq()?

 -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
  2015-09-21  5:08 [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling Wu, Feng
@ 2015-09-21  9:18 ` George Dunlap
  2015-09-21 11:59   ` Wu, Feng
  0 siblings, 1 reply; 86+ messages in thread
From: George Dunlap @ 2015-09-21  9:18 UTC (permalink / raw)
  To: Wu, Feng, Dario Faggioli
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich

On 09/21/2015 06:08 AM, Wu, Feng wrote:
> 
> 
>> -----Original Message-----
>> From: George Dunlap [mailto:george.dunlap@citrix.com]
>> Sent: Thursday, September 17, 2015 5:38 PM
>> To: Dario Faggioli; Wu, Feng
>> Cc: xen-devel@lists.xen.org; Tian, Kevin; Keir Fraser; George Dunlap; Andrew
>> Cooper; Jan Beulich
>> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
>> handling
>>
>> On 09/17/2015 09:48 AM, Dario Faggioli wrote:
>>> On Thu, 2015-09-17 at 08:00 +0000, Wu, Feng wrote:
>>>
>>>>> -----Original Message-----
>>>>> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
>>>
>>>>> So, I guess, first of all, can you confirm whether or not it's exploding
>>>>> in debug builds?
>>>>
>>>> Does the following information in Config.mk mean it is a debug build?
>>>>
>>>> # A debug build of Xen and tools?
>>>> debug ?= y
>>>> debug_symbols ?= $(debug)
>>>>
>>> I think so. But as I said in my other email, I was wrong, and this is
>>> probably not an issue.
>>>
>>>>> And in either case (just tossing out ideas) would it be
>>>>> possible to deal with the "interrupt already raised when blocking" case:
>>>>
>>>> Thanks for the suggestions below!
>>>>
>>> :-)
>>>
>>>>>  - later in the context switching function ?
>>>> In this case, we might need to set a flag in vmx_pre_ctx_switch_pi() instead
>>>> of calling vcpu_unblock() directly, then when it returns to context_switch(),
>>>> we can check the flag and don't really block the vCPU.
>>>>
>>> Yeah, and that would still be rather hard to understand and maintain,
>>> IMO.
>>>
>>>> But I don't have a clear
>>>> picture about how to archive this, here are some questions from me:
>>>> - When we are in context_switch(), we have done the following changes to
>>>> vcpu's state:
>>>> 	* sd->curr is set to next
>>>> 	* vCPU's running state (both prev and next ) is changed by
>>>> 	  vcpu_runstate_change()
>>>> 	* next->is_running is set to 1
>>>> 	* periodic timer for prev is stopped
>>>> 	* periodic timer for next is setup
>>>> 	......
>>>>
>>>> So what point should we perform the action to _unblock_ the vCPU? We
>>>> Need to roll back the formal changes to the vCPU's state, right?
>>>>
>>> Mmm... not really. Not blocking prev does not mean that prev would be
>>> kept running on the pCPU, and that's true for your current solution as
>>> well! As you say yourself, you're already in the process of switching
>>> between prev and next, at a point where it's already a thing that next
>>> will be the vCPU that will run. Not blocking means that prev is
>>> reinserted to the runqueue, and a new invocation to the scheduler is
>>> (potentially) queued as well (via raising SCHEDULE_SOFTIRQ, in
>>> __runq_tickle()), but it's only when such new scheduling happens that
>>> prev will (potentially) be selected to run again.
>>>
>>> So, no, unless I'm fully missing your point, there wouldn't be no
>>> rollback required. However, I still would like the other solution (doing
>>> stuff in vcpu_block()) better (see below).
>>>
>>>>>  - with another hook, perhaps in vcpu_block() ?
>>>>
>>>> We could check this in vcpu_block(), however, the logic here is that before
>>>> vCPU is blocked, we need to change the posted-interrupt descriptor,
>>>> and during changing it, if 'ON' bit is set, which means VT-d hardware
>>>> issues a notification event because interrupts from the assigned devices
>>>> is coming, we don't need to block the vCPU and hence no need to update
>>>> the PI descriptor in this case.
>>>>
>>> Yep, I saw that. But could it be possible to do *everything* related to
>>> blocking, including the update of the descriptor, in vcpu_block(), if no
>>> interrupt have been raised yet at that time? I mean, would you, if
>>> updating the descriptor in there, still get the event that allows you to
>>> call vcpu_wake(), and hence vmx_vcpu_wake_prepare(), which would undo
>>> the blocking, no matter whether that resulted in an actual context
>>> switch already or not?
>>>
>>> I appreciate that this narrows the window for such an event to happen by
>>> quite a bit, making the logic itself a little less useful (it makes
>>> things more similar to "regular" blocking vs. event delivery, though,
>>> AFAICT), but if it's correct, ad if it allows us to save the ugly
>>> invocation of vcpu_unblock from context switch context, I'd give it a
>>> try.
>>>
>>> After all, this PI thing requires actions to be taken when a vCPU is
>>> scheduled or descheduled because of blocking, unblocking and
>>> preemptions, and it would seem natural to me to:
>>>  - deal with blockings in vcpu_block()
>>>  - deal with unblockings in vcpu_wake()
>>>  - deal with preemptions in context_switch()
>>>
>>> This does not mean being able to consolidate some of the cases (like
>>> blockings and preemptions, in the current version of the code) were not
>>> a nice thing... But we don't want it at all costs . :-)
>>
>> So just to clarify the situation...
>>
>> If a vcpu configured for the "running" state (i.e., NV set to
>> "posted_intr_vector", notifications enabled), and an interrupt happens
>> in the hypervisor -- what happens?
>>
>> Is it the case that the interrupt is not actually delivered to the
>> processor, but that the pending bit will be set in the pi field, so that
>> the interrupt will be delivered the next time the hypervisor returns
>> into the guest?
>>
>> (I am assuming that is the case, because if the hypervisor *does* get an
>> interrupt, then it can just unblock it there.)
>>
>> This sort of race condition -- where we get an interrupt to wake up a
>> vcpu as we're blocking -- is already handled for "old-style" interrupts
>> in vcpu_block:
>>
>> void vcpu_block(void)
>> {
>>     struct vcpu *v = current;
>>
>>     set_bit(_VPF_blocked, &v->pause_flags);
>>
>>     /* Check for events /after/ blocking: avoids wakeup waiting race. */
>>     if ( local_events_need_delivery() )
>>     {
>>         clear_bit(_VPF_blocked, &v->pause_flags);
>>     }
>>     else
>>     {
>>         TRACE_2D(TRC_SCHED_BLOCK, v->domain->domain_id, v->vcpu_id);
>>         raise_softirq(SCHEDULE_SOFTIRQ);
>>     }
>> }
>>
>> That is, we set _VPF_blocked, so that any interrupt which would wake it
>> up actually wakes it up, and then we check local_events_need_delivery()
>> to see if there were any that came in after we decided to block but
>> before we made sure that an interrupt would wake us up.
>>
>> I think it would be best if we could keep all the logic that does the
>> same thing in the same place.  Which would mean in vcpu_block(), after
>> calling set_bit(_VPF_blocked), changing the NV to pi_wakeup_vector, and
>> then extending local_events_need_delivery() to also look for pending PI
>> events.
>>
>> Looking a bit more at your states, I think the actions that need to be
>> taken on all the transitions are these (collapsing 'runnable' and
>> 'offline' into the same state):
>>
>> blocked -> runnable (vcpu_wake)
>>  - NV = posted_intr_vector
>>  - Take vcpu off blocked list
>>  - SN = 1
>> runnable -> running (context_switch)
>>  - SN = 0
> 
> Need set the 'NDST' field to the right dest vCPU as well.
> 
>> running -> runnable (context_switch)
>>  - SN = 1
>> running -> blocked (vcpu_block)
>>  - NV = pi_wakeup_vector
>>  - Add vcpu to blocked list
> 
> Need set the 'NDST' field to the pCPU which owns the blocking list,
> So we can wake up the vCPU from the right blocking list in the wakeup
> event handler.
> 
>>
>> This actually has a few pretty nice properties:
>> 1. You have a nice pair of complementary actions -- block / wake, run /
>> preempt
>> 2. The potentially long actions with lists happen in vcpu_wake and
>> vcpu_block, not on the context switch path
>>
>> And at this point, you don't have the "lazy context switch" issue
>> anymore, do we?  Because we've handled the "blocking" case in
>> vcpu_block(), we don't need to do anything in the main context_switch()
>> (which may do the lazy context switching into idle).  We only need to do
>> something in the actual __context_switch().
> 
> I think the handling for lazy context switch is not only for the blocking case,
> we still need to do something for lazy context switch even we handled the
> blocking case in vcpu_block(), such as,
> 1. For non-idle -> idle
> - set 'SN'

If we set SN in vcpu_block(), then we don't need to set it on context
switch -- 是不是?

> 2. For idle -> non-idle
> - clear 'SN'
> - set 'NDST' filed to the right cpu the vCPU is going to running on. (Maybe
> this one doesn't belong to lazy context switch, if the cpu of the non-idle
> vCPU was changed, then per_cpu(curr_vcpu, cpu) != next in context_switch(),
> hence it will go to __context_switch() directly, right?)

If we clear SN* in vcpu_wake(), then we don't need to clear it on a
context switch.  And the only way we can transition from "idle lazy
context switch" to "runnable" is if the vcpu was the last vcpu running
on this pcpu -- in which case, NDST should already be set to this pcpu,
right?

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
@ 2015-09-21  5:09 Wu, Feng
  2015-09-21  9:54 ` George Dunlap
  0 siblings, 1 reply; 86+ messages in thread
From: Wu, Feng @ 2015-09-21  5:09 UTC (permalink / raw)
  To: George Dunlap, Dario Faggioli
  Cc: Tian, Kevin, Keir Fraser, Andrew Cooper, George Dunlap,
	xen-devel, Jan Beulich, Wu, Feng



> -----Original Message-----
> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George
> Dunlap
> Sent: Friday, September 18, 2015 10:34 PM
> To: Dario Faggioli
> Cc: Jan Beulich; George Dunlap; Tian, Kevin; Keir Fraser; Andrew Cooper;
> xen-devel@lists.xen.org; Wu, Feng
> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> 
> On Fri, Sep 18, 2015 at 3:31 PM, George Dunlap
> <George.Dunlap@eu.citrix.com> wrote:
> >> As said, me too. Perhaps we can go for option 1, which is simpler,
> >> cleaner and more consistent, considering the current status of the
> >> code. We can always investigate, in future, whether and how to
> >> implement the optimization for all the blockings, if beneficial and fea
> >> sible, or have them diverge, if deemed worthwhile.
> >
> > Sounds like a plan.
> 
> Er, just in case that idiom wasn't clear: Option 1 sounds like a
> *good* plan, so unless Feng disagrees, let's go with that. :-)

Sorry for the late response, I was on leave last Friday.

Thanks for your discussions and suggestions. I have one question about option 1.
I find that there are two places where '_VPF_blocked' can get set: vcpu_block()
and do_poll(). After putting the logic in vcpu_block(), do we need to care about
do_poll(). I don't know the purpose of do_poll() and the usage case of it.
Dario/George, could you please share some knowledge about it? Thanks a lot!

Thanks,
Feng


> 
>  -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling
@ 2015-09-21  5:08 Wu, Feng
  2015-09-21  9:18 ` George Dunlap
  0 siblings, 1 reply; 86+ messages in thread
From: Wu, Feng @ 2015-09-21  5:08 UTC (permalink / raw)
  To: George Dunlap, Dario Faggioli
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel, Jan Beulich, Wu, Feng



> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@citrix.com]
> Sent: Thursday, September 17, 2015 5:38 PM
> To: Dario Faggioli; Wu, Feng
> Cc: xen-devel@lists.xen.org; Tian, Kevin; Keir Fraser; George Dunlap; Andrew
> Cooper; Jan Beulich
> Subject: Re: [Xen-devel] [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic
> handling
> 
> On 09/17/2015 09:48 AM, Dario Faggioli wrote:
> > On Thu, 2015-09-17 at 08:00 +0000, Wu, Feng wrote:
> >
> >>> -----Original Message-----
> >>> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> >
> >>> So, I guess, first of all, can you confirm whether or not it's exploding
> >>> in debug builds?
> >>
> >> Does the following information in Config.mk mean it is a debug build?
> >>
> >> # A debug build of Xen and tools?
> >> debug ?= y
> >> debug_symbols ?= $(debug)
> >>
> > I think so. But as I said in my other email, I was wrong, and this is
> > probably not an issue.
> >
> >>> And in either case (just tossing out ideas) would it be
> >>> possible to deal with the "interrupt already raised when blocking" case:
> >>
> >> Thanks for the suggestions below!
> >>
> > :-)
> >
> >>>  - later in the context switching function ?
> >> In this case, we might need to set a flag in vmx_pre_ctx_switch_pi() instead
> >> of calling vcpu_unblock() directly, then when it returns to context_switch(),
> >> we can check the flag and don't really block the vCPU.
> >>
> > Yeah, and that would still be rather hard to understand and maintain,
> > IMO.
> >
> >> But I don't have a clear
> >> picture about how to archive this, here are some questions from me:
> >> - When we are in context_switch(), we have done the following changes to
> >> vcpu's state:
> >> 	* sd->curr is set to next
> >> 	* vCPU's running state (both prev and next ) is changed by
> >> 	  vcpu_runstate_change()
> >> 	* next->is_running is set to 1
> >> 	* periodic timer for prev is stopped
> >> 	* periodic timer for next is setup
> >> 	......
> >>
> >> So what point should we perform the action to _unblock_ the vCPU? We
> >> Need to roll back the formal changes to the vCPU's state, right?
> >>
> > Mmm... not really. Not blocking prev does not mean that prev would be
> > kept running on the pCPU, and that's true for your current solution as
> > well! As you say yourself, you're already in the process of switching
> > between prev and next, at a point where it's already a thing that next
> > will be the vCPU that will run. Not blocking means that prev is
> > reinserted to the runqueue, and a new invocation to the scheduler is
> > (potentially) queued as well (via raising SCHEDULE_SOFTIRQ, in
> > __runq_tickle()), but it's only when such new scheduling happens that
> > prev will (potentially) be selected to run again.
> >
> > So, no, unless I'm fully missing your point, there wouldn't be no
> > rollback required. However, I still would like the other solution (doing
> > stuff in vcpu_block()) better (see below).
> >
> >>>  - with another hook, perhaps in vcpu_block() ?
> >>
> >> We could check this in vcpu_block(), however, the logic here is that before
> >> vCPU is blocked, we need to change the posted-interrupt descriptor,
> >> and during changing it, if 'ON' bit is set, which means VT-d hardware
> >> issues a notification event because interrupts from the assigned devices
> >> is coming, we don't need to block the vCPU and hence no need to update
> >> the PI descriptor in this case.
> >>
> > Yep, I saw that. But could it be possible to do *everything* related to
> > blocking, including the update of the descriptor, in vcpu_block(), if no
> > interrupt have been raised yet at that time? I mean, would you, if
> > updating the descriptor in there, still get the event that allows you to
> > call vcpu_wake(), and hence vmx_vcpu_wake_prepare(), which would undo
> > the blocking, no matter whether that resulted in an actual context
> > switch already or not?
> >
> > I appreciate that this narrows the window for such an event to happen by
> > quite a bit, making the logic itself a little less useful (it makes
> > things more similar to "regular" blocking vs. event delivery, though,
> > AFAICT), but if it's correct, ad if it allows us to save the ugly
> > invocation of vcpu_unblock from context switch context, I'd give it a
> > try.
> >
> > After all, this PI thing requires actions to be taken when a vCPU is
> > scheduled or descheduled because of blocking, unblocking and
> > preemptions, and it would seem natural to me to:
> >  - deal with blockings in vcpu_block()
> >  - deal with unblockings in vcpu_wake()
> >  - deal with preemptions in context_switch()
> >
> > This does not mean being able to consolidate some of the cases (like
> > blockings and preemptions, in the current version of the code) were not
> > a nice thing... But we don't want it at all costs . :-)
> 
> So just to clarify the situation...
> 
> If a vcpu configured for the "running" state (i.e., NV set to
> "posted_intr_vector", notifications enabled), and an interrupt happens
> in the hypervisor -- what happens?
> 
> Is it the case that the interrupt is not actually delivered to the
> processor, but that the pending bit will be set in the pi field, so that
> the interrupt will be delivered the next time the hypervisor returns
> into the guest?
> 
> (I am assuming that is the case, because if the hypervisor *does* get an
> interrupt, then it can just unblock it there.)
> 
> This sort of race condition -- where we get an interrupt to wake up a
> vcpu as we're blocking -- is already handled for "old-style" interrupts
> in vcpu_block:
> 
> void vcpu_block(void)
> {
>     struct vcpu *v = current;
> 
>     set_bit(_VPF_blocked, &v->pause_flags);
> 
>     /* Check for events /after/ blocking: avoids wakeup waiting race. */
>     if ( local_events_need_delivery() )
>     {
>         clear_bit(_VPF_blocked, &v->pause_flags);
>     }
>     else
>     {
>         TRACE_2D(TRC_SCHED_BLOCK, v->domain->domain_id, v->vcpu_id);
>         raise_softirq(SCHEDULE_SOFTIRQ);
>     }
> }
> 
> That is, we set _VPF_blocked, so that any interrupt which would wake it
> up actually wakes it up, and then we check local_events_need_delivery()
> to see if there were any that came in after we decided to block but
> before we made sure that an interrupt would wake us up.
> 
> I think it would be best if we could keep all the logic that does the
> same thing in the same place.  Which would mean in vcpu_block(), after
> calling set_bit(_VPF_blocked), changing the NV to pi_wakeup_vector, and
> then extending local_events_need_delivery() to also look for pending PI
> events.
> 
> Looking a bit more at your states, I think the actions that need to be
> taken on all the transitions are these (collapsing 'runnable' and
> 'offline' into the same state):
> 
> blocked -> runnable (vcpu_wake)
>  - NV = posted_intr_vector
>  - Take vcpu off blocked list
>  - SN = 1
> runnable -> running (context_switch)
>  - SN = 0

Need set the 'NDST' field to the right dest vCPU as well.

> running -> runnable (context_switch)
>  - SN = 1
> running -> blocked (vcpu_block)
>  - NV = pi_wakeup_vector
>  - Add vcpu to blocked list

Need set the 'NDST' field to the pCPU which owns the blocking list,
So we can wake up the vCPU from the right blocking list in the wakeup
event handler.

> 
> This actually has a few pretty nice properties:
> 1. You have a nice pair of complementary actions -- block / wake, run /
> preempt
> 2. The potentially long actions with lists happen in vcpu_wake and
> vcpu_block, not on the context switch path
> 
> And at this point, you don't have the "lazy context switch" issue
> anymore, do we?  Because we've handled the "blocking" case in
> vcpu_block(), we don't need to do anything in the main context_switch()
> (which may do the lazy context switching into idle).  We only need to do
> something in the actual __context_switch().

I think the handling for lazy context switch is not only for the blocking case,
we still need to do something for lazy context switch even we handled the
blocking case in vcpu_block(), such as,
1. For non-idle -> idle
- set 'SN'

2. For idle -> non-idle
- clear 'SN'
- set 'NDST' filed to the right cpu the vCPU is going to running on. (Maybe
this one doesn't belong to lazy context switch, if the cpu of the non-idle
vCPU was changed, then per_cpu(curr_vcpu, cpu) != next in context_switch(),
hence it will go to __context_switch() directly, right?)

Thanks,
Feng

> 
> And at that point, could we actually get rid of the PI-specific context
> switch hooks altogether, and just put the SN state changes required for
> running->runnable and runnable->running in vmx_ctxt_switch_from() and
> vmx_ctxt_switch_to()?
> 
> If so, then the only hooks we need to add are vcpu_block and vcpu_wake.
>  To keep these consistent with other scheduling-related functions, I
> would put these in arch_vcpu, next to ctxt_switch_from() and
> ctxt_switch_to().
> 
> Thoughts?
> 
>  -George

^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2015-09-24  8:03 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-11  8:28 [PATCH v7 00/17] Add VT-d Posted-Interrupts support Feng Wu
2015-09-11  8:28 ` [PATCH v7 01/17] VT-d Posted-intterrupt (PI) design Feng Wu
2015-09-11  8:28 ` [PATCH v7 02/17] Add cmpxchg16b support for x86-64 Feng Wu
2015-09-22 13:50   ` Jan Beulich
2015-09-22 13:55     ` Wu, Feng
2015-09-11  8:28 ` [PATCH v7 03/17] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature Feng Wu
2015-09-11  8:28 ` [PATCH v7 04/17] vt-d: VT-d Posted-Interrupts feature detection Feng Wu
2015-09-22 14:18   ` Jan Beulich
2015-09-11  8:28 ` [PATCH v7 05/17] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts Feng Wu
2015-09-22 14:20   ` Jan Beulich
2015-09-23  1:02     ` Wu, Feng
2015-09-23  7:36       ` Jan Beulich
2015-09-11  8:28 ` [PATCH v7 06/17] vmx: Add some helper functions for Posted-Interrupts Feng Wu
2015-09-11  8:28 ` [PATCH v7 07/17] vmx: Initialize VT-d Posted-Interrupts Descriptor Feng Wu
2015-09-11  8:28 ` [PATCH v7 08/17] vmx: Suppress posting interrupts when 'SN' is set Feng Wu
2015-09-22 14:23   ` Jan Beulich
2015-09-11  8:28 ` [PATCH v7 09/17] VT-d: Remove pointless casts Feng Wu
2015-09-22 14:30   ` Jan Beulich
2015-09-11  8:28 ` [PATCH v7 10/17] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts Feng Wu
2015-09-22 14:28   ` Jan Beulich
2015-09-11  8:29 ` [PATCH v7 11/17] vt-d: Add API to update IRTE when VT-d PI is used Feng Wu
2015-09-22 14:42   ` Jan Beulich
2015-09-11  8:29 ` [PATCH v7 12/17] x86: move some APIC related macros to apicdef.h Feng Wu
2015-09-22 14:44   ` Jan Beulich
2015-09-11  8:29 ` [PATCH v7 13/17] Update IRTE according to guest interrupt config changes Feng Wu
2015-09-22 14:51   ` Jan Beulich
2015-09-11  8:29 ` [PATCH v7 14/17] vmx: Properly handle notification event when vCPU is running Feng Wu
2015-09-11  8:29 ` [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling Feng Wu
2015-09-16 16:00   ` Dario Faggioli
2015-09-16 17:18   ` Dario Faggioli
2015-09-16 18:05     ` Dario Faggioli
2015-09-17  8:00     ` Wu, Feng
2015-09-17  8:48       ` Dario Faggioli
2015-09-17  9:16         ` Wu, Feng
2015-09-17  9:38         ` George Dunlap
2015-09-17  9:39           ` George Dunlap
2015-09-17 11:44           ` George Dunlap
2015-09-17 12:40             ` Dario Faggioli
2015-09-17 14:30               ` George Dunlap
2015-09-17 16:36                 ` Dario Faggioli
2015-09-18  6:27                 ` Jan Beulich
2015-09-18  9:22                   ` Dario Faggioli
2015-09-18 14:31                     ` George Dunlap
2015-09-18 14:34                       ` George Dunlap
2015-09-11  8:29 ` [PATCH v7 16/17] VT-d: Dump the posted format IRTE Feng Wu
2015-09-22 14:58   ` Jan Beulich
2015-09-11  8:29 ` [PATCH v7 17/17] Add a command line parameter for VT-d posted-interrupts Feng Wu
2015-09-21  5:08 [PATCH v7 15/17] vmx: VT-d posted-interrupt core logic handling Wu, Feng
2015-09-21  9:18 ` George Dunlap
2015-09-21 11:59   ` Wu, Feng
2015-09-21 13:31     ` Dario Faggioli
2015-09-21 13:50       ` Wu, Feng
2015-09-21 14:11         ` Dario Faggioli
2015-09-22  5:10           ` Wu, Feng
2015-09-22 10:43             ` George Dunlap
2015-09-22 10:46               ` George Dunlap
2015-09-22 13:25                 ` Wu, Feng
2015-09-22 13:40                   ` Dario Faggioli
2015-09-22 13:52                     ` Wu, Feng
2015-09-22 14:15                       ` George Dunlap
2015-09-22 14:38                         ` Dario Faggioli
2015-09-23  5:52                           ` Wu, Feng
2015-09-23  7:59                             ` Dario Faggioli
2015-09-23  8:11                               ` Wu, Feng
2015-09-22 14:28                   ` George Dunlap
2015-09-23  5:37                     ` Wu, Feng
2015-09-21  5:09 Wu, Feng
2015-09-21  9:54 ` George Dunlap
2015-09-21 12:22   ` Wu, Feng
2015-09-21 14:24     ` Dario Faggioli
2015-09-22  7:19       ` Wu, Feng
2015-09-22  8:59         ` Jan Beulich
2015-09-22 13:40           ` Wu, Feng
2015-09-22 14:01             ` Jan Beulich
2015-09-23  9:44               ` George Dunlap
2015-09-23 12:35                 ` Wu, Feng
2015-09-23 15:25                   ` George Dunlap
2015-09-23 15:38                     ` Jan Beulich
2015-09-24  1:50                     ` Wu, Feng
2015-09-24  3:35                       ` Dario Faggioli
2015-09-24  7:51                       ` Jan Beulich
2015-09-24  8:03                         ` Wu, Feng
2015-09-22 10:26         ` George Dunlap
2015-09-23  6:35           ` Wu, Feng
2015-09-23  7:11             ` Dario Faggioli
2015-09-23  7:20               ` Wu, Feng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.