All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] Add VT-d Posted-Interrupts support for KVM
@ 2014-11-10  6:26 ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel, Feng Wu

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

You can find the VT-d Posted-Interrtups Spec. in the following URL:
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html

Feng Wu (13):
  iommu/vt-d: VT-d Posted-Interrupts feature detection
  KVM: Initialize VT-d Posted-Interrtups Descriptor
  KVM: Add KVM_CAP_PI to detect VT-d Posted-Interrtups
  iommu/vt-d: Adjust 'struct irte' to better suit for VT-d
    Posted-Interrupts
  KVM: Update IRTE according to guest interrupt configuration changes
  KVM: Add some helper functions for Posted-Interrupts
  x86, irq: Define a global vector for VT-d Posted-Interrupts
  KVM: Update Posted-Interrupts descriptor during VCPU scheduling
  KVM: Change NDST field after VCPU scheduling
  KVM: Add the handler for Wake-up Vector
  KVM: Suppress posted-interrupt when 'SN' is set
  iommu/vt-d: No need to migrating irq for VT-d Posted-Interrtups
  iommu/vt-d: Add a command line parameter for VT-d posted-interrupts

 arch/x86/include/asm/entry_arch.h    |    2 +
 arch/x86/include/asm/hardirq.h       |    1 +
 arch/x86/include/asm/hw_irq.h        |    2 +
 arch/x86/include/asm/irq_remapping.h |    7 +
 arch/x86/include/asm/irq_vectors.h   |    1 +
 arch/x86/include/asm/kvm_host.h      |    9 ++
 arch/x86/kernel/apic/apic.c          |    1 +
 arch/x86/kernel/entry_64.S           |    2 +
 arch/x86/kernel/irq.c                |   27 ++++
 arch/x86/kernel/irqinit.c            |    2 +
 arch/x86/kvm/vmx.c                   |  257 +++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c                   |   53 ++++++-
 drivers/iommu/amd_iommu.c            |    6 +
 drivers/iommu/intel_irq_remapping.c  |   83 +++++++++--
 drivers/iommu/irq_remapping.c        |   20 +++
 drivers/iommu/irq_remapping.h        |    8 +
 include/linux/dmar.h                 |   30 ++++-
 include/linux/intel-iommu.h          |    1 +
 include/linux/kvm_host.h             |   25 ++++
 include/uapi/linux/kvm.h             |    2 +
 virt/kvm/assigned-dev.c              |  141 +++++++++++++++++++
 virt/kvm/irq_comm.c                  |    4 +-
 virt/kvm/irqchip.c                   |   11 --
 virt/kvm/kvm_main.c                  |   14 ++
 24 files changed, 667 insertions(+), 42 deletions(-)


^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH 00/13] Add VT-d Posted-Interrupts support for KVM
@ 2014-11-10  6:26 ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb-DgEjT+Ai2ygdnm+yROfE0A, pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	tglx-hfZtesqFncYOwBW4kG4KsQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

You can find the VT-d Posted-Interrtups Spec. in the following URL:
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html

Feng Wu (13):
  iommu/vt-d: VT-d Posted-Interrupts feature detection
  KVM: Initialize VT-d Posted-Interrtups Descriptor
  KVM: Add KVM_CAP_PI to detect VT-d Posted-Interrtups
  iommu/vt-d: Adjust 'struct irte' to better suit for VT-d
    Posted-Interrupts
  KVM: Update IRTE according to guest interrupt configuration changes
  KVM: Add some helper functions for Posted-Interrupts
  x86, irq: Define a global vector for VT-d Posted-Interrupts
  KVM: Update Posted-Interrupts descriptor during VCPU scheduling
  KVM: Change NDST field after VCPU scheduling
  KVM: Add the handler for Wake-up Vector
  KVM: Suppress posted-interrupt when 'SN' is set
  iommu/vt-d: No need to migrating irq for VT-d Posted-Interrtups
  iommu/vt-d: Add a command line parameter for VT-d posted-interrupts

 arch/x86/include/asm/entry_arch.h    |    2 +
 arch/x86/include/asm/hardirq.h       |    1 +
 arch/x86/include/asm/hw_irq.h        |    2 +
 arch/x86/include/asm/irq_remapping.h |    7 +
 arch/x86/include/asm/irq_vectors.h   |    1 +
 arch/x86/include/asm/kvm_host.h      |    9 ++
 arch/x86/kernel/apic/apic.c          |    1 +
 arch/x86/kernel/entry_64.S           |    2 +
 arch/x86/kernel/irq.c                |   27 ++++
 arch/x86/kernel/irqinit.c            |    2 +
 arch/x86/kvm/vmx.c                   |  257 +++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c                   |   53 ++++++-
 drivers/iommu/amd_iommu.c            |    6 +
 drivers/iommu/intel_irq_remapping.c  |   83 +++++++++--
 drivers/iommu/irq_remapping.c        |   20 +++
 drivers/iommu/irq_remapping.h        |    8 +
 include/linux/dmar.h                 |   30 ++++-
 include/linux/intel-iommu.h          |    1 +
 include/linux/kvm_host.h             |   25 ++++
 include/uapi/linux/kvm.h             |    2 +
 virt/kvm/assigned-dev.c              |  141 +++++++++++++++++++
 virt/kvm/irq_comm.c                  |    4 +-
 virt/kvm/irqchip.c                   |   11 --
 virt/kvm/kvm_main.c                  |   14 ++
 24 files changed, 667 insertions(+), 42 deletions(-)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH 01/13] iommu/vt-d: VT-d Posted-Interrupts feature detection
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel, Feng Wu

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

This patch adds feature detection logic for VT-d posted-interrupt.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 drivers/iommu/intel_irq_remapping.c |   13 +++++++++++++
 drivers/iommu/irq_remapping.c       |    4 ++++
 drivers/iommu/irq_remapping.h       |    5 +++++
 include/linux/intel-iommu.h         |    1 +
 4 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 7c80661..f99f0f1 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -580,6 +580,19 @@ static int __init intel_irq_remapping_supported(void)
 		if (!ecap_ir_support(iommu->ecap))
 			return 0;
 
+	/* VT-d posted-interrupt feature detection*/
+	if (disable_irq_post == 0)
+		for_each_drhd_unit(drhd) {
+			struct intel_iommu *iommu = drhd->iommu;
+
+			if (!cap_pi_support(iommu->cap)) {
+				irq_post_enabled = 0;
+				disable_irq_post = 1;
+				break;
+			}
+			irq_post_enabled = 1;
+		}
+
 	return 1;
 }
 
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index 74a1767..2f8ee00 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -23,6 +23,10 @@ int irq_remap_broken;
 int disable_sourceid_checking;
 int no_x2apic_optout;
 
+int disable_irq_post = 1;
+int irq_post_enabled = 0;
+EXPORT_SYMBOL_GPL(irq_post_enabled);
+
 static struct irq_remap_ops *remap_ops;
 
 static int msi_alloc_remapped_irq(struct pci_dev *pdev, int irq, int nvec);
diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
index fde250f..7bb5913 100644
--- a/drivers/iommu/irq_remapping.h
+++ b/drivers/iommu/irq_remapping.h
@@ -37,6 +37,9 @@ extern int disable_sourceid_checking;
 extern int no_x2apic_optout;
 extern int irq_remapping_enabled;
 
+extern int disable_irq_post;
+extern int irq_post_enabled;
+
 struct irq_remap_ops {
 	/* Check whether Interrupt Remapping is supported */
 	int (*supported)(void);
@@ -91,6 +94,8 @@ extern struct irq_remap_ops amd_iommu_irq_ops;
 #define irq_remapping_enabled 0
 #define disable_irq_remap     1
 #define irq_remap_broken      0
+#define disable_irq_post      1
+#define irq_post_enabled      0
 
 #endif /* CONFIG_IRQ_REMAP */
 
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index a65208a..5b1a124 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -87,6 +87,7 @@ static inline void dmar_writeq(void __iomem *addr, u64 val)
 /*
  * Decoding Capability Register
  */
+#define cap_pi_support(c)	(((c) >> 59) & 1)
 #define cap_read_drain(c)	(((c) >> 55) & 1)
 #define cap_write_drain(c)	(((c) >> 54) & 1)
 #define cap_max_amask_val(c)	(((c) >> 48) & 0x3f)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 01/13] iommu/vt-d: VT-d Posted-Interrupts feature detection
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb-DgEjT+Ai2ygdnm+yROfE0A, pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	tglx-hfZtesqFncYOwBW4kG4KsQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

This patch adds feature detection logic for VT-d posted-interrupt.

Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/iommu/intel_irq_remapping.c |   13 +++++++++++++
 drivers/iommu/irq_remapping.c       |    4 ++++
 drivers/iommu/irq_remapping.h       |    5 +++++
 include/linux/intel-iommu.h         |    1 +
 4 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 7c80661..f99f0f1 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -580,6 +580,19 @@ static int __init intel_irq_remapping_supported(void)
 		if (!ecap_ir_support(iommu->ecap))
 			return 0;
 
+	/* VT-d posted-interrupt feature detection*/
+	if (disable_irq_post == 0)
+		for_each_drhd_unit(drhd) {
+			struct intel_iommu *iommu = drhd->iommu;
+
+			if (!cap_pi_support(iommu->cap)) {
+				irq_post_enabled = 0;
+				disable_irq_post = 1;
+				break;
+			}
+			irq_post_enabled = 1;
+		}
+
 	return 1;
 }
 
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index 74a1767..2f8ee00 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -23,6 +23,10 @@ int irq_remap_broken;
 int disable_sourceid_checking;
 int no_x2apic_optout;
 
+int disable_irq_post = 1;
+int irq_post_enabled = 0;
+EXPORT_SYMBOL_GPL(irq_post_enabled);
+
 static struct irq_remap_ops *remap_ops;
 
 static int msi_alloc_remapped_irq(struct pci_dev *pdev, int irq, int nvec);
diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
index fde250f..7bb5913 100644
--- a/drivers/iommu/irq_remapping.h
+++ b/drivers/iommu/irq_remapping.h
@@ -37,6 +37,9 @@ extern int disable_sourceid_checking;
 extern int no_x2apic_optout;
 extern int irq_remapping_enabled;
 
+extern int disable_irq_post;
+extern int irq_post_enabled;
+
 struct irq_remap_ops {
 	/* Check whether Interrupt Remapping is supported */
 	int (*supported)(void);
@@ -91,6 +94,8 @@ extern struct irq_remap_ops amd_iommu_irq_ops;
 #define irq_remapping_enabled 0
 #define disable_irq_remap     1
 #define irq_remap_broken      0
+#define disable_irq_post      1
+#define irq_post_enabled      0
 
 #endif /* CONFIG_IRQ_REMAP */
 
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index a65208a..5b1a124 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -87,6 +87,7 @@ static inline void dmar_writeq(void __iomem *addr, u64 val)
 /*
  * Decoding Capability Register
  */
+#define cap_pi_support(c)	(((c) >> 59) & 1)
 #define cap_read_drain(c)	(((c) >> 55) & 1)
 #define cap_write_drain(c)	(((c) >> 54) & 1)
 #define cap_max_amask_val(c)	(((c) >> 48) & 0x3f)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 02/13] KVM: Initialize VT-d Posted-Interrtups Descriptor
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel, Feng Wu

This patch initialize the VT-d Posted-interrupt Descritpor.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 arch/x86/include/asm/irq_remapping.h |    1 +
 arch/x86/kernel/apic/apic.c          |    1 +
 arch/x86/kvm/vmx.c                   |   56 ++++++++++++++++++++++++++++++++-
 3 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index b7747c4..a3cc437 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -57,6 +57,7 @@ extern bool setup_remapped_irq(int irq,
 			       struct irq_chip *chip);
 
 void irq_remap_modify_chip_defaults(struct irq_chip *chip);
+extern int irq_post_enabled;
 
 #else  /* CONFIG_IRQ_REMAP */
 
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index ba6cc04..987408d 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -162,6 +162,7 @@ __setup("apicpmtimer", setup_apicpmtimer);
 #endif
 
 int x2apic_mode;
+EXPORT_SYMBOL_GPL(x2apic_mode);
 #ifdef CONFIG_X86_X2APIC
 /* x2apic enabled before OS handover */
 int x2apic_preenabled;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3e556c6..a4670d3 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -45,6 +45,7 @@
 #include <asm/perf_event.h>
 #include <asm/debugreg.h>
 #include <asm/kexec.h>
+#include <asm/irq_remapping.h>
 
 #include "trace.h"
 
@@ -408,13 +409,32 @@ struct nested_vmx {
 };
 
 #define POSTED_INTR_ON  0
+#define POSTED_INTR_SN  1
+
 /* Posted-Interrupt Descriptor */
 struct pi_desc {
 	u32 pir[8];     /* Posted interrupt requested */
-	u32 control;	/* bit 0 of control is outstanding notification bit */
-	u32 rsvd[7];
+	union {
+		struct {
+			u64	on	: 1,
+				sn	: 1,
+				rsvd_1	: 13,
+				ndm	: 1,
+				nv	: 8,
+				rsvd_2	: 8,
+				ndst	: 32;
+		};
+		u64 control;
+	};
+	u32 rsvd[6];
 } __aligned(64);
 
+static void pi_clear_sn(struct pi_desc *pi_desc)
+{
+	return clear_bit(POSTED_INTR_SN,
+			(unsigned long *)&pi_desc->control);
+}
+
 static bool pi_test_and_set_on(struct pi_desc *pi_desc)
 {
 	return test_and_set_bit(POSTED_INTR_ON,
@@ -4396,6 +4416,33 @@ static void ept_set_mmio_spte_mask(void)
 	kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
 }
 
+static bool pi_desc_init(struct vcpu_vmx *vmx)
+{
+	unsigned int dest;
+
+	if (irq_post_enabled == 0)
+		return true;
+
+	/*
+	 * Initialize Posted-Interrupt Descriptor
+	 */
+
+	pi_clear_sn(&vmx->pi_desc);
+	vmx->pi_desc.nv = POSTED_INTR_VECTOR;
+
+	/* Physical mode for Notificaiton Event */
+	vmx->pi_desc.ndm = 0;
+	dest = cpu_physical_id(vmx->vcpu.cpu);
+
+	if (x2apic_mode)
+		vmx->pi_desc.ndst = dest;
+	else
+		vmx->pi_desc.ndst = (dest << 8) & 0xFF00;
+
+	return true;
+}
+
+
 /*
  * Sets up the vmcs for emulated real mode.
  */
@@ -4439,6 +4486,11 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 
 		vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR);
 		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
+
+		if (!pi_desc_init(vmx)) {
+			printk(KERN_ERR "Initialize PI descriptor error!\n");
+			return 1;
+		}
 	}
 
 	if (ple_gap) {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 02/13] KVM: Initialize VT-d Posted-Interrtups Descriptor
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb-DgEjT+Ai2ygdnm+yROfE0A, pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	tglx-hfZtesqFncYOwBW4kG4KsQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA

This patch initialize the VT-d Posted-interrupt Descritpor.

Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 arch/x86/include/asm/irq_remapping.h |    1 +
 arch/x86/kernel/apic/apic.c          |    1 +
 arch/x86/kvm/vmx.c                   |   56 ++++++++++++++++++++++++++++++++-
 3 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index b7747c4..a3cc437 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -57,6 +57,7 @@ extern bool setup_remapped_irq(int irq,
 			       struct irq_chip *chip);
 
 void irq_remap_modify_chip_defaults(struct irq_chip *chip);
+extern int irq_post_enabled;
 
 #else  /* CONFIG_IRQ_REMAP */
 
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index ba6cc04..987408d 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -162,6 +162,7 @@ __setup("apicpmtimer", setup_apicpmtimer);
 #endif
 
 int x2apic_mode;
+EXPORT_SYMBOL_GPL(x2apic_mode);
 #ifdef CONFIG_X86_X2APIC
 /* x2apic enabled before OS handover */
 int x2apic_preenabled;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3e556c6..a4670d3 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -45,6 +45,7 @@
 #include <asm/perf_event.h>
 #include <asm/debugreg.h>
 #include <asm/kexec.h>
+#include <asm/irq_remapping.h>
 
 #include "trace.h"
 
@@ -408,13 +409,32 @@ struct nested_vmx {
 };
 
 #define POSTED_INTR_ON  0
+#define POSTED_INTR_SN  1
+
 /* Posted-Interrupt Descriptor */
 struct pi_desc {
 	u32 pir[8];     /* Posted interrupt requested */
-	u32 control;	/* bit 0 of control is outstanding notification bit */
-	u32 rsvd[7];
+	union {
+		struct {
+			u64	on	: 1,
+				sn	: 1,
+				rsvd_1	: 13,
+				ndm	: 1,
+				nv	: 8,
+				rsvd_2	: 8,
+				ndst	: 32;
+		};
+		u64 control;
+	};
+	u32 rsvd[6];
 } __aligned(64);
 
+static void pi_clear_sn(struct pi_desc *pi_desc)
+{
+	return clear_bit(POSTED_INTR_SN,
+			(unsigned long *)&pi_desc->control);
+}
+
 static bool pi_test_and_set_on(struct pi_desc *pi_desc)
 {
 	return test_and_set_bit(POSTED_INTR_ON,
@@ -4396,6 +4416,33 @@ static void ept_set_mmio_spte_mask(void)
 	kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
 }
 
+static bool pi_desc_init(struct vcpu_vmx *vmx)
+{
+	unsigned int dest;
+
+	if (irq_post_enabled == 0)
+		return true;
+
+	/*
+	 * Initialize Posted-Interrupt Descriptor
+	 */
+
+	pi_clear_sn(&vmx->pi_desc);
+	vmx->pi_desc.nv = POSTED_INTR_VECTOR;
+
+	/* Physical mode for Notificaiton Event */
+	vmx->pi_desc.ndm = 0;
+	dest = cpu_physical_id(vmx->vcpu.cpu);
+
+	if (x2apic_mode)
+		vmx->pi_desc.ndst = dest;
+	else
+		vmx->pi_desc.ndst = (dest << 8) & 0xFF00;
+
+	return true;
+}
+
+
 /*
  * Sets up the vmcs for emulated real mode.
  */
@@ -4439,6 +4486,11 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 
 		vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR);
 		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
+
+		if (!pi_desc_init(vmx)) {
+			printk(KERN_ERR "Initialize PI descriptor error!\n");
+			return 1;
+		}
 	}
 
 	if (ple_gap) {
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 03/13] KVM: Add KVM_CAP_PI to detect VT-d Posted-Interrtups
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel, Feng Wu

This patch adds KVM_CAP_PI to detect VT-d Posted-Interrtups
feature for QEMU.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 arch/x86/kvm/x86.c       |    4 ++++
 include/uapi/linux/kvm.h |    1 +
 2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0033df3..b447a98 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -63,6 +63,7 @@
 #include <asm/xcr.h>
 #include <asm/pvclock.h>
 #include <asm/div64.h>
+#include <asm/irq_remapping.h>
 
 #define MAX_IO_MSRS 256
 #define KVM_MAX_MCE_BANKS 32
@@ -2775,6 +2776,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_TSC_DEADLINE_TIMER:
 		r = boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER);
 		break;
+	case KVM_CAP_PI:
+		r = irq_post_enabled;
+		break;
 	default:
 		r = 0;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6076882..7593c52 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -761,6 +761,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_FIXUP_HCALL 103
 #define KVM_CAP_PPC_ENABLE_HCALL 104
 #define KVM_CAP_CHECK_EXTENSION_VM 105
+#define KVM_CAP_PI 106
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 03/13] KVM: Add KVM_CAP_PI to detect VT-d Posted-Interrtups
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb-DgEjT+Ai2ygdnm+yROfE0A, pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	tglx-hfZtesqFncYOwBW4kG4KsQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA

This patch adds KVM_CAP_PI to detect VT-d Posted-Interrtups
feature for QEMU.

Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 arch/x86/kvm/x86.c       |    4 ++++
 include/uapi/linux/kvm.h |    1 +
 2 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0033df3..b447a98 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -63,6 +63,7 @@
 #include <asm/xcr.h>
 #include <asm/pvclock.h>
 #include <asm/div64.h>
+#include <asm/irq_remapping.h>
 
 #define MAX_IO_MSRS 256
 #define KVM_MAX_MCE_BANKS 32
@@ -2775,6 +2776,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_TSC_DEADLINE_TIMER:
 		r = boot_cpu_has(X86_FEATURE_TSC_DEADLINE_TIMER);
 		break;
+	case KVM_CAP_PI:
+		r = irq_post_enabled;
+		break;
 	default:
 		r = 0;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6076882..7593c52 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -761,6 +761,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_FIXUP_HCALL 103
 #define KVM_CAP_PPC_ENABLE_HCALL 104
 #define KVM_CAP_CHECK_EXTENSION_VM 105
+#define KVM_CAP_PI 106
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 04/13] iommu/vt-d: Adjust 'struct irte' to better suit for VT-d Posted-Interrupts
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel, Feng Wu

This patch adjusts the definition of 'struct irte', so that we can
add the VT-d Posted-Interrtups format in this structure later.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 drivers/iommu/intel_irq_remapping.c |   35 +++++++++++++++++++----------------
 include/linux/dmar.h                |    4 ++--
 2 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index f99f0f1..776da10 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -310,9 +310,9 @@ static void set_irte_sid(struct irte *irte, unsigned int svt,
 {
 	if (disable_sourceid_checking)
 		svt = SVT_NO_VERIFY;
-	irte->svt = svt;
-	irte->sq = sq;
-	irte->sid = sid;
+	irte->irq_remap_high.svt = svt;
+	irte->irq_remap_high.sq = sq;
+	irte->irq_remap_high.sid = sid;
 }
 
 static int set_ioapic_sid(struct irte *irte, int apic)
@@ -917,8 +917,8 @@ static void prepare_irte(struct irte *irte, int vector,
 {
 	memset(irte, 0, sizeof(*irte));
 
-	irte->present = 1;
-	irte->dst_mode = apic->irq_dest_mode;
+	irte->irq_remap_low.present = 1;
+	irte->irq_remap_low.dst_mode = apic->irq_dest_mode;
 	/*
 	 * Trigger mode in the IRTE will always be edge, and for IO-APIC, the
 	 * actual level or edge trigger will be setup in the IO-APIC
@@ -926,11 +926,11 @@ static void prepare_irte(struct irte *irte, int vector,
 	 * For more details, see the comments (in io_apic.c) explainig IO-APIC
 	 * irq migration in the presence of interrupt-remapping.
 	*/
-	irte->trigger_mode = 0;
-	irte->dlvry_mode = apic->irq_delivery_mode;
-	irte->vector = vector;
-	irte->dest_id = IRTE_DEST(dest);
-	irte->redir_hint = 1;
+	irte->irq_remap_low.trigger_mode = 0;
+	irte->irq_remap_low.dlvry_mode = apic->irq_delivery_mode;
+	irte->irq_remap_low.vector = vector;
+	irte->irq_remap_low.dest_id = IRTE_DEST(dest);
+	irte->irq_remap_low.redir_hint = 1;
 }
 
 static int intel_setup_ioapic_entry(int irq,
@@ -973,10 +973,13 @@ static int intel_setup_ioapic_entry(int irq,
 		"Redir_hint:%d Trig_Mode:%d Dlvry_Mode:%X "
 		"Avail:%X Vector:%02X Dest:%08X "
 		"SID:%04X SQ:%X SVT:%X)\n",
-		attr->ioapic, irte.present, irte.fpd, irte.dst_mode,
-		irte.redir_hint, irte.trigger_mode, irte.dlvry_mode,
-		irte.avail, irte.vector, irte.dest_id,
-		irte.sid, irte.sq, irte.svt);
+		attr->ioapic, irte.irq_remap_low.present,
+		irte.irq_remap_low.fpd, irte.irq_remap_low.dst_mode,
+		irte.irq_remap_low.redir_hint, irte.irq_remap_low.trigger_mode,
+		irte.irq_remap_low.dlvry_mode, irte.irq_remap_low.avail,
+		irte.irq_remap_low.vector, irte.irq_remap_low.dest_id,
+		irte.irq_remap_high.sid, irte.irq_remap_high.sq,
+		irte.irq_remap_high.svt);
 
 	entry = (struct IR_IO_APIC_route_entry *)route_entry;
 	memset(entry, 0, sizeof(*entry));
@@ -1046,8 +1049,8 @@ intel_ioapic_set_affinity(struct irq_data *data, const struct cpumask *mask,
 		return err;
 	}
 
-	irte.vector = cfg->vector;
-	irte.dest_id = IRTE_DEST(dest);
+	irte.irq_remap_low.vector = cfg->vector;
+	irte.irq_remap_low.dest_id = IRTE_DEST(dest);
 
 	/*
 	 * Atomically updates the IRTE with the new destination, vector
diff --git a/include/linux/dmar.h b/include/linux/dmar.h
index 593fff9..8be5d42 100644
--- a/include/linux/dmar.h
+++ b/include/linux/dmar.h
@@ -159,7 +159,7 @@ struct irte {
 				vector		: 8,
 				__reserved_2	: 8,
 				dest_id		: 32;
-		};
+		} irq_remap_low;
 		__u64 low;
 	};
 
@@ -169,7 +169,7 @@ struct irte {
 				sq		: 2,
 				svt		: 2,
 				__reserved_3	: 44;
-		};
+		} irq_remap_high;
 		__u64 high;
 	};
 };
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 04/13] iommu/vt-d: Adjust 'struct irte' to better suit for VT-d Posted-Interrupts
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb-DgEjT+Ai2ygdnm+yROfE0A, pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	tglx-hfZtesqFncYOwBW4kG4KsQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA

This patch adjusts the definition of 'struct irte', so that we can
add the VT-d Posted-Interrtups format in this structure later.

Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/iommu/intel_irq_remapping.c |   35 +++++++++++++++++++----------------
 include/linux/dmar.h                |    4 ++--
 2 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index f99f0f1..776da10 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -310,9 +310,9 @@ static void set_irte_sid(struct irte *irte, unsigned int svt,
 {
 	if (disable_sourceid_checking)
 		svt = SVT_NO_VERIFY;
-	irte->svt = svt;
-	irte->sq = sq;
-	irte->sid = sid;
+	irte->irq_remap_high.svt = svt;
+	irte->irq_remap_high.sq = sq;
+	irte->irq_remap_high.sid = sid;
 }
 
 static int set_ioapic_sid(struct irte *irte, int apic)
@@ -917,8 +917,8 @@ static void prepare_irte(struct irte *irte, int vector,
 {
 	memset(irte, 0, sizeof(*irte));
 
-	irte->present = 1;
-	irte->dst_mode = apic->irq_dest_mode;
+	irte->irq_remap_low.present = 1;
+	irte->irq_remap_low.dst_mode = apic->irq_dest_mode;
 	/*
 	 * Trigger mode in the IRTE will always be edge, and for IO-APIC, the
 	 * actual level or edge trigger will be setup in the IO-APIC
@@ -926,11 +926,11 @@ static void prepare_irte(struct irte *irte, int vector,
 	 * For more details, see the comments (in io_apic.c) explainig IO-APIC
 	 * irq migration in the presence of interrupt-remapping.
 	*/
-	irte->trigger_mode = 0;
-	irte->dlvry_mode = apic->irq_delivery_mode;
-	irte->vector = vector;
-	irte->dest_id = IRTE_DEST(dest);
-	irte->redir_hint = 1;
+	irte->irq_remap_low.trigger_mode = 0;
+	irte->irq_remap_low.dlvry_mode = apic->irq_delivery_mode;
+	irte->irq_remap_low.vector = vector;
+	irte->irq_remap_low.dest_id = IRTE_DEST(dest);
+	irte->irq_remap_low.redir_hint = 1;
 }
 
 static int intel_setup_ioapic_entry(int irq,
@@ -973,10 +973,13 @@ static int intel_setup_ioapic_entry(int irq,
 		"Redir_hint:%d Trig_Mode:%d Dlvry_Mode:%X "
 		"Avail:%X Vector:%02X Dest:%08X "
 		"SID:%04X SQ:%X SVT:%X)\n",
-		attr->ioapic, irte.present, irte.fpd, irte.dst_mode,
-		irte.redir_hint, irte.trigger_mode, irte.dlvry_mode,
-		irte.avail, irte.vector, irte.dest_id,
-		irte.sid, irte.sq, irte.svt);
+		attr->ioapic, irte.irq_remap_low.present,
+		irte.irq_remap_low.fpd, irte.irq_remap_low.dst_mode,
+		irte.irq_remap_low.redir_hint, irte.irq_remap_low.trigger_mode,
+		irte.irq_remap_low.dlvry_mode, irte.irq_remap_low.avail,
+		irte.irq_remap_low.vector, irte.irq_remap_low.dest_id,
+		irte.irq_remap_high.sid, irte.irq_remap_high.sq,
+		irte.irq_remap_high.svt);
 
 	entry = (struct IR_IO_APIC_route_entry *)route_entry;
 	memset(entry, 0, sizeof(*entry));
@@ -1046,8 +1049,8 @@ intel_ioapic_set_affinity(struct irq_data *data, const struct cpumask *mask,
 		return err;
 	}
 
-	irte.vector = cfg->vector;
-	irte.dest_id = IRTE_DEST(dest);
+	irte.irq_remap_low.vector = cfg->vector;
+	irte.irq_remap_low.dest_id = IRTE_DEST(dest);
 
 	/*
 	 * Atomically updates the IRTE with the new destination, vector
diff --git a/include/linux/dmar.h b/include/linux/dmar.h
index 593fff9..8be5d42 100644
--- a/include/linux/dmar.h
+++ b/include/linux/dmar.h
@@ -159,7 +159,7 @@ struct irte {
 				vector		: 8,
 				__reserved_2	: 8,
 				dest_id		: 32;
-		};
+		} irq_remap_low;
 		__u64 low;
 	};
 
@@ -169,7 +169,7 @@ struct irte {
 				sq		: 2,
 				svt		: 2,
 				__reserved_3	: 44;
-		};
+		} irq_remap_high;
 		__u64 high;
 	};
 };
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel, Feng Wu

When guest changes its interrupt configuration (such as, vector, etc.)
for direct-assigned devices, we need to update the associated IRTE
with the new guest vector, so external interrupts from the assigned
devices can be injected to guests without VM-Exit.

The current method of handling guest lowest priority interrtups
is to use a counter 'apic_arb_prio' for each VCPU, we choose the
VCPU with smallest 'apic_arb_prio' and then increase it by 1.
However, for VT-d PI, we cannot re-use this, since we no longer
have control to 'apic_arb_prio' with posted interrupt direct
delivery by Hardware.

Here, we introduce a similiar way with 'apic_arb_prio' to handle
guest lowest priority interrtups when VT-d PI is used. Here is the
ideas:
- Each VCPU has a counter 'round_robin_counter'.
- When guests sets an interrupts to lowest priority, we choose
the VCPU with smallest 'round_robin_counter' as the destination,
then increase it.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 arch/x86/include/asm/irq_remapping.h |    6 ++
 arch/x86/include/asm/kvm_host.h      |    2 +
 arch/x86/kvm/vmx.c                   |   12 +++
 arch/x86/kvm/x86.c                   |   11 +++
 drivers/iommu/amd_iommu.c            |    6 ++
 drivers/iommu/intel_irq_remapping.c  |   28 +++++++
 drivers/iommu/irq_remapping.c        |    9 ++
 drivers/iommu/irq_remapping.h        |    3 +
 include/linux/dmar.h                 |   26 ++++++
 include/linux/kvm_host.h             |   22 +++++
 include/uapi/linux/kvm.h             |    1 +
 virt/kvm/assigned-dev.c              |  141 ++++++++++++++++++++++++++++++++++
 virt/kvm/irq_comm.c                  |    4 +-
 virt/kvm/irqchip.c                   |   11 ---
 14 files changed, 269 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index a3cc437..32d6cc4 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -51,6 +51,7 @@ extern void compose_remapped_msi_msg(struct pci_dev *pdev,
 				     unsigned int irq, unsigned int dest,
 				     struct msi_msg *msg, u8 hpet_id);
 extern int setup_hpet_msi_remapped(unsigned int irq, unsigned int id);
+extern int update_pi_irte(unsigned int irq, u64 pi_desc_addr, u32 vector);
 extern void panic_if_irq_remap(const char *msg);
 extern bool setup_remapped_irq(int irq,
 			       struct irq_cfg *cfg,
@@ -88,6 +89,11 @@ static inline int setup_hpet_msi_remapped(unsigned int irq, unsigned int id)
 	return -ENODEV;
 }
 
+static inline int update_pi_irte(unsigned int irq, u64 pi_desc_addr, u32 vector)
+{
+	return -ENODEV;
+}
+
 static inline void panic_if_irq_remap(const char *msg)
 {
 }
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6ed0c30..0630161 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -358,6 +358,7 @@ struct kvm_vcpu_arch {
 	struct kvm_lapic *apic;    /* kernel irqchip context */
 	unsigned long apic_attention;
 	int32_t apic_arb_prio;
+	int32_t round_robin_counter;
 	int mp_state;
 	u64 ia32_misc_enable_msr;
 	bool tpr_access_reporting;
@@ -771,6 +772,7 @@ struct kvm_x86_ops {
 	int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr);
 
 	void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
+	u64 (*get_pi_desc_addr)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a4670d3..ae91b72 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -544,6 +544,11 @@ static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
 	return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
+{
+	return &(to_vmx(vcpu)->pi_desc);
+}
+
 #define VMCS12_OFFSET(x) offsetof(struct vmcs12, x)
 #define FIELD(number, name)	[number] = VMCS12_OFFSET(name)
 #define FIELD64(number, name)	[number] = VMCS12_OFFSET(name), \
@@ -4280,6 +4285,11 @@ static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu *vcpu)
 	return;
 }
 
+static u64 vmx_get_pi_desc_addr(struct kvm_vcpu *vcpu)
+{
+	return __pa((u64)vcpu_to_pi_desc(vcpu));
+}
+
 /*
  * Set up the vmcs's constant host-state fields, i.e., host-state fields that
  * will not change in the lifetime of the guest.
@@ -9232,6 +9242,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.check_nested_events = vmx_check_nested_events,
 
 	.sched_in = vmx_sched_in,
+
+	.get_pi_desc_addr = vmx_get_pi_desc_addr,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b447a98..0c19d15 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7735,6 +7735,17 @@ bool kvm_arch_has_noncoherent_dma(struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(kvm_arch_has_noncoherent_dma);
 
+int kvm_update_pi_irte_common(struct kvm *kvm, struct kvm_vcpu *vcpu,
+			u32 guest_vector, int host_irq)
+{
+	u64 pi_desc_addr = kvm_x86_ops->get_pi_desc_addr(vcpu);
+
+	if (update_pi_irte(host_irq, pi_desc_addr, guest_vector))
+		return -1;
+
+	return 0;
+}
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 505a9ad..a36fdc7 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -4280,6 +4280,11 @@ static int alloc_hpet_msi(unsigned int irq, unsigned int id)
 	return 0;
 }
 
+static int dummy_update_pi_irte(int irq, u64 pi_desc_addr, u32 vector)
+{
+	return -EINVAL;
+}
+
 struct irq_remap_ops amd_iommu_irq_ops = {
 	.supported		= amd_iommu_supported,
 	.prepare		= amd_iommu_prepare,
@@ -4294,5 +4299,6 @@ struct irq_remap_ops amd_iommu_irq_ops = {
 	.msi_alloc_irq		= msi_alloc_irq,
 	.msi_setup_irq		= msi_setup_irq,
 	.alloc_hpet_msi		= alloc_hpet_msi,
+	.update_pi_irte         = dummy_update_pi_irte,
 };
 #endif
diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 776da10..87c02fe 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -1172,6 +1172,33 @@ static int intel_alloc_hpet_msi(unsigned int irq, unsigned int id)
 	return ret;
 }
 
+static int intel_update_pi_irte(int irq, u64 pi_desc_addr, u32 vector)
+{
+	struct irte irte;
+
+	if (get_irte(irq, &irte))
+		return -1;
+
+	irte.irq_post_low.urg = 0;
+	irte.irq_post_low.vector = vector;
+	irte.irq_post_low.pda_l = (pi_desc_addr >> (32 - PDA_LOW_BIT)) &
+			~(-1UL << PDA_LOW_BIT);
+	irte.irq_post_high.pda_h = (pi_desc_addr >> 32) &
+			~(-1UL << PDA_HIGH_BIT);
+
+	irte.irq_post_low.__reserved_1 = 0;
+	irte.irq_post_low.__reserved_2 = 0;
+	irte.irq_post_low.__reserved_3 = 0;
+	irte.irq_post_high.__reserved_4 = 0;
+
+	irte.irq_post_low.pst = 1;
+
+	if (modify_irte(irq, &irte))
+		return -1;
+
+	return 0;
+}
+
 struct irq_remap_ops intel_irq_remap_ops = {
 	.supported		= intel_irq_remapping_supported,
 	.prepare		= dmar_table_init,
@@ -1186,4 +1213,5 @@ struct irq_remap_ops intel_irq_remap_ops = {
 	.msi_alloc_irq		= intel_msi_alloc_irq,
 	.msi_setup_irq		= intel_msi_setup_irq,
 	.alloc_hpet_msi		= intel_alloc_hpet_msi,
+	.update_pi_irte         = intel_update_pi_irte,
 };
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index 2f8ee00..0e36860 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -362,6 +362,15 @@ int setup_hpet_msi_remapped(unsigned int irq, unsigned int id)
 	return default_setup_hpet_msi(irq, id);
 }
 
+int update_pi_irte(unsigned int irq, u64 pi_desc_addr, u32 vector)
+{
+	if (!remap_ops || !remap_ops->update_pi_irte)
+		return -ENODEV;
+
+	return remap_ops->update_pi_irte(irq, pi_desc_addr, vector);
+}
+EXPORT_SYMBOL_GPL(update_pi_irte);
+
 void panic_if_irq_remap(const char *msg)
 {
 	if (irq_remapping_enabled)
diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
index 7bb5913..2d8f740 100644
--- a/drivers/iommu/irq_remapping.h
+++ b/drivers/iommu/irq_remapping.h
@@ -84,6 +84,9 @@ struct irq_remap_ops {
 
 	/* Setup interrupt remapping for an HPET MSI */
 	int (*alloc_hpet_msi)(unsigned int, unsigned int);
+
+	/* Update IRTE for posted-interrupt */
+	int (*update_pi_irte)(int irq, u64 pi_desc_addr, u32 vector);
 };
 
 extern struct irq_remap_ops intel_irq_remap_ops;
diff --git a/include/linux/dmar.h b/include/linux/dmar.h
index 8be5d42..e1ff4f7 100644
--- a/include/linux/dmar.h
+++ b/include/linux/dmar.h
@@ -160,6 +160,20 @@ struct irte {
 				__reserved_2	: 8,
 				dest_id		: 32;
 		} irq_remap_low;
+
+		struct {
+			__u64   present		: 1,
+				fpd		: 1,
+				__reserved_1	: 6,
+				avail	: 4,
+				__reserved_2	: 2,
+				urg		: 1,
+				pst		: 1,
+				vector	: 8,
+				__reserved_3	: 14,
+				pda_l	: 26;
+		} irq_post_low;
+
 		__u64 low;
 	};
 
@@ -170,10 +184,22 @@ struct irte {
 				svt		: 2,
 				__reserved_3	: 44;
 		} irq_remap_high;
+
+		struct {
+			__u64	sid:	16,
+				sq:		2,
+				svt:	2,
+				__reserved_4:	12,
+				pda_h:	32;
+		} irq_post_high;
+
 		__u64 high;
 	};
 };
 
+#define PDA_LOW_BIT    26
+#define PDA_HIGH_BIT   32
+
 enum {
 	IRQ_REMAP_XAPIC_MODE,
 	IRQ_REMAP_X2APIC_MODE,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ea53b04..6bb8287 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -335,6 +335,25 @@ struct kvm_kernel_irq_routing_entry {
 	struct hlist_node link;
 };
 
+#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
+
+struct kvm_irq_routing_table {
+	int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
+	struct kvm_kernel_irq_routing_entry *rt_entries;
+	u32 nr_rt_entries;
+	/*
+	 * Array indexed by gsi. Each entry contains list of irq chips
+	 * the gsi is connected to.
+	 */
+	struct hlist_head map[0];
+};
+
+#else
+
+struct kvm_irq_routing_table {};
+
+#endif
+
 #ifndef KVM_PRIVATE_MEM_SLOTS
 #define KVM_PRIVATE_MEM_SLOTS 0
 #endif
@@ -766,6 +785,9 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
 				   struct kvm_irq_ack_notifier *kian);
 int kvm_request_irq_source_id(struct kvm *kvm);
 void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
+void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
+				   struct kvm_lapic_irq *irq);
+bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq);
 
 #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT
 int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 7593c52..509223a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1027,6 +1027,7 @@ struct kvm_s390_ucas_mapping {
 #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct kvm_xen_hvm_config)
 #define KVM_SET_CLOCK             _IOW(KVMIO,  0x7b, struct kvm_clock_data)
 #define KVM_GET_CLOCK             _IOR(KVMIO,  0x7c, struct kvm_clock_data)
+#define KVM_ASSIGN_DEV_PI_UPDATE  _IOR(KVMIO,  0x7d, __u32)
 /* Available with KVM_CAP_PIT_STATE2 */
 #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct kvm_pit_state2)
 #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index e05000e..e154009 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -326,6 +326,135 @@ void kvm_free_all_assigned_devices(struct kvm *kvm)
 	}
 }
 
+int __weak kvm_update_pi_irte_common(struct kvm *kvm, struct kvm_vcpu *vcpu,
+					u32 guest_vector, int host_irq)
+{
+	return 0;
+}
+
+int kvm_compare_rr_counter(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
+{
+	return vcpu1->arch.round_robin_counter -
+			vcpu2->arch.round_robin_counter;
+}
+
+bool kvm_pi_find_dest_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
+				struct kvm_vcpu **dest_vcpu)
+{
+	int i, r = 0;
+	struct kvm_vcpu *vcpu, *dest = NULL;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		if (!kvm_apic_present(vcpu))
+			continue;
+
+		if (!kvm_apic_match_dest(vcpu, NULL, irq->shorthand,
+					irq->dest_id, irq->dest_mode))
+			continue;
+
+		if (!kvm_is_dm_lowest_prio(irq)) {
+			r++;
+			*dest_vcpu = vcpu;
+		} else if (kvm_lapic_enabled(vcpu)) {
+			if (!dest)
+				dest = vcpu;
+			else if (kvm_compare_rr_counter(vcpu, dest) < 0)
+				dest = vcpu;
+		}
+	}
+
+	if (dest) {
+		dest->arch.round_robin_counter++;
+		*dest_vcpu = dest;
+		return true;
+	} else if (r == 1)
+		return true;
+
+	return false;
+}
+
+static int __kvm_update_pi_irte(struct kvm *kvm, int host_irq, int guest_irq)
+{
+	struct kvm_kernel_irq_routing_entry *e;
+	struct kvm_irq_routing_table *irq_rt;
+	struct kvm_lapic_irq irq;
+	struct kvm_vcpu *vcpu;
+	int idx, ret = -EINVAL;
+
+	idx = srcu_read_lock(&kvm->irq_srcu);
+	irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
+	ASSERT(guest_irq < irq_rt->nr_rt_entries);
+
+	hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) {
+		if (e->type != KVM_IRQ_ROUTING_MSI)
+			continue;
+		/*
+		 * VT-d posted-interrupt has the following
+		 * limitations:
+		 *  - No support for posting multicast/broadcast
+		 *    interrupts to a VCPU
+		 * Still use interrupt remapping for these
+		 * kind of interrupts
+		 */
+
+		kvm_set_msi_irq(e, &irq);
+		if (!kvm_pi_find_dest_vcpu(kvm, &irq, &vcpu)) {
+			printk(KERN_INFO "%s: can not find the target VCPU\n",
+					__func__);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		if (kvm_update_pi_irte_common(kvm, vcpu, irq.vector,
+				host_irq)) {
+			printk(KERN_INFO "%s: failed to update PI IRTE\n",
+					__func__);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	ret = 0;
+out:
+	srcu_read_unlock(&kvm->irq_srcu, idx);
+	return ret;
+}
+
+int kvm_update_pi_irte(struct kvm *kvm, u32 dev_id)
+{
+	int i, rc = -1;
+	struct kvm_assigned_dev_kernel *dev;
+
+	mutex_lock(&kvm->lock);
+	dev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head, dev_id);
+	if (!dev) {
+		printk(KERN_INFO "%s: cannot find the assigned dev.\n",
+				__func__);
+		rc = -1;
+		goto out;
+	}
+
+	BUG_ON(dev->irq_requested_type == 0);
+
+	if ((dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSI) &&
+		(dev->dev->msi_enabled == 1)) {
+			__kvm_update_pi_irte(kvm,
+					dev->host_irq, dev->guest_irq);
+	} else if ((dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) &&
+		(dev->dev->msix_enabled == 1)) {
+		for (i = 0; i < dev->entries_nr; i++) {
+			__kvm_update_pi_irte(kvm,
+					dev->host_msix_entries[i].vector,
+					dev->guest_msix_entries[i].vector);
+		}
+	}
+
+out:
+	rc = 0;
+	mutex_unlock(&kvm->lock);
+	return rc;
+}
+
 static int assigned_device_enable_host_intx(struct kvm *kvm,
 					    struct kvm_assigned_dev_kernel *dev)
 {
@@ -1017,6 +1146,18 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
 		r = kvm_vm_ioctl_set_pci_irq_mask(kvm, &assigned_dev);
 		break;
 	}
+	case KVM_ASSIGN_DEV_PI_UPDATE: {
+		u32 dev_id;
+
+		r = -EFAULT;
+		if (copy_from_user(&dev_id, argp, sizeof(dev_id)))
+			goto out;
+		r = kvm_update_pi_irte(kvm, dev_id);
+		if (r)
+			goto out;
+		break;
+
+	}
 	default:
 		r = -ENOTTY;
 		break;
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 963b899..f51aed3 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -55,7 +55,7 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
 				line_status);
 }
 
-inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
+bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
 {
 #ifdef CONFIG_IA64
 	return irq->delivery_mode ==
@@ -106,7 +106,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 	return r;
 }
 
-static inline void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
+void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
 				   struct kvm_lapic_irq *irq)
 {
 	trace_kvm_msi_set_irq(e->msi.address_lo, e->msi.data);
diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c
index 7f256f3..cdf29a6 100644
--- a/virt/kvm/irqchip.c
+++ b/virt/kvm/irqchip.c
@@ -31,17 +31,6 @@
 #include <trace/events/kvm.h>
 #include "irq.h"
 
-struct kvm_irq_routing_table {
-	int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
-	struct kvm_kernel_irq_routing_entry *rt_entries;
-	u32 nr_rt_entries;
-	/*
-	 * Array indexed by gsi. Each entry contains list of irq chips
-	 * the gsi is connected to.
-	 */
-	struct hlist_head map[0];
-};
-
 int kvm_irq_map_gsi(struct kvm *kvm,
 		    struct kvm_kernel_irq_routing_entry *entries, int gsi)
 {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb-DgEjT+Ai2ygdnm+yROfE0A, pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	tglx-hfZtesqFncYOwBW4kG4KsQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA

When guest changes its interrupt configuration (such as, vector, etc.)
for direct-assigned devices, we need to update the associated IRTE
with the new guest vector, so external interrupts from the assigned
devices can be injected to guests without VM-Exit.

The current method of handling guest lowest priority interrtups
is to use a counter 'apic_arb_prio' for each VCPU, we choose the
VCPU with smallest 'apic_arb_prio' and then increase it by 1.
However, for VT-d PI, we cannot re-use this, since we no longer
have control to 'apic_arb_prio' with posted interrupt direct
delivery by Hardware.

Here, we introduce a similiar way with 'apic_arb_prio' to handle
guest lowest priority interrtups when VT-d PI is used. Here is the
ideas:
- Each VCPU has a counter 'round_robin_counter'.
- When guests sets an interrupts to lowest priority, we choose
the VCPU with smallest 'round_robin_counter' as the destination,
then increase it.

Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 arch/x86/include/asm/irq_remapping.h |    6 ++
 arch/x86/include/asm/kvm_host.h      |    2 +
 arch/x86/kvm/vmx.c                   |   12 +++
 arch/x86/kvm/x86.c                   |   11 +++
 drivers/iommu/amd_iommu.c            |    6 ++
 drivers/iommu/intel_irq_remapping.c  |   28 +++++++
 drivers/iommu/irq_remapping.c        |    9 ++
 drivers/iommu/irq_remapping.h        |    3 +
 include/linux/dmar.h                 |   26 ++++++
 include/linux/kvm_host.h             |   22 +++++
 include/uapi/linux/kvm.h             |    1 +
 virt/kvm/assigned-dev.c              |  141 ++++++++++++++++++++++++++++++++++
 virt/kvm/irq_comm.c                  |    4 +-
 virt/kvm/irqchip.c                   |   11 ---
 14 files changed, 269 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index a3cc437..32d6cc4 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -51,6 +51,7 @@ extern void compose_remapped_msi_msg(struct pci_dev *pdev,
 				     unsigned int irq, unsigned int dest,
 				     struct msi_msg *msg, u8 hpet_id);
 extern int setup_hpet_msi_remapped(unsigned int irq, unsigned int id);
+extern int update_pi_irte(unsigned int irq, u64 pi_desc_addr, u32 vector);
 extern void panic_if_irq_remap(const char *msg);
 extern bool setup_remapped_irq(int irq,
 			       struct irq_cfg *cfg,
@@ -88,6 +89,11 @@ static inline int setup_hpet_msi_remapped(unsigned int irq, unsigned int id)
 	return -ENODEV;
 }
 
+static inline int update_pi_irte(unsigned int irq, u64 pi_desc_addr, u32 vector)
+{
+	return -ENODEV;
+}
+
 static inline void panic_if_irq_remap(const char *msg)
 {
 }
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6ed0c30..0630161 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -358,6 +358,7 @@ struct kvm_vcpu_arch {
 	struct kvm_lapic *apic;    /* kernel irqchip context */
 	unsigned long apic_attention;
 	int32_t apic_arb_prio;
+	int32_t round_robin_counter;
 	int mp_state;
 	u64 ia32_misc_enable_msr;
 	bool tpr_access_reporting;
@@ -771,6 +772,7 @@ struct kvm_x86_ops {
 	int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr);
 
 	void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
+	u64 (*get_pi_desc_addr)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a4670d3..ae91b72 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -544,6 +544,11 @@ static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
 	return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
+{
+	return &(to_vmx(vcpu)->pi_desc);
+}
+
 #define VMCS12_OFFSET(x) offsetof(struct vmcs12, x)
 #define FIELD(number, name)	[number] = VMCS12_OFFSET(name)
 #define FIELD64(number, name)	[number] = VMCS12_OFFSET(name), \
@@ -4280,6 +4285,11 @@ static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu *vcpu)
 	return;
 }
 
+static u64 vmx_get_pi_desc_addr(struct kvm_vcpu *vcpu)
+{
+	return __pa((u64)vcpu_to_pi_desc(vcpu));
+}
+
 /*
  * Set up the vmcs's constant host-state fields, i.e., host-state fields that
  * will not change in the lifetime of the guest.
@@ -9232,6 +9242,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.check_nested_events = vmx_check_nested_events,
 
 	.sched_in = vmx_sched_in,
+
+	.get_pi_desc_addr = vmx_get_pi_desc_addr,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b447a98..0c19d15 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7735,6 +7735,17 @@ bool kvm_arch_has_noncoherent_dma(struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(kvm_arch_has_noncoherent_dma);
 
+int kvm_update_pi_irte_common(struct kvm *kvm, struct kvm_vcpu *vcpu,
+			u32 guest_vector, int host_irq)
+{
+	u64 pi_desc_addr = kvm_x86_ops->get_pi_desc_addr(vcpu);
+
+	if (update_pi_irte(host_irq, pi_desc_addr, guest_vector))
+		return -1;
+
+	return 0;
+}
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 505a9ad..a36fdc7 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -4280,6 +4280,11 @@ static int alloc_hpet_msi(unsigned int irq, unsigned int id)
 	return 0;
 }
 
+static int dummy_update_pi_irte(int irq, u64 pi_desc_addr, u32 vector)
+{
+	return -EINVAL;
+}
+
 struct irq_remap_ops amd_iommu_irq_ops = {
 	.supported		= amd_iommu_supported,
 	.prepare		= amd_iommu_prepare,
@@ -4294,5 +4299,6 @@ struct irq_remap_ops amd_iommu_irq_ops = {
 	.msi_alloc_irq		= msi_alloc_irq,
 	.msi_setup_irq		= msi_setup_irq,
 	.alloc_hpet_msi		= alloc_hpet_msi,
+	.update_pi_irte         = dummy_update_pi_irte,
 };
 #endif
diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 776da10..87c02fe 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -1172,6 +1172,33 @@ static int intel_alloc_hpet_msi(unsigned int irq, unsigned int id)
 	return ret;
 }
 
+static int intel_update_pi_irte(int irq, u64 pi_desc_addr, u32 vector)
+{
+	struct irte irte;
+
+	if (get_irte(irq, &irte))
+		return -1;
+
+	irte.irq_post_low.urg = 0;
+	irte.irq_post_low.vector = vector;
+	irte.irq_post_low.pda_l = (pi_desc_addr >> (32 - PDA_LOW_BIT)) &
+			~(-1UL << PDA_LOW_BIT);
+	irte.irq_post_high.pda_h = (pi_desc_addr >> 32) &
+			~(-1UL << PDA_HIGH_BIT);
+
+	irte.irq_post_low.__reserved_1 = 0;
+	irte.irq_post_low.__reserved_2 = 0;
+	irte.irq_post_low.__reserved_3 = 0;
+	irte.irq_post_high.__reserved_4 = 0;
+
+	irte.irq_post_low.pst = 1;
+
+	if (modify_irte(irq, &irte))
+		return -1;
+
+	return 0;
+}
+
 struct irq_remap_ops intel_irq_remap_ops = {
 	.supported		= intel_irq_remapping_supported,
 	.prepare		= dmar_table_init,
@@ -1186,4 +1213,5 @@ struct irq_remap_ops intel_irq_remap_ops = {
 	.msi_alloc_irq		= intel_msi_alloc_irq,
 	.msi_setup_irq		= intel_msi_setup_irq,
 	.alloc_hpet_msi		= intel_alloc_hpet_msi,
+	.update_pi_irte         = intel_update_pi_irte,
 };
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index 2f8ee00..0e36860 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -362,6 +362,15 @@ int setup_hpet_msi_remapped(unsigned int irq, unsigned int id)
 	return default_setup_hpet_msi(irq, id);
 }
 
+int update_pi_irte(unsigned int irq, u64 pi_desc_addr, u32 vector)
+{
+	if (!remap_ops || !remap_ops->update_pi_irte)
+		return -ENODEV;
+
+	return remap_ops->update_pi_irte(irq, pi_desc_addr, vector);
+}
+EXPORT_SYMBOL_GPL(update_pi_irte);
+
 void panic_if_irq_remap(const char *msg)
 {
 	if (irq_remapping_enabled)
diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
index 7bb5913..2d8f740 100644
--- a/drivers/iommu/irq_remapping.h
+++ b/drivers/iommu/irq_remapping.h
@@ -84,6 +84,9 @@ struct irq_remap_ops {
 
 	/* Setup interrupt remapping for an HPET MSI */
 	int (*alloc_hpet_msi)(unsigned int, unsigned int);
+
+	/* Update IRTE for posted-interrupt */
+	int (*update_pi_irte)(int irq, u64 pi_desc_addr, u32 vector);
 };
 
 extern struct irq_remap_ops intel_irq_remap_ops;
diff --git a/include/linux/dmar.h b/include/linux/dmar.h
index 8be5d42..e1ff4f7 100644
--- a/include/linux/dmar.h
+++ b/include/linux/dmar.h
@@ -160,6 +160,20 @@ struct irte {
 				__reserved_2	: 8,
 				dest_id		: 32;
 		} irq_remap_low;
+
+		struct {
+			__u64   present		: 1,
+				fpd		: 1,
+				__reserved_1	: 6,
+				avail	: 4,
+				__reserved_2	: 2,
+				urg		: 1,
+				pst		: 1,
+				vector	: 8,
+				__reserved_3	: 14,
+				pda_l	: 26;
+		} irq_post_low;
+
 		__u64 low;
 	};
 
@@ -170,10 +184,22 @@ struct irte {
 				svt		: 2,
 				__reserved_3	: 44;
 		} irq_remap_high;
+
+		struct {
+			__u64	sid:	16,
+				sq:		2,
+				svt:	2,
+				__reserved_4:	12,
+				pda_h:	32;
+		} irq_post_high;
+
 		__u64 high;
 	};
 };
 
+#define PDA_LOW_BIT    26
+#define PDA_HIGH_BIT   32
+
 enum {
 	IRQ_REMAP_XAPIC_MODE,
 	IRQ_REMAP_X2APIC_MODE,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ea53b04..6bb8287 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -335,6 +335,25 @@ struct kvm_kernel_irq_routing_entry {
 	struct hlist_node link;
 };
 
+#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
+
+struct kvm_irq_routing_table {
+	int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
+	struct kvm_kernel_irq_routing_entry *rt_entries;
+	u32 nr_rt_entries;
+	/*
+	 * Array indexed by gsi. Each entry contains list of irq chips
+	 * the gsi is connected to.
+	 */
+	struct hlist_head map[0];
+};
+
+#else
+
+struct kvm_irq_routing_table {};
+
+#endif
+
 #ifndef KVM_PRIVATE_MEM_SLOTS
 #define KVM_PRIVATE_MEM_SLOTS 0
 #endif
@@ -766,6 +785,9 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
 				   struct kvm_irq_ack_notifier *kian);
 int kvm_request_irq_source_id(struct kvm *kvm);
 void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
+void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
+				   struct kvm_lapic_irq *irq);
+bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq);
 
 #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT
 int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 7593c52..509223a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1027,6 +1027,7 @@ struct kvm_s390_ucas_mapping {
 #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct kvm_xen_hvm_config)
 #define KVM_SET_CLOCK             _IOW(KVMIO,  0x7b, struct kvm_clock_data)
 #define KVM_GET_CLOCK             _IOR(KVMIO,  0x7c, struct kvm_clock_data)
+#define KVM_ASSIGN_DEV_PI_UPDATE  _IOR(KVMIO,  0x7d, __u32)
 /* Available with KVM_CAP_PIT_STATE2 */
 #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct kvm_pit_state2)
 #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index e05000e..e154009 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -326,6 +326,135 @@ void kvm_free_all_assigned_devices(struct kvm *kvm)
 	}
 }
 
+int __weak kvm_update_pi_irte_common(struct kvm *kvm, struct kvm_vcpu *vcpu,
+					u32 guest_vector, int host_irq)
+{
+	return 0;
+}
+
+int kvm_compare_rr_counter(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
+{
+	return vcpu1->arch.round_robin_counter -
+			vcpu2->arch.round_robin_counter;
+}
+
+bool kvm_pi_find_dest_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
+				struct kvm_vcpu **dest_vcpu)
+{
+	int i, r = 0;
+	struct kvm_vcpu *vcpu, *dest = NULL;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		if (!kvm_apic_present(vcpu))
+			continue;
+
+		if (!kvm_apic_match_dest(vcpu, NULL, irq->shorthand,
+					irq->dest_id, irq->dest_mode))
+			continue;
+
+		if (!kvm_is_dm_lowest_prio(irq)) {
+			r++;
+			*dest_vcpu = vcpu;
+		} else if (kvm_lapic_enabled(vcpu)) {
+			if (!dest)
+				dest = vcpu;
+			else if (kvm_compare_rr_counter(vcpu, dest) < 0)
+				dest = vcpu;
+		}
+	}
+
+	if (dest) {
+		dest->arch.round_robin_counter++;
+		*dest_vcpu = dest;
+		return true;
+	} else if (r == 1)
+		return true;
+
+	return false;
+}
+
+static int __kvm_update_pi_irte(struct kvm *kvm, int host_irq, int guest_irq)
+{
+	struct kvm_kernel_irq_routing_entry *e;
+	struct kvm_irq_routing_table *irq_rt;
+	struct kvm_lapic_irq irq;
+	struct kvm_vcpu *vcpu;
+	int idx, ret = -EINVAL;
+
+	idx = srcu_read_lock(&kvm->irq_srcu);
+	irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
+	ASSERT(guest_irq < irq_rt->nr_rt_entries);
+
+	hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) {
+		if (e->type != KVM_IRQ_ROUTING_MSI)
+			continue;
+		/*
+		 * VT-d posted-interrupt has the following
+		 * limitations:
+		 *  - No support for posting multicast/broadcast
+		 *    interrupts to a VCPU
+		 * Still use interrupt remapping for these
+		 * kind of interrupts
+		 */
+
+		kvm_set_msi_irq(e, &irq);
+		if (!kvm_pi_find_dest_vcpu(kvm, &irq, &vcpu)) {
+			printk(KERN_INFO "%s: can not find the target VCPU\n",
+					__func__);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		if (kvm_update_pi_irte_common(kvm, vcpu, irq.vector,
+				host_irq)) {
+			printk(KERN_INFO "%s: failed to update PI IRTE\n",
+					__func__);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	ret = 0;
+out:
+	srcu_read_unlock(&kvm->irq_srcu, idx);
+	return ret;
+}
+
+int kvm_update_pi_irte(struct kvm *kvm, u32 dev_id)
+{
+	int i, rc = -1;
+	struct kvm_assigned_dev_kernel *dev;
+
+	mutex_lock(&kvm->lock);
+	dev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head, dev_id);
+	if (!dev) {
+		printk(KERN_INFO "%s: cannot find the assigned dev.\n",
+				__func__);
+		rc = -1;
+		goto out;
+	}
+
+	BUG_ON(dev->irq_requested_type == 0);
+
+	if ((dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSI) &&
+		(dev->dev->msi_enabled == 1)) {
+			__kvm_update_pi_irte(kvm,
+					dev->host_irq, dev->guest_irq);
+	} else if ((dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) &&
+		(dev->dev->msix_enabled == 1)) {
+		for (i = 0; i < dev->entries_nr; i++) {
+			__kvm_update_pi_irte(kvm,
+					dev->host_msix_entries[i].vector,
+					dev->guest_msix_entries[i].vector);
+		}
+	}
+
+out:
+	rc = 0;
+	mutex_unlock(&kvm->lock);
+	return rc;
+}
+
 static int assigned_device_enable_host_intx(struct kvm *kvm,
 					    struct kvm_assigned_dev_kernel *dev)
 {
@@ -1017,6 +1146,18 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
 		r = kvm_vm_ioctl_set_pci_irq_mask(kvm, &assigned_dev);
 		break;
 	}
+	case KVM_ASSIGN_DEV_PI_UPDATE: {
+		u32 dev_id;
+
+		r = -EFAULT;
+		if (copy_from_user(&dev_id, argp, sizeof(dev_id)))
+			goto out;
+		r = kvm_update_pi_irte(kvm, dev_id);
+		if (r)
+			goto out;
+		break;
+
+	}
 	default:
 		r = -ENOTTY;
 		break;
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 963b899..f51aed3 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -55,7 +55,7 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
 				line_status);
 }
 
-inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
+bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
 {
 #ifdef CONFIG_IA64
 	return irq->delivery_mode ==
@@ -106,7 +106,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 	return r;
 }
 
-static inline void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
+void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
 				   struct kvm_lapic_irq *irq)
 {
 	trace_kvm_msi_set_irq(e->msi.address_lo, e->msi.data);
diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c
index 7f256f3..cdf29a6 100644
--- a/virt/kvm/irqchip.c
+++ b/virt/kvm/irqchip.c
@@ -31,17 +31,6 @@
 #include <trace/events/kvm.h>
 #include "irq.h"
 
-struct kvm_irq_routing_table {
-	int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
-	struct kvm_kernel_irq_routing_entry *rt_entries;
-	u32 nr_rt_entries;
-	/*
-	 * Array indexed by gsi. Each entry contains list of irq chips
-	 * the gsi is connected to.
-	 */
-	struct hlist_head map[0];
-};
-
 int kvm_irq_map_gsi(struct kvm *kvm,
 		    struct kvm_kernel_irq_routing_entry *entries, int gsi)
 {
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 06/13] KVM: Add some helper functions for Posted-Interrupts
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel, Feng Wu

This patch adds three helper functions to manipulate the Posted-
Interrtups Decriptor.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 arch/x86/kvm/vmx.c |   18 ++++++++++++++++++
 1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ae91b72..f41111f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -435,6 +435,24 @@ static void pi_clear_sn(struct pi_desc *pi_desc)
 			(unsigned long *)&pi_desc->control);
 }
 
+static void pi_set_sn(struct pi_desc *pi_desc)
+{
+	return set_bit(POSTED_INTR_SN,
+			(unsigned long *)&pi_desc->control);
+}
+
+static int pi_test_on(struct pi_desc *pi_desc)
+{
+	return test_bit(POSTED_INTR_ON,
+			(unsigned long *)&pi_desc->control);
+}
+
+static int pi_test_sn(struct pi_desc *pi_desc)
+{
+	return test_bit(POSTED_INTR_SN,
+			(unsigned long *)&pi_desc->control);
+}
+
 static bool pi_test_and_set_on(struct pi_desc *pi_desc)
 {
 	return test_and_set_bit(POSTED_INTR_ON,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 06/13] KVM: Add some helper functions for Posted-Interrupts
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb-DgEjT+Ai2ygdnm+yROfE0A, pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	tglx-hfZtesqFncYOwBW4kG4KsQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA

This patch adds three helper functions to manipulate the Posted-
Interrtups Decriptor.

Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 arch/x86/kvm/vmx.c |   18 ++++++++++++++++++
 1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ae91b72..f41111f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -435,6 +435,24 @@ static void pi_clear_sn(struct pi_desc *pi_desc)
 			(unsigned long *)&pi_desc->control);
 }
 
+static void pi_set_sn(struct pi_desc *pi_desc)
+{
+	return set_bit(POSTED_INTR_SN,
+			(unsigned long *)&pi_desc->control);
+}
+
+static int pi_test_on(struct pi_desc *pi_desc)
+{
+	return test_bit(POSTED_INTR_ON,
+			(unsigned long *)&pi_desc->control);
+}
+
+static int pi_test_sn(struct pi_desc *pi_desc)
+{
+	return test_bit(POSTED_INTR_SN,
+			(unsigned long *)&pi_desc->control);
+}
+
 static bool pi_test_and_set_on(struct pi_desc *pi_desc)
 {
 	return test_and_set_bit(POSTED_INTR_ON,
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 07/13] x86, irq: Define a global vector for VT-d Posted-Interrupts
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel, Feng Wu

Currently, we use a global vector as the Posted-Interrupts
Notification Event for all the VCPUs in the system. We need
to introduce another global vector for VT-d Posted-Interrtups,
which will be used to wakeup the sleep VCPU when an external
interrupt from a direct-assigned device happens for that VCPU.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 arch/x86/include/asm/entry_arch.h  |    2 ++
 arch/x86/include/asm/hardirq.h     |    1 +
 arch/x86/include/asm/hw_irq.h      |    2 ++
 arch/x86/include/asm/irq_vectors.h |    1 +
 arch/x86/kernel/entry_64.S         |    2 ++
 arch/x86/kernel/irq.c              |   27 +++++++++++++++++++++++++++
 arch/x86/kernel/irqinit.c          |    2 ++
 7 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
index dc5fa66..27ca0af 100644
--- a/arch/x86/include/asm/entry_arch.h
+++ b/arch/x86/include/asm/entry_arch.h
@@ -23,6 +23,8 @@ BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)
 #ifdef CONFIG_HAVE_KVM
 BUILD_INTERRUPT3(kvm_posted_intr_ipi, POSTED_INTR_VECTOR,
 		 smp_kvm_posted_intr_ipi)
+BUILD_INTERRUPT3(kvm_posted_intr_wakeup_ipi, POSTED_INTR_WAKEUP_VECTOR,
+		 smp_kvm_posted_intr_wakeup_ipi)
 #endif
 
 /*
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 0f5fb6b..9866065 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -14,6 +14,7 @@ typedef struct {
 #endif
 #ifdef CONFIG_HAVE_KVM
 	unsigned int kvm_posted_intr_ipis;
+	unsigned int kvm_posted_intr_wakeup_ipis;
 #endif
 	unsigned int x86_platform_ipis;	/* arch dependent */
 	unsigned int apic_perf_irqs;
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 4615906..559563c 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -29,6 +29,7 @@
 extern asmlinkage void apic_timer_interrupt(void);
 extern asmlinkage void x86_platform_ipi(void);
 extern asmlinkage void kvm_posted_intr_ipi(void);
+extern asmlinkage void kvm_posted_intr_wakeup_ipi(void);
 extern asmlinkage void error_interrupt(void);
 extern asmlinkage void irq_work_interrupt(void);
 
@@ -92,6 +93,7 @@ extern void trace_call_function_single_interrupt(void);
 #define trace_irq_move_cleanup_interrupt  irq_move_cleanup_interrupt
 #define trace_reboot_interrupt  reboot_interrupt
 #define trace_kvm_posted_intr_ipi kvm_posted_intr_ipi
+#define trace_kvm_posted_intr_wakeup_ipi kvm_posted_intr_wakeup_ipi
 #endif /* CONFIG_TRACING */
 
 /* IOAPIC */
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 5702d7e..1343349 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -105,6 +105,7 @@
 /* Vector for KVM to deliver posted interrupt IPI */
 #ifdef CONFIG_HAVE_KVM
 #define POSTED_INTR_VECTOR		0xf2
+#define POSTED_INTR_WAKEUP_VECTOR	0xf1
 #endif
 
 /*
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index df088bb..7663aaa 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1004,6 +1004,8 @@ apicinterrupt X86_PLATFORM_IPI_VECTOR \
 #ifdef CONFIG_HAVE_KVM
 apicinterrupt3 POSTED_INTR_VECTOR \
 	kvm_posted_intr_ipi smp_kvm_posted_intr_ipi
+apicinterrupt3 POSTED_INTR_WAKEUP_VECTOR \
+	kvm_posted_intr_wakeup_ipi smp_kvm_posted_intr_wakeup_ipi
 #endif
 
 #ifdef CONFIG_X86_MCE_THRESHOLD
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 922d285..47408c3 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -237,6 +237,9 @@ __visible void smp_x86_platform_ipi(struct pt_regs *regs)
 }
 
 #ifdef CONFIG_HAVE_KVM
+void (*wakeup_handler_callback)(void) = NULL;
+EXPORT_SYMBOL_GPL(wakeup_handler_callback);
+
 /*
  * Handler for POSTED_INTERRUPT_VECTOR.
  */
@@ -256,6 +259,30 @@ __visible void smp_kvm_posted_intr_ipi(struct pt_regs *regs)
 
 	set_irq_regs(old_regs);
 }
+
+/*
+ * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
+ */
+__visible void smp_kvm_posted_intr_wakeup_ipi(struct pt_regs *regs)
+{
+	struct pt_regs *old_regs = set_irq_regs(regs);
+
+	ack_APIC_irq();
+
+	irq_enter();
+
+	exit_idle();
+
+	inc_irq_stat(kvm_posted_intr_wakeup_ipis);
+
+	if (wakeup_handler_callback)
+		wakeup_handler_callback();
+
+	irq_exit();
+
+	set_irq_regs(old_regs);
+}
+
 #endif
 
 __visible void smp_trace_x86_platform_ipi(struct pt_regs *regs)
diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
index 4de73ee..659cde3 100644
--- a/arch/x86/kernel/irqinit.c
+++ b/arch/x86/kernel/irqinit.c
@@ -168,6 +168,8 @@ static void __init apic_intr_init(void)
 #ifdef CONFIG_HAVE_KVM
 	/* IPI for KVM to deliver posted interrupt */
 	alloc_intr_gate(POSTED_INTR_VECTOR, kvm_posted_intr_ipi);
+	/* IPI for KVM to deliver interrupt to wake up tasks */
+	alloc_intr_gate(POSTED_INTR_WAKEUP_VECTOR, kvm_posted_intr_wakeup_ipi);
 #endif
 
 	/* IPI vectors for APIC spurious and error interrupts */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 07/13] x86, irq: Define a global vector for VT-d Posted-Interrupts
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb-DgEjT+Ai2ygdnm+yROfE0A, pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	tglx-hfZtesqFncYOwBW4kG4KsQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA

Currently, we use a global vector as the Posted-Interrupts
Notification Event for all the VCPUs in the system. We need
to introduce another global vector for VT-d Posted-Interrtups,
which will be used to wakeup the sleep VCPU when an external
interrupt from a direct-assigned device happens for that VCPU.

Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 arch/x86/include/asm/entry_arch.h  |    2 ++
 arch/x86/include/asm/hardirq.h     |    1 +
 arch/x86/include/asm/hw_irq.h      |    2 ++
 arch/x86/include/asm/irq_vectors.h |    1 +
 arch/x86/kernel/entry_64.S         |    2 ++
 arch/x86/kernel/irq.c              |   27 +++++++++++++++++++++++++++
 arch/x86/kernel/irqinit.c          |    2 ++
 7 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
index dc5fa66..27ca0af 100644
--- a/arch/x86/include/asm/entry_arch.h
+++ b/arch/x86/include/asm/entry_arch.h
@@ -23,6 +23,8 @@ BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)
 #ifdef CONFIG_HAVE_KVM
 BUILD_INTERRUPT3(kvm_posted_intr_ipi, POSTED_INTR_VECTOR,
 		 smp_kvm_posted_intr_ipi)
+BUILD_INTERRUPT3(kvm_posted_intr_wakeup_ipi, POSTED_INTR_WAKEUP_VECTOR,
+		 smp_kvm_posted_intr_wakeup_ipi)
 #endif
 
 /*
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 0f5fb6b..9866065 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -14,6 +14,7 @@ typedef struct {
 #endif
 #ifdef CONFIG_HAVE_KVM
 	unsigned int kvm_posted_intr_ipis;
+	unsigned int kvm_posted_intr_wakeup_ipis;
 #endif
 	unsigned int x86_platform_ipis;	/* arch dependent */
 	unsigned int apic_perf_irqs;
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 4615906..559563c 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -29,6 +29,7 @@
 extern asmlinkage void apic_timer_interrupt(void);
 extern asmlinkage void x86_platform_ipi(void);
 extern asmlinkage void kvm_posted_intr_ipi(void);
+extern asmlinkage void kvm_posted_intr_wakeup_ipi(void);
 extern asmlinkage void error_interrupt(void);
 extern asmlinkage void irq_work_interrupt(void);
 
@@ -92,6 +93,7 @@ extern void trace_call_function_single_interrupt(void);
 #define trace_irq_move_cleanup_interrupt  irq_move_cleanup_interrupt
 #define trace_reboot_interrupt  reboot_interrupt
 #define trace_kvm_posted_intr_ipi kvm_posted_intr_ipi
+#define trace_kvm_posted_intr_wakeup_ipi kvm_posted_intr_wakeup_ipi
 #endif /* CONFIG_TRACING */
 
 /* IOAPIC */
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 5702d7e..1343349 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -105,6 +105,7 @@
 /* Vector for KVM to deliver posted interrupt IPI */
 #ifdef CONFIG_HAVE_KVM
 #define POSTED_INTR_VECTOR		0xf2
+#define POSTED_INTR_WAKEUP_VECTOR	0xf1
 #endif
 
 /*
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index df088bb..7663aaa 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1004,6 +1004,8 @@ apicinterrupt X86_PLATFORM_IPI_VECTOR \
 #ifdef CONFIG_HAVE_KVM
 apicinterrupt3 POSTED_INTR_VECTOR \
 	kvm_posted_intr_ipi smp_kvm_posted_intr_ipi
+apicinterrupt3 POSTED_INTR_WAKEUP_VECTOR \
+	kvm_posted_intr_wakeup_ipi smp_kvm_posted_intr_wakeup_ipi
 #endif
 
 #ifdef CONFIG_X86_MCE_THRESHOLD
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 922d285..47408c3 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -237,6 +237,9 @@ __visible void smp_x86_platform_ipi(struct pt_regs *regs)
 }
 
 #ifdef CONFIG_HAVE_KVM
+void (*wakeup_handler_callback)(void) = NULL;
+EXPORT_SYMBOL_GPL(wakeup_handler_callback);
+
 /*
  * Handler for POSTED_INTERRUPT_VECTOR.
  */
@@ -256,6 +259,30 @@ __visible void smp_kvm_posted_intr_ipi(struct pt_regs *regs)
 
 	set_irq_regs(old_regs);
 }
+
+/*
+ * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
+ */
+__visible void smp_kvm_posted_intr_wakeup_ipi(struct pt_regs *regs)
+{
+	struct pt_regs *old_regs = set_irq_regs(regs);
+
+	ack_APIC_irq();
+
+	irq_enter();
+
+	exit_idle();
+
+	inc_irq_stat(kvm_posted_intr_wakeup_ipis);
+
+	if (wakeup_handler_callback)
+		wakeup_handler_callback();
+
+	irq_exit();
+
+	set_irq_regs(old_regs);
+}
+
 #endif
 
 __visible void smp_trace_x86_platform_ipi(struct pt_regs *regs)
diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
index 4de73ee..659cde3 100644
--- a/arch/x86/kernel/irqinit.c
+++ b/arch/x86/kernel/irqinit.c
@@ -168,6 +168,8 @@ static void __init apic_intr_init(void)
 #ifdef CONFIG_HAVE_KVM
 	/* IPI for KVM to deliver posted interrupt */
 	alloc_intr_gate(POSTED_INTR_VECTOR, kvm_posted_intr_ipi);
+	/* IPI for KVM to deliver interrupt to wake up tasks */
+	alloc_intr_gate(POSTED_INTR_WAKEUP_VECTOR, kvm_posted_intr_wakeup_ipi);
 #endif
 
 	/* IPI vectors for APIC spurious and error interrupts */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 08/13] KVM: Update Posted-Interrupts descriptor during VCPU scheduling
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel, Feng Wu

Update Posted-Interrupts descriptor according to the
following rules:
- Before VCPU block, set 'NV' to POSTED_INTR_WAKEUP_VECTOR
- After VCPU block, set 'NV' back to POSTED_INTR_VECTOR

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 arch/x86/include/asm/kvm_host.h |    5 ++
 arch/x86/kvm/vmx.c              |   83 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              |   16 +++++++
 virt/kvm/kvm_main.c             |   11 +++++
 4 files changed, 115 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0630161..71cfe3e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -773,6 +773,8 @@ struct kvm_x86_ops {
 
 	void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
 	u64 (*get_pi_desc_addr)(struct kvm_vcpu *vcpu);
+	int (*vcpu_pre_block)(struct kvm_vcpu *vcpu);
+	void (*vcpu_post_block)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
@@ -1095,4 +1097,7 @@ int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
 void kvm_handle_pmu_event(struct kvm_vcpu *vcpu);
 void kvm_deliver_pmi(struct kvm_vcpu *vcpu);
 
+int kvm_arch_vcpu_pre_block(struct kvm_vcpu *vcpu);
+void kvm_arch_vcpu_post_block(struct kvm_vcpu *vcpu);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f41111f..4c1a966 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -9153,6 +9153,86 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 		shrink_ple_window(vcpu);
 }
 
+static int vmx_vcpu_pre_block(struct kvm_vcpu *vcpu)
+{
+	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+	struct pi_desc old;
+	struct pi_desc new;
+
+	if (!irq_post_enabled)
+		return 0;
+
+	memset(&old, 0, sizeof(old));
+	memset(&new, 0, sizeof(new));
+
+	do {
+		old.control = new.control = pi_desc->control;
+
+		/*
+		 * A posted-interrupt happened in the one of the
+		 * following two cases:
+		 * 1. After the latest pir-to-virr sync operation
+		 * in kvm_arch_vcpu_runnable() function
+		 * 2. In this do-while() loop, a posted-interrupt
+		 * occurs.
+		 *
+		 * For either of above cases, we should not block
+		 * the VCPU.
+		 */
+		if (pi_test_on(pi_desc) == 1) {
+			/*
+			 * Need to set this flag, then the inject will
+			 * be synced from PIR to vIRR before VM-ENTRY.
+			 * In fact, for guest IPI case, in function
+			 * vmx_deliver_posted_interrupt(), this flags
+			 * has already been set, but if the interrupt
+			 * is injected by VT-d PI hardware, we need
+			 * to set this.
+			 */
+			kvm_make_request(KVM_REQ_EVENT, vcpu);
+			return 1;
+		}
+
+		pi_clear_sn(&new);
+
+		/* set 'NV' to 'wakeup vector' */
+		new.nv = POSTED_INTR_WAKEUP_VECTOR;
+	} while (cmpxchg(&pi_desc->control, old.control, new.control)
+			!= old.control);
+
+	return 0;
+}
+
+static void vmx_vcpu_post_block(struct kvm_vcpu *vcpu)
+{
+	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+	struct pi_desc old;
+	struct pi_desc new;
+	unsigned int dest = 0;
+
+	if (!irq_post_enabled)
+		return;
+
+	pi_set_sn(pi_desc);
+
+	do {
+		old.control = new.control = pi_desc->control;
+
+		dest = cpu_physical_id(vcpu->cpu);
+
+		if (x2apic_mode)
+			new.ndst = dest;
+		else
+			new.ndst = (dest << 8) & 0xFF00;
+
+		/* set 'NV' to 'notification vector' */
+		new.nv = POSTED_INTR_VECTOR;
+	} while (cmpxchg(&pi_desc->control, old.control, new.control)
+			!= old.control);
+
+	pi_clear_sn(pi_desc);
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
@@ -9262,6 +9342,9 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.sched_in = vmx_sched_in,
 
 	.get_pi_desc_addr = vmx_get_pi_desc_addr,
+
+	.vcpu_pre_block = vmx_vcpu_pre_block,
+	.vcpu_post_block = vmx_vcpu_post_block,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c19d15..d0c8bb2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7746,6 +7746,22 @@ int kvm_update_pi_irte_common(struct kvm *kvm, struct kvm_vcpu *vcpu,
 	return 0;
 }
 
+int kvm_arch_vcpu_pre_block(struct kvm_vcpu *vcpu)
+{
+	if (kvm_x86_ops->vcpu_pre_block)
+		return kvm_x86_ops->vcpu_pre_block(vcpu);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_arch_vcpu_pre_block);
+
+void kvm_arch_vcpu_post_block(struct kvm_vcpu *vcpu)
+{
+	if (kvm_x86_ops->vcpu_post_block)
+		kvm_x86_ops->vcpu_post_block(vcpu);
+}
+EXPORT_SYMBOL_GPL(kvm_arch_vcpu_post_block);
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 25ffac9..1be1a45 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1754,7 +1754,18 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
 		if (signal_pending(current))
 			break;
 
+#ifdef CONFIG_X86
+		if (kvm_arch_vcpu_pre_block(vcpu) == 1) {
+			kvm_make_request(KVM_REQ_UNHALT, vcpu);
+			break;
+		}
+#endif
+
 		schedule();
+
+#ifdef CONFIG_X86
+		kvm_arch_vcpu_post_block(vcpu);
+#endif
 	}
 
 	finish_wait(&vcpu->wq, &wait);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 08/13] KVM: Update Posted-Interrupts descriptor during VCPU scheduling
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb-DgEjT+Ai2ygdnm+yROfE0A, pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	tglx-hfZtesqFncYOwBW4kG4KsQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA

Update Posted-Interrupts descriptor according to the
following rules:
- Before VCPU block, set 'NV' to POSTED_INTR_WAKEUP_VECTOR
- After VCPU block, set 'NV' back to POSTED_INTR_VECTOR

Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 arch/x86/include/asm/kvm_host.h |    5 ++
 arch/x86/kvm/vmx.c              |   83 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              |   16 +++++++
 virt/kvm/kvm_main.c             |   11 +++++
 4 files changed, 115 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0630161..71cfe3e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -773,6 +773,8 @@ struct kvm_x86_ops {
 
 	void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
 	u64 (*get_pi_desc_addr)(struct kvm_vcpu *vcpu);
+	int (*vcpu_pre_block)(struct kvm_vcpu *vcpu);
+	void (*vcpu_post_block)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
@@ -1095,4 +1097,7 @@ int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
 void kvm_handle_pmu_event(struct kvm_vcpu *vcpu);
 void kvm_deliver_pmi(struct kvm_vcpu *vcpu);
 
+int kvm_arch_vcpu_pre_block(struct kvm_vcpu *vcpu);
+void kvm_arch_vcpu_post_block(struct kvm_vcpu *vcpu);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f41111f..4c1a966 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -9153,6 +9153,86 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 		shrink_ple_window(vcpu);
 }
 
+static int vmx_vcpu_pre_block(struct kvm_vcpu *vcpu)
+{
+	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+	struct pi_desc old;
+	struct pi_desc new;
+
+	if (!irq_post_enabled)
+		return 0;
+
+	memset(&old, 0, sizeof(old));
+	memset(&new, 0, sizeof(new));
+
+	do {
+		old.control = new.control = pi_desc->control;
+
+		/*
+		 * A posted-interrupt happened in the one of the
+		 * following two cases:
+		 * 1. After the latest pir-to-virr sync operation
+		 * in kvm_arch_vcpu_runnable() function
+		 * 2. In this do-while() loop, a posted-interrupt
+		 * occurs.
+		 *
+		 * For either of above cases, we should not block
+		 * the VCPU.
+		 */
+		if (pi_test_on(pi_desc) == 1) {
+			/*
+			 * Need to set this flag, then the inject will
+			 * be synced from PIR to vIRR before VM-ENTRY.
+			 * In fact, for guest IPI case, in function
+			 * vmx_deliver_posted_interrupt(), this flags
+			 * has already been set, but if the interrupt
+			 * is injected by VT-d PI hardware, we need
+			 * to set this.
+			 */
+			kvm_make_request(KVM_REQ_EVENT, vcpu);
+			return 1;
+		}
+
+		pi_clear_sn(&new);
+
+		/* set 'NV' to 'wakeup vector' */
+		new.nv = POSTED_INTR_WAKEUP_VECTOR;
+	} while (cmpxchg(&pi_desc->control, old.control, new.control)
+			!= old.control);
+
+	return 0;
+}
+
+static void vmx_vcpu_post_block(struct kvm_vcpu *vcpu)
+{
+	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+	struct pi_desc old;
+	struct pi_desc new;
+	unsigned int dest = 0;
+
+	if (!irq_post_enabled)
+		return;
+
+	pi_set_sn(pi_desc);
+
+	do {
+		old.control = new.control = pi_desc->control;
+
+		dest = cpu_physical_id(vcpu->cpu);
+
+		if (x2apic_mode)
+			new.ndst = dest;
+		else
+			new.ndst = (dest << 8) & 0xFF00;
+
+		/* set 'NV' to 'notification vector' */
+		new.nv = POSTED_INTR_VECTOR;
+	} while (cmpxchg(&pi_desc->control, old.control, new.control)
+			!= old.control);
+
+	pi_clear_sn(pi_desc);
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
@@ -9262,6 +9342,9 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.sched_in = vmx_sched_in,
 
 	.get_pi_desc_addr = vmx_get_pi_desc_addr,
+
+	.vcpu_pre_block = vmx_vcpu_pre_block,
+	.vcpu_post_block = vmx_vcpu_post_block,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c19d15..d0c8bb2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7746,6 +7746,22 @@ int kvm_update_pi_irte_common(struct kvm *kvm, struct kvm_vcpu *vcpu,
 	return 0;
 }
 
+int kvm_arch_vcpu_pre_block(struct kvm_vcpu *vcpu)
+{
+	if (kvm_x86_ops->vcpu_pre_block)
+		return kvm_x86_ops->vcpu_pre_block(vcpu);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_arch_vcpu_pre_block);
+
+void kvm_arch_vcpu_post_block(struct kvm_vcpu *vcpu)
+{
+	if (kvm_x86_ops->vcpu_post_block)
+		kvm_x86_ops->vcpu_post_block(vcpu);
+}
+EXPORT_SYMBOL_GPL(kvm_arch_vcpu_post_block);
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 25ffac9..1be1a45 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1754,7 +1754,18 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
 		if (signal_pending(current))
 			break;
 
+#ifdef CONFIG_X86
+		if (kvm_arch_vcpu_pre_block(vcpu) == 1) {
+			kvm_make_request(KVM_REQ_UNHALT, vcpu);
+			break;
+		}
+#endif
+
 		schedule();
+
+#ifdef CONFIG_X86
+		kvm_arch_vcpu_post_block(vcpu);
+#endif
 	}
 
 	finish_wait(&vcpu->wq, &wait);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 09/13] KVM: Change NDST field after VCPU scheduling
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel, Feng Wu

This patch changes the NDST filed of Posted-Interrupts
Descriptor after VCPU is scheduled to another physical
CPU.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 arch/x86/kvm/vmx.c |   25 +++++++++++++++++++++++++
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4c1a966..fa77714 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1906,6 +1906,31 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
 		vmx->loaded_vmcs->cpu = cpu;
 	}
+
+	if (irq_post_enabled && (vcpu->cpu != cpu)) {
+		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+		struct pi_desc old, new;
+		unsigned int dest;
+
+		memset(&old, 0, sizeof(old));
+		memset(&new, 0, sizeof(new));
+
+		pi_set_sn(pi_desc);
+
+		do {
+			old.control = new.control = pi_desc->control;
+
+			dest = cpu_physical_id(cpu);
+
+			if (x2apic_mode)
+				new.ndst = dest;
+			else
+				new.ndst = (dest << 8) & 0xFF00;
+
+		} while (cmpxchg(&pi_desc->control, old.control,
+				new.control) != old.control);
+		pi_clear_sn(pi_desc);
+	}
 }
 
 static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 09/13] KVM: Change NDST field after VCPU scheduling
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb-DgEjT+Ai2ygdnm+yROfE0A, pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	tglx-hfZtesqFncYOwBW4kG4KsQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA

This patch changes the NDST filed of Posted-Interrupts
Descriptor after VCPU is scheduled to another physical
CPU.

Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 arch/x86/kvm/vmx.c |   25 +++++++++++++++++++++++++
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4c1a966..fa77714 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1906,6 +1906,31 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
 		vmx->loaded_vmcs->cpu = cpu;
 	}
+
+	if (irq_post_enabled && (vcpu->cpu != cpu)) {
+		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+		struct pi_desc old, new;
+		unsigned int dest;
+
+		memset(&old, 0, sizeof(old));
+		memset(&new, 0, sizeof(new));
+
+		pi_set_sn(pi_desc);
+
+		do {
+			old.control = new.control = pi_desc->control;
+
+			dest = cpu_physical_id(cpu);
+
+			if (x2apic_mode)
+				new.ndst = dest;
+			else
+				new.ndst = (dest << 8) & 0xFF00;
+
+		} while (cmpxchg(&pi_desc->control, old.control,
+				new.control) != old.control);
+		pi_clear_sn(pi_desc);
+	}
 }
 
 static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 10/13] KVM: Add the handler for Wake-up Vector
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel, Feng Wu

When VCPU is blocked and an external interrupts from assigned
devices is delivered to it, VT-d Posted-Interrupts mechanism
will deliver a interrrupt to the associated physical CPU with
Wake-up Vector. In its handler, we find the destination VCPU
and wake up it.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 arch/x86/include/asm/kvm_host.h |    2 +
 arch/x86/kvm/vmx.c              |   52 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              |   22 +++++++++++-----
 include/linux/kvm_host.h        |    3 ++
 virt/kvm/kvm_main.c             |    3 ++
 5 files changed, 75 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 71cfe3e..ca231a3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -99,6 +99,8 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
 
 #define ASYNC_PF_PER_VCPU 64
 
+extern void (*wakeup_handler_callback)(void);
+
 enum kvm_reg {
 	VCPU_REGS_RAX = 0,
 	VCPU_REGS_RCX = 1,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index fa77714..51d2c8a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -822,6 +822,13 @@ static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
 static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
 static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
 
+/*
+ * We maintian a per-CPU linked-list of VCPU, so in wakeup_handler() we
+ * can find which VCPU should be waken up.
+ */
+static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
+static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
+
 static unsigned long *vmx_io_bitmap_a;
 static unsigned long *vmx_io_bitmap_b;
 static unsigned long *vmx_msr_bitmap_legacy;
@@ -2813,6 +2820,8 @@ static int hardware_enable(void)
 		return -EBUSY;
 
 	INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
+	INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu));
+	spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
 
 	/*
 	 * Now we can enable the vmclear operation in kdump
@@ -9183,6 +9192,7 @@ static int vmx_vcpu_pre_block(struct kvm_vcpu *vcpu)
 	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
 	struct pi_desc old;
 	struct pi_desc new;
+	unsigned long flags;
 
 	if (!irq_post_enabled)
 		return 0;
@@ -9222,9 +9232,22 @@ static int vmx_vcpu_pre_block(struct kvm_vcpu *vcpu)
 
 		/* set 'NV' to 'wakeup vector' */
 		new.nv = POSTED_INTR_WAKEUP_VECTOR;
+
+		/*
+		 * We should save physical cpu id here, vcpu->cpu may
+		 * be changed due to preemption, in that case, this
+		 * do-while loop will run again.
+		 */
+		vcpu->wakeup_cpu = vcpu->cpu;
 	} while (cmpxchg(&pi_desc->control, old.control, new.control)
 			!= old.control);
 
+	spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
+			vcpu->wakeup_cpu), flags);
+	list_add_tail(&vcpu->blocked_vcpu_list,
+			&per_cpu(blocked_vcpu_on_cpu, vcpu->wakeup_cpu));
+	spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
+			vcpu->wakeup_cpu), flags);
 	return 0;
 }
 
@@ -9234,6 +9257,7 @@ static void vmx_vcpu_post_block(struct kvm_vcpu *vcpu)
 	struct pi_desc old;
 	struct pi_desc new;
 	unsigned int dest = 0;
+	unsigned long flags;
 
 	if (!irq_post_enabled)
 		return;
@@ -9255,6 +9279,13 @@ static void vmx_vcpu_post_block(struct kvm_vcpu *vcpu)
 	} while (cmpxchg(&pi_desc->control, old.control, new.control)
 			!= old.control);
 
+	spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
+				vcpu->wakeup_cpu), flags);
+	list_del(&vcpu->blocked_vcpu_list);
+	spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
+				vcpu->wakeup_cpu), flags);
+	vcpu->wakeup_cpu = -1;
+
 	pi_clear_sn(pi_desc);
 }
 
@@ -9372,6 +9403,25 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.vcpu_post_block = vmx_vcpu_post_block,
 };
 
+/*
+ * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
+ */
+void wakeup_handler(void)
+{
+	struct kvm_vcpu *vcpu;
+	int cpu = smp_processor_id();
+
+	spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
+	list_for_each_entry(vcpu, &per_cpu(blocked_vcpu_on_cpu, cpu),
+			blocked_vcpu_list) {
+		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+
+		if (pi_test_on(pi_desc) == 1)
+			kvm_vcpu_kick(vcpu);
+	}
+	spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
+}
+
 static int __init vmx_init(void)
 {
 	int r, i, msr;
@@ -9486,6 +9536,8 @@ static int __init vmx_init(void)
 
 	update_ple_window_actual_max();
 
+	wakeup_handler_callback = wakeup_handler;
+
 	return 0;
 
 out7:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d0c8bb2..2061b3d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6156,6 +6156,21 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			kvm_vcpu_reload_apic_access_page(vcpu);
 	}
 
+	/*
+	 * Since posted-interrupts can be set by VT-d HW now, in this
+	 * case, KVM_REQ_EVENT is not set. We move the following
+	 * operations out of the if statement.
+	 */
+	if (kvm_lapic_enabled(vcpu)) {
+		/*
+		 * Update architecture specific hints for APIC
+		 * virtual interrupt delivery.
+		 */
+		if (kvm_x86_ops->hwapic_irr_update)
+			kvm_x86_ops->hwapic_irr_update(vcpu,
+				kvm_lapic_find_highest_irr(vcpu));
+	}
+
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
 		kvm_apic_accept_events(vcpu);
 		if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
@@ -6172,13 +6187,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			kvm_x86_ops->enable_irq_window(vcpu);
 
 		if (kvm_lapic_enabled(vcpu)) {
-			/*
-			 * Update architecture specific hints for APIC
-			 * virtual interrupt delivery.
-			 */
-			if (kvm_x86_ops->hwapic_irr_update)
-				kvm_x86_ops->hwapic_irr_update(vcpu,
-					kvm_lapic_find_highest_irr(vcpu));
 			update_cr8_intercept(vcpu);
 			kvm_lapic_sync_to_vapic(vcpu);
 		}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6bb8287..614b4ba 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -239,6 +239,9 @@ struct kvm_vcpu {
 	unsigned long requests;
 	unsigned long guest_debug;
 
+	int wakeup_cpu;
+	struct list_head blocked_vcpu_list;
+
 	struct mutex mutex;
 	struct kvm_run *run;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1be1a45..fb3e504 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -224,6 +224,9 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	init_waitqueue_head(&vcpu->wq);
 	kvm_async_pf_vcpu_init(vcpu);
 
+	vcpu->wakeup_cpu = -1;
+	INIT_LIST_HEAD(&vcpu->blocked_vcpu_list);
+
 	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 	if (!page) {
 		r = -ENOMEM;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 10/13] KVM: Add the handler for Wake-up Vector
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb-DgEjT+Ai2ygdnm+yROfE0A, pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	tglx-hfZtesqFncYOwBW4kG4KsQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA

When VCPU is blocked and an external interrupts from assigned
devices is delivered to it, VT-d Posted-Interrupts mechanism
will deliver a interrrupt to the associated physical CPU with
Wake-up Vector. In its handler, we find the destination VCPU
and wake up it.

Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 arch/x86/include/asm/kvm_host.h |    2 +
 arch/x86/kvm/vmx.c              |   52 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              |   22 +++++++++++-----
 include/linux/kvm_host.h        |    3 ++
 virt/kvm/kvm_main.c             |    3 ++
 5 files changed, 75 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 71cfe3e..ca231a3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -99,6 +99,8 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
 
 #define ASYNC_PF_PER_VCPU 64
 
+extern void (*wakeup_handler_callback)(void);
+
 enum kvm_reg {
 	VCPU_REGS_RAX = 0,
 	VCPU_REGS_RCX = 1,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index fa77714..51d2c8a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -822,6 +822,13 @@ static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
 static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
 static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
 
+/*
+ * We maintian a per-CPU linked-list of VCPU, so in wakeup_handler() we
+ * can find which VCPU should be waken up.
+ */
+static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
+static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
+
 static unsigned long *vmx_io_bitmap_a;
 static unsigned long *vmx_io_bitmap_b;
 static unsigned long *vmx_msr_bitmap_legacy;
@@ -2813,6 +2820,8 @@ static int hardware_enable(void)
 		return -EBUSY;
 
 	INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
+	INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu));
+	spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
 
 	/*
 	 * Now we can enable the vmclear operation in kdump
@@ -9183,6 +9192,7 @@ static int vmx_vcpu_pre_block(struct kvm_vcpu *vcpu)
 	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
 	struct pi_desc old;
 	struct pi_desc new;
+	unsigned long flags;
 
 	if (!irq_post_enabled)
 		return 0;
@@ -9222,9 +9232,22 @@ static int vmx_vcpu_pre_block(struct kvm_vcpu *vcpu)
 
 		/* set 'NV' to 'wakeup vector' */
 		new.nv = POSTED_INTR_WAKEUP_VECTOR;
+
+		/*
+		 * We should save physical cpu id here, vcpu->cpu may
+		 * be changed due to preemption, in that case, this
+		 * do-while loop will run again.
+		 */
+		vcpu->wakeup_cpu = vcpu->cpu;
 	} while (cmpxchg(&pi_desc->control, old.control, new.control)
 			!= old.control);
 
+	spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
+			vcpu->wakeup_cpu), flags);
+	list_add_tail(&vcpu->blocked_vcpu_list,
+			&per_cpu(blocked_vcpu_on_cpu, vcpu->wakeup_cpu));
+	spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
+			vcpu->wakeup_cpu), flags);
 	return 0;
 }
 
@@ -9234,6 +9257,7 @@ static void vmx_vcpu_post_block(struct kvm_vcpu *vcpu)
 	struct pi_desc old;
 	struct pi_desc new;
 	unsigned int dest = 0;
+	unsigned long flags;
 
 	if (!irq_post_enabled)
 		return;
@@ -9255,6 +9279,13 @@ static void vmx_vcpu_post_block(struct kvm_vcpu *vcpu)
 	} while (cmpxchg(&pi_desc->control, old.control, new.control)
 			!= old.control);
 
+	spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
+				vcpu->wakeup_cpu), flags);
+	list_del(&vcpu->blocked_vcpu_list);
+	spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
+				vcpu->wakeup_cpu), flags);
+	vcpu->wakeup_cpu = -1;
+
 	pi_clear_sn(pi_desc);
 }
 
@@ -9372,6 +9403,25 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.vcpu_post_block = vmx_vcpu_post_block,
 };
 
+/*
+ * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
+ */
+void wakeup_handler(void)
+{
+	struct kvm_vcpu *vcpu;
+	int cpu = smp_processor_id();
+
+	spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
+	list_for_each_entry(vcpu, &per_cpu(blocked_vcpu_on_cpu, cpu),
+			blocked_vcpu_list) {
+		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+
+		if (pi_test_on(pi_desc) == 1)
+			kvm_vcpu_kick(vcpu);
+	}
+	spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
+}
+
 static int __init vmx_init(void)
 {
 	int r, i, msr;
@@ -9486,6 +9536,8 @@ static int __init vmx_init(void)
 
 	update_ple_window_actual_max();
 
+	wakeup_handler_callback = wakeup_handler;
+
 	return 0;
 
 out7:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d0c8bb2..2061b3d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6156,6 +6156,21 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			kvm_vcpu_reload_apic_access_page(vcpu);
 	}
 
+	/*
+	 * Since posted-interrupts can be set by VT-d HW now, in this
+	 * case, KVM_REQ_EVENT is not set. We move the following
+	 * operations out of the if statement.
+	 */
+	if (kvm_lapic_enabled(vcpu)) {
+		/*
+		 * Update architecture specific hints for APIC
+		 * virtual interrupt delivery.
+		 */
+		if (kvm_x86_ops->hwapic_irr_update)
+			kvm_x86_ops->hwapic_irr_update(vcpu,
+				kvm_lapic_find_highest_irr(vcpu));
+	}
+
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
 		kvm_apic_accept_events(vcpu);
 		if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
@@ -6172,13 +6187,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			kvm_x86_ops->enable_irq_window(vcpu);
 
 		if (kvm_lapic_enabled(vcpu)) {
-			/*
-			 * Update architecture specific hints for APIC
-			 * virtual interrupt delivery.
-			 */
-			if (kvm_x86_ops->hwapic_irr_update)
-				kvm_x86_ops->hwapic_irr_update(vcpu,
-					kvm_lapic_find_highest_irr(vcpu));
 			update_cr8_intercept(vcpu);
 			kvm_lapic_sync_to_vapic(vcpu);
 		}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6bb8287..614b4ba 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -239,6 +239,9 @@ struct kvm_vcpu {
 	unsigned long requests;
 	unsigned long guest_debug;
 
+	int wakeup_cpu;
+	struct list_head blocked_vcpu_list;
+
 	struct mutex mutex;
 	struct kvm_run *run;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1be1a45..fb3e504 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -224,6 +224,9 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	init_waitqueue_head(&vcpu->wq);
 	kvm_async_pf_vcpu_init(vcpu);
 
+	vcpu->wakeup_cpu = -1;
+	INIT_LIST_HEAD(&vcpu->blocked_vcpu_list);
+
 	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 	if (!page) {
 		r = -ENOMEM;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 11/13] KVM: Suppress posted-interrupt when 'SN' is set
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel, Feng Wu

Currently, we don't support urgent interrupt, all interrupts
are recognized as non-urgent interrupt, so we cannot send
posted-interrupt when 'SN' is set.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 arch/x86/kvm/vmx.c |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 51d2c8a..495cfbd 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4306,15 +4306,22 @@ static int vmx_vm_has_apicv(struct kvm *kvm)
 static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
-	int r;
+	int r, sn;
 
 	if (pi_test_and_set_pir(vector, &vmx->pi_desc))
 		return;
 
+	/*
+	 * Currently, we don't support urgent interrupt, all interrupts
+	 * are recognized as non-urgent interrupt, so we cannot send
+	 * posted-interrupt when 'SN' is set.
+	 */
+	sn = pi_test_sn(&vmx->pi_desc);
+
 	r = pi_test_and_set_on(&vmx->pi_desc);
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 #ifdef CONFIG_SMP
-	if (!r && (vcpu->mode == IN_GUEST_MODE))
+	if (!r && !sn && (vcpu->mode == IN_GUEST_MODE))
 		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu),
 				POSTED_INTR_VECTOR);
 	else
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 11/13] KVM: Suppress posted-interrupt when 'SN' is set
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb-DgEjT+Ai2ygdnm+yROfE0A, pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	tglx-hfZtesqFncYOwBW4kG4KsQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA

Currently, we don't support urgent interrupt, all interrupts
are recognized as non-urgent interrupt, so we cannot send
posted-interrupt when 'SN' is set.

Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 arch/x86/kvm/vmx.c |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 51d2c8a..495cfbd 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4306,15 +4306,22 @@ static int vmx_vm_has_apicv(struct kvm *kvm)
 static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
-	int r;
+	int r, sn;
 
 	if (pi_test_and_set_pir(vector, &vmx->pi_desc))
 		return;
 
+	/*
+	 * Currently, we don't support urgent interrupt, all interrupts
+	 * are recognized as non-urgent interrupt, so we cannot send
+	 * posted-interrupt when 'SN' is set.
+	 */
+	sn = pi_test_sn(&vmx->pi_desc);
+
 	r = pi_test_and_set_on(&vmx->pi_desc);
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 #ifdef CONFIG_SMP
-	if (!r && (vcpu->mode == IN_GUEST_MODE))
+	if (!r && !sn && (vcpu->mode == IN_GUEST_MODE))
 		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu),
 				POSTED_INTR_VECTOR);
 	else
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 12/13] iommu/vt-d: No need to migrating irq for VT-d Posted-Interrtups
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel, Feng Wu

We don't need to migrate the irqs for VT-d Posted-Interrtups here.
When 'pst' is set in IRTE, the associated irq will be posted to
guests instead of interrupt remapping. The destination of the
interrupt is set in Posted-Interrupts Descriptor, and the migration
happens during VCPU scheduling.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 drivers/iommu/intel_irq_remapping.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 87c02fe..249e2b1 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -1038,6 +1038,13 @@ intel_ioapic_set_affinity(struct irq_data *data, const struct cpumask *mask,
 	if (get_irte(irq, &irte))
 		return -EBUSY;
 
+	/*
+	 * If the interrupt is for posting, it is used by guests,
+	 * we cannot change IRTE here.
+	 */
+	if (irte.irq_post_low.pst == 1)
+		return 0;
+
 	err = assign_irq_vector(irq, cfg, mask);
 	if (err)
 		return err;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 12/13] iommu/vt-d: No need to migrating irq for VT-d Posted-Interrtups
@ 2014-11-10  6:26   ` Feng Wu
  0 siblings, 0 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb-DgEjT+Ai2ygdnm+yROfE0A, pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, joro-zLv9SwRftAIdnm+yROfE0A,
	tglx-hfZtesqFncYOwBW4kG4KsQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA

We don't need to migrate the irqs for VT-d Posted-Interrtups here.
When 'pst' is set in IRTE, the associated irq will be posted to
guests instead of interrupt remapping. The destination of the
interrupt is set in Posted-Interrupts Descriptor, and the migration
happens during VCPU scheduling.

Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/iommu/intel_irq_remapping.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 87c02fe..249e2b1 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -1038,6 +1038,13 @@ intel_ioapic_set_affinity(struct irq_data *data, const struct cpumask *mask,
 	if (get_irte(irq, &irte))
 		return -EBUSY;
 
+	/*
+	 * If the interrupt is for posting, it is used by guests,
+	 * we cannot change IRTE here.
+	 */
+	if (irte.irq_post_low.pst == 1)
+		return 0;
+
 	err = assign_irq_vector(irq, cfg, mask);
 	if (err)
 		return err;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 13/13] iommu/vt-d: Add a command line parameter for VT-d posted-interrupts
  2014-11-10  6:26 ` Feng Wu
                   ` (12 preceding siblings ...)
  (?)
@ 2014-11-10  6:26 ` Feng Wu
  2014-11-10 18:15     ` Thomas Gleixner
  2014-11-10 21:57     ` Alex Williamson
  -1 siblings, 2 replies; 129+ messages in thread
From: Feng Wu @ 2014-11-10  6:26 UTC (permalink / raw)
  To: gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel, Feng Wu

Enable VT-d Posted-Interrtups and add a command line
parameter for it.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 drivers/iommu/irq_remapping.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index 0e36860..3cb9429 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -23,7 +23,7 @@ int irq_remap_broken;
 int disable_sourceid_checking;
 int no_x2apic_optout;
 
-int disable_irq_post = 1;
+int disable_irq_post = 0;
 int irq_post_enabled = 0;
 EXPORT_SYMBOL_GPL(irq_post_enabled);
 
@@ -206,6 +206,13 @@ static __init int setup_irqremap(char *str)
 }
 early_param("intremap", setup_irqremap);
 
+static __init int setup_nointpost(char *str)
+{
+	disable_irq_post = 1;
+	return 0;
+}
+early_param("nointpost", setup_nointpost);
+
 void __init setup_irq_remapping_ops(void)
 {
 	remap_ops = &intel_irq_remap_ops;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2014-11-10 18:15     ` Thomas Gleixner
  0 siblings, 0 replies; 129+ messages in thread
From: Thomas Gleixner @ 2014-11-10 18:15 UTC (permalink / raw)
  To: Feng Wu
  Cc: gleb, pbonzini, David Woodhouse, joro, mingo, H. Peter Anvin,
	x86, kvm, iommu, LKML, Jiang Liu

On Mon, 10 Nov 2014, Feng Wu wrote:

> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> With VT-d Posted-Interrupts enabled, external interrupts from
> direct-assigned devices can be delivered to guests without VMM
> intervention when guest is running in non-root mode.

Can you please talk to Jiang and synchronize your work with his
refactoring of the x86 interrupt handling subsystem.

I want this stuff cleaned up first before we add new stuff to it.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2014-11-10 18:15     ` Thomas Gleixner
  0 siblings, 0 replies; 129+ messages in thread
From: Thomas Gleixner @ 2014-11-10 18:15 UTC (permalink / raw)
  To: Feng Wu
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, gleb-DgEjT+Ai2ygdnm+yROfE0A,
	x86-DgEjT+Ai2ygdnm+yROfE0A, LKML,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, H. Peter Anvin,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA, David Woodhouse, Jiang Liu

On Mon, 10 Nov 2014, Feng Wu wrote:

> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> With VT-d Posted-Interrupts enabled, external interrupts from
> direct-assigned devices can be delivered to guests without VMM
> intervention when guest is running in non-root mode.

Can you please talk to Jiang and synchronize your work with his
refactoring of the x86 interrupt handling subsystem.

I want this stuff cleaned up first before we add new stuff to it.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] KVM: Initialize VT-d Posted-Interrtups Descriptor
@ 2014-11-10 21:57     ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2014-11-10 21:57 UTC (permalink / raw)
  To: Feng Wu
  Cc: gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86, kvm, iommu,
	linux-kernel

On Mon, 2014-11-10 at 14:26 +0800, Feng Wu wrote:
> This patch initialize the VT-d Posted-interrupt Descritpor.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
>  arch/x86/include/asm/irq_remapping.h |    1 +
>  arch/x86/kernel/apic/apic.c          |    1 +
>  arch/x86/kvm/vmx.c                   |   56 ++++++++++++++++++++++++++++++++-
>  3 files changed, 56 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
> index b7747c4..a3cc437 100644
> --- a/arch/x86/include/asm/irq_remapping.h
> +++ b/arch/x86/include/asm/irq_remapping.h
> @@ -57,6 +57,7 @@ extern bool setup_remapped_irq(int irq,
>  			       struct irq_chip *chip);
>  
>  void irq_remap_modify_chip_defaults(struct irq_chip *chip);
> +extern int irq_post_enabled;
>  
>  #else  /* CONFIG_IRQ_REMAP */
>  
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index ba6cc04..987408d 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -162,6 +162,7 @@ __setup("apicpmtimer", setup_apicpmtimer);
>  #endif
>  
>  int x2apic_mode;
> +EXPORT_SYMBOL_GPL(x2apic_mode);
>  #ifdef CONFIG_X86_X2APIC
>  /* x2apic enabled before OS handover */
>  int x2apic_preenabled;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 3e556c6..a4670d3 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -45,6 +45,7 @@
>  #include <asm/perf_event.h>
>  #include <asm/debugreg.h>
>  #include <asm/kexec.h>
> +#include <asm/irq_remapping.h>
>  
>  #include "trace.h"
>  
> @@ -408,13 +409,32 @@ struct nested_vmx {
>  };
>  
>  #define POSTED_INTR_ON  0
> +#define POSTED_INTR_SN  1
> +
>  /* Posted-Interrupt Descriptor */
>  struct pi_desc {
>  	u32 pir[8];     /* Posted interrupt requested */
> -	u32 control;	/* bit 0 of control is outstanding notification bit */
> -	u32 rsvd[7];
> +	union {
> +		struct {
> +			u64	on	: 1,
> +				sn	: 1,
> +				rsvd_1	: 13,
> +				ndm	: 1,
> +				nv	: 8,
> +				rsvd_2	: 8,
> +				ndst	: 32;
> +		};
> +		u64 control;
> +	};
> +	u32 rsvd[6];
>  } __aligned(64);
>  
> +static void pi_clear_sn(struct pi_desc *pi_desc)
> +{
> +	return clear_bit(POSTED_INTR_SN,
> +			(unsigned long *)&pi_desc->control);
> +}
> +
>  static bool pi_test_and_set_on(struct pi_desc *pi_desc)
>  {
>  	return test_and_set_bit(POSTED_INTR_ON,
> @@ -4396,6 +4416,33 @@ static void ept_set_mmio_spte_mask(void)
>  	kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
>  }
>  
> +static bool pi_desc_init(struct vcpu_vmx *vmx)
> +{
> +	unsigned int dest;
> +
> +	if (irq_post_enabled == 0)
> +		return true;
> +
> +	/*
> +	 * Initialize Posted-Interrupt Descriptor
> +	 */
> +
> +	pi_clear_sn(&vmx->pi_desc);
> +	vmx->pi_desc.nv = POSTED_INTR_VECTOR;
> +
> +	/* Physical mode for Notificaiton Event */
> +	vmx->pi_desc.ndm = 0;
> +	dest = cpu_physical_id(vmx->vcpu.cpu);
> +
> +	if (x2apic_mode)
> +		vmx->pi_desc.ndst = dest;
> +	else
> +		vmx->pi_desc.ndst = (dest << 8) & 0xFF00;
> +
> +	return true;

Why does this bother to return anything since it can only return true?

> +}
> +
> +
>  /*
>   * Sets up the vmcs for emulated real mode.
>   */
> @@ -4439,6 +4486,11 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>  
>  		vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR);
>  		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
> +
> +		if (!pi_desc_init(vmx)) {

And therefore this cannot happen.

> +			printk(KERN_ERR "Initialize PI descriptor error!\n");
> +			return 1;

This is the wrong error anyway, vmx_create_vcpu() returns ERR_PTR(1)
which fails the reverse IS_ERR()

Thanks,
Alex

> +		}
>  	}
>  
>  	if (ple_gap) {




^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] KVM: Initialize VT-d Posted-Interrtups Descriptor
@ 2014-11-10 21:57     ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2014-11-10 21:57 UTC (permalink / raw)
  To: Feng Wu
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, gleb-DgEjT+Ai2ygdnm+yROfE0A,
	x86-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ

On Mon, 2014-11-10 at 14:26 +0800, Feng Wu wrote:
> This patch initialize the VT-d Posted-interrupt Descritpor.
> 
> Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  arch/x86/include/asm/irq_remapping.h |    1 +
>  arch/x86/kernel/apic/apic.c          |    1 +
>  arch/x86/kvm/vmx.c                   |   56 ++++++++++++++++++++++++++++++++-
>  3 files changed, 56 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
> index b7747c4..a3cc437 100644
> --- a/arch/x86/include/asm/irq_remapping.h
> +++ b/arch/x86/include/asm/irq_remapping.h
> @@ -57,6 +57,7 @@ extern bool setup_remapped_irq(int irq,
>  			       struct irq_chip *chip);
>  
>  void irq_remap_modify_chip_defaults(struct irq_chip *chip);
> +extern int irq_post_enabled;
>  
>  #else  /* CONFIG_IRQ_REMAP */
>  
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index ba6cc04..987408d 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -162,6 +162,7 @@ __setup("apicpmtimer", setup_apicpmtimer);
>  #endif
>  
>  int x2apic_mode;
> +EXPORT_SYMBOL_GPL(x2apic_mode);
>  #ifdef CONFIG_X86_X2APIC
>  /* x2apic enabled before OS handover */
>  int x2apic_preenabled;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 3e556c6..a4670d3 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -45,6 +45,7 @@
>  #include <asm/perf_event.h>
>  #include <asm/debugreg.h>
>  #include <asm/kexec.h>
> +#include <asm/irq_remapping.h>
>  
>  #include "trace.h"
>  
> @@ -408,13 +409,32 @@ struct nested_vmx {
>  };
>  
>  #define POSTED_INTR_ON  0
> +#define POSTED_INTR_SN  1
> +
>  /* Posted-Interrupt Descriptor */
>  struct pi_desc {
>  	u32 pir[8];     /* Posted interrupt requested */
> -	u32 control;	/* bit 0 of control is outstanding notification bit */
> -	u32 rsvd[7];
> +	union {
> +		struct {
> +			u64	on	: 1,
> +				sn	: 1,
> +				rsvd_1	: 13,
> +				ndm	: 1,
> +				nv	: 8,
> +				rsvd_2	: 8,
> +				ndst	: 32;
> +		};
> +		u64 control;
> +	};
> +	u32 rsvd[6];
>  } __aligned(64);
>  
> +static void pi_clear_sn(struct pi_desc *pi_desc)
> +{
> +	return clear_bit(POSTED_INTR_SN,
> +			(unsigned long *)&pi_desc->control);
> +}
> +
>  static bool pi_test_and_set_on(struct pi_desc *pi_desc)
>  {
>  	return test_and_set_bit(POSTED_INTR_ON,
> @@ -4396,6 +4416,33 @@ static void ept_set_mmio_spte_mask(void)
>  	kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
>  }
>  
> +static bool pi_desc_init(struct vcpu_vmx *vmx)
> +{
> +	unsigned int dest;
> +
> +	if (irq_post_enabled == 0)
> +		return true;
> +
> +	/*
> +	 * Initialize Posted-Interrupt Descriptor
> +	 */
> +
> +	pi_clear_sn(&vmx->pi_desc);
> +	vmx->pi_desc.nv = POSTED_INTR_VECTOR;
> +
> +	/* Physical mode for Notificaiton Event */
> +	vmx->pi_desc.ndm = 0;
> +	dest = cpu_physical_id(vmx->vcpu.cpu);
> +
> +	if (x2apic_mode)
> +		vmx->pi_desc.ndst = dest;
> +	else
> +		vmx->pi_desc.ndst = (dest << 8) & 0xFF00;
> +
> +	return true;

Why does this bother to return anything since it can only return true?

> +}
> +
> +
>  /*
>   * Sets up the vmcs for emulated real mode.
>   */
> @@ -4439,6 +4486,11 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>  
>  		vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR);
>  		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
> +
> +		if (!pi_desc_init(vmx)) {

And therefore this cannot happen.

> +			printk(KERN_ERR "Initialize PI descriptor error!\n");
> +			return 1;

This is the wrong error anyway, vmx_create_vcpu() returns ERR_PTR(1)
which fails the reverse IS_ERR()

Thanks,
Alex

> +		}
>  	}
>  
>  	if (ple_gap) {

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes
@ 2014-11-10 21:57     ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2014-11-10 21:57 UTC (permalink / raw)
  To: Feng Wu
  Cc: gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86, kvm, iommu,
	linux-kernel

On Mon, 2014-11-10 at 14:26 +0800, Feng Wu wrote:
> When guest changes its interrupt configuration (such as, vector, etc.)
> for direct-assigned devices, we need to update the associated IRTE
> with the new guest vector, so external interrupts from the assigned
> devices can be injected to guests without VM-Exit.
> 
> The current method of handling guest lowest priority interrtups
> is to use a counter 'apic_arb_prio' for each VCPU, we choose the
> VCPU with smallest 'apic_arb_prio' and then increase it by 1.
> However, for VT-d PI, we cannot re-use this, since we no longer
> have control to 'apic_arb_prio' with posted interrupt direct
> delivery by Hardware.
> 
> Here, we introduce a similiar way with 'apic_arb_prio' to handle
> guest lowest priority interrtups when VT-d PI is used. Here is the
> ideas:
> - Each VCPU has a counter 'round_robin_counter'.
> - When guests sets an interrupts to lowest priority, we choose
> the VCPU with smallest 'round_robin_counter' as the destination,
> then increase it.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
>  arch/x86/include/asm/irq_remapping.h |    6 ++
>  arch/x86/include/asm/kvm_host.h      |    2 +
>  arch/x86/kvm/vmx.c                   |   12 +++
>  arch/x86/kvm/x86.c                   |   11 +++
>  drivers/iommu/amd_iommu.c            |    6 ++
>  drivers/iommu/intel_irq_remapping.c  |   28 +++++++
>  drivers/iommu/irq_remapping.c        |    9 ++
>  drivers/iommu/irq_remapping.h        |    3 +
>  include/linux/dmar.h                 |   26 ++++++
>  include/linux/kvm_host.h             |   22 +++++
>  include/uapi/linux/kvm.h             |    1 +
>  virt/kvm/assigned-dev.c              |  141 ++++++++++++++++++++++++++++++++++
>  virt/kvm/irq_comm.c                  |    4 +-
>  virt/kvm/irqchip.c                   |   11 ---
>  14 files changed, 269 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
> index a3cc437..32d6cc4 100644
> --- a/arch/x86/include/asm/irq_remapping.h
> +++ b/arch/x86/include/asm/irq_remapping.h
> @@ -51,6 +51,7 @@ extern void compose_remapped_msi_msg(struct pci_dev *pdev,
>  				     unsigned int irq, unsigned int dest,
>  				     struct msi_msg *msg, u8 hpet_id);
>  extern int setup_hpet_msi_remapped(unsigned int irq, unsigned int id);
> +extern int update_pi_irte(unsigned int irq, u64 pi_desc_addr, u32 vector);
>  extern void panic_if_irq_remap(const char *msg);
>  extern bool setup_remapped_irq(int irq,
>  			       struct irq_cfg *cfg,
> @@ -88,6 +89,11 @@ static inline int setup_hpet_msi_remapped(unsigned int irq, unsigned int id)
>  	return -ENODEV;
>  }
>  
> +static inline int update_pi_irte(unsigned int irq, u64 pi_desc_addr, u32 vector)
> +{
> +	return -ENODEV;
> +}
> +
>  static inline void panic_if_irq_remap(const char *msg)
>  {
>  }
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6ed0c30..0630161 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -358,6 +358,7 @@ struct kvm_vcpu_arch {
>  	struct kvm_lapic *apic;    /* kernel irqchip context */
>  	unsigned long apic_attention;
>  	int32_t apic_arb_prio;
> +	int32_t round_robin_counter;
>  	int mp_state;
>  	u64 ia32_misc_enable_msr;
>  	bool tpr_access_reporting;
> @@ -771,6 +772,7 @@ struct kvm_x86_ops {
>  	int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr);
>  
>  	void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
> +	u64 (*get_pi_desc_addr)(struct kvm_vcpu *vcpu);
>  };
>  
>  struct kvm_arch_async_pf {
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index a4670d3..ae91b72 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -544,6 +544,11 @@ static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
>  	return container_of(vcpu, struct vcpu_vmx, vcpu);
>  }
>  
> +struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
> +{
> +	return &(to_vmx(vcpu)->pi_desc);
> +}
> +
>  #define VMCS12_OFFSET(x) offsetof(struct vmcs12, x)
>  #define FIELD(number, name)	[number] = VMCS12_OFFSET(name)
>  #define FIELD64(number, name)	[number] = VMCS12_OFFSET(name), \
> @@ -4280,6 +4285,11 @@ static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu *vcpu)
>  	return;
>  }
>  
> +static u64 vmx_get_pi_desc_addr(struct kvm_vcpu *vcpu)
> +{
> +	return __pa((u64)vcpu_to_pi_desc(vcpu));
> +}
> +
>  /*
>   * Set up the vmcs's constant host-state fields, i.e., host-state fields that
>   * will not change in the lifetime of the guest.
> @@ -9232,6 +9242,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
>  	.check_nested_events = vmx_check_nested_events,
>  
>  	.sched_in = vmx_sched_in,
> +
> +	.get_pi_desc_addr = vmx_get_pi_desc_addr,
>  };
>  
>  static int __init vmx_init(void)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b447a98..0c19d15 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7735,6 +7735,17 @@ bool kvm_arch_has_noncoherent_dma(struct kvm *kvm)
>  }
>  EXPORT_SYMBOL_GPL(kvm_arch_has_noncoherent_dma);
>  
> +int kvm_update_pi_irte_common(struct kvm *kvm, struct kvm_vcpu *vcpu,
> +			u32 guest_vector, int host_irq)
> +{
> +	u64 pi_desc_addr = kvm_x86_ops->get_pi_desc_addr(vcpu);
> +
> +	if (update_pi_irte(host_irq, pi_desc_addr, guest_vector))
> +		return -1;
> +
> +	return 0;
> +}
> +
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 505a9ad..a36fdc7 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -4280,6 +4280,11 @@ static int alloc_hpet_msi(unsigned int irq, unsigned int id)
>  	return 0;
>  }
>  
> +static int dummy_update_pi_irte(int irq, u64 pi_desc_addr, u32 vector)
> +{
> +	return -EINVAL;
> +}
> +
>  struct irq_remap_ops amd_iommu_irq_ops = {
>  	.supported		= amd_iommu_supported,
>  	.prepare		= amd_iommu_prepare,
> @@ -4294,5 +4299,6 @@ struct irq_remap_ops amd_iommu_irq_ops = {
>  	.msi_alloc_irq		= msi_alloc_irq,
>  	.msi_setup_irq		= msi_setup_irq,
>  	.alloc_hpet_msi		= alloc_hpet_msi,
> +	.update_pi_irte         = dummy_update_pi_irte,
>  };
>  #endif
> diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
> index 776da10..87c02fe 100644
> --- a/drivers/iommu/intel_irq_remapping.c
> +++ b/drivers/iommu/intel_irq_remapping.c
> @@ -1172,6 +1172,33 @@ static int intel_alloc_hpet_msi(unsigned int irq, unsigned int id)
>  	return ret;
>  }
>  
> +static int intel_update_pi_irte(int irq, u64 pi_desc_addr, u32 vector)
> +{
> +	struct irte irte;
> +
> +	if (get_irte(irq, &irte))
> +		return -1;
> +
> +	irte.irq_post_low.urg = 0;
> +	irte.irq_post_low.vector = vector;
> +	irte.irq_post_low.pda_l = (pi_desc_addr >> (32 - PDA_LOW_BIT)) &
> +			~(-1UL << PDA_LOW_BIT);
> +	irte.irq_post_high.pda_h = (pi_desc_addr >> 32) &
> +			~(-1UL << PDA_HIGH_BIT);
> +
> +	irte.irq_post_low.__reserved_1 = 0;
> +	irte.irq_post_low.__reserved_2 = 0;
> +	irte.irq_post_low.__reserved_3 = 0;
> +	irte.irq_post_high.__reserved_4 = 0;
> +
> +	irte.irq_post_low.pst = 1;
> +
> +	if (modify_irte(irq, &irte))
> +		return -1;
> +
> +	return 0;
> +}
> +
>  struct irq_remap_ops intel_irq_remap_ops = {
>  	.supported		= intel_irq_remapping_supported,
>  	.prepare		= dmar_table_init,
> @@ -1186,4 +1213,5 @@ struct irq_remap_ops intel_irq_remap_ops = {
>  	.msi_alloc_irq		= intel_msi_alloc_irq,
>  	.msi_setup_irq		= intel_msi_setup_irq,
>  	.alloc_hpet_msi		= intel_alloc_hpet_msi,
> +	.update_pi_irte         = intel_update_pi_irte,

Extending irq_remap_ops should really be a separate patch from it's use
by KVM.

>  };
> diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
> index 2f8ee00..0e36860 100644
> --- a/drivers/iommu/irq_remapping.c
> +++ b/drivers/iommu/irq_remapping.c
> @@ -362,6 +362,15 @@ int setup_hpet_msi_remapped(unsigned int irq, unsigned int id)
>  	return default_setup_hpet_msi(irq, id);
>  }
>  
> +int update_pi_irte(unsigned int irq, u64 pi_desc_addr, u32 vector)
> +{
> +	if (!remap_ops || !remap_ops->update_pi_irte)
> +		return -ENODEV;
> +
> +	return remap_ops->update_pi_irte(irq, pi_desc_addr, vector);
> +}
> +EXPORT_SYMBOL_GPL(update_pi_irte);
> +
>  void panic_if_irq_remap(const char *msg)
>  {
>  	if (irq_remapping_enabled)
> diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
> index 7bb5913..2d8f740 100644
> --- a/drivers/iommu/irq_remapping.h
> +++ b/drivers/iommu/irq_remapping.h
> @@ -84,6 +84,9 @@ struct irq_remap_ops {
>  
>  	/* Setup interrupt remapping for an HPET MSI */
>  	int (*alloc_hpet_msi)(unsigned int, unsigned int);
> +
> +	/* Update IRTE for posted-interrupt */
> +	int (*update_pi_irte)(int irq, u64 pi_desc_addr, u32 vector);
>  };
>  
>  extern struct irq_remap_ops intel_irq_remap_ops;
> diff --git a/include/linux/dmar.h b/include/linux/dmar.h
> index 8be5d42..e1ff4f7 100644
> --- a/include/linux/dmar.h
> +++ b/include/linux/dmar.h
> @@ -160,6 +160,20 @@ struct irte {
>  				__reserved_2	: 8,
>  				dest_id		: 32;
>  		} irq_remap_low;
> +
> +		struct {
> +			__u64   present		: 1,
> +				fpd		: 1,
> +				__reserved_1	: 6,
> +				avail	: 4,
> +				__reserved_2	: 2,
> +				urg		: 1,
> +				pst		: 1,
> +				vector	: 8,
> +				__reserved_3	: 14,
> +				pda_l	: 26;
> +		} irq_post_low;
> +
>  		__u64 low;
>  	};
>  
> @@ -170,10 +184,22 @@ struct irte {
>  				svt		: 2,
>  				__reserved_3	: 44;
>  		} irq_remap_high;
> +
> +		struct {
> +			__u64	sid:	16,
> +				sq:		2,
> +				svt:	2,
> +				__reserved_4:	12,
> +				pda_h:	32;
> +		} irq_post_high;
> +
>  		__u64 high;
>  	};
>  };
>  
> +#define PDA_LOW_BIT    26
> +#define PDA_HIGH_BIT   32
> +
>  enum {
>  	IRQ_REMAP_XAPIC_MODE,
>  	IRQ_REMAP_X2APIC_MODE,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ea53b04..6bb8287 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -335,6 +335,25 @@ struct kvm_kernel_irq_routing_entry {
>  	struct hlist_node link;
>  };
>  
> +#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
> +
> +struct kvm_irq_routing_table {
> +	int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
> +	struct kvm_kernel_irq_routing_entry *rt_entries;
> +	u32 nr_rt_entries;
> +	/*
> +	 * Array indexed by gsi. Each entry contains list of irq chips
> +	 * the gsi is connected to.
> +	 */
> +	struct hlist_head map[0];
> +};
> +
> +#else
> +
> +struct kvm_irq_routing_table {};
> +
> +#endif
> +
>  #ifndef KVM_PRIVATE_MEM_SLOTS
>  #define KVM_PRIVATE_MEM_SLOTS 0
>  #endif
> @@ -766,6 +785,9 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
>  				   struct kvm_irq_ack_notifier *kian);
>  int kvm_request_irq_source_id(struct kvm *kvm);
>  void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
> +void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
> +				   struct kvm_lapic_irq *irq);
> +bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq);
>  
>  #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT
>  int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 7593c52..509223a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1027,6 +1027,7 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct kvm_xen_hvm_config)
>  #define KVM_SET_CLOCK             _IOW(KVMIO,  0x7b, struct kvm_clock_data)
>  #define KVM_GET_CLOCK             _IOR(KVMIO,  0x7c, struct kvm_clock_data)
> +#define KVM_ASSIGN_DEV_PI_UPDATE  _IOR(KVMIO,  0x7d, __u32)
>  /* Available with KVM_CAP_PIT_STATE2 */
>  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct kvm_pit_state2)
>  #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0, struct kvm_pit_state2)

Needs an accompanying Documentation/virtual/kvm/api.txt update.

> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index e05000e..e154009 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c


Since legacy KVM device assignment is effectively deprecated, have you
considered how we might do this with VFIO?  Thanks,

Alex


> @@ -326,6 +326,135 @@ void kvm_free_all_assigned_devices(struct kvm *kvm)
>  	}
>  }
>  
> +int __weak kvm_update_pi_irte_common(struct kvm *kvm, struct kvm_vcpu *vcpu,
> +					u32 guest_vector, int host_irq)
> +{
> +	return 0;
> +}
> +
> +int kvm_compare_rr_counter(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
> +{
> +	return vcpu1->arch.round_robin_counter -
> +			vcpu2->arch.round_robin_counter;
> +}
> +
> +bool kvm_pi_find_dest_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
> +				struct kvm_vcpu **dest_vcpu)
> +{
> +	int i, r = 0;
> +	struct kvm_vcpu *vcpu, *dest = NULL;
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		if (!kvm_apic_present(vcpu))
> +			continue;
> +
> +		if (!kvm_apic_match_dest(vcpu, NULL, irq->shorthand,
> +					irq->dest_id, irq->dest_mode))
> +			continue;
> +
> +		if (!kvm_is_dm_lowest_prio(irq)) {
> +			r++;
> +			*dest_vcpu = vcpu;
> +		} else if (kvm_lapic_enabled(vcpu)) {
> +			if (!dest)
> +				dest = vcpu;
> +			else if (kvm_compare_rr_counter(vcpu, dest) < 0)
> +				dest = vcpu;
> +		}
> +	}
> +
> +	if (dest) {
> +		dest->arch.round_robin_counter++;
> +		*dest_vcpu = dest;
> +		return true;
> +	} else if (r == 1)
> +		return true;
> +
> +	return false;
> +}
> +
> +static int __kvm_update_pi_irte(struct kvm *kvm, int host_irq, int guest_irq)
> +{
> +	struct kvm_kernel_irq_routing_entry *e;
> +	struct kvm_irq_routing_table *irq_rt;
> +	struct kvm_lapic_irq irq;
> +	struct kvm_vcpu *vcpu;
> +	int idx, ret = -EINVAL;
> +
> +	idx = srcu_read_lock(&kvm->irq_srcu);
> +	irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
> +	ASSERT(guest_irq < irq_rt->nr_rt_entries);
> +
> +	hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) {
> +		if (e->type != KVM_IRQ_ROUTING_MSI)
> +			continue;
> +		/*
> +		 * VT-d posted-interrupt has the following
> +		 * limitations:
> +		 *  - No support for posting multicast/broadcast
> +		 *    interrupts to a VCPU
> +		 * Still use interrupt remapping for these
> +		 * kind of interrupts
> +		 */
> +
> +		kvm_set_msi_irq(e, &irq);
> +		if (!kvm_pi_find_dest_vcpu(kvm, &irq, &vcpu)) {
> +			printk(KERN_INFO "%s: can not find the target VCPU\n",
> +					__func__);
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		if (kvm_update_pi_irte_common(kvm, vcpu, irq.vector,
> +				host_irq)) {
> +			printk(KERN_INFO "%s: failed to update PI IRTE\n",
> +					__func__);
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +	}
> +
> +	ret = 0;
> +out:
> +	srcu_read_unlock(&kvm->irq_srcu, idx);
> +	return ret;
> +}
> +
> +int kvm_update_pi_irte(struct kvm *kvm, u32 dev_id)
> +{
> +	int i, rc = -1;
> +	struct kvm_assigned_dev_kernel *dev;
> +
> +	mutex_lock(&kvm->lock);
> +	dev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head, dev_id);
> +	if (!dev) {
> +		printk(KERN_INFO "%s: cannot find the assigned dev.\n",
> +				__func__);
> +		rc = -1;
> +		goto out;
> +	}
> +
> +	BUG_ON(dev->irq_requested_type == 0);
> +
> +	if ((dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSI) &&
> +		(dev->dev->msi_enabled == 1)) {
> +			__kvm_update_pi_irte(kvm,
> +					dev->host_irq, dev->guest_irq);
> +	} else if ((dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) &&
> +		(dev->dev->msix_enabled == 1)) {
> +		for (i = 0; i < dev->entries_nr; i++) {
> +			__kvm_update_pi_irte(kvm,
> +					dev->host_msix_entries[i].vector,
> +					dev->guest_msix_entries[i].vector);
> +		}
> +	}
> +
> +out:
> +	rc = 0;
> +	mutex_unlock(&kvm->lock);
> +	return rc;
> +}
> +
>  static int assigned_device_enable_host_intx(struct kvm *kvm,
>  					    struct kvm_assigned_dev_kernel *dev)
>  {
> @@ -1017,6 +1146,18 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
>  		r = kvm_vm_ioctl_set_pci_irq_mask(kvm, &assigned_dev);
>  		break;
>  	}
> +	case KVM_ASSIGN_DEV_PI_UPDATE: {
> +		u32 dev_id;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&dev_id, argp, sizeof(dev_id)))
> +			goto out;
> +		r = kvm_update_pi_irte(kvm, dev_id);
> +		if (r)
> +			goto out;
> +		break;
> +
> +	}
>  	default:
>  		r = -ENOTTY;
>  		break;
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index 963b899..f51aed3 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -55,7 +55,7 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
>  				line_status);
>  }
>  
> -inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
> +bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
>  {
>  #ifdef CONFIG_IA64
>  	return irq->delivery_mode ==
> @@ -106,7 +106,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
>  	return r;
>  }
>  
> -static inline void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
> +void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
>  				   struct kvm_lapic_irq *irq)
>  {
>  	trace_kvm_msi_set_irq(e->msi.address_lo, e->msi.data);
> diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c
> index 7f256f3..cdf29a6 100644
> --- a/virt/kvm/irqchip.c
> +++ b/virt/kvm/irqchip.c
> @@ -31,17 +31,6 @@
>  #include <trace/events/kvm.h>
>  #include "irq.h"
>  
> -struct kvm_irq_routing_table {
> -	int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
> -	struct kvm_kernel_irq_routing_entry *rt_entries;
> -	u32 nr_rt_entries;
> -	/*
> -	 * Array indexed by gsi. Each entry contains list of irq chips
> -	 * the gsi is connected to.
> -	 */
> -	struct hlist_head map[0];
> -};
> -
>  int kvm_irq_map_gsi(struct kvm *kvm,
>  		    struct kvm_kernel_irq_routing_entry *entries, int gsi)
>  {




^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes
@ 2014-11-10 21:57     ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2014-11-10 21:57 UTC (permalink / raw)
  To: Feng Wu
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, gleb-DgEjT+Ai2ygdnm+yROfE0A,
	x86-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ

On Mon, 2014-11-10 at 14:26 +0800, Feng Wu wrote:
> When guest changes its interrupt configuration (such as, vector, etc.)
> for direct-assigned devices, we need to update the associated IRTE
> with the new guest vector, so external interrupts from the assigned
> devices can be injected to guests without VM-Exit.
> 
> The current method of handling guest lowest priority interrtups
> is to use a counter 'apic_arb_prio' for each VCPU, we choose the
> VCPU with smallest 'apic_arb_prio' and then increase it by 1.
> However, for VT-d PI, we cannot re-use this, since we no longer
> have control to 'apic_arb_prio' with posted interrupt direct
> delivery by Hardware.
> 
> Here, we introduce a similiar way with 'apic_arb_prio' to handle
> guest lowest priority interrtups when VT-d PI is used. Here is the
> ideas:
> - Each VCPU has a counter 'round_robin_counter'.
> - When guests sets an interrupts to lowest priority, we choose
> the VCPU with smallest 'round_robin_counter' as the destination,
> then increase it.
> 
> Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  arch/x86/include/asm/irq_remapping.h |    6 ++
>  arch/x86/include/asm/kvm_host.h      |    2 +
>  arch/x86/kvm/vmx.c                   |   12 +++
>  arch/x86/kvm/x86.c                   |   11 +++
>  drivers/iommu/amd_iommu.c            |    6 ++
>  drivers/iommu/intel_irq_remapping.c  |   28 +++++++
>  drivers/iommu/irq_remapping.c        |    9 ++
>  drivers/iommu/irq_remapping.h        |    3 +
>  include/linux/dmar.h                 |   26 ++++++
>  include/linux/kvm_host.h             |   22 +++++
>  include/uapi/linux/kvm.h             |    1 +
>  virt/kvm/assigned-dev.c              |  141 ++++++++++++++++++++++++++++++++++
>  virt/kvm/irq_comm.c                  |    4 +-
>  virt/kvm/irqchip.c                   |   11 ---
>  14 files changed, 269 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
> index a3cc437..32d6cc4 100644
> --- a/arch/x86/include/asm/irq_remapping.h
> +++ b/arch/x86/include/asm/irq_remapping.h
> @@ -51,6 +51,7 @@ extern void compose_remapped_msi_msg(struct pci_dev *pdev,
>  				     unsigned int irq, unsigned int dest,
>  				     struct msi_msg *msg, u8 hpet_id);
>  extern int setup_hpet_msi_remapped(unsigned int irq, unsigned int id);
> +extern int update_pi_irte(unsigned int irq, u64 pi_desc_addr, u32 vector);
>  extern void panic_if_irq_remap(const char *msg);
>  extern bool setup_remapped_irq(int irq,
>  			       struct irq_cfg *cfg,
> @@ -88,6 +89,11 @@ static inline int setup_hpet_msi_remapped(unsigned int irq, unsigned int id)
>  	return -ENODEV;
>  }
>  
> +static inline int update_pi_irte(unsigned int irq, u64 pi_desc_addr, u32 vector)
> +{
> +	return -ENODEV;
> +}
> +
>  static inline void panic_if_irq_remap(const char *msg)
>  {
>  }
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6ed0c30..0630161 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -358,6 +358,7 @@ struct kvm_vcpu_arch {
>  	struct kvm_lapic *apic;    /* kernel irqchip context */
>  	unsigned long apic_attention;
>  	int32_t apic_arb_prio;
> +	int32_t round_robin_counter;
>  	int mp_state;
>  	u64 ia32_misc_enable_msr;
>  	bool tpr_access_reporting;
> @@ -771,6 +772,7 @@ struct kvm_x86_ops {
>  	int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr);
>  
>  	void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
> +	u64 (*get_pi_desc_addr)(struct kvm_vcpu *vcpu);
>  };
>  
>  struct kvm_arch_async_pf {
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index a4670d3..ae91b72 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -544,6 +544,11 @@ static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
>  	return container_of(vcpu, struct vcpu_vmx, vcpu);
>  }
>  
> +struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
> +{
> +	return &(to_vmx(vcpu)->pi_desc);
> +}
> +
>  #define VMCS12_OFFSET(x) offsetof(struct vmcs12, x)
>  #define FIELD(number, name)	[number] = VMCS12_OFFSET(name)
>  #define FIELD64(number, name)	[number] = VMCS12_OFFSET(name), \
> @@ -4280,6 +4285,11 @@ static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu *vcpu)
>  	return;
>  }
>  
> +static u64 vmx_get_pi_desc_addr(struct kvm_vcpu *vcpu)
> +{
> +	return __pa((u64)vcpu_to_pi_desc(vcpu));
> +}
> +
>  /*
>   * Set up the vmcs's constant host-state fields, i.e., host-state fields that
>   * will not change in the lifetime of the guest.
> @@ -9232,6 +9242,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
>  	.check_nested_events = vmx_check_nested_events,
>  
>  	.sched_in = vmx_sched_in,
> +
> +	.get_pi_desc_addr = vmx_get_pi_desc_addr,
>  };
>  
>  static int __init vmx_init(void)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b447a98..0c19d15 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7735,6 +7735,17 @@ bool kvm_arch_has_noncoherent_dma(struct kvm *kvm)
>  }
>  EXPORT_SYMBOL_GPL(kvm_arch_has_noncoherent_dma);
>  
> +int kvm_update_pi_irte_common(struct kvm *kvm, struct kvm_vcpu *vcpu,
> +			u32 guest_vector, int host_irq)
> +{
> +	u64 pi_desc_addr = kvm_x86_ops->get_pi_desc_addr(vcpu);
> +
> +	if (update_pi_irte(host_irq, pi_desc_addr, guest_vector))
> +		return -1;
> +
> +	return 0;
> +}
> +
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 505a9ad..a36fdc7 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -4280,6 +4280,11 @@ static int alloc_hpet_msi(unsigned int irq, unsigned int id)
>  	return 0;
>  }
>  
> +static int dummy_update_pi_irte(int irq, u64 pi_desc_addr, u32 vector)
> +{
> +	return -EINVAL;
> +}
> +
>  struct irq_remap_ops amd_iommu_irq_ops = {
>  	.supported		= amd_iommu_supported,
>  	.prepare		= amd_iommu_prepare,
> @@ -4294,5 +4299,6 @@ struct irq_remap_ops amd_iommu_irq_ops = {
>  	.msi_alloc_irq		= msi_alloc_irq,
>  	.msi_setup_irq		= msi_setup_irq,
>  	.alloc_hpet_msi		= alloc_hpet_msi,
> +	.update_pi_irte         = dummy_update_pi_irte,
>  };
>  #endif
> diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
> index 776da10..87c02fe 100644
> --- a/drivers/iommu/intel_irq_remapping.c
> +++ b/drivers/iommu/intel_irq_remapping.c
> @@ -1172,6 +1172,33 @@ static int intel_alloc_hpet_msi(unsigned int irq, unsigned int id)
>  	return ret;
>  }
>  
> +static int intel_update_pi_irte(int irq, u64 pi_desc_addr, u32 vector)
> +{
> +	struct irte irte;
> +
> +	if (get_irte(irq, &irte))
> +		return -1;
> +
> +	irte.irq_post_low.urg = 0;
> +	irte.irq_post_low.vector = vector;
> +	irte.irq_post_low.pda_l = (pi_desc_addr >> (32 - PDA_LOW_BIT)) &
> +			~(-1UL << PDA_LOW_BIT);
> +	irte.irq_post_high.pda_h = (pi_desc_addr >> 32) &
> +			~(-1UL << PDA_HIGH_BIT);
> +
> +	irte.irq_post_low.__reserved_1 = 0;
> +	irte.irq_post_low.__reserved_2 = 0;
> +	irte.irq_post_low.__reserved_3 = 0;
> +	irte.irq_post_high.__reserved_4 = 0;
> +
> +	irte.irq_post_low.pst = 1;
> +
> +	if (modify_irte(irq, &irte))
> +		return -1;
> +
> +	return 0;
> +}
> +
>  struct irq_remap_ops intel_irq_remap_ops = {
>  	.supported		= intel_irq_remapping_supported,
>  	.prepare		= dmar_table_init,
> @@ -1186,4 +1213,5 @@ struct irq_remap_ops intel_irq_remap_ops = {
>  	.msi_alloc_irq		= intel_msi_alloc_irq,
>  	.msi_setup_irq		= intel_msi_setup_irq,
>  	.alloc_hpet_msi		= intel_alloc_hpet_msi,
> +	.update_pi_irte         = intel_update_pi_irte,

Extending irq_remap_ops should really be a separate patch from it's use
by KVM.

>  };
> diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
> index 2f8ee00..0e36860 100644
> --- a/drivers/iommu/irq_remapping.c
> +++ b/drivers/iommu/irq_remapping.c
> @@ -362,6 +362,15 @@ int setup_hpet_msi_remapped(unsigned int irq, unsigned int id)
>  	return default_setup_hpet_msi(irq, id);
>  }
>  
> +int update_pi_irte(unsigned int irq, u64 pi_desc_addr, u32 vector)
> +{
> +	if (!remap_ops || !remap_ops->update_pi_irte)
> +		return -ENODEV;
> +
> +	return remap_ops->update_pi_irte(irq, pi_desc_addr, vector);
> +}
> +EXPORT_SYMBOL_GPL(update_pi_irte);
> +
>  void panic_if_irq_remap(const char *msg)
>  {
>  	if (irq_remapping_enabled)
> diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
> index 7bb5913..2d8f740 100644
> --- a/drivers/iommu/irq_remapping.h
> +++ b/drivers/iommu/irq_remapping.h
> @@ -84,6 +84,9 @@ struct irq_remap_ops {
>  
>  	/* Setup interrupt remapping for an HPET MSI */
>  	int (*alloc_hpet_msi)(unsigned int, unsigned int);
> +
> +	/* Update IRTE for posted-interrupt */
> +	int (*update_pi_irte)(int irq, u64 pi_desc_addr, u32 vector);
>  };
>  
>  extern struct irq_remap_ops intel_irq_remap_ops;
> diff --git a/include/linux/dmar.h b/include/linux/dmar.h
> index 8be5d42..e1ff4f7 100644
> --- a/include/linux/dmar.h
> +++ b/include/linux/dmar.h
> @@ -160,6 +160,20 @@ struct irte {
>  				__reserved_2	: 8,
>  				dest_id		: 32;
>  		} irq_remap_low;
> +
> +		struct {
> +			__u64   present		: 1,
> +				fpd		: 1,
> +				__reserved_1	: 6,
> +				avail	: 4,
> +				__reserved_2	: 2,
> +				urg		: 1,
> +				pst		: 1,
> +				vector	: 8,
> +				__reserved_3	: 14,
> +				pda_l	: 26;
> +		} irq_post_low;
> +
>  		__u64 low;
>  	};
>  
> @@ -170,10 +184,22 @@ struct irte {
>  				svt		: 2,
>  				__reserved_3	: 44;
>  		} irq_remap_high;
> +
> +		struct {
> +			__u64	sid:	16,
> +				sq:		2,
> +				svt:	2,
> +				__reserved_4:	12,
> +				pda_h:	32;
> +		} irq_post_high;
> +
>  		__u64 high;
>  	};
>  };
>  
> +#define PDA_LOW_BIT    26
> +#define PDA_HIGH_BIT   32
> +
>  enum {
>  	IRQ_REMAP_XAPIC_MODE,
>  	IRQ_REMAP_X2APIC_MODE,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ea53b04..6bb8287 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -335,6 +335,25 @@ struct kvm_kernel_irq_routing_entry {
>  	struct hlist_node link;
>  };
>  
> +#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
> +
> +struct kvm_irq_routing_table {
> +	int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
> +	struct kvm_kernel_irq_routing_entry *rt_entries;
> +	u32 nr_rt_entries;
> +	/*
> +	 * Array indexed by gsi. Each entry contains list of irq chips
> +	 * the gsi is connected to.
> +	 */
> +	struct hlist_head map[0];
> +};
> +
> +#else
> +
> +struct kvm_irq_routing_table {};
> +
> +#endif
> +
>  #ifndef KVM_PRIVATE_MEM_SLOTS
>  #define KVM_PRIVATE_MEM_SLOTS 0
>  #endif
> @@ -766,6 +785,9 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
>  				   struct kvm_irq_ack_notifier *kian);
>  int kvm_request_irq_source_id(struct kvm *kvm);
>  void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
> +void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
> +				   struct kvm_lapic_irq *irq);
> +bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq);
>  
>  #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT
>  int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 7593c52..509223a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1027,6 +1027,7 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct kvm_xen_hvm_config)
>  #define KVM_SET_CLOCK             _IOW(KVMIO,  0x7b, struct kvm_clock_data)
>  #define KVM_GET_CLOCK             _IOR(KVMIO,  0x7c, struct kvm_clock_data)
> +#define KVM_ASSIGN_DEV_PI_UPDATE  _IOR(KVMIO,  0x7d, __u32)
>  /* Available with KVM_CAP_PIT_STATE2 */
>  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct kvm_pit_state2)
>  #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0, struct kvm_pit_state2)

Needs an accompanying Documentation/virtual/kvm/api.txt update.

> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index e05000e..e154009 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c


Since legacy KVM device assignment is effectively deprecated, have you
considered how we might do this with VFIO?  Thanks,

Alex


> @@ -326,6 +326,135 @@ void kvm_free_all_assigned_devices(struct kvm *kvm)
>  	}
>  }
>  
> +int __weak kvm_update_pi_irte_common(struct kvm *kvm, struct kvm_vcpu *vcpu,
> +					u32 guest_vector, int host_irq)
> +{
> +	return 0;
> +}
> +
> +int kvm_compare_rr_counter(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
> +{
> +	return vcpu1->arch.round_robin_counter -
> +			vcpu2->arch.round_robin_counter;
> +}
> +
> +bool kvm_pi_find_dest_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
> +				struct kvm_vcpu **dest_vcpu)
> +{
> +	int i, r = 0;
> +	struct kvm_vcpu *vcpu, *dest = NULL;
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		if (!kvm_apic_present(vcpu))
> +			continue;
> +
> +		if (!kvm_apic_match_dest(vcpu, NULL, irq->shorthand,
> +					irq->dest_id, irq->dest_mode))
> +			continue;
> +
> +		if (!kvm_is_dm_lowest_prio(irq)) {
> +			r++;
> +			*dest_vcpu = vcpu;
> +		} else if (kvm_lapic_enabled(vcpu)) {
> +			if (!dest)
> +				dest = vcpu;
> +			else if (kvm_compare_rr_counter(vcpu, dest) < 0)
> +				dest = vcpu;
> +		}
> +	}
> +
> +	if (dest) {
> +		dest->arch.round_robin_counter++;
> +		*dest_vcpu = dest;
> +		return true;
> +	} else if (r == 1)
> +		return true;
> +
> +	return false;
> +}
> +
> +static int __kvm_update_pi_irte(struct kvm *kvm, int host_irq, int guest_irq)
> +{
> +	struct kvm_kernel_irq_routing_entry *e;
> +	struct kvm_irq_routing_table *irq_rt;
> +	struct kvm_lapic_irq irq;
> +	struct kvm_vcpu *vcpu;
> +	int idx, ret = -EINVAL;
> +
> +	idx = srcu_read_lock(&kvm->irq_srcu);
> +	irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
> +	ASSERT(guest_irq < irq_rt->nr_rt_entries);
> +
> +	hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) {
> +		if (e->type != KVM_IRQ_ROUTING_MSI)
> +			continue;
> +		/*
> +		 * VT-d posted-interrupt has the following
> +		 * limitations:
> +		 *  - No support for posting multicast/broadcast
> +		 *    interrupts to a VCPU
> +		 * Still use interrupt remapping for these
> +		 * kind of interrupts
> +		 */
> +
> +		kvm_set_msi_irq(e, &irq);
> +		if (!kvm_pi_find_dest_vcpu(kvm, &irq, &vcpu)) {
> +			printk(KERN_INFO "%s: can not find the target VCPU\n",
> +					__func__);
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		if (kvm_update_pi_irte_common(kvm, vcpu, irq.vector,
> +				host_irq)) {
> +			printk(KERN_INFO "%s: failed to update PI IRTE\n",
> +					__func__);
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +	}
> +
> +	ret = 0;
> +out:
> +	srcu_read_unlock(&kvm->irq_srcu, idx);
> +	return ret;
> +}
> +
> +int kvm_update_pi_irte(struct kvm *kvm, u32 dev_id)
> +{
> +	int i, rc = -1;
> +	struct kvm_assigned_dev_kernel *dev;
> +
> +	mutex_lock(&kvm->lock);
> +	dev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head, dev_id);
> +	if (!dev) {
> +		printk(KERN_INFO "%s: cannot find the assigned dev.\n",
> +				__func__);
> +		rc = -1;
> +		goto out;
> +	}
> +
> +	BUG_ON(dev->irq_requested_type == 0);
> +
> +	if ((dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSI) &&
> +		(dev->dev->msi_enabled == 1)) {
> +			__kvm_update_pi_irte(kvm,
> +					dev->host_irq, dev->guest_irq);
> +	} else if ((dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) &&
> +		(dev->dev->msix_enabled == 1)) {
> +		for (i = 0; i < dev->entries_nr; i++) {
> +			__kvm_update_pi_irte(kvm,
> +					dev->host_msix_entries[i].vector,
> +					dev->guest_msix_entries[i].vector);
> +		}
> +	}
> +
> +out:
> +	rc = 0;
> +	mutex_unlock(&kvm->lock);
> +	return rc;
> +}
> +
>  static int assigned_device_enable_host_intx(struct kvm *kvm,
>  					    struct kvm_assigned_dev_kernel *dev)
>  {
> @@ -1017,6 +1146,18 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
>  		r = kvm_vm_ioctl_set_pci_irq_mask(kvm, &assigned_dev);
>  		break;
>  	}
> +	case KVM_ASSIGN_DEV_PI_UPDATE: {
> +		u32 dev_id;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&dev_id, argp, sizeof(dev_id)))
> +			goto out;
> +		r = kvm_update_pi_irte(kvm, dev_id);
> +		if (r)
> +			goto out;
> +		break;
> +
> +	}
>  	default:
>  		r = -ENOTTY;
>  		break;
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index 963b899..f51aed3 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -55,7 +55,7 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e,
>  				line_status);
>  }
>  
> -inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
> +bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
>  {
>  #ifdef CONFIG_IA64
>  	return irq->delivery_mode ==
> @@ -106,7 +106,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
>  	return r;
>  }
>  
> -static inline void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
> +void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
>  				   struct kvm_lapic_irq *irq)
>  {
>  	trace_kvm_msi_set_irq(e->msi.address_lo, e->msi.data);
> diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c
> index 7f256f3..cdf29a6 100644
> --- a/virt/kvm/irqchip.c
> +++ b/virt/kvm/irqchip.c
> @@ -31,17 +31,6 @@
>  #include <trace/events/kvm.h>
>  #include "irq.h"
>  
> -struct kvm_irq_routing_table {
> -	int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
> -	struct kvm_kernel_irq_routing_entry *rt_entries;
> -	u32 nr_rt_entries;
> -	/*
> -	 * Array indexed by gsi. Each entry contains list of irq chips
> -	 * the gsi is connected to.
> -	 */
> -	struct hlist_head map[0];
> -};
> -
>  int kvm_irq_map_gsi(struct kvm *kvm,
>  		    struct kvm_kernel_irq_routing_entry *entries, int gsi)
>  {

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 13/13] iommu/vt-d: Add a command line parameter for VT-d posted-interrupts
@ 2014-11-10 21:57     ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2014-11-10 21:57 UTC (permalink / raw)
  To: Feng Wu
  Cc: gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86, kvm, iommu,
	linux-kernel

On Mon, 2014-11-10 at 14:26 +0800, Feng Wu wrote:
> Enable VT-d Posted-Interrtups and add a command line
> parameter for it.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
>  drivers/iommu/irq_remapping.c |    9 ++++++++-
>  1 files changed, 8 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
> index 0e36860..3cb9429 100644
> --- a/drivers/iommu/irq_remapping.c
> +++ b/drivers/iommu/irq_remapping.c
> @@ -23,7 +23,7 @@ int irq_remap_broken;
>  int disable_sourceid_checking;
>  int no_x2apic_optout;
>  
> -int disable_irq_post = 1;
> +int disable_irq_post = 0;
>  int irq_post_enabled = 0;
>  EXPORT_SYMBOL_GPL(irq_post_enabled);
>  
> @@ -206,6 +206,13 @@ static __init int setup_irqremap(char *str)
>  }
>  early_param("intremap", setup_irqremap);
>  
> +static __init int setup_nointpost(char *str)
> +{
> +	disable_irq_post = 1;
> +	return 0;
> +}
> +early_param("nointpost", setup_nointpost);
> +

Documentation/kernel-parameters.txt?  Thanks,

Alex

>  void __init setup_irq_remapping_ops(void)
>  {
>  	remap_ops = &intel_irq_remap_ops;




^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 13/13] iommu/vt-d: Add a command line parameter for VT-d posted-interrupts
@ 2014-11-10 21:57     ` Alex Williamson
  0 siblings, 0 replies; 129+ messages in thread
From: Alex Williamson @ 2014-11-10 21:57 UTC (permalink / raw)
  To: Feng Wu
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, gleb-DgEjT+Ai2ygdnm+yROfE0A,
	x86-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ

On Mon, 2014-11-10 at 14:26 +0800, Feng Wu wrote:
> Enable VT-d Posted-Interrtups and add a command line
> parameter for it.
> 
> Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  drivers/iommu/irq_remapping.c |    9 ++++++++-
>  1 files changed, 8 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
> index 0e36860..3cb9429 100644
> --- a/drivers/iommu/irq_remapping.c
> +++ b/drivers/iommu/irq_remapping.c
> @@ -23,7 +23,7 @@ int irq_remap_broken;
>  int disable_sourceid_checking;
>  int no_x2apic_optout;
>  
> -int disable_irq_post = 1;
> +int disable_irq_post = 0;
>  int irq_post_enabled = 0;
>  EXPORT_SYMBOL_GPL(irq_post_enabled);
>  
> @@ -206,6 +206,13 @@ static __init int setup_irqremap(char *str)
>  }
>  early_param("intremap", setup_irqremap);
>  
> +static __init int setup_nointpost(char *str)
> +{
> +	disable_irq_post = 1;
> +	return 0;
> +}
> +early_param("nointpost", setup_nointpost);
> +

Documentation/kernel-parameters.txt?  Thanks,

Alex

>  void __init setup_irq_remapping_ops(void)
>  {
>  	remap_ops = &intel_irq_remap_ops;

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2014-11-10 18:15     ` Thomas Gleixner
@ 2014-11-11  2:28       ` Jiang Liu
  -1 siblings, 0 replies; 129+ messages in thread
From: Jiang Liu @ 2014-11-11  2:28 UTC (permalink / raw)
  To: Thomas Gleixner, Feng Wu
  Cc: gleb, pbonzini, David Woodhouse, joro, mingo, H. Peter Anvin,
	x86, kvm, iommu, LKML

On 2014/11/11 2:15, Thomas Gleixner wrote:
> On Mon, 10 Nov 2014, Feng Wu wrote:
> 
>> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
>> With VT-d Posted-Interrupts enabled, external interrupts from
>> direct-assigned devices can be delivered to guests without VMM
>> intervention when guest is running in non-root mode.
> 
> Can you please talk to Jiang and synchronize your work with his
> refactoring of the x86 interrupt handling subsystem.
> 
> I want this stuff cleaned up first before we add new stuff to it.
Hi Thomas,
	Just talked with Feng, we will focused on refactor first and
then add posted interrupt support.
Regards!
Gerry

> 
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2014-11-11  2:28       ` Jiang Liu
  0 siblings, 0 replies; 129+ messages in thread
From: Jiang Liu @ 2014-11-11  2:28 UTC (permalink / raw)
  To: Thomas Gleixner, Feng Wu
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, gleb-DgEjT+Ai2ygdnm+yROfE0A,
	x86-DgEjT+Ai2ygdnm+yROfE0A, LKML,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, H. Peter Anvin,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA, David Woodhouse

On 2014/11/11 2:15, Thomas Gleixner wrote:
> On Mon, 10 Nov 2014, Feng Wu wrote:
> 
>> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
>> With VT-d Posted-Interrupts enabled, external interrupts from
>> direct-assigned devices can be delivered to guests without VMM
>> intervention when guest is running in non-root mode.
> 
> Can you please talk to Jiang and synchronize your work with his
> refactoring of the x86 interrupt handling subsystem.
> 
> I want this stuff cleaned up first before we add new stuff to it.
Hi Thomas,
	Just talked with Feng, we will focused on refactor first and
then add posted interrupt support.
Regards!
Gerry

> 
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* RE: several messages
@ 2014-11-11  6:37         ` Wu, Feng
  0 siblings, 0 replies; 129+ messages in thread
From: Wu, Feng @ 2014-11-11  6:37 UTC (permalink / raw)
  To: Jiang Liu, Thomas Gleixner
  Cc: gleb, pbonzini, David Woodhouse, joro, mingo, H. Peter Anvin,
	x86, kvm, iommu, LKML



> -----Original Message-----
> From: Jiang Liu [mailto:jiang.liu@linux.intel.com]
> Sent: Tuesday, November 11, 2014 10:29 AM
> To: Thomas Gleixner; Wu, Feng
> Cc: gleb@kernel.org; pbonzini@redhat.com; David Woodhouse;
> joro@8bytes.org; mingo@redhat.com; H. Peter Anvin; x86@kernel.org;
> kvm@vger.kernel.org; iommu@lists.linux-foundation.org; LKML
> Subject: Re: several messages
> 
> On 2014/11/11 2:15, Thomas Gleixner wrote:
> > On Mon, 10 Nov 2014, Feng Wu wrote:
> >
> >> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> >> With VT-d Posted-Interrupts enabled, external interrupts from
> >> direct-assigned devices can be delivered to guests without VMM
> >> intervention when guest is running in non-root mode.
> >
> > Can you please talk to Jiang and synchronize your work with his
> > refactoring of the x86 interrupt handling subsystem.
> >
> > I want this stuff cleaned up first before we add new stuff to it.
> Hi Thomas,
> 	Just talked with Feng, we will focused on refactor first and
> then add posted interrupt support.
> Regards!
> Gerry

No problem!

Thanks,
Feng

> 
> >
> > Thanks,
> >
> > 	tglx
> >

^ permalink raw reply	[flat|nested] 129+ messages in thread

* RE: several messages
@ 2014-11-11  6:37         ` Wu, Feng
  0 siblings, 0 replies; 129+ messages in thread
From: Wu, Feng @ 2014-11-11  6:37 UTC (permalink / raw)
  To: Jiang Liu, Thomas Gleixner
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, gleb-DgEjT+Ai2ygdnm+yROfE0A,
	x86-DgEjT+Ai2ygdnm+yROfE0A, LKML,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, H. Peter Anvin,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA, David Woodhouse



> -----Original Message-----
> From: Jiang Liu [mailto:jiang.liu-VuQAYsv1563Yd54FQh9/CA@public.gmane.org]
> Sent: Tuesday, November 11, 2014 10:29 AM
> To: Thomas Gleixner; Wu, Feng
> Cc: gleb-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; pbonzini-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; David Woodhouse;
> joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org; mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; H. Peter Anvin; x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org;
> kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org; LKML
> Subject: Re: several messages
> 
> On 2014/11/11 2:15, Thomas Gleixner wrote:
> > On Mon, 10 Nov 2014, Feng Wu wrote:
> >
> >> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> >> With VT-d Posted-Interrupts enabled, external interrupts from
> >> direct-assigned devices can be delivered to guests without VMM
> >> intervention when guest is running in non-root mode.
> >
> > Can you please talk to Jiang and synchronize your work with his
> > refactoring of the x86 interrupt handling subsystem.
> >
> > I want this stuff cleaned up first before we add new stuff to it.
> Hi Thomas,
> 	Just talked with Feng, we will focused on refactor first and
> then add posted interrupt support.
> Regards!
> Gerry

No problem!

Thanks,
Feng

> 
> >
> > Thanks,
> >
> > 	tglx
> >

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] KVM: Initialize VT-d Posted-Interrtups Descriptor
@ 2014-11-11 13:35     ` Jiang Liu
  0 siblings, 0 replies; 129+ messages in thread
From: Jiang Liu @ 2014-11-11 13:35 UTC (permalink / raw)
  To: Feng Wu, gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel

On 2014/11/10 14:26, Feng Wu wrote:
> This patch initialize the VT-d Posted-interrupt Descritpor.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
>  arch/x86/include/asm/irq_remapping.h |    1 +
>  arch/x86/kernel/apic/apic.c          |    1 +
>  arch/x86/kvm/vmx.c                   |   56 ++++++++++++++++++++++++++++++++-
>  3 files changed, 56 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
> index b7747c4..a3cc437 100644
> --- a/arch/x86/include/asm/irq_remapping.h
> +++ b/arch/x86/include/asm/irq_remapping.h
> @@ -57,6 +57,7 @@ extern bool setup_remapped_irq(int irq,
>  			       struct irq_chip *chip);
>  
>  void irq_remap_modify_chip_defaults(struct irq_chip *chip);
> +extern int irq_post_enabled;
>  
>  #else  /* CONFIG_IRQ_REMAP */
>  
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index ba6cc04..987408d 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -162,6 +162,7 @@ __setup("apicpmtimer", setup_apicpmtimer);
>  #endif
>  
>  int x2apic_mode;
> +EXPORT_SYMBOL_GPL(x2apic_mode);
>  #ifdef CONFIG_X86_X2APIC
>  /* x2apic enabled before OS handover */
>  int x2apic_preenabled;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 3e556c6..a4670d3 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -45,6 +45,7 @@
>  #include <asm/perf_event.h>
>  #include <asm/debugreg.h>
>  #include <asm/kexec.h>
> +#include <asm/irq_remapping.h>
>  
>  #include "trace.h"
>  
> @@ -408,13 +409,32 @@ struct nested_vmx {
>  };
>  
>  #define POSTED_INTR_ON  0
> +#define POSTED_INTR_SN  1
> +
>  /* Posted-Interrupt Descriptor */
>  struct pi_desc {
>  	u32 pir[8];     /* Posted interrupt requested */
> -	u32 control;	/* bit 0 of control is outstanding notification bit */
> -	u32 rsvd[7];
> +	union {
> +		struct {
> +			u64	on	: 1,
> +				sn	: 1,
> +				rsvd_1	: 13,
> +				ndm	: 1,
> +				nv	: 8,
> +				rsvd_2	: 8,
> +				ndst	: 32;
> +		};
> +		u64 control;
> +	};
> +	u32 rsvd[6];
>  } __aligned(64);
>  
> +static void pi_clear_sn(struct pi_desc *pi_desc)
> +{
> +	return clear_bit(POSTED_INTR_SN,
> +			(unsigned long *)&pi_desc->control);
> +}
> +
>  static bool pi_test_and_set_on(struct pi_desc *pi_desc)
>  {
>  	return test_and_set_bit(POSTED_INTR_ON,
> @@ -4396,6 +4416,33 @@ static void ept_set_mmio_spte_mask(void)
>  	kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
>  }
>  
> +static bool pi_desc_init(struct vcpu_vmx *vmx)
> +{
> +	unsigned int dest;
> +
> +	if (irq_post_enabled == 0)
> +		return true;
> +
> +	/*
> +	 * Initialize Posted-Interrupt Descriptor
> +	 */
> +
> +	pi_clear_sn(&vmx->pi_desc);
> +	vmx->pi_desc.nv = POSTED_INTR_VECTOR;
> +
> +	/* Physical mode for Notificaiton Event */
> +	vmx->pi_desc.ndm = 0;
> +	dest = cpu_physical_id(vmx->vcpu.cpu);
> +
> +	if (x2apic_mode)
Hi Feng,
	Could you try to use x2apic_enabled() here so you don't
need to export x2apic_mode?
Regards!
Gerry
> +		vmx->pi_desc.ndst = dest;
> +	else
> +		vmx->pi_desc.ndst = (dest << 8) & 0xFF00;
> +
> +	return true;
> +}
> +
> +
>  /*
>   * Sets up the vmcs for emulated real mode.
>   */
> @@ -4439,6 +4486,11 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>  
>  		vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR);
>  		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
> +
> +		if (!pi_desc_init(vmx)) {
> +			printk(KERN_ERR "Initialize PI descriptor error!\n");
> +			return 1;
> +		}
>  	}
>  
>  	if (ple_gap) {
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] KVM: Initialize VT-d Posted-Interrtups Descriptor
@ 2014-11-11 13:35     ` Jiang Liu
  0 siblings, 0 replies; 129+ messages in thread
From: Jiang Liu @ 2014-11-11 13:35 UTC (permalink / raw)
  To: Feng Wu, gleb-DgEjT+Ai2ygdnm+yROfE0A,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	joro-zLv9SwRftAIdnm+yROfE0A, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA

On 2014/11/10 14:26, Feng Wu wrote:
> This patch initialize the VT-d Posted-interrupt Descritpor.
> 
> Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  arch/x86/include/asm/irq_remapping.h |    1 +
>  arch/x86/kernel/apic/apic.c          |    1 +
>  arch/x86/kvm/vmx.c                   |   56 ++++++++++++++++++++++++++++++++-
>  3 files changed, 56 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
> index b7747c4..a3cc437 100644
> --- a/arch/x86/include/asm/irq_remapping.h
> +++ b/arch/x86/include/asm/irq_remapping.h
> @@ -57,6 +57,7 @@ extern bool setup_remapped_irq(int irq,
>  			       struct irq_chip *chip);
>  
>  void irq_remap_modify_chip_defaults(struct irq_chip *chip);
> +extern int irq_post_enabled;
>  
>  #else  /* CONFIG_IRQ_REMAP */
>  
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index ba6cc04..987408d 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -162,6 +162,7 @@ __setup("apicpmtimer", setup_apicpmtimer);
>  #endif
>  
>  int x2apic_mode;
> +EXPORT_SYMBOL_GPL(x2apic_mode);
>  #ifdef CONFIG_X86_X2APIC
>  /* x2apic enabled before OS handover */
>  int x2apic_preenabled;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 3e556c6..a4670d3 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -45,6 +45,7 @@
>  #include <asm/perf_event.h>
>  #include <asm/debugreg.h>
>  #include <asm/kexec.h>
> +#include <asm/irq_remapping.h>
>  
>  #include "trace.h"
>  
> @@ -408,13 +409,32 @@ struct nested_vmx {
>  };
>  
>  #define POSTED_INTR_ON  0
> +#define POSTED_INTR_SN  1
> +
>  /* Posted-Interrupt Descriptor */
>  struct pi_desc {
>  	u32 pir[8];     /* Posted interrupt requested */
> -	u32 control;	/* bit 0 of control is outstanding notification bit */
> -	u32 rsvd[7];
> +	union {
> +		struct {
> +			u64	on	: 1,
> +				sn	: 1,
> +				rsvd_1	: 13,
> +				ndm	: 1,
> +				nv	: 8,
> +				rsvd_2	: 8,
> +				ndst	: 32;
> +		};
> +		u64 control;
> +	};
> +	u32 rsvd[6];
>  } __aligned(64);
>  
> +static void pi_clear_sn(struct pi_desc *pi_desc)
> +{
> +	return clear_bit(POSTED_INTR_SN,
> +			(unsigned long *)&pi_desc->control);
> +}
> +
>  static bool pi_test_and_set_on(struct pi_desc *pi_desc)
>  {
>  	return test_and_set_bit(POSTED_INTR_ON,
> @@ -4396,6 +4416,33 @@ static void ept_set_mmio_spte_mask(void)
>  	kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
>  }
>  
> +static bool pi_desc_init(struct vcpu_vmx *vmx)
> +{
> +	unsigned int dest;
> +
> +	if (irq_post_enabled == 0)
> +		return true;
> +
> +	/*
> +	 * Initialize Posted-Interrupt Descriptor
> +	 */
> +
> +	pi_clear_sn(&vmx->pi_desc);
> +	vmx->pi_desc.nv = POSTED_INTR_VECTOR;
> +
> +	/* Physical mode for Notificaiton Event */
> +	vmx->pi_desc.ndm = 0;
> +	dest = cpu_physical_id(vmx->vcpu.cpu);
> +
> +	if (x2apic_mode)
Hi Feng,
	Could you try to use x2apic_enabled() here so you don't
need to export x2apic_mode?
Regards!
Gerry
> +		vmx->pi_desc.ndst = dest;
> +	else
> +		vmx->pi_desc.ndst = (dest << 8) & 0xFF00;
> +
> +	return true;
> +}
> +
> +
>  /*
>   * Sets up the vmcs for emulated real mode.
>   */
> @@ -4439,6 +4486,11 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>  
>  		vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR);
>  		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
> +
> +		if (!pi_desc_init(vmx)) {
> +			printk(KERN_ERR "Initialize PI descriptor error!\n");
> +			return 1;
> +		}
>  	}
>  
>  	if (ple_gap) {
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 01/13] iommu/vt-d: VT-d Posted-Interrupts feature detection
@ 2014-11-11 13:38     ` Jiang Liu
  0 siblings, 0 replies; 129+ messages in thread
From: Jiang Liu @ 2014-11-11 13:38 UTC (permalink / raw)
  To: Feng Wu, gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel



On 2014/11/10 14:26, Feng Wu wrote:
> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> With VT-d Posted-Interrupts enabled, external interrupts from
> direct-assigned devices can be delivered to guests without VMM
> intervention when guest is running in non-root mode.
> 
> This patch adds feature detection logic for VT-d posted-interrupt.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
>  drivers/iommu/intel_irq_remapping.c |   13 +++++++++++++
>  drivers/iommu/irq_remapping.c       |    4 ++++
>  drivers/iommu/irq_remapping.h       |    5 +++++
>  include/linux/intel-iommu.h         |    1 +
>  4 files changed, 23 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
> index 7c80661..f99f0f1 100644
> --- a/drivers/iommu/intel_irq_remapping.c
> +++ b/drivers/iommu/intel_irq_remapping.c
> @@ -580,6 +580,19 @@ static int __init intel_irq_remapping_supported(void)
>  		if (!ecap_ir_support(iommu->ecap))
>  			return 0;
>  
> +	/* VT-d posted-interrupt feature detection*/
> +	if (disable_irq_post == 0)
> +		for_each_drhd_unit(drhd) {
> +			struct intel_iommu *iommu = drhd->iommu;
Hi Feng,
	You may use for_each_active_iommu() here.
Regards!
Gerry

> +
> +			if (!cap_pi_support(iommu->cap)) {
> +				irq_post_enabled = 0;
> +				disable_irq_post = 1;
> +				break;
> +			}
> +			irq_post_enabled = 1;
> +		}
> +
>  	return 1;
>  }
>  
> diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
> index 74a1767..2f8ee00 100644
> --- a/drivers/iommu/irq_remapping.c
> +++ b/drivers/iommu/irq_remapping.c
> @@ -23,6 +23,10 @@ int irq_remap_broken;
>  int disable_sourceid_checking;
>  int no_x2apic_optout;
>  
> +int disable_irq_post = 1;
> +int irq_post_enabled = 0;
> +EXPORT_SYMBOL_GPL(irq_post_enabled);
> +
>  static struct irq_remap_ops *remap_ops;
>  
>  static int msi_alloc_remapped_irq(struct pci_dev *pdev, int irq, int nvec);
> diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
> index fde250f..7bb5913 100644
> --- a/drivers/iommu/irq_remapping.h
> +++ b/drivers/iommu/irq_remapping.h
> @@ -37,6 +37,9 @@ extern int disable_sourceid_checking;
>  extern int no_x2apic_optout;
>  extern int irq_remapping_enabled;
>  
> +extern int disable_irq_post;
> +extern int irq_post_enabled;
> +
>  struct irq_remap_ops {
>  	/* Check whether Interrupt Remapping is supported */
>  	int (*supported)(void);
> @@ -91,6 +94,8 @@ extern struct irq_remap_ops amd_iommu_irq_ops;
>  #define irq_remapping_enabled 0
>  #define disable_irq_remap     1
>  #define irq_remap_broken      0
> +#define disable_irq_post      1
> +#define irq_post_enabled      0
>  
>  #endif /* CONFIG_IRQ_REMAP */
>  
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index a65208a..5b1a124 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -87,6 +87,7 @@ static inline void dmar_writeq(void __iomem *addr, u64 val)
>  /*
>   * Decoding Capability Register
>   */
> +#define cap_pi_support(c)	(((c) >> 59) & 1)
>  #define cap_read_drain(c)	(((c) >> 55) & 1)
>  #define cap_write_drain(c)	(((c) >> 54) & 1)
>  #define cap_max_amask_val(c)	(((c) >> 48) & 0x3f)
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 01/13] iommu/vt-d: VT-d Posted-Interrupts feature detection
@ 2014-11-11 13:38     ` Jiang Liu
  0 siblings, 0 replies; 129+ messages in thread
From: Jiang Liu @ 2014-11-11 13:38 UTC (permalink / raw)
  To: Feng Wu, gleb-DgEjT+Ai2ygdnm+yROfE0A,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	joro-zLv9SwRftAIdnm+yROfE0A, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA



On 2014/11/10 14:26, Feng Wu wrote:
> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> With VT-d Posted-Interrupts enabled, external interrupts from
> direct-assigned devices can be delivered to guests without VMM
> intervention when guest is running in non-root mode.
> 
> This patch adds feature detection logic for VT-d posted-interrupt.
> 
> Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  drivers/iommu/intel_irq_remapping.c |   13 +++++++++++++
>  drivers/iommu/irq_remapping.c       |    4 ++++
>  drivers/iommu/irq_remapping.h       |    5 +++++
>  include/linux/intel-iommu.h         |    1 +
>  4 files changed, 23 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
> index 7c80661..f99f0f1 100644
> --- a/drivers/iommu/intel_irq_remapping.c
> +++ b/drivers/iommu/intel_irq_remapping.c
> @@ -580,6 +580,19 @@ static int __init intel_irq_remapping_supported(void)
>  		if (!ecap_ir_support(iommu->ecap))
>  			return 0;
>  
> +	/* VT-d posted-interrupt feature detection*/
> +	if (disable_irq_post == 0)
> +		for_each_drhd_unit(drhd) {
> +			struct intel_iommu *iommu = drhd->iommu;
Hi Feng,
	You may use for_each_active_iommu() here.
Regards!
Gerry

> +
> +			if (!cap_pi_support(iommu->cap)) {
> +				irq_post_enabled = 0;
> +				disable_irq_post = 1;
> +				break;
> +			}
> +			irq_post_enabled = 1;
> +		}
> +
>  	return 1;
>  }
>  
> diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
> index 74a1767..2f8ee00 100644
> --- a/drivers/iommu/irq_remapping.c
> +++ b/drivers/iommu/irq_remapping.c
> @@ -23,6 +23,10 @@ int irq_remap_broken;
>  int disable_sourceid_checking;
>  int no_x2apic_optout;
>  
> +int disable_irq_post = 1;
> +int irq_post_enabled = 0;
> +EXPORT_SYMBOL_GPL(irq_post_enabled);
> +
>  static struct irq_remap_ops *remap_ops;
>  
>  static int msi_alloc_remapped_irq(struct pci_dev *pdev, int irq, int nvec);
> diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
> index fde250f..7bb5913 100644
> --- a/drivers/iommu/irq_remapping.h
> +++ b/drivers/iommu/irq_remapping.h
> @@ -37,6 +37,9 @@ extern int disable_sourceid_checking;
>  extern int no_x2apic_optout;
>  extern int irq_remapping_enabled;
>  
> +extern int disable_irq_post;
> +extern int irq_post_enabled;
> +
>  struct irq_remap_ops {
>  	/* Check whether Interrupt Remapping is supported */
>  	int (*supported)(void);
> @@ -91,6 +94,8 @@ extern struct irq_remap_ops amd_iommu_irq_ops;
>  #define irq_remapping_enabled 0
>  #define disable_irq_remap     1
>  #define irq_remap_broken      0
> +#define disable_irq_post      1
> +#define irq_post_enabled      0
>  
>  #endif /* CONFIG_IRQ_REMAP */
>  
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index a65208a..5b1a124 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -87,6 +87,7 @@ static inline void dmar_writeq(void __iomem *addr, u64 val)
>  /*
>   * Decoding Capability Register
>   */
> +#define cap_pi_support(c)	(((c) >> 59) & 1)
>  #define cap_read_drain(c)	(((c) >> 55) & 1)
>  #define cap_write_drain(c)	(((c) >> 54) & 1)
>  #define cap_max_amask_val(c)	(((c) >> 48) & 0x3f)
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 04/13] iommu/vt-d: Adjust 'struct irte' to better suit for VT-d Posted-Interrupts
@ 2014-11-11 13:43     ` Jiang Liu
  0 siblings, 0 replies; 129+ messages in thread
From: Jiang Liu @ 2014-11-11 13:43 UTC (permalink / raw)
  To: Feng Wu, gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel

Hi Feng,
	Other than this solution, how about introducing new
struct irte_pi for posted interrupt?

On 2014/11/10 14:26, Feng Wu wrote:
> This patch adjusts the definition of 'struct irte', so that we can
> add the VT-d Posted-Interrtups format in this structure later.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
>  drivers/iommu/intel_irq_remapping.c |   35 +++++++++++++++++++----------------
>  include/linux/dmar.h                |    4 ++--
>  2 files changed, 21 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
> index f99f0f1..776da10 100644
> --- a/drivers/iommu/intel_irq_remapping.c
> +++ b/drivers/iommu/intel_irq_remapping.c
> @@ -310,9 +310,9 @@ static void set_irte_sid(struct irte *irte, unsigned int svt,
>  {
>  	if (disable_sourceid_checking)
>  		svt = SVT_NO_VERIFY;
> -	irte->svt = svt;
> -	irte->sq = sq;
> -	irte->sid = sid;
> +	irte->irq_remap_high.svt = svt;
> +	irte->irq_remap_high.sq = sq;
> +	irte->irq_remap_high.sid = sid;
>  }
>  
>  static int set_ioapic_sid(struct irte *irte, int apic)
> @@ -917,8 +917,8 @@ static void prepare_irte(struct irte *irte, int vector,
>  {
>  	memset(irte, 0, sizeof(*irte));
>  
> -	irte->present = 1;
> -	irte->dst_mode = apic->irq_dest_mode;
> +	irte->irq_remap_low.present = 1;
> +	irte->irq_remap_low.dst_mode = apic->irq_dest_mode;
>  	/*
>  	 * Trigger mode in the IRTE will always be edge, and for IO-APIC, the
>  	 * actual level or edge trigger will be setup in the IO-APIC
> @@ -926,11 +926,11 @@ static void prepare_irte(struct irte *irte, int vector,
>  	 * For more details, see the comments (in io_apic.c) explainig IO-APIC
>  	 * irq migration in the presence of interrupt-remapping.
>  	*/
> -	irte->trigger_mode = 0;
> -	irte->dlvry_mode = apic->irq_delivery_mode;
> -	irte->vector = vector;
> -	irte->dest_id = IRTE_DEST(dest);
> -	irte->redir_hint = 1;
> +	irte->irq_remap_low.trigger_mode = 0;
> +	irte->irq_remap_low.dlvry_mode = apic->irq_delivery_mode;
> +	irte->irq_remap_low.vector = vector;
> +	irte->irq_remap_low.dest_id = IRTE_DEST(dest);
> +	irte->irq_remap_low.redir_hint = 1;
>  }
>  
>  static int intel_setup_ioapic_entry(int irq,
> @@ -973,10 +973,13 @@ static int intel_setup_ioapic_entry(int irq,
>  		"Redir_hint:%d Trig_Mode:%d Dlvry_Mode:%X "
>  		"Avail:%X Vector:%02X Dest:%08X "
>  		"SID:%04X SQ:%X SVT:%X)\n",
> -		attr->ioapic, irte.present, irte.fpd, irte.dst_mode,
> -		irte.redir_hint, irte.trigger_mode, irte.dlvry_mode,
> -		irte.avail, irte.vector, irte.dest_id,
> -		irte.sid, irte.sq, irte.svt);
> +		attr->ioapic, irte.irq_remap_low.present,
> +		irte.irq_remap_low.fpd, irte.irq_remap_low.dst_mode,
> +		irte.irq_remap_low.redir_hint, irte.irq_remap_low.trigger_mode,
> +		irte.irq_remap_low.dlvry_mode, irte.irq_remap_low.avail,
> +		irte.irq_remap_low.vector, irte.irq_remap_low.dest_id,
> +		irte.irq_remap_high.sid, irte.irq_remap_high.sq,
> +		irte.irq_remap_high.svt);
>  
>  	entry = (struct IR_IO_APIC_route_entry *)route_entry;
>  	memset(entry, 0, sizeof(*entry));
> @@ -1046,8 +1049,8 @@ intel_ioapic_set_affinity(struct irq_data *data, const struct cpumask *mask,
>  		return err;
>  	}
>  
> -	irte.vector = cfg->vector;
> -	irte.dest_id = IRTE_DEST(dest);
> +	irte.irq_remap_low.vector = cfg->vector;
> +	irte.irq_remap_low.dest_id = IRTE_DEST(dest);
>  
>  	/*
>  	 * Atomically updates the IRTE with the new destination, vector
> diff --git a/include/linux/dmar.h b/include/linux/dmar.h
> index 593fff9..8be5d42 100644
> --- a/include/linux/dmar.h
> +++ b/include/linux/dmar.h
> @@ -159,7 +159,7 @@ struct irte {
>  				vector		: 8,
>  				__reserved_2	: 8,
>  				dest_id		: 32;
> -		};
> +		} irq_remap_low;
>  		__u64 low;
>  	};
>  
> @@ -169,7 +169,7 @@ struct irte {
>  				sq		: 2,
>  				svt		: 2,
>  				__reserved_3	: 44;
> -		};
> +		} irq_remap_high;
>  		__u64 high;
>  	};
>  };
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 04/13] iommu/vt-d: Adjust 'struct irte' to better suit for VT-d Posted-Interrupts
@ 2014-11-11 13:43     ` Jiang Liu
  0 siblings, 0 replies; 129+ messages in thread
From: Jiang Liu @ 2014-11-11 13:43 UTC (permalink / raw)
  To: Feng Wu, gleb-DgEjT+Ai2ygdnm+yROfE0A,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	joro-zLv9SwRftAIdnm+yROfE0A, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA

Hi Feng,
	Other than this solution, how about introducing new
struct irte_pi for posted interrupt?

On 2014/11/10 14:26, Feng Wu wrote:
> This patch adjusts the definition of 'struct irte', so that we can
> add the VT-d Posted-Interrtups format in this structure later.
> 
> Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  drivers/iommu/intel_irq_remapping.c |   35 +++++++++++++++++++----------------
>  include/linux/dmar.h                |    4 ++--
>  2 files changed, 21 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
> index f99f0f1..776da10 100644
> --- a/drivers/iommu/intel_irq_remapping.c
> +++ b/drivers/iommu/intel_irq_remapping.c
> @@ -310,9 +310,9 @@ static void set_irte_sid(struct irte *irte, unsigned int svt,
>  {
>  	if (disable_sourceid_checking)
>  		svt = SVT_NO_VERIFY;
> -	irte->svt = svt;
> -	irte->sq = sq;
> -	irte->sid = sid;
> +	irte->irq_remap_high.svt = svt;
> +	irte->irq_remap_high.sq = sq;
> +	irte->irq_remap_high.sid = sid;
>  }
>  
>  static int set_ioapic_sid(struct irte *irte, int apic)
> @@ -917,8 +917,8 @@ static void prepare_irte(struct irte *irte, int vector,
>  {
>  	memset(irte, 0, sizeof(*irte));
>  
> -	irte->present = 1;
> -	irte->dst_mode = apic->irq_dest_mode;
> +	irte->irq_remap_low.present = 1;
> +	irte->irq_remap_low.dst_mode = apic->irq_dest_mode;
>  	/*
>  	 * Trigger mode in the IRTE will always be edge, and for IO-APIC, the
>  	 * actual level or edge trigger will be setup in the IO-APIC
> @@ -926,11 +926,11 @@ static void prepare_irte(struct irte *irte, int vector,
>  	 * For more details, see the comments (in io_apic.c) explainig IO-APIC
>  	 * irq migration in the presence of interrupt-remapping.
>  	*/
> -	irte->trigger_mode = 0;
> -	irte->dlvry_mode = apic->irq_delivery_mode;
> -	irte->vector = vector;
> -	irte->dest_id = IRTE_DEST(dest);
> -	irte->redir_hint = 1;
> +	irte->irq_remap_low.trigger_mode = 0;
> +	irte->irq_remap_low.dlvry_mode = apic->irq_delivery_mode;
> +	irte->irq_remap_low.vector = vector;
> +	irte->irq_remap_low.dest_id = IRTE_DEST(dest);
> +	irte->irq_remap_low.redir_hint = 1;
>  }
>  
>  static int intel_setup_ioapic_entry(int irq,
> @@ -973,10 +973,13 @@ static int intel_setup_ioapic_entry(int irq,
>  		"Redir_hint:%d Trig_Mode:%d Dlvry_Mode:%X "
>  		"Avail:%X Vector:%02X Dest:%08X "
>  		"SID:%04X SQ:%X SVT:%X)\n",
> -		attr->ioapic, irte.present, irte.fpd, irte.dst_mode,
> -		irte.redir_hint, irte.trigger_mode, irte.dlvry_mode,
> -		irte.avail, irte.vector, irte.dest_id,
> -		irte.sid, irte.sq, irte.svt);
> +		attr->ioapic, irte.irq_remap_low.present,
> +		irte.irq_remap_low.fpd, irte.irq_remap_low.dst_mode,
> +		irte.irq_remap_low.redir_hint, irte.irq_remap_low.trigger_mode,
> +		irte.irq_remap_low.dlvry_mode, irte.irq_remap_low.avail,
> +		irte.irq_remap_low.vector, irte.irq_remap_low.dest_id,
> +		irte.irq_remap_high.sid, irte.irq_remap_high.sq,
> +		irte.irq_remap_high.svt);
>  
>  	entry = (struct IR_IO_APIC_route_entry *)route_entry;
>  	memset(entry, 0, sizeof(*entry));
> @@ -1046,8 +1049,8 @@ intel_ioapic_set_affinity(struct irq_data *data, const struct cpumask *mask,
>  		return err;
>  	}
>  
> -	irte.vector = cfg->vector;
> -	irte.dest_id = IRTE_DEST(dest);
> +	irte.irq_remap_low.vector = cfg->vector;
> +	irte.irq_remap_low.dest_id = IRTE_DEST(dest);
>  
>  	/*
>  	 * Atomically updates the IRTE with the new destination, vector
> diff --git a/include/linux/dmar.h b/include/linux/dmar.h
> index 593fff9..8be5d42 100644
> --- a/include/linux/dmar.h
> +++ b/include/linux/dmar.h
> @@ -159,7 +159,7 @@ struct irte {
>  				vector		: 8,
>  				__reserved_2	: 8,
>  				dest_id		: 32;
> -		};
> +		} irq_remap_low;
>  		__u64 low;
>  	};
>  
> @@ -169,7 +169,7 @@ struct irte {
>  				sq		: 2,
>  				svt		: 2,
>  				__reserved_3	: 44;
> -		};
> +		} irq_remap_high;
>  		__u64 high;
>  	};
>  };
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 12/13] iommu/vt-d: No need to migrating irq for VT-d Posted-Interrtups
@ 2014-11-11 13:48     ` Jiang Liu
  0 siblings, 0 replies; 129+ messages in thread
From: Jiang Liu @ 2014-11-11 13:48 UTC (permalink / raw)
  To: Feng Wu, gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel



On 2014/11/10 14:26, Feng Wu wrote:
> We don't need to migrate the irqs for VT-d Posted-Interrtups here.
> When 'pst' is set in IRTE, the associated irq will be posted to
> guests instead of interrupt remapping. The destination of the
> interrupt is set in Posted-Interrupts Descriptor, and the migration
> happens during VCPU scheduling.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
>  drivers/iommu/intel_irq_remapping.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
> index 87c02fe..249e2b1 100644
> --- a/drivers/iommu/intel_irq_remapping.c
> +++ b/drivers/iommu/intel_irq_remapping.c
> @@ -1038,6 +1038,13 @@ intel_ioapic_set_affinity(struct irq_data *data, const struct cpumask *mask,
>  	if (get_irte(irq, &irte))
>  		return -EBUSY;
>  
> +	/*
> +	 * If the interrupt is for posting, it is used by guests,
> +	 * we cannot change IRTE here.
> +	 */
> +	if (irte.irq_post_low.pst == 1)
> +		return 0;
Hi Feng,
	You should return some error code instead of 0, otherwise the
irq core will get confused.

> +
>  	err = assign_irq_vector(irq, cfg, mask);
>  	if (err)
>  		return err;
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 12/13] iommu/vt-d: No need to migrating irq for VT-d Posted-Interrtups
@ 2014-11-11 13:48     ` Jiang Liu
  0 siblings, 0 replies; 129+ messages in thread
From: Jiang Liu @ 2014-11-11 13:48 UTC (permalink / raw)
  To: Feng Wu, gleb-DgEjT+Ai2ygdnm+yROfE0A,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	joro-zLv9SwRftAIdnm+yROfE0A, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA



On 2014/11/10 14:26, Feng Wu wrote:
> We don't need to migrate the irqs for VT-d Posted-Interrtups here.
> When 'pst' is set in IRTE, the associated irq will be posted to
> guests instead of interrupt remapping. The destination of the
> interrupt is set in Posted-Interrupts Descriptor, and the migration
> happens during VCPU scheduling.
> 
> Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  drivers/iommu/intel_irq_remapping.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
> index 87c02fe..249e2b1 100644
> --- a/drivers/iommu/intel_irq_remapping.c
> +++ b/drivers/iommu/intel_irq_remapping.c
> @@ -1038,6 +1038,13 @@ intel_ioapic_set_affinity(struct irq_data *data, const struct cpumask *mask,
>  	if (get_irte(irq, &irte))
>  		return -EBUSY;
>  
> +	/*
> +	 * If the interrupt is for posting, it is used by guests,
> +	 * we cannot change IRTE here.
> +	 */
> +	if (irte.irq_post_low.pst == 1)
> +		return 0;
Hi Feng,
	You should return some error code instead of 0, otherwise the
irq core will get confused.

> +
>  	err = assign_irq_vector(irq, cfg, mask);
>  	if (err)
>  		return err;
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* RE: [PATCH 02/13] KVM: Initialize VT-d Posted-Interrtups Descriptor
@ 2014-11-20  4:53       ` Wu, Feng
  0 siblings, 0 replies; 129+ messages in thread
From: Wu, Feng @ 2014-11-20  4:53 UTC (permalink / raw)
  To: Jiang Liu, gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel, Wu, Feng



> -----Original Message-----
> From: Jiang Liu [mailto:jiang.liu@linux.intel.com]
> Sent: Tuesday, November 11, 2014 9:36 PM
> To: Wu, Feng; gleb@kernel.org; pbonzini@redhat.com;
> dwmw2@infradead.org; joro@8bytes.org; tglx@linutronix.de;
> mingo@redhat.com; hpa@zytor.com; x86@kernel.org
> Cc: kvm@vger.kernel.org; iommu@lists.linux-foundation.org;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 02/13] KVM: Initialize VT-d Posted-Interrtups Descriptor
> 
> On 2014/11/10 14:26, Feng Wu wrote:
> > This patch initialize the VT-d Posted-interrupt Descritpor.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > ---
> >  arch/x86/include/asm/irq_remapping.h |    1 +
> >  arch/x86/kernel/apic/apic.c          |    1 +
> >  arch/x86/kvm/vmx.c                   |   56
> ++++++++++++++++++++++++++++++++-
> >  3 files changed, 56 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/irq_remapping.h
> b/arch/x86/include/asm/irq_remapping.h
> > index b7747c4..a3cc437 100644
> > --- a/arch/x86/include/asm/irq_remapping.h
> > +++ b/arch/x86/include/asm/irq_remapping.h
> > @@ -57,6 +57,7 @@ extern bool setup_remapped_irq(int irq,
> >  			       struct irq_chip *chip);
> >
> >  void irq_remap_modify_chip_defaults(struct irq_chip *chip);
> > +extern int irq_post_enabled;
> >
> >  #else  /* CONFIG_IRQ_REMAP */
> >
> > diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> > index ba6cc04..987408d 100644
> > --- a/arch/x86/kernel/apic/apic.c
> > +++ b/arch/x86/kernel/apic/apic.c
> > @@ -162,6 +162,7 @@ __setup("apicpmtimer", setup_apicpmtimer);
> >  #endif
> >
> >  int x2apic_mode;
> > +EXPORT_SYMBOL_GPL(x2apic_mode);
> >  #ifdef CONFIG_X86_X2APIC
> >  /* x2apic enabled before OS handover */
> >  int x2apic_preenabled;
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 3e556c6..a4670d3 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -45,6 +45,7 @@
> >  #include <asm/perf_event.h>
> >  #include <asm/debugreg.h>
> >  #include <asm/kexec.h>
> > +#include <asm/irq_remapping.h>
> >
> >  #include "trace.h"
> >
> > @@ -408,13 +409,32 @@ struct nested_vmx {
> >  };
> >
> >  #define POSTED_INTR_ON  0
> > +#define POSTED_INTR_SN  1
> > +
> >  /* Posted-Interrupt Descriptor */
> >  struct pi_desc {
> >  	u32 pir[8];     /* Posted interrupt requested */
> > -	u32 control;	/* bit 0 of control is outstanding notification bit */
> > -	u32 rsvd[7];
> > +	union {
> > +		struct {
> > +			u64	on	: 1,
> > +				sn	: 1,
> > +				rsvd_1	: 13,
> > +				ndm	: 1,
> > +				nv	: 8,
> > +				rsvd_2	: 8,
> > +				ndst	: 32;
> > +		};
> > +		u64 control;
> > +	};
> > +	u32 rsvd[6];
> >  } __aligned(64);
> >
> > +static void pi_clear_sn(struct pi_desc *pi_desc)
> > +{
> > +	return clear_bit(POSTED_INTR_SN,
> > +			(unsigned long *)&pi_desc->control);
> > +}
> > +
> >  static bool pi_test_and_set_on(struct pi_desc *pi_desc)
> >  {
> >  	return test_and_set_bit(POSTED_INTR_ON,
> > @@ -4396,6 +4416,33 @@ static void ept_set_mmio_spte_mask(void)
> >  	kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
> >  }
> >
> > +static bool pi_desc_init(struct vcpu_vmx *vmx)
> > +{
> > +	unsigned int dest;
> > +
> > +	if (irq_post_enabled == 0)
> > +		return true;
> > +
> > +	/*
> > +	 * Initialize Posted-Interrupt Descriptor
> > +	 */
> > +
> > +	pi_clear_sn(&vmx->pi_desc);
> > +	vmx->pi_desc.nv = POSTED_INTR_VECTOR;
> > +
> > +	/* Physical mode for Notificaiton Event */
> > +	vmx->pi_desc.ndm = 0;
> > +	dest = cpu_physical_id(vmx->vcpu.cpu);
> > +
> > +	if (x2apic_mode)
> Hi Feng,
> 	Could you try to use x2apic_enabled() here so you don't
> need to export x2apic_mode?
> Regards!
> Gerry

In that case, we should also export x2apic_enabled(), right?

Thanks,
Feng

> > +		vmx->pi_desc.ndst = dest;
> > +	else
> > +		vmx->pi_desc.ndst = (dest << 8) & 0xFF00;
> > +
> > +	return true;
> > +}
> > +
> > +
> >  /*
> >   * Sets up the vmcs for emulated real mode.
> >   */
> > @@ -4439,6 +4486,11 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> *vmx)
> >
> >  		vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR);
> >  		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
> > +
> > +		if (!pi_desc_init(vmx)) {
> > +			printk(KERN_ERR "Initialize PI descriptor error!\n");
> > +			return 1;
> > +		}
> >  	}
> >
> >  	if (ple_gap) {
> >

^ permalink raw reply	[flat|nested] 129+ messages in thread

* RE: [PATCH 02/13] KVM: Initialize VT-d Posted-Interrtups Descriptor
@ 2014-11-20  4:53       ` Wu, Feng
  0 siblings, 0 replies; 129+ messages in thread
From: Wu, Feng @ 2014-11-20  4:53 UTC (permalink / raw)
  To: Jiang Liu, gleb-DgEjT+Ai2ygdnm+yROfE0A,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	joro-zLv9SwRftAIdnm+yROfE0A, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA



> -----Original Message-----
> From: Jiang Liu [mailto:jiang.liu-VuQAYsv1563Yd54FQh9/CA@public.gmane.org]
> Sent: Tuesday, November 11, 2014 9:36 PM
> To: Wu, Feng; gleb-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; pbonzini-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org;
> dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org; joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org; tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org;
> mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org; x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
> Cc: kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org;
> linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: [PATCH 02/13] KVM: Initialize VT-d Posted-Interrtups Descriptor
> 
> On 2014/11/10 14:26, Feng Wu wrote:
> > This patch initialize the VT-d Posted-interrupt Descritpor.
> >
> > Signed-off-by: Feng Wu <feng.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> > ---
> >  arch/x86/include/asm/irq_remapping.h |    1 +
> >  arch/x86/kernel/apic/apic.c          |    1 +
> >  arch/x86/kvm/vmx.c                   |   56
> ++++++++++++++++++++++++++++++++-
> >  3 files changed, 56 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/irq_remapping.h
> b/arch/x86/include/asm/irq_remapping.h
> > index b7747c4..a3cc437 100644
> > --- a/arch/x86/include/asm/irq_remapping.h
> > +++ b/arch/x86/include/asm/irq_remapping.h
> > @@ -57,6 +57,7 @@ extern bool setup_remapped_irq(int irq,
> >  			       struct irq_chip *chip);
> >
> >  void irq_remap_modify_chip_defaults(struct irq_chip *chip);
> > +extern int irq_post_enabled;
> >
> >  #else  /* CONFIG_IRQ_REMAP */
> >
> > diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> > index ba6cc04..987408d 100644
> > --- a/arch/x86/kernel/apic/apic.c
> > +++ b/arch/x86/kernel/apic/apic.c
> > @@ -162,6 +162,7 @@ __setup("apicpmtimer", setup_apicpmtimer);
> >  #endif
> >
> >  int x2apic_mode;
> > +EXPORT_SYMBOL_GPL(x2apic_mode);
> >  #ifdef CONFIG_X86_X2APIC
> >  /* x2apic enabled before OS handover */
> >  int x2apic_preenabled;
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 3e556c6..a4670d3 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -45,6 +45,7 @@
> >  #include <asm/perf_event.h>
> >  #include <asm/debugreg.h>
> >  #include <asm/kexec.h>
> > +#include <asm/irq_remapping.h>
> >
> >  #include "trace.h"
> >
> > @@ -408,13 +409,32 @@ struct nested_vmx {
> >  };
> >
> >  #define POSTED_INTR_ON  0
> > +#define POSTED_INTR_SN  1
> > +
> >  /* Posted-Interrupt Descriptor */
> >  struct pi_desc {
> >  	u32 pir[8];     /* Posted interrupt requested */
> > -	u32 control;	/* bit 0 of control is outstanding notification bit */
> > -	u32 rsvd[7];
> > +	union {
> > +		struct {
> > +			u64	on	: 1,
> > +				sn	: 1,
> > +				rsvd_1	: 13,
> > +				ndm	: 1,
> > +				nv	: 8,
> > +				rsvd_2	: 8,
> > +				ndst	: 32;
> > +		};
> > +		u64 control;
> > +	};
> > +	u32 rsvd[6];
> >  } __aligned(64);
> >
> > +static void pi_clear_sn(struct pi_desc *pi_desc)
> > +{
> > +	return clear_bit(POSTED_INTR_SN,
> > +			(unsigned long *)&pi_desc->control);
> > +}
> > +
> >  static bool pi_test_and_set_on(struct pi_desc *pi_desc)
> >  {
> >  	return test_and_set_bit(POSTED_INTR_ON,
> > @@ -4396,6 +4416,33 @@ static void ept_set_mmio_spte_mask(void)
> >  	kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
> >  }
> >
> > +static bool pi_desc_init(struct vcpu_vmx *vmx)
> > +{
> > +	unsigned int dest;
> > +
> > +	if (irq_post_enabled == 0)
> > +		return true;
> > +
> > +	/*
> > +	 * Initialize Posted-Interrupt Descriptor
> > +	 */
> > +
> > +	pi_clear_sn(&vmx->pi_desc);
> > +	vmx->pi_desc.nv = POSTED_INTR_VECTOR;
> > +
> > +	/* Physical mode for Notificaiton Event */
> > +	vmx->pi_desc.ndm = 0;
> > +	dest = cpu_physical_id(vmx->vcpu.cpu);
> > +
> > +	if (x2apic_mode)
> Hi Feng,
> 	Could you try to use x2apic_enabled() here so you don't
> need to export x2apic_mode?
> Regards!
> Gerry

In that case, we should also export x2apic_enabled(), right?

Thanks,
Feng

> > +		vmx->pi_desc.ndst = dest;
> > +	else
> > +		vmx->pi_desc.ndst = (dest << 8) & 0xFF00;
> > +
> > +	return true;
> > +}
> > +
> > +
> >  /*
> >   * Sets up the vmcs for emulated real mode.
> >   */
> > @@ -4439,6 +4486,11 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> *vmx)
> >
> >  		vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR);
> >  		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
> > +
> > +		if (!pi_desc_init(vmx)) {
> > +			printk(KERN_ERR "Initialize PI descriptor error!\n");
> > +			return 1;
> > +		}
> >  	}
> >
> >  	if (ple_gap) {
> >

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] KVM: Initialize VT-d Posted-Interrtups Descriptor
@ 2014-11-20  5:00         ` Jiang Liu
  0 siblings, 0 replies; 129+ messages in thread
From: Jiang Liu @ 2014-11-20  5:00 UTC (permalink / raw)
  To: Wu, Feng, gleb, pbonzini, dwmw2, joro, tglx, mingo, hpa, x86
  Cc: kvm, iommu, linux-kernel

On 2014/11/20 12:53, Wu, Feng wrote:
> 
> 
>> -----Original Message-----
<snit>
>>> +	/*
>>> +	 * Initialize Posted-Interrupt Descriptor
>>> +	 */
>>> +
>>> +	pi_clear_sn(&vmx->pi_desc);
>>> +	vmx->pi_desc.nv = POSTED_INTR_VECTOR;
>>> +
>>> +	/* Physical mode for Notificaiton Event */
>>> +	vmx->pi_desc.ndm = 0;
>>> +	dest = cpu_physical_id(vmx->vcpu.cpu);
>>> +
>>> +	if (x2apic_mode)
>> Hi Feng,
>> 	Could you try to use x2apic_enabled() here so you don't
>> need to export x2apic_mode?
>> Regards!
>> Gerry
> 
> In that case, we should also export x2apic_enabled(), right?
Hi Feng,
	x2apic_enabled() is a static inline function:)
Regards!
Gerry

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 02/13] KVM: Initialize VT-d Posted-Interrtups Descriptor
@ 2014-11-20  5:00         ` Jiang Liu
  0 siblings, 0 replies; 129+ messages in thread
From: Jiang Liu @ 2014-11-20  5:00 UTC (permalink / raw)
  To: Wu, Feng, gleb-DgEjT+Ai2ygdnm+yROfE0A,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ,
	joro-zLv9SwRftAIdnm+yROfE0A, tglx-hfZtesqFncYOwBW4kG4KsQ,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	x86-DgEjT+Ai2ygdnm+yROfE0A
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA

On 2014/11/20 12:53, Wu, Feng wrote:
> 
> 
>> -----Original Message-----
<snit>
>>> +	/*
>>> +	 * Initialize Posted-Interrupt Descriptor
>>> +	 */
>>> +
>>> +	pi_clear_sn(&vmx->pi_desc);
>>> +	vmx->pi_desc.nv = POSTED_INTR_VECTOR;
>>> +
>>> +	/* Physical mode for Notificaiton Event */
>>> +	vmx->pi_desc.ndm = 0;
>>> +	dest = cpu_physical_id(vmx->vcpu.cpu);
>>> +
>>> +	if (x2apic_mode)
>> Hi Feng,
>> 	Could you try to use x2apic_enabled() here so you don't
>> need to export x2apic_mode?
>> Regards!
>> Gerry
> 
> In that case, we should also export x2apic_enabled(), right?
Hi Feng,
	x2apic_enabled() is a static inline function:)
Regards!
Gerry

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2023-06-13 17:37 ` mptcp: drop legacy code.: Tests Results MPTCP CI
@ 2023-06-16 22:54   ` Mat Martineau
  0 siblings, 0 replies; 129+ messages in thread
From: Mat Martineau @ 2023-06-16 22:54 UTC (permalink / raw)
  To: Paolo Abeni, mptcp

[-- Attachment #1: Type: text/plain, Size: 6401 bytes --]

On Mon, 12 Jun 2023, Paolo Abeni wrote:

> Thanks to the previous patch we can finally drop the "temporary hack"
> used to detect rx eof.
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> "previous patch" should be replace with a proper commit reference
> once such patch is merged on -net, unless we target both patches on
> the same tree (but this one is really net-next material, while the fix
> is for -net)

Hi Paolo,

To be clear, the "previous patch" is "mptcp: consolidate fallback and non 
fallback state machine"?


> ---
> net/mptcp/protocol.c | 49 --------------------------------------------
> net/mptcp/protocol.h |  5 +----
> 2 files changed, 1 insertion(+), 53 deletions(-)

Hooray for deleting code!

Reviewed-by: Mat Martineau <martineau@kernel.org>

>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 8f3e50065c13..feaedfd2b3eb 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -898,49 +898,6 @@ bool mptcp_schedule_work(struct sock *sk)
> 	return false;
> }
>
> -void mptcp_subflow_eof(struct sock *sk)
> -{
> -	if (!test_and_set_bit(MPTCP_WORK_EOF, &mptcp_sk(sk)->flags))
> -		mptcp_schedule_work(sk);
> -}
> -
> -static void mptcp_check_for_eof(struct mptcp_sock *msk)
> -{
> -	struct mptcp_subflow_context *subflow;
> -	struct sock *sk = (struct sock *)msk;
> -	int receivers = 0;
> -
> -	mptcp_for_each_subflow(msk, subflow)
> -		receivers += !subflow->rx_eof;
> -	if (receivers)
> -		return;
> -
> -	if (!(sk->sk_shutdown & RCV_SHUTDOWN)) {
> -		/* hopefully temporary hack: propagate shutdown status
> -		 * to msk, when all subflows agree on it
> -		 */
> -		WRITE_ONCE(sk->sk_shutdown, sk->sk_shutdown | RCV_SHUTDOWN);
> -
> -		smp_mb__before_atomic(); /* SHUTDOWN must be visible first */
> -		sk->sk_data_ready(sk);
> -	}
> -
> -	switch (sk->sk_state) {
> -	case TCP_ESTABLISHED:
> -		inet_sk_state_store(sk, TCP_CLOSE_WAIT);
> -		break;
> -	case TCP_FIN_WAIT1:
> -		inet_sk_state_store(sk, TCP_CLOSING);
> -		break;
> -	case TCP_FIN_WAIT2:
> -		inet_sk_state_store(sk, TCP_CLOSE);
> -		break;
> -	default:
> -		return;
> -	}
> -	mptcp_close_wake_up(sk);
> -}
> -
> static struct sock *mptcp_subflow_recv_lookup(const struct mptcp_sock *msk)
> {
> 	struct mptcp_subflow_context *subflow;
> @@ -2193,9 +2150,6 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
> 				break;
> 			}
>
> -			if (test_and_clear_bit(MPTCP_WORK_EOF, &msk->flags))
> -				mptcp_check_for_eof(msk);
> -
> 			if (sk->sk_shutdown & RCV_SHUTDOWN) {
> 				/* race breaker: the shutdown could be after the
> 				 * previous receive queue check
> @@ -2726,9 +2680,6 @@ static void mptcp_worker(struct work_struct *work)
>
> 	mptcp_pm_nl_work(msk);
>
> -	if (test_and_clear_bit(MPTCP_WORK_EOF, &msk->flags))
> -		mptcp_check_for_eof(msk);
> -
> 	mptcp_check_send_data_fin(sk);
> 	mptcp_check_data_fin_ack(sk);
> 	mptcp_check_data_fin(sk);
> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index d2e59cf33f57..528586e2ed73 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
> @@ -113,7 +113,6 @@
> /* MPTCP socket atomic flags */
> #define MPTCP_NOSPACE		1
> #define MPTCP_WORK_RTX		2
> -#define MPTCP_WORK_EOF		3
> #define MPTCP_FALLBACK_DONE	4
> #define MPTCP_WORK_CLOSE_SUBFLOW 5
>
> @@ -481,14 +480,13 @@ struct mptcp_subflow_context {
> 		send_mp_fail : 1,
> 		send_fastclose : 1,
> 		send_infinite_map : 1,
> -		rx_eof : 1,
> 		remote_key_valid : 1,        /* received the peer key from */
> 		disposable : 1,	    /* ctx can be free at ulp release time */
> 		stale : 1,	    /* unable to snd/rcv data, do not use for xmit */
> 		local_id_valid : 1, /* local_id is correctly initialized */
> 		valid_csum_seen : 1,        /* at least one csum validated */
> 		is_mptfo : 1,	    /* subflow is doing TFO */
> -		__unused : 8;
> +		__unused : 9;
> 	enum mptcp_data_avail data_avail;
> 	bool	scheduled;
> 	u32	remote_nonce;
> @@ -744,7 +742,6 @@ static inline u64 mptcp_expand_seq(u64 old_seq, u64 cur_seq, bool use_64bit)
> void __mptcp_check_push(struct sock *sk, struct sock *ssk);
> void __mptcp_data_acked(struct sock *sk);
> void __mptcp_error_report(struct sock *sk);
> -void mptcp_subflow_eof(struct sock *sk);
> bool mptcp_update_rcv_data_fin(struct mptcp_sock *msk, u64 data_fin_seq, bool use_64bit);
> static inline bool mptcp_data_fin_enabled(const struct mptcp_sock *msk)
> {
> -- 
> 2.40.1
>
>
>

On Tue, 13 Jun 2023, MPTCP CI wrote:

> Hi Paolo,
>
> Thank you for your modifications, that's great!
>
> Our CI did some validations and here is its report:
>
> - KVM Validation: normal (except selftest_mptcp_join):
>  - Success! ✅:
>  - Task: https://cirrus-ci.com/task/6108358801883136
>  - Summary: https://api.cirrus-ci.com/v1/artifact/task/6108358801883136/summary/summary.txt
>
> - KVM Validation: debug (only selftest_mptcp_join):
>  - Unstable: 1 failed test(s): selftest_mptcp_join 🔴:
>  - Task: https://cirrus-ci.com/task/4630615174152192
>  - Summary: https://api.cirrus-ci.com/v1/artifact/task/4630615174152192/summary/summary.txt
>
> - KVM Validation: normal (only selftest_mptcp_join):
>  - Success! ✅:
>  - Task: https://cirrus-ci.com/task/5545408848461824
>  - Summary: https://api.cirrus-ci.com/v1/artifact/task/5545408848461824/summary/summary.txt
>
> - KVM Validation: debug (except selftest_mptcp_join):
>  - Success! ✅:
>  - Task: https://cirrus-ci.com/task/6671308755304448
>  - Summary: https://api.cirrus-ci.com/v1/artifact/task/6671308755304448/summary/summary.txt
>
> Initiator: Matthieu Baerts
> Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/bdbf7858d22c
>
>
> If there are some issues, you can reproduce them using the same environment as
> the one used by the CI thanks to a docker image, e.g.:
>
>    $ cd [kernel source code]
>    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
>        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
>        auto-debug
>
> For more details:
>
>    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker
>
>
> Please note that despite all the efforts that have been already done to have a
> stable tests suite when executed on a public CI like here, it is possible some
> reported issues are not due to your modifications. Still, do not hesitate to
> help us improve that ;-)
>
> Cheers,
> MPTCP GH Action bot
> Bot operated by Matthieu Baerts (Tessares)
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2022-09-02 18:34       ` Jeff King
@ 2022-09-02 18:44         ` Junio C Hamano
  0 siblings, 0 replies; 129+ messages in thread
From: Junio C Hamano @ 2022-09-02 18:44 UTC (permalink / raw)
  To: Jeff King
  Cc: Eric Sunshine, Johannes Schindelin,
	Eric Sunshine via GitGitGadget,
	Ævar Arnfjörð Bjarmason, Git List, Elijah Newren,
	Fabian Stelzer

Jeff King <peff@peff.net> writes:

> Thanks for this response and especially the links. My initial gut
> response was similar to Dscho's. Which is not surprising, because it
> apparently was also my initial response to chainlint.sed back then. ;)
>
> But I do think that chainlint.sed has proven itself to be both useful
> and not much of a maintenance burden. My only real complaint was the
> additional runtime in a few corner cases, and that is exactly what
> you're addressing here.

I have nothing to add to the above ;-)  Thanks all (including Dscho
who made us be more explicit in pros-and-cons).



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2022-09-02 18:16     ` Eric Sunshine
@ 2022-09-02 18:34       ` Jeff King
  2022-09-02 18:44         ` Junio C Hamano
  0 siblings, 1 reply; 129+ messages in thread
From: Jeff King @ 2022-09-02 18:34 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Johannes Schindelin, Eric Sunshine via GitGitGadget,
	Ævar Arnfjörð Bjarmason, Git List, Elijah Newren,
	Fabian Stelzer

On Fri, Sep 02, 2022 at 02:16:21PM -0400, Eric Sunshine wrote:

> > It would be one thing if we could use a well-maintained third-party tool
> > to do this job. But adding this to our plate? I hope we can avoid that.
> 
> I understand your concerns about review and maintenance burden, and
> you're not the first to make such observations; when chainlint.sed was
> submitted, it was greeted with similar concerns[1,2], all very
> understandable. The key takeaway[3] from that conversation, though,
> was that, unlike user-facing features which must be reviewed in detail
> and maintained in perpetuity, this is a mere developer aid which can
> be easily ejected from the project if it ever becomes a maintenance
> burden or shows itself to be unreliable. Potential maintenance burden
> aside, a very real benefit of such a tool is that it should help
> prevent bugs from slipping into the project going forward[4], which is
> indeed the aim of all our developer-focused aids.

Thanks for this response and especially the links. My initial gut
response was similar to Dscho's. Which is not surprising, because it
apparently was also my initial response to chainlint.sed back then. ;)

But I do think that chainlint.sed has proven itself to be both useful
and not much of a maintenance burden. My only real complaint was the
additional runtime in a few corner cases, and that is exactly what
you're addressing here.

I'm not excited about carefully reviewing it. At the same time, given
the low stakes, I'm kind of willing to accept that between the tests and
the results of running it on the current code base, the proof is in the
pudding.

-Peff

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2022-09-02 12:42   ` several messages Johannes Schindelin
@ 2022-09-02 18:16     ` Eric Sunshine
  2022-09-02 18:34       ` Jeff King
  0 siblings, 1 reply; 129+ messages in thread
From: Eric Sunshine @ 2022-09-02 18:16 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Eric Sunshine via GitGitGadget,
	Ævar Arnfjörð Bjarmason, Git List, Jeff King,
	Elijah Newren, Fabian Stelzer

On Fri, Sep 2, 2022 at 8:42 AM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> On Thu, 1 Sep 2022, Eric Sunshine via GitGitGadget wrote:
> >  t/chainlint.pl                                | 730 ++++++++++++++++++
> >  t/chainlint.sed                               | 399 ----------
> >  t/chainlint/blank-line-before-esac.expect     |  18 +
> >  t/chainlint/blank-line-before-esac.test       |  19 +
> >  ...
>
> This looks like it was a lot of work. And that it would be a lot of work
> to review, too, and certainly even more work to maintain.
>
> Are we really sure that we want to burden the Git project with this much
> stuff that is not actually related to Git's core functionality?
>
> It would be one thing if we could use a well-maintained third-party tool
> to do this job. But adding this to our plate? I hope we can avoid that.

I understand your concerns about review and maintenance burden, and
you're not the first to make such observations; when chainlint.sed was
submitted, it was greeted with similar concerns[1,2], all very
understandable. The key takeaway[3] from that conversation, though,
was that, unlike user-facing features which must be reviewed in detail
and maintained in perpetuity, this is a mere developer aid which can
be easily ejected from the project if it ever becomes a maintenance
burden or shows itself to be unreliable. Potential maintenance burden
aside, a very real benefit of such a tool is that it should help
prevent bugs from slipping into the project going forward[4], which is
indeed the aim of all our developer-focused aids.

In more practical terms, despite initial concerns, in the 4+ years
since its introduction, the maintenance cost of chainlint.sed has been
nearly zero. Very early on, there was a report[5] that chainlint.sed
was showing a false-positive in a `contrib` test script; the developer
quickly responded with a fix[6]. The only other maintenance issues
were a couple dead-simple changes[7,8] to shorten "labels" to support
older versions of `sed`. (As for the chainlint self-tests, the
maintenance cost has been exactly zero). My hope is that chainlint.pl
should have a similar track-record, but it can easily be dropped from
the project if not.

[1]: https://lore.kernel.org/git/xmqqk1q11mkj.fsf@gitster-ct.c.googlers.com/
[2]: https://lore.kernel.org/git/20180712165608.GA10515@sigill.intra.peff.net/
[3]: https://lore.kernel.org/git/CAPig+cRmAkiYqFXwRAkQALDoOo-79r2iAumdEJEZhBnETvL-fw@mail.gmail.com/
[4]: https://lore.kernel.org/git/xmqqin5kw7q3.fsf@gitster-ct.c.googlers.com/
[5]: https://lore.kernel.org/git/20180730181356.GA156463@aiede.svl.corp.google.com/
[6]: https://lore.kernel.org/git/20180807082135.60913-1-sunshine@sunshineco.com/
[7]: https://lore.kernel.org/git/20180824152016.20286-5-avarab@gmail.com/
[8]: https://lore.kernel.org/git/d15ed626de65c51ef2ba31020eeb2111fb8e091f.1596675905.git.gitgitgadget@gmail.com/

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2022-09-01  0:29 ` [PATCH 18/18] t: retire unused chainlint.sed Eric Sunshine via GitGitGadget
@ 2022-09-02 12:42   ` Johannes Schindelin
  2022-09-02 18:16     ` Eric Sunshine
  0 siblings, 1 reply; 129+ messages in thread
From: Johannes Schindelin @ 2022-09-02 12:42 UTC (permalink / raw)
  To: Eric Sunshine via GitGitGadget, Ævar Arnfjörð Bjarmason
  Cc: git, Jeff King, Elijah Newren, Fabian Stelzer, Eric Sunshine

Hi Eric,

On Thu, 1 Sep 2022, Eric Sunshine via GitGitGadget wrote:

>  contrib/buildsystems/CMakeLists.txt           |   2 +-
>  t/Makefile                                    |  49 +-
>  t/README                                      |   5 -
>  t/chainlint.pl                                | 730 ++++++++++++++++++
>  t/chainlint.sed                               | 399 ----------
>  t/chainlint/blank-line-before-esac.expect     |  18 +
>  t/chainlint/blank-line-before-esac.test       |  19 +
>  t/chainlint/block.expect                      |  15 +-
>  t/chainlint/block.test                        |  15 +-
>  t/chainlint/chain-break-background.expect     |   9 +
>  t/chainlint/chain-break-background.test       |  10 +
>  t/chainlint/chain-break-continue.expect       |  12 +
>  t/chainlint/chain-break-continue.test         |  13 +
>  t/chainlint/chain-break-false.expect          |   9 +
>  t/chainlint/chain-break-false.test            |  10 +
>  t/chainlint/chain-break-return-exit.expect    |  19 +
>  t/chainlint/chain-break-return-exit.test      |  23 +
>  t/chainlint/chain-break-status.expect         |   9 +
>  t/chainlint/chain-break-status.test           |  11 +
>  t/chainlint/chained-block.expect              |   9 +
>  t/chainlint/chained-block.test                |  11 +
>  t/chainlint/chained-subshell.expect           |  10 +
>  t/chainlint/chained-subshell.test             |  13 +
>  .../command-substitution-subsubshell.expect   |   2 +
>  .../command-substitution-subsubshell.test     |   3 +
>  t/chainlint/complex-if-in-cuddled-loop.expect |   2 +-
>  t/chainlint/double-here-doc.expect            |   2 +
>  t/chainlint/double-here-doc.test              |  12 +
>  t/chainlint/dqstring-line-splice.expect       |   3 +
>  t/chainlint/dqstring-line-splice.test         |   7 +
>  t/chainlint/dqstring-no-interpolate.expect    |  11 +
>  t/chainlint/dqstring-no-interpolate.test      |  15 +
>  t/chainlint/empty-here-doc.expect             |   3 +
>  t/chainlint/empty-here-doc.test               |   5 +
>  t/chainlint/exclamation.expect                |   4 +
>  t/chainlint/exclamation.test                  |   8 +
>  t/chainlint/for-loop-abbreviated.expect       |   5 +
>  t/chainlint/for-loop-abbreviated.test         |   6 +
>  t/chainlint/for-loop.expect                   |   4 +-
>  t/chainlint/function.expect                   |  11 +
>  t/chainlint/function.test                     |  13 +
>  t/chainlint/here-doc-indent-operator.expect   |   5 +
>  t/chainlint/here-doc-indent-operator.test     |  13 +
>  t/chainlint/here-doc-multi-line-string.expect |   3 +-
>  t/chainlint/if-condition-split.expect         |   7 +
>  t/chainlint/if-condition-split.test           |   8 +
>  t/chainlint/if-in-loop.expect                 |   2 +-
>  t/chainlint/if-in-loop.test                   |   2 +-
>  t/chainlint/loop-detect-failure.expect        |  15 +
>  t/chainlint/loop-detect-failure.test          |  17 +
>  t/chainlint/loop-detect-status.expect         |  18 +
>  t/chainlint/loop-detect-status.test           |  19 +
>  t/chainlint/loop-in-if.expect                 |   2 +-
>  t/chainlint/loop-upstream-pipe.expect         |  10 +
>  t/chainlint/loop-upstream-pipe.test           |  11 +
>  t/chainlint/multi-line-string.expect          |  11 +-
>  t/chainlint/nested-loop-detect-failure.expect |  31 +
>  t/chainlint/nested-loop-detect-failure.test   |  35 +
>  t/chainlint/nested-subshell.expect            |   2 +-
>  t/chainlint/one-liner-for-loop.expect         |   9 +
>  t/chainlint/one-liner-for-loop.test           |  10 +
>  t/chainlint/return-loop.expect                |   5 +
>  t/chainlint/return-loop.test                  |   6 +
>  t/chainlint/semicolon.expect                  |   2 +-
>  t/chainlint/sqstring-in-sqstring.expect       |   4 +
>  t/chainlint/sqstring-in-sqstring.test         |   5 +
>  t/chainlint/t7900-subtree.expect              |  13 +-
>  t/chainlint/token-pasting.expect              |  27 +
>  t/chainlint/token-pasting.test                |  32 +
>  t/chainlint/while-loop.expect                 |   4 +-
>  t/t0027-auto-crlf.sh                          |   7 +-
>  t/t3070-wildmatch.sh                          |   5 -
>  t/test-lib.sh                                 |  12 +-
>  73 files changed, 1439 insertions(+), 449 deletions(-)
>  create mode 100755 t/chainlint.pl
>  delete mode 100644 t/chainlint.sed
>  create mode 100644 t/chainlint/blank-line-before-esac.expect
>  create mode 100644 t/chainlint/blank-line-before-esac.test
>  create mode 100644 t/chainlint/chain-break-background.expect
>  create mode 100644 t/chainlint/chain-break-background.test
>  create mode 100644 t/chainlint/chain-break-continue.expect
>  create mode 100644 t/chainlint/chain-break-continue.test
>  create mode 100644 t/chainlint/chain-break-false.expect
>  create mode 100644 t/chainlint/chain-break-false.test
>  create mode 100644 t/chainlint/chain-break-return-exit.expect
>  create mode 100644 t/chainlint/chain-break-return-exit.test
>  create mode 100644 t/chainlint/chain-break-status.expect
>  create mode 100644 t/chainlint/chain-break-status.test
>  create mode 100644 t/chainlint/chained-block.expect
>  create mode 100644 t/chainlint/chained-block.test
>  create mode 100644 t/chainlint/chained-subshell.expect
>  create mode 100644 t/chainlint/chained-subshell.test
>  create mode 100644 t/chainlint/command-substitution-subsubshell.expect
>  create mode 100644 t/chainlint/command-substitution-subsubshell.test
>  create mode 100644 t/chainlint/double-here-doc.expect
>  create mode 100644 t/chainlint/double-here-doc.test
>  create mode 100644 t/chainlint/dqstring-line-splice.expect
>  create mode 100644 t/chainlint/dqstring-line-splice.test
>  create mode 100644 t/chainlint/dqstring-no-interpolate.expect
>  create mode 100644 t/chainlint/dqstring-no-interpolate.test
>  create mode 100644 t/chainlint/empty-here-doc.expect
>  create mode 100644 t/chainlint/empty-here-doc.test
>  create mode 100644 t/chainlint/exclamation.expect
>  create mode 100644 t/chainlint/exclamation.test
>  create mode 100644 t/chainlint/for-loop-abbreviated.expect
>  create mode 100644 t/chainlint/for-loop-abbreviated.test
>  create mode 100644 t/chainlint/function.expect
>  create mode 100644 t/chainlint/function.test
>  create mode 100644 t/chainlint/here-doc-indent-operator.expect
>  create mode 100644 t/chainlint/here-doc-indent-operator.test
>  create mode 100644 t/chainlint/if-condition-split.expect
>  create mode 100644 t/chainlint/if-condition-split.test
>  create mode 100644 t/chainlint/loop-detect-failure.expect
>  create mode 100644 t/chainlint/loop-detect-failure.test
>  create mode 100644 t/chainlint/loop-detect-status.expect
>  create mode 100644 t/chainlint/loop-detect-status.test
>  create mode 100644 t/chainlint/loop-upstream-pipe.expect
>  create mode 100644 t/chainlint/loop-upstream-pipe.test
>  create mode 100644 t/chainlint/nested-loop-detect-failure.expect
>  create mode 100644 t/chainlint/nested-loop-detect-failure.test
>  create mode 100644 t/chainlint/one-liner-for-loop.expect
>  create mode 100644 t/chainlint/one-liner-for-loop.test
>  create mode 100644 t/chainlint/return-loop.expect
>  create mode 100644 t/chainlint/return-loop.test
>  create mode 100644 t/chainlint/sqstring-in-sqstring.expect
>  create mode 100644 t/chainlint/sqstring-in-sqstring.test
>  create mode 100644 t/chainlint/token-pasting.expect
>  create mode 100644 t/chainlint/token-pasting.test

This looks like it was a lot of work. And that it would be a lot of work
to review, too, and certainly even more work to maintain.

Are we really sure that we want to burden the Git project with this much
stuff that is not actually related to Git's core functionality?

It would be one thing if we could use a well-maintained third-party tool
to do this job. But adding this to our plate? I hope we can avoid that.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2022-06-16 15:27 ` selftests: mptcp: tweak simult_flows for debug kernels.: Tests Results MPTCP CI
@ 2022-06-17 22:13   ` Mat Martineau
  0 siblings, 0 replies; 129+ messages in thread
From: Mat Martineau @ 2022-06-17 22:13 UTC (permalink / raw)
  To: Paolo Abeni, mptcp

[-- Attachment #1: Type: text/plain, Size: 3986 bytes --]

On Thu, 16 Jun 2022, Paolo Abeni wrote:

> The mentioned test measures the transfer run-time to verify
> that the user-space program is able to use the full aggregate B/W.
>
> Even on (virtual) link-speed-bound tests, debug kernel can slow
> down the transfer enough to cause sporadic test failures.
>
> Instead of unconditionally raising the maximum allowed run-time,
> tweak when the running kernel is a debug one, and use some simple/
> rough heuristic to guess such scenarios.
>
> Note: this intentionally avoids looking for /boot/config-<version> as
> the latter file is not always available in our reference CI
> environments.
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Looks good, runs fine in my vm with debug kernel config:

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>

> ---
> tools/testing/selftests/net/mptcp/simult_flows.sh | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/net/mptcp/simult_flows.sh b/tools/testing/selftests/net/mptcp/simult_flows.sh
> index f441ff7904fc..141fcf0d40d1 100755
> --- a/tools/testing/selftests/net/mptcp/simult_flows.sh
> +++ b/tools/testing/selftests/net/mptcp/simult_flows.sh
> @@ -12,6 +12,7 @@ timeout_test=$((timeout_poll * 2 + 1))
> test_cnt=1
> ret=0
> bail=0
> +slack=50
>
> usage() {
> 	echo "Usage: $0 [ -b ] [ -c ] [ -d ]"
> @@ -52,6 +53,7 @@ setup()
> 	cout=$(mktemp)
> 	capout=$(mktemp)
> 	size=$((2 * 2048 * 4096))
> +
> 	dd if=/dev/zero of=$small bs=4096 count=20 >/dev/null 2>&1
> 	dd if=/dev/zero of=$large bs=4096 count=$((size / 4096)) >/dev/null 2>&1
>
> @@ -104,6 +106,13 @@ setup()
> 	ip -net "$ns3" route add default via dead:beef:3::2
>
> 	ip netns exec "$ns3" ./pm_nl_ctl limits 1 1
> +
> +	# debug build can slow down measurably the test program
> +	# we use quite tight time limit on the run-time, to ensure
> +	# maximum B/W usage.
> +	# Use the kmemleak file presence as a rough estimate for this being
> +	# a debug kernel and increase the maximum run-time accordingly
> +	[ -f /sys/kernel/debug/kmemleak ] && slack=$((slack+200))
> }
>
> # $1: ns, $2: port
> @@ -241,7 +250,7 @@ run_test()
>
> 	# mptcp_connect will do some sleeps to allow the mp_join handshake
> 	# completion (see mptcp_connect): 200ms on each side, add some slack
> -	time=$((time + 450))
> +	time=$((time + 400 + $slack))
>
> 	printf "%-60s" "$msg"
> 	do_transfer $small $large $time
> -- 
> 2.35.3
>
>
>

On Thu, 16 Jun 2022, MPTCP CI wrote:

> Hi Paolo,
>
> Thank you for your modifications, that's great!
>
> Our CI did some validations and here is its report:
>
> - KVM Validation: normal:
>  - Success! ✅:
>  - Task: https://cirrus-ci.com/task/5708948505362432
>  - Summary: https://api.cirrus-ci.com/v1/artifact/task/5708948505362432/summary/summary.txt
>
> - KVM Validation: debug:
>  - Unstable: 3 failed test(s): packetdrill_add_addr selftest_diag selftest_mptcp_join 🔴:
>  - Task: https://cirrus-ci.com/task/5145998551941120
>  - Summary: https://api.cirrus-ci.com/v1/artifact/task/5145998551941120/summary/summary.txt
>
> Initiator: Patchew Applier
> Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/727243b29682
>
>
> If there are some issues, you can reproduce them using the same environment as
> the one used by the CI thanks to a docker image, e.g.:
>
>    $ cd [kernel source code]
>    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
>        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
>        auto-debug
>
> For more details:
>
>    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker
>
>
> Please note that despite all the efforts that have been already done to have a
> stable tests suite when executed on a public CI like here, it is possible some
> reported issues are not due to your modifications. Still, do not hesitate to
> help us improve that ;-)
>
> Cheers,
> MPTCP GH Action bot
> Bot operated by Matthieu Baerts (Tessares)
>
>

--
Mat Martineau
Intel

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2016-01-27 10:09     ` Thomas Gleixner
  (?)
@ 2016-01-29 13:21     ` Borislav Petkov
  -1 siblings, 0 replies; 129+ messages in thread
From: Borislav Petkov @ 2016-01-29 13:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Ingo Molnar, x86, linux-kernel, Brian Gerst,
	Dave Hansen, Linus Torvalds, Oleg Nesterov, linux-mm,
	Andrey Ryabinin

On Wed, Jan 27, 2016 at 11:09:04AM +0100, Thomas Gleixner wrote:
> On Mon, 25 Jan 2016, Andy Lutomirski wrote:
> > This is a straightforward speedup on Ivy Bridge and newer, IIRC.
> > (I tested on Skylake.  INVPCID is not available on Sandy Bridge.
> > I don't have Ivy Bridge, Haswell or Broadwell to test on, so I
> > could be wrong as to when the feature was introduced.)
>
> Haswell and Broadwell have it. No idea about ivy bridge.

I have an IVB model 58. It doesn't have it:

CPUID_0x00000007: EAX=0x00000000, EBX=0x00000281, ECX=0x00000000, EDX=0x00000000

INVPCID should be EBX[10].

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2016-01-25 18:57 ` Ingo Molnar
@ 2016-01-27 10:09     ` Thomas Gleixner
  0 siblings, 0 replies; 129+ messages in thread
From: Thomas Gleixner @ 2016-01-27 10:09 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: x86, linux-kernel, Borislav Petkov, Brian Gerst, Dave Hansen,
	Linus Torvalds, Oleg Nesterov, linux-mm, Andrey Ryabinin

On Mon, 25 Jan 2016, Andy Lutomirski wrote:
> This is a straightforward speedup on Ivy Bridge and newer, IIRC.
> (I tested on Skylake.  INVPCID is not available on Sandy Bridge.
> I don't have Ivy Bridge, Haswell or Broadwell to test on, so I
> could be wrong as to when the feature was introduced.)

Haswell and Broadwell have it. No idea about ivy bridge.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2016-01-27 10:09     ` Thomas Gleixner
  0 siblings, 0 replies; 129+ messages in thread
From: Thomas Gleixner @ 2016-01-27 10:09 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: x86, linux-kernel, Borislav Petkov, Brian Gerst, Dave Hansen,
	Linus Torvalds, Oleg Nesterov, linux-mm, Andrey Ryabinin

On Mon, 25 Jan 2016, Andy Lutomirski wrote:
> This is a straightforward speedup on Ivy Bridge and newer, IIRC.
> (I tested on Skylake.  INVPCID is not available on Sandy Bridge.
> I don't have Ivy Bridge, Haswell or Broadwell to test on, so I
> could be wrong as to when the feature was introduced.)

Haswell and Broadwell have it. No idea about ivy bridge.
 
Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2014-07-04  4:13             ` Nicolas Pitre
@ 2014-07-04 17:45               ` Abhilash Kesavan
  -1 siblings, 0 replies; 129+ messages in thread
From: Abhilash Kesavan @ 2014-07-04 17:45 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: linux-samsung-soc, linux-arm-kernel, Kukjin Kim,
	Lorenzo Pieralisi, Andrew Bresticker, Douglas Anderson

Hi Nicolas,

On Fri, Jul 4, 2014 at 9:43 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Fri, 4 Jul 2014, Abhilash Kesavan wrote:
>
>> Hi Nicolas,
>>
>> On Fri, Jul 4, 2014 at 12:30 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>> > On Thu, 3 Jul 2014, Abhilash Kesavan wrote:
>> >
>> >> Hi Nicolas,
>> >>
>> >> On Thu, Jul 3, 2014 at 9:15 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>> >> > On Thu, 3 Jul 2014, Abhilash Kesavan wrote:
>> >> >
>> >> >> On Thu, Jul 3, 2014 at 6:59 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>> >> >> > Please, let's avoid going that route.  There is no such special handling
>> >> >> > needed if the API is sufficient.  And the provided API allows you to
>> >> >> > suspend a CPU or shut it down.  It shouldn't matter at that level if
>> >> >> > this is due to a cluster switch or a hotplug event. Do you really need
>> >> >> > something else?
>> >> >> No, just one local flag for suspend should be enough for me. Will remove these.
>> >> >
>> >> > [...]
>> >> >
>> >> >> Changes in v5:
>> >> >>       - Removed the MCPM flags and just used a local flag to
>> >> >>       indicate that we are suspending.
>> >> >
>> >> > [...]
>> >> >
>> >> >> -static void exynos_power_down(void)
>> >> >> +static void exynos_mcpm_power_down(u64 residency)
>> >> >>  {
>> >> >>       unsigned int mpidr, cpu, cluster;
>> >> >>       bool last_man = false, skip_wfi = false;
>> >> >> @@ -150,7 +153,12 @@ static void exynos_power_down(void)
>> >> >>       BUG_ON(__mcpm_cluster_state(cluster) != CLUSTER_UP);
>> >> >>       cpu_use_count[cpu][cluster]--;
>> >> >>       if (cpu_use_count[cpu][cluster] == 0) {
>> >> >> -             exynos_cpu_power_down(cpunr);
>> >> >> +             /*
>> >> >> +              * Bypass power down for CPU0 during suspend. This is being
>> >> >> +              * taken care by the SYS_PWR_CFG bit in CORE0_SYS_PWR_REG.
>> >> >> +              */
>> >> >> +             if ((cpunr != 0) || (residency != S5P_CHECK_SLEEP))
>> >> >> +                     exynos_cpu_power_down(cpunr);
>> >> >>
>> >> >>               if (exynos_cluster_unused(cluster)) {
>> >> >>                       exynos_cluster_power_down(cluster);
>> >> >> @@ -209,6 +217,11 @@ static void exynos_power_down(void)
>> >> >>       /* Not dead at this point?  Let our caller cope. */
>> >> >>  }
>> >> >>
>> >> >> +static void exynos_power_down(void)
>> >> >> +{
>> >> >> +     exynos_mcpm_power_down(0);
>> >> >> +}
>> >> >
>> >> > [...]
>> >> >
>> >> >> +static int notrace exynos_mcpm_cpu_suspend(unsigned long arg)
>> >> >> +{
>> >> >> +     /* MCPM works with HW CPU identifiers */
>> >> >> +     unsigned int mpidr = read_cpuid_mpidr();
>> >> >> +     unsigned int cluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
>> >> >> +     unsigned int cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
>> >> >> +
>> >> >> +     __raw_writel(0x0, sysram_base_addr + EXYNOS5420_CPU_STATE);
>> >> >> +
>> >> >> +     mcpm_set_entry_vector(cpu, cluster, exynos_cpu_resume);
>> >> >> +
>> >> >> +     /*
>> >> >> +      * Pass S5P_CHECK_SLEEP flag to the MCPM back-end to indicate that
>> >> >> +      * we are suspending the system and need to skip CPU0 power down.
>> >> >> +      */
>> >> >> +     mcpm_cpu_suspend(S5P_CHECK_SLEEP);
>> >> >
>> >> > NAK.
>> >> >
>> >> > When I say "local flag with local meaning", I don't want you to smuggle
>> >> > that flag through a public interface either.  I find it rather inelegant
>> >> > to have the notion of special handling for CPU0 being spread around like
>> >> > that.
>> >> >
>> >> > If CPU0 is special, then it should be dealth with in one place only,
>> >> > ideally in the MCPM backend, so the rest of the kernel doesn't have to
>> >> > care.
>> >> >
>> >> > Again, here's what I mean:
>> >> >
>> >> > static void exynos_mcpm_down_handler(int flags)
>> >> > {
>> >> >         [...]
>> >> >         if ((cpunr != 0) || !(flags & SKIP_CPU_POWERDOWN_IF_CPU0))
>> >> >                 exynos_cpu_power_down(cpunr);
>> >> >         [...]
>> >> > }
>> >> >
>> >> > static void exynos_mcpm_power_down()
>> >> > {
>> >> >         exynos_mcpm_down_handler(0);
>> >> > }
>> >> >
>> >> > static void exynos_mcpm_suspend(u64 residency)
>> >> > {
>> >> >         /*
>> >> >          * Theresidency argument is ignored for now.
>> >> >          * However, in the CPU suspend case, we bypass power down for
>> >> >          * CPU0 as this is being taken care by the SYS_PWR_CFG bit in
>> >> >          * CORE0_SYS_PWR_REG.
>> >> >          */
>> >> >         exynos_mcpm_down_handler(SKIP_CPU_POWERDOWN_IF_CPU0);
>> >> > }
>> >> >
>> >> > And SKIP_CPU_POWERDOWN_IF_CPU0 is defined in and visible to
>> >> > mcpm-exynos.c only.
>> >> Sorry if I am being dense, but the exynos_mcpm_suspend function would
>> >> get called from both the bL cpuidle driver as well the exynos pm code.
>> >
>> > What is that exynos pm code actually doing?
>> exynos pm code is shared across Exynos4 and 5 SoCs. It primarily does
>> a series of register configurations which are required to put the
>> system to sleep (some parts of these are SoC specific and others
>> common). It also populates the suspend_ops for exynos. In the current
>> patch, exynos_suspend_enter() is where I have plugged in the
>> mcpm_cpu_suspend call.
>>
>> This patch is based on the S2R series for 5420
>> (http://comments.gmane.org/gmane.linux.kernel.samsung-soc/33898), you
>> may also have a look at that for a clearer picture.
>> >
>> >> We want to skip CPU0 only in case of the S2R case i.e. the pm code and
>> >> keep the CPU0 power down code for all other cases including CPUIdle.
>> >
>> > OK.  If so I missed that somehow (hint hint).
>> >
>> >> If I call exynos_mcpm_down_handler with the flag in
>> >> exynos_mcpm_suspend(), CPUIdle will also skip CPU0 isn't it ?
>> >
>> > As it is, yes.  You could logically use an infinite residency time
>> > (something like U64_MAX) to distinguish S2RAM from other types of
>> > suspend.
>> OK, I will use this rather than the S5P_CHECK_SLEEP macro.
>
> Another suggestion which might possibly be better: why not looking for
> the SYS_PWR_CFG bit in exynos_cpu_power_down() directly?  After all,
> exynos_cpu_power_down() is semantically supposed to do what its name
> suggest and could simply do nothing if the proper conditions are already
> in place.
I have implemented this and it works fine. Patch coming up.

Regards,
Abhilash
>
>
> Nicolas

^ permalink raw reply	[flat|nested] 129+ messages in thread

* several messages
@ 2014-07-04 17:45               ` Abhilash Kesavan
  0 siblings, 0 replies; 129+ messages in thread
From: Abhilash Kesavan @ 2014-07-04 17:45 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Nicolas,

On Fri, Jul 4, 2014 at 9:43 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Fri, 4 Jul 2014, Abhilash Kesavan wrote:
>
>> Hi Nicolas,
>>
>> On Fri, Jul 4, 2014 at 12:30 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>> > On Thu, 3 Jul 2014, Abhilash Kesavan wrote:
>> >
>> >> Hi Nicolas,
>> >>
>> >> On Thu, Jul 3, 2014 at 9:15 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>> >> > On Thu, 3 Jul 2014, Abhilash Kesavan wrote:
>> >> >
>> >> >> On Thu, Jul 3, 2014 at 6:59 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>> >> >> > Please, let's avoid going that route.  There is no such special handling
>> >> >> > needed if the API is sufficient.  And the provided API allows you to
>> >> >> > suspend a CPU or shut it down.  It shouldn't matter at that level if
>> >> >> > this is due to a cluster switch or a hotplug event. Do you really need
>> >> >> > something else?
>> >> >> No, just one local flag for suspend should be enough for me. Will remove these.
>> >> >
>> >> > [...]
>> >> >
>> >> >> Changes in v5:
>> >> >>       - Removed the MCPM flags and just used a local flag to
>> >> >>       indicate that we are suspending.
>> >> >
>> >> > [...]
>> >> >
>> >> >> -static void exynos_power_down(void)
>> >> >> +static void exynos_mcpm_power_down(u64 residency)
>> >> >>  {
>> >> >>       unsigned int mpidr, cpu, cluster;
>> >> >>       bool last_man = false, skip_wfi = false;
>> >> >> @@ -150,7 +153,12 @@ static void exynos_power_down(void)
>> >> >>       BUG_ON(__mcpm_cluster_state(cluster) != CLUSTER_UP);
>> >> >>       cpu_use_count[cpu][cluster]--;
>> >> >>       if (cpu_use_count[cpu][cluster] == 0) {
>> >> >> -             exynos_cpu_power_down(cpunr);
>> >> >> +             /*
>> >> >> +              * Bypass power down for CPU0 during suspend. This is being
>> >> >> +              * taken care by the SYS_PWR_CFG bit in CORE0_SYS_PWR_REG.
>> >> >> +              */
>> >> >> +             if ((cpunr != 0) || (residency != S5P_CHECK_SLEEP))
>> >> >> +                     exynos_cpu_power_down(cpunr);
>> >> >>
>> >> >>               if (exynos_cluster_unused(cluster)) {
>> >> >>                       exynos_cluster_power_down(cluster);
>> >> >> @@ -209,6 +217,11 @@ static void exynos_power_down(void)
>> >> >>       /* Not dead at this point?  Let our caller cope. */
>> >> >>  }
>> >> >>
>> >> >> +static void exynos_power_down(void)
>> >> >> +{
>> >> >> +     exynos_mcpm_power_down(0);
>> >> >> +}
>> >> >
>> >> > [...]
>> >> >
>> >> >> +static int notrace exynos_mcpm_cpu_suspend(unsigned long arg)
>> >> >> +{
>> >> >> +     /* MCPM works with HW CPU identifiers */
>> >> >> +     unsigned int mpidr = read_cpuid_mpidr();
>> >> >> +     unsigned int cluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
>> >> >> +     unsigned int cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
>> >> >> +
>> >> >> +     __raw_writel(0x0, sysram_base_addr + EXYNOS5420_CPU_STATE);
>> >> >> +
>> >> >> +     mcpm_set_entry_vector(cpu, cluster, exynos_cpu_resume);
>> >> >> +
>> >> >> +     /*
>> >> >> +      * Pass S5P_CHECK_SLEEP flag to the MCPM back-end to indicate that
>> >> >> +      * we are suspending the system and need to skip CPU0 power down.
>> >> >> +      */
>> >> >> +     mcpm_cpu_suspend(S5P_CHECK_SLEEP);
>> >> >
>> >> > NAK.
>> >> >
>> >> > When I say "local flag with local meaning", I don't want you to smuggle
>> >> > that flag through a public interface either.  I find it rather inelegant
>> >> > to have the notion of special handling for CPU0 being spread around like
>> >> > that.
>> >> >
>> >> > If CPU0 is special, then it should be dealth with in one place only,
>> >> > ideally in the MCPM backend, so the rest of the kernel doesn't have to
>> >> > care.
>> >> >
>> >> > Again, here's what I mean:
>> >> >
>> >> > static void exynos_mcpm_down_handler(int flags)
>> >> > {
>> >> >         [...]
>> >> >         if ((cpunr != 0) || !(flags & SKIP_CPU_POWERDOWN_IF_CPU0))
>> >> >                 exynos_cpu_power_down(cpunr);
>> >> >         [...]
>> >> > }
>> >> >
>> >> > static void exynos_mcpm_power_down()
>> >> > {
>> >> >         exynos_mcpm_down_handler(0);
>> >> > }
>> >> >
>> >> > static void exynos_mcpm_suspend(u64 residency)
>> >> > {
>> >> >         /*
>> >> >          * Theresidency argument is ignored for now.
>> >> >          * However, in the CPU suspend case, we bypass power down for
>> >> >          * CPU0 as this is being taken care by the SYS_PWR_CFG bit in
>> >> >          * CORE0_SYS_PWR_REG.
>> >> >          */
>> >> >         exynos_mcpm_down_handler(SKIP_CPU_POWERDOWN_IF_CPU0);
>> >> > }
>> >> >
>> >> > And SKIP_CPU_POWERDOWN_IF_CPU0 is defined in and visible to
>> >> > mcpm-exynos.c only.
>> >> Sorry if I am being dense, but the exynos_mcpm_suspend function would
>> >> get called from both the bL cpuidle driver as well the exynos pm code.
>> >
>> > What is that exynos pm code actually doing?
>> exynos pm code is shared across Exynos4 and 5 SoCs. It primarily does
>> a series of register configurations which are required to put the
>> system to sleep (some parts of these are SoC specific and others
>> common). It also populates the suspend_ops for exynos. In the current
>> patch, exynos_suspend_enter() is where I have plugged in the
>> mcpm_cpu_suspend call.
>>
>> This patch is based on the S2R series for 5420
>> (http://comments.gmane.org/gmane.linux.kernel.samsung-soc/33898), you
>> may also have a look at that for a clearer picture.
>> >
>> >> We want to skip CPU0 only in case of the S2R case i.e. the pm code and
>> >> keep the CPU0 power down code for all other cases including CPUIdle.
>> >
>> > OK.  If so I missed that somehow (hint hint).
>> >
>> >> If I call exynos_mcpm_down_handler with the flag in
>> >> exynos_mcpm_suspend(), CPUIdle will also skip CPU0 isn't it ?
>> >
>> > As it is, yes.  You could logically use an infinite residency time
>> > (something like U64_MAX) to distinguish S2RAM from other types of
>> > suspend.
>> OK, I will use this rather than the S5P_CHECK_SLEEP macro.
>
> Another suggestion which might possibly be better: why not looking for
> the SYS_PWR_CFG bit in exynos_cpu_power_down() directly?  After all,
> exynos_cpu_power_down() is semantically supposed to do what its name
> suggest and could simply do nothing if the proper conditions are already
> in place.
I have implemented this and it works fine. Patch coming up.

Regards,
Abhilash
>
>
> Nicolas

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2014-07-03 20:00           ` Abhilash Kesavan
@ 2014-07-04  4:13             ` Nicolas Pitre
  -1 siblings, 0 replies; 129+ messages in thread
From: Nicolas Pitre @ 2014-07-04  4:13 UTC (permalink / raw)
  To: Abhilash Kesavan
  Cc: linux-samsung-soc, linux-arm-kernel, Kukjin Kim,
	Lorenzo Pieralisi, Andrew Bresticker, Douglas Anderson

On Fri, 4 Jul 2014, Abhilash Kesavan wrote:

> Hi Nicolas,
> 
> On Fri, Jul 4, 2014 at 12:30 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > On Thu, 3 Jul 2014, Abhilash Kesavan wrote:
> >
> >> Hi Nicolas,
> >>
> >> On Thu, Jul 3, 2014 at 9:15 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> >> > On Thu, 3 Jul 2014, Abhilash Kesavan wrote:
> >> >
> >> >> On Thu, Jul 3, 2014 at 6:59 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> >> >> > Please, let's avoid going that route.  There is no such special handling
> >> >> > needed if the API is sufficient.  And the provided API allows you to
> >> >> > suspend a CPU or shut it down.  It shouldn't matter at that level if
> >> >> > this is due to a cluster switch or a hotplug event. Do you really need
> >> >> > something else?
> >> >> No, just one local flag for suspend should be enough for me. Will remove these.
> >> >
> >> > [...]
> >> >
> >> >> Changes in v5:
> >> >>       - Removed the MCPM flags and just used a local flag to
> >> >>       indicate that we are suspending.
> >> >
> >> > [...]
> >> >
> >> >> -static void exynos_power_down(void)
> >> >> +static void exynos_mcpm_power_down(u64 residency)
> >> >>  {
> >> >>       unsigned int mpidr, cpu, cluster;
> >> >>       bool last_man = false, skip_wfi = false;
> >> >> @@ -150,7 +153,12 @@ static void exynos_power_down(void)
> >> >>       BUG_ON(__mcpm_cluster_state(cluster) != CLUSTER_UP);
> >> >>       cpu_use_count[cpu][cluster]--;
> >> >>       if (cpu_use_count[cpu][cluster] == 0) {
> >> >> -             exynos_cpu_power_down(cpunr);
> >> >> +             /*
> >> >> +              * Bypass power down for CPU0 during suspend. This is being
> >> >> +              * taken care by the SYS_PWR_CFG bit in CORE0_SYS_PWR_REG.
> >> >> +              */
> >> >> +             if ((cpunr != 0) || (residency != S5P_CHECK_SLEEP))
> >> >> +                     exynos_cpu_power_down(cpunr);
> >> >>
> >> >>               if (exynos_cluster_unused(cluster)) {
> >> >>                       exynos_cluster_power_down(cluster);
> >> >> @@ -209,6 +217,11 @@ static void exynos_power_down(void)
> >> >>       /* Not dead at this point?  Let our caller cope. */
> >> >>  }
> >> >>
> >> >> +static void exynos_power_down(void)
> >> >> +{
> >> >> +     exynos_mcpm_power_down(0);
> >> >> +}
> >> >
> >> > [...]
> >> >
> >> >> +static int notrace exynos_mcpm_cpu_suspend(unsigned long arg)
> >> >> +{
> >> >> +     /* MCPM works with HW CPU identifiers */
> >> >> +     unsigned int mpidr = read_cpuid_mpidr();
> >> >> +     unsigned int cluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
> >> >> +     unsigned int cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
> >> >> +
> >> >> +     __raw_writel(0x0, sysram_base_addr + EXYNOS5420_CPU_STATE);
> >> >> +
> >> >> +     mcpm_set_entry_vector(cpu, cluster, exynos_cpu_resume);
> >> >> +
> >> >> +     /*
> >> >> +      * Pass S5P_CHECK_SLEEP flag to the MCPM back-end to indicate that
> >> >> +      * we are suspending the system and need to skip CPU0 power down.
> >> >> +      */
> >> >> +     mcpm_cpu_suspend(S5P_CHECK_SLEEP);
> >> >
> >> > NAK.
> >> >
> >> > When I say "local flag with local meaning", I don't want you to smuggle
> >> > that flag through a public interface either.  I find it rather inelegant
> >> > to have the notion of special handling for CPU0 being spread around like
> >> > that.
> >> >
> >> > If CPU0 is special, then it should be dealth with in one place only,
> >> > ideally in the MCPM backend, so the rest of the kernel doesn't have to
> >> > care.
> >> >
> >> > Again, here's what I mean:
> >> >
> >> > static void exynos_mcpm_down_handler(int flags)
> >> > {
> >> >         [...]
> >> >         if ((cpunr != 0) || !(flags & SKIP_CPU_POWERDOWN_IF_CPU0))
> >> >                 exynos_cpu_power_down(cpunr);
> >> >         [...]
> >> > }
> >> >
> >> > static void exynos_mcpm_power_down()
> >> > {
> >> >         exynos_mcpm_down_handler(0);
> >> > }
> >> >
> >> > static void exynos_mcpm_suspend(u64 residency)
> >> > {
> >> >         /*
> >> >          * Theresidency argument is ignored for now.
> >> >          * However, in the CPU suspend case, we bypass power down for
> >> >          * CPU0 as this is being taken care by the SYS_PWR_CFG bit in
> >> >          * CORE0_SYS_PWR_REG.
> >> >          */
> >> >         exynos_mcpm_down_handler(SKIP_CPU_POWERDOWN_IF_CPU0);
> >> > }
> >> >
> >> > And SKIP_CPU_POWERDOWN_IF_CPU0 is defined in and visible to
> >> > mcpm-exynos.c only.
> >> Sorry if I am being dense, but the exynos_mcpm_suspend function would
> >> get called from both the bL cpuidle driver as well the exynos pm code.
> >
> > What is that exynos pm code actually doing?
> exynos pm code is shared across Exynos4 and 5 SoCs. It primarily does
> a series of register configurations which are required to put the
> system to sleep (some parts of these are SoC specific and others
> common). It also populates the suspend_ops for exynos. In the current
> patch, exynos_suspend_enter() is where I have plugged in the
> mcpm_cpu_suspend call.
> 
> This patch is based on the S2R series for 5420
> (http://comments.gmane.org/gmane.linux.kernel.samsung-soc/33898), you
> may also have a look at that for a clearer picture.
> >
> >> We want to skip CPU0 only in case of the S2R case i.e. the pm code and
> >> keep the CPU0 power down code for all other cases including CPUIdle.
> >
> > OK.  If so I missed that somehow (hint hint).
> >
> >> If I call exynos_mcpm_down_handler with the flag in
> >> exynos_mcpm_suspend(), CPUIdle will also skip CPU0 isn't it ?
> >
> > As it is, yes.  You could logically use an infinite residency time
> > (something like U64_MAX) to distinguish S2RAM from other types of
> > suspend.
> OK, I will use this rather than the S5P_CHECK_SLEEP macro.

Another suggestion which might possibly be better: why not looking for 
the SYS_PWR_CFG bit in exynos_cpu_power_down() directly?  After all, 
exynos_cpu_power_down() is semantically supposed to do what its name 
suggest and could simply do nothing if the proper conditions are already 
in place.


Nicolas

^ permalink raw reply	[flat|nested] 129+ messages in thread

* several messages
@ 2014-07-04  4:13             ` Nicolas Pitre
  0 siblings, 0 replies; 129+ messages in thread
From: Nicolas Pitre @ 2014-07-04  4:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 4 Jul 2014, Abhilash Kesavan wrote:

> Hi Nicolas,
> 
> On Fri, Jul 4, 2014 at 12:30 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > On Thu, 3 Jul 2014, Abhilash Kesavan wrote:
> >
> >> Hi Nicolas,
> >>
> >> On Thu, Jul 3, 2014 at 9:15 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> >> > On Thu, 3 Jul 2014, Abhilash Kesavan wrote:
> >> >
> >> >> On Thu, Jul 3, 2014 at 6:59 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> >> >> > Please, let's avoid going that route.  There is no such special handling
> >> >> > needed if the API is sufficient.  And the provided API allows you to
> >> >> > suspend a CPU or shut it down.  It shouldn't matter at that level if
> >> >> > this is due to a cluster switch or a hotplug event. Do you really need
> >> >> > something else?
> >> >> No, just one local flag for suspend should be enough for me. Will remove these.
> >> >
> >> > [...]
> >> >
> >> >> Changes in v5:
> >> >>       - Removed the MCPM flags and just used a local flag to
> >> >>       indicate that we are suspending.
> >> >
> >> > [...]
> >> >
> >> >> -static void exynos_power_down(void)
> >> >> +static void exynos_mcpm_power_down(u64 residency)
> >> >>  {
> >> >>       unsigned int mpidr, cpu, cluster;
> >> >>       bool last_man = false, skip_wfi = false;
> >> >> @@ -150,7 +153,12 @@ static void exynos_power_down(void)
> >> >>       BUG_ON(__mcpm_cluster_state(cluster) != CLUSTER_UP);
> >> >>       cpu_use_count[cpu][cluster]--;
> >> >>       if (cpu_use_count[cpu][cluster] == 0) {
> >> >> -             exynos_cpu_power_down(cpunr);
> >> >> +             /*
> >> >> +              * Bypass power down for CPU0 during suspend. This is being
> >> >> +              * taken care by the SYS_PWR_CFG bit in CORE0_SYS_PWR_REG.
> >> >> +              */
> >> >> +             if ((cpunr != 0) || (residency != S5P_CHECK_SLEEP))
> >> >> +                     exynos_cpu_power_down(cpunr);
> >> >>
> >> >>               if (exynos_cluster_unused(cluster)) {
> >> >>                       exynos_cluster_power_down(cluster);
> >> >> @@ -209,6 +217,11 @@ static void exynos_power_down(void)
> >> >>       /* Not dead at this point?  Let our caller cope. */
> >> >>  }
> >> >>
> >> >> +static void exynos_power_down(void)
> >> >> +{
> >> >> +     exynos_mcpm_power_down(0);
> >> >> +}
> >> >
> >> > [...]
> >> >
> >> >> +static int notrace exynos_mcpm_cpu_suspend(unsigned long arg)
> >> >> +{
> >> >> +     /* MCPM works with HW CPU identifiers */
> >> >> +     unsigned int mpidr = read_cpuid_mpidr();
> >> >> +     unsigned int cluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
> >> >> +     unsigned int cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
> >> >> +
> >> >> +     __raw_writel(0x0, sysram_base_addr + EXYNOS5420_CPU_STATE);
> >> >> +
> >> >> +     mcpm_set_entry_vector(cpu, cluster, exynos_cpu_resume);
> >> >> +
> >> >> +     /*
> >> >> +      * Pass S5P_CHECK_SLEEP flag to the MCPM back-end to indicate that
> >> >> +      * we are suspending the system and need to skip CPU0 power down.
> >> >> +      */
> >> >> +     mcpm_cpu_suspend(S5P_CHECK_SLEEP);
> >> >
> >> > NAK.
> >> >
> >> > When I say "local flag with local meaning", I don't want you to smuggle
> >> > that flag through a public interface either.  I find it rather inelegant
> >> > to have the notion of special handling for CPU0 being spread around like
> >> > that.
> >> >
> >> > If CPU0 is special, then it should be dealth with in one place only,
> >> > ideally in the MCPM backend, so the rest of the kernel doesn't have to
> >> > care.
> >> >
> >> > Again, here's what I mean:
> >> >
> >> > static void exynos_mcpm_down_handler(int flags)
> >> > {
> >> >         [...]
> >> >         if ((cpunr != 0) || !(flags & SKIP_CPU_POWERDOWN_IF_CPU0))
> >> >                 exynos_cpu_power_down(cpunr);
> >> >         [...]
> >> > }
> >> >
> >> > static void exynos_mcpm_power_down()
> >> > {
> >> >         exynos_mcpm_down_handler(0);
> >> > }
> >> >
> >> > static void exynos_mcpm_suspend(u64 residency)
> >> > {
> >> >         /*
> >> >          * Theresidency argument is ignored for now.
> >> >          * However, in the CPU suspend case, we bypass power down for
> >> >          * CPU0 as this is being taken care by the SYS_PWR_CFG bit in
> >> >          * CORE0_SYS_PWR_REG.
> >> >          */
> >> >         exynos_mcpm_down_handler(SKIP_CPU_POWERDOWN_IF_CPU0);
> >> > }
> >> >
> >> > And SKIP_CPU_POWERDOWN_IF_CPU0 is defined in and visible to
> >> > mcpm-exynos.c only.
> >> Sorry if I am being dense, but the exynos_mcpm_suspend function would
> >> get called from both the bL cpuidle driver as well the exynos pm code.
> >
> > What is that exynos pm code actually doing?
> exynos pm code is shared across Exynos4 and 5 SoCs. It primarily does
> a series of register configurations which are required to put the
> system to sleep (some parts of these are SoC specific and others
> common). It also populates the suspend_ops for exynos. In the current
> patch, exynos_suspend_enter() is where I have plugged in the
> mcpm_cpu_suspend call.
> 
> This patch is based on the S2R series for 5420
> (http://comments.gmane.org/gmane.linux.kernel.samsung-soc/33898), you
> may also have a look at that for a clearer picture.
> >
> >> We want to skip CPU0 only in case of the S2R case i.e. the pm code and
> >> keep the CPU0 power down code for all other cases including CPUIdle.
> >
> > OK.  If so I missed that somehow (hint hint).
> >
> >> If I call exynos_mcpm_down_handler with the flag in
> >> exynos_mcpm_suspend(), CPUIdle will also skip CPU0 isn't it ?
> >
> > As it is, yes.  You could logically use an infinite residency time
> > (something like U64_MAX) to distinguish S2RAM from other types of
> > suspend.
> OK, I will use this rather than the S5P_CHECK_SLEEP macro.

Another suggestion which might possibly be better: why not looking for 
the SYS_PWR_CFG bit in exynos_cpu_power_down() directly?  After all, 
exynos_cpu_power_down() is semantically supposed to do what its name 
suggest and could simply do nothing if the proper conditions are already 
in place.


Nicolas

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2014-07-03 19:00         ` Nicolas Pitre
@ 2014-07-03 20:00           ` Abhilash Kesavan
  -1 siblings, 0 replies; 129+ messages in thread
From: Abhilash Kesavan @ 2014-07-03 20:00 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: linux-samsung-soc, linux-arm-kernel, Kukjin Kim,
	Lorenzo Pieralisi, Andrew Bresticker, Douglas Anderson

Hi Nicolas,

On Fri, Jul 4, 2014 at 12:30 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Thu, 3 Jul 2014, Abhilash Kesavan wrote:
>
>> Hi Nicolas,
>>
>> On Thu, Jul 3, 2014 at 9:15 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>> > On Thu, 3 Jul 2014, Abhilash Kesavan wrote:
>> >
>> >> On Thu, Jul 3, 2014 at 6:59 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>> >> > Please, let's avoid going that route.  There is no such special handling
>> >> > needed if the API is sufficient.  And the provided API allows you to
>> >> > suspend a CPU or shut it down.  It shouldn't matter at that level if
>> >> > this is due to a cluster switch or a hotplug event. Do you really need
>> >> > something else?
>> >> No, just one local flag for suspend should be enough for me. Will remove these.
>> >
>> > [...]
>> >
>> >> Changes in v5:
>> >>       - Removed the MCPM flags and just used a local flag to
>> >>       indicate that we are suspending.
>> >
>> > [...]
>> >
>> >> -static void exynos_power_down(void)
>> >> +static void exynos_mcpm_power_down(u64 residency)
>> >>  {
>> >>       unsigned int mpidr, cpu, cluster;
>> >>       bool last_man = false, skip_wfi = false;
>> >> @@ -150,7 +153,12 @@ static void exynos_power_down(void)
>> >>       BUG_ON(__mcpm_cluster_state(cluster) != CLUSTER_UP);
>> >>       cpu_use_count[cpu][cluster]--;
>> >>       if (cpu_use_count[cpu][cluster] == 0) {
>> >> -             exynos_cpu_power_down(cpunr);
>> >> +             /*
>> >> +              * Bypass power down for CPU0 during suspend. This is being
>> >> +              * taken care by the SYS_PWR_CFG bit in CORE0_SYS_PWR_REG.
>> >> +              */
>> >> +             if ((cpunr != 0) || (residency != S5P_CHECK_SLEEP))
>> >> +                     exynos_cpu_power_down(cpunr);
>> >>
>> >>               if (exynos_cluster_unused(cluster)) {
>> >>                       exynos_cluster_power_down(cluster);
>> >> @@ -209,6 +217,11 @@ static void exynos_power_down(void)
>> >>       /* Not dead at this point?  Let our caller cope. */
>> >>  }
>> >>
>> >> +static void exynos_power_down(void)
>> >> +{
>> >> +     exynos_mcpm_power_down(0);
>> >> +}
>> >
>> > [...]
>> >
>> >> +static int notrace exynos_mcpm_cpu_suspend(unsigned long arg)
>> >> +{
>> >> +     /* MCPM works with HW CPU identifiers */
>> >> +     unsigned int mpidr = read_cpuid_mpidr();
>> >> +     unsigned int cluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
>> >> +     unsigned int cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
>> >> +
>> >> +     __raw_writel(0x0, sysram_base_addr + EXYNOS5420_CPU_STATE);
>> >> +
>> >> +     mcpm_set_entry_vector(cpu, cluster, exynos_cpu_resume);
>> >> +
>> >> +     /*
>> >> +      * Pass S5P_CHECK_SLEEP flag to the MCPM back-end to indicate that
>> >> +      * we are suspending the system and need to skip CPU0 power down.
>> >> +      */
>> >> +     mcpm_cpu_suspend(S5P_CHECK_SLEEP);
>> >
>> > NAK.
>> >
>> > When I say "local flag with local meaning", I don't want you to smuggle
>> > that flag through a public interface either.  I find it rather inelegant
>> > to have the notion of special handling for CPU0 being spread around like
>> > that.
>> >
>> > If CPU0 is special, then it should be dealth with in one place only,
>> > ideally in the MCPM backend, so the rest of the kernel doesn't have to
>> > care.
>> >
>> > Again, here's what I mean:
>> >
>> > static void exynos_mcpm_down_handler(int flags)
>> > {
>> >         [...]
>> >         if ((cpunr != 0) || !(flags & SKIP_CPU_POWERDOWN_IF_CPU0))
>> >                 exynos_cpu_power_down(cpunr);
>> >         [...]
>> > }
>> >
>> > static void exynos_mcpm_power_down()
>> > {
>> >         exynos_mcpm_down_handler(0);
>> > }
>> >
>> > static void exynos_mcpm_suspend(u64 residency)
>> > {
>> >         /*
>> >          * Theresidency argument is ignored for now.
>> >          * However, in the CPU suspend case, we bypass power down for
>> >          * CPU0 as this is being taken care by the SYS_PWR_CFG bit in
>> >          * CORE0_SYS_PWR_REG.
>> >          */
>> >         exynos_mcpm_down_handler(SKIP_CPU_POWERDOWN_IF_CPU0);
>> > }
>> >
>> > And SKIP_CPU_POWERDOWN_IF_CPU0 is defined in and visible to
>> > mcpm-exynos.c only.
>> Sorry if I am being dense, but the exynos_mcpm_suspend function would
>> get called from both the bL cpuidle driver as well the exynos pm code.
>
> What is that exynos pm code actually doing?
exynos pm code is shared across Exynos4 and 5 SoCs. It primarily does
a series of register configurations which are required to put the
system to sleep (some parts of these are SoC specific and others
common). It also populates the suspend_ops for exynos. In the current
patch, exynos_suspend_enter() is where I have plugged in the
mcpm_cpu_suspend call.

This patch is based on the S2R series for 5420
(http://comments.gmane.org/gmane.linux.kernel.samsung-soc/33898), you
may also have a look at that for a clearer picture.
>
>> We want to skip CPU0 only in case of the S2R case i.e. the pm code and
>> keep the CPU0 power down code for all other cases including CPUIdle.
>
> OK.  If so I missed that somehow (hint hint).
>
>> If I call exynos_mcpm_down_handler with the flag in
>> exynos_mcpm_suspend(), CPUIdle will also skip CPU0 isn't it ?
>
> As it is, yes.  You could logically use an infinite residency time
> (something like U64_MAX) to distinguish S2RAM from other types of
> suspend.
OK, I will use this rather than the S5P_CHECK_SLEEP macro.
>
> Yet, why is this SYS_PWR_CFG bit set outside of MCPM?  Couldn't the MCPM
> backend handle it directly instead of expecting some other entity to do
> it?
Low power modes such as Sleep, Low Power Audio, AFTR (ARM Off Top
Running) require a series of register configurations as specified by
the UM to enter/exit them. All the exynos SoCs including 5420, do such
configurations (including sys_pwr_reg setup) as part of the
exynos_pm_prepare function in pm.c and so we just need to skip the cpu
power down.

Regards,
Abhilash
>
>
> Nicolas

^ permalink raw reply	[flat|nested] 129+ messages in thread

* several messages
@ 2014-07-03 20:00           ` Abhilash Kesavan
  0 siblings, 0 replies; 129+ messages in thread
From: Abhilash Kesavan @ 2014-07-03 20:00 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Nicolas,

On Fri, Jul 4, 2014 at 12:30 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Thu, 3 Jul 2014, Abhilash Kesavan wrote:
>
>> Hi Nicolas,
>>
>> On Thu, Jul 3, 2014 at 9:15 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>> > On Thu, 3 Jul 2014, Abhilash Kesavan wrote:
>> >
>> >> On Thu, Jul 3, 2014 at 6:59 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>> >> > Please, let's avoid going that route.  There is no such special handling
>> >> > needed if the API is sufficient.  And the provided API allows you to
>> >> > suspend a CPU or shut it down.  It shouldn't matter at that level if
>> >> > this is due to a cluster switch or a hotplug event. Do you really need
>> >> > something else?
>> >> No, just one local flag for suspend should be enough for me. Will remove these.
>> >
>> > [...]
>> >
>> >> Changes in v5:
>> >>       - Removed the MCPM flags and just used a local flag to
>> >>       indicate that we are suspending.
>> >
>> > [...]
>> >
>> >> -static void exynos_power_down(void)
>> >> +static void exynos_mcpm_power_down(u64 residency)
>> >>  {
>> >>       unsigned int mpidr, cpu, cluster;
>> >>       bool last_man = false, skip_wfi = false;
>> >> @@ -150,7 +153,12 @@ static void exynos_power_down(void)
>> >>       BUG_ON(__mcpm_cluster_state(cluster) != CLUSTER_UP);
>> >>       cpu_use_count[cpu][cluster]--;
>> >>       if (cpu_use_count[cpu][cluster] == 0) {
>> >> -             exynos_cpu_power_down(cpunr);
>> >> +             /*
>> >> +              * Bypass power down for CPU0 during suspend. This is being
>> >> +              * taken care by the SYS_PWR_CFG bit in CORE0_SYS_PWR_REG.
>> >> +              */
>> >> +             if ((cpunr != 0) || (residency != S5P_CHECK_SLEEP))
>> >> +                     exynos_cpu_power_down(cpunr);
>> >>
>> >>               if (exynos_cluster_unused(cluster)) {
>> >>                       exynos_cluster_power_down(cluster);
>> >> @@ -209,6 +217,11 @@ static void exynos_power_down(void)
>> >>       /* Not dead at this point?  Let our caller cope. */
>> >>  }
>> >>
>> >> +static void exynos_power_down(void)
>> >> +{
>> >> +     exynos_mcpm_power_down(0);
>> >> +}
>> >
>> > [...]
>> >
>> >> +static int notrace exynos_mcpm_cpu_suspend(unsigned long arg)
>> >> +{
>> >> +     /* MCPM works with HW CPU identifiers */
>> >> +     unsigned int mpidr = read_cpuid_mpidr();
>> >> +     unsigned int cluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
>> >> +     unsigned int cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
>> >> +
>> >> +     __raw_writel(0x0, sysram_base_addr + EXYNOS5420_CPU_STATE);
>> >> +
>> >> +     mcpm_set_entry_vector(cpu, cluster, exynos_cpu_resume);
>> >> +
>> >> +     /*
>> >> +      * Pass S5P_CHECK_SLEEP flag to the MCPM back-end to indicate that
>> >> +      * we are suspending the system and need to skip CPU0 power down.
>> >> +      */
>> >> +     mcpm_cpu_suspend(S5P_CHECK_SLEEP);
>> >
>> > NAK.
>> >
>> > When I say "local flag with local meaning", I don't want you to smuggle
>> > that flag through a public interface either.  I find it rather inelegant
>> > to have the notion of special handling for CPU0 being spread around like
>> > that.
>> >
>> > If CPU0 is special, then it should be dealth with in one place only,
>> > ideally in the MCPM backend, so the rest of the kernel doesn't have to
>> > care.
>> >
>> > Again, here's what I mean:
>> >
>> > static void exynos_mcpm_down_handler(int flags)
>> > {
>> >         [...]
>> >         if ((cpunr != 0) || !(flags & SKIP_CPU_POWERDOWN_IF_CPU0))
>> >                 exynos_cpu_power_down(cpunr);
>> >         [...]
>> > }
>> >
>> > static void exynos_mcpm_power_down()
>> > {
>> >         exynos_mcpm_down_handler(0);
>> > }
>> >
>> > static void exynos_mcpm_suspend(u64 residency)
>> > {
>> >         /*
>> >          * Theresidency argument is ignored for now.
>> >          * However, in the CPU suspend case, we bypass power down for
>> >          * CPU0 as this is being taken care by the SYS_PWR_CFG bit in
>> >          * CORE0_SYS_PWR_REG.
>> >          */
>> >         exynos_mcpm_down_handler(SKIP_CPU_POWERDOWN_IF_CPU0);
>> > }
>> >
>> > And SKIP_CPU_POWERDOWN_IF_CPU0 is defined in and visible to
>> > mcpm-exynos.c only.
>> Sorry if I am being dense, but the exynos_mcpm_suspend function would
>> get called from both the bL cpuidle driver as well the exynos pm code.
>
> What is that exynos pm code actually doing?
exynos pm code is shared across Exynos4 and 5 SoCs. It primarily does
a series of register configurations which are required to put the
system to sleep (some parts of these are SoC specific and others
common). It also populates the suspend_ops for exynos. In the current
patch, exynos_suspend_enter() is where I have plugged in the
mcpm_cpu_suspend call.

This patch is based on the S2R series for 5420
(http://comments.gmane.org/gmane.linux.kernel.samsung-soc/33898), you
may also have a look at that for a clearer picture.
>
>> We want to skip CPU0 only in case of the S2R case i.e. the pm code and
>> keep the CPU0 power down code for all other cases including CPUIdle.
>
> OK.  If so I missed that somehow (hint hint).
>
>> If I call exynos_mcpm_down_handler with the flag in
>> exynos_mcpm_suspend(), CPUIdle will also skip CPU0 isn't it ?
>
> As it is, yes.  You could logically use an infinite residency time
> (something like U64_MAX) to distinguish S2RAM from other types of
> suspend.
OK, I will use this rather than the S5P_CHECK_SLEEP macro.
>
> Yet, why is this SYS_PWR_CFG bit set outside of MCPM?  Couldn't the MCPM
> backend handle it directly instead of expecting some other entity to do
> it?
Low power modes such as Sleep, Low Power Audio, AFTR (ARM Off Top
Running) require a series of register configurations as specified by
the UM to enter/exit them. All the exynos SoCs including 5420, do such
configurations (including sys_pwr_reg setup) as part of the
exynos_pm_prepare function in pm.c and so we just need to skip the cpu
power down.

Regards,
Abhilash
>
>
> Nicolas

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2014-07-03 16:19       ` Abhilash Kesavan
@ 2014-07-03 19:00         ` Nicolas Pitre
  -1 siblings, 0 replies; 129+ messages in thread
From: Nicolas Pitre @ 2014-07-03 19:00 UTC (permalink / raw)
  To: Abhilash Kesavan
  Cc: linux-samsung-soc, linux-arm-kernel, Kukjin Kim,
	Lorenzo Pieralisi, Andrew Bresticker, Douglas Anderson

On Thu, 3 Jul 2014, Abhilash Kesavan wrote:

> Hi Nicolas,
> 
> On Thu, Jul 3, 2014 at 9:15 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > On Thu, 3 Jul 2014, Abhilash Kesavan wrote:
> >
> >> On Thu, Jul 3, 2014 at 6:59 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> >> > Please, let's avoid going that route.  There is no such special handling
> >> > needed if the API is sufficient.  And the provided API allows you to
> >> > suspend a CPU or shut it down.  It shouldn't matter at that level if
> >> > this is due to a cluster switch or a hotplug event. Do you really need
> >> > something else?
> >> No, just one local flag for suspend should be enough for me. Will remove these.
> >
> > [...]
> >
> >> Changes in v5:
> >>       - Removed the MCPM flags and just used a local flag to
> >>       indicate that we are suspending.
> >
> > [...]
> >
> >> -static void exynos_power_down(void)
> >> +static void exynos_mcpm_power_down(u64 residency)
> >>  {
> >>       unsigned int mpidr, cpu, cluster;
> >>       bool last_man = false, skip_wfi = false;
> >> @@ -150,7 +153,12 @@ static void exynos_power_down(void)
> >>       BUG_ON(__mcpm_cluster_state(cluster) != CLUSTER_UP);
> >>       cpu_use_count[cpu][cluster]--;
> >>       if (cpu_use_count[cpu][cluster] == 0) {
> >> -             exynos_cpu_power_down(cpunr);
> >> +             /*
> >> +              * Bypass power down for CPU0 during suspend. This is being
> >> +              * taken care by the SYS_PWR_CFG bit in CORE0_SYS_PWR_REG.
> >> +              */
> >> +             if ((cpunr != 0) || (residency != S5P_CHECK_SLEEP))
> >> +                     exynos_cpu_power_down(cpunr);
> >>
> >>               if (exynos_cluster_unused(cluster)) {
> >>                       exynos_cluster_power_down(cluster);
> >> @@ -209,6 +217,11 @@ static void exynos_power_down(void)
> >>       /* Not dead at this point?  Let our caller cope. */
> >>  }
> >>
> >> +static void exynos_power_down(void)
> >> +{
> >> +     exynos_mcpm_power_down(0);
> >> +}
> >
> > [...]
> >
> >> +static int notrace exynos_mcpm_cpu_suspend(unsigned long arg)
> >> +{
> >> +     /* MCPM works with HW CPU identifiers */
> >> +     unsigned int mpidr = read_cpuid_mpidr();
> >> +     unsigned int cluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
> >> +     unsigned int cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
> >> +
> >> +     __raw_writel(0x0, sysram_base_addr + EXYNOS5420_CPU_STATE);
> >> +
> >> +     mcpm_set_entry_vector(cpu, cluster, exynos_cpu_resume);
> >> +
> >> +     /*
> >> +      * Pass S5P_CHECK_SLEEP flag to the MCPM back-end to indicate that
> >> +      * we are suspending the system and need to skip CPU0 power down.
> >> +      */
> >> +     mcpm_cpu_suspend(S5P_CHECK_SLEEP);
> >
> > NAK.
> >
> > When I say "local flag with local meaning", I don't want you to smuggle
> > that flag through a public interface either.  I find it rather inelegant
> > to have the notion of special handling for CPU0 being spread around like
> > that.
> >
> > If CPU0 is special, then it should be dealth with in one place only,
> > ideally in the MCPM backend, so the rest of the kernel doesn't have to
> > care.
> >
> > Again, here's what I mean:
> >
> > static void exynos_mcpm_down_handler(int flags)
> > {
> >         [...]
> >         if ((cpunr != 0) || !(flags & SKIP_CPU_POWERDOWN_IF_CPU0))
> >                 exynos_cpu_power_down(cpunr);
> >         [...]
> > }
> >
> > static void exynos_mcpm_power_down()
> > {
> >         exynos_mcpm_down_handler(0);
> > }
> >
> > static void exynos_mcpm_suspend(u64 residency)
> > {
> >         /*
> >          * Theresidency argument is ignored for now.
> >          * However, in the CPU suspend case, we bypass power down for
> >          * CPU0 as this is being taken care by the SYS_PWR_CFG bit in
> >          * CORE0_SYS_PWR_REG.
> >          */
> >         exynos_mcpm_down_handler(SKIP_CPU_POWERDOWN_IF_CPU0);
> > }
> >
> > And SKIP_CPU_POWERDOWN_IF_CPU0 is defined in and visible to
> > mcpm-exynos.c only.
> Sorry if I am being dense, but the exynos_mcpm_suspend function would
> get called from both the bL cpuidle driver as well the exynos pm code.

What is that exynos pm code actually doing?

> We want to skip CPU0 only in case of the S2R case i.e. the pm code and
> keep the CPU0 power down code for all other cases including CPUIdle.

OK.  If so I missed that somehow (hint hint).

> If I call exynos_mcpm_down_handler with the flag in
> exynos_mcpm_suspend(), CPUIdle will also skip CPU0 isn't it ?

As it is, yes.  You could logically use an infinite residency time 
(something like U64_MAX) to distinguish S2RAM from other types of 
suspend.

Yet, why is this SYS_PWR_CFG bit set outside of MCPM?  Couldn't the MCPM 
backend handle it directly instead of expecting some other entity to do 
it?


Nicolas

^ permalink raw reply	[flat|nested] 129+ messages in thread

* several messages
@ 2014-07-03 19:00         ` Nicolas Pitre
  0 siblings, 0 replies; 129+ messages in thread
From: Nicolas Pitre @ 2014-07-03 19:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 3 Jul 2014, Abhilash Kesavan wrote:

> Hi Nicolas,
> 
> On Thu, Jul 3, 2014 at 9:15 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > On Thu, 3 Jul 2014, Abhilash Kesavan wrote:
> >
> >> On Thu, Jul 3, 2014 at 6:59 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> >> > Please, let's avoid going that route.  There is no such special handling
> >> > needed if the API is sufficient.  And the provided API allows you to
> >> > suspend a CPU or shut it down.  It shouldn't matter at that level if
> >> > this is due to a cluster switch or a hotplug event. Do you really need
> >> > something else?
> >> No, just one local flag for suspend should be enough for me. Will remove these.
> >
> > [...]
> >
> >> Changes in v5:
> >>       - Removed the MCPM flags and just used a local flag to
> >>       indicate that we are suspending.
> >
> > [...]
> >
> >> -static void exynos_power_down(void)
> >> +static void exynos_mcpm_power_down(u64 residency)
> >>  {
> >>       unsigned int mpidr, cpu, cluster;
> >>       bool last_man = false, skip_wfi = false;
> >> @@ -150,7 +153,12 @@ static void exynos_power_down(void)
> >>       BUG_ON(__mcpm_cluster_state(cluster) != CLUSTER_UP);
> >>       cpu_use_count[cpu][cluster]--;
> >>       if (cpu_use_count[cpu][cluster] == 0) {
> >> -             exynos_cpu_power_down(cpunr);
> >> +             /*
> >> +              * Bypass power down for CPU0 during suspend. This is being
> >> +              * taken care by the SYS_PWR_CFG bit in CORE0_SYS_PWR_REG.
> >> +              */
> >> +             if ((cpunr != 0) || (residency != S5P_CHECK_SLEEP))
> >> +                     exynos_cpu_power_down(cpunr);
> >>
> >>               if (exynos_cluster_unused(cluster)) {
> >>                       exynos_cluster_power_down(cluster);
> >> @@ -209,6 +217,11 @@ static void exynos_power_down(void)
> >>       /* Not dead at this point?  Let our caller cope. */
> >>  }
> >>
> >> +static void exynos_power_down(void)
> >> +{
> >> +     exynos_mcpm_power_down(0);
> >> +}
> >
> > [...]
> >
> >> +static int notrace exynos_mcpm_cpu_suspend(unsigned long arg)
> >> +{
> >> +     /* MCPM works with HW CPU identifiers */
> >> +     unsigned int mpidr = read_cpuid_mpidr();
> >> +     unsigned int cluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
> >> +     unsigned int cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
> >> +
> >> +     __raw_writel(0x0, sysram_base_addr + EXYNOS5420_CPU_STATE);
> >> +
> >> +     mcpm_set_entry_vector(cpu, cluster, exynos_cpu_resume);
> >> +
> >> +     /*
> >> +      * Pass S5P_CHECK_SLEEP flag to the MCPM back-end to indicate that
> >> +      * we are suspending the system and need to skip CPU0 power down.
> >> +      */
> >> +     mcpm_cpu_suspend(S5P_CHECK_SLEEP);
> >
> > NAK.
> >
> > When I say "local flag with local meaning", I don't want you to smuggle
> > that flag through a public interface either.  I find it rather inelegant
> > to have the notion of special handling for CPU0 being spread around like
> > that.
> >
> > If CPU0 is special, then it should be dealth with in one place only,
> > ideally in the MCPM backend, so the rest of the kernel doesn't have to
> > care.
> >
> > Again, here's what I mean:
> >
> > static void exynos_mcpm_down_handler(int flags)
> > {
> >         [...]
> >         if ((cpunr != 0) || !(flags & SKIP_CPU_POWERDOWN_IF_CPU0))
> >                 exynos_cpu_power_down(cpunr);
> >         [...]
> > }
> >
> > static void exynos_mcpm_power_down()
> > {
> >         exynos_mcpm_down_handler(0);
> > }
> >
> > static void exynos_mcpm_suspend(u64 residency)
> > {
> >         /*
> >          * Theresidency argument is ignored for now.
> >          * However, in the CPU suspend case, we bypass power down for
> >          * CPU0 as this is being taken care by the SYS_PWR_CFG bit in
> >          * CORE0_SYS_PWR_REG.
> >          */
> >         exynos_mcpm_down_handler(SKIP_CPU_POWERDOWN_IF_CPU0);
> > }
> >
> > And SKIP_CPU_POWERDOWN_IF_CPU0 is defined in and visible to
> > mcpm-exynos.c only.
> Sorry if I am being dense, but the exynos_mcpm_suspend function would
> get called from both the bL cpuidle driver as well the exynos pm code.

What is that exynos pm code actually doing?

> We want to skip CPU0 only in case of the S2R case i.e. the pm code and
> keep the CPU0 power down code for all other cases including CPUIdle.

OK.  If so I missed that somehow (hint hint).

> If I call exynos_mcpm_down_handler with the flag in
> exynos_mcpm_suspend(), CPUIdle will also skip CPU0 isn't it ?

As it is, yes.  You could logically use an infinite residency time 
(something like U64_MAX) to distinguish S2RAM from other types of 
suspend.

Yet, why is this SYS_PWR_CFG bit set outside of MCPM?  Couldn't the MCPM 
backend handle it directly instead of expecting some other entity to do 
it?


Nicolas

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2014-07-03 15:45     ` Nicolas Pitre
@ 2014-07-03 16:19       ` Abhilash Kesavan
  -1 siblings, 0 replies; 129+ messages in thread
From: Abhilash Kesavan @ 2014-07-03 16:19 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: linux-samsung-soc, linux-arm-kernel, Kukjin Kim,
	Lorenzo Pieralisi, Andrew Bresticker, Douglas Anderson

Hi Nicolas,

On Thu, Jul 3, 2014 at 9:15 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Thu, 3 Jul 2014, Abhilash Kesavan wrote:
>
>> On Thu, Jul 3, 2014 at 6:59 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>> > Please, let's avoid going that route.  There is no such special handling
>> > needed if the API is sufficient.  And the provided API allows you to
>> > suspend a CPU or shut it down.  It shouldn't matter at that level if
>> > this is due to a cluster switch or a hotplug event. Do you really need
>> > something else?
>> No, just one local flag for suspend should be enough for me. Will remove these.
>
> [...]
>
>> Changes in v5:
>>       - Removed the MCPM flags and just used a local flag to
>>       indicate that we are suspending.
>
> [...]
>
>> -static void exynos_power_down(void)
>> +static void exynos_mcpm_power_down(u64 residency)
>>  {
>>       unsigned int mpidr, cpu, cluster;
>>       bool last_man = false, skip_wfi = false;
>> @@ -150,7 +153,12 @@ static void exynos_power_down(void)
>>       BUG_ON(__mcpm_cluster_state(cluster) != CLUSTER_UP);
>>       cpu_use_count[cpu][cluster]--;
>>       if (cpu_use_count[cpu][cluster] == 0) {
>> -             exynos_cpu_power_down(cpunr);
>> +             /*
>> +              * Bypass power down for CPU0 during suspend. This is being
>> +              * taken care by the SYS_PWR_CFG bit in CORE0_SYS_PWR_REG.
>> +              */
>> +             if ((cpunr != 0) || (residency != S5P_CHECK_SLEEP))
>> +                     exynos_cpu_power_down(cpunr);
>>
>>               if (exynos_cluster_unused(cluster)) {
>>                       exynos_cluster_power_down(cluster);
>> @@ -209,6 +217,11 @@ static void exynos_power_down(void)
>>       /* Not dead at this point?  Let our caller cope. */
>>  }
>>
>> +static void exynos_power_down(void)
>> +{
>> +     exynos_mcpm_power_down(0);
>> +}
>
> [...]
>
>> +static int notrace exynos_mcpm_cpu_suspend(unsigned long arg)
>> +{
>> +     /* MCPM works with HW CPU identifiers */
>> +     unsigned int mpidr = read_cpuid_mpidr();
>> +     unsigned int cluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
>> +     unsigned int cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
>> +
>> +     __raw_writel(0x0, sysram_base_addr + EXYNOS5420_CPU_STATE);
>> +
>> +     mcpm_set_entry_vector(cpu, cluster, exynos_cpu_resume);
>> +
>> +     /*
>> +      * Pass S5P_CHECK_SLEEP flag to the MCPM back-end to indicate that
>> +      * we are suspending the system and need to skip CPU0 power down.
>> +      */
>> +     mcpm_cpu_suspend(S5P_CHECK_SLEEP);
>
> NAK.
>
> When I say "local flag with local meaning", I don't want you to smuggle
> that flag through a public interface either.  I find it rather inelegant
> to have the notion of special handling for CPU0 being spread around like
> that.
>
> If CPU0 is special, then it should be dealth with in one place only,
> ideally in the MCPM backend, so the rest of the kernel doesn't have to
> care.
>
> Again, here's what I mean:
>
> static void exynos_mcpm_down_handler(int flags)
> {
>         [...]
>         if ((cpunr != 0) || !(flags & SKIP_CPU_POWERDOWN_IF_CPU0))
>                 exynos_cpu_power_down(cpunr);
>         [...]
> }
>
> static void exynos_mcpm_power_down()
> {
>         exynos_mcpm_down_handler(0);
> }
>
> static void exynos_mcpm_suspend(u64 residency)
> {
>         /*
>          * Theresidency argument is ignored for now.
>          * However, in the CPU suspend case, we bypass power down for
>          * CPU0 as this is being taken care by the SYS_PWR_CFG bit in
>          * CORE0_SYS_PWR_REG.
>          */
>         exynos_mcpm_down_handler(SKIP_CPU_POWERDOWN_IF_CPU0);
> }
>
> And SKIP_CPU_POWERDOWN_IF_CPU0 is defined in and visible to
> mcpm-exynos.c only.
Sorry if I am being dense, but the exynos_mcpm_suspend function would
get called from both the bL cpuidle driver as well the exynos pm code.
We want to skip CPU0 only in case of the S2R case i.e. the pm code and
keep the CPU0 power down code for all other cases including CPUIdle.
If I call exynos_mcpm_down_handler with the flag in
exynos_mcpm_suspend(), CPUIdle will also skip CPU0 isn't it ?

Regards,
Abhilash
>
>
> Nicolas

^ permalink raw reply	[flat|nested] 129+ messages in thread

* several messages
@ 2014-07-03 16:19       ` Abhilash Kesavan
  0 siblings, 0 replies; 129+ messages in thread
From: Abhilash Kesavan @ 2014-07-03 16:19 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Nicolas,

On Thu, Jul 3, 2014 at 9:15 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Thu, 3 Jul 2014, Abhilash Kesavan wrote:
>
>> On Thu, Jul 3, 2014 at 6:59 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>> > Please, let's avoid going that route.  There is no such special handling
>> > needed if the API is sufficient.  And the provided API allows you to
>> > suspend a CPU or shut it down.  It shouldn't matter at that level if
>> > this is due to a cluster switch or a hotplug event. Do you really need
>> > something else?
>> No, just one local flag for suspend should be enough for me. Will remove these.
>
> [...]
>
>> Changes in v5:
>>       - Removed the MCPM flags and just used a local flag to
>>       indicate that we are suspending.
>
> [...]
>
>> -static void exynos_power_down(void)
>> +static void exynos_mcpm_power_down(u64 residency)
>>  {
>>       unsigned int mpidr, cpu, cluster;
>>       bool last_man = false, skip_wfi = false;
>> @@ -150,7 +153,12 @@ static void exynos_power_down(void)
>>       BUG_ON(__mcpm_cluster_state(cluster) != CLUSTER_UP);
>>       cpu_use_count[cpu][cluster]--;
>>       if (cpu_use_count[cpu][cluster] == 0) {
>> -             exynos_cpu_power_down(cpunr);
>> +             /*
>> +              * Bypass power down for CPU0 during suspend. This is being
>> +              * taken care by the SYS_PWR_CFG bit in CORE0_SYS_PWR_REG.
>> +              */
>> +             if ((cpunr != 0) || (residency != S5P_CHECK_SLEEP))
>> +                     exynos_cpu_power_down(cpunr);
>>
>>               if (exynos_cluster_unused(cluster)) {
>>                       exynos_cluster_power_down(cluster);
>> @@ -209,6 +217,11 @@ static void exynos_power_down(void)
>>       /* Not dead at this point?  Let our caller cope. */
>>  }
>>
>> +static void exynos_power_down(void)
>> +{
>> +     exynos_mcpm_power_down(0);
>> +}
>
> [...]
>
>> +static int notrace exynos_mcpm_cpu_suspend(unsigned long arg)
>> +{
>> +     /* MCPM works with HW CPU identifiers */
>> +     unsigned int mpidr = read_cpuid_mpidr();
>> +     unsigned int cluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
>> +     unsigned int cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
>> +
>> +     __raw_writel(0x0, sysram_base_addr + EXYNOS5420_CPU_STATE);
>> +
>> +     mcpm_set_entry_vector(cpu, cluster, exynos_cpu_resume);
>> +
>> +     /*
>> +      * Pass S5P_CHECK_SLEEP flag to the MCPM back-end to indicate that
>> +      * we are suspending the system and need to skip CPU0 power down.
>> +      */
>> +     mcpm_cpu_suspend(S5P_CHECK_SLEEP);
>
> NAK.
>
> When I say "local flag with local meaning", I don't want you to smuggle
> that flag through a public interface either.  I find it rather inelegant
> to have the notion of special handling for CPU0 being spread around like
> that.
>
> If CPU0 is special, then it should be dealth with in one place only,
> ideally in the MCPM backend, so the rest of the kernel doesn't have to
> care.
>
> Again, here's what I mean:
>
> static void exynos_mcpm_down_handler(int flags)
> {
>         [...]
>         if ((cpunr != 0) || !(flags & SKIP_CPU_POWERDOWN_IF_CPU0))
>                 exynos_cpu_power_down(cpunr);
>         [...]
> }
>
> static void exynos_mcpm_power_down()
> {
>         exynos_mcpm_down_handler(0);
> }
>
> static void exynos_mcpm_suspend(u64 residency)
> {
>         /*
>          * Theresidency argument is ignored for now.
>          * However, in the CPU suspend case, we bypass power down for
>          * CPU0 as this is being taken care by the SYS_PWR_CFG bit in
>          * CORE0_SYS_PWR_REG.
>          */
>         exynos_mcpm_down_handler(SKIP_CPU_POWERDOWN_IF_CPU0);
> }
>
> And SKIP_CPU_POWERDOWN_IF_CPU0 is defined in and visible to
> mcpm-exynos.c only.
Sorry if I am being dense, but the exynos_mcpm_suspend function would
get called from both the bL cpuidle driver as well the exynos pm code.
We want to skip CPU0 only in case of the S2R case i.e. the pm code and
keep the CPU0 power down code for all other cases including CPUIdle.
If I call exynos_mcpm_down_handler with the flag in
exynos_mcpm_suspend(), CPUIdle will also skip CPU0 isn't it ?

Regards,
Abhilash
>
>
> Nicolas

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2014-07-03 14:46 ` [PATCH v5] " Abhilash Kesavan
@ 2014-07-03 15:45     ` Nicolas Pitre
  0 siblings, 0 replies; 129+ messages in thread
From: Nicolas Pitre @ 2014-07-03 15:45 UTC (permalink / raw)
  To: Abhilash Kesavan, Abhilash Kesavan
  Cc: linux-samsung-soc, linux-arm-kernel, kgene.kim,
	Lorenzo Pieralisi, Andrew Bresticker, Douglas Anderson

On Thu, 3 Jul 2014, Abhilash Kesavan wrote:

> On Thu, Jul 3, 2014 at 6:59 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > Please, let's avoid going that route.  There is no such special handling
> > needed if the API is sufficient.  And the provided API allows you to
> > suspend a CPU or shut it down.  It shouldn't matter at that level if
> > this is due to a cluster switch or a hotplug event. Do you really need
> > something else?
> No, just one local flag for suspend should be enough for me. Will remove these.

[...]

> Changes in v5:
> 	- Removed the MCPM flags and just used a local flag to
> 	indicate that we are suspending.

[...]

> -static void exynos_power_down(void)
> +static void exynos_mcpm_power_down(u64 residency)
>  {
>  	unsigned int mpidr, cpu, cluster;
>  	bool last_man = false, skip_wfi = false;
> @@ -150,7 +153,12 @@ static void exynos_power_down(void)
>  	BUG_ON(__mcpm_cluster_state(cluster) != CLUSTER_UP);
>  	cpu_use_count[cpu][cluster]--;
>  	if (cpu_use_count[cpu][cluster] == 0) {
> -		exynos_cpu_power_down(cpunr);
> +		/*
> +		 * Bypass power down for CPU0 during suspend. This is being
> +		 * taken care by the SYS_PWR_CFG bit in CORE0_SYS_PWR_REG.
> +		 */
> +		if ((cpunr != 0) || (residency != S5P_CHECK_SLEEP))
> +			exynos_cpu_power_down(cpunr);
>  
>  		if (exynos_cluster_unused(cluster)) {
>  			exynos_cluster_power_down(cluster);
> @@ -209,6 +217,11 @@ static void exynos_power_down(void)
>  	/* Not dead at this point?  Let our caller cope. */
>  }
>  
> +static void exynos_power_down(void)
> +{
> +	exynos_mcpm_power_down(0);
> +}

[...]

> +static int notrace exynos_mcpm_cpu_suspend(unsigned long arg)
> +{
> +	/* MCPM works with HW CPU identifiers */
> +	unsigned int mpidr = read_cpuid_mpidr();
> +	unsigned int cluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
> +	unsigned int cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
> +
> +	__raw_writel(0x0, sysram_base_addr + EXYNOS5420_CPU_STATE);
> +
> +	mcpm_set_entry_vector(cpu, cluster, exynos_cpu_resume);
> +
> +	/*
> +	 * Pass S5P_CHECK_SLEEP flag to the MCPM back-end to indicate that
> +	 * we are suspending the system and need to skip CPU0 power down.
> +	 */
> +	mcpm_cpu_suspend(S5P_CHECK_SLEEP);

NAK.

When I say "local flag with local meaning", I don't want you to smuggle 
that flag through a public interface either.  I find it rather inelegant 
to have the notion of special handling for CPU0 being spread around like 
that.

If CPU0 is special, then it should be dealth with in one place only, 
ideally in the MCPM backend, so the rest of the kernel doesn't have to 
care.

Again, here's what I mean:

static void exynos_mcpm_down_handler(int flags)
{
	[...]
	if ((cpunr != 0) || !(flags & SKIP_CPU_POWERDOWN_IF_CPU0))
		exynos_cpu_power_down(cpunr);
	[...]
}

static void exynos_mcpm_power_down()
{
	exynos_mcpm_down_handler(0);
}

static void exynos_mcpm_suspend(u64 residency)
{
	/*
	 * Theresidency argument is ignored for now.
	 * However, in the CPU suspend case, we bypass power down for 
	 * CPU0 as this is being taken care by the SYS_PWR_CFG bit in
	 * CORE0_SYS_PWR_REG.
	 */
	exynos_mcpm_down_handler(SKIP_CPU_POWERDOWN_IF_CPU0);
}

And SKIP_CPU_POWERDOWN_IF_CPU0 is defined in and visible to 
mcpm-exynos.c only.


Nicolas

^ permalink raw reply	[flat|nested] 129+ messages in thread

* several messages
@ 2014-07-03 15:45     ` Nicolas Pitre
  0 siblings, 0 replies; 129+ messages in thread
From: Nicolas Pitre @ 2014-07-03 15:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 3 Jul 2014, Abhilash Kesavan wrote:

> On Thu, Jul 3, 2014 at 6:59 PM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > Please, let's avoid going that route.  There is no such special handling
> > needed if the API is sufficient.  And the provided API allows you to
> > suspend a CPU or shut it down.  It shouldn't matter at that level if
> > this is due to a cluster switch or a hotplug event. Do you really need
> > something else?
> No, just one local flag for suspend should be enough for me. Will remove these.

[...]

> Changes in v5:
> 	- Removed the MCPM flags and just used a local flag to
> 	indicate that we are suspending.

[...]

> -static void exynos_power_down(void)
> +static void exynos_mcpm_power_down(u64 residency)
>  {
>  	unsigned int mpidr, cpu, cluster;
>  	bool last_man = false, skip_wfi = false;
> @@ -150,7 +153,12 @@ static void exynos_power_down(void)
>  	BUG_ON(__mcpm_cluster_state(cluster) != CLUSTER_UP);
>  	cpu_use_count[cpu][cluster]--;
>  	if (cpu_use_count[cpu][cluster] == 0) {
> -		exynos_cpu_power_down(cpunr);
> +		/*
> +		 * Bypass power down for CPU0 during suspend. This is being
> +		 * taken care by the SYS_PWR_CFG bit in CORE0_SYS_PWR_REG.
> +		 */
> +		if ((cpunr != 0) || (residency != S5P_CHECK_SLEEP))
> +			exynos_cpu_power_down(cpunr);
>  
>  		if (exynos_cluster_unused(cluster)) {
>  			exynos_cluster_power_down(cluster);
> @@ -209,6 +217,11 @@ static void exynos_power_down(void)
>  	/* Not dead at this point?  Let our caller cope. */
>  }
>  
> +static void exynos_power_down(void)
> +{
> +	exynos_mcpm_power_down(0);
> +}

[...]

> +static int notrace exynos_mcpm_cpu_suspend(unsigned long arg)
> +{
> +	/* MCPM works with HW CPU identifiers */
> +	unsigned int mpidr = read_cpuid_mpidr();
> +	unsigned int cluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
> +	unsigned int cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
> +
> +	__raw_writel(0x0, sysram_base_addr + EXYNOS5420_CPU_STATE);
> +
> +	mcpm_set_entry_vector(cpu, cluster, exynos_cpu_resume);
> +
> +	/*
> +	 * Pass S5P_CHECK_SLEEP flag to the MCPM back-end to indicate that
> +	 * we are suspending the system and need to skip CPU0 power down.
> +	 */
> +	mcpm_cpu_suspend(S5P_CHECK_SLEEP);

NAK.

When I say "local flag with local meaning", I don't want you to smuggle 
that flag through a public interface either.  I find it rather inelegant 
to have the notion of special handling for CPU0 being spread around like 
that.

If CPU0 is special, then it should be dealth with in one place only, 
ideally in the MCPM backend, so the rest of the kernel doesn't have to 
care.

Again, here's what I mean:

static void exynos_mcpm_down_handler(int flags)
{
	[...]
	if ((cpunr != 0) || !(flags & SKIP_CPU_POWERDOWN_IF_CPU0))
		exynos_cpu_power_down(cpunr);
	[...]
}

static void exynos_mcpm_power_down()
{
	exynos_mcpm_down_handler(0);
}

static void exynos_mcpm_suspend(u64 residency)
{
	/*
	 * Theresidency argument is ignored for now.
	 * However, in the CPU suspend case, we bypass power down for 
	 * CPU0 as this is being taken care by the SYS_PWR_CFG bit in
	 * CORE0_SYS_PWR_REG.
	 */
	exynos_mcpm_down_handler(SKIP_CPU_POWERDOWN_IF_CPU0);
}

And SKIP_CPU_POWERDOWN_IF_CPU0 is defined in and visible to 
mcpm-exynos.c only.


Nicolas

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2010-07-15  9:39   ` Patrick McHardy
@ 2010-07-15 10:17     ` Jan Engelhardt
  0 siblings, 0 replies; 129+ messages in thread
From: Jan Engelhardt @ 2010-07-15 10:17 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Michael S. Tsirkin, David S. Miller, Randy Dunlap,
	netfilter-devel, netfilter, coreteam, linux-kernel, netdev, kvm,
	herbert

On Thursday 2010-07-15 11:36, Patrick McHardy wrote:
>> +struct xt_CHECKSUM_info {
>> +	u_int8_t operation;	/* bitset of operations */
>
>Please use __u8 in public header files.
>
>> +#include <linux/in.h>
>
>Do you really need in.h?
>
>> + * $Id$
>
>Please no CVS IDs.
>
>> +	switch (c) {
>> +	case 'F':
>> +		if (*flags)
>> +			xtables_error(PARAMETER_PROBLEM,
>> +			        "CHECKSUM target: Only use --checksum-fill ONCE!");
>
>There is a helper function called xtables_param_act for checking double
>arguments and printing a standarized error message.

I took care of these for Xt-a.


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2009-09-08 19:32 ` Giuliano Pochini
@ 2009-09-08 22:56   ` Mark Hills
  0 siblings, 0 replies; 129+ messages in thread
From: Mark Hills @ 2009-09-08 22:56 UTC (permalink / raw)
  To: Giuliano Pochini; +Cc: Takashi Iwai, alsa-devel

On Tue, 8 Sep 2009, Giuliano Pochini wrote:

> On Sun, 6 Sep 2009 15:16:57 +0100 (BST)
> Mark Hills <mark@pogo.org.uk> wrote:
>
>> I have an Echo Layla 3G in my workstation. It works for audio, but does
>> not recover from ACPI suspend to RAM.
>
> I does not recover because there is not suspend/resume support. I wrote
> almost complete support a lot of time ago which wasn't merged due to two
> unsolved issues. The resurrection procedure (actually it is a reinit from
> scratch) was very unreliable on my Gina24 for unknown reasons and there
> was an atomicity issue in the rawmidi interface.
> I can manage to provide you a patch for testing in a few days.

Thanks Giuliano and Takashi for the replies.

The reason for my quietness is I decided loading/unloading the module was 
a hack so I followed Takashi's documentation, aiming to implement some 
kind of suspend/resume.

However didn't have a great deal of luck; it's made especially hard with 
not having a separate serial console or way of viewing debug messages when 
the machine locks. It seemed that the amount of de-initialisation affects 
the ability to reinitialise the Layla 3G card from scratch. Then I ran out 
of time.

It would be great to see the patch, I'd be very happy to help with 
testing. Thanks for your work on this.

-- 
Mark

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2009-02-09 21:27   ` jamal
@ 2009-02-09 21:44     ` Jan Engelhardt
  0 siblings, 0 replies; 129+ messages in thread
From: Jan Engelhardt @ 2009-02-09 21:44 UTC (permalink / raw)
  To: jamal; +Cc: Patrick McHardy, Pablo Neira Ayuso, netfilter-devel







On Monday 2009-02-09 22:27, jamal wrote:
>On Mon, 2009-02-09 at 22:04 +0100, Jan Engelhardt wrote:
>> 
>> I do not think a library should call exit() and cause the main program 
>> to terminate; 
>> to this end it might be best to add a
>> 
>> 	void (*exit_error)(..
>> 
>> function pointer to the xtables_global struct you are proposing.
>
>Thanks for the suggestion - it sounds reasonable. 
>Note, however, grep says there are about 700 references  to exit_error()
>- so my intent of moving it into xtables.c is for usability more than
>anything.

Hm you are right; much of the code assumes that exit_error()
never returns. We need to stick to that for the time being.

>> If you could convert iptables.c and friends to also make use of this, 
>> that'd be great.
>
>There are only 3 definitions as far as i can see. If i can convert those
>to use that global struct then I should be able to compile that basic
>program.

Yep.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2009-02-09 21:04 ` several messages Jan Engelhardt
@ 2009-02-09 21:27   ` jamal
  2009-02-09 21:44     ` Jan Engelhardt
  0 siblings, 1 reply; 129+ messages in thread
From: jamal @ 2009-02-09 21:27 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Patrick McHardy, Pablo Neira Ayuso, netfilter-devel

On Mon, 2009-02-09 at 22:04 +0100, Jan Engelhardt wrote:

> 
> I do not think a library should call exit() and cause the main program 
> to terminate; 
> to this end it might be best to add a
> 
> 	void (*exit_error)(..
> 
> function pointer to the xtables_global struct you are proposing.
> 
> 

Thanks for the suggestion - it sounds reasonable. 
Note, however, grep says there are about 700 references  to exit_error()
- so my intent of moving it into xtables.c is for usability more than
anything.

> If you could convert iptables.c and friends to also make use of this, 
> that'd be great.

There are only 3 definitions as far as i can see. If i can convert those
to use that global struct then I should be able to compile that basic
program.

cheers,
jamal


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2009-02-09 20:57 [PATCH] libxtables: Introduce global params structuring jamal
@ 2009-02-09 21:04 ` Jan Engelhardt
  2009-02-09 21:27   ` jamal
  0 siblings, 1 reply; 129+ messages in thread
From: Jan Engelhardt @ 2009-02-09 21:04 UTC (permalink / raw)
  To: jamal; +Cc: Patrick McHardy, Pablo Neira Ayuso, netfilter-devel


On Monday 2009-02-09 21:45, jamal wrote:
>
>Ok, I just synced with latest git. I will send you a few patches first.
>My path to resolving tc/ipt is to start with being able to take a basic
>useless program like:
>
>----------
>#include <xtables.h>
>int main(int argc, char **argv) {
>
>        return 0;
>}
>--------
>
>then compile and link with "gcc useless.c -lxtables -ldl"
>
>As it is right now i have to define in the minimal exit_error()

I do not think a library should call exit() and cause the main program 
to terminate; to this end it might be best to add a

	void (*exit_error)(..

function pointer to the xtables_global struct you are proposing.


>
>Here's the basic change.

If you could convert iptables.c and friends to also make use of this, 
that'd be great.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2008-11-26 21:06         ` Tobias Müller
@ 2008-11-27  0:57           ` Jiri Kosina
  0 siblings, 0 replies; 129+ messages in thread
From: Jiri Kosina @ 2008-11-27  0:57 UTC (permalink / raw)
  To: J.R. Mauro, Tobias Müller; +Cc: Jan Scholz, jirislaby, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 406 bytes --]

On Wed, 26 Nov 2008, J.R. Mauro wrote:

> There is one bluetooth model and one USB model.

On Wed, 26 Nov 2008, Tobias Müller wrote:

> I own the USB variant and these are the right id for that. The wireless 
> IDs were from another patch I merged together with mine. I don't have a 
> wireless version.

OK, so therefore you confirm that Jan's patch is OK, right?

Thanks a lot,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2008-10-19 19:40   ` several messages Jiri Kosina
  2008-10-19 20:06       ` Justin Mattock
@ 2008-10-19 22:09     ` Jiri Slaby
  1 sibling, 0 replies; 129+ messages in thread
From: Jiri Slaby @ 2008-10-19 22:09 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: linux-input, linux-kernel, Steven Noonan, Justin Mattock,
	Sven Anders, Marcel Holtmann, linux-bluetooth

Jiri Kosina napsal(a):
> On Sun, 19 Oct 2008, Jiri Slaby wrote:
> 
>> +enum hid_type {
>> +	HID_TYPE_UNKNOWN = 0,
>> +	HID_TYPE_MOUSE,
>> +	HID_TYPE_KEYBOARD
>> +};
>> +
> 
> Do we really need the HID_TYPE_KEYBOARD at all? It's not used anywhere in 
> the code. I'd propose to add it when it is actually needed. I.e. have the 
> enum contain something like HID_TYPE_MOUSE HID_TYPE_OTHER for now, and add 
> whatever will become necessary in the future, what do you think?

I would use unknown rather than other, since on bluetooth mouse is unknown
not other, if you don't mind?

Or did you mean tristate unknown, mouse and other?

Thanks for review.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2008-10-19 20:06       ` Justin Mattock
  0 siblings, 0 replies; 129+ messages in thread
From: Justin Mattock @ 2008-10-19 20:06 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Jiri Slaby, linux-input, linux-kernel, Steven Noonan,
	Sven Anders, Marcel Holtmann, linux-bluetooth

On Sun, Oct 19, 2008 at 12:40 PM, Jiri Kosina <jkosina@suse.cz> wrote:
> On Sun, 19 Oct 2008, Jiri Slaby wrote:
>
>> +enum hid_type {
>> +     HID_TYPE_UNKNOWN = 0,
>> +     HID_TYPE_MOUSE,
>> +     HID_TYPE_KEYBOARD
>> +};
>> +
>
> Do we really need the HID_TYPE_KEYBOARD at all? It's not used anywhere in
> the code. I'd propose to add it when it is actually needed. I.e. have the
> enum contain something like HID_TYPE_MOUSE HID_TYPE_OTHER for now, and add
> whatever will become necessary in the future, what do you think?
>
>
> On Sun, 19 Oct 2008, Jiri Slaby wrote:
>
>> +/**
>> + * hid_mouse_ignore_list - mouse devices which must not be held by the hid layer
>> + */
>
> I think a more descriptive comment would be appropriate here. It might not
> be obvious on the first sight why this needs to be done separately from
> the generic hid_blacklist. I.e. something like
>
> /**
>  * There are composite devices for which we want to ignore only a certain
>  * interface. This is a list of devices for which only the mouse interface
>  * will be ignored.
>  */
>
> maybe?
>
> Thanks,
>
> --
> Jiri Kosina
> SUSE Labs
>

I can agree with that, whats the point having something
there if it not being used,(just eating up precious space);

-- 
Justin P. Mattock

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2008-10-19 20:06       ` Justin Mattock
  0 siblings, 0 replies; 129+ messages in thread
From: Justin Mattock @ 2008-10-19 20:06 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Jiri Slaby, linux-input-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Steven Noonan, Sven Anders,
	Marcel Holtmann, linux-bluetooth-u79uwXL29TY76Z2rM5mHXA

On Sun, Oct 19, 2008 at 12:40 PM, Jiri Kosina <jkosina-AlSwsSmVLrQ@public.gmane.org> wrote:
> On Sun, 19 Oct 2008, Jiri Slaby wrote:
>
>> +enum hid_type {
>> +     HID_TYPE_UNKNOWN = 0,
>> +     HID_TYPE_MOUSE,
>> +     HID_TYPE_KEYBOARD
>> +};
>> +
>
> Do we really need the HID_TYPE_KEYBOARD at all? It's not used anywhere in
> the code. I'd propose to add it when it is actually needed. I.e. have the
> enum contain something like HID_TYPE_MOUSE HID_TYPE_OTHER for now, and add
> whatever will become necessary in the future, what do you think?
>
>
> On Sun, 19 Oct 2008, Jiri Slaby wrote:
>
>> +/**
>> + * hid_mouse_ignore_list - mouse devices which must not be held by the hid layer
>> + */
>
> I think a more descriptive comment would be appropriate here. It might not
> be obvious on the first sight why this needs to be done separately from
> the generic hid_blacklist. I.e. something like
>
> /**
>  * There are composite devices for which we want to ignore only a certain
>  * interface. This is a list of devices for which only the mouse interface
>  * will be ignored.
>  */
>
> maybe?
>
> Thanks,
>
> --
> Jiri Kosina
> SUSE Labs
>

I can agree with that, whats the point having something
there if it not being used,(just eating up precious space);

-- 
Justin P. Mattock

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2008-10-19 14:15 ` [PATCH 2/2] HID: fix appletouch regression Jiri Slaby
@ 2008-10-19 19:40   ` Jiri Kosina
  2008-10-19 20:06       ` Justin Mattock
  2008-10-19 22:09     ` Jiri Slaby
  0 siblings, 2 replies; 129+ messages in thread
From: Jiri Kosina @ 2008-10-19 19:40 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: linux-input, linux-kernel, Steven Noonan, Justin Mattock,
	Sven Anders, Marcel Holtmann, linux-bluetooth

On Sun, 19 Oct 2008, Jiri Slaby wrote:

> +enum hid_type {
> +	HID_TYPE_UNKNOWN = 0,
> +	HID_TYPE_MOUSE,
> +	HID_TYPE_KEYBOARD
> +};
> +

Do we really need the HID_TYPE_KEYBOARD at all? It's not used anywhere in 
the code. I'd propose to add it when it is actually needed. I.e. have the 
enum contain something like HID_TYPE_MOUSE HID_TYPE_OTHER for now, and add 
whatever will become necessary in the future, what do you think?


On Sun, 19 Oct 2008, Jiri Slaby wrote:

> +/**
> + * hid_mouse_ignore_list - mouse devices which must not be held by the hid layer
> + */

I think a more descriptive comment would be appropriate here. It might not 
be obvious on the first sight why this needs to be done separately from 
the generic hid_blacklist. I.e. something like

/** 
 * There are composite devices for which we want to ignore only a certain
 * interface. This is a list of devices for which only the mouse interface 
 * will be ignored.
 */

maybe?

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
       [not found] <9E397A467F4DB34884A1FD0D5D27CF43018903F96E@msxaoa4.twosigma.com>
@ 2008-06-12 16:54 ` Benjamin L. Shi
  0 siblings, 0 replies; 129+ messages in thread
From: Benjamin L. Shi @ 2008-06-12 16:54 UTC (permalink / raw)
  To: xfs




Index: fs/xfs/xfs_iomap.c
===================================================================
RCS file: /src/linux/2.6.18/fs/xfs/xfs_iomap.c,v
retrieving revision 1.1.1.1
retrieving revision 1.2
diff -u -r1.1.1.1 -r1.2
--- fs/xfs/xfs_iomap.c  29 Sep 2006 13:45:19 -0000      1.1.1.1
+++ fs/xfs/xfs_iomap.c  12 Jun 2008 15:59:10 -0000      1.2
@@ -706,11 +706,24 @@
         * then we must have run out of space - flush delalloc, and retry..
         */
        if (nimaps == 0) {
+               if ((mp->m_flags & XFS_MOUNT_FULL) != 0) {
+                       if (mp->m_sb.sb_fdblocks < 500) {
+                               //      printk("full again %llu\n",
+                               //              mp->m_sb.sb_fdblocks);
+                                       return XFS_ERROR(ENOSPC);
+                       } else {
+                               //      printk("not full again %llu\n",
+                               //              mp->m_sb.sb_fdblocks);
+                                       mp->m_flags &= ~XFS_MOUNT_FULL;
+                       }
+               }
                xfs_iomap_enter_trace(XFS_IOMAP_WRITE_NOSPACE,
                                        io, offset, count);
-               if (xfs_flush_space(ip, &fsynced, &ioflag))
+               if (xfs_flush_space(ip, &fsynced, &ioflag)) {
+                       mp->m_flags |= XFS_MOUNT_FULL;
+                       //printk("set full %llu\n", mp->m_sb.sb_fdblocks);
                        return XFS_ERROR(ENOSPC);
-
+               }
                error = 0;
                goto retry;
        }
Index: fs/xfs/xfs_mount.h
===================================================================
RCS file: /src/linux/2.6.18/fs/xfs/xfs_mount.h,v
retrieving revision 1.1.1.1
retrieving revision 1.2
diff -u -r1.1.1.1 -r1.2
--- fs/xfs/xfs_mount.h  29 Sep 2006 13:45:19 -0000      1.1.1.1
+++ fs/xfs/xfs_mount.h  12 Jun 2008 15:59:10 -0000      1.2
@@ -459,6 +459,7 @@
                                                 * I/O size in stat() */
 #define XFS_MOUNT_NO_PERCPU_SB (1ULL << 23)    /* don't use per-cpu
superblock
                                                   counters */
+#define XFS_MOUNT_FULL         (1ULL << 24)


 /*



>
> On Fri, 6 Oct 2006, David Chinner wrote:
>
>>> The backtrace looked like this:
>>>
>>> ... nfsd_write nfsd_vfs_write vfs_writev do_readv_writev
>>> xfs_file_writev
>>> xfs_write generic_file_buffered_write xfs_get_blocks __xfs_get_blocks
>>> xfs_bmap xfs_iomap xfs_iomap_write_delay xfs_flush_space
>>> xfs_flush_device
>>> schedule_timeout_uninterruptible.
>>
>> Ahhh, this gets hit on the ->prepare_write path
>> (xfs_iomap_write_delay()),
>
> Yes.
>
>> not the allocate path (xfs_iomap_write_allocate()). Sorry - I got myself
>> (and probably everyone else) confused there which why I suspected sync
>> writes - they trigger the allocate path in the write call. I don't think
>> 2.6.18 will change anything.
>>
>> FWIW, I don't think we can avoid this sleep when we first hit ENOSPC
>> conditions, but perhaps once we are certain of the ENOSPC status
>> we can tag the filesystem with this state (say an xfs_mount flag)
>> and only clear that tag when something is freed. We could then
>> use the tag to avoid continually trying extremely hard to allocate
>> space when we know there is none available....
>
> Yes! That's what I was trying to suggest  << OLE Object: Picture (Device
> Independent Bitmap) >> . Thank you.
>
> Is that hard to do?
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2007-02-22 22:58     ` Guennadi Liakhovetski
  (?)
@ 2007-02-23 11:17       ` Johannes Berg
  -1 siblings, 0 replies; 129+ messages in thread
From: Johannes Berg @ 2007-02-23 11:17 UTC (permalink / raw)
  To: Guennadi Liakhovetski
  Cc: David Brownell, linuxppc-dev, rtc-linux, linux-pm, Torrance,
	Alessandro Zummo, john stultz, Andrew Morton, Linux Kernel list

[-- Attachment #1: Type: text/plain, Size: 378 bytes --]

On Thu, 2007-02-22 at 23:58 +0100, Guennadi Liakhovetski wrote:

> I think, we only want 1, right? And the latter seems to be more generic / 
> platform independent? And as a side-effect, powermac would have to migrate 
> to generic rtc:-)

Can we migrate all of powerpc to genrtc? But yes, I agree. Had enough to
do though already to get suspend working :)

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2007-02-23 11:17       ` Johannes Berg
  0 siblings, 0 replies; 129+ messages in thread
From: Johannes Berg @ 2007-02-23 11:17 UTC (permalink / raw)
  To: Guennadi Liakhovetski
  Cc: Alessandro Zummo, rtc-linux, Torrance, linux-pm,
	Linux Kernel list, David Brownell, linuxppc-dev, john stultz,
	Andrew Morton


[-- Attachment #1.1: Type: text/plain, Size: 378 bytes --]

On Thu, 2007-02-22 at 23:58 +0100, Guennadi Liakhovetski wrote:

> I think, we only want 1, right? And the latter seems to be more generic / 
> platform independent? And as a side-effect, powermac would have to migrate 
> to generic rtc:-)

Can we migrate all of powerpc to genrtc? But yes, I agree. Had enough to
do though already to get suspend working :)

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 146 bytes --]

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2007-02-23 11:17       ` Johannes Berg
  0 siblings, 0 replies; 129+ messages in thread
From: Johannes Berg @ 2007-02-23 11:17 UTC (permalink / raw)
  To: Guennadi Liakhovetski
  Cc: Alessandro Zummo, rtc-linux, Torrance, linux-pm,
	Linux Kernel list, David Brownell, linuxppc-dev, john stultz,
	Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 378 bytes --]

On Thu, 2007-02-22 at 23:58 +0100, Guennadi Liakhovetski wrote:

> I think, we only want 1, right? And the latter seems to be more generic / 
> platform independent? And as a side-effect, powermac would have to migrate 
> to generic rtc:-)

Can we migrate all of powerpc to genrtc? But yes, I agree. Had enough to
do though already to get suspend working :)

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2007-02-22 22:58     ` Guennadi Liakhovetski
  (?)
@ 2007-02-23  1:15       ` David Brownell
  -1 siblings, 0 replies; 129+ messages in thread
From: David Brownell @ 2007-02-23  1:15 UTC (permalink / raw)
  To: Guennadi Liakhovetski
  Cc: Alessandro Zummo, Andrew Morton, Johannes Berg, john stultz,
	Linux Kernel list, linux-pm, linuxppc-dev, rtc-linux, Torrance

On Thursday 22 February 2007 2:58 pm, Guennadi Liakhovetski wrote:
> 
> I think, we only want 1, right? And the latter seems to be more generic / 
> platform independent? And as a side-effect, powermac would have to migrate 
> to generic rtc:-)

I'd certainly think that restoring the system clock should be, as much
as possible, in platform-agnostic code.  Like the generic RTC framework.

And hmm, that powermac/time.c file replicates other RTC code...

Minor obstacle:  removing the EXPERIMENTAL label from that code.

- Dave

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2007-02-23  1:15       ` David Brownell
  0 siblings, 0 replies; 129+ messages in thread
From: David Brownell @ 2007-02-23  1:15 UTC (permalink / raw)
  To: Guennadi Liakhovetski
  Cc: linuxppc-dev, Torrance, Alessandro Zummo, rtc-linux, john stultz,
	Andrew Morton, linux-pm, Johannes Berg, Linux Kernel list

On Thursday 22 February 2007 2:58 pm, Guennadi Liakhovetski wrote:
> 
> I think, we only want 1, right? And the latter seems to be more generic / 
> platform independent? And as a side-effect, powermac would have to migrate 
> to generic rtc:-)

I'd certainly think that restoring the system clock should be, as much
as possible, in platform-agnostic code.  Like the generic RTC framework.

And hmm, that powermac/time.c file replicates other RTC code...

Minor obstacle:  removing the EXPERIMENTAL label from that code.

- Dave

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2007-02-23  1:15       ` David Brownell
  0 siblings, 0 replies; 129+ messages in thread
From: David Brownell @ 2007-02-23  1:15 UTC (permalink / raw)
  To: Guennadi Liakhovetski
  Cc: Alessandro Zummo, rtc-linux, john stultz, linux-pm,
	Linux Kernel list, Torrance, linuxppc-dev, Andrew Morton,
	Johannes Berg

On Thursday 22 February 2007 2:58 pm, Guennadi Liakhovetski wrote:
> 
> I think, we only want 1, right? And the latter seems to be more generic / 
> platform independent? And as a side-effect, powermac would have to migrate 
> to generic rtc:-)

I'd certainly think that restoring the system clock should be, as much
as possible, in platform-agnostic code.  Like the generic RTC framework.

And hmm, that powermac/time.c file replicates other RTC code...

Minor obstacle:  removing the EXPERIMENTAL label from that code.

- Dave

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2007-02-22  3:50 ` [patch 6/6] rtc suspend()/resume() restores system clock David Brownell
  2007-02-22 22:58     ` Guennadi Liakhovetski
@ 2007-02-22 22:58     ` Guennadi Liakhovetski
  0 siblings, 0 replies; 129+ messages in thread
From: Guennadi Liakhovetski @ 2007-02-22 22:58 UTC (permalink / raw)
  To: Johannes Berg, David Brownell
  Cc: linuxppc-dev, rtc-linux, linux-pm, Torrance, Alessandro Zummo,
	john stultz, Andrew Morton, Linux Kernel list

of the following 2 patches:

On Mon, 5 Feb 2007, Johannes Berg wrote:

> This patch removes the time suspend/restore code that was done through
> a PMU notifier in arch/platforms/powermac/time.c.
> 
> Instead, we introduce arch/powerpc/sysdev/timer.c which creates a sys
> device and handles time of day suspend/resume through that.
> 
> Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
> Cc: Andrew Morton <akpm@osdl.org>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>

[patch trimmed]

On Wed, 21 Feb 2007, David Brownell wrote:

> RTC class suspend/resume support, re-initializing the system clock on resume
> >from the clock used to initialize it at boot time.
> 
>  - Inlining the same code used by ARM, which saves and restores the
>    delta between a selected RTC and the current system wall-clock time.
> 
>  - Removes calls to that ARM code from AT91, OMAP, and S3C RTCs.
> 
> This goes on top of the patch series removing "struct class_device" usage
> >from the RTC framework.  That makes class suspend()/resume() work.
> 
> Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
> 
> ---
>  drivers/rtc/Kconfig          |   24 +++++++++----
>  drivers/rtc/class.c          |   74 +++++++++++++++++++++++++++++++++++++++++++
>  drivers/rtc/rtc-at91rm9200.c |   30 -----------------
>  drivers/rtc/rtc-omap.c       |   17 ---------
>  drivers/rtc/rtc-s3c.c        |   22 ------------
>  5 files changed, 91 insertions(+), 76 deletions(-)

[patch trimmed]

I think, we only want 1, right? And the latter seems to be more generic / 
platform independent? And as a side-effect, powermac would have to migrate 
to generic rtc:-)

Thanks
Guennadi
---
Guennadi Liakhovetski

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2007-02-22 22:58     ` Guennadi Liakhovetski
  0 siblings, 0 replies; 129+ messages in thread
From: Guennadi Liakhovetski @ 2007-02-22 22:58 UTC (permalink / raw)
  To: Johannes Berg, David Brownell
  Cc: linuxppc-dev, Alessandro Zummo, rtc-linux, john stultz,
	Andrew Morton, linux-pm, Linux Kernel list, Torrance

of the following 2 patches:

On Mon, 5 Feb 2007, Johannes Berg wrote:

> This patch removes the time suspend/restore code that was done through
> a PMU notifier in arch/platforms/powermac/time.c.
> 
> Instead, we introduce arch/powerpc/sysdev/timer.c which creates a sys
> device and handles time of day suspend/resume through that.
> 
> Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
> Cc: Andrew Morton <akpm@osdl.org>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>

[patch trimmed]

On Wed, 21 Feb 2007, David Brownell wrote:

> RTC class suspend/resume support, re-initializing the system clock on resume
> >from the clock used to initialize it at boot time.
> 
>  - Inlining the same code used by ARM, which saves and restores the
>    delta between a selected RTC and the current system wall-clock time.
> 
>  - Removes calls to that ARM code from AT91, OMAP, and S3C RTCs.
> 
> This goes on top of the patch series removing "struct class_device" usage
> >from the RTC framework.  That makes class suspend()/resume() work.
> 
> Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
> 
> ---
>  drivers/rtc/Kconfig          |   24 +++++++++----
>  drivers/rtc/class.c          |   74 +++++++++++++++++++++++++++++++++++++++++++
>  drivers/rtc/rtc-at91rm9200.c |   30 -----------------
>  drivers/rtc/rtc-omap.c       |   17 ---------
>  drivers/rtc/rtc-s3c.c        |   22 ------------
>  5 files changed, 91 insertions(+), 76 deletions(-)

[patch trimmed]

I think, we only want 1, right? And the latter seems to be more generic / 
platform independent? And as a side-effect, powermac would have to migrate 
to generic rtc:-)

Thanks
Guennadi
---
Guennadi Liakhovetski

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2007-02-22 22:58     ` Guennadi Liakhovetski
  0 siblings, 0 replies; 129+ messages in thread
From: Guennadi Liakhovetski @ 2007-02-22 22:58 UTC (permalink / raw)
  To: Johannes Berg, David Brownell
  Cc: Alessandro Zummo, rtc-linux, john stultz, linux-pm,
	Linux Kernel list, Torrance, linuxppc-dev, Andrew Morton

of the following 2 patches:

On Mon, 5 Feb 2007, Johannes Berg wrote:

> This patch removes the time suspend/restore code that was done through
> a PMU notifier in arch/platforms/powermac/time.c.
> 
> Instead, we introduce arch/powerpc/sysdev/timer.c which creates a sys
> device and handles time of day suspend/resume through that.
> 
> Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
> Cc: Andrew Morton <akpm@osdl.org>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>

[patch trimmed]

On Wed, 21 Feb 2007, David Brownell wrote:

> RTC class suspend/resume support, re-initializing the system clock on resume
> >from the clock used to initialize it at boot time.
> 
>  - Inlining the same code used by ARM, which saves and restores the
>    delta between a selected RTC and the current system wall-clock time.
> 
>  - Removes calls to that ARM code from AT91, OMAP, and S3C RTCs.
> 
> This goes on top of the patch series removing "struct class_device" usage
> >from the RTC framework.  That makes class suspend()/resume() work.
> 
> Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
> 
> ---
>  drivers/rtc/Kconfig          |   24 +++++++++----
>  drivers/rtc/class.c          |   74 +++++++++++++++++++++++++++++++++++++++++++
>  drivers/rtc/rtc-at91rm9200.c |   30 -----------------
>  drivers/rtc/rtc-omap.c       |   17 ---------
>  drivers/rtc/rtc-s3c.c        |   22 ------------
>  5 files changed, 91 insertions(+), 76 deletions(-)

[patch trimmed]

I think, we only want 1, right? And the latter seems to be more generic / 
platform independent? And as a side-effect, powermac would have to migrate 
to generic rtc:-)

Thanks
Guennadi
---
Guennadi Liakhovetski

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2006-10-06  0:33               ` David Chinner
@ 2006-10-06 13:25                 ` Stephane Doyon
  -1 siblings, 0 replies; 129+ messages in thread
From: Stephane Doyon @ 2006-10-06 13:25 UTC (permalink / raw)
  To: David Chinner; +Cc: Trond Myklebust, xfs, nfs, Shailendra Tripathi

On Fri, 6 Oct 2006, David Chinner wrote:

> On Thu, Oct 05, 2006 at 11:39:45AM -0400, Stephane Doyon wrote:
>>
>> I hadn't realized that the issue isn't just with the final flush on
>> close(). It's actually been flushing all along, delaying some of the
>> subsequent write()s, getting NOSPC errors but not reporting them until the
>> end.
>
> Other NFS clients will report an ENOSPC on the next write() or close()
> if the error is reported during async writeback. The clients that typically
> do this throw away any unwritten data as well on the basis that the
> application was returned an error ASAP and it is now Somebody Else's
> Problem (i.e. the application needs to handle it from there).

Well the client wouldn't necessarily have to throw away cached data. It 
could conceivably be made to return ENOSPC on some subsequent write. It 
would need to throw away the data for that write, but not necessarily 
destroy its cache. It could then clear the error condition and allow the 
application to keep trying if it wants to...

>> Would it be incorrect for a subsequent write to return the error that
>> occurred while flushing data from previous writes? Then the app could
>> decide whether to continue and retry or not. But I guess I can see how
>> that might get convoluted.
>
> .... there's many entertaining hoops to jump through to do this
> reliably.

I imagine there would be...

> For example: when you have large amounts of cached data, expedient
> error reporting and tossing unwritten data leads to much faster
> error recovery than trying to write every piece of data (hence the
> Irix use of this method).

In my case, I didn't think I was caching that much data though, only a few 
hundred MBs, and I wouldn't have minded so much if an error had been 
returned after that much. The way it's implemented though, I can write an 
unbounded amount of data through that cache and not be told of the problem 
until I close or fsync. It may not be technically wrong, but given the 
outrageous delay I saw in my particular situation, it felt pretty 
suboptimal.

> There's no clear right or wrong approach here - both have their
> advantages and disadvantages for different workloads. If it
> weren't for the sub-optimal behaviour of XFS in this case, you
> probably wouldn't have even cared about this....

Indeed not! In fact, changing the client is not practical for me, what I 
need is a fix for the XFS behavior. I just thought it was also worth 
reporting what I perceived to be an issue with the NFS client.

Thanks

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2006-10-06 13:25                 ` Stephane Doyon
  0 siblings, 0 replies; 129+ messages in thread
From: Stephane Doyon @ 2006-10-06 13:25 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs, nfs, Shailendra Tripathi, Trond Myklebust

On Fri, 6 Oct 2006, David Chinner wrote:

> On Thu, Oct 05, 2006 at 11:39:45AM -0400, Stephane Doyon wrote:
>>
>> I hadn't realized that the issue isn't just with the final flush on
>> close(). It's actually been flushing all along, delaying some of the
>> subsequent write()s, getting NOSPC errors but not reporting them until the
>> end.
>
> Other NFS clients will report an ENOSPC on the next write() or close()
> if the error is reported during async writeback. The clients that typically
> do this throw away any unwritten data as well on the basis that the
> application was returned an error ASAP and it is now Somebody Else's
> Problem (i.e. the application needs to handle it from there).

Well the client wouldn't necessarily have to throw away cached data. It 
could conceivably be made to return ENOSPC on some subsequent write. It 
would need to throw away the data for that write, but not necessarily 
destroy its cache. It could then clear the error condition and allow the 
application to keep trying if it wants to...

>> Would it be incorrect for a subsequent write to return the error that
>> occurred while flushing data from previous writes? Then the app could
>> decide whether to continue and retry or not. But I guess I can see how
>> that might get convoluted.
>
> .... there's many entertaining hoops to jump through to do this
> reliably.

I imagine there would be...

> For example: when you have large amounts of cached data, expedient
> error reporting and tossing unwritten data leads to much faster
> error recovery than trying to write every piece of data (hence the
> Irix use of this method).

In my case, I didn't think I was caching that much data though, only a few 
hundred MBs, and I wouldn't have minded so much if an error had been 
returned after that much. The way it's implemented though, I can write an 
unbounded amount of data through that cache and not be told of the problem 
until I close or fsync. It may not be technically wrong, but given the 
outrageous delay I saw in my particular situation, it felt pretty 
suboptimal.

> There's no clear right or wrong approach here - both have their
> advantages and disadvantages for different workloads. If it
> weren't for the sub-optimal behaviour of XFS in this case, you
> probably wouldn't have even cared about this....

Indeed not! In fact, changing the client is not practical for me, what I 
need is a fix for the XFS behavior. I just thought it was also worth 
reporting what I perceived to be an issue with the NFS client.

Thanks

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2006-10-05 23:29               ` David Chinner
@ 2006-10-06 13:03                 ` Stephane Doyon
  -1 siblings, 0 replies; 129+ messages in thread
From: Stephane Doyon @ 2006-10-06 13:03 UTC (permalink / raw)
  To: David Chinner; +Cc: Trond Myklebust, xfs, nfs, Shailendra Tripathi

On Fri, 6 Oct 2006, David Chinner wrote:

>> The backtrace looked like this:
>>
>> ... nfsd_write nfsd_vfs_write vfs_writev do_readv_writev xfs_file_writev
>> xfs_write generic_file_buffered_write xfs_get_blocks __xfs_get_blocks
>> xfs_bmap xfs_iomap xfs_iomap_write_delay xfs_flush_space xfs_flush_device
>> schedule_timeout_uninterruptible.
>
> Ahhh, this gets hit on the ->prepare_write path (xfs_iomap_write_delay()),

Yes.

> not the allocate path (xfs_iomap_write_allocate()). Sorry - I got myself
> (and probably everyone else) confused there which why I suspected sync
> writes - they trigger the allocate path in the write call. I don't think
> 2.6.18 will change anything.
>
> FWIW, I don't think we can avoid this sleep when we first hit ENOSPC
> conditions, but perhaps once we are certain of the ENOSPC status
> we can tag the filesystem with this state (say an xfs_mount flag)
> and only clear that tag when something is freed. We could then
> use the tag to avoid continually trying extremely hard to allocate
> space when we know there is none available....

Yes! That's what I was trying to suggest :-). Thank you.

Is that hard to do?

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2006-10-06 13:03                 ` Stephane Doyon
  0 siblings, 0 replies; 129+ messages in thread
From: Stephane Doyon @ 2006-10-06 13:03 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs, nfs, Shailendra Tripathi, Trond Myklebust

On Fri, 6 Oct 2006, David Chinner wrote:

>> The backtrace looked like this:
>>
>> ... nfsd_write nfsd_vfs_write vfs_writev do_readv_writev xfs_file_writev
>> xfs_write generic_file_buffered_write xfs_get_blocks __xfs_get_blocks
>> xfs_bmap xfs_iomap xfs_iomap_write_delay xfs_flush_space xfs_flush_device
>> schedule_timeout_uninterruptible.
>
> Ahhh, this gets hit on the ->prepare_write path (xfs_iomap_write_delay()),

Yes.

> not the allocate path (xfs_iomap_write_allocate()). Sorry - I got myself
> (and probably everyone else) confused there which why I suspected sync
> writes - they trigger the allocate path in the write call. I don't think
> 2.6.18 will change anything.
>
> FWIW, I don't think we can avoid this sleep when we first hit ENOSPC
> conditions, but perhaps once we are certain of the ENOSPC status
> we can tag the filesystem with this state (say an xfs_mount flag)
> and only clear that tag when something is freed. We could then
> use the tag to avoid continually trying extremely hard to allocate
> space when we know there is none available....

Yes! That's what I was trying to suggest :-). Thank you.

Is that hard to do?

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2006-10-05 15:39             ` Stephane Doyon
@ 2006-10-06  0:33               ` David Chinner
  -1 siblings, 0 replies; 129+ messages in thread
From: David Chinner @ 2006-10-06  0:33 UTC (permalink / raw)
  To: Stephane Doyon
  Cc: Trond Myklebust, David Chinner, xfs, nfs, Shailendra Tripathi

On Thu, Oct 05, 2006 at 11:39:45AM -0400, Stephane Doyon wrote:
> 
> I hadn't realized that the issue isn't just with the final flush on 
> close(). It's actually been flushing all along, delaying some of the 
> subsequent write()s, getting NOSPC errors but not reporting them until the 
> end.

Other NFS clients will report an ENOSPC on the next write() or close()
if the error is reported during async writeback. The clients that typically
do this throw away any unwritten data as well on the basis that the
application was returned an error ASAP and it is now Somebody Else's
Problem (i.e. the application needs to handle it from there).

> I understand that since my application did not request any syncing, the 
> system cannot guarantee to report errors until cached data has been 
> flushed. But some data has indeed been flushed with an error; can't this 
> be reported earlier than on close?

It could, but...

> Would it be incorrect for a subsequent write to return the error that 
> occurred while flushing data from previous writes? Then the app could 
> decide whether to continue and retry or not. But I guess I can see how 
> that might get convoluted.

.... there's many entertaining hoops to jump through to do this
reliably.

FWIW, these are simply two different approaches to handling ENOSPC
(and other server) errors.  Mostly it comes down to how the ppl who
implemented the NFS client think it's best to handle the errors in
the scenarios that they most care about.

For example: when you have large amounts of cached data, expedient
error reporting and tossing unwritten data leads to much faster
error recovery than trying to write every piece of data (hence the
Irix use of this method).

OTOH, when you really want as much of the data to get to the server,
regardless of whether you lose some (e.g.  log files) before
reporting an error then you try to write every bit of data before
telling the application.

There's no clear right or wrong approach here - both have their
advantages and disadvantages for different workloads. If it
weren't for the sub-optimal behaviour of XFS in this case, you
probably wouldn't have even cared about this....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2006-10-06  0:33               ` David Chinner
  0 siblings, 0 replies; 129+ messages in thread
From: David Chinner @ 2006-10-06  0:33 UTC (permalink / raw)
  To: Stephane Doyon
  Cc: xfs, David Chinner, nfs, Shailendra Tripathi, Trond Myklebust

On Thu, Oct 05, 2006 at 11:39:45AM -0400, Stephane Doyon wrote:
> 
> I hadn't realized that the issue isn't just with the final flush on 
> close(). It's actually been flushing all along, delaying some of the 
> subsequent write()s, getting NOSPC errors but not reporting them until the 
> end.

Other NFS clients will report an ENOSPC on the next write() or close()
if the error is reported during async writeback. The clients that typically
do this throw away any unwritten data as well on the basis that the
application was returned an error ASAP and it is now Somebody Else's
Problem (i.e. the application needs to handle it from there).

> I understand that since my application did not request any syncing, the 
> system cannot guarantee to report errors until cached data has been 
> flushed. But some data has indeed been flushed with an error; can't this 
> be reported earlier than on close?

It could, but...

> Would it be incorrect for a subsequent write to return the error that 
> occurred while flushing data from previous writes? Then the app could 
> decide whether to continue and retry or not. But I guess I can see how 
> that might get convoluted.

.... there's many entertaining hoops to jump through to do this
reliably.

FWIW, these are simply two different approaches to handling ENOSPC
(and other server) errors.  Mostly it comes down to how the ppl who
implemented the NFS client think it's best to handle the errors in
the scenarios that they most care about.

For example: when you have large amounts of cached data, expedient
error reporting and tossing unwritten data leads to much faster
error recovery than trying to write every piece of data (hence the
Irix use of this method).

OTOH, when you really want as much of the data to get to the server,
regardless of whether you lose some (e.g.  log files) before
reporting an error then you try to write every bit of data before
telling the application.

There's no clear right or wrong approach here - both have their
advantages and disadvantages for different workloads. If it
weren't for the sub-optimal behaviour of XFS in this case, you
probably wouldn't have even cared about this....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2006-10-05 16:33             ` Stephane Doyon
@ 2006-10-05 23:29               ` David Chinner
  -1 siblings, 0 replies; 129+ messages in thread
From: David Chinner @ 2006-10-05 23:29 UTC (permalink / raw)
  To: Stephane Doyon
  Cc: David Chinner, Trond Myklebust, xfs, nfs, Shailendra Tripathi

On Thu, Oct 05, 2006 at 12:33:05PM -0400, Stephane Doyon wrote:
> retrying, just plowing on...
> 
> >this would trigger a 500ms sleep on every write.  That's the right
> >sort of ballpark for the slowness you were seeing - 5GB / 32k * 0.5s
> >= ~22 hours....
> >
> >This got fixed in 2.6.18-rc6 -
> 
> You mean commit 4be536debe3f7b0c right? (Actually -rc7 I believe...) I do 
> have that one in my kernel. My kernel is 2.6.17 plus assorted XFS fixes.
> 
> >can you retry with a 2.6.18 server
> >and see if your problem goes away?
> 
> Unfortunately it will be several days before I have a chance to do that.
> 
> The backtrace looked like this:
> 
> ... nfsd_write nfsd_vfs_write vfs_writev do_readv_writev xfs_file_writev 
> xfs_write generic_file_buffered_write xfs_get_blocks __xfs_get_blocks 
> xfs_bmap xfs_iomap xfs_iomap_write_delay xfs_flush_space xfs_flush_device 
> schedule_timeout_uninterruptible.

Ahhh, this gets hit on the ->prepare_write path (xfs_iomap_write_delay()),
not the allocate path (xfs_iomap_write_allocate()). Sorry - I got myself
(and probably everyone else) confused there which why I suspected sync
writes - they trigger the allocate path in the write call. I don't think
2.6.18 will change anything.

FWIW, I don't think we can avoid this sleep when we first hit ENOSPC
conditions, but perhaps once we are certain of the ENOSPC status
we can tag the filesystem with this state (say an xfs_mount flag)
and only clear that tag when something is freed. We could then
use the tag to avoid continually trying extremely hard to allocate
space when we know there is none available....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2006-10-05 23:29               ` David Chinner
  0 siblings, 0 replies; 129+ messages in thread
From: David Chinner @ 2006-10-05 23:29 UTC (permalink / raw)
  To: Stephane Doyon
  Cc: xfs, David Chinner, nfs, Shailendra Tripathi, Trond Myklebust

On Thu, Oct 05, 2006 at 12:33:05PM -0400, Stephane Doyon wrote:
> retrying, just plowing on...
> 
> >this would trigger a 500ms sleep on every write.  That's the right
> >sort of ballpark for the slowness you were seeing - 5GB / 32k * 0.5s
> >= ~22 hours....
> >
> >This got fixed in 2.6.18-rc6 -
> 
> You mean commit 4be536debe3f7b0c right? (Actually -rc7 I believe...) I do 
> have that one in my kernel. My kernel is 2.6.17 plus assorted XFS fixes.
> 
> >can you retry with a 2.6.18 server
> >and see if your problem goes away?
> 
> Unfortunately it will be several days before I have a chance to do that.
> 
> The backtrace looked like this:
> 
> ... nfsd_write nfsd_vfs_write vfs_writev do_readv_writev xfs_file_writev 
> xfs_write generic_file_buffered_write xfs_get_blocks __xfs_get_blocks 
> xfs_bmap xfs_iomap xfs_iomap_write_delay xfs_flush_space xfs_flush_device 
> schedule_timeout_uninterruptible.

Ahhh, this gets hit on the ->prepare_write path (xfs_iomap_write_delay()),
not the allocate path (xfs_iomap_write_allocate()). Sorry - I got myself
(and probably everyone else) confused there which why I suspected sync
writes - they trigger the allocate path in the write call. I don't think
2.6.18 will change anything.

FWIW, I don't think we can avoid this sleep when we first hit ENOSPC
conditions, but perhaps once we are certain of the ENOSPC status
we can tag the filesystem with this state (say an xfs_mount flag)
and only clear that tag when something is freed. We could then
use the tag to avoid continually trying extremely hard to allocate
space when we know there is none available....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2006-10-05  8:30           ` David Chinner
@ 2006-10-05 16:33             ` Stephane Doyon
  -1 siblings, 0 replies; 129+ messages in thread
From: Stephane Doyon @ 2006-10-05 16:33 UTC (permalink / raw)
  To: David Chinner; +Cc: Trond Myklebust, xfs, nfs, Shailendra Tripathi

On Thu, 5 Oct 2006, David Chinner wrote:

> On Tue, Oct 03, 2006 at 09:39:55AM -0400, Stephane Doyon wrote:
>> Sorry for insisting, but it seems to me there's still a problem in need of
>> fixing: when writing a 5GB file over NFS to an XFS file system and hitting
>> ENOSPC, it takes on the order of 22hours before my application gets an
>> error, whereas it would normally take about 2minutes if the file system
>> did not become full.
>>
>> Perhaps I was being a bit too "constructive" and drowned my point in
>> explanations and proposed workarounds... You are telling me that neither
>> NFS nor XFS is doing anything wrong, and I can understand your points of
>> view, but surely that behavior isn't considered acceptable?
>
> I agree that this a little extreme and I can't recall of seeing
> anything like this before, but I can see how that may happen if the
> NFS client continues to try to write every dirty page after getting
> an ENOSPC and each one of those writes has to wait for 500ms.
>
> However, you did not mention what kernel version you are running.
> One recent bug (introduced by a fix for deadlocks at ENOSPC) could
> allow oversubscription of free space to occur in XFS, resulting in

I do have that fix in my kernel. (I'm the one who pointed you to the patch 
that introduced that particular problem.)

> the write being allowed to proceed (i.e. sufficient space for the
> data blocks) but then failing the allocation because there weren't
> enough blocks put aside for potential btree splits that occur during
> allocation. If the linux client is using sync writes on retry, then

The writes from nfsd shouldn't be sync. Technically it's not even 
retrying, just plowing on...

> this would trigger a 500ms sleep on every write.  That's the right
> sort of ballpark for the slowness you were seeing - 5GB / 32k * 0.5s
> = ~22 hours....
>
> This got fixed in 2.6.18-rc6 -

You mean commit 4be536debe3f7b0c right? (Actually -rc7 I believe...) I do 
have that one in my kernel. My kernel is 2.6.17 plus assorted XFS fixes.

> can you retry with a 2.6.18 server
> and see if your problem goes away?

Unfortunately it will be several days before I have a chance to do that.

The backtrace looked like this:

... nfsd_write nfsd_vfs_write vfs_writev do_readv_writev xfs_file_writev 
xfs_write generic_file_buffered_write xfs_get_blocks __xfs_get_blocks 
xfs_bmap xfs_iomap xfs_iomap_write_delay xfs_flush_space xfs_flush_device 
schedule_timeout_uninterruptible.

with a 500ms sleep in xfs_flush_device().

Thanks

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2006-10-05 16:33             ` Stephane Doyon
  0 siblings, 0 replies; 129+ messages in thread
From: Stephane Doyon @ 2006-10-05 16:33 UTC (permalink / raw)
  To: David Chinner; +Cc: xfs, nfs, Shailendra Tripathi, Trond Myklebust

On Thu, 5 Oct 2006, David Chinner wrote:

> On Tue, Oct 03, 2006 at 09:39:55AM -0400, Stephane Doyon wrote:
>> Sorry for insisting, but it seems to me there's still a problem in need of
>> fixing: when writing a 5GB file over NFS to an XFS file system and hitting
>> ENOSPC, it takes on the order of 22hours before my application gets an
>> error, whereas it would normally take about 2minutes if the file system
>> did not become full.
>>
>> Perhaps I was being a bit too "constructive" and drowned my point in
>> explanations and proposed workarounds... You are telling me that neither
>> NFS nor XFS is doing anything wrong, and I can understand your points of
>> view, but surely that behavior isn't considered acceptable?
>
> I agree that this a little extreme and I can't recall of seeing
> anything like this before, but I can see how that may happen if the
> NFS client continues to try to write every dirty page after getting
> an ENOSPC and each one of those writes has to wait for 500ms.
>
> However, you did not mention what kernel version you are running.
> One recent bug (introduced by a fix for deadlocks at ENOSPC) could
> allow oversubscription of free space to occur in XFS, resulting in

I do have that fix in my kernel. (I'm the one who pointed you to the patch 
that introduced that particular problem.)

> the write being allowed to proceed (i.e. sufficient space for the
> data blocks) but then failing the allocation because there weren't
> enough blocks put aside for potential btree splits that occur during
> allocation. If the linux client is using sync writes on retry, then

The writes from nfsd shouldn't be sync. Technically it's not even 
retrying, just plowing on...

> this would trigger a 500ms sleep on every write.  That's the right
> sort of ballpark for the slowness you were seeing - 5GB / 32k * 0.5s
> = ~22 hours....
>
> This got fixed in 2.6.18-rc6 -

You mean commit 4be536debe3f7b0c right? (Actually -rc7 I believe...) I do 
have that one in my kernel. My kernel is 2.6.17 plus assorted XFS fixes.

> can you retry with a 2.6.18 server
> and see if your problem goes away?

Unfortunately it will be several days before I have a chance to do that.

The backtrace looked like this:

... nfsd_write nfsd_vfs_write vfs_writev do_readv_writev xfs_file_writev 
xfs_write generic_file_buffered_write xfs_get_blocks __xfs_get_blocks 
xfs_bmap xfs_iomap xfs_iomap_write_delay xfs_flush_space xfs_flush_device 
schedule_timeout_uninterruptible.

with a 500ms sleep in xfs_flush_device().

Thanks

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2006-10-03 16:40           ` Trond Myklebust
@ 2006-10-05 15:39             ` Stephane Doyon
  -1 siblings, 0 replies; 129+ messages in thread
From: Stephane Doyon @ 2006-10-05 15:39 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: David Chinner, xfs, nfs, Shailendra Tripathi

On Tue, 3 Oct 2006, Trond Myklebust wrote:

> On Tue, 2006-10-03 at 09:39 -0400, Stephane Doyon wrote:
>> Sorry for insisting, but it seems to me there's still a problem in need of
>> fixing: when writing a 5GB file over NFS to an XFS file system and hitting
>> ENOSPC, it takes on the order of 22hours before my application gets an
>> error, whereas it would normally take about 2minutes if the file system
>> did not become full.
>>
>> Perhaps I was being a bit too "constructive" and drowned my point in
>> explanations and proposed workarounds... You are telling me that neither
>> NFS nor XFS is doing anything wrong, and I can understand your points of
>> view, but surely that behavior isn't considered acceptable?
>
> Sure it is.

If you say so :-).

> You are allowing the kernel to cache 5GB, and that means you
> only get the error message when close() completes.

But it's not actually caching the entire 5GB at once... I guess you're 
saying that doesn't matter...?

> If you want faster error reporting, there are modes like O_SYNC,
> O_DIRECT, that will attempt to flush the data more quickly. In addition,
> you can force flushing using fsync().

What if the program is a standard utility like cp?

> Finally, you can tweak the VM into
> flushing more often using /proc/sys/vm.

It doesn't look to me like a question of degrees about how early to flush. 
Actually my client can't possibly be caching all of 5GB, it doesn't have 
the RAM or swap for that. Tracing it more carefully, it appears dirty data 
starts being flushed after a few hundred MBs. No error is returned on the 
subsequent writes, only on the final close(). I see some of the write() 
calls are delayed, presumably when the machine reaches the dirty 
threshold. So I don't see how the vm settings can help in this case.

I hadn't realized that the issue isn't just with the final flush on 
close(). It's actually been flushing all along, delaying some of the 
subsequent write()s, getting NOSPC errors but not reporting them until the 
end.

I understand that since my application did not request any syncing, the 
system cannot guarantee to report errors until cached data has been 
flushed. But some data has indeed been flushed with an error; can't this 
be reported earlier than on close?

Would it be incorrect for a subsequent write to return the error that 
occurred while flushing data from previous writes? Then the app could 
decide whether to continue and retry or not. But I guess I can see how 
that might get convoluted.

Thanks for your patience,

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2006-10-05 15:39             ` Stephane Doyon
  0 siblings, 0 replies; 129+ messages in thread
From: Stephane Doyon @ 2006-10-05 15:39 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: David Chinner, nfs, Shailendra Tripathi, xfs

On Tue, 3 Oct 2006, Trond Myklebust wrote:

> On Tue, 2006-10-03 at 09:39 -0400, Stephane Doyon wrote:
>> Sorry for insisting, but it seems to me there's still a problem in need of
>> fixing: when writing a 5GB file over NFS to an XFS file system and hitting
>> ENOSPC, it takes on the order of 22hours before my application gets an
>> error, whereas it would normally take about 2minutes if the file system
>> did not become full.
>>
>> Perhaps I was being a bit too "constructive" and drowned my point in
>> explanations and proposed workarounds... You are telling me that neither
>> NFS nor XFS is doing anything wrong, and I can understand your points of
>> view, but surely that behavior isn't considered acceptable?
>
> Sure it is.

If you say so :-).

> You are allowing the kernel to cache 5GB, and that means you
> only get the error message when close() completes.

But it's not actually caching the entire 5GB at once... I guess you're 
saying that doesn't matter...?

> If you want faster error reporting, there are modes like O_SYNC,
> O_DIRECT, that will attempt to flush the data more quickly. In addition,
> you can force flushing using fsync().

What if the program is a standard utility like cp?

> Finally, you can tweak the VM into
> flushing more often using /proc/sys/vm.

It doesn't look to me like a question of degrees about how early to flush. 
Actually my client can't possibly be caching all of 5GB, it doesn't have 
the RAM or swap for that. Tracing it more carefully, it appears dirty data 
starts being flushed after a few hundred MBs. No error is returned on the 
subsequent writes, only on the final close(). I see some of the write() 
calls are delayed, presumably when the machine reaches the dirty 
threshold. So I don't see how the vm settings can help in this case.

I hadn't realized that the issue isn't just with the final flush on 
close(). It's actually been flushing all along, delaying some of the 
subsequent write()s, getting NOSPC errors but not reporting them until the 
end.

I understand that since my application did not request any syncing, the 
system cannot guarantee to report errors until cached data has been 
flushed. But some data has indeed been flushed with an error; can't this 
be reported earlier than on close?

Would it be incorrect for a subsequent write to return the error that 
occurred while flushing data from previous writes? Then the app could 
decide whether to continue and retry or not. But I guess I can see how 
that might get convoluted.

Thanks for your patience,

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2006-10-03 13:39         ` Stephane Doyon
@ 2006-10-05  8:30           ` David Chinner
  -1 siblings, 0 replies; 129+ messages in thread
From: David Chinner @ 2006-10-05  8:30 UTC (permalink / raw)
  To: Stephane Doyon
  Cc: Trond Myklebust, David Chinner, xfs, nfs, Shailendra Tripathi

On Tue, Oct 03, 2006 at 09:39:55AM -0400, Stephane Doyon wrote:
> Sorry for insisting, but it seems to me there's still a problem in need of 
> fixing: when writing a 5GB file over NFS to an XFS file system and hitting 
> ENOSPC, it takes on the order of 22hours before my application gets an 
> error, whereas it would normally take about 2minutes if the file system 
> did not become full.
>
> Perhaps I was being a bit too "constructive" and drowned my point in 
> explanations and proposed workarounds... You are telling me that neither 
> NFS nor XFS is doing anything wrong, and I can understand your points of 
> view, but surely that behavior isn't considered acceptable?

I agree that this a little extreme and I can't recall of seeing
anything like this before, but I can see how that may happen if the
NFS client continues to try to write every dirty page after getting
an ENOSPC and each one of those writes has to wait for 500ms.

However, you did not mention what kernel version you are running.
One recent bug (introduced by a fix for deadlocks at ENOSPC) could
allow oversubscription of free space to occur in XFS, resulting in
the write being allowed to proceed (i.e. sufficient space for the
data blocks) but then failing the allocation because there weren't
enough blocks put aside for potential btree splits that occur during
allocation. If the linux client is using sync writes on retry, then
this would trigger a 500ms sleep on every write.  That's the right
sort of ballpark for the slowness you were seeing - 5GB / 32k * 0.5s
= ~22 hours....

This got fixed in 2.6.18-rc6 - can you retry with a 2.6.18 server
and see if your problem goes away?

Cheers,

Dave.

-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2006-10-05  8:30           ` David Chinner
  0 siblings, 0 replies; 129+ messages in thread
From: David Chinner @ 2006-10-05  8:30 UTC (permalink / raw)
  To: Stephane Doyon
  Cc: xfs, David Chinner, nfs, Shailendra Tripathi, Trond Myklebust

On Tue, Oct 03, 2006 at 09:39:55AM -0400, Stephane Doyon wrote:
> Sorry for insisting, but it seems to me there's still a problem in need of 
> fixing: when writing a 5GB file over NFS to an XFS file system and hitting 
> ENOSPC, it takes on the order of 22hours before my application gets an 
> error, whereas it would normally take about 2minutes if the file system 
> did not become full.
>
> Perhaps I was being a bit too "constructive" and drowned my point in 
> explanations and proposed workarounds... You are telling me that neither 
> NFS nor XFS is doing anything wrong, and I can understand your points of 
> view, but surely that behavior isn't considered acceptable?

I agree that this a little extreme and I can't recall of seeing
anything like this before, but I can see how that may happen if the
NFS client continues to try to write every dirty page after getting
an ENOSPC and each one of those writes has to wait for 500ms.

However, you did not mention what kernel version you are running.
One recent bug (introduced by a fix for deadlocks at ENOSPC) could
allow oversubscription of free space to occur in XFS, resulting in
the write being allowed to proceed (i.e. sufficient space for the
data blocks) but then failing the allocation because there weren't
enough blocks put aside for potential btree splits that occur during
allocation. If the linux client is using sync writes on retry, then
this would trigger a 500ms sleep on every write.  That's the right
sort of ballpark for the slowness you were seeing - 5GB / 32k * 0.5s
= ~22 hours....

This got fixed in 2.6.18-rc6 - can you retry with a 2.6.18 server
and see if your problem goes away?

Cheers,

Dave.

-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2006-10-03 13:39         ` Stephane Doyon
@ 2006-10-03 16:40           ` Trond Myklebust
  -1 siblings, 0 replies; 129+ messages in thread
From: Trond Myklebust @ 2006-10-03 16:40 UTC (permalink / raw)
  To: Stephane Doyon; +Cc: David Chinner, xfs, nfs, Shailendra Tripathi

On Tue, 2006-10-03 at 09:39 -0400, Stephane Doyon wrote:
> Sorry for insisting, but it seems to me there's still a problem in need of 
> fixing: when writing a 5GB file over NFS to an XFS file system and hitting 
> ENOSPC, it takes on the order of 22hours before my application gets an 
> error, whereas it would normally take about 2minutes if the file system 
> did not become full.
> 
> Perhaps I was being a bit too "constructive" and drowned my point in 
> explanations and proposed workarounds... You are telling me that neither 
> NFS nor XFS is doing anything wrong, and I can understand your points of 
> view, but surely that behavior isn't considered acceptable?

Sure it is. You are allowing the kernel to cache 5GB, and that means you
only get the error message when close() completes.

If you want faster error reporting, there are modes like O_SYNC,
O_DIRECT, that will attempt to flush the data more quickly. In addition,
you can force flushing using fsync(). Finally, you can tweak the VM into
flushing more often using /proc/sys/vm.

Cheers,
 Trond

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2006-10-03 16:40           ` Trond Myklebust
  0 siblings, 0 replies; 129+ messages in thread
From: Trond Myklebust @ 2006-10-03 16:40 UTC (permalink / raw)
  To: Stephane Doyon; +Cc: David Chinner, nfs, Shailendra Tripathi, xfs

On Tue, 2006-10-03 at 09:39 -0400, Stephane Doyon wrote:
> Sorry for insisting, but it seems to me there's still a problem in need of 
> fixing: when writing a 5GB file over NFS to an XFS file system and hitting 
> ENOSPC, it takes on the order of 22hours before my application gets an 
> error, whereas it would normally take about 2minutes if the file system 
> did not become full.
> 
> Perhaps I was being a bit too "constructive" and drowned my point in 
> explanations and proposed workarounds... You are telling me that neither 
> NFS nor XFS is doing anything wrong, and I can understand your points of 
> view, but surely that behavior isn't considered acceptable?

Sure it is. You are allowing the kernel to cache 5GB, and that means you
only get the error message when close() completes.

If you want faster error reporting, there are modes like O_SYNC,
O_DIRECT, that will attempt to flush the data more quickly. In addition,
you can force flushing using fsync(). Finally, you can tweak the VM into
flushing more often using /proc/sys/vm.

Cheers,
 Trond


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2006-10-02 22:30     ` David Chinner
@ 2006-10-03 13:39         ` Stephane Doyon
  0 siblings, 0 replies; 129+ messages in thread
From: Stephane Doyon @ 2006-10-03 13:39 UTC (permalink / raw)
  To: Trond Myklebust, David Chinner; +Cc: xfs, nfs, Shailendra Tripathi

Sorry for insisting, but it seems to me there's still a problem in need of 
fixing: when writing a 5GB file over NFS to an XFS file system and hitting 
ENOSPC, it takes on the order of 22hours before my application gets an 
error, whereas it would normally take about 2minutes if the file system 
did not become full.

Perhaps I was being a bit too "constructive" and drowned my point in 
explanations and proposed workarounds... You are telling me that neither 
NFS nor XFS is doing anything wrong, and I can understand your points of 
view, but surely that behavior isn't considered acceptable?

On Tue, 26 Sep 2006, Trond Myklebust wrote:

> On Tue, 2006-09-26 at 16:05 -0400, Stephane Doyon wrote:
>> I suppose it's not technically wrong to try to flush all the pages of the
>> file, but if the server file system is full then it will be at its worse.
>> Also if you happened to be on a slower link and have a big cache to flush,
>> you're waiting around for very little gain.
>
> That all assumes that nobody fixes the problem on the server. If
> somebody notices, and actually removes an unused file, then you may be
> happy that the kernel preserved the last 80% of the apache log file that
> was being written out.
>
> ENOSPC is a transient error: that is why the current behaviour exists.

On Tue, 3 Oct 2006, David Chinner wrote:

> This deep in the XFS allocation functions, we cannot tell if we hold
> the i_mutex or not, and it plays no part in determining if we have
> space or not. Hence we don't touch it here.


> I doubt it's a good idea for an NFS server, either.
[...]
> Remember that XFS, like most filesystems, trades off speed for
> correctness as we approach ENOSPC. Many parts of XFS slow down as we
> approach ENOSPC, and this is just one example of where we need to be
> correct, not fast.
[...]
> IMO, this is a non-problem.  You're talking about optimising a
> relatively rare corner case where correctness is more important than
> speed and your test case is highly artificial. AFAIC, if you are
> running at ENOSPC then you get what performance is appropriate for
> correctness and if you are continually runing at ENOSPC, then buy
> some more disks.....

My recipe to reproduce the problem locally is admittedly somewhat 
artificial, but the problematic usage definitely isn't: simply an app on 
an NFS client that happens to fill up a file system. There must be some 
way to handle this better.

Thanks

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
@ 2006-10-03 13:39         ` Stephane Doyon
  0 siblings, 0 replies; 129+ messages in thread
From: Stephane Doyon @ 2006-10-03 13:39 UTC (permalink / raw)
  To: Trond Myklebust, David Chinner; +Cc: nfs, Shailendra Tripathi, xfs

Sorry for insisting, but it seems to me there's still a problem in need of 
fixing: when writing a 5GB file over NFS to an XFS file system and hitting 
ENOSPC, it takes on the order of 22hours before my application gets an 
error, whereas it would normally take about 2minutes if the file system 
did not become full.

Perhaps I was being a bit too "constructive" and drowned my point in 
explanations and proposed workarounds... You are telling me that neither 
NFS nor XFS is doing anything wrong, and I can understand your points of 
view, but surely that behavior isn't considered acceptable?

On Tue, 26 Sep 2006, Trond Myklebust wrote:

> On Tue, 2006-09-26 at 16:05 -0400, Stephane Doyon wrote:
>> I suppose it's not technically wrong to try to flush all the pages of the
>> file, but if the server file system is full then it will be at its worse.
>> Also if you happened to be on a slower link and have a big cache to flush,
>> you're waiting around for very little gain.
>
> That all assumes that nobody fixes the problem on the server. If
> somebody notices, and actually removes an unused file, then you may be
> happy that the kernel preserved the last 80% of the apache log file that
> was being written out.
>
> ENOSPC is a transient error: that is why the current behaviour exists.

On Tue, 3 Oct 2006, David Chinner wrote:

> This deep in the XFS allocation functions, we cannot tell if we hold
> the i_mutex or not, and it plays no part in determining if we have
> space or not. Hence we don't touch it here.


> I doubt it's a good idea for an NFS server, either.
[...]
> Remember that XFS, like most filesystems, trades off speed for
> correctness as we approach ENOSPC. Many parts of XFS slow down as we
> approach ENOSPC, and this is just one example of where we need to be
> correct, not fast.
[...]
> IMO, this is a non-problem.  You're talking about optimising a
> relatively rare corner case where correctness is more important than
> speed and your test case is highly artificial. AFAIC, if you are
> running at ENOSPC then you get what performance is appropriate for
> correctness and if you are continually runing at ENOSPC, then buy
> some more disks.....

My recipe to reproduce the problem locally is admittedly somewhat 
artificial, but the problematic usage definitely isn't: simply an app on 
an NFS client that happens to fill up a file system. There must be some 
way to handle this better.

Thanks


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2006-04-11 20:30   ` Greg KH
  2006-04-11 23:46     ` Jan Engelhardt
@ 2006-04-12  0:36     ` Nix
  1 sibling, 0 replies; 129+ messages in thread
From: Nix @ 2006-04-12  0:36 UTC (permalink / raw)
  To: Greg KH; +Cc: Jan Engelhardt, linux-kernel, stable, torvalds

On 11 Apr 2006, Greg KH whispered secretively:
> The first one went out last night, as it was a real issue that affected
> people and I had already waited longer than I felt comfortable with, due
> to travel issues I had (two different talks in two different cities in
> two different days.)
> 
> The second one went out today, because it was reported today.  Should I
> have waited until tomorrow to see if something else came up?

Indeed.

On top of that, they're `only' local DoSes, so many admins (i.e. those
without untrusted local users) will probably not have installed .3 yet:
and anyone with untrusted local users probably has someone whose entire
job is handling security anyway.


There's nothing wrong with rapid-fire -stables; either the issue is (in
the judgement of the ones doing the installation) critical, in which
case it should get out as fast as possible, or it isn't, in which case
the local installing admins can put it off for a day or so themselves to
see if another release comes out immediately afterwards.

-- 
`On a scale of 1-10, X's "brokenness rating" is 1.1, but that's only
 because bringing Windows into the picture rescaled "brokenness" by
 a factor of 10.' --- Peter da Silva

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2006-04-11 20:30   ` Greg KH
@ 2006-04-11 23:46     ` Jan Engelhardt
  2006-04-12  0:36     ` Nix
  1 sibling, 0 replies; 129+ messages in thread
From: Jan Engelhardt @ 2006-04-11 23:46 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, stable, torvalds

>
>The first one went out last night, as it was a real issue that affected
>people and I had already waited longer than I felt comfortable with, due
>to travel issues I had (two different talks in two different cities in
>two different days.)
>
>The second one went out today, because it was reported today.  Should I
>have waited until tomorrow to see if something else came up?
>
No of course not, I did not know the first one was already due long time.


[Sigh, pine changed the subject header and I did not notice.]


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2006-04-11 19:04 ` several messages Jan Engelhardt
  2006-04-11 19:20   ` Boris B. Zhmurov
@ 2006-04-11 20:30   ` Greg KH
  2006-04-11 23:46     ` Jan Engelhardt
  2006-04-12  0:36     ` Nix
  1 sibling, 2 replies; 129+ messages in thread
From: Greg KH @ 2006-04-11 20:30 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: linux-kernel, stable, torvalds

On Tue, Apr 11, 2006 at 09:04:42PM +0200, Jan Engelhardt wrote:
> 
> >Date: Tue, 11 Apr 2006 09:26:20 -0700
> >Subject: Linux 2.6.16.3
> >David Howells:
> >      Keys: Fix oops when adding key to non-keyring [CVE-2006-1522]
> 
> >Date: Tue, 11 Apr 2006 10:33:23 -0700
> >Subject: Linux 2.6.16.4
> >Oleg Nesterov:
> >      RCU signal handling [CVE-2006-1523]
> 
> Now admins spend another hour this day just to upgrade.
> These two patches could have been queued until the end of the day. Maybe 
> another one turns up soon.

The first one went out last night, as it was a real issue that affected
people and I had already waited longer than I felt comfortable with, due
to travel issues I had (two different talks in two different cities in
two different days.)

The second one went out today, because it was reported today.  Should I
have waited until tomorrow to see if something else came up?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2006-04-11 19:04 ` several messages Jan Engelhardt
@ 2006-04-11 19:20   ` Boris B. Zhmurov
  2006-04-11 20:30   ` Greg KH
  1 sibling, 0 replies; 129+ messages in thread
From: Boris B. Zhmurov @ 2006-04-11 19:20 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Greg KH, linux-kernel, stable, torvalds

Hello, Jan Engelhardt.

On 11.04.2006 23:04 you said the following:


> Now admins spend another hour this day just to upgrade.

It's admin's job, isn't it?


> These two patches could have been queued until the end of the day. Maybe 
> another one turns up soon.
> Jan Engelhardt

Hmm... Interesting. Are you blaming security officers for doing their 
job? Please, don't! And many many thanks to Greg for giving us security 
patches as soon as possible.


-- 
Boris B. Zhmurov
mailto: bb@kernelpanic.ru
"wget http://kernelpanic.ru/bb_public_key.pgp -O - | gpg --import"


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2006-04-11 17:33 Linux 2.6.16.4 Greg KH
@ 2006-04-11 19:04 ` Jan Engelhardt
  2006-04-11 19:20   ` Boris B. Zhmurov
  2006-04-11 20:30   ` Greg KH
  0 siblings, 2 replies; 129+ messages in thread
From: Jan Engelhardt @ 2006-04-11 19:04 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, stable, torvalds


>Date: Tue, 11 Apr 2006 09:26:20 -0700
>Subject: Linux 2.6.16.3
>David Howells:
>      Keys: Fix oops when adding key to non-keyring [CVE-2006-1522]

>Date: Tue, 11 Apr 2006 10:33:23 -0700
>Subject: Linux 2.6.16.4
>Oleg Nesterov:
>      RCU signal handling [CVE-2006-1523]

Now admins spend another hour this day just to upgrade.
These two patches could have been queued until the end of the day. Maybe 
another one turns up soon.


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2005-05-13 15:07                       ` Christoph Hellwig
@ 2005-05-13 15:38                         ` Dmitry Yusupov
  0 siblings, 0 replies; 129+ messages in thread
From: Dmitry Yusupov @ 2005-05-13 15:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: mingz, Guennadi Liakhovetski, FUJITA Tomonori,
	Vladislav Bolkhovitin, iet-dev, linux-scsi, Sander, David Hollis,
	Maciej Soltysiak, linux-kernel

On Fri, 2005-05-13 at 16:07 +0100, Christoph Hellwig wrote:
> On Fri, May 13, 2005 at 08:04:16AM -0700, Dmitry Yusupov wrote:
> > You could tell this to school's computer class teacher... Serious SAN
> > deployment will always be based either on FC or iSCSI for the reasons I
> > explained before.
> 
> Just FYI Steeleye ships a very successful clustering product that builds
> on nbd.

AFAIK, it is used for Data Replication purposes only. Correct me if I'm
wrong...



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2005-05-13 15:04                     ` Dmitry Yusupov
@ 2005-05-13 15:07                       ` Christoph Hellwig
  2005-05-13 15:38                         ` Dmitry Yusupov
  0 siblings, 1 reply; 129+ messages in thread
From: Christoph Hellwig @ 2005-05-13 15:07 UTC (permalink / raw)
  To: Dmitry Yusupov
  Cc: Christoph Hellwig, mingz, Guennadi Liakhovetski, FUJITA Tomonori,
	Vladislav Bolkhovitin, iet-dev, linux-scsi, Sander, David Hollis,
	Maciej Soltysiak, linux-kernel

On Fri, May 13, 2005 at 08:04:16AM -0700, Dmitry Yusupov wrote:
> You could tell this to school's computer class teacher... Serious SAN
> deployment will always be based either on FC or iSCSI for the reasons I
> explained before.

Just FYI Steeleye ships a very successful clustering product that builds
on nbd.


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2005-05-13  8:12                   ` Christoph Hellwig
@ 2005-05-13 15:04                     ` Dmitry Yusupov
  2005-05-13 15:07                       ` Christoph Hellwig
  0 siblings, 1 reply; 129+ messages in thread
From: Dmitry Yusupov @ 2005-05-13 15:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: mingz, Guennadi Liakhovetski, FUJITA Tomonori,
	Vladislav Bolkhovitin, iet-dev, linux-scsi, Sander, David Hollis,
	Maciej Soltysiak, linux-kernel

On Fri, 2005-05-13 at 09:12 +0100, Christoph Hellwig wrote:
> On Thu, May 12, 2005 at 11:32:12AM -0700, Dmitry Yusupov wrote:
> > On Wed, 2005-05-11 at 22:16 -0400, Ming Zhang wrote:
> > > iscsi is scsi over ip.
> > 
> > correction. iSCSI today has RFC at least for two transports - TCP/IP and
> > iSER/RDMA(in finalized progress) with RDMA over Infiniband or RNIC. And
> > I think people start writing initial draft for SCTP/IP transport...
> > 
> > >From this perspective, iSCSI certainly more advanced and matured
> > comparing to NBD variations. 
> 
> It's for certainly much more complicated (in marketing speak that's usually
> called advanced) but far less mature.
> 
> If you want network storage to just work use nbd.

You could tell this to school's computer class teacher... Serious SAN
deployment will always be based either on FC or iSCSI for the reasons I
explained before.

I do not disagree, nbd is nice and simple and for sure has its own
deployment space.

Dmitry


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2005-05-12 18:32                 ` Dmitry Yusupov
@ 2005-05-13  8:12                   ` Christoph Hellwig
  2005-05-13 15:04                     ` Dmitry Yusupov
  0 siblings, 1 reply; 129+ messages in thread
From: Christoph Hellwig @ 2005-05-13  8:12 UTC (permalink / raw)
  To: Dmitry Yusupov
  Cc: mingz, Guennadi Liakhovetski, FUJITA Tomonori,
	Vladislav Bolkhovitin, iet-dev, linux-scsi, Sander, David Hollis,
	Maciej Soltysiak, linux-kernel

On Thu, May 12, 2005 at 11:32:12AM -0700, Dmitry Yusupov wrote:
> On Wed, 2005-05-11 at 22:16 -0400, Ming Zhang wrote:
> > iscsi is scsi over ip.
> 
> correction. iSCSI today has RFC at least for two transports - TCP/IP and
> iSER/RDMA(in finalized progress) with RDMA over Infiniband or RNIC. And
> I think people start writing initial draft for SCTP/IP transport...
> 
> >From this perspective, iSCSI certainly more advanced and matured
> comparing to NBD variations. 

It's for certainly much more complicated (in marketing speak that's usually
called advanced) but far less mature.

If you want network storage to just work use nbd.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2005-05-12  2:16               ` Ming Zhang
@ 2005-05-12 18:32                 ` Dmitry Yusupov
  2005-05-13  8:12                   ` Christoph Hellwig
  0 siblings, 1 reply; 129+ messages in thread
From: Dmitry Yusupov @ 2005-05-12 18:32 UTC (permalink / raw)
  To: mingz
  Cc: Guennadi Liakhovetski, FUJITA Tomonori, Vladislav Bolkhovitin,
	iet-dev, linux-scsi, Sander, David Hollis, Maciej Soltysiak,
	linux-kernel

On Wed, 2005-05-11 at 22:16 -0400, Ming Zhang wrote:
> iscsi is scsi over ip.

correction. iSCSI today has RFC at least for two transports - TCP/IP and
iSER/RDMA(in finalized progress) with RDMA over Infiniband or RNIC. And
I think people start writing initial draft for SCTP/IP transport...

>From this perspective, iSCSI certainly more advanced and matured
comparing to NBD variations. 

> usb disk is scsi over usb.
> so just a different transport.
> u are rite. ;)
> 
> ming
> 
> On Wed, 2005-05-11 at 23:26 +0200, Guennadi Liakhovetski wrote:
> > Hello and thanks for the replies
> > 
> > On Wed, 11 May 2005, FUJITA Tomonori wrote:
> > > The iSCSI protocol simply encapsulates the SCSI protocol into the
> > > TCP/IP protocol, and carries packets over IP networks. You can handle
> > ...
> > 
> > On Wed, 11 May 2005, Vladislav Bolkhovitin wrote:
> > > Actually, this is property not of iSCSI target itself, but of any SCSI target.
> > > So, we implemented it as part of our SCSI target mid-level (SCST,
> > > http://scst.sourceforge.net), therefore any target driver working over it will
> > > automatically benefit from this feature. Unfortunately, currently available
> > > only target drivers for Qlogic 2x00 cards and for poor UNH iSCSI target (that
> > > works not too reliable and only with very specific initiators). The published
> > ...
> > 
> > The above confirms basically my understanding apart from one "minor" 
> > confusion - I thought, that parallel to hardware solutions pure software 
> > implementations were possible / being developed, like a driver, that 
> > implements a SCSI LDD API on one side, and forwards packets to an IP 
> > stack, say, over an ethernet card - on the initiator side. And a counter 
> > part on the target side. Similarly to the USB mass-storage and storage 
> > gadget drivers?
> > 
> > Thanks
> > Guennadi
> > ---
> > Guennadi Liakhovetski
> > 
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2005-05-11 21:26             ` several messages Guennadi Liakhovetski
  2005-05-12  2:16               ` Ming Zhang
@ 2005-05-12 10:17               ` Vladislav Bolkhovitin
  1 sibling, 0 replies; 129+ messages in thread
From: Vladislav Bolkhovitin @ 2005-05-12 10:17 UTC (permalink / raw)
  To: Guennadi Liakhovetski
  Cc: FUJITA Tomonori, iscsitarget-devel, linux-scsi, dmitry_yus,
	Sander, David Hollis, Maciej Soltysiak, linux-kernel

Guennadi Liakhovetski wrote:
> Hello and thanks for the replies
> 
> On Wed, 11 May 2005, FUJITA Tomonori wrote:
> 
>>The iSCSI protocol simply encapsulates the SCSI protocol into the
>>TCP/IP protocol, and carries packets over IP networks. You can handle
> 
> ...
> 
> On Wed, 11 May 2005, Vladislav Bolkhovitin wrote:
> 
>>Actually, this is property not of iSCSI target itself, but of any SCSI target.
>>So, we implemented it as part of our SCSI target mid-level (SCST,
>>http://scst.sourceforge.net), therefore any target driver working over it will
>>automatically benefit from this feature. Unfortunately, currently available
>>only target drivers for Qlogic 2x00 cards and for poor UNH iSCSI target (that
>>works not too reliable and only with very specific initiators). The published
> 
> ...
> 
> The above confirms basically my understanding apart from one "minor" 
> confusion - I thought, that parallel to hardware solutions pure software 
> implementations were possible / being developed, like a driver, that 
> implements a SCSI LDD API on one side, and forwards packets to an IP 
> stack, say, over an ethernet card - on the initiator side. And a counter 
> part on the target side. Similarly to the USB mass-storage and storage 
> gadget drivers?

There is some confusion in the SCSI world between SCSI as a transport 
and SCSI as a commands set and software communication protocol, which 
works above the transport. So, you can implement SCSI transport at any 
software (eg iSCSI) or hardware (parallel SCSI, Fibre Channel, SATA, 
etc.) way, but if the SCSI message passing protocol is used overall 
system remains SCSI with all protocol obligations like task management. 
So, pure software SCSI solution is possible. BTW, there are pure 
hardware iSCSI implementations as well.

Vlad

> Thanks
> Guennadi
> ---
> Guennadi Liakhovetski
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2005-05-11 21:26             ` several messages Guennadi Liakhovetski
@ 2005-05-12  2:16               ` Ming Zhang
  2005-05-12 18:32                 ` Dmitry Yusupov
  2005-05-12 10:17               ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 129+ messages in thread
From: Ming Zhang @ 2005-05-12  2:16 UTC (permalink / raw)
  To: Guennadi Liakhovetski
  Cc: FUJITA Tomonori, Vladislav Bolkhovitin, iet-dev, linux-scsi,
	Dmitry Yusupov, Sander, David Hollis, Maciej Soltysiak,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1703 bytes --]

iscsi is scsi over ip.
usb disk is scsi over usb.
so just a different transport.
u are rite. ;)

ming

On Wed, 2005-05-11 at 23:26 +0200, Guennadi Liakhovetski wrote:
> Hello and thanks for the replies
> 
> On Wed, 11 May 2005, FUJITA Tomonori wrote:
> > The iSCSI protocol simply encapsulates the SCSI protocol into the
> > TCP/IP protocol, and carries packets over IP networks. You can handle
> ...
> 
> On Wed, 11 May 2005, Vladislav Bolkhovitin wrote:
> > Actually, this is property not of iSCSI target itself, but of any SCSI target.
> > So, we implemented it as part of our SCSI target mid-level (SCST,
> > http://scst.sourceforge.net), therefore any target driver working over it will
> > automatically benefit from this feature. Unfortunately, currently available
> > only target drivers for Qlogic 2x00 cards and for poor UNH iSCSI target (that
> > works not too reliable and only with very specific initiators). The published
> ...
> 
> The above confirms basically my understanding apart from one "minor" 
> confusion - I thought, that parallel to hardware solutions pure software 
> implementations were possible / being developed, like a driver, that 
> implements a SCSI LDD API on one side, and forwards packets to an IP 
> stack, say, over an ethernet card - on the initiator side. And a counter 
> part on the target side. Similarly to the USB mass-storage and storage 
> gadget drivers?
> 
> Thanks
> Guennadi
> ---
> Guennadi Liakhovetski
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2005-05-11  8:56           ` Vladislav Bolkhovitin
@ 2005-05-11 21:26             ` Guennadi Liakhovetski
  2005-05-12  2:16               ` Ming Zhang
  2005-05-12 10:17               ` Vladislav Bolkhovitin
  0 siblings, 2 replies; 129+ messages in thread
From: Guennadi Liakhovetski @ 2005-05-11 21:26 UTC (permalink / raw)
  To: FUJITA Tomonori, Vladislav Bolkhovitin
  Cc: iscsitarget-devel, linux-scsi, dmitry_yus, Sander, David Hollis,
	Maciej Soltysiak, linux-kernel

Hello and thanks for the replies

On Wed, 11 May 2005, FUJITA Tomonori wrote:
> The iSCSI protocol simply encapsulates the SCSI protocol into the
> TCP/IP protocol, and carries packets over IP networks. You can handle
...

On Wed, 11 May 2005, Vladislav Bolkhovitin wrote:
> Actually, this is property not of iSCSI target itself, but of any SCSI target.
> So, we implemented it as part of our SCSI target mid-level (SCST,
> http://scst.sourceforge.net), therefore any target driver working over it will
> automatically benefit from this feature. Unfortunately, currently available
> only target drivers for Qlogic 2x00 cards and for poor UNH iSCSI target (that
> works not too reliable and only with very specific initiators). The published
...

The above confirms basically my understanding apart from one "minor" 
confusion - I thought, that parallel to hardware solutions pure software 
implementations were possible / being developed, like a driver, that 
implements a SCSI LDD API on one side, and forwards packets to an IP 
stack, say, over an ethernet card - on the initiator side. And a counter 
part on the target side. Similarly to the USB mass-storage and storage 
gadget drivers?

Thanks
Guennadi
---
Guennadi Liakhovetski


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2004-05-11  8:45 2.6.6-rc3-mm2 (4KSTACK) Helge Hafting
@ 2004-05-11 17:59 ` Bill Davidsen
  0 siblings, 0 replies; 129+ messages in thread
From: Bill Davidsen @ 2004-05-11 17:59 UTC (permalink / raw)
  To: Horst von Brand, Andrew Morton, Helge Hafting; +Cc: Linux Kernel Mailing List

Thank all of you for this information. This is an interesting way to
overcome the kernel memory fragmentation issue. I would have thought it
was more important to address having the memory so fragmented that there
is no 8k chunk left "even with many megabytes free" as someone wrote.


On Mon, 10 May 2004, Horst von Brand wrote:

> Bill Davidsen <davidsen@tmr.com> said:
> 
> [...]
> 
> > I tried 4k stack, I couldn't measure any improvement in anything (as in 
> > no visible speedup or saving in memory).
> 
> 4K stacks lets the kernel create new threads/processes as long as there is
> free memory; with 8K stacks it needs two consecutive free page frames in
> physical memory, when memory is fragmented (and large) they are hard to
> come by...
> -- 
> Dr. Horst H. von Brand                   User #22616 counter.li.org
> Departamento de Informatica                     Fono: +56 32 654431
> Universidad Tecnica Federico Santa Maria              +56 32 654239
> Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513
> 

On Mon, 10 May 2004, Andrew Morton wrote:

> Horst von Brand <vonbrand@inf.utfsm.cl> wrote:
> >
> > Bill Davidsen <davidsen@tmr.com> said:
> > 
> > [...]
> > 
> > > I tried 4k stack, I couldn't measure any improvement in anything (as in 
> > > no visible speedup or saving in memory).
> > 
> > 4K stacks lets the kernel create new threads/processes as long as there is
> > free memory; with 8K stacks it needs two consecutive free page frames in
> > physical memory, when memory is fragmented (and large) they are hard to
> > come by...
> 
> This is true to a surprising extent.  A couple of weeks ago I observed my
> 256MB box freeing over 20MB of pages before it could successfully acquire a
> single 1-order page.
> 
> That was during an updatedb run.
> 
> And a 1-order GFP_NOFS allocation was actually livelocking, because
> !__GFP_FS allocations aren't allowed to enter dentry reclaim.  Which is why
> VFS caches are now forced to use 0-order allocations.
> 
> 

On Tue, 11 May 2004, Helge Hafting wrote:

> Bill Davidsen wrote:
> 
> > Arjan van de Ven wrote:
> >
> >>> It's probably a Bad Idea to push this to Linus before the vendors 
> >>> that have
> >>> significant market-impact issues (again - anybody other than NVidia 
> >>> here?)
> >>> have gotten their stuff cleaned up...
> >>
> >>
> >>
> >> Ok I don't want to start a flamewar but... Do we want to hold linux back
> >> until all binary only module vendors have caught up ??
> >
> >
> > My questions is, hold it back from what? Having the 4k option is fine, 
> > it's just eliminating the current default which I think is 
> > undesirable. I tried 4k stack, I couldn't measure any improvement in 
> > anything (as in no visible speedup or saving in memory). 
> 
> The memory saving is usually modest: 4k per thread. It might make a 
> difference for
> those with many thousands of threads.   I believe this is unswappable 
> memory,
> which is much more valuable than ordinary process memory.
> 
> More interesting is that it removes one way for fork() to fail. With 8k 
> stacks,
> the new process needs to allocate two consecutive pages for those 8k.  That
> might be impossible due to fragmentation, even if there are megabytes of 
> free
> memory. Such a problem usually only shows up after a long time.  Now we 
> only need to
> allocate a single page, which always works as long as there is any free 
> memory at all.
> 
> > For an embedded system, where space is tight and the code paths well 
> > known, sure, but I haven't been able to find or generate any objective 
> > improvement, other than some posts saying smaller is always better. 
> > Nothing slows a system down like a crash, even if it isn't followed by 
> > a restore from backup.
> 
> Consider the case when your server (web/mail/other) fails to fork, and then
> you can't login because that requires fork() too.  4k stacks remove this 
> scenario,
> and is a stability improvement.
> 
> Helge Hafting
> 

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2003-04-22 23:38   ` Rick Lindsley
@ 2003-04-23  9:17     ` Ingo Molnar
  0 siblings, 0 replies; 129+ messages in thread
From: Ingo Molnar @ 2003-04-23  9:17 UTC (permalink / raw)
  To: Rick Lindsley; +Cc: Bill Davidsen, Dave Jones, linux-kernel


On Tue, 22 Apr 2003, Rick Lindsley wrote:

> True.  I have a hunch (and it's only a hunch -- no hard data!) that two
> threads that are sharing the same data will do better if they can be
> located on a physical/sibling processor group.  For workloads where you
> really do have two distinct processes, or even threads but which are
> operating on wholly different portions of data or code, moving them to
> separate physical processors may be warranted.  The key is whether the
> work of one sibling is destroying the cache of another.

If two threads have a workload that wants to be co-scheduled then the SMP
scheduler will do damage to them anyway - independently of any HT
scheduling decisions. One solution for such specific cases is to use the
CPU-binding API to move those threads to the same physical CPU. If there's
some common class of applications where this is the common case, then we
could start thinking about automatic support for them.

	Ingo


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2003-04-22 22:16 ` several messages Bill Davidsen
@ 2003-04-22 23:38   ` Rick Lindsley
  2003-04-23  9:17     ` Ingo Molnar
  0 siblings, 1 reply; 129+ messages in thread
From: Rick Lindsley @ 2003-04-22 23:38 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Dave Jones, Ingo Molnar, linux-kernel

    Have you done any tests with a threaded process running on a single CPU in
    the siblings? If they are sharing data and locks in the same cache it's
    not obvious (to me at least) that it would be faster in two CPUs having to
    do updates. That's a question, not an implication that it is significantly
    better in just one, a threaded program with only two threads is not as
    likely to be doing the same thing in both, perhaps.

True.  I have a hunch (and it's only a hunch -- no hard data!) that
two threads that are sharing the same data will do better if they can
be located on a physical/sibling processor group.  For workloads where
you really do have two distinct processes, or even threads but which are
operating on wholly different portions of data or code, moving them to
separate physical processors may be warranted.  The key is whether the
work of one sibling is destroying the cache of another.

Rick

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2003-04-22 10:34 [patch] HT scheduler, sched-2.5.68-A9 Ingo Molnar
@ 2003-04-22 22:16 ` Bill Davidsen
  2003-04-22 23:38   ` Rick Lindsley
  0 siblings, 1 reply; 129+ messages in thread
From: Bill Davidsen @ 2003-04-22 22:16 UTC (permalink / raw)
  To: Dave Jones, Ingo Molnar; +Cc: Rick Lindsley, linux-kernel

On Tue, 22 Apr 2003, Dave Jones wrote:

> On Mon, Apr 21, 2003 at 03:13:31PM -0400, Ingo Molnar wrote:
> 
>  > +/*
>  > + * Is there a way to do this via Kconfig?
>  > + */
>  > +#if CONFIG_NR_SIBLINGS_2
>  > +# define CONFIG_NR_SIBLINGS 2
>  > +#elif CONFIG_NR_SIBLINGS_4
>  > +# define CONFIG_NR_SIBLINGS 4
>  > +#else
>  > +# define CONFIG_NR_SIBLINGS 0
>  > +#endif
>  > +
> 
> Maybe this would be better resolved at runtime ?
> With the above patch, you'd need three seperate kernel images
> to run optimally on a system in each of the cases.
> The 'vendor kernel' scenario here looks ugly to me.
> 
>  > +#if CONFIG_NR_SIBLINGS
>  > +# define CONFIG_SHARE_RUNQUEUE 1
>  > +#else
>  > +# define CONFIG_SHARE_RUNQUEUE 0
>  > +#endif
> 
> And why can't this just be a
> 
> 	if (ht_enabled)
> 		shared_runqueue = 1;
> 
> Dumping all this into the config system seems to be the
> wrong direction IMHO. The myriad of runtime knobs in the
> scheduler already is bad enough, without introducing
> compile time ones as well.

May I add my "I don't understand this, either" at this point? It seems
desirable to have this particular value determined at runtime. 

On Tue, 22 Apr 2003, Ingo Molnar wrote:

> 
> On Tue, 22 Apr 2003, Rick Lindsley wrote:

> > [...] Are we assuming that because both a physical processor and its
> > sibling are not idle, that it is better to move a task from the sibling
> > to a physical processor?  In other words, we are presuming that the case
> > where the task on the physical processor and the task(s) on the
> > sibling(s) are actually benefitting from the relationship is rare?
> 
> yes. This 'un-sharing' of contexts happens unconditionally, whenever we
> notice the situation. (ie. whenever a CPU goes completely idle and notices
> an overloaded physical CPU.) On the HT system i have i have measure this
> to be a beneficial move even for the most trivial things like infinite
> loop-counting.
> 
> the more per-logical-CPU cache a given SMT implementation has, the less
> beneficial this move becomes - in that case the system should rather be
> set up as a NUMA topology and scheduled via the NUMA scheduler.

Have you done any tests with a threaded process running on a single CPU in
the siblings? If they are sharing data and locks in the same cache it's
not obvious (to me at least) that it would be faster in two CPUs having to
do updates. That's a question, not an implication that it is significantly
better in just one, a threaded program with only two threads is not as
likely to be doing the same thing in both, perhaps.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2003-01-27 16:46 ` several messages Bill Davidsen
@ 2003-01-27 16:59   ` David Woodhouse
  0 siblings, 0 replies; 129+ messages in thread
From: David Woodhouse @ 2003-01-27 16:59 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Hal Duston, linux-kernel, Olaf Titz


davidsen@tmr.com said:
>  You make my point for me, there is no magic, and when building a
> module it should require that the directory be specified by either a
> command line option (as noted below) or by being built as part of a
> source tree. There *is* no good default in that particular case.

/lib/modules/`uname -r`/build _is_ a good default for a module to build 
again. It is correct in more cases than a simple failure to do anything.

For _installing_, the correct place to install the built objects is surely
/lib/modules/$(VERSION).$(PATCHLEVEL).$(SUBLEVEL)$(EXTRAVERSION) where the 
variables are obtained from the top-level Makefile in the kernel sources 
you built against.

You _default_ to building and installing for the current kernel, if it's
installed properly. But of course you should be permitted to override that 
on the command line.


> Related for sure, the point I was making was that there is no good
> default place to put modules built outside a kernel source tree (and
> probably also when built for multiple kernels).

I disagree. Modutils will look in only one place -- the /lib/modules/... 
directory corresponding to the kernel version for which you built each 
module. Each module, therefore, should go into the directory corresponding 
to the version of the kernel against which it was built.

Finding the appropriate _installation_ directory is trivial, surely? You 
can even find it from the 'kernel_version' stamp _inside_ the object file, 
without any other information?

--
dwmw2



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: several messages
  2003-01-23  0:20 ANN: LKMB (Linux Kernel Module Builder) version 0.1.16 Hal Duston
@ 2003-01-27 16:46 ` Bill Davidsen
  2003-01-27 16:59   ` David Woodhouse
  0 siblings, 1 reply; 129+ messages in thread
From: Bill Davidsen @ 2003-01-27 16:46 UTC (permalink / raw)
  To: David Woodhouse, Hal Duston; +Cc: linux-kernel, Olaf Titz

On Wed, 22 Jan 2003, David Woodhouse wrote:

> 
> davidsen@tmr.com said:
> >  `uname -r` is the kernel version of the running kernel. It is NOT by
> > magic the kernel version of the kernel you are building... 
> 
> Er, yes. And what's your point? 
> 
> There is _no_ magic that will find the kernel you want to build against
> today without any input from you. Using the build tree for the
> currently-running kernel, if installed in the standard place, is as good a
> default as any. Of course you should be permitted to override that default.

You make my point for me, there is no magic, and when building a module it
should require that the directory be specified by either a command line
option (as noted below) or by being built as part of a source tree. There
*is* no good default in that particular case.


On Wed, 22 Jan 2003, Hal Duston wrote:

> I use "INSTALL_MOD_PATH=put/the/modules/here/instead/of/lib/modules" in my
> .profile or whatever in order to drop the modules into another directory
> at "make modules_install" time.  Is this one of the things folks are
> talking about?

Related for sure, the point I was making was that there is no good default
place to put modules built outside a kernel source tree (and probably also
when built for multiple kernels). I was suggesting that the module tree of
the running kernel might be a really poor choice. I don't think I was
clear in my first post, I was not suggesting a better default, I was
suggesting that any default is likely to bite.

I'm not unhappy that Mr. Woodhouse disagrees, I just think he missed my
point the first time and I'm trying to clarify.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 129+ messages in thread

end of thread, other threads:[~2023-06-16 22:54 UTC | newest]

Thread overview: 129+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-10  6:26 [PATCH 00/13] Add VT-d Posted-Interrupts support for KVM Feng Wu
2014-11-10  6:26 ` Feng Wu
2014-11-10  6:26 ` [PATCH 01/13] iommu/vt-d: VT-d Posted-Interrupts feature detection Feng Wu
2014-11-10  6:26   ` Feng Wu
2014-11-11 13:38   ` Jiang Liu
2014-11-11 13:38     ` Jiang Liu
2014-11-10  6:26 ` [PATCH 02/13] KVM: Initialize VT-d Posted-Interrtups Descriptor Feng Wu
2014-11-10  6:26   ` Feng Wu
2014-11-10 21:57   ` Alex Williamson
2014-11-10 21:57     ` Alex Williamson
2014-11-11 13:35   ` Jiang Liu
2014-11-11 13:35     ` Jiang Liu
2014-11-20  4:53     ` Wu, Feng
2014-11-20  4:53       ` Wu, Feng
2014-11-20  5:00       ` Jiang Liu
2014-11-20  5:00         ` Jiang Liu
2014-11-10  6:26 ` [PATCH 03/13] KVM: Add KVM_CAP_PI to detect VT-d Posted-Interrtups Feng Wu
2014-11-10  6:26   ` Feng Wu
2014-11-10  6:26 ` [PATCH 04/13] iommu/vt-d: Adjust 'struct irte' to better suit for VT-d Posted-Interrupts Feng Wu
2014-11-10  6:26   ` Feng Wu
2014-11-11 13:43   ` Jiang Liu
2014-11-11 13:43     ` Jiang Liu
2014-11-10  6:26 ` [PATCH 05/13] KVM: Update IRTE according to guest interrupt configuration changes Feng Wu
2014-11-10  6:26   ` Feng Wu
2014-11-10 21:57   ` Alex Williamson
2014-11-10 21:57     ` Alex Williamson
2014-11-10  6:26 ` [PATCH 06/13] KVM: Add some helper functions for Posted-Interrupts Feng Wu
2014-11-10  6:26   ` Feng Wu
2014-11-10  6:26 ` [PATCH 07/13] x86, irq: Define a global vector for VT-d Posted-Interrupts Feng Wu
2014-11-10  6:26   ` Feng Wu
2014-11-10  6:26 ` [PATCH 08/13] KVM: Update Posted-Interrupts descriptor during VCPU scheduling Feng Wu
2014-11-10  6:26   ` Feng Wu
2014-11-10  6:26 ` [PATCH 09/13] KVM: Change NDST field after " Feng Wu
2014-11-10  6:26   ` Feng Wu
2014-11-10  6:26 ` [PATCH 10/13] KVM: Add the handler for Wake-up Vector Feng Wu
2014-11-10  6:26   ` Feng Wu
2014-11-10  6:26 ` [PATCH 11/13] KVM: Suppress posted-interrupt when 'SN' is set Feng Wu
2014-11-10  6:26   ` Feng Wu
2014-11-10  6:26 ` [PATCH 12/13] iommu/vt-d: No need to migrating irq for VT-d Posted-Interrtups Feng Wu
2014-11-10  6:26   ` Feng Wu
2014-11-11 13:48   ` Jiang Liu
2014-11-11 13:48     ` Jiang Liu
2014-11-10  6:26 ` [PATCH 13/13] iommu/vt-d: Add a command line parameter for VT-d posted-interrupts Feng Wu
2014-11-10 18:15   ` several messages Thomas Gleixner
2014-11-10 18:15     ` Thomas Gleixner
2014-11-11  2:28     ` Jiang Liu
2014-11-11  2:28       ` Jiang Liu
2014-11-11  6:37       ` Wu, Feng
2014-11-11  6:37         ` Wu, Feng
2014-11-10 21:57   ` [PATCH 13/13] iommu/vt-d: Add a command line parameter for VT-d posted-interrupts Alex Williamson
2014-11-10 21:57     ` Alex Williamson
  -- strict thread matches above, loose matches on Subject: below --
2023-06-12 16:02 [PATCH mptcp-next] mptcp: drop legacy code Paolo Abeni
2023-06-13 17:37 ` mptcp: drop legacy code.: Tests Results MPTCP CI
2023-06-16 22:54   ` several messages Mat Martineau
2022-09-01  0:29 [PATCH 00/18] make test "linting" more comprehensive Eric Sunshine via GitGitGadget
2022-09-01  0:29 ` [PATCH 18/18] t: retire unused chainlint.sed Eric Sunshine via GitGitGadget
2022-09-02 12:42   ` several messages Johannes Schindelin
2022-09-02 18:16     ` Eric Sunshine
2022-09-02 18:34       ` Jeff King
2022-09-02 18:44         ` Junio C Hamano
2022-06-16 13:55 [PATCH mptcp-next] selftests: mptcp: tweak simult_flows for debug kernels Paolo Abeni
2022-06-16 15:27 ` selftests: mptcp: tweak simult_flows for debug kernels.: Tests Results MPTCP CI
2022-06-17 22:13   ` several messages Mat Martineau
2016-01-25 18:37 [PATCH v2 0/3] x86/mm: INVPCID support Andy Lutomirski
2016-01-25 18:57 ` Ingo Molnar
2016-01-27 10:09   ` several messages Thomas Gleixner
2016-01-27 10:09     ` Thomas Gleixner
2016-01-29 13:21     ` Borislav Petkov
2014-07-03  5:02 [RFC PATCH v4] ARM: EXYNOS: Use MCPM call-backs to support S2R on Exynos5420 Abhilash Kesavan
2014-07-03 14:46 ` [PATCH v5] " Abhilash Kesavan
2014-07-03 15:45   ` several messages Nicolas Pitre
2014-07-03 15:45     ` Nicolas Pitre
2014-07-03 16:19     ` Abhilash Kesavan
2014-07-03 16:19       ` Abhilash Kesavan
2014-07-03 19:00       ` Nicolas Pitre
2014-07-03 19:00         ` Nicolas Pitre
2014-07-03 20:00         ` Abhilash Kesavan
2014-07-03 20:00           ` Abhilash Kesavan
2014-07-04  4:13           ` Nicolas Pitre
2014-07-04  4:13             ` Nicolas Pitre
2014-07-04 17:45             ` Abhilash Kesavan
2014-07-04 17:45               ` Abhilash Kesavan
2010-07-11 15:06 [PATCHv2] netfilter: add CHECKSUM target Michael S. Tsirkin
2010-07-11 15:14 ` [PATCHv3] extensions: libxt_CHECKSUM extension Michael S. Tsirkin
2010-07-15  9:39   ` Patrick McHardy
2010-07-15 10:17     ` several messages Jan Engelhardt
2009-09-06 14:16 Layla 3G does not recover from ACPI Suspend Mark Hills
2009-09-08 19:32 ` Giuliano Pochini
2009-09-08 22:56   ` several messages Mark Hills
2009-02-09 20:57 [PATCH] libxtables: Introduce global params structuring jamal
2009-02-09 21:04 ` several messages Jan Engelhardt
2009-02-09 21:27   ` jamal
2009-02-09 21:44     ` Jan Engelhardt
2008-11-26 14:33 [PATCH 0/1] HID: hid_apple is not used for apple alu wireless keyboards Jan Scholz
2008-11-26 14:33 ` [PATCH 1/1] HID: Apple alu wireless keyboards are bluetooth devices Jan Scholz
2008-11-26 14:54   ` Jiri Kosina
2008-11-26 15:17     ` Jan Scholz
2008-11-26 15:33       ` Jiri Kosina
2008-11-26 21:06         ` Tobias Müller
2008-11-27  0:57           ` several messages Jiri Kosina
2008-10-19 14:15 [PATCH 1/2] HID: add hid_type Jiri Slaby
2008-10-19 14:15 ` [PATCH 2/2] HID: fix appletouch regression Jiri Slaby
2008-10-19 19:40   ` several messages Jiri Kosina
2008-10-19 20:06     ` Justin Mattock
2008-10-19 20:06       ` Justin Mattock
2008-10-19 22:09     ` Jiri Slaby
     [not found] <9E397A467F4DB34884A1FD0D5D27CF43018903F96E@msxaoa4.twosigma.com>
2008-06-12 16:54 ` Benjamin L. Shi
     [not found] <200702211929.17203.david-b@pacbell.net>
2007-02-22  3:50 ` [patch 6/6] rtc suspend()/resume() restores system clock David Brownell
2007-02-22 22:58   ` several messages Guennadi Liakhovetski
2007-02-22 22:58     ` Guennadi Liakhovetski
2007-02-22 22:58     ` Guennadi Liakhovetski
2007-02-23  1:15     ` David Brownell
2007-02-23  1:15       ` David Brownell
2007-02-23  1:15       ` David Brownell
2007-02-23 11:17     ` Johannes Berg
2007-02-23 11:17       ` Johannes Berg
2007-02-23 11:17       ` Johannes Berg
2006-09-26 18:51 Long sleep with i_mutex in xfs_flush_device(), affects NFS service Stephane Doyon
2006-09-27 11:33 ` Shailendra Tripathi
2006-10-02 14:45   ` Stephane Doyon
2006-10-02 22:30     ` David Chinner
2006-10-03 13:39       ` several messages Stephane Doyon
2006-10-03 13:39         ` Stephane Doyon
2006-10-03 16:40         ` Trond Myklebust
2006-10-03 16:40           ` Trond Myklebust
2006-10-05 15:39           ` Stephane Doyon
2006-10-05 15:39             ` Stephane Doyon
2006-10-06  0:33             ` David Chinner
2006-10-06  0:33               ` David Chinner
2006-10-06 13:25               ` Stephane Doyon
2006-10-06 13:25                 ` Stephane Doyon
2006-10-05  8:30         ` David Chinner
2006-10-05  8:30           ` David Chinner
2006-10-05 16:33           ` Stephane Doyon
2006-10-05 16:33             ` Stephane Doyon
2006-10-05 23:29             ` David Chinner
2006-10-05 23:29               ` David Chinner
2006-10-06 13:03               ` Stephane Doyon
2006-10-06 13:03                 ` Stephane Doyon
2006-04-11 17:33 Linux 2.6.16.4 Greg KH
2006-04-11 19:04 ` several messages Jan Engelhardt
2006-04-11 19:20   ` Boris B. Zhmurov
2006-04-11 20:30   ` Greg KH
2006-04-11 23:46     ` Jan Engelhardt
2006-04-12  0:36     ` Nix
2005-05-04 17:31 ata over ethernet question Maciej Soltysiak
2005-05-04 19:48 ` David Hollis
2005-05-04 21:17   ` Re[2]: " Maciej Soltysiak
2005-05-05 15:09     ` David Hollis
2005-05-07 15:05       ` Sander
2005-05-10 22:00         ` Guennadi Liakhovetski
2005-05-11  8:56           ` Vladislav Bolkhovitin
2005-05-11 21:26             ` several messages Guennadi Liakhovetski
2005-05-12  2:16               ` Ming Zhang
2005-05-12 18:32                 ` Dmitry Yusupov
2005-05-13  8:12                   ` Christoph Hellwig
2005-05-13 15:04                     ` Dmitry Yusupov
2005-05-13 15:07                       ` Christoph Hellwig
2005-05-13 15:38                         ` Dmitry Yusupov
2005-05-12 10:17               ` Vladislav Bolkhovitin
2004-05-11  8:45 2.6.6-rc3-mm2 (4KSTACK) Helge Hafting
2004-05-11 17:59 ` several messages Bill Davidsen
2003-04-22 10:34 [patch] HT scheduler, sched-2.5.68-A9 Ingo Molnar
2003-04-22 22:16 ` several messages Bill Davidsen
2003-04-22 23:38   ` Rick Lindsley
2003-04-23  9:17     ` Ingo Molnar
2003-01-23  0:20 ANN: LKMB (Linux Kernel Module Builder) version 0.1.16 Hal Duston
2003-01-27 16:46 ` several messages Bill Davidsen
2003-01-27 16:59   ` David Woodhouse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.