All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/7] x86/idle: add halt poll support
@ 2017-08-29 11:46 Yang Zhang
  2017-08-29 11:46   ` Yang Zhang
                   ` (12 more replies)
  0 siblings, 13 replies; 47+ messages in thread
From: Yang Zhang @ 2017-08-29 11:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar, dmatlack, agraf,
	peterz, linux-doc, Yang Zhang

Some latency-intensive workload will see obviously performance 
drop when running inside VM. The main reason is that the overhead 
is amplified when running inside VM. The most cost i have seen is 
inside idle path. 

This patch introduces a new mechanism to poll for a while before 
entering idle state. If schedule is needed during poll, then we 
don't need to goes through the heavy overhead path. 

Here is the data we get when running benchmark contextswitch to measure
the latency(lower is better):

   1. w/o patch:
      2493.14 ns/ctxsw -- 200.3 %CPU
   
   2. w/ patch:
      halt_poll_threshold=10000 -- 1485.96ns/ctxsw -- 201.0 %CPU
      halt_poll_threshold=20000 -- 1391.26 ns/ctxsw -- 200.7 %CPU
      halt_poll_threshold=30000 -- 1488.55 ns/ctxsw -- 200.1 %CPU
      halt_poll_threshold=500000 -- 1159.14 ns/ctxsw -- 201.5 %CPU
   
   3. kvm dynamic poll
      halt_poll_ns=10000 -- 2296.11 ns/ctxsw -- 201.2 %CPU
      halt_poll_ns=20000 -- 2599.7 ns/ctxsw -- 201.7 %CPU
      halt_poll_ns=30000 -- 2588.68 ns/ctxsw -- 211.6 %CPU
      halt_poll_ns=500000 -- 2423.20 ns/ctxsw -- 229.2 %CPU
   
   4. idle=poll
      2050.1 ns/ctxsw -- 1003 %CPU
   
   5. idle=mwait
      2188.06 ns/ctxsw -- 206.3 %CPU

Here is the data we get when running benchmark netperf:

   1. w/o patch:
      14556.8 bits/s  -- 144.2 %CPU

   2. w/ patch:
      halt_poll_threshold=10000 -- 15803.89 bits/s -- 159.5 %CPU
      halt_poll_threshold=20000 -- 15899.04 bits/s -- 161.5 %CPU
      halt_poll_threshold=30000 -- 15642.38 bits/s -- 161.8 %CPU
      halt_poll_threshold=40000 -- 18040.76 bits/s -- 184.0 %CPU
      halt_poll_threshold=50000 -- 18877.61 bits/s -- 197.3 %CPU

   3. kvm dynamic poll
      halt_poll_ns=10000 -- 15876.00 bits/s -- 172.2 %CPU
      halt_poll_ns=20000 -- 15602.58 bits/s -- 185.4 %CPU
      halt_poll_ns=30000 -- 15930.69 bits/s -- 194.4 %CPU
      halt_poll_ns=40000 -- 16413.09 bits/s -- 195.3 %CPU
      halt_poll_ns=50000 -- 16417.42 bits/s -- 196.3 %CPU

   4. idle=poll in guest
      18441.3bit/s -- 1003 %CPU

   5. idle=mwait in guest
      15760.6  bits/s  -- 157.6 %CPU

V1 -> V2:
- integrate the smart halt poll into paravirt code
- use idle_stamp instead of check_poll
- since it hard to get whether vcpu is the only task in pcpu, so we
  don't consider it in this series.(May improve it in future)

Yang Zhang (7):
  x86/paravirt: Add pv_idle_ops to paravirt ops
  KVM guest: register kvm_idle_poll for pv_idle_ops
  sched/idle: Add poll before enter real idle path
  x86/paravirt: Add update in x86/paravirt pv_idle_ops
  Documentation: Add three sysctls for smart idle poll
  KVM guest: introduce smart idle poll algorithm
  sched/idle: update poll time when wakeup from idle

 Documentation/sysctl/kernel.txt       | 25 +++++++++++++
 arch/x86/include/asm/paravirt.h       | 10 ++++++
 arch/x86/include/asm/paravirt_types.h |  7 ++++
 arch/x86/kernel/kvm.c                 | 67 +++++++++++++++++++++++++++++++++++
 arch/x86/kernel/paravirt.c            | 11 ++++++
 arch/x86/kernel/process.c             |  7 ++++
 include/linux/kernel.h                |  6 ++++
 include/linux/sched/idle.h            |  4 +++
 kernel/sched/core.c                   |  4 +++
 kernel/sched/idle.c                   |  9 +++++
 kernel/sysctl.c                       | 23 ++++++++++++
 11 files changed, 173 insertions(+)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 1/7] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-08-29 11:46 [RFC PATCH v2 0/7] x86/idle: add halt poll support Yang Zhang
@ 2017-08-29 11:46   ` Yang Zhang
  2017-08-29 11:46 ` [RFC PATCH v2 2/7] KVM guest: register kvm_idle_poll for pv_idle_ops Yang Zhang
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 47+ messages in thread
From: Yang Zhang @ 2017-08-29 11:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar, dmatlack, agraf,
	peterz, linux-doc, Yang Zhang, Quan Xu, Jeremy Fitzhardinge,
	Chris Wright, Alok Kataria, Rusty Russell, Ingo Molnar,
	H. Peter Anvin, x86, Andy Lutomirski, Kirill A. Shutemov,
	Pan Xinhui, Kees Cook, virtualization

So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in
idle path which will polling for a while before we enter the real idle
state.

In virtualization, idle path includes several heavy operations
includes timer access(LAPIC timer or TSC deadline timer) which will hurt
performance especially for latency intensive workload like message
passing task. The cost is mainly come from the vmexit which is a
hardware context switch between VM and hypervisor. Our solution is to
poll for a while and do not enter real idle path if we can get the
schedule event during polling.

Poll may cause the CPU waste so we adopt a smart polling mechanism to
reduce the useless poll.

Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Quan Xu <quan.xu0@gmail.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: virtualization@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/paravirt.h       | 5 +++++
 arch/x86/include/asm/paravirt_types.h | 6 ++++++
 arch/x86/kernel/paravirt.c            | 6 ++++++
 3 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 9ccac19..6d46760 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -202,6 +202,11 @@ static inline unsigned long long paravirt_read_pmc(int counter)
 
 #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter))
 
+static inline void paravirt_idle_poll(void)
+{
+	PVOP_VCALL0(pv_idle_ops.poll);
+}
+
 static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries)
 {
 	PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 9ffc36b..cf45726 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -324,6 +324,10 @@ struct pv_lock_ops {
 	struct paravirt_callee_save vcpu_is_preempted;
 } __no_randomize_layout;
 
+struct pv_idle_ops {
+	void (*poll)(void);
+} __no_randomize_layout;
+
 /* This contains all the paravirt structures: we get a convenient
  * number for each function using the offset which we use to indicate
  * what to patch. */
@@ -334,6 +338,7 @@ struct paravirt_patch_template {
 	struct pv_irq_ops pv_irq_ops;
 	struct pv_mmu_ops pv_mmu_ops;
 	struct pv_lock_ops pv_lock_ops;
+	struct pv_idle_ops pv_idle_ops;
 } __no_randomize_layout;
 
 extern struct pv_info pv_info;
@@ -343,6 +348,7 @@ struct paravirt_patch_template {
 extern struct pv_irq_ops pv_irq_ops;
 extern struct pv_mmu_ops pv_mmu_ops;
 extern struct pv_lock_ops pv_lock_ops;
+extern struct pv_idle_ops pv_idle_ops;
 
 #define PARAVIRT_PATCH(x)					\
 	(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index bc0a849..1b5b247 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -128,6 +128,7 @@ static void *get_call_destination(u8 type)
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 		.pv_lock_ops = pv_lock_ops,
 #endif
+		.pv_idle_ops = pv_idle_ops,
 	};
 	return *((void **)&tmpl + type);
 }
@@ -312,6 +313,10 @@ struct pv_time_ops pv_time_ops = {
 	.steal_clock = native_steal_clock,
 };
 
+struct pv_idle_ops pv_idle_ops = {
+	.poll = paravirt_nop,
+};
+
 __visible struct pv_irq_ops pv_irq_ops = {
 	.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
 	.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
@@ -471,3 +476,4 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
 EXPORT_SYMBOL    (pv_mmu_ops);
 EXPORT_SYMBOL_GPL(pv_info);
 EXPORT_SYMBOL    (pv_irq_ops);
+EXPORT_SYMBOL    (pv_idle_ops);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 1/7] x86/paravirt: Add pv_idle_ops to paravirt ops
@ 2017-08-29 11:46   ` Yang Zhang
  0 siblings, 0 replies; 47+ messages in thread
From: Yang Zhang @ 2017-08-29 11:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Yang Zhang, Jeremy Fitzhardinge, kvm, rkrcmar, peterz,
	Pan Xinhui, virtualization, H. Peter Anvin, Alok Kataria,
	wanpeng.li, x86, Ingo Molnar, Kees Cook, Chris Wright,
	Andy Lutomirski, dmatlack, tglx, Quan Xu, linux-doc, mst,
	pbonzini, Kirill A. Shutemov

So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in
idle path which will polling for a while before we enter the real idle
state.

In virtualization, idle path includes several heavy operations
includes timer access(LAPIC timer or TSC deadline timer) which will hurt
performance especially for latency intensive workload like message
passing task. The cost is mainly come from the vmexit which is a
hardware context switch between VM and hypervisor. Our solution is to
poll for a while and do not enter real idle path if we can get the
schedule event during polling.

Poll may cause the CPU waste so we adopt a smart polling mechanism to
reduce the useless poll.

Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Quan Xu <quan.xu0@gmail.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: virtualization@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/paravirt.h       | 5 +++++
 arch/x86/include/asm/paravirt_types.h | 6 ++++++
 arch/x86/kernel/paravirt.c            | 6 ++++++
 3 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 9ccac19..6d46760 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -202,6 +202,11 @@ static inline unsigned long long paravirt_read_pmc(int counter)
 
 #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter))
 
+static inline void paravirt_idle_poll(void)
+{
+	PVOP_VCALL0(pv_idle_ops.poll);
+}
+
 static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries)
 {
 	PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 9ffc36b..cf45726 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -324,6 +324,10 @@ struct pv_lock_ops {
 	struct paravirt_callee_save vcpu_is_preempted;
 } __no_randomize_layout;
 
+struct pv_idle_ops {
+	void (*poll)(void);
+} __no_randomize_layout;
+
 /* This contains all the paravirt structures: we get a convenient
  * number for each function using the offset which we use to indicate
  * what to patch. */
@@ -334,6 +338,7 @@ struct paravirt_patch_template {
 	struct pv_irq_ops pv_irq_ops;
 	struct pv_mmu_ops pv_mmu_ops;
 	struct pv_lock_ops pv_lock_ops;
+	struct pv_idle_ops pv_idle_ops;
 } __no_randomize_layout;
 
 extern struct pv_info pv_info;
@@ -343,6 +348,7 @@ struct paravirt_patch_template {
 extern struct pv_irq_ops pv_irq_ops;
 extern struct pv_mmu_ops pv_mmu_ops;
 extern struct pv_lock_ops pv_lock_ops;
+extern struct pv_idle_ops pv_idle_ops;
 
 #define PARAVIRT_PATCH(x)					\
 	(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index bc0a849..1b5b247 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -128,6 +128,7 @@ static void *get_call_destination(u8 type)
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 		.pv_lock_ops = pv_lock_ops,
 #endif
+		.pv_idle_ops = pv_idle_ops,
 	};
 	return *((void **)&tmpl + type);
 }
@@ -312,6 +313,10 @@ struct pv_time_ops pv_time_ops = {
 	.steal_clock = native_steal_clock,
 };
 
+struct pv_idle_ops pv_idle_ops = {
+	.poll = paravirt_nop,
+};
+
 __visible struct pv_irq_ops pv_irq_ops = {
 	.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
 	.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
@@ -471,3 +476,4 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
 EXPORT_SYMBOL    (pv_mmu_ops);
 EXPORT_SYMBOL_GPL(pv_info);
 EXPORT_SYMBOL    (pv_irq_ops);
+EXPORT_SYMBOL    (pv_idle_ops);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 2/7] KVM guest: register kvm_idle_poll for pv_idle_ops
  2017-08-29 11:46 [RFC PATCH v2 0/7] x86/idle: add halt poll support Yang Zhang
  2017-08-29 11:46   ` Yang Zhang
@ 2017-08-29 11:46 ` Yang Zhang
  2017-08-29 11:46 ` [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path Yang Zhang
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 47+ messages in thread
From: Yang Zhang @ 2017-08-29 11:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar, dmatlack, agraf,
	peterz, linux-doc, Yang Zhang, Quan Xu, Ingo Molnar,
	H. Peter Anvin, x86

Although smart idle poll has nothing to do with paravirt, it can not
bring any benifit to native. So we only enable it when Linux runs as a
KVM guest(it can extend to other hypervisor like Xen, HyperV and
VMware).

Introduce per-CPU variable poll_duration_ns to control the max poll
time.

Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Quan Xu <quan.xu0@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/kvm.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index d04e30e..7d84a02 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -75,6 +75,7 @@ static int parse_no_kvmclock_vsyscall(char *arg)
 
 early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
 
+static DEFINE_PER_CPU(unsigned long, poll_duration_ns);
 static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
 static int has_steal_clock = 0;
@@ -357,6 +358,29 @@ static void kvm_guest_cpu_init(void)
 		kvm_register_steal_time();
 }
 
+static void kvm_idle_poll(void)
+{
+	unsigned long poll_duration = this_cpu_read(poll_duration_ns);
+	ktime_t start, cur, stop;
+
+	start = cur = ktime_get();
+	stop = ktime_add_ns(ktime_get(), poll_duration);
+
+	do {
+		if (need_resched())
+			break;
+		cur = ktime_get();
+	} while (ktime_before(cur, stop));
+}
+
+static void kvm_guest_idle_init(void)
+{
+	if (!kvm_para_available())
+		return;
+
+	pv_idle_ops.poll = kvm_idle_poll;
+}
+
 static void kvm_pv_disable_apf(void)
 {
 	if (!__this_cpu_read(apf_reason.enabled))
@@ -492,6 +516,8 @@ void __init kvm_guest_init(void)
 	kvm_guest_cpu_init();
 #endif
 
+	kvm_guest_idle_init();
+
 	/*
 	 * Hard lockup detection is enabled by default. Disable it, as guests
 	 * can get false positives too easily, for example if the host is
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path
  2017-08-29 11:46 [RFC PATCH v2 0/7] x86/idle: add halt poll support Yang Zhang
  2017-08-29 11:46   ` Yang Zhang
  2017-08-29 11:46 ` [RFC PATCH v2 2/7] KVM guest: register kvm_idle_poll for pv_idle_ops Yang Zhang
@ 2017-08-29 11:46 ` Yang Zhang
  2017-08-29 12:45   ` Peter Zijlstra
  2017-08-29 14:39   ` Borislav Petkov
  2017-08-29 11:46 ` [RFC PATCH v2 4/7] x86/paravirt: Add update in x86/paravirt pv_idle_ops Yang Zhang
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 47+ messages in thread
From: Yang Zhang @ 2017-08-29 11:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar, dmatlack, agraf,
	peterz, linux-doc, Yang Zhang, Quan Xu, Ingo Molnar,
	H. Peter Anvin, x86, Borislav Petkov, Kyle Huey, Andy Lutomirski,
	Len Brown

Add poll in do_idle. For UP VM, if there are running task, it will not
goes into idle path, so we only enable poll in SMP VM.

Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Quan Xu <quan.xu0@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Kyle Huey <me@kylehuey.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Len Brown <len.brown@intel.com>
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/process.c | 7 +++++++
 kernel/sched/idle.c       | 2 ++
 2 files changed, 9 insertions(+)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 3ca1980..def4113 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -332,6 +332,13 @@ void arch_cpu_idle(void)
 	x86_idle();
 }
 
+#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
+void arch_cpu_idle_poll(void)
+{
+	paravirt_idle_poll();
+}
+#endif
+
 /*
  * We use this if we don't have any better idle routine..
  */
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 6c23e30..b374744 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
 }
 
 /* Weak implementations for optional arch specific functions */
+void __weak arch_cpu_idle_poll(void) { }
 void __weak arch_cpu_idle_prepare(void) { }
 void __weak arch_cpu_idle_enter(void) { }
 void __weak arch_cpu_idle_exit(void) { }
@@ -219,6 +220,7 @@ static void do_idle(void)
 	 */
 
 	__current_set_polling();
+	arch_cpu_idle_poll();
 	quiet_vmstat();
 	tick_nohz_idle_enter();
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 4/7] x86/paravirt: Add update in x86/paravirt pv_idle_ops
  2017-08-29 11:46 [RFC PATCH v2 0/7] x86/idle: add halt poll support Yang Zhang
                   ` (3 preceding siblings ...)
  2017-08-29 11:46 ` [RFC PATCH v2 4/7] x86/paravirt: Add update in x86/paravirt pv_idle_ops Yang Zhang
@ 2017-08-29 11:46 ` Yang Zhang
  2017-08-29 11:46 ` [RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll Yang Zhang
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 47+ messages in thread
From: Yang Zhang @ 2017-08-29 11:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar, dmatlack, agraf,
	peterz, linux-doc, Yang Zhang, Quan Xu, Jeremy Fitzhardinge,
	Chris Wright, Alok Kataria, Rusty Russell, Ingo Molnar,
	H. Peter Anvin, x86, Andy Lutomirski, Kirill A. Shutemov,
	Pan Xinhui, Kees Cook, virtualization

.update is used to adjust the next poll time.

Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Quan Xu <quan.xu0@gmail.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: virtualization@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/paravirt.h       | 5 +++++
 arch/x86/include/asm/paravirt_types.h | 1 +
 arch/x86/kernel/paravirt.c            | 1 +
 3 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 6d46760..32e1c06 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -207,6 +207,11 @@ static inline void paravirt_idle_poll(void)
 	PVOP_VCALL0(pv_idle_ops.poll);
 }
 
+static inline void paravirt_idle_update_poll_duration(unsigned long duration)
+{
+	PVOP_VCALL1(pv_idle_ops.update, duration);
+}
+
 static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries)
 {
 	PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index cf45726..3b4f95a 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -326,6 +326,7 @@ struct pv_lock_ops {
 
 struct pv_idle_ops {
 	void (*poll)(void);
+	void (*update)(unsigned long);
 } __no_randomize_layout;
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 1b5b247..a11b2c2 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -315,6 +315,7 @@ struct pv_time_ops pv_time_ops = {
 
 struct pv_idle_ops pv_idle_ops = {
 	.poll = paravirt_nop,
+	.update = paravirt_nop,
 };
 
 __visible struct pv_irq_ops pv_irq_ops = {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 4/7] x86/paravirt: Add update in x86/paravirt pv_idle_ops
  2017-08-29 11:46 [RFC PATCH v2 0/7] x86/idle: add halt poll support Yang Zhang
                   ` (2 preceding siblings ...)
  2017-08-29 11:46 ` [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path Yang Zhang
@ 2017-08-29 11:46 ` Yang Zhang
  2017-08-29 11:46 ` Yang Zhang
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 47+ messages in thread
From: Yang Zhang @ 2017-08-29 11:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Yang Zhang, Jeremy Fitzhardinge, kvm, rkrcmar, peterz,
	Pan Xinhui, virtualization, H. Peter Anvin, Alok Kataria,
	wanpeng.li, x86, Ingo Molnar, Kees Cook, Chris Wright,
	Andy Lutomirski, dmatlack, tglx, Quan Xu, linux-doc, mst,
	pbonzini, Kirill A. Shutemov

.update is used to adjust the next poll time.

Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Quan Xu <quan.xu0@gmail.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: virtualization@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/paravirt.h       | 5 +++++
 arch/x86/include/asm/paravirt_types.h | 1 +
 arch/x86/kernel/paravirt.c            | 1 +
 3 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 6d46760..32e1c06 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -207,6 +207,11 @@ static inline void paravirt_idle_poll(void)
 	PVOP_VCALL0(pv_idle_ops.poll);
 }
 
+static inline void paravirt_idle_update_poll_duration(unsigned long duration)
+{
+	PVOP_VCALL1(pv_idle_ops.update, duration);
+}
+
 static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries)
 {
 	PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index cf45726..3b4f95a 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -326,6 +326,7 @@ struct pv_lock_ops {
 
 struct pv_idle_ops {
 	void (*poll)(void);
+	void (*update)(unsigned long);
 } __no_randomize_layout;
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 1b5b247..a11b2c2 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -315,6 +315,7 @@ struct pv_time_ops pv_time_ops = {
 
 struct pv_idle_ops pv_idle_ops = {
 	.poll = paravirt_nop,
+	.update = paravirt_nop,
 };
 
 __visible struct pv_irq_ops pv_irq_ops = {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll
  2017-08-29 11:46 [RFC PATCH v2 0/7] x86/idle: add halt poll support Yang Zhang
@ 2017-08-29 11:46   ` Yang Zhang
  2017-08-29 11:46 ` [RFC PATCH v2 2/7] KVM guest: register kvm_idle_poll for pv_idle_ops Yang Zhang
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 47+ messages in thread
From: Yang Zhang @ 2017-08-29 11:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar, dmatlack, agraf,
	peterz, linux-doc, Yang Zhang, Quan Xu, Jonathan Corbet,
	Jeremy Fitzhardinge, Chris Wright, Alok Kataria, Rusty Russell,
	Ingo Molnar, H. Peter Anvin, x86, Luis R. Rodriguez, Kees Cook,
	Mauro Carvalho Chehab, Krzysztof Kozlowski, Josh Poimboeuf,
	Andrew Morton, Petr Mladek, Jessica Yu, Larry Finger, zijun_hu,
	Baoquan He, Johannes Berg, Ian Abbott, virtualization,
	linux-fsdevel

To reduce the cost of poll, we introduce three sysctl to control the
poll time.

Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Quan Xu <quan.xu0@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jessica Yu <jeyu@redhat.com>
Cc: Larry Finger <Larry.Finger@lwfinger.net>
Cc: zijun_hu <zijun_hu@htc.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Ian Abbott <abbotti@mev.co.uk>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: linux-fsdevel@vger.kernel.org
---
 Documentation/sysctl/kernel.txt | 25 +++++++++++++++++++++++++
 arch/x86/kernel/paravirt.c      |  4 ++++
 include/linux/kernel.h          |  6 ++++++
 kernel/sysctl.c                 | 23 +++++++++++++++++++++++
 4 files changed, 58 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index bac23c1..67447b8 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -63,6 +63,9 @@ show up in /proc/sys/kernel:
 - perf_event_max_stack
 - perf_event_max_contexts_per_stack
 - pid_max
+- poll_grow                   [ X86 only ]
+- poll_shrink                 [ X86 only ]
+- poll_threshold_ns           [ X86 only ]
 - powersave-nap               [ PPC only ]
 - printk
 - printk_delay
@@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one.
 
 ==============================================================
 
+poll_grow: (X86 only)
+
+This parameter is multiplied in the grow_poll_ns() to increase the poll time.
+By default, the values is 2.
+
+==============================================================
+poll_shrink: (X86 only)
+
+This parameter is divided in the shrink_poll_ns() to reduce the poll time.
+By default, the values is 2.
+
+==============================================================
+poll_threshold_ns: (X86 only)
+
+This parameter controls the max poll time before entering real idle path.
+This parameter is expected to take effect only when running inside a VM.
+It would make no sense to turn on it in bare mental.
+By default, the values is 0 means don't poll. It is recommended to change
+the value to non-zero if running latency-bound workloads inside VM.
+
+==============================================================
+
 powersave-nap: (PPC only)
 
 If set, Linux-PPC will use the 'nap' mode of powersaving,
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index a11b2c2..0b92f8f 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = {
 	.update = paravirt_nop,
 };
 
+unsigned long poll_threshold_ns;
+unsigned int poll_shrink = 2;
+unsigned int poll_grow = 2;
+
 __visible struct pv_irq_ops pv_irq_ops = {
 	.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
 	.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index bd6d96c..6cb2820 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -462,6 +462,12 @@ extern __scanf(2, 0)
 
 extern bool crash_kexec_post_notifiers;
 
+#ifdef CONFIG_PARAVIRT
+extern unsigned long poll_threshold_ns;
+extern unsigned int poll_shrink;
+extern unsigned int poll_grow;
+#endif
+
 /*
  * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
  * holds a CPU number which is executing panic() currently. A value of
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..9b86a8f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_PARAVIRT
+	{
+		.procname       = "halt_poll_threshold",
+		.data           = &poll_threshold_ns,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+	{
+		.procname       = "halt_poll_grow",
+		.data           = &poll_grow,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+	{
+		.procname       = "halt_poll_shrink",
+		.data           = &poll_shrink,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+#endif
 	{ }
 };
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll
@ 2017-08-29 11:46   ` Yang Zhang
  0 siblings, 0 replies; 47+ messages in thread
From: Yang Zhang @ 2017-08-29 11:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar, dmatlack, agraf,
	peterz, linux-doc, Yang Zhang, Quan Xu, Jonathan Corbet,
	Jeremy Fitzhardinge, Chris Wright, Alok Kataria, Rusty Russell,
	Ingo Molnar, H. Peter Anvin, x86, Luis R. Rodriguez, Kees Cook,
	Mauro Carvalho Chehab, Krzysztof Kozlowski, Josh Poimboeuf,
	Andrew Morton

To reduce the cost of poll, we introduce three sysctl to control the
poll time.

Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Quan Xu <quan.xu0@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jessica Yu <jeyu@redhat.com>
Cc: Larry Finger <Larry.Finger@lwfinger.net>
Cc: zijun_hu <zijun_hu@htc.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Ian Abbott <abbotti@mev.co.uk>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: linux-fsdevel@vger.kernel.org
---
 Documentation/sysctl/kernel.txt | 25 +++++++++++++++++++++++++
 arch/x86/kernel/paravirt.c      |  4 ++++
 include/linux/kernel.h          |  6 ++++++
 kernel/sysctl.c                 | 23 +++++++++++++++++++++++
 4 files changed, 58 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index bac23c1..67447b8 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -63,6 +63,9 @@ show up in /proc/sys/kernel:
 - perf_event_max_stack
 - perf_event_max_contexts_per_stack
 - pid_max
+- poll_grow                   [ X86 only ]
+- poll_shrink                 [ X86 only ]
+- poll_threshold_ns           [ X86 only ]
 - powersave-nap               [ PPC only ]
 - printk
 - printk_delay
@@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one.
 
 ==============================================================
 
+poll_grow: (X86 only)
+
+This parameter is multiplied in the grow_poll_ns() to increase the poll time.
+By default, the values is 2.
+
+==============================================================
+poll_shrink: (X86 only)
+
+This parameter is divided in the shrink_poll_ns() to reduce the poll time.
+By default, the values is 2.
+
+==============================================================
+poll_threshold_ns: (X86 only)
+
+This parameter controls the max poll time before entering real idle path.
+This parameter is expected to take effect only when running inside a VM.
+It would make no sense to turn on it in bare mental.
+By default, the values is 0 means don't poll. It is recommended to change
+the value to non-zero if running latency-bound workloads inside VM.
+
+==============================================================
+
 powersave-nap: (PPC only)
 
 If set, Linux-PPC will use the 'nap' mode of powersaving,
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index a11b2c2..0b92f8f 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = {
 	.update = paravirt_nop,
 };
 
+unsigned long poll_threshold_ns;
+unsigned int poll_shrink = 2;
+unsigned int poll_grow = 2;
+
 __visible struct pv_irq_ops pv_irq_ops = {
 	.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
 	.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index bd6d96c..6cb2820 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -462,6 +462,12 @@ extern __scanf(2, 0)
 
 extern bool crash_kexec_post_notifiers;
 
+#ifdef CONFIG_PARAVIRT
+extern unsigned long poll_threshold_ns;
+extern unsigned int poll_shrink;
+extern unsigned int poll_grow;
+#endif
+
 /*
  * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
  * holds a CPU number which is executing panic() currently. A value of
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..9b86a8f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_PARAVIRT
+	{
+		.procname       = "halt_poll_threshold",
+		.data           = &poll_threshold_ns,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+	{
+		.procname       = "halt_poll_grow",
+		.data           = &poll_grow,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+	{
+		.procname       = "halt_poll_shrink",
+		.data           = &poll_shrink,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+#endif
 	{ }
 };
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll
  2017-08-29 11:46 [RFC PATCH v2 0/7] x86/idle: add halt poll support Yang Zhang
                   ` (4 preceding siblings ...)
  2017-08-29 11:46 ` Yang Zhang
@ 2017-08-29 11:46 ` Yang Zhang
  2017-08-29 11:46   ` Yang Zhang
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 47+ messages in thread
From: Yang Zhang @ 2017-08-29 11:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Yang Zhang, Jeremy Fitzhardinge, kvm, rkrcmar, peterz,
	virtualization, H. Peter Anvin, Alok Kataria, wanpeng.li,
	Jessica Yu, Baoquan He, Jonathan Corbet, x86,
	Krzysztof Kozlowski, Ingo Molnar, zijun_hu, Petr Mladek,
	Kees Cook, Johannes Berg, Chris Wright, Ian Abbott,
	Josh Poimboeuf, dmatlack, tglx, Mauro Carvalho Chehab, Quan Xu,
	linux-doc, Luis R. Rodriguez

To reduce the cost of poll, we introduce three sysctl to control the
poll time.

Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Quan Xu <quan.xu0@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jessica Yu <jeyu@redhat.com>
Cc: Larry Finger <Larry.Finger@lwfinger.net>
Cc: zijun_hu <zijun_hu@htc.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Ian Abbott <abbotti@mev.co.uk>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: linux-fsdevel@vger.kernel.org
---
 Documentation/sysctl/kernel.txt | 25 +++++++++++++++++++++++++
 arch/x86/kernel/paravirt.c      |  4 ++++
 include/linux/kernel.h          |  6 ++++++
 kernel/sysctl.c                 | 23 +++++++++++++++++++++++
 4 files changed, 58 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index bac23c1..67447b8 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -63,6 +63,9 @@ show up in /proc/sys/kernel:
 - perf_event_max_stack
 - perf_event_max_contexts_per_stack
 - pid_max
+- poll_grow                   [ X86 only ]
+- poll_shrink                 [ X86 only ]
+- poll_threshold_ns           [ X86 only ]
 - powersave-nap               [ PPC only ]
 - printk
 - printk_delay
@@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one.
 
 ==============================================================
 
+poll_grow: (X86 only)
+
+This parameter is multiplied in the grow_poll_ns() to increase the poll time.
+By default, the values is 2.
+
+==============================================================
+poll_shrink: (X86 only)
+
+This parameter is divided in the shrink_poll_ns() to reduce the poll time.
+By default, the values is 2.
+
+==============================================================
+poll_threshold_ns: (X86 only)
+
+This parameter controls the max poll time before entering real idle path.
+This parameter is expected to take effect only when running inside a VM.
+It would make no sense to turn on it in bare mental.
+By default, the values is 0 means don't poll. It is recommended to change
+the value to non-zero if running latency-bound workloads inside VM.
+
+==============================================================
+
 powersave-nap: (PPC only)
 
 If set, Linux-PPC will use the 'nap' mode of powersaving,
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index a11b2c2..0b92f8f 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = {
 	.update = paravirt_nop,
 };
 
+unsigned long poll_threshold_ns;
+unsigned int poll_shrink = 2;
+unsigned int poll_grow = 2;
+
 __visible struct pv_irq_ops pv_irq_ops = {
 	.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
 	.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index bd6d96c..6cb2820 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -462,6 +462,12 @@ extern __scanf(2, 0)
 
 extern bool crash_kexec_post_notifiers;
 
+#ifdef CONFIG_PARAVIRT
+extern unsigned long poll_threshold_ns;
+extern unsigned int poll_shrink;
+extern unsigned int poll_grow;
+#endif
+
 /*
  * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
  * holds a CPU number which is executing panic() currently. A value of
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..9b86a8f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_PARAVIRT
+	{
+		.procname       = "halt_poll_threshold",
+		.data           = &poll_threshold_ns,
+		.maxlen         = sizeof(unsigned long),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+	{
+		.procname       = "halt_poll_grow",
+		.data           = &poll_grow,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+	{
+		.procname       = "halt_poll_shrink",
+		.data           = &poll_shrink,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+#endif
 	{ }
 };
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 6/7] KVM guest: introduce smart idle poll algorithm
  2017-08-29 11:46 [RFC PATCH v2 0/7] x86/idle: add halt poll support Yang Zhang
                   ` (6 preceding siblings ...)
  2017-08-29 11:46   ` Yang Zhang
@ 2017-08-29 11:46 ` Yang Zhang
  2017-08-29 11:46 ` [RFC PATCH v2 7/7] sched/idle: update poll time when wakeup from idle Yang Zhang
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 47+ messages in thread
From: Yang Zhang @ 2017-08-29 11:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar, dmatlack, agraf,
	peterz, linux-doc, Yang Zhang, Quan Xu, Ingo Molnar,
	H. Peter Anvin, x86

using smart idle poll to reduce the useless poll when system is idle.

Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Quan Xu <quan.xu0@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/kvm.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 7d84a02..ffc8307 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -373,12 +373,53 @@ static void kvm_idle_poll(void)
 	} while (ktime_before(cur, stop));
 }
 
+static unsigned int grow_poll_ns(unsigned int old, unsigned int grow,
+				 unsigned int max)
+{
+	unsigned int val;
+
+	/* set base poll time to 10000ns */
+	if (old == 0 && grow)
+		return 10000;
+
+	val = old * grow;
+	if (val > max)
+		val = max;
+
+	return val;
+}
+
+static unsigned int shrink_poll_ns(unsigned int old, unsigned int shrink)
+{
+	if (shrink == 0)
+		return 0;
+
+	return old / shrink;
+}
+
+static int kvm_idle_update_poll_duration(unsigned long idle_duration)
+{
+	unsigned long poll_duration = this_cpu_read(poll_duration_ns);
+
+	if (poll_duration && idle_duration > poll_threshold_ns)
+		poll_duration = shrink_poll_ns(poll_duration, poll_shrink);
+	else if (poll_duration < poll_threshold_ns &&
+		 idle_duration < poll_threshold_ns)
+		poll_duration = grow_poll_ns(poll_duration, poll_grow,
+					     poll_threshold_ns);
+
+	this_cpu_write(poll_duration_ns, poll_duration);
+
+	return 0;
+}
+
 static void kvm_guest_idle_init(void)
 {
 	if (!kvm_para_available())
 		return;
 
 	pv_idle_ops.poll = kvm_idle_poll;
+	pv_idle_ops.update = kvm_idle_update_poll_duration;
 }
 
 static void kvm_pv_disable_apf(void)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 7/7] sched/idle: update poll time when wakeup from idle
  2017-08-29 11:46 [RFC PATCH v2 0/7] x86/idle: add halt poll support Yang Zhang
                   ` (7 preceding siblings ...)
  2017-08-29 11:46 ` [RFC PATCH v2 6/7] KVM guest: introduce smart idle poll algorithm Yang Zhang
@ 2017-08-29 11:46 ` Yang Zhang
  2017-08-29 12:46   ` Peter Zijlstra
  2017-08-29 11:58 ` [RFC PATCH v2 0/7] x86/idle: add halt poll support Alexander Graf
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Yang Zhang @ 2017-08-29 11:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar, dmatlack, agraf,
	peterz, linux-doc, Yang Zhang, Quan Xu, Ingo Molnar

In ttwu_do_wakeup, it will update avg_idle when wakeup from idle. Here
we just reuse this logic to update the poll time. It may be a little
late to update the poll in ttwu_do_wakeup, but the test result shows no
obvious performance gap compare with updating poll in irq handler.

one problem is that idle_stamp only used when using CFS scheduler. But
it is ok since it is the default policy for scheduler and only consider
it should enough.

Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Quan Xu <quan.xu0@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
---
 include/linux/sched/idle.h | 4 ++++
 kernel/sched/core.c        | 4 ++++
 kernel/sched/idle.c        | 7 +++++++
 3 files changed, 15 insertions(+)

diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h
index 5ca63eb..6e0554d 100644
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -12,6 +12,10 @@ enum cpu_idle_type {
 
 extern void wake_up_if_idle(int cpu);
 
+#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
+extern void update_poll_duration(unsigned long idle_duration);
+#endif
+
 /*
  * Idle thread specific functions to determine the need_resched
  * polling state.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0869b20..25be9a3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1678,6 +1678,10 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
 		if (rq->avg_idle > max)
 			rq->avg_idle = max;
 
+#if defined(CONFIG_PARAVIRT)
+		update_poll_duration(rq->avg_idle);
+#endif
+
 		rq->idle_stamp = 0;
 	}
 #endif
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index b374744..7eb8559 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -101,6 +101,13 @@ void __cpuidle default_idle_call(void)
 	}
 }
 
+#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
+void update_poll_duration(unsigned long idle_duration)
+{
+	paravirt_idle_update_poll_duration(idle_duration);
+}
+#endif
+
 static int call_cpuidle(struct cpuidle_driver *drv, struct cpuidle_device *dev,
 		      int next_state)
 {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
  2017-08-29 11:46 [RFC PATCH v2 0/7] x86/idle: add halt poll support Yang Zhang
                   ` (8 preceding siblings ...)
  2017-08-29 11:46 ` [RFC PATCH v2 7/7] sched/idle: update poll time when wakeup from idle Yang Zhang
@ 2017-08-29 11:58 ` Alexander Graf
  2017-09-01  6:21   ` Yang Zhang
  2017-08-29 13:03 ` Andi Kleen
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 47+ messages in thread
From: Alexander Graf @ 2017-08-29 11:58 UTC (permalink / raw)
  To: Yang Zhang, linux-kernel
  Cc: kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar, dmatlack, peterz,
	linux-doc

On 08/29/2017 01:46 PM, Yang Zhang wrote:
> Some latency-intensive workload will see obviously performance
> drop when running inside VM. The main reason is that the overhead
> is amplified when running inside VM. The most cost i have seen is
> inside idle path.
>
> This patch introduces a new mechanism to poll for a while before
> entering idle state. If schedule is needed during poll, then we
> don't need to goes through the heavy overhead path.
>
> Here is the data we get when running benchmark contextswitch to measure
> the latency(lower is better):
>
>     1. w/o patch:
>        2493.14 ns/ctxsw -- 200.3 %CPU
>     
>     2. w/ patch:
>        halt_poll_threshold=10000 -- 1485.96ns/ctxsw -- 201.0 %CPU
>        halt_poll_threshold=20000 -- 1391.26 ns/ctxsw -- 200.7 %CPU
>        halt_poll_threshold=30000 -- 1488.55 ns/ctxsw -- 200.1 %CPU
>        halt_poll_threshold=500000 -- 1159.14 ns/ctxsw -- 201.5 %CPU
>     
>     3. kvm dynamic poll
>        halt_poll_ns=10000 -- 2296.11 ns/ctxsw -- 201.2 %CPU
>        halt_poll_ns=20000 -- 2599.7 ns/ctxsw -- 201.7 %CPU
>        halt_poll_ns=30000 -- 2588.68 ns/ctxsw -- 211.6 %CPU
>        halt_poll_ns=500000 -- 2423.20 ns/ctxsw -- 229.2 %CPU
>     
>     4. idle=poll
>        2050.1 ns/ctxsw -- 1003 %CPU
>     
>     5. idle=mwait
>        2188.06 ns/ctxsw -- 206.3 %CPU

Could you please try to create another metric for guest initiated, host 
aborted mwait?

For a quick benchmark, reserve 4 registers for a magic value, set them 
to the magic value before you enter MWAIT in the guest. Then allow 
native MWAIT execution on the host. If you see the guest wants to enter 
with the 4 registers containing the magic contents and no events are 
pending, directly go into the vcpu block function on the host.

That way any time a guest gets naturally aborted while in mwait, it will 
only reenter mwait when an event actually occured. While the guest is 
normally running (and nobody else wants to run on the host), we just 
stay in guest context, but with a sleeping CPU.

Overall, that might give us even better performance, as it allows for 
turbo boost and HT to work properly.


Alex

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path
  2017-08-29 11:46 ` [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path Yang Zhang
@ 2017-08-29 12:45   ` Peter Zijlstra
  2017-09-01  5:57     ` Quan Xu
  2017-08-29 14:39   ` Borislav Petkov
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2017-08-29 12:45 UTC (permalink / raw)
  To: Yang Zhang
  Cc: linux-kernel, kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar,
	dmatlack, agraf, linux-doc, Quan Xu, Ingo Molnar, H. Peter Anvin,
	x86, Borislav Petkov, Kyle Huey, Andy Lutomirski, Len Brown

On Tue, Aug 29, 2017 at 11:46:37AM +0000, Yang Zhang wrote:
> Add poll in do_idle. For UP VM, if there are running task, it will not
> goes into idle path, so we only enable poll in SMP VM.
> 
> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
> Signed-off-by: Quan Xu <quan.xu0@gmail.com>

Broken SoB chain.

> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 6c23e30..b374744 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
>  }
>  
>  /* Weak implementations for optional arch specific functions */
> +void __weak arch_cpu_idle_poll(void) { }
>  void __weak arch_cpu_idle_prepare(void) { }
>  void __weak arch_cpu_idle_enter(void) { }

And not a word on why we need a new arch hook. What's wrong with
arch_cpu_idle_enter() for instance?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 7/7] sched/idle: update poll time when wakeup from idle
  2017-08-29 11:46 ` [RFC PATCH v2 7/7] sched/idle: update poll time when wakeup from idle Yang Zhang
@ 2017-08-29 12:46   ` Peter Zijlstra
  2017-09-01  7:30     ` Yang Zhang
  2017-09-29 10:29     ` Quan Xu
  0 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2017-08-29 12:46 UTC (permalink / raw)
  To: Yang Zhang
  Cc: linux-kernel, kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar,
	dmatlack, agraf, linux-doc, Quan Xu, Ingo Molnar

On Tue, Aug 29, 2017 at 11:46:41AM +0000, Yang Zhang wrote:
> In ttwu_do_wakeup, it will update avg_idle when wakeup from idle. Here
> we just reuse this logic to update the poll time. It may be a little
> late to update the poll in ttwu_do_wakeup, but the test result shows no
> obvious performance gap compare with updating poll in irq handler.
> 
> one problem is that idle_stamp only used when using CFS scheduler. But
> it is ok since it is the default policy for scheduler and only consider
> it should enough.
> 
> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
> Signed-off-by: Quan Xu <quan.xu0@gmail.com>

Same broken SoB chain, and not a useful word on why you need to adjust
this crap to begin with. What you want that poll duration to be related
to is the cost of a VMEXIT/VMENTER cycle, not however long we happened
to be idle.

So no.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
  2017-08-29 11:46 [RFC PATCH v2 0/7] x86/idle: add halt poll support Yang Zhang
                   ` (9 preceding siblings ...)
  2017-08-29 11:58 ` [RFC PATCH v2 0/7] x86/idle: add halt poll support Alexander Graf
@ 2017-08-29 13:03 ` Andi Kleen
  2017-08-29 14:02 ` Wanpeng Li
  2017-08-29 14:56 ` Michael S. Tsirkin
  12 siblings, 0 replies; 47+ messages in thread
From: Andi Kleen @ 2017-08-29 13:03 UTC (permalink / raw)
  To: Yang Zhang
  Cc: linux-kernel, kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar,
	dmatlack, agraf, peterz, linux-doc

Yang Zhang <yang.zhang.wz@gmail.com> writes:

> Some latency-intensive workload will see obviously performance 
> drop when running inside VM. The main reason is that the overhead 
> is amplified when running inside VM. The most cost i have seen is 
> inside idle path. 

You could test with https://lkml.org/lkml/2017/7/9/204
which optimizes these idle paths with a fast path.

If that works it would be a far better solution than
explicit tunables and basically giving up by polling.

-Andi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/7] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-08-29 11:46   ` Yang Zhang
@ 2017-08-29 13:55     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 47+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-08-29 13:55 UTC (permalink / raw)
  To: Yang Zhang, xen-devel, jgross, Boris Ostrovsky
  Cc: linux-kernel, kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar,
	dmatlack, agraf, peterz, linux-doc, Quan Xu, Jeremy Fitzhardinge,
	Chris Wright, Alok Kataria, Rusty Russell, Ingo Molnar,
	H. Peter Anvin, x86, Andy Lutomirski, Kirill A. Shutemov,
	Pan Xinhui, Kees Cook, virtualization

On Tue, Aug 29, 2017 at 11:46:35AM +0000, Yang Zhang wrote:
> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in
> idle path which will polling for a while before we enter the real idle
> state.
> 
> In virtualization, idle path includes several heavy operations
> includes timer access(LAPIC timer or TSC deadline timer) which will hurt
> performance especially for latency intensive workload like message
> passing task. The cost is mainly come from the vmexit which is a
> hardware context switch between VM and hypervisor. Our solution is to
> poll for a while and do not enter real idle path if we can get the
> schedule event during polling.
> 
> Poll may cause the CPU waste so we adopt a smart polling mechanism to
> reduce the useless poll.
> 
> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
> Cc: Jeremy Fitzhardinge <jeremy@goop.org>
> Cc: Chris Wright <chrisw@sous-sol.org>
> Cc: Alok Kataria <akataria@vmware.com>
> Cc: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: x86@kernel.org
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: virtualization@lists.linux-foundation.org
> Cc: linux-kernel@vger.kernel.org

Adding xen-devel.

Juergen, we really should replace Jeremy's name with xen-devel or
your name.. Wasn't there an patch by you that took some of the 
mainternship over it?

> ---
>  arch/x86/include/asm/paravirt.h       | 5 +++++
>  arch/x86/include/asm/paravirt_types.h | 6 ++++++
>  arch/x86/kernel/paravirt.c            | 6 ++++++
>  3 files changed, 17 insertions(+)
> 
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index 9ccac19..6d46760 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -202,6 +202,11 @@ static inline unsigned long long paravirt_read_pmc(int counter)
>  
>  #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter))
>  
> +static inline void paravirt_idle_poll(void)
> +{
> +	PVOP_VCALL0(pv_idle_ops.poll);
> +}
> +
>  static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries)
>  {
>  	PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 9ffc36b..cf45726 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -324,6 +324,10 @@ struct pv_lock_ops {
>  	struct paravirt_callee_save vcpu_is_preempted;
>  } __no_randomize_layout;
>  
> +struct pv_idle_ops {
> +	void (*poll)(void);
> +} __no_randomize_layout;
> +
>  /* This contains all the paravirt structures: we get a convenient
>   * number for each function using the offset which we use to indicate
>   * what to patch. */
> @@ -334,6 +338,7 @@ struct paravirt_patch_template {
>  	struct pv_irq_ops pv_irq_ops;
>  	struct pv_mmu_ops pv_mmu_ops;
>  	struct pv_lock_ops pv_lock_ops;
> +	struct pv_idle_ops pv_idle_ops;
>  } __no_randomize_layout;
>  
>  extern struct pv_info pv_info;
> @@ -343,6 +348,7 @@ struct paravirt_patch_template {
>  extern struct pv_irq_ops pv_irq_ops;
>  extern struct pv_mmu_ops pv_mmu_ops;
>  extern struct pv_lock_ops pv_lock_ops;
> +extern struct pv_idle_ops pv_idle_ops;
>  
>  #define PARAVIRT_PATCH(x)					\
>  	(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index bc0a849..1b5b247 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -128,6 +128,7 @@ static void *get_call_destination(u8 type)
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  		.pv_lock_ops = pv_lock_ops,
>  #endif
> +		.pv_idle_ops = pv_idle_ops,
>  	};
>  	return *((void **)&tmpl + type);
>  }
> @@ -312,6 +313,10 @@ struct pv_time_ops pv_time_ops = {
>  	.steal_clock = native_steal_clock,
>  };
>  
> +struct pv_idle_ops pv_idle_ops = {
> +	.poll = paravirt_nop,
> +};
> +
>  __visible struct pv_irq_ops pv_irq_ops = {
>  	.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
>  	.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
> @@ -471,3 +476,4 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
>  EXPORT_SYMBOL    (pv_mmu_ops);
>  EXPORT_SYMBOL_GPL(pv_info);
>  EXPORT_SYMBOL    (pv_irq_ops);
> +EXPORT_SYMBOL    (pv_idle_ops);
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/7] x86/paravirt: Add pv_idle_ops to paravirt ops
@ 2017-08-29 13:55     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 47+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-08-29 13:55 UTC (permalink / raw)
  To: Yang Zhang, xen-devel, jgross, Boris Ostrovsky
  Cc: Jeremy Fitzhardinge, rkrcmar, kvm, mst, peterz, Pan Xinhui,
	virtualization, H. Peter Anvin, Alok Kataria, wanpeng.li, x86,
	Ingo Molnar, Kees Cook, Chris Wright, Andy Lutomirski, dmatlack,
	tglx, Quan Xu, linux-doc, linux-kernel, pbonzini,
	Kirill A. Shutemov

On Tue, Aug 29, 2017 at 11:46:35AM +0000, Yang Zhang wrote:
> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in
> idle path which will polling for a while before we enter the real idle
> state.
> 
> In virtualization, idle path includes several heavy operations
> includes timer access(LAPIC timer or TSC deadline timer) which will hurt
> performance especially for latency intensive workload like message
> passing task. The cost is mainly come from the vmexit which is a
> hardware context switch between VM and hypervisor. Our solution is to
> poll for a while and do not enter real idle path if we can get the
> schedule event during polling.
> 
> Poll may cause the CPU waste so we adopt a smart polling mechanism to
> reduce the useless poll.
> 
> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
> Cc: Jeremy Fitzhardinge <jeremy@goop.org>
> Cc: Chris Wright <chrisw@sous-sol.org>
> Cc: Alok Kataria <akataria@vmware.com>
> Cc: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: x86@kernel.org
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: virtualization@lists.linux-foundation.org
> Cc: linux-kernel@vger.kernel.org

Adding xen-devel.

Juergen, we really should replace Jeremy's name with xen-devel or
your name.. Wasn't there an patch by you that took some of the 
mainternship over it?

> ---
>  arch/x86/include/asm/paravirt.h       | 5 +++++
>  arch/x86/include/asm/paravirt_types.h | 6 ++++++
>  arch/x86/kernel/paravirt.c            | 6 ++++++
>  3 files changed, 17 insertions(+)
> 
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index 9ccac19..6d46760 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -202,6 +202,11 @@ static inline unsigned long long paravirt_read_pmc(int counter)
>  
>  #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter))
>  
> +static inline void paravirt_idle_poll(void)
> +{
> +	PVOP_VCALL0(pv_idle_ops.poll);
> +}
> +
>  static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries)
>  {
>  	PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 9ffc36b..cf45726 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -324,6 +324,10 @@ struct pv_lock_ops {
>  	struct paravirt_callee_save vcpu_is_preempted;
>  } __no_randomize_layout;
>  
> +struct pv_idle_ops {
> +	void (*poll)(void);
> +} __no_randomize_layout;
> +
>  /* This contains all the paravirt structures: we get a convenient
>   * number for each function using the offset which we use to indicate
>   * what to patch. */
> @@ -334,6 +338,7 @@ struct paravirt_patch_template {
>  	struct pv_irq_ops pv_irq_ops;
>  	struct pv_mmu_ops pv_mmu_ops;
>  	struct pv_lock_ops pv_lock_ops;
> +	struct pv_idle_ops pv_idle_ops;
>  } __no_randomize_layout;
>  
>  extern struct pv_info pv_info;
> @@ -343,6 +348,7 @@ struct paravirt_patch_template {
>  extern struct pv_irq_ops pv_irq_ops;
>  extern struct pv_mmu_ops pv_mmu_ops;
>  extern struct pv_lock_ops pv_lock_ops;
> +extern struct pv_idle_ops pv_idle_ops;
>  
>  #define PARAVIRT_PATCH(x)					\
>  	(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index bc0a849..1b5b247 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -128,6 +128,7 @@ static void *get_call_destination(u8 type)
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  		.pv_lock_ops = pv_lock_ops,
>  #endif
> +		.pv_idle_ops = pv_idle_ops,
>  	};
>  	return *((void **)&tmpl + type);
>  }
> @@ -312,6 +313,10 @@ struct pv_time_ops pv_time_ops = {
>  	.steal_clock = native_steal_clock,
>  };
>  
> +struct pv_idle_ops pv_idle_ops = {
> +	.poll = paravirt_nop,
> +};
> +
>  __visible struct pv_irq_ops pv_irq_ops = {
>  	.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
>  	.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
> @@ -471,3 +476,4 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
>  EXPORT_SYMBOL    (pv_mmu_ops);
>  EXPORT_SYMBOL_GPL(pv_info);
>  EXPORT_SYMBOL    (pv_irq_ops);
> +EXPORT_SYMBOL    (pv_idle_ops);
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
  2017-08-29 11:46 [RFC PATCH v2 0/7] x86/idle: add halt poll support Yang Zhang
                   ` (10 preceding siblings ...)
  2017-08-29 13:03 ` Andi Kleen
@ 2017-08-29 14:02 ` Wanpeng Li
  2017-08-29 14:27   ` Konrad Rzeszutek Wilk
                     ` (3 more replies)
  2017-08-29 14:56 ` Michael S. Tsirkin
  12 siblings, 4 replies; 47+ messages in thread
From: Wanpeng Li @ 2017-08-29 14:02 UTC (permalink / raw)
  To: Yang Zhang
  Cc: linux-kernel, kvm, Wanpeng Li, Michael S. Tsirkin, Paolo Bonzini,
	Thomas Gleixner, Radim Krcmar, David Matlack, Alexander Graf,
	Peter Zijlstra, linux-doc

> Here is the data we get when running benchmark netperf:
>
>    2. w/ patch:
>       halt_poll_threshold=10000 -- 15803.89 bits/s -- 159.5 %CPU
>       halt_poll_threshold=20000 -- 15899.04 bits/s -- 161.5 %CPU
>       halt_poll_threshold=30000 -- 15642.38 bits/s -- 161.8 %CPU
>       halt_poll_threshold=40000 -- 18040.76 bits/s -- 184.0 %CPU
>       halt_poll_threshold=50000 -- 18877.61 bits/s -- 197.3 %CPU
>
>    3. kvm dynamic poll
>       halt_poll_ns=10000 -- 15876.00 bits/s -- 172.2 %CPU
>       halt_poll_ns=20000 -- 15602.58 bits/s -- 185.4 %CPU
>       halt_poll_ns=30000 -- 15930.69 bits/s -- 194.4 %CPU
>       halt_poll_ns=40000 -- 16413.09 bits/s -- 195.3 %CPU
>       halt_poll_ns=50000 -- 16417.42 bits/s -- 196.3 %CPU
>

Actually I'm not sure how much sense it makes to introduce this pv
stuff and the duplicate adaptive halt-polling logic as what has
already been done in kvm w/o obvious benefit for real workload like
netperf. In addition, as you mentioned offline to me, enable both the
patchset and the adaptive halt-polling logic in kvm simultaneously can
result in more cpu power consumption. I remembered that David from
Google mentioned that Windows Event Objects can get 2x latency
improvement in KVM FORUM, which means that the adaptive halt-polling
in kvm should be enabled by default. So if the windows guests and
linux guests are mixed on the same host, then this patchset will
result in more cpu power consumption if the customer enable the
polling in the linux guest. Anyway, if the patchset is finally
acceptable by maintainer, I will introduce the generic adaptive
halt-polling framework in kvm to avoid the duplicate logic.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
  2017-08-29 14:02 ` Wanpeng Li
@ 2017-08-29 14:27   ` Konrad Rzeszutek Wilk
  2017-08-29 14:36   ` Michael S. Tsirkin
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 47+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-08-29 14:27 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Yang Zhang, linux-kernel, kvm, Wanpeng Li, Michael S. Tsirkin,
	Paolo Bonzini, Thomas Gleixner, Radim Krcmar, David Matlack,
	Alexander Graf, Peter Zijlstra, linux-doc

On Tue, Aug 29, 2017 at 10:02:15PM +0800, Wanpeng Li wrote:
> > Here is the data we get when running benchmark netperf:
> >
> >    2. w/ patch:
> >       halt_poll_threshold=10000 -- 15803.89 bits/s -- 159.5 %CPU
> >       halt_poll_threshold=20000 -- 15899.04 bits/s -- 161.5 %CPU
> >       halt_poll_threshold=30000 -- 15642.38 bits/s -- 161.8 %CPU
> >       halt_poll_threshold=40000 -- 18040.76 bits/s -- 184.0 %CPU
> >       halt_poll_threshold=50000 -- 18877.61 bits/s -- 197.3 %CPU
> >
> >    3. kvm dynamic poll
> >       halt_poll_ns=10000 -- 15876.00 bits/s -- 172.2 %CPU
> >       halt_poll_ns=20000 -- 15602.58 bits/s -- 185.4 %CPU
> >       halt_poll_ns=30000 -- 15930.69 bits/s -- 194.4 %CPU
> >       halt_poll_ns=40000 -- 16413.09 bits/s -- 195.3 %CPU
> >       halt_poll_ns=50000 -- 16417.42 bits/s -- 196.3 %CPU
> >
> 
> Actually I'm not sure how much sense it makes to introduce this pv
> stuff and the duplicate adaptive halt-polling logic as what has
> already been done in kvm w/o obvious benefit for real workload like
> netperf. In addition, as you mentioned offline to me, enable both the

"real workload like netperf"? 

That is not a real workload. That is a synthetic one.

> patchset and the adaptive halt-polling logic in kvm simultaneously can
> result in more cpu power consumption. I remembered that David from
> Google mentioned that Windows Event Objects can get 2x latency
> improvement in KVM FORUM, which means that the adaptive halt-polling
> in kvm should be enabled by default. So if the windows guests and
> linux guests are mixed on the same host, then this patchset will
> result in more cpu power consumption if the customer enable the
> polling in the linux guest. Anyway, if the patchset is finally

More CPU power consumption sounds as a bad idea, does it not?

> acceptable by maintainer, I will introduce the generic adaptive
> halt-polling framework in kvm to avoid the duplicate logic.
> 
> Regards,
> Wanpeng Li

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
  2017-08-29 14:02 ` Wanpeng Li
  2017-08-29 14:27   ` Konrad Rzeszutek Wilk
@ 2017-08-29 14:36   ` Michael S. Tsirkin
  2017-09-01  6:32   ` Yang Zhang
  2017-09-01  6:44   ` Yang Zhang
  3 siblings, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2017-08-29 14:36 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Yang Zhang, linux-kernel, kvm, Wanpeng Li, Paolo Bonzini,
	Thomas Gleixner, Radim Krcmar, David Matlack, Alexander Graf,
	Peter Zijlstra, linux-doc

On Tue, Aug 29, 2017 at 10:02:15PM +0800, Wanpeng Li wrote:
> Actually I'm not sure how much sense it makes to introduce this pv
> stuff and the duplicate adaptive halt-polling logic as what has
> already been done in kvm w/o obvious benefit for real workload like
> netperf.

In fact, I would really like to better understand why does the polling
in kvm even help.  Switching to the idle task is supposed to be really
cheap as you are not losing context.  In case of e.g. network polling
you gain the interrupt latency, but in case of kvm it's just an IPI
which is converted to a memory write when using mwait. Is mwait more
costly than commonly thought? Or is the idle driver too agressive in
putting the CPU into deep sleep?

I think this analysis is something that would benefit
bare-metal/containers as well.

-- 
MST

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path
  2017-08-29 11:46 ` [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path Yang Zhang
  2017-08-29 12:45   ` Peter Zijlstra
@ 2017-08-29 14:39   ` Borislav Petkov
  2017-09-01  6:49     ` Quan Xu
  1 sibling, 1 reply; 47+ messages in thread
From: Borislav Petkov @ 2017-08-29 14:39 UTC (permalink / raw)
  To: Yang Zhang
  Cc: linux-kernel, kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar,
	dmatlack, agraf, peterz, linux-doc, Quan Xu, Ingo Molnar,
	H. Peter Anvin, x86, Kyle Huey, Andy Lutomirski, Len Brown

On Tue, Aug 29, 2017 at 11:46:37AM +0000, Yang Zhang wrote:
> Add poll in do_idle. For UP VM, if there are running task, it will not
> goes into idle path, so we only enable poll in SMP VM.
> 
> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: x86@kernel.org
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Kyle Huey <me@kylehuey.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Len Brown <len.brown@intel.com>
> Cc: linux-kernel@vger.kernel.org
> ---
>  arch/x86/kernel/process.c | 7 +++++++
>  kernel/sched/idle.c       | 2 ++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 3ca1980..def4113 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -332,6 +332,13 @@ void arch_cpu_idle(void)
>  	x86_idle();
>  }
>  
> +#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
> +void arch_cpu_idle_poll(void)
> +{
> +	paravirt_idle_poll();
> +}
> +#endif

So this will get called on any system which has CONFIG_PARAVIRT enabled
*even* if they're not running any guests.

Huh?

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
  2017-08-29 11:46 [RFC PATCH v2 0/7] x86/idle: add halt poll support Yang Zhang
                   ` (11 preceding siblings ...)
  2017-08-29 14:02 ` Wanpeng Li
@ 2017-08-29 14:56 ` Michael S. Tsirkin
  2017-09-13 11:56   ` Yang Zhang
  12 siblings, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2017-08-29 14:56 UTC (permalink / raw)
  To: Yang Zhang
  Cc: linux-kernel, kvm, wanpeng.li, pbonzini, tglx, rkrcmar, dmatlack,
	agraf, peterz, linux-doc

On Tue, Aug 29, 2017 at 11:46:34AM +0000, Yang Zhang wrote:
> Some latency-intensive workload will see obviously performance 
> drop when running inside VM.

But are we trading a lot of CPU for a bit of lower latency?

> The main reason is that the overhead 
> is amplified when running inside VM. The most cost i have seen is 
> inside idle path. 
> 
> This patch introduces a new mechanism to poll for a while before 
> entering idle state. If schedule is needed during poll, then we 
> don't need to goes through the heavy overhead path. 

Isn't it the job of an idle driver to find the best way to
halt the CPU?

It looks like just by adding a cstate we can make it
halt at higher latencies only. And at lower latencies,
if it's doing a good job we can hopefully use mwait to
stop the CPU.

In fact I have been experimenting with exactly that.
Some initial results are encouraging but I could use help
with testing and especially tuning. If you can help
pls let me know!

Patch below is not intended for upstream - it's just
the fastest way I found to test things.
So it just uses command line arguments to configure the guest,
the right thing is through a combination of ACPI and CPUIDs
but let's decide whether we need this first.

RFC dontmerge PATCH intel_idle: add pv cstates when running on kvm

Usage:

kvm_pv_mwait - enables the feature. Note: you must have a recent
		host that allows guests to execute mwait without an exit,
		otherwise you will just get 100% CPU.

kvm_halt_target_residency - halt above this target residency.
		Should probably be a function of the cost of
		halt+wakeup.

kvm_halt_native - set to 0 if your VCPU does not match host.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---

diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index c2ae819..6fa58ad 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -65,8 +65,10 @@
 #include <asm/intel-family.h>
 #include <asm/mwait.h>
 #include <asm/msr.h>
+#include <linux/kvm_para.h>
 
 #define INTEL_IDLE_VERSION "0.4.1"
+#define PREFIX "intel_idle: "
 
 static struct cpuidle_driver intel_idle_driver = {
 	.name = "intel_idle",
@@ -94,6 +96,7 @@ struct idle_cpu {
 };
 
 static const struct idle_cpu *icpu;
+static struct idle_cpu icpus;
 static struct cpuidle_device __percpu *intel_idle_cpuidle_devices;
 static int intel_idle(struct cpuidle_device *dev,
 			struct cpuidle_driver *drv, int index);
@@ -119,6 +122,49 @@ static struct cpuidle_state *cpuidle_state_table;
 #define flg2MWAIT(flags) (((flags) >> 24) & 0xFF)
 #define MWAIT2flg(eax) ((eax & 0xFF) << 24)
 
+static int intel_halt(struct cpuidle_device *dev,
+			struct cpuidle_driver *drv, int index)
+{
+	printk_once(KERN_ERR "safe_halt started\n");
+	safe_halt();
+	printk_once(KERN_ERR "safe_halt done\n");
+	return index;
+}
+
+static int kvm_halt_target_residency = 400; /* Halt above this target residency */
+module_param(kvm_halt_target_residency, int, 0444);
+static int kvm_halt_native = 1; /* Use native mwait substates */
+module_param(kvm_halt_native, int, 0444);
+static int kvm_pv_mwait = 0; /* Whether to do mwait within KVM */
+module_param(kvm_pv_mwait, int, 0444);
+
+static struct cpuidle_state kvm_halt_cstate = {
+	.name = "HALT-KVM",
+	.desc = "HALT",
+	.flags = MWAIT2flg(0x10),
+	.exit_latency = 0,
+	.target_residency = 0,
+	.enter = &intel_halt,
+};
+
+static struct cpuidle_state kvm_cstates[] = {
+	{
+		.name = "C1-NHM",
+		.desc = "MWAIT 0x00",
+		.flags = MWAIT2flg(0x00),
+		.exit_latency = 3,
+		.target_residency = 6,
+		.enter = &intel_idle,
+		.enter_freeze = intel_idle_freeze, },
+	{
+		.name = "HALT-KVM",
+		.desc = "HALT",
+		.flags = MWAIT2flg(0x10),
+		.exit_latency = 30,
+		.target_residency = 399,
+		.enter = &intel_halt, }
+};
+
 /*
  * States are indexed by the cstate number,
  * which is also the index into the MWAIT hint array.
@@ -927,8 +973,11 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		tick_broadcast_enter();
 
+	printk_once(KERN_ERR "mwait_idle_with_hints started\n");
 	mwait_idle_with_hints(eax, ecx);
 
+	printk_once(KERN_ERR "mwait_idle_with_hints done\n");
+
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		tick_broadcast_exit();
 
@@ -989,6 +1038,10 @@ static const struct idle_cpu idle_cpu_tangier = {
 	.state_table = tangier_cstates,
 };
 
+static const struct idle_cpu idle_cpu_kvm = {
+	.state_table = kvm_cstates,
+};
+
 static const struct idle_cpu idle_cpu_lincroft = {
 	.state_table = atom_cstates,
 	.auto_demotion_disable_flags = ATM_LNC_C6_AUTO_DEMOTE,
@@ -1061,7 +1115,7 @@ static const struct idle_cpu idle_cpu_dnv = {
 };
 
 #define ICPU(model, cpu) \
-	{ X86_VENDOR_INTEL, 6, model, X86_FEATURE_MWAIT, (unsigned long)&cpu }
+	{ X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (unsigned long)&cpu }
 
 static const struct x86_cpu_id intel_idle_ids[] __initconst = {
 	ICPU(INTEL_FAM6_NEHALEM_EP,		idle_cpu_nehalem),
@@ -1125,19 +1180,39 @@ static int __init intel_idle_probe(void)
 		return -ENODEV;
 	}
 
-	if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
-		return -ENODEV;
+	icpus = *(struct idle_cpu *)id->driver_data;
+
+	if (kvm_pv_mwait) {
+
+		if (!kvm_halt_native)
+			icpus = idle_cpu_kvm;
+
+		pr_debug(PREFIX "MWAIT enabled by KVM\n");
+		mwait_substates = 0x1;
+		/*
+		 * these MSRs do not work on kvm maybe they should?
+		 * more likely we need to poke at CPUID before using MSRs
+		 */
+		icpus.auto_demotion_disable_flags = 0;
+		icpus.disable_promotion_to_c1e = 0;
+	} else {
+		if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
+			return -ENODEV;
+
+		if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
+			return -ENODEV;
 
-	cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates);
+		cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates);
 
-	if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED) ||
-	    !(ecx & CPUID5_ECX_INTERRUPT_BREAK) ||
-	    !mwait_substates)
+		if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED) ||
+		    !(ecx & CPUID5_ECX_INTERRUPT_BREAK) ||
+		    !mwait_substates)
 			return -ENODEV;
 
-	pr_debug("MWAIT substates: 0x%x\n", mwait_substates);
+		pr_debug(PREFIX "MWAIT substates: 0x%x\n", mwait_substates);
+	}
 
-	icpu = (const struct idle_cpu *)id->driver_data;
+	icpu = &icpus;
 	cpuidle_state_table = icpu->state_table;
 
 	pr_debug("v" INTEL_IDLE_VERSION " model 0x%X\n",
@@ -1340,6 +1415,11 @@ static void __init intel_idle_cpuidle_driver_init(void)
 		    (cpuidle_state_table[cstate].enter_freeze == NULL))
 			break;
 
+		if (kvm_pv_mwait &&
+		    cpuidle_state_table[cstate].target_residency >=
+		    kvm_halt_target_residency)
+			break;
+
 		if (cstate + 1 > max_cstate) {
 			pr_info("max_cstate %d reached\n", max_cstate);
 			break;
@@ -1353,7 +1433,7 @@ static void __init intel_idle_cpuidle_driver_init(void)
 					& MWAIT_SUBSTATE_MASK;
 
 		/* if NO sub-states for this state in CPUID, skip it */
-		if (num_substates == 0)
+		if (num_substates == 0 && !kvm_pv_mwait)
 			continue;
 
 		/* if state marked as disabled, skip it */
@@ -1375,6 +1455,20 @@ static void __init intel_idle_cpuidle_driver_init(void)
 		drv->state_count += 1;
 	}
 
+	if (kvm_halt_native && kvm_pv_mwait) {
+		drv->states[drv->state_count] =	/* structure copy */
+			kvm_halt_cstate;
+		drv->states[drv->state_count].exit_latency =
+			drv->state_count > 1 ?
+			drv->states[drv->state_count - 1].exit_latency + 1 : 1;
+		drv->states[drv->state_count].target_residency =
+			kvm_halt_target_residency;
+
+		drv->state_count += 1;
+	}
+
+	printk(KERN_ERR "detected states: %d\n\n",  drv->state_count);
+
 	if (icpu->byt_auto_demotion_disable_flag) {
 		wrmsrl(MSR_CC6_DEMOTION_POLICY_CONFIG, 0);
 		wrmsrl(MSR_MC6_DEMOTION_POLICY_CONFIG, 0);
@@ -1452,7 +1546,8 @@ static int __init intel_idle_init(void)
 		goto init_driver_fail;
 	}
 
-	if (boot_cpu_has(X86_FEATURE_ARAT))	/* Always Reliable APIC Timer */
+	if (boot_cpu_has(X86_FEATURE_ARAT) ||	/* Always Reliable APIC Timer */
+	    kvm_pv_mwait)
 		lapic_timer_reliable_states = LAPIC_TIMER_ALWAYS_RELIABLE;
 
 	retval = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "idle/intel:online",

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll
  2017-08-29 11:46   ` Yang Zhang
@ 2017-08-29 17:20     ` Luis R. Rodriguez
  -1 siblings, 0 replies; 47+ messages in thread
From: Luis R. Rodriguez @ 2017-08-29 17:20 UTC (permalink / raw)
  To: Yang Zhang
  Cc: linux-kernel, kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar,
	dmatlack, agraf, peterz, linux-doc, Quan Xu, Jonathan Corbet,
	Juergen Gross, Jeremy Fitzhardinge, Chris Wright, Alok Kataria,
	Rusty Russell, Ingo Molnar, H. Peter Anvin, x86,
	Luis R. Rodriguez, Kees Cook, Mauro Carvalho Chehab,
	Krzysztof Kozlowski, Josh Poimboeuf, Andrew Morton, Petr Mladek,
	Jessica Yu, Larry Finger, zijun_hu, Baoquan He, Johannes Berg,
	Ian Abbott, virtualization, linux-fsdevel

On Tue, Aug 29, 2017 at 11:46:39AM +0000, Yang Zhang wrote:
> To reduce the cost of poll, we introduce three sysctl to control the
> poll time.

This commit does not describe in any way the fact that these knobs are
all for and only for PARAVIRT.

> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index bac23c1..67447b8 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -63,6 +63,9 @@ show up in /proc/sys/kernel:
>  - perf_event_max_stack
>  - perf_event_max_contexts_per_stack
>  - pid_max
> +- poll_grow                   [ X86 only ]
> +- poll_shrink                 [ X86 only ]
> +- poll_threshold_ns           [ X86 only ]

How about paravirt_ prefix?

>  - powersave-nap               [ PPC only ]
>  - printk
>  - printk_delay
> @@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one.
>  
>  ==============================================================
>  
> +poll_grow: (X86 only)
> +
> +This parameter is multiplied in the grow_poll_ns() to increase the poll time.
> +By default, the values is 2.
> +
> +==============================================================
> +poll_shrink: (X86 only)
> +
> +This parameter is divided in the shrink_poll_ns() to reduce the poll time.
> +By default, the values is 2.

These don't say anything about this being related to paravirt.

> +
> +==============================================================
> +poll_threshold_ns: (X86 only)
> +
> +This parameter controls the max poll time before entering real idle path.
> +This parameter is expected to take effect only when running inside a VM.
> +It would make no sense to turn on it in bare mental.

turn it on? Or modify, or use it?

> +By default, the values is 0 means don't poll. It is recommended to change
> +the value to non-zero if running latency-bound workloads inside VM.
> +
> +==============================================================
> +
>  powersave-nap: (PPC only)
>  
>  If set, Linux-PPC will use the 'nap' mode of powersaving,
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index a11b2c2..0b92f8f 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = {
>  	.update = paravirt_nop,
>  };
>  
> +unsigned long poll_threshold_ns;
> +unsigned int poll_shrink = 2;
> +unsigned int poll_grow = 2;
> +
>  __visible struct pv_irq_ops pv_irq_ops = {
>  	.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
>  	.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index bd6d96c..6cb2820 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -462,6 +462,12 @@ extern __scanf(2, 0)
>  
>  extern bool crash_kexec_post_notifiers;
>  
> +#ifdef CONFIG_PARAVIRT
> +extern unsigned long poll_threshold_ns;
> +extern unsigned int poll_shrink;
> +extern unsigned int poll_grow;
> +#endif

What are these if we are on bare metal but support
paravirt? The docs are not clear in any way about it.

> +
>  /*
>   * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
>   * holds a CPU number which is executing panic() currently. A value of
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6648fbb..9b86a8f 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
>  		.extra2		= &one,
>  	},
>  #endif
> +#ifdef CONFIG_PARAVIRT
> +	{
> +		.procname       = "halt_poll_threshold",
> +		.data           = &poll_threshold_ns,
> +		.maxlen         = sizeof(unsigned long),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec,
> +	},

proc_doulongvec_minmax() may be more appropriate here? What is the range?
Also what user did you see used proc_dointvec() but had unsigned long?
If the test below reveal this is invalid we would need to go fix them
as well.

> +	{
> +		.procname       = "halt_poll_grow",
> +		.data           = &poll_grow,
> +		.maxlen         = sizeof(unsigned int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec,
> +	},
> +	{
> +		.procname       = "halt_poll_shrink",
> +		.data           = &poll_shrink,
> +		.maxlen         = sizeof(unsigned int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec,

We have now a much more appropriate proc_douintvec() proc_douintvec_minmax().
It seems you want to support only unsigned int for two of these so that would
be more appropriate.

To test things out you can replicate your values using or extending the
test driver lib/test_sysctl.c CONFIG_TEST_SYSCTL=m with your type of
values and then fuzz testing them with arbitrary values to ensure you get
only as input what you really think these values should get.

Once done please extend the script:

tools/testing/selftests/sysctl/sysctl.sh

to cover the different tests you have run, we want tests to be generic.

  Luis

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll
@ 2017-08-29 17:20     ` Luis R. Rodriguez
  0 siblings, 0 replies; 47+ messages in thread
From: Luis R. Rodriguez @ 2017-08-29 17:20 UTC (permalink / raw)
  To: Yang Zhang
  Cc: linux-kernel, kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar,
	dmatlack, agraf, peterz, linux-doc, Quan Xu, Jonathan Corbet,
	Juergen Gross, Jeremy Fitzhardinge, Chris Wright, Alok Kataria,
	Rusty Russell, Ingo Molnar, H. Peter Anvin, x86,
	Luis R. Rodriguez, Kees Cook, Mauro Carvalho Chehab,
	Krzysztof Kozlowski, Josh Poimboeuf

On Tue, Aug 29, 2017 at 11:46:39AM +0000, Yang Zhang wrote:
> To reduce the cost of poll, we introduce three sysctl to control the
> poll time.

This commit does not describe in any way the fact that these knobs are
all for and only for PARAVIRT.

> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index bac23c1..67447b8 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -63,6 +63,9 @@ show up in /proc/sys/kernel:
>  - perf_event_max_stack
>  - perf_event_max_contexts_per_stack
>  - pid_max
> +- poll_grow                   [ X86 only ]
> +- poll_shrink                 [ X86 only ]
> +- poll_threshold_ns           [ X86 only ]

How about paravirt_ prefix?

>  - powersave-nap               [ PPC only ]
>  - printk
>  - printk_delay
> @@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one.
>  
>  ==============================================================
>  
> +poll_grow: (X86 only)
> +
> +This parameter is multiplied in the grow_poll_ns() to increase the poll time.
> +By default, the values is 2.
> +
> +==============================================================
> +poll_shrink: (X86 only)
> +
> +This parameter is divided in the shrink_poll_ns() to reduce the poll time.
> +By default, the values is 2.

These don't say anything about this being related to paravirt.

> +
> +==============================================================
> +poll_threshold_ns: (X86 only)
> +
> +This parameter controls the max poll time before entering real idle path.
> +This parameter is expected to take effect only when running inside a VM.
> +It would make no sense to turn on it in bare mental.

turn it on? Or modify, or use it?

> +By default, the values is 0 means don't poll. It is recommended to change
> +the value to non-zero if running latency-bound workloads inside VM.
> +
> +==============================================================
> +
>  powersave-nap: (PPC only)
>  
>  If set, Linux-PPC will use the 'nap' mode of powersaving,
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index a11b2c2..0b92f8f 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = {
>  	.update = paravirt_nop,
>  };
>  
> +unsigned long poll_threshold_ns;
> +unsigned int poll_shrink = 2;
> +unsigned int poll_grow = 2;
> +
>  __visible struct pv_irq_ops pv_irq_ops = {
>  	.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
>  	.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index bd6d96c..6cb2820 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -462,6 +462,12 @@ extern __scanf(2, 0)
>  
>  extern bool crash_kexec_post_notifiers;
>  
> +#ifdef CONFIG_PARAVIRT
> +extern unsigned long poll_threshold_ns;
> +extern unsigned int poll_shrink;
> +extern unsigned int poll_grow;
> +#endif

What are these if we are on bare metal but support
paravirt? The docs are not clear in any way about it.

> +
>  /*
>   * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
>   * holds a CPU number which is executing panic() currently. A value of
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6648fbb..9b86a8f 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
>  		.extra2		= &one,
>  	},
>  #endif
> +#ifdef CONFIG_PARAVIRT
> +	{
> +		.procname       = "halt_poll_threshold",
> +		.data           = &poll_threshold_ns,
> +		.maxlen         = sizeof(unsigned long),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec,
> +	},

proc_doulongvec_minmax() may be more appropriate here? What is the range?
Also what user did you see used proc_dointvec() but had unsigned long?
If the test below reveal this is invalid we would need to go fix them
as well.

> +	{
> +		.procname       = "halt_poll_grow",
> +		.data           = &poll_grow,
> +		.maxlen         = sizeof(unsigned int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec,
> +	},
> +	{
> +		.procname       = "halt_poll_shrink",
> +		.data           = &poll_shrink,
> +		.maxlen         = sizeof(unsigned int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec,

We have now a much more appropriate proc_douintvec() proc_douintvec_minmax().
It seems you want to support only unsigned int for two of these so that would
be more appropriate.

To test things out you can replicate your values using or extending the
test driver lib/test_sysctl.c CONFIG_TEST_SYSCTL=m with your type of
values and then fuzz testing them with arbitrary values to ensure you get
only as input what you really think these values should get.

Once done please extend the script:

tools/testing/selftests/sysctl/sysctl.sh

to cover the different tests you have run, we want tests to be generic.

  Luis

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll
  2017-08-29 11:46   ` Yang Zhang
  (?)
@ 2017-08-29 17:20   ` Luis R. Rodriguez
  -1 siblings, 0 replies; 47+ messages in thread
From: Luis R. Rodriguez @ 2017-08-29 17:20 UTC (permalink / raw)
  To: Yang Zhang
  Cc: Jeremy Fitzhardinge, rkrcmar, kvm, mst, peterz, virtualization,
	H. Peter Anvin, Alok Kataria, wanpeng.li, Jessica Yu, Baoquan He,
	Jonathan Corbet, x86, Krzysztof Kozlowski, Ingo Molnar, zijun_hu,
	Petr Mladek, Kees Cook, Johannes Berg, Rusty Russell,
	Chris Wright, Ian Abbott, Josh Poimboeuf, dmatlack, tglx,
	Mauro Carvalho Chehab, Juergen Gross, Quan Xu, linux-doc

On Tue, Aug 29, 2017 at 11:46:39AM +0000, Yang Zhang wrote:
> To reduce the cost of poll, we introduce three sysctl to control the
> poll time.

This commit does not describe in any way the fact that these knobs are
all for and only for PARAVIRT.

> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index bac23c1..67447b8 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -63,6 +63,9 @@ show up in /proc/sys/kernel:
>  - perf_event_max_stack
>  - perf_event_max_contexts_per_stack
>  - pid_max
> +- poll_grow                   [ X86 only ]
> +- poll_shrink                 [ X86 only ]
> +- poll_threshold_ns           [ X86 only ]

How about paravirt_ prefix?

>  - powersave-nap               [ PPC only ]
>  - printk
>  - printk_delay
> @@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one.
>  
>  ==============================================================
>  
> +poll_grow: (X86 only)
> +
> +This parameter is multiplied in the grow_poll_ns() to increase the poll time.
> +By default, the values is 2.
> +
> +==============================================================
> +poll_shrink: (X86 only)
> +
> +This parameter is divided in the shrink_poll_ns() to reduce the poll time.
> +By default, the values is 2.

These don't say anything about this being related to paravirt.

> +
> +==============================================================
> +poll_threshold_ns: (X86 only)
> +
> +This parameter controls the max poll time before entering real idle path.
> +This parameter is expected to take effect only when running inside a VM.
> +It would make no sense to turn on it in bare mental.

turn it on? Or modify, or use it?

> +By default, the values is 0 means don't poll. It is recommended to change
> +the value to non-zero if running latency-bound workloads inside VM.
> +
> +==============================================================
> +
>  powersave-nap: (PPC only)
>  
>  If set, Linux-PPC will use the 'nap' mode of powersaving,
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index a11b2c2..0b92f8f 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = {
>  	.update = paravirt_nop,
>  };
>  
> +unsigned long poll_threshold_ns;
> +unsigned int poll_shrink = 2;
> +unsigned int poll_grow = 2;
> +
>  __visible struct pv_irq_ops pv_irq_ops = {
>  	.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
>  	.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index bd6d96c..6cb2820 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -462,6 +462,12 @@ extern __scanf(2, 0)
>  
>  extern bool crash_kexec_post_notifiers;
>  
> +#ifdef CONFIG_PARAVIRT
> +extern unsigned long poll_threshold_ns;
> +extern unsigned int poll_shrink;
> +extern unsigned int poll_grow;
> +#endif

What are these if we are on bare metal but support
paravirt? The docs are not clear in any way about it.

> +
>  /*
>   * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
>   * holds a CPU number which is executing panic() currently. A value of
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6648fbb..9b86a8f 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
>  		.extra2		= &one,
>  	},
>  #endif
> +#ifdef CONFIG_PARAVIRT
> +	{
> +		.procname       = "halt_poll_threshold",
> +		.data           = &poll_threshold_ns,
> +		.maxlen         = sizeof(unsigned long),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec,
> +	},

proc_doulongvec_minmax() may be more appropriate here? What is the range?
Also what user did you see used proc_dointvec() but had unsigned long?
If the test below reveal this is invalid we would need to go fix them
as well.

> +	{
> +		.procname       = "halt_poll_grow",
> +		.data           = &poll_grow,
> +		.maxlen         = sizeof(unsigned int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec,
> +	},
> +	{
> +		.procname       = "halt_poll_shrink",
> +		.data           = &poll_shrink,
> +		.maxlen         = sizeof(unsigned int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec,

We have now a much more appropriate proc_douintvec() proc_douintvec_minmax().
It seems you want to support only unsigned int for two of these so that would
be more appropriate.

To test things out you can replicate your values using or extending the
test driver lib/test_sysctl.c CONFIG_TEST_SYSCTL=m with your type of
values and then fuzz testing them with arbitrary values to ensure you get
only as input what you really think these values should get.

Once done please extend the script:

tools/testing/selftests/sysctl/sysctl.sh

to cover the different tests you have run, we want tests to be generic.

  Luis

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/7] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-08-29 13:55     ` Konrad Rzeszutek Wilk
  (?)
@ 2017-08-30  7:33     ` Juergen Gross
  -1 siblings, 0 replies; 47+ messages in thread
From: Juergen Gross @ 2017-08-30  7:33 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Yang Zhang, xen-devel, Boris Ostrovsky
  Cc: linux-kernel, kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar,
	dmatlack, agraf, peterz, linux-doc, Quan Xu, Jeremy Fitzhardinge,
	Chris Wright, Alok Kataria, Rusty Russell, Ingo Molnar,
	H. Peter Anvin, x86, Andy Lutomirski, Kirill A. Shutemov,
	Pan Xinhui, Kees Cook, virtualization

On 29/08/17 15:55, Konrad Rzeszutek Wilk wrote:
> On Tue, Aug 29, 2017 at 11:46:35AM +0000, Yang Zhang wrote:
>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in
>> idle path which will polling for a while before we enter the real idle
>> state.
>>
>> In virtualization, idle path includes several heavy operations
>> includes timer access(LAPIC timer or TSC deadline timer) which will hurt
>> performance especially for latency intensive workload like message
>> passing task. The cost is mainly come from the vmexit which is a
>> hardware context switch between VM and hypervisor. Our solution is to
>> poll for a while and do not enter real idle path if we can get the
>> schedule event during polling.
>>
>> Poll may cause the CPU waste so we adopt a smart polling mechanism to
>> reduce the useless poll.
>>
>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
>> Cc: Jeremy Fitzhardinge <jeremy@goop.org>
>> Cc: Chris Wright <chrisw@sous-sol.org>
>> Cc: Alok Kataria <akataria@vmware.com>
>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: x86@kernel.org
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: virtualization@lists.linux-foundation.org
>> Cc: linux-kernel@vger.kernel.org
> 
> Adding xen-devel.
> 
> Juergen, we really should replace Jeremy's name with xen-devel or
> your name..

I wouldn't mind being added. What does Jeremy think of being removed?

> Wasn't there an patch by you that took some of the 
> mainternship over it?

I added include/linux/hypervisor.h to the PARAVIRT section and offered
to maintain it in case the PARAVIRT maintainers didn't want to.


Juergen

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/7] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-08-29 13:55     ` Konrad Rzeszutek Wilk
  (?)
  (?)
@ 2017-08-30  7:33     ` Juergen Gross
  -1 siblings, 0 replies; 47+ messages in thread
From: Juergen Gross @ 2017-08-30  7:33 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Yang Zhang, xen-devel, Boris Ostrovsky
  Cc: Jeremy Fitzhardinge, rkrcmar, kvm, mst, peterz, Pan Xinhui,
	virtualization, H. Peter Anvin, Alok Kataria, wanpeng.li, x86,
	Ingo Molnar, Kees Cook, Chris Wright, Andy Lutomirski, dmatlack,
	tglx, Quan Xu, linux-doc, linux-kernel, pbonzini,
	Kirill A. Shutemov

On 29/08/17 15:55, Konrad Rzeszutek Wilk wrote:
> On Tue, Aug 29, 2017 at 11:46:35AM +0000, Yang Zhang wrote:
>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in
>> idle path which will polling for a while before we enter the real idle
>> state.
>>
>> In virtualization, idle path includes several heavy operations
>> includes timer access(LAPIC timer or TSC deadline timer) which will hurt
>> performance especially for latency intensive workload like message
>> passing task. The cost is mainly come from the vmexit which is a
>> hardware context switch between VM and hypervisor. Our solution is to
>> poll for a while and do not enter real idle path if we can get the
>> schedule event during polling.
>>
>> Poll may cause the CPU waste so we adopt a smart polling mechanism to
>> reduce the useless poll.
>>
>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
>> Cc: Jeremy Fitzhardinge <jeremy@goop.org>
>> Cc: Chris Wright <chrisw@sous-sol.org>
>> Cc: Alok Kataria <akataria@vmware.com>
>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: x86@kernel.org
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: virtualization@lists.linux-foundation.org
>> Cc: linux-kernel@vger.kernel.org
> 
> Adding xen-devel.
> 
> Juergen, we really should replace Jeremy's name with xen-devel or
> your name..

I wouldn't mind being added. What does Jeremy think of being removed?

> Wasn't there an patch by you that took some of the 
> mainternship over it?

I added include/linux/hypervisor.h to the PARAVIRT section and offered
to maintain it in case the PARAVIRT maintainers didn't want to.


Juergen

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path
  2017-08-29 12:45   ` Peter Zijlstra
@ 2017-09-01  5:57     ` Quan Xu
  2017-09-14  8:41       ` Quan Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Quan Xu @ 2017-09-01  5:57 UTC (permalink / raw)
  To: Peter Zijlstra, Yang Zhang
  Cc: linux-kernel, kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar,
	dmatlack, agraf, linux-doc, Ingo Molnar, H. Peter Anvin, x86,
	Borislav Petkov, Kyle Huey, Andy Lutomirski, Len Brown

on 2017/8/29 20:45, Peter Zijlstra wrote:

> On Tue, Aug 29, 2017 at 11:46:37AM +0000, Yang Zhang wrote:
>> Add poll in do_idle. For UP VM, if there are running task, it will not
>> goes into idle path, so we only enable poll in SMP VM.
>>
>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
> Broken SoB chain.
   Peter,  I can't follow 'Broken SoB chain'.. could you more about it?

   -Quan

>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index 6c23e30..b374744 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
>>   }
>>   
>>   /* Weak implementations for optional arch specific functions */
>> +void __weak arch_cpu_idle_poll(void) { }
>>   void __weak arch_cpu_idle_prepare(void) { }
>>   void __weak arch_cpu_idle_enter(void) { }
> And not a word on why we need a new arch hook. What's wrong with
> arch_cpu_idle_enter() for instance?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
  2017-08-29 11:58 ` [RFC PATCH v2 0/7] x86/idle: add halt poll support Alexander Graf
@ 2017-09-01  6:21   ` Yang Zhang
  0 siblings, 0 replies; 47+ messages in thread
From: Yang Zhang @ 2017-09-01  6:21 UTC (permalink / raw)
  To: Alexander Graf, linux-kernel
  Cc: kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar, dmatlack, peterz,
	linux-doc

On 2017/8/29 19:58, Alexander Graf wrote:
> On 08/29/2017 01:46 PM, Yang Zhang wrote:
>> Some latency-intensive workload will see obviously performance
>> drop when running inside VM. The main reason is that the overhead
>> is amplified when running inside VM. The most cost i have seen is
>> inside idle path.
>>
>> This patch introduces a new mechanism to poll for a while before
>> entering idle state. If schedule is needed during poll, then we
>> don't need to goes through the heavy overhead path.
>>
>> Here is the data we get when running benchmark contextswitch to measure
>> the latency(lower is better):
>>
>>     1. w/o patch:
>>        2493.14 ns/ctxsw -- 200.3 %CPU
>>     2. w/ patch:
>>        halt_poll_threshold=10000 -- 1485.96ns/ctxsw -- 201.0 %CPU
>>        halt_poll_threshold=20000 -- 1391.26 ns/ctxsw -- 200.7 %CPU
>>        halt_poll_threshold=30000 -- 1488.55 ns/ctxsw -- 200.1 %CPU
>>        halt_poll_threshold=500000 -- 1159.14 ns/ctxsw -- 201.5 %CPU
>>     3. kvm dynamic poll
>>        halt_poll_ns=10000 -- 2296.11 ns/ctxsw -- 201.2 %CPU
>>        halt_poll_ns=20000 -- 2599.7 ns/ctxsw -- 201.7 %CPU
>>        halt_poll_ns=30000 -- 2588.68 ns/ctxsw -- 211.6 %CPU
>>        halt_poll_ns=500000 -- 2423.20 ns/ctxsw -- 229.2 %CPU
>>     4. idle=poll
>>        2050.1 ns/ctxsw -- 1003 %CPU
>>     5. idle=mwait
>>        2188.06 ns/ctxsw -- 206.3 %CPU
> 
> Could you please try to create another metric for guest initiated, host 
> aborted mwait?
> 
> For a quick benchmark, reserve 4 registers for a magic value, set them 
> to the magic value before you enter MWAIT in the guest. Then allow 
> native MWAIT execution on the host. If you see the guest wants to enter 

I guess you want to allow native MWAIT execution on the guest not host?

> with the 4 registers containing the magic contents and no events are 
> pending, directly go into the vcpu block function on the host.

Mmm..It is not very clear to me. If guest executes MWAIT without vmexit, 
how to check the register?

> 
> That way any time a guest gets naturally aborted while in mwait, it will 
> only reenter mwait when an event actually occured. While the guest is 
> normally running (and nobody else wants to run on the host), we just 
> stay in guest context, but with a sleeping CPU.
> 
> Overall, that might give us even better performance, as it allows for 
> turbo boost and HT to work properly.

In our testing,  we have enough cores(32cores) but only 10VCPUs, so in 
the best case, we may see the same performance as poll.

-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
  2017-08-29 14:02 ` Wanpeng Li
  2017-08-29 14:27   ` Konrad Rzeszutek Wilk
  2017-08-29 14:36   ` Michael S. Tsirkin
@ 2017-09-01  6:32   ` Yang Zhang
  2017-09-01  6:52     ` Wanpeng Li
  2017-09-01  6:44   ` Yang Zhang
  3 siblings, 1 reply; 47+ messages in thread
From: Yang Zhang @ 2017-09-01  6:32 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: linux-kernel, kvm, Wanpeng Li, Michael S. Tsirkin, Paolo Bonzini,
	Thomas Gleixner, Radim Krcmar, David Matlack, Alexander Graf,
	Peter Zijlstra, linux-doc

On 2017/8/29 22:02, Wanpeng Li wrote:
>> Here is the data we get when running benchmark netperf:
>>
>>     2. w/ patch:
>>        halt_poll_threshold=10000 -- 15803.89 bits/s -- 159.5 %CPU
>>        halt_poll_threshold=20000 -- 15899.04 bits/s -- 161.5 %CPU
>>        halt_poll_threshold=30000 -- 15642.38 bits/s -- 161.8 %CPU
>>        halt_poll_threshold=40000 -- 18040.76 bits/s -- 184.0 %CPU
>>        halt_poll_threshold=50000 -- 18877.61 bits/s -- 197.3 %CPU
>>
>>     3. kvm dynamic poll
>>        halt_poll_ns=10000 -- 15876.00 bits/s -- 172.2 %CPU
>>        halt_poll_ns=20000 -- 15602.58 bits/s -- 185.4 %CPU
>>        halt_poll_ns=30000 -- 15930.69 bits/s -- 194.4 %CPU
>>        halt_poll_ns=40000 -- 16413.09 bits/s -- 195.3 %CPU
>>        halt_poll_ns=50000 -- 16417.42 bits/s -- 196.3 %CPU
>>
> 
> Actually I'm not sure how much sense it makes to introduce this pv
> stuff and the duplicate adaptive halt-polling logic as what has
> already been done in kvm w/o obvious benefit for real workload like > netperf. In addition, as you mentioned offline to me, enable both the
> patchset and the adaptive halt-polling logic in kvm simultaneously can
> result in more cpu power consumption. I remembered that David from

If we use poll in KVM side, it will consume more cpu than in guest side. 
If use both two, then we can get the same performance as only enable 
guest side poll but it will cost more cpu because of poll KVM side. It 
means we should disable KVM side poll since it cannot give much 
improvement than  guest side except consume more cpu and large latency.

> Google mentioned that Windows Event Objects can get 2x latency
> improvement in KVM FORUM, which means that the adaptive halt-polling
> in kvm should be enabled by default. So if the windows guests and
> linux guests are mixed on the same host, then this patchset will
> result in more cpu power consumption if the customer enable the
> polling in the linux guest. Anyway, if the patchset is finally
> acceptable by maintainer, I will introduce the generic adaptive
> halt-polling framework in kvm to avoid the duplicate logic.

We will add more conditions than the current algorithm in future. But 
it's ok to use the one copy currently, we will do it in next version.


-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
  2017-08-29 14:02 ` Wanpeng Li
                     ` (2 preceding siblings ...)
  2017-09-01  6:32   ` Yang Zhang
@ 2017-09-01  6:44   ` Yang Zhang
  2017-09-01  6:58     ` Wanpeng Li
  3 siblings, 1 reply; 47+ messages in thread
From: Yang Zhang @ 2017-09-01  6:44 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: linux-kernel, kvm, Wanpeng Li, Michael S. Tsirkin, Paolo Bonzini,
	Thomas Gleixner, Radim Krcmar, David Matlack, Alexander Graf,
	Peter Zijlstra, linux-doc

On 2017/8/29 22:02, Wanpeng Li wrote:
>> Here is the data we get when running benchmark netperf:
>>
>>     2. w/ patch:
>>        halt_poll_threshold=10000 -- 15803.89 bits/s -- 159.5 %CPU
>>        halt_poll_threshold=20000 -- 15899.04 bits/s -- 161.5 %CPU
>>        halt_poll_threshold=30000 -- 15642.38 bits/s -- 161.8 %CPU
>>        halt_poll_threshold=40000 -- 18040.76 bits/s -- 184.0 %CPU
>>        halt_poll_threshold=50000 -- 18877.61 bits/s -- 197.3 %CPU
>>
>>     3. kvm dynamic poll
>>        halt_poll_ns=10000 -- 15876.00 bits/s -- 172.2 %CPU
>>        halt_poll_ns=20000 -- 15602.58 bits/s -- 185.4 %CPU
>>        halt_poll_ns=30000 -- 15930.69 bits/s -- 194.4 %CPU
>>        halt_poll_ns=40000 -- 16413.09 bits/s -- 195.3 %CPU
>>        halt_poll_ns=50000 -- 16417.42 bits/s -- 196.3 %CPU
>>
> 
> Actually I'm not sure how much sense it makes to introduce this pv
> stuff and the duplicate adaptive halt-polling logic as what has
> already been done in kvm w/o obvious benefit for real workload like
> netperf. In addition, as you mentioned offline to me, enable both the
> patchset and the adaptive halt-polling logic in kvm simultaneously can
> result in more cpu power consumption. I remembered that David from

No.If we use poll in KVM side, it will consume more cpu than in guest 
side. If use both two, then we can get the performance as only enable 
guest side poll but it will cost more cpu because of poll in KVM side. 
It means we should disable KVM side poll since it cannot give much 
improvement than guest side except consume more cpu.

> Google mentioned that Windows Event Objects can get 2x latency
> improvement in KVM FORUM, which means that the adaptive halt-polling
> in kvm should be enabled by default. So if the windows guests and
> linux guests are mixed on the same host, then this patchset will
> result in more cpu power consumption if the customer enable the
> polling in the linux guest. Anyway, if the patchset is finally

I have said in last time, there already users using idle=poll in there 
VM, you *cannot* prevent them doing it. This patch provide a better 
solution than unconditional poll, we didn't introduce any worse stuff.

> acceptable by maintainer, I will introduce the generic adaptive
> halt-polling framework in kvm to avoid the duplicate logic.

We will add more conditions than the current algorithm in future. But 
it's ok to use one code currently, we will do it in next version.

> 
> Regards,
> Wanpeng Li
> 


-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path
  2017-08-29 14:39   ` Borislav Petkov
@ 2017-09-01  6:49     ` Quan Xu
  2017-09-29 10:39       ` Quan Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Quan Xu @ 2017-09-01  6:49 UTC (permalink / raw)
  To: Borislav Petkov, Yang Zhang
  Cc: linux-kernel, kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar,
	dmatlack, agraf, peterz, linux-doc, Ingo Molnar, H. Peter Anvin,
	x86, Kyle Huey, Andy Lutomirski, Len Brown

on 2017/8/29 22:39, Borislav Petkov wrote:
> On Tue, Aug 29, 2017 at 11:46:37AM +0000, Yang Zhang wrote:
>> Add poll in do_idle. For UP VM, if there are running task, it will not
>> goes into idle path, so we only enable poll in SMP VM.
>>
>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: x86@kernel.org
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: Kyle Huey <me@kylehuey.com>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Len Brown <len.brown@intel.com>
>> Cc: linux-kernel@vger.kernel.org
>> ---
>>   arch/x86/kernel/process.c | 7 +++++++
>>   kernel/sched/idle.c       | 2 ++
>>   2 files changed, 9 insertions(+)
>>
>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>> index 3ca1980..def4113 100644
>> --- a/arch/x86/kernel/process.c
>> +++ b/arch/x86/kernel/process.c
>> @@ -332,6 +332,13 @@ void arch_cpu_idle(void)
>>   	x86_idle();
>>   }
>>   
>> +#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
>> +void arch_cpu_idle_poll(void)
>> +{
>> +	paravirt_idle_poll();
>> +}
>> +#endif
> So this will get called on any system which has CONFIG_PARAVIRT enabled
> *even* if they're not running any guests.
>
> Huh?
    Borislav ,
    yes,  this will get called on any system which has CONFIG_PARAVIRT 
enabled.

    but if they are not running any guests,  the callback is 
paravirt_nop() ,
    IIUC which is as similar as the other paravirt_*, such as 
paravirt_pgd_free()..

  - Quan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/7] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-08-29 13:55     ` Konrad Rzeszutek Wilk
                       ` (2 preceding siblings ...)
  (?)
@ 2017-09-01  6:50     ` Yang Zhang
  -1 siblings, 0 replies; 47+ messages in thread
From: Yang Zhang @ 2017-09-01  6:50 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, xen-devel, jgross, Boris Ostrovsky
  Cc: linux-kernel, kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar,
	dmatlack, agraf, peterz, linux-doc, Quan Xu, Jeremy Fitzhardinge,
	Chris Wright, Alok Kataria, Rusty Russell, Ingo Molnar,
	H. Peter Anvin, x86, Andy Lutomirski, Kirill A. Shutemov,
	Pan Xinhui, Kees Cook, virtualization

On 2017/8/29 21:55, Konrad Rzeszutek Wilk wrote:
> On Tue, Aug 29, 2017 at 11:46:35AM +0000, Yang Zhang wrote:
>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in
>> idle path which will polling for a while before we enter the real idle
>> state.
>>
>> In virtualization, idle path includes several heavy operations
>> includes timer access(LAPIC timer or TSC deadline timer) which will hurt
>> performance especially for latency intensive workload like message
>> passing task. The cost is mainly come from the vmexit which is a
>> hardware context switch between VM and hypervisor. Our solution is to
>> poll for a while and do not enter real idle path if we can get the
>> schedule event during polling.
>>
>> Poll may cause the CPU waste so we adopt a smart polling mechanism to
>> reduce the useless poll.
>>
>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
>> Cc: Jeremy Fitzhardinge <jeremy@goop.org>
>> Cc: Chris Wright <chrisw@sous-sol.org>
>> Cc: Alok Kataria <akataria@vmware.com>
>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: x86@kernel.org
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: virtualization@lists.linux-foundation.org
>> Cc: linux-kernel@vger.kernel.org
> 
> Adding xen-devel.
> 
> Juergen, we really should replace Jeremy's name with xen-devel or
> your name.. Wasn't there an patch by you that took some of the
> mainternship over it?

Hi Konard, I didn't test it in Xen side since i don't have the 
environment but i can add it for Xen in next version if you think it is 
useful to Xen as well.

> 
>> ---
>>   arch/x86/include/asm/paravirt.h       | 5 +++++
>>   arch/x86/include/asm/paravirt_types.h | 6 ++++++
>>   arch/x86/kernel/paravirt.c            | 6 ++++++
>>   3 files changed, 17 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
>> index 9ccac19..6d46760 100644
>> --- a/arch/x86/include/asm/paravirt.h
>> +++ b/arch/x86/include/asm/paravirt.h
>> @@ -202,6 +202,11 @@ static inline unsigned long long paravirt_read_pmc(int counter)
>>   
>>   #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter))
>>   
>> +static inline void paravirt_idle_poll(void)
>> +{
>> +	PVOP_VCALL0(pv_idle_ops.poll);
>> +}
>> +
>>   static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries)
>>   {
>>   	PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
>> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
>> index 9ffc36b..cf45726 100644
>> --- a/arch/x86/include/asm/paravirt_types.h
>> +++ b/arch/x86/include/asm/paravirt_types.h
>> @@ -324,6 +324,10 @@ struct pv_lock_ops {
>>   	struct paravirt_callee_save vcpu_is_preempted;
>>   } __no_randomize_layout;
>>   
>> +struct pv_idle_ops {
>> +	void (*poll)(void);
>> +} __no_randomize_layout;
>> +
>>   /* This contains all the paravirt structures: we get a convenient
>>    * number for each function using the offset which we use to indicate
>>    * what to patch. */
>> @@ -334,6 +338,7 @@ struct paravirt_patch_template {
>>   	struct pv_irq_ops pv_irq_ops;
>>   	struct pv_mmu_ops pv_mmu_ops;
>>   	struct pv_lock_ops pv_lock_ops;
>> +	struct pv_idle_ops pv_idle_ops;
>>   } __no_randomize_layout;
>>   
>>   extern struct pv_info pv_info;
>> @@ -343,6 +348,7 @@ struct paravirt_patch_template {
>>   extern struct pv_irq_ops pv_irq_ops;
>>   extern struct pv_mmu_ops pv_mmu_ops;
>>   extern struct pv_lock_ops pv_lock_ops;
>> +extern struct pv_idle_ops pv_idle_ops;
>>   
>>   #define PARAVIRT_PATCH(x)					\
>>   	(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
>> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
>> index bc0a849..1b5b247 100644
>> --- a/arch/x86/kernel/paravirt.c
>> +++ b/arch/x86/kernel/paravirt.c
>> @@ -128,6 +128,7 @@ static void *get_call_destination(u8 type)
>>   #ifdef CONFIG_PARAVIRT_SPINLOCKS
>>   		.pv_lock_ops = pv_lock_ops,
>>   #endif
>> +		.pv_idle_ops = pv_idle_ops,
>>   	};
>>   	return *((void **)&tmpl + type);
>>   }
>> @@ -312,6 +313,10 @@ struct pv_time_ops pv_time_ops = {
>>   	.steal_clock = native_steal_clock,
>>   };
>>   
>> +struct pv_idle_ops pv_idle_ops = {
>> +	.poll = paravirt_nop,
>> +};
>> +
>>   __visible struct pv_irq_ops pv_irq_ops = {
>>   	.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
>>   	.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
>> @@ -471,3 +476,4 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
>>   EXPORT_SYMBOL    (pv_mmu_ops);
>>   EXPORT_SYMBOL_GPL(pv_info);
>>   EXPORT_SYMBOL    (pv_irq_ops);
>> +EXPORT_SYMBOL    (pv_idle_ops);
>> -- 
>> 1.8.3.1
>>


-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 1/7] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-08-29 13:55     ` Konrad Rzeszutek Wilk
                       ` (3 preceding siblings ...)
  (?)
@ 2017-09-01  6:50     ` Yang Zhang
  -1 siblings, 0 replies; 47+ messages in thread
From: Yang Zhang @ 2017-09-01  6:50 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, xen-devel, jgross, Boris Ostrovsky
  Cc: Jeremy Fitzhardinge, rkrcmar, kvm, mst, peterz, Pan Xinhui,
	virtualization, H. Peter Anvin, Alok Kataria, wanpeng.li, x86,
	Ingo Molnar, Kees Cook, Chris Wright, Andy Lutomirski, dmatlack,
	tglx, Quan Xu, linux-doc, linux-kernel, pbonzini,
	Kirill A. Shutemov

On 2017/8/29 21:55, Konrad Rzeszutek Wilk wrote:
> On Tue, Aug 29, 2017 at 11:46:35AM +0000, Yang Zhang wrote:
>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in
>> idle path which will polling for a while before we enter the real idle
>> state.
>>
>> In virtualization, idle path includes several heavy operations
>> includes timer access(LAPIC timer or TSC deadline timer) which will hurt
>> performance especially for latency intensive workload like message
>> passing task. The cost is mainly come from the vmexit which is a
>> hardware context switch between VM and hypervisor. Our solution is to
>> poll for a while and do not enter real idle path if we can get the
>> schedule event during polling.
>>
>> Poll may cause the CPU waste so we adopt a smart polling mechanism to
>> reduce the useless poll.
>>
>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
>> Cc: Jeremy Fitzhardinge <jeremy@goop.org>
>> Cc: Chris Wright <chrisw@sous-sol.org>
>> Cc: Alok Kataria <akataria@vmware.com>
>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: x86@kernel.org
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: virtualization@lists.linux-foundation.org
>> Cc: linux-kernel@vger.kernel.org
> 
> Adding xen-devel.
> 
> Juergen, we really should replace Jeremy's name with xen-devel or
> your name.. Wasn't there an patch by you that took some of the
> mainternship over it?

Hi Konard, I didn't test it in Xen side since i don't have the 
environment but i can add it for Xen in next version if you think it is 
useful to Xen as well.

> 
>> ---
>>   arch/x86/include/asm/paravirt.h       | 5 +++++
>>   arch/x86/include/asm/paravirt_types.h | 6 ++++++
>>   arch/x86/kernel/paravirt.c            | 6 ++++++
>>   3 files changed, 17 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
>> index 9ccac19..6d46760 100644
>> --- a/arch/x86/include/asm/paravirt.h
>> +++ b/arch/x86/include/asm/paravirt.h
>> @@ -202,6 +202,11 @@ static inline unsigned long long paravirt_read_pmc(int counter)
>>   
>>   #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter))
>>   
>> +static inline void paravirt_idle_poll(void)
>> +{
>> +	PVOP_VCALL0(pv_idle_ops.poll);
>> +}
>> +
>>   static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries)
>>   {
>>   	PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
>> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
>> index 9ffc36b..cf45726 100644
>> --- a/arch/x86/include/asm/paravirt_types.h
>> +++ b/arch/x86/include/asm/paravirt_types.h
>> @@ -324,6 +324,10 @@ struct pv_lock_ops {
>>   	struct paravirt_callee_save vcpu_is_preempted;
>>   } __no_randomize_layout;
>>   
>> +struct pv_idle_ops {
>> +	void (*poll)(void);
>> +} __no_randomize_layout;
>> +
>>   /* This contains all the paravirt structures: we get a convenient
>>    * number for each function using the offset which we use to indicate
>>    * what to patch. */
>> @@ -334,6 +338,7 @@ struct paravirt_patch_template {
>>   	struct pv_irq_ops pv_irq_ops;
>>   	struct pv_mmu_ops pv_mmu_ops;
>>   	struct pv_lock_ops pv_lock_ops;
>> +	struct pv_idle_ops pv_idle_ops;
>>   } __no_randomize_layout;
>>   
>>   extern struct pv_info pv_info;
>> @@ -343,6 +348,7 @@ struct paravirt_patch_template {
>>   extern struct pv_irq_ops pv_irq_ops;
>>   extern struct pv_mmu_ops pv_mmu_ops;
>>   extern struct pv_lock_ops pv_lock_ops;
>> +extern struct pv_idle_ops pv_idle_ops;
>>   
>>   #define PARAVIRT_PATCH(x)					\
>>   	(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
>> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
>> index bc0a849..1b5b247 100644
>> --- a/arch/x86/kernel/paravirt.c
>> +++ b/arch/x86/kernel/paravirt.c
>> @@ -128,6 +128,7 @@ static void *get_call_destination(u8 type)
>>   #ifdef CONFIG_PARAVIRT_SPINLOCKS
>>   		.pv_lock_ops = pv_lock_ops,
>>   #endif
>> +		.pv_idle_ops = pv_idle_ops,
>>   	};
>>   	return *((void **)&tmpl + type);
>>   }
>> @@ -312,6 +313,10 @@ struct pv_time_ops pv_time_ops = {
>>   	.steal_clock = native_steal_clock,
>>   };
>>   
>> +struct pv_idle_ops pv_idle_ops = {
>> +	.poll = paravirt_nop,
>> +};
>> +
>>   __visible struct pv_irq_ops pv_irq_ops = {
>>   	.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
>>   	.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
>> @@ -471,3 +476,4 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
>>   EXPORT_SYMBOL    (pv_mmu_ops);
>>   EXPORT_SYMBOL_GPL(pv_info);
>>   EXPORT_SYMBOL    (pv_irq_ops);
>> +EXPORT_SYMBOL    (pv_idle_ops);
>> -- 
>> 1.8.3.1
>>


-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
  2017-09-01  6:32   ` Yang Zhang
@ 2017-09-01  6:52     ` Wanpeng Li
  0 siblings, 0 replies; 47+ messages in thread
From: Wanpeng Li @ 2017-09-01  6:52 UTC (permalink / raw)
  To: Yang Zhang
  Cc: linux-kernel, kvm, Wanpeng Li, Michael S. Tsirkin, Paolo Bonzini,
	Thomas Gleixner, Radim Krcmar, David Matlack, Alexander Graf,
	Peter Zijlstra, linux-doc

2017-09-01 14:32 GMT+08:00 Yang Zhang <yang.zhang.wz@gmail.com>:
> On 2017/8/29 22:02, Wanpeng Li wrote:
>>>
>>> Here is the data we get when running benchmark netperf:
>>>
>>>     2. w/ patch:
>>>        halt_poll_threshold=10000 -- 15803.89 bits/s -- 159.5 %CPU
>>>        halt_poll_threshold=20000 -- 15899.04 bits/s -- 161.5 %CPU
>>>        halt_poll_threshold=30000 -- 15642.38 bits/s -- 161.8 %CPU
>>>        halt_poll_threshold=40000 -- 18040.76 bits/s -- 184.0 %CPU
>>>        halt_poll_threshold=50000 -- 18877.61 bits/s -- 197.3 %CPU
>>>
>>>     3. kvm dynamic poll
>>>        halt_poll_ns=10000 -- 15876.00 bits/s -- 172.2 %CPU
>>>        halt_poll_ns=20000 -- 15602.58 bits/s -- 185.4 %CPU
>>>        halt_poll_ns=30000 -- 15930.69 bits/s -- 194.4 %CPU
>>>        halt_poll_ns=40000 -- 16413.09 bits/s -- 195.3 %CPU
>>>        halt_poll_ns=50000 -- 16417.42 bits/s -- 196.3 %CPU
>>>
>>
>> Actually I'm not sure how much sense it makes to introduce this pv
>> stuff and the duplicate adaptive halt-polling logic as what has
>> already been done in kvm w/o obvious benefit for real workload like >
>> netperf. In addition, as you mentioned offline to me, enable both the
>> patchset and the adaptive halt-polling logic in kvm simultaneously can
>> result in more cpu power consumption. I remembered that David from
>
>
> If we use poll in KVM side, it will consume more cpu than in guest side. If
> use both two, then we can get the same performance as only enable guest side
> poll but it will cost more cpu because of poll KVM side. It means we should
> disable KVM side poll since it cannot give much improvement than  guest side
> except consume more cpu and large latency.

How can message passing workloads in windows guest survive?

Regards,
Wanpeng Li

>
>> Google mentioned that Windows Event Objects can get 2x latency
>> improvement in KVM FORUM, which means that the adaptive halt-polling
>> in kvm should be enabled by default. So if the windows guests and
>> linux guests are mixed on the same host, then this patchset will
>> result in more cpu power consumption if the customer enable the
>> polling in the linux guest. Anyway, if the patchset is finally
>> acceptable by maintainer, I will introduce the generic adaptive
>> halt-polling framework in kvm to avoid the duplicate logic.
>
>
> We will add more conditions than the current algorithm in future. But it's
> ok to use the one copy currently, we will do it in next version.
>
>
>
> --
> Yang
> Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
  2017-09-01  6:44   ` Yang Zhang
@ 2017-09-01  6:58     ` Wanpeng Li
  2017-09-01  7:53       ` Yang Zhang
  0 siblings, 1 reply; 47+ messages in thread
From: Wanpeng Li @ 2017-09-01  6:58 UTC (permalink / raw)
  To: Yang Zhang
  Cc: linux-kernel, kvm, Wanpeng Li, Michael S. Tsirkin, Paolo Bonzini,
	Thomas Gleixner, Radim Krcmar, David Matlack, Alexander Graf,
	Peter Zijlstra, linux-doc

2017-09-01 14:44 GMT+08:00 Yang Zhang <yang.zhang.wz@gmail.com>:
> On 2017/8/29 22:02, Wanpeng Li wrote:
>>>
>>> Here is the data we get when running benchmark netperf:
>>>
>>>     2. w/ patch:
>>>        halt_poll_threshold=10000 -- 15803.89 bits/s -- 159.5 %CPU
>>>        halt_poll_threshold=20000 -- 15899.04 bits/s -- 161.5 %CPU
>>>        halt_poll_threshold=30000 -- 15642.38 bits/s -- 161.8 %CPU
>>>        halt_poll_threshold=40000 -- 18040.76 bits/s -- 184.0 %CPU
>>>        halt_poll_threshold=50000 -- 18877.61 bits/s -- 197.3 %CPU
>>>
>>>     3. kvm dynamic poll
>>>        halt_poll_ns=10000 -- 15876.00 bits/s -- 172.2 %CPU
>>>        halt_poll_ns=20000 -- 15602.58 bits/s -- 185.4 %CPU
>>>        halt_poll_ns=30000 -- 15930.69 bits/s -- 194.4 %CPU
>>>        halt_poll_ns=40000 -- 16413.09 bits/s -- 195.3 %CPU
>>>        halt_poll_ns=50000 -- 16417.42 bits/s -- 196.3 %CPU
>>>
>>
>> Actually I'm not sure how much sense it makes to introduce this pv
>> stuff and the duplicate adaptive halt-polling logic as what has
>> already been done in kvm w/o obvious benefit for real workload like
>> netperf. In addition, as you mentioned offline to me, enable both the
>> patchset and the adaptive halt-polling logic in kvm simultaneously can
>> result in more cpu power consumption. I remembered that David from
>
>
> No.If we use poll in KVM side, it will consume more cpu than in guest side.
> If use both two, then we can get the performance as only enable guest side
> poll but it will cost more cpu because of poll in KVM side. It means we
> should disable KVM side poll since it cannot give much improvement than
> guest side except consume more cpu.

The customers should have enough knowledge about what's the meaning of
the tunning which you exposed.

Regards,
Wanpeng Li

>
>> Google mentioned that Windows Event Objects can get 2x latency
>> improvement in KVM FORUM, which means that the adaptive halt-polling
>> in kvm should be enabled by default. So if the windows guests and
>> linux guests are mixed on the same host, then this patchset will
>> result in more cpu power consumption if the customer enable the
>> polling in the linux guest. Anyway, if the patchset is finally
>
>
> I have said in last time, there already users using idle=poll in there VM,
> you *cannot* prevent them doing it. This patch provide a better solution
> than unconditional poll, we didn't introduce any worse stuff.
>
>> acceptable by maintainer, I will introduce the generic adaptive
>> halt-polling framework in kvm to avoid the duplicate logic.
>
>
> We will add more conditions than the current algorithm in future. But it's
> ok to use one code currently, we will do it in next version.
>
>>
>> Regards,
>> Wanpeng Li
>>
>
>
> --
> Yang
> Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 7/7] sched/idle: update poll time when wakeup from idle
  2017-08-29 12:46   ` Peter Zijlstra
@ 2017-09-01  7:30     ` Yang Zhang
  2017-09-29 10:29     ` Quan Xu
  1 sibling, 0 replies; 47+ messages in thread
From: Yang Zhang @ 2017-09-01  7:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar,
	dmatlack, agraf, linux-doc, Quan Xu, Ingo Molnar

On 2017/8/29 20:46, Peter Zijlstra wrote:
> On Tue, Aug 29, 2017 at 11:46:41AM +0000, Yang Zhang wrote:
>> In ttwu_do_wakeup, it will update avg_idle when wakeup from idle. Here
>> we just reuse this logic to update the poll time. It may be a little
>> late to update the poll in ttwu_do_wakeup, but the test result shows no
>> obvious performance gap compare with updating poll in irq handler.
>>
>> one problem is that idle_stamp only used when using CFS scheduler. But
>> it is ok since it is the default policy for scheduler and only consider
>> it should enough.
>>
>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
> 
> Same broken SoB chain, and not a useful word on why you need to adjust
> this crap to begin with. What you want that poll duration to be related
> to is the cost of a VMEXIT/VMENTER cycle, not however long we happened
> to be idle.

Actually, we should compare the cost of VMEXIT/VMENTER with the real 
duration in idle. We have a rough number of the cost for one 
VMEXIT/VMENTER(it is about 2k~4k cycles depends on the underlying CPU) 
and it introduces 4~5 VMENTER/VMEXITs in idle path which may increase 
about 7us latency in average. So we set the poll duration to 10us by 
default.

Another problem is there is no good way to measure the duration in idle. 
avg_idle is the only way i find so far. Do you have any suggestion to do 
it better? Thanks.

-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
  2017-09-01  6:58     ` Wanpeng Li
@ 2017-09-01  7:53       ` Yang Zhang
  0 siblings, 0 replies; 47+ messages in thread
From: Yang Zhang @ 2017-09-01  7:53 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: linux-kernel, kvm, Wanpeng Li, Michael S. Tsirkin, Paolo Bonzini,
	Thomas Gleixner, Radim Krcmar, David Matlack, Alexander Graf,
	Peter Zijlstra, linux-doc

On 2017/9/1 14:58, Wanpeng Li wrote:
> 2017-09-01 14:44 GMT+08:00 Yang Zhang <yang.zhang.wz@gmail.com>:
>> On 2017/8/29 22:02, Wanpeng Li wrote:
>>>>
>>>> Here is the data we get when running benchmark netperf:
>>>>
>>>>      2. w/ patch:
>>>>         halt_poll_threshold=10000 -- 15803.89 bits/s -- 159.5 %CPU
>>>>         halt_poll_threshold=20000 -- 15899.04 bits/s -- 161.5 %CPU
>>>>         halt_poll_threshold=30000 -- 15642.38 bits/s -- 161.8 %CPU
>>>>         halt_poll_threshold=40000 -- 18040.76 bits/s -- 184.0 %CPU
>>>>         halt_poll_threshold=50000 -- 18877.61 bits/s -- 197.3 %CPU
>>>>
>>>>      3. kvm dynamic poll
>>>>         halt_poll_ns=10000 -- 15876.00 bits/s -- 172.2 %CPU
>>>>         halt_poll_ns=20000 -- 15602.58 bits/s -- 185.4 %CPU
>>>>         halt_poll_ns=30000 -- 15930.69 bits/s -- 194.4 %CPU
>>>>         halt_poll_ns=40000 -- 16413.09 bits/s -- 195.3 %CPU
>>>>         halt_poll_ns=50000 -- 16417.42 bits/s -- 196.3 %CPU
>>>>
>>>
>>> Actually I'm not sure how much sense it makes to introduce this pv
>>> stuff and the duplicate adaptive halt-polling logic as what has
>>> already been done in kvm w/o obvious benefit for real workload like
>>> netperf. In addition, as you mentioned offline to me, enable both the
>>> patchset and the adaptive halt-polling logic in kvm simultaneously can
>>> result in more cpu power consumption. I remembered that David from
>>
>>
>> No.If we use poll in KVM side, it will consume more cpu than in guest side.
>> If use both two, then we can get the performance as only enable guest side
>> poll but it will cost more cpu because of poll in KVM side. It means we
>> should disable KVM side poll since it cannot give much improvement than
>> guest side except consume more cpu.
> 
> The customers should have enough knowledge about what's the meaning of
> the tunning which you exposed.

We have applied this patch to customize kernel for some real customers 
and we get positive feedback from them since the CPU never run at 100% 
even there is no task running. Also, this helps them to give more CPUs 
to other tasks and reduce the power consumption in their rack. Don't you 
think it is better?

> 
> Regards,
> Wanpeng Li
> 
>>
>>> Google mentioned that Windows Event Objects can get 2x latency
>>> improvement in KVM FORUM, which means that the adaptive halt-polling
>>> in kvm should be enabled by default. So if the windows guests and
>>> linux guests are mixed on the same host, then this patchset will
>>> result in more cpu power consumption if the customer enable the
>>> polling in the linux guest. Anyway, if the patchset is finally
>>
>>
>> I have said in last time, there already users using idle=poll in there VM,
>> you *cannot* prevent them doing it. This patch provide a better solution
>> than unconditional poll, we didn't introduce any worse stuff.
>>
>>> acceptable by maintainer, I will introduce the generic adaptive
>>> halt-polling framework in kvm to avoid the duplicate logic.
>>
>>
>> We will add more conditions than the current algorithm in future. But it's
>> ok to use one code currently, we will do it in next version.
>>
>>>
>>> Regards,
>>> Wanpeng Li
>>>
>>
>>
>> --
>> Yang
>> Alibaba Cloud Computing


-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
  2017-08-29 14:56 ` Michael S. Tsirkin
@ 2017-09-13 11:56   ` Yang Zhang
  2017-09-14  8:36     ` Quan Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Yang Zhang @ 2017-09-13 11:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, kvm, wanpeng.li, pbonzini, tglx, rkrcmar, dmatlack,
	agraf, peterz, linux-doc, Quan Xu

On 2017/8/29 22:56, Michael S. Tsirkin wrote:
> On Tue, Aug 29, 2017 at 11:46:34AM +0000, Yang Zhang wrote:
>> Some latency-intensive workload will see obviously performance
>> drop when running inside VM.
> 
> But are we trading a lot of CPU for a bit of lower latency?
> 
>> The main reason is that the overhead
>> is amplified when running inside VM. The most cost i have seen is
>> inside idle path.
>>
>> This patch introduces a new mechanism to poll for a while before
>> entering idle state. If schedule is needed during poll, then we
>> don't need to goes through the heavy overhead path.
> 
> Isn't it the job of an idle driver to find the best way to
> halt the CPU?
> 
> It looks like just by adding a cstate we can make it
> halt at higher latencies only. And at lower latencies,
> if it's doing a good job we can hopefully use mwait to
> stop the CPU.
> 
> In fact I have been experimenting with exactly that.
> Some initial results are encouraging but I could use help
> with testing and especially tuning. If you can help
> pls let me know!

Quan, Can you help to test it and give result? Thanks.

-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
  2017-09-13 11:56   ` Yang Zhang
@ 2017-09-14  8:36     ` Quan Xu
  2017-09-14  9:19       ` Wanpeng Li
  0 siblings, 1 reply; 47+ messages in thread
From: Quan Xu @ 2017-09-14  8:36 UTC (permalink / raw)
  To: Yang Zhang, Michael S. Tsirkin
  Cc: linux-kernel, kvm, wanpeng.li, pbonzini, tglx, rkrcmar, dmatlack,
	agraf, peterz, linux-doc



on 2017/9/13 19:56, Yang Zhang wrote:
> On 2017/8/29 22:56, Michael S. Tsirkin wrote:
>> On Tue, Aug 29, 2017 at 11:46:34AM +0000, Yang Zhang wrote:
>>> Some latency-intensive workload will see obviously performance
>>> drop when running inside VM.
>>
>> But are we trading a lot of CPU for a bit of lower latency?
>>
>>> The main reason is that the overhead
>>> is amplified when running inside VM. The most cost i have seen is
>>> inside idle path.
>>>
>>> This patch introduces a new mechanism to poll for a while before
>>> entering idle state. If schedule is needed during poll, then we
>>> don't need to goes through the heavy overhead path.
>>
>> Isn't it the job of an idle driver to find the best way to
>> halt the CPU?
>>
>> It looks like just by adding a cstate we can make it
>> halt at higher latencies only. And at lower latencies,
>> if it's doing a good job we can hopefully use mwait to
>> stop the CPU.
>>
>> In fact I have been experimenting with exactly that.
>> Some initial results are encouraging but I could use help
>> with testing and especially tuning. If you can help
>> pls let me know!
>
> Quan, Can you help to test it and give result? Thanks.
>

Hi, MST

I have tested the patch "intel_idle: add pv cstates when running on 
kvm"  on a recent host that allows guests
to execute mwait without an exit. also I have tested our patch "[RFC 
PATCH v2 0/7] x86/idle: add halt poll support",
upstream linux, and  idle=poll.

the following is the result (which seems better than ever berfore, as I 
ran test case on a more powerful machine):

for __netperf__,  the first column is trans. rate per sec, the second 
column is CPU utilzation.

1. upstream linux

       28371.7 bits/s -- 76.6 %CPU

2. idle=poll

       34372 bit/s -- 999.3 %CPU

3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support",  with different 
values of parameter 'halt_poll_threshold':

       28362.7 bits/s -- 74.7  %CPU (halt_poll_threshold=10000)
       32949.5 bits/s -- 82.5  %CPU (halt_poll_threshold=20000)
       39717.9 bits/s -- 104.1 %CPU (halt_poll_threshold=30000)
       40137.9 bits/s -- 104.4 %CPU (halt_poll_threshold=40000)
       40079.8 bits/s -- 105.6 %CPU (halt_poll_threshold=50000)


4. "intel_idle: add pv cstates when running on kvm"

       33041.8 bits/s  -- 999.4 %CPU





for __ctxsw__, the first column is the time per process context 
switches, the second column is CPU utilzation..

1. upstream linux

       3624.19 ns/ctxsw -- 191.9 %CPU

2. idle=poll

       3419.66 ns/ctxsw -- 999.2 %CPU

3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different 
values of parameter 'halt_poll_threshold':

       1123.40 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=10000)
       1127.38 ns/ctxsw -- 199.7 %CPU (halt_poll_threshold=20000)
       1113.58 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=30000)
       1117.12 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=40000)
       1121.62 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=50000)

  4.  "intel_idle: add pv cstates when running on kvm"

       3427.59 ns/ctxsw -- 999.4 %CPU

-Quan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path
  2017-09-01  5:57     ` Quan Xu
@ 2017-09-14  8:41       ` Quan Xu
  2017-09-14  9:18         ` Borislav Petkov
  0 siblings, 1 reply; 47+ messages in thread
From: Quan Xu @ 2017-09-14  8:41 UTC (permalink / raw)
  To: Peter Zijlstra, Yang Zhang
  Cc: linux-kernel, kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar,
	dmatlack, agraf, linux-doc, Ingo Molnar, H. Peter Anvin, x86,
	Borislav Petkov, Kyle Huey, Andy Lutomirski, Len Brown



on 2017/9/1 13:57, Quan Xu wrote:
> on 2017/8/29 20:45, Peter Zijlstra wrote:
>
>> On Tue, Aug 29, 2017 at 11:46:37AM +0000, Yang Zhang wrote:
>>> Add poll in do_idle. For UP VM, if there are running task, it will not
>>> goes into idle path, so we only enable poll in SMP VM.
>>>
>>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
>> Broken SoB chain.
>   Peter,  I can't follow 'Broken SoB chain'.. could you explain more 
> about it?
>
     Peter, Ping..

Quan



>   -Quan
>
>>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>>> index 6c23e30..b374744 100644
>>> --- a/kernel/sched/idle.c
>>> +++ b/kernel/sched/idle.c
>>> @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
>>>   }
>>>     /* Weak implementations for optional arch specific functions */
>>> +void __weak arch_cpu_idle_poll(void) { }
>>>   void __weak arch_cpu_idle_prepare(void) { }
>>>   void __weak arch_cpu_idle_enter(void) { }
>> And not a word on why we need a new arch hook. What's wrong with
>> arch_cpu_idle_enter() for instance?
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path
  2017-09-14  8:41       ` Quan Xu
@ 2017-09-14  9:18         ` Borislav Petkov
  0 siblings, 0 replies; 47+ messages in thread
From: Borislav Petkov @ 2017-09-14  9:18 UTC (permalink / raw)
  To: Quan Xu
  Cc: Peter Zijlstra, Yang Zhang, linux-kernel, kvm, wanpeng.li, mst,
	pbonzini, tglx, rkrcmar, dmatlack, agraf, linux-doc, Ingo Molnar,
	H. Peter Anvin, x86, Kyle Huey, Andy Lutomirski, Len Brown

On Thu, Sep 14, 2017 at 04:41:39PM +0800, Quan Xu wrote:
> > > On Tue, Aug 29, 2017 at 11:46:37AM +0000, Yang Zhang wrote:
> > > > Add poll in do_idle. For UP VM, if there are running task, it will not
> > > > goes into idle path, so we only enable poll in SMP VM.
> > > > 
> > > > Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
> > > > Signed-off-by: Quan Xu <quan.xu0@gmail.com>
> > > Broken SoB chain.
> >   Peter,  I can't follow 'Broken SoB chain'.. could you explain more
> > about it?
> > 
>     Peter, Ping..

The SOB chain needs to show the path from the author to the upstream
maintainer. Yours has Yang before you, which doesn't say what his role
is.

Read Documentation/process/submitting-patches.rst, section 11.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
  2017-09-14  8:36     ` Quan Xu
@ 2017-09-14  9:19       ` Wanpeng Li
  2017-09-14  9:40         ` Quan Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Wanpeng Li @ 2017-09-14  9:19 UTC (permalink / raw)
  To: Quan Xu
  Cc: Yang Zhang, Michael S. Tsirkin, linux-kernel, kvm, Wanpeng Li,
	Paolo Bonzini, Thomas Gleixner, Radim Krcmar, David Matlack,
	Alexander Graf, Peter Zijlstra, linux-doc

2017-09-14 16:36 GMT+08:00 Quan Xu <quan.xu0@gmail.com>:
>
>
> on 2017/9/13 19:56, Yang Zhang wrote:
>>
>> On 2017/8/29 22:56, Michael S. Tsirkin wrote:
>>>
>>> On Tue, Aug 29, 2017 at 11:46:34AM +0000, Yang Zhang wrote:
>>>>
>>>> Some latency-intensive workload will see obviously performance
>>>> drop when running inside VM.
>>>
>>>
>>> But are we trading a lot of CPU for a bit of lower latency?
>>>
>>>> The main reason is that the overhead
>>>> is amplified when running inside VM. The most cost i have seen is
>>>> inside idle path.
>>>>
>>>> This patch introduces a new mechanism to poll for a while before
>>>> entering idle state. If schedule is needed during poll, then we
>>>> don't need to goes through the heavy overhead path.
>>>
>>>
>>> Isn't it the job of an idle driver to find the best way to
>>> halt the CPU?
>>>
>>> It looks like just by adding a cstate we can make it
>>> halt at higher latencies only. And at lower latencies,
>>> if it's doing a good job we can hopefully use mwait to
>>> stop the CPU.
>>>
>>> In fact I have been experimenting with exactly that.
>>> Some initial results are encouraging but I could use help
>>> with testing and especially tuning. If you can help
>>> pls let me know!
>>
>>
>> Quan, Can you help to test it and give result? Thanks.
>>
>
> Hi, MST
>
> I have tested the patch "intel_idle: add pv cstates when running on kvm"  on
> a recent host that allows guests
> to execute mwait without an exit. also I have tested our patch "[RFC PATCH
> v2 0/7] x86/idle: add halt poll support",
> upstream linux, and  idle=poll.
>
> the following is the result (which seems better than ever berfore, as I ran
> test case on a more powerful machine):
>
> for __netperf__,  the first column is trans. rate per sec, the second column
> is CPU utilzation.
>
> 1. upstream linux

This "upstream linux" means that disables the kvm adaptive
halt-polling after confirm with Xu Quan.

Regards,
Wanpeng Li

>
>       28371.7 bits/s -- 76.6 %CPU
>
> 2. idle=poll
>
>       34372 bit/s -- 999.3 %CPU
>
> 3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support",  with different
> values of parameter 'halt_poll_threshold':
>
>       28362.7 bits/s -- 74.7  %CPU (halt_poll_threshold=10000)
>       32949.5 bits/s -- 82.5  %CPU (halt_poll_threshold=20000)
>       39717.9 bits/s -- 104.1 %CPU (halt_poll_threshold=30000)
>       40137.9 bits/s -- 104.4 %CPU (halt_poll_threshold=40000)
>       40079.8 bits/s -- 105.6 %CPU (halt_poll_threshold=50000)
>
>
> 4. "intel_idle: add pv cstates when running on kvm"
>
>       33041.8 bits/s  -- 999.4 %CPU
>
>
>
>
>
> for __ctxsw__, the first column is the time per process context switches,
> the second column is CPU utilzation..
>
> 1. upstream linux
>
>       3624.19 ns/ctxsw -- 191.9 %CPU
>
> 2. idle=poll
>
>       3419.66 ns/ctxsw -- 999.2 %CPU
>
> 3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different
> values of parameter 'halt_poll_threshold':
>
>       1123.40 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=10000)
>       1127.38 ns/ctxsw -- 199.7 %CPU (halt_poll_threshold=20000)
>       1113.58 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=30000)
>       1117.12 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=40000)
>       1121.62 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=50000)
>
>  4.  "intel_idle: add pv cstates when running on kvm"
>
>       3427.59 ns/ctxsw -- 999.4 %CPU
>
> -Quan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
  2017-09-14  9:19       ` Wanpeng Li
@ 2017-09-14  9:40         ` Quan Xu
  0 siblings, 0 replies; 47+ messages in thread
From: Quan Xu @ 2017-09-14  9:40 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Yang Zhang, Michael S. Tsirkin, linux-kernel, kvm, Wanpeng Li,
	Paolo Bonzini, Thomas Gleixner, Radim Krcmar, David Matlack,
	Alexander Graf, Peter Zijlstra, linux-doc



On 2017/9/14 17:19, Wanpeng Li wrote:
> 2017-09-14 16:36 GMT+08:00 Quan Xu <quan.xu0@gmail.com>:
>>
>> on 2017/9/13 19:56, Yang Zhang wrote:
>>> On 2017/8/29 22:56, Michael S. Tsirkin wrote:
>>>> On Tue, Aug 29, 2017 at 11:46:34AM +0000, Yang Zhang wrote:
>>>>> Some latency-intensive workload will see obviously performance
>>>>> drop when running inside VM.
>>>>
>>>> But are we trading a lot of CPU for a bit of lower latency?
>>>>
>>>>> The main reason is that the overhead
>>>>> is amplified when running inside VM. The most cost i have seen is
>>>>> inside idle path.
>>>>>
>>>>> This patch introduces a new mechanism to poll for a while before
>>>>> entering idle state. If schedule is needed during poll, then we
>>>>> don't need to goes through the heavy overhead path.
>>>>
>>>> Isn't it the job of an idle driver to find the best way to
>>>> halt the CPU?
>>>>
>>>> It looks like just by adding a cstate we can make it
>>>> halt at higher latencies only. And at lower latencies,
>>>> if it's doing a good job we can hopefully use mwait to
>>>> stop the CPU.
>>>>
>>>> In fact I have been experimenting with exactly that.
>>>> Some initial results are encouraging but I could use help
>>>> with testing and especially tuning. If you can help
>>>> pls let me know!
>>>
>>> Quan, Can you help to test it and give result? Thanks.
>>>
>> Hi, MST
>>
>> I have tested the patch "intel_idle: add pv cstates when running on kvm"  on
>> a recent host that allows guests
>> to execute mwait without an exit. also I have tested our patch "[RFC PATCH
>> v2 0/7] x86/idle: add halt poll support",
>> upstream linux, and  idle=poll.
>>
>> the following is the result (which seems better than ever berfore, as I ran
>> test case on a more powerful machine):
>>
>> for __netperf__,  the first column is trans. rate per sec, the second column
>> is CPU utilzation.
>>
>> 1. upstream linux
> This "upstream linux" means that disables the kvm adaptive
> halt-polling after confirm with Xu Quan.


upstream linux -- the source code is just from upstream linux, without 
our patch or MST's patch..
yes, we disable kvm halt-polling(halt_poll_ns=0) for _all_of_ following 
cases.

Quan


> Regards,
> Wanpeng Li
>
>>        28371.7 bits/s -- 76.6 %CPU
>>
>> 2. idle=poll
>>
>>        34372 bit/s -- 999.3 %CPU
>>
>> 3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support",  with different
>> values of parameter 'halt_poll_threshold':
>>
>>        28362.7 bits/s -- 74.7  %CPU (halt_poll_threshold=10000)
>>        32949.5 bits/s -- 82.5  %CPU (halt_poll_threshold=20000)
>>        39717.9 bits/s -- 104.1 %CPU (halt_poll_threshold=30000)
>>        40137.9 bits/s -- 104.4 %CPU (halt_poll_threshold=40000)
>>        40079.8 bits/s -- 105.6 %CPU (halt_poll_threshold=50000)
>>
>>
>> 4. "intel_idle: add pv cstates when running on kvm"
>>
>>        33041.8 bits/s  -- 999.4 %CPU
>>
>>
>>
>>
>>
>> for __ctxsw__, the first column is the time per process context switches,
>> the second column is CPU utilzation..
>>
>> 1. upstream linux
>>
>>        3624.19 ns/ctxsw -- 191.9 %CPU
>>
>> 2. idle=poll
>>
>>        3419.66 ns/ctxsw -- 999.2 %CPU
>>
>> 3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different
>> values of parameter 'halt_poll_threshold':
>>
>>        1123.40 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=10000)
>>        1127.38 ns/ctxsw -- 199.7 %CPU (halt_poll_threshold=20000)
>>        1113.58 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=30000)
>>        1117.12 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=40000)
>>        1121.62 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=50000)
>>
>>   4.  "intel_idle: add pv cstates when running on kvm"
>>
>>        3427.59 ns/ctxsw -- 999.4 %CPU
>>
>> -Quan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 7/7] sched/idle: update poll time when wakeup from idle
  2017-08-29 12:46   ` Peter Zijlstra
  2017-09-01  7:30     ` Yang Zhang
@ 2017-09-29 10:29     ` Quan Xu
  1 sibling, 0 replies; 47+ messages in thread
From: Quan Xu @ 2017-09-29 10:29 UTC (permalink / raw)
  To: Peter Zijlstra, Yang Zhang
  Cc: quan.xu0, linux-kernel, kvm, wanpeng.li, mst, pbonzini, tglx,
	rkrcmar, dmatlack, agraf, linux-doc, Ingo Molnar



On 2017/8/29 20:46, Peter Zijlstra wrote:
> On Tue, Aug 29, 2017 at 11:46:41AM +0000, Yang Zhang wrote:
>> In ttwu_do_wakeup, it will update avg_idle when wakeup from idle. Here
>> we just reuse this logic to update the poll time. It may be a little
>> late to update the poll in ttwu_do_wakeup, but the test result shows no
>> obvious performance gap compare with updating poll in irq handler.
>>
>> one problem is that idle_stamp only used when using CFS scheduler. But
>> it is ok since it is the default policy for scheduler and only consider
>> it should enough.
>>
>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
> Same broken SoB chain, and not a useful word on why you need to adjust
> this crap to begin with. What you want that poll duration to be related
> to is the cost of a VMEXIT/VMENTER cycle, not however long we happened
> to be idle.
>
> So no.

Peter,

I think you are right..

IIUC, the time we happened to be idle may contain a chain of 
VMEXIT/VMENTER cycles,
which would be mainly (except the last VMEXIT/VMENTER cycles) for just 
idle loops. right?

as you mentioned, poll duration to be related to is the cost of __a__ 
VMEXIT/VMENTER cycle.
howerver it is very difficult to measure a VMEXIT/VMENTER cycle 
accurately from
kvm guest, we could find out an approximate one -- dropping the idle 
loops from the
time we happened to be idle.. make sense?

Quan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path
  2017-09-01  6:49     ` Quan Xu
@ 2017-09-29 10:39       ` Quan Xu
  0 siblings, 0 replies; 47+ messages in thread
From: Quan Xu @ 2017-09-29 10:39 UTC (permalink / raw)
  To: Borislav Petkov, Yang Zhang
  Cc: linux-kernel, kvm, wanpeng.li, mst, pbonzini, tglx, rkrcmar,
	dmatlack, agraf, peterz, linux-doc, Ingo Molnar, H. Peter Anvin,
	x86, Kyle Huey, Andy Lutomirski, Len Brown



On 2017/9/1 14:49, Quan Xu wrote:
> on 2017/8/29 22:39, Borislav Petkov wrote:
>> On Tue, Aug 29, 2017 at 11:46:37AM +0000, Yang Zhang wrote:
>>> Add poll in do_idle. For UP VM, if there are running task, it will not
>>> goes into idle path, so we only enable poll in SMP VM.
>>>
>>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>> Cc: x86@kernel.org
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Cc: Borislav Petkov <bp@alien8.de>
>>> Cc: Kyle Huey <me@kylehuey.com>
>>> Cc: Andy Lutomirski <luto@kernel.org>
>>> Cc: Len Brown <len.brown@intel.com>
>>> Cc: linux-kernel@vger.kernel.org
>>> ---
>>>   arch/x86/kernel/process.c | 7 +++++++
>>>   kernel/sched/idle.c       | 2 ++
>>>   2 files changed, 9 insertions(+)
>>>
>>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>>> index 3ca1980..def4113 100644
>>> --- a/arch/x86/kernel/process.c
>>> +++ b/arch/x86/kernel/process.c
>>> @@ -332,6 +332,13 @@ void arch_cpu_idle(void)
>>>       x86_idle();
>>>   }
>>>   +#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
>>> +void arch_cpu_idle_poll(void)
>>> +{
>>> +    paravirt_idle_poll();
>>> +}
>>> +#endif
>> So this will get called on any system which has CONFIG_PARAVIRT enabled
>> *even* if they're not running any guests.
>>
>> Huh?
      Borislav,
      think twice, it is better to run ander an IF condition, to make 
sure they are running any guests.

       Quan
> Borislav ,
>    yes, this will get called on any system which has CONFIG_PARAVIRT 
> enabled.
>
>    but if they are not running any guests,  the callback is 
> paravirt_nop() ,
>    IIUC which is as similar as the other paravirt_*, such as 
> paravirt_pgd_free()..
>
>  - Quan

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2017-09-29 10:40 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-29 11:46 [RFC PATCH v2 0/7] x86/idle: add halt poll support Yang Zhang
2017-08-29 11:46 ` [RFC PATCH v2 1/7] x86/paravirt: Add pv_idle_ops to paravirt ops Yang Zhang
2017-08-29 11:46   ` Yang Zhang
2017-08-29 13:55   ` Konrad Rzeszutek Wilk
2017-08-29 13:55     ` Konrad Rzeszutek Wilk
2017-08-30  7:33     ` Juergen Gross
2017-08-30  7:33     ` Juergen Gross
2017-09-01  6:50     ` Yang Zhang
2017-09-01  6:50     ` Yang Zhang
2017-08-29 11:46 ` [RFC PATCH v2 2/7] KVM guest: register kvm_idle_poll for pv_idle_ops Yang Zhang
2017-08-29 11:46 ` [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path Yang Zhang
2017-08-29 12:45   ` Peter Zijlstra
2017-09-01  5:57     ` Quan Xu
2017-09-14  8:41       ` Quan Xu
2017-09-14  9:18         ` Borislav Petkov
2017-08-29 14:39   ` Borislav Petkov
2017-09-01  6:49     ` Quan Xu
2017-09-29 10:39       ` Quan Xu
2017-08-29 11:46 ` [RFC PATCH v2 4/7] x86/paravirt: Add update in x86/paravirt pv_idle_ops Yang Zhang
2017-08-29 11:46 ` Yang Zhang
2017-08-29 11:46 ` [RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll Yang Zhang
2017-08-29 11:46 ` Yang Zhang
2017-08-29 11:46   ` Yang Zhang
2017-08-29 17:20   ` Luis R. Rodriguez
2017-08-29 17:20   ` Luis R. Rodriguez
2017-08-29 17:20     ` Luis R. Rodriguez
2017-08-29 11:46 ` [RFC PATCH v2 6/7] KVM guest: introduce smart idle poll algorithm Yang Zhang
2017-08-29 11:46 ` [RFC PATCH v2 7/7] sched/idle: update poll time when wakeup from idle Yang Zhang
2017-08-29 12:46   ` Peter Zijlstra
2017-09-01  7:30     ` Yang Zhang
2017-09-29 10:29     ` Quan Xu
2017-08-29 11:58 ` [RFC PATCH v2 0/7] x86/idle: add halt poll support Alexander Graf
2017-09-01  6:21   ` Yang Zhang
2017-08-29 13:03 ` Andi Kleen
2017-08-29 14:02 ` Wanpeng Li
2017-08-29 14:27   ` Konrad Rzeszutek Wilk
2017-08-29 14:36   ` Michael S. Tsirkin
2017-09-01  6:32   ` Yang Zhang
2017-09-01  6:52     ` Wanpeng Li
2017-09-01  6:44   ` Yang Zhang
2017-09-01  6:58     ` Wanpeng Li
2017-09-01  7:53       ` Yang Zhang
2017-08-29 14:56 ` Michael S. Tsirkin
2017-09-13 11:56   ` Yang Zhang
2017-09-14  8:36     ` Quan Xu
2017-09-14  9:19       ` Wanpeng Li
2017-09-14  9:40         ` Quan Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.