linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v3 0/6] x86/idle: add halt poll support
@ 2017-11-13 10:05 Quan Xu
  2017-11-13 10:06 ` [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops Quan Xu
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Quan Xu @ 2017-11-13 10:05 UTC (permalink / raw)
  To: kvm, linux-doc, linux-fsdevel, linux-kernel, virtualization, x86,
	xen-devel
  Cc: Yang Zhang

From: Yang Zhang <yang.zhang.wz@gmail.com>

Some latency-intensive workload have seen obviously performance
drop when running inside VM. The main reason is that the overhead
is amplified when running inside VM. The most cost I have seen is
inside idle path.

This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.

Here is the data we get when running benchmark contextswitch to measure
the latency(lower is better):

   1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
     3402.9 ns/ctxsw -- 199.8 %CPU

   2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
      halt_poll_threshold=10000  -- 1151.4 ns/ctxsw -- 200.1 %CPU
      halt_poll_threshold=20000  -- 1149.7 ns/ctxsw -- 199.9 %CPU
      halt_poll_threshold=30000  -- 1151.0 ns/ctxsw -- 199.9 %CPU
      halt_poll_threshold=40000  -- 1155.4 ns/ctxsw -- 199.3 %CPU
      halt_poll_threshold=50000  -- 1161.0 ns/ctxsw -- 200.0 %CPU
      halt_poll_threshold=100000 -- 1163.8 ns/ctxsw -- 200.4 %CPU
      halt_poll_threshold=300000 -- 1159.4 ns/ctxsw -- 201.9 %CPU
      halt_poll_threshold=500000 -- 1163.5 ns/ctxsw -- 205.5 %CPU

   3. w/ kvm dynamic poll:
      halt_poll_ns=10000  -- 3470.5 ns/ctxsw -- 199.6 %CPU
      halt_poll_ns=20000  -- 3273.0 ns/ctxsw -- 199.7 %CPU
      halt_poll_ns=30000  -- 3628.7 ns/ctxsw -- 199.4 %CPU
      halt_poll_ns=40000  -- 2280.6 ns/ctxsw -- 199.5 %CPU
      halt_poll_ns=50000  -- 3200.3 ns/ctxsw -- 199.7 %CPU
      halt_poll_ns=100000 -- 2186.6 ns/ctxsw -- 199.6 %CPU
      halt_poll_ns=300000 -- 3178.7 ns/ctxsw -- 199.6 %CPU
      halt_poll_ns=500000 -- 3505.4 ns/ctxsw -- 199.7 %CPU

   4. w/patch and w/ kvm dynamic poll:

      halt_poll_ns=10000 & halt_poll_threshold=10000  -- 1155.5 ns/ctxsw -- 199.8 %CPU
      halt_poll_ns=10000 & halt_poll_threshold=20000  -- 1165.6 ns/ctxsw -- 199.8 %CPU
      halt_poll_ns=10000 & halt_poll_threshold=30000  -- 1161.1 ns/ctxsw -- 200.0 %CPU

      halt_poll_ns=20000 & halt_poll_threshold=10000  -- 1158.1 ns/ctxsw -- 199.8 %CPU
      halt_poll_ns=20000 & halt_poll_threshold=20000  -- 1161.0 ns/ctxsw -- 199.7 %CPU
      halt_poll_ns=20000 & halt_poll_threshold=30000  -- 1163.7 ns/ctxsw -- 199.9 %CPU

      halt_poll_ns=30000 & halt_poll_threshold=10000  -- 1158.7 ns/ctxsw -- 199.7 %CPU
      halt_poll_ns=30000 & halt_poll_threshold=20000  -- 1153.8 ns/ctxsw -- 199.8 %CPU
      halt_poll_ns=30000 & halt_poll_threshold=30000  -- 1155.1 ns/ctxsw -- 199.8 %CPU

   5. idle=poll
      3957.57 ns/ctxsw --  999.4%CPU

Here is the data we get when running benchmark netperf:

   1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
      29031.6 bit/s -- 76.1 %CPU

   2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
      halt_poll_threshold=10000  -- 29021.7 bit/s -- 105.1 %CPU
      halt_poll_threshold=20000  -- 33463.5 bit/s -- 128.2 %CPU
      halt_poll_threshold=30000  -- 34436.4 bit/s -- 127.8 %CPU
      halt_poll_threshold=40000  -- 35563.3 bit/s -- 129.6 %CPU
      halt_poll_threshold=50000  -- 35787.7 bit/s -- 129.4 %CPU
      halt_poll_threshold=100000 -- 35477.7 bit/s -- 130.0 %CPU
      halt_poll_threshold=300000 -- 35730.0 bit/s -- 132.4 %CPU
      halt_poll_threshold=500000 -- 34978.4 bit/s -- 134.2 %CPU

   3. w/ kvm dynamic poll:
      halt_poll_ns=10000  -- 28849.8 bit/s -- 75.2  %CPU
      halt_poll_ns=20000  -- 29004.8 bit/s -- 76.1  %CPU
      halt_poll_ns=30000  -- 35662.0 bit/s -- 199.7 %CPU
      halt_poll_ns=40000  -- 35874.8 bit/s -- 187.5 %CPU
      halt_poll_ns=50000  -- 35603.1 bit/s -- 199.8 %CPU
      halt_poll_ns=100000 -- 35588.8 bit/s -- 200.0 %CPU
      halt_poll_ns=300000 -- 35912.4 bit/s -- 200.0 %CPU
      halt_poll_ns=500000 -- 35735.6 bit/s -- 200.0 %CPU

   4. w/patch and w/ kvm dynamic poll:

      halt_poll_ns=10000 & halt_poll_threshold=10000  -- 29427.9 bit/s -- 107.8 %CPU
      halt_poll_ns=10000 & halt_poll_threshold=20000  -- 33048.4 bit/s -- 128.1 %CPU
      halt_poll_ns=10000 & halt_poll_threshold=30000  -- 35129.8 bit/s -- 129.1 %CPU

      halt_poll_ns=20000 & halt_poll_threshold=10000  -- 31091.3 bit/s -- 130.3 %CPU
      halt_poll_ns=20000 & halt_poll_threshold=20000  -- 33587.9 bit/s -- 128.9 %CPU
      halt_poll_ns=20000 & halt_poll_threshold=30000  -- 35532.9 bit/s -- 129.1 %CPU

      halt_poll_ns=30000 & halt_poll_threshold=10000  -- 35633.1 bit/s -- 199.4 %CPU
      halt_poll_ns=30000 & halt_poll_threshold=20000  -- 42225.3 bit/s -- 198.7 %CPU
      halt_poll_ns=30000 & halt_poll_threshold=30000  -- 42210.7 bit/s -- 200.3 %CPU

   5. idle=poll
      37081.7 bit/s -- 998.1 %CPU

---
V2 -> V3:
- move poll update into arch/. in v3, poll update is based on duration of the
  last idle loop which is from tick_nohz_idle_enter to tick_nohz_idle_exit,
  and try our best not to interfere with scheduler/idle code. (This seems
  not to follow Peter's v2 comment, however we had a f2f discussion about it
  in Prague.)
- enhance patch desciption.
- enhance Documentation and sysctls.
- test with IRQ_TIMINGS related code, which seems not working so far.

V1 -> V2:
- integrate the smart halt poll into paravirt code
- use idle_stamp instead of check_poll
- since it hard to get whether vcpu is the only task in pcpu, so we
  don't consider it in this series.(May improve it in future)

---
Quan Xu (4):
  x86/paravirt: Add pv_idle_ops to paravirt ops
  KVM guest: register kvm_idle_poll for pv_idle_ops
  Documentation: Add three sysctls for smart idle poll
  tick: get duration of the last idle loop

Yang Zhang (2):
  sched/idle: Add a generic poll before enter real idle path
  KVM guest: introduce smart idle poll algorithm

 Documentation/sysctl/kernel.txt       |   35 ++++++++++++++++
 arch/x86/include/asm/paravirt.h       |    5 ++
 arch/x86/include/asm/paravirt_types.h |    6 +++
 arch/x86/kernel/kvm.c                 |   73 +++++++++++++++++++++++++++++++++
 arch/x86/kernel/paravirt.c            |   10 +++++
 arch/x86/kernel/process.c             |    7 +++
 include/linux/kernel.h                |    6 +++
 include/linux/tick.h                  |    2 +
 kernel/sched/idle.c                   |    2 +
 kernel/sysctl.c                       |   34 +++++++++++++++
 kernel/time/tick-sched.c              |   11 +++++
 kernel/time/tick-sched.h              |    3 +
 12 files changed, 194 insertions(+), 0 deletions(-)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-11-13 10:05 [PATCH RFC v3 0/6] x86/idle: add halt poll support Quan Xu
@ 2017-11-13 10:06 ` Quan Xu
  2017-11-13 10:53   ` Juergen Gross
  2017-11-13 10:06 ` [PATCH RFC v3 2/6] KVM guest: register kvm_idle_poll for pv_idle_ops Quan Xu
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Quan Xu @ 2017-11-13 10:06 UTC (permalink / raw)
  To: kvm, linux-doc, linux-fsdevel, linux-kernel, virtualization, x86,
	xen-devel
  Cc: Yang Zhang, Juergen Gross, Quan Xu, Rusty Russell, Ingo Molnar,
	H. Peter Anvin, Alok Kataria, Thomas Gleixner

From: Quan Xu <quan.xu0@gmail.com>

So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
in idle path which will poll for a while before we enter the real idle
state.

In virtualization, idle path includes several heavy operations
includes timer access(LAPIC timer or TSC deadline timer) which will
hurt performance especially for latency intensive workload like message
passing task. The cost is mainly from the vmexit which is a hardware
context switch between virtual machine and hypervisor. Our solution is
to poll for a while and do not enter real idle path if we can get the
schedule event during polling.

Poll may cause the CPU waste so we adopt a smart polling mechanism to
reduce the useless poll.

Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Quan Xu <quan.xu0@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
---
 arch/x86/include/asm/paravirt.h       |    5 +++++
 arch/x86/include/asm/paravirt_types.h |    6 ++++++
 arch/x86/kernel/paravirt.c            |    6 ++++++
 3 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index fd81228..3c83727 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -198,6 +198,11 @@ static inline unsigned long long paravirt_read_pmc(int counter)
 
 #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter))
 
+static inline void paravirt_idle_poll(void)
+{
+	PVOP_VCALL0(pv_idle_ops.poll);
+}
+
 static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries)
 {
 	PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 10cc3b9..95c0e3e 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -313,6 +313,10 @@ struct pv_lock_ops {
 	struct paravirt_callee_save vcpu_is_preempted;
 } __no_randomize_layout;
 
+struct pv_idle_ops {
+	void (*poll)(void);
+} __no_randomize_layout;
+
 /* This contains all the paravirt structures: we get a convenient
  * number for each function using the offset which we use to indicate
  * what to patch. */
@@ -323,6 +327,7 @@ struct paravirt_patch_template {
 	struct pv_irq_ops pv_irq_ops;
 	struct pv_mmu_ops pv_mmu_ops;
 	struct pv_lock_ops pv_lock_ops;
+	struct pv_idle_ops pv_idle_ops;
 } __no_randomize_layout;
 
 extern struct pv_info pv_info;
@@ -332,6 +337,7 @@ struct paravirt_patch_template {
 extern struct pv_irq_ops pv_irq_ops;
 extern struct pv_mmu_ops pv_mmu_ops;
 extern struct pv_lock_ops pv_lock_ops;
+extern struct pv_idle_ops pv_idle_ops;
 
 #define PARAVIRT_PATCH(x)					\
 	(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 19a3e8f..67cab22 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -128,6 +128,7 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void *target,
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 		.pv_lock_ops = pv_lock_ops,
 #endif
+		.pv_idle_ops = pv_idle_ops,
 	};
 	return *((void **)&tmpl + type);
 }
@@ -312,6 +313,10 @@ struct pv_time_ops pv_time_ops = {
 	.steal_clock = native_steal_clock,
 };
 
+struct pv_idle_ops pv_idle_ops = {
+	.poll = paravirt_nop,
+};
+
 __visible struct pv_irq_ops pv_irq_ops = {
 	.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
 	.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
@@ -463,3 +468,4 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
 EXPORT_SYMBOL    (pv_mmu_ops);
 EXPORT_SYMBOL_GPL(pv_info);
 EXPORT_SYMBOL    (pv_irq_ops);
+EXPORT_SYMBOL    (pv_idle_ops);
-- 
1.7.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH RFC v3 2/6] KVM guest: register kvm_idle_poll for pv_idle_ops
  2017-11-13 10:05 [PATCH RFC v3 0/6] x86/idle: add halt poll support Quan Xu
  2017-11-13 10:06 ` [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops Quan Xu
@ 2017-11-13 10:06 ` Quan Xu
  2017-11-13 10:06 ` [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path Quan Xu
  2017-11-15 21:31 ` [Xen-devel] [PATCH RFC v3 0/6] x86/idle: add halt poll support Konrad Rzeszutek Wilk
  3 siblings, 0 replies; 32+ messages in thread
From: Quan Xu @ 2017-11-13 10:06 UTC (permalink / raw)
  To: kvm, linux-doc, linux-fsdevel, linux-kernel, virtualization, x86,
	xen-devel
  Cc: Yang Zhang, Quan Xu, Radim Krčmář,
	Ingo Molnar, H. Peter Anvin, Paolo Bonzini, Thomas Gleixner

From: Quan Xu <quan.xu0@gmail.com>

Although smart idle poll has nothing to do with paravirt, it can
not bring any benifit to native. So we only enable it when Linux
runs as a KVM guest( also it can extend to other hypervisor like
Xen, HyperV and VMware).

Introduce per-CPU variable poll_duration_ns to control the max
poll time.

Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Quan Xu <quan.xu0@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/kvm.c |   26 ++++++++++++++++++++++++++
 1 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8bb9594..2a6e402 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -75,6 +75,7 @@ static int parse_no_kvmclock_vsyscall(char *arg)
 
 early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
 
+static DEFINE_PER_CPU(unsigned long, poll_duration_ns);
 static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
 static int has_steal_clock = 0;
@@ -364,6 +365,29 @@ static void kvm_guest_cpu_init(void)
 		kvm_register_steal_time();
 }
 
+static void kvm_idle_poll(void)
+{
+	unsigned long poll_duration = this_cpu_read(poll_duration_ns);
+	ktime_t start, cur, stop;
+
+	start = cur = ktime_get();
+	stop = ktime_add_ns(ktime_get(), poll_duration);
+
+	do {
+		if (need_resched())
+			break;
+		cur = ktime_get();
+	} while (ktime_before(cur, stop));
+}
+
+static void kvm_guest_idle_init(void)
+{
+	if (!kvm_para_available())
+		return;
+
+	pv_idle_ops.poll = kvm_idle_poll;
+}
+
 static void kvm_pv_disable_apf(void)
 {
 	if (!__this_cpu_read(apf_reason.enabled))
@@ -499,6 +523,8 @@ void __init kvm_guest_init(void)
 	kvm_guest_cpu_init();
 #endif
 
+	kvm_guest_idle_init();
+
 	/*
 	 * Hard lockup detection is enabled by default. Disable it, as guests
 	 * can get false positives too easily, for example if the host is
-- 
1.7.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
  2017-11-13 10:05 [PATCH RFC v3 0/6] x86/idle: add halt poll support Quan Xu
  2017-11-13 10:06 ` [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops Quan Xu
  2017-11-13 10:06 ` [PATCH RFC v3 2/6] KVM guest: register kvm_idle_poll for pv_idle_ops Quan Xu
@ 2017-11-13 10:06 ` Quan Xu
  2017-11-15 12:11   ` Peter Zijlstra
  2017-11-15 21:31 ` [Xen-devel] [PATCH RFC v3 0/6] x86/idle: add halt poll support Konrad Rzeszutek Wilk
  3 siblings, 1 reply; 32+ messages in thread
From: Quan Xu @ 2017-11-13 10:06 UTC (permalink / raw)
  To: kvm, linux-doc, linux-fsdevel, linux-kernel, virtualization, x86,
	xen-devel
  Cc: Yang Zhang, Len Brown, Quan Xu, Tom Lendacky, Peter Zijlstra,
	Kyle Huey, Ingo Molnar, Borislav Petkov, Andy Lutomirski,
	H. Peter Anvin, Thomas Gleixner, Tobias Klauser

From: Yang Zhang <yang.zhang.wz@gmail.com>

Implement a generic idle poll which resembles the functionality
found in arch/. Provide weak arch_cpu_idle_poll function which
can be overridden by the architecture code if needed.

Interrupts arrive which may not cause a reschedule in idle loops.
In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry
for interrupts and VM-exit immediately. Also this becomes more
expensive than bare metal. Add a generic idle poll before enter
real idle path. When a reschedule event is pending, we can bypass
the real idle path.

Signed-off-by: Quan Xu <quan.xu0@gmail.com>
Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Kyle Huey <me@kylehuey.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Tobias Klauser <tklauser@distanz.ch>
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/process.c |    7 +++++++
 kernel/sched/idle.c       |    2 ++
 2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index c676853..f7db8b5 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -333,6 +333,13 @@ void arch_cpu_idle(void)
 	x86_idle();
 }
 
+#ifdef CONFIG_PARAVIRT
+void arch_cpu_idle_poll(void)
+{
+	paravirt_idle_poll();
+}
+#endif
+
 /*
  * We use this if we don't have any better idle routine..
  */
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 257f4f0..df7c422 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
 }
 
 /* Weak implementations for optional arch specific functions */
+void __weak arch_cpu_idle_poll(void) { }
 void __weak arch_cpu_idle_prepare(void) { }
 void __weak arch_cpu_idle_enter(void) { }
 void __weak arch_cpu_idle_exit(void) { }
@@ -219,6 +220,7 @@ static void do_idle(void)
 	 */
 
 	__current_set_polling();
+	arch_cpu_idle_poll();
 	quiet_vmstat();
 	tick_nohz_idle_enter();
 
-- 
1.7.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-11-13 10:06 ` [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops Quan Xu
@ 2017-11-13 10:53   ` Juergen Gross
  2017-11-13 11:09     ` Wanpeng Li
  2017-11-14  7:02     ` Quan Xu
  0 siblings, 2 replies; 32+ messages in thread
From: Juergen Gross @ 2017-11-13 10:53 UTC (permalink / raw)
  To: Quan Xu, kvm, linux-doc, linux-fsdevel, linux-kernel,
	virtualization, x86, xen-devel
  Cc: Quan Xu, Yang Zhang, Alok Kataria, Rusty Russell,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On 13/11/17 11:06, Quan Xu wrote:
> From: Quan Xu <quan.xu0@gmail.com>
> 
> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
> in idle path which will poll for a while before we enter the real idle
> state.
> 
> In virtualization, idle path includes several heavy operations
> includes timer access(LAPIC timer or TSC deadline timer) which will
> hurt performance especially for latency intensive workload like message
> passing task. The cost is mainly from the vmexit which is a hardware
> context switch between virtual machine and hypervisor. Our solution is
> to poll for a while and do not enter real idle path if we can get the
> schedule event during polling.
> 
> Poll may cause the CPU waste so we adopt a smart polling mechanism to
> reduce the useless poll.
> 
> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Alok Kataria <akataria@vmware.com>
> Cc: Rusty Russell <rusty@rustcorp.com.au>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: x86@kernel.org
> Cc: virtualization@lists.linux-foundation.org
> Cc: linux-kernel@vger.kernel.org
> Cc: xen-devel@lists.xenproject.org

Hmm, is the idle entry path really so critical to performance that a new
pvops function is necessary? Wouldn't a function pointer, maybe guarded
by a static key, be enough? A further advantage would be that this would
work on other architectures, too.


Juergen

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-11-13 10:53   ` Juergen Gross
@ 2017-11-13 11:09     ` Wanpeng Li
  2017-11-14  7:02     ` Quan Xu
  1 sibling, 0 replies; 32+ messages in thread
From: Wanpeng Li @ 2017-11-13 11:09 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Quan Xu, kvm, linux-doc,
	open list:FILESYSTEMS (VFS and infrastructure),
	linux-kernel, virtualization, the arch/x86 maintainers,
	xen-devel, Quan Xu, Yang Zhang, Alok Kataria, Rusty Russell,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin

2017-11-13 18:53 GMT+08:00 Juergen Gross <jgross@suse.com>:
> On 13/11/17 11:06, Quan Xu wrote:
>> From: Quan Xu <quan.xu0@gmail.com>
>>
>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
>> in idle path which will poll for a while before we enter the real idle
>> state.
>>
>> In virtualization, idle path includes several heavy operations
>> includes timer access(LAPIC timer or TSC deadline timer) which will
>> hurt performance especially for latency intensive workload like message
>> passing task. The cost is mainly from the vmexit which is a hardware
>> context switch between virtual machine and hypervisor. Our solution is
>> to poll for a while and do not enter real idle path if we can get the
>> schedule event during polling.
>>
>> Poll may cause the CPU waste so we adopt a smart polling mechanism to
>> reduce the useless poll.
>>
>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
>> Cc: Juergen Gross <jgross@suse.com>
>> Cc: Alok Kataria <akataria@vmware.com>
>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: x86@kernel.org
>> Cc: virtualization@lists.linux-foundation.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: xen-devel@lists.xenproject.org
>
> Hmm, is the idle entry path really so critical to performance that a new
> pvops function is necessary? Wouldn't a function pointer, maybe guarded
> by a static key, be enough? A further advantage would be that this would
> work on other architectures, too.

There is a "Adaptive halt-polling" which are merged to upstream more
than two years ago avoids to thread the critical path and has already
been ported to other architectures. https://lkml.org/lkml/2015/9/3/615

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-11-13 10:53   ` Juergen Gross
  2017-11-13 11:09     ` Wanpeng Li
@ 2017-11-14  7:02     ` Quan Xu
  2017-11-14  7:12       ` Wanpeng Li
  2017-11-14  7:30       ` Juergen Gross
  1 sibling, 2 replies; 32+ messages in thread
From: Quan Xu @ 2017-11-14  7:02 UTC (permalink / raw)
  To: Juergen Gross, Quan Xu, kvm, linux-doc, linux-fsdevel,
	linux-kernel, virtualization, x86, xen-devel
  Cc: Yang Zhang, Alok Kataria, Rusty Russell, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin



On 2017/11/13 18:53, Juergen Gross wrote:
> On 13/11/17 11:06, Quan Xu wrote:
>> From: Quan Xu <quan.xu0@gmail.com>
>>
>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
>> in idle path which will poll for a while before we enter the real idle
>> state.
>>
>> In virtualization, idle path includes several heavy operations
>> includes timer access(LAPIC timer or TSC deadline timer) which will
>> hurt performance especially for latency intensive workload like message
>> passing task. The cost is mainly from the vmexit which is a hardware
>> context switch between virtual machine and hypervisor. Our solution is
>> to poll for a while and do not enter real idle path if we can get the
>> schedule event during polling.
>>
>> Poll may cause the CPU waste so we adopt a smart polling mechanism to
>> reduce the useless poll.
>>
>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
>> Cc: Juergen Gross <jgross@suse.com>
>> Cc: Alok Kataria <akataria@vmware.com>
>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: x86@kernel.org
>> Cc: virtualization@lists.linux-foundation.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: xen-devel@lists.xenproject.org
> Hmm, is the idle entry path really so critical to performance that a new
> pvops function is necessary?
Juergen, Here is the data we get when running benchmark netperf:
  1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
     29031.6 bit/s -- 76.1 %CPU

  2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
     35787.7 bit/s -- 129.4 %CPU

  3. w/ kvm dynamic poll:
     35735.6 bit/s -- 200.0 %CPU

  4. w/patch and w/ kvm dynamic poll:
     42225.3 bit/s -- 198.7 %CPU

  5. idle=poll
     37081.7 bit/s -- 998.1 %CPU



  w/ this patch, we will improve performance by 23%.. even we could improve
  performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the
  cost of CPU is much lower than 'idle=poll' case..

> Wouldn't a function pointer, maybe guarded
> by a static key, be enough? A further advantage would be that this would
> work on other architectures, too.

I assume this feature will be ported to other archs.. a new pvops makes code
clean and easy to maintain. also I tried to add it into existed pvops, 
but it
doesn't match.



Quan
Alibaba Cloud
>
> Juergen
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-11-14  7:02     ` Quan Xu
@ 2017-11-14  7:12       ` Wanpeng Li
  2017-11-14  8:15         ` Quan Xu
  2017-11-14  7:30       ` Juergen Gross
  1 sibling, 1 reply; 32+ messages in thread
From: Wanpeng Li @ 2017-11-14  7:12 UTC (permalink / raw)
  To: Quan Xu
  Cc: Juergen Gross, Quan Xu, kvm, linux-doc,
	open list:FILESYSTEMS (VFS and infrastructure),
	linux-kernel, virtualization, the arch/x86 maintainers,
	xen-devel, Yang Zhang, Alok Kataria, Rusty Russell,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin

2017-11-14 15:02 GMT+08:00 Quan Xu <quan.xu0@gmail.com>:
>
>
> On 2017/11/13 18:53, Juergen Gross wrote:
>>
>> On 13/11/17 11:06, Quan Xu wrote:
>>>
>>> From: Quan Xu <quan.xu0@gmail.com>
>>>
>>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
>>> in idle path which will poll for a while before we enter the real idle
>>> state.
>>>
>>> In virtualization, idle path includes several heavy operations
>>> includes timer access(LAPIC timer or TSC deadline timer) which will
>>> hurt performance especially for latency intensive workload like message
>>> passing task. The cost is mainly from the vmexit which is a hardware
>>> context switch between virtual machine and hypervisor. Our solution is
>>> to poll for a while and do not enter real idle path if we can get the
>>> schedule event during polling.
>>>
>>> Poll may cause the CPU waste so we adopt a smart polling mechanism to
>>> reduce the useless poll.
>>>
>>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
>>> Cc: Juergen Gross <jgross@suse.com>
>>> Cc: Alok Kataria <akataria@vmware.com>
>>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>> Cc: x86@kernel.org
>>> Cc: virtualization@lists.linux-foundation.org
>>> Cc: linux-kernel@vger.kernel.org
>>> Cc: xen-devel@lists.xenproject.org
>>
>> Hmm, is the idle entry path really so critical to performance that a new
>> pvops function is necessary?
>
> Juergen, Here is the data we get when running benchmark netperf:
>  1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
>     29031.6 bit/s -- 76.1 %CPU
>
>  2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
>     35787.7 bit/s -- 129.4 %CPU
>
>  3. w/ kvm dynamic poll:
>     35735.6 bit/s -- 200.0 %CPU

Actually we can reduce the CPU utilization by sleeping a period of
time as what has already been done in the poll logic of IO subsystem,
then we can improve the algorithm in kvm instead of introduing another
duplicate one in the kvm guest.

Regards,
Wanpeng Li

>
>  4. w/patch and w/ kvm dynamic poll:
>     42225.3 bit/s -- 198.7 %CPU
>
>  5. idle=poll
>     37081.7 bit/s -- 998.1 %CPU
>
>
>
>  w/ this patch, we will improve performance by 23%.. even we could improve
>  performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the
>  cost of CPU is much lower than 'idle=poll' case..
>
>> Wouldn't a function pointer, maybe guarded
>> by a static key, be enough? A further advantage would be that this would
>> work on other architectures, too.
>
>
> I assume this feature will be ported to other archs.. a new pvops makes code
> clean and easy to maintain. also I tried to add it into existed pvops, but
> it
> doesn't match.
>
>
>
> Quan
> Alibaba Cloud
>>
>>
>> Juergen
>>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-11-14  7:02     ` Quan Xu
  2017-11-14  7:12       ` Wanpeng Li
@ 2017-11-14  7:30       ` Juergen Gross
  2017-11-14  9:38         ` Quan Xu
  1 sibling, 1 reply; 32+ messages in thread
From: Juergen Gross @ 2017-11-14  7:30 UTC (permalink / raw)
  To: Quan Xu, Quan Xu, kvm, linux-doc, linux-fsdevel, linux-kernel,
	virtualization, x86, xen-devel
  Cc: Yang Zhang, Alok Kataria, Rusty Russell, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin

On 14/11/17 08:02, Quan Xu wrote:
> 
> 
> On 2017/11/13 18:53, Juergen Gross wrote:
>> On 13/11/17 11:06, Quan Xu wrote:
>>> From: Quan Xu <quan.xu0@gmail.com>
>>>
>>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
>>> in idle path which will poll for a while before we enter the real idle
>>> state.
>>>
>>> In virtualization, idle path includes several heavy operations
>>> includes timer access(LAPIC timer or TSC deadline timer) which will
>>> hurt performance especially for latency intensive workload like message
>>> passing task. The cost is mainly from the vmexit which is a hardware
>>> context switch between virtual machine and hypervisor. Our solution is
>>> to poll for a while and do not enter real idle path if we can get the
>>> schedule event during polling.
>>>
>>> Poll may cause the CPU waste so we adopt a smart polling mechanism to
>>> reduce the useless poll.
>>>
>>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
>>> Cc: Juergen Gross <jgross@suse.com>
>>> Cc: Alok Kataria <akataria@vmware.com>
>>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>> Cc: x86@kernel.org
>>> Cc: virtualization@lists.linux-foundation.org
>>> Cc: linux-kernel@vger.kernel.org
>>> Cc: xen-devel@lists.xenproject.org
>> Hmm, is the idle entry path really so critical to performance that a new
>> pvops function is necessary?
> Juergen, Here is the data we get when running benchmark netperf:
>  1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
>     29031.6 bit/s -- 76.1 %CPU
> 
>  2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
>     35787.7 bit/s -- 129.4 %CPU
> 
>  3. w/ kvm dynamic poll:
>     35735.6 bit/s -- 200.0 %CPU
> 
>  4. w/patch and w/ kvm dynamic poll:
>     42225.3 bit/s -- 198.7 %CPU
> 
>  5. idle=poll
>     37081.7 bit/s -- 998.1 %CPU
> 
> 
> 
>  w/ this patch, we will improve performance by 23%.. even we could improve
>  performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the
>  cost of CPU is much lower than 'idle=poll' case..

I don't question the general idea. I just think pvops isn't the best way
to implement it.

>> Wouldn't a function pointer, maybe guarded
>> by a static key, be enough? A further advantage would be that this would
>> work on other architectures, too.
> 
> I assume this feature will be ported to other archs.. a new pvops makes
> code
> clean and easy to maintain. also I tried to add it into existed pvops,
> but it
> doesn't match.

You are aware that pvops is x86 only?

I really don't see the big difference in maintainability compared to the
static key / function pointer variant:

void (*guest_idle_poll_func)(void);
struct static_key guest_idle_poll_key __read_mostly;

static inline void guest_idle_poll(void)
{
	if (static_key_false(&guest_idle_poll_key))
		guest_idle_poll_func();
}

And KVM would just need to set guest_idle_poll_func and enable the
static key. Works on non-x86 architectures, too.


Juergen

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-11-14  7:12       ` Wanpeng Li
@ 2017-11-14  8:15         ` Quan Xu
  2017-11-14  8:22           ` Wanpeng Li
  0 siblings, 1 reply; 32+ messages in thread
From: Quan Xu @ 2017-11-14  8:15 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Juergen Gross, Quan Xu, kvm, linux-doc,
	open list:FILESYSTEMS (VFS and infrastructure),
	linux-kernel, virtualization, the arch/x86 maintainers,
	xen-devel, Yang Zhang, Alok Kataria, Rusty Russell,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin



On 2017/11/14 15:12, Wanpeng Li wrote:
> 2017-11-14 15:02 GMT+08:00 Quan Xu <quan.xu0@gmail.com>:
>>
>> On 2017/11/13 18:53, Juergen Gross wrote:
>>> On 13/11/17 11:06, Quan Xu wrote:
>>>> From: Quan Xu <quan.xu0@gmail.com>
>>>>
>>>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
>>>> in idle path which will poll for a while before we enter the real idle
>>>> state.
>>>>
>>>> In virtualization, idle path includes several heavy operations
>>>> includes timer access(LAPIC timer or TSC deadline timer) which will
>>>> hurt performance especially for latency intensive workload like message
>>>> passing task. The cost is mainly from the vmexit which is a hardware
>>>> context switch between virtual machine and hypervisor. Our solution is
>>>> to poll for a while and do not enter real idle path if we can get the
>>>> schedule event during polling.
>>>>
>>>> Poll may cause the CPU waste so we adopt a smart polling mechanism to
>>>> reduce the useless poll.
>>>>
>>>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>>>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
>>>> Cc: Juergen Gross <jgross@suse.com>
>>>> Cc: Alok Kataria <akataria@vmware.com>
>>>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>>> Cc: x86@kernel.org
>>>> Cc: virtualization@lists.linux-foundation.org
>>>> Cc: linux-kernel@vger.kernel.org
>>>> Cc: xen-devel@lists.xenproject.org
>>> Hmm, is the idle entry path really so critical to performance that a new
>>> pvops function is necessary?
>> Juergen, Here is the data we get when running benchmark netperf:
>>   1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
>>      29031.6 bit/s -- 76.1 %CPU
>>
>>   2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
>>      35787.7 bit/s -- 129.4 %CPU
>>
>>   3. w/ kvm dynamic poll:
>>      35735.6 bit/s -- 200.0 %CPU
> Actually we can reduce the CPU utilization by sleeping a period of
> time as what has already been done in the poll logic of IO subsystem,
> then we can improve the algorithm in kvm instead of introduing another
> duplicate one in the kvm guest.
We really appreciate upstream's kvm dynamic poll mechanism, which is
really helpful for a lot of scenario..

However, as description said, in virtualization, idle path includes
several heavy operations includes timer access (LAPIC timer or TSC
deadline timer) which will hurt performance especially for latency
intensive workload like message passing task. The cost is mainly from
the vmexit which is a hardware context switch between virtual machine
and hypervisor.

for upstream's kvm dynamic poll mechanism, even you could provide a
better algorism, how could you bypass timer access (LAPIC timer or TSC
deadline timer), or a hardware context switch between virtual machine
and hypervisor. I know these is a tradeoff.

Furthermore, here is the data we get when running benchmark contextswitch
to measure the latency(lower is better):

1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
   3402.9 ns/ctxsw -- 199.8 %CPU

2. w/ patch and disable kvm dynamic poll:
   1163.5 ns/ctxsw -- 205.5 %CPU

3. w/ kvm dynamic poll:
   2280.6 ns/ctxsw -- 199.5 %CPU

so, these tow solution are quite similar, but not duplicate..

that's also why to add a generic idle poll before enter real idle path.
When a reschedule event is pending, we can bypass the real idle path.


Quan
Alibaba Cloud




> Regards,
> Wanpeng Li
>
>>   4. w/patch and w/ kvm dynamic poll:
>>      42225.3 bit/s -- 198.7 %CPU
>>
>>   5. idle=poll
>>      37081.7 bit/s -- 998.1 %CPU
>>
>>
>>
>>   w/ this patch, we will improve performance by 23%.. even we could improve
>>   performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the
>>   cost of CPU is much lower than 'idle=poll' case..
>>
>>> Wouldn't a function pointer, maybe guarded
>>> by a static key, be enough? A further advantage would be that this would
>>> work on other architectures, too.
>>
>> I assume this feature will be ported to other archs.. a new pvops makes code
>> clean and easy to maintain. also I tried to add it into existed pvops, but
>> it
>> doesn't match.
>>
>>
>>
>> Quan
>> Alibaba Cloud
>>>
>>> Juergen
>>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-11-14  8:15         ` Quan Xu
@ 2017-11-14  8:22           ` Wanpeng Li
  2017-11-14 10:23             ` Quan Xu
  0 siblings, 1 reply; 32+ messages in thread
From: Wanpeng Li @ 2017-11-14  8:22 UTC (permalink / raw)
  To: Quan Xu
  Cc: Juergen Gross, Quan Xu, kvm, linux-doc,
	open list:FILESYSTEMS (VFS and infrastructure),
	linux-kernel, virtualization, the arch/x86 maintainers,
	xen-devel, Yang Zhang, Alok Kataria, Rusty Russell,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin

2017-11-14 16:15 GMT+08:00 Quan Xu <quan.xu0@gmail.com>:
>
>
> On 2017/11/14 15:12, Wanpeng Li wrote:
>>
>> 2017-11-14 15:02 GMT+08:00 Quan Xu <quan.xu0@gmail.com>:
>>>
>>>
>>> On 2017/11/13 18:53, Juergen Gross wrote:
>>>>
>>>> On 13/11/17 11:06, Quan Xu wrote:
>>>>>
>>>>> From: Quan Xu <quan.xu0@gmail.com>
>>>>>
>>>>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
>>>>> in idle path which will poll for a while before we enter the real idle
>>>>> state.
>>>>>
>>>>> In virtualization, idle path includes several heavy operations
>>>>> includes timer access(LAPIC timer or TSC deadline timer) which will
>>>>> hurt performance especially for latency intensive workload like message
>>>>> passing task. The cost is mainly from the vmexit which is a hardware
>>>>> context switch between virtual machine and hypervisor. Our solution is
>>>>> to poll for a while and do not enter real idle path if we can get the
>>>>> schedule event during polling.
>>>>>
>>>>> Poll may cause the CPU waste so we adopt a smart polling mechanism to
>>>>> reduce the useless poll.
>>>>>
>>>>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>>>>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
>>>>> Cc: Juergen Gross <jgross@suse.com>
>>>>> Cc: Alok Kataria <akataria@vmware.com>
>>>>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>>>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>>>> Cc: x86@kernel.org
>>>>> Cc: virtualization@lists.linux-foundation.org
>>>>> Cc: linux-kernel@vger.kernel.org
>>>>> Cc: xen-devel@lists.xenproject.org
>>>>
>>>> Hmm, is the idle entry path really so critical to performance that a new
>>>> pvops function is necessary?
>>>
>>> Juergen, Here is the data we get when running benchmark netperf:
>>>   1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
>>>      29031.6 bit/s -- 76.1 %CPU
>>>
>>>   2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
>>>      35787.7 bit/s -- 129.4 %CPU
>>>
>>>   3. w/ kvm dynamic poll:
>>>      35735.6 bit/s -- 200.0 %CPU
>>
>> Actually we can reduce the CPU utilization by sleeping a period of
>> time as what has already been done in the poll logic of IO subsystem,
>> then we can improve the algorithm in kvm instead of introduing another
>> duplicate one in the kvm guest.
>
> We really appreciate upstream's kvm dynamic poll mechanism, which is
> really helpful for a lot of scenario..
>
> However, as description said, in virtualization, idle path includes
> several heavy operations includes timer access (LAPIC timer or TSC
> deadline timer) which will hurt performance especially for latency
> intensive workload like message passing task. The cost is mainly from
> the vmexit which is a hardware context switch between virtual machine
> and hypervisor.
>
> for upstream's kvm dynamic poll mechanism, even you could provide a
> better algorism, how could you bypass timer access (LAPIC timer or TSC
> deadline timer), or a hardware context switch between virtual machine
> and hypervisor. I know these is a tradeoff.
>
> Furthermore, here is the data we get when running benchmark contextswitch
> to measure the latency(lower is better):
>
> 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
>   3402.9 ns/ctxsw -- 199.8 %CPU
>
> 2. w/ patch and disable kvm dynamic poll:
>   1163.5 ns/ctxsw -- 205.5 %CPU
>
> 3. w/ kvm dynamic poll:
>   2280.6 ns/ctxsw -- 199.5 %CPU
>
> so, these tow solution are quite similar, but not duplicate..
>
> that's also why to add a generic idle poll before enter real idle path.
> When a reschedule event is pending, we can bypass the real idle path.
>

There is a similar logic in the idle governor/driver, so how this
patchset influence the decision in the idle governor/driver when
running on bare-metal(power managment is not exposed to the guest so
we will not enter into idle driver in the guest)?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-11-14  7:30       ` Juergen Gross
@ 2017-11-14  9:38         ` Quan Xu
  2017-11-14 10:27           ` Juergen Gross
  0 siblings, 1 reply; 32+ messages in thread
From: Quan Xu @ 2017-11-14  9:38 UTC (permalink / raw)
  To: Juergen Gross, Quan Xu, kvm, linux-doc, linux-fsdevel,
	linux-kernel, virtualization, x86, xen-devel
  Cc: Yang Zhang, Alok Kataria, Rusty Russell, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin



On 2017/11/14 15:30, Juergen Gross wrote:
> On 14/11/17 08:02, Quan Xu wrote:
>>
>> On 2017/11/13 18:53, Juergen Gross wrote:
>>> On 13/11/17 11:06, Quan Xu wrote:
>>>> From: Quan Xu <quan.xu0@gmail.com>
>>>>
>>>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
>>>> in idle path which will poll for a while before we enter the real idle
>>>> state.
>>>>
>>>> In virtualization, idle path includes several heavy operations
>>>> includes timer access(LAPIC timer or TSC deadline timer) which will
>>>> hurt performance especially for latency intensive workload like message
>>>> passing task. The cost is mainly from the vmexit which is a hardware
>>>> context switch between virtual machine and hypervisor. Our solution is
>>>> to poll for a while and do not enter real idle path if we can get the
>>>> schedule event during polling.
>>>>
>>>> Poll may cause the CPU waste so we adopt a smart polling mechanism to
>>>> reduce the useless poll.
>>>>
>>>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>>>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
>>>> Cc: Juergen Gross <jgross@suse.com>
>>>> Cc: Alok Kataria <akataria@vmware.com>
>>>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>>> Cc: x86@kernel.org
>>>> Cc: virtualization@lists.linux-foundation.org
>>>> Cc: linux-kernel@vger.kernel.org
>>>> Cc: xen-devel@lists.xenproject.org
>>> Hmm, is the idle entry path really so critical to performance that a new
>>> pvops function is necessary?
>> Juergen, Here is the data we get when running benchmark netperf:
>>   1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
>>      29031.6 bit/s -- 76.1 %CPU
>>
>>   2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
>>      35787.7 bit/s -- 129.4 %CPU
>>
>>   3. w/ kvm dynamic poll:
>>      35735.6 bit/s -- 200.0 %CPU
>>
>>   4. w/patch and w/ kvm dynamic poll:
>>      42225.3 bit/s -- 198.7 %CPU
>>
>>   5. idle=poll
>>      37081.7 bit/s -- 998.1 %CPU
>>
>>
>>
>>   w/ this patch, we will improve performance by 23%.. even we could improve
>>   performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the
>>   cost of CPU is much lower than 'idle=poll' case..
> I don't question the general idea. I just think pvops isn't the best way
> to implement it.
>
>>> Wouldn't a function pointer, maybe guarded
>>> by a static key, be enough? A further advantage would be that this would
>>> work on other architectures, too.
>> I assume this feature will be ported to other archs.. a new pvops makes

       sorry, a typo.. /other archs/other hypervisors/
       it refers hypervisor like Xen, HyperV and VMware)..

>> code
>> clean and easy to maintain. also I tried to add it into existed pvops,
>> but it
>> doesn't match.
> You are aware that pvops is x86 only?

yes, I'm aware..

> I really don't see the big difference in maintainability compared to the
> static key / function pointer variant:
>
> void (*guest_idle_poll_func)(void);
> struct static_key guest_idle_poll_key __read_mostly;
>
> static inline void guest_idle_poll(void)
> {
> 	if (static_key_false(&guest_idle_poll_key))
> 		guest_idle_poll_func();
> }



thank you for your sample code :)
I agree there is no big difference.. I think we are discussion for two 
things:
  1) x86 VM on different hypervisors
  2) different archs VM on kvm hypervisor

What I want to do is x86 VM on different hypervisors, such as kvm / xen 
/ hyperv ..

> And KVM would just need to set guest_idle_poll_func and enable the
> static key. Works on non-x86 architectures, too.
>

.. referred to 'pv_mmu_ops', HyperV and Xen can implement their own 
functions for 'pv_mmu_ops'.
I think it is the same to pv_idle_ops.

with above explaination, do you still think I need to define the static
key/function pointer variant?

btw, any interest to port it to Xen HVM guest? :)

Quan
Alibaba Cloud

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-11-14  8:22           ` Wanpeng Li
@ 2017-11-14 10:23             ` Quan Xu
  0 siblings, 0 replies; 32+ messages in thread
From: Quan Xu @ 2017-11-14 10:23 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Juergen Gross, Quan Xu, kvm, linux-doc,
	open list:FILESYSTEMS (VFS and infrastructure),
	linux-kernel, virtualization, the arch/x86 maintainers,
	xen-devel, Yang Zhang, Alok Kataria, Rusty Russell,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin



On 2017/11/14 16:22, Wanpeng Li wrote:
> 2017-11-14 16:15 GMT+08:00 Quan Xu <quan.xu0@gmail.com>:
>>
>> On 2017/11/14 15:12, Wanpeng Li wrote:
>>> 2017-11-14 15:02 GMT+08:00 Quan Xu <quan.xu0@gmail.com>:
>>>>
>>>> On 2017/11/13 18:53, Juergen Gross wrote:
>>>>> On 13/11/17 11:06, Quan Xu wrote:
>>>>>> From: Quan Xu <quan.xu0@gmail.com>
>>>>>>
>>>>>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
>>>>>> in idle path which will poll for a while before we enter the real idle
>>>>>> state.
>>>>>>
>>>>>> In virtualization, idle path includes several heavy operations
>>>>>> includes timer access(LAPIC timer or TSC deadline timer) which will
>>>>>> hurt performance especially for latency intensive workload like message
>>>>>> passing task. The cost is mainly from the vmexit which is a hardware
>>>>>> context switch between virtual machine and hypervisor. Our solution is
>>>>>> to poll for a while and do not enter real idle path if we can get the
>>>>>> schedule event during polling.
>>>>>>
>>>>>> Poll may cause the CPU waste so we adopt a smart polling mechanism to
>>>>>> reduce the useless poll.
>>>>>>
>>>>>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>>>>>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
>>>>>> Cc: Juergen Gross <jgross@suse.com>
>>>>>> Cc: Alok Kataria <akataria@vmware.com>
>>>>>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>>>>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>>>>> Cc: x86@kernel.org
>>>>>> Cc: virtualization@lists.linux-foundation.org
>>>>>> Cc: linux-kernel@vger.kernel.org
>>>>>> Cc: xen-devel@lists.xenproject.org
>>>>> Hmm, is the idle entry path really so critical to performance that a new
>>>>> pvops function is necessary?
>>>> Juergen, Here is the data we get when running benchmark netperf:
>>>>    1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
>>>>       29031.6 bit/s -- 76.1 %CPU
>>>>
>>>>    2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
>>>>       35787.7 bit/s -- 129.4 %CPU
>>>>
>>>>    3. w/ kvm dynamic poll:
>>>>       35735.6 bit/s -- 200.0 %CPU
>>> Actually we can reduce the CPU utilization by sleeping a period of
>>> time as what has already been done in the poll logic of IO subsystem,
>>> then we can improve the algorithm in kvm instead of introduing another
>>> duplicate one in the kvm guest.
>> We really appreciate upstream's kvm dynamic poll mechanism, which is
>> really helpful for a lot of scenario..
>>
>> However, as description said, in virtualization, idle path includes
>> several heavy operations includes timer access (LAPIC timer or TSC
>> deadline timer) which will hurt performance especially for latency
>> intensive workload like message passing task. The cost is mainly from
>> the vmexit which is a hardware context switch between virtual machine
>> and hypervisor.
>>
>> for upstream's kvm dynamic poll mechanism, even you could provide a
>> better algorism, how could you bypass timer access (LAPIC timer or TSC
>> deadline timer), or a hardware context switch between virtual machine
>> and hypervisor. I know these is a tradeoff.
>>
>> Furthermore, here is the data we get when running benchmark contextswitch
>> to measure the latency(lower is better):
>>
>> 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
>>    3402.9 ns/ctxsw -- 199.8 %CPU
>>
>> 2. w/ patch and disable kvm dynamic poll:
>>    1163.5 ns/ctxsw -- 205.5 %CPU
>>
>> 3. w/ kvm dynamic poll:
>>    2280.6 ns/ctxsw -- 199.5 %CPU
>>
>> so, these tow solution are quite similar, but not duplicate..
>>
>> that's also why to add a generic idle poll before enter real idle path.
>> When a reschedule event is pending, we can bypass the real idle path.
>>
> There is a similar logic in the idle governor/driver, so how this
> patchset influence the decision in the idle governor/driver when
> running on bare-metal(power managment is not exposed to the guest so
> we will not enter into idle driver in the guest)?
>

This is expected to take effect only when running as a virtual machine with
proper CONFIG_* enabled. This can not work on bare mental even with proper
CONFIG_* enabled.

Quan
Alibaba Cloud

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-11-14  9:38         ` Quan Xu
@ 2017-11-14 10:27           ` Juergen Gross
  2017-11-14 11:43             ` Quan Xu
  0 siblings, 1 reply; 32+ messages in thread
From: Juergen Gross @ 2017-11-14 10:27 UTC (permalink / raw)
  To: Quan Xu, Quan Xu, kvm, linux-doc, linux-fsdevel, linux-kernel,
	virtualization, x86, xen-devel
  Cc: Yang Zhang, Alok Kataria, Rusty Russell, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin

On 14/11/17 10:38, Quan Xu wrote:
> 
> 
> On 2017/11/14 15:30, Juergen Gross wrote:
>> On 14/11/17 08:02, Quan Xu wrote:
>>>
>>> On 2017/11/13 18:53, Juergen Gross wrote:
>>>> On 13/11/17 11:06, Quan Xu wrote:
>>>>> From: Quan Xu <quan.xu0@gmail.com>
>>>>>
>>>>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
>>>>> in idle path which will poll for a while before we enter the real idle
>>>>> state.
>>>>>
>>>>> In virtualization, idle path includes several heavy operations
>>>>> includes timer access(LAPIC timer or TSC deadline timer) which will
>>>>> hurt performance especially for latency intensive workload like
>>>>> message
>>>>> passing task. The cost is mainly from the vmexit which is a hardware
>>>>> context switch between virtual machine and hypervisor. Our solution is
>>>>> to poll for a while and do not enter real idle path if we can get the
>>>>> schedule event during polling.
>>>>>
>>>>> Poll may cause the CPU waste so we adopt a smart polling mechanism to
>>>>> reduce the useless poll.
>>>>>
>>>>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>>>>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
>>>>> Cc: Juergen Gross <jgross@suse.com>
>>>>> Cc: Alok Kataria <akataria@vmware.com>
>>>>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>>>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>>>> Cc: x86@kernel.org
>>>>> Cc: virtualization@lists.linux-foundation.org
>>>>> Cc: linux-kernel@vger.kernel.org
>>>>> Cc: xen-devel@lists.xenproject.org
>>>> Hmm, is the idle entry path really so critical to performance that a
>>>> new
>>>> pvops function is necessary?
>>> Juergen, Here is the data we get when running benchmark netperf:
>>>   1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
>>>      29031.6 bit/s -- 76.1 %CPU
>>>
>>>   2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
>>>      35787.7 bit/s -- 129.4 %CPU
>>>
>>>   3. w/ kvm dynamic poll:
>>>      35735.6 bit/s -- 200.0 %CPU
>>>
>>>   4. w/patch and w/ kvm dynamic poll:
>>>      42225.3 bit/s -- 198.7 %CPU
>>>
>>>   5. idle=poll
>>>      37081.7 bit/s -- 998.1 %CPU
>>>
>>>
>>>
>>>   w/ this patch, we will improve performance by 23%.. even we could
>>> improve
>>>   performance by 45.4%, if we use w/patch and w/ kvm dynamic poll.
>>> also the
>>>   cost of CPU is much lower than 'idle=poll' case..
>> I don't question the general idea. I just think pvops isn't the best way
>> to implement it.
>>
>>>> Wouldn't a function pointer, maybe guarded
>>>> by a static key, be enough? A further advantage would be that this
>>>> would
>>>> work on other architectures, too.
>>> I assume this feature will be ported to other archs.. a new pvops makes
> 
>       sorry, a typo.. /other archs/other hypervisors/
>       it refers hypervisor like Xen, HyperV and VMware)..
> 
>>> code
>>> clean and easy to maintain. also I tried to add it into existed pvops,
>>> but it
>>> doesn't match.
>> You are aware that pvops is x86 only?
> 
> yes, I'm aware..
> 
>> I really don't see the big difference in maintainability compared to the
>> static key / function pointer variant:
>>
>> void (*guest_idle_poll_func)(void);
>> struct static_key guest_idle_poll_key __read_mostly;
>>
>> static inline void guest_idle_poll(void)
>> {
>>     if (static_key_false(&guest_idle_poll_key))
>>         guest_idle_poll_func();
>> }
> 
> 
> 
> thank you for your sample code :)
> I agree there is no big difference.. I think we are discussion for two
> things:
>  1) x86 VM on different hypervisors
>  2) different archs VM on kvm hypervisor
> 
> What I want to do is x86 VM on different hypervisors, such as kvm / xen
> / hyperv ..

Why limit the solution to x86 if the more general solution isn't
harder?

As you didn't give any reason why the pvops approach is better other
than you don't care for non-x86 platforms you won't get an "Ack" from
me for this patch.

> 
>> And KVM would just need to set guest_idle_poll_func and enable the
>> static key. Works on non-x86 architectures, too.
>>
> 
> .. referred to 'pv_mmu_ops', HyperV and Xen can implement their own
> functions for 'pv_mmu_ops'.
> I think it is the same to pv_idle_ops.
> 
> with above explaination, do you still think I need to define the static
> key/function pointer variant?
> 
> btw, any interest to port it to Xen HVM guest? :)

Maybe. But this should work for Xen on ARM, too.


Juergen

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-11-14 10:27           ` Juergen Gross
@ 2017-11-14 11:43             ` Quan Xu
  2017-11-14 11:58               ` Juergen Gross
  0 siblings, 1 reply; 32+ messages in thread
From: Quan Xu @ 2017-11-14 11:43 UTC (permalink / raw)
  To: Juergen Gross, Quan Xu, kvm, linux-doc, linux-fsdevel,
	linux-kernel, virtualization, x86, xen-devel
  Cc: Yang Zhang, Alok Kataria, Rusty Russell, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin



On 2017/11/14 18:27, Juergen Gross wrote:
> On 14/11/17 10:38, Quan Xu wrote:
>>
>> On 2017/11/14 15:30, Juergen Gross wrote:
>>> On 14/11/17 08:02, Quan Xu wrote:
>>>> On 2017/11/13 18:53, Juergen Gross wrote:
>>>>> On 13/11/17 11:06, Quan Xu wrote:
>>>>>> From: Quan Xu <quan.xu0@gmail.com>
>>>>>>
>>>>>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called
>>>>>> in idle path which will poll for a while before we enter the real idle
>>>>>> state.
>>>>>>
>>>>>> In virtualization, idle path includes several heavy operations
>>>>>> includes timer access(LAPIC timer or TSC deadline timer) which will
>>>>>> hurt performance especially for latency intensive workload like
>>>>>> message
>>>>>> passing task. The cost is mainly from the vmexit which is a hardware
>>>>>> context switch between virtual machine and hypervisor. Our solution is
>>>>>> to poll for a while and do not enter real idle path if we can get the
>>>>>> schedule event during polling.
>>>>>>
>>>>>> Poll may cause the CPU waste so we adopt a smart polling mechanism to
>>>>>> reduce the useless poll.
>>>>>>
>>>>>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>>>>>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
>>>>>> Cc: Juergen Gross <jgross@suse.com>
>>>>>> Cc: Alok Kataria <akataria@vmware.com>
>>>>>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>>>>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>>>>> Cc: x86@kernel.org
>>>>>> Cc: virtualization@lists.linux-foundation.org
>>>>>> Cc: linux-kernel@vger.kernel.org
>>>>>> Cc: xen-devel@lists.xenproject.org
>>>>> Hmm, is the idle entry path really so critical to performance that a
>>>>> new
>>>>> pvops function is necessary?
>>>> Juergen, Here is the data we get when running benchmark netperf:
>>>>    1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
>>>>       29031.6 bit/s -- 76.1 %CPU
>>>>
>>>>    2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
>>>>       35787.7 bit/s -- 129.4 %CPU
>>>>
>>>>    3. w/ kvm dynamic poll:
>>>>       35735.6 bit/s -- 200.0 %CPU
>>>>
>>>>    4. w/patch and w/ kvm dynamic poll:
>>>>       42225.3 bit/s -- 198.7 %CPU
>>>>
>>>>    5. idle=poll
>>>>       37081.7 bit/s -- 998.1 %CPU
>>>>
>>>>
>>>>
>>>>    w/ this patch, we will improve performance by 23%.. even we could
>>>> improve
>>>>    performance by 45.4%, if we use w/patch and w/ kvm dynamic poll.
>>>> also the
>>>>    cost of CPU is much lower than 'idle=poll' case..
>>> I don't question the general idea. I just think pvops isn't the best way
>>> to implement it.
>>>
>>>>> Wouldn't a function pointer, maybe guarded
>>>>> by a static key, be enough? A further advantage would be that this
>>>>> would
>>>>> work on other architectures, too.
>>>> I assume this feature will be ported to other archs.. a new pvops makes
>>        sorry, a typo.. /other archs/other hypervisors/
>>        it refers hypervisor like Xen, HyperV and VMware)..
>>
>>>> code
>>>> clean and easy to maintain. also I tried to add it into existed pvops,
>>>> but it
>>>> doesn't match.
>>> You are aware that pvops is x86 only?
>> yes, I'm aware..
>>
>>> I really don't see the big difference in maintainability compared to the
>>> static key / function pointer variant:
>>>
>>> void (*guest_idle_poll_func)(void);
>>> struct static_key guest_idle_poll_key __read_mostly;
>>>
>>> static inline void guest_idle_poll(void)
>>> {
>>>      if (static_key_false(&guest_idle_poll_key))
>>>          guest_idle_poll_func();
>>> }
>>
>>
>> thank you for your sample code :)
>> I agree there is no big difference.. I think we are discussion for two
>> things:
>>   1) x86 VM on different hypervisors
>>   2) different archs VM on kvm hypervisor
>>
>> What I want to do is x86 VM on different hypervisors, such as kvm / xen
>> / hyperv ..
> Why limit the solution to x86 if the more general solution isn't
> harder?
>
> As you didn't give any reason why the pvops approach is better other
> than you don't care for non-x86 platforms you won't get an "Ack" from
> me for this patch.


It just looks a little odder to me. I understand you care about no-x86 arch.

Are you aware 'pv_time_ops' for arm64/arm/x86 archs, defined in
    - arch/arm64/include/asm/paravirt.h
    - arch/x86/include/asm/paravirt_types.h
    - arch/arm/include/asm/paravirt.h

I am unfamilar with arm code. IIUC, if you'd implement pv_idle_ops
for arm/arm64 arch, you'd define a same structure in
    - arch/arm64/include/asm/paravirt.h     or
    - arch/arm/include/asm/paravirt.h

.. instead of static key / fuction.

then implement a real function in
    - arch/arm/kernel/paravirt.c.

Also I wonder HOW/WHERE to define a static key/function, then to benifit
x86/no-x86 archs?

Quan
Alibaba Cloud

>>> And KVM would just need to set guest_idle_poll_func and enable the
>>> static key. Works on non-x86 architectures, too.
>>>
>> .. referred to 'pv_mmu_ops', HyperV and Xen can implement their own
>> functions for 'pv_mmu_ops'.
>> I think it is the same to pv_idle_ops.
>>
>> with above explaination, do you still think I need to define the static
>> key/function pointer variant?
>>
>> btw, any interest to port it to Xen HVM guest? :)
> Maybe. But this should work for Xen on ARM, too.
>
>
> Juergen
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
  2017-11-14 11:43             ` Quan Xu
@ 2017-11-14 11:58               ` Juergen Gross
  0 siblings, 0 replies; 32+ messages in thread
From: Juergen Gross @ 2017-11-14 11:58 UTC (permalink / raw)
  To: Quan Xu, Quan Xu, kvm, linux-doc, linux-fsdevel, linux-kernel,
	virtualization, x86, xen-devel
  Cc: Yang Zhang, Alok Kataria, Rusty Russell, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin

On 14/11/17 12:43, Quan Xu wrote:
> 
> 
> On 2017/11/14 18:27, Juergen Gross wrote:
>> On 14/11/17 10:38, Quan Xu wrote:
>>>
>>> On 2017/11/14 15:30, Juergen Gross wrote:
>>>> On 14/11/17 08:02, Quan Xu wrote:
>>>>> On 2017/11/13 18:53, Juergen Gross wrote:
>>>>>> On 13/11/17 11:06, Quan Xu wrote:
>>>>>>> From: Quan Xu <quan.xu0@gmail.com>
>>>>>>>
>>>>>>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is
>>>>>>> called
>>>>>>> in idle path which will poll for a while before we enter the real
>>>>>>> idle
>>>>>>> state.
>>>>>>>
>>>>>>> In virtualization, idle path includes several heavy operations
>>>>>>> includes timer access(LAPIC timer or TSC deadline timer) which will
>>>>>>> hurt performance especially for latency intensive workload like
>>>>>>> message
>>>>>>> passing task. The cost is mainly from the vmexit which is a hardware
>>>>>>> context switch between virtual machine and hypervisor. Our
>>>>>>> solution is
>>>>>>> to poll for a while and do not enter real idle path if we can get
>>>>>>> the
>>>>>>> schedule event during polling.
>>>>>>>
>>>>>>> Poll may cause the CPU waste so we adopt a smart polling
>>>>>>> mechanism to
>>>>>>> reduce the useless poll.
>>>>>>>
>>>>>>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>>>>>>> Signed-off-by: Quan Xu <quan.xu0@gmail.com>
>>>>>>> Cc: Juergen Gross <jgross@suse.com>
>>>>>>> Cc: Alok Kataria <akataria@vmware.com>
>>>>>>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>>>>>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>>>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>>>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>>>>>> Cc: x86@kernel.org
>>>>>>> Cc: virtualization@lists.linux-foundation.org
>>>>>>> Cc: linux-kernel@vger.kernel.org
>>>>>>> Cc: xen-devel@lists.xenproject.org
>>>>>> Hmm, is the idle entry path really so critical to performance that a
>>>>>> new
>>>>>> pvops function is necessary?
>>>>> Juergen, Here is the data we get when running benchmark netperf:
>>>>>    1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0):
>>>>>       29031.6 bit/s -- 76.1 %CPU
>>>>>
>>>>>    2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0):
>>>>>       35787.7 bit/s -- 129.4 %CPU
>>>>>
>>>>>    3. w/ kvm dynamic poll:
>>>>>       35735.6 bit/s -- 200.0 %CPU
>>>>>
>>>>>    4. w/patch and w/ kvm dynamic poll:
>>>>>       42225.3 bit/s -- 198.7 %CPU
>>>>>
>>>>>    5. idle=poll
>>>>>       37081.7 bit/s -- 998.1 %CPU
>>>>>
>>>>>
>>>>>
>>>>>    w/ this patch, we will improve performance by 23%.. even we could
>>>>> improve
>>>>>    performance by 45.4%, if we use w/patch and w/ kvm dynamic poll.
>>>>> also the
>>>>>    cost of CPU is much lower than 'idle=poll' case..
>>>> I don't question the general idea. I just think pvops isn't the best
>>>> way
>>>> to implement it.
>>>>
>>>>>> Wouldn't a function pointer, maybe guarded
>>>>>> by a static key, be enough? A further advantage would be that this
>>>>>> would
>>>>>> work on other architectures, too.
>>>>> I assume this feature will be ported to other archs.. a new pvops
>>>>> makes
>>>        sorry, a typo.. /other archs/other hypervisors/
>>>        it refers hypervisor like Xen, HyperV and VMware)..
>>>
>>>>> code
>>>>> clean and easy to maintain. also I tried to add it into existed pvops,
>>>>> but it
>>>>> doesn't match.
>>>> You are aware that pvops is x86 only?
>>> yes, I'm aware..
>>>
>>>> I really don't see the big difference in maintainability compared to
>>>> the
>>>> static key / function pointer variant:
>>>>
>>>> void (*guest_idle_poll_func)(void);
>>>> struct static_key guest_idle_poll_key __read_mostly;
>>>>
>>>> static inline void guest_idle_poll(void)
>>>> {
>>>>      if (static_key_false(&guest_idle_poll_key))
>>>>          guest_idle_poll_func();
>>>> }
>>>
>>>
>>> thank you for your sample code :)
>>> I agree there is no big difference.. I think we are discussion for two
>>> things:
>>>   1) x86 VM on different hypervisors
>>>   2) different archs VM on kvm hypervisor
>>>
>>> What I want to do is x86 VM on different hypervisors, such as kvm / xen
>>> / hyperv ..
>> Why limit the solution to x86 if the more general solution isn't
>> harder?
>>
>> As you didn't give any reason why the pvops approach is better other
>> than you don't care for non-x86 platforms you won't get an "Ack" from
>> me for this patch.
> 
> 
> It just looks a little odder to me. I understand you care about no-x86
> arch.
> 
> Are you aware 'pv_time_ops' for arm64/arm/x86 archs, defined in
>    - arch/arm64/include/asm/paravirt.h
>    - arch/x86/include/asm/paravirt_types.h
>    - arch/arm/include/asm/paravirt.h

Yes, I know. This is just a hack to make it compile. Other than the
same names this has nothing to do with pvops, but is just a function
vector.

> I am unfamilar with arm code. IIUC, if you'd implement pv_idle_ops
> for arm/arm64 arch, you'd define a same structure in
>    - arch/arm64/include/asm/paravirt.h     or
>    - arch/arm/include/asm/paravirt.h
> 
> .. instead of static key / fuction.
> 
> then implement a real function in
>    - arch/arm/kernel/paravirt.c.

So just to use pvops you want to implement it in each arch instead
of using a mechanism available everywhere?

> Also I wonder HOW/WHERE to define a static key/function, then to benifit
> x86/no-x86 archs?

What? There are plenty of examples in the kernel.

Please stop wasting my time. Either write a patch which is acceptable
or let it be. I won't take your pvops approach without a really good
reason to do so. And so far you haven't given any reason other than
you are too lazy to write a proper patch, sorry.


Juergen

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
  2017-11-13 10:06 ` [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path Quan Xu
@ 2017-11-15 12:11   ` Peter Zijlstra
  2017-11-15 22:03     ` Thomas Gleixner
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2017-11-15 12:11 UTC (permalink / raw)
  To: Quan Xu
  Cc: kvm, linux-doc, linux-fsdevel, linux-kernel, virtualization, x86,
	xen-devel, Yang Zhang, Quan Xu, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Borislav Petkov, Kyle Huey, Len Brown,
	Andy Lutomirski, Tom Lendacky, Tobias Klauser

On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote:
> From: Yang Zhang <yang.zhang.wz@gmail.com>
> 
> Implement a generic idle poll which resembles the functionality
> found in arch/. Provide weak arch_cpu_idle_poll function which
> can be overridden by the architecture code if needed.

No, we want less of those magic hooks, not more.

> Interrupts arrive which may not cause a reschedule in idle loops.
> In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry
> for interrupts and VM-exit immediately. Also this becomes more
> expensive than bare metal. Add a generic idle poll before enter
> real idle path. When a reschedule event is pending, we can bypass
> the real idle path.

Why not do a HV specific idle driver?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Xen-devel] [PATCH RFC v3 0/6] x86/idle: add halt poll support
  2017-11-13 10:05 [PATCH RFC v3 0/6] x86/idle: add halt poll support Quan Xu
                   ` (2 preceding siblings ...)
  2017-11-13 10:06 ` [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path Quan Xu
@ 2017-11-15 21:31 ` Konrad Rzeszutek Wilk
  2017-11-20  7:18   ` Quan Xu
  3 siblings, 1 reply; 32+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-11-15 21:31 UTC (permalink / raw)
  To: Quan Xu
  Cc: kvm, linux-doc, linux-fsdevel, linux-kernel, virtualization, x86,
	xen-devel, Yang Zhang

On Mon, Nov 13, 2017 at 06:05:59PM +0800, Quan Xu wrote:
> From: Yang Zhang <yang.zhang.wz@gmail.com>
> 
> Some latency-intensive workload have seen obviously performance
> drop when running inside VM. The main reason is that the overhead
> is amplified when running inside VM. The most cost I have seen is
> inside idle path.

Meaning an VMEXIT b/c it is an 'halt' operation ? And then going
back in guest (VMRESUME) takes time. And hence your latency gets
all whacked b/c of this?

So if I understand - you want to use your _full_ timeslice (of the guest)
without ever (or as much as possible) to go in the hypervisor?

Which means in effect you don't care about power-saving or CPUfreq
savings, you just want to eat the full CPU for snack?

> 
> This patch introduces a new mechanism to poll for a while before
> entering idle state. If schedule is needed during poll, then we
> don't need to goes through the heavy overhead path.

Schedule of what? The guest or the host?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
  2017-11-15 12:11   ` Peter Zijlstra
@ 2017-11-15 22:03     ` Thomas Gleixner
  2017-11-16  8:45       ` Peter Zijlstra
  2017-11-16  9:12       ` Quan Xu
  0 siblings, 2 replies; 32+ messages in thread
From: Thomas Gleixner @ 2017-11-15 22:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Quan Xu, kvm, linux-doc, linux-fsdevel, LKML, virtualization,
	x86, xen-devel, Yang Zhang, Quan Xu, Ingo Molnar, H. Peter Anvin,
	Borislav Petkov, Kyle Huey, Len Brown, Andy Lutomirski,
	Tom Lendacky, Tobias Klauser, Daniel Lezcano

On Wed, 15 Nov 2017, Peter Zijlstra wrote:

> On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote:
> > From: Yang Zhang <yang.zhang.wz@gmail.com>
> > 
> > Implement a generic idle poll which resembles the functionality
> > found in arch/. Provide weak arch_cpu_idle_poll function which
> > can be overridden by the architecture code if needed.
> 
> No, we want less of those magic hooks, not more.
> 
> > Interrupts arrive which may not cause a reschedule in idle loops.
> > In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry
> > for interrupts and VM-exit immediately. Also this becomes more
> > expensive than bare metal. Add a generic idle poll before enter
> > real idle path. When a reschedule event is pending, we can bypass
> > the real idle path.
> 
> Why not do a HV specific idle driver?

If I understand the problem correctly then he wants to avoid the heavy
lifting in tick_nohz_idle_enter() in the first place, but there is already
an interesting quirk there which makes it exit early.  See commit
3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for this commit
looks similar. But lets not proliferate that. I'd rather see that go away.

But the irq_timings stuff is heading into the same direction, with a more
complex prediction logic which should tell you pretty good how long that
idle period is going to be and in case of an interrupt heavy workload this
would skip the extra work of stopping and restarting the tick and provide a
very good input into a polling decision.

This can be handled either in a HV specific idle driver or even in the
generic core code. If the interrupt does not arrive then you can assume
within the predicted time then you can assume that the flood stopped and
invoke halt or whatever.

That avoids all of that 'tunable and tweakable' x86 specific hackery and
utilizes common functionality which is mostly there already.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
  2017-11-15 22:03     ` Thomas Gleixner
@ 2017-11-16  8:45       ` Peter Zijlstra
  2017-11-16  8:58         ` Thomas Gleixner
  2017-11-16  9:29         ` Quan Xu
  2017-11-16  9:12       ` Quan Xu
  1 sibling, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2017-11-16  8:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Quan Xu, kvm, linux-doc, linux-fsdevel, LKML, virtualization,
	x86, xen-devel, Yang Zhang, Quan Xu, Ingo Molnar, H. Peter Anvin,
	Borislav Petkov, Kyle Huey, Len Brown, Andy Lutomirski,
	Tom Lendacky, Tobias Klauser, Daniel Lezcano

On Wed, Nov 15, 2017 at 11:03:08PM +0100, Thomas Gleixner wrote:
> If I understand the problem correctly then he wants to avoid the heavy
> lifting in tick_nohz_idle_enter() in the first place, but there is already
> an interesting quirk there which makes it exit early. 

Sure. And there are people who want to do the same for native.

Adding more ugly and special cases just isn't the way to go about doing
that.

I'm fairly sure I've told the various groups that want to tinker with
this to work together on this. I've also in fairly significant detail
sketched how to rework the idle code and idle predictors.

At this point I'm too tired to dig any of that up, so I'll just keep
saying no to patches that don't even attempt to go in the right
direction.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
  2017-11-16  8:45       ` Peter Zijlstra
@ 2017-11-16  8:58         ` Thomas Gleixner
  2017-11-16  9:29         ` Quan Xu
  1 sibling, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2017-11-16  8:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Quan Xu, kvm, linux-doc, linux-fsdevel, LKML, virtualization,
	x86, xen-devel, Yang Zhang, Quan Xu, Ingo Molnar, H. Peter Anvin,
	Borislav Petkov, Kyle Huey, Len Brown, Andy Lutomirski,
	Tom Lendacky, Tobias Klauser, Daniel Lezcano

On Thu, 16 Nov 2017, Peter Zijlstra wrote:

> On Wed, Nov 15, 2017 at 11:03:08PM +0100, Thomas Gleixner wrote:
> > If I understand the problem correctly then he wants to avoid the heavy
> > lifting in tick_nohz_idle_enter() in the first place, but there is already
> > an interesting quirk there which makes it exit early. 
> 
> Sure. And there are people who want to do the same for native.
> 
> Adding more ugly and special cases just isn't the way to go about doing
> that.
> 
> I'm fairly sure I've told the various groups that want to tinker with
> this to work together on this. I've also in fairly significant detail
> sketched how to rework the idle code and idle predictors.
> 
> At this point I'm too tired to dig any of that up, so I'll just keep
> saying no to patches that don't even attempt to go in the right
> direction.

That's why I said: But lets not proliferate that. I'd rather see that go
away.

And yes, the VM folks should talk to those who are trying to solve similar
problems for native (embedded/mobile).

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
  2017-11-15 22:03     ` Thomas Gleixner
  2017-11-16  8:45       ` Peter Zijlstra
@ 2017-11-16  9:12       ` Quan Xu
  2017-11-16  9:45         ` Daniel Lezcano
  2017-11-16  9:53         ` Thomas Gleixner
  1 sibling, 2 replies; 32+ messages in thread
From: Quan Xu @ 2017-11-16  9:12 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra
  Cc: Quan Xu, kvm, linux-doc, linux-fsdevel, LKML, virtualization,
	x86, xen-devel, Yang Zhang, Ingo Molnar, H. Peter Anvin,
	Borislav Petkov, Kyle Huey, Len Brown, Andy Lutomirski,
	Tom Lendacky, Tobias Klauser, Daniel Lezcano



On 2017-11-16 06:03, Thomas Gleixner wrote:
> On Wed, 15 Nov 2017, Peter Zijlstra wrote:
>
>> On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote:
>>> From: Yang Zhang <yang.zhang.wz@gmail.com>
>>>
>>> Implement a generic idle poll which resembles the functionality
>>> found in arch/. Provide weak arch_cpu_idle_poll function which
>>> can be overridden by the architecture code if needed.
>> No, we want less of those magic hooks, not more.
>>
>>> Interrupts arrive which may not cause a reschedule in idle loops.
>>> In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry
>>> for interrupts and VM-exit immediately. Also this becomes more
>>> expensive than bare metal. Add a generic idle poll before enter
>>> real idle path. When a reschedule event is pending, we can bypass
>>> the real idle path.
>> Why not do a HV specific idle driver?
> If I understand the problem correctly then he wants to avoid the heavy
> lifting in tick_nohz_idle_enter() in the first place, but there is already
> an interesting quirk there which makes it exit early.  See commit
> 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for this commit
> looks similar. But lets not proliferate that. I'd rather see that go away.

agreed.

Even we can get more benifit than commit 3c5d92a0cfb5 ("nohz: Introduce 
arch_needs_cpu")
in kvm guest. I won't proliferate that..

> But the irq_timings stuff is heading into the same direction, with a more
> complex prediction logic which should tell you pretty good how long that
> idle period is going to be and in case of an interrupt heavy workload this
> would skip the extra work of stopping and restarting the tick and provide a
> very good input into a polling decision.


interesting. I have tested with IRQ_TIMINGS related code, which seems 
not working so far.
Also I'd like to help as much as I can.
> This can be handled either in a HV specific idle driver or even in the
> generic core code. If the interrupt does not arrive then you can assume
> within the predicted time then you can assume that the flood stopped and
> invoke halt or whatever.
>
> That avoids all of that 'tunable and tweakable' x86 specific hackery and
> utilizes common functionality which is mostly there already.
here is some sample code. Poll for a while before enter halt in 
cpuidle_enter_state()
If I get a reschedule event, then don't try to enter halt.  (I hope this 
is the right direction as Peter mentioned in another email)

--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
struct cpuidle_driver *drv,
                 target_state = &drv->states[index];
         }

+#ifdef CONFIG_PARAVIRT
+       paravirt_idle_poll();
+
+       if (need_resched())
+               return -EBUSY;
+#endif
+
         /* Take note of the planned idle state. */
         sched_idle_set_state(target_state);




thanks,

Quan
Alibaba Cloud

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
  2017-11-16  8:45       ` Peter Zijlstra
  2017-11-16  8:58         ` Thomas Gleixner
@ 2017-11-16  9:29         ` Quan Xu
  2017-11-16  9:47           ` Thomas Gleixner
  1 sibling, 1 reply; 32+ messages in thread
From: Quan Xu @ 2017-11-16  9:29 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner
  Cc: Quan Xu, kvm, linux-doc, linux-fsdevel, LKML, virtualization,
	x86, xen-devel, Yang Zhang, Ingo Molnar, H. Peter Anvin,
	Borislav Petkov, Kyle Huey, Len Brown, Andy Lutomirski,
	Tom Lendacky, Tobias Klauser, Daniel Lezcano



On 2017-11-16 16:45, Peter Zijlstra wrote:
> On Wed, Nov 15, 2017 at 11:03:08PM +0100, Thomas Gleixner wrote:
>> If I understand the problem correctly then he wants to avoid the heavy
>> lifting in tick_nohz_idle_enter() in the first place, but there is already
>> an interesting quirk there which makes it exit early.
> Sure. And there are people who want to do the same for native.
>
> Adding more ugly and special cases just isn't the way to go about doing
> that.
>
> I'm fairly sure I've told the various groups that want to tinker with
> this to work together on this. I've also in fairly significant detail
> sketched how to rework the idle code and idle predictors.
>
> At this point I'm too tired to dig any of that up, so I'll just keep
> saying no to patches that don't even attempt to go in the right
> direction.
Peter, take care.

I really have considered this factor, and try my best not to interfere 
with scheduler/idle code.
if irq_timings code is ready, I can use it directly. I think irq_timings 
is not an easy task, I'd
like to help as much as I can.  Also don't try to touch tick_nohz* code 
again.

as tglx suggested, this can be handled either in a HV specific idle driver or even in the generic core code.

I hope this is in the right direction.

Quan
Alibaba Cloud

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
  2017-11-16  9:12       ` Quan Xu
@ 2017-11-16  9:45         ` Daniel Lezcano
  2017-11-20  7:05           ` Quan Xu
  2017-11-16  9:53         ` Thomas Gleixner
  1 sibling, 1 reply; 32+ messages in thread
From: Daniel Lezcano @ 2017-11-16  9:45 UTC (permalink / raw)
  To: Quan Xu, Thomas Gleixner, Peter Zijlstra
  Cc: Quan Xu, kvm, linux-doc, linux-fsdevel, LKML, virtualization,
	x86, xen-devel, Yang Zhang, Ingo Molnar, H. Peter Anvin,
	Borislav Petkov, Kyle Huey, Len Brown, Andy Lutomirski,
	Tom Lendacky, Tobias Klauser

On 16/11/2017 10:12, Quan Xu wrote:
> 
> 
> On 2017-11-16 06:03, Thomas Gleixner wrote:
>> On Wed, 15 Nov 2017, Peter Zijlstra wrote:
>>
>>> On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote:
>>>> From: Yang Zhang <yang.zhang.wz@gmail.com>
>>>>
>>>> Implement a generic idle poll which resembles the functionality
>>>> found in arch/. Provide weak arch_cpu_idle_poll function which
>>>> can be overridden by the architecture code if needed.
>>> No, we want less of those magic hooks, not more.
>>>
>>>> Interrupts arrive which may not cause a reschedule in idle loops.
>>>> In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry
>>>> for interrupts and VM-exit immediately. Also this becomes more
>>>> expensive than bare metal. Add a generic idle poll before enter
>>>> real idle path. When a reschedule event is pending, we can bypass
>>>> the real idle path.
>>> Why not do a HV specific idle driver?
>> If I understand the problem correctly then he wants to avoid the heavy
>> lifting in tick_nohz_idle_enter() in the first place, but there is
>> already
>> an interesting quirk there which makes it exit early.  See commit
>> 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for this
>> commit
>> looks similar. But lets not proliferate that. I'd rather see that go
>> away.
> 
> agreed.
> 
> Even we can get more benifit than commit 3c5d92a0cfb5 ("nohz: Introduce
> arch_needs_cpu")
> in kvm guest. I won't proliferate that..
> 
>> But the irq_timings stuff is heading into the same direction, with a more
>> complex prediction logic which should tell you pretty good how long that
>> idle period is going to be and in case of an interrupt heavy workload
>> this
>> would skip the extra work of stopping and restarting the tick and
>> provide a
>> very good input into a polling decision.
> 
> 
> interesting. I have tested with IRQ_TIMINGS related code, which seems
> not working so far.

I don't know how you tested it, can you elaborate what you meant by
"seems not working so far" ?

There are still some work to do to be more efficient. The prediction
based on the irq timings is all right if the interrupts have a simple
periodicity. But as soon as there is a pattern, the current code can't
handle it properly and does bad predictions.

I'm working on a self-learning pattern detection which is too heavy for
the kernel, and with it we should be able to detect properly the
patterns and re-ajust the period if it changes. I'm in the process of
making it suitable for kernel code (both math and perf).

One improvement which can be done right now and which can help you is
the interrupts rate on the CPU. It is possible to compute it and that
will give an accurate information for the polling decision.



-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
  2017-11-16  9:29         ` Quan Xu
@ 2017-11-16  9:47           ` Thomas Gleixner
  0 siblings, 0 replies; 32+ messages in thread
From: Thomas Gleixner @ 2017-11-16  9:47 UTC (permalink / raw)
  To: Quan Xu
  Cc: Peter Zijlstra, Quan Xu, kvm, linux-doc, linux-fsdevel, LKML,
	virtualization, x86, xen-devel, Yang Zhang, Ingo Molnar,
	H. Peter Anvin, Borislav Petkov, Kyle Huey, Len Brown,
	Andy Lutomirski, Tom Lendacky, Tobias Klauser, Daniel Lezcano

On Thu, 16 Nov 2017, Quan Xu wrote:
> On 2017-11-16 16:45, Peter Zijlstra wrote:
> 
> I really have considered this factor, and try my best not to interfere with
> scheduler/idle code.
>
> if irq_timings code is ready, I can use it directly. I think irq_timings
> is not an easy task, I'd like to help as much as I can.

It's not a question whether irq_timings code is ready or not.

The infrastructure is there. I said that before and I'm happy to repeat:

 All parties who need this kind of prediction should:

     1) Talk to each other

     2) Work together to make it usable for _ALL_ use cases

 AFAICT, that never happened. Otherwise there would be either progress on
 that or at least a reasonable explanation WHY it cannot be done.

Peter and myself made it entirely clear several times in the past that this
must be solved at the generic level without any magic hackery in random
places. But the only thing we've seen so far is the magic hackery coming
around in a different flavour some time after we rejected the last one.

We can play that game forever. The outcome is extremly predictable.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
  2017-11-16  9:12       ` Quan Xu
  2017-11-16  9:45         ` Daniel Lezcano
@ 2017-11-16  9:53         ` Thomas Gleixner
  2017-11-17 11:23           ` Quan Xu
  1 sibling, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2017-11-16  9:53 UTC (permalink / raw)
  To: Quan Xu
  Cc: Peter Zijlstra, Quan Xu, kvm, linux-doc, linux-fsdevel, LKML,
	virtualization, x86, xen-devel, Yang Zhang, Ingo Molnar,
	H. Peter Anvin, Borislav Petkov, Kyle Huey, Len Brown,
	Andy Lutomirski, Tom Lendacky, Tobias Klauser, Daniel Lezcano

[-- Attachment #1: Type: text/plain, Size: 1296 bytes --]

On Thu, 16 Nov 2017, Quan Xu wrote:
> On 2017-11-16 06:03, Thomas Gleixner wrote:
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev,
> struct cpuidle_driver *drv,
>                 target_state = &drv->states[index];
>         }
> 
> +#ifdef CONFIG_PARAVIRT
> +       paravirt_idle_poll();
> +
> +       if (need_resched())
> +               return -EBUSY;
> +#endif

That's just plain wrong. We don't want to see any of this PARAVIRT crap in
anything outside the architecture/hypervisor interfacing code which really
needs it.

The problem can and must be solved at the generic level in the first place
to gather the data which can be used to make such decisions.

How that information is used might be either completely generic or requires
system specific variants. But as long as we don't have any information at
all we cannot discuss that.

Please sit down and write up which data needs to be considered to make
decisions about probabilistic polling. Then we need to compare and contrast
that with the data which is necessary to make power/idle state decisions.

I would be very surprised if this data would not overlap by at least 90%.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
  2017-11-16  9:53         ` Thomas Gleixner
@ 2017-11-17 11:23           ` Quan Xu
  2017-11-17 11:36             ` Thomas Gleixner
  0 siblings, 1 reply; 32+ messages in thread
From: Quan Xu @ 2017-11-17 11:23 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra
  Cc: Quan Xu, kvm, linux-doc, linux-fsdevel, LKML, virtualization,
	x86, xen-devel, Yang Zhang, Ingo Molnar, H. Peter Anvin,
	Borislav Petkov, Kyle Huey, Len Brown, Andy Lutomirski,
	Tom Lendacky, Tobias Klauser, Daniel Lezcano



On 2017-11-16 17:53, Thomas Gleixner wrote:
> On Thu, 16 Nov 2017, Quan Xu wrote:
>> On 2017-11-16 06:03, Thomas Gleixner wrote:
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev,
>> struct cpuidle_driver *drv,
>>                  target_state = &drv->states[index];
>>          }
>>
>> +#ifdef CONFIG_PARAVIRT
>> +       paravirt_idle_poll();
>> +
>> +       if (need_resched())
>> +               return -EBUSY;
>> +#endif
> That's just plain wrong. We don't want to see any of this PARAVIRT crap in
> anything outside the architecture/hypervisor interfacing code which really
> needs it.
>
> The problem can and must be solved at the generic level in the first place
> to gather the data which can be used to make such decisions.
>
> How that information is used might be either completely generic or requires
> system specific variants. But as long as we don't have any information at
> all we cannot discuss that.
>
> Please sit down and write up which data needs to be considered to make
> decisions about probabilistic polling. Then we need to compare and contrast
> that with the data which is necessary to make power/idle state decisions.
>
> I would be very surprised if this data would not overlap by at least 90%.
>

Peter, tglx
Thanks for your comments..

rethink of this patch set,

1. which data needs to considerd to make decisions about probabilistic 
polling

I really need to write up which data needs to considerd to make
decisions about probabilistic polling. At last several months,
I always focused on the data _from idle to reschedule_, then to bypass
the idle loops. unfortunately, this makes me touch scheduler/idle/nohz
code inevitably.

with tglx's suggestion, the data which is necessary to make power/idle
state decisions, is the last idle state's residency time. IIUC this data
is duration from idle to wakeup, which maybe by reschedule irq or other irq.

I also test that the reschedule irq overlap by more than 90% (trace the
need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for
one minute.

as the overlap, I think I can input the last idle state's residency time
to make decisions about probabilistic polling, as @dev->last_residency does.
it is much easier to get data.


2. do a HV specific idle driver (function)

so far, power management is not exposed to guest.. idle is simple for 
KVM guest,
calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call())..
thanks Xen guys, who has implemented the paravirt framework. I can 
implement it
as easy as following:

              --- a/arch/x86/kernel/kvm.c
              +++ b/arch/x86/kernel/kvm.c
              @@ -465,6 +465,12 @@ static void __init 
kvm_apf_trap_init(void)
                      update_intr_gate(X86_TRAP_PF, async_page_fault);
               }

              +static __cpuidle void kvm_safe_halt(void)
              +{
          +        /* 1. POLL, if need_resched() --> return */
          +
              +        asm volatile("sti; hlt": : :"memory"); /* 2. halt */
              +
          +        /* 3. get the last idle state's residency time */
              +
          +        /* 4. update poll duration based on last idle state's 
residency time */
              +}
              +
               void __init kvm_guest_init(void)
               {
                      int i;
              @@ -490,6 +496,8 @@ void __init kvm_guest_init(void)
                      if (kvmclock_vsyscall)
                              kvm_setup_vsyscall_timeinfo();

              +       pv_irq_ops.safe_halt = kvm_safe_halt;
              +
               #ifdef CONFIG_SMP




then, I am no need to introduce a new pvops, and never modify 
schedule/idle/nohz code again.
also I can narrow all of the code down in arch/x86/kernel/kvm.c.

If this is in the right direction, I will send a new patch set next week..

thanks,

Quan
Alibaba Cloud

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
  2017-11-17 11:23           ` Quan Xu
@ 2017-11-17 11:36             ` Thomas Gleixner
  2017-11-17 12:21               ` Quan Xu
  0 siblings, 1 reply; 32+ messages in thread
From: Thomas Gleixner @ 2017-11-17 11:36 UTC (permalink / raw)
  To: Quan Xu
  Cc: Peter Zijlstra, Quan Xu, kvm, linux-doc, linux-fsdevel, LKML,
	virtualization, x86, xen-devel, Yang Zhang, Ingo Molnar,
	H. Peter Anvin, Borislav Petkov, Kyle Huey, Len Brown,
	Andy Lutomirski, Tom Lendacky, Tobias Klauser, Daniel Lezcano

[-- Attachment #1: Type: text/plain, Size: 4694 bytes --]

On Fri, 17 Nov 2017, Quan Xu wrote:
> On 2017-11-16 17:53, Thomas Gleixner wrote:
> > That's just plain wrong. We don't want to see any of this PARAVIRT crap in
> > anything outside the architecture/hypervisor interfacing code which really
> > needs it.
> > 
> > The problem can and must be solved at the generic level in the first place
> > to gather the data which can be used to make such decisions.
> > 
> > How that information is used might be either completely generic or requires
> > system specific variants. But as long as we don't have any information at
> > all we cannot discuss that.
> > 
> > Please sit down and write up which data needs to be considered to make
> > decisions about probabilistic polling. Then we need to compare and contrast
> > that with the data which is necessary to make power/idle state decisions.
> > 
> > I would be very surprised if this data would not overlap by at least 90%.
> > 
> 1. which data needs to considerd to make decisions about probabilistic polling
> 
> I really need to write up which data needs to considerd to make
> decisions about probabilistic polling. At last several months,
> I always focused on the data _from idle to reschedule_, then to bypass
> the idle loops. unfortunately, this makes me touch scheduler/idle/nohz
> code inevitably.
> 
> with tglx's suggestion, the data which is necessary to make power/idle
> state decisions, is the last idle state's residency time. IIUC this data
> is duration from idle to wakeup, which maybe by reschedule irq or other irq.

That's part of the picture, but not complete.

> I also test that the reschedule irq overlap by more than 90% (trace the
> need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for
> one minute.
> 
> as the overlap, I think I can input the last idle state's residency time
> to make decisions about probabilistic polling, as @dev->last_residency does.
> it is much easier to get data.

That's only true for your particular use case.

> 
> 2. do a HV specific idle driver (function)
> 
> so far, power management is not exposed to guest.. idle is simple for KVM
> guest,
> calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call())..
> thanks Xen guys, who has implemented the paravirt framework. I can implement
> it
> as easy as following:
> 
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ --- a/arch/x86/kernel/kvm.c

Your email client is using a very strange formatting. 

> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ +++ b/arch/x86/kernel/kvm.c
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ @@ -465,6 +465,12 @@ static void __init kvm_apf_trap_init(void)
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ update_intr_gate(X86_TRAP_PF, async_page_fault);
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ }
> 
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ +static __cpuidle void kvm_safe_halt(void)
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ +{
> ᅵᅵᅵ ᅵᅵᅵᅵ +ᅵᅵᅵᅵᅵᅵᅵ /* 1. POLL, if need_resched() --> return */
> ᅵᅵᅵ ᅵᅵᅵᅵ +
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ +ᅵᅵᅵᅵᅵᅵᅵ asm volatile("sti; hlt": : :"memory"); /* 2. halt */
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ +
> ᅵᅵᅵ ᅵᅵᅵᅵ +ᅵᅵᅵᅵᅵᅵᅵ /* 3. get the last idle state's residency time */
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ +
> ᅵᅵᅵ ᅵᅵᅵᅵ +ᅵᅵᅵᅵᅵᅵᅵ /* 4. update poll duration based on last idle state's
> residency time */
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ +}
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ +
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ void __init kvm_guest_init(void)
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ {
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ int i;
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ @@ -490,6 +496,8 @@ void __init kvm_guest_init(void)
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ if (kvmclock_vsyscall)
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ kvm_setup_vsyscall_timeinfo();
> 
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ +ᅵᅵᅵᅵᅵᅵ pv_irq_ops.safe_halt = kvm_safe_halt;
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ +
> ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ #ifdef CONFIG_SMP
> 
> 
> then, I am no need to introduce a new pvops, and never modify
> schedule/idle/nohz code again.
> also I can narrow all of the code down in arch/x86/kernel/kvm.c.
> 
> If this is in the right direction, I will send a new patch set next week..

This is definitely better than what you proposed so far and implementing it
as a prove of concept seems to be worthwhile.

But I doubt that this is the final solution. It's not generic and not
necessarily suitable for all use case scenarios.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
  2017-11-17 11:36             ` Thomas Gleixner
@ 2017-11-17 12:21               ` Quan Xu
  0 siblings, 0 replies; 32+ messages in thread
From: Quan Xu @ 2017-11-17 12:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Quan Xu, kvm, linux-doc, linux-fsdevel, LKML,
	virtualization, x86, xen-devel, Yang Zhang, Ingo Molnar,
	H. Peter Anvin, Borislav Petkov, Kyle Huey, Len Brown,
	Andy Lutomirski, Tom Lendacky, Tobias Klauser, Daniel Lezcano



On 2017-11-17 19:36, Thomas Gleixner wrote:
> On Fri, 17 Nov 2017, Quan Xu wrote:
>> On 2017-11-16 17:53, Thomas Gleixner wrote:
>>> That's just plain wrong. We don't want to see any of this PARAVIRT crap in
>>> anything outside the architecture/hypervisor interfacing code which really
>>> needs it.
>>>
>>> The problem can and must be solved at the generic level in the first place
>>> to gather the data which can be used to make such decisions.
>>>
>>> How that information is used might be either completely generic or requires
>>> system specific variants. But as long as we don't have any information at
>>> all we cannot discuss that.
>>>
>>> Please sit down and write up which data needs to be considered to make
>>> decisions about probabilistic polling. Then we need to compare and contrast
>>> that with the data which is necessary to make power/idle state decisions.
>>>
>>> I would be very surprised if this data would not overlap by at least 90%.
>>>
>> 1. which data needs to considerd to make decisions about probabilistic polling
>>
>> I really need to write up which data needs to considerd to make
>> decisions about probabilistic polling. At last several months,
>> I always focused on the data _from idle to reschedule_, then to bypass
>> the idle loops. unfortunately, this makes me touch scheduler/idle/nohz
>> code inevitably.
>>
>> with tglx's suggestion, the data which is necessary to make power/idle
>> state decisions, is the last idle state's residency time. IIUC this data
>> is duration from idle to wakeup, which maybe by reschedule irq or other irq.
> That's part of the picture, but not complete.

tglx, could you share more? I am very curious about it..

>> I also test that the reschedule irq overlap by more than 90% (trace the
>> need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for
>> one minute.
>>
>> as the overlap, I think I can input the last idle state's residency time
>> to make decisions about probabilistic polling, as @dev->last_residency does.
>> it is much easier to get data.
> That's only true for your particular use case.
>
>> 2. do a HV specific idle driver (function)
>>
>> so far, power management is not exposed to guest.. idle is simple for KVM
>> guest,
>> calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call())..
>> thanks Xen guys, who has implemented the paravirt framework. I can implement
>> it
>> as easy as following:
>>
>>  ᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵᅵ --- a/arch/x86/kernel/kvm.c
> Your email client is using a very strange formatting.

my bad, I insert space to highlight these code.

> This is definitely better than what you proposed so far and implementing it
> as a prove of concept seems to be worthwhile.
>
> But I doubt that this is the final solution. It's not generic and not
> necessarily suitable for all use case scenarios.
>
>
yes, I am exhausted :):)


could you tell me the gap to be generic and necessarily suitable for
all use case scenarios? as lack of irq/idle predictors?

 ï¿œI really want to upstream it for all of public cloud users/providers..

as kvm host has a similar one, is it possible to upstream with following 
conditions? :
 ᅵᅵᅵ 1). add a QEMU configuration, whether enable or not, by default 
disable.
 ᅵᅵᅵ 2). add some "TODO" comments near the code.
 ᅵᅵᅵ 3). ...


anyway, thanks for your help..

Quan
 ï¿œAlibaba Cloud

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
  2017-11-16  9:45         ` Daniel Lezcano
@ 2017-11-20  7:05           ` Quan Xu
  2017-11-20 18:01             ` Daniel Lezcano
  0 siblings, 1 reply; 32+ messages in thread
From: Quan Xu @ 2017-11-20  7:05 UTC (permalink / raw)
  To: Daniel Lezcano, Thomas Gleixner, Peter Zijlstra
  Cc: Quan Xu, kvm, linux-doc, linux-fsdevel, LKML, virtualization,
	x86, xen-devel, Yang Zhang, Ingo Molnar, H. Peter Anvin,
	Borislav Petkov, Kyle Huey, Len Brown, Andy Lutomirski,
	Tom Lendacky, Tobias Klauser



On 2017-11-16 17:45, Daniel Lezcano wrote:
> On 16/11/2017 10:12, Quan Xu wrote:
>>
>> On 2017-11-16 06:03, Thomas Gleixner wrote:
>>> On Wed, 15 Nov 2017, Peter Zijlstra wrote:
>>>
>>>> On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote:
>>>>> From: Yang Zhang <yang.zhang.wz@gmail.com>
>>>>>
>>>>> Implement a generic idle poll which resembles the functionality
>>>>> found in arch/. Provide weak arch_cpu_idle_poll function which
>>>>> can be overridden by the architecture code if needed.
>>>> No, we want less of those magic hooks, not more.
>>>>
>>>>> Interrupts arrive which may not cause a reschedule in idle loops.
>>>>> In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry
>>>>> for interrupts and VM-exit immediately. Also this becomes more
>>>>> expensive than bare metal. Add a generic idle poll before enter
>>>>> real idle path. When a reschedule event is pending, we can bypass
>>>>> the real idle path.
>>>> Why not do a HV specific idle driver?
>>> If I understand the problem correctly then he wants to avoid the heavy
>>> lifting in tick_nohz_idle_enter() in the first place, but there is
>>> already
>>> an interesting quirk there which makes it exit early.  See commit
>>> 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for this
>>> commit
>>> looks similar. But lets not proliferate that. I'd rather see that go
>>> away.
>> agreed.
>>
>> Even we can get more benifit than commit 3c5d92a0cfb5 ("nohz: Introduce
>> arch_needs_cpu")
>> in kvm guest. I won't proliferate that..
>>
>>> But the irq_timings stuff is heading into the same direction, with a more
>>> complex prediction logic which should tell you pretty good how long that
>>> idle period is going to be and in case of an interrupt heavy workload
>>> this
>>> would skip the extra work of stopping and restarting the tick and
>>> provide a
>>> very good input into a polling decision.
>>
>> interesting. I have tested with IRQ_TIMINGS related code, which seems
>> not working so far.
> I don't know how you tested it, can you elaborate what you meant by
> "seems not working so far" ?

Daniel, I tried to enable IRQ_TIMINGS* manually. used 
irq_timings_next_event()
to return estimation of the earliest interrupt. However I got a constant.

> There are still some work to do to be more efficient. The prediction
> based on the irq timings is all right if the interrupts have a simple
> periodicity. But as soon as there is a pattern, the current code can't
> handle it properly and does bad predictions.
>
> I'm working on a self-learning pattern detection which is too heavy for
> the kernel, and with it we should be able to detect properly the
> patterns and re-ajust the period if it changes. I'm in the process of
> making it suitable for kernel code (both math and perf).
>
> One improvement which can be done right now and which can help you is
> the interrupts rate on the CPU. It is possible to compute it and that
> will give an accurate information for the polling decision.
>
>
As tglx said, talk to each other / work together to make it usable for 
all use cases.
could you share how to enable it to get the interrupts rate on the CPU? 
I can try it
in cloud scenario. of course, I'd like to work with you to improve it.

Quan
Alibaba Cloud

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Xen-devel] [PATCH RFC v3 0/6] x86/idle: add halt poll support
  2017-11-15 21:31 ` [Xen-devel] [PATCH RFC v3 0/6] x86/idle: add halt poll support Konrad Rzeszutek Wilk
@ 2017-11-20  7:18   ` Quan Xu
  0 siblings, 0 replies; 32+ messages in thread
From: Quan Xu @ 2017-11-20  7:18 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Quan Xu
  Cc: kvm, linux-doc, linux-fsdevel, linux-kernel, virtualization, x86,
	xen-devel, Yang Zhang



On 2017-11-16 05:31, Konrad Rzeszutek Wilk wrote:
> On Mon, Nov 13, 2017 at 06:05:59PM +0800, Quan Xu wrote:
>> From: Yang Zhang <yang.zhang.wz@gmail.com>
>>
>> Some latency-intensive workload have seen obviously performance
>> drop when running inside VM. The main reason is that the overhead
>> is amplified when running inside VM. The most cost I have seen is
>> inside idle path.
> Meaning an VMEXIT b/c it is an 'halt' operation ? And then going
> back in guest (VMRESUME) takes time. And hence your latency gets
> all whacked b/c of this?
    Konrad, I can't follow 'b/c' here.. sorry.

> So if I understand - you want to use your _full_ timeslice (of the guest)
> without ever (or as much as possible) to go in the hypervisor?
     as much as possible.

> Which means in effect you don't care about power-saving or CPUfreq
> savings, you just want to eat the full CPU for snack?
   actually, we  care about power-saving. The poll duration is 
self-tuning, otherwise it is almost as the same as
   'halt=poll'. Also we always sent out with CPU usage of benchmark 
netperf/ctxsw. We got much more
   performance with limited promotion of CPU usage.


>> This patch introduces a new mechanism to poll for a while before
>> entering idle state. If schedule is needed during poll, then we
>> don't need to goes through the heavy overhead path.
> Schedule of what? The guest or the host?
   rescheduled of guest scheduler..
   it is the guest.


Quan
Alibaba Cloud
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
  2017-11-20  7:05           ` Quan Xu
@ 2017-11-20 18:01             ` Daniel Lezcano
  0 siblings, 0 replies; 32+ messages in thread
From: Daniel Lezcano @ 2017-11-20 18:01 UTC (permalink / raw)
  To: Quan Xu, Thomas Gleixner, Peter Zijlstra
  Cc: Quan Xu, kvm, linux-doc, linux-fsdevel, LKML, virtualization,
	x86, xen-devel, Yang Zhang, Ingo Molnar, H. Peter Anvin,
	Borislav Petkov, Kyle Huey, Len Brown, Andy Lutomirski,
	Tom Lendacky, Tobias Klauser

On 20/11/2017 08:05, Quan Xu wrote:

[ ... ]

>>>> But the irq_timings stuff is heading into the same direction, with a
>>>> more
>>>> complex prediction logic which should tell you pretty good how long
>>>> that
>>>> idle period is going to be and in case of an interrupt heavy workload
>>>> this
>>>> would skip the extra work of stopping and restarting the tick and
>>>> provide a
>>>> very good input into a polling decision.
>>>
>>> interesting. I have tested with IRQ_TIMINGS related code, which seems
>>> not working so far.
>> I don't know how you tested it, can you elaborate what you meant by
>> "seems not working so far" ?
> 
> Daniel, I tried to enable IRQ_TIMINGS* manually. used
> irq_timings_next_event()
> to return estimation of the earliest interrupt. However I got a constant.

The irq timings gives you an indication of the next interrupt deadline.

This information is a piece of the puzzle, you need to combine it with
the next timer expiration, and the next scheduling event. Then take the
earliest event in a timeline basis.

Using the trivial scheme above will work well with workload like videos
or mp3 but will fail as soon as the interrupts are not coming in a
regular basis and this is where the pattern recognition algorithm must act.

>> There are still some work to do to be more efficient. The prediction
>> based on the irq timings is all right if the interrupts have a simple
>> periodicity. But as soon as there is a pattern, the current code can't
>> handle it properly and does bad predictions.
>>
>> I'm working on a self-learning pattern detection which is too heavy for
>> the kernel, and with it we should be able to detect properly the
>> patterns and re-ajust the period if it changes. I'm in the process of
>> making it suitable for kernel code (both math and perf).
>>
>> One improvement which can be done right now and which can help you is
>> the interrupts rate on the CPU. It is possible to compute it and that
>> will give an accurate information for the polling decision.
>>
>>
> As tglx said, talk to each other / work together to make it usable for
> all use cases.
> could you share how to enable it to get the interrupts rate on the CPU?
> I can try it
> in cloud scenario. of course, I'd like to work with you to improve it.

Sure, I will be glad if we can collaborate. I have some draft code but
before sharing it I would like we define what is the rate and what kind
of information we expect to infer from it. From my point of view it is a
value indicating the interrupt period per CPU, a short value indicates a
high number of interrupts on the CPU.

This value must decay with the time, the question here is what decay
function we apply to the rate from the last timestamp ?




-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2017-11-20 18:01 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-13 10:05 [PATCH RFC v3 0/6] x86/idle: add halt poll support Quan Xu
2017-11-13 10:06 ` [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops Quan Xu
2017-11-13 10:53   ` Juergen Gross
2017-11-13 11:09     ` Wanpeng Li
2017-11-14  7:02     ` Quan Xu
2017-11-14  7:12       ` Wanpeng Li
2017-11-14  8:15         ` Quan Xu
2017-11-14  8:22           ` Wanpeng Li
2017-11-14 10:23             ` Quan Xu
2017-11-14  7:30       ` Juergen Gross
2017-11-14  9:38         ` Quan Xu
2017-11-14 10:27           ` Juergen Gross
2017-11-14 11:43             ` Quan Xu
2017-11-14 11:58               ` Juergen Gross
2017-11-13 10:06 ` [PATCH RFC v3 2/6] KVM guest: register kvm_idle_poll for pv_idle_ops Quan Xu
2017-11-13 10:06 ` [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path Quan Xu
2017-11-15 12:11   ` Peter Zijlstra
2017-11-15 22:03     ` Thomas Gleixner
2017-11-16  8:45       ` Peter Zijlstra
2017-11-16  8:58         ` Thomas Gleixner
2017-11-16  9:29         ` Quan Xu
2017-11-16  9:47           ` Thomas Gleixner
2017-11-16  9:12       ` Quan Xu
2017-11-16  9:45         ` Daniel Lezcano
2017-11-20  7:05           ` Quan Xu
2017-11-20 18:01             ` Daniel Lezcano
2017-11-16  9:53         ` Thomas Gleixner
2017-11-17 11:23           ` Quan Xu
2017-11-17 11:36             ` Thomas Gleixner
2017-11-17 12:21               ` Quan Xu
2017-11-15 21:31 ` [Xen-devel] [PATCH RFC v3 0/6] x86/idle: add halt poll support Konrad Rzeszutek Wilk
2017-11-20  7:18   ` Quan Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).