linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] x86/idle: add halt poll support
@ 2017-06-22 11:22 root
  2017-06-22 11:22 ` [PATCH 1/2] x86/idle: add halt poll for halt idle root
                   ` (3 more replies)
  0 siblings, 4 replies; 37+ messages in thread
From: root @ 2017-06-22 11:22 UTC (permalink / raw)
  To: tglx, mingo, hpa, pbonzini
  Cc: x86, corbet, tony.luck, bp, peterz, mchehab, akpm, krzk,
	jpoimboe, luto, borntraeger, thgarnie, rgerst, minipli,
	douly.fnst, nicstange, fweisbec, dvlasenk, bristot,
	yamada.masahiro, mika.westerberg, yu.c.chen, aaron.lu, rostedt,
	me, len.brown, prarit, hidehiro.kawai.ez, fengtiantian, pmladek,
	jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, adobriyan, fgao, ebiederm,
	subashab, arnd, matt, mgorman, linux-kernel, linux-doc,
	linux-edac, kvm, Yang Zhang

From: Yang Zhang <yang.zhang.wz@gmail.com>

Some latency-intensive workload will see obviously performance
drop when running inside VM. The main reason is that the overhead
is amplified when running inside VM. The most cost i have seen is
inside idle path.
This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we 
don't need to goes through the heavy overhead path.

Here is the data i get when running benchmark contextswitch
(https://github.com/tsuna/contextswitch)
before patch:
2000000 process context switches in 4822613801ns (2411.3ns/ctxsw)
after patch:
2000000 process context switches in 3584098241ns (1792.0ns/ctxsw)


Yang Zhang (2):
  x86/idle: add halt poll for halt idle
  x86/idle: use dynamic halt poll

 Documentation/sysctl/kernel.txt          | 24 ++++++++++
 arch/x86/include/asm/processor.h         |  6 +++
 arch/x86/kernel/apic/apic.c              |  6 +++
 arch/x86/kernel/apic/vector.c            |  1 +
 arch/x86/kernel/cpu/mcheck/mce_amd.c     |  2 +
 arch/x86/kernel/cpu/mcheck/therm_throt.c |  2 +
 arch/x86/kernel/cpu/mcheck/threshold.c   |  2 +
 arch/x86/kernel/irq.c                    |  5 ++
 arch/x86/kernel/irq_work.c               |  2 +
 arch/x86/kernel/process.c                | 80 ++++++++++++++++++++++++++++++++
 arch/x86/kernel/smp.c                    |  6 +++
 include/linux/kernel.h                   |  5 ++
 kernel/sched/idle.c                      |  3 ++
 kernel/sysctl.c                          | 23 +++++++++
 14 files changed, 167 insertions(+)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/2] x86/idle: add halt poll for halt idle
  2017-06-22 11:22 [PATCH 0/2] x86/idle: add halt poll support root
@ 2017-06-22 11:22 ` root
  2017-06-22 14:23   ` Thomas Gleixner
  2017-08-16  4:04   ` Michael S. Tsirkin
  2017-06-22 11:22 ` [PATCH 2/2] x86/idle: use dynamic halt poll root
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 37+ messages in thread
From: root @ 2017-06-22 11:22 UTC (permalink / raw)
  To: tglx, mingo, hpa, pbonzini
  Cc: x86, corbet, tony.luck, bp, peterz, mchehab, akpm, krzk,
	jpoimboe, luto, borntraeger, thgarnie, rgerst, minipli,
	douly.fnst, nicstange, fweisbec, dvlasenk, bristot,
	yamada.masahiro, mika.westerberg, yu.c.chen, aaron.lu, rostedt,
	me, len.brown, prarit, hidehiro.kawai.ez, fengtiantian, pmladek,
	jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, adobriyan, fgao, ebiederm,
	subashab, arnd, matt, mgorman, linux-kernel, linux-doc,
	linux-edac, kvm, Yang Zhang

From: Yang Zhang <yang.zhang.wz@gmail.com>

This patch introduce a new mechanism to poll for a while before
entering idle state.

David has a topic in KVM forum to describe the problem on current KVM VM
when running some message passing workload in KVM forum. Also, there
are some work to improve the performance in KVM, like halt polling in KVM.
But we still has 4 MSR wirtes and HLT vmexit when going into halt idle
which introduce lot of latency.

Halt polling in KVM provide the capbility to not schedule out VCPU when
it is the only task in this pCPU. Unlike it, this patch will let VCPU polls
for a while if there is no work inside VCPU to elimiate heavy vmexit during
in/out idle. The potential impact is it will cost more CPU cycle since we
are doing polling and may impact other task which waiting on the same
physical CPU in host.

Here is the data i get when running benchmark contextswitch
(https://github.com/tsuna/contextswitch)

before patch:
2000000 process context switches in 4822613801ns (2411.3ns/ctxsw)

after patch:
2000000 process context switches in 3584098241ns (1792.0ns/ctxsw)

Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
---
 Documentation/sysctl/kernel.txt | 10 ++++++++++
 arch/x86/kernel/process.c       | 21 +++++++++++++++++++++
 include/linux/kernel.h          |  3 +++
 kernel/sched/idle.c             |  3 +++
 kernel/sysctl.c                 |  9 +++++++++
 5 files changed, 46 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index bac23c1..4e71bfe 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -63,6 +63,7 @@ show up in /proc/sys/kernel:
 - perf_event_max_stack
 - perf_event_max_contexts_per_stack
 - pid_max
+- poll_threshold_ns        [ X86 only ]
 - powersave-nap               [ PPC only ]
 - printk
 - printk_delay
@@ -702,6 +703,15 @@ kernel tries to allocate a number starting from this one.
 
 ==============================================================
 
+poll_threshold_ns: (X86 only)
+
+This parameter used to control the max wait time to poll before going
+into real idle state. By default, the values is 0 means don't poll.
+It is recommended to change the value to non-zero if running latency-bound
+workloads in VM.
+
+==============================================================
+
 powersave-nap: (PPC only)
 
 If set, Linux-PPC will use the 'nap' mode of powersaving,
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 0bb8842..6361783 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -39,6 +39,10 @@
 #include <asm/desc.h>
 #include <asm/prctl.h>
 
+#ifdef CONFIG_HYPERVISOR_GUEST
+unsigned long poll_threshold_ns;
+#endif
+
 /*
  * per-CPU TSS segments. Threads are completely 'soft' on Linux,
  * no more per-task TSS's. The TSS size is kept cacheline-aligned
@@ -313,6 +317,23 @@ static inline void play_dead(void)
 }
 #endif
 
+#ifdef CONFIG_HYPERVISOR_GUEST
+void arch_cpu_idle_poll(void)
+{
+	ktime_t start, cur, stop;
+
+	if (poll_threshold_ns) {
+		start = cur = ktime_get();
+		stop = ktime_add_ns(ktime_get(), poll_threshold_ns);
+		do {
+			if (need_resched())
+				break;
+			cur = ktime_get();
+		} while (ktime_before(cur, stop));
+	}
+}
+#endif
+
 void arch_cpu_idle_enter(void)
 {
 	tsc_verify_tsc_adjust(false);
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 13bc08a..04cf774 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -460,6 +460,9 @@ extern __scanf(2, 0)
 extern int sysctl_panic_on_stackoverflow;
 
 extern bool crash_kexec_post_notifiers;
+#ifdef CONFIG_HYPERVISOR_GUEST
+extern unsigned long poll_threshold_ns;
+#endif
 
 /*
  * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 2a25a9e..e789f99 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
 }
 
 /* Weak implementations for optional arch specific functions */
+void __weak arch_cpu_idle_poll(void) { }
 void __weak arch_cpu_idle_prepare(void) { }
 void __weak arch_cpu_idle_enter(void) { }
 void __weak arch_cpu_idle_exit(void) { }
@@ -219,6 +220,8 @@ static void do_idle(void)
 	 */
 
 	__current_set_polling();
+	arch_cpu_idle_poll();
+
 	tick_nohz_idle_enter();
 
 	while (!need_resched()) {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 4dfba1a..9174d57 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1203,6 +1203,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_HYPERVISOR_GUEST
+	{
+		.procname	= "halt_poll_threshold",
+		.data		= &poll_threshold_ns,
+		.maxlen		= sizeof(unsigned long),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif
 	{ }
 };
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-06-22 11:22 [PATCH 0/2] x86/idle: add halt poll support root
  2017-06-22 11:22 ` [PATCH 1/2] x86/idle: add halt poll for halt idle root
@ 2017-06-22 11:22 ` root
  2017-06-22 11:51   ` Paolo Bonzini
                     ` (2 more replies)
  2017-06-22 11:32 ` [PATCH 0/2] x86/idle: add halt poll support Yang Zhang
  2017-06-22 11:50 ` Wanpeng Li
  3 siblings, 3 replies; 37+ messages in thread
From: root @ 2017-06-22 11:22 UTC (permalink / raw)
  To: tglx, mingo, hpa, pbonzini
  Cc: x86, corbet, tony.luck, bp, peterz, mchehab, akpm, krzk,
	jpoimboe, luto, borntraeger, thgarnie, rgerst, minipli,
	douly.fnst, nicstange, fweisbec, dvlasenk, bristot,
	yamada.masahiro, mika.westerberg, yu.c.chen, aaron.lu, rostedt,
	me, len.brown, prarit, hidehiro.kawai.ez, fengtiantian, pmladek,
	jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, adobriyan, fgao, ebiederm,
	subashab, arnd, matt, mgorman, linux-kernel, linux-doc,
	linux-edac, kvm, Yang Zhang

From: Yang Zhang <yang.zhang.wz@gmail.com>

use dynamic poll to reduce the cost when the event is not occurred during
poll. The idea is similar to current dynamic halt poll inside KVM:
Before entering idle, we will record the time. After wakeup from idle
(nomally, this is in interrupt handler), we will record the time too.
Then we will check whether we need to grow/shrink the poll time depands
on how long the CPU stay inside idle state.

There are two new sysctl to change poll time dynamically:
poll_shrink, poll_grow

Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
---
 Documentation/sysctl/kernel.txt          | 14 ++++++++
 arch/x86/include/asm/processor.h         |  6 ++++
 arch/x86/kernel/apic/apic.c              |  6 ++++
 arch/x86/kernel/apic/vector.c            |  1 +
 arch/x86/kernel/cpu/mcheck/mce_amd.c     |  2 ++
 arch/x86/kernel/cpu/mcheck/therm_throt.c |  2 ++
 arch/x86/kernel/cpu/mcheck/threshold.c   |  2 ++
 arch/x86/kernel/irq.c                    |  5 +++
 arch/x86/kernel/irq_work.c               |  2 ++
 arch/x86/kernel/process.c                | 59 ++++++++++++++++++++++++++++++++
 arch/x86/kernel/smp.c                    |  6 ++++
 include/linux/kernel.h                   |  2 ++
 kernel/sysctl.c                          | 14 ++++++++
 13 files changed, 121 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 4e71bfe..76043b4 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -63,6 +63,8 @@ show up in /proc/sys/kernel:
 - perf_event_max_stack
 - perf_event_max_contexts_per_stack
 - pid_max
+- poll_grow                [ X86 only ]
+- poll_shrink              [ X86 only ]
 - poll_threshold_ns        [ X86 only ]
 - powersave-nap               [ PPC only ]
 - printk
@@ -703,6 +705,18 @@ kernel tries to allocate a number starting from this one.
 
 ==============================================================
 
+poll_grow: (X86 only)
+
+This parameter is multiplied in the grow_poll_ns() to increase the poll time.
+By default, the values is 2.
+
+==============================================================
+poll_shrink: (X86 only)
+
+This parameter is divided in the shrink_poll_ns() to reduce the poll time.
+By default, the values is 2.
+
+==============================================================
 poll_threshold_ns: (X86 only)
 
 This parameter used to control the max wait time to poll before going
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3cada99..cf952ed 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -931,4 +931,10 @@ static inline uint32_t hypervisor_cpuid_base(const char *sig, uint32_t leaves)
 
 void stop_this_cpu(void *dummy);
 void df_debug(struct pt_regs *regs, long error_code);
+#ifdef CONFIG_HYPERVISOR_GUEST
+extern void check_poll(void);
+#else
+static inline void check_poll(void) {}
+#endif
+
 #endif /* _ASM_X86_PROCESSOR_H */
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 2d75faf..37b16b6 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -962,6 +962,7 @@ __visible void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs)
 	 * interrupt lock, which is the WrongThing (tm) to do.
 	 */
 	entering_ack_irq();
+	check_poll();
 	local_apic_timer_interrupt();
 	exiting_irq();
 
@@ -981,6 +982,7 @@ __visible void __irq_entry smp_trace_apic_timer_interrupt(struct pt_regs *regs)
 	 * interrupt lock, which is the WrongThing (tm) to do.
 	 */
 	entering_ack_irq();
+	check_poll();
 	trace_local_timer_entry(LOCAL_TIMER_VECTOR);
 	local_apic_timer_interrupt();
 	trace_local_timer_exit(LOCAL_TIMER_VECTOR);
@@ -1863,6 +1865,7 @@ static void __smp_spurious_interrupt(u8 vector)
 __visible void __irq_entry smp_spurious_interrupt(struct pt_regs *regs)
 {
 	entering_irq();
+	check_poll();
 	__smp_spurious_interrupt(~regs->orig_ax);
 	exiting_irq();
 }
@@ -1872,6 +1875,7 @@ __visible void __irq_entry smp_trace_spurious_interrupt(struct pt_regs *regs)
 	u8 vector = ~regs->orig_ax;
 
 	entering_irq();
+	check_poll();
 	trace_spurious_apic_entry(vector);
 	__smp_spurious_interrupt(vector);
 	trace_spurious_apic_exit(vector);
@@ -1921,6 +1925,7 @@ static void __smp_error_interrupt(struct pt_regs *regs)
 __visible void __irq_entry smp_error_interrupt(struct pt_regs *regs)
 {
 	entering_irq();
+	check_poll();
 	__smp_error_interrupt(regs);
 	exiting_irq();
 }
@@ -1928,6 +1933,7 @@ __visible void __irq_entry smp_error_interrupt(struct pt_regs *regs)
 __visible void __irq_entry smp_trace_error_interrupt(struct pt_regs *regs)
 {
 	entering_irq();
+	check_poll();
 	trace_error_apic_entry(ERROR_APIC_VECTOR);
 	__smp_error_interrupt(regs);
 	trace_error_apic_exit(ERROR_APIC_VECTOR);
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index f3557a1..77fc6ed 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -564,6 +564,7 @@ asmlinkage __visible void __irq_entry smp_irq_move_cleanup_interrupt(void)
 	unsigned vector, me;
 
 	entering_ack_irq();
+	check_poll();
 
 	/* Prevent vectors vanishing under us */
 	raw_spin_lock(&vector_lock);
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index 6e4a047..7f984d6 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -819,6 +819,7 @@ static inline void __smp_deferred_error_interrupt(void)
 asmlinkage __visible void __irq_entry smp_deferred_error_interrupt(void)
 {
 	entering_irq();
+	check_poll();
 	__smp_deferred_error_interrupt();
 	exiting_ack_irq();
 }
@@ -826,6 +827,7 @@ asmlinkage __visible void __irq_entry smp_deferred_error_interrupt(void)
 asmlinkage __visible void __irq_entry smp_trace_deferred_error_interrupt(void)
 {
 	entering_irq();
+	check_poll();
 	trace_deferred_error_apic_entry(DEFERRED_ERROR_VECTOR);
 	__smp_deferred_error_interrupt();
 	trace_deferred_error_apic_exit(DEFERRED_ERROR_VECTOR);
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index d7cc190..d420b42 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -400,6 +400,7 @@ static inline void __smp_thermal_interrupt(void)
 smp_thermal_interrupt(struct pt_regs *regs)
 {
 	entering_irq();
+	check_poll();
 	__smp_thermal_interrupt();
 	exiting_ack_irq();
 }
@@ -408,6 +409,7 @@ static inline void __smp_thermal_interrupt(void)
 smp_trace_thermal_interrupt(struct pt_regs *regs)
 {
 	entering_irq();
+	check_poll();
 	trace_thermal_apic_entry(THERMAL_APIC_VECTOR);
 	__smp_thermal_interrupt();
 	trace_thermal_apic_exit(THERMAL_APIC_VECTOR);
diff --git a/arch/x86/kernel/cpu/mcheck/threshold.c b/arch/x86/kernel/cpu/mcheck/threshold.c
index bb0e75ee..77858ba 100644
--- a/arch/x86/kernel/cpu/mcheck/threshold.c
+++ b/arch/x86/kernel/cpu/mcheck/threshold.c
@@ -26,6 +26,7 @@ static inline void __smp_threshold_interrupt(void)
 asmlinkage __visible void __irq_entry smp_threshold_interrupt(void)
 {
 	entering_irq();
+	check_poll();
 	__smp_threshold_interrupt();
 	exiting_ack_irq();
 }
@@ -33,6 +34,7 @@ asmlinkage __visible void __irq_entry smp_threshold_interrupt(void)
 asmlinkage __visible void __irq_entry smp_trace_threshold_interrupt(void)
 {
 	entering_irq();
+	check_poll();
 	trace_threshold_apic_entry(THRESHOLD_APIC_VECTOR);
 	__smp_threshold_interrupt();
 	trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR);
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index f34fe74..65ff260 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -230,6 +230,7 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
 
 	entering_irq();
 
+	check_poll();
 	/* entering_irq() tells RCU that we're not quiescent.  Check it. */
 	RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
 
@@ -269,6 +270,7 @@ __visible void __irq_entry smp_x86_platform_ipi(struct pt_regs *regs)
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
 	entering_ack_irq();
+	check_poll();
 	__smp_x86_platform_ipi();
 	exiting_irq();
 	set_irq_regs(old_regs);
@@ -295,6 +297,7 @@ __visible void smp_kvm_posted_intr_ipi(struct pt_regs *regs)
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
 	entering_ack_irq();
+	check_poll();
 	inc_irq_stat(kvm_posted_intr_ipis);
 	exiting_irq();
 	set_irq_regs(old_regs);
@@ -308,6 +311,7 @@ __visible void smp_kvm_posted_intr_wakeup_ipi(struct pt_regs *regs)
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
 	entering_ack_irq();
+	check_poll();
 	inc_irq_stat(kvm_posted_intr_wakeup_ipis);
 	kvm_posted_intr_wakeup_handler();
 	exiting_irq();
@@ -320,6 +324,7 @@ __visible void __irq_entry smp_trace_x86_platform_ipi(struct pt_regs *regs)
 	struct pt_regs *old_regs = set_irq_regs(regs);
 
 	entering_ack_irq();
+	check_poll();
 	trace_x86_platform_ipi_entry(X86_PLATFORM_IPI_VECTOR);
 	__smp_x86_platform_ipi();
 	trace_x86_platform_ipi_exit(X86_PLATFORM_IPI_VECTOR);
diff --git a/arch/x86/kernel/irq_work.c b/arch/x86/kernel/irq_work.c
index 2754878..2c4b6cd 100644
--- a/arch/x86/kernel/irq_work.c
+++ b/arch/x86/kernel/irq_work.c
@@ -20,6 +20,7 @@ static inline void __smp_irq_work_interrupt(void)
 __visible void __irq_entry smp_irq_work_interrupt(struct pt_regs *regs)
 {
 	ipi_entering_ack_irq();
+	check_poll();
 	__smp_irq_work_interrupt();
 	exiting_irq();
 }
@@ -27,6 +28,7 @@ __visible void __irq_entry smp_irq_work_interrupt(struct pt_regs *regs)
 __visible void __irq_entry smp_trace_irq_work_interrupt(struct pt_regs *regs)
 {
 	ipi_entering_ack_irq();
+	check_poll();
 	trace_irq_work_entry(IRQ_WORK_VECTOR);
 	__smp_irq_work_interrupt();
 	trace_irq_work_exit(IRQ_WORK_VECTOR);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 6361783..e5238a8 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -41,6 +41,10 @@
 
 #ifdef CONFIG_HYPERVISOR_GUEST
 unsigned long poll_threshold_ns;
+unsigned int poll_shrink = 2;
+unsigned int poll_grow = 2;
+DEFINE_PER_CPU(unsigned long, poll_begin_ns);
+DEFINE_PER_CPU(unsigned long, poll_ns);
 #endif
 
 /*
@@ -318,6 +322,57 @@ static inline void play_dead(void)
 #endif
 
 #ifdef CONFIG_HYPERVISOR_GUEST
+static unsigned int grow_poll_ns(unsigned int old, unsigned int grow,
+				      unsigned int max)
+{
+	unsigned int val;
+
+	/* 10us as base poll duration */
+	if (old == 0 && grow)
+		return 10000;
+
+	val = old * grow;
+	if (val > max)
+		val = max;
+
+	return val;
+}
+
+static unsigned int shrink_poll_ns(unsigned int old, unsigned int shrink)
+{
+	if (shrink == 0)
+		return 0;
+
+	return old / shrink;
+}
+
+void check_poll(void)
+{
+	unsigned int val, poll_duration;
+	unsigned long begin_ns, now_ns;
+
+	if (!poll_threshold_ns)
+		return;
+
+	begin_ns = this_cpu_read(poll_begin_ns);
+	/* Not from halt state */
+	if (!begin_ns)
+		return;
+
+	now_ns = ktime_to_ns(ktime_get());
+	poll_duration = this_cpu_read(poll_ns);
+
+	if (poll_duration && now_ns - begin_ns > poll_threshold_ns)
+		val = shrink_poll_ns(poll_duration, poll_shrink);
+	else if (poll_duration < poll_threshold_ns &&
+		 now_ns - begin_ns < poll_threshold_ns)
+		val = grow_poll_ns(poll_duration, poll_grow, poll_threshold_ns);
+
+	this_cpu_write(poll_ns, val);
+	this_cpu_write(poll_begin_ns, 0);
+
+}
+
 void arch_cpu_idle_poll(void)
 {
 	ktime_t start, cur, stop;
@@ -359,6 +414,10 @@ void arch_cpu_idle(void)
 void __cpuidle default_idle(void)
 {
 	trace_cpu_idle_rcuidle(1, smp_processor_id());
+#ifdef CONFIG_HYPERVISOR_GUEST
+	if (poll_threshold_ns)
+		this_cpu_write(poll_begin_ns, ktime_to_ns(ktime_get()));
+#endif
 	safe_halt();
 	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id());
 }
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index d798c0d..81a3961 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -265,6 +265,7 @@ static inline void __smp_reschedule_interrupt(void)
 __visible void __irq_entry smp_reschedule_interrupt(struct pt_regs *regs)
 {
 	ack_APIC_irq();
+	check_poll();
 	__smp_reschedule_interrupt();
 	/*
 	 * KVM uses this interrupt to force a cpu out of guest mode
@@ -280,6 +281,7 @@ __visible void __irq_entry smp_trace_reschedule_interrupt(struct pt_regs *regs)
 	 * to nest.
 	 */
 	ipi_entering_ack_irq();
+	check_poll();
 	trace_reschedule_entry(RESCHEDULE_VECTOR);
 	__smp_reschedule_interrupt();
 	trace_reschedule_exit(RESCHEDULE_VECTOR);
@@ -298,6 +300,7 @@ static inline void __smp_call_function_interrupt(void)
 __visible void __irq_entry smp_call_function_interrupt(struct pt_regs *regs)
 {
 	ipi_entering_ack_irq();
+	check_poll();
 	__smp_call_function_interrupt();
 	exiting_irq();
 }
@@ -306,6 +309,7 @@ __visible void __irq_entry smp_call_function_interrupt(struct pt_regs *regs)
 smp_trace_call_function_interrupt(struct pt_regs *regs)
 {
 	ipi_entering_ack_irq();
+	check_poll();
 	trace_call_function_entry(CALL_FUNCTION_VECTOR);
 	__smp_call_function_interrupt();
 	trace_call_function_exit(CALL_FUNCTION_VECTOR);
@@ -322,6 +326,7 @@ static inline void __smp_call_function_single_interrupt(void)
 smp_call_function_single_interrupt(struct pt_regs *regs)
 {
 	ipi_entering_ack_irq();
+	check_poll();
 	__smp_call_function_single_interrupt();
 	exiting_irq();
 }
@@ -330,6 +335,7 @@ static inline void __smp_call_function_single_interrupt(void)
 smp_trace_call_function_single_interrupt(struct pt_regs *regs)
 {
 	ipi_entering_ack_irq();
+	check_poll();
 	trace_call_function_single_entry(CALL_FUNCTION_SINGLE_VECTOR);
 	__smp_call_function_single_interrupt();
 	trace_call_function_single_exit(CALL_FUNCTION_SINGLE_VECTOR);
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 04cf774..e901b26 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -462,6 +462,8 @@ extern __scanf(2, 0)
 extern bool crash_kexec_post_notifiers;
 #ifdef CONFIG_HYPERVISOR_GUEST
 extern unsigned long poll_threshold_ns;
+extern unsigned int poll_shrink;
+extern unsigned int poll_grow;
 #endif
 
 /*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9174d57..82776eb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1211,6 +1211,20 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "halt_poll_grow",
+		.data		= &poll_grow,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "halt_poll_shrink",
+		.data		= &poll_shrink,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 #endif
 	{ }
 };
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] x86/idle: add halt poll support
  2017-06-22 11:22 [PATCH 0/2] x86/idle: add halt poll support root
  2017-06-22 11:22 ` [PATCH 1/2] x86/idle: add halt poll for halt idle root
  2017-06-22 11:22 ` [PATCH 2/2] x86/idle: use dynamic halt poll root
@ 2017-06-22 11:32 ` Yang Zhang
  2017-06-22 11:50 ` Wanpeng Li
  3 siblings, 0 replies; 37+ messages in thread
From: Yang Zhang @ 2017-06-22 11:32 UTC (permalink / raw)
  To: tglx, mingo, hpa, pbonzini
  Cc: x86, corbet, tony.luck, bp, peterz, mchehab, akpm, krzk,
	jpoimboe, luto, borntraeger, thgarnie, rgerst, minipli,
	douly.fnst, nicstange, fweisbec, dvlasenk, bristot,
	yamada.masahiro, mika.westerberg, yu.c.chen, aaron.lu, rostedt,
	me, len.brown, prarit, hidehiro.kawai.ez, fengtiantian, pmladek,
	jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, adobriyan, fgao, ebiederm,
	subashab, arnd, matt, mgorman, linux-kernel, linux-doc,
	linux-edac, kvm

On 2017/6/22 19:22, root wrote:
> From: Yang Zhang <yang.zhang.wz@gmail.com>

Sorry to use wrong username to send patch because i am using a new
machine which don't setup the git config well.

>
> Some latency-intensive workload will see obviously performance
> drop when running inside VM. The main reason is that the overhead
> is amplified when running inside VM. The most cost i have seen is
> inside idle path.
> This patch introduces a new mechanism to poll for a while before
> entering idle state. If schedule is needed during poll, then we
> don't need to goes through the heavy overhead path.
>
> Here is the data i get when running benchmark contextswitch
> (https://github.com/tsuna/contextswitch)
> before patch:
> 2000000 process context switches in 4822613801ns (2411.3ns/ctxsw)
> after patch:
> 2000000 process context switches in 3584098241ns (1792.0ns/ctxsw)
>
>
> Yang Zhang (2):
>   x86/idle: add halt poll for halt idle
>   x86/idle: use dynamic halt poll
>
>  Documentation/sysctl/kernel.txt          | 24 ++++++++++
>  arch/x86/include/asm/processor.h         |  6 +++
>  arch/x86/kernel/apic/apic.c              |  6 +++
>  arch/x86/kernel/apic/vector.c            |  1 +
>  arch/x86/kernel/cpu/mcheck/mce_amd.c     |  2 +
>  arch/x86/kernel/cpu/mcheck/therm_throt.c |  2 +
>  arch/x86/kernel/cpu/mcheck/threshold.c   |  2 +
>  arch/x86/kernel/irq.c                    |  5 ++
>  arch/x86/kernel/irq_work.c               |  2 +
>  arch/x86/kernel/process.c                | 80 ++++++++++++++++++++++++++++++++
>  arch/x86/kernel/smp.c                    |  6 +++
>  include/linux/kernel.h                   |  5 ++
>  kernel/sched/idle.c                      |  3 ++
>  kernel/sysctl.c                          | 23 +++++++++
>  14 files changed, 167 insertions(+)
>


-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] x86/idle: add halt poll support
  2017-06-22 11:22 [PATCH 0/2] x86/idle: add halt poll support root
                   ` (2 preceding siblings ...)
  2017-06-22 11:32 ` [PATCH 0/2] x86/idle: add halt poll support Yang Zhang
@ 2017-06-22 11:50 ` Wanpeng Li
  2017-06-23  4:08   ` Yang Zhang
  3 siblings, 1 reply; 37+ messages in thread
From: Wanpeng Li @ 2017-06-22 11:50 UTC (permalink / raw)
  To: root
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paolo Bonzini,
	the arch/x86 maintainers, Jonathan Corbet, tony.luck,
	Borislav Petkov, Peter Zijlstra, mchehab, Andrew Morton, krzk,
	jpoimboe, Andy Lutomirski, Christian Borntraeger, thgarnie,
	rgerst, minipli, douly.fnst, nicstange, Frederic Weisbecker,
	dvlasenk, Daniel Bristot de Oliveira, yamada.masahiro,
	mika.westerberg, Chen Yu, aaron.lu, Steven Rostedt, me,
	Len Brown, Prarit Bhargava, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, adobriyan, fgao, ebiederm,
	subashab, arnd, Matt Fleming, Mel Gorman, linux-kernel,
	linux-doc, linux-edac, kvm

2017-06-22 19:22 GMT+08:00 root <yang.zhang.wz@gmail.com>:
> From: Yang Zhang <yang.zhang.wz@gmail.com>
>
> Some latency-intensive workload will see obviously performance
> drop when running inside VM. The main reason is that the overhead
> is amplified when running inside VM. The most cost i have seen is
> inside idle path.
> This patch introduces a new mechanism to poll for a while before
> entering idle state. If schedule is needed during poll, then we
> don't need to goes through the heavy overhead path.
>
> Here is the data i get when running benchmark contextswitch
> (https://github.com/tsuna/contextswitch)
> before patch:
> 2000000 process context switches in 4822613801ns (2411.3ns/ctxsw)
> after patch:
> 2000000 process context switches in 3584098241ns (1792.0ns/ctxsw)

If you test this after disabling the adaptive halt-polling in kvm?
What's the performance data of w/ this patchset and w/o the adaptive
halt-polling in kvm, and w/o this patchset and w/ the adaptive
halt-polling in kvm? In addition, both linux and windows guests can
get benefit as we have already done this in kvm.

Regards,
Wanpeng Li

> Yang Zhang (2):
>   x86/idle: add halt poll for halt idle
>   x86/idle: use dynamic halt poll
>
>  Documentation/sysctl/kernel.txt          | 24 ++++++++++
>  arch/x86/include/asm/processor.h         |  6 +++
>  arch/x86/kernel/apic/apic.c              |  6 +++
>  arch/x86/kernel/apic/vector.c            |  1 +
>  arch/x86/kernel/cpu/mcheck/mce_amd.c     |  2 +
>  arch/x86/kernel/cpu/mcheck/therm_throt.c |  2 +
>  arch/x86/kernel/cpu/mcheck/threshold.c   |  2 +
>  arch/x86/kernel/irq.c                    |  5 ++
>  arch/x86/kernel/irq_work.c               |  2 +
>  arch/x86/kernel/process.c                | 80 ++++++++++++++++++++++++++++++++
>  arch/x86/kernel/smp.c                    |  6 +++
>  include/linux/kernel.h                   |  5 ++
>  kernel/sched/idle.c                      |  3 ++
>  kernel/sysctl.c                          | 23 +++++++++
>  14 files changed, 167 insertions(+)
>
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-06-22 11:22 ` [PATCH 2/2] x86/idle: use dynamic halt poll root
@ 2017-06-22 11:51   ` Paolo Bonzini
  2017-06-23  3:58     ` Yang Zhang
  2017-06-22 14:32   ` Thomas Gleixner
  2017-06-22 22:46   ` kbuild test robot
  2 siblings, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2017-06-22 11:51 UTC (permalink / raw)
  To: root, tglx, mingo, hpa
  Cc: x86, corbet, tony.luck, bp, peterz, mchehab, akpm, krzk,
	jpoimboe, luto, borntraeger, thgarnie, rgerst, minipli,
	douly.fnst, nicstange, fweisbec, dvlasenk, bristot,
	yamada.masahiro, mika.westerberg, yu.c.chen, aaron.lu, rostedt,
	me, len.brown, prarit, hidehiro.kawai.ez, fengtiantian, pmladek,
	jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, adobriyan, fgao, ebiederm,
	subashab, arnd, matt, mgorman, linux-kernel, linux-doc,
	linux-edac, kvm



On 22/06/2017 13:22, root wrote:
>  ==============================================================
>  
> +poll_grow: (X86 only)
> +
> +This parameter is multiplied in the grow_poll_ns() to increase the poll time.
> +By default, the values is 2.
> +
> +==============================================================
> +poll_shrink: (X86 only)
> +
> +This parameter is divided in the shrink_poll_ns() to reduce the poll time.
> +By default, the values is 2.

Even before starting the debate on whether this is a good idea or a bad
idea, KVM reduces the polling value to the minimum (10 us) by default
when polling fails.  Also, it shouldn't be bound to
CONFIG_HYPERVISOR_GUEST, since there's nothing specific to virtual
machines here.

Regarding the good/bad idea part, KVM's polling is made much more
acceptable by single_task_running().  At least you need to integrate it
with paravirtualization.  If the VM is scheduled out, you shrink the
polling period.  There is already vcpu_is_preempted for this, it is used
by mutexes.

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] x86/idle: add halt poll for halt idle
  2017-06-22 11:22 ` [PATCH 1/2] x86/idle: add halt poll for halt idle root
@ 2017-06-22 14:23   ` Thomas Gleixner
  2017-06-23  4:05     ` Yang Zhang
  2017-08-16  4:04   ` Michael S. Tsirkin
  1 sibling, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2017-06-22 14:23 UTC (permalink / raw)
  To: root
  Cc: mingo, hpa, pbonzini, x86, corbet, tony.luck, bp, peterz,
	mchehab, akpm, krzk, jpoimboe, luto, borntraeger, thgarnie,
	rgerst, minipli, douly.fnst, nicstange, fweisbec, dvlasenk,
	bristot, yamada.masahiro, mika.westerberg, yu.c.chen, aaron.lu,
	rostedt, me, len.brown, prarit, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, adobriyan, fgao, ebiederm,
	subashab, arnd, matt, mgorman, linux-kernel, linux-doc,
	linux-edac, kvm

On Thu, 22 Jun 2017, root wrote:
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -39,6 +39,10 @@
>  #include <asm/desc.h>
>  #include <asm/prctl.h>
>  
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +unsigned long poll_threshold_ns;
> +#endif
> +
>  /*
>   * per-CPU TSS segments. Threads are completely 'soft' on Linux,
>   * no more per-task TSS's. The TSS size is kept cacheline-aligned
> @@ -313,6 +317,23 @@ static inline void play_dead(void)
>  }
>  #endif
>  
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +void arch_cpu_idle_poll(void)
> +{
> +	ktime_t start, cur, stop;
> +
> +	if (poll_threshold_ns) {
> +		start = cur = ktime_get();
> +		stop = ktime_add_ns(ktime_get(), poll_threshold_ns);
> +		do {
> +			if (need_resched())
> +				break;
> +			cur = ktime_get();
> +		} while (ktime_before(cur, stop));
> +	}
> +}
> +#endif

Aside of the whole approach being debatable, what's the reason to make this
depend on CONFIG_HYPERVISOR_GUEST and to move it into x86. If that
mechanism is worthwhile then it should go into the generic code and not
into x86. There is absolutely nothing x86 specific in that patch.

Also the CONFIG_HYPERVISOR_GUEST dependency is silly. Distro kernels ship
with CONFIG_HYPERVISOR_GUEST=y so this also gets into affect on bare metal.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-06-22 11:22 ` [PATCH 2/2] x86/idle: use dynamic halt poll root
  2017-06-22 11:51   ` Paolo Bonzini
@ 2017-06-22 14:32   ` Thomas Gleixner
  2017-06-23  4:04     ` Yang Zhang
  2017-06-22 22:46   ` kbuild test robot
  2 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2017-06-22 14:32 UTC (permalink / raw)
  To: root
  Cc: mingo, hpa, pbonzini, x86, corbet, tony.luck, bp, peterz,
	mchehab, akpm, krzk, jpoimboe, luto, borntraeger, thgarnie,
	rgerst, minipli, douly.fnst, nicstange, fweisbec, dvlasenk,
	bristot, yamada.masahiro, mika.westerberg, yu.c.chen, aaron.lu,
	rostedt, me, len.brown, prarit, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, adobriyan, fgao, ebiederm,
	subashab, arnd, matt, mgorman, linux-kernel, linux-doc,
	linux-edac, kvm

On Thu, 22 Jun 2017, root wrote:
> @@ -962,6 +962,7 @@ __visible void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs)
>  	 * interrupt lock, which is the WrongThing (tm) to do.
>  	 */
>  	entering_ack_irq();
> +	check_poll();

No way, that we sprinkle this function into every interrupt hotpath. There
are enough genuine ways to do that w/o touching a gazillion of files.

>  #ifdef CONFIG_HYPERVISOR_GUEST
> +static unsigned int grow_poll_ns(unsigned int old, unsigned int grow,
> +				      unsigned int max)
> +{
> +	unsigned int val;
> +
> +	/* 10us as base poll duration */
> +	if (old == 0 && grow)
> +		return 10000;
> +
> +	val = old * grow;
> +	if (val > max)
> +		val = max;
> +
> +	return val;
> +}
> +
> +static unsigned int shrink_poll_ns(unsigned int old, unsigned int shrink)
> +{
> +	if (shrink == 0)
> +		return 0;
> +
> +	return old / shrink;
> +}
> +
> +void check_poll(void)
> +{
> +	unsigned int val, poll_duration;
> +	unsigned long begin_ns, now_ns;
> +
> +	if (!poll_threshold_ns)
> +		return;

If at all then this needs to be a static key based decision.

> +
> +	begin_ns = this_cpu_read(poll_begin_ns);
> +	/* Not from halt state */
> +	if (!begin_ns)
> +		return;

If you integrate this stuff into the proper place, then the whole mess goes
away. We really do not need another facility to track idle state. We have
enough already, really.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-06-22 11:22 ` [PATCH 2/2] x86/idle: use dynamic halt poll root
  2017-06-22 11:51   ` Paolo Bonzini
  2017-06-22 14:32   ` Thomas Gleixner
@ 2017-06-22 22:46   ` kbuild test robot
  2 siblings, 0 replies; 37+ messages in thread
From: kbuild test robot @ 2017-06-22 22:46 UTC (permalink / raw)
  To: root
  Cc: kbuild-all, tglx, mingo, hpa, pbonzini, x86, corbet, tony.luck,
	bp, peterz, mchehab, akpm, krzk, jpoimboe, luto, borntraeger,
	thgarnie, rgerst, minipli, douly.fnst, nicstange, fweisbec,
	dvlasenk, bristot, yamada.masahiro, mika.westerberg, yu.c.chen,
	aaron.lu, rostedt, me, len.brown, prarit, hidehiro.kawai.ez,
	fengtiantian, pmladek, jeyu, Larry.Finger, zijun_hu, luisbg,
	johannes.berg, niklas.soderlund+renesas, zlpnobody, adobriyan,
	fgao, ebiederm, subashab, arnd, matt, mgorman, linux-kernel,
	linux-doc, linux-edac, kvm, Yang Zhang

[-- Attachment #1: Type: text/plain, Size: 5003 bytes --]

Hi Yang,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.12-rc6]
[cannot apply to tip/x86/core next-20170622]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/root/x86-idle-add-halt-poll-for-halt-idle/20170623-061318
config: i386-randconfig-x016-06222129 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   In file included from arch/x86/include/asm/preempt.h:5:0,
                    from include/linux/preempt.h:80,
                    from include/linux/spinlock.h:50,
                    from include/linux/mmzone.h:7,
                    from include/linux/gfp.h:5,
                    from include/linux/mm.h:9,
                    from arch/x86/kernel/process.c:5:
   arch/x86/kernel/process.c: In function 'check_poll':
>> arch/x86/include/asm/percpu.h:109:3: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized]
      asm(op "l %1,"__percpu_arg(0)  \
      ^~~
   arch/x86/kernel/process.c:351:15: note: 'val' was declared here
     unsigned int val, poll_duration;
                  ^~~
--
   In file included from arch/x86/include/asm/preempt.h:5:0,
                    from include/linux/preempt.h:80,
                    from include/linux/spinlock.h:50,
                    from include/linux/mmzone.h:7,
                    from include/linux/gfp.h:5,
                    from include/linux/mm.h:9,
                    from arch/x86//kernel/process.c:5:
   arch/x86//kernel/process.c: In function 'check_poll':
>> arch/x86/include/asm/percpu.h:109:3: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized]
      asm(op "l %1,"__percpu_arg(0)  \
      ^~~
   arch/x86//kernel/process.c:351:15: note: 'val' was declared here
     unsigned int val, poll_duration;
                  ^~~

vim +/val +109 arch/x86/include/asm/percpu.h

0f5e4816 arch/x86/include/asm/percpu.h Tejun Heo      2009-10-29   93  		pto_T__ pto_tmp__;			\
0f5e4816 arch/x86/include/asm/percpu.h Tejun Heo      2009-10-29   94  		pto_tmp__ = (val);			\
23b764d0 arch/x86/include/asm/percpu.h Andi Kleen     2010-06-10   95  		(void)pto_tmp__;			\
bc9e3be2 include/asm-x86/percpu.h      Joe Perches    2008-03-23   96  	}						\
3334052a include/asm-x86/percpu.h      travis@sgi.com 2008-01-30   97  	switch (sizeof(var)) {				\
3334052a include/asm-x86/percpu.h      travis@sgi.com 2008-01-30   98  	case 1:						\
87b26406 arch/x86/include/asm/percpu.h Brian Gerst    2009-01-19   99  		asm(op "b %1,"__percpu_arg(0)		\
3334052a include/asm-x86/percpu.h      travis@sgi.com 2008-01-30  100  		    : "+m" (var)			\
0f5e4816 arch/x86/include/asm/percpu.h Tejun Heo      2009-10-29  101  		    : "qi" ((pto_T__)(val)));		\
3334052a include/asm-x86/percpu.h      travis@sgi.com 2008-01-30  102  		break;					\
3334052a include/asm-x86/percpu.h      travis@sgi.com 2008-01-30  103  	case 2:						\
87b26406 arch/x86/include/asm/percpu.h Brian Gerst    2009-01-19  104  		asm(op "w %1,"__percpu_arg(0)		\
3334052a include/asm-x86/percpu.h      travis@sgi.com 2008-01-30  105  		    : "+m" (var)			\
0f5e4816 arch/x86/include/asm/percpu.h Tejun Heo      2009-10-29  106  		    : "ri" ((pto_T__)(val)));		\
3334052a include/asm-x86/percpu.h      travis@sgi.com 2008-01-30  107  		break;					\
3334052a include/asm-x86/percpu.h      travis@sgi.com 2008-01-30  108  	case 4:						\
87b26406 arch/x86/include/asm/percpu.h Brian Gerst    2009-01-19 @109  		asm(op "l %1,"__percpu_arg(0)		\
3334052a include/asm-x86/percpu.h      travis@sgi.com 2008-01-30  110  		    : "+m" (var)			\
0f5e4816 arch/x86/include/asm/percpu.h Tejun Heo      2009-10-29  111  		    : "ri" ((pto_T__)(val)));		\
3334052a include/asm-x86/percpu.h      travis@sgi.com 2008-01-30  112  		break;					\
9939ddaf arch/x86/include/asm/percpu.h Tejun Heo      2009-01-13  113  	case 8:						\
87b26406 arch/x86/include/asm/percpu.h Brian Gerst    2009-01-19  114  		asm(op "q %1,"__percpu_arg(0)		\
9939ddaf arch/x86/include/asm/percpu.h Tejun Heo      2009-01-13  115  		    : "+m" (var)			\
0f5e4816 arch/x86/include/asm/percpu.h Tejun Heo      2009-10-29  116  		    : "re" ((pto_T__)(val)));		\
9939ddaf arch/x86/include/asm/percpu.h Tejun Heo      2009-01-13  117  		break;					\

:::::: The code at line 109 was first introduced by commit
:::::: 87b264065880fa696c121dad8498a60524e0f6de x86-64: Use absolute displacements for per-cpu accesses.

:::::: TO: Brian Gerst <brgerst@gmail.com>
:::::: CC: Tejun Heo <tj@kernel.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 23347 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-06-22 11:51   ` Paolo Bonzini
@ 2017-06-23  3:58     ` Yang Zhang
  2017-06-27 11:22       ` Yang Zhang
  0 siblings, 1 reply; 37+ messages in thread
From: Yang Zhang @ 2017-06-23  3:58 UTC (permalink / raw)
  To: Paolo Bonzini, tglx, mingo, hpa
  Cc: x86, corbet, tony.luck, bp, peterz, mchehab, akpm, krzk,
	jpoimboe, luto, borntraeger, thgarnie, rgerst, minipli,
	douly.fnst, nicstange, fweisbec, dvlasenk, bristot,
	yamada.masahiro, mika.westerberg, yu.c.chen, aaron.lu, rostedt,
	me, len.brown, prarit, hidehiro.kawai.ez, fengtiantian, pmladek,
	jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, adobriyan, fgao, ebiederm,
	subashab, arnd, matt, mgorman, linux-kernel, linux-doc,
	linux-edac, kvm

On 2017/6/22 19:51, Paolo Bonzini wrote:
>
>
> On 22/06/2017 13:22, root wrote:
>>  ==============================================================
>>
>> +poll_grow: (X86 only)
>> +
>> +This parameter is multiplied in the grow_poll_ns() to increase the poll time.
>> +By default, the values is 2.
>> +
>> +==============================================================
>> +poll_shrink: (X86 only)
>> +
>> +This parameter is divided in the shrink_poll_ns() to reduce the poll time.
>> +By default, the values is 2.
>
> Even before starting the debate on whether this is a good idea or a bad
> idea, KVM reduces the polling value to the minimum (10 us) by default

I noticed it. It looks like the logic inside KVM is more reasonable. I 
will do more testing to compare the two.

> when polling fails.  Also, it shouldn't be bound to
> CONFIG_HYPERVISOR_GUEST, since there's nothing specific to virtual
> machines here.

Yes. The original idea to use CONFIG_HYPERVISOR_GUEST because this 
mechanism will only helpful inside VM. But as Thomas mentioned on other 
thread it is wrong to use it since most distribution kernel will set it 
to yes and still affect the bare metal. I will integrate it with 
paravirtualizaion part as you suggested in below.

>
> Regarding the good/bad idea part, KVM's polling is made much more
> acceptable by single_task_running().  At least you need to integrate it
> with paravirtualization.  If the VM is scheduled out, you shrink the
> polling period.  There is already vcpu_is_preempted for this, it is used
> by mutexes.

I have considered single_task_running() before. But since there is no 
such paravirtual interface currently and i am not sure whether it is a 
information leak from host if introducing such interface, so i didn't do 
it. Do you mean vcpu_is_preempted can do the same thing? I check the 
code and seems it only tells whether the VCPU is scheduled out or not 
which cannot satisfy the needs.

>
> Paolo
>


-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-06-22 14:32   ` Thomas Gleixner
@ 2017-06-23  4:04     ` Yang Zhang
  0 siblings, 0 replies; 37+ messages in thread
From: Yang Zhang @ 2017-06-23  4:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, hpa, pbonzini, x86, corbet, tony.luck, bp, peterz,
	mchehab, akpm, krzk, jpoimboe, luto, borntraeger, thgarnie,
	rgerst, minipli, douly.fnst, nicstange, fweisbec, dvlasenk,
	bristot, yamada.masahiro, mika.westerberg, yu.c.chen, aaron.lu,
	rostedt, me, len.brown, prarit, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, adobriyan, fgao, ebiederm,
	subashab, arnd, matt, mgorman, linux-kernel, linux-doc,
	linux-edac, kvm

On 2017/6/22 22:32, Thomas Gleixner wrote:
> On Thu, 22 Jun 2017, root wrote:
>> @@ -962,6 +962,7 @@ __visible void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs)
>>  	 * interrupt lock, which is the WrongThing (tm) to do.
>>  	 */
>>  	entering_ack_irq();
>> +	check_poll();
>
> No way, that we sprinkle this function into every interrupt hotpath. There
> are enough genuine ways to do that w/o touching a gazillion of files.

I will find a more correct place to call this function.

>
>>  #ifdef CONFIG_HYPERVISOR_GUEST
>> +static unsigned int grow_poll_ns(unsigned int old, unsigned int grow,
>> +				      unsigned int max)
>> +{
>> +	unsigned int val;
>> +
>> +	/* 10us as base poll duration */
>> +	if (old == 0 && grow)
>> +		return 10000;
>> +
>> +	val = old * grow;
>> +	if (val > max)
>> +		val = max;
>> +
>> +	return val;
>> +}
>> +
>> +static unsigned int shrink_poll_ns(unsigned int old, unsigned int shrink)
>> +{
>> +	if (shrink == 0)
>> +		return 0;
>> +
>> +	return old / shrink;
>> +}
>> +
>> +void check_poll(void)
>> +{
>> +	unsigned int val, poll_duration;
>> +	unsigned long begin_ns, now_ns;
>> +
>> +	if (!poll_threshold_ns)
>> +		return;
>
> If at all then this needs to be a static key based decision.

Sure, will do it.

>
>> +
>> +	begin_ns = this_cpu_read(poll_begin_ns);
>> +	/* Not from halt state */
>> +	if (!begin_ns)
>> +		return;
>
> If you integrate this stuff into the proper place, then the whole mess goes
> away. We really do not need another facility to track idle state. We have
> enough already, really.

Agree, I will check current code to find a more proper way to do the check.

>
> Thanks,
>
> 	tglx
>


-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] x86/idle: add halt poll for halt idle
  2017-06-22 14:23   ` Thomas Gleixner
@ 2017-06-23  4:05     ` Yang Zhang
  0 siblings, 0 replies; 37+ messages in thread
From: Yang Zhang @ 2017-06-23  4:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, hpa, pbonzini, x86, corbet, tony.luck, bp, peterz,
	mchehab, akpm, krzk, jpoimboe, luto, borntraeger, thgarnie,
	rgerst, minipli, douly.fnst, nicstange, fweisbec, dvlasenk,
	bristot, yamada.masahiro, mika.westerberg, yu.c.chen, aaron.lu,
	rostedt, me, len.brown, prarit, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, adobriyan, fgao, ebiederm,
	subashab, arnd, matt, mgorman, linux-kernel, linux-doc,
	linux-edac, kvm

On 2017/6/22 22:23, Thomas Gleixner wrote:
> On Thu, 22 Jun 2017, root wrote:
>> --- a/arch/x86/kernel/process.c
>> +++ b/arch/x86/kernel/process.c
>> @@ -39,6 +39,10 @@
>>  #include <asm/desc.h>
>>  #include <asm/prctl.h>
>>
>> +#ifdef CONFIG_HYPERVISOR_GUEST
>> +unsigned long poll_threshold_ns;
>> +#endif
>> +
>>  /*
>>   * per-CPU TSS segments. Threads are completely 'soft' on Linux,
>>   * no more per-task TSS's. The TSS size is kept cacheline-aligned
>> @@ -313,6 +317,23 @@ static inline void play_dead(void)
>>  }
>>  #endif
>>
>> +#ifdef CONFIG_HYPERVISOR_GUEST
>> +void arch_cpu_idle_poll(void)
>> +{
>> +	ktime_t start, cur, stop;
>> +
>> +	if (poll_threshold_ns) {
>> +		start = cur = ktime_get();
>> +		stop = ktime_add_ns(ktime_get(), poll_threshold_ns);
>> +		do {
>> +			if (need_resched())
>> +				break;
>> +			cur = ktime_get();
>> +		} while (ktime_before(cur, stop));
>> +	}
>> +}
>> +#endif
>
> Aside of the whole approach being debatable, what's the reason to make this
> depend on CONFIG_HYPERVISOR_GUEST and to move it into x86. If that
> mechanism is worthwhile then it should go into the generic code and not
> into x86. There is absolutely nothing x86 specific in that patch.
>
> Also the CONFIG_HYPERVISOR_GUEST dependency is silly. Distro kernels ship
> with CONFIG_HYPERVISOR_GUEST=y so this also gets into affect on bare metal.

You are right. As Paolo suggested, i will integrate it with 
paravalization code.

>
> Thanks,
>
> 	tglx
>


-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] x86/idle: add halt poll support
  2017-06-22 11:50 ` Wanpeng Li
@ 2017-06-23  4:08   ` Yang Zhang
  2017-06-23  4:35     ` Wanpeng Li
  0 siblings, 1 reply; 37+ messages in thread
From: Yang Zhang @ 2017-06-23  4:08 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paolo Bonzini,
	the arch/x86 maintainers, Jonathan Corbet, tony.luck,
	Borislav Petkov, Peter Zijlstra, mchehab, Andrew Morton, krzk,
	jpoimboe, Andy Lutomirski, Christian Borntraeger, thgarnie,
	rgerst, minipli, douly.fnst, nicstange, Frederic Weisbecker,
	dvlasenk, Daniel Bristot de Oliveira, yamada.masahiro,
	mika.westerberg, Chen Yu, aaron.lu, Steven Rostedt, me,
	Len Brown, Prarit Bhargava, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, adobriyan, fgao, ebiederm,
	subashab, arnd, Matt Fleming, Mel Gorman, linux-kernel,
	linux-doc, linux-edac, kvm

On 2017/6/22 19:50, Wanpeng Li wrote:
> 2017-06-22 19:22 GMT+08:00 root <yang.zhang.wz@gmail.com>:
>> From: Yang Zhang <yang.zhang.wz@gmail.com>
>>
>> Some latency-intensive workload will see obviously performance
>> drop when running inside VM. The main reason is that the overhead
>> is amplified when running inside VM. The most cost i have seen is
>> inside idle path.
>> This patch introduces a new mechanism to poll for a while before
>> entering idle state. If schedule is needed during poll, then we
>> don't need to goes through the heavy overhead path.
>>
>> Here is the data i get when running benchmark contextswitch
>> (https://github.com/tsuna/contextswitch)
>> before patch:
>> 2000000 process context switches in 4822613801ns (2411.3ns/ctxsw)
>> after patch:
>> 2000000 process context switches in 3584098241ns (1792.0ns/ctxsw)
>
> If you test this after disabling the adaptive halt-polling in kvm?
> What's the performance data of w/ this patchset and w/o the adaptive
> halt-polling in kvm, and w/o this patchset and w/ the adaptive
> halt-polling in kvm? In addition, both linux and windows guests can
> get benefit as we have already done this in kvm.

I will provide more data in next version. But it doesn't conflict with 
current halt polling inside kvm. This is just another enhancement.

>
> Regards,
> Wanpeng Li
>
>> Yang Zhang (2):
>>   x86/idle: add halt poll for halt idle
>>   x86/idle: use dynamic halt poll
>>
>>  Documentation/sysctl/kernel.txt          | 24 ++++++++++
>>  arch/x86/include/asm/processor.h         |  6 +++
>>  arch/x86/kernel/apic/apic.c              |  6 +++
>>  arch/x86/kernel/apic/vector.c            |  1 +
>>  arch/x86/kernel/cpu/mcheck/mce_amd.c     |  2 +
>>  arch/x86/kernel/cpu/mcheck/therm_throt.c |  2 +
>>  arch/x86/kernel/cpu/mcheck/threshold.c   |  2 +
>>  arch/x86/kernel/irq.c                    |  5 ++
>>  arch/x86/kernel/irq_work.c               |  2 +
>>  arch/x86/kernel/process.c                | 80 ++++++++++++++++++++++++++++++++
>>  arch/x86/kernel/smp.c                    |  6 +++
>>  include/linux/kernel.h                   |  5 ++
>>  kernel/sched/idle.c                      |  3 ++
>>  kernel/sysctl.c                          | 23 +++++++++
>>  14 files changed, 167 insertions(+)
>>
>> --
>> 1.8.3.1
>>


-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] x86/idle: add halt poll support
  2017-06-23  4:08   ` Yang Zhang
@ 2017-06-23  4:35     ` Wanpeng Li
  2017-06-23  6:49       ` Yang Zhang
  0 siblings, 1 reply; 37+ messages in thread
From: Wanpeng Li @ 2017-06-23  4:35 UTC (permalink / raw)
  To: Yang Zhang
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paolo Bonzini,
	the arch/x86 maintainers, Jonathan Corbet, tony.luck,
	Borislav Petkov, Peter Zijlstra, mchehab, Andrew Morton, krzk,
	jpoimboe, Andy Lutomirski, Christian Borntraeger, Thomas Garnier,
	Robert Gerst, Mathias Krause, douly.fnst, Nicolai Stange,
	Frederic Weisbecker, dvlasenk, Daniel Bristot de Oliveira,
	yamada.masahiro, mika.westerberg, Chen Yu, aaron.lu,
	Steven Rostedt, Kyle Huey, Len Brown, Prarit Bhargava,
	hidehiro.kawai.ez, fengtiantian, pmladek, jeyu, Larry.Finger,
	zijun_hu, luisbg, johannes.berg, niklas.soderlund+renesas,
	zlpnobody, Alexey Dobriyan, fgao, ebiederm,
	Subash Abhinov Kasiviswanathan, Arnd Bergmann, Matt Fleming,
	Mel Gorman, linux-kernel, linux-doc, linux-edac, kvm

2017-06-23 12:08 GMT+08:00 Yang Zhang <yang.zhang.wz@gmail.com>:
> On 2017/6/22 19:50, Wanpeng Li wrote:
>>
>> 2017-06-22 19:22 GMT+08:00 root <yang.zhang.wz@gmail.com>:
>>>
>>> From: Yang Zhang <yang.zhang.wz@gmail.com>
>>>
>>> Some latency-intensive workload will see obviously performance
>>> drop when running inside VM. The main reason is that the overhead
>>> is amplified when running inside VM. The most cost i have seen is
>>> inside idle path.
>>> This patch introduces a new mechanism to poll for a while before
>>> entering idle state. If schedule is needed during poll, then we
>>> don't need to goes through the heavy overhead path.
>>>
>>> Here is the data i get when running benchmark contextswitch
>>> (https://github.com/tsuna/contextswitch)
>>> before patch:
>>> 2000000 process context switches in 4822613801ns (2411.3ns/ctxsw)
>>> after patch:
>>> 2000000 process context switches in 3584098241ns (1792.0ns/ctxsw)
>>
>>
>> If you test this after disabling the adaptive halt-polling in kvm?
>> What's the performance data of w/ this patchset and w/o the adaptive
>> halt-polling in kvm, and w/o this patchset and w/ the adaptive
>> halt-polling in kvm? In addition, both linux and windows guests can
>> get benefit as we have already done this in kvm.
>
>
> I will provide more data in next version. But it doesn't conflict with

Another case I can think of is w/ both this patchset and the adaptive
halt-polling in kvm.

> current halt polling inside kvm. This is just another enhancement.

I didn't look close to the patchset, however, maybe there is another
poll in the kvm part again sometimes if you fails the poll in the
guest. In addition, the adaptive halt-polling in kvm has performance
penalty when the pCPU is heavily overcommitted though there is a
single_task_running() in my testing, it is hard to accurately aware
whether there are other tasks waiting on the pCPU in the guest which
will make it worser. Depending on vcpu_is_preempted() or steal time
maybe not accurately or directly.

So I'm not sure how much sense it makes by adaptive halt-polling in
both guest and kvm. I prefer to just keep adaptive halt-polling in
kvm(then both linux/windows or other guests can get benefit) and avoid
to churn the core x86 path.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] x86/idle: add halt poll support
  2017-06-23  4:35     ` Wanpeng Li
@ 2017-06-23  6:49       ` Yang Zhang
  2017-06-27 14:00         ` Radim Krčmář
  0 siblings, 1 reply; 37+ messages in thread
From: Yang Zhang @ 2017-06-23  6:49 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Paolo Bonzini,
	the arch/x86 maintainers, Jonathan Corbet, tony.luck,
	Borislav Petkov, Peter Zijlstra, mchehab, Andrew Morton, krzk,
	jpoimboe, Andy Lutomirski, Christian Borntraeger, Thomas Garnier,
	Robert Gerst, Mathias Krause, douly.fnst, Nicolai Stange,
	Frederic Weisbecker, dvlasenk, Daniel Bristot de Oliveira,
	yamada.masahiro, mika.westerberg, Chen Yu, aaron.lu,
	Steven Rostedt, Kyle Huey, Len Brown, Prarit Bhargava,
	hidehiro.kawai.ez, fengtiantian, pmladek, jeyu, Larry.Finger,
	zijun_hu, luisbg, johannes.berg, niklas.soderlund+renesas,
	zlpnobody, Alexey Dobriyan, fgao, ebiederm,
	Subash Abhinov Kasiviswanathan, Arnd Bergmann, Matt Fleming,
	Mel Gorman, linux-kernel, linux-doc, linux-edac, kvm

On 2017/6/23 12:35, Wanpeng Li wrote:
> 2017-06-23 12:08 GMT+08:00 Yang Zhang <yang.zhang.wz@gmail.com>:
>> On 2017/6/22 19:50, Wanpeng Li wrote:
>>>
>>> 2017-06-22 19:22 GMT+08:00 root <yang.zhang.wz@gmail.com>:
>>>>
>>>> From: Yang Zhang <yang.zhang.wz@gmail.com>
>>>>
>>>> Some latency-intensive workload will see obviously performance
>>>> drop when running inside VM. The main reason is that the overhead
>>>> is amplified when running inside VM. The most cost i have seen is
>>>> inside idle path.
>>>> This patch introduces a new mechanism to poll for a while before
>>>> entering idle state. If schedule is needed during poll, then we
>>>> don't need to goes through the heavy overhead path.
>>>>
>>>> Here is the data i get when running benchmark contextswitch
>>>> (https://github.com/tsuna/contextswitch)
>>>> before patch:
>>>> 2000000 process context switches in 4822613801ns (2411.3ns/ctxsw)
>>>> after patch:
>>>> 2000000 process context switches in 3584098241ns (1792.0ns/ctxsw)
>>>
>>>
>>> If you test this after disabling the adaptive halt-polling in kvm?
>>> What's the performance data of w/ this patchset and w/o the adaptive
>>> halt-polling in kvm, and w/o this patchset and w/ the adaptive
>>> halt-polling in kvm? In addition, both linux and windows guests can
>>> get benefit as we have already done this in kvm.
>>
>>
>> I will provide more data in next version. But it doesn't conflict with
>
> Another case I can think of is w/ both this patchset and the adaptive
> halt-polling in kvm.
>
>> current halt polling inside kvm. This is just another enhancement.
>
> I didn't look close to the patchset, however, maybe there is another
> poll in the kvm part again sometimes if you fails the poll in the
> guest. In addition, the adaptive halt-polling in kvm has performance
> penalty when the pCPU is heavily overcommitted though there is a
> single_task_running() in my testing, it is hard to accurately aware
> whether there are other tasks waiting on the pCPU in the guest which
> will make it worser. Depending on vcpu_is_preempted() or steal time
> maybe not accurately or directly.
>
> So I'm not sure how much sense it makes by adaptive halt-polling in
> both guest and kvm. I prefer to just keep adaptive halt-polling in
> kvm(then both linux/windows or other guests can get benefit) and avoid
> to churn the core x86 path.

This mechanism is not specific to KVM. It is a kernel feature which can 
benefit guest when running inside X86 virtualization environment. The 
guest includes KVM,Xen,VMWARE,Hyper-v. Administrator can control KVM to 
use adaptive halt poll but he cannot control the user to use halt 
polling inside guest. Lots of user set idle=poll inside guest to improve 
performance which occupy more CPU cycles. This mechanism is a 
enhancement to it not to KVM halt polling.

>
> Regards,
> Wanpeng Li
>


-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-06-23  3:58     ` Yang Zhang
@ 2017-06-27 11:22       ` Yang Zhang
  2017-06-27 12:07         ` Paolo Bonzini
  0 siblings, 1 reply; 37+ messages in thread
From: Yang Zhang @ 2017-06-27 11:22 UTC (permalink / raw)
  To: Paolo Bonzini, tglx, mingo, hpa
  Cc: x86, corbet, tony.luck, bp, peterz, mchehab, akpm, krzk,
	jpoimboe, luto, borntraeger, thgarnie, rgerst, minipli,
	douly.fnst, nicstange, fweisbec, dvlasenk, bristot,
	yamada.masahiro, mika.westerberg, yu.c.chen, aaron.lu, rostedt,
	me, len.brown, prarit, hidehiro.kawai.ez, fengtiantian, pmladek,
	jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, adobriyan, fgao, ebiederm,
	subashab, arnd, matt, mgorman, linux-kernel, linux-doc,
	linux-edac, kvm

On 2017/6/23 11:58, Yang Zhang wrote:
> On 2017/6/22 19:51, Paolo Bonzini wrote:
>>
>>
>> On 22/06/2017 13:22, root wrote:
>>>  ==============================================================
>>>
>>> +poll_grow: (X86 only)
>>> +
>>> +This parameter is multiplied in the grow_poll_ns() to increase the
>>> poll time.
>>> +By default, the values is 2.
>>> +
>>> +==============================================================
>>> +poll_shrink: (X86 only)
>>> +
>>> +This parameter is divided in the shrink_poll_ns() to reduce the poll
>>> time.
>>> +By default, the values is 2.
>>
>> Even before starting the debate on whether this is a good idea or a bad
>> idea, KVM reduces the polling value to the minimum (10 us) by default
>
> I noticed it. It looks like the logic inside KVM is more reasonable. I
> will do more testing to compare the two.
>
>> when polling fails.  Also, it shouldn't be bound to
>> CONFIG_HYPERVISOR_GUEST, since there's nothing specific to virtual
>> machines here.
>
> Yes. The original idea to use CONFIG_HYPERVISOR_GUEST because this
> mechanism will only helpful inside VM. But as Thomas mentioned on other
> thread it is wrong to use it since most distribution kernel will set it
> to yes and still affect the bare metal. I will integrate it with
> paravirtualizaion part as you suggested in below.
>
>>
>> Regarding the good/bad idea part, KVM's polling is made much more
>> acceptable by single_task_running().  At least you need to integrate it
>> with paravirtualization.  If the VM is scheduled out, you shrink the
>> polling period.  There is already vcpu_is_preempted for this, it is used
>> by mutexes.
>
> I have considered single_task_running() before. But since there is no
> such paravirtual interface currently and i am not sure whether it is a
> information leak from host if introducing such interface, so i didn't do
> it. Do you mean vcpu_is_preempted can do the same thing? I check the
> code and seems it only tells whether the VCPU is scheduled out or not
> which cannot satisfy the needs.

Hi Paolo

Can you help to answer my confusion? I have double checked the code, but 
still not get your point. Do you think it is necessary to introduce an 
paravirtual interface to expose single_task_running() to guest?

-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-06-27 11:22       ` Yang Zhang
@ 2017-06-27 12:07         ` Paolo Bonzini
  2017-06-27 12:23           ` Wanpeng Li
  0 siblings, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2017-06-27 12:07 UTC (permalink / raw)
  To: Yang Zhang, tglx, mingo, hpa
  Cc: x86, corbet, tony.luck, bp, peterz, mchehab, akpm, krzk,
	jpoimboe, luto, borntraeger, thgarnie, rgerst, minipli,
	douly.fnst, nicstange, fweisbec, dvlasenk, bristot,
	yamada.masahiro, mika.westerberg, yu.c.chen, aaron.lu, rostedt,
	me, len.brown, prarit, hidehiro.kawai.ez, fengtiantian, pmladek,
	jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, adobriyan, fgao, ebiederm,
	subashab, arnd, matt, mgorman, linux-kernel, linux-doc,
	linux-edac, kvm



On 27/06/2017 13:22, Yang Zhang wrote:
>>>
>>> Regarding the good/bad idea part, KVM's polling is made much more
>>> acceptable by single_task_running().  At least you need to integrate it
>>> with paravirtualization.  If the VM is scheduled out, you shrink the
>>> polling period.  There is already vcpu_is_preempted for this, it is used
>>> by mutexes.
>>
>> I have considered single_task_running() before. But since there is no
>> such paravirtual interface currently and i am not sure whether it is a
>> information leak from host if introducing such interface, so i didn't do
>> it. Do you mean vcpu_is_preempted can do the same thing? I check the
>> code and seems it only tells whether the VCPU is scheduled out or not
>> which cannot satisfy the needs.
> 
> Can you help to answer my confusion? I have double checked the code, but
> still not get your point. Do you think it is necessary to introduce an
> paravirtual interface to expose single_task_running() to guest?

I think vcpu_is_preempted is a good enough replacement.

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-06-27 12:07         ` Paolo Bonzini
@ 2017-06-27 12:23           ` Wanpeng Li
  2017-06-27 12:28             ` Paolo Bonzini
  0 siblings, 1 reply; 37+ messages in thread
From: Wanpeng Li @ 2017-06-27 12:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Yang Zhang, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	the arch/x86 maintainers, Jonathan Corbet, tony.luck,
	Borislav Petkov, Peter Zijlstra, mchehab, Andrew Morton, krzk,
	jpoimboe, Andy Lutomirski, Christian Borntraeger, Thomas Garnier,
	Robert Gerst, Mathias Krause, douly.fnst, Nicolai Stange,
	Frederic Weisbecker, dvlasenk, Daniel Bristot de Oliveira,
	yamada.masahiro, mika.westerberg, Chen Yu, aaron.lu,
	Steven Rostedt, Kyle Huey, Len Brown, Prarit Bhargava,
	hidehiro.kawai.ez, fengtiantian, pmladek, jeyu, Larry.Finger,
	zijun_hu, luisbg, johannes.berg, niklas.soderlund+renesas,
	zlpnobody, Alexey Dobriyan, fgao, ebiederm,
	Subash Abhinov Kasiviswanathan, Arnd Bergmann, Matt Fleming,
	Mel Gorman, linux-kernel, linux-doc, linux-edac, kvm

2017-06-27 20:07 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>
>
> On 27/06/2017 13:22, Yang Zhang wrote:
>>>>
>>>> Regarding the good/bad idea part, KVM's polling is made much more
>>>> acceptable by single_task_running().  At least you need to integrate it
>>>> with paravirtualization.  If the VM is scheduled out, you shrink the
>>>> polling period.  There is already vcpu_is_preempted for this, it is used
>>>> by mutexes.
>>>
>>> I have considered single_task_running() before. But since there is no
>>> such paravirtual interface currently and i am not sure whether it is a
>>> information leak from host if introducing such interface, so i didn't do
>>> it. Do you mean vcpu_is_preempted can do the same thing? I check the
>>> code and seems it only tells whether the VCPU is scheduled out or not
>>> which cannot satisfy the needs.
>>
>> Can you help to answer my confusion? I have double checked the code, but
>> still not get your point. Do you think it is necessary to introduce an
>> paravirtual interface to expose single_task_running() to guest?
>
> I think vcpu_is_preempted is a good enough replacement.

For example, vcpu->arch.st.steal.preempted is 0 when the vCPU is sched
in and vmentry, then several tasks are enqueued on the same pCPU and
waiting on cfs red-black tree, the guest should avoid to poll in this
scenario, however, vcpu_is_preempted returns false and guest decides
to poll.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-06-27 12:23           ` Wanpeng Li
@ 2017-06-27 12:28             ` Paolo Bonzini
  2017-06-27 13:40               ` Radim Krčmář
  0 siblings, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2017-06-27 12:28 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Yang Zhang, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	the arch/x86 maintainers, Jonathan Corbet, tony.luck,
	Borislav Petkov, Peter Zijlstra, mchehab, Andrew Morton, krzk,
	jpoimboe, Andy Lutomirski, Christian Borntraeger, Thomas Garnier,
	Robert Gerst, Mathias Krause, douly.fnst, Nicolai Stange,
	Frederic Weisbecker, dvlasenk, Daniel Bristot de Oliveira,
	yamada.masahiro, mika.westerberg, Chen Yu, aaron.lu,
	Steven Rostedt, Kyle Huey, Len Brown, Prarit Bhargava,
	hidehiro.kawai.ez, fengtiantian, pmladek, jeyu, Larry.Finger,
	zijun_hu, luisbg, johannes.berg, niklas.soderlund+renesas,
	zlpnobody, Alexey Dobriyan, fgao, ebiederm,
	Subash Abhinov Kasiviswanathan, Arnd Bergmann, Matt Fleming,
	Mel Gorman, linux-kernel, linux-doc, linux-edac, kvm



On 27/06/2017 14:23, Wanpeng Li wrote:
>>>> I have considered single_task_running() before. But since there is no
>>>> such paravirtual interface currently and i am not sure whether it is a
>>>> information leak from host if introducing such interface, so i didn't do
>>>> it. Do you mean vcpu_is_preempted can do the same thing? I check the
>>>> code and seems it only tells whether the VCPU is scheduled out or not
>>>> which cannot satisfy the needs.
>>> Can you help to answer my confusion? I have double checked the code, but
>>> still not get your point. Do you think it is necessary to introduce an
>>> paravirtual interface to expose single_task_running() to guest?
>
> I think vcpu_is_preempted is a good enough replacement.
> For example, vcpu->arch.st.steal.preempted is 0 when the vCPU is sched
> in and vmentry, then several tasks are enqueued on the same pCPU and
> waiting on cfs red-black tree, the guest should avoid to poll in this
> scenario, however, vcpu_is_preempted returns false and guest decides
> to poll.

... which is not necessarily _wrong_.  It's just a different heuristic.

In the end, the guest could run with "idle=poll" even, and there's
little the host scheduler can do about it, except treating it as a CPU
bound task.

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-06-27 12:28             ` Paolo Bonzini
@ 2017-06-27 13:40               ` Radim Krčmář
  2017-06-27 13:56                 ` Paolo Bonzini
  0 siblings, 1 reply; 37+ messages in thread
From: Radim Krčmář @ 2017-06-27 13:40 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Wanpeng Li, Yang Zhang, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, the arch/x86 maintainers, Jonathan Corbet,
	tony.luck, Borislav Petkov, Peter Zijlstra, mchehab,
	Andrew Morton, krzk, jpoimboe, Andy Lutomirski,
	Christian Borntraeger, Thomas Garnier, Robert Gerst,
	Mathias Krause, douly.fnst, Nicolai Stange, Frederic Weisbecker,
	dvlasenk, Daniel Bristot de Oliveira, yamada.masahiro,
	mika.westerberg, Chen Yu, aaron.lu, Steven Rostedt, Kyle Huey,
	Len Brown, Prarit Bhargava, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, Alexey Dobriyan, fgao,
	ebiederm, Subash Abhinov Kasiviswanathan, Arnd Bergmann,
	Matt Fleming, Mel Gorman, linux-kernel, linux-doc, linux-edac,
	kvm

2017-06-27 14:28+0200, Paolo Bonzini:
> On 27/06/2017 14:23, Wanpeng Li wrote:
>>>>> I have considered single_task_running() before. But since there is no
>>>>> such paravirtual interface currently and i am not sure whether it is a
>>>>> information leak from host if introducing such interface, so i didn't do
>>>>> it. Do you mean vcpu_is_preempted can do the same thing? I check the
>>>>> code and seems it only tells whether the VCPU is scheduled out or not
>>>>> which cannot satisfy the needs.
>>>> Can you help to answer my confusion? I have double checked the code, but
>>>> still not get your point. Do you think it is necessary to introduce an
>>>> paravirtual interface to expose single_task_running() to guest?
>>
>> I think vcpu_is_preempted is a good enough replacement.
>> For example, vcpu->arch.st.steal.preempted is 0 when the vCPU is sched
>> in and vmentry, then several tasks are enqueued on the same pCPU and
>> waiting on cfs red-black tree, the guest should avoid to poll in this
>> scenario, however, vcpu_is_preempted returns false and guest decides
>> to poll.
> 
> ... which is not necessarily _wrong_.  It's just a different heuristic.

Right, it's just harder to use than host's single_task_running() -- the
VCPU calling vcpu_is_preempted() is never preempted, so we have to look
at other VCPUs that are not halted, but still preempted.

If we see some ratio of preempted VCPUs (> 0?), then we stop polling and
yield to the host.  Working under the assumption that there is work for
this PCPU if other VCPUs have stuff to do.  The downside is that it
misses information about host's topology, so it would be hard to make it
work well.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-06-27 13:40               ` Radim Krčmář
@ 2017-06-27 13:56                 ` Paolo Bonzini
  2017-06-27 14:22                   ` Radim Krčmář
  0 siblings, 1 reply; 37+ messages in thread
From: Paolo Bonzini @ 2017-06-27 13:56 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: Wanpeng Li, Yang Zhang, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, the arch/x86 maintainers, Jonathan Corbet,
	tony.luck, Borislav Petkov, Peter Zijlstra, mchehab,
	Andrew Morton, krzk, jpoimboe, Andy Lutomirski,
	Christian Borntraeger, Thomas Garnier, Robert Gerst,
	Mathias Krause, douly.fnst, Nicolai Stange, Frederic Weisbecker,
	dvlasenk, Daniel Bristot de Oliveira, yamada.masahiro,
	mika.westerberg, Chen Yu, aaron.lu, Steven Rostedt, Kyle Huey,
	Len Brown, Prarit Bhargava, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, Alexey Dobriyan, fgao,
	ebiederm, Subash Abhinov Kasiviswanathan, Arnd Bergmann,
	Matt Fleming, Mel Gorman, linux-kernel, linux-doc, linux-edac,
	kvm



On 27/06/2017 15:40, Radim Krčmář wrote:
>> ... which is not necessarily _wrong_.  It's just a different heuristic.
> Right, it's just harder to use than host's single_task_running() -- the
> VCPU calling vcpu_is_preempted() is never preempted, so we have to look
> at other VCPUs that are not halted, but still preempted.
> 
> If we see some ratio of preempted VCPUs (> 0?), then we stop polling and
> yield to the host.  Working under the assumption that there is work for
> this PCPU if other VCPUs have stuff to do.  The downside is that it
> misses information about host's topology, so it would be hard to make it
> work well.

I would just use vcpu_is_preempted on the current CPU.  From guest POV
this option is really a "f*** everyone else" setting just like
idle=poll, only a little more polite.

If we've been preempted and we were polling, there are two cases.  If an
interrupt was queued while the guest was preempted, the poll will be
treated as successful anyway.  If it hasn't, let others run---but really
that's not because the guest wants to be polite, it's to avoid that the
scheduler penalizes it excessively.

So until it's preempted, I think it's okay if the guest doesn't care
about others.  You wouldn't use this option anyway in overcommitted
situations.

(I'm still not very convinced about the idea).

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] x86/idle: add halt poll support
  2017-06-23  6:49       ` Yang Zhang
@ 2017-06-27 14:00         ` Radim Krčmář
  0 siblings, 0 replies; 37+ messages in thread
From: Radim Krčmář @ 2017-06-27 14:00 UTC (permalink / raw)
  To: Yang Zhang
  Cc: Wanpeng Li, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Paolo Bonzini, the arch/x86 maintainers, Jonathan Corbet,
	tony.luck, Borislav Petkov, Peter Zijlstra, mchehab,
	Andrew Morton, krzk, jpoimboe, Andy Lutomirski,
	Christian Borntraeger, Thomas Garnier, Robert Gerst,
	Mathias Krause, douly.fnst, Nicolai Stange, Frederic Weisbecker,
	dvlasenk, Daniel Bristot de Oliveira, yamada.masahiro,
	mika.westerberg, Chen Yu, aaron.lu, Steven Rostedt, Kyle Huey,
	Len Brown, Prarit Bhargava, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, Alexey Dobriyan, fgao,
	ebiederm, Subash Abhinov Kasiviswanathan, Arnd Bergmann,
	Matt Fleming, Mel Gorman, linux-kernel, linux-doc, linux-edac,
	kvm

2017-06-23 14:49+0800, Yang Zhang:
> On 2017/6/23 12:35, Wanpeng Li wrote:
> > 2017-06-23 12:08 GMT+08:00 Yang Zhang <yang.zhang.wz@gmail.com>:
> > > On 2017/6/22 19:50, Wanpeng Li wrote:
> > > > 
> > > > 2017-06-22 19:22 GMT+08:00 root <yang.zhang.wz@gmail.com>:
> > > > > 
> > > > > From: Yang Zhang <yang.zhang.wz@gmail.com>
> > > > > 
> > > > > Some latency-intensive workload will see obviously performance
> > > > > drop when running inside VM. The main reason is that the overhead
> > > > > is amplified when running inside VM. The most cost i have seen is
> > > > > inside idle path.
> > > > > This patch introduces a new mechanism to poll for a while before
> > > > > entering idle state. If schedule is needed during poll, then we
> > > > > don't need to goes through the heavy overhead path.
> > > > > 
> > > > > Here is the data i get when running benchmark contextswitch
> > > > > (https://github.com/tsuna/contextswitch)
> > > > > before patch:
> > > > > 2000000 process context switches in 4822613801ns (2411.3ns/ctxsw)
> > > > > after patch:
> > > > > 2000000 process context switches in 3584098241ns (1792.0ns/ctxsw)
> > > > 
> > > > 
> > > > If you test this after disabling the adaptive halt-polling in kvm?
> > > > What's the performance data of w/ this patchset and w/o the adaptive
> > > > halt-polling in kvm, and w/o this patchset and w/ the adaptive
> > > > halt-polling in kvm? In addition, both linux and windows guests can
> > > > get benefit as we have already done this in kvm.
> > > 
> > > 
> > > I will provide more data in next version. But it doesn't conflict with
> > 
> > Another case I can think of is w/ both this patchset and the adaptive
> > halt-polling in kvm.
> > 
> > > current halt polling inside kvm. This is just another enhancement.
> > 
> > I didn't look close to the patchset, however, maybe there is another
> > poll in the kvm part again sometimes if you fails the poll in the
> > guest. In addition, the adaptive halt-polling in kvm has performance
> > penalty when the pCPU is heavily overcommitted though there is a
> > single_task_running() in my testing, it is hard to accurately aware
> > whether there are other tasks waiting on the pCPU in the guest which
> > will make it worser. Depending on vcpu_is_preempted() or steal time
> > maybe not accurately or directly.
> > 
> > So I'm not sure how much sense it makes by adaptive halt-polling in
> > both guest and kvm. I prefer to just keep adaptive halt-polling in
> > kvm(then both linux/windows or other guests can get benefit) and avoid
> > to churn the core x86 path.
> 
> This mechanism is not specific to KVM. It is a kernel feature which can
> benefit guest when running inside X86 virtualization environment. The guest
> includes KVM,Xen,VMWARE,Hyper-v. Administrator can control KVM to use
> adaptive halt poll but he cannot control the user to use halt polling inside
> guest. Lots of user set idle=poll inside guest to improve performance which
> occupy more CPU cycles. This mechanism is a enhancement to it not to KVM
> halt polling.

Users of idle=poll shouln't overcommit, so the goal seems to be energy
savings without crippling the guest performance too much ...

Wouldn't switching to idle=mwait work as well?

Thanks.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-06-27 13:56                 ` Paolo Bonzini
@ 2017-06-27 14:22                   ` Radim Krčmář
  2017-06-27 14:26                     ` Paolo Bonzini
  2017-07-03  9:28                     ` Yang Zhang
  0 siblings, 2 replies; 37+ messages in thread
From: Radim Krčmář @ 2017-06-27 14:22 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Wanpeng Li, Yang Zhang, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, the arch/x86 maintainers, Jonathan Corbet,
	tony.luck, Borislav Petkov, Peter Zijlstra, mchehab,
	Andrew Morton, krzk, jpoimboe, Andy Lutomirski,
	Christian Borntraeger, Thomas Garnier, Robert Gerst,
	Mathias Krause, douly.fnst, Nicolai Stange, Frederic Weisbecker,
	dvlasenk, Daniel Bristot de Oliveira, yamada.masahiro,
	mika.westerberg, Chen Yu, aaron.lu, Steven Rostedt, Kyle Huey,
	Len Brown, Prarit Bhargava, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, Alexey Dobriyan, fgao,
	ebiederm, Subash Abhinov Kasiviswanathan, Arnd Bergmann,
	Matt Fleming, Mel Gorman, linux-kernel, linux-doc, linux-edac,
	kvm

2017-06-27 15:56+0200, Paolo Bonzini:
> On 27/06/2017 15:40, Radim Krčmář wrote:
>>> ... which is not necessarily _wrong_.  It's just a different heuristic.
>> Right, it's just harder to use than host's single_task_running() -- the
>> VCPU calling vcpu_is_preempted() is never preempted, so we have to look
>> at other VCPUs that are not halted, but still preempted.
>> 
>> If we see some ratio of preempted VCPUs (> 0?), then we stop polling and
>> yield to the host.  Working under the assumption that there is work for
>> this PCPU if other VCPUs have stuff to do.  The downside is that it
>> misses information about host's topology, so it would be hard to make it
>> work well.
> 
> I would just use vcpu_is_preempted on the current CPU.  From guest POV
> this option is really a "f*** everyone else" setting just like
> idle=poll, only a little more polite.

vcpu_is_preempted() on current cpu cannot return true, AFAIK.

> If we've been preempted and we were polling, there are two cases.  If an
> interrupt was queued while the guest was preempted, the poll will be
> treated as successful anyway.

I think the poll should be treated as invalid if the window has expired
while the VCPU was preempted -- the guest can't tell whether the
interrupt arrived still within the poll window (unless we added paravirt
for that), so it shouldn't be wasting time waiting for it.

>                                If it hasn't, let others run---but really
> that's not because the guest wants to be polite, it's to avoid that the
> scheduler penalizes it excessively.

This sounds like a VM entry just to do an immediate VM exit, so paravirt
seems better here as well ... (the guest telling the host about its
window -- which could also be used to rule it out as a target in the
pause loop random kick.)

> So until it's preempted, I think it's okay if the guest doesn't care
> about others.  You wouldn't use this option anyway in overcommitted
> situations.
> 
> (I'm still not very convinced about the idea).

Me neither.  (The same mechanism is applicable to bare-metal, but was
never used there, so I would rather bring the guest behavior closer to
bare-metal.)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-06-27 14:22                   ` Radim Krčmář
@ 2017-06-27 14:26                     ` Paolo Bonzini
  2017-07-03  9:28                     ` Yang Zhang
  1 sibling, 0 replies; 37+ messages in thread
From: Paolo Bonzini @ 2017-06-27 14:26 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: Wanpeng Li, Yang Zhang, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, the arch/x86 maintainers, Jonathan Corbet,
	tony.luck, Borislav Petkov, Peter Zijlstra, mchehab,
	Andrew Morton, krzk, jpoimboe, Andy Lutomirski,
	Christian Borntraeger, Thomas Garnier, Robert Gerst,
	Mathias Krause, douly.fnst, Nicolai Stange, Frederic Weisbecker,
	dvlasenk, Daniel Bristot de Oliveira, yamada.masahiro,
	mika.westerberg, Chen Yu, aaron.lu, Steven Rostedt, Kyle Huey,
	Len Brown, Prarit Bhargava, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, Alexey Dobriyan, fgao,
	ebiederm, Subash Abhinov Kasiviswanathan, Arnd Bergmann,
	Matt Fleming, Mel Gorman, linux-kernel, linux-doc, linux-edac,
	kvm



On 27/06/2017 16:22, Radim Krčmář wrote:
> vcpu_is_preempted() on current cpu cannot return true, AFAIK.

Of course.  I must have been thinking of an older version of the
vcpu_is_preempted patch (at some point the guest was the one that set
preempted to 0).

Paolo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-06-27 14:22                   ` Radim Krčmář
  2017-06-27 14:26                     ` Paolo Bonzini
@ 2017-07-03  9:28                     ` Yang Zhang
  2017-07-03 10:06                       ` Thomas Gleixner
                                         ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Yang Zhang @ 2017-07-03  9:28 UTC (permalink / raw)
  To: Radim Krčmář, Paolo Bonzini
  Cc: Wanpeng Li, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	the arch/x86 maintainers, Jonathan Corbet, tony.luck,
	Borislav Petkov, Peter Zijlstra, mchehab, Andrew Morton, krzk,
	jpoimboe, Andy Lutomirski, Christian Borntraeger, Thomas Garnier,
	Robert Gerst, Mathias Krause, douly.fnst, Nicolai Stange,
	Frederic Weisbecker, dvlasenk, Daniel Bristot de Oliveira,
	yamada.masahiro, mika.westerberg, Chen Yu, aaron.lu,
	Steven Rostedt, Kyle Huey, Len Brown, Prarit Bhargava,
	hidehiro.kawai.ez, fengtiantian, pmladek, jeyu, Larry.Finger,
	zijun_hu, luisbg, johannes.berg, niklas.soderlund+renesas,
	zlpnobody, Alexey Dobriyan, fgao, ebiederm,
	Subash Abhinov Kasiviswanathan, Arnd Bergmann, Matt Fleming,
	Mel Gorman, linux-kernel, linux-doc, linux-edac, kvm

On 2017/6/27 22:22, Radim Krčmář wrote:
> 2017-06-27 15:56+0200, Paolo Bonzini:
>> On 27/06/2017 15:40, Radim Krčmář wrote:
>>>> ... which is not necessarily _wrong_.  It's just a different heuristic.
>>> Right, it's just harder to use than host's single_task_running() -- the
>>> VCPU calling vcpu_is_preempted() is never preempted, so we have to look
>>> at other VCPUs that are not halted, but still preempted.
>>>
>>> If we see some ratio of preempted VCPUs (> 0?), then we stop polling and
>>> yield to the host.  Working under the assumption that there is work for
>>> this PCPU if other VCPUs have stuff to do.  The downside is that it
>>> misses information about host's topology, so it would be hard to make it
>>> work well.
>>
>> I would just use vcpu_is_preempted on the current CPU.  From guest POV
>> this option is really a "f*** everyone else" setting just like
>> idle=poll, only a little more polite.
>
> vcpu_is_preempted() on current cpu cannot return true, AFAIK.
>
>> If we've been preempted and we were polling, there are two cases.  If an
>> interrupt was queued while the guest was preempted, the poll will be
>> treated as successful anyway.
>
> I think the poll should be treated as invalid if the window has expired
> while the VCPU was preempted -- the guest can't tell whether the
> interrupt arrived still within the poll window (unless we added paravirt
> for that), so it shouldn't be wasting time waiting for it.
>
>>                                If it hasn't, let others run---but really
>> that's not because the guest wants to be polite, it's to avoid that the
>> scheduler penalizes it excessively.
>
> This sounds like a VM entry just to do an immediate VM exit, so paravirt
> seems better here as well ... (the guest telling the host about its
> window -- which could also be used to rule it out as a target in the
> pause loop random kick.)
>
>> So until it's preempted, I think it's okay if the guest doesn't care
>> about others.  You wouldn't use this option anyway in overcommitted
>> situations.
>>
>> (I'm still not very convinced about the idea).
>
> Me neither.  (The same mechanism is applicable to bare-metal, but was
> never used there, so I would rather bring the guest behavior closer to
> bare-metal.)
>

The background is that we(Alibaba Cloud) do get more and more complaints 
from our customers in both KVM and Xen compare to bare-mental.After 
investigations, the root cause is known to us: big cost in message 
passing workload(David show it in KVM forum 2015)

A typical message workload like below:
vcpu 0                             vcpu 1
1. send ipi                     2.  doing hlt
3. go into idle                 4.  receive ipi and wake up from hlt
5. write APIC time twice        6.  write APIC time twice to
    to stop sched timer              reprogram sched timer
7. doing hlt                    8.  handle task and send ipi to
                                     vcpu 0
9. same to 4.                   10. same to 3

One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). 
The cost of such vmexits will degrades performance severely. Linux 
kernel already provide idle=poll to mitigate the trend. But it only 
eliminates the IPI and hlt vmexit. It has nothing to do with start/stop 
sched timer. A compromise would be to turn off NOHZ kernel, but it is 
not the default config for new distributions. Same for halt-poll in KVM, 
it only solve the cost from schedule in/out in host and can not help 
such workload much.

The purpose of this patch we want to improve current idle=poll mechanism 
to use dynamic polling and do poll before touch sched timer. It should 
not be a virtualization specific feature but seems bare mental have low 
cost to access the MSR. So i want to only enable it in VM. Though the 
idea below the patch may not so perfect to fit all conditions, it looks 
no worse than now.
How about we keep current implementation and i integrate the patch to 
para-virtualize part as Paolo suggested? We can continue discuss it and 
i will continue to refine it if anyone has better suggestions?


-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-07-03  9:28                     ` Yang Zhang
@ 2017-07-03 10:06                       ` Thomas Gleixner
  2017-07-04  2:19                         ` Yang Zhang
  2017-07-04 14:13                       ` Radim Krčmář
  2017-07-04 22:28                       ` Wanpeng Li
  2 siblings, 1 reply; 37+ messages in thread
From: Thomas Gleixner @ 2017-07-03 10:06 UTC (permalink / raw)
  To: Yang Zhang
  Cc: Radim Krčmář,
	Paolo Bonzini, Wanpeng Li, Ingo Molnar, H. Peter Anvin,
	the arch/x86 maintainers, Jonathan Corbet, tony.luck,
	Borislav Petkov, Peter Zijlstra, mchehab, Andrew Morton, krzk,
	jpoimboe, Andy Lutomirski, Christian Borntraeger, Thomas Garnier,
	Robert Gerst, Mathias Krause, douly.fnst, Nicolai Stange,
	Frederic Weisbecker, dvlasenk, Daniel Bristot de Oliveira,
	yamada.masahiro, mika.westerberg, Chen Yu, aaron.lu,
	Steven Rostedt, Kyle Huey, Len Brown, Prarit Bhargava,
	hidehiro.kawai.ez, fengtiantian, pmladek, jeyu, Larry.Finger,
	zijun_hu, luisbg, johannes.berg, niklas.soderlund+renesas,
	zlpnobody, Alexey Dobriyan, fgao, ebiederm,
	Subash Abhinov Kasiviswanathan, Arnd Bergmann, Matt Fleming,
	Mel Gorman, linux-kernel, linux-doc, linux-edac, kvm

On Mon, 3 Jul 2017, Yang Zhang wrote:
> The background is that we(Alibaba Cloud) do get more and more complaints from
> our customers in both KVM and Xen compare to bare-mental.After investigations,
> the root cause is known to us: big cost in message passing workload(David show
> it in KVM forum 2015)
> 
> A typical message workload like below:
> vcpu 0                             vcpu 1
> 1. send ipi                     2.  doing hlt
> 3. go into idle                 4.  receive ipi and wake up from hlt
> 5. write APIC time twice        6.  write APIC time twice to
>    to stop sched timer              reprogram sched timer
> 7. doing hlt                    8.  handle task and send ipi to
>                                     vcpu 0
> 9. same to 4.                   10. same to 3
> 
> One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The
> cost of such vmexits will degrades performance severely. Linux kernel already
> provide idle=poll to mitigate the trend. But it only eliminates the IPI and
> hlt vmexit. It has nothing to do with start/stop sched timer. A compromise
> would be to turn off NOHZ kernel, but it is not the default config for new
> distributions.

You still can turn if off on the kernel command line via nohz=off

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-07-03 10:06                       ` Thomas Gleixner
@ 2017-07-04  2:19                         ` Yang Zhang
  0 siblings, 0 replies; 37+ messages in thread
From: Yang Zhang @ 2017-07-04  2:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Radim Krčmář,
	Paolo Bonzini, Wanpeng Li, Ingo Molnar, H. Peter Anvin,
	the arch/x86 maintainers, Jonathan Corbet, tony.luck,
	Borislav Petkov, Peter Zijlstra, mchehab, Andrew Morton, krzk,
	jpoimboe, Andy Lutomirski, Christian Borntraeger, Thomas Garnier,
	Robert Gerst, Mathias Krause, douly.fnst, Nicolai Stange,
	Frederic Weisbecker, dvlasenk, Daniel Bristot de Oliveira,
	yamada.masahiro, mika.westerberg, Chen Yu, aaron.lu,
	Steven Rostedt, Kyle Huey, Len Brown, Prarit Bhargava,
	hidehiro.kawai.ez, fengtiantian, pmladek, jeyu, Larry.Finger,
	zijun_hu, luisbg, johannes.berg, niklas.soderlund+renesas,
	zlpnobody, Alexey Dobriyan, fgao, ebiederm,
	Subash Abhinov Kasiviswanathan, Arnd Bergmann, Matt Fleming,
	Mel Gorman, linux-kernel, linux-doc, linux-edac, kvm

On 2017/7/3 18:06, Thomas Gleixner wrote:
> On Mon, 3 Jul 2017, Yang Zhang wrote:
>> The background is that we(Alibaba Cloud) do get more and more complaints from
>> our customers in both KVM and Xen compare to bare-mental.After investigations,
>> the root cause is known to us: big cost in message passing workload(David show
>> it in KVM forum 2015)
>>
>> A typical message workload like below:
>> vcpu 0                             vcpu 1
>> 1. send ipi                     2.  doing hlt
>> 3. go into idle                 4.  receive ipi and wake up from hlt
>> 5. write APIC time twice        6.  write APIC time twice to
>>    to stop sched timer              reprogram sched timer
>> 7. doing hlt                    8.  handle task and send ipi to
>>                                     vcpu 0
>> 9. same to 4.                   10. same to 3
>>
>> One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The
>> cost of such vmexits will degrades performance severely. Linux kernel already
>> provide idle=poll to mitigate the trend. But it only eliminates the IPI and
>> hlt vmexit. It has nothing to do with start/stop sched timer. A compromise
>> would be to turn off NOHZ kernel, but it is not the default config for new
>> distributions.
>
> You still can turn if off on the kernel command line via nohz=off

You are right. Senior users will turn off it manually. But it only solve 
the sched timer. They still have the IPI/hlt problem. Another point is 
we release the distribution image to customer without any extra 
configuration to avoid mismatch between VM and bare-metal. To change 
such configuration needs reboot, but some customer's business cannot be 
interrupted after they start the service(like online gaming). It would 
be better if we can provide the sysctl interface to allow run-time 
modification. By the way, idle=poll seems too heavy to use.



>
> Thanks,
>
> 	tglx
>


-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-07-03  9:28                     ` Yang Zhang
  2017-07-03 10:06                       ` Thomas Gleixner
@ 2017-07-04 14:13                       ` Radim Krčmář
  2017-07-04 14:50                         ` Thomas Gleixner
  2017-07-13 11:49                         ` Yang Zhang
  2017-07-04 22:28                       ` Wanpeng Li
  2 siblings, 2 replies; 37+ messages in thread
From: Radim Krčmář @ 2017-07-04 14:13 UTC (permalink / raw)
  To: Yang Zhang
  Cc: Paolo Bonzini, Wanpeng Li, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, the arch/x86 maintainers, Jonathan Corbet,
	tony.luck, Borislav Petkov, Peter Zijlstra, mchehab,
	Andrew Morton, krzk, jpoimboe, Andy Lutomirski,
	Christian Borntraeger, Thomas Garnier, Robert Gerst,
	Mathias Krause, douly.fnst, Nicolai Stange, Frederic Weisbecker,
	dvlasenk, Daniel Bristot de Oliveira, yamada.masahiro,
	mika.westerberg, Chen Yu, aaron.lu, Steven Rostedt, Kyle Huey,
	Len Brown, Prarit Bhargava, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, Alexey Dobriyan, fgao,
	ebiederm, Subash Abhinov Kasiviswanathan, Arnd Bergmann,
	Matt Fleming, Mel Gorman, linux-kernel, linux-doc, linux-edac,
	kvm

2017-07-03 17:28+0800, Yang Zhang:
> The background is that we(Alibaba Cloud) do get more and more complaints
> from our customers in both KVM and Xen compare to bare-mental.After
> investigations, the root cause is known to us: big cost in message passing
> workload(David show it in KVM forum 2015)
> 
> A typical message workload like below:
> vcpu 0                             vcpu 1
> 1. send ipi                     2.  doing hlt
> 3. go into idle                 4.  receive ipi and wake up from hlt
> 5. write APIC time twice        6.  write APIC time twice to
>    to stop sched timer              reprogram sched timer

One write is enough to disable/re-enable the APIC timer -- why does
Linux use two?

> 7. doing hlt                    8.  handle task and send ipi to
>                                     vcpu 0
> 9. same to 4.                   10. same to 3
> 
> One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The
> cost of such vmexits will degrades performance severely.

Yeah, sounds like too much ... I understood that there are

  IPI from 1 to 2
  4 * APIC timer
  IPI from 2 to 1

which adds to 6 MSR writes -- what are the other 4?

>                                                          Linux kernel
> already provide idle=poll to mitigate the trend. But it only eliminates the
> IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A
> compromise would be to turn off NOHZ kernel, but it is not the default
> config for new distributions. Same for halt-poll in KVM, it only solve the
> cost from schedule in/out in host and can not help such workload much.
> 
> The purpose of this patch we want to improve current idle=poll mechanism to

Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow
down the sibling hyperthread.  MWAIT solves the IPI problem, but doesn't
get rid of the timer one.

> use dynamic polling and do poll before touch sched timer. It should not be a
> virtualization specific feature but seems bare mental have low cost to
> access the MSR. So i want to only enable it in VM. Though the idea below the
> patch may not so perfect to fit all conditions, it looks no worse than now.

It adds code to hot-paths (interrupt handlers) while trying to optimize
an idle-path, which is suspicious.

> How about we keep current implementation and i integrate the patch to
> para-virtualize part as Paolo suggested? We can continue discuss it and i
> will continue to refine it if anyone has better suggestions?

I think there is a nicer solution to avoid the expensive timer rewrite:
Linux uses one-shot APIC timers and getting the timer interrupt is about
as expensive as programming the timer, so the guest can keep the timer
armed, but not re-arm it after the expiration if the CPU is idle.

This should also mitigate the problem with short idle periods, but the
optimized window is anywhere between 0 to 1ms.

Do you see disadvantages of this combined with MWAIT?

Thanks.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-07-04 14:13                       ` Radim Krčmář
@ 2017-07-04 14:50                         ` Thomas Gleixner
  2017-07-13 11:49                         ` Yang Zhang
  1 sibling, 0 replies; 37+ messages in thread
From: Thomas Gleixner @ 2017-07-04 14:50 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: Yang Zhang, Paolo Bonzini, Wanpeng Li, Ingo Molnar,
	H. Peter Anvin, the arch/x86 maintainers, Jonathan Corbet,
	tony.luck, Borislav Petkov, Peter Zijlstra, mchehab,
	Andrew Morton, krzk, jpoimboe, Andy Lutomirski,
	Christian Borntraeger, Thomas Garnier, Robert Gerst,
	Mathias Krause, douly.fnst, Nicolai Stange, Frederic Weisbecker,
	dvlasenk, Daniel Bristot de Oliveira, yamada.masahiro,
	mika.westerberg, Chen Yu, aaron.lu, Steven Rostedt, Kyle Huey,
	Len Brown, Prarit Bhargava, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, Alexey Dobriyan, fgao,
	ebiederm, Subash Abhinov Kasiviswanathan, Arnd Bergmann,
	Matt Fleming, Mel Gorman, linux-kernel, linux-doc, linux-edac,
	kvm

[-- Attachment #1: Type: text/plain, Size: 586 bytes --]

On Tue, 4 Jul 2017, Radim Krčmář wrote:
> I think there is a nicer solution to avoid the expensive timer rewrite:
> Linux uses one-shot APIC timers and getting the timer interrupt is about
> as expensive as programming the timer, so the guest can keep the timer
> armed, but not re-arm it after the expiration if the CPU is idle.

So much for the theory. The NOHZ logic has to reprogram the timer when the
next expiry is farther away than the next tick. Otherwise you wake up on
every idle entry once for nothing, which defeats the whole purpose of NOHZ
to not do that.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-07-03  9:28                     ` Yang Zhang
  2017-07-03 10:06                       ` Thomas Gleixner
  2017-07-04 14:13                       ` Radim Krčmář
@ 2017-07-04 22:28                       ` Wanpeng Li
  2 siblings, 0 replies; 37+ messages in thread
From: Wanpeng Li @ 2017-07-04 22:28 UTC (permalink / raw)
  To: Yang Zhang
  Cc: Radim Krčmář,
	Paolo Bonzini, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	the arch/x86 maintainers, Jonathan Corbet, tony.luck,
	Borislav Petkov, Peter Zijlstra, mchehab, Andrew Morton, krzk,
	jpoimboe, Andy Lutomirski, Christian Borntraeger, Thomas Garnier,
	Robert Gerst, Mathias Krause, douly.fnst, Nicolai Stange,
	Frederic Weisbecker, dvlasenk, Daniel Bristot de Oliveira,
	yamada.masahiro, mika.westerberg, Chen Yu, aaron.lu,
	Steven Rostedt, Kyle Huey, Len Brown, Prarit Bhargava,
	hidehiro.kawai.ez, fengtiantian, pmladek, jeyu, Larry.Finger,
	zijun_hu, luisbg, johannes.berg, niklas.soderlund+renesas,
	zlpnobody, Alexey Dobriyan, fgao, ebiederm,
	Subash Abhinov Kasiviswanathan, Arnd Bergmann, Matt Fleming,
	Mel Gorman, linux-kernel, linux-doc, linux-edac, kvm

2017-07-03 17:28 GMT+08:00 Yang Zhang <yang.zhang.wz@gmail.com>:
> On 2017/6/27 22:22, Radim Krčmář wrote:
>>
>> 2017-06-27 15:56+0200, Paolo Bonzini:
>>>
>>> On 27/06/2017 15:40, Radim Krčmář wrote:
>>>>>
>>>>> ... which is not necessarily _wrong_.  It's just a different heuristic.
>>>>
>>>> Right, it's just harder to use than host's single_task_running() -- the
>>>> VCPU calling vcpu_is_preempted() is never preempted, so we have to look
>>>> at other VCPUs that are not halted, but still preempted.
>>>>
>>>> If we see some ratio of preempted VCPUs (> 0?), then we stop polling and
>>>> yield to the host.  Working under the assumption that there is work for
>>>> this PCPU if other VCPUs have stuff to do.  The downside is that it
>>>> misses information about host's topology, so it would be hard to make it
>>>> work well.
>>>
>>>
>>> I would just use vcpu_is_preempted on the current CPU.  From guest POV
>>> this option is really a "f*** everyone else" setting just like
>>> idle=poll, only a little more polite.
>>
>>
>> vcpu_is_preempted() on current cpu cannot return true, AFAIK.
>>
>>> If we've been preempted and we were polling, there are two cases.  If an
>>> interrupt was queued while the guest was preempted, the poll will be
>>> treated as successful anyway.
>>
>>
>> I think the poll should be treated as invalid if the window has expired
>> while the VCPU was preempted -- the guest can't tell whether the
>> interrupt arrived still within the poll window (unless we added paravirt
>> for that), so it shouldn't be wasting time waiting for it.
>>
>>>                                If it hasn't, let others run---but really
>>> that's not because the guest wants to be polite, it's to avoid that the
>>> scheduler penalizes it excessively.
>>
>>
>> This sounds like a VM entry just to do an immediate VM exit, so paravirt
>> seems better here as well ... (the guest telling the host about its
>> window -- which could also be used to rule it out as a target in the
>> pause loop random kick.)
>>
>>> So until it's preempted, I think it's okay if the guest doesn't care
>>> about others.  You wouldn't use this option anyway in overcommitted
>>> situations.
>>>
>>> (I'm still not very convinced about the idea).
>>
>>
>> Me neither.  (The same mechanism is applicable to bare-metal, but was
>> never used there, so I would rather bring the guest behavior closer to
>> bare-metal.)
>>
>
> The background is that we(Alibaba Cloud) do get more and more complaints
> from our customers in both KVM and Xen compare to bare-mental.After
> investigations, the root cause is known to us: big cost in message passing
> workload(David show it in KVM forum 2015)
>
> A typical message workload like below:
> vcpu 0                             vcpu 1
> 1. send ipi                     2.  doing hlt
> 3. go into idle                 4.  receive ipi and wake up from hlt
> 5. write APIC time twice        6.  write APIC time twice to
>    to stop sched timer              reprogram sched timer

I didn't find these two scenarios will program APIC timer twice
separately instead of once separately, could you point out the codes?

Regards,
Wanpeng Li

> 7. doing hlt                    8.  handle task and send ipi to
>                                     vcpu 0
> 9. same to 4.                   10. same to 3
>
> One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The
> cost of such vmexits will degrades performance severely. Linux kernel
> already provide idle=poll to mitigate the trend. But it only eliminates the
> IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A
> compromise would be to turn off NOHZ kernel, but it is not the default
> config for new distributions. Same for halt-poll in KVM, it only solve the
> cost from schedule in/out in host and can not help such workload much.
>
> The purpose of this patch we want to improve current idle=poll mechanism to
> use dynamic polling and do poll before touch sched timer. It should not be a
> virtualization specific feature but seems bare mental have low cost to
> access the MSR. So i want to only enable it in VM. Though the idea below the
> patch may not so perfect to fit all conditions, it looks no worse than now.
> How about we keep current implementation and i integrate the patch to
> para-virtualize part as Paolo suggested? We can continue discuss it and i
> will continue to refine it if anyone has better suggestions?
>
>
>
> --
> Yang
> Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-07-04 14:13                       ` Radim Krčmář
  2017-07-04 14:50                         ` Thomas Gleixner
@ 2017-07-13 11:49                         ` Yang Zhang
  2017-07-14  9:37                           ` Alexander Graf
  1 sibling, 1 reply; 37+ messages in thread
From: Yang Zhang @ 2017-07-13 11:49 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: Paolo Bonzini, Wanpeng Li, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, the arch/x86 maintainers, Jonathan Corbet,
	tony.luck, Borislav Petkov, Peter Zijlstra, mchehab,
	Andrew Morton, krzk, jpoimboe, Andy Lutomirski,
	Christian Borntraeger, Thomas Garnier, Robert Gerst,
	Mathias Krause, douly.fnst, Nicolai Stange, Frederic Weisbecker,
	dvlasenk, Daniel Bristot de Oliveira, yamada.masahiro,
	mika.westerberg, Chen Yu, aaron.lu, Steven Rostedt, Kyle Huey,
	Len Brown, Prarit Bhargava, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, Alexey Dobriyan, fgao,
	ebiederm, Subash Abhinov Kasiviswanathan, Arnd Bergmann,
	Matt Fleming, Mel Gorman, linux-kernel, linux-doc, linux-edac,
	kvm

On 2017/7/4 22:13, Radim Krčmář wrote:
> 2017-07-03 17:28+0800, Yang Zhang:
>> The background is that we(Alibaba Cloud) do get more and more complaints
>> from our customers in both KVM and Xen compare to bare-mental.After
>> investigations, the root cause is known to us: big cost in message passing
>> workload(David show it in KVM forum 2015)
>>
>> A typical message workload like below:
>> vcpu 0                             vcpu 1
>> 1. send ipi                     2.  doing hlt
>> 3. go into idle                 4.  receive ipi and wake up from hlt
>> 5. write APIC time twice        6.  write APIC time twice to
>>    to stop sched timer              reprogram sched timer
>
> One write is enough to disable/re-enable the APIC timer -- why does
> Linux use two?

One is to remove the timer and another one is to reprogram the timer. 
Normally, only one write to remove the timer.But in some cases, it will 
reprogram it.

>
>> 7. doing hlt                    8.  handle task and send ipi to
>>                                     vcpu 0
>> 9. same to 4.                   10. same to 3
>>
>> One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The
>> cost of such vmexits will degrades performance severely.
>
> Yeah, sounds like too much ... I understood that there are
>
>   IPI from 1 to 2
>   4 * APIC timer
>   IPI from 2 to 1
>
> which adds to 6 MSR writes -- what are the other 4?

In the worst case, each timer will touch APIC timer twice.So it will add 
additional 4 msr writse. But this is  not always true.

>
>>                                                          Linux kernel
>> already provide idle=poll to mitigate the trend. But it only eliminates the
>> IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A
>> compromise would be to turn off NOHZ kernel, but it is not the default
>> config for new distributions. Same for halt-poll in KVM, it only solve the
>> cost from schedule in/out in host and can not help such workload much.
>>
>> The purpose of this patch we want to improve current idle=poll mechanism to
>
> Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow
> down the sibling hyperthread.  MWAIT solves the IPI problem, but doesn't
> get rid of the timer one.

Yes, i can try it. But MWAIT will not yield CPU, it only helps the 
sibling hyperthread as you mentioned.

>
>> use dynamic polling and do poll before touch sched timer. It should not be a
>> virtualization specific feature but seems bare mental have low cost to
>> access the MSR. So i want to only enable it in VM. Though the idea below the
>> patch may not so perfect to fit all conditions, it looks no worse than now.
>
> It adds code to hot-paths (interrupt handlers) while trying to optimize
> an idle-path, which is suspicious.
>
>> How about we keep current implementation and i integrate the patch to
>> para-virtualize part as Paolo suggested? We can continue discuss it and i
>> will continue to refine it if anyone has better suggestions?
>
> I think there is a nicer solution to avoid the expensive timer rewrite:
> Linux uses one-shot APIC timers and getting the timer interrupt is about
> as expensive as programming the timer, so the guest can keep the timer
> armed, but not re-arm it after the expiration if the CPU is idle.
>
> This should also mitigate the problem with short idle periods, but the
> optimized window is anywhere between 0 to 1ms.
>
> Do you see disadvantages of this combined with MWAIT?
>
> Thanks.
>


-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-07-13 11:49                         ` Yang Zhang
@ 2017-07-14  9:37                           ` Alexander Graf
  2017-07-17  9:26                             ` Yang Zhang
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Graf @ 2017-07-14  9:37 UTC (permalink / raw)
  To: Yang Zhang, Radim Krčmář
  Cc: Paolo Bonzini, Wanpeng Li, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, the arch/x86 maintainers, Jonathan Corbet,
	tony.luck, Borislav Petkov, Peter Zijlstra, mchehab,
	Andrew Morton, krzk, jpoimboe, Andy Lutomirski,
	Christian Borntraeger, Thomas Garnier, Robert Gerst,
	Mathias Krause, douly.fnst, Nicolai Stange, Frederic Weisbecker,
	dvlasenk, Daniel Bristot de Oliveira, yamada.masahiro,
	mika.westerberg, Chen Yu, aaron.lu, Steven Rostedt, Kyle Huey,
	Len Brown, Prarit Bhargava, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, Alexey Dobriyan, fgao,
	ebiederm, Subash Abhinov Kasiviswanathan, Arnd Bergmann,
	Matt Fleming, Mel Gorman, linux-kernel, linux-doc, linux-edac,
	kvm



On 13.07.17 13:49, Yang Zhang wrote:
> On 2017/7/4 22:13, Radim Krčmář wrote:
>> 2017-07-03 17:28+0800, Yang Zhang:
>>> The background is that we(Alibaba Cloud) do get more and more complaints
>>> from our customers in both KVM and Xen compare to bare-mental.After
>>> investigations, the root cause is known to us: big cost in message 
>>> passing
>>> workload(David show it in KVM forum 2015)
>>>
>>> A typical message workload like below:
>>> vcpu 0                             vcpu 1
>>> 1. send ipi                     2.  doing hlt
>>> 3. go into idle                 4.  receive ipi and wake up from hlt
>>> 5. write APIC time twice        6.  write APIC time twice to
>>>    to stop sched timer              reprogram sched timer
>>
>> One write is enough to disable/re-enable the APIC timer -- why does
>> Linux use two?
> 
> One is to remove the timer and another one is to reprogram the timer. 
> Normally, only one write to remove the timer.But in some cases, it will 
> reprogram it.
> 
>>
>>> 7. doing hlt                    8.  handle task and send ipi to
>>>                                     vcpu 0
>>> 9. same to 4.                   10. same to 3
>>>
>>> One transaction will introduce about 12 vmexits(2 hlt and 10 msr 
>>> write). The
>>> cost of such vmexits will degrades performance severely.
>>
>> Yeah, sounds like too much ... I understood that there are
>>
>>   IPI from 1 to 2
>>   4 * APIC timer
>>   IPI from 2 to 1
>>
>> which adds to 6 MSR writes -- what are the other 4?
> 
> In the worst case, each timer will touch APIC timer twice.So it will add 
> additional 4 msr writse. But this is  not always true.
> 
>>
>>>                                                          Linux kernel
>>> already provide idle=poll to mitigate the trend. But it only 
>>> eliminates the
>>> IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A
>>> compromise would be to turn off NOHZ kernel, but it is not the default
>>> config for new distributions. Same for halt-poll in KVM, it only 
>>> solve the
>>> cost from schedule in/out in host and can not help such workload much.
>>>
>>> The purpose of this patch we want to improve current idle=poll 
>>> mechanism to
>>
>> Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow
>> down the sibling hyperthread.  MWAIT solves the IPI problem, but doesn't
>> get rid of the timer one.
> 
> Yes, i can try it. But MWAIT will not yield CPU, it only helps the 
> sibling hyperthread as you mentioned.

If you implement proper MWAIT emulation that conditionally gets en- or 
disabled depending on the same halt poll dynamics that we already have 
for in-host HLT handling, it will also yield the CPU.

As for the timer - are you sure the problem is really the overhead of 
the timer configuration, not the latency that it takes to actually fire 
the guest timer?

One major problem I see is that we configure the host hrtimer to fire at 
the point in time when the guest wants to see a timer event. But in a 
virtual environment, the point in time when we have to start switching 
to the VM really should be a bit *before* the guest wants to be woken 
up, as it takes quite some time to switch back into the VM context.


Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-07-14  9:37                           ` Alexander Graf
@ 2017-07-17  9:26                             ` Yang Zhang
  2017-07-17  9:54                               ` Alexander Graf
  0 siblings, 1 reply; 37+ messages in thread
From: Yang Zhang @ 2017-07-17  9:26 UTC (permalink / raw)
  To: Alexander Graf, Radim Krčmář
  Cc: Paolo Bonzini, Wanpeng Li, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, the arch/x86 maintainers, Jonathan Corbet,
	tony.luck, Borislav Petkov, Peter Zijlstra, mchehab,
	Andrew Morton, krzk, jpoimboe, Andy Lutomirski,
	Christian Borntraeger, Thomas Garnier, Robert Gerst,
	Mathias Krause, douly.fnst, Nicolai Stange, Frederic Weisbecker,
	dvlasenk, Daniel Bristot de Oliveira, yamada.masahiro,
	mika.westerberg, Chen Yu, aaron.lu, Steven Rostedt, Kyle Huey,
	Len Brown, Prarit Bhargava, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, Alexey Dobriyan, fgao,
	ebiederm, Subash Abhinov Kasiviswanathan, Arnd Bergmann,
	Matt Fleming, Mel Gorman, linux-kernel, linux-doc, linux-edac,
	kvm

On 2017/7/14 17:37, Alexander Graf wrote:
>
>
> On 13.07.17 13:49, Yang Zhang wrote:
>> On 2017/7/4 22:13, Radim Krčmář wrote:
>>> 2017-07-03 17:28+0800, Yang Zhang:
>>>> The background is that we(Alibaba Cloud) do get more and more
>>>> complaints
>>>> from our customers in both KVM and Xen compare to bare-mental.After
>>>> investigations, the root cause is known to us: big cost in message
>>>> passing
>>>> workload(David show it in KVM forum 2015)
>>>>
>>>> A typical message workload like below:
>>>> vcpu 0                             vcpu 1
>>>> 1. send ipi                     2.  doing hlt
>>>> 3. go into idle                 4.  receive ipi and wake up from hlt
>>>> 5. write APIC time twice        6.  write APIC time twice to
>>>>    to stop sched timer              reprogram sched timer
>>>
>>> One write is enough to disable/re-enable the APIC timer -- why does
>>> Linux use two?
>>
>> One is to remove the timer and another one is to reprogram the timer.
>> Normally, only one write to remove the timer.But in some cases, it
>> will reprogram it.
>>
>>>
>>>> 7. doing hlt                    8.  handle task and send ipi to
>>>>                                     vcpu 0
>>>> 9. same to 4.                   10. same to 3
>>>>
>>>> One transaction will introduce about 12 vmexits(2 hlt and 10 msr
>>>> write). The
>>>> cost of such vmexits will degrades performance severely.
>>>
>>> Yeah, sounds like too much ... I understood that there are
>>>
>>>   IPI from 1 to 2
>>>   4 * APIC timer
>>>   IPI from 2 to 1
>>>
>>> which adds to 6 MSR writes -- what are the other 4?
>>
>> In the worst case, each timer will touch APIC timer twice.So it will
>> add additional 4 msr writse. But this is  not always true.
>>
>>>
>>>>                                                          Linux kernel
>>>> already provide idle=poll to mitigate the trend. But it only
>>>> eliminates the
>>>> IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A
>>>> compromise would be to turn off NOHZ kernel, but it is not the default
>>>> config for new distributions. Same for halt-poll in KVM, it only
>>>> solve the
>>>> cost from schedule in/out in host and can not help such workload much.
>>>>
>>>> The purpose of this patch we want to improve current idle=poll
>>>> mechanism to
>>>
>>> Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow
>>> down the sibling hyperthread.  MWAIT solves the IPI problem, but doesn't
>>> get rid of the timer one.
>>
>> Yes, i can try it. But MWAIT will not yield CPU, it only helps the
>> sibling hyperthread as you mentioned.
>
> If you implement proper MWAIT emulation that conditionally gets en- or
> disabled depending on the same halt poll dynamics that we already have
> for in-host HLT handling, it will also yield the CPU.

It is hard to do . If we not intercept MWAIT instruction, there is no 
chance to wake up the CPU unless an interrupt arrived or a store to the 
address armed by MONITOR which is the same with idle=polling.

>
> As for the timer - are you sure the problem is really the overhead of
> the timer configuration, not the latency that it takes to actually fire
> the guest timer?

No, the main cost is introduced by vmexit, includes IPIs, Timer program, 
HLT. David detailed it in KVM forum, you can search "Message Passing 
Workloads in KVM" in google and the first link give the whole analysis 
of the problem.

>
> One major problem I see is that we configure the host hrtimer to fire at
> the point in time when the guest wants to see a timer event. But in a
> virtual environment, the point in time when we have to start switching
> to the VM really should be a bit *before* the guest wants to be woken
> up, as it takes quite some time to switch back into the VM context.
>
>
> Alex


-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-07-17  9:26                             ` Yang Zhang
@ 2017-07-17  9:54                               ` Alexander Graf
  2017-07-17 12:50                                 ` Yang Zhang
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Graf @ 2017-07-17  9:54 UTC (permalink / raw)
  To: Yang Zhang, Radim Krčmář
  Cc: Paolo Bonzini, Wanpeng Li, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, the arch/x86 maintainers, Jonathan Corbet,
	tony.luck, Borislav Petkov, Peter Zijlstra, mchehab,
	Andrew Morton, krzk, jpoimboe, Andy Lutomirski,
	Christian Borntraeger, Thomas Garnier, Robert Gerst,
	Mathias Krause, douly.fnst, Nicolai Stange, Frederic Weisbecker,
	dvlasenk, Daniel Bristot de Oliveira, yamada.masahiro,
	mika.westerberg, Chen Yu, aaron.lu, Steven Rostedt, Kyle Huey,
	Len Brown, Prarit Bhargava, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, Alexey Dobriyan, fgao,
	ebiederm, Subash Abhinov Kasiviswanathan, Arnd Bergmann,
	Matt Fleming, Mel Gorman, linux-kernel, linux-doc, linux-edac,
	kvm



On 17.07.17 11:26, Yang Zhang wrote:
> On 2017/7/14 17:37, Alexander Graf wrote:
>>
>>
>> On 13.07.17 13:49, Yang Zhang wrote:
>>> On 2017/7/4 22:13, Radim Krčmář wrote:
>>>> 2017-07-03 17:28+0800, Yang Zhang:
>>>>> The background is that we(Alibaba Cloud) do get more and more
>>>>> complaints
>>>>> from our customers in both KVM and Xen compare to bare-mental.After
>>>>> investigations, the root cause is known to us: big cost in message
>>>>> passing
>>>>> workload(David show it in KVM forum 2015)
>>>>>
>>>>> A typical message workload like below:
>>>>> vcpu 0                             vcpu 1
>>>>> 1. send ipi                     2.  doing hlt
>>>>> 3. go into idle                 4.  receive ipi and wake up from hlt
>>>>> 5. write APIC time twice        6.  write APIC time twice to
>>>>>    to stop sched timer              reprogram sched timer
>>>>
>>>> One write is enough to disable/re-enable the APIC timer -- why does
>>>> Linux use two?
>>>
>>> One is to remove the timer and another one is to reprogram the timer.
>>> Normally, only one write to remove the timer.But in some cases, it
>>> will reprogram it.
>>>
>>>>
>>>>> 7. doing hlt                    8.  handle task and send ipi to
>>>>>                                     vcpu 0
>>>>> 9. same to 4.                   10. same to 3
>>>>>
>>>>> One transaction will introduce about 12 vmexits(2 hlt and 10 msr
>>>>> write). The
>>>>> cost of such vmexits will degrades performance severely.
>>>>
>>>> Yeah, sounds like too much ... I understood that there are
>>>>
>>>>   IPI from 1 to 2
>>>>   4 * APIC timer
>>>>   IPI from 2 to 1
>>>>
>>>> which adds to 6 MSR writes -- what are the other 4?
>>>
>>> In the worst case, each timer will touch APIC timer twice.So it will
>>> add additional 4 msr writse. But this is  not always true.
>>>
>>>>
>>>>>                                                          Linux kernel
>>>>> already provide idle=poll to mitigate the trend. But it only
>>>>> eliminates the
>>>>> IPI and hlt vmexit. It has nothing to do with start/stop sched 
>>>>> timer. A
>>>>> compromise would be to turn off NOHZ kernel, but it is not the default
>>>>> config for new distributions. Same for halt-poll in KVM, it only
>>>>> solve the
>>>>> cost from schedule in/out in host and can not help such workload much.
>>>>>
>>>>> The purpose of this patch we want to improve current idle=poll
>>>>> mechanism to
>>>>
>>>> Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow
>>>> down the sibling hyperthread.  MWAIT solves the IPI problem, but 
>>>> doesn't
>>>> get rid of the timer one.
>>>
>>> Yes, i can try it. But MWAIT will not yield CPU, it only helps the
>>> sibling hyperthread as you mentioned.
>>
>> If you implement proper MWAIT emulation that conditionally gets en- or
>> disabled depending on the same halt poll dynamics that we already have
>> for in-host HLT handling, it will also yield the CPU.
> 
> It is hard to do . If we not intercept MWAIT instruction, there is no 
> chance to wake up the CPU unless an interrupt arrived or a store to the 
> address armed by MONITOR which is the same with idle=polling.

Yes, but you can reconfigure the VMCS/VMCB to trap on MWAIT or not trap 
on it. That's something that idle=polling does not give you at all - a 
guest vcpu will always use 100% CPU.

The only really tricky part is how to limit the effect of MONITOR on 
nested page table maintenance. But if we just set the MONITOR cache size 
to 4k, well behaved guests should ideally always give us the one same 
page for wakeup - which we can then leave marked as trapping.

> 
>>
>> As for the timer - are you sure the problem is really the overhead of
>> the timer configuration, not the latency that it takes to actually fire
>> the guest timer?
> 
> No, the main cost is introduced by vmexit, includes IPIs, Timer program, 
> HLT. David detailed it in KVM forum, you can search "Message Passing 
> Workloads in KVM" in google and the first link give the whole analysis 
> of the problem.

During time critical message passing you want to keep both vCPUs inside 
the guest, yes. That again is something that guest exposed MWAIT would 
buy you.

The problem is that overcommitting CPU is very expensive with anything 
that does not set the guests idle at all. And not everyone can afford to 
throw more CPUs at problems :).


Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] x86/idle: use dynamic halt poll
  2017-07-17  9:54                               ` Alexander Graf
@ 2017-07-17 12:50                                 ` Yang Zhang
  0 siblings, 0 replies; 37+ messages in thread
From: Yang Zhang @ 2017-07-17 12:50 UTC (permalink / raw)
  To: Alexander Graf, Radim Krčmář
  Cc: Paolo Bonzini, Wanpeng Li, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, the arch/x86 maintainers, Jonathan Corbet,
	tony.luck, Borislav Petkov, Peter Zijlstra, mchehab,
	Andrew Morton, krzk, jpoimboe, Andy Lutomirski,
	Christian Borntraeger, Thomas Garnier, Robert Gerst,
	Mathias Krause, douly.fnst, Nicolai Stange, Frederic Weisbecker,
	dvlasenk, Daniel Bristot de Oliveira, yamada.masahiro,
	mika.westerberg, Chen Yu, aaron.lu, Steven Rostedt, Kyle Huey,
	Len Brown, Prarit Bhargava, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, Alexey Dobriyan, fgao,
	ebiederm, Subash Abhinov Kasiviswanathan, Arnd Bergmann,
	Matt Fleming, Mel Gorman, linux-kernel, linux-doc, linux-edac,
	kvm

On 2017/7/17 17:54, Alexander Graf wrote:
>
>
> On 17.07.17 11:26, Yang Zhang wrote:
>> On 2017/7/14 17:37, Alexander Graf wrote:
>>>
>>>
>>> On 13.07.17 13:49, Yang Zhang wrote:
>>>> On 2017/7/4 22:13, Radim Krčmář wrote:
>>>>> 2017-07-03 17:28+0800, Yang Zhang:
>>>>>> The background is that we(Alibaba Cloud) do get more and more
>>>>>> complaints
>>>>>> from our customers in both KVM and Xen compare to bare-mental.After
>>>>>> investigations, the root cause is known to us: big cost in message
>>>>>> passing
>>>>>> workload(David show it in KVM forum 2015)
>>>>>>
>>>>>> A typical message workload like below:
>>>>>> vcpu 0                             vcpu 1
>>>>>> 1. send ipi                     2.  doing hlt
>>>>>> 3. go into idle                 4.  receive ipi and wake up from hlt
>>>>>> 5. write APIC time twice        6.  write APIC time twice to
>>>>>>    to stop sched timer              reprogram sched timer
>>>>>
>>>>> One write is enough to disable/re-enable the APIC timer -- why does
>>>>> Linux use two?
>>>>
>>>> One is to remove the timer and another one is to reprogram the timer.
>>>> Normally, only one write to remove the timer.But in some cases, it
>>>> will reprogram it.
>>>>
>>>>>
>>>>>> 7. doing hlt                    8.  handle task and send ipi to
>>>>>>                                     vcpu 0
>>>>>> 9. same to 4.                   10. same to 3
>>>>>>
>>>>>> One transaction will introduce about 12 vmexits(2 hlt and 10 msr
>>>>>> write). The
>>>>>> cost of such vmexits will degrades performance severely.
>>>>>
>>>>> Yeah, sounds like too much ... I understood that there are
>>>>>
>>>>>   IPI from 1 to 2
>>>>>   4 * APIC timer
>>>>>   IPI from 2 to 1
>>>>>
>>>>> which adds to 6 MSR writes -- what are the other 4?
>>>>
>>>> In the worst case, each timer will touch APIC timer twice.So it will
>>>> add additional 4 msr writse. But this is  not always true.
>>>>
>>>>>
>>>>>>                                                          Linux kernel
>>>>>> already provide idle=poll to mitigate the trend. But it only
>>>>>> eliminates the
>>>>>> IPI and hlt vmexit. It has nothing to do with start/stop sched
>>>>>> timer. A
>>>>>> compromise would be to turn off NOHZ kernel, but it is not the
>>>>>> default
>>>>>> config for new distributions. Same for halt-poll in KVM, it only
>>>>>> solve the
>>>>>> cost from schedule in/out in host and can not help such workload
>>>>>> much.
>>>>>>
>>>>>> The purpose of this patch we want to improve current idle=poll
>>>>>> mechanism to
>>>>>
>>>>> Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow
>>>>> down the sibling hyperthread.  MWAIT solves the IPI problem, but
>>>>> doesn't
>>>>> get rid of the timer one.
>>>>
>>>> Yes, i can try it. But MWAIT will not yield CPU, it only helps the
>>>> sibling hyperthread as you mentioned.
>>>
>>> If you implement proper MWAIT emulation that conditionally gets en- or
>>> disabled depending on the same halt poll dynamics that we already have
>>> for in-host HLT handling, it will also yield the CPU.
>>
>> It is hard to do . If we not intercept MWAIT instruction, there is no
>> chance to wake up the CPU unless an interrupt arrived or a store to
>> the address armed by MONITOR which is the same with idle=polling.
>
> Yes, but you can reconfigure the VMCS/VMCB to trap on MWAIT or not trap
> on it. That's something that idle=polling does not give you at all - a
> guest vcpu will always use 100% CPU.

There are two things we need to figure out:
1. How and when to reconfigure the VMCS? Currently, all the knowledge 
are from guest, we don't know when to reconfigure it. Also, we cannot 
prevent guest from using MWAIT in other place if it see the feature.

2. If guest execute MWAIT without trap, since there is no way to set 
timeout for it, that would be a waste of CPU too.


>
> The only really tricky part is how to limit the effect of MONITOR on
> nested page table maintenance. But if we just set the MONITOR cache size
> to 4k, well behaved guests should ideally always give us the one same
> page for wakeup - which we can then leave marked as trapping.
>
>>
>>>
>>> As for the timer - are you sure the problem is really the overhead of
>>> the timer configuration, not the latency that it takes to actually fire
>>> the guest timer?
>>
>> No, the main cost is introduced by vmexit, includes IPIs, Timer
>> program, HLT. David detailed it in KVM forum, you can search "Message
>> Passing Workloads in KVM" in google and the first link give the whole
>> analysis of the problem.
>
> During time critical message passing you want to keep both vCPUs inside
> the guest, yes. That again is something that guest exposed MWAIT would
> buy you.

I think MWAIT only helps sibling hyper-threading case. But in real 
Cloud, hyper-threading is not always turning on, i.e. most products of 
Azure and some products of Alibaba Cloud. So it shouldn't be a big problem.

>
> The problem is that overcommitting CPU is very expensive with anything
> that does not set the guests idle at all. And not everyone can afford to
> throw more CPUs at problems :).

Agree, that's the reason why we choose dynamically halt polling. But on 
other side, the cloud vendor has the knowledge to control whether turn 
on it or not. The only problem is that there is no such way for us to do 
currently.

>
>
> Alex


-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] x86/idle: add halt poll for halt idle
  2017-06-22 11:22 ` [PATCH 1/2] x86/idle: add halt poll for halt idle root
  2017-06-22 14:23   ` Thomas Gleixner
@ 2017-08-16  4:04   ` Michael S. Tsirkin
  2017-08-17  7:29     ` Yang Zhang
  1 sibling, 1 reply; 37+ messages in thread
From: Michael S. Tsirkin @ 2017-08-16  4:04 UTC (permalink / raw)
  To: root
  Cc: tglx, mingo, hpa, pbonzini, x86, corbet, tony.luck, bp, peterz,
	mchehab, akpm, krzk, jpoimboe, luto, borntraeger, thgarnie,
	rgerst, minipli, douly.fnst, nicstange, fweisbec, dvlasenk,
	bristot, yamada.masahiro, mika.westerberg, yu.c.chen, aaron.lu,
	rostedt, me, len.brown, prarit, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, adobriyan, fgao, ebiederm,
	subashab, arnd, matt, mgorman, linux-kernel, linux-doc,
	linux-edac, kvm

On Thu, Jun 22, 2017 at 11:22:13AM +0000, root wrote:
> From: Yang Zhang <yang.zhang.wz@gmail.com>
> 
> This patch introduce a new mechanism to poll for a while before
> entering idle state.
> 
> David has a topic in KVM forum to describe the problem on current KVM VM
> when running some message passing workload in KVM forum. Also, there
> are some work to improve the performance in KVM, like halt polling in KVM.
> But we still has 4 MSR wirtes and HLT vmexit when going into halt idle
> which introduce lot of latency.
> 
> Halt polling in KVM provide the capbility to not schedule out VCPU when
> it is the only task in this pCPU. Unlike it, this patch will let VCPU polls
> for a while if there is no work inside VCPU to elimiate heavy vmexit during
> in/out idle. The potential impact is it will cost more CPU cycle since we
> are doing polling and may impact other task which waiting on the same
> physical CPU in host.

I wonder whether you considered doing this in an idle driver.
I have a prototype patch combining this with mwait within guest -
I can post it if you are interested.


> Here is the data i get when running benchmark contextswitch
> (https://github.com/tsuna/contextswitch)
> 
> before patch:
> 2000000 process context switches in 4822613801ns (2411.3ns/ctxsw)
> 
> after patch:
> 2000000 process context switches in 3584098241ns (1792.0ns/ctxsw)
> 
> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
> ---
>  Documentation/sysctl/kernel.txt | 10 ++++++++++
>  arch/x86/kernel/process.c       | 21 +++++++++++++++++++++
>  include/linux/kernel.h          |  3 +++
>  kernel/sched/idle.c             |  3 +++
>  kernel/sysctl.c                 |  9 +++++++++
>  5 files changed, 46 insertions(+)
> 
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index bac23c1..4e71bfe 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -63,6 +63,7 @@ show up in /proc/sys/kernel:
>  - perf_event_max_stack
>  - perf_event_max_contexts_per_stack
>  - pid_max
> +- poll_threshold_ns        [ X86 only ]
>  - powersave-nap               [ PPC only ]
>  - printk
>  - printk_delay
> @@ -702,6 +703,15 @@ kernel tries to allocate a number starting from this one.
>  
>  ==============================================================
>  
> +poll_threshold_ns: (X86 only)
> +
> +This parameter used to control the max wait time to poll before going
> +into real idle state. By default, the values is 0 means don't poll.
> +It is recommended to change the value to non-zero if running latency-bound
> +workloads in VM.
> +
> +==============================================================
> +
>  powersave-nap: (PPC only)
>  
>  If set, Linux-PPC will use the 'nap' mode of powersaving,
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 0bb8842..6361783 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -39,6 +39,10 @@
>  #include <asm/desc.h>
>  #include <asm/prctl.h>
>  
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +unsigned long poll_threshold_ns;
> +#endif
> +
>  /*
>   * per-CPU TSS segments. Threads are completely 'soft' on Linux,
>   * no more per-task TSS's. The TSS size is kept cacheline-aligned
> @@ -313,6 +317,23 @@ static inline void play_dead(void)
>  }
>  #endif
>  
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +void arch_cpu_idle_poll(void)
> +{
> +	ktime_t start, cur, stop;
> +
> +	if (poll_threshold_ns) {
> +		start = cur = ktime_get();
> +		stop = ktime_add_ns(ktime_get(), poll_threshold_ns);
> +		do {
> +			if (need_resched())
> +				break;
> +			cur = ktime_get();
> +		} while (ktime_before(cur, stop));
> +	}
> +}
> +#endif
> +
>  void arch_cpu_idle_enter(void)
>  {
>  	tsc_verify_tsc_adjust(false);
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 13bc08a..04cf774 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -460,6 +460,9 @@ extern __scanf(2, 0)
>  extern int sysctl_panic_on_stackoverflow;
>  
>  extern bool crash_kexec_post_notifiers;
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +extern unsigned long poll_threshold_ns;
> +#endif
>  
>  /*
>   * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 2a25a9e..e789f99 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
>  }
>  
>  /* Weak implementations for optional arch specific functions */
> +void __weak arch_cpu_idle_poll(void) { }
>  void __weak arch_cpu_idle_prepare(void) { }
>  void __weak arch_cpu_idle_enter(void) { }
>  void __weak arch_cpu_idle_exit(void) { }
> @@ -219,6 +220,8 @@ static void do_idle(void)
>  	 */
>  
>  	__current_set_polling();
> +	arch_cpu_idle_poll();
> +
>  	tick_nohz_idle_enter();
>  
>  	while (!need_resched()) {
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 4dfba1a..9174d57 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1203,6 +1203,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
>  		.extra2		= &one,
>  	},
>  #endif
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +	{
> +		.procname	= "halt_poll_threshold",
> +		.data		= &poll_threshold_ns,
> +		.maxlen		= sizeof(unsigned long),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
> +#endif
>  	{ }
>  };
>  
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] x86/idle: add halt poll for halt idle
  2017-08-16  4:04   ` Michael S. Tsirkin
@ 2017-08-17  7:29     ` Yang Zhang
  0 siblings, 0 replies; 37+ messages in thread
From: Yang Zhang @ 2017-08-17  7:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: tglx, mingo, hpa, pbonzini, x86, corbet, tony.luck, bp, peterz,
	mchehab, akpm, krzk, jpoimboe, luto, borntraeger, thgarnie,
	rgerst, minipli, douly.fnst, nicstange, fweisbec, dvlasenk,
	bristot, yamada.masahiro, mika.westerberg, yu.c.chen, aaron.lu,
	rostedt, me, len.brown, prarit, hidehiro.kawai.ez, fengtiantian,
	pmladek, jeyu, Larry.Finger, zijun_hu, luisbg, johannes.berg,
	niklas.soderlund+renesas, zlpnobody, adobriyan, fgao, ebiederm,
	subashab, arnd, matt, mgorman, linux-kernel, linux-doc,
	linux-edac, kvm

On 2017/8/16 12:04, Michael S. Tsirkin wrote:
> On Thu, Jun 22, 2017 at 11:22:13AM +0000, root wrote:
>> From: Yang Zhang <yang.zhang.wz@gmail.com>
>>
>> This patch introduce a new mechanism to poll for a while before
>> entering idle state.
>>
>> David has a topic in KVM forum to describe the problem on current KVM VM
>> when running some message passing workload in KVM forum. Also, there
>> are some work to improve the performance in KVM, like halt polling in KVM.
>> But we still has 4 MSR wirtes and HLT vmexit when going into halt idle
>> which introduce lot of latency.
>>
>> Halt polling in KVM provide the capbility to not schedule out VCPU when
>> it is the only task in this pCPU. Unlike it, this patch will let VCPU polls
>> for a while if there is no work inside VCPU to elimiate heavy vmexit during
>> in/out idle. The potential impact is it will cost more CPU cycle since we
>> are doing polling and may impact other task which waiting on the same
>> physical CPU in host.
>
> I wonder whether you considered doing this in an idle driver.
> I have a prototype patch combining this with mwait within guest -
> I can post it if you are interested.

I am not sure mwait can solve this problem. But yes, if you have any 
prototype patch, i can do a test. Also, i am working on next version 
with better approach. I will post it ASAP.

>
>
>> Here is the data i get when running benchmark contextswitch
>> (https://github.com/tsuna/contextswitch)
>>
>> before patch:
>> 2000000 process context switches in 4822613801ns (2411.3ns/ctxsw)
>>
>> after patch:
>> 2000000 process context switches in 3584098241ns (1792.0ns/ctxsw)
>>
>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>> ---
>>  Documentation/sysctl/kernel.txt | 10 ++++++++++
>>  arch/x86/kernel/process.c       | 21 +++++++++++++++++++++
>>  include/linux/kernel.h          |  3 +++
>>  kernel/sched/idle.c             |  3 +++
>>  kernel/sysctl.c                 |  9 +++++++++
>>  5 files changed, 46 insertions(+)
>>
>> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
>> index bac23c1..4e71bfe 100644
>> --- a/Documentation/sysctl/kernel.txt
>> +++ b/Documentation/sysctl/kernel.txt
>> @@ -63,6 +63,7 @@ show up in /proc/sys/kernel:
>>  - perf_event_max_stack
>>  - perf_event_max_contexts_per_stack
>>  - pid_max
>> +- poll_threshold_ns        [ X86 only ]
>>  - powersave-nap               [ PPC only ]
>>  - printk
>>  - printk_delay
>> @@ -702,6 +703,15 @@ kernel tries to allocate a number starting from this one.
>>
>>  ==============================================================
>>
>> +poll_threshold_ns: (X86 only)
>> +
>> +This parameter used to control the max wait time to poll before going
>> +into real idle state. By default, the values is 0 means don't poll.
>> +It is recommended to change the value to non-zero if running latency-bound
>> +workloads in VM.
>> +
>> +==============================================================
>> +
>>  powersave-nap: (PPC only)
>>
>>  If set, Linux-PPC will use the 'nap' mode of powersaving,
>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>> index 0bb8842..6361783 100644
>> --- a/arch/x86/kernel/process.c
>> +++ b/arch/x86/kernel/process.c
>> @@ -39,6 +39,10 @@
>>  #include <asm/desc.h>
>>  #include <asm/prctl.h>
>>
>> +#ifdef CONFIG_HYPERVISOR_GUEST
>> +unsigned long poll_threshold_ns;
>> +#endif
>> +
>>  /*
>>   * per-CPU TSS segments. Threads are completely 'soft' on Linux,
>>   * no more per-task TSS's. The TSS size is kept cacheline-aligned
>> @@ -313,6 +317,23 @@ static inline void play_dead(void)
>>  }
>>  #endif
>>
>> +#ifdef CONFIG_HYPERVISOR_GUEST
>> +void arch_cpu_idle_poll(void)
>> +{
>> +	ktime_t start, cur, stop;
>> +
>> +	if (poll_threshold_ns) {
>> +		start = cur = ktime_get();
>> +		stop = ktime_add_ns(ktime_get(), poll_threshold_ns);
>> +		do {
>> +			if (need_resched())
>> +				break;
>> +			cur = ktime_get();
>> +		} while (ktime_before(cur, stop));
>> +	}
>> +}
>> +#endif
>> +
>>  void arch_cpu_idle_enter(void)
>>  {
>>  	tsc_verify_tsc_adjust(false);
>> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
>> index 13bc08a..04cf774 100644
>> --- a/include/linux/kernel.h
>> +++ b/include/linux/kernel.h
>> @@ -460,6 +460,9 @@ extern __scanf(2, 0)
>>  extern int sysctl_panic_on_stackoverflow;
>>
>>  extern bool crash_kexec_post_notifiers;
>> +#ifdef CONFIG_HYPERVISOR_GUEST
>> +extern unsigned long poll_threshold_ns;
>> +#endif
>>
>>  /*
>>   * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index 2a25a9e..e789f99 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
>>  }
>>
>>  /* Weak implementations for optional arch specific functions */
>> +void __weak arch_cpu_idle_poll(void) { }
>>  void __weak arch_cpu_idle_prepare(void) { }
>>  void __weak arch_cpu_idle_enter(void) { }
>>  void __weak arch_cpu_idle_exit(void) { }
>> @@ -219,6 +220,8 @@ static void do_idle(void)
>>  	 */
>>
>>  	__current_set_polling();
>> +	arch_cpu_idle_poll();
>> +
>>  	tick_nohz_idle_enter();
>>
>>  	while (!need_resched()) {
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 4dfba1a..9174d57 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -1203,6 +1203,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
>>  		.extra2		= &one,
>>  	},
>>  #endif
>> +#ifdef CONFIG_HYPERVISOR_GUEST
>> +	{
>> +		.procname	= "halt_poll_threshold",
>> +		.data		= &poll_threshold_ns,
>> +		.maxlen		= sizeof(unsigned long),
>> +		.mode		= 0644,
>> +		.proc_handler	= proc_dointvec,
>> +	},
>> +#endif
>>  	{ }
>>  };
>>
>> --
>> 1.8.3.1


-- 
Yang
Alibaba Cloud Computing

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2017-08-17  7:29 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-22 11:22 [PATCH 0/2] x86/idle: add halt poll support root
2017-06-22 11:22 ` [PATCH 1/2] x86/idle: add halt poll for halt idle root
2017-06-22 14:23   ` Thomas Gleixner
2017-06-23  4:05     ` Yang Zhang
2017-08-16  4:04   ` Michael S. Tsirkin
2017-08-17  7:29     ` Yang Zhang
2017-06-22 11:22 ` [PATCH 2/2] x86/idle: use dynamic halt poll root
2017-06-22 11:51   ` Paolo Bonzini
2017-06-23  3:58     ` Yang Zhang
2017-06-27 11:22       ` Yang Zhang
2017-06-27 12:07         ` Paolo Bonzini
2017-06-27 12:23           ` Wanpeng Li
2017-06-27 12:28             ` Paolo Bonzini
2017-06-27 13:40               ` Radim Krčmář
2017-06-27 13:56                 ` Paolo Bonzini
2017-06-27 14:22                   ` Radim Krčmář
2017-06-27 14:26                     ` Paolo Bonzini
2017-07-03  9:28                     ` Yang Zhang
2017-07-03 10:06                       ` Thomas Gleixner
2017-07-04  2:19                         ` Yang Zhang
2017-07-04 14:13                       ` Radim Krčmář
2017-07-04 14:50                         ` Thomas Gleixner
2017-07-13 11:49                         ` Yang Zhang
2017-07-14  9:37                           ` Alexander Graf
2017-07-17  9:26                             ` Yang Zhang
2017-07-17  9:54                               ` Alexander Graf
2017-07-17 12:50                                 ` Yang Zhang
2017-07-04 22:28                       ` Wanpeng Li
2017-06-22 14:32   ` Thomas Gleixner
2017-06-23  4:04     ` Yang Zhang
2017-06-22 22:46   ` kbuild test robot
2017-06-22 11:32 ` [PATCH 0/2] x86/idle: add halt poll support Yang Zhang
2017-06-22 11:50 ` Wanpeng Li
2017-06-23  4:08   ` Yang Zhang
2017-06-23  4:35     ` Wanpeng Li
2017-06-23  6:49       ` Yang Zhang
2017-06-27 14:00         ` Radim Krčmář

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).