linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] implement vcpu preempted check
@ 2016-06-28 14:43 Pan Xinhui
  2016-06-28 14:43 ` [PATCH v2 1/4] kernel/sched: introduce vcpu preempted check interface Pan Xinhui
                   ` (4 more replies)
  0 siblings, 5 replies; 36+ messages in thread
From: Pan Xinhui @ 2016-06-28 14:43 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, virtualization, linux-s390
  Cc: peterz, mingo, mpe, paulus, benh, paulmck, waiman.long,
	will.deacon, boqun.feng, dave, schwidefsky, Pan Xinhui

change fomr v1:
	a simplier definition of default vcpu_is_preempted
	skip mahcine type check on ppc, and add config. remove dedicated macro.
	add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. 
	add more comments
	thanks boqun and Peter's suggestion.

This patch set aims to fix lock holder preemption issues.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
These spin_on_onwer variant also cause rcu stall before we apply this patch set

Pan Xinhui (4):
  kernel/sched: introduce vcpu preempted check interface
  powerpc/spinlock: support vcpu preempted check
  locking/osq: Drop the overload of osq_lock()
  kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner

 arch/powerpc/include/asm/spinlock.h | 18 ++++++++++++++++++
 include/linux/sched.h               | 12 ++++++++++++
 kernel/locking/mutex.c              | 15 +++++++++++++--
 kernel/locking/osq_lock.c           | 10 +++++++++-
 kernel/locking/rwsem-xadd.c         | 16 +++++++++++++---
 5 files changed, 65 insertions(+), 6 deletions(-)

-- 
2.4.11

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 1/4] kernel/sched: introduce vcpu preempted check interface
  2016-06-28 14:43 [PATCH v2 0/4] implement vcpu preempted check Pan Xinhui
@ 2016-06-28 14:43 ` Pan Xinhui
  2016-06-28 14:43 ` [PATCH v2 2/4] powerpc/spinlock: support vcpu preempted check Pan Xinhui
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Pan Xinhui @ 2016-06-28 14:43 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, virtualization, linux-s390
  Cc: peterz, mingo, mpe, paulus, benh, paulmck, waiman.long,
	will.deacon, boqun.feng, dave, schwidefsky, Pan Xinhui

This patch support to fix lock holder preemption issue.

For kernel users, we could use bool vcpu_is_preempted(int cpu) to detech if
one vcpu is preempted or not.

The default implementation is a macro defined by false. So compiler can
wrap it out if arch dose not support such vcpu pteempted check.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
---
 include/linux/sched.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada..cbe0574 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3293,6 +3293,18 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu)
 
 #endif /* CONFIG_SMP */
 
+/*
+ * In order to deal with a various lock holder preemption issues provide an
+ * interface to see if a vCPU is currently running or not.
+ *
+ * This allows us to terminate optimistic spin loops and block, analogous to
+ * the native optimistic spin heuristic of testing if the lock owner task is
+ * running or not.
+ */
+#ifndef vcpu_is_preempted
+#define vcpu_is_preempted(cpu)	false
+#endif
+
 extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask);
 extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 2/4] powerpc/spinlock: support vcpu preempted check
  2016-06-28 14:43 [PATCH v2 0/4] implement vcpu preempted check Pan Xinhui
  2016-06-28 14:43 ` [PATCH v2 1/4] kernel/sched: introduce vcpu preempted check interface Pan Xinhui
@ 2016-06-28 14:43 ` Pan Xinhui
  2016-07-05  9:57   ` Wanpeng Li
  2016-07-06 10:54   ` Balbir Singh
  2016-06-28 14:43 ` [PATCH v2 3/4] locking/osq: Drop the overload of osq_lock() Pan Xinhui
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 36+ messages in thread
From: Pan Xinhui @ 2016-06-28 14:43 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, virtualization, linux-s390
  Cc: peterz, mingo, mpe, paulus, benh, paulmck, waiman.long,
	will.deacon, boqun.feng, dave, schwidefsky, Pan Xinhui

This is to fix some lock holder preemption issues. Some other locks
implementation do a spin loop before acquiring the lock itself. Currently
kernel has an interface of bool vcpu_is_preempted(int cpu). It take the cpu
as parameter and return true if the cpu is preempted. Then kernel can break
the spin loops upon on the retval of vcpu_is_preempted.

As kernel has used this interface, So lets support it.

Only pSeries need supoort it. And the fact is powerNV are built into same
kernel image with pSeries. So we need return false if we are runnig as
powerNV. The another fact is that lppaca->yiled_count keeps zero on
powerNV. So we can just skip the machine type.

Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/spinlock.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index 523673d..3ac9fcb 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -52,6 +52,24 @@
 #define SYNC_IO
 #endif
 
+/*
+ * This support kernel to check if one cpu is preempted or not.
+ * Then we can fix some lock holder preemption issue.
+ */
+#ifdef CONFIG_PPC_PSERIES
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+	/*
+	 * pSeries and powerNV can be built into same kernel image. In
+	 * principle we need return false directly if we are running as
+	 * powerNV. However the yield_count is always zero on powerNV, So
+	 * skip such machine type check
+	 */
+	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
+}
+#endif
+
 static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 {
 	return lock.slock == 0;
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 3/4] locking/osq: Drop the overload of osq_lock()
  2016-06-28 14:43 [PATCH v2 0/4] implement vcpu preempted check Pan Xinhui
  2016-06-28 14:43 ` [PATCH v2 1/4] kernel/sched: introduce vcpu preempted check interface Pan Xinhui
  2016-06-28 14:43 ` [PATCH v2 2/4] powerpc/spinlock: support vcpu preempted check Pan Xinhui
@ 2016-06-28 14:43 ` Pan Xinhui
  2016-06-28 14:43 ` [PATCH v2 4/4] kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner Pan Xinhui
  2016-07-06  6:52 ` [PATCH v2 0/4] implement vcpu preempted check Peter Zijlstra
  4 siblings, 0 replies; 36+ messages in thread
From: Pan Xinhui @ 2016-06-28 14:43 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, virtualization, linux-s390
  Cc: peterz, mingo, mpe, paulus, benh, paulmck, waiman.long,
	will.deacon, boqun.feng, dave, schwidefsky, Pan Xinhui

An over-committed guest with more vCPUs than pCPUs has a heavy overload in
osq_lock().

This is because vCPU A hold the osq lock and yield out, vCPU B wait per_cpu
node->locked to be set. IOW, vCPU B wait vCPU A to run and unlock the osq
lock.

Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is
currently running or not. So break the spin loops on true condition.

test case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

after patch:
20.68%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner
 8.45%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 4.12%  sched-messaging  [kernel.vmlinux]  [k] system_call
 3.01%  sched-messaging  [kernel.vmlinux]  [k] system_call_common
 2.83%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 2.64%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 2.00%  sched-messaging  [kernel.vmlinux]  [k] osq_lock

Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
---
 kernel/locking/osq_lock.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..39d1385 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr)
 	return cpu_nr + 1;
 }
 
+static inline int node_cpu(struct optimistic_spin_node *node)
+{
+	return node->cpu - 1;
+}
+
 static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
 {
 	int cpu_nr = encoded_cpu_val - 1;
@@ -118,8 +123,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	while (!READ_ONCE(node->locked)) {
 		/*
 		 * If we need to reschedule bail... so we can block.
+		 * Use vcpu_is_preempted to detech lock holder preemption issue
+		 * and break. vcpu_is_preempted is a macro defined by false if
+		 * arch does not support vcpu preempted check,
 		 */
-		if (need_resched())
+		if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
 			goto unqueue;
 
 		cpu_relax_lowlatency();
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 4/4] kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
  2016-06-28 14:43 [PATCH v2 0/4] implement vcpu preempted check Pan Xinhui
                   ` (2 preceding siblings ...)
  2016-06-28 14:43 ` [PATCH v2 3/4] locking/osq: Drop the overload of osq_lock() Pan Xinhui
@ 2016-06-28 14:43 ` Pan Xinhui
  2016-07-06  6:52 ` [PATCH v2 0/4] implement vcpu preempted check Peter Zijlstra
  4 siblings, 0 replies; 36+ messages in thread
From: Pan Xinhui @ 2016-06-28 14:43 UTC (permalink / raw)
  To: linux-kernel, linuxppc-dev, virtualization, linux-s390
  Cc: peterz, mingo, mpe, paulus, benh, paulmck, waiman.long,
	will.deacon, boqun.feng, dave, schwidefsky, Pan Xinhui

An over-committed guest with more vCPUs than pCPUs has a heavy overload in
the two spin_on_owner. This blames on the lock holder preemption issue.

Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is
currently running or not. So break the spin loops on true condition.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
20.68%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner
 8.45%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 4.12%  sched-messaging  [kernel.vmlinux]  [k] system_call
 3.01%  sched-messaging  [kernel.vmlinux]  [k] system_call_common
 2.83%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 2.64%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 2.00%  sched-messaging  [kernel.vmlinux]  [k] osq_lock

after patch:
 9.99%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 5.28%  sched-messaging  [unknown]         [H] 0xc0000000000768e0
 4.27%  sched-messaging  [kernel.vmlinux]  [k] __copy_tofrom_user_power7
 3.77%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 3.24%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.02%  sched-messaging  [kernel.vmlinux]  [k] system_call
 2.69%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task

Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
---
 kernel/locking/mutex.c      | 15 +++++++++++++--
 kernel/locking/rwsem-xadd.c | 16 +++++++++++++---
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 79d2d76..ef0451b2 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -236,7 +236,13 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
 		 */
 		barrier();
 
-		if (!owner->on_cpu || need_resched()) {
+		/*
+		 * Use vcpu_is_preempted to detech lock holder preemption issue
+		 * and break. vcpu_is_preempted is a macro defined by false if
+		 * arch does not support vcpu preempted check,
+		 */
+		if (!owner->on_cpu || need_resched() ||
+				vcpu_is_preempted(task_cpu(owner))) {
 			ret = false;
 			break;
 		}
@@ -261,8 +267,13 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)
 
 	rcu_read_lock();
 	owner = READ_ONCE(lock->owner);
+
+	/*
+	 * As lock holder preemption issue, we both skip spinning if task not
+	 * on cpu or its cpu is preempted
+	 */
 	if (owner)
-		retval = owner->on_cpu;
+		retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
 	rcu_read_unlock();
 	/*
 	 * if lock->owner is not set, the mutex owner may have just acquired
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 09e30c6..828ca7c 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -319,7 +319,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
 		goto done;
 	}
 
-	ret = owner->on_cpu;
+	/*
+	 * As lock holder preemption issue, we both skip spinning if task not
+	 * on cpu or its cpu is preempted
+	 */
+	ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
 done:
 	rcu_read_unlock();
 	return ret;
@@ -340,8 +344,14 @@ bool rwsem_spin_on_owner(struct rw_semaphore *sem, struct task_struct *owner)
 		 */
 		barrier();
 
-		/* abort spinning when need_resched or owner is not running */
-		if (!owner->on_cpu || need_resched()) {
+		/*
+		 * abort spinning when need_resched or owner is not running or
+		 * owner's cpu is preempted. vcpu_is_preempted is a macro
+		 * defined by false if arch does not support vcpu preempted
+		 * check
+		 */
+		if (!owner->on_cpu || need_resched() ||
+				vcpu_is_preempted(task_cpu(owner))) {
 			rcu_read_unlock();
 			return false;
 		}
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/4] powerpc/spinlock: support vcpu preempted check
  2016-06-28 14:43 ` [PATCH v2 2/4] powerpc/spinlock: support vcpu preempted check Pan Xinhui
@ 2016-07-05  9:57   ` Wanpeng Li
  2016-07-06  4:58     ` xinhui
  2016-07-06 10:54   ` Balbir Singh
  1 sibling, 1 reply; 36+ messages in thread
From: Wanpeng Li @ 2016-07-05  9:57 UTC (permalink / raw)
  To: Pan Xinhui
  Cc: linux-kernel, linuxppc-dev, virtualization, linux-s390,
	Peter Zijlstra, Ingo Molnar, mpe, Paul Mackerras, benh,
	Paul McKenney, Waiman Long, will.deacon, boqun.feng,
	Davidlohr Bueso, schwidefsky

Hi Xinhui,
2016-06-28 22:43 GMT+08:00 Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>:
> This is to fix some lock holder preemption issues. Some other locks
> implementation do a spin loop before acquiring the lock itself. Currently
> kernel has an interface of bool vcpu_is_preempted(int cpu). It take the cpu
> as parameter and return true if the cpu is preempted. Then kernel can break
> the spin loops upon on the retval of vcpu_is_preempted.
>
> As kernel has used this interface, So lets support it.
>
> Only pSeries need supoort it. And the fact is powerNV are built into same
> kernel image with pSeries. So we need return false if we are runnig as
> powerNV. The another fact is that lppaca->yiled_count keeps zero on
> powerNV. So we can just skip the machine type.

Lock holder vCPU preemption can be detected by hardware pSeries or
paravirt method?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/4] powerpc/spinlock: support vcpu preempted check
  2016-07-05  9:57   ` Wanpeng Li
@ 2016-07-06  4:58     ` xinhui
  2016-07-06  6:46       ` Wanpeng Li
  0 siblings, 1 reply; 36+ messages in thread
From: xinhui @ 2016-07-06  4:58 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: linux-kernel, linuxppc-dev, virtualization, linux-s390,
	Peter Zijlstra, Ingo Molnar, mpe, Paul Mackerras, benh,
	Paul McKenney, Waiman Long, will.deacon, boqun.feng,
	Davidlohr Bueso, schwidefsky

Hi, wanpeng

On 2016年07月05日 17:57, Wanpeng Li wrote:
> Hi Xinhui,
> 2016-06-28 22:43 GMT+08:00 Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>:
>> This is to fix some lock holder preemption issues. Some other locks
>> implementation do a spin loop before acquiring the lock itself. Currently
>> kernel has an interface of bool vcpu_is_preempted(int cpu). It take the cpu
>> as parameter and return true if the cpu is preempted. Then kernel can break
>> the spin loops upon on the retval of vcpu_is_preempted.
>>
>> As kernel has used this interface, So lets support it.
>>
>> Only pSeries need supoort it. And the fact is powerNV are built into same
>> kernel image with pSeries. So we need return false if we are runnig as
>> powerNV. The another fact is that lppaca->yiled_count keeps zero on
>> powerNV. So we can just skip the machine type.
>
> Lock holder vCPU preemption can be detected by hardware pSeries or
> paravirt method?
>
There is one shard struct between kernel and powerVM/KVM. And we read the yield_count of this struct to detect if one vcpu is running or not.
SO it's easy for ppc to implement such interface. Note that yield_count is set by powerVM/KVM.
and only pSeries can run a guest for now. :)

I also review x86 related code, looks like we need add one hyer-call to get such vcpu preemption info?

thanks
xinui
> Regards,
> Wanpeng Li
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/4] powerpc/spinlock: support vcpu preempted check
  2016-07-06  4:58     ` xinhui
@ 2016-07-06  6:46       ` Wanpeng Li
  2016-07-06  7:58         ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Wanpeng Li @ 2016-07-06  6:46 UTC (permalink / raw)
  To: xinhui
  Cc: linux-kernel, linuxppc-dev, virtualization, linux-s390,
	Peter Zijlstra, Ingo Molnar, mpe, Paul Mackerras, benh,
	Paul McKenney, Waiman Long, will.deacon, boqun.feng,
	Davidlohr Bueso, schwidefsky, Paolo Bonzini, kvm

Cc Paolo, kvm ml
2016-07-06 12:58 GMT+08:00 xinhui <xinhui.pan@linux.vnet.ibm.com>:
> Hi, wanpeng
>
> On 2016年07月05日 17:57, Wanpeng Li wrote:
>>
>> Hi Xinhui,
>> 2016-06-28 22:43 GMT+08:00 Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>:
>>>
>>> This is to fix some lock holder preemption issues. Some other locks
>>> implementation do a spin loop before acquiring the lock itself. Currently
>>> kernel has an interface of bool vcpu_is_preempted(int cpu). It take the
>>> cpu
>>> as parameter and return true if the cpu is preempted. Then kernel can
>>> break
>>> the spin loops upon on the retval of vcpu_is_preempted.
>>>
>>> As kernel has used this interface, So lets support it.
>>>
>>> Only pSeries need supoort it. And the fact is powerNV are built into same
>>> kernel image with pSeries. So we need return false if we are runnig as
>>> powerNV. The another fact is that lppaca->yiled_count keeps zero on
>>> powerNV. So we can just skip the machine type.
>>
>>
>> Lock holder vCPU preemption can be detected by hardware pSeries or
>> paravirt method?
>>
> There is one shard struct between kernel and powerVM/KVM. And we read the
> yield_count of this struct to detect if one vcpu is running or not.
> SO it's easy for ppc to implement such interface. Note that yield_count is
> set by powerVM/KVM.
> and only pSeries can run a guest for now. :)
>
> I also review x86 related code, looks like we need add one hyer-call to get
> such vcpu preemption info?

There is no such stuff to record lock holder in x86 kvm, maybe we
don't need to depend on PLE handler algorithm to guess it if we can
know lock holder vCPU directly.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-06-28 14:43 [PATCH v2 0/4] implement vcpu preempted check Pan Xinhui
                   ` (3 preceding siblings ...)
  2016-06-28 14:43 ` [PATCH v2 4/4] kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner Pan Xinhui
@ 2016-07-06  6:52 ` Peter Zijlstra
  2016-07-06  7:47   ` Juergen Gross
                     ` (3 more replies)
  4 siblings, 4 replies; 36+ messages in thread
From: Peter Zijlstra @ 2016-07-06  6:52 UTC (permalink / raw)
  To: Pan Xinhui
  Cc: linux-kernel, linuxppc-dev, virtualization, linux-s390, mingo,
	mpe, paulus, benh, paulmck, waiman.long, will.deacon, boqun.feng,
	dave, schwidefsky, pbonzini

On Tue, Jun 28, 2016 at 10:43:07AM -0400, Pan Xinhui wrote:
> change fomr v1:
> 	a simplier definition of default vcpu_is_preempted
> 	skip mahcine type check on ppc, and add config. remove dedicated macro.
> 	add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. 
> 	add more comments
> 	thanks boqun and Peter's suggestion.
> 
> This patch set aims to fix lock holder preemption issues.
> 
> test-case:
> perf record -a perf bench sched messaging -g 400 -p && perf report
> 
> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>  3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>  3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>  3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
> 
> We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
> These spin_on_onwer variant also cause rcu stall before we apply this patch set
> 

Paolo, could you help out with an (x86) KVM interface for this?

Waiman, could you see if you can utilize this to get rid of the
SPIN_THRESHOLD in qspinlock_paravirt?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-06  6:52 ` [PATCH v2 0/4] implement vcpu preempted check Peter Zijlstra
@ 2016-07-06  7:47   ` Juergen Gross
  2016-07-06  8:19     ` Peter Zijlstra
  2016-07-06 10:05   ` xinhui
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Juergen Gross @ 2016-07-06  7:47 UTC (permalink / raw)
  To: Pan Xinhui
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, virtualization,
	linux-s390, mingo, mpe, paulus, benh, paulmck, waiman.long,
	will.deacon, boqun.feng, dave, schwidefsky, pbonzini,
	xen-devel-request

On 06/07/16 08:52, Peter Zijlstra wrote:
> On Tue, Jun 28, 2016 at 10:43:07AM -0400, Pan Xinhui wrote:
>> change fomr v1:
>> 	a simplier definition of default vcpu_is_preempted
>> 	skip mahcine type check on ppc, and add config. remove dedicated macro.
>> 	add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. 
>> 	add more comments
>> 	thanks boqun and Peter's suggestion.
>>
>> This patch set aims to fix lock holder preemption issues.
>>
>> test-case:
>> perf record -a perf bench sched messaging -g 400 -p && perf report
>>
>> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
>> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>>  3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>>  3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>>  3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
>>
>> We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
>> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
>> These spin_on_onwer variant also cause rcu stall before we apply this patch set
>>
> 
> Paolo, could you help out with an (x86) KVM interface for this?

Xen support of this interface should be rather easy. Could you please
Cc: xen-devel-request@lists.xenproject.org in the next version?


Juergen

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/4] powerpc/spinlock: support vcpu preempted check
  2016-07-06  6:46       ` Wanpeng Li
@ 2016-07-06  7:58         ` Peter Zijlstra
  2016-07-06  8:32           ` Wanpeng Li
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2016-07-06  7:58 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: xinhui, linux-kernel, linuxppc-dev, virtualization, linux-s390,
	Ingo Molnar, mpe, Paul Mackerras, benh, Paul McKenney,
	Waiman Long, will.deacon, boqun.feng, Davidlohr Bueso,
	schwidefsky, Paolo Bonzini, kvm

On Wed, Jul 06, 2016 at 02:46:34PM +0800, Wanpeng Li wrote:
> > SO it's easy for ppc to implement such interface. Note that yield_count is
> > set by powerVM/KVM.
> > and only pSeries can run a guest for now. :)
> >
> > I also review x86 related code, looks like we need add one hyer-call to get
> > such vcpu preemption info?
> 
> There is no such stuff to record lock holder in x86 kvm, maybe we
> don't need to depend on PLE handler algorithm to guess it if we can
> know lock holder vCPU directly.

x86/kvm has vcpu->preempted to indicate if a vcpu is currently preempted
or not. I'm just not sure if that is visible to the guest or how to make
it so.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-06  7:47   ` Juergen Gross
@ 2016-07-06  8:19     ` Peter Zijlstra
  2016-07-06  8:38       ` Juergen Gross
                         ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Peter Zijlstra @ 2016-07-06  8:19 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Pan Xinhui, linux-kernel, linuxppc-dev, virtualization,
	linux-s390, mingo, mpe, paulus, benh, paulmck, waiman.long,
	will.deacon, boqun.feng, dave, schwidefsky, pbonzini,
	xen-devel-request

On Wed, Jul 06, 2016 at 09:47:18AM +0200, Juergen Gross wrote:
> On 06/07/16 08:52, Peter Zijlstra wrote:

> > Paolo, could you help out with an (x86) KVM interface for this?
> 
> Xen support of this interface should be rather easy. Could you please
> Cc: xen-devel-request@lists.xenproject.org in the next version?

So meta question; aren't all you virt people looking at the regular
virtualization list? Or should we really dig out all the various
hypervisor lists and Cc them?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/4] powerpc/spinlock: support vcpu preempted check
  2016-07-06  7:58         ` Peter Zijlstra
@ 2016-07-06  8:32           ` Wanpeng Li
  2016-07-06 10:18             ` xinhui
  0 siblings, 1 reply; 36+ messages in thread
From: Wanpeng Li @ 2016-07-06  8:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: xinhui, linux-kernel, linuxppc-dev, virtualization, linux-s390,
	Ingo Molnar, mpe, Paul Mackerras, benh, Paul McKenney,
	Waiman Long, will.deacon, boqun.feng, Davidlohr Bueso,
	schwidefsky, Paolo Bonzini, kvm

2016-07-06 15:58 GMT+08:00 Peter Zijlstra <peterz@infradead.org>:
> On Wed, Jul 06, 2016 at 02:46:34PM +0800, Wanpeng Li wrote:
>> > SO it's easy for ppc to implement such interface. Note that yield_count is
>> > set by powerVM/KVM.
>> > and only pSeries can run a guest for now. :)
>> >
>> > I also review x86 related code, looks like we need add one hyer-call to get
>> > such vcpu preemption info?
>>
>> There is no such stuff to record lock holder in x86 kvm, maybe we
>> don't need to depend on PLE handler algorithm to guess it if we can
>> know lock holder vCPU directly.
>
> x86/kvm has vcpu->preempted to indicate if a vcpu is currently preempted
> or not. I'm just not sure if that is visible to the guest or how to make
> it so.

Yeah, I miss it. I can be the volunteer to do it if there is any idea,
ping Paolo. :)

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-06  8:19     ` Peter Zijlstra
@ 2016-07-06  8:38       ` Juergen Gross
  2016-07-06 12:44       ` Paolo Bonzini
  2016-07-06 16:56       ` Christian Borntraeger
  2 siblings, 0 replies; 36+ messages in thread
From: Juergen Gross @ 2016-07-06  8:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pan Xinhui, linux-kernel, linuxppc-dev, virtualization,
	linux-s390, mingo, mpe, paulus, benh, paulmck, waiman.long,
	will.deacon, boqun.feng, dave, schwidefsky, pbonzini,
	xen-devel-request

On 06/07/16 10:19, Peter Zijlstra wrote:
> On Wed, Jul 06, 2016 at 09:47:18AM +0200, Juergen Gross wrote:
>> On 06/07/16 08:52, Peter Zijlstra wrote:
> 
>>> Paolo, could you help out with an (x86) KVM interface for this?
>>
>> Xen support of this interface should be rather easy. Could you please
>> Cc: xen-devel-request@lists.xenproject.org in the next version?
> 
> So meta question; aren't all you virt people looking at the regular
> virtualization list? Or should we really dig out all the various
> hypervisor lists and Cc them?

Hmm, good question. Up to now I didn't look at the virtualization list,
just changed that. :-)

Can't speak for the other virt people.


Juergen

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-06  6:52 ` [PATCH v2 0/4] implement vcpu preempted check Peter Zijlstra
  2016-07-06  7:47   ` Juergen Gross
@ 2016-07-06 10:05   ` xinhui
  2016-07-06 10:44   ` Paolo Bonzini
  2016-07-11 15:10   ` Waiman Long
  3 siblings, 0 replies; 36+ messages in thread
From: xinhui @ 2016-07-06 10:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linuxppc-dev, virtualization, linux-s390, mingo,
	mpe, paulus, benh, paulmck, waiman.long, will.deacon, boqun.feng,
	dave, schwidefsky, pbonzini



On 2016年07月06日 14:52, Peter Zijlstra wrote:
> On Tue, Jun 28, 2016 at 10:43:07AM -0400, Pan Xinhui wrote:
>> change fomr v1:
>> 	a simplier definition of default vcpu_is_preempted
>> 	skip mahcine type check on ppc, and add config. remove dedicated macro.
>> 	add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner.
>> 	add more comments
>> 	thanks boqun and Peter's suggestion.
>>
>> This patch set aims to fix lock holder preemption issues.
>>
>> test-case:
>> perf record -a perf bench sched messaging -g 400 -p && perf report
>>
>> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
>> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>>   5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>>   3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>>   3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>>   3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>>   2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
>>
>> We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
>> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
>> These spin_on_onwer variant also cause rcu stall before we apply this patch set
>>
>
> Paolo, could you help out with an (x86) KVM interface for this?
>
> Waiman, could you see if you can utilize this to get rid of the
> SPIN_THRESHOLD in qspinlock_paravirt?
>
hmm. maybe something like below. wait_node can go into pv_wait() earlier as soon as the prev cpu is preempted.
but for the wait_head, as qspinlock does not record the lock_holder correctly(thanks to lock stealing), vcpu preemption check might get wrong results.

Waiman, I have used one hash table to keep the lock holder in my ppc implementation patch. I think we could do something similar in generic code?

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 74c4a86..40560e8 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -312,7 +312,8 @@ pv_wait_early(struct pv_node *prev, int loop)
         if ((loop & PV_PREV_CHECK_MASK) != 0)
                 return false;
  
-       return READ_ONCE(prev->state) != vcpu_running;
+       return READ_ONCE(prev->state) != vcpu_running ||
+                               vcpu_is_preempted(prev->cpu);
  }
  
  /*

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/4] powerpc/spinlock: support vcpu preempted check
  2016-07-06  8:32           ` Wanpeng Li
@ 2016-07-06 10:18             ` xinhui
  0 siblings, 0 replies; 36+ messages in thread
From: xinhui @ 2016-07-06 10:18 UTC (permalink / raw)
  To: Wanpeng Li, Peter Zijlstra
  Cc: linux-kernel, linuxppc-dev, virtualization, linux-s390,
	Ingo Molnar, mpe, Paul Mackerras, benh, Paul McKenney,
	Waiman Long, will.deacon, boqun.feng, Davidlohr Bueso,
	schwidefsky, Paolo Bonzini, kvm


On 2016年07月06日 16:32, Wanpeng Li wrote:
> 2016-07-06 15:58 GMT+08:00 Peter Zijlstra <peterz@infradead.org>:
>> On Wed, Jul 06, 2016 at 02:46:34PM +0800, Wanpeng Li wrote:
>>>> SO it's easy for ppc to implement such interface. Note that yield_count is
>>>> set by powerVM/KVM.
>>>> and only pSeries can run a guest for now. :)
>>>>
>>>> I also review x86 related code, looks like we need add one hyer-call to get
>>>> such vcpu preemption info?
>>>
>>> There is no such stuff to record lock holder in x86 kvm, maybe we
>>> don't need to depend on PLE handler algorithm to guess it if we can
>>> know lock holder vCPU directly.
>>
>> x86/kvm has vcpu->preempted to indicate if a vcpu is currently preempted
>> or not. I'm just not sure if that is visible to the guest or how to make
>> it so.
>
> Yeah, I miss it. I can be the volunteer to do it if there is any idea,
> ping Paolo. :)
>
glad to know that. :)


> Regards,
> Wanpeng Li
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-06  6:52 ` [PATCH v2 0/4] implement vcpu preempted check Peter Zijlstra
  2016-07-06  7:47   ` Juergen Gross
  2016-07-06 10:05   ` xinhui
@ 2016-07-06 10:44   ` Paolo Bonzini
  2016-07-06 11:59     ` Peter Zijlstra
  2016-07-06 12:08     ` Wanpeng Li
  2016-07-11 15:10   ` Waiman Long
  3 siblings, 2 replies; 36+ messages in thread
From: Paolo Bonzini @ 2016-07-06 10:44 UTC (permalink / raw)
  To: Peter Zijlstra, Pan Xinhui
  Cc: linux-s390, dave, mpe, boqun.feng, will.deacon, linux-kernel,
	waiman.long, virtualization, mingo, paulus, benh, schwidefsky,
	paulmck, linuxppc-dev



On 06/07/2016 08:52, Peter Zijlstra wrote:
> On Tue, Jun 28, 2016 at 10:43:07AM -0400, Pan Xinhui wrote:
>> change fomr v1:
>> 	a simplier definition of default vcpu_is_preempted
>> 	skip mahcine type check on ppc, and add config. remove dedicated macro.
>> 	add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. 
>> 	add more comments
>> 	thanks boqun and Peter's suggestion.
>>
>> This patch set aims to fix lock holder preemption issues.
>>
>> test-case:
>> perf record -a perf bench sched messaging -g 400 -p && perf report
>>
>> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
>> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>>  3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>>  3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>>  3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
>>
>> We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
>> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
>> These spin_on_onwer variant also cause rcu stall before we apply this patch set
>>
> 
> Paolo, could you help out with an (x86) KVM interface for this?

If it's just for spin loops, you can check if the version field in the
steal time structure has changed.

Paolo

> Waiman, could you see if you can utilize this to get rid of the
> SPIN_THRESHOLD in qspinlock_paravirt?
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/4] powerpc/spinlock: support vcpu preempted check
  2016-06-28 14:43 ` [PATCH v2 2/4] powerpc/spinlock: support vcpu preempted check Pan Xinhui
  2016-07-05  9:57   ` Wanpeng Li
@ 2016-07-06 10:54   ` Balbir Singh
  2016-07-15 15:35     ` Pan Xinhui
  1 sibling, 1 reply; 36+ messages in thread
From: Balbir Singh @ 2016-07-06 10:54 UTC (permalink / raw)
  To: Pan Xinhui, linux-kernel, linuxppc-dev, virtualization, linux-s390
  Cc: dave, peterz, mpe, boqun.feng, will.deacon, waiman.long, mingo,
	paulus, benh, schwidefsky, paulmck

On Tue, 2016-06-28 at 10:43 -0400, Pan Xinhui wrote:
> This is to fix some lock holder preemption issues. Some other locks
> implementation do a spin loop before acquiring the lock itself. Currently
> kernel has an interface of bool vcpu_is_preempted(int cpu). It take the cpu
								^^ takes
> as parameter and return true if the cpu is preempted. Then kernel can break
> the spin loops upon on the retval of vcpu_is_preempted.
> 
> As kernel has used this interface, So lets support it.
> 
> Only pSeries need supoort it. And the fact is powerNV are built into same
		   ^^ support
> kernel image with pSeries. So we need return false if we are runnig as
> powerNV. The another fact is that lppaca->yiled_count keeps zero on
					  ^^ yield
> powerNV. So we can just skip the machine type.
> 
> Suggested-by: Boqun Feng <boqun.feng@gmail.com>
> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/spinlock.h | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
> index 523673d..3ac9fcb 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -52,6 +52,24 @@
>  #define SYNC_IO
>  #endif
>  
> +/*
> + * This support kernel to check if one cpu is preempted or not.
> + * Then we can fix some lock holder preemption issue.
> + */
> +#ifdef CONFIG_PPC_PSERIES
> +#define vcpu_is_preempted vcpu_is_preempted
> +static inline bool vcpu_is_preempted(int cpu)
> +{
> +	/*
> +	 * pSeries and powerNV can be built into same kernel image. In
> +	 * principle we need return false directly if we are running as
> +	 * powerNV. However the yield_count is always zero on powerNV, So
> +	 * skip such machine type check

Or you could use the ppc_md interface callbacks if required, but your
solution works as well

> +	 */
> +	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
> +}
> +#endif
> +
>  static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
>  {
>  	return lock.slock == 0;


Balbir Singh.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-06 10:44   ` Paolo Bonzini
@ 2016-07-06 11:59     ` Peter Zijlstra
  2016-07-06 12:08     ` Wanpeng Li
  1 sibling, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2016-07-06 11:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Pan Xinhui, linux-s390, dave, mpe, boqun.feng, will.deacon,
	linux-kernel, waiman.long, virtualization, mingo, paulus, benh,
	schwidefsky, paulmck, linuxppc-dev

On Wed, Jul 06, 2016 at 12:44:58PM +0200, Paolo Bonzini wrote:
> > Paolo, could you help out with an (x86) KVM interface for this?
> 
> If it's just for spin loops, you can check if the version field in the
> steal time structure has changed.

That would require remembering the old value, no?

That would work with a previous interface proposal, see:

  http://lkml.kernel.org/r/1466937715-6683-2-git-send-email-xinhui.pan@linux.vnet.ibm.com

the vcpu_get_yield_count() thing would match that I think.

However the current proposal:

  http://lkml.kernel.org/r/1467124991-13164-2-git-send-email-xinhui.pan@linux.vnet.ibm.com

dropped that in favour of only vcpu_is_preempted(), which requires being
able to tell if a (remote) vcpu is currently running or not, which iirc,
isn't possible with the steal time sequence count.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-06 10:44   ` Paolo Bonzini
  2016-07-06 11:59     ` Peter Zijlstra
@ 2016-07-06 12:08     ` Wanpeng Li
  2016-07-06 12:28       ` Paolo Bonzini
  1 sibling, 1 reply; 36+ messages in thread
From: Wanpeng Li @ 2016-07-06 12:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Zijlstra, Pan Xinhui, linux-s390, Davidlohr Bueso, mpe,
	boqun.feng, will.deacon, linux-kernel, Waiman Long,
	virtualization, Ingo Molnar, Paul Mackerras, benh, schwidefsky,
	Paul McKenney, linuxppc-dev, kvm

2016-07-06 18:44 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>
>
> On 06/07/2016 08:52, Peter Zijlstra wrote:
>> On Tue, Jun 28, 2016 at 10:43:07AM -0400, Pan Xinhui wrote:
>>> change fomr v1:
>>>      a simplier definition of default vcpu_is_preempted
>>>      skip mahcine type check on ppc, and add config. remove dedicated macro.
>>>      add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner.
>>>      add more comments
>>>      thanks boqun and Peter's suggestion.
>>>
>>> This patch set aims to fix lock holder preemption issues.
>>>
>>> test-case:
>>> perf record -a perf bench sched messaging -g 400 -p && perf report
>>>
>>> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
>>> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>>>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>>>  3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>>>  3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>>>  3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>>>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
>>>
>>> We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
>>> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
>>> These spin_on_onwer variant also cause rcu stall before we apply this patch set
>>>
>>
>> Paolo, could you help out with an (x86) KVM interface for this?
>
> If it's just for spin loops, you can check if the version field in the
> steal time structure has changed.

Steal time will not be updated until ahead of next vmentry except
wrmsr MSR_KVM_STEAL_TIME. So it can't represent it is preempted
currently, right?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-06 12:08     ` Wanpeng Li
@ 2016-07-06 12:28       ` Paolo Bonzini
  2016-07-06 13:03         ` Wanpeng Li
  2016-07-07  8:48         ` Wanpeng Li
  0 siblings, 2 replies; 36+ messages in thread
From: Paolo Bonzini @ 2016-07-06 12:28 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Pan Xinhui, linux-s390, Davidlohr Bueso, mpe,
	boqun.feng, will.deacon, linux-kernel, Waiman Long,
	virtualization, Ingo Molnar, Paul Mackerras, benh, schwidefsky,
	Paul McKenney, linuxppc-dev, kvm



On 06/07/2016 14:08, Wanpeng Li wrote:
> 2016-07-06 18:44 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>>
>>
>> On 06/07/2016 08:52, Peter Zijlstra wrote:
>>> On Tue, Jun 28, 2016 at 10:43:07AM -0400, Pan Xinhui wrote:
>>>> change fomr v1:
>>>>      a simplier definition of default vcpu_is_preempted
>>>>      skip mahcine type check on ppc, and add config. remove dedicated macro.
>>>>      add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner.
>>>>      add more comments
>>>>      thanks boqun and Peter's suggestion.
>>>>
>>>> This patch set aims to fix lock holder preemption issues.
>>>>
>>>> test-case:
>>>> perf record -a perf bench sched messaging -g 400 -p && perf report
>>>>
>>>> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
>>>> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>>>>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>>>>  3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>>>>  3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>>>>  3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>>>>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
>>>>
>>>> We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
>>>> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
>>>> These spin_on_onwer variant also cause rcu stall before we apply this patch set
>>>
>>> Paolo, could you help out with an (x86) KVM interface for this?
>>
>> If it's just for spin loops, you can check if the version field in the
>> steal time structure has changed.
> 
> Steal time will not be updated until ahead of next vmentry except
> wrmsr MSR_KVM_STEAL_TIME. So it can't represent it is preempted
> currently, right?

Hmm, you're right.  We can use bit 0 of struct kvm_steal_time's flags to
indicate that pad[0] is a "VCPU preempted" field; if pad[0] is 1, the
VCPU has been scheduled out since the last time the guest reset the bit.
 The guest can use an xchg to test-and-clear it.  The bit can be
accessed at any time, independent of the version field.

Paolo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-06  8:19     ` Peter Zijlstra
  2016-07-06  8:38       ` Juergen Gross
@ 2016-07-06 12:44       ` Paolo Bonzini
  2016-07-06 16:56       ` Christian Borntraeger
  2 siblings, 0 replies; 36+ messages in thread
From: Paolo Bonzini @ 2016-07-06 12:44 UTC (permalink / raw)
  To: Peter Zijlstra, Juergen Gross
  Cc: linux-s390, xen-devel-request, dave, benh, Pan Xinhui,
	boqun.feng, will.deacon, linux-kernel, waiman.long,
	virtualization, mingo, paulus, mpe, schwidefsky, paulmck,
	linuxppc-dev



On 06/07/2016 10:19, Peter Zijlstra wrote:
>>> Paolo, could you help out with an (x86) KVM interface for this?
>> > 
>> > Xen support of this interface should be rather easy. Could you please
>> > Cc: xen-devel-request@lists.xenproject.org in the next version?
> So meta question; aren't all you virt people looking at the regular
> virtualization list? Or should we really dig out all the various
> hypervisor lists and Cc them?

I at least skim the subjects.

Paolo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-06 12:28       ` Paolo Bonzini
@ 2016-07-06 13:03         ` Wanpeng Li
  2016-07-07  8:48         ` Wanpeng Li
  1 sibling, 0 replies; 36+ messages in thread
From: Wanpeng Li @ 2016-07-06 13:03 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Zijlstra, Pan Xinhui, linux-s390, Davidlohr Bueso, mpe,
	boqun.feng, will.deacon, linux-kernel, Waiman Long,
	virtualization, Ingo Molnar, Paul Mackerras, benh, schwidefsky,
	Paul McKenney, linuxppc-dev, kvm

2016-07-06 20:28 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>
>
> On 06/07/2016 14:08, Wanpeng Li wrote:
>> 2016-07-06 18:44 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>>>
>>>
>>> On 06/07/2016 08:52, Peter Zijlstra wrote:
>>>> On Tue, Jun 28, 2016 at 10:43:07AM -0400, Pan Xinhui wrote:
>>>>> change fomr v1:
>>>>>      a simplier definition of default vcpu_is_preempted
>>>>>      skip mahcine type check on ppc, and add config. remove dedicated macro.
>>>>>      add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner.
>>>>>      add more comments
>>>>>      thanks boqun and Peter's suggestion.
>>>>>
>>>>> This patch set aims to fix lock holder preemption issues.
>>>>>
>>>>> test-case:
>>>>> perf record -a perf bench sched messaging -g 400 -p && perf report
>>>>>
>>>>> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
>>>>> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>>>>>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>>>>>  3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>>>>>  3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>>>>>  3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>>>>>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
>>>>>
>>>>> We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
>>>>> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
>>>>> These spin_on_onwer variant also cause rcu stall before we apply this patch set
>>>>
>>>> Paolo, could you help out with an (x86) KVM interface for this?
>>>
>>> If it's just for spin loops, you can check if the version field in the
>>> steal time structure has changed.
>>
>> Steal time will not be updated until ahead of next vmentry except
>> wrmsr MSR_KVM_STEAL_TIME. So it can't represent it is preempted
>> currently, right?
>
> Hmm, you're right.  We can use bit 0 of struct kvm_steal_time's flags to
> indicate that pad[0] is a "VCPU preempted" field; if pad[0] is 1, the
> VCPU has been scheduled out since the last time the guest reset the bit.
>  The guest can use an xchg to test-and-clear it.  The bit can be
> accessed at any time, independent of the version field.

I will try to implement it tomorrow, thanks for your proposal. :)

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-06  8:19     ` Peter Zijlstra
  2016-07-06  8:38       ` Juergen Gross
  2016-07-06 12:44       ` Paolo Bonzini
@ 2016-07-06 16:56       ` Christian Borntraeger
  2 siblings, 0 replies; 36+ messages in thread
From: Christian Borntraeger @ 2016-07-06 16:56 UTC (permalink / raw)
  To: Peter Zijlstra, Juergen Gross
  Cc: Pan Xinhui, linux-kernel, linuxppc-dev, virtualization,
	linux-s390, mingo, mpe, paulus, benh, paulmck, waiman.long,
	will.deacon, boqun.feng, dave, schwidefsky, pbonzini,
	xen-devel-request, Martin Schwidefsky

On 07/06/2016 10:19 AM, Peter Zijlstra wrote:
> On Wed, Jul 06, 2016 at 09:47:18AM +0200, Juergen Gross wrote:
>> On 06/07/16 08:52, Peter Zijlstra wrote:
> 
>>> Paolo, could you help out with an (x86) KVM interface for this?
>>
>> Xen support of this interface should be rather easy. Could you please
>> Cc: xen-devel-request@lists.xenproject.org in the next version?
> 
> So meta question; aren't all you virt people looking at the regular
> virtualization list? Or should we really dig out all the various
> hypervisor lists and Cc them?
> 

Some of the kvm on s390 team reads this, but I would assume that the base s390 team
does not (Martin can you confirm?) as the main focus was z/VM and LPAR. So maybe adding
linux-s390@vger for generic things does make sense.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-06 12:28       ` Paolo Bonzini
  2016-07-06 13:03         ` Wanpeng Li
@ 2016-07-07  8:48         ` Wanpeng Li
  2016-07-07  9:42           ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: Wanpeng Li @ 2016-07-07  8:48 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Zijlstra, Pan Xinhui, linux-s390, Davidlohr Bueso, mpe,
	boqun.feng, will.deacon, linux-kernel, Waiman Long,
	virtualization, Ingo Molnar, Paul Mackerras, benh, schwidefsky,
	Paul McKenney, linuxppc-dev, kvm

2016-07-06 20:28 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>
>
> On 06/07/2016 14:08, Wanpeng Li wrote:
>> 2016-07-06 18:44 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>>>
>>>
>>> On 06/07/2016 08:52, Peter Zijlstra wrote:
>>>> On Tue, Jun 28, 2016 at 10:43:07AM -0400, Pan Xinhui wrote:
>>>>> change fomr v1:
>>>>>      a simplier definition of default vcpu_is_preempted
>>>>>      skip mahcine type check on ppc, and add config. remove dedicated macro.
>>>>>      add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner.
>>>>>      add more comments
>>>>>      thanks boqun and Peter's suggestion.
>>>>>
>>>>> This patch set aims to fix lock holder preemption issues.
>>>>>
>>>>> test-case:
>>>>> perf record -a perf bench sched messaging -g 400 -p && perf report
>>>>>
>>>>> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
>>>>> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>>>>>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>>>>>  3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>>>>>  3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>>>>>  3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>>>>>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
>>>>>
>>>>> We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
>>>>> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
>>>>> These spin_on_onwer variant also cause rcu stall before we apply this patch set
>>>>
>>>> Paolo, could you help out with an (x86) KVM interface for this?
>>>
>>> If it's just for spin loops, you can check if the version field in the
>>> steal time structure has changed.
>>
>> Steal time will not be updated until ahead of next vmentry except
>> wrmsr MSR_KVM_STEAL_TIME. So it can't represent it is preempted
>> currently, right?
>
> Hmm, you're right.  We can use bit 0 of struct kvm_steal_time's flags to
> indicate that pad[0] is a "VCPU preempted" field; if pad[0] is 1, the
> VCPU has been scheduled out since the last time the guest reset the bit.
>  The guest can use an xchg to test-and-clear it.  The bit can be
> accessed at any time, independent of the version field.

If one vCPU is preempted, and guest check it several times before this
vCPU is scheded in, then the first time we can get "vCPU is
preempted", however, since the field is cleared, the second time we
will get "vCPU is running".

Do you mean we should call record_steal_time() in both kvm_sched_in()
and kvm_sched_out() to record this field? Btw, if we should keep both
vcpu->preempted and kvm_steal_time's "vCPU preempted" field present
simultaneous?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-07  8:48         ` Wanpeng Li
@ 2016-07-07  9:42           ` Peter Zijlstra
  2016-07-07 10:12             ` Wanpeng Li
                               ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Peter Zijlstra @ 2016-07-07  9:42 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Paolo Bonzini, Pan Xinhui, linux-s390, Davidlohr Bueso, mpe,
	boqun.feng, will.deacon, linux-kernel, Waiman Long,
	virtualization, Ingo Molnar, Paul Mackerras, benh, schwidefsky,
	Paul McKenney, linuxppc-dev, kvm

On Thu, Jul 07, 2016 at 04:48:05PM +0800, Wanpeng Li wrote:
> 2016-07-06 20:28 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
> > Hmm, you're right.  We can use bit 0 of struct kvm_steal_time's flags to
> > indicate that pad[0] is a "VCPU preempted" field; if pad[0] is 1, the
> > VCPU has been scheduled out since the last time the guest reset the bit.
> >  The guest can use an xchg to test-and-clear it.  The bit can be
> > accessed at any time, independent of the version field.
> 
> If one vCPU is preempted, and guest check it several times before this
> vCPU is scheded in, then the first time we can get "vCPU is
> preempted", however, since the field is cleared, the second time we
> will get "vCPU is running".
> 
> Do you mean we should call record_steal_time() in both kvm_sched_in()
> and kvm_sched_out() to record this field? Btw, if we should keep both
> vcpu->preempted and kvm_steal_time's "vCPU preempted" field present
> simultaneous?

I suspect you want something like so; except this has holes in.

We clear KVM_ST_PAD_PREEMPT before disabling preemption and we set it
after enabling it, this means that if we get preempted in between, the
vcpu is reported as running even though it very much is not.

Fixing that requires much larger surgery.

---
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b2766723c951..117270df43b6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1997,8 +1997,29 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
 	vcpu->arch.pv_time_enabled = false;
 }
 
+static void update_steal_time_preempt(struct kvm_vcpu *vcpu)
+{
+	struct kvm_steal_time *st;
+
+	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
+		return;
+
+	if (unlikely(kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
+		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time))))
+		return;
+
+	st = &vcpu->arch.st.steal;
+
+	st->pad[KVM_ST_PAD_PREEMPT] = 1; /* we've stopped running */
+
+	kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
+		st, sizeof(struct kvm_steal_time));
+}
+
 static void record_steal_time(struct kvm_vcpu *vcpu)
 {
+	struct kvm_steal_time *st;
+
 	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
 		return;
 
@@ -2006,29 +2027,34 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time))))
 		return;
 
-	if (vcpu->arch.st.steal.version & 1)
-		vcpu->arch.st.steal.version += 1;  /* first time write, random junk */
+	st = &vcpu->arch.st.steal;
+
+	if (st->version & 1) {
+		st->flags = KVM_ST_FLAG_PREEMPT;
+		st->version += 1;  /* first time write, random junk */
+	}
 
-	vcpu->arch.st.steal.version += 1;
+	st->version += 1;
 
 	kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
-		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
+		st, sizeof(struct kvm_steal_time));
 
 	smp_wmb();
 
-	vcpu->arch.st.steal.steal += current->sched_info.run_delay -
+	st->steal += current->sched_info.run_delay -
 		vcpu->arch.st.last_steal;
 	vcpu->arch.st.last_steal = current->sched_info.run_delay;
+	st->pad[KVM_ST_PAD_PREEMPT] = 0; /* we're about to start running */
 
 	kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
-		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
+		st, sizeof(struct kvm_steal_time));
 
 	smp_wmb();
 
-	vcpu->arch.st.steal.version += 1;
+	st->version += 1;
 
 	kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
-		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
+		st, sizeof(struct kvm_steal_time));
 }
 
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
@@ -6693,6 +6719,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	preempt_enable();
 
+	update_steal_time_preempt(vcpu);
+
 	vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
 
 	/*

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-07  9:42           ` Peter Zijlstra
@ 2016-07-07 10:12             ` Wanpeng Li
  2016-07-07 10:27               ` Wanpeng Li
  2016-07-07 11:08               ` Peter Zijlstra
  2016-07-07 11:09             ` Peter Zijlstra
  2016-07-07 11:21             ` Peter Zijlstra
  2 siblings, 2 replies; 36+ messages in thread
From: Wanpeng Li @ 2016-07-07 10:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paolo Bonzini, Pan Xinhui, linux-s390, Davidlohr Bueso, mpe,
	boqun.feng, will.deacon, linux-kernel, Waiman Long,
	virtualization, Ingo Molnar, Paul Mackerras, benh, schwidefsky,
	Paul McKenney, linuxppc-dev, kvm

2016-07-07 17:42 GMT+08:00 Peter Zijlstra <peterz@infradead.org>:
> On Thu, Jul 07, 2016 at 04:48:05PM +0800, Wanpeng Li wrote:
>> 2016-07-06 20:28 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>> > Hmm, you're right.  We can use bit 0 of struct kvm_steal_time's flags to
>> > indicate that pad[0] is a "VCPU preempted" field; if pad[0] is 1, the
>> > VCPU has been scheduled out since the last time the guest reset the bit.
>> >  The guest can use an xchg to test-and-clear it.  The bit can be
>> > accessed at any time, independent of the version field.
>>
>> If one vCPU is preempted, and guest check it several times before this
>> vCPU is scheded in, then the first time we can get "vCPU is
>> preempted", however, since the field is cleared, the second time we
>> will get "vCPU is running".
>>
>> Do you mean we should call record_steal_time() in both kvm_sched_in()
>> and kvm_sched_out() to record this field? Btw, if we should keep both
>> vcpu->preempted and kvm_steal_time's "vCPU preempted" field present
>> simultaneous?
>
> I suspect you want something like so; except this has holes in.
>
> We clear KVM_ST_PAD_PREEMPT before disabling preemption and we set it
> after enabling it, this means that if we get preempted in between, the
> vcpu is reported as running even though it very much is not.

Paolo also point out this to me offline yesterday: "Please change
pad[12] to "__u32 preempted; __u32 pad[11];" too, and remember to
update Documentation/virtual/kvm/msr.txt!". Btw, do this in preemption
notifier means that the vCPU is real preempted on host, however,
depends on vmexit is different semantic I think.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-07 10:12             ` Wanpeng Li
@ 2016-07-07 10:27               ` Wanpeng Li
  2016-07-07 11:15                 ` Peter Zijlstra
  2016-07-07 11:08               ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: Wanpeng Li @ 2016-07-07 10:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paolo Bonzini, Pan Xinhui, linux-s390, Davidlohr Bueso, mpe,
	boqun.feng, will.deacon, linux-kernel, Waiman Long,
	virtualization, Ingo Molnar, Paul Mackerras, benh, schwidefsky,
	Paul McKenney, linuxppc-dev, kvm

2016-07-07 18:12 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2016-07-07 17:42 GMT+08:00 Peter Zijlstra <peterz@infradead.org>:
>> On Thu, Jul 07, 2016 at 04:48:05PM +0800, Wanpeng Li wrote:
>>> 2016-07-06 20:28 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>>> > Hmm, you're right.  We can use bit 0 of struct kvm_steal_time's flags to
>>> > indicate that pad[0] is a "VCPU preempted" field; if pad[0] is 1, the
>>> > VCPU has been scheduled out since the last time the guest reset the bit.
>>> >  The guest can use an xchg to test-and-clear it.  The bit can be
>>> > accessed at any time, independent of the version field.
>>>
>>> If one vCPU is preempted, and guest check it several times before this
>>> vCPU is scheded in, then the first time we can get "vCPU is
>>> preempted", however, since the field is cleared, the second time we
>>> will get "vCPU is running".
>>>
>>> Do you mean we should call record_steal_time() in both kvm_sched_in()
>>> and kvm_sched_out() to record this field? Btw, if we should keep both
>>> vcpu->preempted and kvm_steal_time's "vCPU preempted" field present
>>> simultaneous?
>>
>> I suspect you want something like so; except this has holes in.
>>
>> We clear KVM_ST_PAD_PREEMPT before disabling preemption and we set it
>> after enabling it, this means that if we get preempted in between, the
>> vcpu is reported as running even though it very much is not.
>
> Paolo also point out this to me offline yesterday: "Please change
> pad[12] to "__u32 preempted; __u32 pad[11];" too, and remember to
> update Documentation/virtual/kvm/msr.txt!". Btw, do this in preemption
> notifier means that the vCPU is real preempted on host, however,
> depends on vmexit is different semantic I think.

In addition, I see xen's vcpu_runstate_info::state is updated during
schedule, so I think I can do this similarly through kvm preemption
notifier. IIUC, xen hypervisor has VCPUOP_get_runstate_info hypercall
implemention, so the desired interface can be implemented if they add
hypercall callsite in domU. I can add hypercall to kvm similarly.

Paolo, thoughts?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-07 10:12             ` Wanpeng Li
  2016-07-07 10:27               ` Wanpeng Li
@ 2016-07-07 11:08               ` Peter Zijlstra
  1 sibling, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2016-07-07 11:08 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Paolo Bonzini, Pan Xinhui, linux-s390, Davidlohr Bueso, mpe,
	boqun.feng, will.deacon, linux-kernel, Waiman Long,
	virtualization, Ingo Molnar, Paul Mackerras, benh, schwidefsky,
	Paul McKenney, linuxppc-dev, kvm

On Thu, Jul 07, 2016 at 06:12:51PM +0800, Wanpeng Li wrote:

> Btw, do this in preemption
> notifier means that the vCPU is real preempted on host, however,
> depends on vmexit is different semantic I think.

Not sure; suppose the vcpu is about to reenter, eg, we're in
vcpu_enter_guest() but before the preempt_disable() and the thread gets
preempted. Are we then not preempted? The vcpu might still very much be
in running state but had to service an vmexit due to an MSR or IO port
or whatnot.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-07  9:42           ` Peter Zijlstra
  2016-07-07 10:12             ` Wanpeng Li
@ 2016-07-07 11:09             ` Peter Zijlstra
  2016-07-07 11:21             ` Peter Zijlstra
  2 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2016-07-07 11:09 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Paolo Bonzini, Pan Xinhui, linux-s390, Davidlohr Bueso, mpe,
	boqun.feng, will.deacon, linux-kernel, Waiman Long,
	virtualization, Ingo Molnar, Paul Mackerras, benh, schwidefsky,
	Paul McKenney, linuxppc-dev, kvm

On Thu, Jul 07, 2016 at 11:42:15AM +0200, Peter Zijlstra wrote:
> I suspect you want something like so; except this has holes in.
> 
> We clear KVM_ST_PAD_PREEMPT before disabling preemption and we set it
> after enabling it, this means that if we get preempted in between, the
> vcpu is reported as running even though it very much is not.
> 
> Fixing that requires much larger surgery.

Note that this same hole is already a 'problem' for steal time
accounting. The thread can accrue further delays (iow steal time) after
we've called record_steal_time(). These delays will go unaccounted.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-07 10:27               ` Wanpeng Li
@ 2016-07-07 11:15                 ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2016-07-07 11:15 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Paolo Bonzini, Pan Xinhui, linux-s390, Davidlohr Bueso, mpe,
	boqun.feng, will.deacon, linux-kernel, Waiman Long,
	virtualization, Ingo Molnar, Paul Mackerras, benh, schwidefsky,
	Paul McKenney, linuxppc-dev, kvm

On Thu, Jul 07, 2016 at 06:27:26PM +0800, Wanpeng Li wrote:
> In addition, I see xen's vcpu_runstate_info::state is updated during
> schedule, so I think I can do this similarly through kvm preemption
> notifier. IIUC, xen hypervisor has VCPUOP_get_runstate_info hypercall
> implemention, so the desired interface can be implemented if they add
> hypercall callsite in domU. I can add hypercall to kvm similarly.

So I suspect Xen has the page its writing to pinned in memory; so that a
write to it is guaranteed to not fault.

Otherwise I cannot see this working.

That is part of the larger surgery required for KVM steal time to get
'fixed'. Currently that steal time stuff uses kvm_write_guest_cached()
which appears to be able to fault in guest pages.

Or I'm not reading this stuff right; which is entirely possible.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-07  9:42           ` Peter Zijlstra
  2016-07-07 10:12             ` Wanpeng Li
  2016-07-07 11:09             ` Peter Zijlstra
@ 2016-07-07 11:21             ` Peter Zijlstra
  2 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2016-07-07 11:21 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Paolo Bonzini, Pan Xinhui, linux-s390, Davidlohr Bueso, mpe,
	boqun.feng, will.deacon, linux-kernel, Waiman Long,
	virtualization, Ingo Molnar, Paul Mackerras, benh, schwidefsky,
	Paul McKenney, linuxppc-dev, kvm

On Thu, Jul 07, 2016 at 11:42:15AM +0200, Peter Zijlstra wrote:
> +static void update_steal_time_preempt(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_steal_time *st;
> +
> +	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
> +		return;
> +
> +	if (unlikely(kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
> +		&vcpu->arch.st.steal, sizeof(struct kvm_steal_time))))
> +		return;
> +
> +	st = &vcpu->arch.st.steal;
> +
> +	st->pad[KVM_ST_PAD_PREEMPT] = 1; /* we've stopped running */

So maybe have this be:

	... = kvm_vcpu_running();

That avoids marking the vcpu preempted when we do hlt and such.

> +	kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
> +		st, sizeof(struct kvm_steal_time));
> +}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-06  6:52 ` [PATCH v2 0/4] implement vcpu preempted check Peter Zijlstra
                     ` (2 preceding siblings ...)
  2016-07-06 10:44   ` Paolo Bonzini
@ 2016-07-11 15:10   ` Waiman Long
  2016-07-12  4:16     ` Juergen Gross
  3 siblings, 1 reply; 36+ messages in thread
From: Waiman Long @ 2016-07-11 15:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pan Xinhui, linux-kernel, linuxppc-dev, virtualization,
	linux-s390, mingo, mpe, paulus, benh, paulmck, will.deacon,
	boqun.feng, dave, schwidefsky, pbonzini

On 07/06/2016 02:52 AM, Peter Zijlstra wrote:
> On Tue, Jun 28, 2016 at 10:43:07AM -0400, Pan Xinhui wrote:
>> change fomr v1:
>> 	a simplier definition of default vcpu_is_preempted
>> 	skip mahcine type check on ppc, and add config. remove dedicated macro.
>> 	add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner.
>> 	add more comments
>> 	thanks boqun and Peter's suggestion.
>>
>> This patch set aims to fix lock holder preemption issues.
>>
>> test-case:
>> perf record -a perf bench sched messaging -g 400 -p&&  perf report
>>
>> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
>> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>>   5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>>   3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>>   3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>>   3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>>   2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
>>
>> We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
>> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
>> These spin_on_onwer variant also cause rcu stall before we apply this patch set
>>
> Paolo, could you help out with an (x86) KVM interface for this?
>
> Waiman, could you see if you can utilize this to get rid of the
> SPIN_THRESHOLD in qspinlock_paravirt?

That API is certainly useful to make the paravirt spinlock perform 
better. However, I am not sure if we can completely get rid of the 
SPIN_THRESHOLD at this point. It is not just the kvm, the xen code need 
to be modified as well.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-11 15:10   ` Waiman Long
@ 2016-07-12  4:16     ` Juergen Gross
  2016-07-12 18:16       ` Waiman Long
  0 siblings, 1 reply; 36+ messages in thread
From: Juergen Gross @ 2016-07-12  4:16 UTC (permalink / raw)
  To: Waiman Long, Peter Zijlstra
  Cc: linux-s390, dave, benh, Pan Xinhui, boqun.feng, will.deacon,
	linux-kernel, virtualization, mingo, paulus, mpe, schwidefsky,
	pbonzini, paulmck, linuxppc-dev

On 11/07/16 17:10, Waiman Long wrote:
> On 07/06/2016 02:52 AM, Peter Zijlstra wrote:
>> On Tue, Jun 28, 2016 at 10:43:07AM -0400, Pan Xinhui wrote:
>>> change fomr v1:
>>>     a simplier definition of default vcpu_is_preempted
>>>     skip mahcine type check on ppc, and add config. remove dedicated
>>> macro.
>>>     add one patch to drop overload of rwsem_spin_on_owner and
>>> mutex_spin_on_owner.
>>>     add more comments
>>>     thanks boqun and Peter's suggestion.
>>>
>>> This patch set aims to fix lock holder preemption issues.
>>>
>>> test-case:
>>> perf record -a perf bench sched messaging -g 400 -p&&  perf report
>>>
>>> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
>>> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>>>   5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>>>   3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>>>   3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>>>   3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>>>   2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
>>>
>>> We introduce interface bool vcpu_is_preempted(int cpu) and use it in
>>> some spin
>>> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
>>> These spin_on_onwer variant also cause rcu stall before we apply this
>>> patch set
>>>
>> Paolo, could you help out with an (x86) KVM interface for this?
>>
>> Waiman, could you see if you can utilize this to get rid of the
>> SPIN_THRESHOLD in qspinlock_paravirt?
> 
> That API is certainly useful to make the paravirt spinlock perform
> better. However, I am not sure if we can completely get rid of the
> SPIN_THRESHOLD at this point. It is not just the kvm, the xen code need
> to be modified as well.

This should be rather easy. The relevant information is included in the
runstate data mapped into kernel memory. I can provide a patch for Xen
if needed.


Juergen

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/4] implement vcpu preempted check
  2016-07-12  4:16     ` Juergen Gross
@ 2016-07-12 18:16       ` Waiman Long
  0 siblings, 0 replies; 36+ messages in thread
From: Waiman Long @ 2016-07-12 18:16 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Peter Zijlstra, linux-s390, dave, benh, Pan Xinhui, boqun.feng,
	will.deacon, linux-kernel, virtualization, mingo, paulus, mpe,
	schwidefsky, pbonzini, paulmck, linuxppc-dev

On 07/12/2016 12:16 AM, Juergen Gross wrote:
> On 11/07/16 17:10, Waiman Long wrote:
>> On 07/06/2016 02:52 AM, Peter Zijlstra wrote:
>>> On Tue, Jun 28, 2016 at 10:43:07AM -0400, Pan Xinhui wrote:
>>>> change fomr v1:
>>>>      a simplier definition of default vcpu_is_preempted
>>>>      skip mahcine type check on ppc, and add config. remove dedicated
>>>> macro.
>>>>      add one patch to drop overload of rwsem_spin_on_owner and
>>>> mutex_spin_on_owner.
>>>>      add more comments
>>>>      thanks boqun and Peter's suggestion.
>>>>
>>>> This patch set aims to fix lock holder preemption issues.
>>>>
>>>> test-case:
>>>> perf record -a perf bench sched messaging -g 400 -p&&   perf report
>>>>
>>>> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
>>>> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>>>>    5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>>>>    3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>>>>    3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>>>>    3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>>>>    2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
>>>>
>>>> We introduce interface bool vcpu_is_preempted(int cpu) and use it in
>>>> some spin
>>>> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
>>>> These spin_on_onwer variant also cause rcu stall before we apply this
>>>> patch set
>>>>
>>> Paolo, could you help out with an (x86) KVM interface for this?
>>>
>>> Waiman, could you see if you can utilize this to get rid of the
>>> SPIN_THRESHOLD in qspinlock_paravirt?
>> That API is certainly useful to make the paravirt spinlock perform
>> better. However, I am not sure if we can completely get rid of the
>> SPIN_THRESHOLD at this point. It is not just the kvm, the xen code need
>> to be modified as well.
> This should be rather easy. The relevant information is included in the
> runstate data mapped into kernel memory. I can provide a patch for Xen
> if needed.
>
>
> Juergen

Thanks for the offering. We will wait until Xinhui's patch comes through 
before working on the next step.

As for the elimination of SPIN_THRESHOLD, the queue head may not always 
have the right CPU number of the lock holder. So I don't think we can 
eliminate that for the queue head spinning. I think we can eliminates 
the SPIN_THRESHOLD spinning for the other queue node vCPUs.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/4] powerpc/spinlock: support vcpu preempted check
  2016-07-06 10:54   ` Balbir Singh
@ 2016-07-15 15:35     ` Pan Xinhui
  0 siblings, 0 replies; 36+ messages in thread
From: Pan Xinhui @ 2016-07-15 15:35 UTC (permalink / raw)
  To: Balbir Singh, Pan Xinhui, linux-kernel, linuxppc-dev,
	virtualization, linux-s390
  Cc: dave, peterz, mpe, boqun.feng, will.deacon, waiman.long, mingo,
	paulus, benh, schwidefsky, paulmck

Hi, Baibir
	sorry for late responce, I missed reading your mail.

在 16/7/6 18:54, Balbir Singh 写道:
> On Tue, 2016-06-28 at 10:43 -0400, Pan Xinhui wrote:
>> This is to fix some lock holder preemption issues. Some other locks
>> implementation do a spin loop before acquiring the lock itself. Currently
>> kernel has an interface of bool vcpu_is_preempted(int cpu). It take the cpu
> 								^^ takes
>> as parameter and return true if the cpu is preempted. Then kernel can break
>> the spin loops upon on the retval of vcpu_is_preempted.
>>
>> As kernel has used this interface, So lets support it.
>>
>> Only pSeries need supoort it. And the fact is powerNV are built into same
> 		   ^^ support
>> kernel image with pSeries. So we need return false if we are runnig as
>> powerNV. The another fact is that lppaca->yiled_count keeps zero on
> 					  ^^ yield
>> powerNV. So we can just skip the machine type.
>>

Blame on me, I indeed need avoid such typo..
thanks for pointing it out.

>> Suggested-by: Boqun Feng <boqun.feng@gmail.com>
>> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/include/asm/spinlock.h | 18 ++++++++++++++++++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
>> index 523673d..3ac9fcb 100644
>> --- a/arch/powerpc/include/asm/spinlock.h
>> +++ b/arch/powerpc/include/asm/spinlock.h
>> @@ -52,6 +52,24 @@
>>  #define SYNC_IO
>>  #endif
>>
>> +/*
>> + * This support kernel to check if one cpu is preempted or not.
>> + * Then we can fix some lock holder preemption issue.
>> + */
>> +#ifdef CONFIG_PPC_PSERIES
>> +#define vcpu_is_preempted vcpu_is_preempted
>> +static inline bool vcpu_is_preempted(int cpu)
>> +{
>> +	/*
>> +	 * pSeries and powerNV can be built into same kernel image. In
>> +	 * principle we need return false directly if we are running as
>> +	 * powerNV. However the yield_count is always zero on powerNV, So
>> +	 * skip such machine type check
>
> Or you could use the ppc_md interface callbacks if required, but your
> solution works as well
>

thanks, So I can keep my code as is.

thanks
xinhui

>> +	 */
>> +	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
>> +}
>> +#endif
>> +
>>  static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
>>  {
>>  	return lock.slock == 0;
>
>
> Balbir Singh.
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2016-07-15 15:35 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-28 14:43 [PATCH v2 0/4] implement vcpu preempted check Pan Xinhui
2016-06-28 14:43 ` [PATCH v2 1/4] kernel/sched: introduce vcpu preempted check interface Pan Xinhui
2016-06-28 14:43 ` [PATCH v2 2/4] powerpc/spinlock: support vcpu preempted check Pan Xinhui
2016-07-05  9:57   ` Wanpeng Li
2016-07-06  4:58     ` xinhui
2016-07-06  6:46       ` Wanpeng Li
2016-07-06  7:58         ` Peter Zijlstra
2016-07-06  8:32           ` Wanpeng Li
2016-07-06 10:18             ` xinhui
2016-07-06 10:54   ` Balbir Singh
2016-07-15 15:35     ` Pan Xinhui
2016-06-28 14:43 ` [PATCH v2 3/4] locking/osq: Drop the overload of osq_lock() Pan Xinhui
2016-06-28 14:43 ` [PATCH v2 4/4] kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner Pan Xinhui
2016-07-06  6:52 ` [PATCH v2 0/4] implement vcpu preempted check Peter Zijlstra
2016-07-06  7:47   ` Juergen Gross
2016-07-06  8:19     ` Peter Zijlstra
2016-07-06  8:38       ` Juergen Gross
2016-07-06 12:44       ` Paolo Bonzini
2016-07-06 16:56       ` Christian Borntraeger
2016-07-06 10:05   ` xinhui
2016-07-06 10:44   ` Paolo Bonzini
2016-07-06 11:59     ` Peter Zijlstra
2016-07-06 12:08     ` Wanpeng Li
2016-07-06 12:28       ` Paolo Bonzini
2016-07-06 13:03         ` Wanpeng Li
2016-07-07  8:48         ` Wanpeng Li
2016-07-07  9:42           ` Peter Zijlstra
2016-07-07 10:12             ` Wanpeng Li
2016-07-07 10:27               ` Wanpeng Li
2016-07-07 11:15                 ` Peter Zijlstra
2016-07-07 11:08               ` Peter Zijlstra
2016-07-07 11:09             ` Peter Zijlstra
2016-07-07 11:21             ` Peter Zijlstra
2016-07-11 15:10   ` Waiman Long
2016-07-12  4:16     ` Juergen Gross
2016-07-12 18:16       ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).