linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-25 17:42 [PATCH] locking/osq: Drop the overload of osq lock Pan Xinhui
@ 2016-06-25 14:24 ` Peter Zijlstra
  2016-06-25 15:21   ` Boqun Feng
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2016-06-25 14:24 UTC (permalink / raw)
  To: Pan Xinhui; +Cc: linux-kernel, mingo, dave, will.deacon, Waiman.Long, benh

On Sat, Jun 25, 2016 at 01:42:03PM -0400, Pan Xinhui wrote:
> An over-committed guest with more vCPUs than pCPUs has a heavy overload
> in osq_lock().
> 
> This is because vCPU A hold the osq lock and yield out, vCPU B wait
> per_cpu node->locked to be set. IOW, vCPU B wait vCPU A to run and
> unlock the osq lock. Even there is need_resched(), it did not help on
> such scenario.
> 
> To fix such bad issue, add a threshold in one while-loop of osq_lock().
> The value of threshold is somehow equal to SPIN_THRESHOLD.

Blergh, virt ...

So yes, lock holder preemption sucks. You would also want to limit the
immediate spin on owner.

Also; I really hate these random number spin-loop thresholds.

Is it at all possible to get feedback from your LPAR stuff that the vcpu
was preempted? Because at that point we can add do something like:


	int vpc = vcpu_preempt_count();

	...

	for (;;) {

		/* the big spin loop */

		if (need_resched() || vpc != vcpu_preempt_count())
			/* bail */

	}


With a default implementation like:

static inline int vcpu_preempt_count(void)
{
	return 0;
}

So the compiler can make it all go away.


But on virt muck it would stop spinning the moment the vcpu gets
preempted, which is the right moment I'm thinking.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-25 14:24 ` Peter Zijlstra
@ 2016-06-25 15:21   ` Boqun Feng
  2016-06-25 16:09     ` Peter Zijlstra
  2016-06-25 16:15     ` Peter Zijlstra
  0 siblings, 2 replies; 31+ messages in thread
From: Boqun Feng @ 2016-06-25 15:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pan Xinhui, linux-kernel, mingo, dave, will.deacon, Waiman.Long, benh

[-- Attachment #1: Type: text/plain, Size: 1941 bytes --]

On Sat, Jun 25, 2016 at 04:24:47PM +0200, Peter Zijlstra wrote:
> On Sat, Jun 25, 2016 at 01:42:03PM -0400, Pan Xinhui wrote:
> > An over-committed guest with more vCPUs than pCPUs has a heavy overload
> > in osq_lock().
> > 
> > This is because vCPU A hold the osq lock and yield out, vCPU B wait
> > per_cpu node->locked to be set. IOW, vCPU B wait vCPU A to run and
> > unlock the osq lock. Even there is need_resched(), it did not help on
> > such scenario.
> > 
> > To fix such bad issue, add a threshold in one while-loop of osq_lock().
> > The value of threshold is somehow equal to SPIN_THRESHOLD.
> 
> Blergh, virt ...
> 
> So yes, lock holder preemption sucks. You would also want to limit the
> immediate spin on owner.
> 
> Also; I really hate these random number spin-loop thresholds.
> 
> Is it at all possible to get feedback from your LPAR stuff that the vcpu
> was preempted? Because at that point we can add do something like:
> 

Good point!

> 
> 	int vpc = vcpu_preempt_count();
> 
> 	...
> 
> 	for (;;) {
> 
> 		/* the big spin loop */
> 
> 		if (need_resched() || vpc != vcpu_preempt_count())

So on PPC, we have lppaca::yield_count to detect when an vcpu is
preempted, if the yield_count is even, the vcpu is running, otherwise it
is preempted(__spin_yield() is a user of this).

Therefore it makes more sense we

		if (need_resched() || vcpu_is_preempted(old))

here, and implement vcpu_is_preempted() on PPC as

bool vcpu_is_preempted(int cpu)
{
	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1)
}

Thoughts?

Regards,
Boqun

> 			/* bail */
> 
> 	}
> 
> 
> With a default implementation like:
> 
> static inline int vcpu_preempt_count(void)
> {
> 	return 0;
> }
> 
> So the compiler can make it all go away.
> 
> 
> But on virt muck it would stop spinning the moment the vcpu gets
> preempted, which is the right moment I'm thinking.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-25 15:21   ` Boqun Feng
@ 2016-06-25 16:09     ` Peter Zijlstra
  2016-06-25 16:13       ` Peter Zijlstra
  2016-06-25 16:28       ` Boqun Feng
  2016-06-25 16:15     ` Peter Zijlstra
  1 sibling, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2016-06-25 16:09 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Pan Xinhui, linux-kernel, mingo, dave, will.deacon, Waiman.Long, benh

On Sat, Jun 25, 2016 at 11:21:30PM +0800, Boqun Feng wrote:
> > 
> > 	int vpc = vcpu_preempt_count();
> > 
> > 	...
> > 
> > 	for (;;) {
> > 
> > 		/* the big spin loop */
> > 
> > 		if (need_resched() || vpc != vcpu_preempt_count())
> 
> So on PPC, we have lppaca::yield_count to detect when an vcpu is

Which sounds like just the value we want.. And I suspect that on x86 KVM
and Xen have similar numbers stashed away someplace.

> preempted, if the yield_count is even, the vcpu is running, otherwise it
> is preempted(__spin_yield() is a user of this).
> 
> Therefore it makes more sense we
> 
> 		if (need_resched() || vcpu_is_preempted(old))
> 
> here, and implement vcpu_is_preempted() on PPC as
> 
> bool vcpu_is_preempted(int cpu)
> {
> 	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1)
> }
> 
> Thoughts?

That works here, but it would not work for the need_resched() in
mutex_spin_on_owner() and mutex_optimistic_spin() which need equal
treatment.

Because those too we want to limit.

The count thing, while a little more cumbersome, is more widely
applicable than just the one OSQ case where we happen to have a cpu
number.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-25 16:09     ` Peter Zijlstra
@ 2016-06-25 16:13       ` Peter Zijlstra
  2016-06-25 17:27         ` panxinhui
  2016-06-25 16:28       ` Boqun Feng
  1 sibling, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2016-06-25 16:13 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Pan Xinhui, linux-kernel, mingo, dave, will.deacon, Waiman.Long, benh

On Sat, Jun 25, 2016 at 06:09:22PM +0200, Peter Zijlstra wrote:
> That works here, but it would not work for the need_resched() in
> mutex_spin_on_owner() and mutex_optimistic_spin() which need equal
> treatment.
> 
> Because those too we want to limit.
> 
> The count thing, while a little more cumbersome, is more widely
> applicable than just the one OSQ case where we happen to have a cpu
> number.

Although I suppose that mutex_spin_on_owner() (and with that the rsem
variant) could use task_cpu(lock->owner) once we've established that the
owner pointer is still valid.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-25 15:21   ` Boqun Feng
  2016-06-25 16:09     ` Peter Zijlstra
@ 2016-06-25 16:15     ` Peter Zijlstra
  2016-06-25 16:45       ` Boqun Feng
  1 sibling, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2016-06-25 16:15 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Pan Xinhui, linux-kernel, mingo, dave, will.deacon, Waiman.Long, benh

On Sat, Jun 25, 2016 at 11:21:30PM +0800, Boqun Feng wrote:
> So on PPC, we have lppaca::yield_count to detect when an vcpu is
> preempted, if the yield_count is even, the vcpu is running, otherwise it
> is preempted(__spin_yield() is a user of this).
> 
> Therefore it makes more sense we
> 
> 		if (need_resched() || vcpu_is_preempted(old))
> 
> here, and implement vcpu_is_preempted() on PPC as
> 
> bool vcpu_is_preempted(int cpu)
> {
> 	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1)
> }
> 
> Thoughts?

Would that not have issues where the owner cpu is kept running but the
spinner (ie. _this_ vcpu) gets preempted? I would think that in that
case we too want to stop spinning.

Although, if all vcpus are scheduled equal, it might not matter on
average.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-25 16:09     ` Peter Zijlstra
  2016-06-25 16:13       ` Peter Zijlstra
@ 2016-06-25 16:28       ` Boqun Feng
  2016-06-25 18:28         ` Peter Zijlstra
  1 sibling, 1 reply; 31+ messages in thread
From: Boqun Feng @ 2016-06-25 16:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pan Xinhui, linux-kernel, mingo, dave, will.deacon, Waiman.Long, benh

[-- Attachment #1: Type: text/plain, Size: 1642 bytes --]

On Sat, Jun 25, 2016 at 06:09:22PM +0200, Peter Zijlstra wrote:
> On Sat, Jun 25, 2016 at 11:21:30PM +0800, Boqun Feng wrote:
> > > 
> > > 	int vpc = vcpu_preempt_count();
> > > 
> > > 	...
> > > 
> > > 	for (;;) {
> > > 
> > > 		/* the big spin loop */
> > > 
> > > 		if (need_resched() || vpc != vcpu_preempt_count())
> > 
> > So on PPC, we have lppaca::yield_count to detect when an vcpu is
> 
> Which sounds like just the value we want.. And I suspect that on x86 KVM
> and Xen have similar numbers stashed away someplace.
> 
> > preempted, if the yield_count is even, the vcpu is running, otherwise it
> > is preempted(__spin_yield() is a user of this).
> > 
> > Therefore it makes more sense we
> > 
> > 		if (need_resched() || vcpu_is_preempted(old))
> > 
> > here, and implement vcpu_is_preempted() on PPC as
> > 
> > bool vcpu_is_preempted(int cpu)
> > {
> > 	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1)
> > }
> > 
> > Thoughts?
> 
> That works here, but it would not work for the need_resched() in
> mutex_spin_on_owner() and mutex_optimistic_spin() which need equal
> treatment.
> 
> Because those too we want to limit.
> 
> The count thing, while a little more cumbersome, is more widely
> applicable than just the one OSQ case where we happen to have a cpu
> number.
> 

But if we don't have a cpu number, which vcpu's preemption are we
trying to detect? I think the logic here is that if _this_ vcpu sees the
_owner_ vcpu is preempted, it should just stop spinning. Therefore, we
need to know the owner cpu number.

Am I missing something here?

Regards,
Boqun


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-25 16:15     ` Peter Zijlstra
@ 2016-06-25 16:45       ` Boqun Feng
  2016-06-25 17:27         ` panxinhui
  0 siblings, 1 reply; 31+ messages in thread
From: Boqun Feng @ 2016-06-25 16:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Pan Xinhui, linux-kernel, mingo, dave, will.deacon, Waiman.Long, benh

[-- Attachment #1: Type: text/plain, Size: 1419 bytes --]

On Sat, Jun 25, 2016 at 06:15:40PM +0200, Peter Zijlstra wrote:
> On Sat, Jun 25, 2016 at 11:21:30PM +0800, Boqun Feng wrote:
> > So on PPC, we have lppaca::yield_count to detect when an vcpu is
> > preempted, if the yield_count is even, the vcpu is running, otherwise it
> > is preempted(__spin_yield() is a user of this).
> > 
> > Therefore it makes more sense we
> > 
> > 		if (need_resched() || vcpu_is_preempted(old))
> > 
> > here, and implement vcpu_is_preempted() on PPC as
> > 
> > bool vcpu_is_preempted(int cpu)
> > {
> > 	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1)
> > }
> > 
> > Thoughts?
> 
> Would that not have issues where the owner cpu is kept running but the
> spinner (ie. _this_ vcpu) gets preempted? I would think that in that
> case we too want to stop spinning.
> 

I don't think we want(or need) to stop the spinning of _this_ vcpu in
that case? Because it has already been preempted, when it gets back to
run, the owner may still be running and haven't set ->locked to 1 yet,
which means spinning on this vcpu is still worthwhile.

I think the proper logic here is that in the optimistic spin queue, if
any one found its predecessor's vcpu was preempted, it should stop
spinning, because it's very likely that it would not see ->locked
becoming 1 in a short time.

> Although, if all vcpus are scheduled equal, it might not matter on
> average.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-25 16:13       ` Peter Zijlstra
@ 2016-06-25 17:27         ` panxinhui
  2016-06-25 19:12           ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: panxinhui @ 2016-06-25 17:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: panxinhui, Boqun Feng, Pan Xinhui, linux-kernel, mingo, dave,
	will.deacon, Waiman.Long, benh


> 在 2016年6月26日,00:13,Peter Zijlstra <peterz@infradead.org> 写道:
> 
> On Sat, Jun 25, 2016 at 06:09:22PM +0200, Peter Zijlstra wrote:
>> That works here, but it would not work for the need_resched() in
>> mutex_spin_on_owner() and mutex_optimistic_spin() which need equal
>> treatment.
>> 
>> Because those too we want to limit.
>> 
>> The count thing, while a little more cumbersome, is more widely
>> applicable than just the one OSQ case where we happen to have a cpu
>> number.
> 
> Although I suppose that mutex_spin_on_owner() (and with that the rsem
> variant) could use task_cpu(lock->owner) once we've established that the
> owner pointer is still valid.
> 
> 
yes, What I am going to fix next is these XXX_spin_on_owner, including  mutex_spin_on_owner, rwsem_spin_on_owner ….

by the way I still think mutex_unlock has a big overload too. 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-25 16:45       ` Boqun Feng
@ 2016-06-25 17:27         ` panxinhui
  2016-06-25 19:20           ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: panxinhui @ 2016-06-25 17:27 UTC (permalink / raw)
  To: Boqun Feng
  Cc: panxinhui, Peter Zijlstra, Pan Xinhui, linux-kernel, mingo, dave,
	will.deacon, Waiman.Long, benh


> 在 2016年6月26日,00:45,Boqun Feng <boqun.feng@gmail.com> 写道:
> 
> On Sat, Jun 25, 2016 at 06:15:40PM +0200, Peter Zijlstra wrote:
>> On Sat, Jun 25, 2016 at 11:21:30PM +0800, Boqun Feng wrote:
>>> So on PPC, we have lppaca::yield_count to detect when an vcpu is
>>> preempted, if the yield_count is even, the vcpu is running, otherwise it
>>> is preempted(__spin_yield() is a user of this).
>>> 
>>> Therefore it makes more sense we
>>> 
>>> 		if (need_resched() || vcpu_is_preempted(old))
>>> 
>>> here, and implement vcpu_is_preempted() on PPC as
>>> 
>>> bool vcpu_is_preempted(int cpu)
>>> {
>>> 	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1)
>>> }
>>> 
>>> Thoughts?
>> 
>> Would that not have issues where the owner cpu is kept running but the
>> spinner (ie. _this_ vcpu) gets preempted? I would think that in that
>> case we too want to stop spinning.
>> 
> 
do  you mean that the spinner detect itself had yield out during the big spin loop?

It is very possible to happen.  BUT if spinner(on this vcpu) yield out, the next spinner would break the spin loop.
AND if spinner detect itself yield out once, it’s very possible to get the osq lock soon as long as the ower vcpu is running.

SO I think we need just check the owner vcpu’s yield_count.

> I don't think we want(or need) to stop the spinning of _this_ vcpu in
> that case? Because it has already been preempted, when it gets back to
> run, the owner may still be running and haven't set ->locked to 1 yet,
> which means spinning on this vcpu is still worthwhile.
> 

> I think the proper logic here is that in the optimistic spin queue, if
> any one found its predecessor's vcpu was preempted, it should stop
> spinning, because it's very likely that it would not see ->locked
> becoming 1 in a short time.
> 
agree!!!
this  vcpu need yield out too if the owner’s vcpu has yield out.


>> Although, if all vcpus are scheduled equal, it might not matter on
>> average.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH] locking/osq: Drop the overload of osq lock
@ 2016-06-25 17:42 Pan Xinhui
  2016-06-25 14:24 ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Pan Xinhui @ 2016-06-25 17:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, dave, will.deacon, Waiman.Long, benh, Pan Xinhui

An over-committed guest with more vCPUs than pCPUs has a heavy overload
in osq_lock().

This is because vCPU A hold the osq lock and yield out, vCPU B wait
per_cpu node->locked to be set. IOW, vCPU B wait vCPU A to run and
unlock the osq lock. Even there is need_resched(), it did not help on
such scenario.

To fix such bad issue, add a threshold in one while-loop of osq_lock().
The value of threshold is somehow equal to SPIN_THRESHOLD.

perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

after patch:
7.62%  sched-messaging  [kernel.kallsyms]  [k] wait_consider_task
7.30%  sched-messaging  [kernel.kallsyms]  [k] _raw_write_lock_irq
5.93%  sched-messaging  [kernel.kallsyms]  [k] mutex_unlock
5.74%  sched-messaging  [unknown]          [H] 0xc000000000077590
4.37%  sched-messaging  [kernel.kallsyms]  [k] __copy_tofrom_user_powe
2.58%  sched-messaging  [kernel.kallsyms]  [k] system_call

Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
---
 kernel/locking/osq_lock.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..922fe5d 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -81,12 +81,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
 	return next;
 }
 
+/* The threahold should take nearly 0.5ms on most archs */
+#define OSQ_SPIN_THRESHOLD (1 << 15)
+
 bool osq_lock(struct optimistic_spin_queue *lock)
 {
 	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
 	struct optimistic_spin_node *prev, *next;
 	int curr = encode_cpu(smp_processor_id());
 	int old;
+	int loops;
 
 	node->locked = 0;
 	node->next = NULL;
@@ -118,8 +122,14 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	while (!READ_ONCE(node->locked)) {
 		/*
 		 * If we need to reschedule bail... so we can block.
+		 * An over-committed guest with more vCPUs than pCPUs
+		 * might fall in this loop and cause a huge overload.
+		 * This is because vCPU A(prev) hold the osq lock and yield out,
+		 * vCPU B(node) wait ->locked to be set, IOW, wait till
+		 * vCPU A run and unlock the osq lock.
+		 * NOTE that vCPU A and vCPU B might run on same physical cpu.
 		 */
-		if (need_resched())
+		if (need_resched() || loops++ == OSQ_SPIN_THRESHOLD)
 			goto unqueue;
 
 		cpu_relax_lowlatency();
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-25 16:28       ` Boqun Feng
@ 2016-06-25 18:28         ` Peter Zijlstra
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2016-06-25 18:28 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Pan Xinhui, linux-kernel, mingo, dave, will.deacon, Waiman.Long, benh

On Sun, Jun 26, 2016 at 12:28:13AM +0800, Boqun Feng wrote:
> On Sat, Jun 25, 2016 at 06:09:22PM +0200, Peter Zijlstra wrote:
> > That works here, but it would not work for the need_resched() in
> > mutex_spin_on_owner() and mutex_optimistic_spin() which need equal
> > treatment.
> > 
> > Because those too we want to limit.
> > 
> > The count thing, while a little more cumbersome, is more widely
> > applicable than just the one OSQ case where we happen to have a cpu
> > number.
> > 
> 
> But if we don't have a cpu number, which vcpu's preemption are we
> trying to detect? 

_this_ vcpu's preemption. If the yield count of this cpu changes, we
know this vcpu has been scheduled and we should stop spinning.

This is similar to the need_resched() case, which check is _this_ cpu
should reschedule.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-25 17:27         ` panxinhui
@ 2016-06-25 19:12           ` Peter Zijlstra
  2016-06-26  4:59             ` panxinhui
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2016-06-25 19:12 UTC (permalink / raw)
  To: panxinhui
  Cc: Boqun Feng, Pan Xinhui, linux-kernel, mingo, dave, will.deacon,
	Waiman.Long, benh

On Sun, Jun 26, 2016 at 01:27:51AM +0800, panxinhui wrote:

> by the way I still think mutex_unlock has a big overload too. 

Do you mean overhead?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-25 17:27         ` panxinhui
@ 2016-06-25 19:20           ` Peter Zijlstra
  2016-06-25 19:29             ` Peter Zijlstra
                               ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Peter Zijlstra @ 2016-06-25 19:20 UTC (permalink / raw)
  To: panxinhui
  Cc: Boqun Feng, Pan Xinhui, linux-kernel, mingo, dave, will.deacon,
	Waiman.Long, benh

On Sun, Jun 26, 2016 at 01:27:56AM +0800, panxinhui wrote:
> >> Would that not have issues where the owner cpu is kept running but the
> >> spinner (ie. _this_ vcpu) gets preempted? I would think that in that
> >> case we too want to stop spinning.
> >> 
> > 
> do  you mean that the spinner detect itself had yield out during the
> big spin loop?
> 
> It is very possible to happen.  BUT if spinner(on this vcpu) yield
> out, the next spinner would break the spin loop.  AND if spinner
> detect itself yield out once, it’s very possible to get the osq lock
> soon as long as the ower vcpu is running.
> 
> SO I think we need just check the owner vcpu’s yield_count.

I had a quick look at KVM and it looks like it only has
kvm_cpu::preempted, which would suggest the interface boqun proposed.

We'll have to look at many of the other virt platforms as well to see
what they can do.

We could also provide _both_ interfaces and a platform can implement
whichever variant (or both) it can.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-25 19:20           ` Peter Zijlstra
@ 2016-06-25 19:29             ` Peter Zijlstra
  2016-06-26  2:26             ` Boqun Feng
  2016-06-26  5:21             ` panxinhui
  2 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2016-06-25 19:29 UTC (permalink / raw)
  To: panxinhui
  Cc: Boqun Feng, Pan Xinhui, linux-kernel, mingo, dave, will.deacon,
	Waiman.Long, benh

On Sat, Jun 25, 2016 at 09:20:25PM +0200, Peter Zijlstra wrote:
> On Sun, Jun 26, 2016 at 01:27:56AM +0800, panxinhui wrote:
> > >> Would that not have issues where the owner cpu is kept running but the
> > >> spinner (ie. _this_ vcpu) gets preempted? I would think that in that
> > >> case we too want to stop spinning.
> > >> 
> > > 
> > do  you mean that the spinner detect itself had yield out during the
> > big spin loop?
> > 
> > It is very possible to happen.  BUT if spinner(on this vcpu) yield
> > out, the next spinner would break the spin loop.  AND if spinner
> > detect itself yield out once, it’s very possible to get the osq lock
> > soon as long as the ower vcpu is running.
> > 
> > SO I think we need just check the owner vcpu’s yield_count.
> 
> I had a quick look at KVM and it looks like it only has
> kvm_cpu::preempted, which would suggest the interface boqun proposed.

Xen seems to have vcpu_runstate_info::state where any !0 state means its
not running, so that too allows implementing that variant.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-25 19:20           ` Peter Zijlstra
  2016-06-25 19:29             ` Peter Zijlstra
@ 2016-06-26  2:26             ` Boqun Feng
  2016-06-26  5:21             ` panxinhui
  2 siblings, 0 replies; 31+ messages in thread
From: Boqun Feng @ 2016-06-26  2:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: panxinhui, Pan Xinhui, linux-kernel, mingo, dave, will.deacon,
	Waiman.Long, benh

[-- Attachment #1: Type: text/plain, Size: 1220 bytes --]

On Sat, Jun 25, 2016 at 09:20:25PM +0200, Peter Zijlstra wrote:
> On Sun, Jun 26, 2016 at 01:27:56AM +0800, panxinhui wrote:
> > >> Would that not have issues where the owner cpu is kept running but the
> > >> spinner (ie. _this_ vcpu) gets preempted? I would think that in that
> > >> case we too want to stop spinning.
> > >> 
> > > 
> > do  you mean that the spinner detect itself had yield out during the
> > big spin loop?
> > 
> > It is very possible to happen.  BUT if spinner(on this vcpu) yield
> > out, the next spinner would break the spin loop.  AND if spinner
> > detect itself yield out once, it’s very possible to get the osq lock
> > soon as long as the ower vcpu is running.
> > 
> > SO I think we need just check the owner vcpu’s yield_count.
> 
> I had a quick look at KVM and it looks like it only has
> kvm_cpu::preempted, which would suggest the interface boqun proposed.
> 
> We'll have to look at many of the other virt platforms as well to see
> what they can do.
> 
> We could also provide _both_ interfaces and a platform can implement
> whichever variant (or both) it can.
> 

Make sense ;-)

Lemme cook something for further discussions.

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-25 19:12           ` Peter Zijlstra
@ 2016-06-26  4:59             ` panxinhui
  2016-06-27  7:55               ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: panxinhui @ 2016-06-26  4:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: panxinhui, Boqun Feng, Pan Xinhui, linux-kernel, mingo, dave,
	will.deacon, Waiman.Long, benh


> 在 2016年6月26日,03:12,Peter Zijlstra <peterz@infradead.org> 写道:
> 
> On Sun, Jun 26, 2016 at 01:27:51AM +0800, panxinhui wrote:
> 
>> by the way I still think mutex_unlock has a big overload too. 
> 
> Do you mean overhead?
> 
oh, maybe you are right. 
mutex_unlock ’s implementation uses inc_return variant on  ppc, and that’s expensive. I am thinking of using  cmpxchg instead. 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-25 19:20           ` Peter Zijlstra
  2016-06-25 19:29             ` Peter Zijlstra
  2016-06-26  2:26             ` Boqun Feng
@ 2016-06-26  5:21             ` panxinhui
  2016-06-26  6:10               ` Boqun Feng
  2 siblings, 1 reply; 31+ messages in thread
From: panxinhui @ 2016-06-26  5:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: panxinhui, Boqun Feng, Pan Xinhui, linux-kernel, mingo, dave,
	will.deacon, Waiman.Long, benh


> 在 2016年6月26日,03:20,Peter Zijlstra <peterz@infradead.org> 写道:
> 
> On Sun, Jun 26, 2016 at 01:27:56AM +0800, panxinhui wrote:
>>>> Would that not have issues where the owner cpu is kept running but the
>>>> spinner (ie. _this_ vcpu) gets preempted? I would think that in that
>>>> case we too want to stop spinning.
>>>> 
>>> 
>> do  you mean that the spinner detect itself had yield out during the
>> big spin loop?
>> 
>> It is very possible to happen.  BUT if spinner(on this vcpu) yield
>> out, the next spinner would break the spin loop.  AND if spinner
>> detect itself yield out once, it’s very possible to get the osq lock
>> soon as long as the ower vcpu is running.
>> 
>> SO I think we need just check the owner vcpu’s yield_count.
> 
> I had a quick look at KVM and it looks like it only has
> kvm_cpu::preempted, which would suggest the interface boqun proposed.
> 
> We'll have to look at many of the other virt platforms as well to see
> what they can do.
> 
> We could also provide _both_ interfaces and a platform can implement
> whichever variant (or both) it can.
> 
the kvm code on ppc has implemented  yield_count.
It let me feel a little relaxed. :)

looks like we could  introduce the interface like below.

bool vcpu_is_preempted(int cpu)
{
	return arch_vcpu_is_preempted(cpu);
}

#ifdef arch_vcpu_has_yield_count 
bool vcpu_has_preemtped_once(int cpu, unsigned int yield_count)
 {
	return arch_get_vcpu_yield_count() != yield_count;
}

#else
bool vcpu_has_preemtped_once(int cpu, unsigned int yield_count)
{
	/*just let called know it is preepmpted*/
	return vcpu_is_preempted(cpu);
}
#endif

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-26  5:21             ` panxinhui
@ 2016-06-26  6:10               ` Boqun Feng
  2016-06-26  6:58                 ` panxinhui
  2016-06-26  6:59                 ` Boqun Feng
  0 siblings, 2 replies; 31+ messages in thread
From: Boqun Feng @ 2016-06-26  6:10 UTC (permalink / raw)
  To: panxinhui
  Cc: Peter Zijlstra, Pan Xinhui, linux-kernel, mingo, dave,
	will.deacon, Waiman.Long, benh

On Sun, Jun 26, 2016 at 01:21:04PM +0800, panxinhui wrote:
> 
> > 在 2016年6月26日,03:20,Peter Zijlstra <peterz@infradead.org> 写道:
> > 
> > On Sun, Jun 26, 2016 at 01:27:56AM +0800, panxinhui wrote:
> >>>> Would that not have issues where the owner cpu is kept running but the
> >>>> spinner (ie. _this_ vcpu) gets preempted? I would think that in that
> >>>> case we too want to stop spinning.
> >>>> 
> >>> 
> >> do  you mean that the spinner detect itself had yield out during the
> >> big spin loop?
> >> 
> >> It is very possible to happen.  BUT if spinner(on this vcpu) yield
> >> out, the next spinner would break the spin loop.  AND if spinner
> >> detect itself yield out once, it’s very possible to get the osq lock
> >> soon as long as the ower vcpu is running.
> >> 
> >> SO I think we need just check the owner vcpu’s yield_count.
> > 
> > I had a quick look at KVM and it looks like it only has
> > kvm_cpu::preempted, which would suggest the interface boqun proposed.
> > 
> > We'll have to look at many of the other virt platforms as well to see
> > what they can do.
> > 
> > We could also provide _both_ interfaces and a platform can implement
> > whichever variant (or both) it can.
> > 
> the kvm code on ppc has implemented  yield_count.
> It let me feel a little relaxed. :)
> 
> looks like we could  introduce the interface like below.
> 
> bool vcpu_is_preempted(int cpu)
> {
> 	return arch_vcpu_is_preempted(cpu);
> }
> 
> #ifdef arch_vcpu_has_yield_count 
> bool vcpu_has_preemtped_once(int cpu, unsigned int yield_count)
>  {
> 	return arch_get_vcpu_yield_count() != yield_count;
> }
> 
> #else
> bool vcpu_has_preemtped_once(int cpu, unsigned int yield_count)
> {
> 	/*just let called know it is preepmpted*/
> 	return vcpu_is_preempted(cpu);
> }
> #endif
> 

Unfortunately, on PPC, we compile pseries code along with powernv code
in a kernel binary, therefore we need to wire the proper primitives at
runtime.

I've cooked the following patch, please note I haven't test this, this
is just a POC.

Regards,
Boqun

---------------->8
virt: Introduce vcpu preemption detection functions

Lock holder preemption is an serious problem in virtualized environment
for locking. Different archs and hypervisors employ different ways to
try to solve the problem in different locking mechanisms. And recently
Pan Xinhui found a significant performance drop in osq_lock(), which
could be solved by providing a mechanism to detect the preemption of
vcpu.

Therefore this patch introduces several vcpu preemption detection
primitives, which allows locking primtives to save the time of
unnecesarry spinning. These primitives act as an abstract layer between
locking functions and arch or hypervisor related code.

There are two sets of primitives:

1.	vcpu_preempt_count() and vcpu_has_preempted(), they must be used
	pairwisely in a same preempt disable critical section. And they
	could detect whether a vcpu preemption happens between them.

2.	vcpu_is_preempted(), used to check whether other cpu's vcpu is
	preempted.

This patch also implements those primitives on pseries and wire them up.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 arch/powerpc/platforms/pseries/Kconfig |  1 +
 arch/powerpc/platforms/pseries/setup.c | 28 +++++++++++
 include/linux/vcpu_preempt.h           | 90 ++++++++++++++++++++++++++++++++++
 3 files changed, 119 insertions(+)
 create mode 100644 include/linux/vcpu_preempt.h

diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
index bec90fb30425..7d24c3e48878 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -21,6 +21,7 @@ config PPC_PSERIES
 	select HOTPLUG_CPU if SMP
 	select ARCH_RANDOM
 	select PPC_DOORBELL
+	select HAS_VCPU_PREEMPTION_DETECTION
 	default y
 
 config PPC_SPLPAR
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 9883bc7ea007..5d4aed54e039 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -42,6 +42,7 @@
 #include <linux/of.h>
 #include <linux/of_pci.h>
 #include <linux/kexec.h>
+#include <linux/vcpu_preempt.h>
 
 #include <asm/mmu.h>
 #include <asm/processor.h>
@@ -501,6 +502,31 @@ static void __init find_and_init_phbs(void)
 	of_pci_check_probe_only();
 }
 
+struct vcpu_preempt_ops vcpu_preempt_ops = DEFAULT_VCPU_PREEMPT_OPS;
+EXPORT_SYMBOL(vcpu_preempt_ops);
+
+static long pseries_vcpu_preempt_count(void)
+{
+	return be32_to_cpu(get_lppaca()->yield_count);
+}
+
+static bool pseries_vcpu_is_preempted(int cpu)
+{
+	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
+}
+
+static bool pseries_vcpu_has_preempted(long vpc)
+{
+	return pseries_vcpu_preempt_count() != vpc;
+}
+
+static void __init pseries_setup_vcpu_preempt_ops(void)
+{
+	vcpu_preempt_ops.preempt_count = pseries_vcpu_preempt_count;
+	vcpu_preempt_ops.is_preempted = pseries_vcpu_is_preempted;
+	vcpu_preempt_ops.has_preempted = pseries_vcpu_has_preempted;
+}
+
 static void __init pSeries_setup_arch(void)
 {
 	set_arch_panic_timeout(10, ARCH_PANIC_TIMEOUT);
@@ -549,6 +575,8 @@ static void __init pSeries_setup_arch(void)
 				"%ld\n", rc);
 		}
 	}
+
+	pseries_setup_vcpu_preempt_ops();
 }
 
 static int __init pSeries_init_panel(void)
diff --git a/include/linux/vcpu_preempt.h b/include/linux/vcpu_preempt.h
new file mode 100644
index 000000000000..4a88414bb83f
--- /dev/null
+++ b/include/linux/vcpu_preempt.h
@@ -0,0 +1,90 @@
+/*
+ * Primitives for checking the vcpu preemption from the guest.
+ */
+
+static long __vcpu_preempt_count(void)
+{
+	return 0;
+}
+
+static bool __vcpu_has_preempted(long vpc)
+{
+	return false;
+}
+
+static bool __vcpu_is_preempted(int cpu)
+{
+	return false;
+}
+
+struct vcpu_preempt_ops {
+	/*
+	 * Get the current vcpu's "preempt count", which is going to use for
+	 * checking whether the current vcpu has ever been preempted
+	 */
+	long (*preempt_count)(void);
+
+	/*
+	 * Return whether a vcpu is preempted
+	 */
+	bool (*is_preempted)(int cpu);
+
+	/*
+	 * Given a "vcpu preempt count", Return whether a vcpu preemption ever
+	 * happened after the .preempt_count() was called.
+	 */
+	bool (*has_preempted)(long vpc);
+};
+
+struct vcpu_preempt_ops vcpu_preempt_ops;
+
+/* Default boilerplate */
+#define DEFAULT_VCPU_PREEMPT_OPS			\
+	{						\
+		.preempt_count = __vcpu_preempt_count,	\
+		.is_preempted = __vcpu_is_preempted,	\
+		.has_preempted = __vcpu_has_preempted	\
+	}
+
+#ifdef CONFIG_HAS_VCPU_PREEMPTION_DETECTION
+/*
+ * vcpu_preempt_count: Get the current cpu's "vcpu preempt count"(vpc).
+ *
+ * The vpc is used for checking whether the current vcpu has ever been
+ * preempted via vcpu_has_preempted().
+ *
+ * This function and vcpu_has_preepmted() should be called in the same
+ * preemption disabled critical section.
+ */
+static long vcpu_preempt_count(void)
+{
+	return vcpu_preempt_ops.preempt_count();
+}
+
+/*
+ * vcpu_is_preempted: Check whether @cpu's vcpu is preempted.
+ */
+static bool vcpu_is_preempted(int cpu)
+{
+	return vcpu_preempt_ops.is_preempted(cpu);
+}
+
+/*
+ * vcpu_has_preepmted: Check whether the current cpu's vcpu has ever been
+ * preempted.
+ *
+ * The checked duration is between the vcpu_preempt_count() which returns @vpc
+ * is called and this function called.
+ *
+ * This function and corresponding vcpu_preempt_count() should be in the same
+ * preemption disabled cirtial section.
+ */
+static bool vcpu_has_preempted(long vpc)
+{
+	return vcpu_preempt_ops.has_preempt(vpc);
+}
+#else /* CONFIG_HAS_VCPU_PREEMPTION_DETECTION */
+#define vcpu_preempt_count() __vcpu_preempt_count()
+#define vcpu_is_preempted(cpu) __vcpu_is_preempted(cpu)
+#define vcpu_has_preempted(vpc) __vcpu_has_preempted(vpc)
+#endif /* CONFIG_HAS_VCPU_PREEPMTION_DETECTION */
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-26  6:10               ` Boqun Feng
@ 2016-06-26  6:58                 ` panxinhui
  2016-06-26 14:11                   ` Boqun Feng
  2016-06-26  6:59                 ` Boqun Feng
  1 sibling, 1 reply; 31+ messages in thread
From: panxinhui @ 2016-06-26  6:58 UTC (permalink / raw)
  To: Boqun Feng
  Cc: panxinhui, Peter Zijlstra, Pan Xinhui, linux-kernel, mingo, dave,
	will.deacon, Waiman.Long, benh


> 在 2016年6月26日,14:10,Boqun Feng <boqun.feng@gmail.com> 写道:
> 
> On Sun, Jun 26, 2016 at 01:21:04PM +0800, panxinhui wrote:
>> 
>>> 在 2016年6月26日,03:20,Peter Zijlstra <peterz@infradead.org> 写道:
>>> 
>>> On Sun, Jun 26, 2016 at 01:27:56AM +0800, panxinhui wrote:
>>>>>> Would that not have issues where the owner cpu is kept running but the
>>>>>> spinner (ie. _this_ vcpu) gets preempted? I would think that in that
>>>>>> case we too want to stop spinning.
>>>>>> 
>>>>> 
>>>> do  you mean that the spinner detect itself had yield out during the
>>>> big spin loop?
>>>> 
>>>> It is very possible to happen.  BUT if spinner(on this vcpu) yield
>>>> out, the next spinner would break the spin loop.  AND if spinner
>>>> detect itself yield out once, it’s very possible to get the osq lock
>>>> soon as long as the ower vcpu is running.
>>>> 
>>>> SO I think we need just check the owner vcpu’s yield_count.
>>> 
>>> I had a quick look at KVM and it looks like it only has
>>> kvm_cpu::preempted, which would suggest the interface boqun proposed.
>>> 
>>> We'll have to look at many of the other virt platforms as well to see
>>> what they can do.
>>> 
>>> We could also provide _both_ interfaces and a platform can implement
>>> whichever variant (or both) it can.
>>> 
>> the kvm code on ppc has implemented  yield_count.
>> It let me feel a little relaxed. :)
>> 
>> looks like we could  introduce the interface like below.
>> 
>> bool vcpu_is_preempted(int cpu)
>> {
>> 	return arch_vcpu_is_preempted(cpu);
>> }
>> 
>> #ifdef arch_vcpu_has_yield_count 
>> bool vcpu_has_preemtped_once(int cpu, unsigned int yield_count)
>> {
>> 	return arch_get_vcpu_yield_count() != yield_count;
>> }
>> 
>> #else
>> bool vcpu_has_preemtped_once(int cpu, unsigned int yield_count)
>> {
>> 	/*just let called know it is preepmpted*/
>> 	return vcpu_is_preempted(cpu);
>> }
>> #endif
>> 
> 
> Unfortunately, on PPC, we compile pseries code along with powernv code
> in a kernel binary, therefore we need to wire the proper primitives at
> runtime.
> 
no, we can use same vcpu preempted check ops actually.
just like how arch_spin_lock() does.

if (SHARED_PROCESSOR)
                                __spin_yield(lock); 

so we can
u32 arch_get_vcpu_yield_count(int cpu)
{
	if (SHARED_PROCESSOR)
                                return be32_to_cpu(lppaca_of(cpu).yield_count;
	return 0;
}

and all these things can be don inline then.

> I've cooked the following patch, please note I haven't test this, this
> is just a POC.
> 
> Regards,
> Boqun
> 
> ---------------->8
> virt: Introduce vcpu preemption detection functions
> 
> Lock holder preemption is an serious problem in virtualized environment
> for locking. Different archs and hypervisors employ different ways to
> try to solve the problem in different locking mechanisms. And recently
> Pan Xinhui found a significant performance drop in osq_lock(), which
> could be solved by providing a mechanism to detect the preemption of
> vcpu.
> 
> Therefore this patch introduces several vcpu preemption detection
> primitives, which allows locking primtives to save the time of
> unnecesarry spinning. These primitives act as an abstract layer between
> locking functions and arch or hypervisor related code.
> 
> There are two sets of primitives:
> 
> 1.	vcpu_preempt_count() and vcpu_has_preempted(), they must be used
> 	pairwisely in a same preempt disable critical section. And they
> 	could detect whether a vcpu preemption happens between them.
> 
> 2.	vcpu_is_preempted(), used to check whether other cpu's vcpu is
> 	preempted.
> 
> This patch also implements those primitives on pseries and wire them up.
> 
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
> arch/powerpc/platforms/pseries/Kconfig |  1 +
> arch/powerpc/platforms/pseries/setup.c | 28 +++++++++++
> include/linux/vcpu_preempt.h           | 90 ++++++++++++++++++++++++++++++++++
> 3 files changed, 119 insertions(+)
> create mode 100644 include/linux/vcpu_preempt.h
> 
> diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
> index bec90fb30425..7d24c3e48878 100644
> --- a/arch/powerpc/platforms/pseries/Kconfig
> +++ b/arch/powerpc/platforms/pseries/Kconfig
> @@ -21,6 +21,7 @@ config PPC_PSERIES
> 	select HOTPLUG_CPU if SMP
> 	select ARCH_RANDOM
> 	select PPC_DOORBELL
> +	select HAS_VCPU_PREEMPTION_DETECTION
> 	default y
> 
it is already too many config there…
as I comment above, we could make these things inline, so looks like we needn't to add such config.
 
> config PPC_SPLPAR
> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> index 9883bc7ea007..5d4aed54e039 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -42,6 +42,7 @@
> #include <linux/of.h>
> #include <linux/of_pci.h>
> #include <linux/kexec.h>
> +#include <linux/vcpu_preempt.h>
> 
> #include <asm/mmu.h>
> #include <asm/processor.h>
> @@ -501,6 +502,31 @@ static void __init find_and_init_phbs(void)
> 	of_pci_check_probe_only();
> }
> 
> +struct vcpu_preempt_ops vcpu_preempt_ops = DEFAULT_VCPU_PREEMPT_OPS;
> +EXPORT_SYMBOL(vcpu_preempt_ops);
> +
> +static long pseries_vcpu_preempt_count(void)
> +{
> +	return be32_to_cpu(get_lppaca()->yield_count);
> +}
> +
> +static bool pseries_vcpu_is_preempted(int cpu)
> +{
> +	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
> +}
> +
> +static bool pseries_vcpu_has_preempted(long vpc)
> +{
> +	return pseries_vcpu_preempt_count() != vpc;
> +}
> +
> +static void __init pseries_setup_vcpu_preempt_ops(void)
> +{
> +	vcpu_preempt_ops.preempt_count = pseries_vcpu_preempt_count;
> +	vcpu_preempt_ops.is_preempted = pseries_vcpu_is_preempted;
> +	vcpu_preempt_ops.has_preempted = pseries_vcpu_has_preempted;
> +}
> +
> static void __init pSeries_setup_arch(void)
> {
> 	set_arch_panic_timeout(10, ARCH_PANIC_TIMEOUT);
> @@ -549,6 +575,8 @@ static void __init pSeries_setup_arch(void)
> 				"%ld\n", rc);
> 		}
> 	}
> +
> +	pseries_setup_vcpu_preempt_ops();
> }
> 
> static int __init pSeries_init_panel(void)
> diff --git a/include/linux/vcpu_preempt.h b/include/linux/vcpu_preempt.h
> new file mode 100644
> index 000000000000..4a88414bb83f
> --- /dev/null
> +++ b/include/linux/vcpu_preempt.h

How about add them in sched.h
I just think  they are similar. correct me if  I got a mistake :)

thanks
xinhui

> @@ -0,0 +1,90 @@
> +/*
> + * Primitives for checking the vcpu preemption from the guest.
> + */
> +
> +static long __vcpu_preempt_count(void)
> +{
> +	return 0;
> +}
> +
> +static bool __vcpu_has_preempted(long vpc)
> +{
> +	return false;
> +}
> +
> +static bool __vcpu_is_preempted(int cpu)
> +{
> +	return false;
> +}
> +
> +struct vcpu_preempt_ops {
> +	/*
> +	 * Get the current vcpu's "preempt count", which is going to use for
> +	 * checking whether the current vcpu has ever been preempted
> +	 */
> +	long (*preempt_count)(void);
> +
> +	/*
> +	 * Return whether a vcpu is preempted
> +	 */
> +	bool (*is_preempted)(int cpu);
> +
> +	/*
> +	 * Given a "vcpu preempt count", Return whether a vcpu preemption ever
> +	 * happened after the .preempt_count() was called.
> +	 */
> +	bool (*has_preempted)(long vpc);
> +};
> +
> +struct vcpu_preempt_ops vcpu_preempt_ops;
> +
> +/* Default boilerplate */
> +#define DEFAULT_VCPU_PREEMPT_OPS			\
> +	{						\
> +		.preempt_count = __vcpu_preempt_count,	\
> +		.is_preempted = __vcpu_is_preempted,	\
> +		.has_preempted = __vcpu_has_preempted	\
> +	}
> +
> +#ifdef CONFIG_HAS_VCPU_PREEMPTION_DETECTION
> +/*
> + * vcpu_preempt_count: Get the current cpu's "vcpu preempt count"(vpc).
> + *
> + * The vpc is used for checking whether the current vcpu has ever been
> + * preempted via vcpu_has_preempted().
> + *
> + * This function and vcpu_has_preepmted() should be called in the same
> + * preemption disabled critical section.
> + */
> +static long vcpu_preempt_count(void)
> +{
> +	return vcpu_preempt_ops.preempt_count();
> +}
> +
> +/*
> + * vcpu_is_preempted: Check whether @cpu's vcpu is preempted.
> + */
> +static bool vcpu_is_preempted(int cpu)
> +{
> +	return vcpu_preempt_ops.is_preempted(cpu);
> +}
> +
> +/*
> + * vcpu_has_preepmted: Check whether the current cpu's vcpu has ever been
> + * preempted.
> + *
> + * The checked duration is between the vcpu_preempt_count() which returns @vpc
> + * is called and this function called.
> + *
> + * This function and corresponding vcpu_preempt_count() should be in the same
> + * preemption disabled cirtial section.
> + */
> +static bool vcpu_has_preempted(long vpc)
> +{
> +	return vcpu_preempt_ops.has_preempt(vpc);
> +}
> +#else /* CONFIG_HAS_VCPU_PREEMPTION_DETECTION */
> +#define vcpu_preempt_count() __vcpu_preempt_count()
> +#define vcpu_is_preempted(cpu) __vcpu_is_preempted(cpu)
> +#define vcpu_has_preempted(vpc) __vcpu_has_preempted(vpc)
> +#endif /* CONFIG_HAS_VCPU_PREEPMTION_DETECTION */
> -- 
> 2.9.0
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-26  6:10               ` Boqun Feng
  2016-06-26  6:58                 ` panxinhui
@ 2016-06-26  6:59                 ` Boqun Feng
  2016-06-26  7:08                   ` panxinhui
  2016-06-27  6:45                   ` Boqun Feng
  1 sibling, 2 replies; 31+ messages in thread
From: Boqun Feng @ 2016-06-26  6:59 UTC (permalink / raw)
  To: panxinhui
  Cc: Peter Zijlstra, Pan Xinhui, linux-kernel, mingo, dave,
	will.deacon, Waiman.Long, benh

On Sun, Jun 26, 2016 at 02:10:57PM +0800, Boqun Feng wrote:
> On Sun, Jun 26, 2016 at 01:21:04PM +0800, panxinhui wrote:
> > 
> > > 在 2016年6月26日,03:20,Peter Zijlstra <peterz@infradead.org> 写道:
> > > 
> > > On Sun, Jun 26, 2016 at 01:27:56AM +0800, panxinhui wrote:
> > >>>> Would that not have issues where the owner cpu is kept running but the
> > >>>> spinner (ie. _this_ vcpu) gets preempted? I would think that in that
> > >>>> case we too want to stop spinning.
> > >>>> 
> > >>> 
> > >> do  you mean that the spinner detect itself had yield out during the
> > >> big spin loop?
> > >> 
> > >> It is very possible to happen.  BUT if spinner(on this vcpu) yield
> > >> out, the next spinner would break the spin loop.  AND if spinner
> > >> detect itself yield out once, it’s very possible to get the osq lock
> > >> soon as long as the ower vcpu is running.
> > >> 
> > >> SO I think we need just check the owner vcpu’s yield_count.
> > > 
> > > I had a quick look at KVM and it looks like it only has
> > > kvm_cpu::preempted, which would suggest the interface boqun proposed.
> > > 
> > > We'll have to look at many of the other virt platforms as well to see
> > > what they can do.
> > > 
> > > We could also provide _both_ interfaces and a platform can implement
> > > whichever variant (or both) it can.
> > > 
> > the kvm code on ppc has implemented  yield_count.
> > It let me feel a little relaxed. :)
> > 
> > looks like we could  introduce the interface like below.
> > 
> > bool vcpu_is_preempted(int cpu)
> > {
> > 	return arch_vcpu_is_preempted(cpu);
> > }
> > 
> > #ifdef arch_vcpu_has_yield_count 
> > bool vcpu_has_preemtped_once(int cpu, unsigned int yield_count)
> >  {
> > 	return arch_get_vcpu_yield_count() != yield_count;
> > }
> > 
> > #else
> > bool vcpu_has_preemtped_once(int cpu, unsigned int yield_count)
> > {
> > 	/*just let called know it is preepmpted*/
> > 	return vcpu_is_preempted(cpu);
> > }
> > #endif
> > 
> 
> Unfortunately, on PPC, we compile pseries code along with powernv code
> in a kernel binary, therefore we need to wire the proper primitives at
> runtime.
> 
> I've cooked the following patch, please note I haven't test this, this
> is just a POC.
> 
> Regards,
> Boqun
> 
> ---------------->8
> virt: Introduce vcpu preemption detection functions
> 
> Lock holder preemption is an serious problem in virtualized environment
> for locking. Different archs and hypervisors employ different ways to
> try to solve the problem in different locking mechanisms. And recently
> Pan Xinhui found a significant performance drop in osq_lock(), which
> could be solved by providing a mechanism to detect the preemption of
> vcpu.
> 
> Therefore this patch introduces several vcpu preemption detection
> primitives, which allows locking primtives to save the time of
> unnecesarry spinning. These primitives act as an abstract layer between
> locking functions and arch or hypervisor related code.
> 
> There are two sets of primitives:
> 
> 1.	vcpu_preempt_count() and vcpu_has_preempted(), they must be used
> 	pairwisely in a same preempt disable critical section. And they
> 	could detect whether a vcpu preemption happens between them.
> 
> 2.	vcpu_is_preempted(), used to check whether other cpu's vcpu is
> 	preempted.
> 
> This patch also implements those primitives on pseries and wire them up.
> 
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
>  arch/powerpc/platforms/pseries/Kconfig |  1 +
>  arch/powerpc/platforms/pseries/setup.c | 28 +++++++++++
>  include/linux/vcpu_preempt.h           | 90 ++++++++++++++++++++++++++++++++++
>  3 files changed, 119 insertions(+)
>  create mode 100644 include/linux/vcpu_preempt.h
> 
> diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
> index bec90fb30425..7d24c3e48878 100644
> --- a/arch/powerpc/platforms/pseries/Kconfig
> +++ b/arch/powerpc/platforms/pseries/Kconfig
> @@ -21,6 +21,7 @@ config PPC_PSERIES
>  	select HOTPLUG_CPU if SMP
>  	select ARCH_RANDOM
>  	select PPC_DOORBELL
> +	select HAS_VCPU_PREEMPTION_DETECTION
>  	default y
>  
>  config PPC_SPLPAR
> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> index 9883bc7ea007..5d4aed54e039 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -42,6 +42,7 @@
>  #include <linux/of.h>
>  #include <linux/of_pci.h>
>  #include <linux/kexec.h>
> +#include <linux/vcpu_preempt.h>
>  
>  #include <asm/mmu.h>
>  #include <asm/processor.h>
> @@ -501,6 +502,31 @@ static void __init find_and_init_phbs(void)
>  	of_pci_check_probe_only();
>  }
>  
> +struct vcpu_preempt_ops vcpu_preempt_ops = DEFAULT_VCPU_PREEMPT_OPS;
> +EXPORT_SYMBOL(vcpu_preempt_ops);
> +
> +static long pseries_vcpu_preempt_count(void)
> +{
> +	return be32_to_cpu(get_lppaca()->yield_count);
> +}
> +
> +static bool pseries_vcpu_is_preempted(int cpu)
> +{
> +	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
> +}
> +
> +static bool pseries_vcpu_has_preempted(long vpc)
> +{
> +	return pseries_vcpu_preempt_count() != vpc;
> +}
> +
> +static void __init pseries_setup_vcpu_preempt_ops(void)
> +{
> +	vcpu_preempt_ops.preempt_count = pseries_vcpu_preempt_count;
> +	vcpu_preempt_ops.is_preempted = pseries_vcpu_is_preempted;
> +	vcpu_preempt_ops.has_preempted = pseries_vcpu_has_preempted;
> +}
> +
>  static void __init pSeries_setup_arch(void)
>  {
>  	set_arch_panic_timeout(10, ARCH_PANIC_TIMEOUT);
> @@ -549,6 +575,8 @@ static void __init pSeries_setup_arch(void)
>  				"%ld\n", rc);
>  		}
>  	}
> +
> +	pseries_setup_vcpu_preempt_ops();
>  }
>  
>  static int __init pSeries_init_panel(void)
> diff --git a/include/linux/vcpu_preempt.h b/include/linux/vcpu_preempt.h
> new file mode 100644
> index 000000000000..4a88414bb83f
> --- /dev/null
> +++ b/include/linux/vcpu_preempt.h
> @@ -0,0 +1,90 @@
> +/*
> + * Primitives for checking the vcpu preemption from the guest.
> + */
> +
> +static long __vcpu_preempt_count(void)
> +{
> +	return 0;
> +}
> +
> +static bool __vcpu_has_preempted(long vpc)
> +{
> +	return false;
> +}
> +
> +static bool __vcpu_is_preempted(int cpu)
> +{
> +	return false;
> +}
> +
> +struct vcpu_preempt_ops {
> +	/*
> +	 * Get the current vcpu's "preempt count", which is going to use for
> +	 * checking whether the current vcpu has ever been preempted
> +	 */
> +	long (*preempt_count)(void);
> +
> +	/*
> +	 * Return whether a vcpu is preempted
> +	 */
> +	bool (*is_preempted)(int cpu);
> +
> +	/*
> +	 * Given a "vcpu preempt count", Return whether a vcpu preemption ever
> +	 * happened after the .preempt_count() was called.
> +	 */
> +	bool (*has_preempted)(long vpc);
> +};
> +
> +struct vcpu_preempt_ops vcpu_preempt_ops;
> +

This should be:

extern struct vcpu_preempt_ops vcpu_preempt_ops;

And I tested this one along with modified version of Xinhui's patch.

The test showed that even in a not over-committed system, the overhead
of mutex_lock() could go down from 2.58%(w/o patches) to 0.87%(w/
patches).

Xinhui, I did also modifiy the title and commit log of your patch, feel
free to take anything into your patch.

Regards,
Boqun

----------------->8
locking/osq: Stop spinning in osq_lock if related vcpus get preempted

In a virtualized environment, a cpu may hold the osq_lock while its vcpu
gets preempted, which makes the spinning of other cpus in the osq
pointless, because it may be quite a while for the holder to get back to
run and notify its successor to go.

To fix this, we use the vcpu preemption detection mechanism to check
whether the holder is preempted out, if so we just stop spinning in the
osq_lock(). Morever, if the spinner has ever been preempted, it also
indicates that the spining in osq_lock() is pointless, so we detect this
and stop spinning too.

Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
---
 kernel/locking/osq_lock.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a37857ab55..bed5a56a6aed 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -1,6 +1,7 @@
 #include <linux/percpu.h>
 #include <linux/sched.h>
 #include <linux/osq_lock.h>
+#include <linux/vcpu_preempt.h>
 
 /*
  * An MCS like lock especially tailored for optimistic spinning for sleeping
@@ -87,6 +88,8 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	struct optimistic_spin_node *prev, *next;
 	int curr = encode_cpu(smp_processor_id());
 	int old;
+	int loops;
+	long vpc;
 
 	node->locked = 0;
 	node->next = NULL;
@@ -106,6 +109,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	node->prev = prev;
 	WRITE_ONCE(prev->next, node);
 
+	old = old - 1;
+	vpc = vcpu_preempt_count();
+
 	/*
 	 * Normally @prev is untouchable after the above store; because at that
 	 * moment unlock can proceed and wipe the node element from stack.
@@ -118,8 +124,14 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	while (!READ_ONCE(node->locked)) {
 		/*
 		 * If we need to reschedule bail... so we can block.
+		 * An over-committed guest with more vCPUs than pCPUs
+		 * might fall in this loop and cause a huge overload.
+		 * This is because vCPU A(prev) hold the osq lock and yield out,
+		 * vCPU B(node) wait ->locked to be set, IOW, wait till
+		 * vCPU A run and unlock the osq lock.
+		 * NOTE that vCPU A and vCPU B might run on same physical cpu.
 		 */
-		if (need_resched())
+		if (need_resched() || vcpu_is_preempted(old) || vcpu_has_preempted(vpc))
 			goto unqueue;
 
 		cpu_relax_lowlatency();

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-26  6:59                 ` Boqun Feng
@ 2016-06-26  7:08                   ` panxinhui
  2016-06-26 14:29                     ` Boqun Feng
  2016-06-27  6:45                   ` Boqun Feng
  1 sibling, 1 reply; 31+ messages in thread
From: panxinhui @ 2016-06-26  7:08 UTC (permalink / raw)
  To: Boqun Feng
  Cc: panxinhui, Peter Zijlstra, Pan Xinhui, linux-kernel, mingo, dave,
	will.deacon, Waiman.Long, benh


> 在 2016年6月26日,14:59,Boqun Feng <boqun.feng@gmail.com> 写道:
> 
> On Sun, Jun 26, 2016 at 02:10:57PM +0800, Boqun Feng wrote:
>> On Sun, Jun 26, 2016 at 01:21:04PM +0800, panxinhui wrote:
>>> 
>>>> 在 2016年6月26日,03:20,Peter Zijlstra <peterz@infradead.org> 写道:
>>>> 
>>>> On Sun, Jun 26, 2016 at 01:27:56AM +0800, panxinhui wrote:
>>>>>>> Would that not have issues where the owner cpu is kept running but the
>>>>>>> spinner (ie. _this_ vcpu) gets preempted? I would think that in that
>>>>>>> case we too want to stop spinning.
>>>>>>> 
>>>>>> 
>>>>> do  you mean that the spinner detect itself had yield out during the
>>>>> big spin loop?
>>>>> 
>>>>> It is very possible to happen.  BUT if spinner(on this vcpu) yield
>>>>> out, the next spinner would break the spin loop.  AND if spinner
>>>>> detect itself yield out once, it’s very possible to get the osq lock
>>>>> soon as long as the ower vcpu is running.
>>>>> 
>>>>> SO I think we need just check the owner vcpu’s yield_count.
>>>> 
>>>> I had a quick look at KVM and it looks like it only has
>>>> kvm_cpu::preempted, which would suggest the interface boqun proposed.
>>>> 
>>>> We'll have to look at many of the other virt platforms as well to see
>>>> what they can do.
>>>> 
>>>> We could also provide _both_ interfaces and a platform can implement
>>>> whichever variant (or both) it can.
>>>> 
>>> the kvm code on ppc has implemented  yield_count.
>>> It let me feel a little relaxed. :)
>>> 
>>> looks like we could  introduce the interface like below.
>>> 
>>> bool vcpu_is_preempted(int cpu)
>>> {
>>> 	return arch_vcpu_is_preempted(cpu);
>>> }
>>> 
>>> #ifdef arch_vcpu_has_yield_count 
>>> bool vcpu_has_preemtped_once(int cpu, unsigned int yield_count)
>>> {
>>> 	return arch_get_vcpu_yield_count() != yield_count;
>>> }
>>> 
>>> #else
>>> bool vcpu_has_preemtped_once(int cpu, unsigned int yield_count)
>>> {
>>> 	/*just let called know it is preepmpted*/
>>> 	return vcpu_is_preempted(cpu);
>>> }
>>> #endif
>>> 
>> 
>> Unfortunately, on PPC, we compile pseries code along with powernv code
>> in a kernel binary, therefore we need to wire the proper primitives at
>> runtime.
>> 
>> I've cooked the following patch, please note I haven't test this, this
>> is just a POC.
>> 
>> Regards,
>> Boqun
>> 
>> ---------------->8
>> virt: Introduce vcpu preemption detection functions
>> 
>> Lock holder preemption is an serious problem in virtualized environment
>> for locking. Different archs and hypervisors employ different ways to
>> try to solve the problem in different locking mechanisms. And recently
>> Pan Xinhui found a significant performance drop in osq_lock(), which
>> could be solved by providing a mechanism to detect the preemption of
>> vcpu.
>> 
>> Therefore this patch introduces several vcpu preemption detection
>> primitives, which allows locking primtives to save the time of
>> unnecesarry spinning. These primitives act as an abstract layer between
>> locking functions and arch or hypervisor related code.
>> 
>> There are two sets of primitives:
>> 
>> 1.	vcpu_preempt_count() and vcpu_has_preempted(), they must be used
>> 	pairwisely in a same preempt disable critical section. And they
>> 	could detect whether a vcpu preemption happens between them.
>> 
>> 2.	vcpu_is_preempted(), used to check whether other cpu's vcpu is
>> 	preempted.
>> 
>> This patch also implements those primitives on pseries and wire them up.
>> 
>> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
>> ---
>> arch/powerpc/platforms/pseries/Kconfig |  1 +
>> arch/powerpc/platforms/pseries/setup.c | 28 +++++++++++
>> include/linux/vcpu_preempt.h           | 90 ++++++++++++++++++++++++++++++++++
>> 3 files changed, 119 insertions(+)
>> create mode 100644 include/linux/vcpu_preempt.h
>> 
>> diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
>> index bec90fb30425..7d24c3e48878 100644
>> --- a/arch/powerpc/platforms/pseries/Kconfig
>> +++ b/arch/powerpc/platforms/pseries/Kconfig
>> @@ -21,6 +21,7 @@ config PPC_PSERIES
>> 	select HOTPLUG_CPU if SMP
>> 	select ARCH_RANDOM
>> 	select PPC_DOORBELL
>> +	select HAS_VCPU_PREEMPTION_DETECTION
>> 	default y
>> 
>> config PPC_SPLPAR
>> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
>> index 9883bc7ea007..5d4aed54e039 100644
>> --- a/arch/powerpc/platforms/pseries/setup.c
>> +++ b/arch/powerpc/platforms/pseries/setup.c
>> @@ -42,6 +42,7 @@
>> #include <linux/of.h>
>> #include <linux/of_pci.h>
>> #include <linux/kexec.h>
>> +#include <linux/vcpu_preempt.h>
>> 
>> #include <asm/mmu.h>
>> #include <asm/processor.h>
>> @@ -501,6 +502,31 @@ static void __init find_and_init_phbs(void)
>> 	of_pci_check_probe_only();
>> }
>> 
>> +struct vcpu_preempt_ops vcpu_preempt_ops = DEFAULT_VCPU_PREEMPT_OPS;
>> +EXPORT_SYMBOL(vcpu_preempt_ops);
>> +
>> +static long pseries_vcpu_preempt_count(void)
>> +{
>> +	return be32_to_cpu(get_lppaca()->yield_count);
>> +}
>> +
>> +static bool pseries_vcpu_is_preempted(int cpu)
>> +{
>> +	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
>> +}
>> +
>> +static bool pseries_vcpu_has_preempted(long vpc)
>> +{
>> +	return pseries_vcpu_preempt_count() != vpc;
>> +}
>> +
>> +static void __init pseries_setup_vcpu_preempt_ops(void)
>> +{
>> +	vcpu_preempt_ops.preempt_count = pseries_vcpu_preempt_count;
>> +	vcpu_preempt_ops.is_preempted = pseries_vcpu_is_preempted;
>> +	vcpu_preempt_ops.has_preempted = pseries_vcpu_has_preempted;
>> +}
>> +
>> static void __init pSeries_setup_arch(void)
>> {
>> 	set_arch_panic_timeout(10, ARCH_PANIC_TIMEOUT);
>> @@ -549,6 +575,8 @@ static void __init pSeries_setup_arch(void)
>> 				"%ld\n", rc);
>> 		}
>> 	}
>> +
>> +	pseries_setup_vcpu_preempt_ops();
>> }
>> 
>> static int __init pSeries_init_panel(void)
>> diff --git a/include/linux/vcpu_preempt.h b/include/linux/vcpu_preempt.h
>> new file mode 100644
>> index 000000000000..4a88414bb83f
>> --- /dev/null
>> +++ b/include/linux/vcpu_preempt.h
>> @@ -0,0 +1,90 @@
>> +/*
>> + * Primitives for checking the vcpu preemption from the guest.
>> + */
>> +
>> +static long __vcpu_preempt_count(void)
>> +{
>> +	return 0;
>> +}
>> +
>> +static bool __vcpu_has_preempted(long vpc)
>> +{
>> +	return false;
>> +}
>> +
>> +static bool __vcpu_is_preempted(int cpu)
>> +{
>> +	return false;
>> +}
>> +
>> +struct vcpu_preempt_ops {
>> +	/*
>> +	 * Get the current vcpu's "preempt count", which is going to use for
>> +	 * checking whether the current vcpu has ever been preempted
>> +	 */
>> +	long (*preempt_count)(void);
>> +
>> +	/*
>> +	 * Return whether a vcpu is preempted
>> +	 */
>> +	bool (*is_preempted)(int cpu);
>> +
>> +	/*
>> +	 * Given a "vcpu preempt count", Return whether a vcpu preemption ever
>> +	 * happened after the .preempt_count() was called.
>> +	 */
>> +	bool (*has_preempted)(long vpc);
>> +};
>> +
>> +struct vcpu_preempt_ops vcpu_preempt_ops;
>> +
> 
> This should be:
> 
> extern struct vcpu_preempt_ops vcpu_preempt_ops;
> 
> And I tested this one along with modified version of Xinhui's patch.
> 
> The test showed that even in a not over-committed system, the overhead
> of mutex_lock() could go down from 2.58%(w/o patches) to 0.87%(w/
> patches).
> 
> Xinhui, I did also modifiy the title and commit log of your patch, feel
> free to take anything into your patch.
> 

thanks boqun, I will review your patch.

> Regards,
> Boqun
> 
> ----------------->8
> locking/osq: Stop spinning in osq_lock if related vcpus get preempted
> 
> In a virtualized environment, a cpu may hold the osq_lock while its vcpu
> gets preempted, which makes the spinning of other cpus in the osq
> pointless, because it may be quite a while for the holder to get back to
> run and notify its successor to go.
> 
> To fix this, we use the vcpu preemption detection mechanism to check
> whether the holder is preempted out, if so we just stop spinning in the
> osq_lock(). Morever, if the spinner has ever been preempted, it also
> indicates that the spining in osq_lock() is pointless, so we detect this
> and stop spinning too.
> 
> Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
> ---
> kernel/locking/osq_lock.c | 14 +++++++++++++-
> 1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 05a37857ab55..bed5a56a6aed 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -1,6 +1,7 @@
> #include <linux/percpu.h>
> #include <linux/sched.h>
> #include <linux/osq_lock.h>
> +#include <linux/vcpu_preempt.h>
> 
> /*
>  * An MCS like lock especially tailored for optimistic spinning for sleeping
> @@ -87,6 +88,8 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> 	struct optimistic_spin_node *prev, *next;
> 	int curr = encode_cpu(smp_processor_id());
> 	int old;
> +	int loops;
> +	long vpc;
> 
> 	node->locked = 0;
> 	node->next = NULL;
> @@ -106,6 +109,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> 	node->prev = prev;
> 	WRITE_ONCE(prev->next, node);
> 
> +	old = old - 1;
> +	vpc = vcpu_preempt_count();
> +
> 	/*
> 	 * Normally @prev is untouchable after the above store; because at that
> 	 * moment unlock can proceed and wipe the node element from stack.
> @@ -118,8 +124,14 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> 	while (!READ_ONCE(node->locked)) {
> 		/*
> 		 * If we need to reschedule bail... so we can block.
> +		 * An over-committed guest with more vCPUs than pCPUs
> +		 * might fall in this loop and cause a huge overload.
> +		 * This is because vCPU A(prev) hold the osq lock and yield out,
> +		 * vCPU B(node) wait ->locked to be set, IOW, wait till
> +		 * vCPU A run and unlock the osq lock.
> +		 * NOTE that vCPU A and vCPU B might run on same physical cpu.
> 		 */
> -		if (need_resched())
> +		if (need_resched() || vcpu_is_preempted(old) || vcpu_has_preempted(vpc))
> 			goto unqueue;
> 

the prev might change, so we need  read node->prev every loop, then check vcpu preempted.

> 		cpu_relax_lowlatency();

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-26  6:58                 ` panxinhui
@ 2016-06-26 14:11                   ` Boqun Feng
  2016-06-26 15:54                     ` panxinhui
  0 siblings, 1 reply; 31+ messages in thread
From: Boqun Feng @ 2016-06-26 14:11 UTC (permalink / raw)
  To: panxinhui
  Cc: Peter Zijlstra, Pan Xinhui, linux-kernel, mingo, dave,
	will.deacon, Waiman.Long, benh

[-- Attachment #1: Type: text/plain, Size: 9473 bytes --]

On Sun, Jun 26, 2016 at 02:58:00PM +0800, panxinhui wrote:
[snip]
> >> 
> > 
> > Unfortunately, on PPC, we compile pseries code along with powernv code
> > in a kernel binary, therefore we need to wire the proper primitives at
> > runtime.
> > 
> no, we can use same vcpu preempted check ops actually.
> just like how arch_spin_lock() does.
> 
> if (SHARED_PROCESSOR)
>                                 __spin_yield(lock); 
> 
> so we can
> u32 arch_get_vcpu_yield_count(int cpu)
> {
> 	if (SHARED_PROCESSOR)

SHARED_PROCESSOR is true only if a PPC_PSERIES kernel runs in shared
processor mode. Using this here will rule out the situation where a
PPC_PSERIES kernel(ppc guest) runs in dedicated processor mode. I don't
think we can rule out the dedicated processor mode, because even in the
dedicated mode, the lock holder may still be preempted. So why? Why do
we only want the yield_count in the shared processor mode?

>                                 return be32_to_cpu(lppaca_of(cpu).yield_count;
> 	return 0;
> }
> 
> and all these things can be don inline then.
> 
> > I've cooked the following patch, please note I haven't test this, this
> > is just a POC.
> > 
> > Regards,
> > Boqun
> > 
> > ---------------->8
> > virt: Introduce vcpu preemption detection functions
> > 
> > Lock holder preemption is an serious problem in virtualized environment
> > for locking. Different archs and hypervisors employ different ways to
> > try to solve the problem in different locking mechanisms. And recently
> > Pan Xinhui found a significant performance drop in osq_lock(), which
> > could be solved by providing a mechanism to detect the preemption of
> > vcpu.
> > 
> > Therefore this patch introduces several vcpu preemption detection
> > primitives, which allows locking primtives to save the time of
> > unnecesarry spinning. These primitives act as an abstract layer between
> > locking functions and arch or hypervisor related code.
> > 
> > There are two sets of primitives:
> > 
> > 1.	vcpu_preempt_count() and vcpu_has_preempted(), they must be used
> > 	pairwisely in a same preempt disable critical section. And they
> > 	could detect whether a vcpu preemption happens between them.
> > 
> > 2.	vcpu_is_preempted(), used to check whether other cpu's vcpu is
> > 	preempted.
> > 
> > This patch also implements those primitives on pseries and wire them up.
> > 
> > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> > ---
> > arch/powerpc/platforms/pseries/Kconfig |  1 +
> > arch/powerpc/platforms/pseries/setup.c | 28 +++++++++++
> > include/linux/vcpu_preempt.h           | 90 ++++++++++++++++++++++++++++++++++
> > 3 files changed, 119 insertions(+)
> > create mode 100644 include/linux/vcpu_preempt.h
> > 
> > diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
> > index bec90fb30425..7d24c3e48878 100644
> > --- a/arch/powerpc/platforms/pseries/Kconfig
> > +++ b/arch/powerpc/platforms/pseries/Kconfig
> > @@ -21,6 +21,7 @@ config PPC_PSERIES
> > 	select HOTPLUG_CPU if SMP
> > 	select ARCH_RANDOM
> > 	select PPC_DOORBELL
> > +	select HAS_VCPU_PREEMPTION_DETECTION
> > 	default y
> > 
> it is already too many config there…

Why is this a problem? It's a Kconfig file, there are not too many
readability issues here. It simply serve as a database for the
relationships between different config options. menuconfig and its
friends can do their job well. BTW, have you read config X86? ;-)

> as I comment above, we could make these things inline, so looks like we needn't to add such config.
>  

Another reason, I didn't make those primitives arch-specific inline
functions, is that they are actually not only arch-specific but also
(virtual-)environment-specific, or hypervisor-specific, that is their
implementations differ when the kernel runs on different hypervisors(KVM
and Xen, for example).

If you make the implementation part inline functions, you end up
handling different virtual environment every time the functions are
called, which is a waste of time. We are lucky because phyp and kvm
share the same interface of yield count, but this might not be true for
other archs and other hypervisors.

> > config PPC_SPLPAR
> > diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> > index 9883bc7ea007..5d4aed54e039 100644
> > --- a/arch/powerpc/platforms/pseries/setup.c
> > +++ b/arch/powerpc/platforms/pseries/setup.c
> > @@ -42,6 +42,7 @@
> > #include <linux/of.h>
> > #include <linux/of_pci.h>
> > #include <linux/kexec.h>
> > +#include <linux/vcpu_preempt.h>
> > 
> > #include <asm/mmu.h>
> > #include <asm/processor.h>
> > @@ -501,6 +502,31 @@ static void __init find_and_init_phbs(void)
> > 	of_pci_check_probe_only();
> > }
> > 
> > +struct vcpu_preempt_ops vcpu_preempt_ops = DEFAULT_VCPU_PREEMPT_OPS;
> > +EXPORT_SYMBOL(vcpu_preempt_ops);
> > +
> > +static long pseries_vcpu_preempt_count(void)
> > +{
> > +	return be32_to_cpu(get_lppaca()->yield_count);
> > +}
> > +
> > +static bool pseries_vcpu_is_preempted(int cpu)
> > +{
> > +	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
> > +}
> > +
> > +static bool pseries_vcpu_has_preempted(long vpc)
> > +{
> > +	return pseries_vcpu_preempt_count() != vpc;
> > +}
> > +
> > +static void __init pseries_setup_vcpu_preempt_ops(void)
> > +{
> > +	vcpu_preempt_ops.preempt_count = pseries_vcpu_preempt_count;
> > +	vcpu_preempt_ops.is_preempted = pseries_vcpu_is_preempted;
> > +	vcpu_preempt_ops.has_preempted = pseries_vcpu_has_preempted;
> > +}
> > +
> > static void __init pSeries_setup_arch(void)
> > {
> > 	set_arch_panic_timeout(10, ARCH_PANIC_TIMEOUT);
> > @@ -549,6 +575,8 @@ static void __init pSeries_setup_arch(void)
> > 				"%ld\n", rc);
> > 		}
> > 	}
> > +
> > +	pseries_setup_vcpu_preempt_ops();
> > }
> > 
> > static int __init pSeries_init_panel(void)
> > diff --git a/include/linux/vcpu_preempt.h b/include/linux/vcpu_preempt.h
> > new file mode 100644
> > index 000000000000..4a88414bb83f
> > --- /dev/null
> > +++ b/include/linux/vcpu_preempt.h
> 
> How about add them in sched.h

I don't have a strong opinion on which header file to use. TBH, I was
trying to use some header files for parvirt stuff, but failed to find
one, so I add a new one because I don't want people confused between
preemption of vcpus and preemption of tasks. But once again, I don't
have a strong opinion ;-)

> I just think  they are similar. correct me if  I got a mistake :)
> 
> thanks
> xinhui
> 
> > @@ -0,0 +1,90 @@
> > +/*
> > + * Primitives for checking the vcpu preemption from the guest.
> > + */
> > +
> > +static long __vcpu_preempt_count(void)
> > +{
> > +	return 0;
> > +}
> > +
> > +static bool __vcpu_has_preempted(long vpc)
> > +{
> > +	return false;
> > +}
> > +
> > +static bool __vcpu_is_preempted(int cpu)
> > +{
> > +	return false;
> > +}
> > +
> > +struct vcpu_preempt_ops {
> > +	/*
> > +	 * Get the current vcpu's "preempt count", which is going to use for
> > +	 * checking whether the current vcpu has ever been preempted
> > +	 */
> > +	long (*preempt_count)(void);
> > +
> > +	/*
> > +	 * Return whether a vcpu is preempted
> > +	 */
> > +	bool (*is_preempted)(int cpu);
> > +
> > +	/*
> > +	 * Given a "vcpu preempt count", Return whether a vcpu preemption ever
> > +	 * happened after the .preempt_count() was called.
> > +	 */
> > +	bool (*has_preempted)(long vpc);
> > +};
> > +
> > +struct vcpu_preempt_ops vcpu_preempt_ops;
> > +
> > +/* Default boilerplate */
> > +#define DEFAULT_VCPU_PREEMPT_OPS			\
> > +	{						\
> > +		.preempt_count = __vcpu_preempt_count,	\
> > +		.is_preempted = __vcpu_is_preempted,	\
> > +		.has_preempted = __vcpu_has_preempted	\
> > +	}
> > +
> > +#ifdef CONFIG_HAS_VCPU_PREEMPTION_DETECTION
> > +/*
> > + * vcpu_preempt_count: Get the current cpu's "vcpu preempt count"(vpc).
> > + *
> > + * The vpc is used for checking whether the current vcpu has ever been
> > + * preempted via vcpu_has_preempted().
> > + *
> > + * This function and vcpu_has_preepmted() should be called in the same
> > + * preemption disabled critical section.
> > + */
> > +static long vcpu_preempt_count(void)
> > +{
> > +	return vcpu_preempt_ops.preempt_count();
> > +}
> > +
> > +/*
> > + * vcpu_is_preempted: Check whether @cpu's vcpu is preempted.
> > + */
> > +static bool vcpu_is_preempted(int cpu)
> > +{
> > +	return vcpu_preempt_ops.is_preempted(cpu);
> > +}
> > +
> > +/*
> > + * vcpu_has_preepmted: Check whether the current cpu's vcpu has ever been
> > + * preempted.
> > + *
> > + * The checked duration is between the vcpu_preempt_count() which returns @vpc
> > + * is called and this function called.
> > + *
> > + * This function and corresponding vcpu_preempt_count() should be in the same
> > + * preemption disabled cirtial section.
> > + */
> > +static bool vcpu_has_preempted(long vpc)
> > +{
> > +	return vcpu_preempt_ops.has_preempt(vpc);
> > +}
> > +#else /* CONFIG_HAS_VCPU_PREEMPTION_DETECTION */
> > +#define vcpu_preempt_count() __vcpu_preempt_count()
> > +#define vcpu_is_preempted(cpu) __vcpu_is_preempted(cpu)
> > +#define vcpu_has_preempted(vpc) __vcpu_has_preempted(vpc)
> > +#endif /* CONFIG_HAS_VCPU_PREEPMTION_DETECTION */
> > -- 
> > 2.9.0
> > 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-26  7:08                   ` panxinhui
@ 2016-06-26 14:29                     ` Boqun Feng
  2016-06-26 15:11                       ` panxinhui
  0 siblings, 1 reply; 31+ messages in thread
From: Boqun Feng @ 2016-06-26 14:29 UTC (permalink / raw)
  To: panxinhui
  Cc: Peter Zijlstra, Pan Xinhui, linux-kernel, mingo, dave,
	will.deacon, Waiman.Long, benh

[-- Attachment #1: Type: text/plain, Size: 1642 bytes --]

On Sun, Jun 26, 2016 at 03:08:20PM +0800, panxinhui wrote:
[snip]
> > @@ -106,6 +109,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> > 	node->prev = prev;
> > 	WRITE_ONCE(prev->next, node);
> > 
> > +	old = old - 1;
> > +	vpc = vcpu_preempt_count();
> > +
> > 	/*
> > 	 * Normally @prev is untouchable after the above store; because at that
> > 	 * moment unlock can proceed and wipe the node element from stack.
> > @@ -118,8 +124,14 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> > 	while (!READ_ONCE(node->locked)) {
> > 		/*
> > 		 * If we need to reschedule bail... so we can block.
> > +		 * An over-committed guest with more vCPUs than pCPUs
> > +		 * might fall in this loop and cause a huge overload.
> > +		 * This is because vCPU A(prev) hold the osq lock and yield out,
> > +		 * vCPU B(node) wait ->locked to be set, IOW, wait till
> > +		 * vCPU A run and unlock the osq lock.
> > +		 * NOTE that vCPU A and vCPU B might run on same physical cpu.
> > 		 */
> > -		if (need_resched())
> > +		if (need_resched() || vcpu_is_preempted(old) || vcpu_has_preempted(vpc))
> > 			goto unqueue;
> > 
> 
> the prev might change, so we need  read node->prev every loop, then check vcpu preempted.
> 

Right you are on the possibility of the prev's change, however, even if
we reread node->prev, the prev is still not stable after we read, that
is the prev can change after we read in the loop and before we check the
vcpu preemption in the next loop, therefore whether the reread is
worthwhile, depends on some real tests I think.

Regards,
Boqun

> > 		cpu_relax_lowlatency();
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-26 14:29                     ` Boqun Feng
@ 2016-06-26 15:11                       ` panxinhui
  0 siblings, 0 replies; 31+ messages in thread
From: panxinhui @ 2016-06-26 15:11 UTC (permalink / raw)
  To: Boqun Feng
  Cc: panxinhui, Peter Zijlstra, Pan Xinhui, linux-kernel, mingo, dave,
	will.deacon, Waiman.Long, benh


> 在 2016年6月26日,22:29,Boqun Feng <boqun.feng@gmail.com> 写道:
> 
> On Sun, Jun 26, 2016 at 03:08:20PM +0800, panxinhui wrote:
> [snip]
>>> @@ -106,6 +109,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>>> 	node->prev = prev;
>>> 	WRITE_ONCE(prev->next, node);
>>> 
>>> +	old = old - 1;
>>> +	vpc = vcpu_preempt_count();
>>> +
>>> 	/*
>>> 	 * Normally @prev is untouchable after the above store; because at that
>>> 	 * moment unlock can proceed and wipe the node element from stack.
>>> @@ -118,8 +124,14 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>>> 	while (!READ_ONCE(node->locked)) {
>>> 		/*
>>> 		 * If we need to reschedule bail... so we can block.
>>> +		 * An over-committed guest with more vCPUs than pCPUs
>>> +		 * might fall in this loop and cause a huge overload.
>>> +		 * This is because vCPU A(prev) hold the osq lock and yield out,
>>> +		 * vCPU B(node) wait ->locked to be set, IOW, wait till
>>> +		 * vCPU A run and unlock the osq lock.
>>> +		 * NOTE that vCPU A and vCPU B might run on same physical cpu.
>>> 		 */
>>> -		if (need_resched())
>>> +		if (need_resched() || vcpu_is_preempted(old) || vcpu_has_preempted(vpc))
>>> 			goto unqueue;
>>> 
>> 
>> the prev might change, so we need  read node->prev every loop, then check vcpu preempted.
>> 
> 
> Right you are on the possibility of the prev's change, however, even if
> we reread node->prev, the prev is still not stable after we read, that
> is the prev can change after we read in the loop and before we check the

but the next loops we will detect the owner is preempted, right?

> vcpu preemption in the next loop, therefore whether the reread is

well,think the case below.

3 vcpu  try to get the osq lock. IF you don’t re-read the prev.

A				B 									C	
preempted 	
			detect A preempted, then break loops 
			and do unqueue
											spin loops, because B (the prev ) is running,



			preempted
											detect B preempted then break loops and do queue.// looks like it’ too late to break the loops 
running


SO we can get the rules, 
IF the queue nodes looks like
vcpu A,B,C,D,E,F,G,….. more spinners

say, if vcpu  B,F is preempted, then, we could know vcpu C,D,E should stop the spinning. right? and all other vcpu after F, say G,H,…should stop spinning too.
 

thanks
xinhui

> Regards,
> Boqun
> 
>>> 		cpu_relax_lowlatency();
>> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-26 14:11                   ` Boqun Feng
@ 2016-06-26 15:54                     ` panxinhui
  0 siblings, 0 replies; 31+ messages in thread
From: panxinhui @ 2016-06-26 15:54 UTC (permalink / raw)
  To: Boqun Feng
  Cc: panxinhui, Peter Zijlstra, Pan Xinhui, linux-kernel, mingo, dave,
	will.deacon, Waiman.Long, benh


> 在 2016年6月26日,22:11,Boqun Feng <boqun.feng@gmail.com> 写道:
> 
> On Sun, Jun 26, 2016 at 02:58:00PM +0800, panxinhui wrote:
> [snip]
>>>> 
>>> 
>>> Unfortunately, on PPC, we compile pseries code along with powernv code
>>> in a kernel binary, therefore we need to wire the proper primitives at
>>> runtime.
>>> 
>> no, we can use same vcpu preempted check ops actually.
>> just like how arch_spin_lock() does.
>> 
>> if (SHARED_PROCESSOR)
>>                                __spin_yield(lock); 
>> 
>> so we can
>> u32 arch_get_vcpu_yield_count(int cpu)
>> {
>> 	if (SHARED_PROCESSOR)
> 
> SHARED_PROCESSOR is true only if a PPC_PSERIES kernel runs in shared
> processor mode. Using this here will rule out the situation where a
> PPC_PSERIES kernel(ppc guest) runs indedicated processor mode. I don’t
> think we can rule out the dedicated processor mode, because even in the
well, copied from the codes

/*
 * We are using a non architected field to determine if a partition is
 * shared or dedicated. This currently works on both KVM and PHYP, but
 * we will have to transition to something better.
 */
#define LPPACA_OLD_SHARED_PROC          2

static inline bool lppaca_shared_proc(struct lppaca *l)
{
        return !!(l->__old_status & LPPACA_OLD_SHARED_PROC);
}
 
OKay, I have no idea about the difference between dedicated processor mode and the so-called shared..
Let me talk to you f2f monday….


> dedicated mode, the lock holder may still be preempted. So why? Why do
> we only want the yield_count in the shared processor mode?
> 
>>                                return be32_to_cpu(lppaca_of(cpu).yield_count;
>> 	return 0;
>> }
>> 
>> and all these things can be don inline then.
>> 
>>> I've cooked the following patch, please note I haven't test this, this
>>> is just a POC.
>>> 
>>> Regards,
>>> Boqun
>>> 
>>> ---------------->8
>>> virt: Introduce vcpu preemption detection functions
>>> 
>>> Lock holder preemption is an serious problem in virtualized environment
>>> for locking. Different archs and hypervisors employ different ways to
>>> try to solve the problem in different locking mechanisms. And recently
>>> Pan Xinhui found a significant performance drop in osq_lock(), which
>>> could be solved by providing a mechanism to detect the preemption of
>>> vcpu.
>>> 
>>> Therefore this patch introduces several vcpu preemption detection
>>> primitives, which allows locking primtives to save the time of
>>> unnecesarry spinning. These primitives act as an abstract layer between
>>> locking functions and arch or hypervisor related code.
>>> 
>>> There are two sets of primitives:
>>> 
>>> 1.	vcpu_preempt_count() and vcpu_has_preempted(), they must be used
>>> 	pairwisely in a same preempt disable critical section. And they
>>> 	could detect whether a vcpu preemption happens between them.
>>> 
>>> 2.	vcpu_is_preempted(), used to check whether other cpu's vcpu is
>>> 	preempted.
>>> 
>>> This patch also implements those primitives on pseries and wire them up.
>>> 
>>> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
>>> ---
>>> arch/powerpc/platforms/pseries/Kconfig |  1 +
>>> arch/powerpc/platforms/pseries/setup.c | 28 +++++++++++
>>> include/linux/vcpu_preempt.h           | 90 ++++++++++++++++++++++++++++++++++
>>> 3 files changed, 119 insertions(+)
>>> create mode 100644 include/linux/vcpu_preempt.h
>>> 
>>> diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
>>> index bec90fb30425..7d24c3e48878 100644
>>> --- a/arch/powerpc/platforms/pseries/Kconfig
>>> +++ b/arch/powerpc/platforms/pseries/Kconfig
>>> @@ -21,6 +21,7 @@ config PPC_PSERIES
>>> 	select HOTPLUG_CPU if SMP
>>> 	select ARCH_RANDOM
>>> 	select PPC_DOORBELL
>>> +	select HAS_VCPU_PREEMPTION_DETECTION
>>> 	default y
>>> 
>> it is already too many config there…
> 
> Why is this a problem? It's a Kconfig file, there are not too many
> readability issues here. It simply serve as a database for the
> relationships between different config options. menuconfig and its
just a personal favor…

I also a little hate the callback ops if it is not necessary to use, as it breaks my reading of the codes :-
AND same reason, I don’t want to vim .config then /any config_

anyway, this is not a big problem…


> friends can do their job well. BTW, have you read config X86? ;-)
> 
>> as I comment above, we could make these things inline, so looks like we needn't to add such config.
>> 
> 
> Another reason, I didn't make those primitives arch-specific inline
> functions, is that they are actually not only arch-specific but also
> (virtual-)environment-specific, or hypervisor-specific, that is their
> implementations differ when the kernel runs on different hypervisors(KVM
> and Xen, for example).
> 
> If you make the implementation part inline functions, you end up
> handling different virtual environment every time the functions are
> called, which is a waste of time. We are lucky because phyp and kvm
well, the call is also expensive.

> share the same interface of yield count, but this might not be true for
> other archs and other hypervisors.
> 

fair enough.

>>> config PPC_SPLPAR
>>> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
>>> index 9883bc7ea007..5d4aed54e039 100644
>>> --- a/arch/powerpc/platforms/pseries/setup.c
>>> +++ b/arch/powerpc/platforms/pseries/setup.c
>>> @@ -42,6 +42,7 @@
>>> #include <linux/of.h>
>>> #include <linux/of_pci.h>
>>> #include <linux/kexec.h>
>>> +#include <linux/vcpu_preempt.h>
>>> 
>>> #include <asm/mmu.h>
>>> #include <asm/processor.h>
>>> @@ -501,6 +502,31 @@ static void __init find_and_init_phbs(void)
>>> 	of_pci_check_probe_only();
>>> }
>>> 
>>> +struct vcpu_preempt_ops vcpu_preempt_ops = DEFAULT_VCPU_PREEMPT_OPS;
>>> +EXPORT_SYMBOL(vcpu_preempt_ops);
>>> +
>>> +static long pseries_vcpu_preempt_count(void)
>>> +{
>>> +	return be32_to_cpu(get_lppaca()->yield_count);
>>> +}
>>> +
>>> +static bool pseries_vcpu_is_preempted(int cpu)
>>> +{
>>> +	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
>>> +}
>>> +
>>> +static bool pseries_vcpu_has_preempted(long vpc)
>>> +{
>>> +	return pseries_vcpu_preempt_count() != vpc;
>>> +}
>>> +
>>> +static void __init pseries_setup_vcpu_preempt_ops(void)
>>> +{
>>> +	vcpu_preempt_ops.preempt_count = pseries_vcpu_preempt_count;
>>> +	vcpu_preempt_ops.is_preempted = pseries_vcpu_is_preempted;
>>> +	vcpu_preempt_ops.has_preempted = pseries_vcpu_has_preempted;
>>> +}
>>> +
>>> static void __init pSeries_setup_arch(void)
>>> {
>>> 	set_arch_panic_timeout(10, ARCH_PANIC_TIMEOUT);
>>> @@ -549,6 +575,8 @@ static void __init pSeries_setup_arch(void)
>>> 				"%ld\n", rc);
>>> 		}
>>> 	}
>>> +
>>> +	pseries_setup_vcpu_preempt_ops();
>>> }
>>> 
>>> static int __init pSeries_init_panel(void)
>>> diff --git a/include/linux/vcpu_preempt.h b/include/linux/vcpu_preempt.h
>>> new file mode 100644
>>> index 000000000000..4a88414bb83f
>>> --- /dev/null
>>> +++ b/include/linux/vcpu_preempt.h
>> 
>> How about add them in sched.h
> 
> I don't have a strong opinion on which header file to use. TBH, I was
> trying to use some header files for parvirt stuff, but failed to find
> one, so I add a new one because I don't want people confused between
> preemption of vcpus and preemption of tasks. But once again, I don't
> have a strong opinion ;-)
> 
>> I just think  they are similar. correct me if  I got a mistake :)
>> 
>> thanks
>> xinhui
>> 
>>> @@ -0,0 +1,90 @@
>>> +/*
>>> + * Primitives for checking the vcpu preemption from the guest.
>>> + */
>>> +
>>> +static long __vcpu_preempt_count(void)
>>> +{
>>> +	return 0;
>>> +}
>>> +
>>> +static bool __vcpu_has_preempted(long vpc)
>>> +{
>>> +	return false;
>>> +}
>>> +
>>> +static bool __vcpu_is_preempted(int cpu)
>>> +{
>>> +	return false;
>>> +}
>>> +
>>> +struct vcpu_preempt_ops {
>>> +	/*
>>> +	 * Get the current vcpu's "preempt count", which is going to use for
>>> +	 * checking whether the current vcpu has ever been preempted
>>> +	 */
>>> +	long (*preempt_count)(void);
>>> +

>>> 
>>> +	/*
>>> +	 * Return whether a vcpu is preempted
>>> +	 */
>>> +	bool (*is_preempted)(int cpu);
>>> +
>>> +	/*
>>> +	 * Given a "vcpu preempt count", Return whether a vcpu preemption ever
>>> +	 * happened after the .preempt_count() was called.
>>> +	 */
>>> +	bool (*has_preempted)(long vpc);
>>> +};
>>> +
>>> +struct vcpu_preempt_ops vcpu_preempt_ops;
>>> +
>>> +/* Default boilerplate */
>>> +#define DEFAULT_VCPU_PREEMPT_OPS			\
>>> +	{						\
>>> +		.preempt_count = __vcpu_preempt_count,	\
>>> +		.is_preempted = __vcpu_is_preempted,	\
>>> +		.has_preempted = __vcpu_has_preempted	\
>>> +	}
>>> +
>>> +#ifdef CONFIG_HAS_VCPU_PREEMPTION_DETECTION
>>> +/*
>>> + * vcpu_preempt_count: Get the current cpu's "vcpu preempt count"(vpc).
>>> + *
>>> + * The vpc is used for checking whether the current vcpu has ever been
>>> + * preempted via vcpu_has_preempted().
>>> + *
>>> + * This function and vcpu_has_preepmted() should be called in the same
>>> + * preemption disabled critical section.
>>> + */
>>> +static long vcpu_preempt_count(void)
>>> +{
>>> +	return vcpu_preempt_ops.preempt_count();
>>> +}
>>> +
>>> +/*
>>> + * vcpu_is_preempted: Check whether @cpu's vcpu is preempted.
>>> + */
>>> +static bool vcpu_is_preempted(int cpu)
>>> +{
>>> +	return vcpu_preempt_ops.is_preempted(cpu);
>>> +}
>>> +
>>> +/*
>>> + * vcpu_has_preepmted: Check whether the current cpu's vcpu has ever been
>>> + * preempted.
>>> + *
>>> + * The checked duration is between the vcpu_preempt_count() which returns @vpc
>>> + * is called and this function called.
>>> + *
>>> + * This function and corresponding vcpu_preempt_count() should be in the same
>>> + * preemption disabled cirtial section.
>>> + */
>>> +static bool vcpu_has_preempted(long vpc)
>>> +{
>>> +	return vcpu_preempt_ops.has_preempt(vpc);
>>> +}
>>> +#else /* CONFIG_HAS_VCPU_PREEMPTION_DETECTION */
>>> +#define vcpu_preempt_count() __vcpu_preempt_count()
>>> +#define vcpu_is_preempted(cpu) __vcpu_is_preempted(cpu)
>>> +#define vcpu_has_preempted(vpc) __vcpu_has_preempted(vpc)
>>> +#endif /* CONFIG_HAS_VCPU_PREEPMTION_DETECTION */
>>> -- 
>>> 2.9.0
>>> 
>> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-26  6:59                 ` Boqun Feng
  2016-06-26  7:08                   ` panxinhui
@ 2016-06-27  6:45                   ` Boqun Feng
  2016-06-27  7:36                     ` xinhui
  2016-06-27  8:09                     ` Peter Zijlstra
  1 sibling, 2 replies; 31+ messages in thread
From: Boqun Feng @ 2016-06-27  6:45 UTC (permalink / raw)
  To: panxinhui
  Cc: Peter Zijlstra, Pan Xinhui, linux-kernel, mingo, dave,
	will.deacon, Waiman.Long, benh

On Sun, Jun 26, 2016 at 02:59:26PM +0800, Boqun Feng wrote:
[snip]
> 
> This should be:
> 
> extern struct vcpu_preempt_ops vcpu_preempt_ops;
> 
> And I tested this one along with modified version of Xinhui's patch.
> 
> The test showed that even in a not over-committed system, the overhead
> of mutex_lock() could go down from 2.58%(w/o patches) to 0.87%(w/
> patches).
> 

Clearly, I made several more mistakes in the patch, and the vcpu
preemption could never been activated, so please ignore this bench
result.

Here is a new version of vcpu preempt. Thoughts?

Regards,
Boqun

---------------->8
virt: Introduce vcpu preemption detection functions

Lock holder preemption is an serious problem in virtualized environment
for locking. Different archs and hypervisors employ different ways to
try to solve the problem in different locking mechanisms. And recently
Pan Xinhui found a significant performance drop in osq_lock(), which
could be solved by providing a mechanism to detect the preemption of
vcpu.

Therefore this patch introduces several vcpu preemption detection
primitives, which allows locking primtives to save the time of
unnecesarry spinning. These primitives act as an abstract layer between
locking functions and arch or hypervisor related code.

There are two sets of primitives:

1.	vcpu_preempt_count() and vcpu_has_preempted(), they must be used
	pairwisely in a same preempt disable critical section. And they
	could detect whether a vcpu preemption happens between them.

2.	vcpu_is_preempted(), used to check whether other cpu's vcpu is
	preempted.

This patch also implements those primitives on pseries and wire them up.


Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 arch/powerpc/platforms/pseries/Kconfig |  1 +
 arch/powerpc/platforms/pseries/setup.c | 28 ++++++++++++
 include/linux/vcpu_preempt.h           | 82 ++++++++++++++++++++++++++++++++++
 kernel/Kconfig.preempt                 |  5 ++-
 4 files changed, 115 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/vcpu_preempt.h

diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
index bec90fb30425..7d24c3e48878 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -21,6 +21,7 @@ config PPC_PSERIES
 	select HOTPLUG_CPU if SMP
 	select ARCH_RANDOM
 	select PPC_DOORBELL
+	select HAS_VCPU_PREEMPTION_DETECTION
 	default y
 
 config PPC_SPLPAR
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 9883bc7ea007..5d4aed54e039 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -42,6 +42,7 @@
 #include <linux/of.h>
 #include <linux/of_pci.h>
 #include <linux/kexec.h>
+#include <linux/vcpu_preempt.h>
 
 #include <asm/mmu.h>
 #include <asm/processor.h>
@@ -501,6 +502,31 @@ static void __init find_and_init_phbs(void)
 	of_pci_check_probe_only();
 }
 
+struct vcpu_preempt_ops vcpu_preempt_ops = DEFAULT_VCPU_PREEMPT_OPS;
+EXPORT_SYMBOL(vcpu_preempt_ops);
+
+static long pseries_vcpu_preempt_count(void)
+{
+	return be32_to_cpu(get_lppaca()->yield_count);
+}
+
+static bool pseries_vcpu_is_preempted(int cpu)
+{
+	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
+}
+
+static bool pseries_vcpu_has_preempted(long vpc)
+{
+	return pseries_vcpu_preempt_count() != vpc;
+}
+
+static void __init pseries_setup_vcpu_preempt_ops(void)
+{
+	vcpu_preempt_ops.preempt_count = pseries_vcpu_preempt_count;
+	vcpu_preempt_ops.is_preempted = pseries_vcpu_is_preempted;
+	vcpu_preempt_ops.has_preempted = pseries_vcpu_has_preempted;
+}
+
 static void __init pSeries_setup_arch(void)
 {
 	set_arch_panic_timeout(10, ARCH_PANIC_TIMEOUT);
@@ -549,6 +575,8 @@ static void __init pSeries_setup_arch(void)
 				"%ld\n", rc);
 		}
 	}
+
+	pseries_setup_vcpu_preempt_ops();
 }
 
 static int __init pSeries_init_panel(void)
diff --git a/include/linux/vcpu_preempt.h b/include/linux/vcpu_preempt.h
new file mode 100644
index 000000000000..e76e054a7917
--- /dev/null
+++ b/include/linux/vcpu_preempt.h
@@ -0,0 +1,82 @@
+/*
+ * Primitives for checking the vcpu preemption from the guest.
+ */
+
+static long __vcpu_preempt_count(void)
+{
+	return 0;
+}
+
+static bool __vcpu_has_preempted(long vpc)
+{
+	return false;
+}
+
+static bool __vcpu_is_preempted(int cpu)
+{
+	return false;
+}
+
+struct vcpu_preempt_ops {
+	/*
+	 * Get the current vcpu's "preempt count", which is going to use for
+	 * checking whether the current vcpu has ever been preempted
+	 */
+	long (*preempt_count)(void);
+
+	/*
+	 * Return whether a vcpu is preempted
+	 */
+	bool (*is_preempted)(int cpu);
+
+	/*
+	 * Given a "vcpu preempt count", Return whether a vcpu preemption ever
+	 * happened after the .preempt_count() was called.
+	 */
+	bool (*has_preempted)(long vpc);
+};
+
+extern struct vcpu_preempt_ops vcpu_preempt_ops;
+
+/* Default boilerplate */
+#define DEFAULT_VCPU_PREEMPT_OPS			\
+	{						\
+		.preempt_count = __vcpu_preempt_count,	\
+		.is_preempted = __vcpu_is_preempted,	\
+		.has_preempted = __vcpu_has_preempted	\
+	}
+
+#ifdef CONFIG_HAS_VCPU_PREEMPTION_DETECTION
+/*
+ * vcpu_preempt_count: Get the current cpu's "vcpu preempt count"(vpc).
+ *
+ * The vpc is used for checking whether the current vcpu has ever been
+ * preempted via vcpu_has_preempted().
+ *
+ * This function and vcpu_has_preepmted() should be called in the same
+ * preemption disabled critical section.
+ */
+#define vcpu_preempt_count()	vcpu_preempt_ops.preempt_count()
+
+/*
+ * vcpu_is_preempted: Check whether @cpu's vcpu is preempted.
+ */
+#define vcpu_is_preempted(cpu)	vcpu_preempt_ops.is_preempted(cpu)
+
+/*
+ * vcpu_has_preepmted: Check whether the current cpu's vcpu has ever been
+ * preempted.
+ *
+ * The checked duration is between the vcpu_preempt_count() which returns @vpc
+ * is called and this function called.
+ *
+ * This function and corresponding vcpu_preempt_count() should be in the same
+ * preemption disabled cirtial section.
+ */
+#define vcpu_has_preempted(vpc)	vcpu_preempt_ops.has_preempted(vpc)
+
+#else /* CONFIG_HAS_VCPU_PREEMPTION_DETECTION */
+#define vcpu_preempt_count() __vcpu_preempt_count()
+#define vcpu_is_preempted(cpu) __vcpu_is_preempted(cpu)
+#define vcpu_has_preempted(vpc) __vcpu_has_preempted(vpc)
+#endif /* CONFIG_HAS_VCPU_PREEPMTION_DETECTION */
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 3f9c97419f02..cb88bc3a2fc6 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -55,4 +55,7 @@ config PREEMPT
 endchoice
 
 config PREEMPT_COUNT
-       bool
\ No newline at end of file
+       bool
+
+config HAS_VCPU_PREEMPTION_DETECTION
+       bool
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-27  6:45                   ` Boqun Feng
@ 2016-06-27  7:36                     ` xinhui
  2016-06-27  8:09                     ` Peter Zijlstra
  1 sibling, 0 replies; 31+ messages in thread
From: xinhui @ 2016-06-27  7:36 UTC (permalink / raw)
  To: Boqun Feng, panxinhui
  Cc: Peter Zijlstra, linux-kernel, mingo, dave, will.deacon,
	Waiman.Long, benh



On 2016年06月27日 14:45, Boqun Feng wrote:
> On Sun, Jun 26, 2016 at 02:59:26PM +0800, Boqun Feng wrote:
> [snip]
>>
>> This should be:
>>
>> extern struct vcpu_preempt_ops vcpu_preempt_ops;
>>
>> And I tested this one along with modified version of Xinhui's patch.
>>
>> The test showed that even in a not over-committed system, the overhead
>> of mutex_lock() could go down from 2.58%(w/o patches) to 0.87%(w/
>> patches).
>>
>
> Clearly, I made several more mistakes in the patch, and the vcpu
> preemption could never been activated, so please ignore this bench
> result.
>
> Here is a new version of vcpu preempt. Thoughts?
>
> Regards,
> Boqun
>
> ---------------->8
> virt: Introduce vcpu preemption detection functions
>
> Lock holder preemption is an serious problem in virtualized environment
> for locking. Different archs and hypervisors employ different ways to
> try to solve the problem in different locking mechanisms. And recently
> Pan Xinhui found a significant performance drop in osq_lock(), which
> could be solved by providing a mechanism to detect the preemption of
> vcpu.
>
> Therefore this patch introduces several vcpu preemption detection
> primitives, which allows locking primtives to save the time of
> unnecesarry spinning. These primitives act as an abstract layer between
> locking functions and arch or hypervisor related code.
>
> There are two sets of primitives:
>
> 1.	vcpu_preempt_count() and vcpu_has_preempted(), they must be used
> 	pairwisely in a same preempt disable critical section. And they
> 	could detect whether a vcpu preemption happens between them.
>
> 2.	vcpu_is_preempted(), used to check whether other cpu's vcpu is
> 	preempted.
>
> This patch also implements those primitives on pseries and wire them up.
>
>
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
>   arch/powerpc/platforms/pseries/Kconfig |  1 +
>   arch/powerpc/platforms/pseries/setup.c | 28 ++++++++++++
>   include/linux/vcpu_preempt.h           | 82 ++++++++++++++++++++++++++++++++++
>   kernel/Kconfig.preempt                 |  5 ++-
>   4 files changed, 115 insertions(+), 1 deletion(-)
>   create mode 100644 include/linux/vcpu_preempt.h
>
> diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
> index bec90fb30425..7d24c3e48878 100644
> --- a/arch/powerpc/platforms/pseries/Kconfig
> +++ b/arch/powerpc/platforms/pseries/Kconfig
> @@ -21,6 +21,7 @@ config PPC_PSERIES
>   	select HOTPLUG_CPU if SMP
>   	select ARCH_RANDOM
>   	select PPC_DOORBELL
> +	select HAS_VCPU_PREEMPTION_DETECTION
>   	default y
>
>   config PPC_SPLPAR
> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> index 9883bc7ea007..5d4aed54e039 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -42,6 +42,7 @@
>   #include <linux/of.h>
>   #include <linux/of_pci.h>
>   #include <linux/kexec.h>
> +#include <linux/vcpu_preempt.h>
>
>   #include <asm/mmu.h>
>   #include <asm/processor.h>
> @@ -501,6 +502,31 @@ static void __init find_and_init_phbs(void)
>   	of_pci_check_probe_only();
>   }
>
> +struct vcpu_preempt_ops vcpu_preempt_ops = DEFAULT_VCPU_PREEMPT_OPS;
> +EXPORT_SYMBOL(vcpu_preempt_ops);
> +
> +static long pseries_vcpu_preempt_count(void)
> +{
> +	return be32_to_cpu(get_lppaca()->yield_count);
> +}
> +
> +static bool pseries_vcpu_is_preempted(int cpu)
> +{
> +	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
> +}
> +
> +static bool pseries_vcpu_has_preempted(long vpc)
> +{
> +	return pseries_vcpu_preempt_count() != vpc;
> +}
> +
> +static void __init pseries_setup_vcpu_preempt_ops(void)
> +{
> +	vcpu_preempt_ops.preempt_count = pseries_vcpu_preempt_count;
> +	vcpu_preempt_ops.is_preempted = pseries_vcpu_is_preempted;
> +	vcpu_preempt_ops.has_preempted = pseries_vcpu_has_preempted;
> +}
> +
>   static void __init pSeries_setup_arch(void)
>   {
>   	set_arch_panic_timeout(10, ARCH_PANIC_TIMEOUT);
> @@ -549,6 +575,8 @@ static void __init pSeries_setup_arch(void)
>   				"%ld\n", rc);
>   		}
>   	}
> +
> +	pseries_setup_vcpu_preempt_ops();
>   }
>
>   static int __init pSeries_init_panel(void)
> diff --git a/include/linux/vcpu_preempt.h b/include/linux/vcpu_preempt.h
> new file mode 100644
> index 000000000000..e76e054a7917
> --- /dev/null
> +++ b/include/linux/vcpu_preempt.h
> @@ -0,0 +1,82 @@
> +/*
> + * Primitives for checking the vcpu preemption from the guest.
> + */
> +
> +static long __vcpu_preempt_count(void)
> +{
> +	return 0;
> +}
> +
> +static bool __vcpu_has_preempted(long vpc)
> +{
> +	return false;
> +}
> +
> +static bool __vcpu_is_preempted(int cpu)
> +{
> +	return false;
> +}
> +
> +struct vcpu_preempt_ops {
> +	/*
> +	 * Get the current vcpu's "preempt count", which is going to use for
> +	 * checking whether the current vcpu has ever been preempted
> +	 */
> +	long (*preempt_count)(void);
> +
> +	/*
> +	 * Return whether a vcpu is preempted
> +	 */
> +	bool (*is_preempted)(int cpu);
> +
> +	/*
> +	 * Given a "vcpu preempt count", Return whether a vcpu preemption ever
> +	 * happened after the .preempt_count() was called.
> +	 */
> +	bool (*has_preempted)(long vpc);
> +};
> +
> +extern struct vcpu_preempt_ops vcpu_preempt_ops;
> +
> +/* Default boilerplate */
> +#define DEFAULT_VCPU_PREEMPT_OPS			\
> +	{						\
> +		.preempt_count = __vcpu_preempt_count,	\
> +		.is_preempted = __vcpu_is_preempted,	\
> +		.has_preempted = __vcpu_has_preempted	\
> +	}
> +
> +#ifdef CONFIG_HAS_VCPU_PREEMPTION_DETECTION
> +/*
> + * vcpu_preempt_count: Get the current cpu's "vcpu preempt count"(vpc).
> + *
> + * The vpc is used for checking whether the current vcpu has ever been
> + * preempted via vcpu_has_preempted().
> + *
> + * This function and vcpu_has_preepmted() should be called in the same
> + * preemption disabled critical section.
> + */
> +#define vcpu_preempt_count()	vcpu_preempt_ops.preempt_count()
> +
> +/*
> + * vcpu_is_preempted: Check whether @cpu's vcpu is preempted.
> + */
> +#define vcpu_is_preempted(cpu)	vcpu_preempt_ops.is_preempted(cpu)
> +
> +/*
> + * vcpu_has_preepmted: Check whether the current cpu's vcpu has ever been
> + * preempted.
> + *
> + * The checked duration is between the vcpu_preempt_count() which returns @vpc
> + * is called and this function called.
> + *
> + * This function and corresponding vcpu_preempt_count() should be in the same
> + * preemption disabled cirtial section.
> + */
> +#define vcpu_has_preempted(vpc)	vcpu_preempt_ops.has_preempted(vpc)
> +
> +#else /* CONFIG_HAS_VCPU_PREEMPTION_DETECTION */
> +#define vcpu_preempt_count() __vcpu_preempt_count()
> +#define vcpu_is_preempted(cpu) __vcpu_is_preempted(cpu)
> +#define vcpu_has_preempted(vpc) __vcpu_has_preempted(vpc)
> +#endif /* CONFIG_HAS_VCPU_PREEPMTION_DETECTION */

AND if possible, We can hide the detail about vcpu preempted check. IOW, caller better do not know vcpu_preempt_count() .

currently, users have to code like below.

long vpc = vcpu_preempt_count();
while()
	if (vcpu_has_preempted(vpc))
		break;
the vpc is long, this the implementation detail.



SO we can introduce :
vcpu_preempt_critical_start(void);
{
	/* record the yield count*/
	__this_cpu_write(vpc, vcpu_preempt_count());
}
vcpu_preempt_critical_end(void)
{
	// do nothing.
}

vcpu_has_preempted(void)
{
	vcpu_preempt_ops.has_preempted(__this_cpu_read(vpc));
}


then we can code like below

vcpu_preempt_critical_start()
while()
	if (vcpu_has_preempted())
		break;
vcpu_preempt_critical_end()

looks a little better?

thanks
xinhui

> diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
> index 3f9c97419f02..cb88bc3a2fc6 100644
> --- a/kernel/Kconfig.preempt
> +++ b/kernel/Kconfig.preempt
> @@ -55,4 +55,7 @@ config PREEMPT
>   endchoice
>
>   config PREEMPT_COUNT
> -       bool
> \ No newline at end of file
> +       bool
> +
> +config HAS_VCPU_PREEMPTION_DETECTION
> +       bool
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-26  4:59             ` panxinhui
@ 2016-06-27  7:55               ` Peter Zijlstra
  2016-06-27 10:19                 ` xinhui
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2016-06-27  7:55 UTC (permalink / raw)
  To: panxinhui
  Cc: Boqun Feng, Pan Xinhui, linux-kernel, mingo, dave, will.deacon,
	Waiman.Long, benh

On Sun, Jun 26, 2016 at 12:59:01PM +0800, panxinhui wrote:
> 
> > 在 2016年6月26日,03:12,Peter Zijlstra <peterz@infradead.org> 写道:
> > 
> > On Sun, Jun 26, 2016 at 01:27:51AM +0800, panxinhui wrote:
> > 
> >> by the way I still think mutex_unlock has a big overload too. 
> > 
> > Do you mean overhead?
> > 
> oh, maybe you are right. 

> mutex_unlock ’s implementation uses inc_return variant on  ppc, and
> that’s expensive. I am thinking of using  cmpxchg instead. 

That statement doesn't make any sense. PPC is an LL/SC arch, inc_return
and cmpxchg are the 'same' LL/SC loop.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-27  6:45                   ` Boqun Feng
  2016-06-27  7:36                     ` xinhui
@ 2016-06-27  8:09                     ` Peter Zijlstra
  2016-06-27 10:31                       ` Boqun Feng
  1 sibling, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2016-06-27  8:09 UTC (permalink / raw)
  To: Boqun Feng
  Cc: panxinhui, Pan Xinhui, linux-kernel, mingo, dave, will.deacon,
	Waiman.Long, benh

On Mon, Jun 27, 2016 at 02:45:06PM +0800, Boqun Feng wrote:
> +++ b/include/linux/vcpu_preempt.h
> @@ -0,0 +1,82 @@
> +/*
> + * Primitives for checking the vcpu preemption from the guest.
> + */
> +
> +static long __vcpu_preempt_count(void)
> +{
> +	return 0;
> +}
> +
> +static bool __vcpu_has_preempted(long vpc)
> +{
> +	return false;
> +}
> +
> +static bool __vcpu_is_preempted(int cpu)
> +{
> +	return false;
> +}
> +
> +struct vcpu_preempt_ops {
> +	/*
> +	 * Get the current vcpu's "preempt count", which is going to use for
> +	 * checking whether the current vcpu has ever been preempted
> +	 */
> +	long (*preempt_count)(void);
> +
> +	/*
> +	 * Return whether a vcpu is preempted
> +	 */
> +	bool (*is_preempted)(int cpu);
> +
> +	/*
> +	 * Given a "vcpu preempt count", Return whether a vcpu preemption ever
> +	 * happened after the .preempt_count() was called.
> +	 */
> +	bool (*has_preempted)(long vpc);
> +};
> +
> +extern struct vcpu_preempt_ops vcpu_preempt_ops;
> +
> +/* Default boilerplate */
> +#define DEFAULT_VCPU_PREEMPT_OPS			\
> +	{						\
> +		.preempt_count = __vcpu_preempt_count,	\
> +		.is_preempted = __vcpu_is_preempted,	\
> +		.has_preempted = __vcpu_has_preempted	\
> +	}
> +
> +#ifdef CONFIG_HAS_VCPU_PREEMPTION_DETECTION
> +/*
> + * vcpu_preempt_count: Get the current cpu's "vcpu preempt count"(vpc).
> + *
> + * The vpc is used for checking whether the current vcpu has ever been
> + * preempted via vcpu_has_preempted().
> + *
> + * This function and vcpu_has_preepmted() should be called in the same
> + * preemption disabled critical section.
> + */
> +#define vcpu_preempt_count()	vcpu_preempt_ops.preempt_count()
> +
> +/*
> + * vcpu_is_preempted: Check whether @cpu's vcpu is preempted.
> + */
> +#define vcpu_is_preempted(cpu)	vcpu_preempt_ops.is_preempted(cpu)
> +
> +/*
> + * vcpu_has_preepmted: Check whether the current cpu's vcpu has ever been
> + * preempted.
> + *
> + * The checked duration is between the vcpu_preempt_count() which returns @vpc
> + * is called and this function called.
> + *
> + * This function and corresponding vcpu_preempt_count() should be in the same
> + * preemption disabled cirtial section.
> + */
> +#define vcpu_has_preempted(vpc)	vcpu_preempt_ops.has_preempted(vpc)
> +
> +#else /* CONFIG_HAS_VCPU_PREEMPTION_DETECTION */
> +#define vcpu_preempt_count() __vcpu_preempt_count()
> +#define vcpu_is_preempted(cpu) __vcpu_is_preempted(cpu)
> +#define vcpu_has_preempted(vpc) __vcpu_has_preempted(vpc)
> +#endif /* CONFIG_HAS_VCPU_PREEPMTION_DETECTION */

No, this is entirely insane, also broken.

No vectors, no actual function calls, nothing like that. You want the
below to completely compile away and generate the exact 100% same code
it does today.

> +++ b/kernel/locking/osq_lock.c
> @@ -1,6 +1,7 @@
>  #include <linux/percpu.h>
>  #include <linux/sched.h>
>  #include <linux/osq_lock.h>
> +#include <linux/vcpu_preempt.h>
>  
>  /*
>   * An MCS like lock especially tailored for optimistic spinning for sleeping
> @@ -87,6 +88,8 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>  	struct optimistic_spin_node *prev, *next;
>  	int curr = encode_cpu(smp_processor_id());
>  	int old;
> +	int loops;
> +	long vpc;
>  
>  	node->locked = 0;
>  	node->next = NULL;
> @@ -106,6 +109,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>  	node->prev = prev;
>  	WRITE_ONCE(prev->next, node);
>  
> +	old = old - 1;

That's just nasty, and could result in an unconditional decrement being
issues, even though its never used.

> +	vpc = vcpu_preempt_count();
> +
>  	/*
>  	 * Normally @prev is untouchable after the above store; because at that
>  	 * moment unlock can proceed and wipe the node element from stack.
> @@ -118,8 +124,14 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>  	while (!READ_ONCE(node->locked)) {
>  		/*
>  		 * If we need to reschedule bail... so we can block.
> +		 * An over-committed guest with more vCPUs than pCPUs
> +		 * might fall in this loop and cause a huge overload.
> +		 * This is because vCPU A(prev) hold the osq lock and yield out,
> +		 * vCPU B(node) wait ->locked to be set, IOW, wait till
> +		 * vCPU A run and unlock the osq lock.
> +		 * NOTE that vCPU A and vCPU B might run on same physical cpu.
>  		 */
> -		if (need_resched())
> +		if (need_resched() || vcpu_is_preempted(old) || vcpu_has_preempted(vpc))
>  			goto unqueue;
>  
>  		cpu_relax_lowlatency();
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-27  7:55               ` Peter Zijlstra
@ 2016-06-27 10:19                 ` xinhui
  0 siblings, 0 replies; 31+ messages in thread
From: xinhui @ 2016-06-27 10:19 UTC (permalink / raw)
  To: Peter Zijlstra, panxinhui
  Cc: Boqun Feng, linux-kernel, mingo, dave, will.deacon, Waiman.Long, benh



On 2016年06月27日 15:55, Peter Zijlstra wrote:
> On Sun, Jun 26, 2016 at 12:59:01PM +0800, panxinhui wrote:
>>
>>> 在 2016年6月26日,03:12,Peter Zijlstra <peterz@infradead.org> 写道:
>>>
>>> On Sun, Jun 26, 2016 at 01:27:51AM +0800, panxinhui wrote:
>>>
>>>> by the way I still think mutex_unlock has a big overload too.
>>>
>>> Do you mean overhead?
>>>
>> oh, maybe you are right.
>
>> mutex_unlock ’s implementation uses inc_return variant on  ppc, and
>> that’s expensive. I am thinking of using  cmpxchg instead.
>
> That statement doesn't make any sense. PPC is an LL/SC arch, inc_return
> and cmpxchg are the 'same' LL/SC loop.
>
This is a little optimize.
if there are lock waiters, the lockval is minus X, when we call unlock, it will inc the lockval, if it is <= 0, enter unlockslowpath to wakeup the waiters, and set lockval to 1 in the slowpath.
SO there is no need to inc lockval if it is already a minus number. therefore we can save one store or loads/stores in LL/SC loops

the base idea is from code below,
if (!atomic_read(&lk)//no need to call atomic_inc_return which is expensive.
	atomic_inc_return(&lk))

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] locking/osq: Drop the overload of osq lock
  2016-06-27  8:09                     ` Peter Zijlstra
@ 2016-06-27 10:31                       ` Boqun Feng
  0 siblings, 0 replies; 31+ messages in thread
From: Boqun Feng @ 2016-06-27 10:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: panxinhui, Pan Xinhui, linux-kernel, mingo, dave, will.deacon,
	Waiman.Long, benh

[-- Attachment #1: Type: text/plain, Size: 2549 bytes --]

On Mon, Jun 27, 2016 at 10:09:59AM +0200, Peter Zijlstra wrote:
[snip]
> 
> No, this is entirely insane, also broken.
> 
> No vectors, no actual function calls, nothing like that. You want the
> below to completely compile away and generate the exact 100% same code
> it does today.
> 

Point taken.

As Xinhui also posted something similar, which worked better on not
effecting the generated code if not enabled. I think I'd better to drop
this workload to him ;-)

> > +++ b/kernel/locking/osq_lock.c
> > @@ -1,6 +1,7 @@
> >  #include <linux/percpu.h>
> >  #include <linux/sched.h>
> >  #include <linux/osq_lock.h>
> > +#include <linux/vcpu_preempt.h>
> >  
> >  /*
> >   * An MCS like lock especially tailored for optimistic spinning for sleeping
> > @@ -87,6 +88,8 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> >  	struct optimistic_spin_node *prev, *next;
> >  	int curr = encode_cpu(smp_processor_id());
> >  	int old;
> > +	int loops;
> > +	long vpc;
> >  
> >  	node->locked = 0;
> >  	node->next = NULL;
> > @@ -106,6 +109,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> >  	node->prev = prev;
> >  	WRITE_ONCE(prev->next, node);
> >  
> > +	old = old - 1;
> 
> That's just nasty, and could result in an unconditional decrement being
> issues, even though its never used.
> 

Right, better to calculate this decrement at the argument of
vcpu_is_preempted() callsite, and define the primitive of host code as

#define vcpu_is_preempted(cpu) false

, which could probably be optimized out.

Regards,
Boqun

> > +	vpc = vcpu_preempt_count();
> > +
> >  	/*
> >  	 * Normally @prev is untouchable after the above store; because at that
> >  	 * moment unlock can proceed and wipe the node element from stack.
> > @@ -118,8 +124,14 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> >  	while (!READ_ONCE(node->locked)) {
> >  		/*
> >  		 * If we need to reschedule bail... so we can block.
> > +		 * An over-committed guest with more vCPUs than pCPUs
> > +		 * might fall in this loop and cause a huge overload.
> > +		 * This is because vCPU A(prev) hold the osq lock and yield out,
> > +		 * vCPU B(node) wait ->locked to be set, IOW, wait till
> > +		 * vCPU A run and unlock the osq lock.
> > +		 * NOTE that vCPU A and vCPU B might run on same physical cpu.
> >  		 */
> > -		if (need_resched())
> > +		if (need_resched() || vcpu_is_preempted(old) || vcpu_has_preempted(vpc))
> >  			goto unqueue;
> >  
> >  		cpu_relax_lowlatency();
> > 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2016-06-27 10:27 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-25 17:42 [PATCH] locking/osq: Drop the overload of osq lock Pan Xinhui
2016-06-25 14:24 ` Peter Zijlstra
2016-06-25 15:21   ` Boqun Feng
2016-06-25 16:09     ` Peter Zijlstra
2016-06-25 16:13       ` Peter Zijlstra
2016-06-25 17:27         ` panxinhui
2016-06-25 19:12           ` Peter Zijlstra
2016-06-26  4:59             ` panxinhui
2016-06-27  7:55               ` Peter Zijlstra
2016-06-27 10:19                 ` xinhui
2016-06-25 16:28       ` Boqun Feng
2016-06-25 18:28         ` Peter Zijlstra
2016-06-25 16:15     ` Peter Zijlstra
2016-06-25 16:45       ` Boqun Feng
2016-06-25 17:27         ` panxinhui
2016-06-25 19:20           ` Peter Zijlstra
2016-06-25 19:29             ` Peter Zijlstra
2016-06-26  2:26             ` Boqun Feng
2016-06-26  5:21             ` panxinhui
2016-06-26  6:10               ` Boqun Feng
2016-06-26  6:58                 ` panxinhui
2016-06-26 14:11                   ` Boqun Feng
2016-06-26 15:54                     ` panxinhui
2016-06-26  6:59                 ` Boqun Feng
2016-06-26  7:08                   ` panxinhui
2016-06-26 14:29                     ` Boqun Feng
2016-06-26 15:11                       ` panxinhui
2016-06-27  6:45                   ` Boqun Feng
2016-06-27  7:36                     ` xinhui
2016-06-27  8:09                     ` Peter Zijlstra
2016-06-27 10:31                       ` Boqun Feng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).