linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
       [not found] <3d2459c7-defd-a47e-6cea-007c10cecaac@asrmicro.com>
@ 2017-07-26 14:16 ` Thomas Gleixner
  2017-07-27  1:29   ` qiaozhou
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2017-07-26 14:16 UTC (permalink / raw)
  To: qiaozhou
  Cc: John Stultz, sboyd, LKML, Wang Wilbur, Marc Zyngier, Will Deacon,
	Peter Zijlstra

On Wed, 26 Jul 2017, qiaozhou wrote:

Cc'ed ARM folks.

> I want to ask you for suggestions about how to fix one contention between
> expire_timers and try_to_del_timer_sync. Thanks in advance.
> The issue is a hard-lockup issue detected on our platform(arm64, one cluster
> with 4 a53, and the other with 4 a73). The sequence is as below:
> 1. core0 checks expired timers, and wakes up a process on a73, in step 1.3.
> 2. core4 starts to run the process and try to delete the timer in sync method,
> in step 2.1.
> 3. Before core0 can run to 1.4 and get the lock, core4 has already run from
> 2.1 to 2.6 in while loop. And in step 2.4, it fails since the timer is still
> the running timer.
> 
> core0(a53): core4(a73):
> run_timer_softirq
> __run_timers
>     spin_lock_irq(&base->lock)
>         expire_timers()
>             1.1: base->running_timer = timer;
>             1.2: spin_unlock_irq(&base->lock);
>             1.3: call_timer_fn(timer, fn, data);
> schedule()
>             1.4: spin_lock_irq(&base->lock);     del_timer_sync(&timer)
>             1.5: back to 1.1 in while loop
> 2.1: try_to_del_timer_sync()
>         2.2: get_timer_base()
>         2.3: spin_lock_irqsave(&base->lock, flags);
>           2.4: check "base->running_timer != timer"
>         2.5: spin_unlock_irqrestore(&base->lock, flags);
>    2.6:cpu_relax(); (back to 2.1 in while loop)

This is horribly formatted.

> Core0 runs @832MHz, and core4 runs @1.8GHz. A73 is also much more powerful in
> design than a53. So the actual running is that a73 keeps looping in spin_lock
> -> check running_timer -> spin_unlock -> spin_lock ->...., while a53 can't
> even succeed to complete one store exclusive instruction, before it can enter
> WFE to wait for lock release. It just keeps looping in +30 and+3c in below
> asm, due to stxr fails.
> 
> So it deadloop, a53 can't jump out ldaxr/stxr loop, while a73 can't pass the
> running_timer check. Finally the hard-lockup is triggered.(BTW, the issue
> needs a long time to occur.)
> 
> <_raw_spin_lock_irq+0x2c>:   prfm    pstl1strm, [x19]
> /<_raw_spin_lock_irq+0x30>:   ldaxr   w0, [x19]//
> //<_raw_spin_lock_irq+0x34>:   add     w1, w0, #0x10, lsl #12//
> //<_raw_spin_lock_irq+0x38>:   stxr    w2, w1, [x19]//
> //<_raw_spin_lock_irq+0x3c>:   cbnz    w2, 0xffffff80089f52e0
> <_raw_spin_lock_irq+0x30>/
> <_raw_spin_lock_irq+0x40>:   eor     w1, w0, w0, ror #16
> <_raw_spin_lock_irq+0x44>:   cbz     w1, 0xffffff80089f530c
> <_raw_spin_lock_irq+0x5c>
> <_raw_spin_lock_irq+0x48>:   sevl
> <_raw_spin_lock_irq+0x4c>:   wfe
> <_raw_spin_lock_irq+0x50>:   ldaxrh  w2, [x19]
> <_raw_spin_lock_irq+0x54>:   eor     w1, w2, w0, lsr #16
> <_raw_spin_lock_irq+0x58>:   cbnz    w1, 0xffffff80089f52fc
> <_raw_spin_lock_irq+0x4c>
> 
> The loop on a53 only has 4 instructions, and loop on a73 has ~100
> instructions. Still a53 has no chance to store exclusive successfully. It may
> be related with ldaxr/stxr cost, core frequency, core number etc.
> 
> I have no idea of fixing it in spinlock/unlock implement, so I try to fix it
> in timer driver. I prepared a raw patch, not sure it's the correct direction
> to solve this issue. Could you help to give some suggestions? Thanks.
> 
> From fb8fbfeb8f9f92fdadd9920ce234fd433bc883e1 Mon Sep 17 00:00:00 2001
> From: Qiao Zhou <qiaozhou@asrmicro.com>
> Date: Wed, 26 Jul 2017 20:30:33 +0800
> Subject: [PATCH] RFC: timers: try to fix contention between expire_timers and
>  del_timer_sync
> 
> try to fix contention between expire_timers and del_timer_sync by
> adding TIMER_WOKEN status, which means that the timer has already
> woken up corresponding process.

Timers do a lot more things than waking up a process.

Just because this happens in a particular case for you where a wakeup is
involved this does not mean, that this is generally true.

And that whole flag business is completely broken. There is a comment in
call_timer_fn() which says:

        /*
         * It is permissible to free the timer from inside the
         * function that is called from it ....

So you _CANNOT_ touch timer after the function returned. It might be gone
already.

For that particular timer case we can clear base->running_timer w/o the
lock held (see patch below), but this kind of

     lock -> test -> unlock -> retry

loops are all over the place in the kernel, so this is going to hurt you
sooner than later in some other place.

Thanks,

	tglx

8<------------

--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1301,10 +1301,12 @@ static void expire_timers(struct timer_b
 		if (timer->flags & TIMER_IRQSAFE) {
 			raw_spin_unlock(&base->lock);
 			call_timer_fn(timer, fn, data);
+			base->running_timer = NULL;
 			raw_spin_lock(&base->lock);
 		} else {
 			raw_spin_unlock_irq(&base->lock);
 			call_timer_fn(timer, fn, data);
+			base->running_timer = NULL;
 			raw_spin_lock_irq(&base->lock);
 		}
 	}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-07-26 14:16 ` [Question]: try to fix contention between expire_timers and try_to_del_timer_sync Thomas Gleixner
@ 2017-07-27  1:29   ` qiaozhou
  2017-07-27 15:14     ` Will Deacon
  2017-07-28  1:10     ` Vikram Mulukutla
  0 siblings, 2 replies; 23+ messages in thread
From: qiaozhou @ 2017-07-27  1:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: John Stultz, sboyd, LKML, Wang Wilbur, Marc Zyngier, Will Deacon,
	Peter Zijlstra


On 2017年07月26日 22:16, Thomas Gleixner wrote:
> On Wed, 26 Jul 2017, qiaozhou wrote:
>
> Cc'ed ARM folks.
>
>> I want to ask you for suggestions about how to fix one contention between
>> expire_timers and try_to_del_timer_sync. Thanks in advance.
>> The issue is a hard-lockup issue detected on our platform(arm64, one cluster
>> with 4 a53, and the other with 4 a73). The sequence is as below:
>> 1. core0 checks expired timers, and wakes up a process on a73, in step 1.3.
>> 2. core4 starts to run the process and try to delete the timer in sync method,
>> in step 2.1.
>> 3. Before core0 can run to 1.4 and get the lock, core4 has already run from
>> 2.1 to 2.6 in while loop. And in step 2.4, it fails since the timer is still
>> the running timer.
>>
>> core0(a53): core4(a73):
>> run_timer_softirq
>> __run_timers
>>      spin_lock_irq(&base->lock)
>>          expire_timers()
>>              1.1: base->running_timer = timer;
>>              1.2: spin_unlock_irq(&base->lock);
>>              1.3: call_timer_fn(timer, fn, data);
>> schedule()
>>              1.4: spin_lock_irq(&base->lock);     del_timer_sync(&timer)
>>              1.5: back to 1.1 in while loop
>> 2.1: try_to_del_timer_sync()
>>          2.2: get_timer_base()
>>          2.3: spin_lock_irqsave(&base->lock, flags);
>>            2.4: check "base->running_timer != timer"
>>          2.5: spin_unlock_irqrestore(&base->lock, flags);
>>     2.6:cpu_relax(); (back to 2.1 in while loop)
> This is horribly formatted.
Sorry for the format. Updated the calling sequence on two cores.

core0(a53): core4(a73):
run_timer_softirq
__run_timers
     spin_lock_irq(&base->lock)
         expire_timers()
             1.1: base->running_timer = timer;
             1.2: spin_unlock_irq(&base->lock);
             1.3: call_timer_fn(timer, fn, data);
             1.4: spin_lock_irq(&base->lock);
             1.5: back to 1.1 in while loop

core4(a73):
schedule()
   del_timer_sync(&timer)
      2.1: try_to_del_timer_sync()
         2.2: get_timer_base()
         2.3: spin_lock_irqsave(&base->lock, flags);
         2.4: check "base->running_timer != timer"
         2.5: spin_unlock_irqrestore(&base->lock, flags);
      2.6:cpu_relax(); (back to 2.1 in while loop)

>
>> Core0 runs @832MHz, and core4 runs @1.8GHz. A73 is also much more powerful in
>> design than a53. So the actual running is that a73 keeps looping in spin_lock
>> -> check running_timer -> spin_unlock -> spin_lock ->...., while a53 can't
>> even succeed to complete one store exclusive instruction, before it can enter
>> WFE to wait for lock release. It just keeps looping in +30 and+3c in below
>> asm, due to stxr fails.
>>
>> So it deadloop, a53 can't jump out ldaxr/stxr loop, while a73 can't pass the
>> running_timer check. Finally the hard-lockup is triggered.(BTW, the issue
>> needs a long time to occur.)
>>
>> <_raw_spin_lock_irq+0x2c>:   prfm    pstl1strm, [x19]
>> /<_raw_spin_lock_irq+0x30>:   ldaxr   w0, [x19]//
>> //<_raw_spin_lock_irq+0x34>:   add     w1, w0, #0x10, lsl #12//
>> //<_raw_spin_lock_irq+0x38>:   stxr    w2, w1, [x19]//
>> //<_raw_spin_lock_irq+0x3c>:   cbnz    w2, 0xffffff80089f52e0
>> <_raw_spin_lock_irq+0x30>/
>> <_raw_spin_lock_irq+0x40>:   eor     w1, w0, w0, ror #16
>> <_raw_spin_lock_irq+0x44>:   cbz     w1, 0xffffff80089f530c
>> <_raw_spin_lock_irq+0x5c>
>> <_raw_spin_lock_irq+0x48>:   sevl
>> <_raw_spin_lock_irq+0x4c>:   wfe
>> <_raw_spin_lock_irq+0x50>:   ldaxrh  w2, [x19]
>> <_raw_spin_lock_irq+0x54>:   eor     w1, w2, w0, lsr #16
>> <_raw_spin_lock_irq+0x58>:   cbnz    w1, 0xffffff80089f52fc
>> <_raw_spin_lock_irq+0x4c>
>>
>> The loop on a53 only has 4 instructions, and loop on a73 has ~100
>> instructions. Still a53 has no chance to store exclusive successfully. It may
>> be related with ldaxr/stxr cost, core frequency, core number etc.
>>
>> I have no idea of fixing it in spinlock/unlock implement, so I try to fix it
>> in timer driver. I prepared a raw patch, not sure it's the correct direction
>> to solve this issue. Could you help to give some suggestions? Thanks.
>>
>>  From fb8fbfeb8f9f92fdadd9920ce234fd433bc883e1 Mon Sep 17 00:00:00 2001
>> From: Qiao Zhou <qiaozhou@asrmicro.com>
>> Date: Wed, 26 Jul 2017 20:30:33 +0800
>> Subject: [PATCH] RFC: timers: try to fix contention between expire_timers and
>>   del_timer_sync
>>
>> try to fix contention between expire_timers and del_timer_sync by
>> adding TIMER_WOKEN status, which means that the timer has already
>> woken up corresponding process.
> Timers do a lot more things than waking up a process.
>
> Just because this happens in a particular case for you where a wakeup is
> involved this does not mean, that this is generally true.
Yes, you're right. My intention is that as the timer function is 
executed, maybe we can do something.
>
> And that whole flag business is completely broken. There is a comment in
> call_timer_fn() which says:
>
>          /*
>           * It is permissible to free the timer from inside the
>           * function that is called from it ....
>
> So you _CANNOT_ touch timer after the function returned. It might be gone
> already.
You're right. Touching timer here has risk.
>
> For that particular timer case we can clear base->running_timer w/o the
> lock held (see patch below), but this kind of
>
>       lock -> test -> unlock -> retry
>
> loops are all over the place in the kernel, so this is going to hurt you
> sooner than later in some other place.
It's true. This is the way spinlock is used normally and widely in 
kernel. I'll also ask ARM experts whether we can do something to avoid 
or reduce the chance of such issue. ARMv8.1 has one single 
instruction(ldadda) to replace the ldaxr/stxr loop. Hope it can improve 
and reduce the chance.
>
> Thanks,
>
> 	tglx
>
> 8<------------
>
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -1301,10 +1301,12 @@ static void expire_timers(struct timer_b
>   		if (timer->flags & TIMER_IRQSAFE) {
>   			raw_spin_unlock(&base->lock);
>   			call_timer_fn(timer, fn, data);
> +			base->running_timer = NULL;
>   			raw_spin_lock(&base->lock);
>   		} else {
>   			raw_spin_unlock_irq(&base->lock);
>   			call_timer_fn(timer, fn, data);
> +			base->running_timer = NULL;
>   			raw_spin_lock_irq(&base->lock);
>   		}
>   	}
It should work for this particular issue and I'll test it. Previously I 
thought it was unsafe to touch base->running_timer without holding lock.

Thanks a lot for all the suggestions.

Best Regards
Qiao

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-07-27  1:29   ` qiaozhou
@ 2017-07-27 15:14     ` Will Deacon
  2017-07-27 15:19       ` Thomas Gleixner
  2017-07-28  1:10     ` Vikram Mulukutla
  1 sibling, 1 reply; 23+ messages in thread
From: Will Deacon @ 2017-07-27 15:14 UTC (permalink / raw)
  To: qiaozhou
  Cc: Thomas Gleixner, John Stultz, sboyd, LKML, Wang Wilbur,
	Marc Zyngier, Peter Zijlstra

On Thu, Jul 27, 2017 at 09:29:20AM +0800, qiaozhou wrote:
> On 2017年07月26日 22:16, Thomas Gleixner wrote:
> >--- a/kernel/time/timer.c
> >+++ b/kernel/time/timer.c
> >@@ -1301,10 +1301,12 @@ static void expire_timers(struct timer_b
> >  		if (timer->flags & TIMER_IRQSAFE) {
> >  			raw_spin_unlock(&base->lock);
> >  			call_timer_fn(timer, fn, data);
> >+			base->running_timer = NULL;
> >  			raw_spin_lock(&base->lock);
> >  		} else {
> >  			raw_spin_unlock_irq(&base->lock);
> >  			call_timer_fn(timer, fn, data);
> >+			base->running_timer = NULL;
> >  			raw_spin_lock_irq(&base->lock);
> >  		}
> >  	}
> It should work for this particular issue and I'll test it. Previously I
> thought it was unsafe to touch base->running_timer without holding lock.

I think it works out in practice because base->lock and base->running_timer
share a cacheline, so end up being ordered correctly. We should probably be
using READ_ONCE/WRITE_ONCE for accessing the running_time field though.

One thing I don't get though, is why try_to_del_timer_sync needs to check
base->running_timer at all. Given that it holds the base->lock, can't it
be the person that sets it to NULL?

Will

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-07-27 15:14     ` Will Deacon
@ 2017-07-27 15:19       ` Thomas Gleixner
  0 siblings, 0 replies; 23+ messages in thread
From: Thomas Gleixner @ 2017-07-27 15:19 UTC (permalink / raw)
  To: Will Deacon
  Cc: qiaozhou, John Stultz, sboyd, LKML, Wang Wilbur, Marc Zyngier,
	Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 1774 bytes --]

On Thu, 27 Jul 2017, Will Deacon wrote:
> On Thu, Jul 27, 2017 at 09:29:20AM +0800, qiaozhou wrote:
> > On 2017年07月26日 22:16, Thomas Gleixner wrote:
> > >--- a/kernel/time/timer.c
> > >+++ b/kernel/time/timer.c
> > >@@ -1301,10 +1301,12 @@ static void expire_timers(struct timer_b
> > >  		if (timer->flags & TIMER_IRQSAFE) {
> > >  			raw_spin_unlock(&base->lock);
> > >  			call_timer_fn(timer, fn, data);
> > >+			base->running_timer = NULL;
> > >  			raw_spin_lock(&base->lock);
> > >  		} else {
> > >  			raw_spin_unlock_irq(&base->lock);
> > >  			call_timer_fn(timer, fn, data);
> > >+			base->running_timer = NULL;
> > >  			raw_spin_lock_irq(&base->lock);
> > >  		}
> > >  	}
> > It should work for this particular issue and I'll test it. Previously I
> > thought it was unsafe to touch base->running_timer without holding lock.
> 
> I think it works out in practice because base->lock and base->running_timer
> share a cacheline, so end up being ordered correctly. We should probably be
> using READ_ONCE/WRITE_ONCE for accessing the running_time field though.
> 
> One thing I don't get though, is why try_to_del_timer_sync needs to check
> base->running_timer at all. Given that it holds the base->lock, can't it
> be the person that sets it to NULL?

No. The timer callback code does:

    base->running_timer = timer;
    spin_unlock(base->lock);
    fn(timer);
    spin_lock(base->lock);
    base->running_timer = NULL;

So for del_timer_sync() the only way to figure out whether the timer
callback is running is to check base->running_timer. We cannot store state
in the timer itself because we cannot clear that state when the callback
return as the timer might have been freed in the callback. Yes, that's
nasty, but reality.

Thanks,

	tglx

    

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-07-27  1:29   ` qiaozhou
  2017-07-27 15:14     ` Will Deacon
@ 2017-07-28  1:10     ` Vikram Mulukutla
  2017-07-28  9:28       ` Peter Zijlstra
  2017-07-28  9:28       ` Will Deacon
  1 sibling, 2 replies; 23+ messages in thread
From: Vikram Mulukutla @ 2017-07-28  1:10 UTC (permalink / raw)
  To: qiaozhou
  Cc: Thomas Gleixner, John Stultz, sboyd, LKML, Wang Wilbur,
	Marc Zyngier, Will Deacon, Peter Zijlstra, linux-kernel-owner,
	sudeep.holla

[-- Attachment #1: Type: text/plain, Size: 2969 bytes --]



cc: Sudeep Holla

On 2017-07-26 18:29, qiaozhou wrote:
> On 2017年07月26日 22:16, Thomas Gleixner wrote:
>> On Wed, 26 Jul 2017, qiaozhou wrote:
>> 
>> Cc'ed ARM folks.
>> 

<snip>

>> 
>> For that particular timer case we can clear base->running_timer w/o 
>> the
>> lock held (see patch below), but this kind of
>> 
>>       lock -> test -> unlock -> retry
>> 
>> loops are all over the place in the kernel, so this is going to hurt 
>> you
>> sooner than later in some other place.
> It's true. This is the way spinlock is used normally and widely in
> kernel. I'll also ask ARM experts whether we can do something to avoid
> or reduce the chance of such issue. ARMv8.1 has one single
> instruction(ldadda) to replace the ldaxr/stxr loop. Hope it can
> improve and reduce the chance.

I think we should have this discussion now - I brought this up earlier 
[1]
and I promised a test case that I completely forgot about - but here it
is (attached). Essentially a Big CPU in an acquire-check-release loop
will have an unfair advantage over a little CPU concurrently attempting
to acquire the same lock, in spite of the ticket implementation. If the 
Big
CPU needs the little CPU to make forward progress : livelock.

We've run into the same loop construct in other spots in the kernel and
the reason that a real  symptom is so rare is that the retry-loop on the 
'Big'
CPU needs to be interrupted just once by say an IRQ/FIQ and the 
live-lock
is broken. If the entire retry loop is within an interrupt-disabled 
critical
section then the odds of live-locking are much higher.

An example of the problem on a previous kernel is here [2]. Changes to 
the
workqueue code since may have fixed this particular instance.

One solution was to use udelay(1) in such loops instead of cpu_relax(), 
but
that's not very 'relaxing'. I'm not sure if there's something we could 
do
within the ticket spin-lock implementation to deal with this.

Note that I ran my test on a 4.9 kernel so that didn't include any 
spinlock
implementation changes since then. The test schedules two threads, one 
on
a big CPU and one on a little CPU. The big CPU thread does the 
lock/unlock/retry
loop for a full 1 second with interrupts disabled, while the little CPU 
attempts
to acquire the same loop but enabling interrupts after every successful 
lock+unlock.
With unfairness, the little CPU may take upto 1 second (or several 
milliseconds at
the least) just to acquire the lock once. This varies depending on the 
IPC difference
and frequencies of the big and little ARM64 CPUs:

Big cpu frequency | Little cpu frequency | Max time taken by little to 
acquire lock
2GHz              | 1.5GHz               | 133 microseconds
2GHz              | 300MHz               | 734 milliseconds

Thanks,
Vikram

[1] - https://lkml.org/lkml/2016/11/17/934
[2] - https://goo.gl/uneFjt

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-measure-spinlock-fairness-across-differently-capable.patch --]
[-- Type: text/x-diff; name=0001-measure-spinlock-fairness-across-differently-capable.patch, Size: 7285 bytes --]

From 51d6186b620a9e354a0d40af06aef1c1299ca223 Mon Sep 17 00:00:00 2001
From: Vikram Mulukutla <markivx@codeaurora.org>
Date: Thu, 27 Jul 2017 12:14:48 -0700
Subject: [PATCH] measure spinlock fairness across differently capable CPUs

How to run this test:
1) compile and boot
2) echo 1 > /sys/module/test/parameters/run_test
3) sleep 5
4) echo 0 > /sys/module/test/parameters/run_test

The test schedules two threads on two separate CPUs
and attempts to acquire the same spinlock from both
threads with a certain loop construct.
(it assumes cpu0 is 'Little' and cpu4 is 'Big'. This
can be changed in the code)

After running the test, check these counters:
cat /sys/module/test/parameters/big_time_us
cat /sys/module/test/parameters/little_time_us

If unfairness exists, little_time should be close to 1 second or
a fairly large millisecond value.
test.c has comments that explain why this is so.

Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
---
 kernel/sched/Makefile |   2 +-
 kernel/sched/test.c   | 204 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 205 insertions(+), 1 deletion(-)
 create mode 100644 kernel/sched/test.c

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index f6cce95..542a1c7 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -15,7 +15,7 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
 endif
 
-obj-y += core.o loadavg.o clock.o cputime.o
+obj-y += core.o loadavg.o clock.o cputime.o test.o
 obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o
 obj-y += wait.o swait.o completion.o idle.o
 obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o energy.o sched_avg.o
diff --git a/kernel/sched/test.c b/kernel/sched/test.c
new file mode 100644
index 0000000..5dd3b0d
--- /dev/null
+++ b/kernel/sched/test.c
@@ -0,0 +1,204 @@
+/* Copyright (c) 2014-2016, The Linux Foundation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 and
+ * only version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Big.Little spinlock unfairness test by Vikram Mulukutla
+ */
+
+#include <linux/init.h>
+#include <linux/notifier.h>
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
+#include <linux/cpufreq.h>
+#include <linux/kthread.h>
+#include <linux/sched.h>
+#include <linux/sched/rt.h>
+#include <linux/module.h>
+#include <linux/delay.h>
+
+static DEFINE_SPINLOCK(mylock);
+
+static int run_test;
+static int big_counter, little_counter;
+module_param(big_counter, int, 0644);
+module_param(little_counter, int, 0644);
+
+static int big_time_us, little_time_us;
+module_param(big_time_us, int, 0644);
+module_param(little_time_us, int, 0644);
+
+volatile int sync = 0;
+
+struct testdata {
+	int cpu;
+	int *counter;
+	int *time;
+};
+
+struct testdata littledata = {
+	.cpu = 0,
+	.counter = &little_counter,
+	.time = &little_time_us,
+};
+
+struct testdata bigdata = {
+	.cpu = 4,
+	.counter = &big_counter,
+	.time = &big_time_us,
+};
+
+/*
+ * This is the little CPU thread. It attempts to get the lock, disabling
+ * and enabling interrupts before and after the critical section, and
+ * checks how long it took to get the lock. Ideally this time
+ * should be reflective of just one iteration of the critical section of the
+ * big cpu thread but since this little CPU always loses the ticket strex,
+ * it ends up waiting a full second.
+ */
+static int __ref lockunlock(void *data)
+{
+	struct testdata *testdata = data;
+	u64 spin_start_time, spin_stop_time;
+
+	sched_setaffinity(current->pid, cpumask_of(testdata->cpu));
+
+	while (!sync);
+
+	local_irq_disable();
+	spin_lock(&mylock);
+	while (1) {
+		spin_unlock(&mylock);
+		local_irq_enable();
+
+		cpu_relax();
+
+		local_irq_disable();
+
+		spin_start_time = sched_clock();
+		spin_lock(&mylock);
+		spin_stop_time = sched_clock();
+
+		if (*testdata->time < ((spin_stop_time - spin_start_time)/NSEC_PER_USEC))
+			*testdata->time = (spin_stop_time - spin_start_time)/NSEC_PER_USEC;
+		*testdata->counter = *testdata->counter + 1;
+		if (!run_test)
+			break;
+	}
+	spin_unlock(&mylock);
+	local_irq_enable();
+
+	return 0;
+}
+
+/*
+ * This is the big CPU thread. Ideally it should contend and have its ticket
+ * go in after the little CPU. However what ends up happening is
+ * - if it's running fast enough - this big CPU keeps winning the strex every
+ * time and is able to obtain and release the lock for a full second before it
+ * takes a break, at which point the little CPU wins (finally).
+ *
+ * Note that if there exists a loop construct such as this one in the kernel,
+ * and there is no 'break', then we have a potential livelock situation if
+ * this big CPU needs the little CPU to obtain the lock for the former to
+ * break out of the loop.
+ */
+static int __ref lockunlock_noirq(void *data)
+{
+	struct testdata *testdata = data;
+	u64 spin_start_time, spin_stop_time, time_since_start, test_start_time;
+	bool rollover;
+
+	sched_setaffinity(current->pid, cpumask_of(testdata->cpu));
+
+	while (!sync);
+
+	local_irq_disable();
+	spin_lock(&mylock);
+	test_start_time = sched_clock();
+	while (1) {
+		spin_unlock(&mylock);
+
+		time_since_start = sched_clock() - test_start_time;
+		rollover = time_since_start >= NSEC_PER_SEC;
+
+		/*
+		 * We take a break after 1 second to allow watchdog etc. and
+		 * potentially a user shell to work!
+		 */
+		if (rollover) {
+			local_irq_enable();
+			test_start_time = sched_clock();
+		}
+
+		/*
+		 * Changing this cpu_relax to a udelay(1) will likely
+		 * allow the little CPU enough cycles so it finally wins
+		 * the strex battle.
+		 */
+		cpu_relax();
+
+		if (rollover)
+			local_irq_disable();
+
+		spin_start_time = sched_clock();
+		spin_lock(&mylock);
+		spin_stop_time = sched_clock();
+
+		if (*testdata->time < ((spin_stop_time - spin_start_time)/NSEC_PER_USEC))
+			*testdata->time = (spin_stop_time - spin_start_time)/NSEC_PER_USEC;
+		*testdata->counter = *testdata->counter + 1;
+		if (!run_test)
+			break;
+	}
+	spin_unlock(&mylock);
+	local_irq_enable();
+
+	return 0;
+}
+
+static int runtest_param_set(const char *val, struct kernel_param *kp)
+{
+	int ret;
+	int old_val = run_test;
+
+	ret = param_set_int(val, kp);
+
+	if (ret)
+		return ret;
+
+	/* If run_test is not zero or one, ignore. */
+	if (run_test >> 1) {
+		run_test = old_val;
+		return -EINVAL;
+	}
+
+	if (run_test) {
+		sync = 0;
+
+		*littledata.time = 0;
+		*littledata.counter = 0;
+		*bigdata.time = 0;
+		*bigdata.counter = 0;
+
+		kthread_run(lockunlock, (void *) &littledata,
+						"test/%d", littledata.cpu);
+		kthread_run(lockunlock_noirq, (void *) &bigdata,
+						"test/%d", bigdata.cpu);
+		sync=1;
+	}
+
+	return 0;
+}
+
+module_param_call(run_test, runtest_param_set, param_get_int,
+		  &run_test, S_IRUGO|S_IWUSR);
+
+
+
-- 
-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-07-28  1:10     ` Vikram Mulukutla
@ 2017-07-28  9:28       ` Peter Zijlstra
  2017-07-28 19:11         ` Vikram Mulukutla
  2017-07-28  9:28       ` Will Deacon
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2017-07-28  9:28 UTC (permalink / raw)
  To: Vikram Mulukutla
  Cc: qiaozhou, Thomas Gleixner, John Stultz, sboyd, LKML, Wang Wilbur,
	Marc Zyngier, Will Deacon, linux-kernel-owner, sudeep.holla

On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:

> I think we should have this discussion now - I brought this up earlier [1]
> and I promised a test case that I completely forgot about - but here it
> is (attached). Essentially a Big CPU in an acquire-check-release loop
> will have an unfair advantage over a little CPU concurrently attempting
> to acquire the same lock, in spite of the ticket implementation. If the Big
> CPU needs the little CPU to make forward progress : livelock.

This needs to be fixed in hardware. There really isn't anything the
software can sanely do about it.

It also doesn't have anything to do with the spinlock implementation.
Ticket or not, its a fundamental problem of LL/SC. Any situation where
we use atomics for fwd progress guarantees this can happen.

The little core (or really any core) should hold on to the locked
cacheline for a while and not insta relinquish it. Giving it a chance to
reach the SC.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-07-28  1:10     ` Vikram Mulukutla
  2017-07-28  9:28       ` Peter Zijlstra
@ 2017-07-28  9:28       ` Will Deacon
  2017-07-28 19:09         ` Vikram Mulukutla
  1 sibling, 1 reply; 23+ messages in thread
From: Will Deacon @ 2017-07-28  9:28 UTC (permalink / raw)
  To: Vikram Mulukutla
  Cc: qiaozhou, Thomas Gleixner, John Stultz, sboyd, LKML, Wang Wilbur,
	Marc Zyngier, Peter Zijlstra, linux-kernel-owner, sudeep.holla

On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
> On 2017-07-26 18:29, qiaozhou wrote:
> >On 2017年07月26日 22:16, Thomas Gleixner wrote:
> >>On Wed, 26 Jul 2017, qiaozhou wrote:
> >>For that particular timer case we can clear base->running_timer w/o the
> >>lock held (see patch below), but this kind of
> >>
> >>      lock -> test -> unlock -> retry
> >>
> >>loops are all over the place in the kernel, so this is going to hurt you
> >>sooner than later in some other place.
> >It's true. This is the way spinlock is used normally and widely in
> >kernel. I'll also ask ARM experts whether we can do something to avoid
> >or reduce the chance of such issue. ARMv8.1 has one single
> >instruction(ldadda) to replace the ldaxr/stxr loop. Hope it can
> >improve and reduce the chance.
> 
> I think we should have this discussion now - I brought this up earlier [1]
> and I promised a test case that I completely forgot about - but here it
> is (attached). Essentially a Big CPU in an acquire-check-release loop
> will have an unfair advantage over a little CPU concurrently attempting
> to acquire the same lock, in spite of the ticket implementation. If the Big
> CPU needs the little CPU to make forward progress : livelock.
> 
> We've run into the same loop construct in other spots in the kernel and
> the reason that a real  symptom is so rare is that the retry-loop on the
> 'Big'
> CPU needs to be interrupted just once by say an IRQ/FIQ and the live-lock
> is broken. If the entire retry loop is within an interrupt-disabled critical
> section then the odds of live-locking are much higher.
> 
> An example of the problem on a previous kernel is here [2]. Changes to the
> workqueue code since may have fixed this particular instance.
> 
> One solution was to use udelay(1) in such loops instead of cpu_relax(), but
> that's not very 'relaxing'. I'm not sure if there's something we could do
> within the ticket spin-lock implementation to deal with this.

Does bodging cpu_relax to back-off to wfe after a while help? The event
stream will wake it up if nothing else does. Nasty patch below, but I'd be
interested to know whether or not it helps.

Will

--->8

diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index 64c9e78f9882..1f5a29c8612e 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -149,9 +149,11 @@ extern void release_thread(struct task_struct *);
 
 unsigned long get_wchan(struct task_struct *p);
 
+void __cpu_relax(unsigned long pc);
+
 static inline void cpu_relax(void)
 {
-	asm volatile("yield" ::: "memory");
+	__cpu_relax(_THIS_IP_);
 }
 
 /* Thread switching */
diff --git a/arch/arm64/kernel/arm64ksyms.c b/arch/arm64/kernel/arm64ksyms.c
index 67368c7329c0..be8a698ea680 100644
--- a/arch/arm64/kernel/arm64ksyms.c
+++ b/arch/arm64/kernel/arm64ksyms.c
@@ -72,6 +72,8 @@ EXPORT_SYMBOL(_mcount);
 NOKPROBE_SYMBOL(_mcount);
 #endif
 
+EXPORT_SYMBOL(__cpu_relax);
+
 	/* arm-smccc */
 EXPORT_SYMBOL(__arm_smccc_smc);
 EXPORT_SYMBOL(__arm_smccc_hvc);
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 659ae8094ed5..c394c3704b7f 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -403,6 +403,31 @@ unsigned long get_wchan(struct task_struct *p)
 	return ret;
 }
 
+static DEFINE_PER_CPU(u64, __cpu_relax_data);
+
+#define CPU_RELAX_WFE_THRESHOLD	10000
+void __cpu_relax(unsigned long pc)
+{
+	u64 new, old = raw_cpu_read(__cpu_relax_data);
+	u32 old_pc, new_pc;
+	bool wfe = false;
+
+	old_pc = (u32)old;
+	new = new_pc = (u32)pc;
+
+	if (old_pc == new_pc) {
+		if ((old >> 32) > CPU_RELAX_WFE_THRESHOLD) {
+			asm volatile("sevl; wfe; wfe\n" ::: "memory");
+			wfe = true;
+		} else {
+			new = old + (1UL << 32);
+		}
+	}
+
+	if (this_cpu_cmpxchg(__cpu_relax_data, old, new) == old && !wfe)
+		asm volatile("yield" ::: "memory");
+}
+
 unsigned long arch_align_stack(unsigned long sp)
 {
 	if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space)

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-07-28  9:28       ` Will Deacon
@ 2017-07-28 19:09         ` Vikram Mulukutla
  2017-07-31 11:20           ` qiaozhou
  2017-07-31 13:13           ` Will Deacon
  0 siblings, 2 replies; 23+ messages in thread
From: Vikram Mulukutla @ 2017-07-28 19:09 UTC (permalink / raw)
  To: Will Deacon
  Cc: qiaozhou, Thomas Gleixner, John Stultz, sboyd, LKML, Wang Wilbur,
	Marc Zyngier, Peter Zijlstra, linux-kernel-owner, sudeep.holla

On 2017-07-28 02:28, Will Deacon wrote:
> On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:

<snip>

>> 
>> I think we should have this discussion now - I brought this up earlier 
>> [1]
>> and I promised a test case that I completely forgot about - but here 
>> it
>> is (attached). Essentially a Big CPU in an acquire-check-release loop
>> will have an unfair advantage over a little CPU concurrently 
>> attempting
>> to acquire the same lock, in spite of the ticket implementation. If 
>> the Big
>> CPU needs the little CPU to make forward progress : livelock.
>> 

<snip>

>> 
>> One solution was to use udelay(1) in such loops instead of 
>> cpu_relax(), but
>> that's not very 'relaxing'. I'm not sure if there's something we could 
>> do
>> within the ticket spin-lock implementation to deal with this.
> 
> Does bodging cpu_relax to back-off to wfe after a while help? The event
> stream will wake it up if nothing else does. Nasty patch below, but I'd 
> be
> interested to know whether or not it helps.
> 
> Will
> 
This does seem to help. Here's some data after 5 runs with and without 
the patch.

time = max time taken to acquire lock
counter = number of times lock acquired

cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz
Without the cpu_relax() bodging patch:
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
   117893us|       2349144|        2us|       6748236|
   571260us|       2125651|        2us|       7643264|
    19780us|       2392770|        2us|       5987203|
    19948us|       2395413|        2us|       5977286|
    19822us|       2429619|        2us|       5768252|
    19888us|       2444940|        2us|       5675657|
=====================================================

cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz
With the cpu_relax() bodging patch:
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
        3us|       2737438|        2us|       6907147|
        2us|       2742478|        2us|       6902241|
      132us|       2745636|        2us|       6876485|
        3us|       2744554|        2us|       6898048|
        3us|       2741391|        2us|       6882901|
=====================================================

The patch also seems to have helped with fairness in general
allowing more work to be done if the CPU frequencies are more
closely matched (I don't know if this translates to real world
performance - probably not). The counter values are higher
with the patch.

time = max time taken to acquire lock
counter = number of times lock acquired

cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz
Without the cpu_relax() bodging patch:
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
        2us|       5240654|        1us|       5339009|
        2us|       5287797|       97us|       5327073|
        2us|       5237634|        1us|       5334694|
        2us|       5236676|       88us|       5333582|
       84us|       5285880|       84us|       5329489|
=====================================================

cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz
With the cpu_relax() bodging patch:
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
      140us|      10449121|        1us|      11154596|
        1us|      10757081|        1us|      11479395|
       83us|      10237109|        1us|      10902557|
        2us|       9871101|        1us|      10514313|
        2us|       9758763|        1us|      10391849|
=====================================================


Thanks,
Vikram

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-07-28  9:28       ` Peter Zijlstra
@ 2017-07-28 19:11         ` Vikram Mulukutla
  0 siblings, 0 replies; 23+ messages in thread
From: Vikram Mulukutla @ 2017-07-28 19:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: qiaozhou, Thomas Gleixner, John Stultz, sboyd, LKML, Wang Wilbur,
	Marc Zyngier, Will Deacon, linux-kernel-owner, sudeep.holla

On 2017-07-28 02:28, Peter Zijlstra wrote:
> On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
> 
>> I think we should have this discussion now - I brought this up earlier 
>> [1]
>> and I promised a test case that I completely forgot about - but here 
>> it
>> is (attached). Essentially a Big CPU in an acquire-check-release loop
>> will have an unfair advantage over a little CPU concurrently 
>> attempting
>> to acquire the same lock, in spite of the ticket implementation. If 
>> the Big
>> CPU needs the little CPU to make forward progress : livelock.
> 
> This needs to be fixed in hardware. There really isn't anything the
> software can sanely do about it.
> 
> It also doesn't have anything to do with the spinlock implementation.
> Ticket or not, its a fundamental problem of LL/SC. Any situation where
> we use atomics for fwd progress guarantees this can happen.
> 

Agreed, it seems like trying to build a fair SW protocol over unfair HW.
But if we can minimally change such loop constructs to address this (all
instances I've seen so far use cpu_relax) it would save a lot of hours
spent debugging these problems. Lot of b.L devices out there :-)

It's also possible that such a workaround may help contention 
performance
since the big CPU may have to wait for say a tick before breaking out of
that loop (the non-livelock scenario where the entire loop isn't in a
critical section).

> The little core (or really any core) should hold on to the locked
> cacheline for a while and not insta relinquish it. Giving it a chance 
> to
> reach the SC.

Thanks,
Vikram

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-07-28 19:09         ` Vikram Mulukutla
@ 2017-07-31 11:20           ` qiaozhou
  2017-08-01  7:37             ` qiaozhou
  2017-07-31 13:13           ` Will Deacon
  1 sibling, 1 reply; 23+ messages in thread
From: qiaozhou @ 2017-07-31 11:20 UTC (permalink / raw)
  To: Vikram Mulukutla, Will Deacon
  Cc: Thomas Gleixner, John Stultz, sboyd, LKML, Wang Wilbur,
	Marc Zyngier, Peter Zijlstra, linux-kernel-owner, sudeep.holla



On 2017年07月29日 03:09, Vikram Mulukutla wrote:
> On 2017-07-28 02:28, Will Deacon wrote:
>> On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
> 
> <snip>
> 
>>>
>>> I think we should have this discussion now - I brought this up 
>>> earlier [1]
>>> and I promised a test case that I completely forgot about - but here it
>>> is (attached). Essentially a Big CPU in an acquire-check-release loop
>>> will have an unfair advantage over a little CPU concurrently attempting
>>> to acquire the same lock, in spite of the ticket implementation. If 
>>> the Big
>>> CPU needs the little CPU to make forward progress : livelock.
>>>
> 
> <snip>
> 
>>>
>>> One solution was to use udelay(1) in such loops instead of 
>>> cpu_relax(), but
>>> that's not very 'relaxing'. I'm not sure if there's something we 
>>> could do
>>> within the ticket spin-lock implementation to deal with this.
>>
>> Does bodging cpu_relax to back-off to wfe after a while help? The event
>> stream will wake it up if nothing else does. Nasty patch below, but 
>> I'd be
>> interested to know whether or not it helps.
>>
>> Will
>>
The patch also helps a lot on my platform. (Though it does cause 
deadlock(related with udelay) in uart driver in early boot, and not sure 
it's uart driver issue. Just workaround it firstly)

Platform: 4 a53(832MHz) + 4 a73(1.8GHz)
Test condition #1:
     a. core2: a53, while loop (spinlock, spin_unlock)
     b. core7: a73, while loop (spinlock, spin_unlock, cpu_relax)

Test result: recording the lock acquire times(a53, a73), max lock 
acquired time(a53), in 20 seconds

Without cpu_relax bodging patch:
===============================================================
|a53 locked times | a73 locked times | a53 max locked time(us)|
==================|==================|========================|
                182|          38371616|               1,951,954|
                202|          38427652|               2,261,319|
                210|          38477427|              15,309,597|
                207|          38494479|               6,656,453|
                220|          38422283|               2,064,155|
===============================================================

With cpu_relax bodging patch:
===============================================================
|a53 locked times | a73 locked times | a53 max locked time(us)|
==================|==================|========================|
            1849898|          37799379|                 131,255|
            1574172|          38557653|                  38,410|
            1924777|          37831725|                  42,999|
            1477665|          38723741|                  52,087|
            1865793|          38007741|                 783,965|
===============================================================

Also add some workload to the whole system to check the result.
Test condition #2: based on #1
     c. core6: a73, 1.8GHz, run "while(1);" loop

With cpu_relax bodging patch:
===============================================================
|a53 locked times | a73 locked times | a53 max locked time(us)|
==================|==================|========================|
                 20|          42563981|               2,317,070|
                 10|          42652793|               4,210,944|
                  9|          42651075|               5,691,834|
                 28|          42652591|               4,539,555|
                 10|          42652801|               5,850,639|
===============================================================

Also hotplug out other cores.
Test condition #2: based on #1
     d. hotplug out core1/3/4/5/6, keep core0 for scheduling

With cpu_relax bodging patch:
===============================================================
|a53 locked times | a73 locked times | a53 max locked time(us)|
==================|==================|========================|
                447|          42652450|                 309,549|
                515|          42650382|                 337,661|
                415|          42646669|                 628,525|
                431|          42651137|                 365,862|
                464|          42648916|                 379,934|
===============================================================

The last two tests are the actual cases where the hard-lockup is 
triggered on my platform. So I gathered some data, and it shows that a53 
needs much longer time to acquire the lock.

All tests are done in android, black screen with USB cable attached. The 
data is not so pretty as Vikram's. It might be related with cpu 
topology, core numbers, CCI frequency etc. (I'll do another test with 
both a53 and a73 running at 1.2GHz, to check whether it's the core 
frequency which leads to the major difference.)

> This does seem to help. Here's some data after 5 runs with and without 
> the patch.
> 
> time = max time taken to acquire lock
> counter = number of times lock acquired
> 
> cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz
> Without the cpu_relax() bodging patch:
> =====================================================
> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
> ==========|==============|===========|==============|
>    117893us|       2349144|        2us|       6748236|
>    571260us|       2125651|        2us|       7643264|
>     19780us|       2392770|        2us|       5987203|
>     19948us|       2395413|        2us|       5977286|
>     19822us|       2429619|        2us|       5768252|
>     19888us|       2444940|        2us|       5675657|
> =====================================================
> 
> cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz
> With the cpu_relax() bodging patch:
> =====================================================
> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
> ==========|==============|===========|==============|
>         3us|       2737438|        2us|       6907147|
>         2us|       2742478|        2us|       6902241|
>       132us|       2745636|        2us|       6876485|
>         3us|       2744554|        2us|       6898048|
>         3us|       2741391|        2us|       6882901|
> ==================================================== >
> The patch also seems to have helped with fairness in general
> allowing more work to be done if the CPU frequencies are more
> closely matched (I don't know if this translates to real world
> performance - probably not). The counter values are higher
> with the patch.
> 
> time = max time taken to acquire lock
> counter = number of times lock acquired
> 
> cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz
> Without the cpu_relax() bodging patch:
> =====================================================
> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
> ==========|==============|===========|==============|
>         2us|       5240654|        1us|       5339009|
>         2us|       5287797|       97us|       5327073|
>         2us|       5237634|        1us|       5334694|
>         2us|       5236676|       88us|       5333582|
>        84us|       5285880|       84us|       5329489|
> =====================================================
> 
> cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz
> With the cpu_relax() bodging patch:
> =====================================================
> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
> ==========|==============|===========|==============|
>       140us|      10449121|        1us|      11154596|
>         1us|      10757081|        1us|      11479395|
>        83us|      10237109|        1us|      10902557|
>         2us|       9871101|        1us|      10514313|
>         2us|       9758763|        1us|      10391849|
> =====================================================
> 
> 
> Thanks,
> Vikram
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-07-28 19:09         ` Vikram Mulukutla
  2017-07-31 11:20           ` qiaozhou
@ 2017-07-31 13:13           ` Will Deacon
  2017-08-03 23:25             ` Vikram Mulukutla
  1 sibling, 1 reply; 23+ messages in thread
From: Will Deacon @ 2017-07-31 13:13 UTC (permalink / raw)
  To: Vikram Mulukutla
  Cc: qiaozhou, Thomas Gleixner, John Stultz, sboyd, LKML, Wang Wilbur,
	Marc Zyngier, Peter Zijlstra, linux-kernel-owner, sudeep.holla

Hi Vikram,

On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote:
> On 2017-07-28 02:28, Will Deacon wrote:
> >On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
> 
> <snip>
> 
> >>
> >>I think we should have this discussion now - I brought this up earlier
> >>[1]
> >>and I promised a test case that I completely forgot about - but here it
> >>is (attached). Essentially a Big CPU in an acquire-check-release loop
> >>will have an unfair advantage over a little CPU concurrently attempting
> >>to acquire the same lock, in spite of the ticket implementation. If the
> >>Big
> >>CPU needs the little CPU to make forward progress : livelock.
> >>
> 
> <snip>
> 
> >>
> >>One solution was to use udelay(1) in such loops instead of cpu_relax(),
> >>but
> >>that's not very 'relaxing'. I'm not sure if there's something we could
> >>do
> >>within the ticket spin-lock implementation to deal with this.
> >
> >Does bodging cpu_relax to back-off to wfe after a while help? The event
> >stream will wake it up if nothing else does. Nasty patch below, but I'd be
> >interested to know whether or not it helps.
> >
> >Will
> >
> This does seem to help. Here's some data after 5 runs with and without the
> patch.

Blimey, that does seem to make a difference. Shame it's so ugly! Would you
be able to experiment with other values for CPU_RELAX_WFE_THRESHOLD? I had
it set to 10000 in the diff I posted, but that might be higher than optimal.
It would be interested to see if it correlates with num_possible_cpus()
for the highly contended case.

Will

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-07-31 11:20           ` qiaozhou
@ 2017-08-01  7:37             ` qiaozhou
  2017-08-03 23:32               ` Vikram Mulukutla
  0 siblings, 1 reply; 23+ messages in thread
From: qiaozhou @ 2017-08-01  7:37 UTC (permalink / raw)
  To: Vikram Mulukutla, Will Deacon
  Cc: Thomas Gleixner, John Stultz, sboyd, LKML, Wang Wilbur,
	Marc Zyngier, Peter Zijlstra, linux-kernel-owner, sudeep.holla,
	Zhou Qiao



On 2017年07月31日 19:20, qiaozhou wrote:
> 
> 
> On 2017年07月29日 03:09, Vikram Mulukutla wrote:
>> On 2017-07-28 02:28, Will Deacon wrote:
>>> On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
>>
>> <snip>
>>>
>>> Does bodging cpu_relax to back-off to wfe after a while help? The event
>>> stream will wake it up if nothing else does. Nasty patch below, but 
>>> I'd be
>>> interested to know whether or not it helps.
>>>
>>> Will
>>>
> The patch also helps a lot on my platform. (Though it does cause 
> deadlock(related with udelay) in uart driver in early boot, and not sure 
> it's uart driver issue. Just workaround it firstly)
> 
> Platform: 4 a53(832MHz) + 4 a73(1.8GHz)
> Test condition #1:
>      a. core2: a53, while loop (spinlock, spin_unlock)
>      b. core7: a73, while loop (spinlock, spin_unlock, cpu_relax)
> 
> Test result: recording the lock acquire times(a53, a73), max lock 
> acquired time(a53), in 20 seconds
> 
> Without cpu_relax bodging patch:
> ===============================================================
> |a53 locked times | a73 locked times | a53 max locked time(us)|
> ==================|==================|========================|
>                 182|          38371616|               1,951,954|
>                 202|          38427652|               2,261,319|
>                 210|          38477427|              15,309,597|
>                 207|          38494479|               6,656,453|
>                 220|          38422283|               2,064,155|
> ===============================================================
> 
> With cpu_relax bodging patch:
> ===============================================================
> |a53 locked times | a73 locked times | a53 max locked time(us)|
> ==================|==================|========================|
>             1849898|          37799379|                 131,255|
>             1574172|          38557653|                  38,410|
>             1924777|          37831725|                  42,999|
>             1477665|          38723741|                  52,087|
>             1865793|          38007741|                 783,965|
> ===============================================================
> 
> Also add some workload to the whole system to check the result.
> Test condition #2: based on #1
>      c. core6: a73, 1.8GHz, run "while(1);" loop
> 
> With cpu_relax bodging patch:
> ===============================================================
> |a53 locked times | a73 locked times | a53 max locked time(us)|
> ==================|==================|========================|
>                  20|          42563981|               2,317,070|
>                  10|          42652793|               4,210,944|
>                   9|          42651075|               5,691,834|
>                  28|          42652591|               4,539,555|
>                  10|          42652801|               5,850,639|
> ===============================================================
> 
> Also hotplug out other cores.
> Test condition #3: based on #1
>      d. hotplug out core1/3/4/5/6, keep core0 for scheduling
> 
> With cpu_relax bodging patch:
> ===============================================================
> |a53 locked times | a73 locked times | a53 max locked time(us)|
> ==================|==================|========================|
>                 447|          42652450|                 309,549|
>                 515|          42650382|                 337,661|
>                 415|          42646669|                 628,525|
>                 431|          42651137|                 365,862|
>                 464|          42648916|                 379,934|
> ===============================================================
> 
> The last two tests are the actual cases where the hard-lockup is 
> triggered on my platform. So I gathered some data, and it shows that a53 
> needs much longer time to acquire the lock.
> 
> All tests are done in android, black screen with USB cable attached. The 
> data is not so pretty as Vikram's. It might be related with cpu 
> topology, core numbers, CCI frequency etc. (I'll do another test with 
> both a53 and a73 running at 1.2GHz, to check whether it's the core 
> frequency which leads to the major difference.)
> 
Test the contention with the same frequency between a53 and a73 cores.
Platform: 4 a53(1248MHz) + 4 a73(1248MHz)
Test condition #4:
      a. core2: a53, while loop (spinlock, spin_unlock)
      b. core7: a73, while loop (spinlock, spin_unlock)
===============================================================
|a53 locked times | a73 locked times | a53 max locked time(us)|
==================|==================|========================|
           12945632|          13021576|                      14|
           12934181|          13059230|                      16|
           12987186|          13059016|                      49|
           12958583|          13038884|                      24|
           14637546|          14672522|                      14|
===============================================================

The locked times are almost the same, and the max time of acquiring the 
lock on a53 also drops. On my platform, core frequency seems to be the 
key factor.

>> This does seem to help. Here's some data after 5 runs with and without 
>> the patch.
>>
>> time = max time taken to acquire lock
>> counter = number of times lock acquired
>>
>> cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz
>> Without the cpu_relax() bodging patch:
>> =====================================================
>> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
>> ==========|==============|===========|==============|
>>    117893us|       2349144|        2us|       6748236|
>>    571260us|       2125651|        2us|       7643264|
>>     19780us|       2392770|        2us|       5987203|
>>     19948us|       2395413|        2us|       5977286|
>>     19822us|       2429619|        2us|       5768252|
>>     19888us|       2444940|        2us|       5675657|
>> =====================================================
>>
>> cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz
>> With the cpu_relax() bodging patch:
>> =====================================================
>> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
>> ==========|==============|===========|==============|
>>         3us|       2737438|        2us|       6907147|
>>         2us|       2742478|        2us|       6902241|
>>       132us|       2745636|        2us|       6876485|
>>         3us|       2744554|        2us|       6898048|
>>         3us|       2741391|        2us|       6882901|
>> ==================================================== >
>> The patch also seems to have helped with fairness in general
>> allowing more work to be done if the CPU frequencies are more
>> closely matched (I don't know if this translates to real world
>> performance - probably not). The counter values are higher
>> with the patch.
>>
>> time = max time taken to acquire lock
>> counter = number of times lock acquired
>>
>> cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz
>> Without the cpu_relax() bodging patch:
>> =====================================================
>> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
>> ==========|==============|===========|==============|
>>         2us|       5240654|        1us|       5339009|
>>         2us|       5287797|       97us|       5327073|
>>         2us|       5237634|        1us|       5334694|
>>         2us|       5236676|       88us|       5333582|
>>        84us|       5285880|       84us|       5329489|
>> =====================================================
>>
>> cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz
>> With the cpu_relax() bodging patch:
>> =====================================================
>> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
>> ==========|==============|===========|==============|
>>       140us|      10449121|        1us|      11154596|
>>         1us|      10757081|        1us|      11479395|
>>        83us|      10237109|        1us|      10902557|
>>         2us|       9871101|        1us|      10514313|
>>         2us|       9758763|        1us|      10391849|
>> =====================================================
>>Also apply Vikram's patch and have a test.

cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
Without cpu_relax bodging patch
=====================================================
cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
==========|==============|===========|==============|
      16505|          5243|          2|      12487322|
      16494|          5619|          1|      12013291|
      16498|          5276|          2|      11706824|
      16494|          7123|          1|      12532355|
      16470|          7208|          2|      11784617|
=====================================================

cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
With cpu_relax bodging patch:
=====================================================
cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
==========|==============|===========|==============|
       3991|        140714|          1|      11430528|
       4018|        144371|          1|      11430528|
       4034|        143250|          1|      11427011|
       4330|        147345|          1|      11423583|
       4752|        138273|          1|      11433241|
=====================================================

It has some improvements, but not so good as Vikram's data. The big core 
still has much more chance to acquire lock.
>>
>> Thanks,
>> Vikram
>>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-07-31 13:13           ` Will Deacon
@ 2017-08-03 23:25             ` Vikram Mulukutla
  2017-08-15 18:40               ` Will Deacon
  0 siblings, 1 reply; 23+ messages in thread
From: Vikram Mulukutla @ 2017-08-03 23:25 UTC (permalink / raw)
  To: Will Deacon
  Cc: qiaozhou, Thomas Gleixner, John Stultz, sboyd, LKML, Wang Wilbur,
	Marc Zyngier, Peter Zijlstra, linux-kernel-owner, sudeep.holla

Hi Will,

On 2017-07-31 06:13, Will Deacon wrote:
> Hi Vikram,
> 
> On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote:
>> On 2017-07-28 02:28, Will Deacon wrote:
>> >On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:

>> >
>> This does seem to help. Here's some data after 5 runs with and without 
>> the
>> patch.
> 
> Blimey, that does seem to make a difference. Shame it's so ugly! Would 
> you
> be able to experiment with other values for CPU_RELAX_WFE_THRESHOLD? I 
> had
> it set to 10000 in the diff I posted, but that might be higher than 
> optimal.
> It would be interested to see if it correlates with num_possible_cpus()
> for the highly contended case.
> 
> Will

Sorry for the late response - I should hopefully have some more data 
with
different thresholds before the week is finished or on Monday.

Thanks,
Vikram
-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-08-01  7:37             ` qiaozhou
@ 2017-08-03 23:32               ` Vikram Mulukutla
  2017-08-04  3:15                 ` qiaozhou
  0 siblings, 1 reply; 23+ messages in thread
From: Vikram Mulukutla @ 2017-08-03 23:32 UTC (permalink / raw)
  To: qiaozhou
  Cc: Will Deacon, Thomas Gleixner, John Stultz, sboyd, LKML,
	Wang Wilbur, Marc Zyngier, Peter Zijlstra, linux-kernel-owner,
	sudeep.holla


Hi Qiao,


On 2017-08-01 00:37, qiaozhou wrote:
> On 2017年07月31日 19:20, qiaozhou wrote:
>> 
>> 

<snip>

>>> =====================================================
>>> Also apply Vikram's patch and have a test.
> 
> cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
> Without cpu_relax bodging patch
> =====================================================
> cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
> ==========|==============|===========|==============|
>      16505|          5243|          2|      12487322|
>      16494|          5619|          1|      12013291|
>      16498|          5276|          2|      11706824|
>      16494|          7123|          1|      12532355|
>      16470|          7208|          2|      11784617|
> =====================================================
> 
> cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
> With cpu_relax bodging patch:
> =====================================================
> cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
> ==========|==============|===========|==============|
>       3991|        140714|          1|      11430528|
>       4018|        144371|          1|      11430528|
>       4034|        143250|          1|      11427011|
>       4330|        147345|          1|      11423583|
>       4752|        138273|          1|      11433241|
> =====================================================
> 
> It has some improvements, but not so good as Vikram's data. The big
> core still has much more chance to acquire lock.
>>> 
>>> Thanks,
>>> Vikram
>>> 

Thanks for your data! I'll check on one of our other platforms to see
if I see similar behavior. This may have something to do with the
event-stream on your platform or the A53 revision as Sudeep pointed
out here [1] - something to check I suppose...

Thanks,
Vikram

[1] - https://lkml.org/lkml/2016/11/21/458

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-08-03 23:32               ` Vikram Mulukutla
@ 2017-08-04  3:15                 ` qiaozhou
  0 siblings, 0 replies; 23+ messages in thread
From: qiaozhou @ 2017-08-04  3:15 UTC (permalink / raw)
  To: Vikram Mulukutla
  Cc: Will Deacon, Thomas Gleixner, John Stultz, sboyd, LKML,
	Wang Wilbur, Marc Zyngier, Peter Zijlstra, linux-kernel-owner,
	sudeep.holla



On 2017年08月04日 07:32, Vikram Mulukutla wrote:
> 
> Hi Qiao,
> 
> 
> On 2017-08-01 00:37, qiaozhou wrote:
>> On 2017年07月31日 19:20, qiaozhou wrote:
>>>
>>>
> 
> <snip>
> 
>>>> =====================================================
>>>> Also apply Vikram's patch and have a test.
>>
>> cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
>> Without cpu_relax bodging patch
>> =====================================================
>> cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
>> ==========|==============|===========|==============|
>>      16505|          5243|          2|      12487322|
>>      16494|          5619|          1|      12013291|
>>      16498|          5276|          2|      11706824|
>>      16494|          7123|          1|      12532355|
>>      16470|          7208|          2|      11784617|
>> =====================================================
>>
>> cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
>> With cpu_relax bodging patch:
>> =====================================================
>> cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
>> ==========|==============|===========|==============|
>>       3991|        140714|          1|      11430528|
>>       4018|        144371|          1|      11430528|
>>       4034|        143250|          1|      11427011|
>>       4330|        147345|          1|      11423583|
>>       4752|        138273|          1|      11433241|
>> =====================================================
>>
>> It has some improvements, but not so good as Vikram's data. The big
>> core still has much more chance to acquire lock.
>>>>
>>>> Thanks,
>>>> Vikram
>>>>
> 
> Thanks for your data! I'll check on one of our other platforms to see
> if I see similar behavior. This may have something to do with the
> event-stream on your platform or the A53 revision as Sudeep pointed
> out here [1] - something to check I suppose...
Thanks for the reminder. Our HW IP has already fixed the bug Sudeep 
mentioned. I'm also checking the global exclusive monitor implementation 
on our platform with our ASIC guys, and it might be related with snoop 
transaction, or implement details in global exclusive monitor.

Thanks a lot.
Qiao
> 
> Thanks,
> Vikram
> 
> [1] - https://lkml.org/lkml/2016/11/21/458
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-08-03 23:25             ` Vikram Mulukutla
@ 2017-08-15 18:40               ` Will Deacon
  2017-08-25 19:48                 ` Vikram Mulukutla
  0 siblings, 1 reply; 23+ messages in thread
From: Will Deacon @ 2017-08-15 18:40 UTC (permalink / raw)
  To: Vikram Mulukutla
  Cc: qiaozhou, Thomas Gleixner, John Stultz, sboyd, LKML, Wang Wilbur,
	Marc Zyngier, Peter Zijlstra, linux-kernel-owner, sudeep.holla

Hi Vikram,

On Thu, Aug 03, 2017 at 04:25:12PM -0700, Vikram Mulukutla wrote:
> On 2017-07-31 06:13, Will Deacon wrote:
> >On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote:
> >>On 2017-07-28 02:28, Will Deacon wrote:
> >>>On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
> 
> >>>
> >>This does seem to help. Here's some data after 5 runs with and without
> >>the
> >>patch.
> >
> >Blimey, that does seem to make a difference. Shame it's so ugly! Would you
> >be able to experiment with other values for CPU_RELAX_WFE_THRESHOLD? I had
> >it set to 10000 in the diff I posted, but that might be higher than
> >optimal.
> >It would be interested to see if it correlates with num_possible_cpus()
> >for the highly contended case.
> >
> >Will
> 
> Sorry for the late response - I should hopefully have some more data with
> different thresholds before the week is finished or on Monday.

Did you get anywhere with the threshold heuristic?

Will

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-08-15 18:40               ` Will Deacon
@ 2017-08-25 19:48                 ` Vikram Mulukutla
  2017-08-25 20:25                   ` Vikram Mulukutla
  2017-08-28 23:12                   ` Vikram Mulukutla
  0 siblings, 2 replies; 23+ messages in thread
From: Vikram Mulukutla @ 2017-08-25 19:48 UTC (permalink / raw)
  To: Will Deacon
  Cc: qiaozhou, Thomas Gleixner, John Stultz, sboyd, LKML, Wang Wilbur,
	Marc Zyngier, Peter Zijlstra, linux-kernel-owner, sudeep.holla


Hi Will,

On 2017-08-15 11:40, Will Deacon wrote:
> Hi Vikram,
> 
> On Thu, Aug 03, 2017 at 04:25:12PM -0700, Vikram Mulukutla wrote:
>> On 2017-07-31 06:13, Will Deacon wrote:
>> >On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote:
>> >>On 2017-07-28 02:28, Will Deacon wrote:
>> >>>On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
>> 
>> >>>
>> >>This does seem to help. Here's some data after 5 runs with and without
>> >>the
>> >>patch.
>> >
>> >Blimey, that does seem to make a difference. Shame it's so ugly! Would you
>> >be able to experiment with other values for CPU_RELAX_WFE_THRESHOLD? I had
>> >it set to 10000 in the diff I posted, but that might be higher than
>> >optimal.
>> >It would be interested to see if it correlates with num_possible_cpus()
>> >for the highly contended case.
>> >
>> >Will
>> 
>> Sorry for the late response - I should hopefully have some more data 
>> with
>> different thresholds before the week is finished or on Monday.
> 
> Did you get anywhere with the threshold heuristic?
> 
> Will

Here's some data from experiments that I finally got to today. I decided
to recompile for every value of the threshold. Was doing a binary search
of sorts and then started reducing by orders of magnitude. There pairs 
of rows here:

Row1 is with cpu0 (little) at 300MHz and cpu4 at 1.9Ghz
Row2 is with cpu0 (little) at 1.5GHz and cpu4 at 1.9Ghz

It looks like even with the threshold set to 1, we don't hit the worst
case of a single instance of locking taking a long time, probably a 
consequence
of how the test is designed? However as we reduce the threshold, the 
fairness
in terms of how many times each CPU acquires the lock skews towards the 
big CPU,
starting with threshold=500

If I understand the code correctly, the upper 32 bits of an ARM64 
virtual
address will overflow when 1 is added to it, and so we'll keep WFE'ing 
on
every subsequent cpu_relax invoked from the same PC, until we cross the
hard-coded threshold, right?


CPU_RELAX_WFE_THRESHOLD = 5000 and 2500 (very similar results)
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
2|2763059|2|7323169
0|11477590|1|12110373

3|2762788|2|7329881
1|11557987|1|12042980

3|2765912|2|7308789
1|11470294|1|12120074

3|2761793|2|7333907
1|11431841|1|12155046

3|2762402|2|7328843
1|11495705|1|12096518

3|2764392|2|7308640
1|11479146|1|12111419
====================================================|

CPU_RELAX_WFE_THRESHOLD = 500
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
3|2338277|2|10052592
1|6963131|1|18103639

3|2337982|2|10037188
1|6979396|1|18082811

3|2337282|2|10052184
0|6954990|1|18109860

3|2338737|2|10039556
1|7185046|1|17809240

4|2338857|2|10027407
1|6958274|1|18111394

4|2340208|2|10031173
0|7097088|1|17921861
-----------------------------------------------------
=====================================================

CPU_RELAX_WFE_THRESHOLD = 50
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
4|1219792|2|18005180
0|1252767|1|25296935

4|1219312|2|18049566
1|1252625|1|25227292

4|1219884|2|18020775
1|1252363|1|25298387

4|1220862|2|18012062
1|1251827|1|25283787

4|1220489|2|18010055
0|1251729|1|25272917

3|1220088|2|18027279
0|1253264|1|25268834
-----------------------------------------------------
=====================================================

CPU_RELAX_WFE_THRESHOLD = 10
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
3|298604|1|23784805
0|293511|1|24604172

3|294707|2|23857487
0|292519|1|24564209

4|294199|1|23832180
0|293840|1|24593323

4|294314|1|23853353
0|293609|1|24635190

4|293802|1|23836764
0|293322|1|24553212

3|293658|1|23889801
0|292663|1|24552118
-----------------------------------------------------
=====================================================

CPU_RELAX_WFE_THRESHOLD = 5
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
3|173061|1|22332479
0|173759|1|23774009

3|174471|1|22342362
0|173161|1|23814466

3|173851|2|22235422
0|172734|1|23705848

2|173452|1|22255166
0|172830|1|23824301

2|173028|1|22390297
0|172336|1|23836407

3|172968|1|22285954
0|173207|1|23844900
-----------------------------------------------------
=====================================================

CPU_RELAX_WFE_THRESHOLD = 1
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
2|64245|1|6266848
0|77117|1|20917346

2|71310|1|5184106
1|77426|1|21040797

3|71335|2|5024650
0|77167|1|20934429

3|71295|1|5361696
0|77377|1|20902970

2|71357|1|5302482
0|77278|1|20967106

3|71158|1|5214564
0|77334|1|21022485
-----------------------------------------------------
=====================================================

Thanks,
Vikram

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-08-25 19:48                 ` Vikram Mulukutla
@ 2017-08-25 20:25                   ` Vikram Mulukutla
  2017-08-28 23:12                   ` Vikram Mulukutla
  1 sibling, 0 replies; 23+ messages in thread
From: Vikram Mulukutla @ 2017-08-25 20:25 UTC (permalink / raw)
  To: Will Deacon
  Cc: qiaozhou, Thomas Gleixner, John Stultz, sboyd, LKML, Wang Wilbur,
	Marc Zyngier, Peter Zijlstra, linux-kernel-owner, sudeep.holla



On 2017-08-25 12:48, Vikram Mulukutla wrote:

> 
> If I understand the code correctly, the upper 32 bits of an ARM64 
> virtual
> address will overflow when 1 is added to it, and so we'll keep WFE'ing 
> on
> every subsequent cpu_relax invoked from the same PC, until we cross the
> hard-coded threshold, right?
> 

Oops, misread that. Second time we enter cpu_relax from the same PC, we
do a WFE. Then we stop doing the WFE until we hit the threshold using 
the
per-cpu counter. So with a higher threshold, we wait for more 
cpu_relax()
calls before starting the WFE again.

So a lower threshold implies we should hit WFE branch sooner. It seems
that since my test keeps the while loop going for a full 5 seconds, a 
lower
threshold will obviously result in more WFEs and lower the 
lock-acquired-count.

I guess we want a high threshold but not so high that the little CPU has
to wait a while before the big CPU counts up to the threshold, is that 
correct?

Thanks,
Vikram

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-08-25 19:48                 ` Vikram Mulukutla
  2017-08-25 20:25                   ` Vikram Mulukutla
@ 2017-08-28 23:12                   ` Vikram Mulukutla
  2017-09-06 11:19                     ` qiaozhou
  2017-09-25 11:02                     ` qiaozhou
  1 sibling, 2 replies; 23+ messages in thread
From: Vikram Mulukutla @ 2017-08-28 23:12 UTC (permalink / raw)
  To: Will Deacon
  Cc: qiaozhou, Thomas Gleixner, John Stultz, sboyd, LKML, Wang Wilbur,
	Marc Zyngier, Peter Zijlstra, linux-kernel-owner, sudeep.holla

Hi Will,

On 2017-08-25 12:48, Vikram Mulukutla wrote:
> Hi Will,
> 
> On 2017-08-15 11:40, Will Deacon wrote:
>> Hi Vikram,
>> 
>> On Thu, Aug 03, 2017 at 04:25:12PM -0700, Vikram Mulukutla wrote:
>>> On 2017-07-31 06:13, Will Deacon wrote:
>>> >On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote:
>>> >>On 2017-07-28 02:28, Will Deacon wrote:
>>> >>>On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
>>> 
>>> >>>
>>> >>This does seem to help. Here's some data after 5 runs with and without
>>> >>the
>>> >>patch.
>>> >
>>> >Blimey, that does seem to make a difference. Shame it's so ugly! Would you
>>> >be able to experiment with other values for CPU_RELAX_WFE_THRESHOLD? I had
>>> >it set to 10000 in the diff I posted, but that might be higher than
>>> >optimal.
>>> >It would be interested to see if it correlates with num_possible_cpus()
>>> >for the highly contended case.
>>> >
>>> >Will
>>> 
>>> Sorry for the late response - I should hopefully have some more data 
>>> with
>>> different thresholds before the week is finished or on Monday.
>> 
>> Did you get anywhere with the threshold heuristic?
>> 
>> Will
> 
> Here's some data from experiments that I finally got to today. I 
> decided
> to recompile for every value of the threshold. Was doing a binary 
> search
> of sorts and then started reducing by orders of magnitude. There pairs
> of rows here:
> 

Well here's something interesting. I tried a different platform and 
found that
the workaround doesn't help much at all, similar to Qiao's observation 
on his b.L
chipset. Something to do with the WFE implementation or event-stream?

I modified your patch to use a __delay(1) in place of the WFEs and this 
was
the result (still with the 10k threshold). The worst-case lock time for 
cpu0
drastically improves. Given that cpu0 re-enables interrupts between each 
lock
attempt in my test case, I think the lock count matters less here.

cpu_relax() patch with WFEs (original workaround):
(pairs of rows, first row is with c0 at 300Mhz, second
with c0 at 1.9GHz. Both rows have cpu4 at 2.3GHz max time
is in microseconds)
------------------------------------------------------|
c0 max time| c0 lock count| c4 max time| c4 lock count|
------------------------------------------------------|
      999843|            25|           2|      12988498| -> c0/cpu0 at 
300Mhz
           0|       8421132|           1|       9152979| -> c0/cpu0 at 
1.9GHz
------------------------------------------------------|
      999860|           160|           2|      12963487|
           1|       8418492|           1|       9158001|
------------------------------------------------------|
      999381|           734|           2|      12988636|
           1|       8387562|           1|       9128056|
------------------------------------------------------|
      989800|           750|           3|      12996473|
           1|       8389091|           1|       9112444|
------------------------------------------------------|

cpu_relax() patch with __delay(1):
(pairs of rows, first row is with c0 at 300Mhz, second
with c0 at 1.9GHz. Both rows have cpu4 at 2.3GHz. max time
is in microseconds)
------------------------------------------------------|
c0 max time| c0 lock count| c4 max time| c4 lock count|
------------------------------------------------------|
        7703|         1532|            2|      13035203| -> c0/cpu0 at 
300Mhz
           1|      8511686|            1|       8550411| -> c0/cpu0 at 
1.9GHz
------------------------------------------------------|
        7801|         1561|            2|      13040188|
           1|      8553985|            1|       8609853|
------------------------------------------------------|
        3953|         1576|            2|      13049991|
           1|      8576370|            1|       8611533|
------------------------------------------------------|
        3953|         1557|            2|      13030553|
           1|      8509020|            1|       8543883|
------------------------------------------------------|

I should also note that my earlier kernel was 4.9-stable based
and the one above was on a 4.4-stable based kernel.

Thanks,
Vikram

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-08-28 23:12                   ` Vikram Mulukutla
@ 2017-09-06 11:19                     ` qiaozhou
  2017-09-25 11:02                     ` qiaozhou
  1 sibling, 0 replies; 23+ messages in thread
From: qiaozhou @ 2017-09-06 11:19 UTC (permalink / raw)
  To: Vikram Mulukutla, Will Deacon
  Cc: Thomas Gleixner, John Stultz, sboyd, LKML, Wang Wilbur,
	Marc Zyngier, Peter Zijlstra, linux-kernel-owner, sudeep.holla



On 2017年08月29日 07:12, Vikram Mulukutla wrote:
> 
> Well here's something interesting. I tried a different platform and 
> found that
> the workaround doesn't help much at all, similar to Qiao's observation 
> on his b.L
> chipset. Something to do with the WFE implementation or event-stream?

Hi Vikram,

I did some experiments, to tune the ddr controller(and ddr ram) freq, 
and cci freq. And the result is as below:

cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
cci: 832M
dclk: DDR controller clock.(data rate = 4 * dclk)
With cpu_relax bodging patch:
==============================================================
dclk   | cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
=======|===========|==============|===========|==============|
78M    |       8906|         55438|         13|       4015789|
156M   |       5964|         75109|          4|       8229050|
500M   |        102|       5984783|          1|       6400885|
600M   |         16|       6233601|          1|       6504718|
==============================================================

I suspect that the global exclusive monitor in ddr controller may play 
an important part. With ddr frequency is higher enough, it seems to 
handle the exclusive requests efficiently and fairly.

If reducing cci freq to a lower value, the result of little core drops a 
lot again.

cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
cci: 416M
dclk: DDR controller clock.(data rate = 4 * dclk)
With cpu_relax bodging patch:
==============================================================
dclk   | cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
=======|===========|==============|===========|==============|
78M    |       8837|         10596|         11|       3873635|
156M   |      17597|         10211|          4|       6513493|
500M   |      10888|         13214|          2|       8916396|
600M   |       8934|         15842|          2|       9394124|
==============================================================

I guess the result on your different platform might be related with DDR 
frequency too.

Best Regards
Qiao

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-08-28 23:12                   ` Vikram Mulukutla
  2017-09-06 11:19                     ` qiaozhou
@ 2017-09-25 11:02                     ` qiaozhou
  2017-10-02 14:14                       ` Will Deacon
  1 sibling, 1 reply; 23+ messages in thread
From: qiaozhou @ 2017-09-25 11:02 UTC (permalink / raw)
  To: Vikram Mulukutla, Will Deacon
  Cc: Thomas Gleixner, John Stultz, sboyd, LKML, Wang Wilbur,
	Marc Zyngier, Peter Zijlstra, linux-kernel-owner, sudeep.holla

Hi Will,

Will this bodging patch be merged? It can solve the livelock issue on 
arm64 platforms(at least improve a lot).

I suspected that CCI-freq might impact the contention between little and 
big core, but on my platform, it impacts little. In fact the frequency 
of external DDR controller impacts the contention.(My last reply has 
detailed data). It might be flushed out of cache after entering WFE and 
be loaded from DDR to cache when woken up.(I guessed that's why external 
DDR freq matters.)

Even with the lowest DDR freq(78M) on my platform, the maximum delay to 
get locked of the little core drops to ~10 ms with this bodging patch, 
while without the patch, the delay can be in 10s level by my testing, as 
discussed previously. So I'm wondering whether it's will be pushed into 
mainline, or still need more data?

Thanks a lot.

Best Regards
Qiao

On 2017年08月29日 07:12, Vikram Mulukutla wrote:
> Hi Will,
> 
> On 2017-08-25 12:48, Vikram Mulukutla wrote:
>> Hi Will,
>>
>> On 2017-08-15 11:40, Will Deacon wrote:
>>> Hi Vikram,
>>>
>>> On Thu, Aug 03, 2017 at 04:25:12PM -0700, Vikram Mulukutla wrote:
>>>> On 2017-07-31 06:13, Will Deacon wrote:
>>>> >On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote:
>>>> >>On 2017-07-28 02:28, Will Deacon wrote:
>>>> >>>On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
>>>>
>>>> >>>
>>>> >>This does seem to help. Here's some data after 5 runs with and 
>>>> without
>>>> >>the
>>>> >>patch.
>>>> >
>>>> >Blimey, that does seem to make a difference. Shame it's so ugly! 
>>>> Would you
>>>> >be able to experiment with other values for 
>>>> CPU_RELAX_WFE_THRESHOLD? I had
>>>> >it set to 10000 in the diff I posted, but that might be higher than
>>>> >optimal.
>>>> >It would be interested to see if it correlates with 
>>>> num_possible_cpus()
>>>> >for the highly contended case.
>>>> >
>>>> >Will
>>>>
>>>> Sorry for the late response - I should hopefully have some more data 
>>>> with
>>>> different thresholds before the week is finished or on Monday.
>>>
>>> Did you get anywhere with the threshold heuristic?
>>>
>>> Will
>>
>> Here's some data from experiments that I finally got to today. I decided
>> to recompile for every value of the threshold. Was doing a binary search
>> of sorts and then started reducing by orders of magnitude. There pairs
>> of rows here:
>>
> 
> Well here's something interesting. I tried a different platform and 
> found that
> the workaround doesn't help much at all, similar to Qiao's observation 
> on his b.L
> chipset. Something to do with the WFE implementation or event-stream?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-09-25 11:02                     ` qiaozhou
@ 2017-10-02 14:14                       ` Will Deacon
  2017-10-11  8:33                         ` qiaozhou
  0 siblings, 1 reply; 23+ messages in thread
From: Will Deacon @ 2017-10-02 14:14 UTC (permalink / raw)
  To: qiaozhou
  Cc: Vikram Mulukutla, Thomas Gleixner, John Stultz, sboyd, LKML,
	Wang Wilbur, Marc Zyngier, Peter Zijlstra, linux-kernel-owner,
	sudeep.holla

Hi Qiao,

On Mon, Sep 25, 2017 at 07:02:03PM +0800, qiaozhou wrote:
> Will this bodging patch be merged? It can solve the livelock issue on arm64
> platforms(at least improve a lot).

Whilst it seemed to help in some cases, I'm not keen to merge it until
we've identified a sensible way to parameterise the threshold and also
explain why it doens't appear to help at all on some platforms. I guess
I could respin a version that takes the threshold on the cmdline and
also checks to see that the event stream is enabled, since that might
help with further benchmarking.

Would that be useful?

Will

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
  2017-10-02 14:14                       ` Will Deacon
@ 2017-10-11  8:33                         ` qiaozhou
  0 siblings, 0 replies; 23+ messages in thread
From: qiaozhou @ 2017-10-11  8:33 UTC (permalink / raw)
  To: Will Deacon
  Cc: Vikram Mulukutla, Thomas Gleixner, John Stultz, sboyd, LKML,
	Wang Wilbur, Marc Zyngier, Peter Zijlstra, linux-kernel-owner,
	sudeep.holla

Hi Will,

On 2017年10月02日 22:14, Will Deacon wrote:
> Hi Qiao,
> 
> On Mon, Sep 25, 2017 at 07:02:03PM +0800, qiaozhou wrote:
>> Will this bodging patch be merged? It can solve the livelock issue on arm64
>> platforms(at least improve a lot).
> 
> Whilst it seemed to help in some cases, I'm not keen to merge it until
> we've identified a sensible way to parameterise the threshold and also
> explain why it doens't appear to help at all on some platforms. I guess
> I could respin a version that takes the threshold on the cmdline and
> also checks to see that the event stream is enabled, since that might
> help with further benchmarking.
> 
> Would that be useful?

Got it. Please let me know if you want to verify it on my platform. Many 
thanks.

Best Regards
Qiao

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-10-11  8:33 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <3d2459c7-defd-a47e-6cea-007c10cecaac@asrmicro.com>
2017-07-26 14:16 ` [Question]: try to fix contention between expire_timers and try_to_del_timer_sync Thomas Gleixner
2017-07-27  1:29   ` qiaozhou
2017-07-27 15:14     ` Will Deacon
2017-07-27 15:19       ` Thomas Gleixner
2017-07-28  1:10     ` Vikram Mulukutla
2017-07-28  9:28       ` Peter Zijlstra
2017-07-28 19:11         ` Vikram Mulukutla
2017-07-28  9:28       ` Will Deacon
2017-07-28 19:09         ` Vikram Mulukutla
2017-07-31 11:20           ` qiaozhou
2017-08-01  7:37             ` qiaozhou
2017-08-03 23:32               ` Vikram Mulukutla
2017-08-04  3:15                 ` qiaozhou
2017-07-31 13:13           ` Will Deacon
2017-08-03 23:25             ` Vikram Mulukutla
2017-08-15 18:40               ` Will Deacon
2017-08-25 19:48                 ` Vikram Mulukutla
2017-08-25 20:25                   ` Vikram Mulukutla
2017-08-28 23:12                   ` Vikram Mulukutla
2017-09-06 11:19                     ` qiaozhou
2017-09-25 11:02                     ` qiaozhou
2017-10-02 14:14                       ` Will Deacon
2017-10-11  8:33                         ` qiaozhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).