linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Problem] Cache line starvation
@ 2018-09-21 12:02 Sebastian Andrzej Siewior
  2018-09-21 12:13 ` Thomas Gleixner
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-09-21 12:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: Daniel Wagner, Peter Zijlstra, Will Deacon, x86, Linus Torvalds,
	H. Peter Anvin, Boqun Feng, Paul E. McKenney

We reproducibly observe cache line starvation on a Core2Duo E6850 (2
cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4
cores).

The problem can be triggered with a v4.9-RT kernel by starting

    cyclictest -S -p98 -m  -i2000 -b 200

and as "load"

    stress-ng --ptrace 4

The reported maximal latency is usually less than 60us. If the problem
triggers then values around 400us, 800us or even more are reported. The
upperlimit is the -i parameter.

Reproduction with 4.9-RT is almost immediate on Core2Duo, ARM64 and SKL,
but it took 7.5 hours to trigger on v4.14-RT on the Core2Duo.

Instrumentation show always the picture:

CPU0                                         CPU1
=> do_syscall_64                              => do_syscall_64
=> SyS_ptrace                                   => syscall_slow_exit_work
=> ptrace_check_attach                          => ptrace_do_notify / rt_read_unlock 
=> wait_task_inactive                              rt_spin_lock_slowunlock()
   -> while task_running()                         __rt_mutex_unlock_common()
  /   check_task_state()                           mark_wakeup_next_waiter()
 |     raw_spin_lock_irq(&p->pi_lock);             raw_spin_lock(&current->pi_lock);
 |     .                                               .
 |     raw_spin_unlock_irq(&p->pi_lock);               .
  \  cpu_relax()                                       .
   -                                                   .
    *IRQ*                                          <lock acquired>

In the error case we observe that the while() loop is repeated more than
5000 times which indicates that the pi_lock can be acquired. CPU1 on the
other side does not make progress waiting for the same lock with interrupts
disabled.

This continues until an IRQ hits CPU0. Once CPU0 starts processing the IRQ
the other CPU is able to acquire pi_lock and the situation relaxes.

This matches Daniel Wagner's observations which he described in [0] on
v4.4-RT.

It took us weeks to hunt this down as it's really showing true Heisenbug
behaviour and instrumentation at the wrong place makes it vanish
immediately.

Other data points:

Replacing the single rep_nop() in cpu_relax() with two consecutive
rep_nop() makes it harder to trigger on the Core2Duo machine. Adding a
third one seems to resolve it completely, though no long term runs done
yet. Also our attempt of instrumenting the lock()/unlock() sequence more
deeply with rdtsc() timestamps made it go away. On SKL the identical
instrumentation hides it.

Peter suggested to do a clwb(&p->pi_lock); before the cpu_relax() in
wait_task_inactive() which on both the Core2Duo and the SKL gets runtime
patched to clflush(). That hides it as well.

On ARM64 we suspected something like this without having real proof a week
ago and added a delay loop to cpu_relax(). That "resolved" the issue. By
now we have traces on ARM64 which proof it to be the same issue.

Daniel reported that disabling ticket locks on 4.4 makes the problem go
away, but he hasn't run a long time test yet and as we saw with 4.14 it can
take quite a while.

While it is pretty reliable reproducible with an RT kernel, this is not an
RT kernel issue at all. Mainline has similar or even the same code
patterns, but it's harder to observe as there are no latency guarantees for
an observer like cyclictest. The fact that an interrupt makes it go away
makes it even harder because the rare stalls just go unnoticed because they
are canceled early.

[0] https://lkml.kernel.org/r/4d878d3e-bd1b-3d6e-bdde-4ce840191093@monom.org

Sebastian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-09-21 12:02 [Problem] Cache line starvation Sebastian Andrzej Siewior
@ 2018-09-21 12:13 ` Thomas Gleixner
  2018-09-21 12:50   ` Sebastian Andrzej Siewior
  2018-09-21 12:20 ` Peter Zijlstra
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2018-09-21 12:13 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Daniel Wagner, Peter Zijlstra, Will Deacon, x86,
	Linus Torvalds, H. Peter Anvin, Boqun Feng, Paul E. McKenney

On Fri, 21 Sep 2018, Sebastian Andrzej Siewior wrote:
> We reproducibly observe cache line starvation on a Core2Duo E6850 (2
> cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4
> cores).

We tried to reproduce on AMD, but so far no failure.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-09-21 12:02 [Problem] Cache line starvation Sebastian Andrzej Siewior
  2018-09-21 12:13 ` Thomas Gleixner
@ 2018-09-21 12:20 ` Peter Zijlstra
  2018-09-21 12:54   ` Thomas Gleixner
  2018-10-03  7:51   ` Catalin Marinas
  2018-09-26  7:34 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2018-09-21 12:20 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Daniel Wagner, Will Deacon, x86, Linus Torvalds,
	H. Peter Anvin, Boqun Feng, Paul E. McKenney

On Fri, Sep 21, 2018 at 02:02:26PM +0200, Sebastian Andrzej Siewior wrote:
> We reproducibly observe cache line starvation on a Core2Duo E6850 (2
> cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4
> cores).
> 
> The problem can be triggered with a v4.9-RT kernel by starting

> Daniel reported that disabling ticket locks on 4.4 makes the problem go
> away, but he hasn't run a long time test yet and as we saw with 4.14 it can
> take quite a while.

On 4.4 and 4.9 ARM64 still uses ticket locks. So I'm very interested to
know if the ticket locks on x86 really fix or just make it harder.

I've been looking at qspinlock in the light of this and there is indeed
room for improvement. The ticket lock certainly is much simpler.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-09-21 12:13 ` Thomas Gleixner
@ 2018-09-21 12:50   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-09-21 12:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Daniel Wagner, Peter Zijlstra, Will Deacon, x86,
	Linus Torvalds, H. Peter Anvin, Boqun Feng, Paul E. McKenney

On 2018-09-21 14:13:19 [+0200], Thomas Gleixner wrote:
> On Fri, 21 Sep 2018, Sebastian Andrzej Siewior wrote:
> > We reproducibly observe cache line starvation on a Core2Duo E6850 (2
> > cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4
> > cores).
> 
> We tried to reproduce on AMD, but so far no failure.

After tweaking the config a little, it just triggered after ~8 minutes
on a A10-7800.

> Thanks,
> 
> 	tglx

Sebastian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-09-21 12:20 ` Peter Zijlstra
@ 2018-09-21 12:54   ` Thomas Gleixner
  2018-10-03  7:51   ` Catalin Marinas
  1 sibling, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2018-09-21 12:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, linux-kernel, Daniel Wagner,
	Will Deacon, x86, Linus Torvalds, H. Peter Anvin, Boqun Feng,
	Paul E. McKenney

On Fri, 21 Sep 2018, Peter Zijlstra wrote:
> On Fri, Sep 21, 2018 at 02:02:26PM +0200, Sebastian Andrzej Siewior wrote:
> > We reproducibly observe cache line starvation on a Core2Duo E6850 (2
> > cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4
> > cores).
> > 
> > The problem can be triggered with a v4.9-RT kernel by starting
> 
> > Daniel reported that disabling ticket locks on 4.4 makes the problem go
> > away, but he hasn't run a long time test yet and as we saw with 4.14 it can
> > take quite a while.
> 
> On 4.4 and 4.9 ARM64 still uses ticket locks. So I'm very interested to
> know if the ticket locks on x86 really fix or just make it harder.

We can run 4.4. with ticket locks over the weekend.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-09-21 12:02 [Problem] Cache line starvation Sebastian Andrzej Siewior
  2018-09-21 12:13 ` Thomas Gleixner
  2018-09-21 12:20 ` Peter Zijlstra
@ 2018-09-26  7:34 ` Peter Zijlstra
  2018-09-26  8:04   ` Thomas Gleixner
  2018-09-26 12:53 ` Will Deacon
  2018-10-02  6:31 ` Daniel Wagner
  4 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2018-09-26  7:34 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Daniel Wagner, Will Deacon, x86, Linus Torvalds,
	H. Peter Anvin, Boqun Feng, Paul E. McKenney

On Fri, Sep 21, 2018 at 02:02:26PM +0200, Sebastian Andrzej Siewior wrote:
> Instrumentation show always the picture:
> 
> CPU0                                         CPU1
> => do_syscall_64                              => do_syscall_64
> => SyS_ptrace                                   => syscall_slow_exit_work
> => ptrace_check_attach                          => ptrace_do_notify / rt_read_unlock 
> => wait_task_inactive                              rt_spin_lock_slowunlock()
>    -> while task_running()                         __rt_mutex_unlock_common()
>   /   check_task_state()                           mark_wakeup_next_waiter()
>  |     raw_spin_lock_irq(&p->pi_lock);             raw_spin_lock(&current->pi_lock);
>  |     .                                               .
>  |     raw_spin_unlock_irq(&p->pi_lock);               .
>   \  cpu_relax()                                       .
>    -                                                   .
>     *IRQ*                                          <lock acquired>
> 
> In the error case we observe that the while() loop is repeated more than
> 5000 times which indicates that the pi_lock can be acquired. CPU1 on the
> other side does not make progress waiting for the same lock with interrupts
> disabled.

I've tried really hard to reproduce this in userspace, but so far have
not had any luck. Looks to be a real tricky thing to make happen.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-09-26  7:34 ` Peter Zijlstra
@ 2018-09-26  8:04   ` Thomas Gleixner
  0 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2018-09-26  8:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, linux-kernel, Daniel Wagner,
	Will Deacon, x86, Linus Torvalds, H. Peter Anvin, Boqun Feng,
	Paul E. McKenney

On Wed, 26 Sep 2018, Peter Zijlstra wrote:
> On Fri, Sep 21, 2018 at 02:02:26PM +0200, Sebastian Andrzej Siewior wrote:
> > Instrumentation show always the picture:
> > 
> > CPU0                                         CPU1
> > => do_syscall_64                              => do_syscall_64
> > => SyS_ptrace                                   => syscall_slow_exit_work
> > => ptrace_check_attach                          => ptrace_do_notify / rt_read_unlock 
> > => wait_task_inactive                              rt_spin_lock_slowunlock()
> >    -> while task_running()                         __rt_mutex_unlock_common()
> >   /   check_task_state()                           mark_wakeup_next_waiter()
> >  |     raw_spin_lock_irq(&p->pi_lock);             raw_spin_lock(&current->pi_lock);
> >  |     .                                               .
> >  |     raw_spin_unlock_irq(&p->pi_lock);               .
> >   \  cpu_relax()                                       .
> >    -                                                   .
> >     *IRQ*                                          <lock acquired>
> > 
> > In the error case we observe that the while() loop is repeated more than
> > 5000 times which indicates that the pi_lock can be acquired. CPU1 on the
> > other side does not make progress waiting for the same lock with interrupts
> > disabled.
> 
> I've tried really hard to reproduce this in userspace, but so far have
> not had any luck. Looks to be a real tricky thing to make happen.

It's probably equally tricky to write a reproducer as it was to instrument
the thing. I assume it's a combination of code sequences on both CPUs which
involve other (unrelated) lock instructions on the way.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-09-21 12:02 [Problem] Cache line starvation Sebastian Andrzej Siewior
                   ` (2 preceding siblings ...)
  2018-09-26  7:34 ` Peter Zijlstra
@ 2018-09-26 12:53 ` Will Deacon
  2018-09-27 14:25   ` Kurt Kanzenbach
  2018-10-02  6:31 ` Daniel Wagner
  4 siblings, 1 reply; 21+ messages in thread
From: Will Deacon @ 2018-09-26 12:53 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Daniel Wagner, Peter Zijlstra, x86, Linus Torvalds,
	H. Peter Anvin, Boqun Feng, Paul E. McKenney

Hi all,

On Fri, Sep 21, 2018 at 02:02:26PM +0200, Sebastian Andrzej Siewior wrote:
> We reproducibly observe cache line starvation on a Core2Duo E6850 (2
> cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4
> cores).
> 
> Instrumentation show always the picture:
> 
> CPU0                                         CPU1
> => do_syscall_64                              => do_syscall_64
> => SyS_ptrace                                   => syscall_slow_exit_work
> => ptrace_check_attach                          => ptrace_do_notify / rt_read_unlock 
> => wait_task_inactive                              rt_spin_lock_slowunlock()
>    -> while task_running()                         __rt_mutex_unlock_common()
>   /   check_task_state()                           mark_wakeup_next_waiter()
>  |     raw_spin_lock_irq(&p->pi_lock);             raw_spin_lock(&current->pi_lock);
>  |     .                                               .
>  |     raw_spin_unlock_irq(&p->pi_lock);               .
>   \  cpu_relax()                                       .
>    -                                                   .
>     *IRQ*                                          <lock acquired>
> 
> In the error case we observe that the while() loop is repeated more than
> 5000 times which indicates that the pi_lock can be acquired. CPU1 on the
> other side does not make progress waiting for the same lock with interrupts
> disabled.
> 
> This continues until an IRQ hits CPU0. Once CPU0 starts processing the IRQ
> the other CPU is able to acquire pi_lock and the situation relaxes.
> 
> Peter suggested to do a clwb(&p->pi_lock); before the cpu_relax() in
> wait_task_inactive() which on both the Core2Duo and the SKL gets runtime
> patched to clflush(). That hides it as well.

Given the broadcast nature of cache-flushing, I'd be pretty nervous about
adding it on anything other than a case-by-case basis. That doesn't sound
like something we'd want to maintain... It would also be interesting to know
whether the problem is actually before the cache (i.e. if the lock actually
sits in the store buffer on CPU0). Does MFENCE/DSB after the unlock() help at
all?

We've previously seen something similar to this on arm64 in big/little
systems where the big cores can loop around and re-take a spinlock before
the little guys can get in the queue or take a ticket. I bodged that in
cpu_relax(), but there's a magic heuristic which I couldn't figure out how
to specify:

https://lkml.org/lkml/2017/7/28/172

For A72 (which is the core I think you're using) it would be interesting to
try both:

	(1) Removing the prfm instruction from spin_lock(), and
	(2) Setting bit 42 of CPUACTLR_EL1 on each CPU (probably needs a
	    firmware change)

That should prevent the lock() operation from speculatively pulling in the
cacheline in a unique state.

More recent Arm CPUs have atomic instructions which, apart from CAS,
*should* avoid this starvation issue entirely.

Will

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-09-26 12:53 ` Will Deacon
@ 2018-09-27 14:25   ` Kurt Kanzenbach
  2018-09-27 14:41     ` Kurt Kanzenbach
  0 siblings, 1 reply; 21+ messages in thread
From: Kurt Kanzenbach @ 2018-09-27 14:25 UTC (permalink / raw)
  To: Will Deacon
  Cc: Sebastian Andrzej Siewior, linux-kernel, Daniel Wagner,
	Peter Zijlstra, x86, Linus Torvalds, H. Peter Anvin, Boqun Feng,
	Paul E. McKenney, Mark Rutland

Hi Will,

On Wed, Sep 26, 2018 at 01:53:02PM +0100, Will Deacon wrote:
> Hi all,
>
> On Fri, Sep 21, 2018 at 02:02:26PM +0200, Sebastian Andrzej Siewior wrote:
> > We reproducibly observe cache line starvation on a Core2Duo E6850 (2
> > cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4
> > cores).
> >
> > Instrumentation show always the picture:
> >
> > CPU0                                         CPU1
> > => do_syscall_64                              => do_syscall_64
> > => SyS_ptrace                                   => syscall_slow_exit_work
> > => ptrace_check_attach                          => ptrace_do_notify / rt_read_unlock
> > => wait_task_inactive                              rt_spin_lock_slowunlock()
> >    -> while task_running()                         __rt_mutex_unlock_common()
> >   /   check_task_state()                           mark_wakeup_next_waiter()
> >  |     raw_spin_lock_irq(&p->pi_lock);             raw_spin_lock(&current->pi_lock);
> >  |     .                                               .
> >  |     raw_spin_unlock_irq(&p->pi_lock);               .
> >   \  cpu_relax()                                       .
> >    -                                                   .
> >     *IRQ*                                          <lock acquired>
> >
> > In the error case we observe that the while() loop is repeated more than
> > 5000 times which indicates that the pi_lock can be acquired. CPU1 on the
> > other side does not make progress waiting for the same lock with interrupts
> > disabled.
> >
> > This continues until an IRQ hits CPU0. Once CPU0 starts processing the IRQ
> > the other CPU is able to acquire pi_lock and the situation relaxes.
> >
> > Peter suggested to do a clwb(&p->pi_lock); before the cpu_relax() in
> > wait_task_inactive() which on both the Core2Duo and the SKL gets runtime
> > patched to clflush(). That hides it as well.
>
> Given the broadcast nature of cache-flushing, I'd be pretty nervous about
> adding it on anything other than a case-by-case basis. That doesn't sound
> like something we'd want to maintain... It would also be interesting to know
> whether the problem is actually before the cache (i.e. if the lock actually
> sits in the store buffer on CPU0). Does MFENCE/DSB after the unlock() help at
> all?
>
> We've previously seen something similar to this on arm64 in big/little
> systems where the big cores can loop around and re-take a spinlock before
> the little guys can get in the queue or take a ticket. I bodged that in
> cpu_relax(), but there's a magic heuristic which I couldn't figure out how
> to specify:
>
> https://lkml.org/lkml/2017/7/28/172
>
> For A72 (which is the core I think you're using) it would be interesting to
> try both:
>
> 	(1) Removing the prfm instruction from spin_lock(), and
> 	(2) Setting bit 42 of CPUACTLR_EL1 on each CPU (probably needs a
> 	    firmware change)

correct, we use the Cortex A72.

I followed your suggestions. I've removed the prefetch instructions from
the spin lock implementation in the v4.9 kernel. In addition I've
modified armv8/start.S in U-Boot to setup bit 42 in CPUACTLR_EL1
(S3_1_c15_c2_0). We've also made sure, that this bit is actually written
for each CPU by reading their register value in the kernel.

However, the issue still triggers fine. With stress-ng we're able to
generate latency in millisecond range. The only workaround we've found
so far is to add a "delay" in cpu_relax().

Any ideas, what we can test further?

Thanks,
Kurt

>
> That should prevent the lock() operation from speculatively pulling in the
> cacheline in a unique state.
>
> More recent Arm CPUs have atomic instructions which, apart from CAS,
> *should* avoid this starvation issue entirely.
>
> Will
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-09-27 14:25   ` Kurt Kanzenbach
@ 2018-09-27 14:41     ` Kurt Kanzenbach
  2018-09-27 14:47       ` Thomas Gleixner
  0 siblings, 1 reply; 21+ messages in thread
From: Kurt Kanzenbach @ 2018-09-27 14:41 UTC (permalink / raw)
  To: Will Deacon
  Cc: Sebastian Andrzej Siewior, linux-kernel, Daniel Wagner,
	Peter Zijlstra, x86, Linus Torvalds, H. Peter Anvin, Boqun Feng,
	Paul E. McKenney, Mark Rutland

Hi Will,

On Thu, Sep 27, 2018 at 04:25:47PM +0200, Kurt Kanzenbach wrote:
> Hi Will,
>
> On Wed, Sep 26, 2018 at 01:53:02PM +0100, Will Deacon wrote:
> > Hi all,
> >
> > On Fri, Sep 21, 2018 at 02:02:26PM +0200, Sebastian Andrzej Siewior wrote:
> > > We reproducibly observe cache line starvation on a Core2Duo E6850 (2
> > > cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4
> > > cores).
> > >
> > > Instrumentation show always the picture:
> > >
> > > CPU0                                         CPU1
> > > => do_syscall_64                              => do_syscall_64
> > > => SyS_ptrace                                   => syscall_slow_exit_work
> > > => ptrace_check_attach                          => ptrace_do_notify / rt_read_unlock
> > > => wait_task_inactive                              rt_spin_lock_slowunlock()
> > >    -> while task_running()                         __rt_mutex_unlock_common()
> > >   /   check_task_state()                           mark_wakeup_next_waiter()
> > >  |     raw_spin_lock_irq(&p->pi_lock);             raw_spin_lock(&current->pi_lock);
> > >  |     .                                               .
> > >  |     raw_spin_unlock_irq(&p->pi_lock);               .
> > >   \  cpu_relax()                                       .
> > >    -                                                   .
> > >     *IRQ*                                          <lock acquired>
> > >
> > > In the error case we observe that the while() loop is repeated more than
> > > 5000 times which indicates that the pi_lock can be acquired. CPU1 on the
> > > other side does not make progress waiting for the same lock with interrupts
> > > disabled.
> > >
> > > This continues until an IRQ hits CPU0. Once CPU0 starts processing the IRQ
> > > the other CPU is able to acquire pi_lock and the situation relaxes.
> > >
> > > Peter suggested to do a clwb(&p->pi_lock); before the cpu_relax() in
> > > wait_task_inactive() which on both the Core2Duo and the SKL gets runtime
> > > patched to clflush(). That hides it as well.
> >
> > Given the broadcast nature of cache-flushing, I'd be pretty nervous about
> > adding it on anything other than a case-by-case basis. That doesn't sound
> > like something we'd want to maintain... It would also be interesting to know
> > whether the problem is actually before the cache (i.e. if the lock actually
> > sits in the store buffer on CPU0). Does MFENCE/DSB after the unlock() help at
> > all?
> >
> > We've previously seen something similar to this on arm64 in big/little
> > systems where the big cores can loop around and re-take a spinlock before
> > the little guys can get in the queue or take a ticket. I bodged that in
> > cpu_relax(), but there's a magic heuristic which I couldn't figure out how
> > to specify:
> >
> > https://lkml.org/lkml/2017/7/28/172
> >
> > For A72 (which is the core I think you're using) it would be interesting to
> > try both:
> >
> > 	(1) Removing the prfm instruction from spin_lock(), and
> > 	(2) Setting bit 42 of CPUACTLR_EL1 on each CPU (probably needs a
> > 	    firmware change)
>
> correct, we use the Cortex A72.
>
> I followed your suggestions. I've removed the prefetch instructions from
> the spin lock implementation in the v4.9 kernel. In addition I've
> modified armv8/start.S in U-Boot to setup bit 42 in CPUACTLR_EL1
> (S3_1_c15_c2_0). We've also made sure, that this bit is actually written
> for each CPU by reading their register value in the kernel.
>
> However, the issue still triggers fine. With stress-ng we're able to
> generate latency in millisecond range. The only workaround we've found
> so far is to add a "delay" in cpu_relax().

It might interesting for you, how we added the delay. We've used:

static inline void cpu_relax(void)
{
	volatile int i = 0;

	asm volatile("yield" ::: "memory");
	while (i++ <= 1000);
}

Of course it's not efficient, but it works.

Thanks,
Kurt

>
> Any ideas, what we can test further?
>
> Thanks,
> Kurt
>
> >
> > That should prevent the lock() operation from speculatively pulling in the
> > cacheline in a unique state.
> >
> > More recent Arm CPUs have atomic instructions which, apart from CAS,
> > *should* avoid this starvation issue entirely.
> >
> > Will
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-09-27 14:41     ` Kurt Kanzenbach
@ 2018-09-27 14:47       ` Thomas Gleixner
  2018-09-28  9:05         ` Kurt Kanzenbach
  2018-09-28 19:26         ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 21+ messages in thread
From: Thomas Gleixner @ 2018-09-27 14:47 UTC (permalink / raw)
  To: Kurt Kanzenbach
  Cc: Will Deacon, Sebastian Andrzej Siewior, linux-kernel,
	Daniel Wagner, Peter Zijlstra, x86, Linus Torvalds,
	H. Peter Anvin, Boqun Feng, Paul E. McKenney, Mark Rutland

On Thu, 27 Sep 2018, Kurt Kanzenbach wrote:
> On Thu, Sep 27, 2018 at 04:25:47PM +0200, Kurt Kanzenbach wrote:
> > However, the issue still triggers fine. With stress-ng we're able to
> > generate latency in millisecond range. The only workaround we've found
> > so far is to add a "delay" in cpu_relax().
> 
> It might interesting for you, how we added the delay. We've used:
> 
> static inline void cpu_relax(void)
> {
> 	volatile int i = 0;
> 
> 	asm volatile("yield" ::: "memory");
> 	while (i++ <= 1000);
> }
> 
> Of course it's not efficient, but it works.

I wonder if it's just the store on the stack which makes it work. I've seen
that when instrumenting x86. When the careful instrumentation just stayed
in registers it failed. Once it was too much and stack got involved it
vanished away.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-09-27 14:47       ` Thomas Gleixner
@ 2018-09-28  9:05         ` Kurt Kanzenbach
  2018-09-28 15:26           ` Kurt Kanzenbach
  2018-09-28 19:26         ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 21+ messages in thread
From: Kurt Kanzenbach @ 2018-09-28  9:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Will Deacon, Sebastian Andrzej Siewior, linux-kernel,
	Daniel Wagner, Peter Zijlstra, x86, Linus Torvalds,
	H. Peter Anvin, Boqun Feng, Paul E. McKenney, Mark Rutland

Hi Thomas,

On Thu, Sep 27, 2018 at 04:47:47PM +0200, Thomas Gleixner wrote:
> On Thu, 27 Sep 2018, Kurt Kanzenbach wrote:
> > On Thu, Sep 27, 2018 at 04:25:47PM +0200, Kurt Kanzenbach wrote:
> > > However, the issue still triggers fine. With stress-ng we're able to
> > > generate latency in millisecond range. The only workaround we've found
> > > so far is to add a "delay" in cpu_relax().
> >
> > It might interesting for you, how we added the delay. We've used:
> >
> > static inline void cpu_relax(void)
> > {
> > 	volatile int i = 0;
> >
> > 	asm volatile("yield" ::: "memory");
> > 	while (i++ <= 1000);
> > }
> >
> > Of course it's not efficient, but it works.
>
> I wonder if it's just the store on the stack which makes it work. I've seen
> that when instrumenting x86. When the careful instrumentation just stayed
> in registers it failed. Once it was too much and stack got involved it
> vanished away.

I've performed more tests: Adding a store to a global variable just
before calling cpu_relax() doesn't help. Furthermore, adding up to 20
yield instructions (just like you did on x86) didn't work either.

Thanks,
Kurt

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-09-28  9:05         ` Kurt Kanzenbach
@ 2018-09-28 15:26           ` Kurt Kanzenbach
  0 siblings, 0 replies; 21+ messages in thread
From: Kurt Kanzenbach @ 2018-09-28 15:26 UTC (permalink / raw)
  To: Will Deacon
  Cc: Thomas Gleixner, Sebastian Andrzej Siewior, linux-kernel,
	Daniel Wagner, Peter Zijlstra, x86, Linus Torvalds,
	H. Peter Anvin, Boqun Feng, Paul E. McKenney, Mark Rutland

On Fri, Sep 28, 2018 at 11:05:21AM +0200, Kurt Kanzenbach wrote:
> Hi Thomas,
>
> On Thu, Sep 27, 2018 at 04:47:47PM +0200, Thomas Gleixner wrote:
> > On Thu, 27 Sep 2018, Kurt Kanzenbach wrote:
> > > On Thu, Sep 27, 2018 at 04:25:47PM +0200, Kurt Kanzenbach wrote:
> > > > However, the issue still triggers fine. With stress-ng we're able to
> > > > generate latency in millisecond range. The only workaround we've found
> > > > so far is to add a "delay" in cpu_relax().
> > >
> > > It might interesting for you, how we added the delay. We've used:
> > >
> > > static inline void cpu_relax(void)
> > > {
> > > 	volatile int i = 0;
> > >
> > > 	asm volatile("yield" ::: "memory");
> > > 	while (i++ <= 1000);
> > > }
> > >
> > > Of course it's not efficient, but it works.
> >
> > I wonder if it's just the store on the stack which makes it work. I've seen
> > that when instrumenting x86. When the careful instrumentation just stayed
> > in registers it failed. Once it was too much and stack got involved it
> > vanished away.
>
> I've performed more tests: Adding a store to a global variable just
> before calling cpu_relax() doesn't help. Furthermore, adding up to 20
> yield instructions (just like you did on x86) didn't work either.

In addition, the stress-ng test triggers on v4.14-rt and v4.18-rt as
well.

As v4.18-rt still uses the old spin lock implementation, I've backported
the qspinlock implementation to v4.18-rt. The commits I've identified
are:

 - 598865c5f32d ("arm64: barrier: Implement smp_cond_load_relaxed")
 - c11090474d70 ("arm64: locking: Replace ticket lock implementation with qspinlock")

Using these commits it's still possible to trigger the issue. But it
takes longer.

Did I miss anything?

Thanks,
Kurt

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-09-27 14:47       ` Thomas Gleixner
  2018-09-28  9:05         ` Kurt Kanzenbach
@ 2018-09-28 19:26         ` Sebastian Andrzej Siewior
  2018-09-28 19:34           ` Thomas Gleixner
  1 sibling, 1 reply; 21+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-09-28 19:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kurt Kanzenbach, Will Deacon, linux-kernel, Daniel Wagner,
	Peter Zijlstra, x86, Linus Torvalds, H. Peter Anvin, Boqun Feng,
	Paul E. McKenney, Mark Rutland

On 2018-09-27 16:47:47 [+0200], Thomas Gleixner wrote:
> I wonder if it's just the store on the stack which makes it work. I've seen
> that when instrumenting x86. When the careful instrumentation just stayed
> in registers it failed. Once it was too much and stack got involved it
> vanished away.

Added two load/stores into wait_task_inactive() and it still triggers
(on the core2duo).

> Thanks,
> 
> 	tglx

Sebastian

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-09-28 19:26         ` Sebastian Andrzej Siewior
@ 2018-09-28 19:34           ` Thomas Gleixner
  0 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2018-09-28 19:34 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Kurt Kanzenbach, Will Deacon, linux-kernel, Daniel Wagner,
	Peter Zijlstra, x86, Linus Torvalds, H. Peter Anvin, Boqun Feng,
	Paul E. McKenney, Mark Rutland

On Fri, 28 Sep 2018, Sebastian Andrzej Siewior wrote:

> On 2018-09-27 16:47:47 [+0200], Thomas Gleixner wrote:
> > I wonder if it's just the store on the stack which makes it work. I've seen
> > that when instrumenting x86. When the careful instrumentation just stayed
> > in registers it failed. Once it was too much and stack got involved it
> > vanished away.
> 
> Added two load/stores into wait_task_inactive() and it still triggers
> (on the core2duo).

So it was some interaction with rdtsc() which got stored on
stack. Heisenbugs are lovely.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-09-21 12:02 [Problem] Cache line starvation Sebastian Andrzej Siewior
                   ` (3 preceding siblings ...)
  2018-09-26 12:53 ` Will Deacon
@ 2018-10-02  6:31 ` Daniel Wagner
  4 siblings, 0 replies; 21+ messages in thread
From: Daniel Wagner @ 2018-10-02  6:31 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Daniel Wagner, Peter Zijlstra, Will Deacon, x86,
	Linus Torvalds, H. Peter Anvin, Boqun Feng, Paul E. McKenney

On Fri, Sep 21, 2018 at 02:02:26PM +0200, Sebastian Andrzej Siewior wrote:
> This matches Daniel Wagner's observations which he described in [0] on
> v4.4-RT.

Peter Z recommended to drop to ticket spinlocks instead trying to port
back all the qspinlock changes to v4.4-rt.

With ticket spinlocks, 'stress-ng --ptrace 4' run for 50 hours without
a problem (before it was seconds) and my normal workload for -rt testing
for 60 hours without a problem (before it broke within 24h).

The cyclictest max values went slightly down from 32us to 30us but
that might just be coincidence.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-09-21 12:20 ` Peter Zijlstra
  2018-09-21 12:54   ` Thomas Gleixner
@ 2018-10-03  7:51   ` Catalin Marinas
  2018-10-03  8:07     ` Thomas Gleixner
  2018-10-03  8:23     ` Peter Zijlstra
  1 sibling, 2 replies; 21+ messages in thread
From: Catalin Marinas @ 2018-10-03  7:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bigeasy, Linux Kernel Mailing List, daniel.wagner, Will Deacon,
	x86, Linus Torvalds, H. Peter Anvin, boqun.feng, Paul McKenney

On Fri, 21 Sep 2018 at 13:22, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Sep 21, 2018 at 02:02:26PM +0200, Sebastian Andrzej Siewior wrote:
> > We reproducibly observe cache line starvation on a Core2Duo E6850 (2
> > cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4
> > cores).
> >
> > The problem can be triggered with a v4.9-RT kernel by starting
>
> > Daniel reported that disabling ticket locks on 4.4 makes the problem go
> > away, but he hasn't run a long time test yet and as we saw with 4.14 it can
> > take quite a while.
>
> On 4.4 and 4.9 ARM64 still uses ticket locks. So I'm very interested to
> know if the ticket locks on x86 really fix or just make it harder.
>
> I've been looking at qspinlock in the light of this and there is indeed
> room for improvement. The ticket lock certainly is much simpler.

FWIW, in the qspinlock TLA+ model [1], if I replace the
atomic_fetch_or() model with a try_cmpxchg loop, it violates the
liveness properties with only 2 CPUs as one keeps locking/unlocking,
hence changing the lock value, while the other repeatedly fails the
cmpxchg. Your latest qspinlock patches seem to address this (couldn't
get it to fail but the model is only sequentially consistent). Not
sure that's what Sebastian is seeing but without your proposed
qspinlock changes, ticket spinlocks may be a better bet for RT.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/tree/qspinlock.tla

-- 
Catalin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-10-03  7:51   ` Catalin Marinas
@ 2018-10-03  8:07     ` Thomas Gleixner
  2018-10-03  8:28       ` Peter Zijlstra
  2018-10-03  8:23     ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2018-10-03  8:07 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Peter Zijlstra, bigeasy, Linux Kernel Mailing List,
	daniel.wagner, Will Deacon, x86, Linus Torvalds, H. Peter Anvin,
	boqun.feng, Paul McKenney

On Wed, 3 Oct 2018, Catalin Marinas wrote:

> On Fri, 21 Sep 2018 at 13:22, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Sep 21, 2018 at 02:02:26PM +0200, Sebastian Andrzej Siewior wrote:
> > > We reproducibly observe cache line starvation on a Core2Duo E6850 (2
> > > cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4
> > > cores).
> > >
> > > The problem can be triggered with a v4.9-RT kernel by starting
> >
> > > Daniel reported that disabling ticket locks on 4.4 makes the problem go
> > > away, but he hasn't run a long time test yet and as we saw with 4.14 it can
> > > take quite a while.
> >
> > On 4.4 and 4.9 ARM64 still uses ticket locks. So I'm very interested to
> > know if the ticket locks on x86 really fix or just make it harder.
> >
> > I've been looking at qspinlock in the light of this and there is indeed
> > room for improvement. The ticket lock certainly is much simpler.
> 
> FWIW, in the qspinlock TLA+ model [1], if I replace the
> atomic_fetch_or() model with a try_cmpxchg loop, it violates the
> liveness properties with only 2 CPUs as one keeps locking/unlocking,
> hence changing the lock value, while the other repeatedly fails the
> cmpxchg. Your latest qspinlock patches seem to address this (couldn't
> get it to fail but the model is only sequentially consistent). Not
> sure that's what Sebastian is seeing but without your proposed
> qspinlock changes, ticket spinlocks may be a better bet for RT.

Except that the ARM64 ticket locks are not preventing the starvation
issue. Neither do the qrlocks on ARM64 on later kernels.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-10-03  7:51   ` Catalin Marinas
  2018-10-03  8:07     ` Thomas Gleixner
@ 2018-10-03  8:23     ` Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2018-10-03  8:23 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: bigeasy, Linux Kernel Mailing List, daniel.wagner, Will Deacon,
	x86, Linus Torvalds, H. Peter Anvin, boqun.feng, Paul McKenney

On Wed, Oct 03, 2018 at 08:51:50AM +0100, Catalin Marinas wrote:
> On Fri, 21 Sep 2018 at 13:22, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Sep 21, 2018 at 02:02:26PM +0200, Sebastian Andrzej Siewior wrote:
> > > We reproducibly observe cache line starvation on a Core2Duo E6850 (2
> > > cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4
> > > cores).
> > >
> > > The problem can be triggered with a v4.9-RT kernel by starting
> >
> > > Daniel reported that disabling ticket locks on 4.4 makes the problem go
> > > away, but he hasn't run a long time test yet and as we saw with 4.14 it can
> > > take quite a while.
> >
> > On 4.4 and 4.9 ARM64 still uses ticket locks. So I'm very interested to
> > know if the ticket locks on x86 really fix or just make it harder.
> >
> > I've been looking at qspinlock in the light of this and there is indeed
> > room for improvement. The ticket lock certainly is much simpler.
> 
> FWIW, in the qspinlock TLA+ model [1], if I replace the
> atomic_fetch_or() model with a try_cmpxchg loop, it violates the
> liveness properties with only 2 CPUs as one keeps locking/unlocking,
> hence changing the lock value, while the other repeatedly fails the
> cmpxchg. Your latest qspinlock patches seem to address this (couldn't
> get it to fail but the model is only sequentially consistent). Not
> sure that's what Sebastian is seeing but without your proposed
> qspinlock changes, ticket spinlocks may be a better bet for RT.

Right, and agreed. I did raise that point when you initially proposed
that fetch_or() for liveliness.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-10-03  8:07     ` Thomas Gleixner
@ 2018-10-03  8:28       ` Peter Zijlstra
  2018-10-03 10:43         ` Thomas Gleixner
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2018-10-03  8:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Catalin Marinas, bigeasy, Linux Kernel Mailing List,
	daniel.wagner, Will Deacon, x86, Linus Torvalds, H. Peter Anvin,
	boqun.feng, Paul McKenney

On Wed, Oct 03, 2018 at 10:07:05AM +0200, Thomas Gleixner wrote:
> On Wed, 3 Oct 2018, Catalin Marinas wrote:
> 
> > On Fri, 21 Sep 2018 at 13:22, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Fri, Sep 21, 2018 at 02:02:26PM +0200, Sebastian Andrzej Siewior wrote:
> > > > We reproducibly observe cache line starvation on a Core2Duo E6850 (2
> > > > cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4
> > > > cores).

> Except that the ARM64 ticket locks are not preventing the starvation
> issue. Neither do the qrlocks on ARM64 on later kernels.

That A72 is ARMv8-A afaict, that doesn't have the fancy LSE bits (which
are new in ARMv8.1) on and thus ends up using LL/SC. And LL/SC has
similar issues as cmpxchg loops.

AFAICT Cortex-A75 is the first that has LSE on (that's an ARMv8.2-A).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Problem] Cache line starvation
  2018-10-03  8:28       ` Peter Zijlstra
@ 2018-10-03 10:43         ` Thomas Gleixner
  0 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2018-10-03 10:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Catalin Marinas, bigeasy, Linux Kernel Mailing List,
	daniel.wagner, Will Deacon, x86, Linus Torvalds, H. Peter Anvin,
	boqun.feng, Paul McKenney

On Wed, 3 Oct 2018, Peter Zijlstra wrote:
> On Wed, Oct 03, 2018 at 10:07:05AM +0200, Thomas Gleixner wrote:
> > On Wed, 3 Oct 2018, Catalin Marinas wrote:
> > 
> > > On Fri, 21 Sep 2018 at 13:22, Peter Zijlstra <peterz@infradead.org> wrote:
> > > > On Fri, Sep 21, 2018 at 02:02:26PM +0200, Sebastian Andrzej Siewior wrote:
> > > > > We reproducibly observe cache line starvation on a Core2Duo E6850 (2
> > > > > cores), a i5-6400 SKL (4 cores) and on a NXP LS2044A ARM Cortex-A72 (4
> > > > > cores).
> 
> > Except that the ARM64 ticket locks are not preventing the starvation
> > issue. Neither do the qrlocks on ARM64 on later kernels.
> 
> That A72 is ARMv8-A afaict, that doesn't have the fancy LSE bits (which
> are new in ARMv8.1) on and thus ends up using LL/SC. And LL/SC has
> similar issues as cmpxchg loops.
> 
> AFAICT Cortex-A75 is the first that has LSE on (that's an ARMv8.2-A).

I know. That doesn't help though. We need a solution for the v8-A and the
magic delay loop we have right now to work around it is more that
disturbing.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-10-03 10:44 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-21 12:02 [Problem] Cache line starvation Sebastian Andrzej Siewior
2018-09-21 12:13 ` Thomas Gleixner
2018-09-21 12:50   ` Sebastian Andrzej Siewior
2018-09-21 12:20 ` Peter Zijlstra
2018-09-21 12:54   ` Thomas Gleixner
2018-10-03  7:51   ` Catalin Marinas
2018-10-03  8:07     ` Thomas Gleixner
2018-10-03  8:28       ` Peter Zijlstra
2018-10-03 10:43         ` Thomas Gleixner
2018-10-03  8:23     ` Peter Zijlstra
2018-09-26  7:34 ` Peter Zijlstra
2018-09-26  8:04   ` Thomas Gleixner
2018-09-26 12:53 ` Will Deacon
2018-09-27 14:25   ` Kurt Kanzenbach
2018-09-27 14:41     ` Kurt Kanzenbach
2018-09-27 14:47       ` Thomas Gleixner
2018-09-28  9:05         ` Kurt Kanzenbach
2018-09-28 15:26           ` Kurt Kanzenbach
2018-09-28 19:26         ` Sebastian Andrzej Siewior
2018-09-28 19:34           ` Thomas Gleixner
2018-10-02  6:31 ` Daniel Wagner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).