linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
@ 2015-12-10 19:43 David Daney
       [not found] ` <SN1PR07MB21577C72379C8440A208D6BC9EEA0@SN1PR07MB2157.namprd07.prod.outlook.com>
  2015-12-11  9:59 ` Will Deacon
  0 siblings, 2 replies; 24+ messages in thread
From: David Daney @ 2015-12-10 19:43 UTC (permalink / raw)
  To: Will Deacon, Davidlohr Bueso, Peter Zijlstra (Intel),
	Thomas Gleixner, Paul E. McKenney, Ingo Molnar
  Cc: Linux Kernel Mailing List, linux-arm-kernel, Pinski, Andrew

Hi,

We are getting soft lockup OOPs on Cavium CN88XX (A.K.A. ThunderX), 
which is an arm64 implementation.

A typical failure shows multiple threads stuck in mutex operations like 
this:

.
.
.
[   68.909873] Task dump for CPU 18:
[   68.909876] systemd-udevd   R  running task        0   537    534 
0x00000002
[   68.909877] Call trace:
[   68.909880] [<fffffe0000088858>] dump_backtrace+0x0/0x17c
[   68.909883] [<fffffe00000889f8>] show_stack+0x24/0x2c
[   68.909885] [<fffffe00000c4210>] sched_show_task+0xb0/0x104
[   68.909888] [<fffffe00000c682c>] dump_cpu_task+0x48/0x54
[   68.909890] [<fffffe00000ee5e0>] rcu_dump_cpu_stacks+0x9c/0xec
[   68.909893] [<fffffe00000f2c9c>] rcu_check_callbacks+0x524/0xa18
[   68.909896] [<fffffe00000f83a0>] update_process_times+0x44/0x74
[   68.909899] [<fffffe00001078d4>] tick_sched_timer+0x78/0x1ac
[   68.909901] [<fffffe00000f8b74>] __hrtimer_run_queues+0x148/0x2d4
[   68.909903] [<fffffe00000f9464>] hrtimer_interrupt+0xb0/0x1f4
[   68.909906] [<fffffe000056e6e8>] arch_timer_handler_phys+0x3c/0x48
[   68.909909] [<fffffe00000e7fd4>] handle_percpu_devid_irq+0xb0/0x1b0
[   68.909912] [<fffffe00000e33c4>] generic_handle_irq+0x34/0x4c
[   68.909914] [<fffffe00000e3738>] __handle_domain_irq+0x90/0xfc
[   68.909916] [<fffffe0000081d80>] gic_handle_irq+0x90/0x18c
[   68.909918] Exception stack(0xfffffe03f14e3920 to 0xfffffe03f14e3a40)
[   68.909921] 3920: fffffe03fd5c5800 fffffe0000c55800 fffffe03f14e3a80 
fffffe00000dabd8
[   68.909924] 3940: 00000000a0000145 0000000000000015 fffffe03e9602400 
fffffe00002fddb0
[   68.909927] 3960: 0000000000000000 0000000000000000 fffffe03fd5c5810 
fffffe03f14e0000
[   68.909929] 3980: 0000000000000001 ffffffffff000000 fffffe03db307e38 
0000000000000000
[   68.909932] 39a0: 0000000000737973 00000000ffffffff 0000000000000000 
000000003b364d50
[   68.909935] 39c0: 0000000000000018 ffffffffa99641af 0016fd71b6000000 
003b9aca00000000
[   68.909937] 39e0: fffffe00001f1508 000003ff9b9fd028 000003ffed7a0a10 
fffffe03fd5c5800
[   68.909940] 3a00: fffffe0000c55800 fffffe0000cea1c8 fffffe03fd5a5800 
fffffe0000ca2eb0
[   68.909943] 3a20: 0000000000000015 fffffe03e9602400 fffffe0000cea1c8 
fffffe0000712000
[   68.909945] [<fffffe0000084ce8>] el1_irq+0x68/0xd8
[   68.909948] [<fffffe00000da03c>] mutex_optimistic_spin+0x9c/0x1d0
[   68.909951] [<fffffe00006fe4b8>] __mutex_lock_slowpath+0x44/0x158
[   68.909953] [<fffffe00006fe620>] mutex_lock+0x54/0x58
[   68.909956] [<fffffe0000265efc>] kernfs_iop_permission+0x38/0x70
[   68.909959] [<fffffe00001fbf50>] __inode_permission+0x88/0xd8
[   68.909961] [<fffffe00001fbfd0>] inode_permission+0x30/0x6c
[   68.909964] [<fffffe00001fe26c>] link_path_walk+0x68/0x4d4
[   68.909966] [<fffffe00001ffa14>] path_openat+0xb4/0x2bc
[   68.909968] [<fffffe000020123c>] do_filp_open+0x74/0xd0
[   68.909971] [<fffffe00001f13e4>] do_sys_open+0x14c/0x228
[   68.909973] [<fffffe00001f1544>] SyS_openat+0x3c/0x48
[   68.909976] [<fffffe00000851f0>] el0_svc_naked+0x24/0x28
.
.
.

Reverting 81a43adae3b9 (locking/mutex: Use acquire/release semantics) 
Makes the problem go away.

At this point it is unknown if this patch is incorrect, or if the 
underlying ARM64 atomic_*_{acquire,release} primitives are defective, or 
if the problem lies elsewhere.

I am not requesting any specific action with this e-mail, but wanted to 
draw attention to the issue.  Undoubtedly we will be able to provide 
more detailed information about the issue in the coming days.

Thanks,
David Daney


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
       [not found] ` <SN1PR07MB21577C72379C8440A208D6BC9EEA0@SN1PR07MB2157.namprd07.prod.outlook.com>
@ 2015-12-11  3:29   ` Andrew Pinski
  2015-12-11  4:51     ` Andrew Pinski
  2015-12-11  7:33     ` Peter Zijlstra
  0 siblings, 2 replies; 24+ messages in thread
From: Andrew Pinski @ 2015-12-11  3:29 UTC (permalink / raw)
  To: Will Deacon, Davidlohr Bueso, Peter Zijlstra (Intel),
	Thomas Gleixner, Paul E. McKenney, Ingo Molnar,
	Linux Kernel Mailing List, linux-arm-kernel, Andrew

On Thu, Dec 10, 2015 at 11:44 AM, David Danny wrote:
>
> Hi,
>
> We are getting soft lockup OOPs on Cavium CN88XX (A.K.A. ThunderX), which is an arm64 implementation.

I get a slightly different OOPs and reverting
c55a6ffa6285e29f874ed403979472631ec70bff I was able to boot.
What I saw with osq_lock.c was that osq_wait_next is called for both
lock and unlock case so it might need both barriers.
The other question comes does atomic_cmpxchg_release have release
semantics when the compare fails?  Right now it does not.

Thanks,
Andrew


>
> A typical failure shows multiple threads stuck in mutex operations like
> this:
>
> .
> .
> .
> [   68.909873] Task dump for CPU 18:
> [   68.909876] systemd-udevd   R  running task        0   537    534
> 0x00000002
> [   68.909877] Call trace:
> [   68.909880] [<fffffe0000088858>] dump_backtrace+0x0/0x17c
> [   68.909883] [<fffffe00000889f8>] show_stack+0x24/0x2c
> [   68.909885] [<fffffe00000c4210>] sched_show_task+0xb0/0x104
> [   68.909888] [<fffffe00000c682c>] dump_cpu_task+0x48/0x54
> [   68.909890] [<fffffe00000ee5e0>] rcu_dump_cpu_stacks+0x9c/0xec
> [   68.909893] [<fffffe00000f2c9c>] rcu_check_callbacks+0x524/0xa18
> [   68.909896] [<fffffe00000f83a0>] update_process_times+0x44/0x74
> [   68.909899] [<fffffe00001078d4>] tick_sched_timer+0x78/0x1ac
> [   68.909901] [<fffffe00000f8b74>] __hrtimer_run_queues+0x148/0x2d4
> [   68.909903] [<fffffe00000f9464>] hrtimer_interrupt+0xb0/0x1f4
> [   68.909906] [<fffffe000056e6e8>] arch_timer_handler_phys+0x3c/0x48
> [   68.909909] [<fffffe00000e7fd4>] handle_percpu_devid_irq+0xb0/0x1b0
> [   68.909912] [<fffffe00000e33c4>] generic_handle_irq+0x34/0x4c
> [   68.909914] [<fffffe00000e3738>] __handle_domain_irq+0x90/0xfc
> [   68.909916] [<fffffe0000081d80>] gic_handle_irq+0x90/0x18c
> [   68.909918] Exception stack(0xfffffe03f14e3920 to 0xfffffe03f14e3a40)
> [   68.909921] 3920: fffffe03fd5c5800 fffffe0000c55800 fffffe03f14e3a80
> fffffe00000dabd8
> [   68.909924] 3940: 00000000a0000145 0000000000000015 fffffe03e9602400
> fffffe00002fddb0
> [   68.909927] 3960: 0000000000000000 0000000000000000 fffffe03fd5c5810
> fffffe03f14e0000
> [   68.909929] 3980: 0000000000000001 ffffffffff000000 fffffe03db307e38
> 0000000000000000
> [   68.909932] 39a0: 0000000000737973 00000000ffffffff 0000000000000000
> 000000003b364d50
> [   68.909935] 39c0: 0000000000000018 ffffffffa99641af 0016fd71b6000000
> 003b9aca00000000
> [   68.909937] 39e0: fffffe00001f1508 000003ff9b9fd028 000003ffed7a0a10
> fffffe03fd5c5800
> [   68.909940] 3a00: fffffe0000c55800 fffffe0000cea1c8 fffffe03fd5a5800
> fffffe0000ca2eb0
> [   68.909943] 3a20: 0000000000000015 fffffe03e9602400 fffffe0000cea1c8
> fffffe0000712000
> [   68.909945] [<fffffe0000084ce8>] el1_irq+0x68/0xd8
> [   68.909948] [<fffffe00000da03c>] mutex_optimistic_spin+0x9c/0x1d0
> [   68.909951] [<fffffe00006fe4b8>] __mutex_lock_slowpath+0x44/0x158
> [   68.909953] [<fffffe00006fe620>] mutex_lock+0x54/0x58
> [   68.909956] [<fffffe0000265efc>] kernfs_iop_permission+0x38/0x70
> [   68.909959] [<fffffe00001fbf50>] __inode_permission+0x88/0xd8
> [   68.909961] [<fffffe00001fbfd0>] inode_permission+0x30/0x6c
> [   68.909964] [<fffffe00001fe26c>] link_path_walk+0x68/0x4d4
> [   68.909966] [<fffffe00001ffa14>] path_openat+0xb4/0x2bc
> [   68.909968] [<fffffe000020123c>] do_filp_open+0x74/0xd0
> [   68.909971] [<fffffe00001f13e4>] do_sys_open+0x14c/0x228
> [   68.909973] [<fffffe00001f1544>] SyS_openat+0x3c/0x48
> [   68.909976] [<fffffe00000851f0>] el0_svc_naked+0x24/0x28
> .
> .
> .
>
> Reverting 81a43adae3b9 (locking/mutex: Use acquire/release semantics) Makes the problem go away.
>
> At this point it is unknown if this patch is incorrect, or if the underlying ARM64 atomic_*_{acquire,release} primitives are defective, or if the problem lies elsewhere.
>
> I am not requesting any specific action with this e-mail, but wanted to draw attention to the issue.  Undoubtedly we will be able to provide more detailed information about the issue in the coming days.
>
> Thanks,
> David Daney
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-11  3:29   ` FW: " Andrew Pinski
@ 2015-12-11  4:51     ` Andrew Pinski
  2015-12-11  8:41       ` Peter Zijlstra
  2015-12-11  7:33     ` Peter Zijlstra
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Pinski @ 2015-12-11  4:51 UTC (permalink / raw)
  To: Will Deacon, Davidlohr Bueso, Peter Zijlstra (Intel),
	Thomas Gleixner, Paul E. McKenney, Ingo Molnar,
	Linux Kernel Mailing List, linux-arm-kernel, Andrew

On Thu, Dec 10, 2015 at 7:29 PM, Andrew Pinski <pinskia@gmail.com> wrote:
> On Thu, Dec 10, 2015 at 11:44 AM, David Danny wrote:
>>
>> Hi,
>>
>> We are getting soft lockup OOPs on Cavium CN88XX (A.K.A. ThunderX), which is an arm64 implementation.
>
> I get a slightly different OOPs and reverting
> c55a6ffa6285e29f874ed403979472631ec70bff I was able to boot.
> What I saw with osq_lock.c was that osq_wait_next is called for both
> lock and unlock case so it might need both barriers.
> The other question comes does atomic_cmpxchg_release have release
> semantics when the compare fails?  Right now it does not.

So looking further I think I understand what is going wrong and why
c55a6ffa6285e29f874ed403979472631ec70bff is incorrect.

The compare and swap inside osq_lock needs to be both release and
acquire semantics memory barriers because the stores (to node) need to
be visible to the other cores before the setting of lock->tail
happens.
Because if node->next is is up to date, we might end up in
osq_wait_next and waiting in an infinite loop while waiting on
ourselves.

I think we should revert c55a6ffa6285e29f874ed403979472631ec70bff
fully as mentioned for the reasons above.

Thanks,
Andrew Pinski


>
> Thanks,
> Andrew
>
>
>>
>> A typical failure shows multiple threads stuck in mutex operations like
>> this:
>>
>> .
>> .
>> .
>> [   68.909873] Task dump for CPU 18:
>> [   68.909876] systemd-udevd   R  running task        0   537    534
>> 0x00000002
>> [   68.909877] Call trace:
>> [   68.909880] [<fffffe0000088858>] dump_backtrace+0x0/0x17c
>> [   68.909883] [<fffffe00000889f8>] show_stack+0x24/0x2c
>> [   68.909885] [<fffffe00000c4210>] sched_show_task+0xb0/0x104
>> [   68.909888] [<fffffe00000c682c>] dump_cpu_task+0x48/0x54
>> [   68.909890] [<fffffe00000ee5e0>] rcu_dump_cpu_stacks+0x9c/0xec
>> [   68.909893] [<fffffe00000f2c9c>] rcu_check_callbacks+0x524/0xa18
>> [   68.909896] [<fffffe00000f83a0>] update_process_times+0x44/0x74
>> [   68.909899] [<fffffe00001078d4>] tick_sched_timer+0x78/0x1ac
>> [   68.909901] [<fffffe00000f8b74>] __hrtimer_run_queues+0x148/0x2d4
>> [   68.909903] [<fffffe00000f9464>] hrtimer_interrupt+0xb0/0x1f4
>> [   68.909906] [<fffffe000056e6e8>] arch_timer_handler_phys+0x3c/0x48
>> [   68.909909] [<fffffe00000e7fd4>] handle_percpu_devid_irq+0xb0/0x1b0
>> [   68.909912] [<fffffe00000e33c4>] generic_handle_irq+0x34/0x4c
>> [   68.909914] [<fffffe00000e3738>] __handle_domain_irq+0x90/0xfc
>> [   68.909916] [<fffffe0000081d80>] gic_handle_irq+0x90/0x18c
>> [   68.909918] Exception stack(0xfffffe03f14e3920 to 0xfffffe03f14e3a40)
>> [   68.909921] 3920: fffffe03fd5c5800 fffffe0000c55800 fffffe03f14e3a80
>> fffffe00000dabd8
>> [   68.909924] 3940: 00000000a0000145 0000000000000015 fffffe03e9602400
>> fffffe00002fddb0
>> [   68.909927] 3960: 0000000000000000 0000000000000000 fffffe03fd5c5810
>> fffffe03f14e0000
>> [   68.909929] 3980: 0000000000000001 ffffffffff000000 fffffe03db307e38
>> 0000000000000000
>> [   68.909932] 39a0: 0000000000737973 00000000ffffffff 0000000000000000
>> 000000003b364d50
>> [   68.909935] 39c0: 0000000000000018 ffffffffa99641af 0016fd71b6000000
>> 003b9aca00000000
>> [   68.909937] 39e0: fffffe00001f1508 000003ff9b9fd028 000003ffed7a0a10
>> fffffe03fd5c5800
>> [   68.909940] 3a00: fffffe0000c55800 fffffe0000cea1c8 fffffe03fd5a5800
>> fffffe0000ca2eb0
>> [   68.909943] 3a20: 0000000000000015 fffffe03e9602400 fffffe0000cea1c8
>> fffffe0000712000
>> [   68.909945] [<fffffe0000084ce8>] el1_irq+0x68/0xd8
>> [   68.909948] [<fffffe00000da03c>] mutex_optimistic_spin+0x9c/0x1d0
>> [   68.909951] [<fffffe00006fe4b8>] __mutex_lock_slowpath+0x44/0x158
>> [   68.909953] [<fffffe00006fe620>] mutex_lock+0x54/0x58
>> [   68.909956] [<fffffe0000265efc>] kernfs_iop_permission+0x38/0x70
>> [   68.909959] [<fffffe00001fbf50>] __inode_permission+0x88/0xd8
>> [   68.909961] [<fffffe00001fbfd0>] inode_permission+0x30/0x6c
>> [   68.909964] [<fffffe00001fe26c>] link_path_walk+0x68/0x4d4
>> [   68.909966] [<fffffe00001ffa14>] path_openat+0xb4/0x2bc
>> [   68.909968] [<fffffe000020123c>] do_filp_open+0x74/0xd0
>> [   68.909971] [<fffffe00001f13e4>] do_sys_open+0x14c/0x228
>> [   68.909973] [<fffffe00001f1544>] SyS_openat+0x3c/0x48
>> [   68.909976] [<fffffe00000851f0>] el0_svc_naked+0x24/0x28
>> .
>> .
>> .
>>
>> Reverting 81a43adae3b9 (locking/mutex: Use acquire/release semantics) Makes the problem go away.
>>
>> At this point it is unknown if this patch is incorrect, or if the underlying ARM64 atomic_*_{acquire,release} primitives are defective, or if the problem lies elsewhere.
>>
>> I am not requesting any specific action with this e-mail, but wanted to draw attention to the issue.  Undoubtedly we will be able to provide more detailed information about the issue in the coming days.
>>
>> Thanks,
>> David Daney
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-11  3:29   ` FW: " Andrew Pinski
  2015-12-11  4:51     ` Andrew Pinski
@ 2015-12-11  7:33     ` Peter Zijlstra
  1 sibling, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2015-12-11  7:33 UTC (permalink / raw)
  To: Andrew Pinski
  Cc: Will Deacon, Davidlohr Bueso, Thomas Gleixner, Paul E. McKenney,
	Ingo Molnar, Linux Kernel Mailing List, linux-arm-kernel, Andrew

On Thu, Dec 10, 2015 at 07:29:34PM -0800, Andrew Pinski wrote:
> On Thu, Dec 10, 2015 at 11:44 AM, David Danny wrote:
> >
> > Hi,
> >
> > We are getting soft lockup OOPs on Cavium CN88XX (A.K.A. ThunderX), which is an arm64 implementation.
> 
> I get a slightly different OOPs and reverting
> c55a6ffa6285e29f874ed403979472631ec70bff I was able to boot.
> What I saw with osq_lock.c was that osq_wait_next is called for both
> lock and unlock case so it might need both barriers.
> The other question comes does atomic_cmpxchg_release have release
> semantics when the compare fails?  Right now it does not.
> 

Out cmpxchg primites imply no barrier on failure, this is documented
somewhere.. /me searches..

---

commit ed2de9f74ecbbf3063d29b2334e7b455d7f35189
Author: Will Deacon <will.deacon@arm.com>
Date:   Thu Jul 16 16:10:06 2015 +0100

    locking/Documentation: Clarify failed cmpxchg() memory ordering semantics
    
    A failed cmpxchg does not provide any memory ordering guarantees, a
    property that is used to optimise the cmpxchg implementations on Alpha,
    PowerPC and arm64.
    
    This patch updates atomic_ops.txt and memory-barriers.txt to reflect
    this.
    
    Signed-off-by: Will Deacon <will.deacon@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Douglas Hatch <doug.hatch@hp.com>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Scott J Norton <scott.norton@hp.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Waiman Long <waiman.long@hp.com>
    Link: http://lkml.kernel.org/r/20150716151006.GH26390@arm.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt
index dab6da3382d9..b19fc34efdb1 100644
--- a/Documentation/atomic_ops.txt
+++ b/Documentation/atomic_ops.txt
@@ -266,7 +266,9 @@ with the given old and new values. Like all atomic_xxx operations,
 atomic_cmpxchg will only satisfy its atomicity semantics as long as all
 other accesses of *v are performed through atomic_xxx operations.
 
-atomic_cmpxchg must provide explicit memory barriers around the operation.
+atomic_cmpxchg must provide explicit memory barriers around the operation,
+although if the comparison fails then no memory ordering guarantees are
+required.
 
 The semantics for atomic_cmpxchg are the same as those defined for 'cas'
 below.
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 13feb697271f..18fc860df1be 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -2383,9 +2383,7 @@ about the state (old or new) implies an SMP-conditional general memory barrier
 explicit lock operations, described later).  These include:
 
 	xchg();
-	cmpxchg();
 	atomic_xchg();			atomic_long_xchg();
-	atomic_cmpxchg();		atomic_long_cmpxchg();
 	atomic_inc_return();		atomic_long_inc_return();
 	atomic_dec_return();		atomic_long_dec_return();
 	atomic_add_return();		atomic_long_add_return();
@@ -2398,7 +2396,9 @@ about the state (old or new) implies an SMP-conditional general memory barrier
 	test_and_clear_bit();
 	test_and_change_bit();
 
-	/* when succeeds (returns 1) */
+	/* when succeeds */
+	cmpxchg();
+	atomic_cmpxchg();		atomic_long_cmpxchg();
 	atomic_add_unless();		atomic_long_add_unless();
 
 These are used for such things as implementing ACQUIRE-class and RELEASE-class

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: FW: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-11  4:51     ` Andrew Pinski
@ 2015-12-11  8:41       ` Peter Zijlstra
  2015-12-11 12:04         ` Will Deacon
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2015-12-11  8:41 UTC (permalink / raw)
  To: Andrew Pinski
  Cc: Will Deacon, Davidlohr Bueso, Thomas Gleixner, Paul E. McKenney,
	Ingo Molnar, Linux Kernel Mailing List, linux-arm-kernel

On Thu, Dec 10, 2015 at 08:51:34PM -0800, Andrew Pinski wrote:

> So looking further I think I understand what is going wrong and why
> c55a6ffa6285e29f874ed403979472631ec70bff is incorrect.

The osq_wait_next() call in osq_lock() is when we fail the lock. This is
effectively trylock() semantics and like for cmpxchg a failed trylock
has no implied barrier semantics.  So from that POV osq_wait_next() does
not need to provide ACQUIRE semantics.

In osq_unlock() there's an xchg() in front, which implies full barriers
and thereby provides RELEASE semantics for that part of osq_unlock(), so
again, from this POV osq_wait_next() does not need to provide RELEASE
semantics.

> The compare and swap inside osq_lock needs to be both release and
> acquire semantics memory barriers because the stores (to node) need to
> be visible to the other cores before the setting of lock->tail
> happens.

I'm a wee bit confused on what exactly you mean. Both stores to @node:

 1) osq_wait_next(): next = xchg(&node->next, NULL)
 2) osq_unlock():    next = xchg(&node->next, NULL)

are xchg() calls which imply full ordering (sequential consistency).

Similarly the store before osq_wait_next() in osq_lock(), namely:

  cmpxchg(&prev->node, node, NULL)

is fully ordered.

So I cannot see any store being delayed past the
atomic_cmpxchg_acquire().

Now you mention 'compare and swap inside osq_lock' which I take to be
the latter; and it _is_ fully ordered.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-10 19:43 Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX) David Daney
       [not found] ` <SN1PR07MB21577C72379C8440A208D6BC9EEA0@SN1PR07MB2157.namprd07.prod.outlook.com>
@ 2015-12-11  9:59 ` Will Deacon
  1 sibling, 0 replies; 24+ messages in thread
From: Will Deacon @ 2015-12-11  9:59 UTC (permalink / raw)
  To: David Daney
  Cc: Davidlohr Bueso, Peter Zijlstra (Intel),
	Thomas Gleixner, Paul E. McKenney, Ingo Molnar,
	Linux Kernel Mailing List, linux-arm-kernel, Pinski, Andrew

On Thu, Dec 10, 2015 at 11:43:46AM -0800, David Daney wrote:
> We are getting soft lockup OOPs on Cavium CN88XX (A.K.A. ThunderX), which is
> an arm64 implementation.

[...]

> At this point it is unknown if this patch is incorrect, or if the underlying
> ARM64 atomic_*_{acquire,release} primitives are defective, or if the problem
> lies elsewhere.

Are you using the ll/sc or lse versions of the atomics? In the case of
the former, are they inline or out-of-line (this depends on whether or
not you've selected CONFIG_ARM64_LSE_ATOMICS and whether or not you have
toolchain support)?

Will

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-11  8:41       ` Peter Zijlstra
@ 2015-12-11 12:04         ` Will Deacon
  2015-12-11 12:13           ` Peter Zijlstra
                             ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Will Deacon @ 2015-12-11 12:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Pinski, Davidlohr Bueso, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar, Linux Kernel Mailing List,
	linux-arm-kernel

Hi all,

On Fri, Dec 11, 2015 at 09:41:33AM +0100, Peter Zijlstra wrote:
> On Thu, Dec 10, 2015 at 08:51:34PM -0800, Andrew Pinski wrote:
> 
> > So looking further I think I understand what is going wrong and why
> > c55a6ffa6285e29f874ed403979472631ec70bff is incorrect.
> 
> The osq_wait_next() call in osq_lock() is when we fail the lock. This is
> effectively trylock() semantics and like for cmpxchg a failed trylock
> has no implied barrier semantics.  So from that POV osq_wait_next() does
> not need to provide ACQUIRE semantics.
> 
> In osq_unlock() there's an xchg() in front, which implies full barriers
> and thereby provides RELEASE semantics for that part of osq_unlock(), so
> again, from this POV osq_wait_next() does not need to provide RELEASE
> semantics.
> 
> > The compare and swap inside osq_lock needs to be both release and
> > acquire semantics memory barriers because the stores (to node) need to
> > be visible to the other cores before the setting of lock->tail
> > happens.
> 
> I'm a wee bit confused on what exactly you mean. Both stores to @node:
> 
>  1) osq_wait_next(): next = xchg(&node->next, NULL)
>  2) osq_unlock():    next = xchg(&node->next, NULL)
> 
> are xchg() calls which imply full ordering (sequential consistency).

I think Andrew meant the atomic_xchg_acquire at the start of osq_lock,
as opposed to "compare and swap". In which case, it does look like
there's a bug here because there is nothing to order the initialisation
of the node fields with publishing of the node, whether that's
indirectly as a result of setting the tail to the current CPU or
directly as a result of the WRITE_ONCE.

Andrew, David: does making that atomic_xchg_acquire and atomic_xchg
fix things for you?

I don't fully grok what 81a43adae3b9 has to do with any of this, so
maybe there's another bug too.

Will

--->8

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d092a0c9c2d4..05a37857ab55 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -93,10 +93,12 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	node->cpu = curr;
 
 	/*
-	 * ACQUIRE semantics, pairs with corresponding RELEASE
-	 * in unlock() uncontended, or fastpath.
+	 * We need both ACQUIRE (pairs with corresponding RELEASE in
+	 * unlock() uncontended, or fastpath) and RELEASE (to publish
+	 * the node fields we just initialised) semantics when updating
+	 * the lock tail.
 	 */
-	old = atomic_xchg_acquire(&lock->tail, curr);
+	old = atomic_xchg(&lock->tail, curr);
 	if (old == OSQ_UNLOCKED_VAL)
 		return true;
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: FW: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-11 12:04         ` Will Deacon
@ 2015-12-11 12:13           ` Peter Zijlstra
  2015-12-11 12:18             ` Will Deacon
  2015-12-11 14:17           ` Davidlohr Bueso
  2015-12-17 21:52           ` Jeremy Linton
  2 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2015-12-11 12:13 UTC (permalink / raw)
  To: Will Deacon
  Cc: Andrew Pinski, Davidlohr Bueso, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar, Linux Kernel Mailing List,
	linux-arm-kernel

On Fri, Dec 11, 2015 at 12:04:19PM +0000, Will Deacon wrote:
> I think Andrew meant the atomic_xchg_acquire at the start of osq_lock,
> as opposed to "compare and swap". In which case, it does look like
> there's a bug here because there is nothing to order the initialisation
> of the node fields with publishing of the node, whether that's
> indirectly as a result of setting the tail to the current CPU or
> directly as a result of the WRITE_ONCE.

Agreed, this does indeed look like a bug. If confirmed please write a
shiny changelog and I'll queue asap.

> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index d092a0c9c2d4..05a37857ab55 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -93,10 +93,12 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>  	node->cpu = curr;
>  
>  	/*
> -	 * ACQUIRE semantics, pairs with corresponding RELEASE
> -	 * in unlock() uncontended, or fastpath.
> +	 * We need both ACQUIRE (pairs with corresponding RELEASE in
> +	 * unlock() uncontended, or fastpath) and RELEASE (to publish
> +	 * the node fields we just initialised) semantics when updating
> +	 * the lock tail.
>  	 */
> -	old = atomic_xchg_acquire(&lock->tail, curr);
> +	old = atomic_xchg(&lock->tail, curr);
>  	if (old == OSQ_UNLOCKED_VAL)
>  		return true;
>  

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-11 12:13           ` Peter Zijlstra
@ 2015-12-11 12:18             ` Will Deacon
  2015-12-11 12:26               ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Will Deacon @ 2015-12-11 12:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Pinski, Davidlohr Bueso, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar, Linux Kernel Mailing List,
	linux-arm-kernel, david.daney

On Fri, Dec 11, 2015 at 01:13:19PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 11, 2015 at 12:04:19PM +0000, Will Deacon wrote:
> > I think Andrew meant the atomic_xchg_acquire at the start of osq_lock,
> > as opposed to "compare and swap". In which case, it does look like
> > there's a bug here because there is nothing to order the initialisation
> > of the node fields with publishing of the node, whether that's
> > indirectly as a result of setting the tail to the current CPU or
> > directly as a result of the WRITE_ONCE.
> 
> Agreed, this does indeed look like a bug. If confirmed please write a
> shiny changelog and I'll queue asap.

Yup. I've failed to reproduce the issue locally, so we'll need to wait
for Andrew and/or David to get back to us first.

Will

> > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > index d092a0c9c2d4..05a37857ab55 100644
> > --- a/kernel/locking/osq_lock.c
> > +++ b/kernel/locking/osq_lock.c
> > @@ -93,10 +93,12 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> >  	node->cpu = curr;
> >  
> >  	/*
> > -	 * ACQUIRE semantics, pairs with corresponding RELEASE
> > -	 * in unlock() uncontended, or fastpath.
> > +	 * We need both ACQUIRE (pairs with corresponding RELEASE in
> > +	 * unlock() uncontended, or fastpath) and RELEASE (to publish
> > +	 * the node fields we just initialised) semantics when updating
> > +	 * the lock tail.
> >  	 */
> > -	old = atomic_xchg_acquire(&lock->tail, curr);
> > +	old = atomic_xchg(&lock->tail, curr);
> >  	if (old == OSQ_UNLOCKED_VAL)
> >  		return true;
> >  
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-11 12:18             ` Will Deacon
@ 2015-12-11 12:26               ` Peter Zijlstra
  2015-12-11 13:33                 ` Will Deacon
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2015-12-11 12:26 UTC (permalink / raw)
  To: Will Deacon
  Cc: Andrew Pinski, Davidlohr Bueso, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar, Linux Kernel Mailing List,
	linux-arm-kernel, david.daney

On Fri, Dec 11, 2015 at 12:18:00PM +0000, Will Deacon wrote:
> On Fri, Dec 11, 2015 at 01:13:19PM +0100, Peter Zijlstra wrote:
> > On Fri, Dec 11, 2015 at 12:04:19PM +0000, Will Deacon wrote:
> > > I think Andrew meant the atomic_xchg_acquire at the start of osq_lock,
> > > as opposed to "compare and swap". In which case, it does look like
> > > there's a bug here because there is nothing to order the initialisation
> > > of the node fields with publishing of the node, whether that's
> > > indirectly as a result of setting the tail to the current CPU or
> > > directly as a result of the WRITE_ONCE.
> > 
> > Agreed, this does indeed look like a bug. If confirmed please write a
> > shiny changelog and I'll queue asap.
> 
> Yup. I've failed to reproduce the issue locally, so we'll need to wait
> for Andrew and/or David to get back to us first.

While we're there, the acquire in osq_wait_next() seems somewhat ill
documented too.

I _think_ we need ACQUIRE semantics there because we want to strictly
order the lock-unqueue A,B,C steps and we get that with:

 A: SC
 B: ACQ
 C: Relaxed

Similarly for unlock we want the WRITE_ONCE to happen after
osq_wait_next, but in that case we can even rely on the control
dependency there.


As noted in a previous email, the ACQUIRE for osq_wait_next() does not
come from its use in lock since its on the fail path, and trylock
failure doesn't imply any barriers.

Not should it have RELEASE semantics for its use in unlock, since we
already have that covered by the xchg() done prior.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-11 12:26               ` Peter Zijlstra
@ 2015-12-11 13:33                 ` Will Deacon
  2015-12-11 13:48                   ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Will Deacon @ 2015-12-11 13:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Pinski, Davidlohr Bueso, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar, Linux Kernel Mailing List,
	linux-arm-kernel, david.daney

On Fri, Dec 11, 2015 at 01:26:47PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 11, 2015 at 12:18:00PM +0000, Will Deacon wrote:
> > On Fri, Dec 11, 2015 at 01:13:19PM +0100, Peter Zijlstra wrote:
> > > On Fri, Dec 11, 2015 at 12:04:19PM +0000, Will Deacon wrote:
> > > > I think Andrew meant the atomic_xchg_acquire at the start of osq_lock,
> > > > as opposed to "compare and swap". In which case, it does look like
> > > > there's a bug here because there is nothing to order the initialisation
> > > > of the node fields with publishing of the node, whether that's
> > > > indirectly as a result of setting the tail to the current CPU or
> > > > directly as a result of the WRITE_ONCE.
> > > 
> > > Agreed, this does indeed look like a bug. If confirmed please write a
> > > shiny changelog and I'll queue asap.
> > 
> > Yup. I've failed to reproduce the issue locally, so we'll need to wait
> > for Andrew and/or David to get back to us first.
> 
> While we're there, the acquire in osq_wait_next() seems somewhat ill
> documented too.
> 
> I _think_ we need ACQUIRE semantics there because we want to strictly
> order the lock-unqueue A,B,C steps and we get that with:
> 
>  A: SC
>  B: ACQ
>  C: Relaxed
> 
> Similarly for unlock we want the WRITE_ONCE to happen after
> osq_wait_next, but in that case we can even rely on the control
> dependency there.

Even for the lock-unqueue case, isn't B->C ordered by a control dependency
because C consists only of stores?

Will

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-11 13:33                 ` Will Deacon
@ 2015-12-11 13:48                   ` Peter Zijlstra
  2015-12-11 14:06                     ` Will Deacon
  2015-12-11 22:35                     ` Paul E. McKenney
  0 siblings, 2 replies; 24+ messages in thread
From: Peter Zijlstra @ 2015-12-11 13:48 UTC (permalink / raw)
  To: Will Deacon
  Cc: Andrew Pinski, Davidlohr Bueso, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar, Linux Kernel Mailing List,
	linux-arm-kernel, david.daney

On Fri, Dec 11, 2015 at 01:33:14PM +0000, Will Deacon wrote:
> On Fri, Dec 11, 2015 at 01:26:47PM +0100, Peter Zijlstra wrote:

> > While we're there, the acquire in osq_wait_next() seems somewhat ill
> > documented too.
> > 
> > I _think_ we need ACQUIRE semantics there because we want to strictly
> > order the lock-unqueue A,B,C steps and we get that with:
> > 
> >  A: SC
> >  B: ACQ
> >  C: Relaxed
> > 
> > Similarly for unlock we want the WRITE_ONCE to happen after
> > osq_wait_next, but in that case we can even rely on the control
> > dependency there.
> 
> Even for the lock-unqueue case, isn't B->C ordered by a control dependency
> because C consists only of stores?

Hmm, indeed. So we could go fully relaxed on it I suppose, since the
same is true for the unlock site.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-11 13:48                   ` Peter Zijlstra
@ 2015-12-11 14:06                     ` Will Deacon
  2015-12-11 17:11                       ` Peter Zijlstra
  2015-12-11 22:35                     ` Paul E. McKenney
  1 sibling, 1 reply; 24+ messages in thread
From: Will Deacon @ 2015-12-11 14:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Pinski, Davidlohr Bueso, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar, Linux Kernel Mailing List,
	linux-arm-kernel, david.daney

On Fri, Dec 11, 2015 at 02:48:03PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 11, 2015 at 01:33:14PM +0000, Will Deacon wrote:
> > On Fri, Dec 11, 2015 at 01:26:47PM +0100, Peter Zijlstra wrote:
> 
> > > While we're there, the acquire in osq_wait_next() seems somewhat ill
> > > documented too.
> > > 
> > > I _think_ we need ACQUIRE semantics there because we want to strictly
> > > order the lock-unqueue A,B,C steps and we get that with:
> > > 
> > >  A: SC
> > >  B: ACQ
> > >  C: Relaxed
> > > 
> > > Similarly for unlock we want the WRITE_ONCE to happen after
> > > osq_wait_next, but in that case we can even rely on the control
> > > dependency there.
> > 
> > Even for the lock-unqueue case, isn't B->C ordered by a control dependency
> > because C consists only of stores?
> 
> Hmm, indeed. So we could go fully relaxed on it I suppose, since the
> same is true for the unlock site.

In which case, we should be able to relax the xchg in there (osq_wait_next)
too, right?

Will

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-11 12:04         ` Will Deacon
  2015-12-11 12:13           ` Peter Zijlstra
@ 2015-12-11 14:17           ` Davidlohr Bueso
  2015-12-17 21:52           ` Jeremy Linton
  2 siblings, 0 replies; 24+ messages in thread
From: Davidlohr Bueso @ 2015-12-11 14:17 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Andrew Pinski, Davidlohr Bueso, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar, Linux Kernel Mailing List,
	linux-arm-kernel

On Fri, 11 Dec 2015, Will Deacon wrote:

>I think Andrew meant the atomic_xchg_acquire at the start of osq_lock,
>as opposed to "compare and swap". In which case, it does look like
>there's a bug here because there is nothing to order the initialisation
>of the node fields with publishing of the node, whether that's
>indirectly as a result of setting the tail to the current CPU or
>directly as a result of the WRITE_ONCE.

Sorry I'm late to the party.

Duh yes this is obviously bogus, and worse I recall triggering a similar
tail initialization issue in osq_lock on some experimental work on x86,
so this is very much a point of failure. Ack.

>
>Andrew, David: does making that atomic_xchg_acquire and atomic_xchg
>fix things for you?
>
>I don't fully grok what 81a43adae3b9 has to do with any of this, so
>maybe there's another bug too.

I think this is mainly because mutex_optimistic_spin is where the stack
shows the lockup, which really translates to c55a6ffa62.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-11 14:06                     ` Will Deacon
@ 2015-12-11 17:11                       ` Peter Zijlstra
  2015-12-11 17:24                         ` Will Deacon
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2015-12-11 17:11 UTC (permalink / raw)
  To: Will Deacon
  Cc: Andrew Pinski, Davidlohr Bueso, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar, Linux Kernel Mailing List,
	linux-arm-kernel, david.daney

On Fri, Dec 11, 2015 at 02:06:49PM +0000, Will Deacon wrote:
> On Fri, Dec 11, 2015 at 02:48:03PM +0100, Peter Zijlstra wrote:
> > On Fri, Dec 11, 2015 at 01:33:14PM +0000, Will Deacon wrote:
> > > On Fri, Dec 11, 2015 at 01:26:47PM +0100, Peter Zijlstra wrote:
> > 
> > > > While we're there, the acquire in osq_wait_next() seems somewhat ill
> > > > documented too.
> > > > 
> > > > I _think_ we need ACQUIRE semantics there because we want to strictly
> > > > order the lock-unqueue A,B,C steps and we get that with:
> > > > 
> > > >  A: SC
> > > >  B: ACQ
> > > >  C: Relaxed
> > > > 
> > > > Similarly for unlock we want the WRITE_ONCE to happen after
> > > > osq_wait_next, but in that case we can even rely on the control
> > > > dependency there.
> > > 
> > > Even for the lock-unqueue case, isn't B->C ordered by a control dependency
> > > because C consists only of stores?
> > 
> > Hmm, indeed. So we could go fully relaxed on it I suppose, since the
> > same is true for the unlock site.
> 
> In which case, we should be able to relax the xchg in there (osq_wait_next)
> too, right?

Can I have second thoughts an confuse matters again? ;-)

A RmW-acq is a load-acquire+store. That means the store is _after_ the
load and thus not required for the completion of the control dependency.

Therefore the store in question can reorder inside the conditional
control block's stores.

Hmm?

Suggesting this acquire is in fact also wrong, since we need full order
ops to guarantee full order both in lock and unlock.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-11 17:11                       ` Peter Zijlstra
@ 2015-12-11 17:24                         ` Will Deacon
  0 siblings, 0 replies; 24+ messages in thread
From: Will Deacon @ 2015-12-11 17:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Pinski, Davidlohr Bueso, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar, Linux Kernel Mailing List,
	linux-arm-kernel, david.daney

On Fri, Dec 11, 2015 at 06:11:28PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 11, 2015 at 02:06:49PM +0000, Will Deacon wrote:
> > On Fri, Dec 11, 2015 at 02:48:03PM +0100, Peter Zijlstra wrote:
> > > On Fri, Dec 11, 2015 at 01:33:14PM +0000, Will Deacon wrote:
> > > > On Fri, Dec 11, 2015 at 01:26:47PM +0100, Peter Zijlstra wrote:
> > > 
> > > > > While we're there, the acquire in osq_wait_next() seems somewhat ill
> > > > > documented too.
> > > > > 
> > > > > I _think_ we need ACQUIRE semantics there because we want to strictly
> > > > > order the lock-unqueue A,B,C steps and we get that with:
> > > > > 
> > > > >  A: SC
> > > > >  B: ACQ
> > > > >  C: Relaxed
> > > > > 
> > > > > Similarly for unlock we want the WRITE_ONCE to happen after
> > > > > osq_wait_next, but in that case we can even rely on the control
> > > > > dependency there.
> > > > 
> > > > Even for the lock-unqueue case, isn't B->C ordered by a control dependency
> > > > because C consists only of stores?
> > > 
> > > Hmm, indeed. So we could go fully relaxed on it I suppose, since the
> > > same is true for the unlock site.
> > 
> > In which case, we should be able to relax the xchg in there (osq_wait_next)
> > too, right?
> 
> Can I have second thoughts an confuse matters again? ;-)
> 
> A RmW-acq is a load-acquire+store. That means the store is _after_ the
> load and thus not required for the completion of the control dependency.
> 
> Therefore the store in question can reorder inside the conditional
> control block's stores.
> 
> Hmm?

Ah yeah, it's the same thing we were discussing the other day! Whilst
there is a form of control dependency from the SC part of the LL/SC
sequence, it doesn't guarantee ordering in the same way that a
load->store control dependency does. That is, it orders subsequent
writes to be afterwards in the coherence order but it doesn't ensure
multi-copy atomicity for readers.

Now, in this case, &lock->tail is only ever accessed by other cmpxchg
operations, so I think it does actually work using just the control
dependency. Worst case, a concurrent osq_wait_next gets a stale value
in the atomic_read, but that's not a correctness problem.

Will

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-11 13:48                   ` Peter Zijlstra
  2015-12-11 14:06                     ` Will Deacon
@ 2015-12-11 22:35                     ` Paul E. McKenney
  2015-12-14 18:49                       ` One Thousand Gnomes
  2015-12-14 20:28                       ` FW: " Peter Zijlstra
  1 sibling, 2 replies; 24+ messages in thread
From: Paul E. McKenney @ 2015-12-11 22:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Andrew Pinski, Davidlohr Bueso, Thomas Gleixner,
	Ingo Molnar, Linux Kernel Mailing List, linux-arm-kernel,
	david.daney

On Fri, Dec 11, 2015 at 02:48:03PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 11, 2015 at 01:33:14PM +0000, Will Deacon wrote:
> > On Fri, Dec 11, 2015 at 01:26:47PM +0100, Peter Zijlstra wrote:
> 
> > > While we're there, the acquire in osq_wait_next() seems somewhat ill
> > > documented too.
> > > 
> > > I _think_ we need ACQUIRE semantics there because we want to strictly
> > > order the lock-unqueue A,B,C steps and we get that with:
> > > 
> > >  A: SC
> > >  B: ACQ
> > >  C: Relaxed
> > > 
> > > Similarly for unlock we want the WRITE_ONCE to happen after
> > > osq_wait_next, but in that case we can even rely on the control
> > > dependency there.
> > 
> > Even for the lock-unqueue case, isn't B->C ordered by a control dependency
> > because C consists only of stores?
> 
> Hmm, indeed. So we could go fully relaxed on it I suppose, since the
> same is true for the unlock site.

I am probably missing quite a bit on this thread, but don't x86 MMIO
accesses to frame buffers need to interact with something more heavyweight
than an x86 release store or acquire load in order to remain confined
to the resulting critical section?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-11 22:35                     ` Paul E. McKenney
@ 2015-12-14 18:49                       ` One Thousand Gnomes
  2015-12-14 20:31                         ` Peter Zijlstra
  2015-12-15  4:36                         ` Paul E. McKenney
  2015-12-14 20:28                       ` FW: " Peter Zijlstra
  1 sibling, 2 replies; 24+ messages in thread
From: One Thousand Gnomes @ 2015-12-14 18:49 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Will Deacon, Andrew Pinski, Davidlohr Bueso,
	Thomas Gleixner, Ingo Molnar, Linux Kernel Mailing List,
	linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>,
	david.daney@cavium.com

On Fri, 11 Dec 2015 14:35:40 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Fri, Dec 11, 2015 at 02:48:03PM +0100, Peter Zijlstra wrote:
> > On Fri, Dec 11, 2015 at 01:33:14PM +0000, Will Deacon wrote:
> > > On Fri, Dec 11, 2015 at 01:26:47PM +0100, Peter Zijlstra wrote:
> > 
> > > > While we're there, the acquire in osq_wait_next() seems somewhat ill
> > > > documented too.
> > > > 
> > > > I _think_ we need ACQUIRE semantics there because we want to strictly
> > > > order the lock-unqueue A,B,C steps and we get that with:
> > > > 
> > > >  A: SC
> > > >  B: ACQ
> > > >  C: Relaxed
> > > > 
> > > > Similarly for unlock we want the WRITE_ONCE to happen after
> > > > osq_wait_next, but in that case we can even rely on the control
> > > > dependency there.
> > > 
> > > Even for the lock-unqueue case, isn't B->C ordered by a control dependency
> > > because C consists only of stores?
> > 
> > Hmm, indeed. So we could go fully relaxed on it I suppose, since the
> > same is true for the unlock site.
> 
> I am probably missing quite a bit on this thread, but don't x86 MMIO
> accesses to frame buffers need to interact with something more heavyweight
> than an x86 release store or acquire load in order to remain confined
> to the resulting critical section?

Depends upon the device and the mapping. There are also CPU errata
related to write combining on older CPUs (notably Pentium Pro era) which
result in ordering errors with write combining unless deliberately fenced.

Any PCI access isn't constrained to the critical section unless a PCI
read from the same device is done and completes before exiting. Even then
on processors with a separate APIC bus (PPro, PII I think) interrupts are
asynchronous on their own bus.

The PCI posting rules also apply to DMA.

Finally we run the IDT WinChip in out-of-order store mode not full x86
compatibility which while uniprocessor does mean the correct fences
matter.

Just to ensure total confusion some video cards have MMIO areas that are
not in fact memory but a FIFO rigged to look like a block of RAM for
speed of writing. In those cases the rules are a bit card dependant.

But seriously are there any cases we actually care about this for osq ?

Alan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-11 22:35                     ` Paul E. McKenney
  2015-12-14 18:49                       ` One Thousand Gnomes
@ 2015-12-14 20:28                       ` Peter Zijlstra
  2015-12-15  4:36                         ` Paul E. McKenney
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2015-12-14 20:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, Andrew Pinski, Davidlohr Bueso, Thomas Gleixner,
	Ingo Molnar, Linux Kernel Mailing List, linux-arm-kernel,
	david.daney

On Fri, Dec 11, 2015 at 02:35:40PM -0800, Paul E. McKenney wrote:
> On Fri, Dec 11, 2015 at 02:48:03PM +0100, Peter Zijlstra wrote:
> > On Fri, Dec 11, 2015 at 01:33:14PM +0000, Will Deacon wrote:
> > > On Fri, Dec 11, 2015 at 01:26:47PM +0100, Peter Zijlstra wrote:
> > 
> > > > While we're there, the acquire in osq_wait_next() seems somewhat ill
> > > > documented too.
> > > > 
> > > > I _think_ we need ACQUIRE semantics there because we want to strictly
> > > > order the lock-unqueue A,B,C steps and we get that with:
> > > > 
> > > >  A: SC
> > > >  B: ACQ
> > > >  C: Relaxed
> > > > 
> > > > Similarly for unlock we want the WRITE_ONCE to happen after
> > > > osq_wait_next, but in that case we can even rely on the control
> > > > dependency there.
> > > 
> > > Even for the lock-unqueue case, isn't B->C ordered by a control dependency
> > > because C consists only of stores?
> > 
> > Hmm, indeed. So we could go fully relaxed on it I suppose, since the
> > same is true for the unlock site.
> 
> I am probably missing quite a bit on this thread, but don't x86 MMIO
> accesses to frame buffers need to interact with something more heavyweight
> than an x86 release store or acquire load in order to remain confined
> to the resulting critical section?

So on x86 there really isn't a problem because every atomic op (and
there's plenty here) will be a full barrier.

That is, even if you were to replace everything with _relaxed() ops, it
would still work as 'expected' on x86.

ppc/arm64 will crash and burn, but that's another story.

But the important point here was that osq_wait_next() is never relied
upon to provide either the ACQUIRE semantics for osq_lock() not the
RELEASE semantics for osq_unlock(). Those are provided by other ops.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-14 18:49                       ` One Thousand Gnomes
@ 2015-12-14 20:31                         ` Peter Zijlstra
  2015-12-15  4:36                         ` Paul E. McKenney
  1 sibling, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2015-12-14 20:31 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Paul E. McKenney, Will Deacon, Andrew Pinski, Davidlohr Bueso,
	Thomas Gleixner, Ingo Molnar, Linux Kernel Mailing List

On Mon, Dec 14, 2015 at 06:49:31PM +0000, One Thousand Gnomes wrote:
> 
> But seriously are there any cases we actually care about this for osq ?

So I think what PaulMck is worried about is that one would expect things
like:

	mutex_lock();
	MMIO(++var);
	mutex_unlock();

(the same example Linus gave but with a mutex instead of a spinlock) to
just work.

Now, I haven't checked the code, but I'm not sure we ever rely on osq to
provide the mutex ACQUIRE barrier, since we always need to acquire the
mutex variable itself after we've acquired the osq 'lock'.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-14 18:49                       ` One Thousand Gnomes
  2015-12-14 20:31                         ` Peter Zijlstra
@ 2015-12-15  4:36                         ` Paul E. McKenney
  1 sibling, 0 replies; 24+ messages in thread
From: Paul E. McKenney @ 2015-12-15  4:36 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Peter Zijlstra, Will Deacon, Andrew Pinski, Davidlohr Bueso,
	Thomas Gleixner, Ingo Molnar, Linux Kernel Mailing List

On Mon, Dec 14, 2015 at 06:49:31PM +0000, One Thousand Gnomes wrote:
> On Fri, 11 Dec 2015 14:35:40 -0800
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Fri, Dec 11, 2015 at 02:48:03PM +0100, Peter Zijlstra wrote:
> > > On Fri, Dec 11, 2015 at 01:33:14PM +0000, Will Deacon wrote:
> > > > On Fri, Dec 11, 2015 at 01:26:47PM +0100, Peter Zijlstra wrote:
> > > 
> > > > > While we're there, the acquire in osq_wait_next() seems somewhat ill
> > > > > documented too.
> > > > > 
> > > > > I _think_ we need ACQUIRE semantics there because we want to strictly
> > > > > order the lock-unqueue A,B,C steps and we get that with:
> > > > > 
> > > > >  A: SC
> > > > >  B: ACQ
> > > > >  C: Relaxed
> > > > > 
> > > > > Similarly for unlock we want the WRITE_ONCE to happen after
> > > > > osq_wait_next, but in that case we can even rely on the control
> > > > > dependency there.
> > > > 
> > > > Even for the lock-unqueue case, isn't B->C ordered by a control dependency
> > > > because C consists only of stores?
> > > 
> > > Hmm, indeed. So we could go fully relaxed on it I suppose, since the
> > > same is true for the unlock site.
> > 
> > I am probably missing quite a bit on this thread, but don't x86 MMIO
> > accesses to frame buffers need to interact with something more heavyweight
> > than an x86 release store or acquire load in order to remain confined
> > to the resulting critical section?
> 
> Depends upon the device and the mapping. There are also CPU errata
> related to write combining on older CPUs (notably Pentium Pro era) which
> result in ordering errors with write combining unless deliberately fenced.
> 
> Any PCI access isn't constrained to the critical section unless a PCI
> read from the same device is done and completes before exiting. Even then
> on processors with a separate APIC bus (PPro, PII I think) interrupts are
> asynchronous on their own bus.
> 
> The PCI posting rules also apply to DMA.
> 
> Finally we run the IDT WinChip in out-of-order store mode not full x86
> compatibility which while uniprocessor does mean the correct fences
> matter.
> 
> Just to ensure total confusion some video cards have MMIO areas that are
> not in fact memory but a FIFO rigged to look like a block of RAM for
> speed of writing. In those cases the rules are a bit card dependant.

Sounds like the usual fun and excitement!  ;-)

> But seriously are there any cases we actually care about this for osq ?

Apparently not, given Peter's email.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-14 20:28                       ` FW: " Peter Zijlstra
@ 2015-12-15  4:36                         ` Paul E. McKenney
  0 siblings, 0 replies; 24+ messages in thread
From: Paul E. McKenney @ 2015-12-15  4:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Andrew Pinski, Davidlohr Bueso, Thomas Gleixner,
	Ingo Molnar, Linux Kernel Mailing List, linux-arm-kernel,
	david.daney

On Mon, Dec 14, 2015 at 09:28:55PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 11, 2015 at 02:35:40PM -0800, Paul E. McKenney wrote:
> > On Fri, Dec 11, 2015 at 02:48:03PM +0100, Peter Zijlstra wrote:
> > > On Fri, Dec 11, 2015 at 01:33:14PM +0000, Will Deacon wrote:
> > > > On Fri, Dec 11, 2015 at 01:26:47PM +0100, Peter Zijlstra wrote:
> > > 
> > > > > While we're there, the acquire in osq_wait_next() seems somewhat ill
> > > > > documented too.
> > > > > 
> > > > > I _think_ we need ACQUIRE semantics there because we want to strictly
> > > > > order the lock-unqueue A,B,C steps and we get that with:
> > > > > 
> > > > >  A: SC
> > > > >  B: ACQ
> > > > >  C: Relaxed
> > > > > 
> > > > > Similarly for unlock we want the WRITE_ONCE to happen after
> > > > > osq_wait_next, but in that case we can even rely on the control
> > > > > dependency there.
> > > > 
> > > > Even for the lock-unqueue case, isn't B->C ordered by a control dependency
> > > > because C consists only of stores?
> > > 
> > > Hmm, indeed. So we could go fully relaxed on it I suppose, since the
> > > same is true for the unlock site.
> > 
> > I am probably missing quite a bit on this thread, but don't x86 MMIO
> > accesses to frame buffers need to interact with something more heavyweight
> > than an x86 release store or acquire load in order to remain confined
> > to the resulting critical section?
> 
> So on x86 there really isn't a problem because every atomic op (and
> there's plenty here) will be a full barrier.
> 
> That is, even if you were to replace everything with _relaxed() ops, it
> would still work as 'expected' on x86.
> 
> ppc/arm64 will crash and burn, but that's another story.
> 
> But the important point here was that osq_wait_next() is never relied
> upon to provide either the ACQUIRE semantics for osq_lock() not the
> RELEASE semantics for osq_unlock(). Those are provided by other ops.

OK, good to know!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: FW: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
  2015-12-11 12:04         ` Will Deacon
  2015-12-11 12:13           ` Peter Zijlstra
  2015-12-11 14:17           ` Davidlohr Bueso
@ 2015-12-17 21:52           ` Jeremy Linton
  2 siblings, 0 replies; 24+ messages in thread
From: Jeremy Linton @ 2015-12-17 21:52 UTC (permalink / raw)
  To: Will Deacon, Peter Zijlstra
  Cc: Davidlohr Bueso, Andrew Pinski, Linux Kernel Mailing List,
	Thomas Gleixner, Paul E. McKenney, Ingo Molnar, linux-arm-kernel

On 12/11/2015 06:04 AM, Will Deacon wrote:
> I think Andrew meant the atomic_xchg_acquire at the start of osq_lock,
> as opposed to "compare and swap". In which case, it does look like
> there's a bug here because there is nothing to order the initialisation
> of the node fields with publishing of the node, whether that's
> indirectly as a result of setting the tail to the current CPU or
> directly as a result of the WRITE_ONCE.
>
> Andrew, David: does making that atomic_xchg_acquire and atomic_xchg
> fix things for you?
>
> I don't fully grok what 81a43adae3b9 has to do with any of this, so
> maybe there's another bug too.

Will,

I ran into this problem while getting a machine to boot with ACPI
yesterday. Your suggested fix works on that machine.

Thanks, so for that patch.

Tested-by: Jeremy Linton <jeremy.linton@arm.com>






IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX)
@ 2015-12-11 17:43 Andrew Pinski
  0 siblings, 0 replies; 24+ messages in thread
From: Andrew Pinski @ 2015-12-11 17:43 UTC (permalink / raw)
  To: Will Deacon, dave
  Cc: Peter Zijlstra, Andrew, Davidlohr Bueso, Thomas Gleixner,
	Paul E. McKenney, Ingo Molnar, Linux Kernel Mailing List,
	linux-arm-kernel, David Daney

On Fri, Dec 11, 2015 at 6:18 AM, Davidlohr Bueso wrote:
>
> On Fri, 11 Dec 2015, Will Deacon wrote:
>
>>I think Andrew meant the atomic_xchg_acquire at the start of osq_lock,
>>as opposed to "compare and swap". In which case, it does look like
>>there's a bug here because there is nothing to order the initialisation
>>of the node fields with publishing of the node, whether that's
>>indirectly as a result of setting the tail to the current CPU or
>>directly as a result of the WRITE_ONCE.
>
> Sorry I'm late to the party.
>
> Duh yes this is obviously bogus, and worse I recall triggering a similar tail initialization issue in osq_lock on some experimental work on x86, so this is very much a point of failure. Ack.
>
>>
>>Andrew, David: does making that atomic_xchg_acquire and atomic_xchg fix
>>things for you?

Yes that works for me.  And yes that looks like the correct fix.

>>
>>I don't fully grok what 81a43adae3b9 has to do with any of this, so
>>maybe there's another bug too.
>
> I think this is mainly because mutex_optimistic_spin is where the stack shows the lockup, which really translates to c55a6ffa62.

Yes as mutex_optimistic_spin calls into osq_lock/osq_unlock.  And
81a43adae3b9 changed mutex.c which David thought was where the issue
was located rather than not what mutex_optimistic_spin called.

Thanks,
Andrew Pinski

>
> Thanks,
> Davidlohr

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-12-17 21:52 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-10 19:43 Commit 81a43adae3b9 (locking/mutex: Use acquire/release semantics) causing failures on arm64 (ThunderX) David Daney
     [not found] ` <SN1PR07MB21577C72379C8440A208D6BC9EEA0@SN1PR07MB2157.namprd07.prod.outlook.com>
2015-12-11  3:29   ` FW: " Andrew Pinski
2015-12-11  4:51     ` Andrew Pinski
2015-12-11  8:41       ` Peter Zijlstra
2015-12-11 12:04         ` Will Deacon
2015-12-11 12:13           ` Peter Zijlstra
2015-12-11 12:18             ` Will Deacon
2015-12-11 12:26               ` Peter Zijlstra
2015-12-11 13:33                 ` Will Deacon
2015-12-11 13:48                   ` Peter Zijlstra
2015-12-11 14:06                     ` Will Deacon
2015-12-11 17:11                       ` Peter Zijlstra
2015-12-11 17:24                         ` Will Deacon
2015-12-11 22:35                     ` Paul E. McKenney
2015-12-14 18:49                       ` One Thousand Gnomes
2015-12-14 20:31                         ` Peter Zijlstra
2015-12-15  4:36                         ` Paul E. McKenney
2015-12-14 20:28                       ` FW: " Peter Zijlstra
2015-12-15  4:36                         ` Paul E. McKenney
2015-12-11 14:17           ` Davidlohr Bueso
2015-12-17 21:52           ` Jeremy Linton
2015-12-11  7:33     ` Peter Zijlstra
2015-12-11  9:59 ` Will Deacon
2015-12-11 17:43 Andrew Pinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).