linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH] Fix a race between rwsem and the scheduler
@ 2016-08-30  8:49 Balbir Singh
  2016-08-30  9:13 ` Nicholas Piggin
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Balbir Singh @ 2016-08-30  8:49 UTC (permalink / raw)
  To: LKML, Peter Zijlstra
  Cc: Oleg Nesterov, Benjamin Herrenschmidt, Nicholas Piggin



The origin of the issue I've seen seems to be related to
rwsem spin lock stealing. Basically I see the system deadlock'd in the
following state

I have a system with multiple threads and

Most of the threads are stuck doing

[67272.593915] --- interrupt: e81 at _raw_spin_lock_irqsave+0xa4/0x130
[67272.593915]     LR = _raw_spin_lock_irqsave+0x9c/0x130
[67272.700996] [c000000012857ae0] [c00000000012453c] rwsem_wake+0xcc/0x110
[67272.749283] [c000000012857b20] [c0000000001215d8] up_write+0x78/0x90
[67272.788965] [c000000012857b50] [c00000000028153c] unlink_anon_vmas+0x15c/0x2c0
[67272.798782] [c000000012857bc0] [c00000000026f5c0] free_pgtables+0xf0/0x1c0
[67272.842528] [c000000012857c10] [c00000000027c9a0] exit_mmap+0x100/0x1a0
[67272.872947] [c000000012857cd0] [c0000000000b4a98] mmput+0xa8/0x1b0
[67272.898432] [c000000012857d00] [c0000000000bc50c] do_exit+0x33c/0xc30
[67272.944721] [c000000012857dc0] [c0000000000bcee4] do_group_exit+0x64/0x100
[67272.969014] [c000000012857e00] [c0000000000bcfac] SyS_exit_group+0x2c/0x30
[67272.978971] [c000000012857e30] [c000000000009204] system_call+0x38/0xb4
[67272.999016] Instruction dump:

They are spinning on the sem->wait_lock, the holder of sem->wait_lock has
irq's disabled and is doing

[c00000037930fb30] c0000000000f724c try_to_wake_up+0x6c/0x570
[c00000037930fbb0] c000000000124328 __rwsem_do_wake+0x1f8/0x260
[c00000037930fc00] c0000000001244b4 rwsem_wake+0x84/0x110
[c00000037930fc40] c000000000121598 up_write+0x78/0x90
[c00000037930fc70] c000000000281a54 anon_vma_fork+0x184/0x1d0
[c00000037930fcc0] c0000000000b68e0 copy_process.isra.5+0x14c0/0x1870
[c00000037930fda0] c0000000000b6e68 _do_fork+0xa8/0x4b0
[c00000037930fe30] c000000000009460 ppc_clone+0x8/0xc

The offset of try_to_wake_up is actually misleading, it is actually stuck
doing the following in try_to_wake_up

while (p->on_cpu)
	cpu_relax();

Analysis

The issue is triggered, due to the following race

CPU1					CPU2

while () {
  if (cond)
    break;
  do {
    schedule();
    set_current_state(TASK_UN..)
  } while (!cond);
					rwsem_wake()
					  spin_lock_irqsave(wait_lock)
  raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
}					  try_to_wake_up()
set_current_state(TASK_RUNNING);	  ..
list_del(&waiter.list);

CPU2 wakes up CPU1, but before it can get the wait_lock and set
current state to TASK_RUNNING the following occurs

CPU3
(stole the rwsem before waiter can be woken up from queue)
up_write()
rwsem_wake()
raw_spin_lock_irqsave(wait_lock)
if (!list_empty)
  wake_up_process()
  try_to_wake_up()
  raw_spin_lock_irqsave(p->pi_lock)
  ..
  if (p->on_rq && ttwu_wakeup())
  ..
  while (p->on_cpu)
    cpu_relax()
  ..

CPU3 tries to wake up the task on CPU1 again since it finds
it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
after CPU2, CPU3 got it.

CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
the task is spinning on the wait_lock. Interestingly since p->on_rq
is checked under pi_lock, I've noticed that try_to_wake_up() finds
p->on_rq to be 0. This was the most confusing bit of the analysis,
but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
check is not reliable without this fix IMHO. The race is visible
(based on the analysis) only when ttwu_queue() does a remote wakeup
via ttwu_queue_remote. In which case the p->on_rq change is not
done uder the pi_lock.

The result is that after a while the entire system locks up on
the raw_spin_irqlock_save(wait_lock) and the holder spins infintely

Reproduction of the issue

The issue can be reproduced after a long run on my system with 80
threads and having to tweak available memory to very low and running
memory stress-ng mmapfork test. It usually takes a long time to
reproduce. I am trying to work on a test case that can reproduce
the issue faster, but thats work in progress. I am still testing the
changes on my still in a loop and the tests seem OK thus far.

Big thanks to Benjamin and Nick for helping debug this as well.
Ben helped catch the missing barrier, Nick caught every missing
bit in my theory

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 kernel/sched/core.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2a906f2..582c684 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	success = 1; /* we're going to change ->state */
 	cpu = task_cpu(p);
 
+	/*
+	 * Ensure we see on_rq and p_state consistently
+	 *
+	 * For example in __rwsem_down_write_failed(), we have
+	 *    [S] ->on_rq = 1				[L] ->state
+	 *    MB					 RMB
+	 *    [S] ->state = TASK_UNINTERRUPTIBLE	[L] ->on_rq
+	 * In the absence of the RMB p->on_rq can be observed to be 0
+	 * and we end up spinning indefinitely in while (p->on_cpu)
+	 */
+	smp_rmb();
 	if (p->on_rq && ttwu_remote(p, wake_flags))
 		goto stat;
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-30  8:49 [RFC][PATCH] Fix a race between rwsem and the scheduler Balbir Singh
@ 2016-08-30  9:13 ` Nicholas Piggin
  2016-08-30 12:19 ` Peter Zijlstra
  2016-08-30 12:58 ` Oleg Nesterov
  2 siblings, 0 replies; 27+ messages in thread
From: Nicholas Piggin @ 2016-08-30  9:13 UTC (permalink / raw)
  To: Balbir Singh; +Cc: LKML, Peter Zijlstra, Oleg Nesterov, Benjamin Herrenschmidt

On Tue, 30 Aug 2016 18:49:37 +1000
Balbir Singh <bsingharora@gmail.com> wrote:

> The origin of the issue I've seen seems to be related to
> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> following state

BTW. this is not really to do with rwsems, but purely a scheduler bug.
It was seen in other callers too that follow similar pattern, but even
when those don't follow this pattern that leads to deadlock, it could
lead to a long busywait in the waker until the wakee next sleeps or
gets preempted.

> 
> I have a system with multiple threads and
> 
> Most of the threads are stuck doing
> 
> [67272.593915] --- interrupt: e81 at _raw_spin_lock_irqsave+0xa4/0x130
> [67272.593915]     LR = _raw_spin_lock_irqsave+0x9c/0x130
> [67272.700996] [c000000012857ae0] [c00000000012453c] rwsem_wake+0xcc/0x110
> [67272.749283] [c000000012857b20] [c0000000001215d8] up_write+0x78/0x90
> [67272.788965] [c000000012857b50] [c00000000028153c] unlink_anon_vmas+0x15c/0x2c0
> [67272.798782] [c000000012857bc0] [c00000000026f5c0] free_pgtables+0xf0/0x1c0
> [67272.842528] [c000000012857c10] [c00000000027c9a0] exit_mmap+0x100/0x1a0
> [67272.872947] [c000000012857cd0] [c0000000000b4a98] mmput+0xa8/0x1b0
> [67272.898432] [c000000012857d00] [c0000000000bc50c] do_exit+0x33c/0xc30
> [67272.944721] [c000000012857dc0] [c0000000000bcee4] do_group_exit+0x64/0x100
> [67272.969014] [c000000012857e00] [c0000000000bcfac] SyS_exit_group+0x2c/0x30
> [67272.978971] [c000000012857e30] [c000000000009204] system_call+0x38/0xb4
> [67272.999016] Instruction dump:
> 
> They are spinning on the sem->wait_lock, the holder of sem->wait_lock has
> irq's disabled and is doing
> 
> [c00000037930fb30] c0000000000f724c try_to_wake_up+0x6c/0x570
> [c00000037930fbb0] c000000000124328 __rwsem_do_wake+0x1f8/0x260
> [c00000037930fc00] c0000000001244b4 rwsem_wake+0x84/0x110
> [c00000037930fc40] c000000000121598 up_write+0x78/0x90
> [c00000037930fc70] c000000000281a54 anon_vma_fork+0x184/0x1d0
> [c00000037930fcc0] c0000000000b68e0 copy_process.isra.5+0x14c0/0x1870
> [c00000037930fda0] c0000000000b6e68 _do_fork+0xa8/0x4b0
> [c00000037930fe30] c000000000009460 ppc_clone+0x8/0xc
> 
> The offset of try_to_wake_up is actually misleading, it is actually stuck
> doing the following in try_to_wake_up
> 
> while (p->on_cpu)
> 	cpu_relax();
> 
> Analysis
> 
> The issue is triggered, due to the following race
> 
> CPU1					CPU2
> 
> while () {
>   if (cond)
>     break;
>   do {
>     schedule();
>     set_current_state(TASK_UN..)
>   } while (!cond);
> 					rwsem_wake()
> 					  spin_lock_irqsave(wait_lock)
>   raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
> }					  try_to_wake_up()
> set_current_state(TASK_RUNNING);	  ..
> list_del(&waiter.list);
> 
> CPU2 wakes up CPU1, but before it can get the wait_lock and set
> current state to TASK_RUNNING the following occurs
> 
> CPU3
> (stole the rwsem before waiter can be woken up from queue)
> up_write()
> rwsem_wake()
> raw_spin_lock_irqsave(wait_lock)
> if (!list_empty)
>   wake_up_process()
>   try_to_wake_up()
>   raw_spin_lock_irqsave(p->pi_lock)
>   ..
>   if (p->on_rq && ttwu_wakeup())
>   ..
>   while (p->on_cpu)
>     cpu_relax()
>   ..
> 
> CPU3 tries to wake up the task on CPU1 again since it finds
> it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
> after CPU2, CPU3 got it.
> 
> CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
> the task is spinning on the wait_lock. Interestingly since p->on_rq
> is checked under pi_lock, I've noticed that try_to_wake_up() finds
> p->on_rq to be 0. This was the most confusing bit of the analysis,
> but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
> check is not reliable without this fix IMHO. The race is visible
> (based on the analysis) only when ttwu_queue() does a remote wakeup
> via ttwu_queue_remote. In which case the p->on_rq change is not
> done uder the pi_lock.
> 
> The result is that after a while the entire system locks up on
> the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
> 
> Reproduction of the issue
> 
> The issue can be reproduced after a long run on my system with 80
> threads and having to tweak available memory to very low and running
> memory stress-ng mmapfork test. It usually takes a long time to
> reproduce. I am trying to work on a test case that can reproduce
> the issue faster, but thats work in progress. I am still testing the
> changes on my still in a loop and the tests seem OK thus far.
> 
> Big thanks to Benjamin and Nick for helping debug this as well.
> Ben helped catch the missing barrier, Nick caught every missing
> bit in my theory
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> 
> Signed-off-by: Balbir Singh <bsingharora@gmail.com>
> ---
>  kernel/sched/core.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2a906f2..582c684 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	success = 1; /* we're going to change ->state */
>  	cpu = task_cpu(p);
>  
> +	/*
> +	 * Ensure we see on_rq and p_state consistently
> +	 *
> +	 * For example in __rwsem_down_write_failed(), we have
> +	 *    [S] ->on_rq = 1				[L] ->state
> +	 *    MB					 RMB
> +	 *    [S] ->state = TASK_UNINTERRUPTIBLE	[L] ->on_rq
> +	 * In the absence of the RMB p->on_rq can be observed to be 0
> +	 * and we end up spinning indefinitely in while (p->on_cpu)
> +	 */
> +	smp_rmb();
>  	if (p->on_rq && ttwu_remote(p, wake_flags))
>  		goto stat;
>  

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-30  8:49 [RFC][PATCH] Fix a race between rwsem and the scheduler Balbir Singh
  2016-08-30  9:13 ` Nicholas Piggin
@ 2016-08-30 12:19 ` Peter Zijlstra
  2016-08-30 13:04   ` Oleg Nesterov
  2016-08-31  3:41   ` Balbir Singh
  2016-08-30 12:58 ` Oleg Nesterov
  2 siblings, 2 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-08-30 12:19 UTC (permalink / raw)
  To: Balbir Singh; +Cc: LKML, Oleg Nesterov, Benjamin Herrenschmidt, Nicholas Piggin

On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:
> 
> 
> The origin of the issue I've seen seems to be related to
> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> following state

As Nick says (good to see you're back Nick!), this is unrelated to
rwsems.

This is true for pretty much every blocking wait loop out there, they
all do:

	for (;;) {
		current->state = UNINTERRUPTIBLE;
		smp_mb();
		if (cond)
			break;
		schedule();
	}
	current->state = RUNNING;

Which, if the wakeup is spurious, is just the pattern you need.

> +++ b/kernel/sched/core.c
> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	success = 1; /* we're going to change ->state */
>  	cpu = task_cpu(p);
>  
> +	/*
> +	 * Ensure we see on_rq and p_state consistently
> +	 *
> +	 * For example in __rwsem_down_write_failed(), we have
> +	 *    [S] ->on_rq = 1				[L] ->state
> +	 *    MB					 RMB

There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
to PPC, is _not_ MB. It is however sufficient for this case.

> +	 *    [S] ->state = TASK_UNINTERRUPTIBLE	[L] ->on_rq
> +	 * In the absence of the RMB p->on_rq can be observed to be 0
> +	 * and we end up spinning indefinitely in while (p->on_cpu)
> +	 */


	/*
	 * Ensure we load p->on_rq _after_ p->state, otherwise it would
	 * be possible to, falsely, observe p->on_rq == 0 and get stuck
	 * in smp_cond_load_acquire() below.
	 *
	 * sched_ttwu_pending()			try_to_wake_up()
	 *   [S] p->on_rq = 1;			[L] P->state
	 *       UNLOCK rq->lock
	 *
	 * schedule()				RMB
	 *       LOCK rq->lock
	 *       UNLOCK rq->lock
	 *
	 * [task p]
	 *   [S] p->state = UNINTERRUPTIBLE	[L] p->on_rq
	 *
	 * Pairs with the UNLOCK+LOCK on rq->lock from the
	 * last wakeup of our task and the schedule that got our task
	 * current.
	 */

> +	smp_rmb();
>  	if (p->on_rq && ttwu_remote(p, wake_flags))
>  		goto stat;
>  


Now, this has been present for a fair while, I suspect ever since we
reworked the wakeup path to not use rq->lock twice. Curious you only now
hit it.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-30  8:49 [RFC][PATCH] Fix a race between rwsem and the scheduler Balbir Singh
  2016-08-30  9:13 ` Nicholas Piggin
  2016-08-30 12:19 ` Peter Zijlstra
@ 2016-08-30 12:58 ` Oleg Nesterov
  2016-08-31  3:25   ` Balbir Singh
  2 siblings, 1 reply; 27+ messages in thread
From: Oleg Nesterov @ 2016-08-30 12:58 UTC (permalink / raw)
  To: Balbir Singh
  Cc: LKML, Peter Zijlstra, Benjamin Herrenschmidt, Nicholas Piggin

On 08/30, Balbir Singh wrote:
>
> The origin of the issue I've seen seems to be related to
> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> following state
> 
> I have a system with multiple threads and
> 
> Most of the threads are stuck doing
> 
> [67272.593915] --- interrupt: e81 at _raw_spin_lock_irqsave+0xa4/0x130
> [67272.593915]     LR = _raw_spin_lock_irqsave+0x9c/0x130
> [67272.700996] [c000000012857ae0] [c00000000012453c] rwsem_wake+0xcc/0x110
> [67272.749283] [c000000012857b20] [c0000000001215d8] up_write+0x78/0x90
> [67272.788965] [c000000012857b50] [c00000000028153c] unlink_anon_vmas+0x15c/0x2c0
> [67272.798782] [c000000012857bc0] [c00000000026f5c0] free_pgtables+0xf0/0x1c0
> [67272.842528] [c000000012857c10] [c00000000027c9a0] exit_mmap+0x100/0x1a0
> [67272.872947] [c000000012857cd0] [c0000000000b4a98] mmput+0xa8/0x1b0
> [67272.898432] [c000000012857d00] [c0000000000bc50c] do_exit+0x33c/0xc30
> [67272.944721] [c000000012857dc0] [c0000000000bcee4] do_group_exit+0x64/0x100
> [67272.969014] [c000000012857e00] [c0000000000bcfac] SyS_exit_group+0x2c/0x30
> [67272.978971] [c000000012857e30] [c000000000009204] system_call+0x38/0xb4
> [67272.999016] Instruction dump:
> 
> They are spinning on the sem->wait_lock, the holder of sem->wait_lock has
> irq's disabled and is doing
> 
> [c00000037930fb30] c0000000000f724c try_to_wake_up+0x6c/0x570
> [c00000037930fbb0] c000000000124328 __rwsem_do_wake+0x1f8/0x260
> [c00000037930fc00] c0000000001244b4 rwsem_wake+0x84/0x110
> [c00000037930fc40] c000000000121598 up_write+0x78/0x90
> [c00000037930fc70] c000000000281a54 anon_vma_fork+0x184/0x1d0
> [c00000037930fcc0] c0000000000b68e0 copy_process.isra.5+0x14c0/0x1870
> [c00000037930fda0] c0000000000b6e68 _do_fork+0xa8/0x4b0
> [c00000037930fe30] c000000000009460 ppc_clone+0x8/0xc
> 
> The offset of try_to_wake_up is actually misleading, it is actually stuck
> doing the following in try_to_wake_up
> 
> while (p->on_cpu)
> 	cpu_relax();
> 
> Analysis
> 
> The issue is triggered, due to the following race
> 
> CPU1					CPU2
> 
> while () {
>   if (cond)
>     break;
>   do {
>     schedule();
>     set_current_state(TASK_UN..)
>   } while (!cond);
> 					rwsem_wake()
> 					  spin_lock_irqsave(wait_lock)
>   raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
> }					  try_to_wake_up()
> set_current_state(TASK_RUNNING);	  ..
> list_del(&waiter.list);
> 
> CPU2 wakes up CPU1, but before it can get the wait_lock and set
> current state to TASK_RUNNING the following occurs
> 
> CPU3
> (stole the rwsem before waiter can be woken up from queue)
> up_write()
> rwsem_wake()
> raw_spin_lock_irqsave(wait_lock)
> if (!list_empty)
>   wake_up_process()
>   try_to_wake_up()
>   raw_spin_lock_irqsave(p->pi_lock)
>   ..
>   if (p->on_rq && ttwu_wakeup())
>   ..
>   while (p->on_cpu)
>     cpu_relax()
>   ..
> 
> CPU3 tries to wake up the task on CPU1 again since it finds
> it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
> after CPU2, CPU3 got it.
> 
> CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
> the task is spinning on the wait_lock. Interestingly since p->on_rq
> is checked under pi_lock, I've noticed that try_to_wake_up() finds
> p->on_rq to be 0. This was the most confusing bit of the analysis,
> but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
> check is not reliable without this fix IMHO. The race is visible
> (based on the analysis) only when ttwu_queue() does a remote wakeup
> via ttwu_queue_remote. In which case the p->on_rq change is not
> done uder the pi_lock.
> 
> The result is that after a while the entire system locks up on
> the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
> 
> Reproduction of the issue
> 
> The issue can be reproduced after a long run on my system with 80
> threads and having to tweak available memory to very low and running
> memory stress-ng mmapfork test. It usually takes a long time to
> reproduce. I am trying to work on a test case that can reproduce
> the issue faster, but thats work in progress. I am still testing the
> changes on my still in a loop and the tests seem OK thus far.
> 
> Big thanks to Benjamin and Nick for helping debug this as well.
> Ben helped catch the missing barrier, Nick caught every missing
> bit in my theory
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> 
> Signed-off-by: Balbir Singh <bsingharora@gmail.com>
> ---
>  kernel/sched/core.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2a906f2..582c684 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  	success = 1; /* we're going to change ->state */
>  	cpu = task_cpu(p);
>  
> +	/*
> +	 * Ensure we see on_rq and p_state consistently
> +	 *
> +	 * For example in __rwsem_down_write_failed(), we have
> +	 *    [S] ->on_rq = 1				[L] ->state
> +	 *    MB					 RMB
> +	 *    [S] ->state = TASK_UNINTERRUPTIBLE	[L] ->on_rq
> +	 * In the absence of the RMB p->on_rq can be observed to be 0
> +	 * and we end up spinning indefinitely in while (p->on_cpu)
> +	 */
> +	smp_rmb();

I think the patch is fine... but unless I am totally confused this
is not specific to __rwsem_down_write_failed(). ttwu() can hang the
same way if the target simply does

	schedule_timeout();
	current->state = TASK_INTERRUPTIBLE;
	current->state = TASK_RUNNING;

And. I am not sure I understand where this MB above comes from.

Oleg.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-30 12:19 ` Peter Zijlstra
@ 2016-08-30 13:04   ` Oleg Nesterov
  2016-08-30 14:13     ` Peter Zijlstra
  2016-08-30 21:25     ` Benjamin Herrenschmidt
  2016-08-31  3:41   ` Balbir Singh
  1 sibling, 2 replies; 27+ messages in thread
From: Oleg Nesterov @ 2016-08-30 13:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Balbir Singh, LKML, Benjamin Herrenschmidt, Nicholas Piggin

On 08/30, Peter Zijlstra wrote:
>
> 	/*
> 	 * Ensure we load p->on_rq _after_ p->state, otherwise it would
> 	 * be possible to, falsely, observe p->on_rq == 0 and get stuck
> 	 * in smp_cond_load_acquire() below.
> 	 *
> 	 * sched_ttwu_pending()			try_to_wake_up()
> 	 *   [S] p->on_rq = 1;			[L] P->state
> 	 *       UNLOCK rq->lock
> 	 *
> 	 * schedule()				RMB
> 	 *       LOCK rq->lock
> 	 *       UNLOCK rq->lock
> 	 *
> 	 * [task p]
> 	 *   [S] p->state = UNINTERRUPTIBLE	[L] p->on_rq
> 	 *
> 	 * Pairs with the UNLOCK+LOCK on rq->lock from the
> 	 * last wakeup of our task and the schedule that got our task
> 	 * current.
> 	 */

Confused... how this connects to UNLOCK+LOCK on rq->lock? A LOAD can
leak into the critical section.

But context switch should imply mb() we can rely on?

Oleg.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-30 13:04   ` Oleg Nesterov
@ 2016-08-30 14:13     ` Peter Zijlstra
  2016-08-30 16:57       ` Oleg Nesterov
  2016-08-30 21:25     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-08-30 14:13 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Balbir Singh, LKML, Benjamin Herrenschmidt, Nicholas Piggin

On Tue, Aug 30, 2016 at 03:04:27PM +0200, Oleg Nesterov wrote:
> On 08/30, Peter Zijlstra wrote:
> >
> > 	/*
> > 	 * Ensure we load p->on_rq _after_ p->state, otherwise it would
> > 	 * be possible to, falsely, observe p->on_rq == 0 and get stuck
> > 	 * in smp_cond_load_acquire() below.
> > 	 *
> > 	 * sched_ttwu_pending()			try_to_wake_up()
> > 	 *   [S] p->on_rq = 1;			[L] P->state
> > 	 *       UNLOCK rq->lock
> > 	 *
> > 	 * schedule()				RMB
> > 	 *       LOCK rq->lock
> > 	 *       UNLOCK rq->lock
> > 	 *
> > 	 * [task p]
> > 	 *   [S] p->state = UNINTERRUPTIBLE	[L] p->on_rq
> > 	 *
> > 	 * Pairs with the UNLOCK+LOCK on rq->lock from the
> > 	 * last wakeup of our task and the schedule that got our task
> > 	 * current.
> > 	 */
> 
> Confused... how this connects to UNLOCK+LOCK on rq->lock? A LOAD can
> leak into the critical section.

How so? That LOCK+UNLOCK which is leaky, UNLOCK+LOCK is a read/write
barrier (just not an MB because it lacks full transitivity).

> But context switch should imply mb() we can rely on?

Not sure it should, on x86 switch_mm does a CR3 write and that is
serializing, but switch_to() doesn't need to do anything iirc.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-30 14:13     ` Peter Zijlstra
@ 2016-08-30 16:57       ` Oleg Nesterov
  2016-08-30 18:34         ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Oleg Nesterov @ 2016-08-30 16:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Balbir Singh, LKML, Benjamin Herrenschmidt, Nicholas Piggin

On 08/30, Peter Zijlstra wrote:
> On Tue, Aug 30, 2016 at 03:04:27PM +0200, Oleg Nesterov wrote:
> > On 08/30, Peter Zijlstra wrote:
> > >
> > > 	/*
> > > 	 * Ensure we load p->on_rq _after_ p->state, otherwise it would
> > > 	 * be possible to, falsely, observe p->on_rq == 0 and get stuck
> > > 	 * in smp_cond_load_acquire() below.
> > > 	 *
> > > 	 * sched_ttwu_pending()			try_to_wake_up()
> > > 	 *   [S] p->on_rq = 1;			[L] P->state
> > > 	 *       UNLOCK rq->lock
> > > 	 *
> > > 	 * schedule()				RMB
> > > 	 *       LOCK rq->lock
> > > 	 *       UNLOCK rq->lock
> > > 	 *
> > > 	 * [task p]
> > > 	 *   [S] p->state = UNINTERRUPTIBLE	[L] p->on_rq
> > > 	 *
> > > 	 * Pairs with the UNLOCK+LOCK on rq->lock from the
> > > 	 * last wakeup of our task and the schedule that got our task
> > > 	 * current.
> > > 	 */
> >
> > Confused... how this connects to UNLOCK+LOCK on rq->lock? A LOAD can
> > leak into the critical section.
>
> How so? That LOCK+UNLOCK which is leaky, UNLOCK+LOCK is a read/write
> barrier (just not an MB because it lacks full transitivity).

Ah, I have wrongly read the "Pairs with the UNLOCK+LOCK" as
"Pairs with the LOCK+UNLOCK". And didn't notice this even after I
copy-and-pasted this part.

> > But context switch should imply mb() we can rely on?
>
> Not sure it should, on x86 switch_mm does a CR3 write and that is
> serializing, but switch_to() doesn't need to do anything iirc.

Documentation/memory-barriers.txt says

	schedule() and similar imply full memory barriers.

and I (wrongly?) interpreted this as if this is also true for 2
different threadds.

I mean, I thought that the LOAD/STORE's done by some task can't
be re-ordered with LOAD/STORE's done by another task which was
running on the same CPU. Wrong?

Oleg.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-30 16:57       ` Oleg Nesterov
@ 2016-08-30 18:34         ` Peter Zijlstra
  2016-08-30 21:28           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-08-30 18:34 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Balbir Singh, LKML, Benjamin Herrenschmidt, Nicholas Piggin

On Tue, Aug 30, 2016 at 06:57:47PM +0200, Oleg Nesterov wrote:
> On 08/30, Peter Zijlstra wrote:
> > On Tue, Aug 30, 2016 at 03:04:27PM +0200, Oleg Nesterov wrote:

> > > But context switch should imply mb() we can rely on?
> >
> > Not sure it should, on x86 switch_mm does a CR3 write and that is
> > serializing, but switch_to() doesn't need to do anything iirc.
> 
> Documentation/memory-barriers.txt says
> 
> 	schedule() and similar imply full memory barriers.
> 
> and I (wrongly?) interpreted this as if this is also true for 2
> different threadds.

I'm not actually sure it does. There is the comment from 8643cda549ca4
which explain the program order guarantees.

But I'm not sure who or what would simply a full smp_mb() when you call
schedule() -- I mean, its true on x86, but that's 'trivial'.

> I mean, I thought that the LOAD/STORE's done by some task can't
> be re-ordered with LOAD/STORE's done by another task which was
> running on the same CPU. Wrong?

If so, I'm not sure how :/

So smp_mb__before_spinlock() stops stores from @prev, and the ACQUIRE
from spin_lock(&rq->lock) stops both loads/stores from @next, but afaict
nothing stops the loads from @prev seeing stores from @next.

Also not sure this matters though, if they're threads in the same
process its a data race already and nobody cares. If they're not threads
in the same process, they're separated by address space and can't 'see'
each other anyway.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-30 13:04   ` Oleg Nesterov
  2016-08-30 14:13     ` Peter Zijlstra
@ 2016-08-30 21:25     ` Benjamin Herrenschmidt
  2016-08-31  7:20       ` Peter Zijlstra
  1 sibling, 1 reply; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2016-08-30 21:25 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra; +Cc: Balbir Singh, LKML, Nicholas Piggin

On Tue, 2016-08-30 at 15:04 +0200, Oleg Nesterov wrote:
> 
> Confused... how this connects to UNLOCK+LOCK on rq->lock? A LOAD can
> leak into the critical section.
> 
> But context switch should imply mb() we can rely on?

Between setting of ->on_rq and returning to the task so it can
change its state back to [UN]INTERRUPTIBLE, there will be at least one
write barrier (spin unlock of the rq), possibly even a full barrier
(context switch). The write barrier is enough so I didn't dig to make
sure we always context switch in the scenario we're looking at but I
think we do.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-30 18:34         ` Peter Zijlstra
@ 2016-08-30 21:28           ` Benjamin Herrenschmidt
  2016-08-31  7:18             ` Peter Zijlstra
  2016-08-31 13:31             ` Peter Zijlstra
  0 siblings, 2 replies; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2016-08-30 21:28 UTC (permalink / raw)
  To: Peter Zijlstra, Oleg Nesterov; +Cc: Balbir Singh, LKML, Nicholas Piggin

On Tue, 2016-08-30 at 20:34 +0200, Peter Zijlstra wrote:
> 
> I'm not actually sure it does. There is the comment from 8643cda549ca4
> which explain the program order guarantees.
> 
> But I'm not sure who or what would simply a full smp_mb() when you call
> schedule() -- I mean, its true on x86, but that's 'trivial'.

It's always been a requirement that if you actually context switch a
full mb() is implied (though that isn't the case if you don't actually
switch, ie, you are back to RUNNING before you even hit schedule).

On powerpc we have a sync deep in _switch to achieve that.

This is necessary so that a process who wakes up on a different CPU sees
all of its own load/stores.

> > I mean, I thought that the LOAD/STORE's done by some task can't
> > be re-ordered with LOAD/STORE's done by another task which was
> > running on the same CPU. Wrong?
> 
> If so, I'm not sure how :/
> 
> So smp_mb__before_spinlock() stops stores from @prev, and the ACQUIRE
> from spin_lock(&rq->lock) stops both loads/stores from @next, but afaict
> nothing stops the loads from @prev seeing stores from @next.
> 
> Also not sure this matters though, if they're threads in the same
> process its a data race already and nobody cares. If they're not threads
> in the same process, they're separated by address space and can't 'see'
> each other anyway.

The architecture switch_to() has to do the right thing.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-30 12:58 ` Oleg Nesterov
@ 2016-08-31  3:25   ` Balbir Singh
  0 siblings, 0 replies; 27+ messages in thread
From: Balbir Singh @ 2016-08-31  3:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Peter Zijlstra, Benjamin Herrenschmidt, Nicholas Piggin



On 30/08/16 22:58, Oleg Nesterov wrote:
> On 08/30, Balbir Singh wrote:
>>
>> The origin of the issue I've seen seems to be related to
>> rwsem spin lock stealing. Basically I see the system deadlock'd in the
>> following state
>>
>> I have a system with multiple threads and
>>
>> Most of the threads are stuck doing
>>
>> [67272.593915] --- interrupt: e81 at _raw_spin_lock_irqsave+0xa4/0x130
>> [67272.593915]     LR = _raw_spin_lock_irqsave+0x9c/0x130
>> [67272.700996] [c000000012857ae0] [c00000000012453c] rwsem_wake+0xcc/0x110
>> [67272.749283] [c000000012857b20] [c0000000001215d8] up_write+0x78/0x90
>> [67272.788965] [c000000012857b50] [c00000000028153c] unlink_anon_vmas+0x15c/0x2c0
>> [67272.798782] [c000000012857bc0] [c00000000026f5c0] free_pgtables+0xf0/0x1c0
>> [67272.842528] [c000000012857c10] [c00000000027c9a0] exit_mmap+0x100/0x1a0
>> [67272.872947] [c000000012857cd0] [c0000000000b4a98] mmput+0xa8/0x1b0
>> [67272.898432] [c000000012857d00] [c0000000000bc50c] do_exit+0x33c/0xc30
>> [67272.944721] [c000000012857dc0] [c0000000000bcee4] do_group_exit+0x64/0x100
>> [67272.969014] [c000000012857e00] [c0000000000bcfac] SyS_exit_group+0x2c/0x30
>> [67272.978971] [c000000012857e30] [c000000000009204] system_call+0x38/0xb4
>> [67272.999016] Instruction dump:
>>
>> They are spinning on the sem->wait_lock, the holder of sem->wait_lock has
>> irq's disabled and is doing
>>
>> [c00000037930fb30] c0000000000f724c try_to_wake_up+0x6c/0x570
>> [c00000037930fbb0] c000000000124328 __rwsem_do_wake+0x1f8/0x260
>> [c00000037930fc00] c0000000001244b4 rwsem_wake+0x84/0x110
>> [c00000037930fc40] c000000000121598 up_write+0x78/0x90
>> [c00000037930fc70] c000000000281a54 anon_vma_fork+0x184/0x1d0
>> [c00000037930fcc0] c0000000000b68e0 copy_process.isra.5+0x14c0/0x1870
>> [c00000037930fda0] c0000000000b6e68 _do_fork+0xa8/0x4b0
>> [c00000037930fe30] c000000000009460 ppc_clone+0x8/0xc
>>
>> The offset of try_to_wake_up is actually misleading, it is actually stuck
>> doing the following in try_to_wake_up
>>
>> while (p->on_cpu)
>> 	cpu_relax();
>>
>> Analysis
>>
>> The issue is triggered, due to the following race
>>
>> CPU1					CPU2
>>
>> while () {
>>   if (cond)
>>     break;
>>   do {
>>     schedule();
>>     set_current_state(TASK_UN..)
>>   } while (!cond);
>> 					rwsem_wake()
>> 					  spin_lock_irqsave(wait_lock)
>>   raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
>> }					  try_to_wake_up()
>> set_current_state(TASK_RUNNING);	  ..
>> list_del(&waiter.list);
>>
>> CPU2 wakes up CPU1, but before it can get the wait_lock and set
>> current state to TASK_RUNNING the following occurs
>>
>> CPU3
>> (stole the rwsem before waiter can be woken up from queue)
>> up_write()
>> rwsem_wake()
>> raw_spin_lock_irqsave(wait_lock)
>> if (!list_empty)
>>   wake_up_process()
>>   try_to_wake_up()
>>   raw_spin_lock_irqsave(p->pi_lock)
>>   ..
>>   if (p->on_rq && ttwu_wakeup())
>>   ..
>>   while (p->on_cpu)
>>     cpu_relax()
>>   ..
>>
>> CPU3 tries to wake up the task on CPU1 again since it finds
>> it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
>> after CPU2, CPU3 got it.
>>
>> CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
>> the task is spinning on the wait_lock. Interestingly since p->on_rq
>> is checked under pi_lock, I've noticed that try_to_wake_up() finds
>> p->on_rq to be 0. This was the most confusing bit of the analysis,
>> but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
>> check is not reliable without this fix IMHO. The race is visible
>> (based on the analysis) only when ttwu_queue() does a remote wakeup
>> via ttwu_queue_remote. In which case the p->on_rq change is not
>> done uder the pi_lock.
>>
>> The result is that after a while the entire system locks up on
>> the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
>>
>> Reproduction of the issue
>>
>> The issue can be reproduced after a long run on my system with 80
>> threads and having to tweak available memory to very low and running
>> memory stress-ng mmapfork test. It usually takes a long time to
>> reproduce. I am trying to work on a test case that can reproduce
>> the issue faster, but thats work in progress. I am still testing the
>> changes on my still in a loop and the tests seem OK thus far.
>>
>> Big thanks to Benjamin and Nick for helping debug this as well.
>> Ben helped catch the missing barrier, Nick caught every missing
>> bit in my theory
>>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Nicholas Piggin <npiggin@gmail.com>
>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>>
>> Signed-off-by: Balbir Singh <bsingharora@gmail.com>
>> ---
>>  kernel/sched/core.c | 11 +++++++++++
>>  1 file changed, 11 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 2a906f2..582c684 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>  	success = 1; /* we're going to change ->state */
>>  	cpu = task_cpu(p);
>>  
>> +	/*
>> +	 * Ensure we see on_rq and p_state consistently
>> +	 *
>> +	 * For example in __rwsem_down_write_failed(), we have
>> +	 *    [S] ->on_rq = 1				[L] ->state
>> +	 *    MB					 RMB
>> +	 *    [S] ->state = TASK_UNINTERRUPTIBLE	[L] ->on_rq
>> +	 * In the absence of the RMB p->on_rq can be observed to be 0
>> +	 * and we end up spinning indefinitely in while (p->on_cpu)
>> +	 */
>> +	smp_rmb();
> 
> I think the patch is fine... but unless I am totally confused this
> is not specific to __rwsem_down_write_failed(). ttwu() can hang the
> same way if the target simply does
> 
> 	schedule_timeout();
> 	current->state = TASK_INTERRUPTIBLE;

Yes

> 	current->state = TASK_RUNNING;
> 
> And. I am not sure I understand where this MB above comes from.

I think Ben pointed that out earlier and you found it in the documentation as well
at the end of __switch

Balbir Singh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-30 12:19 ` Peter Zijlstra
  2016-08-30 13:04   ` Oleg Nesterov
@ 2016-08-31  3:41   ` Balbir Singh
  2016-08-31  7:28     ` Peter Zijlstra
  1 sibling, 1 reply; 27+ messages in thread
From: Balbir Singh @ 2016-08-31  3:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Oleg Nesterov, Benjamin Herrenschmidt, Nicholas Piggin,
	Alexey Kardashevskiy



On 30/08/16 22:19, Peter Zijlstra wrote:
> On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:
>>
>>
>> The origin of the issue I've seen seems to be related to
>> rwsem spin lock stealing. Basically I see the system deadlock'd in the
>> following state
> 
> As Nick says (good to see you're back Nick!), this is unrelated to
> rwsems.
> 
> This is true for pretty much every blocking wait loop out there, they
> all do:
> 
> 	for (;;) {
> 		current->state = UNINTERRUPTIBLE;
> 		smp_mb();
> 		if (cond)
> 			break;
> 		schedule();
> 	}
> 	current->state = RUNNING;
> 
> Which, if the wakeup is spurious, is just the pattern you need.

Yes True! My bad Alexey had seen the same basic pattern, I should have been clearer
in my commit log. Should I resend the patch?

> 
>> +++ b/kernel/sched/core.c
>> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>  	success = 1; /* we're going to change ->state */
>>  	cpu = task_cpu(p);
>>  
>> +	/*
>> +	 * Ensure we see on_rq and p_state consistently
>> +	 *
>> +	 * For example in __rwsem_down_write_failed(), we have
>> +	 *    [S] ->on_rq = 1				[L] ->state
>> +	 *    MB					 RMB
> 
> There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
> to PPC, is _not_ MB. It is however sufficient for this case.
> 

The MB comes from the __switch_to() in schedule(). Ben mentioned it in a 
different thread.

>> +	 *    [S] ->state = TASK_UNINTERRUPTIBLE	[L] ->on_rq
>> +	 * In the absence of the RMB p->on_rq can be observed to be 0
>> +	 * and we end up spinning indefinitely in while (p->on_cpu)
>> +	 */
> 
> 
> 	/*
> 	 * Ensure we load p->on_rq _after_ p->state, otherwise it would
> 	 * be possible to, falsely, observe p->on_rq == 0 and get stuck
> 	 * in smp_cond_load_acquire() below.
> 	 *
> 	 * sched_ttwu_pending()			try_to_wake_up()
> 	 *   [S] p->on_rq = 1;			[L] P->state
> 	 *       UNLOCK rq->lock
> 	 *
> 	 * schedule()				RMB
> 	 *       LOCK rq->lock
> 	 *       UNLOCK rq->lock
> 	 *
> 	 * [task p]
> 	 *   [S] p->state = UNINTERRUPTIBLE	[L] p->on_rq
> 	 *
> 	 * Pairs with the UNLOCK+LOCK on rq->lock from the
> 	 * last wakeup of our task and the schedule that got our task
> 	 * current.
> 	 */
> 
>> +	smp_rmb();
>>  	if (p->on_rq && ttwu_remote(p, wake_flags))
>>  		goto stat;
>>  
> 
> 
> Now, this has been present for a fair while, I suspect ever since we
> reworked the wakeup path to not use rq->lock twice. Curious you only now
> hit it.
> 

Yes, I just hit it a a week or two back and I needed to collect data to
explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me
I needed a system with large threads and less memory running stress-ng.
Reproducing the problem takes an unpredictable amount of time.

Balbir Singh.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-30 21:28           ` Benjamin Herrenschmidt
@ 2016-08-31  7:18             ` Peter Zijlstra
  2016-08-31 10:56               ` Benjamin Herrenschmidt
  2016-08-31 13:31             ` Peter Zijlstra
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-08-31  7:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Oleg Nesterov, Balbir Singh, LKML, Nicholas Piggin

On Wed, Aug 31, 2016 at 07:28:18AM +1000, Benjamin Herrenschmidt wrote:
> It's always been a requirement that if you actually context switch a
> full mb() is implied ...

> On powerpc we have a sync deep in _switch to achieve that.

OK, fair enough. I must've missed it in the x86 switch_to, must be one
of those implied serializing instructions I'm not too familiar with.

> (though that isn't the case if you don't actually
> switch, ie, you are back to RUNNING before you even hit schedule).

Right, which invalidates the claim that schedule() implies a full mb,

> This is necessary so that a process who wakes up on a different CPU sees
> all of its own load/stores.

Don't actually think its needed for that, see the comment from
8643cda549ca4, the scheduler has enough barriers to guarantee
Program-Order for tasks without that.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-30 21:25     ` Benjamin Herrenschmidt
@ 2016-08-31  7:20       ` Peter Zijlstra
  2016-08-31 10:55         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-08-31  7:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Oleg Nesterov, Balbir Singh, LKML, Nicholas Piggin

On Wed, Aug 31, 2016 at 07:25:01AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2016-08-30 at 15:04 +0200, Oleg Nesterov wrote:
> > 
> > Confused... how this connects to UNLOCK+LOCK on rq->lock? A LOAD can
> > leak into the critical section.
> > 
> > But context switch should imply mb() we can rely on?
> 
> Between setting of ->on_rq and returning to the task so it can
> change its state back to [UN]INTERRUPTIBLE, there will be at least one
> write barrier (spin unlock of the rq),

spin-unlock is _not_ a write barrier, its a RELEASE barrier, and is not
sufficient for this.

> possibly even a full barrier
> (context switch). The write barrier is enough so I didn't dig to make
> sure we always context switch in the scenario we're looking at but I
> think we do.

There is enough, you just need to pair the RELEASE with an ACQUIRE to
get a full load-store barrier.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-31  3:41   ` Balbir Singh
@ 2016-08-31  7:28     ` Peter Zijlstra
  2016-08-31 10:17       ` Balbir Singh
                         ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-08-31  7:28 UTC (permalink / raw)
  To: Balbir Singh
  Cc: LKML, Oleg Nesterov, Benjamin Herrenschmidt, Nicholas Piggin,
	Alexey Kardashevskiy

On Wed, Aug 31, 2016 at 01:41:33PM +1000, Balbir Singh wrote:
> On 30/08/16 22:19, Peter Zijlstra wrote:
> > On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:
> >>
> >>
> >> The origin of the issue I've seen seems to be related to
> >> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> >> following state
> > 
> > As Nick says (good to see you're back Nick!), this is unrelated to
> > rwsems.
> > 
> > This is true for pretty much every blocking wait loop out there, they
> > all do:
> > 
> > 	for (;;) {
> > 		current->state = UNINTERRUPTIBLE;
> > 		smp_mb();
> > 		if (cond)
> > 			break;
> > 		schedule();
> > 	}
> > 	current->state = RUNNING;
> > 
> > Which, if the wakeup is spurious, is just the pattern you need.
> 
> Yes True! My bad Alexey had seen the same basic pattern, I should have been clearer
> in my commit log. Should I resend the patch?

Yes please.

> > There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
> > to PPC, is _not_ MB. It is however sufficient for this case.
> > 
> 
> The MB comes from the __switch_to() in schedule(). Ben mentioned it in a 
> different thread.

Right, although even without that, there is sufficient ordering, as the
rq unlock from the wakeup, coupled with the rq lock from the schedule
already form a load-store barrier.

> > Now, this has been present for a fair while, I suspect ever since we
> > reworked the wakeup path to not use rq->lock twice. Curious you only now
> > hit it.
> > 
> 
> Yes, I just hit it a a week or two back and I needed to collect data to
> explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me
> I needed a system with large threads and less memory running stress-ng.
> Reproducing the problem takes an unpredictable amount of time.

What hardware do you see this on, is it shiny new Power8 chips which
have never before seen deep queues or something. Or is it 'regular' old
Power7 like stuff?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-31  7:28     ` Peter Zijlstra
@ 2016-08-31 10:17       ` Balbir Singh
  2016-08-31 10:57       ` Benjamin Herrenschmidt
  2016-09-01  1:48       ` Alexey Kardashevskiy
  2 siblings, 0 replies; 27+ messages in thread
From: Balbir Singh @ 2016-08-31 10:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Oleg Nesterov, Benjamin Herrenschmidt, Nicholas Piggin,
	Alexey Kardashevskiy



On 31/08/16 17:28, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 01:41:33PM +1000, Balbir Singh wrote:
>> On 30/08/16 22:19, Peter Zijlstra wrote:
>>> On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:
>>>>
>>>>
>>>> The origin of the issue I've seen seems to be related to
>>>> rwsem spin lock stealing. Basically I see the system deadlock'd in the
>>>> following state
>>>
>>> As Nick says (good to see you're back Nick!), this is unrelated to
>>> rwsems.
>>>
>>> This is true for pretty much every blocking wait loop out there, they
>>> all do:
>>>
>>> 	for (;;) {
>>> 		current->state = UNINTERRUPTIBLE;
>>> 		smp_mb();
>>> 		if (cond)
>>> 			break;
>>> 		schedule();
>>> 	}
>>> 	current->state = RUNNING;
>>>
>>> Which, if the wakeup is spurious, is just the pattern you need.
>>
>> Yes True! My bad Alexey had seen the same basic pattern, I should have been clearer
>> in my commit log. Should I resend the patch?
> 
> Yes please.
> 

Done, just now. I've tried to generalize the issue, but I've kept the
example

>>> There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
>>> to PPC, is _not_ MB. It is however sufficient for this case.
>>>
>>
>> The MB comes from the __switch_to() in schedule(). Ben mentioned it in a 
>> different thread.
> 
> Right, although even without that, there is sufficient ordering, as the
> rq unlock from the wakeup, coupled with the rq lock from the schedule
> already form a load-store barrier.
> 
>>> Now, this has been present for a fair while, I suspect ever since we
>>> reworked the wakeup path to not use rq->lock twice. Curious you only now
>>> hit it.
>>>
>>
>> Yes, I just hit it a a week or two back and I needed to collect data to
>> explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me
>> I needed a system with large threads and less memory running stress-ng.
>> Reproducing the problem takes an unpredictable amount of time.
> 
> What hardware do you see this on, is it shiny new Power8 chips which
> have never before seen deep queues or something. Or is it 'regular' old
> Power7 like stuff?
> 

I don't think the issue is processor specific, it is probabilistic, but
I've not tested on Power7. I am seeing it on a Power8 system


Balbir Singh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-31  7:20       ` Peter Zijlstra
@ 2016-08-31 10:55         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2016-08-31 10:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Oleg Nesterov, Balbir Singh, LKML, Nicholas Piggin

On Wed, 2016-08-31 at 09:20 +0200, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 07:25:01AM +1000, Benjamin Herrenschmidt wrote:
> > 
> > On Tue, 2016-08-30 at 15:04 +0200, Oleg Nesterov wrote:
> > > 
> > > 
> > > Confused... how this connects to UNLOCK+LOCK on rq->lock? A LOAD can
> > > leak into the critical section.
> > > 
> > > But context switch should imply mb() we can rely on?
> > 
> > Between setting of ->on_rq and returning to the task so it can
> > change its state back to [UN]INTERRUPTIBLE, there will be at least one
> > write barrier (spin unlock of the rq),
> 
> spin-unlock is _not_ a write barrier, its a RELEASE barrier, and is not
> sufficient for this.

Ah yes well it's an lwsync so it's a wmb for us :-) .

> > possibly even a full barrier
> > (context switch). The write barrier is enough so I didn't dig to make
> > sure we always context switch in the scenario we're looking at but I
> > think we do.
> 
> There is enough, you just need to pair the RELEASE with an ACQUIRE to
> get a full load-store barrier.

Right so I *think* there will be at least the release of the rq_lock by
the IPI followed by schedule itself taking and releasing it again, but
I can't vouch for it. As I said, I didn't dig deeper on that side of
things as for us a spin_unlock is a write barrier and for the write
side that concerns me here it's sufficient ;-) It's the read side that
has a problem.

That said you may want to investigate more to make sure there is no way
out of schedule where that spin_unlock is the only thing between
setting on_rq and coming out (which leads to setting the task state).

I suspect there will be at least one more re-aquisition & release of
the rq lock but I may be wrong.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-31  7:18             ` Peter Zijlstra
@ 2016-08-31 10:56               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2016-08-31 10:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Oleg Nesterov, Balbir Singh, LKML, Nicholas Piggin

On Wed, 2016-08-31 at 09:18 +0200, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 07:28:18AM +1000, Benjamin Herrenschmidt
> wrote:
> > 
> > It's always been a requirement that if you actually context switch
> > a
> > full mb() is implied ...
> 
> > 
> > On powerpc we have a sync deep in _switch to achieve that.
> 
> OK, fair enough. I must've missed it in the x86 switch_to, must be
> one
> of those implied serializing instructions I'm not too familiar with.
> 
> > 
> > (though that isn't the case if you don't actually
> > switch, ie, you are back to RUNNING before you even hit schedule).
> 
> Right, which invalidates the claim that schedule() implies a full mb,

Right, it's only full mb if you actually schedule to another process :-
)

> > 
> > This is necessary so that a process who wakes up on a different CPU
> > sees
> > all of its own load/stores.
> 
> Don't actually think its needed for that, see the comment from
> 8643cda549ca4, the scheduler has enough barriers to guarantee
> Program-Order for tasks without that.
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-31  7:28     ` Peter Zijlstra
  2016-08-31 10:17       ` Balbir Singh
@ 2016-08-31 10:57       ` Benjamin Herrenschmidt
  2016-09-01  1:48       ` Alexey Kardashevskiy
  2 siblings, 0 replies; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2016-08-31 10:57 UTC (permalink / raw)
  To: Peter Zijlstra, Balbir Singh
  Cc: LKML, Oleg Nesterov, Nicholas Piggin, Alexey Kardashevskiy

On Wed, 2016-08-31 at 09:28 +0200, Peter Zijlstra wrote:
> 
> What hardware do you see this on, is it shiny new Power8 chips which
> have never before seen deep queues or something. Or is it 'regular'
> old Power7 like stuff?

Power8 which isn't *that* new these days...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-30 21:28           ` Benjamin Herrenschmidt
  2016-08-31  7:18             ` Peter Zijlstra
@ 2016-08-31 13:31             ` Peter Zijlstra
  2016-08-31 21:47               ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-08-31 13:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Oleg Nesterov, Balbir Singh, LKML, Nicholas Piggin

On Wed, Aug 31, 2016 at 07:28:18AM +1000, Benjamin Herrenschmidt wrote:

> On powerpc we have a sync deep in _switch to achieve that.

OK, for giggles, could you (or Balbir) check what happens if you take
that sync out?

There should be enough serialization in the generic code to cover the
case that code mentions.

ARM64 has a stronger barrier in its context switch code, but that's
because they need to sync against external agents (like their TLB and
cache) and no amount of generic locking is going to cover that.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-31 13:31             ` Peter Zijlstra
@ 2016-08-31 21:47               ` Benjamin Herrenschmidt
  2016-09-01  6:49                 ` Balbir Singh
  2016-09-01  6:57                 ` Peter Zijlstra
  0 siblings, 2 replies; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2016-08-31 21:47 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Oleg Nesterov, Balbir Singh, LKML, Nicholas Piggin

On Wed, 2016-08-31 at 15:31 +0200, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 07:28:18AM +1000, Benjamin Herrenschmidt
> wrote:
> 
> > 
> > On powerpc we have a sync deep in _switch to achieve that.
> 
> OK, for giggles, could you (or Balbir) check what happens if you take
> that sync out?
> 
> There should be enough serialization in the generic code to cover the
> case that code mentions.
> 
> ARM64 has a stronger barrier in its context switch code, but that's
> because they need to sync against external agents (like their TLB and
> cache) and no amount of generic locking is going to cover that.

The problem is no amount of testing can tell you it works for sure :-)

I would be nervous not having a real full sync in _switch. All we have
along the scheduler path is lwsync's and our isync based load construct
for spin_lock, I'm not sure what other assumptions we have around that
sync in there...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-31  7:28     ` Peter Zijlstra
  2016-08-31 10:17       ` Balbir Singh
  2016-08-31 10:57       ` Benjamin Herrenschmidt
@ 2016-09-01  1:48       ` Alexey Kardashevskiy
  2016-09-01 12:16         ` Alexey Kardashevskiy
  2 siblings, 1 reply; 27+ messages in thread
From: Alexey Kardashevskiy @ 2016-09-01  1:48 UTC (permalink / raw)
  To: Peter Zijlstra, Balbir Singh
  Cc: LKML, Oleg Nesterov, Benjamin Herrenschmidt, Nicholas Piggin

On 31/08/16 17:28, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 01:41:33PM +1000, Balbir Singh wrote:
>> On 30/08/16 22:19, Peter Zijlstra wrote:
>>> On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:
>>>>
>>>>
>>>> The origin of the issue I've seen seems to be related to
>>>> rwsem spin lock stealing. Basically I see the system deadlock'd in the
>>>> following state
>>>
>>> As Nick says (good to see you're back Nick!), this is unrelated to
>>> rwsems.
>>>
>>> This is true for pretty much every blocking wait loop out there, they
>>> all do:
>>>
>>> 	for (;;) {
>>> 		current->state = UNINTERRUPTIBLE;
>>> 		smp_mb();
>>> 		if (cond)
>>> 			break;
>>> 		schedule();
>>> 	}
>>> 	current->state = RUNNING;
>>>
>>> Which, if the wakeup is spurious, is just the pattern you need.
>>
>> Yes True! My bad Alexey had seen the same basic pattern, I should have been clearer
>> in my commit log. Should I resend the patch?
> 
> Yes please.
> 
>>> There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
>>> to PPC, is _not_ MB. It is however sufficient for this case.
>>>
>>
>> The MB comes from the __switch_to() in schedule(). Ben mentioned it in a 
>> different thread.
> 
> Right, although even without that, there is sufficient ordering, as the
> rq unlock from the wakeup, coupled with the rq lock from the schedule
> already form a load-store barrier.
> 
>>> Now, this has been present for a fair while, I suspect ever since we
>>> reworked the wakeup path to not use rq->lock twice. Curious you only now
>>> hit it.
>>>
>>
>> Yes, I just hit it a a week or two back and I needed to collect data to
>> explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me
>> I needed a system with large threads and less memory running stress-ng.
>> Reproducing the problem takes an unpredictable amount of time.
> 
> What hardware do you see this on, is it shiny new Power8 chips which
> have never before seen deep queues or something. Or is it 'regular' old
> Power7 like stuff?

I am seeing it on POWER8 with KVM and 2 guests, each having 3 virtio-net
devices with vhost enabled, all virtio-net devices are connected to the
same virtual bridge on the host (via /dev/tap*) and are doing lots of
trafic, just between these 2 guests.

I remember doing the same test on POWER7 more than 2 years ago and finding
missing barriers in virtio but nothing like this one. But POWER7 is
seriously slower than POWER8 so it seems that nobody bothered with loading
it that much.

I wonder how to reproduce the bug quicker as sometime it works days with no
fault but sometime it fails within first 30 minutes (backtraces from 2
stuck CPUs are the same though), anyone has an idea (kernel hacks, taskset,
type of traficб уес)? As Nick suggested, I changed cpus_share_cache() to
return "false" so ttwu_queue() would always go via ttwu_queue_remote() path
but this did not make any difference.



-- 
Alexey

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-31 21:47               ` Benjamin Herrenschmidt
@ 2016-09-01  6:49                 ` Balbir Singh
  2016-09-01  6:57                 ` Peter Zijlstra
  1 sibling, 0 replies; 27+ messages in thread
From: Balbir Singh @ 2016-09-01  6:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Peter Zijlstra
  Cc: Oleg Nesterov, LKML, Nicholas Piggin



On 01/09/16 07:47, Benjamin Herrenschmidt wrote:
> On Wed, 2016-08-31 at 15:31 +0200, Peter Zijlstra wrote:
>> On Wed, Aug 31, 2016 at 07:28:18AM +1000, Benjamin Herrenschmidt
>> wrote:
>>
>>>
>>> On powerpc we have a sync deep in _switch to achieve that.
>>
>> OK, for giggles, could you (or Balbir) check what happens if you take
>> that sync out?
>>
>> There should be enough serialization in the generic code to cover the
>> case that code mentions.
>>
>> ARM64 has a stronger barrier in its context switch code, but that's
>> because they need to sync against external agents (like their TLB and
>> cache) and no amount of generic locking is going to cover that.
> 
> The problem is no amount of testing can tell you it works for sure :-)
> 
> I would be nervous not having a real full sync in _switch. All we have
> along the scheduler path is lwsync's and our isync based load construct
> for spin_lock, I'm not sure what other assumptions we have around that
> sync in there...
> 

I would agree, I am not sure of the assumptions either.

Balbir Singh.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-08-31 21:47               ` Benjamin Herrenschmidt
  2016-09-01  6:49                 ` Balbir Singh
@ 2016-09-01  6:57                 ` Peter Zijlstra
  2016-09-01 14:17                   ` Boqun Feng
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-09-01  6:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Oleg Nesterov, Balbir Singh, LKML, Nicholas Piggin

On Thu, Sep 01, 2016 at 07:47:10AM +1000, Benjamin Herrenschmidt wrote:

> > OK, for giggles, could you (or Balbir) check what happens if you take
> > that sync out?

> The problem is no amount of testing can tell you it works for sure :-)

It breaking does prove the negative though, so still interesting.

> I would be nervous not having a real full sync in _switch. All we have
> along the scheduler path is lwsync's and our isync based load construct
> for spin_lock, I'm not sure what other assumptions we have around that
> sync in there...

Only one way to find out  ;-)

I'm not saying you should commit that change, just curious if (and how
fast) it would come apart.

At the very least we could update the comment that goes with that sync.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-09-01  1:48       ` Alexey Kardashevskiy
@ 2016-09-01 12:16         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 27+ messages in thread
From: Alexey Kardashevskiy @ 2016-09-01 12:16 UTC (permalink / raw)
  To: Peter Zijlstra, Balbir Singh
  Cc: LKML, Oleg Nesterov, Benjamin Herrenschmidt, Nicholas Piggin,
	Michael S. Tsirkin

On 01/09/16 11:48, Alexey Kardashevskiy wrote:
> On 31/08/16 17:28, Peter Zijlstra wrote:
>> On Wed, Aug 31, 2016 at 01:41:33PM +1000, Balbir Singh wrote:
>>> On 30/08/16 22:19, Peter Zijlstra wrote:
>>>> On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:
>>>>>
>>>>>
>>>>> The origin of the issue I've seen seems to be related to
>>>>> rwsem spin lock stealing. Basically I see the system deadlock'd in the
>>>>> following state
>>>>
>>>> As Nick says (good to see you're back Nick!), this is unrelated to
>>>> rwsems.
>>>>
>>>> This is true for pretty much every blocking wait loop out there, they
>>>> all do:
>>>>
>>>> 	for (;;) {
>>>> 		current->state = UNINTERRUPTIBLE;
>>>> 		smp_mb();
>>>> 		if (cond)
>>>> 			break;
>>>> 		schedule();
>>>> 	}
>>>> 	current->state = RUNNING;
>>>>
>>>> Which, if the wakeup is spurious, is just the pattern you need.
>>>
>>> Yes True! My bad Alexey had seen the same basic pattern, I should have been clearer
>>> in my commit log. Should I resend the patch?
>>
>> Yes please.
>>
>>>> There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
>>>> to PPC, is _not_ MB. It is however sufficient for this case.
>>>>
>>>
>>> The MB comes from the __switch_to() in schedule(). Ben mentioned it in a 
>>> different thread.
>>
>> Right, although even without that, there is sufficient ordering, as the
>> rq unlock from the wakeup, coupled with the rq lock from the schedule
>> already form a load-store barrier.
>>
>>>> Now, this has been present for a fair while, I suspect ever since we
>>>> reworked the wakeup path to not use rq->lock twice. Curious you only now
>>>> hit it.
>>>>
>>>
>>> Yes, I just hit it a a week or two back and I needed to collect data to
>>> explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me
>>> I needed a system with large threads and less memory running stress-ng.
>>> Reproducing the problem takes an unpredictable amount of time.
>>
>> What hardware do you see this on, is it shiny new Power8 chips which
>> have never before seen deep queues or something. Or is it 'regular' old
>> Power7 like stuff?
> 
> I am seeing it on POWER8 with KVM and 2 guests, each having 3 virtio-net
> devices with vhost enabled, all virtio-net devices are connected to the
> same virtual bridge on the host (via /dev/tap*) and are doing lots of
> trafic, just between these 2 guests.
> 
> I remember doing the same test on POWER7 more than 2 years ago and finding
> missing barriers in virtio but nothing like this one. But POWER7 is
> seriously slower than POWER8 so it seems that nobody bothered with loading
> it that much.
> 
> I wonder how to reproduce the bug quicker as sometime it works days with no
> fault but sometime it fails within first 30 minutes (backtraces from 2
> stuck CPUs are the same though), anyone has an idea (kernel hacks, taskset,
> type of traficб уес)? As Nick suggested, I changed cpus_share_cache() to
> return "false" so ttwu_queue() would always go via ttwu_queue_remote() path
> but this did not make any difference.


Below are the backtraces of 2 stuck cpus (they could not be stopped in xmon
as others, got these via bmc). I am also adding Michael in cc:, just in case...

0:mon> t 0xc000000fd4ab79b0
[c000000fd4ab79b0] c0000000001330c4 do_raw_spin_lock+0x1f4/0x260
[c000000fd4ab79f0] c000000000a368e8 _raw_spin_lock_irqsave+0x98/0xd0
[c000000fd4ab7a30] c00000000011ecbc remove_wait_queue+0x2c/0x70
[c000000fd4ab7a70] d0000000166e0364 vhost_poll_stop+0x34/0x60 [vhost]
[c000000fd4ab7aa0] d0000000167a08c8 handle_rx+0x168/0x8d0 [vhost_net]
[c000000fd4ab7c80] d0000000166e04e0 vhost_worker+0x120/0x1d0 [vhost]
[c000000fd4ab7d00] c0000000000ebad8 kthread+0x118/0x140
[c000000fd4ab7e30] c0000000000098e8 ret_from_kernel_thread+0x5c/0x74

0:mon> t 0xc000001fff41b5a0
[c000001fff41b5a0] c0000000000fc3ac try_to_wake_up+0x5c/0x6c0
[c000001fff41b630] d0000000166e0fcc vhost_work_queue+0x6c/0x90 [vhost]
[c000001fff41b660] d0000000166e205c vhost_poll_wakeup+0x3c/0x50 [vhost]
[c000001fff41b680] c00000000011e6d4 __wake_up_common+0x84/0xf0
[c000001fff41b6e0] c00000000011ee98 __wake_up_sync_key+0x68/0xa0
[c000001fff41b730] c0000000008defd0 sock_def_readable+0x80/0x180
[c000001fff41b760] d0000000167438ac tun_net_xmit+0x52c/0x5d0 [tun]
[c000001fff41b7b0] c000000000906dd8 dev_hard_start_xmit+0x178/0x450
[c000001fff41b870] c000000000938d6c sch_direct_xmit+0x11c/0x250
[c000001fff41b910] c000000000907840 __dev_queue_xmit+0x560/0x8c0
[c000001fff41b9c0] d000000016553c80 br_dev_queue_push_xmit+0xb0/0x250 [bridge]
[c000001fff41ba00] d000000016553e54 br_forward_finish+0x34/0xe0 [bridge]
[c000001fff41ba80] d000000016554044 __br_forward.isra.0+0x144/0x220 [bridge]
[c000001fff41baf0] d000000016555e78 br_handle_frame_finish+0x148/0x570 [bridge]
[c000001fff41bb80] d00000001655654c br_handle_frame+0x24c/0x400 [bridge]
[c000001fff41bc10] c0000000008fed28 __netif_receive_skb_core+0x498/0xc60
[c000001fff41bcf0] c000000000904314 process_backlog+0x114/0x220
[c000001fff41bd60] c00000000090343c net_rx_action+0x36c/0x520
[c000001fff41be70] c0000000000c5068 __do_softirq+0x258/0x5b0
[c000001fff41bf90] c000000000028bd0 call_do_softirq+0x14/0x24
[c000000fc675b950] c0000000000155f8 do_softirq_own_stack+0x58/0xa0
[c000000fc675b990] c0000000000c4cb4 do_softirq.part.3+0x74/0xa0
[c000000fc675b9c0] c0000000008fe674 netif_rx_ni+0x1d4/0x230
[c000000fc675ba00] d0000000167415f8 tun_get_user+0x408/0xbc0 [tun]
[c000000fc675baf0] d000000016741e04 tun_sendmsg+0x54/0x90 [tun]
[c000000fc675bb30] d0000000167a1f68 handle_tx+0x298/0x5f0 [vhost_net]
[c000000fc675bc80] d0000000166e04e0 vhost_worker+0x120/0x1d0 [vhost]
[c000000fc675bd00] c0000000000ebad8 kthread+0x118/0x140
[c000000fc675be30] c0000000000098e8 ret_from_kernel_thread+0x5c/0x74
0:mon>




-- 
Alexey

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-09-01  6:57                 ` Peter Zijlstra
@ 2016-09-01 14:17                   ` Boqun Feng
  2016-09-01 15:33                     ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Boqun Feng @ 2016-09-01 14:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Herrenschmidt, Oleg Nesterov, Balbir Singh, LKML,
	Nicholas Piggin

[-- Attachment #1: Type: text/plain, Size: 1124 bytes --]

On Thu, Sep 01, 2016 at 08:57:38AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 01, 2016 at 07:47:10AM +1000, Benjamin Herrenschmidt wrote:
> 
> > > OK, for giggles, could you (or Balbir) check what happens if you take
> > > that sync out?
> 
> > The problem is no amount of testing can tell you it works for sure :-)
> 
> It breaking does prove the negative though, so still interesting.
> 
> > I would be nervous not having a real full sync in _switch. All we have
> > along the scheduler path is lwsync's and our isync based load construct
> > for spin_lock, I'm not sure what other assumptions we have around that
> > sync in there...
> 
> Only one way to find out  ;-)
> 
> I'm not saying you should commit that change, just curious if (and how
> fast) it would come apart.
> 
> At the very least we could update the comment that goes with that sync.

Could there be some code that relies on the full barrier semantics of
schedule() to provide transitivity?

IIUC, the Program Order Guarantee you stated before try_to_wake_up()
only has ordering effect on wakers and wakees.

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC][PATCH] Fix a race between rwsem and the scheduler
  2016-09-01 14:17                   ` Boqun Feng
@ 2016-09-01 15:33                     ` Peter Zijlstra
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-09-01 15:33 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Benjamin Herrenschmidt, Oleg Nesterov, Balbir Singh, LKML,
	Nicholas Piggin

On Thu, Sep 01, 2016 at 10:17:57PM +0800, Boqun Feng wrote:
> On Thu, Sep 01, 2016 at 08:57:38AM +0200, Peter Zijlstra wrote:
> Could there be some code that relies on the full barrier semantics of
> schedule() to provide transitivity?

Could, sure, who knows. RCU might, although Paul typically sticks in
smp_mb just to be safe.

The kernel coming apart when you remove that sync would prove this
fairly quick though.

Like Ben said, it not coming apart doesn't prove anything. It coming
apart does however prove something, namely that it is required :-)

> IIUC, the Program Order Guarantee you stated before try_to_wake_up()
> only has ordering effect on wakers and wakees.

Right, the scheduler only does RCpc guarantees.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2016-09-01 15:33 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-30  8:49 [RFC][PATCH] Fix a race between rwsem and the scheduler Balbir Singh
2016-08-30  9:13 ` Nicholas Piggin
2016-08-30 12:19 ` Peter Zijlstra
2016-08-30 13:04   ` Oleg Nesterov
2016-08-30 14:13     ` Peter Zijlstra
2016-08-30 16:57       ` Oleg Nesterov
2016-08-30 18:34         ` Peter Zijlstra
2016-08-30 21:28           ` Benjamin Herrenschmidt
2016-08-31  7:18             ` Peter Zijlstra
2016-08-31 10:56               ` Benjamin Herrenschmidt
2016-08-31 13:31             ` Peter Zijlstra
2016-08-31 21:47               ` Benjamin Herrenschmidt
2016-09-01  6:49                 ` Balbir Singh
2016-09-01  6:57                 ` Peter Zijlstra
2016-09-01 14:17                   ` Boqun Feng
2016-09-01 15:33                     ` Peter Zijlstra
2016-08-30 21:25     ` Benjamin Herrenschmidt
2016-08-31  7:20       ` Peter Zijlstra
2016-08-31 10:55         ` Benjamin Herrenschmidt
2016-08-31  3:41   ` Balbir Singh
2016-08-31  7:28     ` Peter Zijlstra
2016-08-31 10:17       ` Balbir Singh
2016-08-31 10:57       ` Benjamin Herrenschmidt
2016-09-01  1:48       ` Alexey Kardashevskiy
2016-09-01 12:16         ` Alexey Kardashevskiy
2016-08-30 12:58 ` Oleg Nesterov
2016-08-31  3:25   ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).