[RFC][PATCH] Fix a race between rwsem and the scheduler

* [RFC][PATCH] Fix a race between rwsem and the scheduler
@ 2016-08-30  8:49 Balbir Singh
  2016-08-30  9:13 ` Nicholas Piggin
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Balbir Singh @ 2016-08-30  8:49 UTC (permalink / raw)
  To: LKML, Peter Zijlstra
  Cc: Oleg Nesterov, Benjamin Herrenschmidt, Nicholas Piggin



The origin of the issue I've seen seems to be related to
rwsem spin lock stealing. Basically I see the system deadlock'd in the
following state

I have a system with multiple threads and

Most of the threads are stuck doing

[67272.593915] --- interrupt: e81 at _raw_spin_lock_irqsave+0xa4/0x130
[67272.593915]     LR = _raw_spin_lock_irqsave+0x9c/0x130
[67272.700996] [c000000012857ae0] [c00000000012453c] rwsem_wake+0xcc/0x110
[67272.749283] [c000000012857b20] [c0000000001215d8] up_write+0x78/0x90
[67272.788965] [c000000012857b50] [c00000000028153c] unlink_anon_vmas+0x15c/0x2c0
[67272.798782] [c000000012857bc0] [c00000000026f5c0] free_pgtables+0xf0/0x1c0
[67272.842528] [c000000012857c10] [c00000000027c9a0] exit_mmap+0x100/0x1a0
[67272.872947] [c000000012857cd0] [c0000000000b4a98] mmput+0xa8/0x1b0
[67272.898432] [c000000012857d00] [c0000000000bc50c] do_exit+0x33c/0xc30
[67272.944721] [c000000012857dc0] [c0000000000bcee4] do_group_exit+0x64/0x100
[67272.969014] [c000000012857e00] [c0000000000bcfac] SyS_exit_group+0x2c/0x30
[67272.978971] [c000000012857e30] [c000000000009204] system_call+0x38/0xb4
[67272.999016] Instruction dump:

They are spinning on the sem->wait_lock, the holder of sem->wait_lock has
irq's disabled and is doing

[c00000037930fb30] c0000000000f724c try_to_wake_up+0x6c/0x570
[c00000037930fbb0] c000000000124328 __rwsem_do_wake+0x1f8/0x260
[c00000037930fc00] c0000000001244b4 rwsem_wake+0x84/0x110
[c00000037930fc40] c000000000121598 up_write+0x78/0x90
[c00000037930fc70] c000000000281a54 anon_vma_fork+0x184/0x1d0
[c00000037930fcc0] c0000000000b68e0 copy_process.isra.5+0x14c0/0x1870
[c00000037930fda0] c0000000000b6e68 _do_fork+0xa8/0x4b0
[c00000037930fe30] c000000000009460 ppc_clone+0x8/0xc

The offset of try_to_wake_up is actually misleading, it is actually stuck
doing the following in try_to_wake_up

while (p->on_cpu)
	cpu_relax();

Analysis

The issue is triggered, due to the following race

CPU1					CPU2

while () {
  if (cond)
    break;
  do {
    schedule();
    set_current_state(TASK_UN..)
  } while (!cond);
					rwsem_wake()
					  spin_lock_irqsave(wait_lock)
  raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
}					  try_to_wake_up()
set_current_state(TASK_RUNNING);	  ..
list_del(&waiter.list);

CPU2 wakes up CPU1, but before it can get the wait_lock and set
current state to TASK_RUNNING the following occurs

CPU3
(stole the rwsem before waiter can be woken up from queue)
up_write()
rwsem_wake()
raw_spin_lock_irqsave(wait_lock)
if (!list_empty)
  wake_up_process()
  try_to_wake_up()
  raw_spin_lock_irqsave(p->pi_lock)
  ..
  if (p->on_rq && ttwu_wakeup())
  ..
  while (p->on_cpu)
    cpu_relax()
  ..

CPU3 tries to wake up the task on CPU1 again since it finds
it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
after CPU2, CPU3 got it.

CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
the task is spinning on the wait_lock. Interestingly since p->on_rq
is checked under pi_lock, I've noticed that try_to_wake_up() finds
p->on_rq to be 0. This was the most confusing bit of the analysis,
but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
check is not reliable without this fix IMHO. The race is visible
(based on the analysis) only when ttwu_queue() does a remote wakeup
via ttwu_queue_remote. In which case the p->on_rq change is not
done uder the pi_lock.

The result is that after a while the entire system locks up on
the raw_spin_irqlock_save(wait_lock) and the holder spins infintely

Reproduction of the issue

The issue can be reproduced after a long run on my system with 80
threads and having to tweak available memory to very low and running
memory stress-ng mmapfork test. It usually takes a long time to
reproduce. I am trying to work on a test case that can reproduce
the issue faster, but thats work in progress. I am still testing the
changes on my still in a loop and the tests seem OK thus far.

Big thanks to Benjamin and Nick for helping debug this as well.
Ben helped catch the missing barrier, Nick caught every missing
bit in my theory

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 kernel/sched/core.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2a906f2..582c684 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	success = 1; /* we're going to change ->state */
 	cpu = task_cpu(p);
 
+	/*
+	 * Ensure we see on_rq and p_state consistently
+	 *
+	 * For example in __rwsem_down_write_failed(), we have
+	 *    [S] ->on_rq = 1				[L] ->state
+	 *    MB					 RMB
+	 *    [S] ->state = TASK_UNINTERRUPTIBLE	[L] ->on_rq
+	 * In the absence of the RMB p->on_rq can be observed to be 0
+	 * and we end up spinning indefinitely in while (p->on_cpu)
+	 */
+	smp_rmb();
 	if (p->on_rq && ttwu_remote(p, wake_flags))
 		goto stat;
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread