linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* lockdep trace from posix timers
@ 2012-07-24 20:36 Dave Jones
  2012-07-27 16:20 ` Dave Jones
  2012-08-16 18:07 ` Peter Zijlstra
  0 siblings, 2 replies; 54+ messages in thread
From: Dave Jones @ 2012-07-24 20:36 UTC (permalink / raw)
  To: Linux Kernel; +Cc: Thomas Gleixner

Linus tree as of 5fecc9d8f59e765c2a48379dd7c6f5cf88c7d75a

	Dave

======================================================
[ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
3.5.0+ #122 Not tainted
------------------------------------------------------
trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
blocked:  (tasklist_lock){.+.+..}, instance: ffffffff81c05098, at: [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0

and this task is already holding:
blocked:  (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0
which would create a new lock dependency:
 (&(&new_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}

but this new dependency connects a HARDIRQ-irq-safe lock:
 (&(&new_timer->it_lock)->rlock){-.-...}
... which became HARDIRQ-irq-safe at:
  [<ffffffff810d8e31>] __lock_acquire+0x7e1/0x1ae0
  [<ffffffff810da83d>] lock_acquire+0xad/0x220
  [<ffffffff8168a15d>] _raw_spin_lock_irqsave+0x7d/0xd0
  [<ffffffff81093c17>] posix_timer_fn+0x37/0xe0
  [<ffffffff8109a534>] __run_hrtimer+0x94/0x4c0
  [<ffffffff8109b157>] hrtimer_interrupt+0xf7/0x230
  [<ffffffff816953a9>] smp_apic_timer_interrupt+0x69/0x99
  [<ffffffff8169402f>] apic_timer_interrupt+0x6f/0x80
  [<ffffffff811644e9>] __generic_file_aio_write+0x239/0x450
  [<ffffffff81164773>] generic_file_aio_write+0x73/0xe0
  [<ffffffff81258832>] ext4_file_write+0xc2/0x280
  [<ffffffff811cfb76>] do_sync_write+0xe6/0x120
  [<ffffffff811d04af>] vfs_write+0xaf/0x190
  [<ffffffff811d07ed>] sys_write+0x4d/0x90
  [<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f

to a HARDIRQ-irq-unsafe lock:
 (&(&p->alloc_lock)->rlock){+.+...}
... which became HARDIRQ-irq-unsafe at:
...  [<ffffffff810d8c37>] __lock_acquire+0x5e7/0x1ae0
  [<ffffffff810da83d>] lock_acquire+0xad/0x220
  [<ffffffff816895f6>] _raw_spin_lock+0x46/0x80
  [<ffffffff811d8882>] set_task_comm+0x32/0x180
  [<ffffffff81095848>] kthreadd+0x38/0x2e0
  [<ffffffff81694934>] kernel_thread_helper+0x4/0x10

other info that might help us debug this:

Chain exists of:
  &(&new_timer->it_lock)->rlock --> tasklist_lock --> &(&p->alloc_lock)->rlock

 Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&(&p->alloc_lock)->rlock);
                               local_irq_disable();
                               lock(&(&new_timer->it_lock)->rlock);
                               lock(tasklist_lock);
  <Interrupt>
    lock(&(&new_timer->it_lock)->rlock);

 *** DEADLOCK ***

1 lock on stack by trinity-child2/5327:
 #0: blocked:  (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0

the dependencies between HARDIRQ-irq-safe lock and the holding lock:
-> (&(&new_timer->it_lock)->rlock){-.-...} ops: 584 {
   IN-HARDIRQ-W at:
                    [<ffffffff810d8e31>] __lock_acquire+0x7e1/0x1ae0
                    [<ffffffff810da83d>] lock_acquire+0xad/0x220
                    [<ffffffff8168a15d>] _raw_spin_lock_irqsave+0x7d/0xd0
                    [<ffffffff81093c17>] posix_timer_fn+0x37/0xe0
                    [<ffffffff8109a534>] __run_hrtimer+0x94/0x4c0
                    [<ffffffff8109b157>] hrtimer_interrupt+0xf7/0x230
                    [<ffffffff816953a9>] smp_apic_timer_interrupt+0x69/0x99
                    [<ffffffff8169402f>] apic_timer_interrupt+0x6f/0x80
                    [<ffffffff811644e9>] __generic_file_aio_write+0x239/0x450
                    [<ffffffff81164773>] generic_file_aio_write+0x73/0xe0
                    [<ffffffff81258832>] ext4_file_write+0xc2/0x280
                    [<ffffffff811cfb76>] do_sync_write+0xe6/0x120
                    [<ffffffff811d04af>] vfs_write+0xaf/0x190
                    [<ffffffff811d07ed>] sys_write+0x4d/0x90
                    [<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f
   IN-SOFTIRQ-W at:
                    [<ffffffff810d8bfe>] __lock_acquire+0x5ae/0x1ae0
                    [<ffffffff810da83d>] lock_acquire+0xad/0x220
                    [<ffffffff8168a15d>] _raw_spin_lock_irqsave+0x7d/0xd0
                    [<ffffffff81093c17>] posix_timer_fn+0x37/0xe0
                    [<ffffffff8109a534>] __run_hrtimer+0x94/0x4c0
                    [<ffffffff8109b157>] hrtimer_interrupt+0xf7/0x230
                    [<ffffffff8109b2bb>] __hrtimer_peek_ahead_timers.part.27+0x2b/0x30
                    [<ffffffff8109b309>] hrtimer_peek_ahead_timers+0x49/0xa0
                    [<ffffffff8109b391>] run_hrtimer_softirq+0x31/0x40
                    [<ffffffff810746e0>] __do_softirq+0xf0/0x400
                    [<ffffffff81074b2c>] run_ksoftirqd+0x13c/0x230
                    [<ffffffff81095517>] kthread+0xb7/0xc0
                    [<ffffffff81694934>] kernel_thread_helper+0x4/0x10
   INITIAL USE at:
                   [<ffffffff810d8956>] __lock_acquire+0x306/0x1ae0
                   [<ffffffff810da83d>] lock_acquire+0xad/0x220
                   [<ffffffff8168a15d>] _raw_spin_lock_irqsave+0x7d/0xd0
                   [<ffffffff81093d49>] __lock_timer+0x89/0x1f0
                   [<ffffffff810947a4>] sys_timer_delete+0x44/0x100
                   [<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f
 }
 ... key      at: [<ffffffff820ac8a0>] __key.29841+0x0/0x8
 ... acquired at:
   [<ffffffff810d820b>] check_irq_usage+0x5b/0xe0
   [<ffffffff810d93da>] __lock_acquire+0xd8a/0x1ae0
   [<ffffffff810da83d>] lock_acquire+0xad/0x220
   [<ffffffff81689b59>] _raw_read_lock+0x49/0x80
   [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
   [<ffffffff81094786>] sys_timer_delete+0x26/0x100
   [<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f


the dependencies between the lock to be acquired and HARDIRQ-irq-unsafe lock:
 -> (&(&p->alloc_lock)->rlock){+.+...} ops: 1189942 {
    HARDIRQ-ON-W at:
                      [<ffffffff810d8c37>] __lock_acquire+0x5e7/0x1ae0
                      [<ffffffff810da83d>] lock_acquire+0xad/0x220
                      [<ffffffff816895f6>] _raw_spin_lock+0x46/0x80
                      [<ffffffff811d8882>] set_task_comm+0x32/0x180
                      [<ffffffff81095848>] kthreadd+0x38/0x2e0
                      [<ffffffff81694934>] kernel_thread_helper+0x4/0x10
    SOFTIRQ-ON-W at:
                      [<ffffffff810d8c6c>] __lock_acquire+0x61c/0x1ae0
                      [<ffffffff810da83d>] lock_acquire+0xad/0x220
                      [<ffffffff816895f6>] _raw_spin_lock+0x46/0x80
                      [<ffffffff811d8882>] set_task_comm+0x32/0x180
                      [<ffffffff81095848>] kthreadd+0x38/0x2e0
                      [<ffffffff81694934>] kernel_thread_helper+0x4/0x10
    INITIAL USE at:
                     [<ffffffff810d8956>] __lock_acquire+0x306/0x1ae0
                     [<ffffffff810da83d>] lock_acquire+0xad/0x220
                     [<ffffffff816895f6>] _raw_spin_lock+0x46/0x80
                     [<ffffffff811d8882>] set_task_comm+0x32/0x180
                     [<ffffffff81095848>] kthreadd+0x38/0x2e0
                     [<ffffffff81694934>] kernel_thread_helper+0x4/0x10
  }
  ... key      at: [<ffffffff820883a0>] __key.45832+0x0/0x8
  ... acquired at:
   [<ffffffff810da83d>] lock_acquire+0xad/0x220
   [<ffffffff816895f6>] _raw_spin_lock+0x46/0x80
   [<ffffffff812d5f2e>] keyctl_session_to_parent+0xde/0x490
   [<ffffffff812d634d>] sys_keyctl+0x6d/0x1a0
   [<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f

-> (tasklist_lock){.+.+..} ops: 35423 {
   HARDIRQ-ON-R at:
                    [<ffffffff810d8b1e>] __lock_acquire+0x4ce/0x1ae0
                    [<ffffffff810da83d>] lock_acquire+0xad/0x220
                    [<ffffffff81689b59>] _raw_read_lock+0x49/0x80
                    [<ffffffff810701a2>] do_wait+0xb2/0x360
                    [<ffffffff81072095>] sys_wait4+0x75/0xf0
                    [<ffffffff8108a482>] wait_for_helper+0x82/0xb0
                    [<ffffffff81694934>] kernel_thread_helper+0x4/0x10
   SOFTIRQ-ON-R at:
                    [<ffffffff810d8c6c>] __lock_acquire+0x61c/0x1ae0
                    [<ffffffff810da83d>] lock_acquire+0xad/0x220
                    [<ffffffff81689b59>] _raw_read_lock+0x49/0x80
                    [<ffffffff810701a2>] do_wait+0xb2/0x360
                    [<ffffffff81072095>] sys_wait4+0x75/0xf0
                    [<ffffffff8108a482>] wait_for_helper+0x82/0xb0
                    [<ffffffff81694934>] kernel_thread_helper+0x4/0x10
   INITIAL USE at:
                   [<ffffffff810d8956>] __lock_acquire+0x306/0x1ae0
                   [<ffffffff810da83d>] lock_acquire+0xad/0x220
                   [<ffffffff81689f0c>] _raw_write_lock_irq+0x5c/0xa0
                   [<ffffffff81068a41>] copy_process.part.22+0x1051/0x1750
                   [<ffffffff810692d0>] do_fork+0x140/0x4e0
                   [<ffffffff81024606>] kernel_thread+0x76/0x80
                   [<ffffffff81664602>] rest_init+0x26/0x154
                   [<ffffffff81efecb3>] start_kernel+0x3f8/0x405
                   [<ffffffff81efe356>] x86_64_start_reservations+0x131/0x135
                   [<ffffffff81efe4a2>] x86_64_start_kernel+0x148/0x157
 }
 ... key      at: [<ffffffff81c05098>] tasklist_lock+0x18/0x80
 ... acquired at:
   [<ffffffff810d820b>] check_irq_usage+0x5b/0xe0
   [<ffffffff810d93da>] __lock_acquire+0xd8a/0x1ae0
   [<ffffffff810da83d>] lock_acquire+0xad/0x220
   [<ffffffff81689b59>] _raw_read_lock+0x49/0x80
   [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
   [<ffffffff81094786>] sys_timer_delete+0x26/0x100
   [<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f


stack backtrace:
Pid: 5327, comm: trinity-child2 Not tainted 3.5.0+ #122
Call Trace:
 [<ffffffff810d8194>] check_usage+0x4e4/0x500
 [<ffffffff81023729>] ? native_sched_clock+0x19/0x80
 [<ffffffff810d59a8>] ? trace_hardirqs_off_caller+0x28/0xd0
 [<ffffffff81023729>] ? native_sched_clock+0x19/0x80
 [<ffffffff810d820b>] check_irq_usage+0x5b/0xe0
 [<ffffffff810d93da>] __lock_acquire+0xd8a/0x1ae0
 [<ffffffff810d8956>] ? __lock_acquire+0x306/0x1ae0
 [<ffffffff810d59a8>] ? trace_hardirqs_off_caller+0x28/0xd0
 [<ffffffff810da2a5>] ? lock_release_non_nested+0x175/0x320
 [<ffffffff810da83d>] lock_acquire+0xad/0x220
 [<ffffffff8109762b>] ? posix_cpu_timer_del+0x2b/0xe0
 [<ffffffff81689b59>] _raw_read_lock+0x49/0x80
 [<ffffffff8109762b>] ? posix_cpu_timer_del+0x2b/0xe0
 [<ffffffff81093d95>] ? __lock_timer+0xd5/0x1f0
 [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
 [<ffffffff81094786>] sys_timer_delete+0x26/0x100
 [<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-07-24 20:36 lockdep trace from posix timers Dave Jones
@ 2012-07-27 16:20 ` Dave Jones
  2012-08-16 12:54   ` Ming Lei
  2012-08-16 18:07 ` Peter Zijlstra
  1 sibling, 1 reply; 54+ messages in thread
From: Dave Jones @ 2012-07-27 16:20 UTC (permalink / raw)
  To: Linux Kernel, Thomas Gleixner

On Tue, Jul 24, 2012 at 04:36:13PM -0400, Dave Jones wrote:
 > Linus tree as of 5fecc9d8f59e765c2a48379dd7c6f5cf88c7d75a
 > 
 > 	Dave
 > 
 > ======================================================
 > [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
 > 3.5.0+ #122 Not tainted
 > ------------------------------------------------------
 > trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 > blocked:  (tasklist_lock){.+.+..}, instance: ffffffff81c05098, at: [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
 > 
 > and this task is already holding:
 > blocked:  (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0
 > which would create a new lock dependency:
 >  (&(&new_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}
 > 
 > but this new dependency connects a HARDIRQ-irq-safe lock:
 >  (&(&new_timer->it_lock)->rlock){-.-...}
 > ... which became HARDIRQ-irq-safe at:

Shall I start bisecting this ? I can trigger it very easily, but if you
can give me a set of commits to narrow down, it'll speed up the bisection.

	Dave

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-07-27 16:20 ` Dave Jones
@ 2012-08-16 12:54   ` Ming Lei
  2012-08-16 14:03     ` Dave Jones
  0 siblings, 1 reply; 54+ messages in thread
From: Ming Lei @ 2012-08-16 12:54 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, Thomas Gleixner

On Sat, Jul 28, 2012 at 12:20 AM, Dave Jones <davej@redhat.com> wrote:
> On Tue, Jul 24, 2012 at 04:36:13PM -0400, Dave Jones wrote:
>  > Linus tree as of 5fecc9d8f59e765c2a48379dd7c6f5cf88c7d75a
>  >
>  >      Dave
>  >
>  > ======================================================
>  > [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
>  > 3.5.0+ #122 Not tainted
>  > ------------------------------------------------------
>  > trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
>  > blocked:  (tasklist_lock){.+.+..}, instance: ffffffff81c05098, at: [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
>  >
>  > and this task is already holding:
>  > blocked:  (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0
>  > which would create a new lock dependency:
>  >  (&(&new_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}
>  >
>  > but this new dependency connects a HARDIRQ-irq-safe lock:
>  >  (&(&new_timer->it_lock)->rlock){-.-...}
>  > ... which became HARDIRQ-irq-safe at:
>
> Shall I start bisecting this ? I can trigger it very easily, but if you
> can give me a set of commits to narrow down, it'll speed up the bisection.

It should a real possible deadlock, could you test the below patch to
see if it can fix the warning?

--
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 125cb67..29f6a8e 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -420,7 +420,7 @@ static int posix_cpu_timer_del(struct k_itimer *timer)
 	int ret = 0;

 	if (likely(p != NULL)) {
-		read_lock(&tasklist_lock);
+		/* tasklist_lock held already in timer_delete */
 		if (unlikely(p->sighand == NULL)) {
 			/*
 			 * We raced with the reaping of the task.
@@ -435,7 +435,6 @@ static int posix_cpu_timer_del(struct k_itimer *timer)
 				list_del(&timer->it.cpu.entry);
 			spin_unlock(&p->sighand->siglock);
 		}
-		read_unlock(&tasklist_lock);

 		if (!ret)
 			put_task_struct(p);
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index 69185ae..222d24c 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -884,15 +884,31 @@ SYSCALL_DEFINE1(timer_delete, timer_t, timer_id)
 	struct k_itimer *timer;
 	unsigned long flags;

+	/*
+	 * hold tasklist_lock to protect tsk->sighand which might be
+	 * accessed inside ->timer_del. It should be held before
+	 * timer->it_lock to avoid the below deadlock:
+	 * 	CPU0			CPU1
+	 *	lock(tasklist_lock)
+	 *				timer_delete()
+	 *					lock(timer->it_lock)
+	 *					lock(tasklist_lock)
+	 *	<timer interrupt>
+	 *		lock(timer->it_lock)
+	 */
+	read_lock(&tasklist_lock);
 retry_delete:
 	timer = lock_timer(timer_id, &flags);
-	if (!timer)
+	if (!timer) {
+		read_unlock(&tasklist_lock);
 		return -EINVAL;
+	}

 	if (timer_delete_hook(timer) == TIMER_RETRY) {
 		unlock_timer(timer, flags);
 		goto retry_delete;
 	}
+	read_unlock(&tasklist_lock);

 	spin_lock(&current->sighand->siglock);
 	list_del(&timer->list);


Thanks,
-- 
Ming Lei

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-16 12:54   ` Ming Lei
@ 2012-08-16 14:03     ` Dave Jones
  0 siblings, 0 replies; 54+ messages in thread
From: Dave Jones @ 2012-08-16 14:03 UTC (permalink / raw)
  To: Ming Lei; +Cc: Linux Kernel, Thomas Gleixner

On Thu, Aug 16, 2012 at 08:54:31PM +0800, Ming Lei wrote:
 > On Sat, Jul 28, 2012 at 12:20 AM, Dave Jones <davej@redhat.com> wrote:
 > > On Tue, Jul 24, 2012 at 04:36:13PM -0400, Dave Jones wrote:
 > >  > Linus tree as of 5fecc9d8f59e765c2a48379dd7c6f5cf88c7d75a
 > >  >
 > >  >      Dave
 > >  >
 > >  > ======================================================
 > >  > [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
 > >  > 3.5.0+ #122 Not tainted
 > >  > ------------------------------------------------------
 > >  > trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 > >  > blocked:  (tasklist_lock){.+.+..}, instance: ffffffff81c05098, at: [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
 > >  >
 > >  > and this task is already holding:
 > >  > blocked:  (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0
 > >  > which would create a new lock dependency:
 > >  >  (&(&new_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}
 > >  >
 > >  > but this new dependency connects a HARDIRQ-irq-safe lock:
 > >  >  (&(&new_timer->it_lock)->rlock){-.-...}
 > >  > ... which became HARDIRQ-irq-safe at:
 > >
 > > Shall I start bisecting this ? I can trigger it very easily, but if you
 > > can give me a set of commits to narrow down, it'll speed up the bisection.
 > 
 > It should a real possible deadlock, could you test the below patch to
 > see if it can fix the warning?

I've not managed to hit it in a while. It seems very dependant upon
specific builds for some reason.  Very strange.

	Dave 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-07-24 20:36 lockdep trace from posix timers Dave Jones
  2012-07-27 16:20 ` Dave Jones
@ 2012-08-16 18:07 ` Peter Zijlstra
  2012-08-17 15:14   ` Oleg Nesterov
  1 sibling, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2012-08-16 18:07 UTC (permalink / raw)
  To: Dave Jones
  Cc: Linux Kernel, Thomas Gleixner, rostedt, dhowells, Oleg Nesterov, Al Viro

On Tue, 2012-07-24 at 16:36 -0400, Dave Jones wrote:

> ======================================================
> [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
> 3.5.0+ #122 Not tainted
> ------------------------------------------------------
> trinity-child2/5327 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> blocked:  (tasklist_lock){.+.+..}, instance: ffffffff81c05098, at: [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
> 
> and this task is already holding:
> blocked:  (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0
> which would create a new lock dependency:
>  (&(&new_timer->it_lock)->rlock){-.-...} -> (tasklist_lock){.+.+..}
> 
> but this new dependency connects a HARDIRQ-irq-safe lock:

> to a HARDIRQ-irq-unsafe lock:
>  (&(&p->alloc_lock)->rlock){+.+...}

> other info that might help us debug this:
> 
> Chain exists of:
>   &(&new_timer->it_lock)->rlock --> tasklist_lock --> &(&p->alloc_lock)->rlock
> 
>  Possible interrupt unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(&(&p->alloc_lock)->rlock);
>                                local_irq_disable();
>                                lock(&(&new_timer->it_lock)->rlock);
>                                lock(tasklist_lock);
>   <Interrupt>
>     lock(&(&new_timer->it_lock)->rlock);
> 
>  *** DEADLOCK ***
> 
> 1 lock on stack by trinity-child2/5327:
>  #0: blocked:  (&(&new_timer->it_lock)->rlock){-.-...}, instance: ffff880143bce170, at: [<ffffffff81093d49>] __lock_timer+0x89/0x1f0


> the dependencies between the lock to be acquired and HARDIRQ-irq-unsafe lock:

>    [<ffffffff810da83d>] lock_acquire+0xad/0x220
>    [<ffffffff816895f6>] _raw_spin_lock+0x46/0x80
>    [<ffffffff812d5f2e>] keyctl_session_to_parent+0xde/0x490
>    [<ffffffff812d634d>] sys_keyctl+0x6d/0x1a0
>    [<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f

> stack backtrace:
> Pid: 5327, comm: trinity-child2 Not tainted 3.5.0+ #122
> Call Trace:
>  [<ffffffff810d8194>] check_usage+0x4e4/0x500
>  [<ffffffff81023729>] ? native_sched_clock+0x19/0x80
>  [<ffffffff810d59a8>] ? trace_hardirqs_off_caller+0x28/0xd0
>  [<ffffffff81023729>] ? native_sched_clock+0x19/0x80
>  [<ffffffff810d820b>] check_irq_usage+0x5b/0xe0
>  [<ffffffff810d93da>] __lock_acquire+0xd8a/0x1ae0
>  [<ffffffff810d8956>] ? __lock_acquire+0x306/0x1ae0
>  [<ffffffff810d59a8>] ? trace_hardirqs_off_caller+0x28/0xd0
>  [<ffffffff810da2a5>] ? lock_release_non_nested+0x175/0x320
>  [<ffffffff810da83d>] lock_acquire+0xad/0x220
>  [<ffffffff8109762b>] ? posix_cpu_timer_del+0x2b/0xe0
>  [<ffffffff81689b59>] _raw_read_lock+0x49/0x80
>  [<ffffffff8109762b>] ? posix_cpu_timer_del+0x2b/0xe0
>  [<ffffffff81093d95>] ? __lock_timer+0xd5/0x1f0
>  [<ffffffff8109762b>] posix_cpu_timer_del+0x2b/0xe0
>  [<ffffffff81094786>] sys_timer_delete+0x26/0x100
>  [<ffffffff8169336d>] system_call_fastpath+0x1a/0x1f


So we have:


 sys_keyctl()
   keyctl_session_to_parent()
     write_lock_irq(&tasklist_lock)
     task_lock(parent)		parent->alloc_lock

VS

  sys_timer_delete()
    lock_timer()		timer->it_lock
    posix_cpu_timer_del()
      read_lock(&tasklist_lock)


Creating:

  timer->it_lock -> tasklist_lock -> task->alloc_lock

And since it_lock is IRQ-safe and alloc_lock isn't, you've got the IRQ
inversion deadlock reported.

The task_lock() in keyctl_session_to_parent() comes from Al who didn't
think it necessary to write a changelog in d35abdb2.

David, Al, anybody want to have a go at fixing this?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-16 18:07 ` Peter Zijlstra
@ 2012-08-17 15:14   ` Oleg Nesterov
  2012-08-17 15:17     ` Oleg Nesterov
  2012-08-20  7:15     ` lockdep trace from posix timers Peter Zijlstra
  0 siblings, 2 replies; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-17 15:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/16, Peter Zijlstra wrote:
>
>      write_lock_irq(&tasklist_lock)
>      task_lock(parent)		parent->alloc_lock

And this is already wrong. See the comment above task_lock().

> And since it_lock is IRQ-safe and alloc_lock isn't, you've got the IRQ
> inversion deadlock reported.

Yes. Or, IOW, write_lock(tasklist) is IRQ-safe and thus it can't nest
with alloc_lock.

> David, Al, anybody want to have a go at fixing this?

I still think that task_work_add() should synhronize with exit_task_work()
itself and fail if necessary. But I wasn't able to convince Al ;)

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-17 15:14   ` Oleg Nesterov
@ 2012-08-17 15:17     ` Oleg Nesterov
  2012-08-17 16:40       ` task_work_add() should not succeed unconditionally (Was: lockdep trace from posix timers) Oleg Nesterov
  2012-08-20  7:15     ` lockdep trace from posix timers Peter Zijlstra
  1 sibling, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-17 15:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/17, Oleg Nesterov wrote:
>
> On 08/16, Peter Zijlstra wrote:
> >
> >      write_lock_irq(&tasklist_lock)
> >      task_lock(parent)		parent->alloc_lock
>
> And this is already wrong. See the comment above task_lock().
>
> > And since it_lock is IRQ-safe and alloc_lock isn't, you've got the IRQ
> > inversion deadlock reported.
>
> Yes. Or, IOW, write_lock(tasklist) is IRQ-safe and thus it can't nest
> with alloc_lock.
>
> > David, Al, anybody want to have a go at fixing this?
>
> I still think that task_work_add() should synhronize with exit_task_work()
> itself and fail if necessary. But I wasn't able to convince Al ;)

And this is my old patch: http://marc.info/?l=linux-kernel&m=134082268721700
It should be re-diffed of course.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* task_work_add() should not succeed unconditionally (Was: lockdep trace from posix timers)
  2012-08-17 15:17     ` Oleg Nesterov
@ 2012-08-17 16:40       ` Oleg Nesterov
  0 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-17 16:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/17, Oleg Nesterov wrote:
>
> On 08/17, Oleg Nesterov wrote:
> >
> > On 08/16, Peter Zijlstra wrote:
> > >
> > >      write_lock_irq(&tasklist_lock)
> > >      task_lock(parent)		parent->alloc_lock
> >
> > And this is already wrong. See the comment above task_lock().
> >
> > > And since it_lock is IRQ-safe and alloc_lock isn't, you've got the IRQ
> > > inversion deadlock reported.
> >
> > Yes. Or, IOW, write_lock(tasklist) is IRQ-safe and thus it can't nest
> > with alloc_lock.
> >
> > > David, Al, anybody want to have a go at fixing this?
> >
> > I still think that task_work_add() should synhronize with exit_task_work()
> > itself and fail if necessary. But I wasn't able to convince Al ;)
>
> And this is my old patch: http://marc.info/?l=linux-kernel&m=134082268721700
> It should be re-diffed of course.

Something like below. Uncompiled/untested, I need to re-check and test.
Now we can remove that task_lock() and rely on task_work_add().

Al, what do you think?

Oleg.

--- x/include/linux/task_work.h
+++ x/include/linux/task_work.h
@@ -18,8 +18,7 @@ void task_work_run(void);
 
 static inline void exit_task_work(struct task_struct *task)
 {
-	if (unlikely(task->task_works))
-		task_work_run();
+	task_work_run();
 }
 
 #endif	/* _LINUX_TASK_WORK_H */
--- x/kernel/task_work.c
+++ x/kernel/task_work.c
@@ -2,29 +2,35 @@
 #include <linux/task_work.h>
 #include <linux/tracehook.h>
 
+#define TWORK_EXITED	((struct callback_head *)1)
+
 int
 task_work_add(struct task_struct *task, struct callback_head *twork, bool notify)
 {
 	struct callback_head *last, *first;
 	unsigned long flags;
+	int err = -ESRCH;
 
 	/*
-	 * Not inserting the new work if the task has already passed
-	 * exit_task_work() is the responisbility of callers.
+	 * We must not insert the new work if the exiting task has already
+	 * passed task_work_run().
 	 */
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	last = task->task_works;
-	first = last ? last->next : twork;
-	twork->next = first;
-	if (last)
-		last->next = twork;
-	task->task_works = twork;
+	if (likely(task->task_works != TWORK_EXITED) {
+		last = task->task_works;
+		first = last ? last->next : twork;
+		twork->next = first;
+		if (last)
+			last->next = twork;
+		task->task_works = twork;
+		err = 0;
+	}
 	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 
 	/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
-	if (notify)
+	if (!err && notify)
 		set_notify_resume(task);
-	return 0;
+	return err;
 }
 
 struct callback_head *
@@ -35,7 +41,7 @@ task_work_cancel(struct task_struct *tas
 
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 	last = task->task_works;
-	if (last) {
+	if (last && last != TWORK_EXITED) {
 		struct callback_head *q = last, *p = q->next;
 		while (1) {
 			if (p->func == func) {
@@ -63,7 +69,12 @@ void task_work_run(void)
 	while (1) {
 		raw_spin_lock_irq(&task->pi_lock);
 		p = task->task_works;
-		task->task_works = NULL;
+		/*
+		 * twork->func() can do task_work_add(), do not
+		 * set TWORK_EXITED until the list becomes empty.
+		 */
+		task->task_works = (!p && (task->flags & PF_EXITING))
+					? TWORK_EXITED : NULL;
 		raw_spin_unlock_irq(&task->pi_lock);
 
 		if (unlikely(!p))


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-17 15:14   ` Oleg Nesterov
  2012-08-17 15:17     ` Oleg Nesterov
@ 2012-08-20  7:15     ` Peter Zijlstra
  2012-08-20 11:44       ` Peter Zijlstra
                         ` (2 more replies)
  1 sibling, 3 replies; 54+ messages in thread
From: Peter Zijlstra @ 2012-08-20  7:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On Fri, 2012-08-17 at 17:14 +0200, Oleg Nesterov wrote:
> I still think that task_work_add() should synhronize with exit_task_work()
> itself and fail if necessary. But I wasn't able to convince Al ;)

I'm not at all sure how that relates to needing task_lock() in the
keyctl stuff.

Also, can't task_work use llist stuff? That would also avoid using
->pi_lock.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20  7:15     ` lockdep trace from posix timers Peter Zijlstra
@ 2012-08-20 11:44       ` Peter Zijlstra
  2012-08-20 11:46         ` Peter Zijlstra
                           ` (3 more replies)
  2012-08-20 14:55       ` Oleg Nesterov
  2012-08-20 15:43       ` Oleg Nesterov
  2 siblings, 4 replies; 54+ messages in thread
From: Peter Zijlstra @ 2012-08-20 11:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On Mon, 2012-08-20 at 09:15 +0200, Peter Zijlstra wrote:
> On Fri, 2012-08-17 at 17:14 +0200, Oleg Nesterov wrote:
> > I still think that task_work_add() should synhronize with exit_task_work()
> > itself and fail if necessary. But I wasn't able to convince Al ;)
> 
> I'm not at all sure how that relates to needing task_lock() in the
> keyctl stuff.
> 
> Also, can't task_work use llist stuff? That would also avoid using
> ->pi_lock.

How about something like the below?

---
 include/linux/task_work.h |   7 +--
 kernel/exit.c             |   2 +-
 kernel/task_work.c        | 120 ++++++++++++++++++++++++----------------------
 3 files changed, 65 insertions(+), 64 deletions(-)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index fb46b03..f365416 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -15,11 +15,6 @@ init_task_work(struct callback_head *twork, task_work_func_t func)
 int task_work_add(struct task_struct *task, struct callback_head *twork, bool);
 struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t);
 void task_work_run(void);
-
-static inline void exit_task_work(struct task_struct *task)
-{
-	if (unlikely(task->task_works))
-		task_work_run();
-}
+void task_work_exit(void);
 
 #endif	/* _LINUX_TASK_WORK_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index f65345f..92aa94b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -992,7 +992,7 @@ void do_exit(long code)
 	exit_shm(tsk);
 	exit_files(tsk);
 	exit_fs(tsk);
-	exit_task_work(tsk);
+	task_work_exit();
 	check_stack_usage();
 	exit_thread();
 
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 91d4e17..e5eac14 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -2,79 +2,85 @@
 #include <linux/task_work.h>
 #include <linux/tracehook.h>
 
+static void task_work_nop(struct callback_head *work)
+{
+}
+
+static struct callback_head dead = { 
+	.next = NULL,
+	.func = task_work_nop,
+};
+
 int
-task_work_add(struct task_struct *task, struct callback_head *twork, bool notify)
+task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
 {
-	struct callback_head *last, *first;
-	unsigned long flags;
-
-	/*
-	 * Not inserting the new work if the task has already passed
-	 * exit_task_work() is the responisbility of callers.
-	 */
-	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	last = task->task_works;
-	first = last ? last->next : twork;
-	twork->next = first;
-	if (last)
-		last->next = twork;
-	task->task_works = twork;
-	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+	struct callback_head **head = &task->task_works;
+	struct callback_head *entry, *old_entry;
+
+	entry = &head;
+	for (;;) {
+		if (entry == &dead)
+			return -ESRCH;
+
+		old_entry = entry;
+		work->next = entry;
+		entry = cmpxchg(head, old_entry, work);
+		if (entry == old_entry)
+			break;
+	}
 
 	/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
 	if (notify)
 		set_notify_resume(task);
+
 	return 0;
 }
 
 struct callback_head *
 task_work_cancel(struct task_struct *task, task_work_func_t func)
 {
-	unsigned long flags;
-	struct callback_head *last, *res = NULL;
-
-	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	last = task->task_works;
-	if (last) {
-		struct callback_head *q = last, *p = q->next;
-		while (1) {
-			if (p->func == func) {
-				q->next = p->next;
-				if (p == last)
-					task->task_works = q == p ? NULL : q;
-				res = p;
-				break;
-			}
-			if (p == last)
-				break;
-			q = p;
-			p = q->next;
+	struct callback_head **workp, *work;
+
+again:
+	workp = &task->task_works;
+	work = *workp;
+	while (work) {
+		if (work->func == func) {
+			if (cmpxchg(workp, work, work->next) == work)
+				return work;
+			goto again;
 		}
+
+		workp = &work->next;
+		work = *workp;
 	}
-	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
-	return res;
+
+	return NULL;
 }
 
-void task_work_run(void)
+static void __task_work_run(struct callback_head *tail)
 {
-	struct task_struct *task = current;
-	struct callback_head *p, *q;
-
-	while (1) {
-		raw_spin_lock_irq(&task->pi_lock);
-		p = task->task_works;
-		task->task_works = NULL;
-		raw_spin_unlock_irq(&task->pi_lock);
-
-		if (unlikely(!p))
-			return;
-
-		q = p->next; /* head */
-		p->next = NULL; /* cut it */
-		while (q) {
-			p = q->next;
-			q->func(q);
-			q = p;
+	struct callback_head **head = &current->task_works;
+
+	do {
+		struct callback_head *work = xchg(head, NULL);
+		while (work) {
+			struct callback_head *next = ACCESS_ONCE(work->next);
+
+			WARN_ON_ONCE(work == &dead);
+
+			work->func(work);
+			work = next;
 		}
-	}
+	} while (cmpxchg(head, NULL, tail) != NULL);
+}
+
+void task_work_run(void)
+{
+	__task_work_run(NULL);
+}
+
+void task_work_exit(void)
+{
+	__task_work_run(&dead);
 }


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 11:44       ` Peter Zijlstra
@ 2012-08-20 11:46         ` Peter Zijlstra
  2012-08-20 11:50         ` Peter Zijlstra
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2012-08-20 11:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On Mon, 2012-08-20 at 13:44 +0200, Peter Zijlstra wrote:
> > I'm not at all sure how that relates to needing task_lock() in the
> > keyctl stuff. 

Ah, is it used to serialize against exit_mm() so that the !->mm check
will suffice to avoid queuing works past exit_task_work() ?




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 11:44       ` Peter Zijlstra
  2012-08-20 11:46         ` Peter Zijlstra
@ 2012-08-20 11:50         ` Peter Zijlstra
  2012-08-20 12:19           ` Steven Rostedt
  2012-08-20 14:59         ` Oleg Nesterov
  2012-08-20 15:05         ` Oleg Nesterov
  3 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2012-08-20 11:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On Mon, 2012-08-20 at 13:44 +0200, Peter Zijlstra wrote:
>  int
> +task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
>  {
> +       struct callback_head **head = &task->task_works;
> +       struct callback_head *entry, *old_entry;
> +
> +       entry = &head;

My compiler just alerted me that: 

	entry = *head; 

works a lot better..

> +       for (;;) {
> +               if (entry == &dead)
> +                       return -ESRCH;
> +
> +               old_entry = entry;
> +               work->next = entry;
> +               entry = cmpxchg(head, old_entry, work);
> +               if (entry == old_entry)
> +                       break;
> +       }
>  
>         /* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
>         if (notify)
>                 set_notify_resume(task);
> +
>         return 0;
>  } 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 11:50         ` Peter Zijlstra
@ 2012-08-20 12:19           ` Steven Rostedt
  2012-08-20 12:20             ` Peter Zijlstra
  0 siblings, 1 reply; 54+ messages in thread
From: Steven Rostedt @ 2012-08-20 12:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Dave Jones, Linux Kernel, Thomas Gleixner,
	dhowells, Al Viro

On Mon, 2012-08-20 at 13:50 +0200, Peter Zijlstra wrote:
> On Mon, 2012-08-20 at 13:44 +0200, Peter Zijlstra wrote:
> >  int
> > +task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
> >  {
> > +       struct callback_head **head = &task->task_works;
> > +       struct callback_head *entry, *old_entry;
> > +
> > +       entry = &head;
> 
> My compiler just alerted me that: 
> 
> 	entry = *head; 
> 
> works a lot better..
> 
> > +       for (;;) {
> > +               if (entry == &dead)

But your compiler likes "entry == &dead"? ;-)

-- Steve

> > +                       return -ESRCH;
> > +
> > +               old_entry = entry;
> > +               work->next = entry;
> > +               entry = cmpxchg(head, old_entry, work);
> > +               if (entry == old_entry)
> > +                       break;
> > +       }
> >  
> >         /* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
> >         if (notify)
> >                 set_notify_resume(task);
> > +
> >         return 0;
> >  } 



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 12:19           ` Steven Rostedt
@ 2012-08-20 12:20             ` Peter Zijlstra
  0 siblings, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2012-08-20 12:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, Dave Jones, Linux Kernel, Thomas Gleixner,
	dhowells, Al Viro

On Mon, 2012-08-20 at 08:19 -0400, Steven Rostedt wrote:
> 
> > > +       for (;;) {
> > > +               if (entry == &dead)
> 
> But your compiler likes "entry == &dead"? ;-)
> 
Yes, fancy as GCC is these days, it doesn't quite bother with the
meta-physical questions just yet.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20  7:15     ` lockdep trace from posix timers Peter Zijlstra
  2012-08-20 11:44       ` Peter Zijlstra
@ 2012-08-20 14:55       ` Oleg Nesterov
  2012-08-20 15:43       ` Oleg Nesterov
  2 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-20 14:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/20, Peter Zijlstra wrote:
>
> On Fri, 2012-08-17 at 17:14 +0200, Oleg Nesterov wrote:
> > I still think that task_work_add() should synhronize with exit_task_work()
> > itself and fail if necessary. But I wasn't able to convince Al ;)
>
> I'm not at all sure how that relates to needing task_lock() in the
> keyctl stuff.

To serialize with exit_mm() which clears ->mm.

We shouldn't do task_work_add(task) if the exiting task has already
passed exit_task_work(). There is no way to do this after ed3e694d7
(and I think this is wrong), so keyctl relies on the fact that
exit_task_work() is called after exit_mm().

> Also, can't task_work use llist stuff? That would also avoid using
> ->pi_lock.

Not sure.... task_work_add/run can use cmpxchg, but _cancel can't.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 11:44       ` Peter Zijlstra
  2012-08-20 11:46         ` Peter Zijlstra
  2012-08-20 11:50         ` Peter Zijlstra
@ 2012-08-20 14:59         ` Oleg Nesterov
  2012-08-20 15:10           ` Peter Zijlstra
  2012-08-20 15:05         ` Oleg Nesterov
  3 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-20 14:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/20, Peter Zijlstra wrote:
>
>  task_work_cancel(struct task_struct *task, task_work_func_t func)
>  {
> ...
> +
> +again:
> +	workp = &task->task_works;
> +	work = *workp;
> +	while (work) {
> +		if (work->func == func) {

But this all can race with task_work_run() if task != current.

This "work" can be freed/reused. And it should only return !NULL
if twork->func() was not called.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 11:44       ` Peter Zijlstra
                           ` (2 preceding siblings ...)
  2012-08-20 14:59         ` Oleg Nesterov
@ 2012-08-20 15:05         ` Oleg Nesterov
  2012-08-20 15:12           ` Peter Zijlstra
  3 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-20 15:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/20, Peter Zijlstra wrote:
>
> +static void __task_work_run(struct callback_head *tail)
>  {
> -	struct task_struct *task = current;
> -	struct callback_head *p, *q;
> -
> -	while (1) {
> -		raw_spin_lock_irq(&task->pi_lock);
> -		p = task->task_works;
> -		task->task_works = NULL;
> -		raw_spin_unlock_irq(&task->pi_lock);
> -
> -		if (unlikely(!p))
> -			return;
> -
> -		q = p->next; /* head */
> -		p->next = NULL; /* cut it */
> -		while (q) {
> -			p = q->next;
> -			q->func(q);
> -			q = p;
> +	struct callback_head **head = &current->task_works;
> +
> +	do {
> +		struct callback_head *work = xchg(head, NULL);
> +		while (work) {
> +			struct callback_head *next = ACCESS_ONCE(work->next);
> +
> +			WARN_ON_ONCE(work == &dead);
> +
> +			work->func(work);
> +			work = next;
>  		}
> -	}
> +	} while (cmpxchg(head, NULL, tail) != NULL);

Yes, we can add the explicit argument to __task_work_run(), but it can
check PF_EXITING instead, this looks simpler to me.

Note also your patch breaks fifo, but this is fixable.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 14:59         ` Oleg Nesterov
@ 2012-08-20 15:10           ` Peter Zijlstra
  2012-08-20 15:27             ` Peter Zijlstra
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2012-08-20 15:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On Mon, 2012-08-20 at 16:59 +0200, Oleg Nesterov wrote:
> On 08/20, Peter Zijlstra wrote:
> >
> >  task_work_cancel(struct task_struct *task, task_work_func_t func)
> >  {
> > ...
> > +
> > +again:
> > +	workp = &task->task_works;
> > +	work = *workp;
> > +	while (work) {
> > +		if (work->func == func) {
> 
> But this all can race with task_work_run() if task != current.
> 
> This "work" can be freed/reused. And it should only return !NULL
> if twork->func() was not called.

Ah, because we could be iterating the list after the xchg done by run.

Hmm.. 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 15:05         ` Oleg Nesterov
@ 2012-08-20 15:12           ` Peter Zijlstra
  2012-08-20 15:41             ` Oleg Nesterov
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2012-08-20 15:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On Mon, 2012-08-20 at 17:05 +0200, Oleg Nesterov wrote:
> On 08/20, Peter Zijlstra wrote:
> >
> > +static void __task_work_run(struct callback_head *tail)
> >  {
> > -	struct task_struct *task = current;
> > -	struct callback_head *p, *q;
> > -
> > -	while (1) {
> > -		raw_spin_lock_irq(&task->pi_lock);
> > -		p = task->task_works;
> > -		task->task_works = NULL;
> > -		raw_spin_unlock_irq(&task->pi_lock);
> > -
> > -		if (unlikely(!p))
> > -			return;
> > -
> > -		q = p->next; /* head */
> > -		p->next = NULL; /* cut it */
> > -		while (q) {
> > -			p = q->next;
> > -			q->func(q);
> > -			q = p;
> > +	struct callback_head **head = &current->task_works;
> > +
> > +	do {
> > +		struct callback_head *work = xchg(head, NULL);
> > +		while (work) {
> > +			struct callback_head *next = ACCESS_ONCE(work->next);
> > +
> > +			WARN_ON_ONCE(work == &dead);
> > +
> > +			work->func(work);
> > +			work = next;
> >  		}
> > -	}
> > +	} while (cmpxchg(head, NULL, tail) != NULL);
> 
> Yes, we can add the explicit argument to __task_work_run(), but it can
> check PF_EXITING instead, this looks simpler to me.

I guess we could.. but I thought the explicit callback was simpler ;-)

> Note also your patch breaks fifo, but this is fixable.

Why do you care about the order? Iterating a single linked queue in fifo
seems more expensive than useful.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 15:10           ` Peter Zijlstra
@ 2012-08-20 15:27             ` Peter Zijlstra
  2012-08-20 15:32               ` Oleg Nesterov
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2012-08-20 15:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On Mon, 2012-08-20 at 17:10 +0200, Peter Zijlstra wrote:
> On Mon, 2012-08-20 at 16:59 +0200, Oleg Nesterov wrote:
> > On 08/20, Peter Zijlstra wrote:
> > >
> > >  task_work_cancel(struct task_struct *task, task_work_func_t func)
> > >  {
> > > ...
> > > +
> > > +again:
> > > +	workp = &task->task_works;
> > > +	work = *workp;
> > > +	while (work) {
> > > +		if (work->func == func) {
> > 
> > But this all can race with task_work_run() if task != current.
> > 
> > This "work" can be freed/reused. And it should only return !NULL
> > if twork->func() was not called.
> 
> Ah, because we could be iterating the list after the xchg done by run.

I guess we could steal the entire list and requeue it afterwards and
lift TIF_NOTIFY_RESUME along with it.. but I can't say that's pretty.





^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 15:27             ` Peter Zijlstra
@ 2012-08-20 15:32               ` Oleg Nesterov
  2012-08-20 15:46                 ` Peter Zijlstra
  0 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-20 15:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/20, Peter Zijlstra wrote:
>
> On Mon, 2012-08-20 at 17:10 +0200, Peter Zijlstra wrote:
> > On Mon, 2012-08-20 at 16:59 +0200, Oleg Nesterov wrote:
> > > On 08/20, Peter Zijlstra wrote:
> > > >
> > > >  task_work_cancel(struct task_struct *task, task_work_func_t func)
> > > >  {
> > > > ...
> > > > +
> > > > +again:
> > > > +	workp = &task->task_works;
> > > > +	work = *workp;
> > > > +	while (work) {
> > > > +		if (work->func == func) {
> > >
> > > But this all can race with task_work_run() if task != current.
> > >
> > > This "work" can be freed/reused. And it should only return !NULL
> > > if twork->func() was not called.
> >
> > Ah, because we could be iterating the list after the xchg done by run.
>
> I guess we could steal the entire list and requeue it afterwards and
> lift TIF_NOTIFY_RESUME along with it..

We can't. This can race with exit_task_work(). And this can break
fput(), the task can return to the userspace without __fput().

> but I can't say that's pretty.

Yes ;)

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 15:12           ` Peter Zijlstra
@ 2012-08-20 15:41             ` Oleg Nesterov
  2012-08-20 15:56               ` Peter Zijlstra
  0 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-20 15:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/20, Peter Zijlstra wrote:
>
> On Mon, 2012-08-20 at 17:05 +0200, Oleg Nesterov wrote:
> >
> > Yes, we can add the explicit argument to __task_work_run(), but it can
> > check PF_EXITING instead, this looks simpler to me.
>
> I guess we could.. but I thought the explicit callback was simpler ;-)

I won't insist. The patch I sent uses PF_EXITING and the fake
"struct callback_head* TWORK_EXITED", but this looks almost the same.

> > Note also your patch breaks fifo, but this is fixable.
>
> Why do you care about the order?

IMHO, this is just more natural.

For example. keyctl_session_to_parent() does _cancel only to protect
from exploits doing keyctl(KEYCTL_SESSION_TO_PARENT) in an endless
loop. It could simply do task_work_add(), but in this case we need
fifo for correctness.

> Iterating a single linked queue in fifo
> seems more expensive than useful.

Currently the list is fifo (we add to the last element), this is O(1).

But the list should be short, we can reverse it in _run() if we change
task_work_add() to add to the head.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20  7:15     ` lockdep trace from posix timers Peter Zijlstra
  2012-08-20 11:44       ` Peter Zijlstra
  2012-08-20 14:55       ` Oleg Nesterov
@ 2012-08-20 15:43       ` Oleg Nesterov
  2012-08-20 15:48         ` Peter Zijlstra
  2 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-20 15:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/20, Peter Zijlstra wrote:
>
> Also, can't task_work use llist stuff? That would also avoid using
> ->pi_lock.

BTW, do you think it is really important to try to avoid ->pi_lock?

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 15:32               ` Oleg Nesterov
@ 2012-08-20 15:46                 ` Peter Zijlstra
  2012-08-20 15:58                   ` Oleg Nesterov
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2012-08-20 15:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On Mon, 2012-08-20 at 17:32 +0200, Oleg Nesterov wrote:

> > I guess we could steal the entire list and requeue it afterwards and
> > lift TIF_NOTIFY_RESUME along with it..
> 
> We can't. This can race with exit_task_work(). And this can break
> fput(), the task can return to the userspace without __fput().

So we could put that spinlock back around cancel and run and leave add
lockless. That'd solve my immediate problem but its not something I'm
proud of.

All schemes I can come up with end up being broken or effectively the
same as the above proposal.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 15:43       ` Oleg Nesterov
@ 2012-08-20 15:48         ` Peter Zijlstra
  2012-08-20 15:58           ` Oleg Nesterov
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2012-08-20 15:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On Mon, 2012-08-20 at 17:43 +0200, Oleg Nesterov wrote:
> BTW, do you think it is really important to try to avoid ->pi_lock?

It is for me, I'm doing task_work_add() from under rq->lock.. :/ I could
fudge around and delay, but not having to would be nice.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 15:41             ` Oleg Nesterov
@ 2012-08-20 15:56               ` Peter Zijlstra
  2012-08-20 16:10                 ` Oleg Nesterov
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2012-08-20 15:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On Mon, 2012-08-20 at 17:41 +0200, Oleg Nesterov wrote:

> I won't insist. The patch I sent uses PF_EXITING and the fake
> "struct callback_head* TWORK_EXITED", but this looks almost the same.

Right, I used a fake callback_head because it avoided a few special
cases since its a dereferencable pointer.

> > > Note also your patch breaks fifo, but this is fixable.
> >
> > Why do you care about the order?
> 
> IMHO, this is just more natural.

Depends on what you're used to I guess ;-) Both RCU and irq_work are
filo, this seems to be the natural way for single linked lists.

> For example. keyctl_session_to_parent() does _cancel only to protect
> from exploits doing keyctl(KEYCTL_SESSION_TO_PARENT) in an endless
> loop. It could simply do task_work_add(), but in this case we need
> fifo for correctness.

I'm not entirely sure I see, not doing the cancel would delay the free
until the executing of key_change_session_keyring()? doing that keyctl()
in an indefinite loop involves going back to userspace, so where's the
resource issue?

Also, I'm not seeing where the FIFO requirement comes from.

> > Iterating a single linked queue in fifo
> > seems more expensive than useful.
> 
> Currently the list is fifo (we add to the last element), this is O(1).

depends on what way you look at the list I guess, with a single linked
list there's only one end you can add to in O(1), so we're calling that
the tail?

> But the list should be short, we can reverse it in _run() if we change
> task_work_add() to add to the head.

Reversing a (single linked) list is O(n^2).. which is indeed doable for
short lists, but why assume its short?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 15:46                 ` Peter Zijlstra
@ 2012-08-20 15:58                   ` Oleg Nesterov
  2012-08-20 16:03                     ` Peter Zijlstra
  0 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-20 15:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/20, Peter Zijlstra wrote:
>
> On Mon, 2012-08-20 at 17:32 +0200, Oleg Nesterov wrote:
>
> > > I guess we could steal the entire list and requeue it afterwards and
> > > lift TIF_NOTIFY_RESUME along with it..
> >
> > We can't. This can race with exit_task_work(). And this can break
> > fput(), the task can return to the userspace without __fput().
>
> So we could put that spinlock back around cancel and run and leave add
> lockless. That'd solve my immediate problem but its not something I'm
> proud of.

Which problem?

We can probably use bit_spin_lock() and avoid ->pi_lock.

Of course, we can add the new lock into task_struct, but this is not nice.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 15:48         ` Peter Zijlstra
@ 2012-08-20 15:58           ` Oleg Nesterov
  0 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-20 15:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/20, Peter Zijlstra wrote:
>
> On Mon, 2012-08-20 at 17:43 +0200, Oleg Nesterov wrote:
> > BTW, do you think it is really important to try to avoid ->pi_lock?
>
> It is for me, I'm doing task_work_add() from under rq->lock.. :/ I could
> fudge around and delay, but not having to would be nice.

Aha, got it.

At first glance, bit_spin_lock() can work. I'll try to make the patches
tomorrow.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 15:58                   ` Oleg Nesterov
@ 2012-08-20 16:03                     ` Peter Zijlstra
  0 siblings, 0 replies; 54+ messages in thread
From: Peter Zijlstra @ 2012-08-20 16:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On Mon, 2012-08-20 at 17:58 +0200, Oleg Nesterov wrote:
> On 08/20, Peter Zijlstra wrote:
> >
> > On Mon, 2012-08-20 at 17:32 +0200, Oleg Nesterov wrote:
> >
> > > > I guess we could steal the entire list and requeue it afterwards and
> > > > lift TIF_NOTIFY_RESUME along with it..
> > >
> > > We can't. This can race with exit_task_work(). And this can break
> > > fput(), the task can return to the userspace without __fput().
> >
> > So we could put that spinlock back around cancel and run and leave add
> > lockless. That'd solve my immediate problem but its not something I'm
> > proud of.
> 
> Which problem?

/me doing task_work_add() from under rq->lock..

> We can probably use bit_spin_lock() and avoid ->pi_lock.

tglx will kill us both for even thinking of bit-spinlocks.

> Of course, we can add the new lock into task_struct, but this is not nice.

If we can limit the lock to cancel/run I'm ok.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 15:56               ` Peter Zijlstra
@ 2012-08-20 16:10                 ` Oleg Nesterov
  2012-08-20 16:19                   ` Peter Zijlstra
  0 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-20 16:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/20, Peter Zijlstra wrote:
>
> On Mon, 2012-08-20 at 17:41 +0200, Oleg Nesterov wrote:
>
> > IMHO, this is just more natural.
>
> Depends on what you're used to I guess ;-)

I have to agree ;)

> > For example. keyctl_session_to_parent() does _cancel only to protect
> > from exploits doing keyctl(KEYCTL_SESSION_TO_PARENT) in an endless
> > loop. It could simply do task_work_add(), but in this case we need
> > fifo for correctness.
>
> I'm not entirely sure I see, not doing the cancel would delay the free
> until the executing of key_change_session_keyring()? doing that keyctl()
> in an indefinite loop involves going back to userspace, so where's the
> resource issue?

But the child does task_work_add(current->parent). IOW, there are 2
different tasks. Now suppose that ->parent sleeps.

> Also, I'm not seeing where the FIFO requirement comes from.

Again, suppose that ->parent sleeps. The last KEYCTL_SESSION_TO_PARENT
request should "win". Although I agree, this is not that important.

> > > Iterating a single linked queue in fifo
> > > seems more expensive than useful.
> >
> > Currently the list is fifo (we add to the last element), this is O(1).
>
> depends on what way you look at the list I guess, with a single linked
> list there's only one end you can add to in O(1), so we're calling that
> the tail?

Sorry, can't understand...

task->task_works points to the last node in the circular single-linked list,
task_work_add() adds the new element after the last one and updates
task->task_works. This is O(1).

> > But the list should be short, we can reverse it in _run() if we change
> > task_work_add() to add to the head.
>
> Reversing a (single linked) list is O(n^2)..

Hmm. This is O(n). You can simply iterate over this list once, changing
the ->next pointer to point back.

> which is indeed doable for
> short lists, but why assume its short?

I agree, it is better to not do this.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 16:10                 ` Oleg Nesterov
@ 2012-08-20 16:19                   ` Peter Zijlstra
  2012-08-20 16:23                     ` Oleg Nesterov
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2012-08-20 16:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On Mon, 2012-08-20 at 18:10 +0200, Oleg Nesterov wrote:
> task->task_works points to the last node in the circular single-linked list,
> task_work_add() adds the new element after the last one and updates
> task->task_works. This is O(1).

Agreed, the way I was looking at that is: ->task_works points to the
head and we put a new one in front, that too is O(1) ;-)

> > > But the list should be short, we can reverse it in _run() if we change
> > > task_work_add() to add to the head.
> >
> > Reversing a (single linked) list is O(n^2)..
> 
> Hmm. This is O(n). You can simply iterate over this list once, changing
> the ->next pointer to point back. 

OK, I'm going to stop and step away from the computer now.. clearly I
more than useless today :/

But yeah.. that could be done.

Anyway, would taking ->pi_lock over _cancel and _run suffice?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 16:19                   ` Peter Zijlstra
@ 2012-08-20 16:23                     ` Oleg Nesterov
  2012-08-21 18:27                       ` Oleg Nesterov
  0 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-20 16:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/20, Peter Zijlstra wrote:
>
> Anyway, would taking ->pi_lock over _cancel and _run suffice?

I thinks yes... But I can't think properly today.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-20 16:23                     ` Oleg Nesterov
@ 2012-08-21 18:27                       ` Oleg Nesterov
  2012-08-21 18:34                         ` Oleg Nesterov
  0 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-21 18:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/20, Oleg Nesterov wrote:
>
> On 08/20, Peter Zijlstra wrote:
> >
> > Anyway, would taking ->pi_lock over _cancel and _run suffice?
>
> I thinks yes... But I can't think properly today.

OK, how about the code below?

Oleg.


#define TWORK_EXITED	((struct callback_head*)1)

static inline bool cmp_xchg(struct callback_head **ptr, void *old, void *new)
{
	return cmpxchg(ptr, old, new) == old;
}

int
task_work_add(struct task_struct *task, struct callback_head *twork, bool notify)
{
	do {
		twork->next = ACCESS_ONCE(task->task_works);
		if (unlikely(twork->next == TWORK_EXITED))
			return -ESRCH;
	} while (cmp_xchg(&task->task_works, twork->next, twork));

	/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
	if (notify)
		set_notify_resume(task);
	return 0;
}

struct callback_head *
task_work_cancel(struct task_struct *task, task_work_func_t func)
{
	struct callback_head *twork = NULL, **pprev = &task->task_works;
	unsigned long flags;

	raw_spin_lock_irqsave(&task->pi_lock, flags);
	if (likely(*pprev != TWORK_EXITED)) {
		for (; (twork = *pprev); pprev = &twork) {
			read_barrier_depends();
			if (twork->func != func)
				continue;

			while (!cmp_xchg(pprev, twork, twork->next))
				// COMMENT ABOUT THE RACE WITH _add()
				pprev = &(*pprev)->next;
			break;
		}
	}
	raw_spin_unlock_irqrestore(&task->pi_lock, flags);

	return twork;
}

void task_work_run(void)
{
	struct task_struct *task = current;
	struct callback_head *twork, *marker;

	for (;;) {
		raw_spin_lock_irq(&task->pi_lock);
		do {
			twork = ACCESS_ONCE(task->task_works);
			marker = (!twork && (task->flags & PF_EXITING)) ?
				TWORK_EXITED : NULL;
		} while (cmp_xchg(&task->task_works, twork, marker));
		raw_spin_unlock_irq(&task->pi_lock);

		if (!twork)
			break;

		...run works...
	}
}


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-21 18:27                       ` Oleg Nesterov
@ 2012-08-21 18:34                         ` Oleg Nesterov
  2012-08-24 18:56                           ` Oleg Nesterov
  0 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-21 18:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/21, Oleg Nesterov wrote:
>
> static inline bool cmp_xchg(struct callback_head **ptr, void *old, void *new)
> {
> 	return cmpxchg(ptr, old, new) == old;
> }
>
> int
> task_work_add(struct task_struct *task, struct callback_head *twork, bool notify)
> {
> 	do {
> 		twork->next = ACCESS_ONCE(task->task_works);
> 		if (unlikely(twork->next == TWORK_EXITED))
> 			return -ESRCH;
> 	} while (cmp_xchg(&task->task_works, twork->next, twork));
          ^^^^^^^^^^^^^^^

while (!cmp_xchg(...))

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-21 18:34                         ` Oleg Nesterov
@ 2012-08-24 18:56                           ` Oleg Nesterov
  2012-08-26 19:11                             ` [PATCH 0/4] (Was: lockdep trace from posix timers) Oleg Nesterov
  2012-08-28 16:29                             ` lockdep trace from posix timers Peter Zijlstra
  0 siblings, 2 replies; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-24 18:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/21, Oleg Nesterov wrote:
>
> > task_work_add(struct task_struct *task, struct callback_head *twork, bool notify)
> > {
> > 	do {
> > 		twork->next = ACCESS_ONCE(task->task_works);
> > 		if (unlikely(twork->next == TWORK_EXITED))
> > 			return -ESRCH;
> > 	} while (cmp_xchg(&task->task_works, twork->next, twork));
>           ^^^^^^^^^^^^^^^
>
> while (!cmp_xchg(...))

and _cancel can be simplified a bit.

OK, I actually tested the code below, seems to work.

Peter, if you think it can work for you and if you agree with
the implementation I will be happy to send the patch.

The only change outside of task_work.c is that exit_task_work()
should call task_work_run() unconditionally.

As for cmp_xchg below... Of course it is not strictly needed.
But otoh, personally I'd like to have this helper (probably
renamed) somewhere in include/linux. Perhaps this is just me,
but cmpxchg() always confuses me, and most users only need
success_or_not from cmpxchg.

Oleg.


#define cmp_xchg(ptr, _old, new)	\
({					\
	__typeof__(_old) old = (_old);	\
	cmpxchg(ptr, old, new) == old;	\
})

#define TWORK_EXITED	((struct callback_head*)1)

int
task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
{
	do {
		work->next = ACCESS_ONCE(task->task_works);
		if (unlikely(work->next == TWORK_EXITED))
			return -ESRCH;
	} while (!cmp_xchg(&task->task_works, work->next, work));

	/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
	if (notify)
		set_notify_resume(task);
	return 0;
}

struct callback_head *
task_work_cancel(struct task_struct *task, task_work_func_t func)
{
	struct callback_head **pprev = &task->task_works;
	struct callback_head *work = NULL;
	unsigned long flags;

	raw_spin_lock_irqsave(&task->pi_lock, flags);
	if (likely(*pprev != TWORK_EXITED)) {
		while ((work = *pprev)) {
			read_barrier_depends();
			if (work->func != func)
				pprev = &work->next;
			else if (cmp_xchg(pprev, work, work->next))
				break;
		}
	}
	raw_spin_unlock_irqrestore(&task->pi_lock, flags);

	return work;
}

void task_work_run(void)
{
	struct task_struct *task = current;
	struct callback_head *work, *head, *next;

	for (;;) {
		raw_spin_lock_irq(&task->pi_lock);
		do {
			work = ACCESS_ONCE(task->task_works);
			head = !work && (task->flags & PF_EXITING) ?
				TWORK_EXITED : NULL;
		} while (!cmp_xchg(&task->task_works, work, head));
		raw_spin_unlock_irq(&task->pi_lock);

		if (!work)
			break;

#if PETERZ_WONT_ARGUE_AGAINST_FIFO_TOO_MUCH
		head = NULL;
		do {
			next = work->next;
			work->next = head;
			head = work;
			work = next;
		} while (work);

		work = head;
#endif
		do {
			next = work->next;
			work->func(work);
			work = next;
			cond_resched();
		} while (work);
	}
}


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 0/4] (Was: lockdep trace from posix timers)
  2012-08-24 18:56                           ` Oleg Nesterov
@ 2012-08-26 19:11                             ` Oleg Nesterov
  2012-08-26 19:12                               ` [PATCH 1/4] task_work: make task_work_add() lockless Oleg Nesterov
                                                 ` (4 more replies)
  2012-08-28 16:29                             ` lockdep trace from posix timers Peter Zijlstra
  1 sibling, 5 replies; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-26 19:11 UTC (permalink / raw)
  To: Peter Zijlstra, Al Viro
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells,
	Linus Torvalds

On 08/24, Oleg Nesterov wrote:
>
> Peter, if you think it can work for you and if you agree with
> the implementation I will be happy to send the patch.

I think I should try anyway ;)

To simplify the review, I attached the resulting code below.

Changes:

	- Comments.

	- Not sure this is really better, but task_work_run()
	  does not need to actually take pi_lock, unlock_wait
	  is enough.

	  However, in this case the dummy entry is better than
	  the fake pointer.

Oleg.

#include <linux/spinlock.h>
#include <linux/task_work.h>
#include <linux/tracehook.h>

static struct callback_head work_exited; /* all we need is ->next == NULL */

int
task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
{
	struct callback_head *head;

	do {
		head = ACCESS_ONCE(task->task_works);
		if (unlikely(head == &work_exited))
			return -ESRCH;
		work->next = head;
	} while (cmpxchg(&task->task_works, head, work) != head);

	if (notify)
		set_notify_resume(task);
	return 0;
}

struct callback_head *
task_work_cancel(struct task_struct *task, task_work_func_t func)
{
	struct callback_head **pprev = &task->task_works;
	struct callback_head *work = NULL;
	unsigned long flags;
	/*
	 * If cmpxchg() fails we continue without updating pprev.
	 * Either we raced with task_work_add() which added the
	 * new entry before this work, we will find it again. Or
	 * we raced with task_work_run(), *pprev == NULL/exited.
	 */
	raw_spin_lock_irqsave(&task->pi_lock, flags);
	while ((work = ACCESS_ONCE(*pprev))) {
		read_barrier_depends();
		if (work->func != func)
			pprev = &work->next;
		else if (cmpxchg(pprev, work, work->next) == work)
			break;
	}
	raw_spin_unlock_irqrestore(&task->pi_lock, flags);

	return work;
}

void task_work_run(void)
{
	struct task_struct *task = current;
	struct callback_head *work, *head, *next;

	for (;;) {
		/*
		 * work->func() can do task_work_add(), do not set
		 * work_exited unless the list is empty.
		 */
		do {
			work = ACCESS_ONCE(task->task_works);
			head = !work && (task->flags & PF_EXITING) ?
				&work_exited : NULL;
		} while (cmpxchg(&task->task_works, work, head) != work);

		if (!work)
			break;
		/*
		 * Synchronize with task_work_cancel(). It can't remove
		 * the first entry == work, cmpxchg(task_works) should
		 * fail, but it can play with *work and other entries.
		 */
		raw_spin_unlock_wait(&task->pi_lock);
		smp_mb();

		/* Reverse the list to run the works in fifo order */
		head = NULL;
		do {
			next = work->next;
			work->next = head;
			head = work;
			work = next;
		} while (work);

		work = head;
		do {
			next = work->next;
			work->func(work);
			work = next;
			cond_resched();
		} while (work);
	}
}


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 1/4] task_work: make task_work_add() lockless
  2012-08-26 19:11                             ` [PATCH 0/4] (Was: lockdep trace from posix timers) Oleg Nesterov
@ 2012-08-26 19:12                               ` Oleg Nesterov
  2012-09-14  6:08                                 ` [tip:core/urgent] task_work: Make " tip-bot for Oleg Nesterov
  2012-09-24 19:27                                 ` [PATCH 1/4] task_work: make " Geert Uytterhoeven
  2012-08-26 19:12                               ` [PATCH 2/4] task_work: task_work_add() should not succeed after exit_task_work() Oleg Nesterov
                                                 ` (3 subsequent siblings)
  4 siblings, 2 replies; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-26 19:12 UTC (permalink / raw)
  To: Peter Zijlstra, Al Viro
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells,
	Linus Torvalds

Change task_works to use llist-like code to avoid pi_lock
in task_work_add(), this makes it useable under rq->lock.

task_work_cancel() and task_work_run() still use pi_lock
to synchronize with each other.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/task_work.c |   95 ++++++++++++++++++++++++++-------------------------
 1 files changed, 48 insertions(+), 47 deletions(-)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index d320d44..f13ec0b 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -3,25 +3,18 @@
 #include <linux/tracehook.h>
 
 int
-task_work_add(struct task_struct *task, struct callback_head *twork, bool notify)
+task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
 {
-	struct callback_head *last, *first;
-	unsigned long flags;
-
+	struct callback_head *head;
 	/*
 	 * Not inserting the new work if the task has already passed
 	 * exit_task_work() is the responisbility of callers.
 	 */
-	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	last = task->task_works;
-	first = last ? last->next : twork;
-	twork->next = first;
-	if (last)
-		last->next = twork;
-	task->task_works = twork;
-	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+	do {
+		head = ACCESS_ONCE(task->task_works);
+		work->next = head;
+	} while (cmpxchg(&task->task_works, head, work) != head);
 
-	/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
 	if (notify)
 		set_notify_resume(task);
 	return 0;
@@ -30,52 +23,60 @@ task_work_add(struct task_struct *task, struct callback_head *twork, bool notify
 struct callback_head *
 task_work_cancel(struct task_struct *task, task_work_func_t func)
 {
+	struct callback_head **pprev = &task->task_works;
+	struct callback_head *work = NULL;
 	unsigned long flags;
-	struct callback_head *last, *res = NULL;
-
+	/*
+	 * If cmpxchg() fails we continue without updating pprev.
+	 * Either we raced with task_work_add() which added the
+	 * new entry before this work, we will find it again. Or
+	 * we raced with task_work_run(), *pprev == NULL.
+	 */
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	last = task->task_works;
-	if (last) {
-		struct callback_head *q = last, *p = q->next;
-		while (1) {
-			if (p->func == func) {
-				q->next = p->next;
-				if (p == last)
-					task->task_works = q == p ? NULL : q;
-				res = p;
-				break;
-			}
-			if (p == last)
-				break;
-			q = p;
-			p = q->next;
-		}
+	while ((work = ACCESS_ONCE(*pprev))) {
+		read_barrier_depends();
+		if (work->func != func)
+			pprev = &work->next;
+		else if (cmpxchg(pprev, work, work->next) == work)
+			break;
 	}
 	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
-	return res;
+
+	return work;
 }
 
 void task_work_run(void)
 {
 	struct task_struct *task = current;
-	struct callback_head *p, *q;
+	struct callback_head *work, *head, *next;
 
-	while (1) {
-		raw_spin_lock_irq(&task->pi_lock);
-		p = task->task_works;
-		task->task_works = NULL;
-		raw_spin_unlock_irq(&task->pi_lock);
+	for (;;) {
+		work = xchg(&task->task_works, NULL);
+		if (!work)
+			break;
+		/*
+		 * Synchronize with task_work_cancel(). It can't remove
+		 * the first entry == work, cmpxchg(task_works) should
+		 * fail, but it can play with *work and other entries.
+		 */
+		raw_spin_unlock_wait(&task->pi_lock);
+		smp_mb();
 
-		if (unlikely(!p))
-			return;
+		/* Reverse the list to run the works in fifo order */
+		head = NULL;
+		do {
+			next = work->next;
+			work->next = head;
+			head = work;
+			work = next;
+		} while (work);
 
-		q = p->next; /* head */
-		p->next = NULL; /* cut it */
-		while (q) {
-			p = q->next;
-			q->func(q);
-			q = p;
+		work = head;
+		do {
+			next = work->next;
+			work->func(work);
+			work = next;
 			cond_resched();
-		}
+		} while (work);
 	}
 }
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/4] task_work: task_work_add() should not succeed after exit_task_work()
  2012-08-26 19:11                             ` [PATCH 0/4] (Was: lockdep trace from posix timers) Oleg Nesterov
  2012-08-26 19:12                               ` [PATCH 1/4] task_work: make task_work_add() lockless Oleg Nesterov
@ 2012-08-26 19:12                               ` Oleg Nesterov
  2012-09-14  6:09                                 ` [tip:core/urgent] " tip-bot for Oleg Nesterov
  2012-08-26 19:12                               ` [PATCH 3/4] task_work: revert d35abdb2 "hold task_lock around checks in keyctl" Oleg Nesterov
                                                 ` (2 subsequent siblings)
  4 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-26 19:12 UTC (permalink / raw)
  To: Peter Zijlstra, Al Viro
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells,
	Linus Torvalds

ed3e694d "move exit_task_work() past exit_files() et.al" destroyed
the add/exit synchronization we had, the caller itself should ensure
task_work_add() can't race with the exiting task.

However, this is not convenient/simple, and the only user which tries
to do this is buggy (see the next patch). Unless the task is current,
there is simply no way to do this in general.

Change exit_task_work()->task_work_run() to use the dummy "work_exited"
entry to let task_work_add() know it should fail.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/task_work.h |    3 +--
 kernel/task_work.c        |   22 ++++++++++++++++------
 2 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index fb46b03..ca5a1cf 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -18,8 +18,7 @@ void task_work_run(void);
 
 static inline void exit_task_work(struct task_struct *task)
 {
-	if (unlikely(task->task_works))
-		task_work_run();
+	task_work_run();
 }
 
 #endif	/* _LINUX_TASK_WORK_H */
diff --git a/kernel/task_work.c b/kernel/task_work.c
index f13ec0b..65bd3c9 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -2,16 +2,17 @@
 #include <linux/task_work.h>
 #include <linux/tracehook.h>
 
+static struct callback_head work_exited; /* all we need is ->next == NULL */
+
 int
 task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
 {
 	struct callback_head *head;
-	/*
-	 * Not inserting the new work if the task has already passed
-	 * exit_task_work() is the responisbility of callers.
-	 */
+
 	do {
 		head = ACCESS_ONCE(task->task_works);
+		if (unlikely(head == &work_exited))
+			return -ESRCH;
 		work->next = head;
 	} while (cmpxchg(&task->task_works, head, work) != head);
 
@@ -30,7 +31,7 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
 	 * If cmpxchg() fails we continue without updating pprev.
 	 * Either we raced with task_work_add() which added the
 	 * new entry before this work, we will find it again. Or
-	 * we raced with task_work_run(), *pprev == NULL.
+	 * we raced with task_work_run(), *pprev == NULL/exited.
 	 */
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 	while ((work = ACCESS_ONCE(*pprev))) {
@@ -51,7 +52,16 @@ void task_work_run(void)
 	struct callback_head *work, *head, *next;
 
 	for (;;) {
-		work = xchg(&task->task_works, NULL);
+		/*
+		 * work->func() can do task_work_add(), do not set
+		 * work_exited unless the list is empty.
+		 */
+		do {
+			work = ACCESS_ONCE(task->task_works);
+			head = !work && (task->flags & PF_EXITING) ?
+				&work_exited : NULL;
+		} while (cmpxchg(&task->task_works, work, head) != work);
+
 		if (!work)
 			break;
 		/*
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 3/4] task_work: revert d35abdb2 "hold task_lock around checks in keyctl"
  2012-08-26 19:11                             ` [PATCH 0/4] (Was: lockdep trace from posix timers) Oleg Nesterov
  2012-08-26 19:12                               ` [PATCH 1/4] task_work: make task_work_add() lockless Oleg Nesterov
  2012-08-26 19:12                               ` [PATCH 2/4] task_work: task_work_add() should not succeed after exit_task_work() Oleg Nesterov
@ 2012-08-26 19:12                               ` Oleg Nesterov
  2012-09-14  6:10                                 ` [tip:core/urgent] task_work: Revert " hold " tip-bot for Oleg Nesterov
  2012-08-26 19:12                               ` [PATCH 4/4] task_work: simplify the usage in ptrace_notify() and get_signal_to_deliver() Oleg Nesterov
  2012-09-06 18:01                               ` [PATCH 0/4] (Was: lockdep trace from posix timers) Oleg Nesterov
  4 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-26 19:12 UTC (permalink / raw)
  To: Peter Zijlstra, Al Viro
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells,
	Linus Torvalds

This reverts commit d35abdb28824cf74f0a106a0f9c6f3ff700a35bf.

task_lock() was added to ensure exit_mm() and thus exit_task_work() is
not possible before task_work_add().

This is wrong, task_lock() must not be nested with write_lock(tasklist).
And this is no longer needed, task_work_add() fails if it is called
after exit_task_work().

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 security/keys/keyctl.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 3364fbf..6cfc647 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -1486,7 +1486,6 @@ long keyctl_session_to_parent(void)
 	oldwork = NULL;
 	parent = me->real_parent;
 
-	task_lock(parent);
 	/* the parent mustn't be init and mustn't be a kernel thread */
 	if (parent->pid <= 1 || !parent->mm)
 		goto unlock;
@@ -1530,7 +1529,6 @@ long keyctl_session_to_parent(void)
 	if (!ret)
 		newwork = NULL;
 unlock:
-	task_unlock(parent);
 	write_unlock_irq(&tasklist_lock);
 	rcu_read_unlock();
 	if (oldwork)
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 4/4] task_work: simplify the usage in ptrace_notify() and get_signal_to_deliver()
  2012-08-26 19:11                             ` [PATCH 0/4] (Was: lockdep trace from posix timers) Oleg Nesterov
                                                 ` (2 preceding siblings ...)
  2012-08-26 19:12                               ` [PATCH 3/4] task_work: revert d35abdb2 "hold task_lock around checks in keyctl" Oleg Nesterov
@ 2012-08-26 19:12                               ` Oleg Nesterov
  2012-09-14  6:11                                 ` [tip:core/urgent] task_work: Simplify " tip-bot for Oleg Nesterov
  2012-09-06 18:01                               ` [PATCH 0/4] (Was: lockdep trace from posix timers) Oleg Nesterov
  4 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-26 19:12 UTC (permalink / raw)
  To: Peter Zijlstra, Al Viro
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells,
	Linus Torvalds

ptrace_notify() and get_signal_to_deliver() do unnecessary things
before task_work_run().

1. smp_mb__after_clear_bit() is not needed, test_and_clear_bit()
   implies mb().

2. And we do not need the barrier at all, in this case we only
   care about the "synchronous" works added by the task itself.

3. No need to clear TIF_NOTIFY_RESUME, and we should not assume
   task_works is the only user of this flag.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/signal.c |   18 ++++--------------
 1 files changed, 4 insertions(+), 14 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index be4f856..2c681f1 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1971,13 +1971,8 @@ static void ptrace_do_notify(int signr, int exit_code, int why)
 void ptrace_notify(int exit_code)
 {
 	BUG_ON((exit_code & (0x7f | ~0xffff)) != SIGTRAP);
-	if (unlikely(current->task_works)) {
-		if (test_and_clear_ti_thread_flag(current_thread_info(),
-						   TIF_NOTIFY_RESUME)) {
-			smp_mb__after_clear_bit();
-			task_work_run();
-		}
-	}
+	if (unlikely(current->task_works))
+		task_work_run();
 
 	spin_lock_irq(&current->sighand->siglock);
 	ptrace_do_notify(SIGTRAP, exit_code, CLD_TRAPPED);
@@ -2198,13 +2193,8 @@ int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
 	struct signal_struct *signal = current->signal;
 	int signr;
 
-	if (unlikely(current->task_works)) {
-		if (test_and_clear_ti_thread_flag(current_thread_info(),
-						   TIF_NOTIFY_RESUME)) {
-			smp_mb__after_clear_bit();
-			task_work_run();
-		}
-	}
+	if (unlikely(current->task_works))
+		task_work_run();
 
 	if (unlikely(uprobe_deny_signal()))
 		return 0;
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-24 18:56                           ` Oleg Nesterov
  2012-08-26 19:11                             ` [PATCH 0/4] (Was: lockdep trace from posix timers) Oleg Nesterov
@ 2012-08-28 16:29                             ` Peter Zijlstra
  2012-08-28 17:01                               ` Oleg Nesterov
  1 sibling, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2012-08-28 16:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On Fri, 2012-08-24 at 20:56 +0200, Oleg Nesterov wrote:
> 
> Peter, if you think it can work for you and if you agree with
> the implementation I will be happy to send the patch. 

Yeah I think it would work, but I'm not sure why you're introducing the
cmp_xchg helper just for this..

Anyway, how about something like the below, it pops the works one by one
when running, that way when the cancel will only return NULL when the
work is either being executed or already executed.

( And yeah, I know, its not FIFO ;-)

---
 include/linux/task_work.h |    7 +--
 kernel/exit.c             |    2 +-
 kernel/task_work.c        |  130 +++++++++++++++++++++++++--------------------
 3 files changed, 75 insertions(+), 64 deletions(-)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index fb46b03..f365416 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -15,11 +15,6 @@ init_task_work(struct callback_head *twork, task_work_func_t func)
 int task_work_add(struct task_struct *task, struct callback_head *twork, bool);
 struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t);
 void task_work_run(void);
-
-static inline void exit_task_work(struct task_struct *task)
-{
-	if (unlikely(task->task_works))
-		task_work_run();
-}
+void task_work_exit(void);
 
 #endif	/* _LINUX_TASK_WORK_H */
diff --git a/kernel/exit.c b/kernel/exit.c
index f65345f..92aa94b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -992,7 +992,7 @@ void do_exit(long code)
 	exit_shm(tsk);
 	exit_files(tsk);
 	exit_fs(tsk);
-	exit_task_work(tsk);
+	task_work_exit();
 	check_stack_usage();
 	exit_thread();
 
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 91d4e17..7767924 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -2,79 +2,95 @@
 #include <linux/task_work.h>
 #include <linux/tracehook.h>
 
+static void task_work_nop(struct callback_head *work)
+{
+}
+
+static struct callback_head dead = {
+	.next = NULL,
+	.func = task_work_nop,
+};
+
 int
-task_work_add(struct task_struct *task, struct callback_head *twork, bool notify)
+task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
 {
-	struct callback_head *last, *first;
-	unsigned long flags;
-
-	/*
-	 * Not inserting the new work if the task has already passed
-	 * exit_task_work() is the responisbility of callers.
-	 */
-	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	last = task->task_works;
-	first = last ? last->next : twork;
-	twork->next = first;
-	if (last)
-		last->next = twork;
-	task->task_works = twork;
-	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+	struct callback_head **head = &task->task_works;
+	struct callback_head *entry, *old_entry;
+
+	entry = *head;
+	for (;;) {
+		if (entry == &dead)
+			return -ESRCH;
+
+		old_entry = entry;
+		work->next = entry;
+		entry = cmpxchg(head, old_entry, work);
+		if (entry == old_entry)
+			break;
+	}
 
 	/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
 	if (notify)
 		set_notify_resume(task);
+
 	return 0;
 }
 
 struct callback_head *
 task_work_cancel(struct task_struct *task, task_work_func_t func)
 {
-	unsigned long flags;
-	struct callback_head *last, *res = NULL;
-
-	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	last = task->task_works;
-	if (last) {
-		struct callback_head *q = last, *p = q->next;
-		while (1) {
-			if (p->func == func) {
-				q->next = p->next;
-				if (p == last)
-					task->task_works = q == p ? NULL : q;
-				res = p;
-				break;
-			}
-			if (p == last)
-				break;
-			q = p;
-			p = q->next;
+	struct callback_head **workp, *work;
+
+again:
+	workp = &task->task_works;
+	work = *workp;
+	while (work) {
+		if (work->func == func) {
+			if (cmpxchg(workp, work, work->next) == work)
+				return work;
+			goto again;
 		}
+
+		workp = &work->next;
+		work = *workp;
 	}
-	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
-	return res;
+
+	return NULL;
 }
 
-void task_work_run(void)
+static callback_head *task_work_pop(void)
 {
-	struct task_struct *task = current;
-	struct callback_head *p, *q;
-
-	while (1) {
-		raw_spin_lock_irq(&task->pi_lock);
-		p = task->task_works;
-		task->task_works = NULL;
-		raw_spin_unlock_irq(&task->pi_lock);
-
-		if (unlikely(!p))
-			return;
-
-		q = p->next; /* head */
-		p->next = NULL; /* cut it */
-		while (q) {
-			p = q->next;
-			q->func(q);
-			q = p;
-		}
+	struct callback_head **head = &current->task_work;
+	struct callback_head *entry, *old_entry;
+
+	entry = *head;
+	for (;;) {
+		if (!entry || entry == &dead)
+			return NULL;
+
+		old_entry = entry;
+		entry = cmpxchg(head, entry, entry->next);
+		if (entry == old_entry)
+			break;
 	}
+
+	return entry;
+}
+
+void task_work_run(void)
+{
+	struct callback_head *work;
+
+	for (work = task_work_pop(); work; )
+		work->func(work);
+}
+
+void task_work_exit(void)
+{
+	struct callback_head **head = &current->task_works;
+
+again:
+	task_work_run();
+	if (cmpxchg(head, NULL, &dead) != NULL)
+		goto again;
 }



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-28 16:29                             ` lockdep trace from posix timers Peter Zijlstra
@ 2012-08-28 17:01                               ` Oleg Nesterov
  2012-08-28 17:12                                 ` Oleg Nesterov
  2012-08-28 17:28                                 ` Peter Zijlstra
  0 siblings, 2 replies; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-28 17:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/28, Peter Zijlstra wrote:
>
> On Fri, 2012-08-24 at 20:56 +0200, Oleg Nesterov wrote:
> >
> > Peter, if you think it can work for you and if you agree with
> > the implementation I will be happy to send the patch.
>
> Yeah I think it would work, but I'm not sure why you're introducing the
> cmp_xchg helper just for this..

Please look at 1-4 the patches I sent (only 1-2 are relevant), I removed
this helper. Although I still think it makes sense, but of course not in
task_work.c.

>  struct callback_head *
>  task_work_cancel(struct task_struct *task, task_work_func_t func)
>  {
> -	unsigned long flags;
> -	struct callback_head *last, *res = NULL;
> -
> -	raw_spin_lock_irqsave(&task->pi_lock, flags);
> -	last = task->task_works;
> -	if (last) {
> -		struct callback_head *q = last, *p = q->next;
> -		while (1) {
> -			if (p->func == func) {
> -				q->next = p->next;
> -				if (p == last)
> -					task->task_works = q == p ? NULL : q;
> -				res = p;
> -				break;
> -			}
> -			if (p == last)
> -				break;
> -			q = p;
> -			p = q->next;
> +	struct callback_head **workp, *work;
> +
> +again:
> +	workp = &task->task_works;
> +	work = *workp;
> +	while (work) {
> +		if (work->func == func) {

But you can't dereference this pointer. Without some locking this
can race with another task_work_cancel() or task_work_run(), this
work can be free/unmapped/reused.

> +			if (cmpxchg(workp, work, work->next) == work)
> +				return work;

Or this can race with task_work_cancel(work) + task_work_add(work).
cmpxchg() can succeed even if work->func is already different.

> +static callback_head *task_work_pop(void)
>  {
> -	struct task_struct *task = current;
> -	struct callback_head *p, *q;
> -
> -	while (1) {
> -		raw_spin_lock_irq(&task->pi_lock);
> -		p = task->task_works;
> -		task->task_works = NULL;
> -		raw_spin_unlock_irq(&task->pi_lock);
> -
> -		if (unlikely(!p))
> -			return;
> -
> -		q = p->next; /* head */
> -		p->next = NULL; /* cut it */
> -		while (q) {
> -			p = q->next;
> -			q->func(q);
> -			q = p;
> -		}
> +	struct callback_head **head = &current->task_work;
> +	struct callback_head *entry, *old_entry;
> +
> +	entry = *head;
> +	for (;;) {
> +		if (!entry || entry == &dead)
> +			return NULL;
> +
> +		old_entry = entry;
> +		entry = cmpxchg(head, entry, entry->next);

Well, this obviously means cmpxchg() for each entry...

> ( And yeah, I know, its not FIFO ;-)

Cough. akpm didn't like fifo, Linus disliked it too...

And now you! Whats going on??? ;)

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-28 17:01                               ` Oleg Nesterov
@ 2012-08-28 17:12                                 ` Oleg Nesterov
  2012-08-28 17:28                                 ` Peter Zijlstra
  1 sibling, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-28 17:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/28, Oleg Nesterov wrote:
>
> On 08/28, Peter Zijlstra wrote:
> >
> > +again:
> > +	workp = &task->task_works;
> > +	work = *workp;
> > +	while (work) {
> > +		if (work->func == func) {
>
> But you can't dereference this pointer. Without some locking this
> can race with another task_work_cancel() or task_work_run(), this
> work can be free/unmapped/reused.
>
> > +			if (cmpxchg(workp, work, work->next) == work)
> > +				return work;
>
> Or this can race with task_work_cancel(work) + task_work_add(work).
> cmpxchg() can succeed even if work->func is already different.

Even simpler, this can race with another task_work_cancel() which
is going to remove work->next.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-28 17:01                               ` Oleg Nesterov
  2012-08-28 17:12                                 ` Oleg Nesterov
@ 2012-08-28 17:28                                 ` Peter Zijlstra
  2012-08-29 15:25                                   ` Oleg Nesterov
  1 sibling, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2012-08-28 17:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On Tue, 2012-08-28 at 19:01 +0200, Oleg Nesterov wrote:
> >  struct callback_head *
> >  task_work_cancel(struct task_struct *task, task_work_func_t func)
> >  {
> > +     struct callback_head **workp, *work;
> > +
> > +again:
> > +     workp = &task->task_works;
> > +     work = *workp;
> > +     while (work) {
> > +             if (work->func == func) {
> 
> But you can't dereference this pointer. Without some locking this
> can race with another task_work_cancel() or task_work_run(), this
> work can be free/unmapped/reused.
> 
> > +                     if (cmpxchg(workp, work, work->next) == work)
> > +                             return work;
> 
> Or this can race with task_work_cancel(work) + task_work_add(work).
> cmpxchg() can succeed even if work->func is already different. 

Bah.. you and your races ;-)

Surely we can do this locklessly.. I'll go try harder still.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: lockdep trace from posix timers
  2012-08-28 17:28                                 ` Peter Zijlstra
@ 2012-08-29 15:25                                   ` Oleg Nesterov
  0 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2012-08-29 15:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells, Al Viro

On 08/28, Peter Zijlstra wrote:
>
> Surely we can do this locklessly.. I'll go try harder still.

I doubt...

Even ignore work->func check, somehow you need to ensure that
work->next == new can't be changed durung cmpxchg(..., new).

Anyway, if this is possible, can't you do this on top of 1-4
I sent? There are simple, and solve the problems we discusssed.

Off-topic. This is really minor, bur can't we simplify llist_add()?

	static inline bool llist_add(struct llist_node *new, struct llist_head *head)
	{
		struct llist_node *old;

		do {
			old = ACCESS_ONCE(head->first);
			new->next = old;
		} while (cmpxchg(&head->first, old, new) != old);

		return old == NULL;
	}

looks simpler and saves a couple of insns. The likely case should
assume that cmpxchg() succeeds after the 1st attempt. If it fails,
another LOAD from head->first should be very cheap.

And note this ACCESS_ONCE(head->first) above. I think that (in theory)
the current code needs it too. But only in theory, I guess.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/4] (Was: lockdep trace from posix timers)
  2012-08-26 19:11                             ` [PATCH 0/4] (Was: lockdep trace from posix timers) Oleg Nesterov
                                                 ` (3 preceding siblings ...)
  2012-08-26 19:12                               ` [PATCH 4/4] task_work: simplify the usage in ptrace_notify() and get_signal_to_deliver() Oleg Nesterov
@ 2012-09-06 18:01                               ` Oleg Nesterov
  2012-09-06 18:35                                 ` Peter Zijlstra
  4 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2012-09-06 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Al Viro
  Cc: Dave Jones, Linux Kernel, Thomas Gleixner, rostedt, dhowells,
	Linus Torvalds

Ping...

Al, will you agree with these changes?

Peter, do you think you can do your make-it-lockless patch (hehe, I
think this is not possible ;) on top?

On 08/26, Oleg Nesterov wrote:
>
> On 08/24, Oleg Nesterov wrote:
> >
> > Peter, if you think it can work for you and if you agree with
> > the implementation I will be happy to send the patch.
>
> I think I should try anyway ;)
>
> To simplify the review, I attached the resulting code below.
>
> Changes:
>
> 	- Comments.
>
> 	- Not sure this is really better, but task_work_run()
> 	  does not need to actually take pi_lock, unlock_wait
> 	  is enough.
>
> 	  However, in this case the dummy entry is better than
> 	  the fake pointer.
>
> Oleg.
>
> #include <linux/spinlock.h>
> #include <linux/task_work.h>
> #include <linux/tracehook.h>
>
> static struct callback_head work_exited; /* all we need is ->next == NULL */
>
> int
> task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
> {
> 	struct callback_head *head;
>
> 	do {
> 		head = ACCESS_ONCE(task->task_works);
> 		if (unlikely(head == &work_exited))
> 			return -ESRCH;
> 		work->next = head;
> 	} while (cmpxchg(&task->task_works, head, work) != head);
>
> 	if (notify)
> 		set_notify_resume(task);
> 	return 0;
> }
>
> struct callback_head *
> task_work_cancel(struct task_struct *task, task_work_func_t func)
> {
> 	struct callback_head **pprev = &task->task_works;
> 	struct callback_head *work = NULL;
> 	unsigned long flags;
> 	/*
> 	 * If cmpxchg() fails we continue without updating pprev.
> 	 * Either we raced with task_work_add() which added the
> 	 * new entry before this work, we will find it again. Or
> 	 * we raced with task_work_run(), *pprev == NULL/exited.
> 	 */
> 	raw_spin_lock_irqsave(&task->pi_lock, flags);
> 	while ((work = ACCESS_ONCE(*pprev))) {
> 		read_barrier_depends();
> 		if (work->func != func)
> 			pprev = &work->next;
> 		else if (cmpxchg(pprev, work, work->next) == work)
> 			break;
> 	}
> 	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
>
> 	return work;
> }
>
> void task_work_run(void)
> {
> 	struct task_struct *task = current;
> 	struct callback_head *work, *head, *next;
>
> 	for (;;) {
> 		/*
> 		 * work->func() can do task_work_add(), do not set
> 		 * work_exited unless the list is empty.
> 		 */
> 		do {
> 			work = ACCESS_ONCE(task->task_works);
> 			head = !work && (task->flags & PF_EXITING) ?
> 				&work_exited : NULL;
> 		} while (cmpxchg(&task->task_works, work, head) != work);
>
> 		if (!work)
> 			break;
> 		/*
> 		 * Synchronize with task_work_cancel(). It can't remove
> 		 * the first entry == work, cmpxchg(task_works) should
> 		 * fail, but it can play with *work and other entries.
> 		 */
> 		raw_spin_unlock_wait(&task->pi_lock);
> 		smp_mb();
>
> 		/* Reverse the list to run the works in fifo order */
> 		head = NULL;
> 		do {
> 			next = work->next;
> 			work->next = head;
> 			head = work;
> 			work = next;
> 		} while (work);
>
> 		work = head;
> 		do {
> 			next = work->next;
> 			work->func(work);
> 			work = next;
> 			cond_resched();
> 		} while (work);
> 	}
> }


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/4] (Was: lockdep trace from posix timers)
  2012-09-06 18:01                               ` [PATCH 0/4] (Was: lockdep trace from posix timers) Oleg Nesterov
@ 2012-09-06 18:35                                 ` Peter Zijlstra
  2012-09-07 13:13                                   ` Oleg Nesterov
  0 siblings, 1 reply; 54+ messages in thread
From: Peter Zijlstra @ 2012-09-06 18:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Dave Jones, Linux Kernel, Thomas Gleixner, rostedt,
	dhowells, Linus Torvalds

On Thu, 2012-09-06 at 20:01 +0200, Oleg Nesterov wrote:
> Ping...

Right, email backlog :-)

> Peter, do you think you can do your make-it-lockless patch (hehe, I
> think this is not possible ;) on top?

Sure, I was trying to see if I could play games with the _cancel
semantics that would satisfy the two callsites and be possible. No joy
yet though.

Anyway, do you want me to take these patches through -tip?


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 0/4] (Was: lockdep trace from posix timers)
  2012-09-06 18:35                                 ` Peter Zijlstra
@ 2012-09-07 13:13                                   ` Oleg Nesterov
  0 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2012-09-07 13:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Al Viro, Dave Jones, Linux Kernel, Thomas Gleixner, rostedt,
	dhowells, Linus Torvalds

On 09/06, Peter Zijlstra wrote:
>
> Anyway, do you want me to take these patches through -tip?

This would be great, thanks.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [tip:core/urgent] task_work: Make task_work_add() lockless
  2012-08-26 19:12                               ` [PATCH 1/4] task_work: make task_work_add() lockless Oleg Nesterov
@ 2012-09-14  6:08                                 ` tip-bot for Oleg Nesterov
  2012-09-24 19:27                                 ` [PATCH 1/4] task_work: make " Geert Uytterhoeven
  1 sibling, 0 replies; 54+ messages in thread
From: tip-bot for Oleg Nesterov @ 2012-09-14  6:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, a.p.zijlstra, peterz, viro,
	akpm, tglx, oleg

Commit-ID:  ac3d0da8f3290b3d394cdb7f50604424a7cd6092
Gitweb:     http://git.kernel.org/tip/ac3d0da8f3290b3d394cdb7f50604424a7cd6092
Author:     Oleg Nesterov <oleg@redhat.com>
AuthorDate: Sun, 26 Aug 2012 21:12:09 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 13 Sep 2012 16:47:33 +0200

task_work: Make task_work_add() lockless

Change task_work's to use llist-like code to avoid pi_lock
in task_work_add(), this makes it useable under rq->lock.

task_work_cancel() and task_work_run() still use pi_lock
to synchronize with each other.

(This is in preparation for a deadlock fix.)

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120826191209.GA4221@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/task_work.c |   95 ++++++++++++++++++++++++++-------------------------
 1 files changed, 48 insertions(+), 47 deletions(-)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index d320d44..f13ec0b 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -3,25 +3,18 @@
 #include <linux/tracehook.h>
 
 int
-task_work_add(struct task_struct *task, struct callback_head *twork, bool notify)
+task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
 {
-	struct callback_head *last, *first;
-	unsigned long flags;
-
+	struct callback_head *head;
 	/*
 	 * Not inserting the new work if the task has already passed
 	 * exit_task_work() is the responisbility of callers.
 	 */
-	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	last = task->task_works;
-	first = last ? last->next : twork;
-	twork->next = first;
-	if (last)
-		last->next = twork;
-	task->task_works = twork;
-	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+	do {
+		head = ACCESS_ONCE(task->task_works);
+		work->next = head;
+	} while (cmpxchg(&task->task_works, head, work) != head);
 
-	/* test_and_set_bit() implies mb(), see tracehook_notify_resume(). */
 	if (notify)
 		set_notify_resume(task);
 	return 0;
@@ -30,52 +23,60 @@ task_work_add(struct task_struct *task, struct callback_head *twork, bool notify
 struct callback_head *
 task_work_cancel(struct task_struct *task, task_work_func_t func)
 {
+	struct callback_head **pprev = &task->task_works;
+	struct callback_head *work = NULL;
 	unsigned long flags;
-	struct callback_head *last, *res = NULL;
-
+	/*
+	 * If cmpxchg() fails we continue without updating pprev.
+	 * Either we raced with task_work_add() which added the
+	 * new entry before this work, we will find it again. Or
+	 * we raced with task_work_run(), *pprev == NULL.
+	 */
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	last = task->task_works;
-	if (last) {
-		struct callback_head *q = last, *p = q->next;
-		while (1) {
-			if (p->func == func) {
-				q->next = p->next;
-				if (p == last)
-					task->task_works = q == p ? NULL : q;
-				res = p;
-				break;
-			}
-			if (p == last)
-				break;
-			q = p;
-			p = q->next;
-		}
+	while ((work = ACCESS_ONCE(*pprev))) {
+		read_barrier_depends();
+		if (work->func != func)
+			pprev = &work->next;
+		else if (cmpxchg(pprev, work, work->next) == work)
+			break;
 	}
 	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
-	return res;
+
+	return work;
 }
 
 void task_work_run(void)
 {
 	struct task_struct *task = current;
-	struct callback_head *p, *q;
+	struct callback_head *work, *head, *next;
 
-	while (1) {
-		raw_spin_lock_irq(&task->pi_lock);
-		p = task->task_works;
-		task->task_works = NULL;
-		raw_spin_unlock_irq(&task->pi_lock);
+	for (;;) {
+		work = xchg(&task->task_works, NULL);
+		if (!work)
+			break;
+		/*
+		 * Synchronize with task_work_cancel(). It can't remove
+		 * the first entry == work, cmpxchg(task_works) should
+		 * fail, but it can play with *work and other entries.
+		 */
+		raw_spin_unlock_wait(&task->pi_lock);
+		smp_mb();
 
-		if (unlikely(!p))
-			return;
+		/* Reverse the list to run the works in fifo order */
+		head = NULL;
+		do {
+			next = work->next;
+			work->next = head;
+			head = work;
+			work = next;
+		} while (work);
 
-		q = p->next; /* head */
-		p->next = NULL; /* cut it */
-		while (q) {
-			p = q->next;
-			q->func(q);
-			q = p;
+		work = head;
+		do {
+			next = work->next;
+			work->func(work);
+			work = next;
 			cond_resched();
-		}
+		} while (work);
 	}
 }

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [tip:core/urgent] task_work: task_work_add() should not succeed after exit_task_work()
  2012-08-26 19:12                               ` [PATCH 2/4] task_work: task_work_add() should not succeed after exit_task_work() Oleg Nesterov
@ 2012-09-14  6:09                                 ` tip-bot for Oleg Nesterov
  0 siblings, 0 replies; 54+ messages in thread
From: tip-bot for Oleg Nesterov @ 2012-09-14  6:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, a.p.zijlstra, viro, akpm, tglx, oleg

Commit-ID:  9da33de62431c7839f98156720862262272a8380
Gitweb:     http://git.kernel.org/tip/9da33de62431c7839f98156720862262272a8380
Author:     Oleg Nesterov <oleg@redhat.com>
AuthorDate: Sun, 26 Aug 2012 21:12:11 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 13 Sep 2012 16:47:34 +0200

task_work: task_work_add() should not succeed after exit_task_work()

ed3e694d "move exit_task_work() past exit_files() et.al" destroyed
the add/exit synchronization we had, the caller itself should ensure
task_work_add() can't race with the exiting task.

However, this is not convenient/simple, and the only user which tries
to do this is buggy (see the next patch). Unless the task is current,
there is simply no way to do this in general.

Change exit_task_work()->task_work_run() to use the dummy "work_exited"
entry to let task_work_add() know it should fail.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120826191211.GA4228@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/task_work.h |    3 +--
 kernel/task_work.c        |   22 ++++++++++++++++------
 2 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index fb46b03..ca5a1cf 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -18,8 +18,7 @@ void task_work_run(void);
 
 static inline void exit_task_work(struct task_struct *task)
 {
-	if (unlikely(task->task_works))
-		task_work_run();
+	task_work_run();
 }
 
 #endif	/* _LINUX_TASK_WORK_H */
diff --git a/kernel/task_work.c b/kernel/task_work.c
index f13ec0b..65bd3c9 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -2,16 +2,17 @@
 #include <linux/task_work.h>
 #include <linux/tracehook.h>
 
+static struct callback_head work_exited; /* all we need is ->next == NULL */
+
 int
 task_work_add(struct task_struct *task, struct callback_head *work, bool notify)
 {
 	struct callback_head *head;
-	/*
-	 * Not inserting the new work if the task has already passed
-	 * exit_task_work() is the responisbility of callers.
-	 */
+
 	do {
 		head = ACCESS_ONCE(task->task_works);
+		if (unlikely(head == &work_exited))
+			return -ESRCH;
 		work->next = head;
 	} while (cmpxchg(&task->task_works, head, work) != head);
 
@@ -30,7 +31,7 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
 	 * If cmpxchg() fails we continue without updating pprev.
 	 * Either we raced with task_work_add() which added the
 	 * new entry before this work, we will find it again. Or
-	 * we raced with task_work_run(), *pprev == NULL.
+	 * we raced with task_work_run(), *pprev == NULL/exited.
 	 */
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 	while ((work = ACCESS_ONCE(*pprev))) {
@@ -51,7 +52,16 @@ void task_work_run(void)
 	struct callback_head *work, *head, *next;
 
 	for (;;) {
-		work = xchg(&task->task_works, NULL);
+		/*
+		 * work->func() can do task_work_add(), do not set
+		 * work_exited unless the list is empty.
+		 */
+		do {
+			work = ACCESS_ONCE(task->task_works);
+			head = !work && (task->flags & PF_EXITING) ?
+				&work_exited : NULL;
+		} while (cmpxchg(&task->task_works, work, head) != work);
+
 		if (!work)
 			break;
 		/*

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [tip:core/urgent] task_work: Revert " hold task_lock around checks in keyctl"
  2012-08-26 19:12                               ` [PATCH 3/4] task_work: revert d35abdb2 "hold task_lock around checks in keyctl" Oleg Nesterov
@ 2012-09-14  6:10                                 ` tip-bot for Oleg Nesterov
  0 siblings, 0 replies; 54+ messages in thread
From: tip-bot for Oleg Nesterov @ 2012-09-14  6:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, a.p.zijlstra, davej, viro,
	akpm, tglx, oleg

Commit-ID:  b3f68f16dbcde6fcdf0fd27695391ff7e9d41233
Gitweb:     http://git.kernel.org/tip/b3f68f16dbcde6fcdf0fd27695391ff7e9d41233
Author:     Oleg Nesterov <oleg@redhat.com>
AuthorDate: Sun, 26 Aug 2012 21:12:14 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 13 Sep 2012 16:47:36 +0200

task_work: Revert "hold task_lock around checks in keyctl"

This reverts commit d35abdb28824cf74f0a106a0f9c6f3ff700a35bf.

task_lock() was added to ensure exit_mm() and thus exit_task_work() is
not possible before task_work_add().

This is wrong, task_lock() must not be nested with write_lock(tasklist).
And this is no longer needed, task_work_add() now fails if it is called
after exit_task_work().

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120826191214.GA4231@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 security/keys/keyctl.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 3364fbf..6cfc647 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -1486,7 +1486,6 @@ long keyctl_session_to_parent(void)
 	oldwork = NULL;
 	parent = me->real_parent;
 
-	task_lock(parent);
 	/* the parent mustn't be init and mustn't be a kernel thread */
 	if (parent->pid <= 1 || !parent->mm)
 		goto unlock;
@@ -1530,7 +1529,6 @@ long keyctl_session_to_parent(void)
 	if (!ret)
 		newwork = NULL;
 unlock:
-	task_unlock(parent);
 	write_unlock_irq(&tasklist_lock);
 	rcu_read_unlock();
 	if (oldwork)

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [tip:core/urgent] task_work: Simplify the usage in ptrace_notify() and get_signal_to_deliver()
  2012-08-26 19:12                               ` [PATCH 4/4] task_work: simplify the usage in ptrace_notify() and get_signal_to_deliver() Oleg Nesterov
@ 2012-09-14  6:11                                 ` tip-bot for Oleg Nesterov
  0 siblings, 0 replies; 54+ messages in thread
From: tip-bot for Oleg Nesterov @ 2012-09-14  6:11 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, a.p.zijlstra, viro, akpm, tglx, oleg

Commit-ID:  f784e8a7989c0da3062d04bfea3db90f41e8f738
Gitweb:     http://git.kernel.org/tip/f784e8a7989c0da3062d04bfea3db90f41e8f738
Author:     Oleg Nesterov <oleg@redhat.com>
AuthorDate: Sun, 26 Aug 2012 21:12:17 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 13 Sep 2012 16:47:37 +0200

task_work: Simplify the usage in ptrace_notify() and get_signal_to_deliver()

ptrace_notify() and get_signal_to_deliver() do unnecessary things
before task_work_run():

1. smp_mb__after_clear_bit() is not needed, test_and_clear_bit()
   implies mb().

2. And we do not need the barrier at all, in this case we only
   care about the "synchronous" works added by the task itself.

3. No need to clear TIF_NOTIFY_RESUME, and we should not assume
   task_works is the only user of this flag.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120826191217.GA4238@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/signal.c |   18 ++++--------------
 1 files changed, 4 insertions(+), 14 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index be4f856..2c681f1 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1971,13 +1971,8 @@ static void ptrace_do_notify(int signr, int exit_code, int why)
 void ptrace_notify(int exit_code)
 {
 	BUG_ON((exit_code & (0x7f | ~0xffff)) != SIGTRAP);
-	if (unlikely(current->task_works)) {
-		if (test_and_clear_ti_thread_flag(current_thread_info(),
-						   TIF_NOTIFY_RESUME)) {
-			smp_mb__after_clear_bit();
-			task_work_run();
-		}
-	}
+	if (unlikely(current->task_works))
+		task_work_run();
 
 	spin_lock_irq(&current->sighand->siglock);
 	ptrace_do_notify(SIGTRAP, exit_code, CLD_TRAPPED);
@@ -2198,13 +2193,8 @@ int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
 	struct signal_struct *signal = current->signal;
 	int signr;
 
-	if (unlikely(current->task_works)) {
-		if (test_and_clear_ti_thread_flag(current_thread_info(),
-						   TIF_NOTIFY_RESUME)) {
-			smp_mb__after_clear_bit();
-			task_work_run();
-		}
-	}
+	if (unlikely(current->task_works))
+		task_work_run();
 
 	if (unlikely(uprobe_deny_signal()))
 		return 0;

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/4] task_work: make task_work_add() lockless
  2012-08-26 19:12                               ` [PATCH 1/4] task_work: make task_work_add() lockless Oleg Nesterov
  2012-09-14  6:08                                 ` [tip:core/urgent] task_work: Make " tip-bot for Oleg Nesterov
@ 2012-09-24 19:27                                 ` Geert Uytterhoeven
  2012-09-24 20:37                                   ` Oleg Nesterov
  1 sibling, 1 reply; 54+ messages in thread
From: Geert Uytterhoeven @ 2012-09-24 19:27 UTC (permalink / raw)
  To: Oleg Nesterov, Yoshinori Sato
  Cc: Peter Zijlstra, Al Viro, Dave Jones, Linux Kernel,
	Thomas Gleixner, rostedt, dhowells, Linus Torvalds, Linux-Next

On Sun, Aug 26, 2012 at 9:12 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> Change task_works to use llist-like code to avoid pi_lock
> in task_work_add(), this makes it useable under rq->lock.
>
> task_work_cancel() and task_work_run() still use pi_lock
> to synchronize with each other.
>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c

> @@ -30,52 +23,60 @@ task_work_add(struct task_struct *task, struct callback_head *twork, bool notify
>  struct callback_head *
>  task_work_cancel(struct task_struct *task, task_work_func_t func)
>  {
> +       struct callback_head **pprev = &task->task_works;
> +       struct callback_head *work = NULL;
>         unsigned long flags;
> -       struct callback_head *last, *res = NULL;
> -
> +       /*
> +        * If cmpxchg() fails we continue without updating pprev.
> +        * Either we raced with task_work_add() which added the
> +        * new entry before this work, we will find it again. Or
> +        * we raced with task_work_run(), *pprev == NULL.
> +        */
>         raw_spin_lock_irqsave(&task->pi_lock, flags);
> -       last = task->task_works;
> -       if (last) {
> -               struct callback_head *q = last, *p = q->next;
> -               while (1) {
> -                       if (p->func == func) {
> -                               q->next = p->next;
> -                               if (p == last)
> -                                       task->task_works = q == p ? NULL : q;
> -                               res = p;
> -                               break;
> -                       }
> -                       if (p == last)
> -                               break;
> -                       q = p;
> -                       p = q->next;
> -               }
> +       while ((work = ACCESS_ONCE(*pprev))) {
> +               read_barrier_depends();

Woops, h8300 doesn't have read_barrier_depends():
kernel/task_work.c:38:3: error: implicit declaration of function
'read_barrier_depends' [-Werror=implicit-function-declaration]
cc1: some warnings being treated as errors
make[2]: *** [kernel/task_work.o] Error 1

http://kisskb.ellerman.id.au/kisskb/buildresult/7238385/

Perhaps an empty definition is fine? Most architectures have:

#define read_barrier_depends() do { } while(0)

The same issue on c6x just got fixed in -next in commit
2c8c2366077da5645bf9063b3cf5c94ecb16f691 ("c6x: use
asm-generic/barrier.h")

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 1/4] task_work: make task_work_add() lockless
  2012-09-24 19:27                                 ` [PATCH 1/4] task_work: make " Geert Uytterhoeven
@ 2012-09-24 20:37                                   ` Oleg Nesterov
  0 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2012-09-24 20:37 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Yoshinori Sato, Peter Zijlstra, Al Viro, Dave Jones,
	Linux Kernel, Thomas Gleixner, rostedt, dhowells, Linus Torvalds,
	Linux-Next

On 09/24, Geert Uytterhoeven wrote:
>
> On Sun, Aug 26, 2012 at 9:12 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> > +       while ((work = ACCESS_ONCE(*pprev))) {
> > +               read_barrier_depends();

Hmm. This should be smp_read_barrier_depends(), but this doesn't matter.

> Woops, h8300 doesn't have read_barrier_depends():
> kernel/task_work.c:38:3: error: implicit declaration of function
> 'read_barrier_depends' [-Werror=implicit-function-declaration]
> cc1: some warnings being treated as errors
> make[2]: *** [kernel/task_work.o] Error 1

Thanks...

> http://kisskb.ellerman.id.au/kisskb/buildresult/7238385/
>
> Perhaps an empty definition is fine? Most architectures have:
>
> #define read_barrier_depends() do { } while(0)

Yes. arch/h8300/include/asm/barrier.h has

	#define smp_read_barrier_depends()      read_barrier_depends()

so probably it should define read_barrier_depends() as well ?

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2012-09-24 20:36 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-24 20:36 lockdep trace from posix timers Dave Jones
2012-07-27 16:20 ` Dave Jones
2012-08-16 12:54   ` Ming Lei
2012-08-16 14:03     ` Dave Jones
2012-08-16 18:07 ` Peter Zijlstra
2012-08-17 15:14   ` Oleg Nesterov
2012-08-17 15:17     ` Oleg Nesterov
2012-08-17 16:40       ` task_work_add() should not succeed unconditionally (Was: lockdep trace from posix timers) Oleg Nesterov
2012-08-20  7:15     ` lockdep trace from posix timers Peter Zijlstra
2012-08-20 11:44       ` Peter Zijlstra
2012-08-20 11:46         ` Peter Zijlstra
2012-08-20 11:50         ` Peter Zijlstra
2012-08-20 12:19           ` Steven Rostedt
2012-08-20 12:20             ` Peter Zijlstra
2012-08-20 14:59         ` Oleg Nesterov
2012-08-20 15:10           ` Peter Zijlstra
2012-08-20 15:27             ` Peter Zijlstra
2012-08-20 15:32               ` Oleg Nesterov
2012-08-20 15:46                 ` Peter Zijlstra
2012-08-20 15:58                   ` Oleg Nesterov
2012-08-20 16:03                     ` Peter Zijlstra
2012-08-20 15:05         ` Oleg Nesterov
2012-08-20 15:12           ` Peter Zijlstra
2012-08-20 15:41             ` Oleg Nesterov
2012-08-20 15:56               ` Peter Zijlstra
2012-08-20 16:10                 ` Oleg Nesterov
2012-08-20 16:19                   ` Peter Zijlstra
2012-08-20 16:23                     ` Oleg Nesterov
2012-08-21 18:27                       ` Oleg Nesterov
2012-08-21 18:34                         ` Oleg Nesterov
2012-08-24 18:56                           ` Oleg Nesterov
2012-08-26 19:11                             ` [PATCH 0/4] (Was: lockdep trace from posix timers) Oleg Nesterov
2012-08-26 19:12                               ` [PATCH 1/4] task_work: make task_work_add() lockless Oleg Nesterov
2012-09-14  6:08                                 ` [tip:core/urgent] task_work: Make " tip-bot for Oleg Nesterov
2012-09-24 19:27                                 ` [PATCH 1/4] task_work: make " Geert Uytterhoeven
2012-09-24 20:37                                   ` Oleg Nesterov
2012-08-26 19:12                               ` [PATCH 2/4] task_work: task_work_add() should not succeed after exit_task_work() Oleg Nesterov
2012-09-14  6:09                                 ` [tip:core/urgent] " tip-bot for Oleg Nesterov
2012-08-26 19:12                               ` [PATCH 3/4] task_work: revert d35abdb2 "hold task_lock around checks in keyctl" Oleg Nesterov
2012-09-14  6:10                                 ` [tip:core/urgent] task_work: Revert " hold " tip-bot for Oleg Nesterov
2012-08-26 19:12                               ` [PATCH 4/4] task_work: simplify the usage in ptrace_notify() and get_signal_to_deliver() Oleg Nesterov
2012-09-14  6:11                                 ` [tip:core/urgent] task_work: Simplify " tip-bot for Oleg Nesterov
2012-09-06 18:01                               ` [PATCH 0/4] (Was: lockdep trace from posix timers) Oleg Nesterov
2012-09-06 18:35                                 ` Peter Zijlstra
2012-09-07 13:13                                   ` Oleg Nesterov
2012-08-28 16:29                             ` lockdep trace from posix timers Peter Zijlstra
2012-08-28 17:01                               ` Oleg Nesterov
2012-08-28 17:12                                 ` Oleg Nesterov
2012-08-28 17:28                                 ` Peter Zijlstra
2012-08-29 15:25                                   ` Oleg Nesterov
2012-08-20 14:55       ` Oleg Nesterov
2012-08-20 15:43       ` Oleg Nesterov
2012-08-20 15:48         ` Peter Zijlstra
2012-08-20 15:58           ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).