linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
@ 2014-05-12 20:45 Thomas Gleixner
  2014-05-12 20:45 ` [patch 1/3] rtmutex: Add missing deadlock check Thomas Gleixner
                   ` (4 more replies)
  0 siblings, 5 replies; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-12 20:45 UTC (permalink / raw)
  To: LKML
  Cc: Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Steven Rostedt, Clark Williams,
	Paul McKenney, Lai Jiangshan, Roland McGrath, Carlos ODonell,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

Dave reported recently that the trinity syscall fuzzer triggers a
warning in the rtmutex code via the futex syscall.

It took me quite a while to understand the issue, until I was able to
write a minimalistic reproducer.

The first two patches address the issues and the third one is adding
an unrelated sanity check to prevent user space from assigning PI
futexes to kernel threads.

I ran it through my usual tests (except of one, see below), Darrens
futex tester and it holds up against a 400 threads trinity whacking
for well above 12hrs now.

Staring three days into futex.c/rtmutex.c and a few GB of traces
definitly brings one in a lousy mood. But I discovered two things
which made me outright grumpy:

1) I wanted to add a test case for the rtmutex deadlock detection
   issue and found out that the in kernel rt-mutex tester which I
   wrote for the purpose of validating rt-mutex corner cases without
   the futex maze involved. And it actually catches issues which are
   not covered by lockdep.

   That tester got broken by a performance improvement patch 3 years
   ago: commit 8161239a8 (rtmutex: Simplify PI algorithm and make
   highest prio task get lock)

   The changelog of that commit sayed:

       need updated after this patch is accepted
           1) Document/*
	   2) the testcase scripts/rt-tester/t4-l2-pi-deboost.tst
   
   The changelog is correct in that point: It still needs to be
   updated.

   But it should not have been merged at all before this got updated
   and fixed!

   So I'm not blaming Lai for this, though he should have followed up
   as well. I'm blaming the maintainer who merged that code.
 
   Steven, I told you more than once, that your damned performance and
   feature mania is completely unacceptable.

   When the patch was posted, you were reviewing it. When I wanted to
   test it, it did not apply and you told me you'll fix it and pick it
   up. Sure you made it apply and you sticked it into a git tree and
   let Ingo pull it.

   I didn't pay much attention as I was happy with the idea in
   general. I trusted your review and judgement. My problem.

   Why on earth did you merge that patch even if the changelog mentions
   that Documentation was left stale and the tester broken?

   This is beyond sloppy and outright stupid.

   Get this fixed now!

   And those fixes are not going to be merged without my Reviewed-by.

   I'm not going to let your multi-reader patches anywhere near RT,
   unless this Documentation and tester issue is resolved.

   After that I'm going through them with a fine comb, as I really lost
   the last modicom of hope, that you can do something without leaving
   trainwrecks left and right.


2) glibc fun

   While writing an isolated test case for the trinity wreckage, I
   found out that glibc has an interesting misfeature:

   strace tells me:

   futex(0x600e00, FUTEX_LOCK_PI_PRIVATE, 1) = -1 EINVAL (Invalid argument)

   but the return value of pthread_mutex_lock() is 0

   Yay, for another master piece of trainwreck engineering!

   I checked the glibc source and of course neither the lock nor the unlock path
   gets any error code from the syscall propagated.

   The handling of -EDEADLOCK is even more impressive. Instead of
   propagating it to the caller something in the guts of glibc calls pause().

     futex(0x601300, FUTEX_LOCK_PI_PRIVATE, 1) = -1 EDEADLK (Resource deadlock avoided)
     pause(

   So any kind of futex wreckage which is not detectable by glibc
   itself ends up in a disaster one way or the other. How is Joe user
   of this stuff who reads the manpages supposed to find out that the
   kernel detected an inconsistent user space variable?

   If someone of the glibc folks can be bothered to look after that,
   I'm happy to provide the simple test cases, but I'm sure that
   looking at the code which simply ignores or pauses on kernel error
   return values is enough to figure out why its broken.

   This does not only apply to the PI version, the bog standard
   futexes have error returns from the kernel as well, which are
   handled in the same absurd way.

Yours grumpy

      tglx



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [patch 1/3] rtmutex: Add missing deadlock check
  2014-05-12 20:45 [patch 0/3] futex/rtmutex: Fix issues exposed by trinity Thomas Gleixner
@ 2014-05-12 20:45 ` Thomas Gleixner
  2014-05-13  5:51   ` Lai Jiangshan
  2014-05-12 20:45 ` [patch 2/3] futex: Add another early deadlock detection check Thomas Gleixner
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-12 20:45 UTC (permalink / raw)
  To: LKML
  Cc: Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Steven Rostedt, Clark Williams,
	Paul McKenney, Lai Jiangshan, Roland McGrath, Carlos ODonell,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

[-- Attachment #1: rtmutex-add-missing-deadlock-check.patch --]
[-- Type: text/plain, Size: 2267 bytes --]

If a task T holds rtmutex L and has pending waiters on that lock then
an attempt of task T to recursivly lock L can escape the deadlock
detection if T itself does not end up being the top most waiter on
that lock. So it happily enqueues itself in the waiter list.

This was exposed by Dave Jones trinity syscall fuzzer:
  http://lkml.kernel.org/r/20140429151655.GA14277@redhat.com

The fix for the issue at hand is simple and more comment than actual
code: Test whether the new waiter task owns one of the locks in the
lock chain and handle it the same way as the other deadlock sites.

The problem has been in the rtmutex code forever.

I would have added a testcase for such a scenario to the rtmutex
tester, but the tester got wreckaged with commit 8161239a8 (rtmutex:
Simplify PI algorithm and make highest prio task get lock). So that
becomes a separate issue. Sigh!

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
---
 kernel/locking/rtmutex.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/locking/rtmutex.c
===================================================================
--- linux-2.6.orig/kernel/locking/rtmutex.c
+++ linux-2.6/kernel/locking/rtmutex.c
@@ -284,7 +284,7 @@ static int rt_mutex_adjust_prio_chain(st
 				      struct rt_mutex_waiter *orig_waiter,
 				      struct task_struct *top_task)
 {
-	struct rt_mutex *lock;
+	struct rt_mutex *lock = orig_lock;
 	struct rt_mutex_waiter *waiter, *top_waiter = orig_waiter;
 	int detect_deadlock, ret = 0, depth = 0;
 	unsigned long flags;
@@ -339,6 +339,22 @@ static int rt_mutex_adjust_prio_chain(st
 		goto out_unlock_pi;
 
 	/*
+	 * Deadlock check for the following scenario:
+	 *
+	 * T holds lock L and has waiters
+	 * T locks L again, but does not end up as it's own top waiter
+	 *
+	 * So we would drop out at the next check without noticing.
+	 *
+	 * Note, we need to check for orig_waiter as it might be NULL
+	 * when deboosting!
+	 */
+	if (orig_waiter && orig_waiter->task == rt_mutex_owner(lock)) {
+		ret = deadlock_detect ? -EDEADLK : 0;
+		goto out_unlock_pi;
+	}
+
+	/*
 	 * Drop out, when the task has no waiters. Note,
 	 * top_waiter can be NULL, when we are in the deboosting
 	 * mode!



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [patch 2/3] futex: Add another early deadlock detection check
  2014-05-12 20:45 [patch 0/3] futex/rtmutex: Fix issues exposed by trinity Thomas Gleixner
  2014-05-12 20:45 ` [patch 1/3] rtmutex: Add missing deadlock check Thomas Gleixner
@ 2014-05-12 20:45 ` Thomas Gleixner
  2014-05-19 12:22   ` [tip:core/urgent] " tip-bot for Thomas Gleixner
  2014-05-12 20:45 ` [patch 3/3] futex: Prevent attaching to kernel threads Thomas Gleixner
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-12 20:45 UTC (permalink / raw)
  To: LKML
  Cc: Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Steven Rostedt, Clark Williams,
	Paul McKenney, Lai Jiangshan, Roland McGrath, Carlos ODonell,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

[-- Attachment #1: futex-add-early-deadlock-detection.patch --]
[-- Type: text/plain, Size: 4619 bytes --]

Dave Jones trinity syscall fuzzer exposed an issue in the deadlock
detection code of rtmutex:
  http://lkml.kernel.org/r/20140429151655.GA14277@redhat.com

That underlying issue has been fixed with a patch to the rtmutex code,
but the futex code must not call into rtmutex in that case because
    - it can detect that issue early
    - it avoids a different and more complex fixup for backing out

If the user space variable got manipulated to 0x80000000 which means
no lock holder, but the waiters bit set and an active pi_state in the
kernel is found we can figure out the recursive locking issue by
looking at the pi_state owner. If that is the current task, then we
can safely return -EDEADLK.

The check should have been added in commit 59fa62451 (futex: Handle
futex_pi OWNER_DIED take over correctly) already, but I did not see
the above issue caused by user space manipulation back then.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
---
 kernel/futex.c |   47 ++++++++++++++++++++++++++++++++++-------------
 1 file changed, 34 insertions(+), 13 deletions(-)

Index: linux-2.6/kernel/futex.c
===================================================================
--- linux-2.6.orig/kernel/futex.c
+++ linux-2.6/kernel/futex.c
@@ -745,7 +745,8 @@ void exit_pi_state_list(struct task_stru
 
 static int
 lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
-		union futex_key *key, struct futex_pi_state **ps)
+		union futex_key *key, struct futex_pi_state **ps,
+		struct task_struct *task)
 {
 	struct futex_pi_state *pi_state = NULL;
 	struct futex_q *this, *next;
@@ -786,6 +787,16 @@ lookup_pi_state(u32 uval, struct futex_h
 					return -EINVAL;
 			}
 
+			/*
+			 * Protect against a corrupted uval. If uval
+			 * is 0x80000000 then pid is 0 and the waiter
+			 * bit is set. So the deadlock check in the
+			 * calling code has failed and we did not fall
+			 * into the check above due to !pid.
+			 */
+			if (task && pi_state->owner == task)
+				return -EDEADLK;
+
 			atomic_inc(&pi_state->refcount);
 			*ps = pi_state;
 
@@ -935,7 +946,7 @@ retry:
 	 * We dont have the lock. Look up the PI state (or create it if
 	 * we are the first waiter):
 	 */
-	ret = lookup_pi_state(uval, hb, key, ps);
+	ret = lookup_pi_state(uval, hb, key, ps, task);
 
 	if (unlikely(ret)) {
 		switch (ret) {
@@ -1347,7 +1358,7 @@ void requeue_pi_wake_futex(struct futex_
  *
  * Return:
  *  0 - failed to acquire the lock atomically;
- *  1 - acquired the lock;
+ * >0 - acquired the lock, return value is vpid of the top_waiter
  * <0 - error
  */
 static int futex_proxy_trylock_atomic(u32 __user *pifutex,
@@ -1358,7 +1369,7 @@ static int futex_proxy_trylock_atomic(u3
 {
 	struct futex_q *top_waiter = NULL;
 	u32 curval;
-	int ret;
+	int ret, vpid;
 
 	if (get_futex_value_locked(&curval, pifutex))
 		return -EFAULT;
@@ -1386,11 +1397,13 @@ static int futex_proxy_trylock_atomic(u3
 	 * the contended case or if set_waiters is 1.  The pi_state is returned
 	 * in ps in contended cases.
 	 */
+	vpid = task_pid_vnr(top_waiter->task);
 	ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task,
 				   set_waiters);
-	if (ret == 1)
+	if (ret == 1) {
 		requeue_pi_wake_futex(top_waiter, key2, hb2);
-
+		return vpid;
+	}
 	return ret;
 }
 
@@ -1421,7 +1434,6 @@ static int futex_requeue(u32 __user *uad
 	struct futex_pi_state *pi_state = NULL;
 	struct futex_hash_bucket *hb1, *hb2;
 	struct futex_q *this, *next;
-	u32 curval2;
 
 	if (requeue_pi) {
 		/*
@@ -1509,16 +1521,25 @@ retry_private:
 		 * At this point the top_waiter has either taken uaddr2 or is
 		 * waiting on it.  If the former, then the pi_state will not
 		 * exist yet, look it up one more time to ensure we have a
-		 * reference to it.
+		 * reference to it. If the lock was taken, ret contains the
+		 * vpid of the top waiter task.
 		 */
-		if (ret == 1) {
+		if (ret > 0) {
 			WARN_ON(pi_state);
 			drop_count++;
 			task_count++;
-			ret = get_futex_value_locked(&curval2, uaddr2);
-			if (!ret)
-				ret = lookup_pi_state(curval2, hb2, &key2,
-						      &pi_state);
+			/*
+			 * If we acquired the lock, then the user
+			 * space value of uaddr2 should be vpid. It
+			 * cannot be changed by the top waiter as it
+			 * is blocked on hb2 lock if it tries to do
+			 * so. If something fiddled with it behind our
+			 * back the pi state lookup might unearth
+			 * it. So we rather use the known value than
+			 * rereading and handing potential crap to
+			 * lookup_pi_state.
+			 */
+			ret = lookup_pi_state(ret, hb2, &key2, &pi_state, NULL);
 		}
 
 		switch (ret) {



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [patch 3/3] futex: Prevent attaching to kernel threads
  2014-05-12 20:45 [patch 0/3] futex/rtmutex: Fix issues exposed by trinity Thomas Gleixner
  2014-05-12 20:45 ` [patch 1/3] rtmutex: Add missing deadlock check Thomas Gleixner
  2014-05-12 20:45 ` [patch 2/3] futex: Add another early deadlock detection check Thomas Gleixner
@ 2014-05-12 20:45 ` Thomas Gleixner
  2014-05-12 20:54   ` Peter Zijlstra
                     ` (2 more replies)
  2014-05-12 21:37 ` [patch 0/3] futex/rtmutex: Fix issues exposed by trinity Steven Rostedt
  2014-05-13  3:54 ` Darren Hart
  4 siblings, 3 replies; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-12 20:45 UTC (permalink / raw)
  To: LKML
  Cc: Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Steven Rostedt, Clark Williams,
	Paul McKenney, Lai Jiangshan, Roland McGrath, Carlos ODonell,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

[-- Attachment #1: futex-prevent-attaching-futex-to-kernel-threads.patch --]
[-- Type: text/plain, Size: 935 bytes --]

We happily allow userspace to declare a random kernel thread to be the
owner of a user space PI futex.

Found while analysing the fallout of Dave Jones syscall fuzzer.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
---
 kernel/futex.c |    7 +++++++
 1 file changed, 7 insertions(+)

Index: linux-2.6/kernel/futex.c
===================================================================
--- linux-2.6.orig/kernel/futex.c
+++ linux-2.6/kernel/futex.c
@@ -682,6 +682,8 @@ static struct task_struct * futex_find_g
 	p = find_task_by_vpid(pid);
 	if (p)
 		get_task_struct(p);
+	else
+		p = NULL;
 
 	rcu_read_unlock();
 
@@ -814,6 +816,11 @@ lookup_pi_state(u32 uval, struct futex_h
 	if (!p)
 		return -ESRCH;
 
+	if (!p->mm) {
+		put_task_struct(p);
+		return -EPERM;
+	}
+
 	/*
 	 * We need to look at the task state flags to figure out,
 	 * whether the task is exiting. To protect against the do_exit



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 3/3] futex: Prevent attaching to kernel threads
  2014-05-12 20:45 ` [patch 3/3] futex: Prevent attaching to kernel threads Thomas Gleixner
@ 2014-05-12 20:54   ` Peter Zijlstra
  2014-05-12 21:16     ` Thomas Gleixner
  2014-05-12 21:59   ` Davidlohr Bueso
  2014-05-19 12:22   ` [tip:core/urgent] " tip-bot for Thomas Gleixner
  2 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2014-05-12 20:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Dave Jones, Linus Torvalds, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Steven Rostedt, Clark Williams, Paul McKenney,
	Lai Jiangshan, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Mon, May 12, 2014 at 08:45:35PM -0000, Thomas Gleixner wrote:
> We happily allow userspace to declare a random kernel thread to be the
> owner of a user space PI futex.
> 
> Found while analysing the fallout of Dave Jones syscall fuzzer.

Did you also still want to check the ppid for _PRIVATE futexes?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 3/3] futex: Prevent attaching to kernel threads
  2014-05-12 20:54   ` Peter Zijlstra
@ 2014-05-12 21:16     ` Thomas Gleixner
  0 siblings, 0 replies; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-12 21:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Dave Jones, Linus Torvalds, Darren Hart, Davidlohr Bueso,
	Ingo Molnar, Steven Rostedt, Clark Williams, Paul McKenney,
	Lai Jiangshan, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Mon, 12 May 2014, Peter Zijlstra wrote:

> On Mon, May 12, 2014 at 08:45:35PM -0000, Thomas Gleixner wrote:
> > We happily allow userspace to declare a random kernel thread to be the
> > owner of a user space PI futex.
> > 
> > Found while analysing the fallout of Dave Jones syscall fuzzer.
> 
> Did you also still want to check the ppid for _PRIVATE futexes?

Yes.

I'm still twisting my brain how to confine the non shared case w/o
going through loops and hoops. I'm not really sure, whether we can do
something about that without making it extremly painful, but we really
should try hard.

If the non shared case turns out to be a hopeless case, then we go for
the easy private confinement or make the shared case actually painfull
enough that people who care about it figure it out :)

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-12 20:45 [patch 0/3] futex/rtmutex: Fix issues exposed by trinity Thomas Gleixner
                   ` (2 preceding siblings ...)
  2014-05-12 20:45 ` [patch 3/3] futex: Prevent attaching to kernel threads Thomas Gleixner
@ 2014-05-12 21:37 ` Steven Rostedt
  2014-05-12 21:52   ` Thomas Gleixner
  2014-05-13  6:37   ` Ingo Molnar
  2014-05-13  3:54 ` Darren Hart
  4 siblings, 2 replies; 53+ messages in thread
From: Steven Rostedt @ 2014-05-12 21:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Clark Williams, Paul McKenney,
	Lai Jiangshan, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Mon, 12 May 2014 20:45:32 -0000
Thomas Gleixner <tglx@linutronix.de> wrote:

 
>    Steven, I told you more than once, that your damned performance and
>    feature mania is completely unacceptable.


Yes you have, and now you are yelling at me for something I did 3 years
ago. But this isn't one of my "performance and feature mania" things
you bitched at me about. This is something *you* asked me to look at!

http://marc.info/?l=linux-kernel&m=129242481625261

The "stress test" you recommend back then was just to port it to -rt
and test it there. Which I did, and it found various issues where I
helped Lai fix.

> 
>    When the patch was posted, you were reviewing it. When I wanted to
>    test it, it did not apply and you told me you'll fix it and pick it
>    up. Sure you made it apply and you sticked it into a git tree and
>    let Ingo pull it.
> 
>    I didn't pay much attention as I was happy with the idea in
>    general. I trusted your review and judgement. My problem.
> 
>    Why on earth did you merge that patch even if the changelog mentions
>    that Documentation was left stale and the tester broken?

Sure, I should have pushed Lai for that at the time. Lessons learned. I
was expecting he would do so after we got to a final version of the
patch but I was also working on other things at the time and forgot to
remind him.

I'm surprised you didn't mention to me to make sure the documentation
and test case got fixed before this went further. In fact, that was
never mentioned. These patches went through 4 revisions over a month,
and I even posted another RFC for them to get into the -rt tree.


> 
>    This is beyond sloppy and outright stupid.

Thomas, listen to yourself. You are calling something that got in
without updating the documentation and test case "beyond sloppy and
outright stupid". Sure, I'll agree it was a little sloppy, and today I'm
more on the ball to get things like this right (as we push much harder
today on documentation coming in along with code changes). But the
change log had:

    need updated after this patch is accepted
    1) Document/*
    2) the testcase scripts/rt-tester/t4-l2-pi-deboost.tst

I agree this should be fixed, but you temper tantrum over this seems a
bit over the top.


> 
>    Get this fixed now!
> 
>    And those fixes are not going to be merged without my Reviewed-by.
> 
>    I'm not going to let your multi-reader patches anywhere near RT,
>    unless this Documentation and tester issue is resolved.
> 
>    After that I'm going through them with a fine comb, as I really lost
>    the last modicom of hope, that you can do something without leaving
>    trainwrecks left and right.


And it boils down to this. In the past, I have screwed up where you
were the one (and usually the only one) that got hit by that mess. But
now you seem to take any little mistake done by me as a sign of
complete incompetence.

I'm sorry you feel this way.

I'll try to fix this up as I get a chance (I still have a day job to
do).

-- Steve



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-12 21:37 ` [patch 0/3] futex/rtmutex: Fix issues exposed by trinity Steven Rostedt
@ 2014-05-12 21:52   ` Thomas Gleixner
  2014-05-12 22:08     ` Steven Rostedt
  2014-05-13  6:37   ` Ingo Molnar
  1 sibling, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-12 21:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Clark Williams, Paul McKenney,
	Lai Jiangshan, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Mon, 12 May 2014, Steven Rostedt wrote:
> On Mon, 12 May 2014 20:45:32 -0000
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
>  
> >    Steven, I told you more than once, that your damned performance and
> >    feature mania is completely unacceptable.
> 
> 
> Yes you have, and now you are yelling at me for something I did 3 years
> ago. But this isn't one of my "performance and feature mania" things
> you bitched at me about. This is something *you* asked me to look at!
> 
> http://marc.info/?l=linux-kernel&m=129242481625261
> 
> The "stress test" you recommend back then was just to port it to -rt
> and test it there. Which I did, and it found various issues where I
> helped Lai fix.

And that's an excuse to merge stuff which leaves trainwrecks behind it?

Steven, start getting adult and admit your failures.

I'm cutting it right here as the following is just your attempt to
redirect the failure to me.

Yes, I made a mistake. The very mistake of trusting you.

That'll not happen again.

	Thomas



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 3/3] futex: Prevent attaching to kernel threads
  2014-05-12 20:45 ` [patch 3/3] futex: Prevent attaching to kernel threads Thomas Gleixner
  2014-05-12 20:54   ` Peter Zijlstra
@ 2014-05-12 21:59   ` Davidlohr Bueso
  2014-05-12 22:18     ` Thomas Gleixner
  2014-05-19 12:22   ` [tip:core/urgent] " tip-bot for Thomas Gleixner
  2 siblings, 1 reply; 53+ messages in thread
From: Davidlohr Bueso @ 2014-05-12 21:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
	Ingo Molnar, Steven Rostedt, Clark Williams, Paul McKenney,
	Lai Jiangshan, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Mon, 2014-05-12 at 20:45 +0000, Thomas Gleixner wrote:
> We happily allow userspace to declare a random kernel thread to be the
> owner of a user space PI futex.
> 
> Found while analysing the fallout of Dave Jones syscall fuzzer.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: stable@vger.kernel.org
> ---
>  kernel/futex.c |    7 +++++++
>  1 file changed, 7 insertions(+)
> 
> Index: linux-2.6/kernel/futex.c
> ===================================================================
> --- linux-2.6.orig/kernel/futex.c
> +++ linux-2.6/kernel/futex.c
> @@ -682,6 +682,8 @@ static struct task_struct * futex_find_g
>         p = find_task_by_vpid(pid);
>         if (p)
>                 get_task_struct(p);
> +       else
> +               p = NULL;
>  
>         rcu_read_unlock();
>  
> @@ -814,6 +816,11 @@ lookup_pi_state(u32 uval, struct futex_h
>         if (!p)
>                 return -ESRCH;
>  
> +       if (!p->mm) {
> +               put_task_struct(p);
> +               return -EPERM;
> +       }

It might make sense to check the mm inside futex_find_get_task(), and
simply never take the refcount if NULL.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-12 21:52   ` Thomas Gleixner
@ 2014-05-12 22:08     ` Steven Rostedt
  2014-05-12 22:37       ` Thomas Gleixner
  0 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2014-05-12 22:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Clark Williams, Paul McKenney,
	Lai Jiangshan, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Mon, 12 May 2014 23:52:07 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Mon, 12 May 2014, Steven Rostedt wrote:
> > On Mon, 12 May 2014 20:45:32 -0000
> > Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> >  
> > >    Steven, I told you more than once, that your damned performance and
> > >    feature mania is completely unacceptable.
> > 
> > 
> > Yes you have, and now you are yelling at me for something I did 3 years
> > ago. But this isn't one of my "performance and feature mania" things
> > you bitched at me about. This is something *you* asked me to look at!
> > 
> > http://marc.info/?l=linux-kernel&m=129242481625261
> > 
> > The "stress test" you recommend back then was just to port it to -rt
> > and test it there. Which I did, and it found various issues where I
> > helped Lai fix.
> 
> And that's an excuse to merge stuff which leaves trainwrecks behind it?

A trainwreck that wasn't noticed in 3 years.

> 
> Steven, start getting adult and admit your failures.

I did admit I made a mistake. If you read the rest of my email that you
cut off, I said "Sure, I should have pushed Lai for that at the time.
Lessons learned. I was expecting he would do so after we got to a final
version of the patch but I was also working on other things at the time
and forgot to remind him."

I even stated that I'm doing better now. That means I have learned from
my mistakes. And continue to do so.

> 
> I'm cutting it right here as the following is just your attempt to
> redirect the failure to me.

No, I'm not trying to redirect the failure to you. I was trying to get
you to understand that actual scope of it.


> 
> Yes, I made a mistake. The very mistake of trusting you.
> 
> That'll not happen again.

I expect nothing less from you.

-- Steve

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 3/3] futex: Prevent attaching to kernel threads
  2014-05-12 21:59   ` Davidlohr Bueso
@ 2014-05-12 22:18     ` Thomas Gleixner
  0 siblings, 0 replies; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-12 22:18 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: LKML, Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
	Ingo Molnar, Steven Rostedt, Clark Williams, Paul McKenney,
	Lai Jiangshan, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Mon, 12 May 2014, Davidlohr Bueso wrote:
> On Mon, 2014-05-12 at 20:45 +0000, Thomas Gleixner wrote:
> > We happily allow userspace to declare a random kernel thread to be the
> > owner of a user space PI futex.
> > 
> > Found while analysing the fallout of Dave Jones syscall fuzzer.
> > 
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > Cc: stable@vger.kernel.org
> > ---
> >  kernel/futex.c |    7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > Index: linux-2.6/kernel/futex.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/futex.c
> > +++ linux-2.6/kernel/futex.c
> > @@ -682,6 +682,8 @@ static struct task_struct * futex_find_g
> >         p = find_task_by_vpid(pid);
> >         if (p)
> >                 get_task_struct(p);
> > +       else
> > +               p = NULL;

Duh, that hunk is actually left over from the attempt to do what you
suggest below :(

> >  
> >         rcu_read_unlock();
> >  
> > @@ -814,6 +816,11 @@ lookup_pi_state(u32 uval, struct futex_h
> >         if (!p)
> >                 return -ESRCH;
> >  
> > +       if (!p->mm) {
> > +               put_task_struct(p);
> > +               return -EPERM;
> > +       }
> 
> It might make sense to check the mm inside futex_find_get_task(), and
> simply never take the refcount if NULL.

Right, but for some reason it was a more ugly patch than the one I
posted (w/o the stupid hunk above) and I tried to minimize the impact
for stable backporting.

Thanks,

	tglx





^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-12 22:08     ` Steven Rostedt
@ 2014-05-12 22:37       ` Thomas Gleixner
  2014-05-12 23:18         ` Steven Rostedt
  0 siblings, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-12 22:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Clark Williams, Paul McKenney,
	Lai Jiangshan, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Mon, 12 May 2014, Steven Rostedt wrote:
> I did admit I made a mistake. If you read the rest of my email that you
> cut off, I said "Sure, I should have pushed Lai for that at the time.
> Lessons learned. I was expecting he would do so after we got to a final
> version of the patch but I was also working on other things at the time
> and forgot to remind him."
> 
> I even stated that I'm doing better now. That means I have learned from
> my mistakes. And continue to do so.

Sure, with your latest multi reader patches you break even the compile
with CONFIG_DEBUG_RT_MUTEXES=y due to:

+               debug_rt_mutex_print_deadlock(waiter);

I'm really impressed by your improvements.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-12 22:37       ` Thomas Gleixner
@ 2014-05-12 23:18         ` Steven Rostedt
  0 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2014-05-12 23:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Clark Williams, Paul McKenney,
	Lai Jiangshan, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Tue, 13 May 2014 00:37:21 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Mon, 12 May 2014, Steven Rostedt wrote:
> > I did admit I made a mistake. If you read the rest of my email that you
> > cut off, I said "Sure, I should have pushed Lai for that at the time.
> > Lessons learned. I was expecting he would do so after we got to a final
> > version of the patch but I was also working on other things at the time
> > and forgot to remind him."
> > 
> > I even stated that I'm doing better now. That means I have learned from
> > my mistakes. And continue to do so.
> 
> Sure, with your latest multi reader patches you break even the compile
> with CONFIG_DEBUG_RT_MUTEXES=y due to:
> 
> +               debug_rt_mutex_print_deadlock(waiter);
> 
> I'm really impressed by your improvements.
> 

Really Thomas? You are judging my work flow with a patch that I posted
as RFC hoping to get benchmarks from, which I've only received a few
for, and I'm hoping to get more.

I've never posted that patch for inclusion, because it does need more
testing and even needs more cleaning up. I only posted it hoping to get
numbers to know if it is worth pursuing or not.

I even tell Clark not to use it as it is not ready for prime time yet.

And because you have never even acknowledged that patch until now,
I've pretty much considered it dead.

-- Steve

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-12 20:45 [patch 0/3] futex/rtmutex: Fix issues exposed by trinity Thomas Gleixner
                   ` (3 preceding siblings ...)
  2014-05-12 21:37 ` [patch 0/3] futex/rtmutex: Fix issues exposed by trinity Steven Rostedt
@ 2014-05-13  3:54 ` Darren Hart
  2014-05-13  9:08   ` Thomas Gleixner
  2014-05-14  6:58   ` Carlos O'Donell
  4 siblings, 2 replies; 53+ messages in thread
From: Darren Hart @ 2014-05-13  3:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Steven Rostedt, Clark Williams,
	Paul McKenney, Lai Jiangshan, Roland McGrath, Carlos ODonell,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

On Mon, May 12, 2014 at 08:45:32PM -0000, Thomas Gleixner wrote:
> Dave reported recently that the trinity syscall fuzzer triggers a
> warning in the rtmutex code via the futex syscall.
> 
> It took me quite a while to understand the issue, until I was able to
> write a minimalistic reproducer.
> 
> The first two patches address the issues and the third one is adding
> an unrelated sanity check to prevent user space from assigning PI
> futexes to kernel threads.
> 
> I ran it through my usual tests (except of one, see below), Darrens
> futex tester and it holds up against a 400 threads trinity whacking
> for well above 12hrs now.
> 
> Staring three days into futex.c/rtmutex.c and a few GB of traces
> definitly brings one in a lousy mood. But I discovered two things
> which made me outright grumpy:
> 
> 
> 2) glibc fun
> 
>    While writing an isolated test case for the trinity wreckage, I
>    found out that glibc has an interesting misfeature:
> 
>    strace tells me:
> 
>    futex(0x600e00, FUTEX_LOCK_PI_PRIVATE, 1) = -1 EINVAL (Invalid argument)
> 
>    but the return value of pthread_mutex_lock() is 0

So something is clearly wrong there - however, were you looking at the comments
(sorry, I mean the C code), or the implementation (all the ASM)? The only way
I've been able to be sure in the past is to delete the ASM files and recompile
using the C files. Hopefully we'll be able to drop all the ASM in the pthread
calls soonish (measured in years in glibc development time scales).... sigh.

>    Yay, for another master piece of trainwreck engineering!
> 
>    I checked the glibc source and of course neither the lock nor the unlock path
>    gets any error code from the syscall propagated.
> 
>    The handling of -EDEADLOCK is even more impressive. Instead of
>    propagating it to the caller something in the guts of glibc calls pause().
> 
>      futex(0x601300, FUTEX_LOCK_PI_PRIVATE, 1) = -1 EDEADLK (Resource deadlock avoided)
>      pause(
> 

Gotta love comments like these though - such trust!:

	/* The mutex is locked.  The kernel will now take care of
           everything. */

IIRC, glibc takes the approach that if this operation fails, there is no way for
it to recovery "properly", and so it chooses to:

	/* Delay the thread indefinitely. */

I believe the thinking goes that if we get to here, then the lock is in an
inconsistent state (between kernel and userspace). I don't have an answer for
why pausing forever would be preferable to returning an error however...

>    So any kind of futex wreckage which is not detectable by glibc
>    itself ends up in a disaster one way or the other. How is Joe user
>    of this stuff who reads the manpages supposed to find out that the
>    kernel detected an inconsistent user space variable?
> 
>    If someone of the glibc folks can be bothered to look after that,
>    I'm happy to provide the simple test cases, but I'm sure that
>    looking at the code which simply ignores or pauses on kernel error
>    return values is enough to figure out why its broken.

Was there a spot where it ignores it completely? I suppose if it's anything
other than ESRCH or EDEADLK... yuck...

> 
>    This does not only apply to the PI version, the bog standard
>    futexes have error returns from the kernel as well, which are
>    handled in the same absurd way.

We do have the attention of some of the new glibc maintainers with the requeue
PI work, and some redhat folks are digging into other PI related spec issues.
I'll bring this up in those discussions and see if there might be something we
can do here.

--
Darren

> 
> Yours grumpy
> 
>       tglx
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/3] rtmutex: Add missing deadlock check
  2014-05-12 20:45 ` [patch 1/3] rtmutex: Add missing deadlock check Thomas Gleixner
@ 2014-05-13  5:51   ` Lai Jiangshan
  2014-05-13  8:45     ` Thomas Gleixner
                       ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Lai Jiangshan @ 2014-05-13  5:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Steven Rostedt, Clark Williams,
	Paul McKenney, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

Hi, Thomas,

I think this patch is just a workaround, it is not the proper fix.
you need a updated deadlock-check mechanism:

- (old) skip the check when top_waiter != task_top_pi_waiter(task)
+ (new) skip the check when top_waiter->prio > task->prio

	/*
	 * Drop out, when the task has no waiters. Note,
	 * top_waiter can be NULL, when we are in the deboosting
	 * mode!
	 */
	if (top_waiter && (!task_has_pi_waiters(task) ||
			   top_waiter != task_top_pi_waiter(task)))
		goto out_unlock_pi;


(also need to update the code in other places respectively)

On 05/13/2014 04:45 AM, Thomas Gleixner wrote:
>  	/*
> +	 * Deadlock check for the following scenario:
> +	 *
> +	 * T holds lock L and has waiters
> +	 * T locks L again, but does not end up as it's own top waiter

ABBA problem (TA TB TC TD are of the same priority)

TA holds lock LA, and try to lock LB which TC already has waited on
TB holds lock LB, and try to lock LA which TD already has waited on

I think this check can't detect it IIUC.

> +	 *
> +	 * So we would drop out at the next check without noticing.
> +	 *
> +	 * Note, we need to check for orig_waiter as it might be NULL
> +	 * when deboosting!
> +	 */
> +	if (orig_waiter && orig_waiter->task == rt_mutex_owner(lock)) {

when non-first-loop, it is already checked.

> +		ret = deadlock_detect ? -EDEADLK : 0;
> +		goto out_unlock_pi;
> +	}

I considered you blamed to me.
I would feel better if you directly blamed to me.

Thanks
Lai

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-12 21:37 ` [patch 0/3] futex/rtmutex: Fix issues exposed by trinity Steven Rostedt
  2014-05-12 21:52   ` Thomas Gleixner
@ 2014-05-13  6:37   ` Ingo Molnar
  1 sibling, 0 replies; 53+ messages in thread
From: Ingo Molnar @ 2014-05-13  6:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, LKML, Dave Jones, Linus Torvalds,
	Peter Zijlstra, Darren Hart, Davidlohr Bueso, Clark Williams,
	Paul McKenney, Lai Jiangshan, Roland McGrath, Carlos ODonell,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior


* Steven Rostedt <rostedt@goodmis.org> wrote:

> >    This is beyond sloppy and outright stupid.
> 
> Thomas, listen to yourself. You are calling something that got in
> without updating the documentation and test case "beyond sloppy and
> outright stupid". Sure, I'll agree it was a little sloppy, and today I'm
> more on the ball to get things like this right (as we push much harder
> today on documentation coming in along with code changes). But the
> change log had:
> 
>     need updated after this patch is accepted
>     1) Document/*
>     2) the testcase scripts/rt-tester/t4-l2-pi-deboost.tst
> 
> I agree this should be fixed, but you temper tantrum over this seems 
> a bit over the top.

Well, the thing is, it's not just about the breakage, but the 
changelog and the patch in itself already violates a very basic and 
fundamental kernel workflow pattern we try to follow all the time for 
core kernel changes: when speedups are introduced then fixes are never 
'promised' to be done in an uncertain future, but are done preferably 
before (or together) with the speedup.

Especially when there's an ABI involved.

The "break and fix later" kind of pattern might not be as problematic 
for debug features which are not critical to users, but for ABI facing 
changes it's deadly and inacceptable.

Also, the problems with the broken commit already begin with the 
readability of the changelog:

1)

The very first sentence:

  > In current rtmutex,

Should be:

  > In the current rtmutex code,

2)

Also, the 'Example' that comes next is just a flow of sentences 
without any vertical alignment (tabs, spaces, etc.), making it harder 
to review.

3)

Plus, here:

  "time3
   A or other high prio task sleeps, but we have passed some time
   The B and C's prio are changed in the period (time2 ~ time3)"

where does the first sentence end?

4)

  "when requiring lock,"

I suspect that wanted to say 'reacquiring' the lock.

5)

  "Other advantage of this patch:"

s/advantage/advantages/

6)

  "The states of a rtmutex are reduced a half, easier to read the code."

s/reduced a half/reduced by half

7)

 it is unrelated/unneed boosting 

s/unneed/unneeded

?

8)

The diffstat is 'frickin hug:

 kernel/futex.c          |  22 +++++-----
 kernel/rtmutex-debug.c  |   1 -
 kernel/rtmutex.c        | 318 +++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------------------------------------------------------------
 kernel/rtmutex_common.h |  16 +------
 4 files changed, 127 insertions(+), 230 deletions(-)

When a changelog is harder to read due to basic grammar and 
typographical errors, and when the patch is large, it's harder to 
notice more fundamental problems as well.

(I've attached the full changelog below for reference.)

I should have noticed these warning signs when pulling your tree, I 
generally read over changelogs and patches I pull, but missed this - 
not sure what fell into me.

My bad, will be more careful next time around.

Thanks,

	Ingo

=====================================>
>From 8161239a8bcce9ad6b537c04a1fa3b5c68bae693 Mon Sep 17 00:00:00 2001
From: Lai Jiangshan <laijs@cn.fujitsu.com>
Date: Fri, 14 Jan 2011 17:09:41 +0800
Subject: [PATCH] rtmutex: Simplify PI algorithm and make highest prio task get
 lock

In current rtmutex, the pending owner may be boosted by the tasks
in the rtmutex's waitlist when the pending owner is deboosted
or a task in the waitlist is boosted. This boosting is unrelated,
because the pending owner does not really take the rtmutex.
It is not reasonable.

Example.

time1:
A(high prio) onwers the rtmutex.
B(mid prio) and C (low prio) in the waitlist.

time2
A release the lock, B becomes the pending owner
A(or other high prio task) continues to run. B's prio is lower
than A, so B is just queued at the runqueue.

time3
A or other high prio task sleeps, but we have passed some time
The B and C's prio are changed in the period (time2 ~ time3)
due to boosting or deboosting. Now C has the priority higher
than B. ***Is it reasonable that C has to boost B and help B to
get the rtmutex?

NO!! I think, it is unrelated/unneed boosting before B really
owns the rtmutex. We should give C a chance to beat B and
win the rtmutex.

This is the motivation of this patch. This patch *ensures*
only the top waiter or higher priority task can take the lock.

How?
1) we don't dequeue the top waiter when unlock, if the top waiter
   is changed, the old top waiter will fail and go to sleep again.
2) when requiring lock, it will get the lock when the lock is not taken and:
   there is no waiter OR higher priority than waiters OR it is top waiter.
3) In any time, the top waiter is changed, the top waiter will be woken up.

The algorithm is much simpler than before, no pending owner, no
boosting for pending owner.

Other advantage of this patch:
1) The states of a rtmutex are reduced a half, easier to read the code.
2) the codes become shorter.
3) top waiter is not dequeued until it really take the lock:
   they will retain FIFO when it is stolen.

Not advantage nor disadvantage
1) Even we may wakeup multiple waiters(any time when top waiter changed),
   we hardly cause "thundering herd",
   the number of wokenup task is likely 1 or very little.
2) two APIs are changed.
   rt_mutex_owner() will not return pending owner, it will return NULL when
                    the top waiter is going to take the lock.
   rt_mutex_next_owner() always return the top waiter.
	                 will not return NULL if we have waiters
                         because the top waiter is not dequeued.

   I have fixed the code that use these APIs.

need updated after this patch is accepted
1) Document/*
2) the testcase scripts/rt-tester/t4-l2-pi-deboost.tst

Signed-off-by:  Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4D3012D5.4060709@cn.fujitsu.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/futex.c          |  22 ++--
 kernel/rtmutex-debug.c  |   1 -
 kernel/rtmutex.c        | 318 +++++++++++++++++-------------------------------
 kernel/rtmutex_common.h |  16 +--
 4 files changed, 127 insertions(+), 230 deletions(-)


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/3] rtmutex: Add missing deadlock check
  2014-05-13  5:51   ` Lai Jiangshan
@ 2014-05-13  8:45     ` Thomas Gleixner
  2014-05-13  8:48     ` Peter Zijlstra
  2014-05-14 12:59     ` Thomas Gleixner
  2 siblings, 0 replies; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-13  8:45 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: LKML, Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Steven Rostedt, Clark Williams,
	Paul McKenney, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

Lai,

On Tue, 13 May 2014, Lai Jiangshan wrote:
> I think this patch is just a workaround, it is not the proper fix.
> you need a updated deadlock-check mechanism:
> 
> - (old) skip the check when top_waiter != task_top_pi_waiter(task)
> + (new) skip the check when top_waiter->prio > task->prio
> 
> 	/*
> 	 * Drop out, when the task has no waiters. Note,
> 	 * top_waiter can be NULL, when we are in the deboosting
> 	 * mode!
> 	 */
> 	if (top_waiter && (!task_has_pi_waiters(task) ||
> 			   top_waiter != task_top_pi_waiter(task)))
> 		goto out_unlock_pi;
> 
> 
> (also need to update the code in other places respectively)

Ok, I did not think it through fully and I want to have the rtmutex
tester working again so we can check for this without going through
futex hoops and loops. I had no time yet to look into that as I needed
to understand the futex issue which exposed it first.

> On 05/13/2014 04:45 AM, Thomas Gleixner wrote:
> >  	/*
> > +	 * Deadlock check for the following scenario:
> > +	 *
> > +	 * T holds lock L and has waiters
> > +	 * T locks L again, but does not end up as it's own top waiter
> 
> ABBA problem (TA TB TC TD are of the same priority)
> 
> TA holds lock LA, and try to lock LB which TC already has waited on
> TB holds lock LB, and try to lock LA which TD already has waited on
> 
> I think this check can't detect it IIUC.
> 
> > +	 *
> > +	 * So we would drop out at the next check without noticing.
> > +	 *
> > +	 * Note, we need to check for orig_waiter as it might be NULL
> > +	 * when deboosting!
> > +	 */
> > +	if (orig_waiter && orig_waiter->task == rt_mutex_owner(lock)) {
> 
> when non-first-loop, it is already checked.

Right, but we must check it for the first loop as well. And that check
was not there ever, so it's not your problem. I verified against a
kernel w/o your optimization.
 
> > +		ret = deadlock_detect ? -EDEADLK : 0;
> > +		goto out_unlock_pi;
> > +	}
> 
> I considered you blamed to me.
> I would feel better if you directly blamed to me.

I blamed you as well for not following up and updating the stuff you
broke.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/3] rtmutex: Add missing deadlock check
  2014-05-13  5:51   ` Lai Jiangshan
  2014-05-13  8:45     ` Thomas Gleixner
@ 2014-05-13  8:48     ` Peter Zijlstra
  2014-05-13 16:12       ` Paul E. McKenney
  2014-05-13 19:42       ` Thomas Gleixner
  2014-05-14 12:59     ` Thomas Gleixner
  2 siblings, 2 replies; 53+ messages in thread
From: Peter Zijlstra @ 2014-05-13  8:48 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Thomas Gleixner, LKML, Dave Jones, Linus Torvalds, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Steven Rostedt, Clark Williams,
	Paul McKenney, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Tue, May 13, 2014 at 01:51:21PM +0800, Lai Jiangshan wrote:
> I considered you blamed to me.
> I would feel better if you directly blamed to me.

Well, the way I see it all you're to blame for is 'forgetting' to update
the rt mutex test for 3 years, and while that is unfortunate, the much
bigger fail is merging the patch without it to begin with.

So in that respect Steven is primarily responsible for merging
incomplete work.

Now Thomas was entirely grumpy from staring at bugs for a long while,
which is entirely understandable. That said I don't think he went
overboard calling Steven out on this.

Now, if you and Steve get this sorted, nothing really happened except
that Thomas got grumpy, which is entirely normal, what else would he be?
:-)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-13  3:54 ` Darren Hart
@ 2014-05-13  9:08   ` Thomas Gleixner
  2014-05-14  7:06     ` Carlos O'Donell
  2014-05-14  6:58   ` Carlos O'Donell
  1 sibling, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-13  9:08 UTC (permalink / raw)
  To: Darren Hart
  Cc: LKML, Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Steven Rostedt, Clark Williams,
	Paul McKenney, Lai Jiangshan, Roland McGrath, Carlos ODonell,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

On Mon, 12 May 2014, Darren Hart wrote:
> On Mon, May 12, 2014 at 08:45:32PM -0000, Thomas Gleixner wrote:
> >    strace tells me:
> > 
> >    futex(0x600e00, FUTEX_LOCK_PI_PRIVATE, 1) = -1 EINVAL (Invalid argument)
> > 
> >    but the return value of pthread_mutex_lock() is 0
> 
> So something is clearly wrong there - however, were you looking at the comments
> (sorry, I mean the C code), or the implementation (all the ASM)? The only way
> I've been able to be sure in the past is to delete the ASM files and recompile
> using the C files. Hopefully we'll be able to drop all the ASM in the pthread
> calls soonish (measured in years in glibc development time scales).... sigh.

The C implementation does:

	    if (INTERNAL_SYSCALL_ERROR_P (e, __err)
		&& (INTERNAL_SYSCALL_ERRNO (e, __err) == ESRCH
		    || INTERNAL_SYSCALL_ERRNO (e, __err) == EDEADLK))
	      {
		assert (INTERNAL_SYSCALL_ERRNO (e, __err) != EDEADLK
			|| (kind != PTHREAD_MUTEX_ERRORCHECK_NP
			    && kind != PTHREAD_MUTEX_RECURSIVE_NP));
		/* ESRCH can happen only for non-robust PI mutexes where
		   the owner of the lock died.  */
		assert (INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust);

		/* Delay the thread indefinitely.  */
		while (1)
		  pause_not_cancel ();
	      }

So anything else than ESRCH and EDEADLK is ignored and then the thing
happily returns 0 at the end. Unlock is the same:

	{
	  int robust = mutex->__data.__kind & PTHREAD_MUTEX_ROBUST_NORMAL_NP;
	  int private = (robust
			 ? PTHREAD_ROBUST_MUTEX_PSHARED (mutex)
			 : PTHREAD_MUTEX_PSHARED (mutex));
	  INTERNAL_SYSCALL_DECL (__err);
	  INTERNAL_SYSCALL (futex, __err, 2, &mutex->__data.__lock,
			    __lll_private_flag (FUTEX_UNLOCK_PI, private));
	}

Quality stuff that.

Thanks,

	tglx

 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/3] rtmutex: Add missing deadlock check
  2014-05-13  8:48     ` Peter Zijlstra
@ 2014-05-13 16:12       ` Paul E. McKenney
  2014-05-13 19:42       ` Thomas Gleixner
  1 sibling, 0 replies; 53+ messages in thread
From: Paul E. McKenney @ 2014-05-13 16:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Thomas Gleixner, LKML, Dave Jones, Linus Torvalds,
	Darren Hart, Davidlohr Bueso, Ingo Molnar, Steven Rostedt,
	Clark Williams, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Tue, May 13, 2014 at 10:48:15AM +0200, Peter Zijlstra wrote:
> On Tue, May 13, 2014 at 01:51:21PM +0800, Lai Jiangshan wrote:
> > I considered you blamed to me.
> > I would feel better if you directly blamed to me.
> 
> Well, the way I see it all you're to blame for is 'forgetting' to update
> the rt mutex test for 3 years, and while that is unfortunate, the much
> bigger fail is merging the patch without it to begin with.
> 
> So in that respect Steven is primarily responsible for merging
> incomplete work.
> 
> Now Thomas was entirely grumpy from staring at bugs for a long while,
> which is entirely understandable. That said I don't think he went
> overboard calling Steven out on this.
> 
> Now, if you and Steve get this sorted, nothing really happened except
> that Thomas got grumpy, which is entirely normal, what else would he be?
> :-)

What Peter said!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/3] rtmutex: Add missing deadlock check
  2014-05-13  8:48     ` Peter Zijlstra
  2014-05-13 16:12       ` Paul E. McKenney
@ 2014-05-13 19:42       ` Thomas Gleixner
  2014-05-13 20:20         ` Steven Rostedt
  1 sibling, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-13 19:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, LKML, Dave Jones, Linus Torvalds, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Steven Rostedt, Clark Williams,
	Paul McKenney, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Tue, 13 May 2014, Peter Zijlstra wrote:
> 
> Now, if you and Steve get this sorted, nothing really happened except
> that Thomas got grumpy, which is entirely normal, what else would he be?
> :-)

Who is that grumpy Thomas dude, should I know him?

Lai, Steven,

before you waste lots of time on the tester, I want to look at it
whether we can simplify it or even rewrite it from scratch. I glanced
at it today and I really can't remember what kind of substances were
involved when I wrote this almost a decade ago.

The whole schedule_rt_mutex mechanism was mostly done to create
controlled lock stealing scenarios and deal with the BKL
oddities.

With Lai's simplification and the demise of BKL I'm quite sure we do
not need it anymore.

So we can just get rid of the complexity in schedule_rt_mutex() and
replace it with a simple:

    while (!td->continue)
    	  schedule();

That would also make the teardown and reset of the whole thing
manageable. Right now it's easy to create a situation where unrolling
stuff gets almost impossible except by pushing the reset button.

The state readouts can be done directly via the rtmutexes and the task
structs.

Thoughts?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/3] rtmutex: Add missing deadlock check
  2014-05-13 19:42       ` Thomas Gleixner
@ 2014-05-13 20:20         ` Steven Rostedt
  2014-05-13 20:36           ` Paul E. McKenney
  0 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2014-05-13 20:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Lai Jiangshan, LKML, Dave Jones, Linus Torvalds,
	Darren Hart, Davidlohr Bueso, Ingo Molnar, Clark Williams,
	Paul McKenney, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Tue, 13 May 2014 21:42:54 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Tue, 13 May 2014, Peter Zijlstra wrote:
> > 
> > Now, if you and Steve get this sorted, nothing really happened except
> > that Thomas got grumpy, which is entirely normal, what else would he be?
> > :-)
> 
> Who is that grumpy Thomas dude, should I know him?
> 
> Lai, Steven,
> 
> before you waste lots of time on the tester, I want to look at it
> whether we can simplify it or even rewrite it from scratch. I glanced
> at it today and I really can't remember what kind of substances were
> involved when I wrote this almost a decade ago.

Thank God. /me removes the ton of trace_printk()s in the code as well
as all the trace_marker.write("%s" %(line)) from the test to figure out
what was going on.

> 
> The whole schedule_rt_mutex mechanism was mostly done to create
> controlled lock stealing scenarios and deal with the BKL
> oddities.
> 
> With Lai's simplification and the demise of BKL I'm quite sure we do
> not need it anymore.
> 
> So we can just get rid of the complexity in schedule_rt_mutex() and
> replace it with a simple:
> 
>     while (!td->continue)
>     	  schedule();
> 
> That would also make the teardown and reset of the whole thing
> manageable. Right now it's easy to create a situation where unrolling
> stuff gets almost impossible except by pushing the reset button.
> 
> The state readouts can be done directly via the rtmutexes and the task
> structs.
> 
> Thoughts?
> 

What about having a module that creates a bunch of threads and forces
all the scenarios that we want to test? Wouldn't it be easier to do
than to have a userspace interface to dictate commands to the kernel?

-- Steve

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/3] rtmutex: Add missing deadlock check
  2014-05-13 20:20         ` Steven Rostedt
@ 2014-05-13 20:36           ` Paul E. McKenney
  2014-05-13 21:27             ` Thomas Gleixner
  0 siblings, 1 reply; 53+ messages in thread
From: Paul E. McKenney @ 2014-05-13 20:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Peter Zijlstra, Lai Jiangshan, LKML, Dave Jones,
	Linus Torvalds, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Clark Williams, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Tue, May 13, 2014 at 04:20:41PM -0400, Steven Rostedt wrote:
> On Tue, 13 May 2014 21:42:54 +0200 (CEST)
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Tue, 13 May 2014, Peter Zijlstra wrote:
> > > 
> > > Now, if you and Steve get this sorted, nothing really happened except
> > > that Thomas got grumpy, which is entirely normal, what else would he be?
> > > :-)
> > 
> > Who is that grumpy Thomas dude, should I know him?

;-) ;-) ;-)

> > Lai, Steven,
> > 
> > before you waste lots of time on the tester, I want to look at it
> > whether we can simplify it or even rewrite it from scratch. I glanced
> > at it today and I really can't remember what kind of substances were
> > involved when I wrote this almost a decade ago.
> 
> Thank God. /me removes the ton of trace_printk()s in the code as well
> as all the trace_marker.write("%s" %(line)) from the test to figure out
> what was going on.
> 
> > 
> > The whole schedule_rt_mutex mechanism was mostly done to create
> > controlled lock stealing scenarios and deal with the BKL
> > oddities.
> > 
> > With Lai's simplification and the demise of BKL I'm quite sure we do
> > not need it anymore.
> > 
> > So we can just get rid of the complexity in schedule_rt_mutex() and
> > replace it with a simple:
> > 
> >     while (!td->continue)
> >     	  schedule();
> > 
> > That would also make the teardown and reset of the whole thing
> > manageable. Right now it's easy to create a situation where unrolling
> > stuff gets almost impossible except by pushing the reset button.
> > 
> > The state readouts can be done directly via the rtmutexes and the task
> > structs.
> > 
> > Thoughts?
> > 
> 
> What about having a module that creates a bunch of threads and forces
> all the scenarios that we want to test? Wouldn't it be easier to do
> than to have a userspace interface to dictate commands to the kernel?

I second this approach!  The kernel environment makes it -much- easier
to force races and other conditions, which turns into much simpler and
more effective tests.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/3] rtmutex: Add missing deadlock check
  2014-05-13 20:36           ` Paul E. McKenney
@ 2014-05-13 21:27             ` Thomas Gleixner
  2014-05-13 22:00               ` Paul E. McKenney
  0 siblings, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-13 21:27 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Peter Zijlstra, Lai Jiangshan, LKML, Dave Jones,
	Linus Torvalds, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Clark Williams, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Tue, 13 May 2014, Paul E. McKenney wrote:
> On Tue, May 13, 2014 at 04:20:41PM -0400, Steven Rostedt wrote:
> > What about having a module that creates a bunch of threads and forces
> > all the scenarios that we want to test? Wouldn't it be easier to do
> > than to have a userspace interface to dictate commands to the kernel?
> 
> I second this approach!  The kernel environment makes it -much- easier
> to force races and other conditions, which turns into much simpler and
> more effective tests.

The point of the rtmutex tester was NOT to force races and stuff, it
was intended to be a formal test for certain static scenarios:

   - verify boosting / debosting
   - verify set_scheduler interaction
   - verify deadlock detection 

The latter was incomplete and therefor missed the futex wreckage :(

Having a formal checker makes a lot of sense.

Plastering the code with a gazillion of trace_printks, waiting several
hours for each iteration and staring into several GB of traces just to
figure out, that it is an algorithmic issue, is utter waste of time
and nerves. And that stuff is definitely complex enough to justify a
static checker.

Back then when I wrote it, it unearthed quite some logic bugs. And I
needed the schedule_rt_mutex() hack to verify the BKL interaction and
the lock steal machinery, which made it impossible to be a module. It
could have been done, but that'd have been even more ugly hackery.

So I made it a user space interface to add/modify test cases without
recompiling the kernel.  But now with BKL and the lock steal muck
gone, we simply might kill it.

Now that allows a module, but then I'm still not sure whether formal
verification rules are fun to code in C. There are certainly better
ways than the *.tst rules I defined back then. But yes, we could add a
similar cryptic thing with static arrays of OP/Data pairs in C.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/3] rtmutex: Add missing deadlock check
  2014-05-13 21:27             ` Thomas Gleixner
@ 2014-05-13 22:00               ` Paul E. McKenney
  2014-05-13 22:44                 ` Steven Rostedt
  0 siblings, 1 reply; 53+ messages in thread
From: Paul E. McKenney @ 2014-05-13 22:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, Peter Zijlstra, Lai Jiangshan, LKML, Dave Jones,
	Linus Torvalds, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Clark Williams, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Tue, May 13, 2014 at 11:27:16PM +0200, Thomas Gleixner wrote:
> On Tue, 13 May 2014, Paul E. McKenney wrote:
> > On Tue, May 13, 2014 at 04:20:41PM -0400, Steven Rostedt wrote:
> > > What about having a module that creates a bunch of threads and forces
> > > all the scenarios that we want to test? Wouldn't it be easier to do
> > > than to have a userspace interface to dictate commands to the kernel?
> > 
> > I second this approach!  The kernel environment makes it -much- easier
> > to force races and other conditions, which turns into much simpler and
> > more effective tests.
> 
> The point of the rtmutex tester was NOT to force races and stuff, it
> was intended to be a formal test for certain static scenarios:
> 
>    - verify boosting / debosting
>    - verify set_scheduler interaction
>    - verify deadlock detection 
> 
> The latter was incomplete and therefor missed the futex wreckage :(
> 
> Having a formal checker makes a lot of sense.
> 
> Plastering the code with a gazillion of trace_printks, waiting several
> hours for each iteration and staring into several GB of traces just to
> figure out, that it is an algorithmic issue, is utter waste of time
> and nerves. And that stuff is definitely complex enough to justify a
> static checker.
> 
> Back then when I wrote it, it unearthed quite some logic bugs. And I
> needed the schedule_rt_mutex() hack to verify the BKL interaction and
> the lock steal machinery, which made it impossible to be a module. It
> could have been done, but that'd have been even more ugly hackery.
> 
> So I made it a user space interface to add/modify test cases without
> recompiling the kernel.  But now with BKL and the lock steal muck
> gone, we simply might kill it.
> 
> Now that allows a module, but then I'm still not sure whether formal
> verification rules are fun to code in C. There are certainly better
> ways than the *.tst rules I defined back then. But yes, we could add a
> similar cryptic thing with static arrays of OP/Data pairs in C.

Good points -- I was indeed thinking about stress testing instead of
algorithmic testing.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/3] rtmutex: Add missing deadlock check
  2014-05-13 22:00               ` Paul E. McKenney
@ 2014-05-13 22:44                 ` Steven Rostedt
  2014-05-13 23:27                   ` Paul E. McKenney
  2014-05-14  6:42                   ` Thomas Gleixner
  0 siblings, 2 replies; 53+ messages in thread
From: Steven Rostedt @ 2014-05-13 22:44 UTC (permalink / raw)
  To: paulmck
  Cc: Thomas Gleixner, Peter Zijlstra, Lai Jiangshan, LKML, Dave Jones,
	Linus Torvalds, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Clark Williams, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Tue, 13 May 2014 15:00:09 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

 
> Good points -- I was indeed thinking about stress testing instead of
> algorithmic testing.

But doesn't lockdep use algorithmic tests too?

-- Steve

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/3] rtmutex: Add missing deadlock check
  2014-05-13 22:44                 ` Steven Rostedt
@ 2014-05-13 23:27                   ` Paul E. McKenney
  2014-05-13 23:53                     ` Steven Rostedt
  2014-05-14  6:42                   ` Thomas Gleixner
  1 sibling, 1 reply; 53+ messages in thread
From: Paul E. McKenney @ 2014-05-13 23:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Peter Zijlstra, Lai Jiangshan, LKML, Dave Jones,
	Linus Torvalds, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Clark Williams, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Tue, May 13, 2014 at 06:44:30PM -0400, Steven Rostedt wrote:
> On Tue, 13 May 2014 15:00:09 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> 
> > Good points -- I was indeed thinking about stress testing instead of
> > algorithmic testing.
> 
> But doesn't lockdep use algorithmic tests too?

I suppose you could argue that there is no such thing as non-algorithmic
testing, given that all test code uses an algorithm of some sort.  Perhaps
with the exception of letting your pet walk across the keyboard.  ;-)

Perhaps I should have instead said that I was thinking about random
testing instead of formal testing?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/3] rtmutex: Add missing deadlock check
  2014-05-13 23:27                   ` Paul E. McKenney
@ 2014-05-13 23:53                     ` Steven Rostedt
  2014-05-14  0:12                       ` Paul E. McKenney
  2014-05-14  6:54                       ` Thomas Gleixner
  0 siblings, 2 replies; 53+ messages in thread
From: Steven Rostedt @ 2014-05-13 23:53 UTC (permalink / raw)
  To: paulmck
  Cc: Thomas Gleixner, Peter Zijlstra, Lai Jiangshan, LKML, Dave Jones,
	Linus Torvalds, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Clark Williams, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Tue, 13 May 2014 16:27:11 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Tue, May 13, 2014 at 06:44:30PM -0400, Steven Rostedt wrote:
> > On Tue, 13 May 2014 15:00:09 -0700
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > 
> > 
> > > Good points -- I was indeed thinking about stress testing instead of
> > > algorithmic testing.
> > 
> > But doesn't lockdep use algorithmic tests too?
> 
> I suppose you could argue that there is no such thing as non-algorithmic
> testing, given that all test code uses an algorithm of some sort.  Perhaps
> with the exception of letting your pet walk across the keyboard.  ;-)
> 
> Perhaps I should have instead said that I was thinking about random
> testing instead of formal testing?

Actually it still applies, but I was mistaken, it's not lockdep itself,
it's the LOCKING_API_SELFTESTS. They are a form of formal testing as
suppose to random testing.

See lib/locking-selftest.c.

That looks more like something we can do for the rtmutex code, or even
add to it.

-- Steve

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/3] rtmutex: Add missing deadlock check
  2014-05-13 23:53                     ` Steven Rostedt
@ 2014-05-14  0:12                       ` Paul E. McKenney
  2014-05-14  6:54                       ` Thomas Gleixner
  1 sibling, 0 replies; 53+ messages in thread
From: Paul E. McKenney @ 2014-05-14  0:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, Peter Zijlstra, Lai Jiangshan, LKML, Dave Jones,
	Linus Torvalds, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Clark Williams, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Tue, May 13, 2014 at 07:53:36PM -0400, Steven Rostedt wrote:
> On Tue, 13 May 2014 16:27:11 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Tue, May 13, 2014 at 06:44:30PM -0400, Steven Rostedt wrote:
> > > On Tue, 13 May 2014 15:00:09 -0700
> > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > > 
> > > 
> > > > Good points -- I was indeed thinking about stress testing instead of
> > > > algorithmic testing.
> > > 
> > > But doesn't lockdep use algorithmic tests too?
> > 
> > I suppose you could argue that there is no such thing as non-algorithmic
> > testing, given that all test code uses an algorithm of some sort.  Perhaps
> > with the exception of letting your pet walk across the keyboard.  ;-)
> > 
> > Perhaps I should have instead said that I was thinking about random
> > testing instead of formal testing?
> 
> Actually it still applies, but I was mistaken, it's not lockdep itself,
> it's the LOCKING_API_SELFTESTS. They are a form of formal testing as
> suppose to random testing.
> 
> See lib/locking-selftest.c.
> 
> That looks more like something we can do for the rtmutex code, or even
> add to it.

Ah, got it!  That could work, though I would be tempted to try
automatically generating the C code/tables/whatever from some behavioral
specification.  Of course, there is always the speculation about how I
might feel about that approach after giving into such temptation...  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/3] rtmutex: Add missing deadlock check
  2014-05-13 22:44                 ` Steven Rostedt
  2014-05-13 23:27                   ` Paul E. McKenney
@ 2014-05-14  6:42                   ` Thomas Gleixner
  1 sibling, 0 replies; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-14  6:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: paulmck, Peter Zijlstra, Lai Jiangshan, LKML, Dave Jones,
	Linus Torvalds, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Clark Williams, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Tue, 13 May 2014, Steven Rostedt wrote:

> On Tue, 13 May 2014 15:00:09 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
>  
> > Good points -- I was indeed thinking about stress testing instead of
> > algorithmic testing.
> 
> But doesn't lockdep use algorithmic tests too?

Kinda, but only for the deadlock detection.

I'm not sure if we want to inflict PI correctness and set_scheduler()
interaction to lockdep.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/3] rtmutex: Add missing deadlock check
  2014-05-13 23:53                     ` Steven Rostedt
  2014-05-14  0:12                       ` Paul E. McKenney
@ 2014-05-14  6:54                       ` Thomas Gleixner
       [not found]                         ` <CAGChsmO9GO1Z2VBbw7uLtTXpYowdoUQbK8C3=Dt2jtGAnc6D2A@mail.gmail.com>
  1 sibling, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-14  6:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: paulmck, Peter Zijlstra, Lai Jiangshan, LKML, Dave Jones,
	Linus Torvalds, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Clark Williams, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior



On Tue, 13 May 2014, Steven Rostedt wrote:

> On Tue, 13 May 2014 16:27:11 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Tue, May 13, 2014 at 06:44:30PM -0400, Steven Rostedt wrote:
> > > On Tue, 13 May 2014 15:00:09 -0700
> > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > > 
> > > 
> > > > Good points -- I was indeed thinking about stress testing instead of
> > > > algorithmic testing.
> > > 
> > > But doesn't lockdep use algorithmic tests too?
> > 
> > I suppose you could argue that there is no such thing as non-algorithmic
> > testing, given that all test code uses an algorithm of some sort.  Perhaps
> > with the exception of letting your pet walk across the keyboard.  ;-)
> > 
> > Perhaps I should have instead said that I was thinking about random
> > testing instead of formal testing?
> 
> Actually it still applies, but I was mistaken, it's not lockdep itself,
> it's the LOCKING_API_SELFTESTS. They are a form of formal testing as
> suppose to random testing.
> 
> See lib/locking-selftest.c.
> 
> That looks more like something we can do for the rtmutex code, or even
> add to it.

It's called from very early init. So no threads ...

Thinking about it with a bit more awake brain, we probably can do it
completely from user space via futex except for a single case, which
we handle in the futex code: recursive AA of a single task.

I want to look into ABBA et al detection for the futexes anyway, so
that might be sufficient. Need to do a few experiments.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-13  3:54 ` Darren Hart
  2014-05-13  9:08   ` Thomas Gleixner
@ 2014-05-14  6:58   ` Carlos O'Donell
  2014-05-14  9:22     ` Peter Zijlstra
  2014-05-14  9:53     ` Thomas Gleixner
  1 sibling, 2 replies; 53+ messages in thread
From: Carlos O'Donell @ 2014-05-14  6:58 UTC (permalink / raw)
  To: Darren Hart, Thomas Gleixner
  Cc: LKML, Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Steven Rostedt, Clark Williams,
	Paul McKenney, Lai Jiangshan, Roland McGrath, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On 05/12/2014 11:54 PM, Darren Hart wrote:
> On Mon, May 12, 2014 at 08:45:32PM -0000, Thomas Gleixner wrote:
>> Dave reported recently that the trinity syscall fuzzer triggers a
>> warning in the rtmutex code via the futex syscall.
>>
>> It took me quite a while to understand the issue, until I was able to
>> write a minimalistic reproducer.
>>
>> The first two patches address the issues and the third one is adding
>> an unrelated sanity check to prevent user space from assigning PI
>> futexes to kernel threads.
>>
>> I ran it through my usual tests (except of one, see below), Darrens
>> futex tester and it holds up against a 400 threads trinity whacking
>> for well above 12hrs now.
>>
>> Staring three days into futex.c/rtmutex.c and a few GB of traces
>> definitly brings one in a lousy mood. But I discovered two things
>> which made me outright grumpy:
>>
>>
>> 2) glibc fun
>>
>>    While writing an isolated test case for the trinity wreckage, I
>>    found out that glibc has an interesting misfeature:
>>
>>    strace tells me:
>>
>>    futex(0x600e00, FUTEX_LOCK_PI_PRIVATE, 1) = -1 EINVAL (Invalid argument)
>>
>>    but the return value of pthread_mutex_lock() is 0
> 
> So something is clearly wrong there - however, were you looking at the comments
> (sorry, I mean the C code), or the implementation (all the ASM)? The only way
> I've been able to be sure in the past is to delete the ASM files and recompile
> using the C files. Hopefully we'll be able to drop all the ASM in the pthread
> calls soonish (measured in years in glibc development time scales).... sigh.

I agree. Something is wrong. There are *few* cases where we might probe the
kernel to test for things, but you'd only see that failed futex syscall once.
If this is repeating for each call to pthread_mutex_lock, then I would appreciate
a test case I could use to debug this and then stick into the regression tests
for glibc.
 
>>    Yay, for another master piece of trainwreck engineering!
>>
>>    I checked the glibc source and of course neither the lock nor the unlock path
>>    gets any error code from the syscall propagated.

If there is a specific test case you have in mind where we fail to return an
appropriate error to the caller, please provide it.

>>    The handling of -EDEADLOCK is even more impressive. Instead of
>>    propagating it to the caller something in the guts of glibc calls pause().
>>
>>      futex(0x601300, FUTEX_LOCK_PI_PRIVATE, 1) = -1 EDEADLK (Resource deadlock avoided)
>>      pause(
>>
> 
> Gotta love comments like these though - such trust!:
> 
> 	/* The mutex is locked.  The kernel will now take care of
>            everything. */
> 
> IIRC, glibc takes the approach that if this operation fails, there is no way for
> it to recovery "properly", and so it chooses to:
> 
> 	/* Delay the thread indefinitely. */
> 
> I believe the thinking goes that if we get to here, then the lock is in an
> inconsistent state (between kernel and userspace). I don't have an answer for
> why pausing forever would be preferable to returning an error however...

What error would we return?

This particular case is a serious error for which we have no good error code
to return to userspace. It's an implementation defect, a bug, we should probably
assert instead of pausing.

We can't cancel the stuck thread because pthread_mutex_lock is not a cancellation
point.

In practice the rest of the application can make forward progress with a single
thread stuck. You can attach the debugger and inspect state, so it's useful
from that perspective.

>>    So any kind of futex wreckage which is not detectable by glibc
>>    itself ends up in a disaster one way or the other. How is Joe user
>>    of this stuff who reads the manpages supposed to find out that the
>>    kernel detected an inconsistent user space variable?
>>
>>    If someone of the glibc folks can be bothered to look after that,
>>    I'm happy to provide the simple test cases, but I'm sure that
>>    looking at the code which simply ignores or pauses on kernel error
>>    return values is enough to figure out why its broken.
> 
> Was there a spot where it ignores it completely? I suppose if it's anything
> other than ESRCH or EDEADLK... yuck...

What additional error codes are documented?

What additional error codes are returned?

Keep in mind glibc has to maintain the POSIX semantics for the other error codes.

We can't blindly return any error code, and if we did what would the user do
with that information? It's easy to say "return any and all error codes", but
in practice the user has to be able to do something actionable with that error.
If the error is indicative of a kernel bug, I'd be inclined to simply assert
or abort immediately. Then you get a core file for debugging.

>>    This does not only apply to the PI version, the bog standard
>>    futexes have error returns from the kernel as well, which are
>>    handled in the same absurd way.
> 
> We do have the attention of some of the new glibc maintainers with the requeue
> PI work, and some redhat folks are digging into other PI related spec issues.
> I'll bring this up in those discussions and see if there might be something we
> can do here.

Hi Darren! :-)

Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-13  9:08   ` Thomas Gleixner
@ 2014-05-14  7:06     ` Carlos O'Donell
  2014-05-14 10:26       ` Thomas Gleixner
  0 siblings, 1 reply; 53+ messages in thread
From: Carlos O'Donell @ 2014-05-14  7:06 UTC (permalink / raw)
  To: Thomas Gleixner, Darren Hart
  Cc: LKML, Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Steven Rostedt, Clark Williams,
	Paul McKenney, Lai Jiangshan, Roland McGrath, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On 05/13/2014 05:08 AM, Thomas Gleixner wrote:
> On Mon, 12 May 2014, Darren Hart wrote:
>> On Mon, May 12, 2014 at 08:45:32PM -0000, Thomas Gleixner wrote:
>>>    strace tells me:
>>>
>>>    futex(0x600e00, FUTEX_LOCK_PI_PRIVATE, 1) = -1 EINVAL (Invalid argument)
>>>
>>>    but the return value of pthread_mutex_lock() is 0
>>
>> So something is clearly wrong there - however, were you looking at the comments
>> (sorry, I mean the C code), or the implementation (all the ASM)? The only way
>> I've been able to be sure in the past is to delete the ASM files and recompile
>> using the C files. Hopefully we'll be able to drop all the ASM in the pthread
>> calls soonish (measured in years in glibc development time scales).... sigh.
> 
> The C implementation does:
> 
> 	    if (INTERNAL_SYSCALL_ERROR_P (e, __err)
> 		&& (INTERNAL_SYSCALL_ERRNO (e, __err) == ESRCH
> 		    || INTERNAL_SYSCALL_ERRNO (e, __err) == EDEADLK))
> 	      {
> 		assert (INTERNAL_SYSCALL_ERRNO (e, __err) != EDEADLK
> 			|| (kind != PTHREAD_MUTEX_ERRORCHECK_NP
> 			    && kind != PTHREAD_MUTEX_RECURSIVE_NP));
> 		/* ESRCH can happen only for non-robust PI mutexes where
> 		   the owner of the lock died.  */
> 		assert (INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust);
> 
> 		/* Delay the thread indefinitely.  */
> 		while (1)
> 		  pause_not_cancel ();
> 	      }
> 
> So anything else than ESRCH and EDEADLK is ignored and then the thing
> happily returns 0 at the end. Unlock is the same:

The code is valid so long as you expect only ESRCH and EDEADLK
to be the only errors the kernel returns.

What other error codes are returned and under what conditions?

Are those other errors actionable by the user?

I'm inclined to abort() on anything but some agreed upon set of error returns.

Since anything other than the agreed upon set of error returns is a failure
in the coordinated implementation between glibc and the kernel.
 
> 	{
> 	  int robust = mutex->__data.__kind & PTHREAD_MUTEX_ROBUST_NORMAL_NP;
> 	  int private = (robust
> 			 ? PTHREAD_ROBUST_MUTEX_PSHARED (mutex)
> 			 : PTHREAD_MUTEX_PSHARED (mutex));
> 	  INTERNAL_SYSCALL_DECL (__err);
> 	  INTERNAL_SYSCALL (futex, __err, 2, &mutex->__data.__lock,
> 			    __lll_private_flag (FUTEX_UNLOCK_PI, private));
> 	}

Isn't the the same issue as before? This code presumes that because the
atomic operation completed successfully that the kernel state is OK.
Assuming otherwise slows down the fast path in the unlock just to check
for kernel bugs. Is the returned error in any way actionable by the user?

Cheers,
Carlos.

Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-14  6:58   ` Carlos O'Donell
@ 2014-05-14  9:22     ` Peter Zijlstra
  2014-05-14 21:17       ` Carlos O'Donell
  2014-05-14  9:53     ` Thomas Gleixner
  1 sibling, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2014-05-14  9:22 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Darren Hart, Thomas Gleixner, LKML, Dave Jones, Linus Torvalds,
	Darren Hart, Davidlohr Bueso, Ingo Molnar, Steven Rostedt,
	Clark Williams, Paul McKenney, Lai Jiangshan, Roland McGrath,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

[-- Attachment #1: Type: text/plain, Size: 1936 bytes --]

On Wed, May 14, 2014 at 02:58:05AM -0400, Carlos O'Donell wrote:
> >>    The handling of -EDEADLOCK is even more impressive. Instead of
> >>    propagating it to the caller something in the guts of glibc calls pause().
> >>
> >>      futex(0x601300, FUTEX_LOCK_PI_PRIVATE, 1) = -1 EDEADLK (Resource deadlock avoided)
> >>      pause(
> >>
> > 
> > Gotta love comments like these though - such trust!:
> > 
> > 	/* The mutex is locked.  The kernel will now take care of
> >            everything. */
> > 
> > IIRC, glibc takes the approach that if this operation fails, there is no way for
> > it to recovery "properly", and so it chooses to:
> > 
> > 	/* Delay the thread indefinitely. */
> > 
> > I believe the thinking goes that if we get to here, then the lock is in an
> > inconsistent state (between kernel and userspace). I don't have an answer for
> > why pausing forever would be preferable to returning an error however...
> 
> What error would we return?

EDEADLK is a valid user return for pthread_mutex_lock() as per:

  http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_lock.html

> This particular case is a serious error for which we have no good error code
> to return to userspace. It's an implementation defect, a bug, we should probably
> assert instead of pausing.

No, its perfectly fine to have a lock sequence abort with -EDEADLK.
Userspace should release its locks and re-attempt.

You can implement usable locking schemes using this error, like
wound/wait locking.

> We can't cancel the stuck thread because pthread_mutex_lock is not a cancellation
> point.
> 
> In practice the rest of the application can make forward progress with a single
> thread stuck. You can attach the debugger and inspect state, so it's useful
> from that perspective.

That's just totally braindead. Return EDEADLK to userspace already, let
the user deal with it.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-14  6:58   ` Carlos O'Donell
  2014-05-14  9:22     ` Peter Zijlstra
@ 2014-05-14  9:53     ` Thomas Gleixner
  2014-05-14 10:07       ` Peter Zijlstra
  1 sibling, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-14  9:53 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Darren Hart, LKML, Dave Jones, Linus Torvalds, Peter Zijlstra,
	Darren Hart, Davidlohr Bueso, Ingo Molnar, Steven Rostedt,
	Clark Williams, Paul McKenney, Lai Jiangshan, Roland McGrath,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

On Wed, 14 May 2014, Carlos O'Donell wrote:
> I agree. Something is wrong. There are *few* cases where we might probe the
> kernel to test for things, but you'd only see that failed futex syscall once.
> If this is repeating for each call to pthread_mutex_lock, then I would appreciate
> a test case I could use to debug this and then stick into the regression tests
> for glibc.

That happens if user space corrupts the lock->data. Here you go:

#define _GNU_SOURCE
#include <pthread.h>
#include <unistd.h>
#include <stdint.h>
#include <stdio.h>

#include <sys/syscall.h>

#define gettid() syscall(__NR_gettid)

static pthread_mutex_t m1;

static void *fn1(void *d)
{
	pthread_mutex_lock(&m1);
	sleep(60);
	return NULL;
}

static void *fn2(void *d)
{
	uint32_t *p = (uint32_t *) &m1;

	*p = (uint32_t) 0x80000000 | gettid();
	sleep(60);
	return NULL;
}

int main(void)
{
	pthread_mutexattr_t ma;
	pthread_t t1, t2;
	uint32_t *p;
	int ret;

	pthread_mutexattr_init(&ma);
	pthread_mutexattr_setprotocol(&ma, PTHREAD_PRIO_INHERIT);
	pthread_mutex_init(&m1, &ma);

	p = (uint32_t *) &m1;
	*p = (uint32_t) gettid();

	pthread_create(&t1, NULL, fn1, NULL);
	sleep(1);
	pthread_create(&t2, NULL, fn2, NULL);
	sleep(1);

	ret = pthread_mutex_lock(&m1);
	printf("ret %d %08x\n", ret, *p);
	return 0;
}

# strace ./fail
...
futex(0x600e20, FUTEX_LOCK_PI_PRIVATE, 1) = -1 EINVAL (Invalid argument)
...

# ./fail
ret 0 80002991

And note, you cannot detect the inconsistency of this in user
space. Simply because in case of PI futexes the kernel manipulates the
user space value and you have no idea what's going on in paralell.

So you need to handle the return value from the kernel proper. Whether
you assert or return -EINVAL to the caller is a different story, but
returning 0 is horribly wrong.

I prefer EINVAL as this matches the spec in the widest sense:

The pthread_mutex_lock(), pthread_mutex_trylock() and
pthread_mutex_unlock() functions may fail if:

  [EINVAL]
    The value specified by mutex does not refer to an initialised mutex object. 

And a wreckaged mutex is not a proper initialized one.

> > IIRC, glibc takes the approach that if this operation fails, there is no way for
> > it to recovery "properly", and so it chooses to:
> > 
> > 	/* Delay the thread indefinitely. */
> > 
> > I believe the thinking goes that if we get to here, then the lock is in an
> > inconsistent state (between kernel and userspace). I don't have an answer for
> > why pausing forever would be preferable to returning an error however...
> 
> What error would we return?
>
> This particular case is a serious error for which we have no good error code
> to return to userspace. It's an implementation defect, a bug, we should probably
> assert instead of pausing.

Errm.

http://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_mutex_lock.html

 The pthread_mutex_lock() function may fail if:

  [EDEADLK]
    The current thread already owns the mutex. 

That's a exactly the error code, which the kernel returns when it
detects a deadlock. 

And glibc returns EDEADLK at a lot of places already. So in that case
it's not a serious error? Because it's detected by glibc. You can't be
serious about that.

So why is a kernel detected deadlock different? Because it detects not
only AA, it detects ABBA and more. But it's still a dead lock. And
while posix spec only talks about AA, it's the very same issue.

So why not propagate this to the caller so he gets an alert right away
instead of letting him attach a debugger, and scratch his head and
lookup glibc source to find out why the hell glibc called pause.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-14  9:53     ` Thomas Gleixner
@ 2014-05-14 10:07       ` Peter Zijlstra
  2014-05-14 10:28         ` Thomas Gleixner
  0 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2014-05-14 10:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Carlos O'Donell, Darren Hart, LKML, Dave Jones,
	Linus Torvalds, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Steven Rostedt, Clark Williams, Paul McKenney, Lai Jiangshan,
	Roland McGrath, Jakub Jelinek, Michael Kerrisk,
	Sebastian Andrzej Siewior

[-- Attachment #1: Type: text/plain, Size: 1572 bytes --]

On Wed, May 14, 2014 at 11:53:44AM +0200, Thomas Gleixner wrote:
> > What error would we return?
> >
> > This particular case is a serious error for which we have no good error code
> > to return to userspace. It's an implementation defect, a bug, we should probably
> > assert instead of pausing.
> 
> Errm.
> 
> http://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_mutex_lock.html
> 
>  The pthread_mutex_lock() function may fail if:
> 
>   [EDEADLK]
>     The current thread already owns the mutex. 
> 
> That's a exactly the error code, which the kernel returns when it
> detects a deadlock. 
> 
> And glibc returns EDEADLK at a lot of places already. So in that case
> it's not a serious error? Because it's detected by glibc. You can't be
> serious about that.
> 
> So why is a kernel detected deadlock different? Because it detects not
> only AA, it detects ABBA and more. But it's still a dead lock. And
> while posix spec only talks about AA, it's the very same issue.
> 
> So why not propagate this to the caller so he gets an alert right away
> instead of letting him attach a debugger, and scratch his head and
> lookup glibc source to find out why the hell glibc called pause.

  http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_lock.html

  The pthread_mutex_lock() function may fail if:

  [EDEADLK]
      A deadlock condition was detected or the current thread already owns the mutex.

Which is explicitly wider than the AA recursion and fully supports the
full lock graph traversal we do.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-14  7:06     ` Carlos O'Donell
@ 2014-05-14 10:26       ` Thomas Gleixner
  2014-05-14 20:59         ` Carlos O'Donell
  0 siblings, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-14 10:26 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Darren Hart, LKML, Dave Jones, Linus Torvalds, Peter Zijlstra,
	Darren Hart, Davidlohr Bueso, Ingo Molnar, Steven Rostedt,
	Clark Williams, Paul McKenney, Lai Jiangshan, Roland McGrath,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

On Wed, 14 May 2014, Carlos O'Donell wrote:
> On 05/13/2014 05:08 AM, Thomas Gleixner wrote:
> > So anything else than ESRCH and EDEADLK is ignored and then the thing
> > happily returns 0 at the end. Unlock is the same:
> 
> The code is valid so long as you expect only ESRCH and EDEADLK
> to be the only errors the kernel returns.

Brilliant assumption, really. So we rather return 0 to the user than
having a catch for that case?

You know that this is how a large portion of security holes are
created, right?

> What other error codes are returned and under what conditions?
 
   -ENOMEM  Kernel can't allocate state.

   -EFAULT  Accessing the user space futex failed. Might be a mapping
    	    zapped by a different thread or no valid access (e.g. RO
    	    mapping)

   -EPERM   No permission to attach yourself to a futex owned by someone
   	    else or to unlock it.

   -EINVAL  Inconsistant state detected.

   -ETIMEDOUT
	    timed op timed out	

   -EWOULDBLOCK

	   For non PI indicates that the caller raced against another
	   thread.

	   For PI also returned if trylock fails.

   -ESRCH  Owner died

   -EAGAIN Owner is about to die, but has not yet finished the cleanup	   

> Are those other errors actionable by the user?

Define actionable. Most of them are in my opinion.

Some of them like -EWOULDBLOCK for non PI or -EAGAIN are for glibc
internal consumption.

> Since anything other than the agreed upon set of error returns is a failure
> in the coordinated implementation between glibc and the kernel.

Well. Back then when we implemented pi and robust futex support this
was done hand in hand with glibc folks and they knew about ALL the
error codes we return. I'm quite sure that we did not add any new one
since then, but I really can't be bothered to look.

> > 	{
> > 	  int robust = mutex->__data.__kind & PTHREAD_MUTEX_ROBUST_NORMAL_NP;
> > 	  int private = (robust
> > 			 ? PTHREAD_ROBUST_MUTEX_PSHARED (mutex)
> > 			 : PTHREAD_MUTEX_PSHARED (mutex));
> > 	  INTERNAL_SYSCALL_DECL (__err);
> > 	  INTERNAL_SYSCALL (futex, __err, 2, &mutex->__data.__lock,
> > 			    __lll_private_flag (FUTEX_UNLOCK_PI, private));
> > 	}
> 
> Isn't the the same issue as before? This code presumes that because the
> atomic operation completed successfully that the kernel state is OK.
> Assuming otherwise slows down the fast path in the unlock just to check
> for kernel bugs. Is the returned error in any way actionable by the user?

We return EPERM, EFAULT and EINVAL on unlock.

   EPERM    Tells you that the user space / kernel state is
	    inconsistent, because kernel thinks the owner is not current.
   	    
   EINVAL   That would be a glibc bug calling unlock pi for a non futex,
   	    which has non pi waiters in the kernel.

	    That one was explicitely there from the very beginning.

And as I demonstrated in the other mail, it's easy to corrupt state
w/o glibc even being able to notice. So you better handle that
gracefully instead of making utterly stupid assumptions.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-14 10:07       ` Peter Zijlstra
@ 2014-05-14 10:28         ` Thomas Gleixner
  2014-05-16 17:55           ` Carlos O'Donell
  0 siblings, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-14 10:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Carlos O'Donell, Darren Hart, LKML, Dave Jones,
	Linus Torvalds, Darren Hart, Davidlohr Bueso, Ingo Molnar,
	Steven Rostedt, Clark Williams, Paul McKenney, Lai Jiangshan,
	Roland McGrath, Jakub Jelinek, Michael Kerrisk,
	Sebastian Andrzej Siewior

On Wed, 14 May 2014, Peter Zijlstra wrote:
> On Wed, May 14, 2014 at 11:53:44AM +0200, Thomas Gleixner wrote:
> > > What error would we return?
> > >
> > > This particular case is a serious error for which we have no good error code
> > > to return to userspace. It's an implementation defect, a bug, we should probably
> > > assert instead of pausing.
> > 
> > Errm.
> > 
> > http://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_mutex_lock.html
> > 
> >  The pthread_mutex_lock() function may fail if:
> > 
> >   [EDEADLK]
> >     The current thread already owns the mutex. 
> > 
> > That's a exactly the error code, which the kernel returns when it
> > detects a deadlock. 
> > 
> > And glibc returns EDEADLK at a lot of places already. So in that case
> > it's not a serious error? Because it's detected by glibc. You can't be
> > serious about that.
> > 
> > So why is a kernel detected deadlock different? Because it detects not
> > only AA, it detects ABBA and more. But it's still a dead lock. And
> > while posix spec only talks about AA, it's the very same issue.
> > 
> > So why not propagate this to the caller so he gets an alert right away
> > instead of letting him attach a debugger, and scratch his head and
> > lookup glibc source to find out why the hell glibc called pause.
> 
>   http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_lock.html

Yuck. I should not have used the first link Gurgle brought up.
 
>   The pthread_mutex_lock() function may fail if:
> 
>   [EDEADLK]
>       A deadlock condition was detected or the current thread already owns the mutex.
> 
> Which is explicitly wider than the AA recursion and fully supports the
> full lock graph traversal we do.
 
Definitely. It's what the kernel does :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/3] rtmutex: Add missing deadlock check
  2014-05-13  5:51   ` Lai Jiangshan
  2014-05-13  8:45     ` Thomas Gleixner
  2014-05-13  8:48     ` Peter Zijlstra
@ 2014-05-14 12:59     ` Thomas Gleixner
  2 siblings, 0 replies; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-14 12:59 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: LKML, Dave Jones, Linus Torvalds, Peter Zijlstra, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Steven Rostedt, Clark Williams,
	Paul McKenney, Roland McGrath, Carlos ODonell, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On Tue, 13 May 2014, Lai Jiangshan wrote:

> Hi, Thomas,
> 
> I think this patch is just a workaround, it is not the proper fix.
> you need a updated deadlock-check mechanism:
> 
> - (old) skip the check when top_waiter != task_top_pi_waiter(task)
> + (new) skip the check when top_waiter->prio > task->prio

I don't think that helps. 
 
> >  	/*
> > +	 * Deadlock check for the following scenario:
> > +	 *
> > +	 * T holds lock L and has waiters
> > +	 * T locks L again, but does not end up as it's own top waiter
> 
> ABBA problem (TA TB TC TD are of the same priority)
> 
> TA holds lock LA, and try to lock LB which TC already has waited on
> TB holds lock LB, and try to lock LA which TD already has waited on
> 
> I think this check can't detect it IIUC.

Right it doesn't.
 
> 	/*
> 	 * Drop out, when the task has no waiters. Note,
> 	 * top_waiter can be NULL, when we are in the deboosting
> 	 * mode!
> 	 */
> 	if (top_waiter && (!task_has_pi_waiters(task) ||
> 			   top_waiter != task_top_pi_waiter(task)))
> 		goto out_unlock_pi;

So the issue here is, that we break out of the chain walk even if
deadlock detection is enabled.

The break out is correct for the boost case w/o deadlock detection, so
we won't do any pointless work.

For the deadlock detection case, we need to continue. But we should
store, that we are not the top_waiter so we can avoid the requeue
business when walking the chain.

Thanks,

	tglx





^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 1/3] rtmutex: Add missing deadlock check
       [not found]                         ` <CAGChsmO9GO1Z2VBbw7uLtTXpYowdoUQbK8C3=Dt2jtGAnc6D2A@mail.gmail.com>
@ 2014-05-14 13:33                           ` Thomas Gleixner
  0 siblings, 0 replies; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-14 13:33 UTC (permalink / raw)
  To: Darren Hart
  Cc: Carlos ODonell, laijs, Dave Jones, Michael Kerrisk,
	Jakub Jelinek, Roland McGrath, Linus Torvalds, Peter Zijlstra,
	Steven Rostedt, Davidlohr Bueso, Clark Williams,
	Paul E. McKenney, LKML, Sebastian Andrzej Siewior, Ingo Molnar

On Wed, 14 May 2014, Darren Hart wrote:
> On May 13, 2014 11:54 PM, "Thomas Gleixner" <tglx@linutronix.de> wrote:
> 
> I'll gladly accept new functional/algorithm type tests to futextest
> (performance stuff is probably going to perf instead, and Dave Miller beat

Dave Jones :)

> me to the fuzz tester with Trinity). But I think there is evidence we
> should build out the futex functional testing, and futextest seems a
> logical place for it.

Though as long as glibc is being silly we need to use the raw syscall
for it.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-14 10:26       ` Thomas Gleixner
@ 2014-05-14 20:59         ` Carlos O'Donell
  2014-05-14 22:54           ` Thomas Gleixner
                             ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Carlos O'Donell @ 2014-05-14 20:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Darren Hart, LKML, Dave Jones, Linus Torvalds, Peter Zijlstra,
	Darren Hart, Davidlohr Bueso, Ingo Molnar, Steven Rostedt,
	Clark Williams, Paul McKenney, Lai Jiangshan, Roland McGrath,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

On 05/14/2014 06:26 AM, Thomas Gleixner wrote:
> On Wed, 14 May 2014, Carlos O'Donell wrote:
>> On 05/13/2014 05:08 AM, Thomas Gleixner wrote:
>>> So anything else than ESRCH and EDEADLK is ignored and then the thing
>>> happily returns 0 at the end. Unlock is the same:
>>
>> The code is valid so long as you expect only ESRCH and EDEADLK
>> to be the only errors the kernel returns.
> 
> Brilliant assumption, really. So we rather return 0 to the user than
> having a catch for that case?

That's correct. At the time the argument was that performance of
the hot path was the only thing that mattered. For each additional
error code we add the hot path instruction length is enlarged.
I'm not arguing for or against, but simply stating the design
criteria that I know were part of the original work. Thus it would
have been perfectly valid to ignore all error codes except those
that mattered to the implementation, leaving failures like ENOMEM
to bubble up elsewhere instead of slowing down threads.
 
> You know that this is how a large portion of security holes are
> created, right?

I don't have any data to back up that claim, but I agree that
choosing performance as the only metric has resulted in code that
does not handle error conditions gracefully. Here we are 8 years
later talking about maintenance and robustness.

I will make my personal opinion clear:

- Internal defects should raise immediate assertions.

- Real problems like resource availability, deadlocks, and
  other recoverable errors should result in the API returning
  an appropriate error code that must not diverge from the POSIX
  definitions for those codes (when such a definition exists).

I'm not a believer in "only the hot path matters", there are such
things as robustness and error detection, and they matter.
 
>> What other error codes are returned and under what conditions?
>  
>    -ENOMEM  Kernel can't allocate state.
> 
>    -EFAULT  Accessing the user space futex failed. Might be a mapping
>     	    zapped by a different thread or no valid access (e.g. RO
>     	    mapping)
> 
>    -EPERM   No permission to attach yourself to a futex owned by someone
>    	    else or to unlock it.
> 
>    -EINVAL  Inconsistant state detected.
> 
>    -ETIMEDOUT
> 	    timed op timed out	
> 
>    -EWOULDBLOCK
> 
> 	   For non PI indicates that the caller raced against another
> 	   thread.
> 
> 	   For PI also returned if trylock fails.
> 
>    -ESRCH  Owner died
> 
>    -EAGAIN Owner is about to die, but has not yet finished the cleanup	   

Awesome. I like the direction this conversation is taking, since this is
actionable for me to fix the implementation.

Have we added all of this to the linux kernel man page entry for the
futex syscall? I consider Michael Kerrisk the trusted gate keeper for
documenting syscall APIs in sufficient detail that both the kernel and
glibc can agree on the usage (or any C library wishing to use the
interfaces).

I see Darren Hart has restarted the conversation with Michael to update
the futex() syscall documentation, which is the right way forward there.

>> Are those other errors actionable by the user?
> 
> Define actionable. Most of them are in my opinion.
> 
> Some of them like -EWOULDBLOCK for non PI or -EAGAIN are for glibc
> internal consumption.

That's what I mean. Can the user do anything useful with the error?

In the case of ENOMEM it's immediately obvious that the user could
attempt to give back memory to the kernel and retry the operation.

However, if glibc detects an inconsistency between the kernel and
glibc, I am not excited about returning EINVAL to the user, I would
rather abort or assert that the data structures have been corrupted.
The principal here being to abort or assert as close to the point at
which the corruption occurred to help the developer debug the problem.
I would argue that returning EINVAL to a library or application that
already manged to corrupt state is unlikely to know how to properly
handle the error anyway.

Would you agree?

>> Since anything other than the agreed upon set of error returns is a failure
>> in the coordinated implementation between glibc and the kernel.
> 
> Well. Back then when we implemented pi and robust futex support this
> was done hand in hand with glibc folks and they knew about ALL the
> error codes we return. I'm quite sure that we did not add any new one
> since then, but I really can't be bothered to look.

That's fine. If you assert the design is this way, then I trust you.
What I need from you is a list of all possible error returns, and you've
provided that, we're ready to fix things.

>>> 	{
>>> 	  int robust = mutex->__data.__kind & PTHREAD_MUTEX_ROBUST_NORMAL_NP;
>>> 	  int private = (robust
>>> 			 ? PTHREAD_ROBUST_MUTEX_PSHARED (mutex)
>>> 			 : PTHREAD_MUTEX_PSHARED (mutex));
>>> 	  INTERNAL_SYSCALL_DECL (__err);
>>> 	  INTERNAL_SYSCALL (futex, __err, 2, &mutex->__data.__lock,
>>> 			    __lll_private_flag (FUTEX_UNLOCK_PI, private));
>>> 	}
>>
>> Isn't the the same issue as before? This code presumes that because the
>> atomic operation completed successfully that the kernel state is OK.
>> Assuming otherwise slows down the fast path in the unlock just to check
>> for kernel bugs. Is the returned error in any way actionable by the user?
> 
> We return EPERM, EFAULT and EINVAL on unlock.
> 
>    EPERM    Tells you that the user space / kernel state is
> 	    inconsistent, because kernel thinks the owner is not current.

I would assert on that.
    	    
>    EINVAL   That would be a glibc bug calling unlock pi for a non futex,
>    	    which has non pi waiters in the kernel.
> 
> 	    That one was explicitely there from the very beginning.

I would also assert on this.

> And as I demonstrated in the other mail, it's easy to corrupt state
> w/o glibc even being able to notice. So you better handle that
> gracefully instead of making utterly stupid assumptions.

The assumptions are predicated on a particular design that doesn't
seem to match your expectations. I accept that in this case you're
correct. We should be catching state corruption and asserting. It's
useful and prevents the program from continuing to corrupt other
potentially more important program state.

As I mentioned before I'm not overly worried about a few extra
instructions in the hot path. I am worried about programs that run
as if they were correct but in fact are corrupting state. Debugging
this is both costly and painful.

Nobody has looked at this code in years, likely because all parties
thought they were done, but had in fact agreed to distinct expectations
and assumptions about the implementation.

I want it fix, and I'm in a position to move resources to fix it.

Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-14  9:22     ` Peter Zijlstra
@ 2014-05-14 21:17       ` Carlos O'Donell
  2014-05-14 23:11         ` Thomas Gleixner
  2014-05-15  8:07         ` Peter Zijlstra
  0 siblings, 2 replies; 53+ messages in thread
From: Carlos O'Donell @ 2014-05-14 21:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Darren Hart, Thomas Gleixner, LKML, Dave Jones, Linus Torvalds,
	Darren Hart, Davidlohr Bueso, Ingo Molnar, Steven Rostedt,
	Clark Williams, Paul McKenney, Lai Jiangshan, Roland McGrath,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

On 05/14/2014 05:22 AM, Peter Zijlstra wrote:
>>> I believe the thinking goes that if we get to here, then the lock is in an
>>> inconsistent state (between kernel and userspace). I don't have an answer for
>>> why pausing forever would be preferable to returning an error however...
>>
>> What error would we return?
> 
> EDEADLK is a valid user return for pthread_mutex_lock() as per:
> 
>   http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_lock.html

How is that correct? It isn't a deadlock we've detected but inconsistent
state between glibc and the kernel. In this case glibc should assert.
Delaying indefinitely with pause() never seems correct (despite that being
what we do today).
 
>> This particular case is a serious error for which we have no good error code
>> to return to userspace. It's an implementation defect, a bug, we should probably
>> assert instead of pausing.
> 
> No, its perfectly fine to have a lock sequence abort with -EDEADLK.
> Userspace should release its locks and re-attempt.

I agree. If I can prove that it's actually a deadlock, and
that unlock/relock will work to fix it, then we can arrange for glibc
to return EDEADLK.

> You can implement usable locking schemes using this error, like
> wound/wait locking.

Agreed.

>> We can't cancel the stuck thread because pthread_mutex_lock is not a cancellation
>> point.
>>
>> In practice the rest of the application can make forward progress with a single
>> thread stuck. You can attach the debugger and inspect state, so it's useful
>> from that perspective.
> 
> That's just totally braindead. Return EDEADLK to userspace already, let
> the user deal with it.

Not all cases where EDEADLK returns is it such a case that the user
can make forward progress, it might be a corrupt state, in which case
if we detect the corrupt state I would assert. Otherwise, yes, we can
return EDEADLK and let the user figure it out.

Does that make sense?

Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-14 20:59         ` Carlos O'Donell
@ 2014-05-14 22:54           ` Thomas Gleixner
  2014-05-15  7:37           ` Peter Zijlstra
  2014-05-15  8:25           ` Peter Zijlstra
  2 siblings, 0 replies; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-14 22:54 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Darren Hart, LKML, Dave Jones, Linus Torvalds, Peter Zijlstra,
	Darren Hart, Davidlohr Bueso, Ingo Molnar, Steven Rostedt,
	Clark Williams, Paul McKenney, Lai Jiangshan, Roland McGrath,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

On Wed, 14 May 2014, Carlos O'Donell wrote:
> On 05/14/2014 06:26 AM, Thomas Gleixner wrote:
> > On Wed, 14 May 2014, Carlos O'Donell wrote:
> >> On 05/13/2014 05:08 AM, Thomas Gleixner wrote:
> >>> So anything else than ESRCH and EDEADLK is ignored and then the thing
> >>> happily returns 0 at the end. Unlock is the same:
> >>
> >> The code is valid so long as you expect only ESRCH and EDEADLK
> >> to be the only errors the kernel returns.
> > 
> > Brilliant assumption, really. So we rather return 0 to the user than
> > having a catch for that case?
> 
> That's correct. At the time the argument was that performance of
> the hot path was the only thing that mattered. For each additional
> error code we add the hot path instruction length is enlarged.
> I'm not arguing for or against, but simply stating the design
> criteria that I know were part of the original work. Thus it would
> have been perfectly valid to ignore all error codes except those
> that mattered to the implementation, leaving failures like ENOMEM
> to bubble up elsewhere instead of slowing down threads.

Why are you even trying to excuse these design decisions which are
outright stupid or worse?

1) Ignoring return vales which were already known at the time when the
   code was implemented is either complete incompetence or malice.

2) Ignoring return values in general is a brainwrecked idea. If the
   other party returns something you cannot deal with, whether caused
   by typo/stupidity or the need to introduce a new one does not
   matter.
  
   Simply because a glibc based test case for an interface will not
   expose that typo/brainfart or the new real requirement to return
   something different.

> I will make my personal opinion clear:
> 
> - Internal defects should raise immediate assertions.
> 
> - Real problems like resource availability, deadlocks, and
>   other recoverable errors should result in the API returning
>   an appropriate error code that must not diverge from the POSIX
>   definitions for those codes (when such a definition exists).
> 
> I'm not a believer in "only the hot path matters", there are such
> things as robustness and error detection, and they matter.

We are on the same page then.
  
> Awesome. I like the direction this conversation is taking, since this is
> actionable for me to fix the implementation.

Same here.
 
> >> Are those other errors actionable by the user?
> > 
> > Define actionable. Most of them are in my opinion.
> > 
> > Some of them like -EWOULDBLOCK for non PI or -EAGAIN are for glibc
> > internal consumption.
> 
> That's what I mean. Can the user do anything useful with the error?

I completely understand what you mean. Just our interpretations might
differ here and there, but that's a pure technical problem as long as
we agree that ignoring stuff is not the way to go :)
 
> However, if glibc detects an inconsistency between the kernel and
> glibc, I am not excited about returning EINVAL to the user, I would
> rather abort or assert that the data structures have been corrupted.
> The principal here being to abort or assert as close to the point at
> which the corruption occurred to help the developer debug the problem.
> I would argue that returning EINVAL to a library or application that
> already manged to corrupt state is unlikely to know how to properly
> handle the error anyway.
> 
> Would you agree?

I personally would prefer the return value, so I might be able to shut
down at least parts of my app gracefully, but that's more a
philosophical problem.

Putting my personal preference aside, I agree that it's better to
assert as 99% of callers will just ignore the return value anyway

> > We return EPERM, EFAULT and EINVAL on unlock.
> > 
> >    EPERM    Tells you that the user space / kernel state is
> > 	    inconsistent, because kernel thinks the owner is not current.
> 
> I would assert on that.

Fair enough.
     	    
> >    EINVAL   That would be a glibc bug calling unlock pi for a non futex,
> >    	    which has non pi waiters in the kernel.
> > 
> > 	    That one was explicitely there from the very beginning.
> 
> I would also assert on this.

Of course. There's nothing the user can do about it.
 
> > And as I demonstrated in the other mail, it's easy to corrupt state
> > w/o glibc even being able to notice. So you better handle that
> > gracefully instead of making utterly stupid assumptions.
> 
> The assumptions are predicated on a particular design that doesn't
> seem to match your expectations. I accept that in this case you're
> correct. We should be catching state corruption and asserting. It's
> useful and prevents the program from continuing to corrupt other
> potentially more important program state.

Agreed.
 
> As I mentioned before I'm not overly worried about a few extra
> instructions in the hot path. I am worried about programs that run
> as if they were correct but in fact are corrupting state. Debugging
> this is both costly and painful.

It's extremly painful for someone who is not familiar with the guts of
the kernel and the glibc details. Probably leading to very creative
"workarounds" ...
 
> Nobody has looked at this code in years, likely because all parties
> thought they were done, but had in fact agreed to distinct expectations
> and assumptions about the implementation.
> 
> I want it fix, and I'm in a position to move resources to fix it.

Great. If you need any help on particular details what the kernel side
is expecting and returning, I'm happy to help.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-14 21:17       ` Carlos O'Donell
@ 2014-05-14 23:11         ` Thomas Gleixner
  2014-05-16 17:54           ` Carlos O'Donell
  2014-05-15  8:07         ` Peter Zijlstra
  1 sibling, 1 reply; 53+ messages in thread
From: Thomas Gleixner @ 2014-05-14 23:11 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Peter Zijlstra, Darren Hart, LKML, Dave Jones, Linus Torvalds,
	Darren Hart, Davidlohr Bueso, Ingo Molnar, Steven Rostedt,
	Clark Williams, Paul McKenney, Lai Jiangshan, Roland McGrath,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

On Wed, 14 May 2014, Carlos O'Donell wrote:
> On 05/14/2014 05:22 AM, Peter Zijlstra wrote:
> >>> I believe the thinking goes that if we get to here, then the lock is in an
> >>> inconsistent state (between kernel and userspace). I don't have an answer for
> >>> why pausing forever would be preferable to returning an error however...
> >>
> >> What error would we return?
> > 
> > EDEADLK is a valid user return for pthread_mutex_lock() as per:
> > 
> >   http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_lock.html
> 
> How is that correct? It isn't a deadlock we've detected but inconsistent
> state between glibc and the kernel. In this case glibc should assert.
> Delaying indefinitely with pause() never seems correct (despite that being
> what we do today).

If there is inconsistent state detected then the kernel will return
-EPERM or -EINVAL. So lets put inconsistent state aside.

In glibc you only can detect the simple AA dead lock, i.e lock owner
tries to lock the lock it owns again. Trivial, right ?

But glibc has no idea which lock chains are involved and might lead to
a dead lock caused by nested locking, simplest and most popular being
ABBA.

The kernel can (if the implementation is fixed, patch is available
already) very well detect ABBA and even more complex nested lock
deadlocks. So it rightfully returns -EDEADLK and that is completely
correct versus the spec and the call site can do something about it.

And that's not different from the glibc detected AA deadlock at
all. It's just detected by a different mechanism.

On kernel side we currently provide this service only for the PI
futexes because we have a kernel side state representation as long as
the user space state is not corrupted.

Back then when it was implemented the dead lock detection actually
worked and was agreed on by both sides - kernel and glibc - to be
usefull and essential to the whole endavour.

Hope that helps.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-14 20:59         ` Carlos O'Donell
  2014-05-14 22:54           ` Thomas Gleixner
@ 2014-05-15  7:37           ` Peter Zijlstra
  2014-05-15  8:25           ` Peter Zijlstra
  2 siblings, 0 replies; 53+ messages in thread
From: Peter Zijlstra @ 2014-05-15  7:37 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Thomas Gleixner, Darren Hart, LKML, Dave Jones, Linus Torvalds,
	Darren Hart, Davidlohr Bueso, Ingo Molnar, Steven Rostedt,
	Clark Williams, Paul McKenney, Lai Jiangshan, Roland McGrath,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

[-- Attachment #1: Type: text/plain, Size: 636 bytes --]

On Wed, May 14, 2014 at 04:59:58PM -0400, Carlos O'Donell wrote:
> Nobody has looked at this code in years, likely because all parties
> thought they were done, but had in fact agreed to distinct expectations
> and assumptions about the implementation.

That's just not true, we've been >< close to hijacking all the
libpthread functionality from glibc through a kernel DSO in order to get
some movement.

Glibc has been known broken and unwilling to change for years and its
lead to immense frustration with people.

> I want it fix, and I'm in a position to move resources to fix it.

That's great news, and about bloody time too :-)

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-14 21:17       ` Carlos O'Donell
  2014-05-14 23:11         ` Thomas Gleixner
@ 2014-05-15  8:07         ` Peter Zijlstra
  2014-05-16 18:14           ` Carlos O'Donell
  1 sibling, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2014-05-15  8:07 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Darren Hart, Thomas Gleixner, LKML, Dave Jones, Linus Torvalds,
	Darren Hart, Davidlohr Bueso, Ingo Molnar, Steven Rostedt,
	Clark Williams, Paul McKenney, Lai Jiangshan, Roland McGrath,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

[-- Attachment #1: Type: text/plain, Size: 482 bytes --]

On Wed, May 14, 2014 at 05:17:35PM -0400, Carlos O'Donell wrote:
> > No, its perfectly fine to have a lock sequence abort with -EDEADLK.
> > Userspace should release its locks and re-attempt.
> 
> I agree. If I can prove that it's actually a deadlock, and
> that unlock/relock will work to fix it, then we can arrange for glibc
> to return EDEADLK.

The only reason the kernel would return EDEADLK is because its walked
the lock graph and determined its well, a deadlock.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-14 20:59         ` Carlos O'Donell
  2014-05-14 22:54           ` Thomas Gleixner
  2014-05-15  7:37           ` Peter Zijlstra
@ 2014-05-15  8:25           ` Peter Zijlstra
  2014-05-16 18:21             ` Carlos O'Donell
  2 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2014-05-15  8:25 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Thomas Gleixner, Darren Hart, LKML, Dave Jones, Linus Torvalds,
	Darren Hart, Davidlohr Bueso, Ingo Molnar, Steven Rostedt,
	Clark Williams, Paul McKenney, Lai Jiangshan, Roland McGrath,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

[-- Attachment #1: Type: text/plain, Size: 1029 bytes --]

On Wed, May 14, 2014 at 04:59:58PM -0400, Carlos O'Donell wrote:
> I will make my personal opinion clear:
> 
> - Internal defects should raise immediate assertions.
> 
> - Real problems like resource availability, deadlocks, and
>   other recoverable errors should result in the API returning
>   an appropriate error code that must not diverge from the POSIX
>   definitions for those codes (when such a definition exists).
> 
> I'm not a believer in "only the hot path matters", there are such
> things as robustness and error detection, and they matter.

Awesome. In case of doubt though, I would prefer a return to an assert,
just in case userspace actually does know wtf its doing ;-)

Granted, that seems to be very rare, but still, its entirely annoying
for those few people who do care to get dead programs.

Alternatively, we could have something like you have for the allocator
(which is, afaik, also considered a hot path) these env variables like
MALLOC_CHECK_ to influence this edge behaviour.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-14 23:11         ` Thomas Gleixner
@ 2014-05-16 17:54           ` Carlos O'Donell
  0 siblings, 0 replies; 53+ messages in thread
From: Carlos O'Donell @ 2014-05-16 17:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Darren Hart, LKML, Dave Jones, Linus Torvalds,
	Darren Hart, Davidlohr Bueso, Ingo Molnar, Steven Rostedt,
	Clark Williams, Paul McKenney, Lai Jiangshan, Roland McGrath,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

On 05/14/2014 07:11 PM, Thomas Gleixner wrote:
> On Wed, 14 May 2014, Carlos O'Donell wrote:
>> On 05/14/2014 05:22 AM, Peter Zijlstra wrote:
>>>>> I believe the thinking goes that if we get to here, then the lock is in an
>>>>> inconsistent state (between kernel and userspace). I don't have an answer for
>>>>> why pausing forever would be preferable to returning an error however...
>>>>
>>>> What error would we return?
>>>
>>> EDEADLK is a valid user return for pthread_mutex_lock() as per:
>>>
>>>   http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_lock.html
>>
>> How is that correct? It isn't a deadlock we've detected but inconsistent
>> state between glibc and the kernel. In this case glibc should assert.
>> Delaying indefinitely with pause() never seems correct (despite that being
>> what we do today).
> 
> If there is inconsistent state detected then the kernel will return
> -EPERM or -EINVAL. So lets put inconsistent state aside.

OK.
 
> In glibc you only can detect the simple AA dead lock, i.e lock owner
> tries to lock the lock it owns again. Trivial, right ?

Agreed.

> But glibc has no idea which lock chains are involved and might lead to
> a dead lock caused by nested locking, simplest and most popular being
> ABBA.

OK.

> The kernel can (if the implementation is fixed, patch is available
> already) very well detect ABBA and even more complex nested lock
> deadlocks. So it rightfully returns -EDEADLK and that is completely
> correct versus the spec and the call site can do something about it.

OK.

> And that's not different from the glibc detected AA deadlock at
> all. It's just detected by a different mechanism.

OK.

> On kernel side we currently provide this service only for the PI
> futexes because we have a kernel side state representation as long as
> the user space state is not corrupted.

OK.

> Back then when it was implemented the dead lock detection actually
> worked and was agreed on by both sides - kernel and glibc - to be
> usefull and essential to the whole endavour.

I agree that ignoring the situation of corrupted or inconsistent
state we should be returning EDEADLK to userspace.

We'll cleanup glibc.

Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-14 10:28         ` Thomas Gleixner
@ 2014-05-16 17:55           ` Carlos O'Donell
  0 siblings, 0 replies; 53+ messages in thread
From: Carlos O'Donell @ 2014-05-16 17:55 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra
  Cc: Darren Hart, LKML, Dave Jones, Linus Torvalds, Darren Hart,
	Davidlohr Bueso, Ingo Molnar, Steven Rostedt, Clark Williams,
	Paul McKenney, Lai Jiangshan, Roland McGrath, Jakub Jelinek,
	Michael Kerrisk, Sebastian Andrzej Siewior

On 05/14/2014 06:28 AM, Thomas Gleixner wrote:
> On Wed, 14 May 2014, Peter Zijlstra wrote:
>> On Wed, May 14, 2014 at 11:53:44AM +0200, Thomas Gleixner wrote:
>>>> What error would we return?
>>>>
>>>> This particular case is a serious error for which we have no good error code
>>>> to return to userspace. It's an implementation defect, a bug, we should probably
>>>> assert instead of pausing.
>>>
>>> Errm.
>>>
>>> http://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_mutex_lock.html
>>>
>>>  The pthread_mutex_lock() function may fail if:
>>>
>>>   [EDEADLK]
>>>     The current thread already owns the mutex. 
>>>
>>> That's a exactly the error code, which the kernel returns when it
>>> detects a deadlock. 
>>>
>>> And glibc returns EDEADLK at a lot of places already. So in that case
>>> it's not a serious error? Because it's detected by glibc. You can't be
>>> serious about that.
>>>
>>> So why is a kernel detected deadlock different? Because it detects not
>>> only AA, it detects ABBA and more. But it's still a dead lock. And
>>> while posix spec only talks about AA, it's the very same issue.
>>>
>>> So why not propagate this to the caller so he gets an alert right away
>>> instead of letting him attach a debugger, and scratch his head and
>>> lookup glibc source to find out why the hell glibc called pause.
>>
>>   http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_lock.html
> 
> Yuck. I should not have used the first link Gurgle brought up.

For the record the correct link is for POSIX Issue 7 (Issue 8 under development).

http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutex_lock.html

The Issue 7 version has a nice table :}

Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-15  8:07         ` Peter Zijlstra
@ 2014-05-16 18:14           ` Carlos O'Donell
  0 siblings, 0 replies; 53+ messages in thread
From: Carlos O'Donell @ 2014-05-16 18:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Darren Hart, Thomas Gleixner, LKML, Dave Jones, Linus Torvalds,
	Darren Hart, Davidlohr Bueso, Ingo Molnar, Steven Rostedt,
	Clark Williams, Paul McKenney, Lai Jiangshan, Roland McGrath,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

On 05/15/2014 04:07 AM, Peter Zijlstra wrote:
> On Wed, May 14, 2014 at 05:17:35PM -0400, Carlos O'Donell wrote:
>>> No, its perfectly fine to have a lock sequence abort with -EDEADLK.
>>> Userspace should release its locks and re-attempt.
>>
>> I agree. If I can prove that it's actually a deadlock, and
>> that unlock/relock will work to fix it, then we can arrange for glibc
>> to return EDEADLK.
> 
> The only reason the kernel would return EDEADLK is because its walked
> the lock graph and determined its well, a deadlock.

Perfect. No further comments from me then.

Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity
  2014-05-15  8:25           ` Peter Zijlstra
@ 2014-05-16 18:21             ` Carlos O'Donell
  0 siblings, 0 replies; 53+ messages in thread
From: Carlos O'Donell @ 2014-05-16 18:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Darren Hart, LKML, Dave Jones, Linus Torvalds,
	Darren Hart, Davidlohr Bueso, Ingo Molnar, Steven Rostedt,
	Clark Williams, Paul McKenney, Lai Jiangshan, Roland McGrath,
	Jakub Jelinek, Michael Kerrisk, Sebastian Andrzej Siewior

On 05/15/2014 04:25 AM, Peter Zijlstra wrote:
> On Wed, May 14, 2014 at 04:59:58PM -0400, Carlos O'Donell wrote:
>> I will make my personal opinion clear:
>>
>> - Internal defects should raise immediate assertions.
>>
>> - Real problems like resource availability, deadlocks, and
>>   other recoverable errors should result in the API returning
>>   an appropriate error code that must not diverge from the POSIX
>>   definitions for those codes (when such a definition exists).
>>
>> I'm not a believer in "only the hot path matters", there are such
>> things as robustness and error detection, and they matter.
> 
> Awesome. In case of doubt though, I would prefer a return to an assert,
> just in case userspace actually does know wtf its doing ;-)

No. In that case the person who knows attaches a debugger to determine
why the internal state is inconsistent. That may require kernel or glibc
debugging and asserting as close to the point of corruption is the only
useful behaviour. I know it's painful, but the number of people who know
what they are doing is vanishingly small compared to the other set.

> Granted, that seems to be very rare, but still, its entirely annoying
> for those few people who do care to get dead programs.
> 
> Alternatively, we could have something like you have for the allocator
> (which is, afaik, also considered a hot path) these env variables like
> MALLOC_CHECK_ to influence this edge behaviour.

We are considering a runtime tunnables framework to unify all of these
kinds of tweaks into a stable API. Given that asserting or not asserting
does not impact the standards conformance we could make that a tunnable
with the default being to assert. The tunnables framework is still pie
in the sky because we need a low-overhead framework to check the global
tunnables. However, we need them, as I've mentioned before as an example
we have an ancient 40MB stack cache in glibc for thread stack reuse that
nobody remembers why it was tuned to that value. Magic.

Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [tip:core/urgent] futex: Add another early deadlock detection check
  2014-05-12 20:45 ` [patch 2/3] futex: Add another early deadlock detection check Thomas Gleixner
@ 2014-05-19 12:22   ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 53+ messages in thread
From: tip-bot for Thomas Gleixner @ 2014-05-19 12:22 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, williams, torvalds, peterz, jakub, rostedt, bigeasy, tglx,
	laijs, davidlohr, linux-kernel, hpa, darren, davej, paulmck,
	carlos, roland, mtk.manpages

Commit-ID:  866293ee54227584ffcb4a42f69c1f365974ba7f
Gitweb:     http://git.kernel.org/tip/866293ee54227584ffcb4a42f69c1f365974ba7f
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 12 May 2014 20:45:34 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Mon, 19 May 2014 21:18:49 +0900

futex: Add another early deadlock detection check

Dave Jones trinity syscall fuzzer exposed an issue in the deadlock
detection code of rtmutex:
  http://lkml.kernel.org/r/20140429151655.GA14277@redhat.com

That underlying issue has been fixed with a patch to the rtmutex code,
but the futex code must not call into rtmutex in that case because
    - it can detect that issue early
    - it avoids a different and more complex fixup for backing out

If the user space variable got manipulated to 0x80000000 which means
no lock holder, but the waiters bit set and an active pi_state in the
kernel is found we can figure out the recursive locking issue by
looking at the pi_state owner. If that is the current task, then we
can safely return -EDEADLK.

The check should have been added in commit 59fa62451 (futex: Handle
futex_pi OWNER_DIED take over correctly) already, but I did not see
the above issue caused by user space manipulation back then.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <darren@dvhart.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Carlos ODonell <carlos@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20140512201701.097349971@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
---
 kernel/futex.c | 47 ++++++++++++++++++++++++++++++++++-------------
 1 file changed, 34 insertions(+), 13 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 5f58927..7c68225 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -745,7 +745,8 @@ void exit_pi_state_list(struct task_struct *curr)
 
 static int
 lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
-		union futex_key *key, struct futex_pi_state **ps)
+		union futex_key *key, struct futex_pi_state **ps,
+		struct task_struct *task)
 {
 	struct futex_pi_state *pi_state = NULL;
 	struct futex_q *this, *next;
@@ -786,6 +787,16 @@ lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
 					return -EINVAL;
 			}
 
+			/*
+			 * Protect against a corrupted uval. If uval
+			 * is 0x80000000 then pid is 0 and the waiter
+			 * bit is set. So the deadlock check in the
+			 * calling code has failed and we did not fall
+			 * into the check above due to !pid.
+			 */
+			if (task && pi_state->owner == task)
+				return -EDEADLK;
+
 			atomic_inc(&pi_state->refcount);
 			*ps = pi_state;
 
@@ -935,7 +946,7 @@ retry:
 	 * We dont have the lock. Look up the PI state (or create it if
 	 * we are the first waiter):
 	 */
-	ret = lookup_pi_state(uval, hb, key, ps);
+	ret = lookup_pi_state(uval, hb, key, ps, task);
 
 	if (unlikely(ret)) {
 		switch (ret) {
@@ -1347,7 +1358,7 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
  *
  * Return:
  *  0 - failed to acquire the lock atomically;
- *  1 - acquired the lock;
+ * >0 - acquired the lock, return value is vpid of the top_waiter
  * <0 - error
  */
 static int futex_proxy_trylock_atomic(u32 __user *pifutex,
@@ -1358,7 +1369,7 @@ static int futex_proxy_trylock_atomic(u32 __user *pifutex,
 {
 	struct futex_q *top_waiter = NULL;
 	u32 curval;
-	int ret;
+	int ret, vpid;
 
 	if (get_futex_value_locked(&curval, pifutex))
 		return -EFAULT;
@@ -1386,11 +1397,13 @@ static int futex_proxy_trylock_atomic(u32 __user *pifutex,
 	 * the contended case or if set_waiters is 1.  The pi_state is returned
 	 * in ps in contended cases.
 	 */
+	vpid = task_pid_vnr(top_waiter->task);
 	ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task,
 				   set_waiters);
-	if (ret == 1)
+	if (ret == 1) {
 		requeue_pi_wake_futex(top_waiter, key2, hb2);
-
+		return vpid;
+	}
 	return ret;
 }
 
@@ -1421,7 +1434,6 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
 	struct futex_pi_state *pi_state = NULL;
 	struct futex_hash_bucket *hb1, *hb2;
 	struct futex_q *this, *next;
-	u32 curval2;
 
 	if (requeue_pi) {
 		/*
@@ -1509,16 +1521,25 @@ retry_private:
 		 * At this point the top_waiter has either taken uaddr2 or is
 		 * waiting on it.  If the former, then the pi_state will not
 		 * exist yet, look it up one more time to ensure we have a
-		 * reference to it.
+		 * reference to it. If the lock was taken, ret contains the
+		 * vpid of the top waiter task.
 		 */
-		if (ret == 1) {
+		if (ret > 0) {
 			WARN_ON(pi_state);
 			drop_count++;
 			task_count++;
-			ret = get_futex_value_locked(&curval2, uaddr2);
-			if (!ret)
-				ret = lookup_pi_state(curval2, hb2, &key2,
-						      &pi_state);
+			/*
+			 * If we acquired the lock, then the user
+			 * space value of uaddr2 should be vpid. It
+			 * cannot be changed by the top waiter as it
+			 * is blocked on hb2 lock if it tries to do
+			 * so. If something fiddled with it behind our
+			 * back the pi state lookup might unearth
+			 * it. So we rather use the known value than
+			 * rereading and handing potential crap to
+			 * lookup_pi_state.
+			 */
+			ret = lookup_pi_state(ret, hb2, &key2, &pi_state, NULL);
 		}
 
 		switch (ret) {

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [tip:core/urgent] futex: Prevent attaching to kernel threads
  2014-05-12 20:45 ` [patch 3/3] futex: Prevent attaching to kernel threads Thomas Gleixner
  2014-05-12 20:54   ` Peter Zijlstra
  2014-05-12 21:59   ` Davidlohr Bueso
@ 2014-05-19 12:22   ` tip-bot for Thomas Gleixner
  2 siblings, 0 replies; 53+ messages in thread
From: tip-bot for Thomas Gleixner @ 2014-05-19 12:22 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, williams, torvalds, peterz, jakub, rostedt, bigeasy, tglx,
	laijs, davidlohr, linux-kernel, hpa, darren, davej, paulmck,
	carlos, roland, mtk.manpages

Commit-ID:  f0d71b3dcb8332f7971b5f2363632573e6d9486a
Gitweb:     http://git.kernel.org/tip/f0d71b3dcb8332f7971b5f2363632573e6d9486a
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 12 May 2014 20:45:35 +0000
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Mon, 19 May 2014 21:18:49 +0900

futex: Prevent attaching to kernel threads

We happily allow userspace to declare a random kernel thread to be the
owner of a user space PI futex.

Found while analysing the fallout of Dave Jones syscall fuzzer.

We also should validate the thread group for private futexes and find
some fast way to validate whether the "alleged" owner has RW access on
the file which backs the SHM, but that's a separate issue.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <darren@dvhart.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Carlos ODonell <carlos@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20140512201701.194824402@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
---
 kernel/futex.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index 7c68225..81dbe77 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -814,6 +814,11 @@ lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
 	if (!p)
 		return -ESRCH;
 
+	if (!p->mm) {
+		put_task_struct(p);
+		return -EPERM;
+	}
+
 	/*
 	 * We need to look at the task state flags to figure out,
 	 * whether the task is exiting. To protect against the do_exit

^ permalink raw reply related	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2014-05-19 12:24 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-12 20:45 [patch 0/3] futex/rtmutex: Fix issues exposed by trinity Thomas Gleixner
2014-05-12 20:45 ` [patch 1/3] rtmutex: Add missing deadlock check Thomas Gleixner
2014-05-13  5:51   ` Lai Jiangshan
2014-05-13  8:45     ` Thomas Gleixner
2014-05-13  8:48     ` Peter Zijlstra
2014-05-13 16:12       ` Paul E. McKenney
2014-05-13 19:42       ` Thomas Gleixner
2014-05-13 20:20         ` Steven Rostedt
2014-05-13 20:36           ` Paul E. McKenney
2014-05-13 21:27             ` Thomas Gleixner
2014-05-13 22:00               ` Paul E. McKenney
2014-05-13 22:44                 ` Steven Rostedt
2014-05-13 23:27                   ` Paul E. McKenney
2014-05-13 23:53                     ` Steven Rostedt
2014-05-14  0:12                       ` Paul E. McKenney
2014-05-14  6:54                       ` Thomas Gleixner
     [not found]                         ` <CAGChsmO9GO1Z2VBbw7uLtTXpYowdoUQbK8C3=Dt2jtGAnc6D2A@mail.gmail.com>
2014-05-14 13:33                           ` Thomas Gleixner
2014-05-14  6:42                   ` Thomas Gleixner
2014-05-14 12:59     ` Thomas Gleixner
2014-05-12 20:45 ` [patch 2/3] futex: Add another early deadlock detection check Thomas Gleixner
2014-05-19 12:22   ` [tip:core/urgent] " tip-bot for Thomas Gleixner
2014-05-12 20:45 ` [patch 3/3] futex: Prevent attaching to kernel threads Thomas Gleixner
2014-05-12 20:54   ` Peter Zijlstra
2014-05-12 21:16     ` Thomas Gleixner
2014-05-12 21:59   ` Davidlohr Bueso
2014-05-12 22:18     ` Thomas Gleixner
2014-05-19 12:22   ` [tip:core/urgent] " tip-bot for Thomas Gleixner
2014-05-12 21:37 ` [patch 0/3] futex/rtmutex: Fix issues exposed by trinity Steven Rostedt
2014-05-12 21:52   ` Thomas Gleixner
2014-05-12 22:08     ` Steven Rostedt
2014-05-12 22:37       ` Thomas Gleixner
2014-05-12 23:18         ` Steven Rostedt
2014-05-13  6:37   ` Ingo Molnar
2014-05-13  3:54 ` Darren Hart
2014-05-13  9:08   ` Thomas Gleixner
2014-05-14  7:06     ` Carlos O'Donell
2014-05-14 10:26       ` Thomas Gleixner
2014-05-14 20:59         ` Carlos O'Donell
2014-05-14 22:54           ` Thomas Gleixner
2014-05-15  7:37           ` Peter Zijlstra
2014-05-15  8:25           ` Peter Zijlstra
2014-05-16 18:21             ` Carlos O'Donell
2014-05-14  6:58   ` Carlos O'Donell
2014-05-14  9:22     ` Peter Zijlstra
2014-05-14 21:17       ` Carlos O'Donell
2014-05-14 23:11         ` Thomas Gleixner
2014-05-16 17:54           ` Carlos O'Donell
2014-05-15  8:07         ` Peter Zijlstra
2014-05-16 18:14           ` Carlos O'Donell
2014-05-14  9:53     ` Thomas Gleixner
2014-05-14 10:07       ` Peter Zijlstra
2014-05-14 10:28         ` Thomas Gleixner
2014-05-16 17:55           ` Carlos O'Donell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).