All of lore.kernel.org
 help / color / mirror / Atom feed
* improve futex on -RT by avoiding the double wake-up
@ 2015-04-07 15:03 Sebastian Andrzej Siewior
  2015-04-07 15:03 ` [PATCH 1/3] futex: avoid double wake up in PI futex wait / wake on -RT Sebastian Andrzej Siewior
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-04-07 15:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Darren Hart,
	Steven Rostedt, fredrik.markstrom, Davidlohr Bueso,
	Manfred Spraul

On -RT we see a double wake up if the waiter has a higher priority than
the process doing the wake up. Patch #1 avoids in the PI wake code, #2 in
the non-PI variant (and only for one waiter).

#3 is a small optimisation for message queues avoid a "cpu_relax()" loop.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/3] futex: avoid double wake up in PI futex wait / wake on -RT
  2015-04-07 15:03 improve futex on -RT by avoiding the double wake-up Sebastian Andrzej Siewior
@ 2015-04-07 15:03 ` Sebastian Andrzej Siewior
  2015-04-07 18:41   ` Thomas Gleixner
  2015-04-07 15:03 ` [PATCH 2/3] futex: avoid double wake up in futex_wake() " Sebastian Andrzej Siewior
  2015-04-07 15:03 ` [PATCH 3/3] ipc/mqueue: remove STATE_PENDING Sebastian Andrzej Siewior
  2 siblings, 1 reply; 27+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-04-07 15:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Darren Hart,
	Steven Rostedt, fredrik.markstrom, Davidlohr Bueso,
	Manfred Spraul, Sebastian Andrzej Siewior

From: Thomas Gleixner <tglx@linutronix.de>

The boosted priority is reverted after the unlock but before the
futex_hash_bucket (hb) has been accessed. The result is that we boost the
task, deboost the task, boost again for the hb lock, deboost again.
A sched trace of this scenario looks the following:

| med_prio-93  sched_wakeup: comm=high_prio pid=92 prio=9 success=1 target_cpu=000
| med_prio-93  sched_switch: prev_comm=med_prio prev_pid=93 prev_prio=29 prev_state=R ==> next_comm=high_prio next_pid=92 next_prio=9
|high_prio-92  sched_pi_setprio: comm=low_prio pid=91 oldprio=120 newprio=9
|high_prio-92  sched_switch: prev_comm=high_prio prev_pid=92 prev_prio=9 prev_state=S ==> next_comm=low_prio next_pid=91 next_prio=9
| low_prio-91  sched_wakeup: comm=high_prio pid=92 prio=9 success=1 target_cpu=000
| low_prio-91  sched_pi_setprio: comm=low_prio pid=91 oldprio=9 newprio=120
| low_prio-91  sched_switch: prev_comm=low_prio prev_pid=91 prev_prio=120 prev_state=R+ ==> next_comm=high_prio next_pid=92 next_prio=9
|high_prio-92  sched_pi_setprio: comm=low_prio pid=91 oldprio=120 newprio=9
|high_prio-92  sched_switch: prev_comm=high_prio prev_pid=92 prev_prio=9 prev_state=D ==> next_comm=low_prio next_pid=91 next_prio=9
| low_prio-91  sched_wakeup: comm=high_prio pid=92 prio=9 success=1 target_cpu=000
| low_prio-91  sched_pi_setprio: comm=low_prio pid=91 oldprio=9 newprio=120
| low_prio-91  sched_switch: prev_comm=low_prio prev_pid=91 prev_prio=120 prev_state=R+ ==> next_comm=high_prio next_pid=92 next_prio=9

We see four sched_pi_setprio() invocation but ideally two would be enough.
The patch tries to avoid the double wakeup by a wake up once the hb lock is
released. The same test case:

| med_prio-21  sched_wakeup: comm=high_prio pid=20 prio=9 success=1 target_cpu=000
| med_prio-21  sched_switch: prev_comm=med_prio prev_pid=21 prev_prio=29 prev_state=R ==> next_comm=high_prio next_pid=20 next_prio=9
|high_prio-20  sched_pi_setprio: comm=low_prio pid=19 oldprio=120 newprio=9
|high_prio-20  sched_switch: prev_comm=high_prio prev_pid=20 prev_prio=9 prev_state=S ==> next_comm=low_prio next_pid=19 next_prio=9
| low_prio-19  sched_wakeup: comm=high_prio pid=20 prio=9 success=1 target_cpu=000
| low_prio-19  sched_pi_setprio: comm=low_prio pid=19 oldprio=9 newprio=120
| low_prio-19  sched_switch: prev_comm=low_prio prev_pid=19 prev_prio=120 prev_state=R+ ==> next_comm=high_prio next_pid=20 next_prio=9

only two sched_pi_setprio() invocations as one would expect and see
without -RT.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/futex.c                  | 32 +++++++++++++++++++++++++++++---
 kernel/locking/rtmutex.c        | 39 ++++++++++++++++++++++++++++-----------
 kernel/locking/rtmutex_common.h |  4 ++++
 3 files changed, 61 insertions(+), 14 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 2579e407ff67..b38abe3573a8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1122,11 +1122,13 @@ static void wake_futex(struct futex_q *q)
 	put_task_struct(p);
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
+			 struct futex_hash_bucket *hb)
 {
 	struct task_struct *new_owner;
 	struct futex_pi_state *pi_state = this->pi_state;
 	u32 uninitialized_var(curval), newval;
+	bool deboost;
 	int ret = 0;
 
 	if (!pi_state)
@@ -1178,7 +1180,17 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
 	raw_spin_unlock_irq(&new_owner->pi_lock);
 
 	raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
-	rt_mutex_unlock(&pi_state->pi_mutex);
+
+	deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex);
+
+	/*
+	 * We deboost after dropping hb->lock. That prevents a double
+	 * wakeup on RT.
+	 */
+	spin_unlock(&hb->lock);
+
+	if (deboost)
+		rt_mutex_adjust_prio(current);
 
 	return 0;
 }
@@ -2412,13 +2424,26 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 	 */
 	match = futex_top_waiter(hb, &key);
 	if (match) {
-		ret = wake_futex_pi(uaddr, uval, match);
+		ret = wake_futex_pi(uaddr, uval, match, hb);
+
+		/*
+		 * In case of success wake_futex_pi dropped the hash
+		 * bucket lock.
+		 */
+		if (!ret)
+			goto out_putkey;
+
 		/*
 		 * The atomic access to the futex value generated a
 		 * pagefault, so retry the user-access and the wakeup:
 		 */
 		if (ret == -EFAULT)
 			goto pi_faulted;
+
+		/*
+		 * wake_futex_pi has detected invalid state. Tell user
+		 * space.
+		 */
 		goto out_unlock;
 	}
 
@@ -2439,6 +2464,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 
 out_unlock:
 	spin_unlock(&hb->lock);
+out_putkey:
 	put_futex_key(&key);
 	return ret;
 
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index b73279367087..9d858106ba0c 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -298,7 +298,7 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
  * of task. We do not use the spin_xx_mutex() variants here as we are
  * outside of the debug path.)
  */
-static void rt_mutex_adjust_prio(struct task_struct *task)
+void rt_mutex_adjust_prio(struct task_struct *task)
 {
 	unsigned long flags;
 
@@ -955,8 +955,8 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 /*
  * Wake up the next waiter on the lock.
  *
- * Remove the top waiter from the current tasks pi waiter list and
- * wake it up.
+ * Remove the top waiter from the current tasks pi waiter list take a
+ * reference and return a pointer to it.
  *
  * Called with lock->wait_lock held.
  */
@@ -1253,7 +1253,7 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
 /*
  * Slow path to release a rt-mutex:
  */
-static void __sched
+static bool __sched
 rt_mutex_slowunlock(struct rt_mutex *lock)
 {
 	raw_spin_lock(&lock->wait_lock);
@@ -1296,7 +1296,7 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
 	while (!rt_mutex_has_waiters(lock)) {
 		/* Drops lock->wait_lock ! */
 		if (unlock_rt_mutex_safe(lock) == true)
-			return;
+			return false;
 		/* Relock the rtmutex and try again */
 		raw_spin_lock(&lock->wait_lock);
 	}
@@ -1309,8 +1309,7 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
 
 	raw_spin_unlock(&lock->wait_lock);
 
-	/* Undo pi boosting if necessary: */
-	rt_mutex_adjust_prio(current);
+	return true;
 }
 
 /*
@@ -1361,12 +1360,14 @@ rt_mutex_fasttrylock(struct rt_mutex *lock,
 
 static inline void
 rt_mutex_fastunlock(struct rt_mutex *lock,
-		    void (*slowfn)(struct rt_mutex *lock))
+		    bool (*slowfn)(struct rt_mutex *lock))
 {
-	if (likely(rt_mutex_cmpxchg(lock, current, NULL)))
+	if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
 		rt_mutex_deadlock_account_unlock(current);
-	else
-		slowfn(lock);
+	} else if (slowfn(lock)) {
+		/* Undo pi boosting if necessary: */
+		rt_mutex_adjust_prio(current);
+	}
 }
 
 /**
@@ -1461,6 +1462,22 @@ void __sched rt_mutex_unlock(struct rt_mutex *lock)
 EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 
 /**
+ * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
+ * @lock: the rt_mutex to be unlocked
+ *
+ * Returns: true/false indicating whether priority adjustment is
+ * required or not.
+ */
+bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock)
+{
+	if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
+		rt_mutex_deadlock_account_unlock(current);
+		return false;
+	}
+	return rt_mutex_slowunlock(lock);
+}
+
+/**
  * rt_mutex_destroy - mark a mutex unusable
  * @lock: the mutex to be destroyed
  *
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 855212501407..1bcf43099b16 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -132,6 +132,10 @@ extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 				      struct rt_mutex_waiter *waiter);
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
 
+extern bool rt_mutex_futex_unlock(struct rt_mutex *lock);
+
+extern void rt_mutex_adjust_prio(struct task_struct *task);
+
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 # include "rtmutex-debug.h"
 #else
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/3] futex: avoid double wake up in futex_wake() on -RT
  2015-04-07 15:03 improve futex on -RT by avoiding the double wake-up Sebastian Andrzej Siewior
  2015-04-07 15:03 ` [PATCH 1/3] futex: avoid double wake up in PI futex wait / wake on -RT Sebastian Andrzej Siewior
@ 2015-04-07 15:03 ` Sebastian Andrzej Siewior
  2015-04-07 19:47   ` Thomas Gleixner
  2015-04-07 15:03 ` [PATCH 3/3] ipc/mqueue: remove STATE_PENDING Sebastian Andrzej Siewior
  2 siblings, 1 reply; 27+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-04-07 15:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Darren Hart,
	Steven Rostedt, fredrik.markstrom, Davidlohr Bueso,
	Manfred Spraul, Sebastian Andrzej Siewior

futex_wake() wakes the waiter while holding the hb->lock. This leads to
a similar double wake up on -RT if the waiter has a higher priority than
the process perfroming the wake up.

This patch allocates space for one task and lets futex_wake() to wake up
the task once the hb->lock is dropped. If futex_wake() is used to wake
up more than once task then the behaviour remains unchanged. With one
waiter however we avoid the double wake up on -RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/futex.c | 33 ++++++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index b38abe3573a8..f3e633e16973 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1088,16 +1088,12 @@ static void __unqueue_futex(struct futex_q *q)
 	hb_waiters_dec(hb);
 }
 
-/*
- * The hash bucket lock must be held when this is called.
- * Afterwards, the futex_q must not be accessed.
- */
-static void wake_futex(struct futex_q *q)
+static struct task_struct *__wake_futex(struct futex_q *q)
 {
 	struct task_struct *p = q->task;
 
 	if (WARN(q->pi_state || q->rt_waiter, "refusing to wake PI futex\n"))
-		return;
+		return NULL;
 
 	/*
 	 * We set q->lock_ptr = NULL _before_ we wake up the task. If
@@ -1117,6 +1113,19 @@ static void wake_futex(struct futex_q *q)
 	 */
 	smp_wmb();
 	q->lock_ptr = NULL;
+	return p;
+}
+
+/*
+ * The hash bucket lock must be held when this is called.
+ * Afterwards, the futex_q must not be accessed.
+ */
+static void wake_futex(struct futex_q *q)
+{
+	struct task_struct *p = __wake_futex(q);
+
+	if (!p)
+		return;
 
 	wake_up_state(p, TASK_NORMAL);
 	put_task_struct(p);
@@ -1228,6 +1237,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
 	union futex_key key = FUTEX_KEY_INIT;
+	struct task_struct *waiter = NULL;
 	int ret;
 
 	if (!bitset)
@@ -1256,14 +1266,23 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 			if (!(this->bitset & bitset))
 				continue;
 
-			wake_futex(this);
+			if (nr_wake == 1)
+				waiter = __wake_futex(this);
+			else
+				wake_futex(this);
 			if (++ret >= nr_wake)
 				break;
 		}
 	}
 
 	spin_unlock(&hb->lock);
+
 out_put_key:
+	if (waiter) {
+		wake_up_state(waiter, TASK_NORMAL);
+		put_task_struct(waiter);
+	}
+
 	put_futex_key(&key);
 out:
 	return ret;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/3] ipc/mqueue: remove STATE_PENDING
  2015-04-07 15:03 improve futex on -RT by avoiding the double wake-up Sebastian Andrzej Siewior
  2015-04-07 15:03 ` [PATCH 1/3] futex: avoid double wake up in PI futex wait / wake on -RT Sebastian Andrzej Siewior
  2015-04-07 15:03 ` [PATCH 2/3] futex: avoid double wake up in futex_wake() " Sebastian Andrzej Siewior
@ 2015-04-07 15:03 ` Sebastian Andrzej Siewior
  2015-04-07 17:48   ` Manfred Spraul
  2 siblings, 1 reply; 27+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-04-07 15:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Darren Hart,
	Steven Rostedt, fredrik.markstrom, Davidlohr Bueso,
	Manfred Spraul, Sebastian Andrzej Siewior

This patch moves the wakeup_process() invocation so it is not done under
the info->lock. With this change, the waiter is woken up once it is
"ready" which means its state is STATE_READY and it does not need to loop
on SMP if it is still in STATE_PENDING.
In the timeout case we still need to grab the info->lock to verify the state.

This change should also avoid the introduction of preempt_disable() in
-RT which avoids a busy-loop which pools for the STATE_PENDING -> STATE_READY
change if the waiter has a higher priority compared to the waker.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 ipc/mqueue.c | 45 ++++++++++++++++++++++++++++-----------------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 7635a1cf99f3..95d179ed0923 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -47,7 +47,6 @@
 #define RECV		1
 
 #define STATE_NONE	0
-#define STATE_PENDING	1
 #define STATE_READY	2
 
 struct posix_msg_tree_node {
@@ -577,9 +576,6 @@ static int wq_sleep(struct mqueue_inode_info *info, int sr,
 		time = schedule_hrtimeout_range_clock(timeout, 0,
 			HRTIMER_MODE_ABS, CLOCK_REALTIME);
 
-		while (ewp->state == STATE_PENDING)
-			cpu_relax();
-
 		if (ewp->state == STATE_READY) {
 			retval = 0;
 			goto out;
@@ -909,9 +905,8 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
  * bypasses the message array and directly hands the message over to the
  * receiver.
  * The receiver accepts the message and returns without grabbing the queue
- * spinlock. Therefore an intermediate STATE_PENDING state and memory barriers
- * are necessary. The same algorithm is used for sysv semaphores, see
- * ipc/sem.c for more details.
+ * spinlock. The same algorithm is used for sysv semaphores, see ipc/sem.c
+ * for more details.
  *
  * The same algorithm is used for senders.
  */
@@ -919,36 +914,41 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
 /* pipelined_send() - send a message directly to the task waiting in
  * sys_mq_timedreceive() (without inserting message into a queue).
  */
-static inline void pipelined_send(struct mqueue_inode_info *info,
+static struct task_struct *pipelined_send(struct mqueue_inode_info *info,
 				  struct msg_msg *message,
 				  struct ext_wait_queue *receiver)
 {
+	struct task_struct *r_task;
+
 	receiver->msg = message;
 	list_del(&receiver->list);
-	receiver->state = STATE_PENDING;
-	wake_up_process(receiver->task);
+	r_task = receiver->task;
+	get_task_struct(r_task);
 	smp_wmb();
 	receiver->state = STATE_READY;
+	return r_task;
 }
 
 /* pipelined_receive() - if there is task waiting in sys_mq_timedsend()
  * gets its message and put to the queue (we have one free place for sure). */
-static inline void pipelined_receive(struct mqueue_inode_info *info)
+static struct task_struct *pipelined_receive(struct mqueue_inode_info *info)
 {
+	struct task_struct *r_sender;
 	struct ext_wait_queue *sender = wq_get_first_waiter(info, SEND);
 
 	if (!sender) {
 		/* for poll */
 		wake_up_interruptible(&info->wait_q);
-		return;
+		return NULL;
 	}
 	if (msg_insert(sender->msg, info))
-		return;
+		return NULL;
 	list_del(&sender->list);
-	sender->state = STATE_PENDING;
-	wake_up_process(sender->task);
+	r_sender = sender->task;
+	get_task_struct(r_sender);
 	smp_wmb();
 	sender->state = STATE_READY;
+	return r_sender;
 }
 
 SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr,
@@ -961,6 +961,7 @@ SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr,
 	struct ext_wait_queue *receiver;
 	struct msg_msg *msg_ptr;
 	struct mqueue_inode_info *info;
+	struct task_struct *r_task = NULL;
 	ktime_t expires, *timeout = NULL;
 	struct timespec ts;
 	struct posix_msg_tree_node *new_leaf = NULL;
@@ -1049,7 +1050,7 @@ SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr,
 	} else {
 		receiver = wq_get_first_waiter(info, RECV);
 		if (receiver) {
-			pipelined_send(info, msg_ptr, receiver);
+			r_task = pipelined_send(info, msg_ptr, receiver);
 		} else {
 			/* adds message to the queue */
 			ret = msg_insert(msg_ptr, info);
@@ -1062,6 +1063,10 @@ SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr,
 	}
 out_unlock:
 	spin_unlock(&info->lock);
+	if (r_task) {
+		wake_up_process(r_task);
+		put_task_struct(r_task);
+	}
 out_free:
 	if (ret)
 		free_msg(msg_ptr);
@@ -1149,14 +1154,20 @@ SYSCALL_DEFINE5(mq_timedreceive, mqd_t, mqdes, char __user *, u_msg_ptr,
 			msg_ptr = wait.msg;
 		}
 	} else {
+		struct task_struct *r_sender;
+
 		msg_ptr = msg_get(info);
 
 		inode->i_atime = inode->i_mtime = inode->i_ctime =
 				CURRENT_TIME;
 
 		/* There is now free space in queue. */
-		pipelined_receive(info);
+		r_sender = pipelined_receive(info);
 		spin_unlock(&info->lock);
+		if (r_sender) {
+			wake_up_process(r_sender);
+			put_task_struct(r_sender);
+		}
 		ret = 0;
 	}
 	if (ret == 0) {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] ipc/mqueue: remove STATE_PENDING
  2015-04-07 15:03 ` [PATCH 3/3] ipc/mqueue: remove STATE_PENDING Sebastian Andrzej Siewior
@ 2015-04-07 17:48   ` Manfred Spraul
  2015-04-07 18:28     ` Thomas Gleixner
  2015-04-10 14:37     ` [PATCH v2] " Sebastian Andrzej Siewior
  0 siblings, 2 replies; 27+ messages in thread
From: Manfred Spraul @ 2015-04-07 17:48 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Darren Hart,
	Steven Rostedt, fredrik.markstrom, Davidlohr Bueso

On 04/07/2015 05:03 PM, Sebastian Andrzej Siewior wrote:
> This patch moves the wakeup_process() invocation so it is not done under
> the info->lock. With this change, the waiter is woken up once it is
> "ready" which means its state is STATE_READY and it does not need to loop
> on SMP if it is still in STATE_PENDING.
> In the timeout case we still need to grab the info->lock to verify the state.
>
> This change should also avoid the introduction of preempt_disable() in
> -RT which avoids a busy-loop which pools for the STATE_PENDING -> STATE_READY
> change if the waiter has a higher priority compared to the waker.

> @@ -909,9 +905,8 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
>    * bypasses the message array and directly hands the message over to the
>    * receiver.
>    * The receiver accepts the message and returns without grabbing the queue
> - * spinlock. Therefore an intermediate STATE_PENDING state and memory barriers
> - * are necessary. The same algorithm is used for sysv semaphores, see
> - * ipc/sem.c for more details.
> + * spinlock. The same algorithm is used for sysv semaphores, see ipc/sem.c
> + * for more details.
No. With your change, ipc/sem.c and ipc/msg.c use different algorithms.
Please update the comment and describe the new approach:

Current approach:
- set pointer to message
- STATE_PENDING
- wake_up_process()
- STATE_READY
     (now the receiver can continue)

New approach:
- set pointer to message
- get_task_struct
- STATE_READY
     (now the receiver can continue, e.g. woken up due to an unrelated 
SIGKILL)
- wake_up_process()
- put_task_struct()


> +		if (r_sender) {
> +			wake_up_process(r_sender);
> +			put_task_struct(r_sender);
> +		}
>   		ret = 0;
Could you double-check that it is safe to call wake_up_process on a 
killed and reaped thread, only with a get_task_struct reference?

And: please test it, too. (patch the kernel so that you can trigger this 
case).

--
     Manfred


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] ipc/mqueue: remove STATE_PENDING
  2015-04-07 17:48   ` Manfred Spraul
@ 2015-04-07 18:28     ` Thomas Gleixner
  2015-04-10 14:37     ` [PATCH v2] " Sebastian Andrzej Siewior
  1 sibling, 0 replies; 27+ messages in thread
From: Thomas Gleixner @ 2015-04-07 18:28 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Sebastian Andrzej Siewior, linux-kernel, Peter Zijlstra,
	Ingo Molnar, Darren Hart, Steven Rostedt, fredrik.markstrom,
	Davidlohr Bueso

On Tue, 7 Apr 2015, Manfred Spraul wrote:
> On 04/07/2015 05:03 PM, Sebastian Andrzej Siewior wrote:
> > + * spinlock. The same algorithm is used for sysv semaphores, see ipc/sem.c
> > + * for more details.
> No. With your change, ipc/sem.c and ipc/msg.c use different algorithms.
> Please update the comment and describe the new approach:
> 
> Current approach:
> - set pointer to message
> - STATE_PENDING
> - wake_up_process()
> - STATE_READY
>     (now the receiver can continue)
> 
> New approach:
> - set pointer to message
> - get_task_struct
> - STATE_READY
>     (now the receiver can continue, e.g. woken up due to an unrelated
> SIGKILL)
> - wake_up_process()
> - put_task_struct()
> 
> 
> > +		if (r_sender) {
> > +			wake_up_process(r_sender);
> > +			put_task_struct(r_sender);
> > +		}
> >   		ret = 0;
> Could you double-check that it is safe to call wake_up_process on a killed
> and reaped thread, only with a get_task_struct reference?

Yes. It is safe to call wake_up_process() on a dead thread if you hold
a ref.

wake_up_process()
  return try_to_wake_up(p, TASK_NORMAL, 0);
 
try_to_wake_up()
  raw_spin_lock_irqsave(&p->pi_lock, flags);
  if (!(p->state & state))
    goto out;

TASK_NORMAL == (TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE)

That makes try_to_wake_up() a NOOP on a task with state TASK_DEAD. We
have quite some code in the kernel which relies on this.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] futex: avoid double wake up in PI futex wait / wake on -RT
  2015-04-07 15:03 ` [PATCH 1/3] futex: avoid double wake up in PI futex wait / wake on -RT Sebastian Andrzej Siewior
@ 2015-04-07 18:41   ` Thomas Gleixner
  2015-04-10 14:42     ` [PATCH 1/3 v2] " Sebastian Andrzej Siewior
  0 siblings, 1 reply; 27+ messages in thread
From: Thomas Gleixner @ 2015-04-07 18:41 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Darren Hart,
	Steven Rostedt, fredrik.markstrom, Davidlohr Bueso,
	Manfred Spraul

On Tue, 7 Apr 2015, Sebastian Andrzej Siewior wrote:
>  /*
>   * Wake up the next waiter on the lock.
>   *
> - * Remove the top waiter from the current tasks pi waiter list and
> - * wake it up.
> + * Remove the top waiter from the current tasks pi waiter list take a
> + * reference and return a pointer to it.

That comment is stale from an earlier version of that patch. That
should be:

+ * Remove the top waiter from the current tasks pi waiter list,
+ * wake it up and return whether the current task needs to undo
+ * a potential priority boosting.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3] futex: avoid double wake up in futex_wake() on -RT
  2015-04-07 15:03 ` [PATCH 2/3] futex: avoid double wake up in futex_wake() " Sebastian Andrzej Siewior
@ 2015-04-07 19:47   ` Thomas Gleixner
  2015-04-10 16:11     ` [PATCH 2/3 v2] " Sebastian Andrzej Siewior
  0 siblings, 1 reply; 27+ messages in thread
From: Thomas Gleixner @ 2015-04-07 19:47 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Darren Hart,
	Steven Rostedt, fredrik.markstrom, Davidlohr Bueso,
	Manfred Spraul

On Tue, 7 Apr 2015, Sebastian Andrzej Siewior wrote:
> futex_wake() wakes the waiter while holding the hb->lock. This leads to
> a similar double wake up on -RT if the waiter has a higher priority than
> the process perfroming the wake up.

Well, the non pi wakeup is designed in a way that the waiter side does
not take the hash bucket lock.

So if you observe that hb->lock contention and the resulting PI
boosting dance on RT, then it's not the wakeup/waiter exit path. That
must be something like this:

T1   		  T2
wakeup()

  -->preemption

		  sys_exit(futex_wait)
		  ...
		  sys_enter(futex_wake);
		    lock(hb->lock);
 
> -/*
> - * The hash bucket lock must be held when this is called.
> - * Afterwards, the futex_q must not be accessed.
> - */

The comment should stay here, because you are not allowed to access
the futex_q after q->lock_ptr has been set to NULL and that happens in
this function.

> -static void wake_futex(struct futex_q *q)
> +static struct task_struct *__wake_futex(struct futex_q *q)
>  {
>  	struct task_struct *p = q->task;
>  
>  	if (WARN(q->pi_state || q->rt_waiter, "refusing to wake PI futex\n"))
> -		return;
> +		return NULL;
>  
>  	/*
>  	 * We set q->lock_ptr = NULL _before_ we wake up the task. If
> @@ -1117,6 +1113,19 @@ static void wake_futex(struct futex_q *q)
>  	 */
>  	smp_wmb();
>  	q->lock_ptr = NULL;
> +	return p;

...

> @@ -1256,14 +1266,23 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
>  			if (!(this->bitset & bitset))
>  				continue;
>  
> -			wake_futex(this);
> +			if (nr_wake == 1)
> +				waiter = __wake_futex(this);
> +			else
> +				wake_futex(this);
>  			if (++ret >= nr_wake)
>  				break;
>  		}
>  	}
>  
>  	spin_unlock(&hb->lock);
> +
>  out_put_key:
> +	if (waiter) {
> +		wake_up_state(waiter, TASK_NORMAL);
> +		put_task_struct(waiter);
> +	}

This should go before out_put_key, because none of the other code
pathes which jump there can set waiter.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v2] ipc/mqueue: remove STATE_PENDING
  2015-04-07 17:48   ` Manfred Spraul
  2015-04-07 18:28     ` Thomas Gleixner
@ 2015-04-10 14:37     ` Sebastian Andrzej Siewior
  2015-04-23 22:18       ` Thomas Gleixner
  1 sibling, 1 reply; 27+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-04-10 14:37 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Darren Hart, Steven Rostedt, fredrik.markstrom, Davidlohr Bueso

This patch moves the wakeup_process() invocation so it is not done under
the info->lock. With this change, the waiter is woken up once it is
"ready" which means its state is STATE_READY and it does not need to loop
on SMP if it is still in STATE_PENDING.
In the timeout case we still need to grab the info->lock to verify the state.

This change should also avoid the introduction of preempt_disable() in
-RT which avoids a busy-loop which pools for the STATE_PENDING -> STATE_READY
change if the waiter has a higher priority compared to the waker.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
v1…v2: update / correct the comment of how the algorithm works now.

* Manfred Spraul | 2015-04-07 19:48:01 [+0200]:
>No. With your change, ipc/sem.c and ipc/msg.c use different algorithms.
>Please update the comment and describe the new approach:
>
>Current approach:
>- set pointer to message
>- STATE_PENDING
>- wake_up_process()
>- STATE_READY
>    (now the receiver can continue)
>
>New approach:
>- set pointer to message
>- get_task_struct
>- STATE_READY
>    (now the receiver can continue, e.g. woken up due to an unrelated
>SIGKILL)
>- wake_up_process()
>- put_task_struct()

Updated. Does it meet your expectation?

>
>>+		if (r_sender) {
>>+			wake_up_process(r_sender);
>>+			put_task_struct(r_sender);
>>+		}
>>  		ret = 0;
>Could you double-check that it is safe to call wake_up_process on a
>killed and reaped thread, only with a get_task_struct reference?
tglx answered that part.

>And: please test it, too. (patch the kernel so that you can trigger
>this case).

Why patch? Isn't this triggered if you have a reader waiting and you
send a message?

 ipc/mqueue.c | 51 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 34 insertions(+), 17 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 7635a1cf99f3..bfdc2219a470 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -47,7 +47,6 @@
 #define RECV		1
 
 #define STATE_NONE	0
-#define STATE_PENDING	1
 #define STATE_READY	2
 
 struct posix_msg_tree_node {
@@ -577,9 +576,6 @@ static int wq_sleep(struct mqueue_inode_info *info, int sr,
 		time = schedule_hrtimeout_range_clock(timeout, 0,
 			HRTIMER_MODE_ABS, CLOCK_REALTIME);
 
-		while (ewp->state == STATE_PENDING)
-			cpu_relax();
-
 		if (ewp->state == STATE_READY) {
 			retval = 0;
 			goto out;
@@ -909,9 +905,14 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
  * bypasses the message array and directly hands the message over to the
  * receiver.
  * The receiver accepts the message and returns without grabbing the queue
- * spinlock. Therefore an intermediate STATE_PENDING state and memory barriers
- * are necessary. The same algorithm is used for sysv semaphores, see
- * ipc/sem.c for more details.
+ * spinlock. The used algorithm is different from sysv semaphores (ipc/sem.c):
+ * - set pointer to message
+ * - hold a reference of the task to be woken up
+ * - update its state (to STATE_READY). Now the receiver can continue.
+ * - wake up the process after the lock is dropped. Should the process wake up
+ *   before this wakeup (due to a timeout or a signal) it will either see
+ *   STATE_READY and continue or acquire the lock to check the sate again.
+ * - put the reference to task to be woken up.
  *
  * The same algorithm is used for senders.
  */
@@ -919,36 +920,41 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
 /* pipelined_send() - send a message directly to the task waiting in
  * sys_mq_timedreceive() (without inserting message into a queue).
  */
-static inline void pipelined_send(struct mqueue_inode_info *info,
+static struct task_struct *pipelined_send(struct mqueue_inode_info *info,
 				  struct msg_msg *message,
 				  struct ext_wait_queue *receiver)
 {
+	struct task_struct *r_task;
+
 	receiver->msg = message;
 	list_del(&receiver->list);
-	receiver->state = STATE_PENDING;
-	wake_up_process(receiver->task);
+	r_task = receiver->task;
+	get_task_struct(r_task);
 	smp_wmb();
 	receiver->state = STATE_READY;
+	return r_task;
 }
 
 /* pipelined_receive() - if there is task waiting in sys_mq_timedsend()
  * gets its message and put to the queue (we have one free place for sure). */
-static inline void pipelined_receive(struct mqueue_inode_info *info)
+static struct task_struct *pipelined_receive(struct mqueue_inode_info *info)
 {
+	struct task_struct *r_sender;
 	struct ext_wait_queue *sender = wq_get_first_waiter(info, SEND);
 
 	if (!sender) {
 		/* for poll */
 		wake_up_interruptible(&info->wait_q);
-		return;
+		return NULL;
 	}
 	if (msg_insert(sender->msg, info))
-		return;
+		return NULL;
 	list_del(&sender->list);
-	sender->state = STATE_PENDING;
-	wake_up_process(sender->task);
+	r_sender = sender->task;
+	get_task_struct(r_sender);
 	smp_wmb();
 	sender->state = STATE_READY;
+	return r_sender;
 }
 
 SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr,
@@ -961,6 +967,7 @@ SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr,
 	struct ext_wait_queue *receiver;
 	struct msg_msg *msg_ptr;
 	struct mqueue_inode_info *info;
+	struct task_struct *r_task = NULL;
 	ktime_t expires, *timeout = NULL;
 	struct timespec ts;
 	struct posix_msg_tree_node *new_leaf = NULL;
@@ -1049,7 +1056,7 @@ SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr,
 	} else {
 		receiver = wq_get_first_waiter(info, RECV);
 		if (receiver) {
-			pipelined_send(info, msg_ptr, receiver);
+			r_task = pipelined_send(info, msg_ptr, receiver);
 		} else {
 			/* adds message to the queue */
 			ret = msg_insert(msg_ptr, info);
@@ -1062,6 +1069,10 @@ SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr,
 	}
 out_unlock:
 	spin_unlock(&info->lock);
+	if (r_task) {
+		wake_up_process(r_task);
+		put_task_struct(r_task);
+	}
 out_free:
 	if (ret)
 		free_msg(msg_ptr);
@@ -1149,14 +1160,20 @@ SYSCALL_DEFINE5(mq_timedreceive, mqd_t, mqdes, char __user *, u_msg_ptr,
 			msg_ptr = wait.msg;
 		}
 	} else {
+		struct task_struct *r_sender;
+
 		msg_ptr = msg_get(info);
 
 		inode->i_atime = inode->i_mtime = inode->i_ctime =
 				CURRENT_TIME;
 
 		/* There is now free space in queue. */
-		pipelined_receive(info);
+		r_sender = pipelined_receive(info);
 		spin_unlock(&info->lock);
+		if (r_sender) {
+			wake_up_process(r_sender);
+			put_task_struct(r_sender);
+		}
 		ret = 0;
 	}
 	if (ret == 0) {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 1/3 v2] futex: avoid double wake up in PI futex wait / wake on -RT
  2015-04-07 18:41   ` Thomas Gleixner
@ 2015-04-10 14:42     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 27+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-04-10 14:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Darren Hart,
	Steven Rostedt, fredrik.markstrom, Davidlohr Bueso,
	Manfred Spraul

From: Thomas Gleixner <tglx@linutronix.de>
Date: Wed, 18 Feb 2015 20:17:31 +0100

The boosted priority is reverted after the unlock but before the
futex_hash_bucket (hb) has been accessed. The result is that we boost the
task, deboost the task, boost again for the hb lock, deboost again.
A sched trace of this scenario looks the following:

| med_prio-93  sched_wakeup: comm=high_prio pid=92 prio=9 success=1 target_cpu=000
| med_prio-93  sched_switch: prev_comm=med_prio prev_pid=93 prev_prio=29 prev_state=R ==> next_comm=high_prio next_pid=92 next_prio=9
|high_prio-92  sched_pi_setprio: comm=low_prio pid=91 oldprio=120 newprio=9
|high_prio-92  sched_switch: prev_comm=high_prio prev_pid=92 prev_prio=9 prev_state=S ==> next_comm=low_prio next_pid=91 next_prio=9
| low_prio-91  sched_wakeup: comm=high_prio pid=92 prio=9 success=1 target_cpu=000
| low_prio-91  sched_pi_setprio: comm=low_prio pid=91 oldprio=9 newprio=120
| low_prio-91  sched_switch: prev_comm=low_prio prev_pid=91 prev_prio=120 prev_state=R+ ==> next_comm=high_prio next_pid=92 next_prio=9
|high_prio-92  sched_pi_setprio: comm=low_prio pid=91 oldprio=120 newprio=9
|high_prio-92  sched_switch: prev_comm=high_prio prev_pid=92 prev_prio=9 prev_state=D ==> next_comm=low_prio next_pid=91 next_prio=9
| low_prio-91  sched_wakeup: comm=high_prio pid=92 prio=9 success=1 target_cpu=000
| low_prio-91  sched_pi_setprio: comm=low_prio pid=91 oldprio=9 newprio=120
| low_prio-91  sched_switch: prev_comm=low_prio prev_pid=91 prev_prio=120 prev_state=R+ ==> next_comm=high_prio next_pid=92 next_prio=9

We see four sched_pi_setprio() invocation but ideally two would be enough.
The patch tries to avoid the double wakeup by a wake up once the hb lock is
released. The same test case:

| med_prio-21  sched_wakeup: comm=high_prio pid=20 prio=9 success=1 target_cpu=000
| med_prio-21  sched_switch: prev_comm=med_prio prev_pid=21 prev_prio=29 prev_state=R ==> next_comm=high_prio next_pid=20 next_prio=9
|high_prio-20  sched_pi_setprio: comm=low_prio pid=19 oldprio=120 newprio=9
|high_prio-20  sched_switch: prev_comm=high_prio prev_pid=20 prev_prio=9 prev_state=S ==> next_comm=low_prio next_pid=19 next_prio=9
| low_prio-19  sched_wakeup: comm=high_prio pid=20 prio=9 success=1 target_cpu=000
| low_prio-19  sched_pi_setprio: comm=low_prio pid=19 oldprio=9 newprio=120
| low_prio-19  sched_switch: prev_comm=low_prio prev_pid=19 prev_prio=120 prev_state=R+ ==> next_comm=high_prio next_pid=20 next_prio=9

only two sched_pi_setprio() invocations as one would expect and see
without -RT.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
v1…v2: update comment on rt_mutex_slowtrylock()

 kernel/futex.c                  | 32 +++++++++++++++++++++++++++++---
 kernel/locking/rtmutex.c        | 40 +++++++++++++++++++++++++++++-----------
 kernel/locking/rtmutex_common.h |  4 ++++
 3 files changed, 62 insertions(+), 14 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 2579e407ff67..b38abe3573a8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1122,11 +1122,13 @@ static void wake_futex(struct futex_q *q)
 	put_task_struct(p);
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
+			 struct futex_hash_bucket *hb)
 {
 	struct task_struct *new_owner;
 	struct futex_pi_state *pi_state = this->pi_state;
 	u32 uninitialized_var(curval), newval;
+	bool deboost;
 	int ret = 0;
 
 	if (!pi_state)
@@ -1178,7 +1180,17 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
 	raw_spin_unlock_irq(&new_owner->pi_lock);
 
 	raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
-	rt_mutex_unlock(&pi_state->pi_mutex);
+
+	deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex);
+
+	/*
+	 * We deboost after dropping hb->lock. That prevents a double
+	 * wakeup on RT.
+	 */
+	spin_unlock(&hb->lock);
+
+	if (deboost)
+		rt_mutex_adjust_prio(current);
 
 	return 0;
 }
@@ -2412,13 +2424,26 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 	 */
 	match = futex_top_waiter(hb, &key);
 	if (match) {
-		ret = wake_futex_pi(uaddr, uval, match);
+		ret = wake_futex_pi(uaddr, uval, match, hb);
+
+		/*
+		 * In case of success wake_futex_pi dropped the hash
+		 * bucket lock.
+		 */
+		if (!ret)
+			goto out_putkey;
+
 		/*
 		 * The atomic access to the futex value generated a
 		 * pagefault, so retry the user-access and the wakeup:
 		 */
 		if (ret == -EFAULT)
 			goto pi_faulted;
+
+		/*
+		 * wake_futex_pi has detected invalid state. Tell user
+		 * space.
+		 */
 		goto out_unlock;
 	}
 
@@ -2439,6 +2464,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 
 out_unlock:
 	spin_unlock(&hb->lock);
+out_putkey:
 	put_futex_key(&key);
 	return ret;
 
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index b73279367087..339082905492 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -298,7 +298,7 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
  * of task. We do not use the spin_xx_mutex() variants here as we are
  * outside of the debug path.)
  */
-static void rt_mutex_adjust_prio(struct task_struct *task)
+void rt_mutex_adjust_prio(struct task_struct *task)
 {
 	unsigned long flags;
 
@@ -955,8 +955,9 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 /*
  * Wake up the next waiter on the lock.
  *
- * Remove the top waiter from the current tasks pi waiter list and
- * wake it up.
+ * Remove the top waiter from the current tasks pi waiter list,
+ * wake it up and return whether the current task needs to undo
+ * a potential priority boosting.
  *
  * Called with lock->wait_lock held.
  */
@@ -1253,7 +1254,7 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
 /*
  * Slow path to release a rt-mutex:
  */
-static void __sched
+static bool __sched
 rt_mutex_slowunlock(struct rt_mutex *lock)
 {
 	raw_spin_lock(&lock->wait_lock);
@@ -1296,7 +1297,7 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
 	while (!rt_mutex_has_waiters(lock)) {
 		/* Drops lock->wait_lock ! */
 		if (unlock_rt_mutex_safe(lock) == true)
-			return;
+			return false;
 		/* Relock the rtmutex and try again */
 		raw_spin_lock(&lock->wait_lock);
 	}
@@ -1309,8 +1310,7 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
 
 	raw_spin_unlock(&lock->wait_lock);
 
-	/* Undo pi boosting if necessary: */
-	rt_mutex_adjust_prio(current);
+	return true;
 }
 
 /*
@@ -1361,12 +1361,14 @@ rt_mutex_fasttrylock(struct rt_mutex *lock,
 
 static inline void
 rt_mutex_fastunlock(struct rt_mutex *lock,
-		    void (*slowfn)(struct rt_mutex *lock))
+		    bool (*slowfn)(struct rt_mutex *lock))
 {
-	if (likely(rt_mutex_cmpxchg(lock, current, NULL)))
+	if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
 		rt_mutex_deadlock_account_unlock(current);
-	else
-		slowfn(lock);
+	} else if (slowfn(lock)) {
+		/* Undo pi boosting if necessary: */
+		rt_mutex_adjust_prio(current);
+	}
 }
 
 /**
@@ -1461,6 +1463,22 @@ void __sched rt_mutex_unlock(struct rt_mutex *lock)
 EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 
 /**
+ * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
+ * @lock: the rt_mutex to be unlocked
+ *
+ * Returns: true/false indicating whether priority adjustment is
+ * required or not.
+ */
+bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock)
+{
+	if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
+		rt_mutex_deadlock_account_unlock(current);
+		return false;
+	}
+	return rt_mutex_slowunlock(lock);
+}
+
+/**
  * rt_mutex_destroy - mark a mutex unusable
  * @lock: the mutex to be destroyed
  *
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 855212501407..1bcf43099b16 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -132,6 +132,10 @@ extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 				      struct rt_mutex_waiter *waiter);
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
 
+extern bool rt_mutex_futex_unlock(struct rt_mutex *lock);
+
+extern void rt_mutex_adjust_prio(struct task_struct *task);
+
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 # include "rtmutex-debug.h"
 #else
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/3 v2] futex: avoid double wake up in futex_wake() on -RT
  2015-04-07 19:47   ` Thomas Gleixner
@ 2015-04-10 16:11     ` Sebastian Andrzej Siewior
  2015-04-13  3:02       ` Davidlohr Bueso
  0 siblings, 1 reply; 27+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-04-10 16:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Darren Hart,
	Steven Rostedt, fredrik.markstrom, Davidlohr Bueso,
	Manfred Spraul

futex_wake() wakes the waiter while holding the hb->lock. The waiter
does not take the hb->lock and can leave the kernel. However the next
operation the same futex operation will point to the same hb->lock and
we will see a small dance around the lock including prio-boosting and
context switch:

low prio task FUTEX_WAKE on high prio
| ft-1489  [000] ....1..    81.167501: sys_enter: sys_futex (8049f60, 1, 1, 0, 0, 0)
| ft-1489  [000] dN..311    81.167504: sched_wakeup: pid=1490 prio=94
| ft-1489  [000] d...311    81.167520: sched_switch: prev_pid=1489 prev_prio=120 prev_state=R+ ==> next_pid=1490 next_prio=94
| ft-1490  [000] ....1..    81.167522: sys_exit: sys_futex = 0

prio task FUTEX_WAKE on low prio
| ft-1490  [000] ....1..    81.167528: sys_enter: sys_futex (8049f5c, 1, 1, 0, 0, 0)
| ft-1490  [000] ....1..    81.167530: sys_exit: sys_futex = 0

prio task waits FUTEX_WAIT, hb->lock still owned by low prio task
| ft-1490  [000] ....1..    81.167534: sys_enter: sys_futex (8049f60, 0, 1, 0, 0, 0)
| ft-1490  [000] d...411    81.167895: sched_pi_setprio: pid=1489 oldprio=120 newprio=94
| ft-1490  [000] d...311    81.167901: sched_switch: prev_pid=1490 prev_prio=94 prev_state=D ==> next_pid=1489 next_prio=94
| ft-1489  [000] d...411    81.167910: sched_wakeup: pid=1490 prio=94
| ft-1489  [000] d...311    81.167912: sched_pi_setprio: pid=1489 oldprio=94 newprio=120
| ft-1489  [000] d...311    81.167915: sched_switch: prev_pid=1489 prev_prio=120 prev_state=R+ ==> next_pid=1490 next_prio=94
| ft-1490  [000] d...3..    81.167922: sched_switch: prev_pid=1490 prev_prio=94 prev_state=S ==> next_pid=1489 next_prio=120
| ft-1489  [000] ....1..    81.167924: sys_exit: sys_futex = 1

This patch delays the wakeup of the process untill the hb->lock is
dropped to avoid boosting + context switch to obtain the lock.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
v1…v2:
    - update patch description
    - move the comment to __wake_futex()
    - move the wakeup before the out_put_key label in futex_wake()

 kernel/futex.c | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index b38abe3573a8..658f4d05cd6f 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1092,12 +1092,12 @@ static void __unqueue_futex(struct futex_q *q)
  * The hash bucket lock must be held when this is called.
  * Afterwards, the futex_q must not be accessed.
  */
-static void wake_futex(struct futex_q *q)
+static struct task_struct *__wake_futex(struct futex_q *q)
 {
 	struct task_struct *p = q->task;
 
 	if (WARN(q->pi_state || q->rt_waiter, "refusing to wake PI futex\n"))
-		return;
+		return NULL;
 
 	/*
 	 * We set q->lock_ptr = NULL _before_ we wake up the task. If
@@ -1117,6 +1117,15 @@ static void wake_futex(struct futex_q *q)
 	 */
 	smp_wmb();
 	q->lock_ptr = NULL;
+	return p;
+}
+
+static void wake_futex(struct futex_q *q)
+{
+	struct task_struct *p = __wake_futex(q);
+
+	if (!p)
+		return;
 
 	wake_up_state(p, TASK_NORMAL);
 	put_task_struct(p);
@@ -1228,6 +1237,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
 	union futex_key key = FUTEX_KEY_INIT;
+	struct task_struct *waiter = NULL;
 	int ret;
 
 	if (!bitset)
@@ -1256,13 +1266,21 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 			if (!(this->bitset & bitset))
 				continue;
 
-			wake_futex(this);
+			if (nr_wake == 1)
+				waiter = __wake_futex(this);
+			else
+				wake_futex(this);
 			if (++ret >= nr_wake)
 				break;
 		}
 	}
 
 	spin_unlock(&hb->lock);
+	if (waiter) {
+		wake_up_state(waiter, TASK_NORMAL);
+		put_task_struct(waiter);
+	}
+
 out_put_key:
 	put_futex_key(&key);
 out:
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3 v2] futex: avoid double wake up in futex_wake() on -RT
  2015-04-10 16:11     ` [PATCH 2/3 v2] " Sebastian Andrzej Siewior
@ 2015-04-13  3:02       ` Davidlohr Bueso
  2015-04-16  5:09         ` Davidlohr Bueso
  0 siblings, 1 reply; 27+ messages in thread
From: Davidlohr Bueso @ 2015-04-13  3:02 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Darren Hart, Steven Rostedt, fredrik.markstrom, Manfred Spraul

On Fri, 2015-04-10 at 18:11 +0200, Sebastian Andrzej Siewior wrote:
> This patch delays the wakeup of the process untill
						^^^ until

>  the hb->lock is
> dropped to avoid boosting + context switch to obtain the lock.

Doing the wakeups while holding the lock is also a general performance
issue for futex_wake. The problem being dealing with spurious wakeups
(wacky drivers), which makes no difference wrt nr_wake.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3 v2] futex: avoid double wake up in futex_wake() on -RT
  2015-04-13  3:02       ` Davidlohr Bueso
@ 2015-04-16  5:09         ` Davidlohr Bueso
  2015-04-16  9:19           ` Thomas Gleixner
  0 siblings, 1 reply; 27+ messages in thread
From: Davidlohr Bueso @ 2015-04-16  5:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Darren Hart, Steven Rostedt, fredrik.markstrom, Manfred Spraul,
	Arnaldo Carvalho de Melo

On Sun, 2015-04-12 at 20:02 -0700, Davidlohr Bueso wrote:
> Doing the wakeups while holding the lock is also a general performance
> issue for futex_wake. The problem being dealing with spurious wakeups
> (wacky drivers), which makes no difference wrt nr_wake.

So I did some measurements with the patch below (Cc'ing Arnaldo for
perf-bench consideration, albeit probably still pretty crude) and by
doing the lockless wakeups, on avg we reduce contending waking threads
latency in about 2x for each thread, which indicates that overall
speedup is based on the number of futex_wake'ers.

I guess now we have the code, the numbers. I go back to auditing drivers
*sigh*. In any case any important core-code already deals with spurious
wakeups (the last silly offender being sysv sems), so I'm really not
_that_ concerned -- in fact, Peter, your patch to trigger them seems to
not trigger any issues anymore. But perhaps its late and I'm in lala
land.

Thanks,
Davidlohr

8<-------------------------------------------------------------
[PATCH] perf-bench/futex: Support parallel wakeup threads

The futex-wake benchmark only measures wakeups done within
a single process. While this has value in its own, it does
not really generate any hb->lock contention. A new -W option
is added, specifying concurrent waker threads. Naturally,
different combinations of numbers of blocking and waker
threads will exhibit different information.

TODO: Improve verbosity of run-to-run stats in parallel.

Kinda-Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 tools/perf/bench/futex-wake.c | 237 ++++++++++++++++++++++++++++++------------
 1 file changed, 173 insertions(+), 64 deletions(-)

diff --git a/tools/perf/bench/futex-wake.c b/tools/perf/bench/futex-wake.c
index 929f762..67014a2 100644
--- a/tools/perf/bench/futex-wake.c
+++ b/tools/perf/bench/futex-wake.c
@@ -1,11 +1,11 @@
 /*
- * Copyright (C) 2013  Davidlohr Bueso <davidlohr@hp.com>
+ * Copyright (C) 2013, 2015  Davidlohr Bueso.
  *
  * futex-wake: Block a bunch of threads on a futex and wake'em up, N at a time.
  *
  * This program is particularly useful to measure the latency of nthread wakeups
- * in non-error situations:  all waiters are queued and all wake calls wakeup
- * one or more tasks, and thus the waitqueue is never empty.
+ * in non-error situations:  all waiters are queued and all futex_wake calls
+ * wakeup one or more tasks, and thus the waitqueue is never empty.
  */
 
 #include "../perf.h"
@@ -21,7 +21,7 @@
 #include <sys/time.h>
 #include <pthread.h>
 
-/* all threads will block on the same futex */
+/* all threads will block on the same futex -- hash bucket chaos ;) */
 static u_int32_t futex1 = 0;
 
 /*
@@ -30,16 +30,27 @@ static u_int32_t futex1 = 0;
  */
 static unsigned int nwakes = 1;
 
-pthread_t *worker;
-static bool done = false, silent = false, fshared = false;
+static pthread_t *blocked_worker, *waking_worker;
+static bool done = false, silent = false;
+static bool fshared = false;
 static pthread_mutex_t thread_lock;
 static pthread_cond_t thread_parent, thread_worker;
-static struct stats waketime_stats, wakeup_stats;
-static unsigned int ncpus, threads_starting, nthreads = 0;
+static unsigned int ncpus, threads_starting;
+static struct stats *waketime_stats, *wakeup_stats;
+static unsigned int nblocked_threads = 0, nwaking_threads = 0;
+static unsigned int repeat = 0, wakeup_limit = 1;
+static unsigned int *idxp;
 static int futex_flag = 0;
 
+#define parallel (wakeup_limit > 1 || nwaking_threads > 1)
+
 static const struct option options[] = {
-	OPT_UINTEGER('t', "threads", &nthreads, "Specify amount of threads"),
+	OPT_UINTEGER('t', "threads", &nblocked_threads, "Specify amount of threads"),
+	/*
+	 * The difference between w and W is that W threads will perform w wakeups.
+	 * If -W is not passed, the program defaults to 1 waking thread.
+	 */
+	OPT_UINTEGER('W', "nwakers", &nwaking_threads, "Specify amount of waking threads (default 1)"),
 	OPT_UINTEGER('w', "nwakes",  &nwakes,   "Specify amount of threads to wake at once"),
 	OPT_BOOLEAN( 's', "silent",  &silent,   "Silent mode: do not display data/details"),
 	OPT_BOOLEAN( 'S', "shared",  &fshared,  "Use shared futexes instead of private ones"),
@@ -51,7 +62,46 @@ static const char * const bench_futex_wake_usage[] = {
 	NULL
 };
 
-static void *workerfn(void *arg __maybe_unused)
+static void print_summary(void)
+{
+	unsigned int i, wakeup_avg;
+	double waketime_avg, waketime_stddev;
+	struct stats __waketime_stats, __wakeup_stats;
+
+	for (i = 0; i < wakeup_limit; i++) {
+		double waketime = avg_stats(&waketime_stats[i]);
+		unsigned int wakeup = avg_stats(&wakeup_stats[i]);
+
+		update_stats(&__wakeup_stats, wakeup);
+		update_stats(&__waketime_stats, waketime);
+	}
+
+	waketime_avg = avg_stats(&__waketime_stats);
+	waketime_stddev = stddev_stats(&__waketime_stats);
+	wakeup_avg = avg_stats(&__wakeup_stats);
+
+	printf("Wokeup %d of %d threads in %.4f ms (+-%.2f%%)\n",
+	       wakeup_avg * wakeup_limit,
+	       nblocked_threads,
+	       waketime_avg/1e3,
+	       rel_stddev_stats(waketime_stddev, waketime_avg));
+}
+
+static void print_banner(void)
+{
+	if (parallel)
+		printf("Run summary [PID %d]: blocking on %d threads (at [%s] "
+		       "futex %p), %d threads waking up %d at a time.\n\n",
+		       getpid(), nblocked_threads, fshared ? "shared":"private",
+		       &futex1, nwaking_threads, nwakes);
+	else
+		printf("Run summary [PID %d]: blocking on %d threads (at [%s] "
+		       "futex %p), waking up %d at a time.\n\n",
+		       getpid(), nblocked_threads, fshared ? "shared":"private",
+		       &futex1, nwakes);
+}
+
+static void *blocked_workerfn(void *arg __maybe_unused)
 {
 	pthread_mutex_lock(&thread_lock);
 	threads_starting--;
@@ -64,36 +114,63 @@ static void *workerfn(void *arg __maybe_unused)
 	return NULL;
 }
 
-static void print_summary(void)
-{
-	double waketime_avg = avg_stats(&waketime_stats);
-	double waketime_stddev = stddev_stats(&waketime_stats);
-	unsigned int wakeup_avg = avg_stats(&wakeup_stats);
-
-	printf("Wokeup %d of %d threads in %.4f ms (+-%.2f%%)\n",
-	       wakeup_avg,
-	       nthreads,
-	       waketime_avg/1e3,
-	       rel_stddev_stats(waketime_stddev, waketime_avg));
-}
-
 static void block_threads(pthread_t *w,
 			  pthread_attr_t thread_attr)
 {
 	cpu_set_t cpu;
 	unsigned int i;
 
-	threads_starting = nthreads;
+	threads_starting = nblocked_threads;
 
 	/* create and block all threads */
-	for (i = 0; i < nthreads; i++) {
+	for (i = 0; i < nblocked_threads; i++) {
 		CPU_ZERO(&cpu);
 		CPU_SET(i % ncpus, &cpu);
 
 		if (pthread_attr_setaffinity_np(&thread_attr, sizeof(cpu_set_t), &cpu))
 			err(EXIT_FAILURE, "pthread_attr_setaffinity_np");
 
-		if (pthread_create(&w[i], &thread_attr, workerfn, NULL))
+		if (pthread_create(&w[i], &thread_attr, blocked_workerfn, NULL))
+			err(EXIT_FAILURE, "pthread_create");
+	}
+}
+
+static void *waking_workerfn(void *arg)
+{
+	unsigned int nwoken = 0, idx = *(unsigned int *)arg;
+	unsigned int limit = parallel ? nwakes : nblocked_threads;
+	struct timeval start, end, runtime;
+
+	gettimeofday(&start, NULL);
+	while (nwoken != limit)
+		nwoken += futex_wake(&futex1, nwakes, futex_flag);
+
+	gettimeofday(&end, NULL);
+	timersub(&end, &start, &runtime);
+
+	update_stats(&wakeup_stats[idx], nwoken);
+	update_stats(&waketime_stats[idx], runtime.tv_usec);
+
+	if (!silent && !parallel) {
+		printf("[Run %d]: Wokeup %d of %d threads in %.4f ms\n",
+		       repeat, nwoken, nblocked_threads, runtime.tv_usec/1e3);
+	}
+
+	pthread_exit(NULL);
+	return NULL;
+}
+
+static void wakeup_threads(pthread_t *w)
+{
+	unsigned int i, *idx;
+	idxp = calloc(wakeup_limit, sizeof(unsigned int));
+	if (!idxp)
+		errx(EXIT_FAILURE, "calloc");
+
+	/* create and block all threads */
+	for (i = 0; i < wakeup_limit; i++) {
+		*(idxp + i) = i;
+		if (pthread_create(&w[i], NULL, waking_workerfn, (void *)(idxp+i)))
 			err(EXIT_FAILURE, "pthread_create");
 	}
 }
@@ -105,11 +182,40 @@ static void toggle_done(int sig __maybe_unused,
 	done = true;
 }
 
+static void alloc_workers(void)
+{
+	waking_worker = calloc(nwaking_threads, sizeof(pthread_t));
+	if (!waking_worker)
+		err(EXIT_FAILURE, "calloc");
+
+	blocked_worker = calloc(nblocked_threads, sizeof(pthread_t));
+	if (!blocked_worker)
+		err(EXIT_FAILURE, "calloc");
+}
+
+static void alloc_init_stats(void)
+{
+	unsigned int i;
+
+	waketime_stats = calloc(wakeup_limit, sizeof(struct stats));
+	if (!waketime_stats)
+		errx(EXIT_FAILURE, "calloc");
+
+	wakeup_stats = calloc(wakeup_limit, sizeof(struct stats));
+	if (!wakeup_stats)
+		errx(EXIT_FAILURE, "calloc");
+
+	for (i = 0; i < wakeup_limit; i++) {
+		init_stats(&wakeup_stats[i]);
+		init_stats(&waketime_stats[i]);
+	}
+}
+
 int bench_futex_wake(int argc, const char **argv,
 		     const char *prefix __maybe_unused)
 {
 	int ret = 0;
-	unsigned int i, j;
+	unsigned int i;
 	struct sigaction act;
 	pthread_attr_t thread_attr;
 
@@ -125,35 +231,45 @@ int bench_futex_wake(int argc, const char **argv,
 	act.sa_sigaction = toggle_done;
 	sigaction(SIGINT, &act, NULL);
 
-	if (!nthreads)
-		nthreads = ncpus;
-
-	worker = calloc(nthreads, sizeof(*worker));
-	if (!worker)
-		err(EXIT_FAILURE, "calloc");
+	if (!nblocked_threads)
+		nblocked_threads = ncpus;
+
+	/* sanity checks */
+	if (parallel) {
+		if (nwaking_threads & 0x1)
+			nwaking_threads++;
+
+		if (nwaking_threads > nblocked_threads)
+			nwaking_threads = nblocked_threads;
+
+		if (nblocked_threads % nwaking_threads)
+			errx(EXIT_FAILURE, "Must be perfectly divisible");
+		/*
+		 * Each thread will wakeup nwakes tasks in
+		 *a single futex_wait call.
+		 */
+		nwakes = nblocked_threads/nwaking_threads;
+		wakeup_limit = nwaking_threads;
+	}
 
 	if (!fshared)
 		futex_flag = FUTEX_PRIVATE_FLAG;
 
-	printf("Run summary [PID %d]: blocking on %d threads (at [%s] futex %p), "
-	       "waking up %d at a time.\n\n",
-	       getpid(), nthreads, fshared ? "shared":"private",  &futex1, nwakes);
+	alloc_workers();
+	alloc_init_stats();
+
+	print_banner();
 
-	init_stats(&wakeup_stats);
-	init_stats(&waketime_stats);
 	pthread_attr_init(&thread_attr);
 	pthread_mutex_init(&thread_lock, NULL);
 	pthread_cond_init(&thread_parent, NULL);
 	pthread_cond_init(&thread_worker, NULL);
 
-	for (j = 0; j < bench_repeat && !done; j++) {
-		unsigned int nwoken = 0;
-		struct timeval start, end, runtime;
-
+	for (repeat = 0; repeat < bench_repeat && !done; repeat++) {
 		/* create, launch & block all threads */
-		block_threads(worker, thread_attr);
+		block_threads(blocked_worker, thread_attr);
 
-		/* make sure all threads are already blocked */
+		/* ensure all threads are already blocked */
 		pthread_mutex_lock(&thread_lock);
 		while (threads_starting)
 			pthread_cond_wait(&thread_parent, &thread_lock);
@@ -162,37 +278,30 @@ int bench_futex_wake(int argc, const char **argv,
 
 		usleep(100000);
 
-		/* Ok, all threads are patiently blocked, start waking folks up */
-		gettimeofday(&start, NULL);
-		while (nwoken != nthreads)
-			nwoken += futex_wake(&futex1, nwakes, futex_flag);
-		gettimeofday(&end, NULL);
-		timersub(&end, &start, &runtime);
-
-		update_stats(&wakeup_stats, nwoken);
-		update_stats(&waketime_stats, runtime.tv_usec);
-
-		if (!silent) {
-			printf("[Run %d]: Wokeup %d of %d threads in %.4f ms\n",
-			       j + 1, nwoken, nthreads, runtime.tv_usec/1e3);
-		}
+		/* ok, all threads are patiently blocked, start waking folks up */
+		wakeup_threads(waking_worker);
+		free(idxp);
 
-		for (i = 0; i < nthreads; i++) {
-			ret = pthread_join(worker[i], NULL);
+		/* we're done with this run */
+		for (i = 0; i < nblocked_threads; i++) {
+			ret = pthread_join(blocked_worker[i], NULL);
 			if (ret)
 				err(EXIT_FAILURE, "pthread_join");
 		}
-
 	}
 
-	/* cleanup & report results */
+	/* report and cleanup */
+	print_summary();
+
 	pthread_cond_destroy(&thread_parent);
 	pthread_cond_destroy(&thread_worker);
 	pthread_mutex_destroy(&thread_lock);
 	pthread_attr_destroy(&thread_attr);
 
-	print_summary();
+	free(waking_worker);
+	free(blocked_worker);
+	free(wakeup_stats);
+	free(waketime_stats);
 
-	free(worker);
 	return ret;
 }
-- 
2.1.4




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3 v2] futex: avoid double wake up in futex_wake() on -RT
  2015-04-16  5:09         ` Davidlohr Bueso
@ 2015-04-16  9:19           ` Thomas Gleixner
  2015-04-16 10:16             ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Thomas Gleixner @ 2015-04-16  9:19 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Sebastian Andrzej Siewior, linux-kernel, Peter Zijlstra,
	Ingo Molnar, Darren Hart, Steven Rostedt, fredrik.markstrom,
	Manfred Spraul, Arnaldo Carvalho de Melo

On Wed, 15 Apr 2015, Davidlohr Bueso wrote:
> On Sun, 2015-04-12 at 20:02 -0700, Davidlohr Bueso wrote:
> > Doing the wakeups while holding the lock is also a general performance
> > issue for futex_wake. The problem being dealing with spurious wakeups
> > (wacky drivers), which makes no difference wrt nr_wake.
> 
> So I did some measurements with the patch below (Cc'ing Arnaldo for
> perf-bench consideration, albeit probably still pretty crude) and by
> doing the lockless wakeups, on avg we reduce contending waking threads
> latency in about 2x for each thread, which indicates that overall
> speedup is based on the number of futex_wake'ers.
> 
> I guess now we have the code, the numbers. I go back to auditing drivers
> *sigh*. In any case any important core-code already deals with spurious
> wakeups (the last silly offender being sysv sems), so I'm really not
> _that_ concerned -- in fact, Peter, your patch to trigger them seems to
> not trigger any issues anymore. But perhaps its late and I'm in lala
> land.

OTOH, we have quite some other code in the kernel which can generate
spurious wakeups. Just look at signals.

CPU0				CPU1

T1 random_syscall()		
   schedule_interruptible()

				Send process wide signal, wake T1
				because its the first target
T2 do_stuff()
   handle_signal()
   schedule()

T1 Deal with the spurious wakeup
  
So any code which does not handle a spurious wakeup is broken
independent of the futex changes. So really nothing to worry about.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3 v2] futex: avoid double wake up in futex_wake() on -RT
  2015-04-16  9:19           ` Thomas Gleixner
@ 2015-04-16 10:16             ` Peter Zijlstra
  2015-04-16 10:49               ` Thomas Gleixner
  2015-04-16 14:42               ` Davidlohr Bueso
  0 siblings, 2 replies; 27+ messages in thread
From: Peter Zijlstra @ 2015-04-16 10:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Davidlohr Bueso, Sebastian Andrzej Siewior, linux-kernel,
	Ingo Molnar, Darren Hart, Steven Rostedt, fredrik.markstrom,
	Manfred Spraul, Arnaldo Carvalho de Melo

On Thu, Apr 16, 2015 at 11:19:41AM +0200, Thomas Gleixner wrote:
> So any code which does not handle a spurious wakeup is broken
> independent of the futex changes. So really nothing to worry about.

Back when we did this:

 http://lkml.iu.edu/hypermail/linux/kernel/1109.1/01941.html
 http://lkml.iu.edu/hypermail/linux/kernel/1109.1/01943.html

Things came apart -- notably sysvsems.

And yes its true that anything not dealing with spuriuos wakups is
borken, but there's still quite a lot of borken out there, although I
think we fixed all the really common ones.

But if we decide we want to go do this, I'd propose we reintroduce this
delayed wake list thing again.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3 v2] futex: avoid double wake up in futex_wake() on -RT
  2015-04-16 10:16             ` Peter Zijlstra
@ 2015-04-16 10:49               ` Thomas Gleixner
  2015-04-16 14:42               ` Davidlohr Bueso
  1 sibling, 0 replies; 27+ messages in thread
From: Thomas Gleixner @ 2015-04-16 10:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Davidlohr Bueso, Sebastian Andrzej Siewior, linux-kernel,
	Ingo Molnar, Darren Hart, Steven Rostedt, fredrik.markstrom,
	Manfred Spraul, Arnaldo Carvalho de Melo

On Thu, 16 Apr 2015, Peter Zijlstra wrote:

> On Thu, Apr 16, 2015 at 11:19:41AM +0200, Thomas Gleixner wrote:
> > So any code which does not handle a spurious wakeup is broken
> > independent of the futex changes. So really nothing to worry about.
> 
> Back when we did this:
> 
>  http://lkml.iu.edu/hypermail/linux/kernel/1109.1/01941.html
>  http://lkml.iu.edu/hypermail/linux/kernel/1109.1/01943.html
> 
> Things came apart -- notably sysvsems.
> 
> And yes its true that anything not dealing with spuriuos wakups is
> borken, but there's still quite a lot of borken out there, although I
> think we fixed all the really common ones.
> 
> But if we decide we want to go do this, I'd propose we reintroduce this
> delayed wake list thing again.

Indeed. Forgot about that completely.

Thanks,

	tglx
 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3 v2] futex: avoid double wake up in futex_wake() on -RT
  2015-04-16 10:16             ` Peter Zijlstra
  2015-04-16 10:49               ` Thomas Gleixner
@ 2015-04-16 14:42               ` Davidlohr Bueso
  2015-04-16 15:54                 ` Peter Zijlstra
  1 sibling, 1 reply; 27+ messages in thread
From: Davidlohr Bueso @ 2015-04-16 14:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Sebastian Andrzej Siewior, linux-kernel,
	Ingo Molnar, Darren Hart, Steven Rostedt, fredrik.markstrom,
	Manfred Spraul, Arnaldo Carvalho de Melo

On Thu, 2015-04-16 at 12:16 +0200, Peter Zijlstra wrote:
> But if we decide we want to go do this, I'd propose we reintroduce this
> delayed wake list thing again.

Given that futexes aren't the only potential users, I definitely agree.
Lemme cleanup the patches and I'll resend. Now, one thing I wonder about
is if we should bother making it a delayed list a plist instead -- as
not all users would consider rt-tasks like futexes do.

Thanks,
Davidlohr

 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3 v2] futex: avoid double wake up in futex_wake() on -RT
  2015-04-16 14:42               ` Davidlohr Bueso
@ 2015-04-16 15:54                 ` Peter Zijlstra
  2015-04-16 16:22                   ` Davidlohr Bueso
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2015-04-16 15:54 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Thomas Gleixner, Sebastian Andrzej Siewior, linux-kernel,
	Ingo Molnar, Darren Hart, Steven Rostedt, fredrik.markstrom,
	Manfred Spraul, Arnaldo Carvalho de Melo

On Thu, Apr 16, 2015 at 07:42:55AM -0700, Davidlohr Bueso wrote:
> On Thu, 2015-04-16 at 12:16 +0200, Peter Zijlstra wrote:
> > But if we decide we want to go do this, I'd propose we reintroduce this
> > delayed wake list thing again.
> 
> Given that futexes aren't the only potential users, I definitely agree.
> Lemme cleanup the patches and I'll resend. Now, one thing I wonder about
> is if we should bother making it a delayed list a plist instead -- as
> not all users would consider rt-tasks like futexes do.

plist don't work and should not be used for tasks anymore. I suppose I
should go rip them out of futexes too. If you want to make the thing
priority aware we should probably abstract the rb-tree from rtmutex.c



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3 v2] futex: avoid double wake up in futex_wake() on -RT
  2015-04-16 15:54                 ` Peter Zijlstra
@ 2015-04-16 16:22                   ` Davidlohr Bueso
  0 siblings, 0 replies; 27+ messages in thread
From: Davidlohr Bueso @ 2015-04-16 16:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Sebastian Andrzej Siewior, linux-kernel,
	Ingo Molnar, Darren Hart, Steven Rostedt, fredrik.markstrom,
	Manfred Spraul, Arnaldo Carvalho de Melo

On Thu, 2015-04-16 at 17:54 +0200, Peter Zijlstra wrote:
> On Thu, Apr 16, 2015 at 07:42:55AM -0700, Davidlohr Bueso wrote:
> > On Thu, 2015-04-16 at 12:16 +0200, Peter Zijlstra wrote:
> > > But if we decide we want to go do this, I'd propose we reintroduce this
> > > delayed wake list thing again.
> > 
> > Given that futexes aren't the only potential users, I definitely agree.
> > Lemme cleanup the patches and I'll resend. Now, one thing I wonder about
> > is if we should bother making it a delayed list a plist instead -- as
> > not all users would consider rt-tasks like futexes do.
> 
> plist don't work and should not be used for tasks anymore. I suppose I
> should go rip them out of futexes too. If you want to make the thing
> priority aware we should probably abstract the rb-tree from rtmutex.c

Hmm yeah I noticed that, but a tree for this thing seems like an
overkill imho. I mean, at least wrt futexes, I don't think there would
ever be a nr_wake - within a single futex call as we use the list on the
stack - value large enough to even remotely justify the data structure.

Oh and that reminds me, we should update the rtmutex docs, it still
mentions plist -- but replacing those nice ascii art lists with trees is
evil ;)

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] ipc/mqueue: remove STATE_PENDING
  2015-04-10 14:37     ` [PATCH v2] " Sebastian Andrzej Siewior
@ 2015-04-23 22:18       ` Thomas Gleixner
  2015-04-28  3:24         ` Davidlohr Bueso
  0 siblings, 1 reply; 27+ messages in thread
From: Thomas Gleixner @ 2015-04-23 22:18 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Manfred Spraul, LKML, Peter Zijlstra, Ingo Molnar, Darren Hart,
	Steven Rostedt, fredrik.markstrom, Davidlohr Bueso,
	Paul E. McKenney

On Fri, 10 Apr 2015, Sebastian Andrzej Siewior wrote:
> >And: please test it, too. (patch the kernel so that you can trigger
> >this case).
> 
> Why patch? Isn't this triggered if you have a reader waiting and you
> send a message?

Manfred referred to the exit race. Though you can spare that exercise
as it is safe by definition if you hold a ref on the task.

Can you please convert that over to Peters lockless wake queues so we
do not reimplement the same thing open coded here.

> +static struct task_struct *pipelined_send(struct mqueue_inode_info *info,
>  				  struct msg_msg *message,
>  				  struct ext_wait_queue *receiver)
>  {
> +	struct task_struct *r_task;
> +
>  	receiver->msg = message;
>  	list_del(&receiver->list);
> -	receiver->state = STATE_PENDING;
> -	wake_up_process(receiver->task);
> +	r_task = receiver->task;
> +	get_task_struct(r_task);
>  	smp_wmb();

While we are at it. The barrier here and the one in pipelined_receive
are not documented and they are missing a proper pairing on the read
side. The comment which you removed was pretty vague about the purpose
of the barrier as well.

Thanks,

	tglx












^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] ipc/mqueue: remove STATE_PENDING
  2015-04-23 22:18       ` Thomas Gleixner
@ 2015-04-28  3:24         ` Davidlohr Bueso
  2015-04-28 12:37           ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Davidlohr Bueso @ 2015-04-28  3:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sebastian Andrzej Siewior, Manfred Spraul, LKML, Peter Zijlstra,
	Ingo Molnar, Darren Hart, Steven Rostedt, fredrik.markstrom,
	Paul E. McKenney

On Fri, 2015-04-24 at 00:18 +0200, Thomas Gleixner wrote:
> Can you please convert that over to Peters lockless wake queues so we
> do not reimplement the same thing open coded here.

So I'd like to include this in my v2 of the wake_q stuff, along with the
futex conversion. What do you guys think of the following?

Thanks,
Davidlohr

8<-------------------------------------------------------------
Subject: [PATCH] ipc/mqueue: lockless pipelined wakeups

This patch moves the wakeup_process() invocation so it is not done under
the info->lock by making use of a lockless wake_q. With this change, the
waiter is woken up once it is STATE_READY and it does not need to loop
on SMP if it is still in STATE_PENDING. In the timeout case we still need
to grab the info->lock to verify the state.

This change should also avoid the introduction of preempt_disable() in
-RT which avoids a busy-loop which pools for the STATE_PENDING -> STATE_READY
change if the waiter has a higher priority compared to the waker.

Additionally, this patch micro-optimizes wq_sleep by using the cheaper
cousin of set_current_state(TASK_INTERRUPTABLE) as we will block no
matter what, thus get rid of the implied barrier. Secondly, and related
to the lockless wakeups, comment the smp_wmb and add barrier pairing on
the reader side.

Based-on-work-from: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 ipc/mqueue.c | 52 +++++++++++++++++++++++++++++++++-------------------
 1 file changed, 33 insertions(+), 19 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 3aaea7f..11c7b92 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -47,8 +47,7 @@
 #define RECV		1
 
 #define STATE_NONE	0
-#define STATE_PENDING	1
-#define STATE_READY	2
+#define STATE_READY	1
 
 struct posix_msg_tree_node {
 	struct rb_node		rb_node;
@@ -571,15 +570,13 @@ static int wq_sleep(struct mqueue_inode_info *info, int sr,
 	wq_add(info, sr, ewp);
 
 	for (;;) {
-		set_current_state(TASK_INTERRUPTIBLE);
+		__set_current_state(TASK_INTERRUPTIBLE);
 
 		spin_unlock(&info->lock);
 		time = schedule_hrtimeout_range_clock(timeout, 0,
 			HRTIMER_MODE_ABS, CLOCK_REALTIME);
 
-		while (ewp->state == STATE_PENDING)
-			cpu_relax();
-
+		smp_rmb(); /* pairs with smp_wmb() in pipelined_send/receive */
 		if (ewp->state == STATE_READY) {
 			retval = 0;
 			goto out;
@@ -909,9 +906,14 @@ out_name:
  * bypasses the message array and directly hands the message over to the
  * receiver.
  * The receiver accepts the message and returns without grabbing the queue
- * spinlock. Therefore an intermediate STATE_PENDING state and memory barriers
- * are necessary. The same algorithm is used for sysv semaphores, see
- * ipc/sem.c for more details.
+ * spinlock. The used algorithm is different from sysv semaphores (ipc/sem.c):
+ *
+ * - Set pointer to message.
+ * - Queue the receiver task's for later wakeup (without the info->lock).
+ * - Update its state to STATE_READY. Now the receiver can continue.
+ * - Wake up the process after the lock is dropped. Should the process wake up
+ *   before this wakeup (due to a timeout or a signal) it will either see
+ *   STATE_READY and continue or acquire the lock to check the sate again.
  *
  * The same algorithm is used for senders.
  */
@@ -919,21 +921,29 @@ out_name:
 /* pipelined_send() - send a message directly to the task waiting in
  * sys_mq_timedreceive() (without inserting message into a queue).
  */
-static inline void pipelined_send(struct mqueue_inode_info *info,
+static inline void pipelined_send(struct wake_q_head *wake_q,
+				  struct mqueue_inode_info *info,
 				  struct msg_msg *message,
 				  struct ext_wait_queue *receiver)
 {
 	receiver->msg = message;
 	list_del(&receiver->list);
-	receiver->state = STATE_PENDING;
-	wake_up_process(receiver->task);
-	smp_wmb();
+	wake_q_add(wake_q, receiver->task);
+	/*
+	 * Ensure that updating receiver->state is the last
+	 * write operation: As once set, the receiver can continue,
+	 * and if we don't have the reference count from the wake_q,
+	 * yet, at that point we can later have a use-after-free
+	 * condition and bogus wakeup.
+	 */
+	smp_wmb(); /* pairs with smp_rmb() in wq_sleep */
 	receiver->state = STATE_READY;
 }
 
 /* pipelined_receive() - if there is task waiting in sys_mq_timedsend()
  * gets its message and put to the queue (we have one free place for sure). */
-static inline void pipelined_receive(struct mqueue_inode_info *info)
+static inline void pipelined_receive(struct wake_q_head *wake_q,
+				     struct mqueue_inode_info *info)
 {
 	struct ext_wait_queue *sender = wq_get_first_waiter(info, SEND);
 
@@ -944,10 +954,10 @@ static inline void pipelined_receive(struct mqueue_inode_info *info)
 	}
 	if (msg_insert(sender->msg, info))
 		return;
+
 	list_del(&sender->list);
-	sender->state = STATE_PENDING;
-	wake_up_process(sender->task);
-	smp_wmb();
+	wake_q_add(wake_q, sender->task);
+	smp_wmb(); /* pairs with smp_rmb() in wq_sleep */
 	sender->state = STATE_READY;
 }
 
@@ -965,6 +975,7 @@ SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr,
 	struct timespec ts;
 	struct posix_msg_tree_node *new_leaf = NULL;
 	int ret = 0;
+	WAKE_Q(wake_q);
 
 	if (u_abs_timeout) {
 		int res = prepare_timeout(u_abs_timeout, &expires, &ts);
@@ -1049,7 +1060,7 @@ SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr,
 	} else {
 		receiver = wq_get_first_waiter(info, RECV);
 		if (receiver) {
-			pipelined_send(info, msg_ptr, receiver);
+			pipelined_send(&wake_q, info, msg_ptr, receiver);
 		} else {
 			/* adds message to the queue */
 			ret = msg_insert(msg_ptr, info);
@@ -1062,6 +1073,7 @@ SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr,
 	}
 out_unlock:
 	spin_unlock(&info->lock);
+	wake_up_q(&wake_q);
 out_free:
 	if (ret)
 		free_msg(msg_ptr);
@@ -1084,6 +1096,7 @@ SYSCALL_DEFINE5(mq_timedreceive, mqd_t, mqdes, char __user *, u_msg_ptr,
 	ktime_t expires, *timeout = NULL;
 	struct timespec ts;
 	struct posix_msg_tree_node *new_leaf = NULL;
+	WAKE_Q(wake_q);
 
 	if (u_abs_timeout) {
 		int res = prepare_timeout(u_abs_timeout, &expires, &ts);
@@ -1155,8 +1168,9 @@ SYSCALL_DEFINE5(mq_timedreceive, mqd_t, mqdes, char __user *, u_msg_ptr,
 				CURRENT_TIME;
 
 		/* There is now free space in queue. */
-		pipelined_receive(info);
+		pipelined_receive(&wake_q, info);
 		spin_unlock(&info->lock);
+		wake_up_q(&wake_q);
 		ret = 0;
 	}
 	if (ret == 0) {
-- 
2.1.4






^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] ipc/mqueue: remove STATE_PENDING
  2015-04-28  3:24         ` Davidlohr Bueso
@ 2015-04-28 12:37           ` Peter Zijlstra
  2015-04-28 16:36             ` Davidlohr Bueso
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2015-04-28 12:37 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Thomas Gleixner, Sebastian Andrzej Siewior, Manfred Spraul, LKML,
	Ingo Molnar, Darren Hart, Steven Rostedt, fredrik.markstrom,
	Paul E. McKenney

On Mon, Apr 27, 2015 at 08:24:53PM -0700, Davidlohr Bueso wrote:
> +static inline void pipelined_send(struct wake_q_head *wake_q,
> +				  struct mqueue_inode_info *info,
>  				  struct msg_msg *message,
>  				  struct ext_wait_queue *receiver)
>  {
>  	receiver->msg = message;
>  	list_del(&receiver->list);
> +	wake_q_add(wake_q, receiver->task);
> +	/*
> +	 * Ensure that updating receiver->state is the last
> +	 * write operation: As once set, the receiver can continue,
> +	 * and if we don't have the reference count from the wake_q,
> +	 * yet, at that point we can later have a use-after-free
> +	 * condition and bogus wakeup.
> +	 */
> +	smp_wmb(); /* pairs with smp_rmb() in wq_sleep */

You have this barrier because we cannot rely on a failed cmpxchg()
actually being a full barrier, right?

>  	receiver->state = STATE_READY;
>  }



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] ipc/mqueue: remove STATE_PENDING
  2015-04-28 12:37           ` Peter Zijlstra
@ 2015-04-28 16:36             ` Davidlohr Bueso
  2015-04-28 16:43               ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Davidlohr Bueso @ 2015-04-28 16:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Sebastian Andrzej Siewior, Manfred Spraul, LKML,
	Ingo Molnar, Darren Hart, Steven Rostedt, fredrik.markstrom,
	Paul E. McKenney

On Tue, 2015-04-28 at 14:37 +0200, Peter Zijlstra wrote:
> On Mon, Apr 27, 2015 at 08:24:53PM -0700, Davidlohr Bueso wrote:
> > +static inline void pipelined_send(struct wake_q_head *wake_q,
> > +				  struct mqueue_inode_info *info,
> >  				  struct msg_msg *message,
> >  				  struct ext_wait_queue *receiver)
> >  {
> >  	receiver->msg = message;
> >  	list_del(&receiver->list);
> > +	wake_q_add(wake_q, receiver->task);
> > +	/*
> > +	 * Ensure that updating receiver->state is the last
> > +	 * write operation: As once set, the receiver can continue,
> > +	 * and if we don't have the reference count from the wake_q,
> > +	 * yet, at that point we can later have a use-after-free
> > +	 * condition and bogus wakeup.
> > +	 */
> > +	smp_wmb(); /* pairs with smp_rmb() in wq_sleep */
> 
> You have this barrier because we cannot rely on a failed cmpxchg()
> actually being a full barrier, right?

Failed cmpxchg() calls implies that the task is never added to the queue
(duplicate, which I cannot see occurring in this patch), so nothing
wrong with the bogus wakeups mentioned in the comment.

This barrier is not added by this patch though. Currently we have it
serializing with the wake_up_process() with STATE_READY, for similar
reasons. Because there is no task refcounting going on, the task can
easily disappear underneath us if the state is set before the wakeup. I
applied the same judgment here.

Thanks,
Davidlohr

> 
> >  	receiver->state = STATE_READY;
> >  }
> 
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] ipc/mqueue: remove STATE_PENDING
  2015-04-28 16:36             ` Davidlohr Bueso
@ 2015-04-28 16:43               ` Peter Zijlstra
  2015-04-28 16:59                 ` Davidlohr Bueso
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2015-04-28 16:43 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Thomas Gleixner, Sebastian Andrzej Siewior, Manfred Spraul, LKML,
	Ingo Molnar, Darren Hart, Steven Rostedt, fredrik.markstrom,
	Paul E. McKenney

On Tue, Apr 28, 2015 at 09:36:50AM -0700, Davidlohr Bueso wrote:
> On Tue, 2015-04-28 at 14:37 +0200, Peter Zijlstra wrote:
> > On Mon, Apr 27, 2015 at 08:24:53PM -0700, Davidlohr Bueso wrote:
> > > +static inline void pipelined_send(struct wake_q_head *wake_q,
> > > +				  struct mqueue_inode_info *info,
> > >  				  struct msg_msg *message,
> > >  				  struct ext_wait_queue *receiver)
> > >  {
> > >  	receiver->msg = message;
> > >  	list_del(&receiver->list);
> > > +	wake_q_add(wake_q, receiver->task);
> > > +	/*
> > > +	 * Ensure that updating receiver->state is the last
> > > +	 * write operation: As once set, the receiver can continue,
> > > +	 * and if we don't have the reference count from the wake_q,
> > > +	 * yet, at that point we can later have a use-after-free
> > > +	 * condition and bogus wakeup.
> > > +	 */
> > > +	smp_wmb(); /* pairs with smp_rmb() in wq_sleep */
> > 
> > You have this barrier because we cannot rely on a failed cmpxchg()
> > actually being a full barrier, right?
> 
> Failed cmpxchg() calls implies that the task is never added to the queue
> (duplicate, which I cannot see occurring in this patch), so nothing
> wrong with the bogus wakeups mentioned in the comment.
> 
> This barrier is not added by this patch though. Currently we have it
> serializing with the wake_up_process() with STATE_READY, for similar
> reasons. Because there is no task refcounting going on, the task can
> easily disappear underneath us if the state is set before the wakeup. I
> applied the same judgment here.

Well, if you can 'guarantee' the cmpxchg will not fail, you can then
rely on the fact that cmpxchg implies a full barrier, which would
obviate the need for the wmb.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] ipc/mqueue: remove STATE_PENDING
  2015-04-28 16:43               ` Peter Zijlstra
@ 2015-04-28 16:59                 ` Davidlohr Bueso
  2015-04-29 19:44                   ` Manfred Spraul
  0 siblings, 1 reply; 27+ messages in thread
From: Davidlohr Bueso @ 2015-04-28 16:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Sebastian Andrzej Siewior, Manfred Spraul, LKML,
	Ingo Molnar, Darren Hart, Steven Rostedt, fredrik.markstrom,
	Paul E. McKenney

On Tue, 2015-04-28 at 18:43 +0200, Peter Zijlstra wrote:
> Well, if you can 'guarantee' the cmpxchg will not fail, you can then
> rely on the fact that cmpxchg implies a full barrier, which would
> obviate the need for the wmb.

Yes, assuming it implies barriers on both sides. And we could obviously
remove the need for pairing. With wake_q being local to wq_sleep() I
cannot see duplicate tasks trying to add themselves in the list. Failed
cmpxchg should only occur when users start misusing the wake_q.

Manfred, do you have any objections to this? Perhaps I've missed the
real purpose of the barriers.

Thanks,
Davidlohr



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] ipc/mqueue: remove STATE_PENDING
  2015-04-28 16:59                 ` Davidlohr Bueso
@ 2015-04-29 19:44                   ` Manfred Spraul
  2015-04-30 18:46                     ` Davidlohr Bueso
  0 siblings, 1 reply; 27+ messages in thread
From: Manfred Spraul @ 2015-04-29 19:44 UTC (permalink / raw)
  To: Davidlohr Bueso, Peter Zijlstra
  Cc: Thomas Gleixner, Sebastian Andrzej Siewior, LKML, Ingo Molnar,
	Darren Hart, Steven Rostedt, fredrik.markstrom, Paul E. McKenney

Hi Davidlohr,

On 04/28/2015 06:59 PM, Davidlohr Bueso wrote:
> On Tue, 2015-04-28 at 18:43 +0200, Peter Zijlstra wrote:
>> Well, if you can 'guarantee' the cmpxchg will not fail, you can then
>> rely on the fact that cmpxchg implies a full barrier, which would
>> obviate the need for the wmb.
> Yes, assuming it implies barriers on both sides. And we could obviously
> remove the need for pairing. With wake_q being local to wq_sleep() I
> cannot see duplicate tasks trying to add themselves in the list. Failed
> cmpxchg should only occur when users start misusing the wake_q.
>
> Manfred, do you have any objections to this? Perhaps I've missed the
> real purpose of the barriers.
I don't remember the details either, so let's check what should happen:

CPU1: sender copies message to kernel memory
  aaaa
CPU1: sender does receiver->msg = message;
   ** barrier 1
CPU1: sender does receiver->state = STATE_READY;

CPU2: receiver notices receiver->state = STATE_READY;
   ** barrier 2
CPU2: receiver reads receiver->msg
  bbbb
CPU2: receiver reads *receiver->msg

Failures would be:
- write to receiver->state is visible before the write to receiver->msg 
or to *receiver->msg
   ** barrier 1 needs to be an smp_wmb()
- cpu 2 reads receiver->msg before receiver->state
   ** barrier 2 needs to be an smp_rmb().

As far as I can see, no barrier is needed in pos aaaa or bbbb.

With regards to failed cmpxchg():
I don't see that mqueue could cause it by itself.

Who is allowed to use wake_q?
If it is permitted to use wake_q for e.g. timeout/signal delivery 
wakeup, then that user might have a pending wakeup stored in the task 
struct.

--
     Manfred

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] ipc/mqueue: remove STATE_PENDING
  2015-04-29 19:44                   ` Manfred Spraul
@ 2015-04-30 18:46                     ` Davidlohr Bueso
  0 siblings, 0 replies; 27+ messages in thread
From: Davidlohr Bueso @ 2015-04-30 18:46 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior, LKML,
	Ingo Molnar, Darren Hart, Steven Rostedt, fredrik.markstrom,
	Paul E. McKenney

On Wed, 2015-04-29 at 21:44 +0200, Manfred Spraul wrote:
> Hi Davidlohr,
> 
> On 04/28/2015 06:59 PM, Davidlohr Bueso wrote:
> > On Tue, 2015-04-28 at 18:43 +0200, Peter Zijlstra wrote:
> >> Well, if you can 'guarantee' the cmpxchg will not fail, you can then
> >> rely on the fact that cmpxchg implies a full barrier, which would
> >> obviate the need for the wmb.
> > Yes, assuming it implies barriers on both sides. And we could obviously
> > remove the need for pairing. With wake_q being local to wq_sleep() I
> > cannot see duplicate tasks trying to add themselves in the list. Failed
> > cmpxchg should only occur when users start misusing the wake_q.
> >
> > Manfred, do you have any objections to this? Perhaps I've missed the
> > real purpose of the barriers.
> I don't remember the details either, so let's check what should happen:
> 
> CPU1: sender copies message to kernel memory
>   aaaa
> CPU1: sender does receiver->msg = message;
>    ** barrier 1
> CPU1: sender does receiver->state = STATE_READY;
> 
> CPU2: receiver notices receiver->state = STATE_READY;
>    ** barrier 2
> CPU2: receiver reads receiver->msg
>   bbbb
> CPU2: receiver reads *receiver->msg
> 
> Failures would be:
> - write to receiver->state is visible before the write to receiver->msg 
> or to *receiver->msg
>    ** barrier 1 needs to be an smp_wmb()
> - cpu 2 reads receiver->msg before receiver->state
>    ** barrier 2 needs to be an smp_rmb().
> 
> As far as I can see, no barrier is needed in pos aaaa or bbbb.

Thanks for confirming.

> 
> With regards to failed cmpxchg():
> I don't see that mqueue could cause it by itself.

Agreed.

> 
> Who is allowed to use wake_q?
> If it is permitted to use wake_q for e.g. timeout/signal delivery 
> wakeup, then that user might have a pending wakeup stored in the task 
> struct.

No, this is not the case. All users are expected to do the wakeup right
away.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2015-04-30 18:47 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-07 15:03 improve futex on -RT by avoiding the double wake-up Sebastian Andrzej Siewior
2015-04-07 15:03 ` [PATCH 1/3] futex: avoid double wake up in PI futex wait / wake on -RT Sebastian Andrzej Siewior
2015-04-07 18:41   ` Thomas Gleixner
2015-04-10 14:42     ` [PATCH 1/3 v2] " Sebastian Andrzej Siewior
2015-04-07 15:03 ` [PATCH 2/3] futex: avoid double wake up in futex_wake() " Sebastian Andrzej Siewior
2015-04-07 19:47   ` Thomas Gleixner
2015-04-10 16:11     ` [PATCH 2/3 v2] " Sebastian Andrzej Siewior
2015-04-13  3:02       ` Davidlohr Bueso
2015-04-16  5:09         ` Davidlohr Bueso
2015-04-16  9:19           ` Thomas Gleixner
2015-04-16 10:16             ` Peter Zijlstra
2015-04-16 10:49               ` Thomas Gleixner
2015-04-16 14:42               ` Davidlohr Bueso
2015-04-16 15:54                 ` Peter Zijlstra
2015-04-16 16:22                   ` Davidlohr Bueso
2015-04-07 15:03 ` [PATCH 3/3] ipc/mqueue: remove STATE_PENDING Sebastian Andrzej Siewior
2015-04-07 17:48   ` Manfred Spraul
2015-04-07 18:28     ` Thomas Gleixner
2015-04-10 14:37     ` [PATCH v2] " Sebastian Andrzej Siewior
2015-04-23 22:18       ` Thomas Gleixner
2015-04-28  3:24         ` Davidlohr Bueso
2015-04-28 12:37           ` Peter Zijlstra
2015-04-28 16:36             ` Davidlohr Bueso
2015-04-28 16:43               ` Peter Zijlstra
2015-04-28 16:59                 ` Davidlohr Bueso
2015-04-29 19:44                   ` Manfred Spraul
2015-04-30 18:46                     ` Davidlohr Bueso

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.