All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/6] ANDROID: binder: RT priority inheritance
@ 2017-10-26 14:07 Martijn Coenen
  2017-10-26 14:07 ` [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance Martijn Coenen
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Martijn Coenen @ 2017-10-26 14:07 UTC (permalink / raw)
  To: gregkh, john.stultz, tkjos, arve, sherryy, tglx, peterz, amit.pundir
  Cc: linux-kernel, devel, maco, Martijn Coenen

Changes since v2 [1]:
- All patches in v2 not related to priority inheritance were merged,
  and hence removed from this series
- Fixed using the wrong mask in node scheduler policy calculation,
  originally reported by Ganesh Mahendran <opensource.ganesh@gmail.com>
- Fixed using an uninitialized value for desired_prio,
  originally reported by Ganesh Mahendran <opensource.ganesh@gmail.com>
  
Changes since v1 [2]:
- added more detailed commit messages and comments to the priority
  inheritance patches, including rationale for not using
  schet_setscheduler() directly, or rt_mutex prio inheritance.
  No functional changes.

[1]: https://lkml.kernel.org/r/20170831080430.118765-1-maco@android.com
[2]: https://lkml.kernel.org/r/20170825093335.100892-1-maco@android.com

---

This patch series introduces support for priority inheritance of real-time
scheduling policies in binder. With the introduction of Android Treble,
functionality that used to be in a single process is now split over two or
more processes, which communicate using binder IPC. For latency sensitive
operations such as sensor events, Bluetooth audio and rendering, inheriting
the (real-time) priority of the caller is crucial to meet requirements.

The implementation in this series directly calls into the scheduler to
modify priorities, since I haven't found a way to make this work correctly
with rt_mutex or other existing priority inheritance mechanisms. The main
reasons a PI rt_mutex doesn't work well are the following:
1) Binder supports asynchronous transactions, where a caller isn't blocked
   on a result; therefore, the caller also couldn't block on a rt_mutex.
2) Binder supports the concept of 'node priority', where the priority of a
   call is not determined by the caller, but by the endpoint the caller
   calls in to.
3) There may not necessarily be any binder threads waiting to handle a
   transaction, so the caller doesn't always have a thread to change the
   priority of; instead, the first thread to pick up the work changes its
   own priority.
4) rt_mutex doesn't support non-RT policies (though a patch was sent to
   LKML once to address this).   

More details in the patches themselves. I have found the current approach
to be reliable, but I'm happy to look into suggestions to make this work
with rt_mutex, or use other infrastructure.

All patches have already been reviewed by Android engineers and are
merged in Android's common kernel trees.

Martijn Coenen (6):
  ANDROID: binder: add support for RT prio inheritance.
  ANDROID: binder: add min sched_policy to node.
  ANDROID: binder: improve priority inheritance.
  ANDROID: binder: add RT inheritance flag to node.
  ANDROID: binder: don't check prio permissions on restore.
  ANDROID: binder: Add tracing for binder priority inheritance.

 drivers/android/binder.c            | 294 ++++++++++++++++++++++++++++++++----
 drivers/android/binder_trace.h      |  24 +++
 include/uapi/linux/android/binder.h |  49 +++++-
 3 files changed, 334 insertions(+), 33 deletions(-)

-- 
2.15.0.rc2.357.g7e34df9404-goog

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance.
  2017-10-26 14:07 [PATCH v3 0/6] ANDROID: binder: RT priority inheritance Martijn Coenen
@ 2017-10-26 14:07 ` Martijn Coenen
  2017-11-15 13:01   ` Peter Zijlstra
  2017-10-26 14:07 ` [PATCH v3 2/6] ANDROID: binder: add min sched_policy to node Martijn Coenen
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Martijn Coenen @ 2017-10-26 14:07 UTC (permalink / raw)
  To: gregkh, john.stultz, tkjos, arve, sherryy, tglx, peterz, amit.pundir
  Cc: linux-kernel, devel, maco, Martijn Coenen

Adds support for SCHED_BATCH/SCHED_FIFO/SCHED_RR priority inheritance
to the binder driver. The desired behavior is as follows:

Each thread in the binder threadpool runs at a default priority, which is
typically nice 0.

Binder nodes (endpoints that can receive binder transactions) can have a
minimum priority associated with them, which means that all transactions
on this node must run at least at this priority.

Let's say a synchronous transaction is made from task T1 in process P1
to process P2, into node N1 which has a 'N1_min_priority':
1) T1 wakes up a task T2 in P2, then blocks on a waitqueue for reply
2) T2 determines prio=max(prio_of_T1, N1_min_priority);
3) T2 sets its own priority to prio, and stores its old prio
4) T2 returns to userspace, does work
5) Eventually, T2 returns a reply transaction to the driver
6) T2 queues the reply to T1 and wakes it up
7) T2 restores its own priority to what it was

For an asynchronous transaction:
1) T1 wakes up a task T2 in P2, returns to userspace (no work left)
2) T2 determines prio=max(default_prio, N1_min_priority)
3) T2 sets its own priority to prio, and stores its old prio
4) T2 returns to userspace, does work, completes
5) T2 calls back into kernel for more work
6) T2 restores its own priority to its default priority

This still leaves a race condition, where T2 wakes up and gets preempted
before it has a chance to change its own priority. This is addressed in
one of the next patches in the series, by letting T1 change the priority
of T2 *before* waking it up.

Signed-off-by: Martijn Coenen <maco@android.com>
---
 drivers/android/binder.c | 217 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 188 insertions(+), 29 deletions(-)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index 95a96a254e5d..be6e7e753013 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -77,6 +77,7 @@
 #endif
 
 #include <uapi/linux/android/binder.h>
+#include <uapi/linux/sched/types.h>
 #include "binder_alloc.h"
 #include "binder_trace.h"
 
@@ -463,6 +464,22 @@ enum binder_deferred_state {
 	BINDER_DEFERRED_RELEASE      = 0x04,
 };
 
+/**
+ * struct binder_priority - scheduler policy and priority
+ * @sched_policy            scheduler policy
+ * @prio                    [100..139] for SCHED_NORMAL, [0..99] for FIFO/RT
+ *
+ * The binder driver supports inheriting the following scheduler policies:
+ * SCHED_NORMAL
+ * SCHED_BATCH
+ * SCHED_FIFO
+ * SCHED_RR
+ */
+struct binder_priority {
+	unsigned int sched_policy;
+	int prio;
+};
+
 /**
  * struct binder_proc - binder process bookkeeping
  * @proc_node:            element for binder_procs list
@@ -542,7 +559,7 @@ struct binder_proc {
 	int requested_threads;
 	int requested_threads_started;
 	int tmp_ref;
-	long default_priority;
+	struct binder_priority default_priority;
 	struct dentry *debugfs_entry;
 	struct binder_alloc alloc;
 	struct binder_context *context;
@@ -624,8 +641,8 @@ struct binder_transaction {
 	struct binder_buffer *buffer;
 	unsigned int	code;
 	unsigned int	flags;
-	long	priority;
-	long	saved_priority;
+	struct binder_priority	priority;
+	struct binder_priority	saved_priority;
 	kuid_t	sender_euid;
 	/**
 	 * @lock:  protects @from, @to_proc, and @to_thread
@@ -1051,22 +1068,142 @@ static void binder_wakeup_proc_ilocked(struct binder_proc *proc)
 	binder_wakeup_thread_ilocked(proc, thread, /* sync = */false);
 }
 
-static void binder_set_nice(long nice)
+static bool is_rt_policy(int policy)
+{
+	return policy == SCHED_FIFO || policy == SCHED_RR;
+}
+
+static bool is_fair_policy(int policy)
+{
+	return policy == SCHED_NORMAL || policy == SCHED_BATCH;
+}
+
+static bool binder_supported_policy(int policy)
+{
+	return is_fair_policy(policy) || is_rt_policy(policy);
+}
+
+static int to_userspace_prio(int policy, int kernel_priority)
+{
+	if (is_fair_policy(policy))
+		return PRIO_TO_NICE(kernel_priority);
+	else
+		return MAX_USER_RT_PRIO - 1 - kernel_priority;
+}
+
+static int to_kernel_prio(int policy, int user_priority)
+{
+	if (is_fair_policy(policy))
+		return NICE_TO_PRIO(user_priority);
+	else
+		return MAX_USER_RT_PRIO - 1 - user_priority;
+}
+
+/**
+ * binder_set_priority() - sets the scheduler priority of a task
+ * @task:	task to set priority on
+ * @desired:	desired priority to run at
+ *
+ * The scheduler policy of tasks is changed explicitly, because we want to
+ * support a few distinct features:
+ * 1) If the requested priority is higher than the maximum allowed priority,
+ *    we want to "clip" at the highest supported priority.
+ * 2) For a future patch, we need to allow changing the priority of a task
+ *    with a different UID; when we make a binder transaction from process A
+ *    to process B with different UIDs, A must be able to set B's priority
+ *    before B wakes up to handle the call. If B were to raise its own priority
+ *    after waking up, a race condition exists where B gets preempted before
+ *    it can raise its own priority.
+ *
+ * Feature 2) sounds like something a rt_mutex would solve, for example by
+ * having the caller proxy lock an rt_mutex on behalf of the callee, and then
+ * sleeping on it. But we have a few requirements that don't work with this
+ * approach:
+ * 1) binder supports a "minimum node priority", meaning that all transactions
+ *    into a node must run at this priority at a minimum. This means that the
+ *    desired priority for handling a transaction is not necessarily equal to
+ *    the priority of the caller.
+ * 2) binder supports asynchronous transactions, where the caller is not blocked
+ *    on transaction completion; so, it also can't be blocked on an rt_mutex.
+ * 3) similarly, there may not necessarily be a thread waiting for
+ *    transactions at the time the call is made, so we don't know who to proxy-
+ *    lock the lock for.
+ * 4) binder supports nested transactions, where A can call into B, and B can
+ *    call back into A before returning a reply to the original transaction.
+ *    This means that if A is blocked on an rt_mutex B holds, B must first wake
+ *    up A to handle a new transaction, and only then can it proxy-lock and try
+ *    to acquire the new rt_mutex. This leaves a race condition where B
+ *    temporarily runs at its original priority.
+ * 5) rt_mutex does not currently support PI for CFS tasks.
+ */
+static void binder_set_priority(struct task_struct *task,
+				struct binder_priority desired)
 {
-	long min_nice;
+	int priority; /* user-space prio value */
+	bool has_cap_nice;
+	unsigned int policy = desired.sched_policy;
 
-	if (can_nice(current, nice)) {
-		set_user_nice(current, nice);
+	if (task->policy == policy && task->normal_prio == desired.prio)
 		return;
+
+	/*
+	 * We need to check capabilities because sched_setscheduler() doesn't
+	 * allow changing the scheduler of a task with a different UID. This
+	 * in turn means we have to use sched_setscheduler_nocheck(), but we
+	 * still want to check that the target process is allowed to run at
+	 * the desired priority, so we do these checks here.
+	 *
+	 * Another reason is that we want want to clip at the highest "allowed"
+	 * limit, so if a process has an RLIMIT_RTPRIO of 50, and we ask it to
+	 * run at 99, instead of declining the request, we run at prio 50.
+	 */
+	has_cap_nice = has_capability_noaudit(task, CAP_SYS_NICE);
+
+	priority = to_userspace_prio(policy, desired.prio);
+
+	if (is_rt_policy(policy) && !has_cap_nice) {
+		long max_rtprio = task_rlimit(task, RLIMIT_RTPRIO);
+
+		if (max_rtprio == 0) {
+			/* Fall back to SCHED_NORMAL */
+			policy = SCHED_NORMAL;
+			priority = MIN_NICE;
+		} else if (priority > max_rtprio) {
+			priority = max_rtprio;
+		}
 	}
-	min_nice = rlimit_to_nice(rlimit(RLIMIT_NICE));
-	binder_debug(BINDER_DEBUG_PRIORITY_CAP,
-		     "%d: nice value %ld not allowed use %ld instead\n",
-		      current->pid, nice, min_nice);
-	set_user_nice(current, min_nice);
-	if (min_nice <= MAX_NICE)
-		return;
-	binder_user_error("%d RLIMIT_NICE not set\n", current->pid);
+
+	if (is_fair_policy(policy) && !has_cap_nice) {
+		long min_nice = rlimit_to_nice(task_rlimit(task, RLIMIT_NICE));
+
+		if (min_nice > MAX_NICE) {
+			binder_user_error("%d RLIMIT_NICE not set\n",
+					  task->pid);
+			return;
+		} else if (priority < min_nice) {
+			priority = min_nice;
+		}
+	}
+
+	if (policy != desired.sched_policy ||
+	    to_kernel_prio(policy, priority) != desired.prio)
+		binder_debug(BINDER_DEBUG_PRIORITY_CAP,
+			     "%d: priority %d not allowed, using %d instead\n",
+			      task->pid, desired.prio,
+			      to_kernel_prio(policy, priority));
+
+	/* Set the actual priority */
+	if (task->policy != policy || is_rt_policy(policy)) {
+		struct sched_param params;
+
+		params.sched_priority = is_rt_policy(policy) ? priority : 0;
+
+		sched_setscheduler_nocheck(task,
+					   policy | SCHED_RESET_ON_FORK,
+					   &params);
+	}
+	if (is_fair_policy(policy))
+		set_user_nice(task, priority);
 }
 
 static struct binder_node *binder_get_node_ilocked(struct binder_proc *proc,
@@ -1151,7 +1288,8 @@ static struct binder_node *binder_init_node_ilocked(
 	node->ptr = ptr;
 	node->cookie = cookie;
 	node->work.type = BINDER_WORK_NODE;
-	node->min_priority = flags & FLAT_BINDER_FLAG_PRIORITY_MASK;
+	node->min_priority = NICE_TO_PRIO(
+			flags & FLAT_BINDER_FLAG_PRIORITY_MASK);
 	node->accept_fds = !!(flags & FLAT_BINDER_FLAG_ACCEPTS_FDS);
 	spin_lock_init(&node->lock);
 	INIT_LIST_HEAD(&node->work.entry);
@@ -2688,7 +2826,7 @@ static void binder_transaction(struct binder_proc *proc,
 		}
 		thread->transaction_stack = in_reply_to->to_parent;
 		binder_inner_proc_unlock(proc);
-		binder_set_nice(in_reply_to->saved_priority);
+		binder_set_priority(current, in_reply_to->saved_priority);
 		target_thread = binder_get_txn_from_and_acq_inner(in_reply_to);
 		if (target_thread == NULL) {
 			return_error = BR_DEAD_REPLY;
@@ -2853,7 +2991,15 @@ static void binder_transaction(struct binder_proc *proc,
 	t->to_thread = target_thread;
 	t->code = tr->code;
 	t->flags = tr->flags;
-	t->priority = task_nice(current);
+	if (!(t->flags & TF_ONE_WAY) &&
+	    binder_supported_policy(current->policy)) {
+		/* Inherit supported policies for synchronous transactions */
+		t->priority.sched_policy = current->policy;
+		t->priority.prio = current->normal_prio;
+	} else {
+		/* Otherwise, fall back to the default priority */
+		t->priority = target_proc->default_priority;
+	}
 
 	trace_binder_transaction(reply, t, target_node);
 
@@ -3761,7 +3907,7 @@ static int binder_thread_read(struct binder_proc *proc,
 			wait_event_interruptible(binder_user_error_wait,
 						 binder_stop_on_user_error < 2);
 		}
-		binder_set_nice(proc->default_priority);
+		binder_set_priority(current, proc->default_priority);
 	}
 
 	if (non_block) {
@@ -3973,16 +4119,21 @@ static int binder_thread_read(struct binder_proc *proc,
 		BUG_ON(t->buffer == NULL);
 		if (t->buffer->target_node) {
 			struct binder_node *target_node = t->buffer->target_node;
+			struct binder_priority prio = t->priority;
 
 			tr.target.ptr = target_node->ptr;
 			tr.cookie =  target_node->cookie;
-			t->saved_priority = task_nice(current);
-			if (t->priority < target_node->min_priority &&
-			    !(t->flags & TF_ONE_WAY))
-				binder_set_nice(t->priority);
-			else if (!(t->flags & TF_ONE_WAY) ||
-				 t->saved_priority > target_node->min_priority)
-				binder_set_nice(target_node->min_priority);
+			t->saved_priority.sched_policy = current->policy;
+			t->saved_priority.prio = current->normal_prio;
+			if (target_node->min_priority < t->priority.prio) {
+				/* The min_priority on a node can currently
+				 * only use SCHED_NORMAL, so we can just
+				 * hardcode this here.
+				 */
+				prio.sched_policy = SCHED_NORMAL;
+				prio.prio = target_node->min_priority;
+			}
+			binder_set_priority(current, prio);
 			cmd = BR_TRANSACTION;
 		} else {
 			tr.target.ptr = 0;
@@ -4630,7 +4781,14 @@ static int binder_open(struct inode *nodp, struct file *filp)
 	get_task_struct(current->group_leader);
 	proc->tsk = current->group_leader;
 	INIT_LIST_HEAD(&proc->todo);
-	proc->default_priority = task_nice(current);
+	if (binder_supported_policy(current->policy)) {
+		proc->default_priority.sched_policy = current->policy;
+		proc->default_priority.prio = current->normal_prio;
+	} else {
+		proc->default_priority.sched_policy = SCHED_NORMAL;
+		proc->default_priority.prio = NICE_TO_PRIO(0);
+	}
+
 	binder_dev = container_of(filp->private_data, struct binder_device,
 				  miscdev);
 	proc->context = &binder_dev->context;
@@ -4922,13 +5080,14 @@ static void print_binder_transaction_ilocked(struct seq_file *m,
 	spin_lock(&t->lock);
 	to_proc = t->to_proc;
 	seq_printf(m,
-		   "%s %d: %p from %d:%d to %d:%d code %x flags %x pri %ld r%d",
+		   "%s %d: %p from %d:%d to %d:%d code %x flags %x pri %d:%d r%d",
 		   prefix, t->debug_id, t,
 		   t->from ? t->from->proc->pid : 0,
 		   t->from ? t->from->pid : 0,
 		   to_proc ? to_proc->pid : 0,
 		   t->to_thread ? t->to_thread->pid : 0,
-		   t->code, t->flags, t->priority, t->need_reply);
+		   t->code, t->flags, t->priority.sched_policy,
+		   t->priority.prio, t->need_reply);
 	spin_unlock(&t->lock);
 
 	if (proc != to_proc) {
-- 
2.15.0.rc2.357.g7e34df9404-goog

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 2/6] ANDROID: binder: add min sched_policy to node.
  2017-10-26 14:07 [PATCH v3 0/6] ANDROID: binder: RT priority inheritance Martijn Coenen
  2017-10-26 14:07 ` [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance Martijn Coenen
@ 2017-10-26 14:07 ` Martijn Coenen
  2017-11-15 13:02   ` Peter Zijlstra
  2017-10-26 14:07 ` [PATCH v3 3/6] ANDROID: binder: improve priority inheritance Martijn Coenen
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Martijn Coenen @ 2017-10-26 14:07 UTC (permalink / raw)
  To: gregkh, john.stultz, tkjos, arve, sherryy, tglx, peterz, amit.pundir
  Cc: linux-kernel, devel, maco, Martijn Coenen

This change adds flags to flat_binder_object.flags
to allow indicating a minimum scheduling policy for
the node. It also clarifies the valid value range
for the priority bits in the flags.

Internally, we use the priority map that the kernel
uses, e.g. [0..99] for real-time policies and [100..139]
for the SCHED_NORMAL/SCHED_BATCH policies.

Signed-off-by: Martijn Coenen <maco@android.com>
---
 drivers/android/binder.c            | 28 +++++++++++++++++--------
 include/uapi/linux/android/binder.h | 41 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 60 insertions(+), 9 deletions(-)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index be6e7e753013..5d51a17dba12 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -352,6 +352,8 @@ struct binder_error {
  *                        and by @lock)
  * @has_async_transaction: async transaction to node in progress
  *                        (protected by @lock)
+ * @sched_policy:         minimum scheduling policy for node
+ *                        (invariant after initialized)
  * @accept_fds:           file descriptor operations supported for node
  *                        (invariant after initialized)
  * @min_priority:         minimum scheduling priority
@@ -391,6 +393,7 @@ struct binder_node {
 		/*
 		 * invariant after initialization
 		 */
+		u8 sched_policy:2;
 		u8 accept_fds:1;
 		u8 min_priority;
 	};
@@ -1256,6 +1259,7 @@ static struct binder_node *binder_init_node_ilocked(
 	binder_uintptr_t ptr = fp ? fp->binder : 0;
 	binder_uintptr_t cookie = fp ? fp->cookie : 0;
 	__u32 flags = fp ? fp->flags : 0;
+	s8 priority;
 
 	assert_spin_locked(&proc->inner_lock);
 
@@ -1288,8 +1292,10 @@ static struct binder_node *binder_init_node_ilocked(
 	node->ptr = ptr;
 	node->cookie = cookie;
 	node->work.type = BINDER_WORK_NODE;
-	node->min_priority = NICE_TO_PRIO(
-			flags & FLAT_BINDER_FLAG_PRIORITY_MASK);
+	priority = flags & FLAT_BINDER_FLAG_PRIORITY_MASK;
+	node->sched_policy = (flags & FLAT_BINDER_FLAG_SCHED_POLICY_MASK) >>
+		FLAT_BINDER_FLAG_SCHED_POLICY_SHIFT;
+	node->min_priority = to_kernel_prio(node->sched_policy, priority);
 	node->accept_fds = !!(flags & FLAT_BINDER_FLAG_ACCEPTS_FDS);
 	spin_lock_init(&node->lock);
 	INIT_LIST_HEAD(&node->work.entry);
@@ -4125,12 +4131,17 @@ static int binder_thread_read(struct binder_proc *proc,
 			tr.cookie =  target_node->cookie;
 			t->saved_priority.sched_policy = current->policy;
 			t->saved_priority.prio = current->normal_prio;
-			if (target_node->min_priority < t->priority.prio) {
-				/* The min_priority on a node can currently
-				 * only use SCHED_NORMAL, so we can just
-				 * hardcode this here.
+			if (target_node->min_priority < t->priority.prio ||
+			    (target_node->min_priority == t->priority.prio &&
+			     target_node->sched_policy == SCHED_FIFO)) {
+				/*
+				 * In case the minimum priority on the node is
+				 * higher (lower value), use that priority. If
+				 * the priority is the same, but the node uses
+				 * SCHED_FIFO, prefer SCHED_FIFO, since it can
+				 * run unbounded, unlike SCHED_RR.
 				 */
-				prio.sched_policy = SCHED_NORMAL;
+				prio.sched_policy = target_node->sched_policy;
 				prio.prio = target_node->min_priority;
 			}
 			binder_set_priority(current, prio);
@@ -5205,8 +5216,9 @@ static void print_binder_node_nilocked(struct seq_file *m,
 	hlist_for_each_entry(ref, &node->refs, node_entry)
 		count++;
 
-	seq_printf(m, "  node %d: u%016llx c%016llx hs %d hw %d ls %d lw %d is %d iw %d tr %d",
+	seq_printf(m, "  node %d: u%016llx c%016llx pri %d:%d hs %d hw %d ls %d lw %d is %d iw %d tr %d",
 		   node->debug_id, (u64)node->ptr, (u64)node->cookie,
+		   node->sched_policy, node->min_priority,
 		   node->has_strong_ref, node->has_weak_ref,
 		   node->local_strong_refs, node->local_weak_refs,
 		   node->internal_strong_refs, count, node->tmp_refs);
diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h
index 84a9a0944e13..b3bced80adea 100644
--- a/include/uapi/linux/android/binder.h
+++ b/include/uapi/linux/android/binder.h
@@ -37,9 +37,48 @@ enum {
 	BINDER_TYPE_PTR		= B_PACK_CHARS('p', 't', '*', B_TYPE_LARGE),
 };
 
-enum {
+/**
+ * enum flat_binder_object_shifts: shift values for flat_binder_object_flags
+ * @FLAT_BINDER_FLAG_SCHED_POLICY_SHIFT: shift for getting scheduler policy.
+ *
+ */
+enum flat_binder_object_shifts {
+	FLAT_BINDER_FLAG_SCHED_POLICY_SHIFT = 9,
+};
+
+/**
+ * enum flat_binder_object_flags - flags for use in flat_binder_object.flags
+ */
+enum flat_binder_object_flags {
+	/**
+	 * @FLAT_BINDER_FLAG_PRIORITY_MASK: bit-mask for min scheduler priority
+	 *
+	 * These bits can be used to set the minimum scheduler priority
+	 * at which transactions into this node should run. Valid values
+	 * in these bits depend on the scheduler policy encoded in
+	 * @FLAT_BINDER_FLAG_SCHED_POLICY_MASK.
+	 *
+	 * For SCHED_NORMAL/SCHED_BATCH, the valid range is between [-20..19]
+	 * For SCHED_FIFO/SCHED_RR, the value can run between [1..99]
+	 */
 	FLAT_BINDER_FLAG_PRIORITY_MASK = 0xff,
+	/**
+	 * @FLAT_BINDER_FLAG_ACCEPTS_FDS: whether the node accepts fds.
+	 */
 	FLAT_BINDER_FLAG_ACCEPTS_FDS = 0x100,
+	/**
+	 * @FLAT_BINDER_FLAG_SCHED_POLICY_MASK: bit-mask for scheduling policy
+	 *
+	 * These two bits can be used to set the min scheduling policy at which
+	 * transactions on this node should run. These match the UAPI
+	 * scheduler policy values, eg:
+	 * 00b: SCHED_NORMAL
+	 * 01b: SCHED_FIFO
+	 * 10b: SCHED_RR
+	 * 11b: SCHED_BATCH
+	 */
+	FLAT_BINDER_FLAG_SCHED_POLICY_MASK =
+		3U << FLAT_BINDER_FLAG_SCHED_POLICY_SHIFT,
 };
 
 #ifdef BINDER_IPC_32BIT
-- 
2.15.0.rc2.357.g7e34df9404-goog

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 3/6] ANDROID: binder: improve priority inheritance.
  2017-10-26 14:07 [PATCH v3 0/6] ANDROID: binder: RT priority inheritance Martijn Coenen
  2017-10-26 14:07 ` [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance Martijn Coenen
  2017-10-26 14:07 ` [PATCH v3 2/6] ANDROID: binder: add min sched_policy to node Martijn Coenen
@ 2017-10-26 14:07 ` Martijn Coenen
  2017-11-15 13:03   ` Peter Zijlstra
  2017-10-26 14:07 ` [PATCH v3 4/6] ANDROID: binder: add RT inheritance flag to node Martijn Coenen
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Martijn Coenen @ 2017-10-26 14:07 UTC (permalink / raw)
  To: gregkh, john.stultz, tkjos, arve, sherryy, tglx, peterz, amit.pundir
  Cc: linux-kernel, devel, maco, Martijn Coenen

By raising the priority of a thread selected for
a transaction *before* we wake it up.

Delay restoring the priority when doing a reply
until after we wake-up the process receiving
the reply.

Signed-off-by: Martijn Coenen <maco@android.com>
---
 drivers/android/binder.c | 74 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 53 insertions(+), 21 deletions(-)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index 5d51a17dba12..4c42e268b1a5 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -610,6 +610,7 @@ enum {
  * @is_dead:              thread is dead and awaiting free
  *                        when outstanding transactions are cleaned up
  *                        (protected by @proc->inner_lock)
+ * @task:                 struct task_struct for this thread
  *
  * Bookkeeping structure for binder threads.
  */
@@ -628,6 +629,7 @@ struct binder_thread {
 	struct binder_stats stats;
 	atomic_t tmp_ref;
 	bool is_dead;
+	struct task_struct *task;
 };
 
 struct binder_transaction {
@@ -646,6 +648,7 @@ struct binder_transaction {
 	unsigned int	flags;
 	struct binder_priority	priority;
 	struct binder_priority	saved_priority;
+	bool    set_priority_called;
 	kuid_t	sender_euid;
 	/**
 	 * @lock:  protects @from, @to_proc, and @to_thread
@@ -1209,6 +1212,38 @@ static void binder_set_priority(struct task_struct *task,
 		set_user_nice(task, priority);
 }
 
+static void binder_transaction_priority(struct task_struct *task,
+					struct binder_transaction *t,
+					struct binder_priority node_prio)
+{
+	struct binder_priority desired_prio;
+
+	if (t->set_priority_called)
+		return;
+
+	t->set_priority_called = true;
+	t->saved_priority.sched_policy = task->policy;
+	t->saved_priority.prio = task->normal_prio;
+
+	desired_prio.prio = t->priority.prio;
+	desired_prio.sched_policy = t->priority.sched_policy;
+
+	if (node_prio.prio < t->priority.prio ||
+	    (node_prio.prio == t->priority.prio &&
+	     node_prio.sched_policy == SCHED_FIFO)) {
+		/*
+		 * In case the minimum priority on the node is
+		 * higher (lower value), use that priority. If
+		 * the priority is the same, but the node uses
+		 * SCHED_FIFO, prefer SCHED_FIFO, since it can
+		 * run unbounded, unlike SCHED_RR.
+		 */
+		desired_prio = node_prio;
+	}
+
+	binder_set_priority(task, desired_prio);
+}
+
 static struct binder_node *binder_get_node_ilocked(struct binder_proc *proc,
 						   binder_uintptr_t ptr)
 {
@@ -2682,11 +2717,15 @@ static bool binder_proc_transaction(struct binder_transaction *t,
 {
 	struct list_head *target_list = NULL;
 	struct binder_node *node = t->buffer->target_node;
+	struct binder_priority node_prio;
 	bool oneway = !!(t->flags & TF_ONE_WAY);
 	bool wakeup = true;
 
 	BUG_ON(!node);
 	binder_node_lock(node);
+	node_prio.prio = node->min_priority;
+	node_prio.sched_policy = node->sched_policy;
+
 	if (oneway) {
 		BUG_ON(thread);
 		if (node->has_async_transaction) {
@@ -2708,12 +2747,14 @@ static bool binder_proc_transaction(struct binder_transaction *t,
 	if (!thread && !target_list)
 		thread = binder_select_thread_ilocked(proc);
 
-	if (thread)
+	if (thread) {
 		target_list = &thread->todo;
-	else if (!target_list)
+		binder_transaction_priority(thread->task, t, node_prio);
+	} else if (!target_list) {
 		target_list = &proc->todo;
-	else
+	} else {
 		BUG_ON(target_list != &node->async_todo);
+	}
 
 	binder_enqueue_work_ilocked(&t->work, target_list);
 
@@ -2832,7 +2873,6 @@ static void binder_transaction(struct binder_proc *proc,
 		}
 		thread->transaction_stack = in_reply_to->to_parent;
 		binder_inner_proc_unlock(proc);
-		binder_set_priority(current, in_reply_to->saved_priority);
 		target_thread = binder_get_txn_from_and_acq_inner(in_reply_to);
 		if (target_thread == NULL) {
 			return_error = BR_DEAD_REPLY;
@@ -3234,6 +3274,7 @@ static void binder_transaction(struct binder_proc *proc,
 		binder_enqueue_work_ilocked(&t->work, &target_thread->todo);
 		binder_inner_proc_unlock(target_proc);
 		wake_up_interruptible_sync(&target_thread->wait);
+		binder_set_priority(current, in_reply_to->saved_priority);
 		binder_free_transaction(in_reply_to);
 	} else if (!(t->flags & TF_ONE_WAY)) {
 		BUG_ON(t->buffer->async_transaction != 0);
@@ -3328,6 +3369,7 @@ static void binder_transaction(struct binder_proc *proc,
 
 	BUG_ON(thread->return_error.cmd != BR_OK);
 	if (in_reply_to) {
+		binder_set_priority(current, in_reply_to->saved_priority);
 		thread->return_error.cmd = BR_TRANSACTION_COMPLETE;
 		binder_enqueue_work(thread->proc,
 				    &thread->return_error.work,
@@ -4125,26 +4167,13 @@ static int binder_thread_read(struct binder_proc *proc,
 		BUG_ON(t->buffer == NULL);
 		if (t->buffer->target_node) {
 			struct binder_node *target_node = t->buffer->target_node;
-			struct binder_priority prio = t->priority;
+			struct binder_priority node_prio;
 
 			tr.target.ptr = target_node->ptr;
 			tr.cookie =  target_node->cookie;
-			t->saved_priority.sched_policy = current->policy;
-			t->saved_priority.prio = current->normal_prio;
-			if (target_node->min_priority < t->priority.prio ||
-			    (target_node->min_priority == t->priority.prio &&
-			     target_node->sched_policy == SCHED_FIFO)) {
-				/*
-				 * In case the minimum priority on the node is
-				 * higher (lower value), use that priority. If
-				 * the priority is the same, but the node uses
-				 * SCHED_FIFO, prefer SCHED_FIFO, since it can
-				 * run unbounded, unlike SCHED_RR.
-				 */
-				prio.sched_policy = target_node->sched_policy;
-				prio.prio = target_node->min_priority;
-			}
-			binder_set_priority(current, prio);
+			node_prio.sched_policy = target_node->sched_policy;
+			node_prio.prio = target_node->min_priority;
+			binder_transaction_priority(current, t, node_prio);
 			cmd = BR_TRANSACTION;
 		} else {
 			tr.target.ptr = 0;
@@ -4320,6 +4349,8 @@ static struct binder_thread *binder_get_thread_ilocked(
 	binder_stats_created(BINDER_STAT_THREAD);
 	thread->proc = proc;
 	thread->pid = current->pid;
+	get_task_struct(current);
+	thread->task = current;
 	atomic_set(&thread->tmp_ref, 0);
 	init_waitqueue_head(&thread->wait);
 	INIT_LIST_HEAD(&thread->todo);
@@ -4370,6 +4401,7 @@ static void binder_free_thread(struct binder_thread *thread)
 	BUG_ON(!list_empty(&thread->todo));
 	binder_stats_deleted(BINDER_STAT_THREAD);
 	binder_proc_dec_tmpref(thread->proc);
+	put_task_struct(thread->task);
 	kfree(thread);
 }
 
-- 
2.15.0.rc2.357.g7e34df9404-goog

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 4/6] ANDROID: binder: add RT inheritance flag to node.
  2017-10-26 14:07 [PATCH v3 0/6] ANDROID: binder: RT priority inheritance Martijn Coenen
                   ` (2 preceding siblings ...)
  2017-10-26 14:07 ` [PATCH v3 3/6] ANDROID: binder: improve priority inheritance Martijn Coenen
@ 2017-10-26 14:07 ` Martijn Coenen
  2017-11-15 13:05   ` Peter Zijlstra
  2017-10-26 14:07 ` [PATCH v3 5/6] ANDROID: binder: don't check prio permissions on restore Martijn Coenen
  2017-10-26 14:07 ` [PATCH v3 6/6] ANDROID: binder: Add tracing for binder priority inheritance Martijn Coenen
  5 siblings, 1 reply; 20+ messages in thread
From: Martijn Coenen @ 2017-10-26 14:07 UTC (permalink / raw)
  To: gregkh, john.stultz, tkjos, arve, sherryy, tglx, peterz, amit.pundir
  Cc: linux-kernel, devel, maco, Martijn Coenen

Allows a binder node to specify whether it wants to
inherit real-time scheduling policy from a caller. This
inheritance may not always be desirable, for example in
cases where the binder call runs untrusted and therefore
potentially unbounded code.

Signed-off-by: Martijn Coenen <maco@android.com>
---
 drivers/android/binder.c            | 21 +++++++++++++++------
 include/uapi/linux/android/binder.h |  8 ++++++++
 2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index 4c42e268b1a5..5958a0876fe8 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -358,6 +358,8 @@ struct binder_error {
  *                        (invariant after initialized)
  * @min_priority:         minimum scheduling priority
  *                        (invariant after initialized)
+ * @inherit_rt:           inherit RT scheduling policy from caller
+ *                        (invariant after initialized)
  * @async_todo:           list of async work items
  *                        (protected by @proc->inner_lock)
  *
@@ -394,6 +396,7 @@ struct binder_node {
 		 * invariant after initialization
 		 */
 		u8 sched_policy:2;
+		u8 inherit_rt:1;
 		u8 accept_fds:1;
 		u8 min_priority;
 	};
@@ -1214,9 +1217,10 @@ static void binder_set_priority(struct task_struct *task,
 
 static void binder_transaction_priority(struct task_struct *task,
 					struct binder_transaction *t,
-					struct binder_priority node_prio)
+					struct binder_priority node_prio,
+					bool inherit_rt)
 {
-	struct binder_priority desired_prio;
+	struct binder_priority desired_prio = t->priority;
 
 	if (t->set_priority_called)
 		return;
@@ -1225,8 +1229,10 @@ static void binder_transaction_priority(struct task_struct *task,
 	t->saved_priority.sched_policy = task->policy;
 	t->saved_priority.prio = task->normal_prio;
 
-	desired_prio.prio = t->priority.prio;
-	desired_prio.sched_policy = t->priority.sched_policy;
+	if (!inherit_rt && is_rt_policy(desired_prio.sched_policy)) {
+		desired_prio.prio = NICE_TO_PRIO(0);
+		desired_prio.sched_policy = SCHED_NORMAL;
+	}
 
 	if (node_prio.prio < t->priority.prio ||
 	    (node_prio.prio == t->priority.prio &&
@@ -1332,6 +1338,7 @@ static struct binder_node *binder_init_node_ilocked(
 		FLAT_BINDER_FLAG_SCHED_POLICY_SHIFT;
 	node->min_priority = to_kernel_prio(node->sched_policy, priority);
 	node->accept_fds = !!(flags & FLAT_BINDER_FLAG_ACCEPTS_FDS);
+	node->inherit_rt = !!(flags & FLAT_BINDER_FLAG_INHERIT_RT);
 	spin_lock_init(&node->lock);
 	INIT_LIST_HEAD(&node->work.entry);
 	INIT_LIST_HEAD(&node->async_todo);
@@ -2749,7 +2756,8 @@ static bool binder_proc_transaction(struct binder_transaction *t,
 
 	if (thread) {
 		target_list = &thread->todo;
-		binder_transaction_priority(thread->task, t, node_prio);
+		binder_transaction_priority(thread->task, t, node_prio,
+					    node->inherit_rt);
 	} else if (!target_list) {
 		target_list = &proc->todo;
 	} else {
@@ -4173,7 +4181,8 @@ static int binder_thread_read(struct binder_proc *proc,
 			tr.cookie =  target_node->cookie;
 			node_prio.sched_policy = target_node->sched_policy;
 			node_prio.prio = target_node->min_priority;
-			binder_transaction_priority(current, t, node_prio);
+			binder_transaction_priority(current, t, node_prio,
+						    target_node->inherit_rt);
 			cmd = BR_TRANSACTION;
 		} else {
 			tr.target.ptr = 0;
diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h
index b3bced80adea..5539933b3491 100644
--- a/include/uapi/linux/android/binder.h
+++ b/include/uapi/linux/android/binder.h
@@ -79,6 +79,14 @@ enum flat_binder_object_flags {
 	 */
 	FLAT_BINDER_FLAG_SCHED_POLICY_MASK =
 		3U << FLAT_BINDER_FLAG_SCHED_POLICY_SHIFT,
+
+	/**
+	 * @FLAT_BINDER_FLAG_INHERIT_RT: whether the node inherits RT policy
+	 *
+	 * Only when set, calls into this node will inherit a real-time
+	 * scheduling policy from the caller (for synchronous transactions).
+	 */
+	FLAT_BINDER_FLAG_INHERIT_RT = 0x800,
 };
 
 #ifdef BINDER_IPC_32BIT
-- 
2.15.0.rc2.357.g7e34df9404-goog

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 5/6] ANDROID: binder: don't check prio permissions on restore.
  2017-10-26 14:07 [PATCH v3 0/6] ANDROID: binder: RT priority inheritance Martijn Coenen
                   ` (3 preceding siblings ...)
  2017-10-26 14:07 ` [PATCH v3 4/6] ANDROID: binder: add RT inheritance flag to node Martijn Coenen
@ 2017-10-26 14:07 ` Martijn Coenen
  2017-11-15 13:08   ` Peter Zijlstra
  2017-10-26 14:07 ` [PATCH v3 6/6] ANDROID: binder: Add tracing for binder priority inheritance Martijn Coenen
  5 siblings, 1 reply; 20+ messages in thread
From: Martijn Coenen @ 2017-10-26 14:07 UTC (permalink / raw)
  To: gregkh, john.stultz, tkjos, arve, sherryy, tglx, peterz, amit.pundir
  Cc: linux-kernel, devel, maco, Martijn Coenen

Because we have disabled RT priority inheritance for
the regular binder domain, the following can happen:

1) thread A (prio 98) calls into thread B
2) because RT prio inheritance is disabled, thread B
   runs at the lowest nice (prio 100) instead
3) thread B calls back into A; A will run at prio 100
   for the duration of the transaction
4) When thread A is done with the call from B, we will
   try to restore the prio back to 98. But, we fail
   because the process doesn't hold CAP_SYS_NICE,
   neither is RLIMIT_RT_PRIO set.

While the proper fix going forward will be to
correctly apply CAP_SYS_NICE or RLIMIT_RT_PRIO,
for now it seems reasonable to not check permissions
on the restore path.

Signed-off-by: Martijn Coenen <maco@android.com>
---
 drivers/android/binder.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index 5958a0876fe8..af1c3bb3f53e 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -1109,9 +1109,10 @@ static int to_kernel_prio(int policy, int user_priority)
 }
 
 /**
- * binder_set_priority() - sets the scheduler priority of a task
+ * binder_do_set_priority() - sets the scheduler priority of a task
  * @task:	task to set priority on
  * @desired:	desired priority to run at
+ * @verify:	verify whether @task is allowed to run at @desired prio
  *
  * The scheduler policy of tasks is changed explicitly, because we want to
  * support a few distinct features:
@@ -1145,8 +1146,9 @@ static int to_kernel_prio(int policy, int user_priority)
  *    temporarily runs at its original priority.
  * 5) rt_mutex does not currently support PI for CFS tasks.
  */
-static void binder_set_priority(struct task_struct *task,
-				struct binder_priority desired)
+static void binder_do_set_priority(struct task_struct *task,
+				   struct binder_priority desired,
+				   bool verify)
 {
 	int priority; /* user-space prio value */
 	bool has_cap_nice;
@@ -1170,7 +1172,7 @@ static void binder_set_priority(struct task_struct *task,
 
 	priority = to_userspace_prio(policy, desired.prio);
 
-	if (is_rt_policy(policy) && !has_cap_nice) {
+	if (verify && is_rt_policy(policy) && !has_cap_nice) {
 		long max_rtprio = task_rlimit(task, RLIMIT_RTPRIO);
 
 		if (max_rtprio == 0) {
@@ -1182,7 +1184,7 @@ static void binder_set_priority(struct task_struct *task,
 		}
 	}
 
-	if (is_fair_policy(policy) && !has_cap_nice) {
+	if (verify && is_fair_policy(policy) && !has_cap_nice) {
 		long min_nice = rlimit_to_nice(task_rlimit(task, RLIMIT_NICE));
 
 		if (min_nice > MAX_NICE) {
@@ -1215,6 +1217,18 @@ static void binder_set_priority(struct task_struct *task,
 		set_user_nice(task, priority);
 }
 
+static void binder_set_priority(struct task_struct *task,
+				struct binder_priority desired)
+{
+	binder_do_set_priority(task, desired, /* verify = */ true);
+}
+
+static void binder_restore_priority(struct task_struct *task,
+				    struct binder_priority desired)
+{
+	binder_do_set_priority(task, desired, /* verify = */ false);
+}
+
 static void binder_transaction_priority(struct task_struct *task,
 					struct binder_transaction *t,
 					struct binder_priority node_prio,
@@ -3282,7 +3296,7 @@ static void binder_transaction(struct binder_proc *proc,
 		binder_enqueue_work_ilocked(&t->work, &target_thread->todo);
 		binder_inner_proc_unlock(target_proc);
 		wake_up_interruptible_sync(&target_thread->wait);
-		binder_set_priority(current, in_reply_to->saved_priority);
+		binder_restore_priority(current, in_reply_to->saved_priority);
 		binder_free_transaction(in_reply_to);
 	} else if (!(t->flags & TF_ONE_WAY)) {
 		BUG_ON(t->buffer->async_transaction != 0);
@@ -3377,7 +3391,7 @@ static void binder_transaction(struct binder_proc *proc,
 
 	BUG_ON(thread->return_error.cmd != BR_OK);
 	if (in_reply_to) {
-		binder_set_priority(current, in_reply_to->saved_priority);
+		binder_restore_priority(current, in_reply_to->saved_priority);
 		thread->return_error.cmd = BR_TRANSACTION_COMPLETE;
 		binder_enqueue_work(thread->proc,
 				    &thread->return_error.work,
@@ -3963,7 +3977,7 @@ static int binder_thread_read(struct binder_proc *proc,
 			wait_event_interruptible(binder_user_error_wait,
 						 binder_stop_on_user_error < 2);
 		}
-		binder_set_priority(current, proc->default_priority);
+		binder_restore_priority(current, proc->default_priority);
 	}
 
 	if (non_block) {
-- 
2.15.0.rc2.357.g7e34df9404-goog

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 6/6] ANDROID: binder: Add tracing for binder priority inheritance.
  2017-10-26 14:07 [PATCH v3 0/6] ANDROID: binder: RT priority inheritance Martijn Coenen
                   ` (4 preceding siblings ...)
  2017-10-26 14:07 ` [PATCH v3 5/6] ANDROID: binder: don't check prio permissions on restore Martijn Coenen
@ 2017-10-26 14:07 ` Martijn Coenen
  5 siblings, 0 replies; 20+ messages in thread
From: Martijn Coenen @ 2017-10-26 14:07 UTC (permalink / raw)
  To: gregkh, john.stultz, tkjos, arve, sherryy, tglx, peterz, amit.pundir
  Cc: linux-kernel, devel, maco, Martijn Coenen

This allows to easily trace and visualize priority inheritance
in the binder driver.

Signed-off-by: Martijn Coenen <maco@android.com>
---
 drivers/android/binder.c       |  4 ++++
 drivers/android/binder_trace.h | 24 ++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index af1c3bb3f53e..9bdffa7f9529 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -1203,6 +1203,10 @@ static void binder_do_set_priority(struct task_struct *task,
 			      task->pid, desired.prio,
 			      to_kernel_prio(policy, priority));
 
+	trace_binder_set_priority(task->tgid, task->pid, task->normal_prio,
+				  to_kernel_prio(policy, priority),
+				  desired.prio);
+
 	/* Set the actual priority */
 	if (task->policy != policy || is_rt_policy(policy)) {
 		struct sched_param params;
diff --git a/drivers/android/binder_trace.h b/drivers/android/binder_trace.h
index 76e3b9c8a8a2..b11dffc521e8 100644
--- a/drivers/android/binder_trace.h
+++ b/drivers/android/binder_trace.h
@@ -85,6 +85,30 @@ DEFINE_BINDER_FUNCTION_RETURN_EVENT(binder_ioctl_done);
 DEFINE_BINDER_FUNCTION_RETURN_EVENT(binder_write_done);
 DEFINE_BINDER_FUNCTION_RETURN_EVENT(binder_read_done);
 
+TRACE_EVENT(binder_set_priority,
+	TP_PROTO(int proc, int thread, unsigned int old_prio,
+		 unsigned int desired_prio, unsigned int new_prio),
+	TP_ARGS(proc, thread, old_prio, new_prio, desired_prio),
+
+	TP_STRUCT__entry(
+		__field(int, proc)
+		__field(int, thread)
+		__field(unsigned int, old_prio)
+		__field(unsigned int, new_prio)
+		__field(unsigned int, desired_prio)
+	),
+	TP_fast_assign(
+		__entry->proc = proc;
+		__entry->thread = thread;
+		__entry->old_prio = old_prio;
+		__entry->new_prio = new_prio;
+		__entry->desired_prio = desired_prio;
+	),
+	TP_printk("proc=%d thread=%d old=%d => new=%d desired=%d",
+		  __entry->proc, __entry->thread, __entry->old_prio,
+		  __entry->new_prio, __entry->desired_prio)
+);
+
 TRACE_EVENT(binder_wait_for_work,
 	TP_PROTO(bool proc_work, bool transaction_stack, bool thread_todo),
 	TP_ARGS(proc_work, transaction_stack, thread_todo),
-- 
2.15.0.rc2.357.g7e34df9404-goog

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance.
  2017-10-26 14:07 ` [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance Martijn Coenen
@ 2017-11-15 13:01   ` Peter Zijlstra
  2017-11-16  9:18     ` Martijn Coenen
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2017-11-15 13:01 UTC (permalink / raw)
  To: Martijn Coenen
  Cc: gregkh, john.stultz, tkjos, arve, sherryy, tglx, amit.pundir,
	linux-kernel, devel, maco

On Thu, Oct 26, 2017 at 04:07:45PM +0200, Martijn Coenen wrote:
> +/**
> + * binder_set_priority() - sets the scheduler priority of a task
> + * @task:	task to set priority on
> + * @desired:	desired priority to run at
> + *
> + * The scheduler policy of tasks is changed explicitly, because we want to
> + * support a few distinct features:
> + * 1) If the requested priority is higher than the maximum allowed priority,
> + *    we want to "clip" at the highest supported priority.
> + * 2) For a future patch, we need to allow changing the priority of a task
> + *    with a different UID; when we make a binder transaction from process A
> + *    to process B with different UIDs, A must be able to set B's priority
> + *    before B wakes up to handle the call. If B were to raise its own priority
> + *    after waking up, a race condition exists where B gets preempted before
> + *    it can raise its own priority.
> + *
> + * Feature 2) sounds like something a rt_mutex would solve, for example by
> + * having the caller proxy lock an rt_mutex on behalf of the callee, and then
> + * sleeping on it. But we have a few requirements that don't work with this
> + * approach:
> + * 1) binder supports a "minimum node priority", meaning that all transactions
> + *    into a node must run at this priority at a minimum. This means that the
> + *    desired priority for handling a transaction is not necessarily equal to
> + *    the priority of the caller.

Isn't that the same as running everything in that node (whatever that
is) at that given prio?

> + * 2) binder supports asynchronous transactions, where the caller is not blocked
> + *    on transaction completion; so, it also can't be blocked on an rt_mutex.

This is not a theoretically sound thing to do... For a system to be
real-time it not only needs to guarantee forward progress but also have
bounded runtime.

Elevating another task's priority to your own and then not blocking
effectively injects time. At the very least this time should be taking
into consideration when doing runtime analysis of the system.

Also, since its async, why do we care? The original task is still making
progress. Only at the point where we start to wait for completion does
it make sense to boost. Otherwise you end up having to charge one task
double time.

> + * 3) similarly, there may not necessarily be a thread waiting for
> + *    transactions at the time the call is made, so we don't know who to proxy-
> + *    lock the lock for.

Non-issue; you have the exact same problem now. Your interface takes a
@task argument, so you have to have a @task either way around.

> + * 4) binder supports nested transactions, where A can call into B, and B can
> + *    call back into A before returning a reply to the original transaction.
> + *    This means that if A is blocked on an rt_mutex B holds, B must first wake
> + *    up A to handle a new transaction, and only then can it proxy-lock and try
> + *    to acquire the new rt_mutex. This leaves a race condition where B
> + *    temporarily runs at its original priority.

That doesn't sound like something a well formed (RT) program should do
in the first place.

You could play horrible games with lock ownership, but YUCK!!

> + * 5) rt_mutex does not currently support PI for CFS tasks.

Neither do you.. just inheriting the nice value is not correct for a WFQ
style scheduler.

> + */


What you're proposing is a bunch of make-work-ish hacks on a system that
isn't designed for RT.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/6] ANDROID: binder: add min sched_policy to node.
  2017-10-26 14:07 ` [PATCH v3 2/6] ANDROID: binder: add min sched_policy to node Martijn Coenen
@ 2017-11-15 13:02   ` Peter Zijlstra
  2017-11-16  9:27     ` Martijn Coenen
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2017-11-15 13:02 UTC (permalink / raw)
  To: Martijn Coenen
  Cc: gregkh, john.stultz, tkjos, arve, sherryy, tglx, amit.pundir,
	linux-kernel, devel, maco

On Thu, Oct 26, 2017 at 04:07:46PM +0200, Martijn Coenen wrote:
> This change adds flags to flat_binder_object.flags
> to allow indicating a minimum scheduling policy for
> the node. It also clarifies the valid value range
> for the priority bits in the flags.
> 
> Internally, we use the priority map that the kernel
> uses, e.g. [0..99] for real-time policies and [100..139]
> for the SCHED_NORMAL/SCHED_BATCH policies.

I will break that without consideration if I have to. That really isn't
something anybody outside of the core code should rely on.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/6] ANDROID: binder: improve priority inheritance.
  2017-10-26 14:07 ` [PATCH v3 3/6] ANDROID: binder: improve priority inheritance Martijn Coenen
@ 2017-11-15 13:03   ` Peter Zijlstra
  2017-11-16  9:31     ` Martijn Coenen
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2017-11-15 13:03 UTC (permalink / raw)
  To: Martijn Coenen
  Cc: gregkh, john.stultz, tkjos, arve, sherryy, tglx, amit.pundir,
	linux-kernel, devel, maco

On Thu, Oct 26, 2017 at 04:07:47PM +0200, Martijn Coenen wrote:
> By raising the priority of a thread selected for
> a transaction *before* we wake it up.
> 
> Delay restoring the priority when doing a reply
> until after we wake-up the process receiving
> the reply.

What happens if a thread dies?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/6] ANDROID: binder: add RT inheritance flag to node.
  2017-10-26 14:07 ` [PATCH v3 4/6] ANDROID: binder: add RT inheritance flag to node Martijn Coenen
@ 2017-11-15 13:05   ` Peter Zijlstra
  2017-11-16  9:37     ` Martijn Coenen
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2017-11-15 13:05 UTC (permalink / raw)
  To: Martijn Coenen
  Cc: gregkh, john.stultz, tkjos, arve, sherryy, tglx, amit.pundir,
	linux-kernel, devel, maco

On Thu, Oct 26, 2017 at 04:07:48PM +0200, Martijn Coenen wrote:
> Allows a binder node to specify whether it wants to
> inherit real-time scheduling policy from a caller. This
> inheritance may not always be desirable, for example in
> cases where the binder call runs untrusted and therefore
> potentially unbounded code.

Isn't that backwards? That is, binder should enforce not inheriting
across a trust boundary, not let a node opt out voluntary.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 5/6] ANDROID: binder: don't check prio permissions on restore.
  2017-10-26 14:07 ` [PATCH v3 5/6] ANDROID: binder: don't check prio permissions on restore Martijn Coenen
@ 2017-11-15 13:08   ` Peter Zijlstra
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2017-11-15 13:08 UTC (permalink / raw)
  To: Martijn Coenen
  Cc: gregkh, john.stultz, tkjos, arve, sherryy, tglx, amit.pundir,
	linux-kernel, devel, maco

On Thu, Oct 26, 2017 at 04:07:49PM +0200, Martijn Coenen wrote:
> Because we have disabled RT priority inheritance for
> the regular binder domain, the following can happen:
> 
> 1) thread A (prio 98) calls into thread B
> 2) because RT prio inheritance is disabled, thread B
>    runs at the lowest nice (prio 100) instead
> 3) thread B calls back into A; A will run at prio 100
>    for the duration of the transaction

That sounds wrong.. Why would you ever downgrade A?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance.
  2017-11-15 13:01   ` Peter Zijlstra
@ 2017-11-16  9:18     ` Martijn Coenen
  2017-11-16 11:27       ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Martijn Coenen @ 2017-11-16  9:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Greg KH, John Stultz, Todd Kjos, Arve Hjønnevåg,
	Sherry Yang, Thomas Gleixner, Amit Pundir, LKML, devel,
	Martijn Coenen

Thanks Peter for looking at this, more inline.

On Wed, Nov 15, 2017 at 2:01 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> + * 1) binder supports a "minimum node priority", meaning that all transactions
>> + *    into a node must run at this priority at a minimum. This means that the
>> + *    desired priority for handling a transaction is not necessarily equal to
>> + *    the priority of the caller.
>
> Isn't that the same as running everything in that node (whatever that
> is) at that given prio?

Not for synchronous transactions; in that case it's max(min_node_prio,
caller_prio). One reason min_node_prio exists is that for certain
critical nodes, we don't want calls into it to run at a very low
priority, because the implementation of said calls may take locks that
are contended. I realize one possible solution to that is to use PI
mutexes in userspace, but Android doesn't support that today.

>
>> + * 2) binder supports asynchronous transactions, where the caller is not blocked
>> + *    on transaction completion; so, it also can't be blocked on an rt_mutex.
>
> This is not a theoretically sound thing to do... For a system to be
> real-time it not only needs to guarantee forward progress but also have
> bounded runtime.
>
> Elevating another task's priority to your own and then not blocking
> effectively injects time. At the very least this time should be taking
> into consideration when doing runtime analysis of the system.

In case of asynchronous transactions, we never inherit the priority of
the caller. We run at max(default_prio, min_node_prio). For the
default node configuration, this means that we don't change priority
at all for async transactions (they all run at nice 0). But for nodes
that have a minimum priority set, we want calls into the node to run
at that priority instead.

>
> Also, since its async, why do we care? The original task is still making
> progress. Only at the point where we start to wait for completion does
> it make sense to boost. Otherwise you end up having to charge one task
> double time.

Yeah, so this is actually not about the caller; it's about certain
critical nodes that need to process events at a high priority. For
example, consider a node that receives sensor events using async
transactions. It's imperative that the threads handling these events
run at real-time priority from the moment they wake up. Binder's
thread model has a generic threadpool for the entire process; threads
aren't dedicated to specific nodes. So to avoid such a thread being
preempted, we need to raise its priority even before waking it up, and
the only place to do it is from the caller.

>
>> + * 3) similarly, there may not necessarily be a thread waiting for
>> + *    transactions at the time the call is made, so we don't know who to proxy-
>> + *    lock the lock for.
>
> Non-issue; you have the exact same problem now. Your interface takes a
> @task argument, so you have to have a @task either way around.

But in my current implementation once a thread does call in for more
work and finds the work pending, it will raise its own priority. I
don't see how this is possible with rt_mutex, because the thread is
not yet available at the moment the caller goes to sleep.

>
>> + * 4) binder supports nested transactions, where A can call into B, and B can
>> + *    call back into A before returning a reply to the original transaction.
>> + *    This means that if A is blocked on an rt_mutex B holds, B must first wake
>> + *    up A to handle a new transaction, and only then can it proxy-lock and try
>> + *    to acquire the new rt_mutex. This leaves a race condition where B
>> + *    temporarily runs at its original priority.
>
> That doesn't sound like something a well formed (RT) program should do
> in the first place.

What do you mean specifically - nested transactions?

>
> You could play horrible games with lock ownership, but YUCK!!
>
>> + * 5) rt_mutex does not currently support PI for CFS tasks.
>
> Neither do you.. just inheriting the nice value is not correct for a WFQ
> style scheduler.

I don't think we're trying to make this fit perfectly with WFQ - we're
just trying to improve the latency on Android devices where it counts,
by telling the scheduler which threads are important. Without these
patches, we would often see binder threads receiving sensor events get
preempted in the kernel, and the delay would cause issues with things
like video stabilization, etc.

> What you're proposing is a bunch of make-work-ish hacks on a system that
> isn't designed for RT.

What parts of the system aren't designed for RT? I'm trying to
understand if it's something in the way binder currently works - eg
the use of asynchronous transactions and generic threadpools - or
something in this patchset in particular. How should we deal with
high-priority sensor events that go over binder IPC in your opinion?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/6] ANDROID: binder: add min sched_policy to node.
  2017-11-15 13:02   ` Peter Zijlstra
@ 2017-11-16  9:27     ` Martijn Coenen
  0 siblings, 0 replies; 20+ messages in thread
From: Martijn Coenen @ 2017-11-16  9:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Greg KH, John Stultz, Todd Kjos, Arve Hjønnevåg,
	Sherry Yang, Thomas Gleixner, Amit Pundir, LKML, devel,
	Martijn Coenen

On Wed, Nov 15, 2017 at 2:02 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> Internally, we use the priority map that the kernel
>> uses, e.g. [0..99] for real-time policies and [100..139]
>> for the SCHED_NORMAL/SCHED_BATCH policies.
>
> I will break that without consideration if I have to. That really isn't
> something anybody outside of the core code should rely on.

I mostly used these because it makes it easier to debug, since that
range is used in other places in the kernel (and in trace events). All
priority calculations use things that are in sched headers, like
NICE_TO_PRIO, PRIO_TO_NICE, and MAX_USER_RT_PRIO. So things wouldn't
necessarily break if you just changed the value ranges. If you
inverted the range, that would be a problem...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/6] ANDROID: binder: improve priority inheritance.
  2017-11-15 13:03   ` Peter Zijlstra
@ 2017-11-16  9:31     ` Martijn Coenen
  0 siblings, 0 replies; 20+ messages in thread
From: Martijn Coenen @ 2017-11-16  9:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Greg KH, John Stultz, Todd Kjos, Arve Hjønnevåg,
	Sherry Yang, Thomas Gleixner, Amit Pundir, LKML, devel,
	Martijn Coenen

On Wed, Nov 15, 2017 at 2:03 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Oct 26, 2017 at 04:07:47PM +0200, Martijn Coenen wrote:
>> By raising the priority of a thread selected for
>> a transaction *before* we wake it up.
>>
>> Delay restoring the priority when doing a reply
>> until after we wake-up the process receiving
>> the reply.
>
> What happens if a thread dies?

Binder threads should never exit, but if they do, the whole process
exits. Anyway in that case we still restore the priority of this
thread - it's in the error path in the "@@ -3328,6 +3369,7 @@"
section.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/6] ANDROID: binder: add RT inheritance flag to node.
  2017-11-15 13:05   ` Peter Zijlstra
@ 2017-11-16  9:37     ` Martijn Coenen
  0 siblings, 0 replies; 20+ messages in thread
From: Martijn Coenen @ 2017-11-16  9:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Greg KH, John Stultz, Todd Kjos, Arve Hjønnevåg,
	Sherry Yang, Thomas Gleixner, Amit Pundir, LKML, devel,
	Martijn Coenen

On Wed, Nov 15, 2017 at 2:05 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Oct 26, 2017 at 04:07:48PM +0200, Martijn Coenen wrote:
>> Allows a binder node to specify whether it wants to
>> inherit real-time scheduling policy from a caller. This
>> inheritance may not always be desirable, for example in
>> cases where the binder call runs untrusted and therefore
>> potentially unbounded code.
>
> Isn't that backwards? That is, binder should enforce not inheriting
> across a trust boundary, not let a node opt out voluntary.

That's a good point. In practice on Android, nodes are constructed
with the default policy of "don't inherit", and we have an internal
API (not available to untrusted code) to change it. That said,
somebody could talk directly to the kernel driver and present a node
with the flag set. I'll look into this. That said, I think we want
this change regardless of enforcing across the trust boundary; that
is, we don't want to inherit by default even if the callee is trusted.
In Android O, all of the framework runs with binder PI disabled; it's
mostly the HALs serving latency-critical data that need it.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance.
  2017-11-16  9:18     ` Martijn Coenen
@ 2017-11-16 11:27       ` Peter Zijlstra
  2017-11-16 13:03         ` Martijn Coenen
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2017-11-16 11:27 UTC (permalink / raw)
  To: Martijn Coenen
  Cc: Greg KH, John Stultz, Todd Kjos, Arve Hj??nnev??g, Sherry Yang,
	Thomas Gleixner, Amit Pundir, LKML, devel, Martijn Coenen

On Thu, Nov 16, 2017 at 10:18:00AM +0100, Martijn Coenen wrote:
> Thanks Peter for looking at this, more inline.

So I think I'm having a very hard time understanding things because I've
no idea how android works. I've made a number of assumptions below;
please bear with me.

> On Wed, Nov 15, 2017 at 2:01 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> + * 1) binder supports a "minimum node priority", meaning that all transactions
> >> + *    into a node must run at this priority at a minimum. This means that the
> >> + *    desired priority for handling a transaction is not necessarily equal to
> >> + *    the priority of the caller.
> >
> > Isn't that the same as running everything in that node (whatever that
> > is) at that given prio?
> 
> Not for synchronous transactions; in that case it's max(min_node_prio,
> caller_prio). 

But that's exactly what you get!? How would it ever get below
min_node_prio? PI only ever (temporarily) raises prio, it never lowers
it.

> One reason min_node_prio exists is that for certain
> critical nodes, we don't want calls into it to run at a very low
> priority, because the implementation of said calls may take locks that
> are contended. I realize one possible solution to that is to use PI
> mutexes in userspace, but Android doesn't support that today.

Fix that?

> >> + * 2) binder supports asynchronous transactions, where the caller is not blocked
> >> + *    on transaction completion; so, it also can't be blocked on an rt_mutex.
> >
> > This is not a theoretically sound thing to do... For a system to be
> > real-time it not only needs to guarantee forward progress but also have
> > bounded runtime.
> >
> > Elevating another task's priority to your own and then not blocking
> > effectively injects time. At the very least this time should be taking
> > into consideration when doing runtime analysis of the system.
> 
> In case of asynchronous transactions, we never inherit the priority of
> the caller. We run at max(default_prio, min_node_prio). For the
> default node configuration, this means that we don't change priority
> at all for async transactions (they all run at nice 0). But for nodes
> that have a minimum priority set, we want calls into the node to run
> at that priority instead.

I'm confused -- again. Why not run those threads at the min prio to
begin with?

Also, this doesn't seem related to PI.

> > Also, since its async, why do we care? The original task is still making
> > progress. Only at the point where we start to wait for completion does
> > it make sense to boost. Otherwise you end up having to charge one task
> > double time.
> 
> Yeah, so this is actually not about the caller; it's about certain
> critical nodes that need to process events at a high priority. For
> example, consider a node that receives sensor events using async
> transactions.

I'm not understanding; are you saying that some app does an async
request for sensor data, then later this sensor triggers and we need to
route the data back to the app?

But above you said that async requests do not boost the priority; so
then I would expect this sensor's data to land on regular min-prio
thread for dispatch.

I'm unclear of how sensors/devices are hooked up. But I'm assuming you
have a thread that poll()s the device, and this thread is part of a
system service process and that service in turn is a node on the RPC
network.

I would expect the service to have set the thread(s) at the relevant
priority for the device(s) it handles (aka. min prio ?).

> It's imperative that the threads handling these events
> run at real-time priority from the moment they wake up.

Sure, see above, ensure the thread(s) that poll()s and handles the
relevant device(s) are of appropriate priority a priory. Then you don't
need to fix anything in a hurry.

> Binder's thread model has a generic threadpool for the entire process;
> threads aren't dedicated to specific nodes. So to avoid such a thread
> being preempted, we need to raise its priority even before waking it
> up, and the only place to do it is from the caller.

*blink* what? Binder has its own threads? Its not just an RPC/ORB?

> >> + * 3) similarly, there may not necessarily be a thread waiting for
> >> + *    transactions at the time the call is made, so we don't know who to proxy-
> >> + *    lock the lock for.
> >
> > Non-issue; you have the exact same problem now. Your interface takes a
> > @task argument, so you have to have a @task either way around.
> 
> But in my current implementation once a thread does call in for more
> work and finds the work pending, it will raise its own priority.

That smells like fail; if you need to raise your prio it means you are
low(er) prio and could at that point be subject to inversion.

Assuming no idle threads in the pool, and all 'busy' at lower priority.
A medium prio something is running, keeping these pool thingies from
asking for more work. A high prio request comes in.. nobody around to
service. You loose, game over.

> I don't see how this is possible with rt_mutex, because the thread is
> not yet available at the moment the caller goes to sleep.

The fact that you need queueing and do not have a task ready to accept
means you've already lost the game.

> >> + * 4) binder supports nested transactions, where A can call into B, and B can
> >> + *    call back into A before returning a reply to the original transaction.
> >> + *    This means that if A is blocked on an rt_mutex B holds, B must first wake
> >> + *    up A to handle a new transaction, and only then can it proxy-lock and try
> >> + *    to acquire the new rt_mutex. This leaves a race condition where B
> >> + *    temporarily runs at its original priority.
> >
> > That doesn't sound like something a well formed (RT) program should do
> > in the first place.
> 
> What do you mean specifically - nested transactions?

Yeah, that sounds really dodgy. I've been a _long_ time since I read the
RT CORBA stuff, so I'm not sure if there's a sensible solution to the
problem, but it sounds really really dodgy.

Regular RPCs are synchronous; if A asks B to do something, A blocks. B
can then not turn around and ask A something. That is called a deadlock.

The only thing B can do at that point is complete the RPC. You can of
course design your RPC such that it can 'fail' the call and then A
re-issues the call with additional information or somesuch.

Alternatively, if you consider A and B to be services/nodes and you get
service A to call into B and then have B call back into A, you need to
guarantee there are sufficient tasks/threads to complete the entire
transaction _ahead_of_time_. That is, when task 1 of A does the request
into B and blocks, it needs to have reserved another thread (2) of A to
be idle and waiting for B's request.

Without the advance reservation you'll run into deadlocks when the
thread-pool runs dry.

IOW, seriously dodgy stuff best avoided.

> I don't think we're trying to make this fit perfectly with WFQ - we're
> just trying to improve the latency on Android devices where it counts,
> by telling the scheduler which threads are important. Without these
> patches, we would often see binder threads receiving sensor events get
> preempted in the kernel, and the delay would cause issues with things
> like video stabilization, etc.

So I really don't understand how all this works.. how can 'important'
device threads not already be of the right priority. All this fixing up
after the fact just reeks of fail.

The device thread then stuffs its data into whatever RPC channels are
waiting for it and goes back to waiting. Each of those channels wake up
the threads waiting for it (at their own respective priorities) and life
goes on.

If the thread blocks while writing into one of those channels, you again
loose.

> > What you're proposing is a bunch of make-work-ish hacks on a system that
> > isn't designed for RT.
> 
> What parts of the system aren't designed for RT? 

None of it as far I can tell. The shared thread pools, the quick fix
priority, the nested transaction stuff, those all reek of utter fail and
confusion.

> I'm trying to understand if it's something in the way binder currently
> works - eg the use of asynchronous transactions and generic
> threadpools - or something in this patchset in particular. How should
> we deal with high-priority sensor events that go over binder IPC in
> your opinion?

Have tasks at appropriate priorities a priory wherever and whenever
possible. PI is a (least bad) fix for when two threads interact on a
shared resource. The best fix is taking out the shared resource.

You don't have a shared resource in most of the scenarios described
here.

Use dedicated IPC channels, and avoid mixing priority msgs on a channel.
If you have to, make sure the channels are priority ordered.

The primary rule for systems is simplicity, RT systems more so than
others. What you're describing is far too complicated and riddled with
holes.

Also, have a look at RT CORBA, there's a really nice open-source one
here:

  http://www.cs.wustl.edu/~schmidt/TAO.html

That group also has a whole bunch of research papers on the subject. As
said, its been many years since I used any of that so I'm bound to not
know all the new shiny stuff ;-)

(As a bonus, its written in a semi sane language, as opposed to most of
Android :-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance.
  2017-11-16 11:27       ` Peter Zijlstra
@ 2017-11-16 13:03         ` Martijn Coenen
  2017-11-16 15:10           ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Martijn Coenen @ 2017-11-16 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Greg KH, John Stultz, Todd Kjos, Arve Hj??nnev??g, Sherry Yang,
	Thomas Gleixner, Amit Pundir, LKML, devel, Martijn Coenen

On Thu, Nov 16, 2017 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Wed, Nov 15, 2017 at 2:01 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> >> + * 1) binder supports a "minimum node priority", meaning that all transactions
>> >> + *    into a node must run at this priority at a minimum. This means that the
>> >> + *    desired priority for handling a transaction is not necessarily equal to
>> >> + *    the priority of the caller.
>> >
>> > Isn't that the same as running everything in that node (whatever that
>> > is) at that given prio?
>>
>> Not for synchronous transactions; in that case it's max(min_node_prio,
>> caller_prio).
>
> But that's exactly what you get!? How would it ever get below
> min_node_prio? PI only ever (temporarily) raises prio, it never lowers
> it.

Ah I guess we have different definitions of inheritance then. This
behavior is also related to binder's threadpool - all those threads
run at a default prio of 120 (nice 0), and call into the kernel to
wait for work. If somebody then makes a call with a lower prio (>120),
we do change the binder thread that handles that call to use the same
low prio as well. Why run at nice 0 if the caller runs at nice 10?

>
>> One reason min_node_prio exists is that for certain
>> critical nodes, we don't want calls into it to run at a very low
>> priority, because the implementation of said calls may take locks that
>> are contended. I realize one possible solution to that is to use PI
>> mutexes in userspace, but Android doesn't support that today.
>
> Fix that?

That's probably easier said than done :) I don't know why we don't
support them, I'll ask around.

>
> I'm confused -- again. Why not run those threads at the min prio to
> begin with?

This also gets back to binder's threadpool; they are generic threads
in a process waiting for work at nice 0. The prio they get depends on
the work they get, and in case of synchronous transactions, who calls
into them. If we had dedicated threads for each node, we could indeed
run them at the min_prio of that node by default. But this is not how
binder works today.

> Also, this doesn't seem related to PI.

Yes, in case of asynchronous transactions it's not PI at all.

> I'm not understanding; are you saying that some app does an async
> request for sensor data, then later this sensor triggers and we need to
> route the data back to the app?

The way this typically works is that an interested app creates a
binder node to receive sensor data on (basically a callback). The app
then passes this node into the Sensor HAL and says "call me back when
you have relevant sensor data". When the Sensor HAL has data, it makes
calls on this node with the relevant data. Because the receiving app
process has a few generic binder threads that may receive any type of
data, not just sensor data, we want to raise the priority only when
handling sensor data, and restore it to the default when the thread is
done.

>
> But above you said that async requests do not boost the priority; so
> then I would expect this sensor's data to land on regular min-prio
> thread for dispatch.

As you found out later in this e-mail, this doesn't work because of
binder's current threadpool design.

> *blink* what? Binder has its own threads? Its not just an RPC/ORB?

Yeah :-)

> That smells like fail; if you need to raise your prio it means you are
> low(er) prio and could at that point be subject to inversion.

True - it does expose a race condition, but it's still better than not
raising the priority at all. I'm looking into selecting a busy thread
and boost that. Though in general we don't run into this very often by
sizing the threadpool large enough.

>
> Assuming no idle threads in the pool, and all 'busy' at lower priority.
> A medium prio something is running, keeping these pool thingies from
> asking for more work. A high prio request comes in.. nobody around to
> service. You loose, game over.

Agree, we're trying to fix that.


> Yeah, that sounds really dodgy. I've been a _long_ time since I read the
> RT CORBA stuff, so I'm not sure if there's a sensible solution to the
> problem, but it sounds really really dodgy.
>
> Regular RPCs are synchronous; if A asks B to do something, A blocks. B
> can then not turn around and ask A something. That is called a deadlock.

In case of binder, this actually works, we call it "recursive calls";
the same thread in A that called into B can receive an incoming call
from B and handle that. Of course you still need to be careful when
that happens - eg if you hold a lock when making the first outbound
call, you better make sure you can't take that same lock on any
inbound call that can result from it. "Don't hold locks over
synchronous binder calls" is a well-adapted Android design paradigm.


> Alternatively, if you consider A and B to be services/nodes and you get
> service A to call into B and then have B call back into A, you need to
> guarantee there are sufficient tasks/threads to complete the entire
> transaction _ahead_of_time_. That is, when task 1 of A does the request
> into B and blocks, it needs to have reserved another thread (2) of A to
> be idle and waiting for B's request.

See above - we're guaranteed to run on the same thread that made the
original call.

> So I really don't understand how all this works.. how can 'important'
> device threads not already be of the right priority. All this fixing up
> after the fact just reeks of fail.

This gets back to the binder threadpool design. That actually
pre-dates my involvement with binder, but there are probably good
reasons for it being the way it is. For example, some Android
processes host as much as 100 binder nodes, some having different prio
requirements; having one or more threads dedicated to each of those
individual nodes seems quite wasteful.

>
>> What parts of the system aren't designed for RT?
>
> None of it as far I can tell. The shared thread pools, the quick fix
> priority, the nested transaction stuff, those all reek of utter fail and
> confusion.

Ack. FWIW I don't think it's really fail and confusion - I believe
these were conscious design choices for binder and at the time I doubt
that RT was even considered. Like every system binder has its warts,
but it has worked very well for Android. The same holds for this
patchset - it has very measurable improvements on Android system
performance, and I don't think the code or behavior is incorrect. It's
major weakness is that it has to work with the existing threadpool
paradigm and therefore depends on some internal scheduler APIs.

> Have tasks at appropriate priorities a priory wherever and whenever
> possible. PI is a (least bad) fix for when two threads interact on a
> shared resource. The best fix is taking out the shared resource.
>
> You don't have a shared resource in most of the scenarios described
> here.
>
> Use dedicated IPC channels, and avoid mixing priority msgs on a channel.
> If you have to, make sure the channels are priority ordered.
>
> The primary rule for systems is simplicity, RT systems more so than
> others. What you're describing is far too complicated and riddled with
> holes.
>
> Also, have a look at RT CORBA, there's a really nice open-source one
> here:
>
>   http://www.cs.wustl.edu/~schmidt/TAO.html
>
> That group also has a whole bunch of research papers on the subject. As
> said, its been many years since I used any of that so I'm bound to not
> know all the new shiny stuff ;-)
>
> (As a bonus, its written in a semi sane language, as opposed to most of
> Android :-)

Thanks for all the feedback. I'll do some more reading and talking
internally. As far as this series goes: since all the things we talked
about which really only affect Android, is your main concern with
merging this the use of some internal scheduler APIs? In general I'd
really like to keep the upstream binder driver as close as possible to
what we use in Android, and until we have an alternative for PI (if we
can agree on something and build it), this is a pretty significant
delta.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance.
  2017-11-16 13:03         ` Martijn Coenen
@ 2017-11-16 15:10           ` Peter Zijlstra
  2017-11-17 10:23             ` Martijn Coenen
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2017-11-16 15:10 UTC (permalink / raw)
  To: Martijn Coenen
  Cc: Greg KH, John Stultz, Todd Kjos, Arve Hj??nnev??g, Sherry Yang,
	Thomas Gleixner, Amit Pundir, LKML, devel, Martijn Coenen

On Thu, Nov 16, 2017 at 02:03:13PM +0100, Martijn Coenen wrote:
> On Thu, Nov 16, 2017 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote:

> > But that's exactly what you get!? How would it ever get below
> > min_node_prio? PI only ever (temporarily) raises prio, it never lowers
> > it.
> 
> Ah I guess we have different definitions of inheritance then.

Well, I go by the one described in all the real-time computing texts;
also found on Wikipedia FWIW:

  https://en.wikipedia.org/wiki/Priority_inheritance

> This behavior is also related to binder's threadpool - all those
> threads run at a default prio of 120 (nice 0), and call into the
> kernel to wait for work. If somebody then makes a call with a lower
> prio (>120), we do change the binder thread that handles that call to
> use the same low prio as well. Why run at nice 0 if the caller runs at
> nice 10?

Not immediately saying it doesn't make sense in the sync-rpc scenario;
just saying that calling it PI is terribly confusing.

> Yes, in case of asynchronous transactions it's not PI at all.

It might be good to untangle some of that.

> > I'm not understanding; are you saying that some app does an async
> > request for sensor data, then later this sensor triggers and we need to
> > route the data back to the app?
> 
> The way this typically works is that an interested app creates a
> binder node to receive sensor data on (basically a callback). The app
> then passes this node into the Sensor HAL and says "call me back when
> you have relevant sensor data". When the Sensor HAL has data, it makes
> calls on this node with the relevant data. Because the receiving app
> process has a few generic binder threads that may receive any type of
> data, not just sensor data, we want to raise the priority only when
> handling sensor data, and restore it to the default when the thread is
> done.

I'm still struggling with the fact that _binder_ has threads and not the
apps themselves. Totally weird that.

But yes, that sounds roughly similar. App requests data through
subscription, event happens/gets delivered, app does something.

> > That smells like fail; if you need to raise your prio it means you are
> > low(er) prio and could at that point be subject to inversion.
> 
> True - it does expose a race condition, but it's still better than not
> raising the priority at all. I'm looking into selecting a busy thread
> and boost that. Though in general we don't run into this very often by
> sizing the threadpool large enough.

Right; but real-time systems are about guarantees, not about mostly and
on average.

But yes, you touch upon one way to have mixed priority thread pools. You
need a max prio receiver thread that is basically always runnable, this
is the one thread doing poll(). It is the thread you can immediately
assign your rt_mutex ownership to etc..

This thread then selects a victim thread to handle the received msg. In
case of no idle thread available, it will queue the msg (on priority
obviously) and assign ownership of the rt_mutex to any one random
thread.

This thread will then get boosted, complete the current work and find
the highest prio message to continue with.

And while this sounds pretty straight forward, it gets quite tricky when
you get a second message to queue etc.. IIRC you have always 'boost' the
lowest priority running thread, such that we end up guaranteeing to run
the N (size of thread pool) highest prio msgs or somesuch.

Ideally you can use dedicated threads for the RT part of the workload
and avoid all this pain.

> > Regular RPCs are synchronous; if A asks B to do something, A blocks. B
> > can then not turn around and ask A something. That is called a deadlock.
> 
> In case of binder, this actually works, we call it "recursive calls";
> the same thread in A that called into B can receive an incoming call
> from B and handle that. Of course you still need to be careful when
> that happens - eg if you hold a lock when making the first outbound
> call, you better make sure you can't take that same lock on any
> inbound call that can result from it. "Don't hold locks over
> synchronous binder calls" is a well-adapted Android design paradigm.

Sounds really weird to me though; I'd be curious to know why this
'feature' was created.

> >> What parts of the system aren't designed for RT?
> >
> > None of it as far I can tell. The shared thread pools, the quick fix
> > priority, the nested transaction stuff, those all reek of utter fail and
> > confusion.
> 
> Ack. FWIW I don't think it's really fail and confusion - I believe
> these were conscious design choices for binder and at the time I doubt
> that RT was even considered.

That's more or less what I was trying to say. The design never
considered RT and trying to retrofit RT into it makes for 'interesting'
things.

> Like every system binder has its warts, but it has worked very well
> for Android. The same holds for this patchset - it has very measurable
> improvements on Android system performance, and I don't think the code
> or behavior is incorrect. It's major weakness is that it has to work
> with the existing threadpool paradigm and therefore depends on some
> internal scheduler APIs.

I'm taking it changing this stuff is 'difficult' since much of it has
been directly exposed to apps? And you'll need to build a parallel
interface and slowly migrate apps away from it?

> Thanks for all the feedback. I'll do some more reading and talking
> internally. As far as this series goes: since all the things we talked
> about which really only affect Android, is your main concern with
> merging this the use of some internal scheduler APIs? 

Yeah I suppose so. Also I think the comments could be made clearer to
avoid some of the confusion we've had here.

> In general I'd really like to keep the upstream binder driver as close
> as possible to what we use in Android, and until we have an
> alternative for PI (if we can agree on something and build it), this
> is a pretty significant delta.

I understand.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance.
  2017-11-16 15:10           ` Peter Zijlstra
@ 2017-11-17 10:23             ` Martijn Coenen
  0 siblings, 0 replies; 20+ messages in thread
From: Martijn Coenen @ 2017-11-17 10:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Greg KH, John Stultz, Todd Kjos, Arve Hj??nnev??g, Sherry Yang,
	Thomas Gleixner, Amit Pundir, LKML, devel, Martijn Coenen

On Thu, Nov 16, 2017 at 4:10 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> Well, I go by the one described in all the real-time computing texts;
> also found on Wikipedia FWIW:
>
>   https://en.wikipedia.org/wiki/Priority_inheritance

Guess I was taking inheritance too literally :-)

>
>> This behavior is also related to binder's threadpool - all those
>> threads run at a default prio of 120 (nice 0), and call into the
>> kernel to wait for work. If somebody then makes a call with a lower
>> prio (>120), we do change the binder thread that handles that call to
>> use the same low prio as well. Why run at nice 0 if the caller runs at
>> nice 10?
>
> Not immediately saying it doesn't make sense in the sync-rpc scenario;
> just saying that calling it PI is terribly confusing.
>
>> Yes, in case of asynchronous transactions it's not PI at all.
>
> It might be good to untangle some of that.

Yeah, I could possibly separate the part that deals with synchronous
transactions out, since that is "true PI".

> I'm still struggling with the fact that _binder_ has threads and not the
> apps themselves. Totally weird that.

Just to clarify - the threads do belong to the app; they are spawned
from the app process in userspace (sometimes on request of the
kernel), and call into the kernel driver for work. But, the app itself
is merely responsible for starting the threadpool; the libbinder
userspace library takes care of the rest. Beyond that, apps just
receive incoming calls on one of these threads, and that's it.

> Right; but real-time systems are about guarantees, not about mostly and
> on average.

Agreed. I wouldn't go so far to claim Android is RT, but we'll take
any gains we can get.

> But yes, you touch upon one way to have mixed priority thread pools. You
> need a max prio receiver thread that is basically always runnable, this
> is the one thread doing poll(). It is the thread you can immediately
> assign your rt_mutex ownership to etc..

This is an interesting design, I'll think about if I can make it work
with synchronous transactions. One concern is that running such a
thread at max prio means it would always preempt other work,
regardless of the importance of the incoming binder transaction.
Though I guess if you assign rt_mutex ownership to that thread before
even waking it up, it doesn't need to be max prio - it will be at the
prio it needs to be at.

Also I'm not sure how to make this work with async transactions that
need to be handled at a certain prio; either we already have a thread
lying around running at that prio, but if we don't we need to change
the prio before waking it up (which is more or less what this series
does).

> Sounds really weird to me though; I'd be curious to know why this
> 'feature' was created.

I think it may be because binder tries to abstract away from the fact
that a node can be remote - for userspace a binder object can be both
local and remote, and it shouldn't have to care. If you have a local
function call chain A() -> B() -> C(), then these functions obviously
all run on the same thread. But if A() and C() live in process P1 and
B lives in process P2, having A() and C() run in the same thread in
process P1 preserves those semantics. Arve (on CC) probably knows more
about this.

> I'm taking it changing this stuff is 'difficult' since much of it has
> been directly exposed to apps? And you'll need to build a parallel
> interface and slowly migrate apps away from it?

I think the whole threadpool design is actually abstracted pretty
nicely from apps, but we do have transaction ordering semantics that
we need to preserve; also the binder protocol between the kernel
driver and userspace is UAPI, and it can be hard to make large changes
to it while retaining support for an old Android userspace.

> Yeah I suppose so. Also I think the comments could be made clearer to
> avoid some of the confusion we've had here.

A lot of what we talked about is also about how binder works in
general. I'm writing more documentation about that internally, and I
hope I could some day add Documentation/ipc/binder.txt or something
like that :-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-11-17 10:23 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-26 14:07 [PATCH v3 0/6] ANDROID: binder: RT priority inheritance Martijn Coenen
2017-10-26 14:07 ` [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance Martijn Coenen
2017-11-15 13:01   ` Peter Zijlstra
2017-11-16  9:18     ` Martijn Coenen
2017-11-16 11:27       ` Peter Zijlstra
2017-11-16 13:03         ` Martijn Coenen
2017-11-16 15:10           ` Peter Zijlstra
2017-11-17 10:23             ` Martijn Coenen
2017-10-26 14:07 ` [PATCH v3 2/6] ANDROID: binder: add min sched_policy to node Martijn Coenen
2017-11-15 13:02   ` Peter Zijlstra
2017-11-16  9:27     ` Martijn Coenen
2017-10-26 14:07 ` [PATCH v3 3/6] ANDROID: binder: improve priority inheritance Martijn Coenen
2017-11-15 13:03   ` Peter Zijlstra
2017-11-16  9:31     ` Martijn Coenen
2017-10-26 14:07 ` [PATCH v3 4/6] ANDROID: binder: add RT inheritance flag to node Martijn Coenen
2017-11-15 13:05   ` Peter Zijlstra
2017-11-16  9:37     ` Martijn Coenen
2017-10-26 14:07 ` [PATCH v3 5/6] ANDROID: binder: don't check prio permissions on restore Martijn Coenen
2017-11-15 13:08   ` Peter Zijlstra
2017-10-26 14:07 ` [PATCH v3 6/6] ANDROID: binder: Add tracing for binder priority inheritance Martijn Coenen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.