From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S933462AbdCaRyb (ORCPT <rfc822;w@1wt.eu>);
        Fri, 31 Mar 2017 13:54:31 -0400
Received: from mail-pg0-f49.google.com ([74.125.83.49]:34129 "EHLO
        mail-pg0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S933781AbdCaRy2 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 31 Mar 2017 13:54:28 -0400
From: Douglas Anderson <dianders@chromium.org>
To: gregkh@linuxfoundation.org, arve@android.com, riandrews@android.com
Cc: tkjos@google.com, Douglas Anderson <dianders@chromium.org>,
        devel@driverdev.osuosl.org, linux-kernel@vger.kernel.org
Subject: [RFC PATCH] binder: Don't require the binder lock when killed in binder_thread_read()
Date: Fri, 31 Mar 2017 10:53:41 -0700
Message-Id: <20170331175341.19889-1-dianders@chromium.org>
X-Mailer: git-send-email 2.12.2.564.g063fe858b8-goog
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Sometimes when we're out of memory the OOM killer decides to kill a
process that's in binder_thread_read().  If we happen to be waiting
for work we'll get the kill signal and wake up.  That's good.  ...but
then we try to grab the binder lock before we return.  That's bad.

The problem is that someone else might be holding the one true global
binder lock.  If that one other process is blocked then we can't
finish exiting.  In the worst case, the other process might be blocked
waiting for memory.  In that case we'll have a really hard time
exiting.

On older kernels that don't have the OOM reaper (or something
similar), like kernel 4.4, this is a really big problem and we end up
with a simple deadlock because:
* Once we pick a process to OOM kill we won't pick another--we first
  wait for the process we picked to die.  The reasoning is that we've
  given the doomed process access to special memory pools so it can
  quit quickly and we don't have special pool memory to go around.
* We don't have any type of "special access donation" that would give
  the mutex holder our special access.

On kernel 4.4 w/ binder patches, we easily see this happen:

We just attempted this OOM kill:
  Killed process 4132 (Binder_1) total-vm:949904kB, anon-rss:4kB, file-rss:0kB

The doomed task:
  Stack traceback for pid 4132
       4132     3380  0    0   D  Binder_1
  Call trace:
   __switch_to+0x9c/0xa8
   __schedule+0x504/0x740
   schedule+0x88/0xa8
   schedule_preempt_disabled+0x28/0x44
   __mutex_lock_slowpath+0xf8/0x1a4
   mutex_lock+0x4c/0x68
   binder_thread_read+0x68c/0x11bc
   binder_ioctl_write_read.constprop.46+0x1e8/0x318
   binder_ioctl+0x370/0x778
   compat_SyS_ioctl+0x134/0x10ac
   el0_svc_naked+0x24/0x28

The binder lock holder:
       4001     3380  0    4   R    Binder_7
  Call trace:
   __switch_to+0x9c/0xa8
   __schedule+0x504/0x740
   preempt_schedule_common+0x28/0x48
   preempt_schedule+0x24/0x2c
   _raw_spin_unlock+0x44/0x50
   list_lru_count_one+0x40/0x50
   super_cache_count+0x68/0xa0
   shrink_slab.part.54+0xfc/0x4a4
   shrink_zone+0xa8/0x198
   try_to_free_pages+0x408/0x590
   __alloc_pages_nodemask+0x60c/0x95c
   __read_swap_cache_async+0x98/0x214
   read_swap_cache_async+0x48/0x7c
   swapin_readahead+0x188/0x1b8
   handle_mm_fault+0xbf8/0x1050
   do_page_fault+0x140/0x2dc
   do_mem_abort+0x64/0xd8
  Exception stack: < ... cut ... >
   el1_da+0x18/0x78
   binder_ioctl+0x370/0x778
   compat_SyS_ioctl+0x134/0x10ac
   el0_svc_naked+0x24/0x28

There's really not a lot of reason to grab the binder lock when we're
just going to exit anyway.  Add a little bit of complexity to the code
to allow binder_thread_read() to sometimes return without the lock
held.

NOTE: to do this safely we need to make sure that we can safely do
these things _without_ the global binder lock:
* Muck with proc->ready_threads
* Clear a bitfield in thread->looper

It appears that those two operations don't need to be done together
and it's OK to have different locking solutions for the two.  Thus:

1. We change thread->looper to atomic_t and use atomic ops to muck
   with it.  This means we're not clobbering someone else's work with
   our read/modify/write.

   Note that I haven't confirmed that every modify of "thread->looper"
   can be done without the binder mutex or without some
   coarser-grained lock.  ...but clearing the
   BINDER_LOOPER_STATE_WAITING bit should be fine.  The only place
   looking at it is binder_deferred_flush().  Also: while erroring out
   we also may clear BINDER_LOOPER_STATE_NEED_RETURN.  This looks to
   be OK, though the code isn't trivial to follow.

2. We add a new fine-grained mutex (per "proc" structure) to guard all
   of the "_threads" counters.  Technically the only value we're
   modifying is "proc->ready_threads" and we're just decrementing it
   (so maybe we could use atomic_t here, too?).  ...but it appears
   that all of the "_threads" work together so protecting them
   together seems to make the most sense.

   Note that to avoid deadlock:
   * We never first grab the fine grained mutex and _then_ the binder
     mutex.
   * We always grab the fine grained mutex for short periods and never
     do anything blocking while holding it.

To sum it all up:

1. This patch does improve the chances that we will be able to kill a
   task quickly when we want to.  That means this patch has some
   merit.
2. Though this patch has merit, it isn't isn't actually all that
   critical because:
   2a) On newer kernels the OOM reaper should keep us from getting
       deadlocked.
   2b) Even on old kernels there are other deadlock cases we hit even
       if we fix binder in this way.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
It's unclear to me if we really want this patch since, as per the
description, it's not all that critical.  I also don't personally have
enough context with Binder to prove that every change it makes is
guaranteed not to screw something up.

I post it in case someone is interested or someone who knows the code
well can say "yes, this is right and good".  :)

This patch was tested against the Chrome OS 4.4 kernel.  It was only
compile-tested against upstream since I don't have binder up and
running there.

 drivers/android/binder.c | 116 +++++++++++++++++++++++++++++++++--------------
 1 file changed, 81 insertions(+), 35 deletions(-)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index aae4d8d4be36..220b74b7100d 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -18,6 +18,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include <asm/cacheflush.h>
+#include <linux/atomic.h>
 #include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/freezer.h>
@@ -347,10 +348,13 @@ struct binder_proc {
 	wait_queue_head_t wait;
 	struct binder_stats stats;
 	struct list_head delivered_death;
+
 	int max_threads;
 	int requested_threads;
 	int requested_threads_started;
 	int ready_threads;
+	struct mutex threads_lock;
+
 	long default_priority;
 	struct dentry *debugfs_entry;
 	struct binder_context *context;
@@ -369,7 +373,7 @@ struct binder_thread {
 	struct binder_proc *proc;
 	struct rb_node rb_node;
 	int pid;
-	int looper;
+	atomic_t looper;
 	struct binder_transaction *transaction_stack;
 	struct list_head todo;
 	uint32_t return_error; /* Write failed, return error code in read buf */
@@ -458,6 +462,15 @@ static inline void binder_lock(const char *tag)
 	trace_binder_locked(tag);
 }
 
+static inline int __must_check binder_lock_interruptible(const char *tag)
+{
+	trace_binder_lock(tag);
+	if (mutex_lock_interruptible(&binder_main_lock))
+		return -ERESTARTSYS;
+	trace_binder_locked(tag);
+	return 0;
+}
+
 static inline void binder_unlock(const char *tag)
 {
 	trace_binder_unlock(tag);
@@ -2450,39 +2463,41 @@ static int binder_thread_write(struct binder_proc *proc,
 		}
 
 		case BC_REGISTER_LOOPER:
+			mutex_lock(&proc->threads_lock);
 			binder_debug(BINDER_DEBUG_THREADS,
 				     "%d:%d BC_REGISTER_LOOPER\n",
 				     proc->pid, thread->pid);
-			if (thread->looper & BINDER_LOOPER_STATE_ENTERED) {
-				thread->looper |= BINDER_LOOPER_STATE_INVALID;
+			if (atomic_read(&thread->looper) & BINDER_LOOPER_STATE_ENTERED) {
+				atomic_or(BINDER_LOOPER_STATE_INVALID, &thread->looper);
 				binder_user_error("%d:%d ERROR: BC_REGISTER_LOOPER called after BC_ENTER_LOOPER\n",
 					proc->pid, thread->pid);
 			} else if (proc->requested_threads == 0) {
-				thread->looper |= BINDER_LOOPER_STATE_INVALID;
+				atomic_or(BINDER_LOOPER_STATE_INVALID, &thread->looper);
 				binder_user_error("%d:%d ERROR: BC_REGISTER_LOOPER called without request\n",
 					proc->pid, thread->pid);
 			} else {
 				proc->requested_threads--;
 				proc->requested_threads_started++;
 			}
-			thread->looper |= BINDER_LOOPER_STATE_REGISTERED;
+			atomic_or(BINDER_LOOPER_STATE_REGISTERED, &thread->looper);
+			mutex_unlock(&proc->threads_lock);
 			break;
 		case BC_ENTER_LOOPER:
 			binder_debug(BINDER_DEBUG_THREADS,
 				     "%d:%d BC_ENTER_LOOPER\n",
 				     proc->pid, thread->pid);
-			if (thread->looper & BINDER_LOOPER_STATE_REGISTERED) {
-				thread->looper |= BINDER_LOOPER_STATE_INVALID;
+			if (atomic_read(&thread->looper) & BINDER_LOOPER_STATE_REGISTERED) {
+				atomic_or(BINDER_LOOPER_STATE_INVALID, &thread->looper);
 				binder_user_error("%d:%d ERROR: BC_ENTER_LOOPER called after BC_REGISTER_LOOPER\n",
 					proc->pid, thread->pid);
 			}
-			thread->looper |= BINDER_LOOPER_STATE_ENTERED;
+			atomic_or(BINDER_LOOPER_STATE_ENTERED, &thread->looper);
 			break;
 		case BC_EXIT_LOOPER:
 			binder_debug(BINDER_DEBUG_THREADS,
 				     "%d:%d BC_EXIT_LOOPER\n",
 				     proc->pid, thread->pid);
-			thread->looper |= BINDER_LOOPER_STATE_EXITED;
+			atomic_or(BINDER_LOOPER_STATE_EXITED, &thread->looper);
 			break;
 
 		case BC_REQUEST_DEATH_NOTIFICATION:
@@ -2538,7 +2553,7 @@ static int binder_thread_write(struct binder_proc *proc,
 				ref->death = death;
 				if (ref->node->proc == NULL) {
 					ref->death->work.type = BINDER_WORK_DEAD_BINDER;
-					if (thread->looper & (BINDER_LOOPER_STATE_REGISTERED | BINDER_LOOPER_STATE_ENTERED)) {
+					if (atomic_read(&thread->looper) & (BINDER_LOOPER_STATE_REGISTERED | BINDER_LOOPER_STATE_ENTERED)) {
 						list_add_tail(&ref->death->work.entry, &thread->todo);
 					} else {
 						list_add_tail(&ref->death->work.entry, &proc->todo);
@@ -2562,7 +2577,7 @@ static int binder_thread_write(struct binder_proc *proc,
 				ref->death = NULL;
 				if (list_empty(&death->work.entry)) {
 					death->work.type = BINDER_WORK_CLEAR_DEATH_NOTIFICATION;
-					if (thread->looper & (BINDER_LOOPER_STATE_REGISTERED | BINDER_LOOPER_STATE_ENTERED)) {
+					if (atomic_read(&thread->looper) & (BINDER_LOOPER_STATE_REGISTERED | BINDER_LOOPER_STATE_ENTERED)) {
 						list_add_tail(&death->work.entry, &thread->todo);
 					} else {
 						list_add_tail(&death->work.entry, &proc->todo);
@@ -2604,7 +2619,7 @@ static int binder_thread_write(struct binder_proc *proc,
 			list_del_init(&death->work.entry);
 			if (death->work.type == BINDER_WORK_DEAD_BINDER_AND_CLEAR) {
 				death->work.type = BINDER_WORK_CLEAR_DEATH_NOTIFICATION;
-				if (thread->looper & (BINDER_LOOPER_STATE_REGISTERED | BINDER_LOOPER_STATE_ENTERED)) {
+				if (atomic_read(&thread->looper) & (BINDER_LOOPER_STATE_REGISTERED | BINDER_LOOPER_STATE_ENTERED)) {
 					list_add_tail(&death->work.entry, &thread->todo);
 				} else {
 					list_add_tail(&death->work.entry, &proc->todo);
@@ -2638,19 +2653,20 @@ static int binder_has_proc_work(struct binder_proc *proc,
 				struct binder_thread *thread)
 {
 	return !list_empty(&proc->todo) ||
-		(thread->looper & BINDER_LOOPER_STATE_NEED_RETURN);
+		(atomic_read(&thread->looper) & BINDER_LOOPER_STATE_NEED_RETURN);
 }
 
 static int binder_has_thread_work(struct binder_thread *thread)
 {
 	return !list_empty(&thread->todo) || thread->return_error != BR_OK ||
-		(thread->looper & BINDER_LOOPER_STATE_NEED_RETURN);
+		(atomic_read(&thread->looper) & BINDER_LOOPER_STATE_NEED_RETURN);
 }
 
 static int binder_thread_read(struct binder_proc *proc,
 			      struct binder_thread *thread,
 			      binder_uintptr_t binder_buffer, size_t size,
-			      binder_size_t *consumed, int non_block)
+			      binder_size_t *consumed, int non_block,
+			      bool *locked)
 {
 	void __user *buffer = (void __user *)(uintptr_t)binder_buffer;
 	void __user *ptr = buffer + *consumed;
@@ -2687,21 +2703,24 @@ static int binder_thread_read(struct binder_proc *proc,
 		goto done;
 	}
 
-
-	thread->looper |= BINDER_LOOPER_STATE_WAITING;
+	atomic_or(BINDER_LOOPER_STATE_WAITING, &thread->looper);
+	mutex_lock(&proc->threads_lock);
 	if (wait_for_proc_work)
 		proc->ready_threads++;
+	mutex_unlock(&proc->threads_lock);
 
 	binder_unlock(__func__);
+	*locked = false;
 
 	trace_binder_wait_for_work(wait_for_proc_work,
 				   !!thread->transaction_stack,
 				   !list_empty(&thread->todo));
 	if (wait_for_proc_work) {
-		if (!(thread->looper & (BINDER_LOOPER_STATE_REGISTERED |
-					BINDER_LOOPER_STATE_ENTERED))) {
+		if (!(atomic_read(&thread->looper) & (BINDER_LOOPER_STATE_REGISTERED |
+						      BINDER_LOOPER_STATE_ENTERED))) {
 			binder_user_error("%d:%d ERROR: Thread waiting for process work before calling BC_REGISTER_LOOPER or BC_ENTER_LOOPER (state %x)\n",
-				proc->pid, thread->pid, thread->looper);
+				proc->pid, thread->pid,
+				atomic_read(&thread->looper));
 			wait_event_interruptible(binder_user_error_wait,
 						 binder_stop_on_user_error < 2);
 		}
@@ -2719,14 +2738,23 @@ static int binder_thread_read(struct binder_proc *proc,
 			ret = wait_event_freezable(thread->wait, binder_has_thread_work(thread));
 	}
 
-	binder_lock(__func__);
-
+	/*
+	 * Update these _without_ grabbing the binder lock since we might be
+	 * about to return an error.
+	 */
+	mutex_lock(&proc->threads_lock);
 	if (wait_for_proc_work)
 		proc->ready_threads--;
-	thread->looper &= ~BINDER_LOOPER_STATE_WAITING;
+	mutex_unlock(&proc->threads_lock);
+	atomic_and(~BINDER_LOOPER_STATE_WAITING, &thread->looper);
+
+	if (ret)
+		return ret;
 
+	ret = binder_lock_interruptible(__func__);
 	if (ret)
 		return ret;
+	*locked = true;
 
 	while (1) {
 		uint32_t cmd;
@@ -2743,7 +2771,7 @@ static int binder_thread_read(struct binder_proc *proc,
 		} else {
 			/* no data added */
 			if (ptr - buffer == 4 &&
-			    !(thread->looper & BINDER_LOOPER_STATE_NEED_RETURN))
+			    !(atomic_read(&thread->looper) & BINDER_LOOPER_STATE_NEED_RETURN))
 				goto retry;
 			break;
 		}
@@ -2957,19 +2985,23 @@ static int binder_thread_read(struct binder_proc *proc,
 done:
 
 	*consumed = ptr - buffer;
+	mutex_lock(&proc->threads_lock);
 	if (proc->requested_threads + proc->ready_threads == 0 &&
 	    proc->requested_threads_started < proc->max_threads &&
-	    (thread->looper & (BINDER_LOOPER_STATE_REGISTERED |
+	    (atomic_read(&thread->looper) & (BINDER_LOOPER_STATE_REGISTERED |
 	     BINDER_LOOPER_STATE_ENTERED)) /* the user-space code fails to */
 	     /*spawn a new thread if we leave this out */) {
 		proc->requested_threads++;
 		binder_debug(BINDER_DEBUG_THREADS,
 			     "%d:%d BR_SPAWN_LOOPER\n",
 			     proc->pid, thread->pid);
-		if (put_user(BR_SPAWN_LOOPER, (uint32_t __user *)buffer))
+		if (put_user(BR_SPAWN_LOOPER, (uint32_t __user *)buffer)) {
+			mutex_unlock(&proc->threads_lock);
 			return -EFAULT;
+		}
 		binder_stat_br(proc, thread, BR_SPAWN_LOOPER);
 	}
+	mutex_unlock(&proc->threads_lock);
 	return 0;
 }
 
@@ -3051,7 +3083,7 @@ static struct binder_thread *binder_get_thread(struct binder_proc *proc)
 		INIT_LIST_HEAD(&thread->todo);
 		rb_link_node(&thread->rb_node, parent, p);
 		rb_insert_color(&thread->rb_node, &proc->threads);
-		thread->looper |= BINDER_LOOPER_STATE_NEED_RETURN;
+		atomic_or(BINDER_LOOPER_STATE_NEED_RETURN, &thread->looper);
 		thread->return_error = BR_OK;
 		thread->return_error2 = BR_OK;
 	}
@@ -3133,7 +3165,8 @@ static unsigned int binder_poll(struct file *filp,
 
 static int binder_ioctl_write_read(struct file *filp,
 				unsigned int cmd, unsigned long arg,
-				struct binder_thread *thread)
+				struct binder_thread *thread,
+				bool *locked)
 {
 	int ret = 0;
 	struct binder_proc *proc = filp->private_data;
@@ -3172,7 +3205,7 @@ static int binder_ioctl_write_read(struct file *filp,
 		ret = binder_thread_read(proc, thread, bwr.read_buffer,
 					 bwr.read_size,
 					 &bwr.read_consumed,
-					 filp->f_flags & O_NONBLOCK);
+					 filp->f_flags & O_NONBLOCK, locked);
 		trace_binder_read_done(ret);
 		if (!list_empty(&proc->todo))
 			wake_up_interruptible(&proc->wait);
@@ -3243,6 +3276,7 @@ static long binder_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 	struct binder_thread *thread;
 	unsigned int size = _IOC_SIZE(cmd);
 	void __user *ubuf = (void __user *)arg;
+	bool locked;
 
 	/*pr_info("binder_ioctl: %d:%d %x %lx\n",
 			proc->pid, current->pid, cmd, arg);*/
@@ -3258,6 +3292,7 @@ static long binder_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		goto err_unlocked;
 
 	binder_lock(__func__);
+	locked = true;
 	thread = binder_get_thread(proc);
 	if (thread == NULL) {
 		ret = -ENOMEM;
@@ -3266,15 +3301,18 @@ static long binder_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 
 	switch (cmd) {
 	case BINDER_WRITE_READ:
-		ret = binder_ioctl_write_read(filp, cmd, arg, thread);
+		ret = binder_ioctl_write_read(filp, cmd, arg, thread, &locked);
 		if (ret)
 			goto err;
 		break;
 	case BINDER_SET_MAX_THREADS:
+		mutex_lock(&proc->threads_lock);
 		if (copy_from_user(&proc->max_threads, ubuf, sizeof(proc->max_threads))) {
 			ret = -EINVAL;
+			mutex_unlock(&proc->threads_lock);
 			goto err;
 		}
+		mutex_unlock(&proc->threads_lock);
 		break;
 	case BINDER_SET_CONTEXT_MGR:
 		ret = binder_ioctl_set_ctx_mgr(filp);
@@ -3308,8 +3346,9 @@ static long binder_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 	ret = 0;
 err:
 	if (thread)
-		thread->looper &= ~BINDER_LOOPER_STATE_NEED_RETURN;
-	binder_unlock(__func__);
+		atomic_and(~BINDER_LOOPER_STATE_NEED_RETURN, &thread->looper);
+	if (locked)
+		binder_unlock(__func__);
 	wait_event_interruptible(binder_user_error_wait, binder_stop_on_user_error < 2);
 	if (ret && ret != -ERESTARTSYS)
 		pr_info("%d:%d ioctl %x %lx returned %d\n", proc->pid, current->pid, cmd, arg, ret);
@@ -3473,6 +3512,7 @@ static int binder_open(struct inode *nodp, struct file *filp)
 	binder_dev = container_of(filp->private_data, struct binder_device,
 				  miscdev);
 	proc->context = &binder_dev->context;
+	mutex_init(&proc->threads_lock);
 
 	binder_lock(__func__);
 
@@ -3521,8 +3561,9 @@ static void binder_deferred_flush(struct binder_proc *proc)
 	for (n = rb_first(&proc->threads); n != NULL; n = rb_next(n)) {
 		struct binder_thread *thread = rb_entry(n, struct binder_thread, rb_node);
 
-		thread->looper |= BINDER_LOOPER_STATE_NEED_RETURN;
-		if (thread->looper & BINDER_LOOPER_STATE_WAITING) {
+		atomic_or(BINDER_LOOPER_STATE_NEED_RETURN, &thread->looper);
+		if (atomic_read(&thread->looper) &
+		    BINDER_LOOPER_STATE_WAITING) {
 			wake_up_interruptible(&thread->wait);
 			wake_count++;
 		}
@@ -3827,7 +3868,8 @@ static void print_binder_thread(struct seq_file *m,
 	size_t start_pos = m->count;
 	size_t header_pos;
 
-	seq_printf(m, "  thread %d: l %02x\n", thread->pid, thread->looper);
+	seq_printf(m, "  thread %d: l %02x\n", thread->pid,
+		   atomic_read(&thread->looper));
 	header_pos = m->count;
 	t = thread->transaction_stack;
 	while (t) {
@@ -4019,6 +4061,8 @@ static void print_binder_proc_stats(struct seq_file *m,
 	struct rb_node *n;
 	int count, strong, weak;
 
+	mutex_lock(&proc->threads_lock);
+
 	seq_printf(m, "proc %d\n", proc->pid);
 	seq_printf(m, "context %s\n", proc->context->name);
 	count = 0;
@@ -4064,6 +4108,8 @@ static void print_binder_proc_stats(struct seq_file *m,
 	seq_printf(m, "  pending transactions: %d\n", count);
 
 	print_binder_stats(m, "  ", &proc->stats);
+
+	mutex_unlock(&proc->threads_lock);
 }
 
 
-- 
2.12.2.564.g063fe858b8-goog