All of lore.kernel.org
 help / color / mirror / Atom feed
From: Axel Rasmussen <axelrasmussen@google.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	Chinwen Chang <chinwen.chang@mediatek.com>,
	Daniel Jordan <daniel.m.jordan@oracle.com>,
	David Rientjes <rientjes@google.com>,
	Davidlohr Bueso <dbueso@suse.de>, Ingo Molnar <mingo@redhat.com>,
	Jann Horn <jannh@google.com>,
	Laurent Dufour <ldufour@linux.ibm.com>,
	Michel Lespinasse <walken@google.com>,
	Stephen Rothwell <sfr@canb.auug.org.au>,
	Steven Rostedt <rostedt@goodmis.org>,
	Vlastimil Babka <vbabka@suse.cz>
Cc: Yafang Shao <laoar.shao@gmail.com>,
	davem@davemloft.net, dsahern@kernel.org,
	gregkh@linuxfoundation.org, kuba@kernel.org,
	liuhangbin@gmail.com, tj@kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Axel Rasmussen <axelrasmussen@google.com>
Subject: [PATCH] mm: mmap_lock: fix use-after-free race and css ref leak in tracepoints
Date: Mon, 30 Nov 2020 15:35:04 -0800	[thread overview]
Message-ID: <20201130233504.3725241-1-axelrasmussen@google.com> (raw)

syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug
is that an ongoing trace event might race with the tracepoint being
disabled (and therefore the _unreg() callback being called). Consider
this ordering:

T1: trace event fires, get_mm_memcg_path() is called
T1: get_memcg_path_buf() returns a buffer pointer
T2: trace_mmap_lock_unreg() is called, buffers are freed
T1: cgroup_path() is called with the now-freed buffer

The solution in this commit is to modify trace_mmap_lock_unreg() to
first stop new buffers from being handed out, and then to wait (spin)
until any existing buffer references are dropped (i.e., those trace
events complete).

I have a simple reproducer program which spins up two pools of threads,
doing the following in a tight loop:

  Pool 1:
  mmap(NULL, 4096, PROT_READ | PROT_WRITE,
       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)
  munmap()

  Pool 2:
  echo 1 > /sys/kernel/debug/tracing/events/mmap_lock/enable
  echo 0 > /sys/kernel/debug/tracing/events/mmap_lock/enable

This triggers the use-after-free very quickly. With this patch, I let it
run for an hour without any BUGs.

While fixing this, I also noticed and fixed a css ref leak. Previously
we called get_mem_cgroup_from_mm(), but we never called css_put() to
release that reference. get_mm_memcg_path() now does this properly.

[1]: https://syzkaller.appspot.com/bug?extid=19e6dd9943972fa1c58a

Fixes: 0f818c4bc1f3 ("mm: mmap_lock: add tracepoints around lock acquisition")
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 mm/mmap_lock.c | 100 +++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 85 insertions(+), 15 deletions(-)

diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
index 12af8f1b8a14..be38dc58278b 100644
--- a/mm/mmap_lock.c
+++ b/mm/mmap_lock.c
@@ -3,6 +3,7 @@
 #include <trace/events/mmap_lock.h>
 
 #include <linux/mm.h>
+#include <linux/atomic.h>
 #include <linux/cgroup.h>
 #include <linux/memcontrol.h>
 #include <linux/mmap_lock.h>
@@ -18,13 +19,28 @@ EXPORT_TRACEPOINT_SYMBOL(mmap_lock_released);
 #ifdef CONFIG_MEMCG
 
 /*
- * Our various events all share the same buffer (because we don't want or need
- * to allocate a set of buffers *per event type*), so we need to protect against
- * concurrent _reg() and _unreg() calls, and count how many _reg() calls have
- * been made.
+ * This is unfortunately complicated... _reg() and _unreg() may be called
+ * in parallel, separately for each of our three event types. To save memory,
+ * all of the event types share the same buffers. Furthermore, trace events
+ * might happen in parallel with _unreg(); we need to ensure we don't free the
+ * buffers before all inflights have finished. Because these events happen
+ * "frequently", we also want to prevent new inflights from starting once the
+ * _unreg() process begins. And, for performance reasons, we want to avoid any
+ * locking in the trace event path.
+ *
+ * So:
+ *
+ * - Use a spinlock to serialize _reg() and _unreg() calls.
+ * - Keep track of nested _reg() calls with a lock-protected counter.
+ * - Define a flag indicating whether or not unregistration has begun (and
+ *   therefore that there should be no new buffer uses going forward).
+ * - Keep track of inflight buffer users with a reference count.
  */
 static DEFINE_SPINLOCK(reg_lock);
-static int reg_refcount;
+static int reg_types_rc; /* Protected by reg_lock. */
+static bool unreg_started; /* Doesn't need synchronization. */
+/* atomic_t instead of refcount_t, as we want ordered inc without locks. */
+static atomic_t inflight_rc = ATOMIC_INIT(0);
 
 /*
  * Size of the buffer for memcg path names. Ignoring stack trace support,
@@ -46,9 +62,14 @@ int trace_mmap_lock_reg(void)
 	unsigned long flags;
 	int cpu;
 
+	/*
+	 * Serialize _reg() and _unreg(). Without this, e.g. _unreg() might
+	 * start cleaning up while _reg() is only partially completed.
+	 */
 	spin_lock_irqsave(&reg_lock, flags);
 
-	if (reg_refcount++)
+	/* If the refcount is going 0->1, proceed with allocating buffers. */
+	if (reg_types_rc++)
 		goto out;
 
 	for_each_possible_cpu(cpu) {
@@ -62,6 +83,11 @@ int trace_mmap_lock_reg(void)
 		per_cpu(memcg_path_buf_idx, cpu) = 0;
 	}
 
+	/* Reset unreg_started flag, allowing new trace events. */
+	WRITE_ONCE(unreg_started, false);
+	/* Add the registration +1 to the inflight refcount. */
+	atomic_inc(&inflight_rc);
+
 out:
 	spin_unlock_irqrestore(&reg_lock, flags);
 	return 0;
@@ -74,7 +100,8 @@ int trace_mmap_lock_reg(void)
 			break;
 	}
 
-	--reg_refcount;
+	/* Since we failed, undo the earlier increment. */
+	--reg_types_rc;
 
 	spin_unlock_irqrestore(&reg_lock, flags);
 	return -ENOMEM;
@@ -87,9 +114,23 @@ void trace_mmap_lock_unreg(void)
 
 	spin_lock_irqsave(&reg_lock, flags);
 
-	if (--reg_refcount)
+	/* If the refcount is going 1->0, proceed with freeing buffers. */
+	if (--reg_types_rc)
 		goto out;
 
+	/* This was the last registration; start preventing new events... */
+	WRITE_ONCE(unreg_started, true);
+	/* Remove the registration +1 from the inflight refcount. */
+	atomic_dec(&inflight_rc);
+	/*
+	 * Wait for inflight refcount to be zero (all inflights stopped). Since
+	 * we have a spinlock we can't sleep, so just spin. Because trace events
+	 * are "fast", and because we stop new inflights from starting at this
+	 * point with unreg_started, this should be a short spin.
+	 */
+	while (atomic_read(&inflight_rc))
+		barrier();
+
 	for_each_possible_cpu(cpu) {
 		kfree(per_cpu(memcg_path_buf, cpu));
 	}
@@ -102,6 +143,20 @@ static inline char *get_memcg_path_buf(void)
 {
 	int idx;
 
+	/*
+	 * If unregistration is happening, stop. Yes, this check is racy;
+	 * that's fine. It just means _unreg() might spin waiting for an extra
+	 * event or two. Use-after-free is actually prevented by the refcount.
+	 */
+	if (READ_ONCE(unreg_started))
+		return NULL;
+	/*
+	 * Take a reference, unless the registration +1 has been released
+	 * and there aren't already existing inflights (refcount is zero).
+	 */
+	if (!atomic_inc_not_zero(&inflight_rc))
+		return NULL;
+
 	idx = this_cpu_add_return(memcg_path_buf_idx, MEMCG_PATH_BUF_SIZE) -
 	      MEMCG_PATH_BUF_SIZE;
 	return &this_cpu_read(memcg_path_buf)[idx];
@@ -110,27 +165,42 @@ static inline char *get_memcg_path_buf(void)
 static inline void put_memcg_path_buf(void)
 {
 	this_cpu_sub(memcg_path_buf_idx, MEMCG_PATH_BUF_SIZE);
+	/* We're done with this buffer; drop the reference. */
+	atomic_dec(&inflight_rc);
 }
 
 /*
  * Write the given mm_struct's memcg path to a percpu buffer, and return a
- * pointer to it. If the path cannot be determined, NULL is returned.
+ * pointer to it. If the path cannot be determined, or no buffer was available
+ * (because the trace event is being unregistered), NULL is returned.
  *
  * Note: buffers are allocated per-cpu to avoid locking, so preemption must be
  * disabled by the caller before calling us, and re-enabled only after the
  * caller is done with the pointer.
+ *
+ * The caller must call put_memcg_path_buf() once the buffer is no longer
+ * needed. This must be done while preemption is still disabled.
  */
 static const char *get_mm_memcg_path(struct mm_struct *mm)
 {
+	char *buf = NULL;
 	struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
 
-	if (memcg != NULL && likely(memcg->css.cgroup != NULL)) {
-		char *buf = get_memcg_path_buf();
+	if (memcg == NULL)
+		goto out;
+	if (unlikely(memcg->css.cgroup == NULL))
+		goto out_put;
 
-		cgroup_path(memcg->css.cgroup, buf, MEMCG_PATH_BUF_SIZE);
-		return buf;
-	}
-	return NULL;
+	buf = get_memcg_path_buf();
+	if (buf == NULL)
+		goto out_put;
+
+	cgroup_path(memcg->css.cgroup, buf, MEMCG_PATH_BUF_SIZE);
+
+out_put:
+	css_put(&memcg->css);
+out:
+	return buf;
 }
 
 #define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
-- 
2.29.2.454.gaff20da3a2-goog


             reply	other threads:[~2020-11-30 23:36 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-30 23:35 Axel Rasmussen [this message]
2020-11-30 23:35 ` [PATCH] mm: mmap_lock: fix use-after-free race and css ref leak in tracepoints Axel Rasmussen
2020-12-01  1:33 ` Shakeel Butt
2020-12-01  1:33   ` Shakeel Butt
2020-12-01 17:36   ` Axel Rasmussen
2020-12-01 17:36     ` Axel Rasmussen
2020-12-01 17:56     ` Greg Thelen
2020-12-01 17:56       ` Greg Thelen
2020-12-01 18:42       ` Shakeel Butt
2020-12-01 18:42         ` Shakeel Butt
2020-12-01 19:13         ` Axel Rasmussen
2020-12-01 19:13           ` Axel Rasmussen
2020-12-01 20:53           ` Shakeel Butt
2020-12-01 20:53             ` Shakeel Butt
2020-12-02  0:15             ` Axel Rasmussen
2020-12-02  0:15               ` Axel Rasmussen
2020-12-02  0:36               ` Shakeel Butt
2020-12-02  0:36                 ` Shakeel Butt
2020-12-02  1:07                 ` Steven Rostedt
2020-12-02  1:11                   ` Shakeel Butt
2020-12-02  1:11                     ` Shakeel Butt
2020-12-04 16:36                     ` Vlastimil Babka
2020-12-04 17:46                       ` Axel Rasmussen
2020-12-04 17:46                         ` Axel Rasmussen
2020-12-02 19:00             ` Tejun Heo
2020-12-02 23:23               ` Shakeel Butt
2020-12-02 23:23                 ` Shakeel Butt
2020-12-02 23:30                 ` Tejun Heo
2020-12-01  3:57 ` Steven Rostedt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201130233504.3725241-1-axelrasmussen@google.com \
    --to=axelrasmussen@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=chinwen.chang@mediatek.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=davem@davemloft.net \
    --cc=dbueso@suse.de \
    --cc=dsahern@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=jannh@google.com \
    --cc=kuba@kernel.org \
    --cc=laoar.shao@gmail.com \
    --cc=ldufour@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liuhangbin@gmail.com \
    --cc=mingo@redhat.com \
    --cc=rientjes@google.com \
    --cc=rostedt@goodmis.org \
    --cc=sfr@canb.auug.org.au \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.