All of lore.kernel.org
 help / color / mirror / Atom feed
From: Shakeel Butt <shakeelb@google.com>
To: Axel Rasmussen <axelrasmussen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Chinwen Chang <chinwen.chang@mediatek.com>,
	Daniel Jordan <daniel.m.jordan@oracle.com>,
	David Rientjes <rientjes@google.com>,
	Davidlohr Bueso <dbueso@suse.de>, Ingo Molnar <mingo@redhat.com>,
	Jann Horn <jannh@google.com>,
	Laurent Dufour <ldufour@linux.ibm.com>,
	Michel Lespinasse <walken@google.com>,
	Stephen Rothwell <sfr@canb.auug.org.au>,
	Steven Rostedt <rostedt@goodmis.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Yafang Shao <laoar.shao@gmail.com>,
	"David S . Miller" <davem@davemloft.net>,
	dsahern@kernel.org,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Jakub Kicinski <kuba@kernel.org>,
	liuhangbin@gmail.com, Tejun Heo <tj@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux MM <linux-mm@kvack.org>
Subject: Re: [PATCH] mm: mmap_lock: fix use-after-free race and css ref leak in tracepoints
Date: Mon, 30 Nov 2020 17:33:57 -0800	[thread overview]
Message-ID: <CALvZod42+o7naLOkpo9Jngmhru-aR4K6RCuTk7TukCikAYrDbQ@mail.gmail.com> (raw)
In-Reply-To: <20201130233504.3725241-1-axelrasmussen@google.com>

On Mon, Nov 30, 2020 at 3:43 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug
> is that an ongoing trace event might race with the tracepoint being
> disabled (and therefore the _unreg() callback being called). Consider
> this ordering:
>
> T1: trace event fires, get_mm_memcg_path() is called
> T1: get_memcg_path_buf() returns a buffer pointer
> T2: trace_mmap_lock_unreg() is called, buffers are freed
> T1: cgroup_path() is called with the now-freed buffer

Any reason to use the cgroup_path instead of the cgroup_ino? There are
other examples of trace points using cgroup_ino and no need to
allocate buffers. Also cgroup namespace might complicate the path
usage.

>
> The solution in this commit is to modify trace_mmap_lock_unreg() to
> first stop new buffers from being handed out, and then to wait (spin)
> until any existing buffer references are dropped (i.e., those trace
> events complete).
>
> I have a simple reproducer program which spins up two pools of threads,
> doing the following in a tight loop:
>
>   Pool 1:
>   mmap(NULL, 4096, PROT_READ | PROT_WRITE,
>        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)
>   munmap()
>
>   Pool 2:
>   echo 1 > /sys/kernel/debug/tracing/events/mmap_lock/enable
>   echo 0 > /sys/kernel/debug/tracing/events/mmap_lock/enable
>
> This triggers the use-after-free very quickly. With this patch, I let it
> run for an hour without any BUGs.
>
> While fixing this, I also noticed and fixed a css ref leak. Previously
> we called get_mem_cgroup_from_mm(), but we never called css_put() to
> release that reference. get_mm_memcg_path() now does this properly.
>
> [1]: https://syzkaller.appspot.com/bug?extid=19e6dd9943972fa1c58a
>
> Fixes: 0f818c4bc1f3 ("mm: mmap_lock: add tracepoints around lock acquisition")

The original patch is in mm tree, so the SHA1 is not stabilized.
Usually Andrew squash the fixes into the original patches.

> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> ---
>  mm/mmap_lock.c | 100 +++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 85 insertions(+), 15 deletions(-)
>
> diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> index 12af8f1b8a14..be38dc58278b 100644
> --- a/mm/mmap_lock.c
> +++ b/mm/mmap_lock.c
> @@ -3,6 +3,7 @@
>  #include <trace/events/mmap_lock.h>
>
>  #include <linux/mm.h>
> +#include <linux/atomic.h>
>  #include <linux/cgroup.h>
>  #include <linux/memcontrol.h>
>  #include <linux/mmap_lock.h>
> @@ -18,13 +19,28 @@ EXPORT_TRACEPOINT_SYMBOL(mmap_lock_released);
>  #ifdef CONFIG_MEMCG
>
>  /*
> - * Our various events all share the same buffer (because we don't want or need
> - * to allocate a set of buffers *per event type*), so we need to protect against
> - * concurrent _reg() and _unreg() calls, and count how many _reg() calls have
> - * been made.
> + * This is unfortunately complicated... _reg() and _unreg() may be called
> + * in parallel, separately for each of our three event types. To save memory,
> + * all of the event types share the same buffers. Furthermore, trace events
> + * might happen in parallel with _unreg(); we need to ensure we don't free the
> + * buffers before all inflights have finished. Because these events happen
> + * "frequently", we also want to prevent new inflights from starting once the
> + * _unreg() process begins. And, for performance reasons, we want to avoid any
> + * locking in the trace event path.
> + *
> + * So:
> + *
> + * - Use a spinlock to serialize _reg() and _unreg() calls.
> + * - Keep track of nested _reg() calls with a lock-protected counter.
> + * - Define a flag indicating whether or not unregistration has begun (and
> + *   therefore that there should be no new buffer uses going forward).
> + * - Keep track of inflight buffer users with a reference count.
>   */
>  static DEFINE_SPINLOCK(reg_lock);
> -static int reg_refcount;
> +static int reg_types_rc; /* Protected by reg_lock. */
> +static bool unreg_started; /* Doesn't need synchronization. */
> +/* atomic_t instead of refcount_t, as we want ordered inc without locks. */
> +static atomic_t inflight_rc = ATOMIC_INIT(0);
>
>  /*
>   * Size of the buffer for memcg path names. Ignoring stack trace support,
> @@ -46,9 +62,14 @@ int trace_mmap_lock_reg(void)
>         unsigned long flags;
>         int cpu;
>
> +       /*
> +        * Serialize _reg() and _unreg(). Without this, e.g. _unreg() might
> +        * start cleaning up while _reg() is only partially completed.
> +        */
>         spin_lock_irqsave(&reg_lock, flags);
>
> -       if (reg_refcount++)
> +       /* If the refcount is going 0->1, proceed with allocating buffers. */
> +       if (reg_types_rc++)
>                 goto out;
>
>         for_each_possible_cpu(cpu) {
> @@ -62,6 +83,11 @@ int trace_mmap_lock_reg(void)
>                 per_cpu(memcg_path_buf_idx, cpu) = 0;
>         }
>
> +       /* Reset unreg_started flag, allowing new trace events. */
> +       WRITE_ONCE(unreg_started, false);
> +       /* Add the registration +1 to the inflight refcount. */
> +       atomic_inc(&inflight_rc);
> +
>  out:
>         spin_unlock_irqrestore(&reg_lock, flags);
>         return 0;
> @@ -74,7 +100,8 @@ int trace_mmap_lock_reg(void)
>                         break;
>         }
>
> -       --reg_refcount;
> +       /* Since we failed, undo the earlier increment. */
> +       --reg_types_rc;
>
>         spin_unlock_irqrestore(&reg_lock, flags);
>         return -ENOMEM;
> @@ -87,9 +114,23 @@ void trace_mmap_lock_unreg(void)
>
>         spin_lock_irqsave(&reg_lock, flags);
>
> -       if (--reg_refcount)
> +       /* If the refcount is going 1->0, proceed with freeing buffers. */
> +       if (--reg_types_rc)
>                 goto out;
>
> +       /* This was the last registration; start preventing new events... */
> +       WRITE_ONCE(unreg_started, true);
> +       /* Remove the registration +1 from the inflight refcount. */
> +       atomic_dec(&inflight_rc);
> +       /*
> +        * Wait for inflight refcount to be zero (all inflights stopped). Since
> +        * we have a spinlock we can't sleep, so just spin. Because trace events
> +        * are "fast", and because we stop new inflights from starting at this
> +        * point with unreg_started, this should be a short spin.
> +        */
> +       while (atomic_read(&inflight_rc))
> +               barrier();
> +
>         for_each_possible_cpu(cpu) {
>                 kfree(per_cpu(memcg_path_buf, cpu));
>         }
> @@ -102,6 +143,20 @@ static inline char *get_memcg_path_buf(void)
>  {
>         int idx;
>
> +       /*
> +        * If unregistration is happening, stop. Yes, this check is racy;
> +        * that's fine. It just means _unreg() might spin waiting for an extra
> +        * event or two. Use-after-free is actually prevented by the refcount.
> +        */
> +       if (READ_ONCE(unreg_started))
> +               return NULL;
> +       /*
> +        * Take a reference, unless the registration +1 has been released
> +        * and there aren't already existing inflights (refcount is zero).
> +        */
> +       if (!atomic_inc_not_zero(&inflight_rc))
> +               return NULL;
> +
>         idx = this_cpu_add_return(memcg_path_buf_idx, MEMCG_PATH_BUF_SIZE) -
>               MEMCG_PATH_BUF_SIZE;
>         return &this_cpu_read(memcg_path_buf)[idx];
> @@ -110,27 +165,42 @@ static inline char *get_memcg_path_buf(void)
>  static inline void put_memcg_path_buf(void)
>  {
>         this_cpu_sub(memcg_path_buf_idx, MEMCG_PATH_BUF_SIZE);
> +       /* We're done with this buffer; drop the reference. */
> +       atomic_dec(&inflight_rc);
>  }
>
>  /*
>   * Write the given mm_struct's memcg path to a percpu buffer, and return a
> - * pointer to it. If the path cannot be determined, NULL is returned.
> + * pointer to it. If the path cannot be determined, or no buffer was available
> + * (because the trace event is being unregistered), NULL is returned.
>   *
>   * Note: buffers are allocated per-cpu to avoid locking, so preemption must be
>   * disabled by the caller before calling us, and re-enabled only after the
>   * caller is done with the pointer.
> + *
> + * The caller must call put_memcg_path_buf() once the buffer is no longer
> + * needed. This must be done while preemption is still disabled.
>   */
>  static const char *get_mm_memcg_path(struct mm_struct *mm)
>  {
> +       char *buf = NULL;
>         struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
>
> -       if (memcg != NULL && likely(memcg->css.cgroup != NULL)) {
> -               char *buf = get_memcg_path_buf();
> +       if (memcg == NULL)
> +               goto out;
> +       if (unlikely(memcg->css.cgroup == NULL))
> +               goto out_put;
>
> -               cgroup_path(memcg->css.cgroup, buf, MEMCG_PATH_BUF_SIZE);
> -               return buf;
> -       }
> -       return NULL;
> +       buf = get_memcg_path_buf();
> +       if (buf == NULL)
> +               goto out_put;
> +
> +       cgroup_path(memcg->css.cgroup, buf, MEMCG_PATH_BUF_SIZE);
> +
> +out_put:
> +       css_put(&memcg->css);
> +out:
> +       return buf;
>  }
>
>  #define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
> --
> 2.29.2.454.gaff20da3a2-goog
>

  reply	other threads:[~2020-12-01  1:34 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-30 23:35 [PATCH] mm: mmap_lock: fix use-after-free race and css ref leak in tracepoints Axel Rasmussen
2020-11-30 23:35 ` Axel Rasmussen
2020-12-01  1:33 ` Shakeel Butt [this message]
2020-12-01  1:33   ` Shakeel Butt
2020-12-01 17:36   ` Axel Rasmussen
2020-12-01 17:36     ` Axel Rasmussen
2020-12-01 17:56     ` Greg Thelen
2020-12-01 17:56       ` Greg Thelen
2020-12-01 18:42       ` Shakeel Butt
2020-12-01 18:42         ` Shakeel Butt
2020-12-01 19:13         ` Axel Rasmussen
2020-12-01 19:13           ` Axel Rasmussen
2020-12-01 20:53           ` Shakeel Butt
2020-12-01 20:53             ` Shakeel Butt
2020-12-02  0:15             ` Axel Rasmussen
2020-12-02  0:15               ` Axel Rasmussen
2020-12-02  0:36               ` Shakeel Butt
2020-12-02  0:36                 ` Shakeel Butt
2020-12-02  1:07                 ` Steven Rostedt
2020-12-02  1:11                   ` Shakeel Butt
2020-12-02  1:11                     ` Shakeel Butt
2020-12-04 16:36                     ` Vlastimil Babka
2020-12-04 17:46                       ` Axel Rasmussen
2020-12-04 17:46                         ` Axel Rasmussen
2020-12-02 19:00             ` Tejun Heo
2020-12-02 23:23               ` Shakeel Butt
2020-12-02 23:23                 ` Shakeel Butt
2020-12-02 23:30                 ` Tejun Heo
2020-12-01  3:57 ` Steven Rostedt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALvZod42+o7naLOkpo9Jngmhru-aR4K6RCuTk7TukCikAYrDbQ@mail.gmail.com \
    --to=shakeelb@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=chinwen.chang@mediatek.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=davem@davemloft.net \
    --cc=dbueso@suse.de \
    --cc=dsahern@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=jannh@google.com \
    --cc=kuba@kernel.org \
    --cc=laoar.shao@gmail.com \
    --cc=ldufour@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liuhangbin@gmail.com \
    --cc=mingo@redhat.com \
    --cc=rientjes@google.com \
    --cc=rostedt@goodmis.org \
    --cc=sfr@canb.auug.org.au \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.