All of lore.kernel.org
 help / color / mirror / Atom feed
From: Muchun Song <songmuchun@bytedance.com>
To: Roman Gushchin <guro@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>,
	Christoph Lameter <cl@linux.com>,
	Pekka Enberg <penberg@kernel.org>,
	David Rientjes <rientjes@google.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Shakeel Butt <shakeelb@google.com>,
	Linux Memory Management List <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [External] Re: [PATCH v2] mm: memcg/slab: fix memory leak at non-root kmem_cache destroy
Date: Thu, 16 Jul 2020 23:54:37 +0800	[thread overview]
Message-ID: <CAMZfGtVMY3DqH4XxwkfY0AekD9EFvAN2xaavRUjxYK_s3yA89w@mail.gmail.com> (raw)
In-Reply-To: <20200716154633.GA15421@carbon.DHCP.thefacebook.com>

On Thu, Jul 16, 2020 at 11:46 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Jul 16, 2020 at 01:07:02PM +0800, Muchun Song wrote:
> > On Thu, Jul 16, 2020 at 1:54 AM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Thu, Jul 16, 2020 at 12:50:22AM +0800, Muchun Song wrote:
> > > > If the kmem_cache refcount is greater than one, we should not
> > > > mark the root kmem_cache as dying. If we mark the root kmem_cache
> > > > dying incorrectly, the non-root kmem_cache can never be destroyed.
> > > > It resulted in memory leak when memcg was destroyed. We can use the
> > > > following steps to reproduce.
> > > >
> > > >   1) Use kmem_cache_create() to create a new kmem_cache named A.
> > > >   2) Coincidentally, the kmem_cache A is an alias for kmem_cache B,
> > > >      so the refcount of B is just increased.
> > > >   3) Use kmem_cache_destroy() to destroy the kmem_cache A, just
> > > >      decrease the B's refcount but mark the B as dying.
> > > >   4) Create a new memory cgroup and alloc memory from the kmem_cache
> > > >      B. It leads to create a non-root kmem_cache for allocating memory.
> > > >   5) When destroy the memory cgroup created in the step 4), the
> > > >      non-root kmem_cache can never be destroyed.
> > > >
> > > > If we repeat steps 4) and 5), this will cause a lot of memory leak.
> > > > So only when refcount reach zero, we mark the root kmem_cache as dying.
> > > >
> > > > Fixes: 92ee383f6daa ("mm: fix race between kmem_cache destroy, create and deactivate")
> > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > > Reviewed-by: Shakeel Butt <shakeelb@google.com>
> > > > ---
> > > >
> > > > changelog in v2:
> > > >  1) Fix a confusing typo in the commit log.
> > >
> > > Ok, now I see the problem. Thank you for fixing the commit log!
> > >
> > > >  2) Remove flush_memcg_workqueue() for !CONFIG_MEMCG_KMEM.
> > > >  3) Introduce a new helper memcg_set_kmem_cache_dying() to fix a race
> > > >     condition between flush_memcg_workqueue() and slab_unmergeable().
> > > >
> > > >  mm/slab_common.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++-------
> > > >  1 file changed, 47 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > > > index 8c1ffbf7de45..c4958116e3fd 100644
> > > > --- a/mm/slab_common.c
> > > > +++ b/mm/slab_common.c
> > > > @@ -258,6 +258,11 @@ static void memcg_unlink_cache(struct kmem_cache *s)
> > > >               list_del(&s->memcg_params.kmem_caches_node);
> > > >       }
> > > >  }
> > > > +
> > > > +static inline bool memcg_kmem_cache_dying(struct kmem_cache *s)
> > > > +{
> > > > +     return is_root_cache(s) && s->memcg_params.dying;
> > > > +}
> > > >  #else
> > > >  static inline int init_memcg_params(struct kmem_cache *s,
> > > >                                   struct kmem_cache *root_cache)
> > > > @@ -272,6 +277,11 @@ static inline void destroy_memcg_params(struct kmem_cache *s)
> > > >  static inline void memcg_unlink_cache(struct kmem_cache *s)
> > > >  {
> > > >  }
> > > > +
> > > > +static inline bool memcg_kmem_cache_dying(struct kmem_cache *s)
> > > > +{
> > > > +     return false;
> > > > +}
> > > >  #endif /* CONFIG_MEMCG_KMEM */
> > > >
> > > >  /*
> > > > @@ -326,6 +336,13 @@ int slab_unmergeable(struct kmem_cache *s)
> > > >       if (s->refcount < 0)
> > > >               return 1;
> > > >
> > > > +     /*
> > > > +      * If the kmem_cache is dying. We should also skip this
> > > > +      * kmem_cache.
> > > > +      */
> > > > +     if (memcg_kmem_cache_dying(s))
> > > > +             return 1;
> > > > +
> > > >       return 0;
> > > >  }
> > > >
> > > > @@ -886,12 +903,15 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
> > > >       return 0;
> > > >  }
> > > >
> > > > -static void flush_memcg_workqueue(struct kmem_cache *s)
> > > > +static void memcg_set_kmem_cache_dying(struct kmem_cache *s)
> > > >  {
> > > >       spin_lock_irq(&memcg_kmem_wq_lock);
> > > >       s->memcg_params.dying = true;
> > > >       spin_unlock_irq(&memcg_kmem_wq_lock);
> > > > +}
> > > >
> > > > +static void flush_memcg_workqueue(struct kmem_cache *s)
> > > > +{
> > > >       /*
> > > >        * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make
> > > >        * sure all registered rcu callbacks have been invoked.
> > > > @@ -923,10 +943,6 @@ static inline int shutdown_memcg_caches(struct kmem_cache *s)
> > > >  {
> > > >       return 0;
> > > >  }
> > > > -
> > > > -static inline void flush_memcg_workqueue(struct kmem_cache *s)
> > > > -{
> > > > -}
> > > >  #endif /* CONFIG_MEMCG_KMEM */
> > > >
> > > >  void slab_kmem_cache_release(struct kmem_cache *s)
> > > > @@ -944,8 +960,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
> > > >       if (unlikely(!s))
> > > >               return;
> > > >
> > > > -     flush_memcg_workqueue(s);
> > > > -
> > > >       get_online_cpus();
> > > >       get_online_mems();
> > > >
> > > > @@ -955,6 +969,32 @@ void kmem_cache_destroy(struct kmem_cache *s)
> > > >       if (s->refcount)
> > > >               goto out_unlock;
> > > >
> > > > +#ifdef CONFIG_MEMCG_KMEM
> > > > +     memcg_set_kmem_cache_dying(s);
> > > > +
> > > > +     mutex_unlock(&slab_mutex);
> > >
> > > Hm, but in theory s->refcount can be increased here?
> >
> > I have tried my best to read all the codes that operate on s->refcount.
> > There is only one place which increases the s->refcount, it is the
> > __kmem_cache_alias(). If the kmem cache is dying, the slab_unmergeable()
> > can never return true for the dying kmem cache because it is the same slab_mutex
> > protection, so I think that there is not a problem, right?
>
> So the problem is that you're checking s->refcount under slab_mutex,
> then you're releasing the mutex and then set the dying bit. But nothing prevents

Maybe you miss something. The dying bit is set in the
memcg_set_kmem_cache_dying()
which is under the slab_mutex protection.  So I think there is no problem.

> someone else to clone the kmem_cache on a different CPU as soon as you've released
> the slab_mutex.
>
> However, I think there is a simple solution for this problem.
> Can you, please, check the following patch?
>
> --
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 37d48a56431d..d6661e7af01e 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -326,6 +326,11 @@ int slab_unmergeable(struct kmem_cache *s)
>         if (s->refcount < 0)
>                 return 1;
>
> +#ifdef CONFIG_MEMCG_KMEM
> +       if (s->memcg_params.dying)
> +               return 1;
> +#endif
> +
>         return 0;
>  }
>
> @@ -888,9 +893,20 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
>
>  static void flush_memcg_workqueue(struct kmem_cache *s)
>  {
> -       spin_lock_irq(&memcg_kmem_wq_lock);
> -       s->memcg_params.dying = true;
> -       spin_unlock_irq(&memcg_kmem_wq_lock);
> +       bool skip = false;
> +
> +       mutex_lock(&slab_mutex);
> +       if (s->refcount == 1) {
> +               spin_lock_irq(&memcg_kmem_wq_lock);
> +               s->memcg_params.dying = true;
> +               spin_unlock_irq(&memcg_kmem_wq_lock);
> +       } else {
> +               skip = true;
> +       }
> +       mutex_unlock(&slab_mutex);
> +
> +       if (skip)
> +               return;
>
>         /*
>          * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make
> --
> 2.26.2
>
>
> --
> >
> > > So it doesn't solve the problem completely, but makes it less probable, right?
> > >
> > > I wonder if it's possible to (additionally) protect s->refcount with a
> > > memcg_kmem_wq_lock, so that we can check it in the context of flush_memcg_workqueue()?
> > >
> > > > +
> > > > +     put_online_mems();
> > > > +     put_online_cpus();
> > > > +
> > > > +     flush_memcg_workqueue(s);
> > > > +
> > > > +     get_online_cpus();
> > > > +     get_online_mems();
> > > > +
> > > > +     mutex_lock(&slab_mutex);
> > > > +
> > > > +     if (WARN(s->refcount,
> > > > +              "kmem_cache_destroy %s: Slab cache is still referenced\n",
> > > > +              s->name)) {
> > > > +             /*
> > > > +              * Reset the dying flag setted by memcg_set_kmem_cache_dying().
> > > > +              */
> > > > +             s->memcg_params.dying = false;
> > > > +             goto out_unlock;
> > > > +     }
> > > > +#endif
> > > > +
> > > >       err = shutdown_memcg_caches(s);
> > > >       if (!err)
> > > >               err = shutdown_cache(s);
> > > > --
> > > > 2.11.0
> > > >
> > >
> > > Other than the problem above your patch looks really good to me, however we should
> > > be really careful here, as it should in theory be back-ported to a big number
> > > of old stable kernels. And because it's (hopefully) fixed in 5.9, it's a backport-only
> > > patch.
> > >
> > > So I wonder if we can mitigate the problem by disabling cache sharing for some
> > > specific kmem_caches instead? Like for all with SLAB_ACCOUNT and maybe for all except
> > > a hard-coded list (if kmem accounting is enabled). Do you mind sharing any details
> > > on how this problem reveals itself in the real life?
> >
> > One day I debug another problem, coincidentally I executed the
> > following command:
> >
> >      cat /sys/kernel/debug/memcg_slabinfo | grep deact
> >
> > And I found an interesting thing, a lot of non-root kmem_cache's
> > active_objs is zero.
> > But why are they not destroyed? So I launched a detailed
> > investigation. Finally I found
> > out that the reason is that we mark the root kmem_cache as dying
> > incorrectly. I have
> > applied this patch on our server for several weeks. And this problem
> > has disappeared.
>
> Got it, it's a great catch!
>
> Thank you!
>
> Roman



-- 
Yours,
Muchun

  reply	other threads:[~2020-07-16 15:55 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-15 16:50 [PATCH v2] mm: memcg/slab: fix memory leak at non-root kmem_cache destroy Muchun Song
2020-07-15 17:54 ` Roman Gushchin
2020-07-16  5:07   ` [External] " Muchun Song
2020-07-16  5:07     ` Muchun Song
2020-07-16 15:46     ` Roman Gushchin
2020-07-16 15:54       ` Muchun Song [this message]
2020-07-16 15:54         ` Muchun Song
2020-07-16 16:16         ` Roman Gushchin
2020-07-16 16:29           ` Muchun Song
2020-07-16 16:29             ` Muchun Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMZfGtVMY3DqH4XxwkfY0AekD9EFvAN2xaavRUjxYK_s3yA89w@mail.gmail.com \
    --to=songmuchun@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=guro@fb.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    --cc=shakeelb@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.