linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Vladimir Davydov <vdavydov.dev@gmail.com>
To: Roman Gushchin <guro@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kernel-team@fb.com, Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>, Rik van Riel <riel@surriel.com>,
	Shakeel Butt <shakeelb@google.com>,
	Christoph Lameter <cl@linux.com>,
	cgroups@vger.kernel.org, Waiman Long <longman@redhat.com>
Subject: Re: [PATCH v5 5/7] mm: rework non-root kmem_cache lifecycle management
Date: Tue, 28 May 2019 20:08:28 +0300	[thread overview]
Message-ID: <20190528170828.zrkvcdsj3d3jzzzo@esperanza> (raw)
In-Reply-To: <20190521200735.2603003-6-guro@fb.com>

Hello Roman,

On Tue, May 21, 2019 at 01:07:33PM -0700, Roman Gushchin wrote:
> This commit makes several important changes in the lifecycle
> of a non-root kmem_cache, which also affect the lifecycle
> of a memory cgroup.
> 
> Currently each charged slab page has a page->mem_cgroup pointer
> to the memory cgroup and holds a reference to it.
> Kmem_caches are held by the memcg and are released with it.
> It means that none of kmem_caches are released unless at least one
> reference to the memcg exists, which is not optimal.
> 
> So the current scheme can be illustrated as:
> page->mem_cgroup->kmem_cache.
> 
> To implement the slab memory reparenting we need to invert the scheme
> into: page->kmem_cache->mem_cgroup.
> 
> Let's make every page to hold a reference to the kmem_cache (we
> already have a stable pointer), and make kmem_caches to hold a single
> reference to the memory cgroup.

Is there any reason why we can't reference both mem cgroup and kmem
cache per each charged kmem page? I mean,

  page->mem_cgroup references mem_cgroup
  page->kmem_cache references kmem_cache
  mem_cgroup references kmem_cache while it's online

TBO it seems to me that not taking a reference to mem cgroup per charged
kmem page makes the code look less straightforward, e.g. as you
mentioned in the commit log, we have to use mod_lruvec_state() for memcg
pages and mod_lruvec_page_state() for root pages.

> 
> To make this possible we need to introduce a new percpu refcounter
> for non-root kmem_caches. The counter is initialized to the percpu
> mode, and is switched to atomic mode after deactivation, so we never
> shutdown an active cache. The counter is bumped for every charged page
> and also for every running allocation. So the kmem_cache can't
> be released unless all allocations complete.
> 
> To shutdown non-active empty kmem_caches, let's reuse the
> infrastructure of the RCU-delayed work queue, used previously for
> the deactivation. After the generalization, it's perfectly suited
> for our needs.
> 
> Since now we can release a kmem_cache at any moment after the
> deactivation, let's call sysfs_slab_remove() only from the shutdown
> path. It makes deactivation path simpler.

But a cache can be dangling for quite a while after cgroup was taken
down, even after this patch, because there still can be pages charged to
it. The reason why we call sysfs_slab_remove() is to delete associated
files from sysfs ASAP. I'd try to preserve the current behavior if
possible.

> 
> Because we don't set the page->mem_cgroup pointer, we need to change
> the way how memcg-level stats is working for slab pages. We can't use
> mod_lruvec_page_state() helpers anymore, so switch over to
> mod_lruvec_state().

> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 4e5b4292a763..8d68de4a2341 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -727,9 +737,31 @@ static void kmemcg_schedule_work_after_rcu(struct rcu_head *head)
>  	queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
>  }
>  
> +static void kmemcg_cache_shutdown_after_rcu(struct kmem_cache *s)
> +{
> +	WARN_ON(shutdown_cache(s));
> +}
> +
> +static void kmemcg_queue_cache_shutdown(struct percpu_ref *percpu_ref)
> +{
> +	struct kmem_cache *s = container_of(percpu_ref, struct kmem_cache,
> +					    memcg_params.refcnt);
> +
> +	spin_lock(&memcg_kmem_wq_lock);

This code may be called from irq context AFAIU so you should use
irq-safe primitive.

> +	if (s->memcg_params.root_cache->memcg_params.dying)
> +		goto unlock;
> +
> +	WARN_ON(s->memcg_params.work_fn);
> +	s->memcg_params.work_fn = kmemcg_cache_shutdown_after_rcu;
> +	call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu);

I may be totally wrong here, but I have a suspicion we don't really need
rcu here.

As I see it, you add this code so as to prevent memcg_kmem_get_cache
from dereferencing a destroyed kmem cache. Can't we continue using
css_tryget_online for that? I mean, take rcu_read_lock() and try to get
css reference. If you succeed, then the cgroup must be online, and
css_offline won't be called until you unlock rcu, right? This means that
the cache is guaranteed to be alive until then, because the cgroup holds
a reference to all its kmem caches until it's taken offline.

> +unlock:
> +	spin_unlock(&memcg_kmem_wq_lock);
> +}
> +
>  static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
>  {
>  	__kmemcg_cache_deactivate_after_rcu(s);
> +	percpu_ref_kill(&s->memcg_params.refcnt);
>  }
>  
>  static void kmemcg_cache_deactivate(struct kmem_cache *s)
> @@ -854,8 +861,15 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
>  
>  static void flush_memcg_workqueue(struct kmem_cache *s)
>  {
> +	/*
> +	 * memcg_params.dying is synchronized using slab_mutex AND
> +	 * memcg_kmem_wq_lock spinlock, because it's not always
> +	 * possible to grab slab_mutex.
> +	 */
>  	mutex_lock(&slab_mutex);
> +	spin_lock(&memcg_kmem_wq_lock);
>  	s->memcg_params.dying = true;
> +	spin_unlock(&memcg_kmem_wq_lock);

I would completely switch from the mutex to the new spin lock -
acquiring them both looks weird.

>  	mutex_unlock(&slab_mutex);
>  
>  	/*

  parent reply	other threads:[~2019-05-28 17:08 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20190521200735.2603003-1-guro@fb.com>
2019-05-22 21:43 ` [PATCH v5 0/7] mm: reparent slab memory on cgroup removal Roman Gushchin
2019-05-22 21:59   ` Andrew Morton
2019-05-22 22:23     ` Roman Gushchin
2019-05-28  7:01       ` Michal Hocko
     [not found] ` <20190521200735.2603003-6-guro@fb.com>
2019-05-28 17:08   ` Vladimir Davydov [this message]
     [not found]     ` <96b8a923-49e4-f13e-b1e3-3df4598d849e@redhat.com>
2019-05-28 17:39       ` [PATCH v5 5/7] mm: rework non-root kmem_cache lifecycle management Vladimir Davydov
2019-05-28 17:41         ` Waiman Long
2019-05-28 18:00     ` Vladimir Davydov
2019-05-28 22:03   ` Johannes Weiner
2019-05-28 22:28     ` Roman Gushchin
     [not found] ` <20190521200735.2603003-3-guro@fb.com>
2019-05-28 17:11   ` [PATCH v5 2/7] mm: generalize postponed non-root kmem_cache deactivation Vladimir Davydov
     [not found] ` <20190521200735.2603003-5-guro@fb.com>
2019-05-28 17:12   ` [PATCH v5 4/7] mm: unify SLAB and SLUB page accounting Vladimir Davydov
2019-05-28 22:00   ` Johannes Weiner
     [not found] ` <20190521200735.2603003-2-guro@fb.com>
2019-05-28 17:14   ` [PATCH v5 1/7] mm: postpone kmem_cache memcg pointer initialization to memcg_link_cache() Vladimir Davydov
2019-05-28 21:56   ` Johannes Weiner
     [not found] ` <20190521200735.2603003-8-guro@fb.com>
2019-05-28 17:38   ` [PATCH v5 7/7] mm: fix /proc/kpagecgroup interface for slab pages Vladimir Davydov
     [not found] ` <20190521200735.2603003-7-guro@fb.com>
2019-05-28 18:33   ` [PATCH v5 6/7] mm: reparent slab memory on cgroup removal Vladimir Davydov
2019-05-28 19:58     ` Roman Gushchin
2019-05-28 20:11       ` Vladimir Davydov
2019-05-28 21:52         ` Roman Gushchin
2019-05-28 22:16       ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190528170828.zrkvcdsj3d3jzzzo@esperanza \
    --to=vdavydov.dev@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=cl@linux.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=mhocko@kernel.org \
    --cc=riel@surriel.com \
    --cc=shakeelb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).