Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal

From: Roman Gushchin <guro@fb.com>
To: Muchun Song <songmuchun@bytedance.com>
Cc: Dave Chinner <david@fromorbit.com>,
	Matthew Wilcox <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Shakeel Butt <shakeelb@google.com>,
	Yang Shi <shy828301@gmail.com>, <alexs@kernel.org>,
	Alexander Duyck <alexander.h.duyck@linux.intel.com>,
	Wei Yang <richard.weiyang@gmail.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>
Subject: Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
Date: Fri, 30 Apr 2021 20:10:46 -0700	[thread overview]
Message-ID: <YIzGtpAwO+2YOPA6@carbon.dhcp.thefacebook.com> (raw)
In-Reply-To: <CAMZfGtXawtMT4JfBtDLZ+hES4iEHFboe2UgJee_s-NhZR5faAw@mail.gmail.com>

On Fri, Apr 30, 2021 at 04:32:39PM +0800, Muchun Song wrote:
> On Fri, Apr 30, 2021 at 11:27 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote:
> > > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote:
> > > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote:
> > > > > In our server, we found a suspected memory leak problem. The kmalloc-32
> > > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> > > > > memory.
> > > > >
> > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > > > > cache is the cause of list_lru_one allocation.
> > > > >
> > > > >   crash> p memcg_nr_cache_ids
> > > > >   memcg_nr_cache_ids = $2 = 24574
> > > > >
> > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > > > > can be calculated with the following formula.
> > > > >
> > > > >   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> > > > >
> > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> > > > >
> > > > >   crash> list super_blocks | wc -l
> > > > >   952
> > > >
> > > > The more I see people trying to work around this, the more I think
> > > > that the way memcgs have been grafted into the list_lru is back to
> > > > front.
> > > >
> > > > We currently allocate scope for every memcg to be able to tracked on
> > > > every not on every superblock instantiated in the system, regardless
> > > > of whether that superblock is even accessible to that memcg.
> > > >
> > > > These huge memcg counts come from container hosts where memcgs are
> > > > confined to just a small subset of the total number of superblocks
> > > > that instantiated at any given point in time.
> > > >
> > > > IOWs, for these systems with huge container counts, list_lru does
> > > > not need the capability of tracking every memcg on every superblock.
> > > >
> > > > What it comes down to is that the list_lru is only needed for a
> > > > given memcg if that memcg is instatiating and freeing objects on a
> > > > given list_lru.
> > > >
> > > > Which makes me think we should be moving more towards "add the memcg
> > > > to the list_lru at the first insert" model rather than "instantiate
> > > > all at memcg init time just in case". The model we originally came
> > > > up with for supprting memcgs is really starting to show it's limits,
> > > > and we should address those limitations rahter than hack more
> > > > complexity into the system that does nothing to remove the
> > > > limitations that are causing the problems in the first place.
> > >
> > > I totally agree.
> > >
> > > It looks like the initial implementation of the whole kernel memory accounting
> > > and memcg-aware shrinkers was based on the idea that the number of memory
> > > cgroups is relatively small and stable.
> >
> > Yes, that was one of the original assumptions - tens to maybe low
> > hundreds of memcgs at most. The other was that memcgs weren't NUMA
> > aware, and so would only need a single LRU list per memcg. Hence the
> > total overhead even with "lots" of memcgsi and superblocks the
> > overhead wasn't that great.
> >
> > Then came "memcgs need to be NUMA aware" because of the size of the
> > machines they were being use for resrouce management in, and that
> > greatly increased the per-memcg, per LRU overhead. Now we're talking
> > about needing to support a couple of orders of magnitude more memcgs
> > and superblocks than were originally designed for.
> >
> > So, really, we're way beyond the original design scope of this
> > subsystem now.
> 
> Got it. So it is better to allocate the structure of the list_lru_node
> dynamically. We should only allocate it when it is really demanded.
> But allocating memory by using GFP_ATOMIC in list_lru_add() is
> not a good idea. So we should allocate the memory out of
> list_lru_add(). I can propose an approach that may work.
> 
> Before start, we should know about the following rules of list lrus.
> 
> - Only objects allocated with __GFP_ACCOUNT need to allocate
>   the struct list_lru_node.
> - The caller of allocating memory must know which list_lru the
>   object will insert.
> 
> So we can allocate struct list_lru_node when allocating the
> object instead of allocating it when list_lru_add().  It is easy, because
> we already know the list_lru and memcg which the object belongs
> to. So we can introduce a new helper to allocate the object and
> list_lru_node. Like below.
> 
> void *list_lru_kmem_cache_alloc(struct list_lru *lru, struct kmem_cache *s,
>                                 gfp_t gfpflags)
> {
>         void *ret = kmem_cache_alloc(s, gfpflags);
> 
>         if (ret && (gfpflags & __GFP_ACCOUNT)) {
>                 struct mem_cgroup *memcg = mem_cgroup_from_obj(ret);
> 
>                 if (mem_cgroup_is_root(memcg))
>                         return ret;
> 
>                 /* Allocate per-memcg list_lru_node, if it already
> allocated, do nothing. */
>                 memcg_list_lru_node_alloc(lru, memcg,
> page_to_nid(virt_to_page(ret)), gfpflags);
>         }
> 
>         return ret;
> }
> 
> If the user wants to insert the allocated object to its lru list in
> the feature. The
> user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc().
> I have looked at the code closely. There are 3 different kmem_caches that
> need to use this new API to allocate memory. They are inode_cachep,
> dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate.
> 
> Hi Roman and Dave,
> 
> What do you think about this approach? If there is no problem, I can provide
> a preliminary patchset within a week.

At a very first glance it looks similar to what Bharata proposed, but with some
additional tricks. It would be nice to find a common ground here. In general,
I think it's a right direction.

In general I believe we might need some more fundamental changes, but I don't
have a specific recipe yet. I need to think more of it.

Thanks!