All of lore.kernel.org
 help / color / mirror / Atom feed
From: Roman Gushchin <guro@fb.com>
To: Greg Thelen <gthelen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Shakeel Butt <shakeelb@google.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Kernel Team <Kernel-team@fb.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>, Rik van Riel <riel@surriel.com>,
	Christoph Lameter <cl@linux.com>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>
Subject: Re: [PATCH v4 0/7] mm: reparent slab memory on cgroup removal
Date: Wed, 5 Jun 2019 17:33:00 +0000	[thread overview]
Message-ID: <20190605173256.GB10098@tower.DHCP.thefacebook.com> (raw)
In-Reply-To: <xr93ef48v5ub.fsf@gthelen.svl.corp.google.com>

On Wed, Jun 05, 2019 at 12:39:24AM -0700, Greg Thelen wrote:
> Roman Gushchin <guro@fb.com> wrote:
> 
> > # Why do we need this?
> >
> > We've noticed that the number of dying cgroups is steadily growing on most
> > of our hosts in production. The following investigation revealed an issue
> > in userspace memory reclaim code [1], accounting of kernel stacks [2],
> > and also the mainreason: slab objects.
> >
> > The underlying problem is quite simple: any page charged
> > to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless
> > all charged pages are gone. If a slab object is actively used by other cgroups,
> > it won't be reclaimed, and will prevent the origin cgroup from being reclaimed.
> >
> > Slab objects, and first of all vfs cache, is shared between cgroups, which are
> > using the same underlying fs, and what's even more important, it's shared
> > between multiple generations of the same workload. So if something is running
> > periodically every time in a new cgroup (like how systemd works), we do
> > accumulate multiple dying cgroups.
> >
> > Strictly speaking pagecache isn't different here, but there is a key difference:
> > we disable protection and apply some extra pressure on LRUs of dying cgroups,
> > and these LRUs contain all charged pages.
> > My experiments show that with the disabled kernel memory accounting the number
> > of dying cgroups stabilizes at a relatively small number (~100, depends on
> > memory pressure and cgroup creation rate), and with kernel memory accounting
> > it grows pretty steadily up to several thousands.
> >
> > Memory cgroups are quite complex and big objects (mostly due to percpu stats),
> > so it leads to noticeable memory losses. Memory occupied by dying cgroups
> > is measured in hundreds of megabytes. I've even seen a host with more than 100Gb
> > of memory wasted for dying cgroups. It leads to a degradation of performance
> > with the uptime, and generally limits the usage of cgroups.
> >
> > My previous attempt [3] to fix the problem by applying extra pressure on slab
> > shrinker lists caused a regressions with xfs and ext4, and has been reverted [4].
> > The following attempts to find the right balance [5, 6] were not successful.
> >
> > So instead of trying to find a maybe non-existing balance, let's do reparent
> > the accounted slabs to the parent cgroup on cgroup removal.
> >
> >
> > # Implementation approach
> >
> > There is however a significant problem with reparenting of slab memory:
> > there is no list of charged pages. Some of them are in shrinker lists,
> > but not all. Introducing of a new list is really not an option.
> >
> > But fortunately there is a way forward: every slab page has a stable pointer
> > to the corresponding kmem_cache. So the idea is to reparent kmem_caches
> > instead of slab pages.
> >
> > It's actually simpler and cheaper, but requires some underlying changes:
> > 1) Make kmem_caches to hold a single reference to the memory cgroup,
> >    instead of a separate reference per every slab page.
> > 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
> >    page->kmem_cache->memcg indirection instead. It's used only on
> >    slab page release, so it shouldn't be a big issue.
> > 3) Introduce a refcounter for non-root slab caches. It's required to
> >    be able to destroy kmem_caches when they become empty and release
> >    the associated memory cgroup.
> >
> > There is a bonus: currently we do release empty kmem_caches on cgroup
> > removal, however all other are waiting for the releasing of the memory cgroup.
> > These refactorings allow kmem_caches to be released as soon as they
> > become inactive and free.
> >
> > Some additional implementation details are provided in corresponding
> > commit messages.
> >
> > # Results
> >
> > Below is the average number of dying cgroups on two groups of our production
> > hosts. They do run some sort of web frontend workload, the memory pressure
> > is moderate. As we can see, with the kernel memory reparenting the number
> > stabilizes in 60s range; however with the original version it grows almost
> > linearly and doesn't show any signs of plateauing. The difference in slab
> > and percpu usage between patched and unpatched versions also grows linearly.
> > In 7 days it exceeded 200Mb.
> >
> > day           0    1    2    3    4    5    6    7
> > original     56  362  628  752 1070 1250 1490 1560
> > patched      23   46   51   55   60   57   67   69
> > mem diff(Mb) 22   74  123  152  164  182  214  241
> 
> No objection to the idea, but a question...

Hi Greg!

> In patched kernel, does slabinfo (or similar) show the list reparented
> slab caches?  A pile of zombie kmem_caches is certainly better than a
> pile of zombie mem_cgroup.  But it still seems like it'll might cause
> degradation - does cache_reap() walk an ever growing set of zombie
> caches?

It's not a pile of zombie kmem_caches vs a pile of zombie mem_cgroups.
It's a smaller pile of zombie kmem_caches vs a larger pile of zombie kmem_caches
*and* a pile of zombie mem_cgroups. The patchset makes the number of zombie
kmem_caches lower, not bigger.

Re slabinfo and other debug interfaces: I do not change anything here.

> 
> We've found it useful to add a slabinfo_full file which includes zombie
> kmem_cache with their memcg_name.  This can help hunt down zombies.

I'm not sure we need to add a permanent debug interface, because something like
drgn ( https://github.com/osandov/drgn ) can be used instead.

If you think that we lack some necessary debug interfaces, I'm totally open
here, but it's not a part of this patchset. Let's talk about them separately.

Thank you for looking into it!

Roman

      reply	other threads:[~2019-06-05 17:33 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-14 21:39 [PATCH v4 0/7] mm: reparent slab memory on cgroup removal Roman Gushchin
2019-05-14 21:39 ` [PATCH v4 1/7] mm: postpone kmem_cache memcg pointer initialization to memcg_link_cache() Roman Gushchin
2019-05-14 21:39 ` [PATCH v4 2/7] mm: generalize postponed non-root kmem_cache deactivation Roman Gushchin
2019-05-14 21:39 ` [PATCH v4 3/7] mm: introduce __memcg_kmem_uncharge_memcg() Roman Gushchin
2019-05-14 21:39 ` [PATCH v4 4/7] mm: unify SLAB and SLUB page accounting Roman Gushchin
2019-05-14 21:39 ` [PATCH v4 5/7] mm: rework non-root kmem_cache lifecycle management Roman Gushchin
2019-05-15  0:06   ` Shakeel Butt
2019-05-15  0:06     ` Shakeel Butt
2019-05-20 14:54     ` Waiman Long
2019-05-20 17:56       ` Roman Gushchin
2019-05-21 18:39     ` Waiman Long
2019-05-21 19:23       ` Roman Gushchin
2019-05-21 19:35         ` Waiman Long
2019-05-15 14:00   ` Christopher Lameter
2019-05-15 14:00     ` Christopher Lameter
2019-05-15 14:11     ` Shakeel Butt
2019-05-15 14:11       ` Shakeel Butt
2019-05-23  0:58   ` [mm] e52271917f: BUG:sleeping_function_called_from_invalid_context_at_mm/slab.h kernel test robot
2019-05-23  0:58     ` kernel test robot
2019-05-23 21:00     ` Roman Gushchin
2019-05-23 21:00       ` Roman Gushchin
2019-05-14 21:39 ` [PATCH v4 6/7] mm: reparent slab memory on cgroup removal Roman Gushchin
2019-05-15  0:10   ` Shakeel Butt
2019-05-15  0:10     ` Shakeel Butt
2019-05-14 21:39 ` [PATCH v4 7/7] mm: fix /proc/kpagecgroup interface for slab pages Roman Gushchin
2019-05-15  0:16   ` Shakeel Butt
2019-05-15  0:16     ` Shakeel Butt
2019-06-05  7:39 ` [PATCH v4 0/7] mm: reparent slab memory on cgroup removal Greg Thelen
2019-06-05  7:39   ` Greg Thelen
2019-06-05 17:33   ` Roman Gushchin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190605173256.GB10098@tower.DHCP.thefacebook.com \
    --to=guro@fb.com \
    --cc=Kernel-team@fb.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=cl@linux.com \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=riel@surriel.com \
    --cc=shakeelb@google.com \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.