linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yosry Ahmed <yosryahmed@google.com>
To: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	 Shakeel Butt <shakeelb@google.com>,
	Muchun Song <muchun.song@linux.dev>,
	 "Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Zefan Li <lizefan.x@bytedance.com>,  Yu Zhao <yuzhao@google.com>,
	Luis Chamberlain <mcgrof@kernel.org>,
	Kees Cook <keescook@chromium.org>,
	 Iurii Zaikin <yzaikin@google.com>,
	"T.J. Mercier" <tjmercier@google.com>,
	 Greg Thelen <gthelen@google.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	 cgroups@vger.kernel.org
Subject: Re: [RFC PATCH 0/8] memory recharging for offline memcgs
Date: Fri, 21 Jul 2023 11:47:49 -0700	[thread overview]
Message-ID: <CAJD7tkatz1JhKVj_iP9J0H7fPJnUSurZkCT1iJTJ=+qRen_nLQ@mail.gmail.com> (raw)
In-Reply-To: <ZLrN1BE42Tsybm6j@slm.duckdns.org>

On Fri, Jul 21, 2023 at 11:26 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote:
> > On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote:
> > > memory at least in our case. The sharing across them comes down to things
> > > like some common library pages which don't really account for much these
> > > days.
> >
> > Keep in mind that even a single page charged to a memcg and used by
> > another memcg is sufficient to result in a zombie memcg.
>
> I mean, yeah, that's a separate issue or rather a subset which isn't all
> that controversial. That can be deterministically solved by reparenting to
> the parent like how slab is handled. I think the "deterministic" part is
> important here. As you said, even a single page can pin a dying cgroup.

There are serious flaws with reparenting that I mentioned above. We do
it for kernel memory, but that's because we really have no other
choice. Oftentimes the memory is not reclaimable and we cannot find an
owner for it. This doesn't mean it's the right answer for user memory.

The semantics are new compared to normal charging (as opposed to
recharging, as I explain below). There is an extra layer of
indirection that we did not (as far as I know) measure the impact of.
Parents end up with pages that they never used and we have no
observability into where it came from. Most importantly, over time
user memory will keep accumulating at the root, reducing the accuracy
and usefulness of accounting, effectively an accounting leak and
reduction of capacity. Memory that is not attributed to any user, aka
system overhead.

>
> > > > Keep in mind that the environment is dynamic, workloads are constantly
> > > > coming and going. Even if find the perfect nesting to appropriately
> > > > scope resources, some rescheduling may render the hierarchy obsolete
> > > > and require us to start over.
> > >
> > > Can you please go into more details on how much memory is shared for what
> > > across unrelated dynamic workloads? That sounds different from other use
> > > cases.
> >
> > I am trying to collect more information from our fleet, but the
> > application restarting in a different cgroup is not what is happening
> > in our case. It is not easy to find out exactly what is going on on
>
> This is the point that Johannes raised but I don't think the current
> proposal would make things more deterministic. From what I can see, it
> actually pushes it towards even less predictability. Currently, yeah, some
> pages may end up in cgroups which aren't the majority user but it at least
> is clear how that would happen. The proposed change adds layers of
> indeterministic behaviors on top. I don't think that's the direction we want
> to go.

I believe recharging is being mis-framed here :)

Recharging semantics are not new, it is a shortcut to a process that
is already happening that is focused on offline memcgs. Let's take a
step back.

It is common practice (at least in my knowledge) to try to reclaim
memory from a cgroup before deleting it (by lowering the limit or
using memory.reclaim). Reclaim heuristics are biased towards
reclaiming memory from offline cgroups. After the memory is reclaimed,
if it is used again by a different process, it will be refaulted and
charged again (aka recharged) to the new

What recharging is doing is *not* anything new. It is effectively
doing what reclaim + refault would do above, with an efficient
shortcut. It avoids the unnecessary fault, avoids disrupting the
workload that will access the memory after it is reclaimed, and cleans
up zombie memcgs memory faster than reclaim would. Moreover, it works
for memory that may not be reclaimable (e.g. because of lack of swap).

All the indeterministic behaviors in recharging are exactly the
indeterministic behaviors in reclaim. It is very similar. We iterate
the lrus, try to isolate and lock folios, etc. This is what reclaim
does. Recharging is basically lightweight reclaim + charging again (as
opposed to fully reclaiming the memory then refaulting it).

We are not introducing new indeterminism or charging semantics.
Recharging does exactly what would happen when we reclaim zombie
memory. It is just more efficient and accelerated.

>
> > machines and where the memory is coming from due to the
> > indeterministic nature of charging. The goal of this proposal is to
> > let the kernel handle leftover memory in zombie memcgs because it is
> > not always obvious to userspace what's going on (like it's not obvious
> > to me now where exactly is the sharing happening :) ).
> >
> > One thing to note is that in some cases, maybe a userspace bug or
> > failed cleanup is a reason for the zombie memcgs. Ideally, this
> > wouldn't happen, but it would be nice to have a fallback mechanism in
> > the kernel if it does.
>
> I'm not disagreeing on that. Our handling of pages owned by dying cgroups
> isn't great but I don't think the proposed change is an acceptable solution.

 I hope the above arguments change your mind :)

>
> Thanks.
>
> --
> tejun


  reply	other threads:[~2023-07-21 18:48 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-20  7:08 [RFC PATCH 0/8] memory recharging for offline memcgs Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 1/8] memcg: refactor updating memcg->moving_account Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 2/8] mm: vmscan: add lruvec_for_each_list() helper Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 3/8] memcg: recharge mapped folios when a memcg is offlined Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 4/8] memcg: support deferred memcg recharging Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 5/8] memcg: recharge folios when accessed or dirtied Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 6/8] memcg: add stats for offline memcgs recharging Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 7/8] memcg: add sysctl and config option to control memory recharging Yosry Ahmed
2023-07-20 18:13   ` Luis Chamberlain
2023-07-20 18:24     ` Yosry Ahmed
2023-07-20 18:30       ` Luis Chamberlain
2023-07-20  7:08 ` [RFC PATCH 8/8] selftests: cgroup: test_memcontrol: add a selftest for memcg recharging Yosry Ahmed
2023-07-20 15:35 ` [RFC PATCH 0/8] memory recharging for offline memcgs Johannes Weiner
2023-07-20 19:57   ` Tejun Heo
2023-07-20 21:34     ` Yosry Ahmed
2023-07-20 22:11       ` Tejun Heo
2023-07-20 22:23         ` Yosry Ahmed
2023-07-20 22:31           ` Tejun Heo
2023-07-20 23:24             ` T.J. Mercier
2023-07-20 23:33               ` Tejun Heo
2023-07-21 18:15             ` Yosry Ahmed
2023-07-21 18:26               ` Tejun Heo
2023-07-21 18:47                 ` Yosry Ahmed [this message]
2023-07-21 19:18                   ` Tejun Heo
2023-07-21 20:37                     ` Yosry Ahmed
2023-07-21 20:44                   ` Johannes Weiner
2023-07-21 20:59                     ` Yosry Ahmed
2023-07-20 21:33   ` Yosry Ahmed
2023-08-01  9:54   ` Michal Hocko
2023-07-21  0:02 ` Roman Gushchin
2023-07-21  0:07   ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJD7tkatz1JhKVj_iP9J0H7fPJnUSurZkCT1iJTJ=+qRen_nLQ@mail.gmail.com' \
    --to=yosryahmed@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=mcgrof@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    --cc=tj@kernel.org \
    --cc=tjmercier@google.com \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    --cc=yzaikin@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).