linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yosry Ahmed <yosryahmed@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeelb@google.com>,
	Muchun Song <muchun.song@linux.dev>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>,
	Yu Zhao <yuzhao@google.com>, Luis Chamberlain <mcgrof@kernel.org>,
	Kees Cook <keescook@chromium.org>,
	Iurii Zaikin <yzaikin@google.com>,
	"T.J. Mercier" <tjmercier@google.com>,
	Greg Thelen <gthelen@google.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	cgroups@vger.kernel.org
Subject: Re: [RFC PATCH 0/8] memory recharging for offline memcgs
Date: Tue, 1 Aug 2023 11:54:47 +0200	[thread overview]
Message-ID: <ZMjWZwYkhxSNLU+q@dhcp22.suse.cz> (raw)
In-Reply-To: <20230720153515.GA1003248@cmpxchg.org>

[Sorry for being late to this discussion]

On Thu 20-07-23 11:35:15, Johannes Weiner wrote:
[...]
> I'm super skeptical of this proposal.

Agreed.
 
> Recharging *might* be the most desirable semantics from a user pov,
> but only if it applies consistently to the whole memory footprint.
> There is no mention of slab allocations such as inodes, dentries,
> network buffers etc. which can be a significant part of a cgroup's
> footprint. These are currently reparented. I don't think doing one
> thing with half of the memory, and a totally different thing with the
> other half upon cgroup deletion is going to be acceptable semantics.
> 
> It appears this also brings back the reliability issue that caused us
> to deprecate charge moving. The recharge path has trylocks, LRU
> isolation attempts, GFP_ATOMIC allocations. These introduce a variable
> error rate into the relocation process, which causes pages that should
> belong to the same domain to be scattered around all over the place.
> It also means that zombie pinning still exists, but it's now even more
> influenced by timing and race conditions, and so less predictable.
> 
> There are two issues being conflated here:
> 
> a) the problem of zombie cgroups, and
> 
> b) who controls resources that outlive the control domain.
> 
> For a), reparenting is still the most reasonable proposal. It's
> reliable for one, but it also fixes the problem fully within the
> established, user-facing semantics: resources that belong to a cgroup
> also hierarchically belong to all ancestral groups; if those resources
> outlive the last-level control domain, they continue to belong to the
> parents. This is how it works today, and this is how it continues to
> work with reparenting. The only difference is that those resources no
> longer pin a dead cgroup anymore, but instead are physically linked to
> the next online ancestor. Since dead cgroups have no effective control
> parameters anymore, this is semantically equivalent - it's just a more
> memory efficient implementation of the same exact thing.
> 
> b) is a discussion totally separate from this. We can argue what we
> want this behavior to be, but I'd argue strongly that whatever we do
> here should apply to all resources managed by the controller equally.
> 
> It could also be argued that if you don't want to lose control over a
> set of resources, then maybe don't delete their control domain while
> they are still alive and in use. For example, when restarting a
> workload, and the new instance is expected to have largely the same
> workingset, consider reusing the cgroup instead of making a new one.
> 
> For the zombie problem, I think we should merge Muchun's patches
> ASAP. They've been proposed several times, they have Roman's reviews
> and acks, and they do not change user-facing semantics. There is no
> good reason not to merge them.

Yes, fully agreed on both points. The problem with zombies is real but
reparenting should address it for a large part. Ownership is a different
problem. We have discussed that at LSFMM this year and in the past as
well I believe. What we probably need is a concept of taking an
ownership of the memory (something like madvise(MADV_OWN, range) or
fadvise for fd based resources). This would allow the caller to take
ownership of the said resource (like memcg charge of it).

I understand that would require some changes to existing workloads.
Whatever the interface will be, it has to be explicit otherwise we
are hitting problems with unaccounted resources that are sitting without
any actual ownership and an undeterministic and time dependeing hopping
over owners. In other words, nobody should be able to drop
responsibility of any object while it is still consuming resources.

-- 
Michal Hocko
SUSE Labs


  parent reply	other threads:[~2023-08-01  9:54 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-20  7:08 [RFC PATCH 0/8] memory recharging for offline memcgs Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 1/8] memcg: refactor updating memcg->moving_account Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 2/8] mm: vmscan: add lruvec_for_each_list() helper Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 3/8] memcg: recharge mapped folios when a memcg is offlined Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 4/8] memcg: support deferred memcg recharging Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 5/8] memcg: recharge folios when accessed or dirtied Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 6/8] memcg: add stats for offline memcgs recharging Yosry Ahmed
2023-07-20  7:08 ` [RFC PATCH 7/8] memcg: add sysctl and config option to control memory recharging Yosry Ahmed
2023-07-20 18:13   ` Luis Chamberlain
2023-07-20 18:24     ` Yosry Ahmed
2023-07-20 18:30       ` Luis Chamberlain
2023-07-20  7:08 ` [RFC PATCH 8/8] selftests: cgroup: test_memcontrol: add a selftest for memcg recharging Yosry Ahmed
2023-07-20 15:35 ` [RFC PATCH 0/8] memory recharging for offline memcgs Johannes Weiner
2023-07-20 19:57   ` Tejun Heo
2023-07-20 21:34     ` Yosry Ahmed
2023-07-20 22:11       ` Tejun Heo
2023-07-20 22:23         ` Yosry Ahmed
2023-07-20 22:31           ` Tejun Heo
2023-07-20 23:24             ` T.J. Mercier
2023-07-20 23:33               ` Tejun Heo
2023-07-21 18:15             ` Yosry Ahmed
2023-07-21 18:26               ` Tejun Heo
2023-07-21 18:47                 ` Yosry Ahmed
2023-07-21 19:18                   ` Tejun Heo
2023-07-21 20:37                     ` Yosry Ahmed
2023-07-21 20:44                   ` Johannes Weiner
2023-07-21 20:59                     ` Yosry Ahmed
2023-07-20 21:33   ` Yosry Ahmed
2023-08-01  9:54   ` Michal Hocko [this message]
2023-07-21  0:02 ` Roman Gushchin
2023-07-21  0:07   ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZMjWZwYkhxSNLU+q@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=mcgrof@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    --cc=tj@kernel.org \
    --cc=tjmercier@google.com \
    --cc=willy@infradead.org \
    --cc=yosryahmed@google.com \
    --cc=yuzhao@google.com \
    --cc=yzaikin@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).