All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-04-11 23:36 ` T.J. Mercier
  0 siblings, 0 replies; 51+ messages in thread
From: T.J. Mercier @ 2023-04-11 23:36 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-mm, cgroups, Yosry Ahmed, Tejun Heo, Shakeel Butt,
	Muchun Song, Johannes Weiner, Roman Gushchin, Alistair Popple,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao

When a memcg is removed by userspace it gets offlined by the kernel.
Offline memcgs are hidden from user space, but they still live in the
kernel until their reference count drops to 0. New allocations cannot
be charged to offline memcgs, but existing allocations charged to
offline memcgs remain charged, and hold a reference to the memcg.

As such, an offline memcg can remain in the kernel indefinitely,
becoming a zombie memcg. The accumulation of a large number of zombie
memcgs lead to increased system overhead (mainly percpu data in struct
mem_cgroup). It also causes some kernel operations that scale with the
number of memcgs to become less efficient (e.g. reclaim).

There are currently out-of-tree solutions which attempt to
periodically clean up zombie memcgs by reclaiming from them. However
that is not effective for non-reclaimable memory, which it would be
better to reparent or recharge to an online cgroup. There are also
proposed changes that would benefit from recharging for shared
resources like pinned pages, or DMA buffer pages.

Suggested attendees:
Yosry Ahmed <yosryahmed@google.com>
Yu Zhao <yuzhao@google.com>
T.J. Mercier <tjmercier@google.com>
Tejun Heo <tj@kernel.org>
Shakeel Butt <shakeelb@google.com>
Muchun Song <muchun.song@linux.dev>
Johannes Weiner <hannes@cmpxchg.org>
Roman Gushchin <roman.gushchin@linux.dev>
Alistair Popple <apopple@nvidia.com>
Jason Gunthorpe <jgg@nvidia.com>
Kalesh Singh <kaleshsingh@google.com>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-04-11 23:36 ` T.J. Mercier
  0 siblings, 0 replies; 51+ messages in thread
From: T.J. Mercier @ 2023-04-11 23:36 UTC (permalink / raw)
  To: lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Yosry Ahmed, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Alistair Popple,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao

When a memcg is removed by userspace it gets offlined by the kernel.
Offline memcgs are hidden from user space, but they still live in the
kernel until their reference count drops to 0. New allocations cannot
be charged to offline memcgs, but existing allocations charged to
offline memcgs remain charged, and hold a reference to the memcg.

As such, an offline memcg can remain in the kernel indefinitely,
becoming a zombie memcg. The accumulation of a large number of zombie
memcgs lead to increased system overhead (mainly percpu data in struct
mem_cgroup). It also causes some kernel operations that scale with the
number of memcgs to become less efficient (e.g. reclaim).

There are currently out-of-tree solutions which attempt to
periodically clean up zombie memcgs by reclaiming from them. However
that is not effective for non-reclaimable memory, which it would be
better to reparent or recharge to an online cgroup. There are also
proposed changes that would benefit from recharging for shared
resources like pinned pages, or DMA buffer pages.

Suggested attendees:
Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Yu Zhao <yuzhao-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Muchun Song <muchun.song-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
Alistair Popple <apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
Kalesh Singh <kaleshsingh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-04-11 23:48   ` Yosry Ahmed
  0 siblings, 0 replies; 51+ messages in thread
From: Yosry Ahmed @ 2023-04-11 23:48 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: lsf-pc, linux-mm, cgroups, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Alistair Popple,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao

On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier@google.com> wrote:
>
> When a memcg is removed by userspace it gets offlined by the kernel.
> Offline memcgs are hidden from user space, but they still live in the
> kernel until their reference count drops to 0. New allocations cannot
> be charged to offline memcgs, but existing allocations charged to
> offline memcgs remain charged, and hold a reference to the memcg.
>
> As such, an offline memcg can remain in the kernel indefinitely,
> becoming a zombie memcg. The accumulation of a large number of zombie
> memcgs lead to increased system overhead (mainly percpu data in struct
> mem_cgroup). It also causes some kernel operations that scale with the
> number of memcgs to become less efficient (e.g. reclaim).
>
> There are currently out-of-tree solutions which attempt to
> periodically clean up zombie memcgs by reclaiming from them. However
> that is not effective for non-reclaimable memory, which it would be
> better to reparent or recharge to an online cgroup. There are also
> proposed changes that would benefit from recharging for shared
> resources like pinned pages, or DMA buffer pages.

I am very interested in attending this discussion, it's something that
I have been actively looking into -- specifically recharging pages of
offlined memcgs.

>
> Suggested attendees:
> Yosry Ahmed <yosryahmed@google.com>
> Yu Zhao <yuzhao@google.com>
> T.J. Mercier <tjmercier@google.com>
> Tejun Heo <tj@kernel.org>
> Shakeel Butt <shakeelb@google.com>
> Muchun Song <muchun.song@linux.dev>
> Johannes Weiner <hannes@cmpxchg.org>
> Roman Gushchin <roman.gushchin@linux.dev>
> Alistair Popple <apopple@nvidia.com>
> Jason Gunthorpe <jgg@nvidia.com>
> Kalesh Singh <kaleshsingh@google.com>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-04-11 23:48   ` Yosry Ahmed
  0 siblings, 0 replies; 51+ messages in thread
From: Yosry Ahmed @ 2023-04-11 23:48 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Shakeel Butt, Muchun Song, Johannes Weiner,
	Roman Gushchin, Alistair Popple, Jason Gunthorpe, Kalesh Singh,
	Yu Zhao

On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> When a memcg is removed by userspace it gets offlined by the kernel.
> Offline memcgs are hidden from user space, but they still live in the
> kernel until their reference count drops to 0. New allocations cannot
> be charged to offline memcgs, but existing allocations charged to
> offline memcgs remain charged, and hold a reference to the memcg.
>
> As such, an offline memcg can remain in the kernel indefinitely,
> becoming a zombie memcg. The accumulation of a large number of zombie
> memcgs lead to increased system overhead (mainly percpu data in struct
> mem_cgroup). It also causes some kernel operations that scale with the
> number of memcgs to become less efficient (e.g. reclaim).
>
> There are currently out-of-tree solutions which attempt to
> periodically clean up zombie memcgs by reclaiming from them. However
> that is not effective for non-reclaimable memory, which it would be
> better to reparent or recharge to an online cgroup. There are also
> proposed changes that would benefit from recharging for shared
> resources like pinned pages, or DMA buffer pages.

I am very interested in attending this discussion, it's something that
I have been actively looking into -- specifically recharging pages of
offlined memcgs.

>
> Suggested attendees:
> Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Yu Zhao <yuzhao-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Muchun Song <muchun.song-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
> Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
> Alistair Popple <apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> Kalesh Singh <kaleshsingh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-04-25 11:36     ` Yosry Ahmed
  0 siblings, 0 replies; 51+ messages in thread
From: Yosry Ahmed @ 2023-04-25 11:36 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: lsf-pc, linux-mm, cgroups, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Alistair Popple,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao, Matthew Wilcox,
	David Rientjes, Greg Thelen

 +David Rientjes +Greg Thelen +Matthew Wilcox

On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier@google.com> wrote:
> >
> > When a memcg is removed by userspace it gets offlined by the kernel.
> > Offline memcgs are hidden from user space, but they still live in the
> > kernel until their reference count drops to 0. New allocations cannot
> > be charged to offline memcgs, but existing allocations charged to
> > offline memcgs remain charged, and hold a reference to the memcg.
> >
> > As such, an offline memcg can remain in the kernel indefinitely,
> > becoming a zombie memcg. The accumulation of a large number of zombie
> > memcgs lead to increased system overhead (mainly percpu data in struct
> > mem_cgroup). It also causes some kernel operations that scale with the
> > number of memcgs to become less efficient (e.g. reclaim).
> >
> > There are currently out-of-tree solutions which attempt to
> > periodically clean up zombie memcgs by reclaiming from them. However
> > that is not effective for non-reclaimable memory, which it would be
> > better to reparent or recharge to an online cgroup. There are also
> > proposed changes that would benefit from recharging for shared
> > resources like pinned pages, or DMA buffer pages.
>
> I am very interested in attending this discussion, it's something that
> I have been actively looking into -- specifically recharging pages of
> offlined memcgs.
>
> >
> > Suggested attendees:
> > Yosry Ahmed <yosryahmed@google.com>
> > Yu Zhao <yuzhao@google.com>
> > T.J. Mercier <tjmercier@google.com>
> > Tejun Heo <tj@kernel.org>
> > Shakeel Butt <shakeelb@google.com>
> > Muchun Song <muchun.song@linux.dev>
> > Johannes Weiner <hannes@cmpxchg.org>
> > Roman Gushchin <roman.gushchin@linux.dev>
> > Alistair Popple <apopple@nvidia.com>
> > Jason Gunthorpe <jgg@nvidia.com>
> > Kalesh Singh <kaleshsingh@google.com>

I was hoping I would bring a more complete idea to this thread, but
here is what I have so far.

The idea is to recharge the memory charged to memcgs when they are
offlined. I like to think of the options we have to deal with memory
charged to offline memcgs as a toolkit. This toolkit includes:

(a) Evict memory.

This is the simplest option, just evict the memory.

For file-backed pages, this writes them back to their backing files,
uncharging and freeing the page. The next access will read the page
again and the faulting process’s memcg will be charged.

For swap-backed pages (anon/shmem), this swaps them out. Swapping out
a page charged to an offline memcg uncharges the page and charges the
swap to its parent. The next access will swap in the page and the
parent will be charged. This is effectively deferred recharging to the
parent.

Pros:
- Simple.

Cons:
- Behavior is different for file-backed vs. swap-backed pages, for
swap-backed pages, the memory is recharged to the parent (aka
reparented), not charged to the "rightful" user.
- Next access will incur higher latency, especially if the pages are active.

(b) Direct recharge to the parent

This can be done for any page and should be simple as the pages are
already hierarchically charged to the parent.

Pros:
- Simple.

Cons:
- If a different memcg is using the memory, it will keep taxing the
parent indefinitely. Same not the "rightful" user argument.

(c) Direct recharge to the mapper

This can be done for any mapped page by walking the rmap and
identifying the memcg of the process(es) mapping the page.

Pros:
- Memory is recharged to the “rightful” user.

Cons:
- More complicated, the “rightful” user’s memcg might run into an OOM
situation – which in this case will be unpredictable and hard to
correlate with an allocation.

(d) Deferred recharging

This is a mixture of (b) & (c) above. It is a two-step process. We
first recharge the memory to the parent, which should be simple and
reliable. Then, we mark the pages so that the next time they are
accessed or mapped we recharge them to the "rightful" user.

For mapped pages, we can use the numa balancing approach of protecting
the mapping (while the vma is still accessible), and then in the fault
path recharge the page. This is better than eviction because the fault
on the next access is minor, and better than direct recharging to the
mapping in the sense that the charge is correlated with an
allocation/mapping. Of course, it is more complicated, we have to
handle different protection interactions (e.g. what if the page is
already protected?). Another disadvantage is that the recharging
happens in the context of a page fault, rather than asynchronously in
the case of directly recharging to the mapper. Page faults are more
latency sensitive, although this shouldn't be a common path.

For unmapped pages, I am struggling to find a way that is simple
enough to recharge the memory on the next access. My first intuition
was to add a hook to folio_mark_accessed(), but I was quickly told
that this is not invoked in all access paths to unmapped pages (e.g.
writes through fds). We can also add a hook to folio_mark_dirty() to
add more coverage, but it seems like this path is fragile, and it
would be ideal if there is a shared well-defined common path (or
paths) for all accesses to unmapped pages. I would imagine if such a
path exists or can be forged it would probably be in the page cache
code somewhere.

For both cases, if a new mapping is created, we can do recharging there.

Pros:
- Memory is recharged to the “rightful” user, eventually.
- The charge is predictable and correlates to a user's access.
- Less overhead on next access than eviction.

Cons:
- The memory will remain charged to the parent until the next access
happens, if it ever happens.
- Worse overhead on next access than directly recharging to the mapper.

With this (incompletely defined) toolkit, a recharging algorithm can
look like this (as a rough example):

- If the page is file-backed:
  - Unmapped? evict (a).
  - Mapped? recharge to the mapper -- direct (c) or deferred (d).
- If the page is swap-backed:
  - Unmapped? deferred recharge to the next accessor (d).
  - Mapped? recharge to the mapper -- direct (c) or deferred (d).

There are, of course, open questions:
1) How do we do deferred recharging for unmapped pages? Is deferred
recharging even a reliable option to begin with? What if the pages are
never accessed again?

2) How do we avoid hiding kernel bugs (e.g. extraneous references)
with recharging? Ideally, all recharged pages eventually end up
somewhere other than root, such that accumulation of recharged pages
at root signals a kernel bug.

3) What do we do about swapped pages charged to offline memcgs? Even
if we recharge all charged pages, preexisting swap entries will pin
the offline memcg. Do we walk all swap_cgroup entries and reparent the
swap entries?

Again, I was hoping to come up with a more concrete proposal, but as
LSF/MM/BPF is approaching, I wanted to share my thoughts on the
mailing list looking for any feedback.

Thanks!


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-04-25 11:36     ` Yosry Ahmed
  0 siblings, 0 replies; 51+ messages in thread
From: Yosry Ahmed @ 2023-04-25 11:36 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Shakeel Butt, Muchun Song, Johannes Weiner,
	Roman Gushchin, Alistair Popple, Jason Gunthorpe, Kalesh Singh,
	Yu Zhao, Matthew Wilcox, David Rientjes, Greg Thelen

 +David Rientjes +Greg Thelen +Matthew Wilcox

On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier-hpIqsD4AKldhl2p70BpVqQ@public.gmane.orgm> wrote:
> >
> > When a memcg is removed by userspace it gets offlined by the kernel.
> > Offline memcgs are hidden from user space, but they still live in the
> > kernel until their reference count drops to 0. New allocations cannot
> > be charged to offline memcgs, but existing allocations charged to
> > offline memcgs remain charged, and hold a reference to the memcg.
> >
> > As such, an offline memcg can remain in the kernel indefinitely,
> > becoming a zombie memcg. The accumulation of a large number of zombie
> > memcgs lead to increased system overhead (mainly percpu data in struct
> > mem_cgroup). It also causes some kernel operations that scale with the
> > number of memcgs to become less efficient (e.g. reclaim).
> >
> > There are currently out-of-tree solutions which attempt to
> > periodically clean up zombie memcgs by reclaiming from them. However
> > that is not effective for non-reclaimable memory, which it would be
> > better to reparent or recharge to an online cgroup. There are also
> > proposed changes that would benefit from recharging for shared
> > resources like pinned pages, or DMA buffer pages.
>
> I am very interested in attending this discussion, it's something that
> I have been actively looking into -- specifically recharging pages of
> offlined memcgs.
>
> >
> > Suggested attendees:
> > Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > Yu Zhao <yuzhao-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > Muchun Song <muchun.song-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
> > Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> > Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
> > Alistair Popple <apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> > Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> > Kalesh Singh <kaleshsingh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

I was hoping I would bring a more complete idea to this thread, but
here is what I have so far.

The idea is to recharge the memory charged to memcgs when they are
offlined. I like to think of the options we have to deal with memory
charged to offline memcgs as a toolkit. This toolkit includes:

(a) Evict memory.

This is the simplest option, just evict the memory.

For file-backed pages, this writes them back to their backing files,
uncharging and freeing the page. The next access will read the page
again and the faulting process’s memcg will be charged.

For swap-backed pages (anon/shmem), this swaps them out. Swapping out
a page charged to an offline memcg uncharges the page and charges the
swap to its parent. The next access will swap in the page and the
parent will be charged. This is effectively deferred recharging to the
parent.

Pros:
- Simple.

Cons:
- Behavior is different for file-backed vs. swap-backed pages, for
swap-backed pages, the memory is recharged to the parent (aka
reparented), not charged to the "rightful" user.
- Next access will incur higher latency, especially if the pages are active.

(b) Direct recharge to the parent

This can be done for any page and should be simple as the pages are
already hierarchically charged to the parent.

Pros:
- Simple.

Cons:
- If a different memcg is using the memory, it will keep taxing the
parent indefinitely. Same not the "rightful" user argument.

(c) Direct recharge to the mapper

This can be done for any mapped page by walking the rmap and
identifying the memcg of the process(es) mapping the page.

Pros:
- Memory is recharged to the “rightful” user.

Cons:
- More complicated, the “rightful” user’s memcg might run into an OOM
situation – which in this case will be unpredictable and hard to
correlate with an allocation.

(d) Deferred recharging

This is a mixture of (b) & (c) above. It is a two-step process. We
first recharge the memory to the parent, which should be simple and
reliable. Then, we mark the pages so that the next time they are
accessed or mapped we recharge them to the "rightful" user.

For mapped pages, we can use the numa balancing approach of protecting
the mapping (while the vma is still accessible), and then in the fault
path recharge the page. This is better than eviction because the fault
on the next access is minor, and better than direct recharging to the
mapping in the sense that the charge is correlated with an
allocation/mapping. Of course, it is more complicated, we have to
handle different protection interactions (e.g. what if the page is
already protected?). Another disadvantage is that the recharging
happens in the context of a page fault, rather than asynchronously in
the case of directly recharging to the mapper. Page faults are more
latency sensitive, although this shouldn't be a common path.

For unmapped pages, I am struggling to find a way that is simple
enough to recharge the memory on the next access. My first intuition
was to add a hook to folio_mark_accessed(), but I was quickly told
that this is not invoked in all access paths to unmapped pages (e.g.
writes through fds). We can also add a hook to folio_mark_dirty() to
add more coverage, but it seems like this path is fragile, and it
would be ideal if there is a shared well-defined common path (or
paths) for all accesses to unmapped pages. I would imagine if such a
path exists or can be forged it would probably be in the page cache
code somewhere.

For both cases, if a new mapping is created, we can do recharging there.

Pros:
- Memory is recharged to the “rightful” user, eventually.
- The charge is predictable and correlates to a user's access.
- Less overhead on next access than eviction.

Cons:
- The memory will remain charged to the parent until the next access
happens, if it ever happens.
- Worse overhead on next access than directly recharging to the mapper.

With this (incompletely defined) toolkit, a recharging algorithm can
look like this (as a rough example):

- If the page is file-backed:
  - Unmapped? evict (a).
  - Mapped? recharge to the mapper -- direct (c) or deferred (d).
- If the page is swap-backed:
  - Unmapped? deferred recharge to the next accessor (d).
  - Mapped? recharge to the mapper -- direct (c) or deferred (d).

There are, of course, open questions:
1) How do we do deferred recharging for unmapped pages? Is deferred
recharging even a reliable option to begin with? What if the pages are
never accessed again?

2) How do we avoid hiding kernel bugs (e.g. extraneous references)
with recharging? Ideally, all recharged pages eventually end up
somewhere other than root, such that accumulation of recharged pages
at root signals a kernel bug.

3) What do we do about swapped pages charged to offline memcgs? Even
if we recharge all charged pages, preexisting swap entries will pin
the offline memcg. Do we walk all swap_cgroup entries and reparent the
swap entries?

Again, I was hoping to come up with a more concrete proposal, but as
LSF/MM/BPF is approaching, I wanted to share my thoughts on the
mailing list looking for any feedback.

Thanks!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-04-25 18:42       ` Waiman Long
  0 siblings, 0 replies; 51+ messages in thread
From: Waiman Long @ 2023-04-25 18:42 UTC (permalink / raw)
  To: Yosry Ahmed, T.J. Mercier
  Cc: lsf-pc, linux-mm, cgroups, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Alistair Popple,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao, Matthew Wilcox,
	David Rientjes, Greg Thelen

On 4/25/23 07:36, Yosry Ahmed wrote:
>   +David Rientjes +Greg Thelen +Matthew Wilcox
>
> On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>> On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier@google.com> wrote:
>>> When a memcg is removed by userspace it gets offlined by the kernel.
>>> Offline memcgs are hidden from user space, but they still live in the
>>> kernel until their reference count drops to 0. New allocations cannot
>>> be charged to offline memcgs, but existing allocations charged to
>>> offline memcgs remain charged, and hold a reference to the memcg.
>>>
>>> As such, an offline memcg can remain in the kernel indefinitely,
>>> becoming a zombie memcg. The accumulation of a large number of zombie
>>> memcgs lead to increased system overhead (mainly percpu data in struct
>>> mem_cgroup). It also causes some kernel operations that scale with the
>>> number of memcgs to become less efficient (e.g. reclaim).
>>>
>>> There are currently out-of-tree solutions which attempt to
>>> periodically clean up zombie memcgs by reclaiming from them. However
>>> that is not effective for non-reclaimable memory, which it would be
>>> better to reparent or recharge to an online cgroup. There are also
>>> proposed changes that would benefit from recharging for shared
>>> resources like pinned pages, or DMA buffer pages.
>> I am very interested in attending this discussion, it's something that
>> I have been actively looking into -- specifically recharging pages of
>> offlined memcgs.
>>
>>> Suggested attendees:
>>> Yosry Ahmed <yosryahmed@google.com>
>>> Yu Zhao <yuzhao@google.com>
>>> T.J. Mercier <tjmercier@google.com>
>>> Tejun Heo <tj@kernel.org>
>>> Shakeel Butt <shakeelb@google.com>
>>> Muchun Song <muchun.song@linux.dev>
>>> Johannes Weiner <hannes@cmpxchg.org>
>>> Roman Gushchin <roman.gushchin@linux.dev>
>>> Alistair Popple <apopple@nvidia.com>
>>> Jason Gunthorpe <jgg@nvidia.com>
>>> Kalesh Singh <kaleshsingh@google.com>
> I was hoping I would bring a more complete idea to this thread, but
> here is what I have so far.
>
> The idea is to recharge the memory charged to memcgs when they are
> offlined. I like to think of the options we have to deal with memory
> charged to offline memcgs as a toolkit. This toolkit includes:
>
> (a) Evict memory.
>
> This is the simplest option, just evict the memory.
>
> For file-backed pages, this writes them back to their backing files,
> uncharging and freeing the page. The next access will read the page
> again and the faulting process’s memcg will be charged.
>
> For swap-backed pages (anon/shmem), this swaps them out. Swapping out
> a page charged to an offline memcg uncharges the page and charges the
> swap to its parent. The next access will swap in the page and the
> parent will be charged. This is effectively deferred recharging to the
> parent.
>
> Pros:
> - Simple.
>
> Cons:
> - Behavior is different for file-backed vs. swap-backed pages, for
> swap-backed pages, the memory is recharged to the parent (aka
> reparented), not charged to the "rightful" user.
> - Next access will incur higher latency, especially if the pages are active.
>
> (b) Direct recharge to the parent
>
> This can be done for any page and should be simple as the pages are
> already hierarchically charged to the parent.
>
> Pros:
> - Simple.
>
> Cons:
> - If a different memcg is using the memory, it will keep taxing the
> parent indefinitely. Same not the "rightful" user argument.

Muchun had actually posted patch to do this last year. See

https://lore.kernel.org/all/20220621125658.64935-10-songmuchun@bytedance.com/T/#me9dbbce85e2f3c4e5f34b97dbbdb5f79d77ce147

I am wondering if he is going to post an updated version of that or not. 
Anyway, I am looking forward to learn about the result of this 
discussion even thought I am not a conference invitee.

Thanks,
Longman




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-04-25 18:42       ` Waiman Long
  0 siblings, 0 replies; 51+ messages in thread
From: Waiman Long @ 2023-04-25 18:42 UTC (permalink / raw)
  To: Yosry Ahmed, T.J. Mercier
  Cc: lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Shakeel Butt, Muchun Song, Johannes Weiner,
	Roman Gushchin, Alistair Popple, Jason Gunthorpe, Kalesh Singh,
	Yu Zhao, Matthew Wilcox, David Rientjes, Greg Thelen

On 4/25/23 07:36, Yosry Ahmed wrote:
>   +David Rientjes +Greg Thelen +Matthew Wilcox
>
> On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> When a memcg is removed by userspace it gets offlined by the kernel.
>>> Offline memcgs are hidden from user space, but they still live in the
>>> kernel until their reference count drops to 0. New allocations cannot
>>> be charged to offline memcgs, but existing allocations charged to
>>> offline memcgs remain charged, and hold a reference to the memcg.
>>>
>>> As such, an offline memcg can remain in the kernel indefinitely,
>>> becoming a zombie memcg. The accumulation of a large number of zombie
>>> memcgs lead to increased system overhead (mainly percpu data in struct
>>> mem_cgroup). It also causes some kernel operations that scale with the
>>> number of memcgs to become less efficient (e.g. reclaim).
>>>
>>> There are currently out-of-tree solutions which attempt to
>>> periodically clean up zombie memcgs by reclaiming from them. However
>>> that is not effective for non-reclaimable memory, which it would be
>>> better to reparent or recharge to an online cgroup. There are also
>>> proposed changes that would benefit from recharging for shared
>>> resources like pinned pages, or DMA buffer pages.
>> I am very interested in attending this discussion, it's something that
>> I have been actively looking into -- specifically recharging pages of
>> offlined memcgs.
>>
>>> Suggested attendees:
>>> Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>> Yu Zhao <yuzhao-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>> T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>> Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>>> Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>> Muchun Song <muchun.song-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
>>> Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
>>> Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
>>> Alistair Popple <apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
>>> Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
>>> Kalesh Singh <kaleshsingh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> I was hoping I would bring a more complete idea to this thread, but
> here is what I have so far.
>
> The idea is to recharge the memory charged to memcgs when they are
> offlined. I like to think of the options we have to deal with memory
> charged to offline memcgs as a toolkit. This toolkit includes:
>
> (a) Evict memory.
>
> This is the simplest option, just evict the memory.
>
> For file-backed pages, this writes them back to their backing files,
> uncharging and freeing the page. The next access will read the page
> again and the faulting process’s memcg will be charged.
>
> For swap-backed pages (anon/shmem), this swaps them out. Swapping out
> a page charged to an offline memcg uncharges the page and charges the
> swap to its parent. The next access will swap in the page and the
> parent will be charged. This is effectively deferred recharging to the
> parent.
>
> Pros:
> - Simple.
>
> Cons:
> - Behavior is different for file-backed vs. swap-backed pages, for
> swap-backed pages, the memory is recharged to the parent (aka
> reparented), not charged to the "rightful" user.
> - Next access will incur higher latency, especially if the pages are active.
>
> (b) Direct recharge to the parent
>
> This can be done for any page and should be simple as the pages are
> already hierarchically charged to the parent.
>
> Pros:
> - Simple.
>
> Cons:
> - If a different memcg is using the memory, it will keep taxing the
> parent indefinitely. Same not the "rightful" user argument.

Muchun had actually posted patch to do this last year. See

https://lore.kernel.org/all/20220621125658.64935-10-songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org/T/#me9dbbce85e2f3c4e5f34b97dbbdb5f79d77ce147

I am wondering if he is going to post an updated version of that or not. 
Anyway, I am looking forward to learn about the result of this 
discussion even thought I am not a conference invitee.

Thanks,
Longman



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-04-25 18:53         ` Yosry Ahmed
  0 siblings, 0 replies; 51+ messages in thread
From: Yosry Ahmed @ 2023-04-25 18:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: T.J. Mercier, lsf-pc, linux-mm, cgroups, Tejun Heo, Shakeel Butt,
	Muchun Song, Johannes Weiner, Roman Gushchin, Alistair Popple,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao, Matthew Wilcox,
	David Rientjes, Greg Thelen

On Tue, Apr 25, 2023 at 11:42 AM Waiman Long <longman@redhat.com> wrote:
>
> On 4/25/23 07:36, Yosry Ahmed wrote:
> >   +David Rientjes +Greg Thelen +Matthew Wilcox
> >
> > On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >> On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier@google.com> wrote:
> >>> When a memcg is removed by userspace it gets offlined by the kernel.
> >>> Offline memcgs are hidden from user space, but they still live in the
> >>> kernel until their reference count drops to 0. New allocations cannot
> >>> be charged to offline memcgs, but existing allocations charged to
> >>> offline memcgs remain charged, and hold a reference to the memcg.
> >>>
> >>> As such, an offline memcg can remain in the kernel indefinitely,
> >>> becoming a zombie memcg. The accumulation of a large number of zombie
> >>> memcgs lead to increased system overhead (mainly percpu data in struct
> >>> mem_cgroup). It also causes some kernel operations that scale with the
> >>> number of memcgs to become less efficient (e.g. reclaim).
> >>>
> >>> There are currently out-of-tree solutions which attempt to
> >>> periodically clean up zombie memcgs by reclaiming from them. However
> >>> that is not effective for non-reclaimable memory, which it would be
> >>> better to reparent or recharge to an online cgroup. There are also
> >>> proposed changes that would benefit from recharging for shared
> >>> resources like pinned pages, or DMA buffer pages.
> >> I am very interested in attending this discussion, it's something that
> >> I have been actively looking into -- specifically recharging pages of
> >> offlined memcgs.
> >>
> >>> Suggested attendees:
> >>> Yosry Ahmed <yosryahmed@google.com>
> >>> Yu Zhao <yuzhao@google.com>
> >>> T.J. Mercier <tjmercier@google.com>
> >>> Tejun Heo <tj@kernel.org>
> >>> Shakeel Butt <shakeelb@google.com>
> >>> Muchun Song <muchun.song@linux.dev>
> >>> Johannes Weiner <hannes@cmpxchg.org>
> >>> Roman Gushchin <roman.gushchin@linux.dev>
> >>> Alistair Popple <apopple@nvidia.com>
> >>> Jason Gunthorpe <jgg@nvidia.com>
> >>> Kalesh Singh <kaleshsingh@google.com>
> > I was hoping I would bring a more complete idea to this thread, but
> > here is what I have so far.
> >
> > The idea is to recharge the memory charged to memcgs when they are
> > offlined. I like to think of the options we have to deal with memory
> > charged to offline memcgs as a toolkit. This toolkit includes:
> >
> > (a) Evict memory.
> >
> > This is the simplest option, just evict the memory.
> >
> > For file-backed pages, this writes them back to their backing files,
> > uncharging and freeing the page. The next access will read the page
> > again and the faulting process’s memcg will be charged.
> >
> > For swap-backed pages (anon/shmem), this swaps them out. Swapping out
> > a page charged to an offline memcg uncharges the page and charges the
> > swap to its parent. The next access will swap in the page and the
> > parent will be charged. This is effectively deferred recharging to the
> > parent.
> >
> > Pros:
> > - Simple.
> >
> > Cons:
> > - Behavior is different for file-backed vs. swap-backed pages, for
> > swap-backed pages, the memory is recharged to the parent (aka
> > reparented), not charged to the "rightful" user.
> > - Next access will incur higher latency, especially if the pages are active.
> >
> > (b) Direct recharge to the parent
> >
> > This can be done for any page and should be simple as the pages are
> > already hierarchically charged to the parent.
> >
> > Pros:
> > - Simple.
> >
> > Cons:
> > - If a different memcg is using the memory, it will keep taxing the
> > parent indefinitely. Same not the "rightful" user argument.
>
> Muchun had actually posted patch to do this last year. See
>
> https://lore.kernel.org/all/20220621125658.64935-10-songmuchun@bytedance.com/T/#me9dbbce85e2f3c4e5f34b97dbbdb5f79d77ce147
>
> I am wondering if he is going to post an updated version of that or not.
> Anyway, I am looking forward to learn about the result of this
> discussion even thought I am not a conference invitee.

There are a couple of problems that were brought up back then, mainly
that memory will be reparented to the root memcg eventually,
practically escaping accounting. Shared resources may end up being
eventually unaccounted. Ideally, we can come up with a scheme where
the memory is charged to the real user, instead of just to the parent.

Consider the case where processes in memcg A and B are both using
memory that is charged to memcg A. If memcg A goes offline, and we
reparent the memory, memcg B keeps using the memory for free, taxing
A's parent, or the entire system if that's root.

Also, if there is a kernel bug and a page is being pinned
unnecessarily, those pages will never be reclaimed and will stick
around and eventually be reparented to the root memcg. If being
reparented to the root memcg is a legitimate action, you can't simply
tell apart if pages are sticking around just because they are being
used by someone or if there is a kernel bug.

>
> Thanks,
> Longman
>
>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-04-25 18:53         ` Yosry Ahmed
  0 siblings, 0 replies; 51+ messages in thread
From: Yosry Ahmed @ 2023-04-25 18:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: T.J. Mercier, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Shakeel Butt, Muchun Song, Johannes Weiner,
	Roman Gushchin, Alistair Popple, Jason Gunthorpe, Kalesh Singh,
	Yu Zhao, Matthew Wilcox, David Rientjes, Greg Thelen

On Tue, Apr 25, 2023 at 11:42 AM Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> On 4/25/23 07:36, Yosry Ahmed wrote:
> >   +David Rientjes +Greg Thelen +Matthew Wilcox
> >
> > On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >> On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier@google.com> wrote:
> >>> When a memcg is removed by userspace it gets offlined by the kernel.
> >>> Offline memcgs are hidden from user space, but they still live in the
> >>> kernel until their reference count drops to 0. New allocations cannot
> >>> be charged to offline memcgs, but existing allocations charged to
> >>> offline memcgs remain charged, and hold a reference to the memcg.
> >>>
> >>> As such, an offline memcg can remain in the kernel indefinitely,
> >>> becoming a zombie memcg. The accumulation of a large number of zombie
> >>> memcgs lead to increased system overhead (mainly percpu data in struct
> >>> mem_cgroup). It also causes some kernel operations that scale with the
> >>> number of memcgs to become less efficient (e.g. reclaim).
> >>>
> >>> There are currently out-of-tree solutions which attempt to
> >>> periodically clean up zombie memcgs by reclaiming from them. However
> >>> that is not effective for non-reclaimable memory, which it would be
> >>> better to reparent or recharge to an online cgroup. There are also
> >>> proposed changes that would benefit from recharging for shared
> >>> resources like pinned pages, or DMA buffer pages.
> >> I am very interested in attending this discussion, it's something that
> >> I have been actively looking into -- specifically recharging pages of
> >> offlined memcgs.
> >>
> >>> Suggested attendees:
> >>> Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> >>> Yu Zhao <yuzhao-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> >>> T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> >>> Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> >>> Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> >>> Muchun Song <muchun.song-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
> >>> Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> >>> Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
> >>> Alistair Popple <apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> >>> Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> >>> Kalesh Singh <kaleshsingh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > I was hoping I would bring a more complete idea to this thread, but
> > here is what I have so far.
> >
> > The idea is to recharge the memory charged to memcgs when they are
> > offlined. I like to think of the options we have to deal with memory
> > charged to offline memcgs as a toolkit. This toolkit includes:
> >
> > (a) Evict memory.
> >
> > This is the simplest option, just evict the memory.
> >
> > For file-backed pages, this writes them back to their backing files,
> > uncharging and freeing the page. The next access will read the page
> > again and the faulting process’s memcg will be charged.
> >
> > For swap-backed pages (anon/shmem), this swaps them out. Swapping out
> > a page charged to an offline memcg uncharges the page and charges the
> > swap to its parent. The next access will swap in the page and the
> > parent will be charged. This is effectively deferred recharging to the
> > parent.
> >
> > Pros:
> > - Simple.
> >
> > Cons:
> > - Behavior is different for file-backed vs. swap-backed pages, for
> > swap-backed pages, the memory is recharged to the parent (aka
> > reparented), not charged to the "rightful" user.
> > - Next access will incur higher latency, especially if the pages are active.
> >
> > (b) Direct recharge to the parent
> >
> > This can be done for any page and should be simple as the pages are
> > already hierarchically charged to the parent.
> >
> > Pros:
> > - Simple.
> >
> > Cons:
> > - If a different memcg is using the memory, it will keep taxing the
> > parent indefinitely. Same not the "rightful" user argument.
>
> Muchun had actually posted patch to do this last year. See
>
> https://lore.kernel.org/all/20220621125658.64935-10-songmuchun@bytedance.com/T/#me9dbbce85e2f3c4e5f34b97dbbdb5f79d77ce147
>
> I am wondering if he is going to post an updated version of that or not.
> Anyway, I am looking forward to learn about the result of this
> discussion even thought I am not a conference invitee.

There are a couple of problems that were brought up back then, mainly
that memory will be reparented to the root memcg eventually,
practically escaping accounting. Shared resources may end up being
eventually unaccounted. Ideally, we can come up with a scheme where
the memory is charged to the real user, instead of just to the parent.

Consider the case where processes in memcg A and B are both using
memory that is charged to memcg A. If memcg A goes offline, and we
reparent the memory, memcg B keeps using the memory for free, taxing
A's parent, or the entire system if that's root.

Also, if there is a kernel bug and a page is being pinned
unnecessarily, those pages will never be reclaimed and will stick
around and eventually be reparented to the root memcg. If being
reparented to the root memcg is a legitimate action, you can't simply
tell apart if pages are sticking around just because they are being
used by someone or if there is a kernel bug.

>
> Thanks,
> Longman
>
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-04-26 20:15           ` Waiman Long
  0 siblings, 0 replies; 51+ messages in thread
From: Waiman Long @ 2023-04-26 20:15 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: T.J. Mercier, lsf-pc, linux-mm, cgroups, Tejun Heo, Shakeel Butt,
	Muchun Song, Johannes Weiner, Roman Gushchin, Alistair Popple,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao, Matthew Wilcox,
	David Rientjes, Greg Thelen

On 4/25/23 14:53, Yosry Ahmed wrote:
> On Tue, Apr 25, 2023 at 11:42 AM Waiman Long <longman@redhat.com> wrote:
>> On 4/25/23 07:36, Yosry Ahmed wrote:
>>>    +David Rientjes +Greg Thelen +Matthew Wilcox
>>>
>>> On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>> On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier@google.com> wrote:
>>>>> When a memcg is removed by userspace it gets offlined by the kernel.
>>>>> Offline memcgs are hidden from user space, but they still live in the
>>>>> kernel until their reference count drops to 0. New allocations cannot
>>>>> be charged to offline memcgs, but existing allocations charged to
>>>>> offline memcgs remain charged, and hold a reference to the memcg.
>>>>>
>>>>> As such, an offline memcg can remain in the kernel indefinitely,
>>>>> becoming a zombie memcg. The accumulation of a large number of zombie
>>>>> memcgs lead to increased system overhead (mainly percpu data in struct
>>>>> mem_cgroup). It also causes some kernel operations that scale with the
>>>>> number of memcgs to become less efficient (e.g. reclaim).
>>>>>
>>>>> There are currently out-of-tree solutions which attempt to
>>>>> periodically clean up zombie memcgs by reclaiming from them. However
>>>>> that is not effective for non-reclaimable memory, which it would be
>>>>> better to reparent or recharge to an online cgroup. There are also
>>>>> proposed changes that would benefit from recharging for shared
>>>>> resources like pinned pages, or DMA buffer pages.
>>>> I am very interested in attending this discussion, it's something that
>>>> I have been actively looking into -- specifically recharging pages of
>>>> offlined memcgs.
>>>>
>>>>> Suggested attendees:
>>>>> Yosry Ahmed <yosryahmed@google.com>
>>>>> Yu Zhao <yuzhao@google.com>
>>>>> T.J. Mercier <tjmercier@google.com>
>>>>> Tejun Heo <tj@kernel.org>
>>>>> Shakeel Butt <shakeelb@google.com>
>>>>> Muchun Song <muchun.song@linux.dev>
>>>>> Johannes Weiner <hannes@cmpxchg.org>
>>>>> Roman Gushchin <roman.gushchin@linux.dev>
>>>>> Alistair Popple <apopple@nvidia.com>
>>>>> Jason Gunthorpe <jgg@nvidia.com>
>>>>> Kalesh Singh <kaleshsingh@google.com>
>>> I was hoping I would bring a more complete idea to this thread, but
>>> here is what I have so far.
>>>
>>> The idea is to recharge the memory charged to memcgs when they are
>>> offlined. I like to think of the options we have to deal with memory
>>> charged to offline memcgs as a toolkit. This toolkit includes:
>>>
>>> (a) Evict memory.
>>>
>>> This is the simplest option, just evict the memory.
>>>
>>> For file-backed pages, this writes them back to their backing files,
>>> uncharging and freeing the page. The next access will read the page
>>> again and the faulting process’s memcg will be charged.
>>>
>>> For swap-backed pages (anon/shmem), this swaps them out. Swapping out
>>> a page charged to an offline memcg uncharges the page and charges the
>>> swap to its parent. The next access will swap in the page and the
>>> parent will be charged. This is effectively deferred recharging to the
>>> parent.
>>>
>>> Pros:
>>> - Simple.
>>>
>>> Cons:
>>> - Behavior is different for file-backed vs. swap-backed pages, for
>>> swap-backed pages, the memory is recharged to the parent (aka
>>> reparented), not charged to the "rightful" user.
>>> - Next access will incur higher latency, especially if the pages are active.
>>>
>>> (b) Direct recharge to the parent
>>>
>>> This can be done for any page and should be simple as the pages are
>>> already hierarchically charged to the parent.
>>>
>>> Pros:
>>> - Simple.
>>>
>>> Cons:
>>> - If a different memcg is using the memory, it will keep taxing the
>>> parent indefinitely. Same not the "rightful" user argument.
>> Muchun had actually posted patch to do this last year. See
>>
>> https://lore.kernel.org/all/20220621125658.64935-10-songmuchun@bytedance.com/T/#me9dbbce85e2f3c4e5f34b97dbbdb5f79d77ce147
>>
>> I am wondering if he is going to post an updated version of that or not.
>> Anyway, I am looking forward to learn about the result of this
>> discussion even thought I am not a conference invitee.
> There are a couple of problems that were brought up back then, mainly
> that memory will be reparented to the root memcg eventually,
> practically escaping accounting. Shared resources may end up being
> eventually unaccounted. Ideally, we can come up with a scheme where
> the memory is charged to the real user, instead of just to the parent.
>
> Consider the case where processes in memcg A and B are both using
> memory that is charged to memcg A. If memcg A goes offline, and we
> reparent the memory, memcg B keeps using the memory for free, taxing
> A's parent, or the entire system if that's root.
>
> Also, if there is a kernel bug and a page is being pinned
> unnecessarily, those pages will never be reclaimed and will stick
> around and eventually be reparented to the root memcg. If being
> reparented to the root memcg is a legitimate action, you can't simply
> tell apart if pages are sticking around just because they are being
> used by someone or if there is a kernel bug.

This is certainly a valid concern. We are currently doing reparenting 
for slab objects. However physical pages have a higher probability of 
being shared by different tasks. I do hope that we can come to agreement 
soon on how best to address this issue.

Thanks,
Longman



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-04-26 20:15           ` Waiman Long
  0 siblings, 0 replies; 51+ messages in thread
From: Waiman Long @ 2023-04-26 20:15 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: T.J. Mercier, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Shakeel Butt, Muchun Song, Johannes Weiner,
	Roman Gushchin, Alistair Popple, Jason Gunthorpe, Kalesh Singh,
	Yu Zhao, Matthew Wilcox, David Rientjes, Greg Thelen

On 4/25/23 14:53, Yosry Ahmed wrote:
> On Tue, Apr 25, 2023 at 11:42 AM Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On 4/25/23 07:36, Yosry Ahmed wrote:
>>>    +David Rientjes +Greg Thelen +Matthew Wilcox
>>>
>>> On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>> When a memcg is removed by userspace it gets offlined by the kernel.
>>>>> Offline memcgs are hidden from user space, but they still live in the
>>>>> kernel until their reference count drops to 0. New allocations cannot
>>>>> be charged to offline memcgs, but existing allocations charged to
>>>>> offline memcgs remain charged, and hold a reference to the memcg.
>>>>>
>>>>> As such, an offline memcg can remain in the kernel indefinitely,
>>>>> becoming a zombie memcg. The accumulation of a large number of zombie
>>>>> memcgs lead to increased system overhead (mainly percpu data in struct
>>>>> mem_cgroup). It also causes some kernel operations that scale with the
>>>>> number of memcgs to become less efficient (e.g. reclaim).
>>>>>
>>>>> There are currently out-of-tree solutions which attempt to
>>>>> periodically clean up zombie memcgs by reclaiming from them. However
>>>>> that is not effective for non-reclaimable memory, which it would be
>>>>> better to reparent or recharge to an online cgroup. There are also
>>>>> proposed changes that would benefit from recharging for shared
>>>>> resources like pinned pages, or DMA buffer pages.
>>>> I am very interested in attending this discussion, it's something that
>>>> I have been actively looking into -- specifically recharging pages of
>>>> offlined memcgs.
>>>>
>>>>> Suggested attendees:
>>>>> Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>>> Yu Zhao <yuzhao-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>>> T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>>> Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>>>>> Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>>> Muchun Song <muchun.song-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
>>>>> Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
>>>>> Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
>>>>> Alistair Popple <apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
>>>>> Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
>>>>> Kalesh Singh <kaleshsingh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>> I was hoping I would bring a more complete idea to this thread, but
>>> here is what I have so far.
>>>
>>> The idea is to recharge the memory charged to memcgs when they are
>>> offlined. I like to think of the options we have to deal with memory
>>> charged to offline memcgs as a toolkit. This toolkit includes:
>>>
>>> (a) Evict memory.
>>>
>>> This is the simplest option, just evict the memory.
>>>
>>> For file-backed pages, this writes them back to their backing files,
>>> uncharging and freeing the page. The next access will read the page
>>> again and the faulting process’s memcg will be charged.
>>>
>>> For swap-backed pages (anon/shmem), this swaps them out. Swapping out
>>> a page charged to an offline memcg uncharges the page and charges the
>>> swap to its parent. The next access will swap in the page and the
>>> parent will be charged. This is effectively deferred recharging to the
>>> parent.
>>>
>>> Pros:
>>> - Simple.
>>>
>>> Cons:
>>> - Behavior is different for file-backed vs. swap-backed pages, for
>>> swap-backed pages, the memory is recharged to the parent (aka
>>> reparented), not charged to the "rightful" user.
>>> - Next access will incur higher latency, especially if the pages are active.
>>>
>>> (b) Direct recharge to the parent
>>>
>>> This can be done for any page and should be simple as the pages are
>>> already hierarchically charged to the parent.
>>>
>>> Pros:
>>> - Simple.
>>>
>>> Cons:
>>> - If a different memcg is using the memory, it will keep taxing the
>>> parent indefinitely. Same not the "rightful" user argument.
>> Muchun had actually posted patch to do this last year. See
>>
>> https://lore.kernel.org/all/20220621125658.64935-10-songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org/T/#me9dbbce85e2f3c4e5f34b97dbbdb5f79d77ce147
>>
>> I am wondering if he is going to post an updated version of that or not.
>> Anyway, I am looking forward to learn about the result of this
>> discussion even thought I am not a conference invitee.
> There are a couple of problems that were brought up back then, mainly
> that memory will be reparented to the root memcg eventually,
> practically escaping accounting. Shared resources may end up being
> eventually unaccounted. Ideally, we can come up with a scheme where
> the memory is charged to the real user, instead of just to the parent.
>
> Consider the case where processes in memcg A and B are both using
> memory that is charged to memcg A. If memcg A goes offline, and we
> reparent the memory, memcg B keeps using the memory for free, taxing
> A's parent, or the entire system if that's root.
>
> Also, if there is a kernel bug and a page is being pinned
> unnecessarily, those pages will never be reclaimed and will stick
> around and eventually be reparented to the root memcg. If being
> reparented to the root memcg is a legitimate action, you can't simply
> tell apart if pages are sticking around just because they are being
> used by someone or if there is a kernel bug.

This is certainly a valid concern. We are currently doing reparenting 
for slab objects. However physical pages have a higher probability of 
being shared by different tasks. I do hope that we can come to agreement 
soon on how best to address this issue.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-01 16:38       ` Roman Gushchin
  0 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2023-05-01 16:38 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: T.J. Mercier, lsf-pc, linux-mm, cgroups, Tejun Heo, Shakeel Butt,
	Muchun Song, Johannes Weiner, Alistair Popple, Jason Gunthorpe,
	Kalesh Singh, Yu Zhao, Matthew Wilcox, David Rientjes,
	Greg Thelen

On Tue, Apr 25, 2023 at 04:36:53AM -0700, Yosry Ahmed wrote:
>  +David Rientjes +Greg Thelen +Matthew Wilcox

Hi Yosry!

Sorry for being late to the party, I was offline for a week.

> 
> On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier@google.com> wrote:
> > >
> > > When a memcg is removed by userspace it gets offlined by the kernel.
> > > Offline memcgs are hidden from user space, but they still live in the
> > > kernel until their reference count drops to 0. New allocations cannot
> > > be charged to offline memcgs, but existing allocations charged to
> > > offline memcgs remain charged, and hold a reference to the memcg.
> > >
> > > As such, an offline memcg can remain in the kernel indefinitely,
> > > becoming a zombie memcg. The accumulation of a large number of zombie
> > > memcgs lead to increased system overhead (mainly percpu data in struct
> > > mem_cgroup). It also causes some kernel operations that scale with the
> > > number of memcgs to become less efficient (e.g. reclaim).

The problem is even more fundamental:
1) offline memcgs are (almost) fully functional memcgs from the kernel's point
   of view,
2) if memcg A allocates some memory, goes offline and now memcg B is using this
   memory, the memory is effectively shared between memcgs A and B,
3) sharing memory was never really supported by memcgs.

If memory is shared between memcgs, most memcg functionality is broken aside
from a case when memcgs are working in a very predictable and coordinated way.
But generally all counters and stats become racy (whoever allocated it first,
pays the full price, other get it for free), and memory limits and protections
are based on the same counters.

Depending on % memory shared, the rate at which memcg are created and destroyed,
memory pressure and other workload-depending factors the problem can be
significant or not.

One way to tackle this problem is to stop using memcgs as wrappers for
individual processes or workloads and use them more as performance classes.
This means more statically and with less memory sharing.
However I admit it's the opposite direction to where all went for the
last decade or so.

> > >
> > > There are currently out-of-tree solutions which attempt to
> > > periodically clean up zombie memcgs by reclaiming from them. However
> > > that is not effective for non-reclaimable memory, which it would be
> > > better to reparent or recharge to an online cgroup. There are also
> > > proposed changes that would benefit from recharging for shared
> > > resources like pinned pages, or DMA buffer pages.
> >
> > I am very interested in attending this discussion, it's something that
> > I have been actively looking into -- specifically recharging pages of
> > offlined memcgs.
> >
> > >
> > > Suggested attendees:
> > > Yosry Ahmed <yosryahmed@google.com>
> > > Yu Zhao <yuzhao@google.com>
> > > T.J. Mercier <tjmercier@google.com>
> > > Tejun Heo <tj@kernel.org>
> > > Shakeel Butt <shakeelb@google.com>
> > > Muchun Song <muchun.song@linux.dev>
> > > Johannes Weiner <hannes@cmpxchg.org>
> > > Roman Gushchin <roman.gushchin@linux.dev>
> > > Alistair Popple <apopple@nvidia.com>
> > > Jason Gunthorpe <jgg@nvidia.com>
> > > Kalesh Singh <kaleshsingh@google.com>
> 
> I was hoping I would bring a more complete idea to this thread, but
> here is what I have so far.
> 
> The idea is to recharge the memory charged to memcgs when they are
> offlined. I like to think of the options we have to deal with memory
> charged to offline memcgs as a toolkit. This toolkit includes:
> 
> (a) Evict memory.
> 
> This is the simplest option, just evict the memory.
> 
> For file-backed pages, this writes them back to their backing files,
> uncharging and freeing the page. The next access will read the page
> again and the faulting process’s memcg will be charged.
> 
> For swap-backed pages (anon/shmem), this swaps them out. Swapping out
> a page charged to an offline memcg uncharges the page and charges the
> swap to its parent. The next access will swap in the page and the
> parent will be charged. This is effectively deferred recharging to the
> parent.
> 
> Pros:
> - Simple.
> 
> Cons:
> - Behavior is different for file-backed vs. swap-backed pages, for
> swap-backed pages, the memory is recharged to the parent (aka
> reparented), not charged to the "rightful" user.
> - Next access will incur higher latency, especially if the pages are active.

Generally I think it's a good solution iff there is not much of memory sharing
with other memcgs. But in practice there is a high chance that some very hot
pages (e.g. shlib pages shared by pretty much everyone) will get evicted.

> 
> (b) Direct recharge to the parent
> 
> This can be done for any page and should be simple as the pages are
> already hierarchically charged to the parent.
> 
> Pros:
> - Simple.
> 
> Cons:
> - If a different memcg is using the memory, it will keep taxing the
> parent indefinitely. Same not the "rightful" user argument.

It worked for slabs and other kmem objects to reduce the severity of the memcg
zombie clogging. Muchun posted patches for lru pages. I believe it's a decent
way to solve the zombie problem, but it doesn't solve any issues with the memory
sharing.

> 
> (c) Direct recharge to the mapper
> 
> This can be done for any mapped page by walking the rmap and
> identifying the memcg of the process(es) mapping the page.
> 
> Pros:
> - Memory is recharged to the “rightful” user.
> 
> Cons:
> - More complicated, the “rightful” user’s memcg might run into an OOM
> situation – which in this case will be unpredictable and hard to
> correlate with an allocation.
> 
> (d) Deferred recharging
> 
> This is a mixture of (b) & (c) above. It is a two-step process. We
> first recharge the memory to the parent, which should be simple and
> reliable. Then, we mark the pages so that the next time they are
> accessed or mapped we recharge them to the "rightful" user.
> 
> For mapped pages, we can use the numa balancing approach of protecting
> the mapping (while the vma is still accessible), and then in the fault
> path recharge the page. This is better than eviction because the fault
> on the next access is minor, and better than direct recharging to the
> mapping in the sense that the charge is correlated with an
> allocation/mapping. Of course, it is more complicated, we have to
> handle different protection interactions (e.g. what if the page is
> already protected?). Another disadvantage is that the recharging
> happens in the context of a page fault, rather than asynchronously in
> the case of directly recharging to the mapper. Page faults are more
> latency sensitive, although this shouldn't be a common path.
> 
> For unmapped pages, I am struggling to find a way that is simple
> enough to recharge the memory on the next access. My first intuition
> was to add a hook to folio_mark_accessed(), but I was quickly told
> that this is not invoked in all access paths to unmapped pages (e.g.
> writes through fds). We can also add a hook to folio_mark_dirty() to
> add more coverage, but it seems like this path is fragile, and it
> would be ideal if there is a shared well-defined common path (or
> paths) for all accesses to unmapped pages. I would imagine if such a
> path exists or can be forged it would probably be in the page cache
> code somewhere.

The problem is that we'd need to add hooks and checks into many hot paths,
so the performance penalty will be likely severe.
But of course hard to tell without actual patches.
> 
> For both cases, if a new mapping is created, we can do recharging there.
> 
> Pros:
> - Memory is recharged to the “rightful” user, eventually.
> - The charge is predictable and correlates to a user's access.
> - Less overhead on next access than eviction.
> 
> Cons:
> - The memory will remain charged to the parent until the next access
> happens, if it ever happens.
> - Worse overhead on next access than directly recharging to the mapper.
> 
> With this (incompletely defined) toolkit, a recharging algorithm can
> look like this (as a rough example):
> 
> - If the page is file-backed:
>   - Unmapped? evict (a).
>   - Mapped? recharge to the mapper -- direct (c) or deferred (d).
> - If the page is swap-backed:
>   - Unmapped? deferred recharge to the next accessor (d).
>   - Mapped? recharge to the mapper -- direct (c) or deferred (d).
> 
> There are, of course, open questions:
> 1) How do we do deferred recharging for unmapped pages? Is deferred
> recharging even a reliable option to begin with? What if the pages are
> never accessed again?

I believe the real question is how to handle memory shared between memcgs.
Dealing with offline memcgs is just a specific case of this problem.
> 
> Again, I was hoping to come up with a more concrete proposal, but as
> LSF/MM/BPF is approaching, I wanted to share my thoughts on the
> mailing list looking for any feedback.

Thank you for bringing it in!


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-01 16:38       ` Roman Gushchin
  0 siblings, 0 replies; 51+ messages in thread
From: Roman Gushchin @ 2023-05-01 16:38 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: T.J. Mercier, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Shakeel Butt, Muchun Song, Johannes Weiner,
	Alistair Popple, Jason Gunthorpe, Kalesh Singh, Yu Zhao,
	Matthew Wilcox, David Rientjes, Greg Thelen

On Tue, Apr 25, 2023 at 04:36:53AM -0700, Yosry Ahmed wrote:
>  +David Rientjes +Greg Thelen +Matthew Wilcox

Hi Yosry!

Sorry for being late to the party, I was offline for a week.

> 
> On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > >
> > > When a memcg is removed by userspace it gets offlined by the kernel.
> > > Offline memcgs are hidden from user space, but they still live in the
> > > kernel until their reference count drops to 0. New allocations cannot
> > > be charged to offline memcgs, but existing allocations charged to
> > > offline memcgs remain charged, and hold a reference to the memcg.
> > >
> > > As such, an offline memcg can remain in the kernel indefinitely,
> > > becoming a zombie memcg. The accumulation of a large number of zombie
> > > memcgs lead to increased system overhead (mainly percpu data in struct
> > > mem_cgroup). It also causes some kernel operations that scale with the
> > > number of memcgs to become less efficient (e.g. reclaim).

The problem is even more fundamental:
1) offline memcgs are (almost) fully functional memcgs from the kernel's point
   of view,
2) if memcg A allocates some memory, goes offline and now memcg B is using this
   memory, the memory is effectively shared between memcgs A and B,
3) sharing memory was never really supported by memcgs.

If memory is shared between memcgs, most memcg functionality is broken aside
from a case when memcgs are working in a very predictable and coordinated way.
But generally all counters and stats become racy (whoever allocated it first,
pays the full price, other get it for free), and memory limits and protections
are based on the same counters.

Depending on % memory shared, the rate at which memcg are created and destroyed,
memory pressure and other workload-depending factors the problem can be
significant or not.

One way to tackle this problem is to stop using memcgs as wrappers for
individual processes or workloads and use them more as performance classes.
This means more statically and with less memory sharing.
However I admit it's the opposite direction to where all went for the
last decade or so.

> > >
> > > There are currently out-of-tree solutions which attempt to
> > > periodically clean up zombie memcgs by reclaiming from them. However
> > > that is not effective for non-reclaimable memory, which it would be
> > > better to reparent or recharge to an online cgroup. There are also
> > > proposed changes that would benefit from recharging for shared
> > > resources like pinned pages, or DMA buffer pages.
> >
> > I am very interested in attending this discussion, it's something that
> > I have been actively looking into -- specifically recharging pages of
> > offlined memcgs.
> >
> > >
> > > Suggested attendees:
> > > Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > > Yu Zhao <yuzhao-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > > T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > > Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > > Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > > Muchun Song <muchun.song-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
> > > Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> > > Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
> > > Alistair Popple <apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> > > Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> > > Kalesh Singh <kaleshsingh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> 
> I was hoping I would bring a more complete idea to this thread, but
> here is what I have so far.
> 
> The idea is to recharge the memory charged to memcgs when they are
> offlined. I like to think of the options we have to deal with memory
> charged to offline memcgs as a toolkit. This toolkit includes:
> 
> (a) Evict memory.
> 
> This is the simplest option, just evict the memory.
> 
> For file-backed pages, this writes them back to their backing files,
> uncharging and freeing the page. The next access will read the page
> again and the faulting process’s memcg will be charged.
> 
> For swap-backed pages (anon/shmem), this swaps them out. Swapping out
> a page charged to an offline memcg uncharges the page and charges the
> swap to its parent. The next access will swap in the page and the
> parent will be charged. This is effectively deferred recharging to the
> parent.
> 
> Pros:
> - Simple.
> 
> Cons:
> - Behavior is different for file-backed vs. swap-backed pages, for
> swap-backed pages, the memory is recharged to the parent (aka
> reparented), not charged to the "rightful" user.
> - Next access will incur higher latency, especially if the pages are active.

Generally I think it's a good solution iff there is not much of memory sharing
with other memcgs. But in practice there is a high chance that some very hot
pages (e.g. shlib pages shared by pretty much everyone) will get evicted.

> 
> (b) Direct recharge to the parent
> 
> This can be done for any page and should be simple as the pages are
> already hierarchically charged to the parent.
> 
> Pros:
> - Simple.
> 
> Cons:
> - If a different memcg is using the memory, it will keep taxing the
> parent indefinitely. Same not the "rightful" user argument.

It worked for slabs and other kmem objects to reduce the severity of the memcg
zombie clogging. Muchun posted patches for lru pages. I believe it's a decent
way to solve the zombie problem, but it doesn't solve any issues with the memory
sharing.

> 
> (c) Direct recharge to the mapper
> 
> This can be done for any mapped page by walking the rmap and
> identifying the memcg of the process(es) mapping the page.
> 
> Pros:
> - Memory is recharged to the “rightful” user.
> 
> Cons:
> - More complicated, the “rightful” user’s memcg might run into an OOM
> situation – which in this case will be unpredictable and hard to
> correlate with an allocation.
> 
> (d) Deferred recharging
> 
> This is a mixture of (b) & (c) above. It is a two-step process. We
> first recharge the memory to the parent, which should be simple and
> reliable. Then, we mark the pages so that the next time they are
> accessed or mapped we recharge them to the "rightful" user.
> 
> For mapped pages, we can use the numa balancing approach of protecting
> the mapping (while the vma is still accessible), and then in the fault
> path recharge the page. This is better than eviction because the fault
> on the next access is minor, and better than direct recharging to the
> mapping in the sense that the charge is correlated with an
> allocation/mapping. Of course, it is more complicated, we have to
> handle different protection interactions (e.g. what if the page is
> already protected?). Another disadvantage is that the recharging
> happens in the context of a page fault, rather than asynchronously in
> the case of directly recharging to the mapper. Page faults are more
> latency sensitive, although this shouldn't be a common path.
> 
> For unmapped pages, I am struggling to find a way that is simple
> enough to recharge the memory on the next access. My first intuition
> was to add a hook to folio_mark_accessed(), but I was quickly told
> that this is not invoked in all access paths to unmapped pages (e.g.
> writes through fds). We can also add a hook to folio_mark_dirty() to
> add more coverage, but it seems like this path is fragile, and it
> would be ideal if there is a shared well-defined common path (or
> paths) for all accesses to unmapped pages. I would imagine if such a
> path exists or can be forged it would probably be in the page cache
> code somewhere.

The problem is that we'd need to add hooks and checks into many hot paths,
so the performance penalty will be likely severe.
But of course hard to tell without actual patches.
> 
> For both cases, if a new mapping is created, we can do recharging there.
> 
> Pros:
> - Memory is recharged to the “rightful” user, eventually.
> - The charge is predictable and correlates to a user's access.
> - Less overhead on next access than eviction.
> 
> Cons:
> - The memory will remain charged to the parent until the next access
> happens, if it ever happens.
> - Worse overhead on next access than directly recharging to the mapper.
> 
> With this (incompletely defined) toolkit, a recharging algorithm can
> look like this (as a rough example):
> 
> - If the page is file-backed:
>   - Unmapped? evict (a).
>   - Mapped? recharge to the mapper -- direct (c) or deferred (d).
> - If the page is swap-backed:
>   - Unmapped? deferred recharge to the next accessor (d).
>   - Mapped? recharge to the mapper -- direct (c) or deferred (d).
> 
> There are, of course, open questions:
> 1) How do we do deferred recharging for unmapped pages? Is deferred
> recharging even a reliable option to begin with? What if the pages are
> never accessed again?

I believe the real question is how to handle memory shared between memcgs.
Dealing with offline memcgs is just a specific case of this problem.
> 
> Again, I was hoping to come up with a more concrete proposal, but as
> LSF/MM/BPF is approaching, I wanted to share my thoughts on the
> mailing list looking for any feedback.

Thank you for bringing it in!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
  2023-05-01 16:38       ` Roman Gushchin
@ 2023-05-02  7:18         ` Yosry Ahmed
  -1 siblings, 0 replies; 51+ messages in thread
From: Yosry Ahmed @ 2023-05-02  7:18 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: T.J. Mercier, lsf-pc, linux-mm, cgroups, Tejun Heo, Shakeel Butt,
	Muchun Song, Johannes Weiner, Alistair Popple, Jason Gunthorpe,
	Kalesh Singh, Yu Zhao, Matthew Wilcox, David Rientjes,
	Greg Thelen

Thanks Roman for taking a look!

On Mon, May 1, 2023 at 9:38 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Tue, Apr 25, 2023 at 04:36:53AM -0700, Yosry Ahmed wrote:
> >  +David Rientjes +Greg Thelen +Matthew Wilcox
>
> Hi Yosry!
>
> Sorry for being late to the party, I was offline for a week.
>
> >
> > On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier@google.com> wrote:
> > > >
> > > > When a memcg is removed by userspace it gets offlined by the kernel.
> > > > Offline memcgs are hidden from user space, but they still live in the
> > > > kernel until their reference count drops to 0. New allocations cannot
> > > > be charged to offline memcgs, but existing allocations charged to
> > > > offline memcgs remain charged, and hold a reference to the memcg.
> > > >
> > > > As such, an offline memcg can remain in the kernel indefinitely,
> > > > becoming a zombie memcg. The accumulation of a large number of zombie
> > > > memcgs lead to increased system overhead (mainly percpu data in struct
> > > > mem_cgroup). It also causes some kernel operations that scale with the
> > > > number of memcgs to become less efficient (e.g. reclaim).
>
> The problem is even more fundamental:
> 1) offline memcgs are (almost) fully functional memcgs from the kernel's point
>    of view,
> 2) if memcg A allocates some memory, goes offline and now memcg B is using this
>    memory, the memory is effectively shared between memcgs A and B,
> 3) sharing memory was never really supported by memcgs.
>
> If memory is shared between memcgs, most memcg functionality is broken aside
> from a case when memcgs are working in a very predictable and coordinated way.
> But generally all counters and stats become racy (whoever allocated it first,
> pays the full price, other get it for free), and memory limits and protections
> are based on the same counters.

100% agreed. This is why we introduced (and tried to upstream) a
memcg= mount option to enforce deterministic charging of shared
resources, but this addresses one part of the more general problem as
you state below.

>
> Depending on % memory shared, the rate at which memcg are created and destroyed,
> memory pressure and other workload-depending factors the problem can be
> significant or not.
>
> One way to tackle this problem is to stop using memcgs as wrappers for
> individual processes or workloads and use them more as performance classes.
> This means more statically and with less memory sharing.
> However I admit it's the opposite direction to where all went for the
> last decade or so.

This smells like cgroup v3, or even something completely new. Given
that we are still using cgroup v1 in Google's prodkernel, this sounds
scary to poor me :)

>
> > > >
> > > > There are currently out-of-tree solutions which attempt to
> > > > periodically clean up zombie memcgs by reclaiming from them. However
> > > > that is not effective for non-reclaimable memory, which it would be
> > > > better to reparent or recharge to an online cgroup. There are also
> > > > proposed changes that would benefit from recharging for shared
> > > > resources like pinned pages, or DMA buffer pages.
> > >
> > > I am very interested in attending this discussion, it's something that
> > > I have been actively looking into -- specifically recharging pages of
> > > offlined memcgs.
> > >
> > > >
> > > > Suggested attendees:
> > > > Yosry Ahmed <yosryahmed@google.com>
> > > > Yu Zhao <yuzhao@google.com>
> > > > T.J. Mercier <tjmercier@google.com>
> > > > Tejun Heo <tj@kernel.org>
> > > > Shakeel Butt <shakeelb@google.com>
> > > > Muchun Song <muchun.song@linux.dev>
> > > > Johannes Weiner <hannes@cmpxchg.org>
> > > > Roman Gushchin <roman.gushchin@linux.dev>
> > > > Alistair Popple <apopple@nvidia.com>
> > > > Jason Gunthorpe <jgg@nvidia.com>
> > > > Kalesh Singh <kaleshsingh@google.com>
> >
> > I was hoping I would bring a more complete idea to this thread, but
> > here is what I have so far.
> >
> > The idea is to recharge the memory charged to memcgs when they are
> > offlined. I like to think of the options we have to deal with memory
> > charged to offline memcgs as a toolkit. This toolkit includes:
> >
> > (a) Evict memory.
> >
> > This is the simplest option, just evict the memory.
> >
> > For file-backed pages, this writes them back to their backing files,
> > uncharging and freeing the page. The next access will read the page
> > again and the faulting process’s memcg will be charged.
> >
> > For swap-backed pages (anon/shmem), this swaps them out. Swapping out
> > a page charged to an offline memcg uncharges the page and charges the
> > swap to its parent. The next access will swap in the page and the
> > parent will be charged. This is effectively deferred recharging to the
> > parent.
> >
> > Pros:
> > - Simple.
> >
> > Cons:
> > - Behavior is different for file-backed vs. swap-backed pages, for
> > swap-backed pages, the memory is recharged to the parent (aka
> > reparented), not charged to the "rightful" user.
> > - Next access will incur higher latency, especially if the pages are active.
>
> Generally I think it's a good solution iff there is not much of memory sharing
> with other memcgs. But in practice there is a high chance that some very hot
> pages (e.g. shlib pages shared by pretty much everyone) will get evicted.
>

Agreed, but I guess it depends on how often are those pages charged to
a memcg being offlined. I can easily imagine a scenario where they
keep bouncing off between memcgs and getting evicted every time
though.

> >
> > (b) Direct recharge to the parent
> >
> > This can be done for any page and should be simple as the pages are
> > already hierarchically charged to the parent.
> >
> > Pros:
> > - Simple.
> >
> > Cons:
> > - If a different memcg is using the memory, it will keep taxing the
> > parent indefinitely. Same not the "rightful" user argument.
>
> It worked for slabs and other kmem objects to reduce the severity of the memcg
> zombie clogging. Muchun posted patches for lru pages. I believe it's a decent
> way to solve the zombie problem, but it doesn't solve any issues with the memory
> sharing.

It has its pros and cons. We have less potential for pinning offline
memcgs, but if the pinning is coming from a bug or a reference leak,
we have much less chance of finding it as the slab/kmem objects will
silently be reparented, and eventually accumulate at the root.

For slab/kmem, there isn't much we can do. For user pages, it should
be much easier to attribute the pages to a process/memcg, hence
recharging is more persuasive.

>
> >
> > (c) Direct recharge to the mapper
> >
> > This can be done for any mapped page by walking the rmap and
> > identifying the memcg of the process(es) mapping the page.
> >
> > Pros:
> > - Memory is recharged to the “rightful” user.
> >
> > Cons:
> > - More complicated, the “rightful” user’s memcg might run into an OOM
> > situation – which in this case will be unpredictable and hard to
> > correlate with an allocation.
> >
> > (d) Deferred recharging
> >
> > This is a mixture of (b) & (c) above. It is a two-step process. We
> > first recharge the memory to the parent, which should be simple and
> > reliable. Then, we mark the pages so that the next time they are
> > accessed or mapped we recharge them to the "rightful" user.
> >
> > For mapped pages, we can use the numa balancing approach of protecting
> > the mapping (while the vma is still accessible), and then in the fault
> > path recharge the page. This is better than eviction because the fault
> > on the next access is minor, and better than direct recharging to the
> > mapping in the sense that the charge is correlated with an
> > allocation/mapping. Of course, it is more complicated, we have to
> > handle different protection interactions (e.g. what if the page is
> > already protected?). Another disadvantage is that the recharging
> > happens in the context of a page fault, rather than asynchronously in
> > the case of directly recharging to the mapper. Page faults are more
> > latency sensitive, although this shouldn't be a common path.
> >
> > For unmapped pages, I am struggling to find a way that is simple
> > enough to recharge the memory on the next access. My first intuition
> > was to add a hook to folio_mark_accessed(), but I was quickly told
> > that this is not invoked in all access paths to unmapped pages (e.g.
> > writes through fds). We can also add a hook to folio_mark_dirty() to
> > add more coverage, but it seems like this path is fragile, and it
> > would be ideal if there is a shared well-defined common path (or
> > paths) for all accesses to unmapped pages. I would imagine if such a
> > path exists or can be forged it would probably be in the page cache
> > code somewhere.
>
> The problem is that we'd need to add hooks and checks into many hot paths,
> so the performance penalty will be likely severe.
> But of course hard to tell without actual patches.

Not always, I guess. If the pages are mapped and we choose to walk the
rmap and recharge, this can be done completely asynchronously as far
as I can tell.

For deferred recharging, yeah in some cases we may need to add hooks
in relatively hot paths. It shouldn't be something that's happening
very frequently though -- at least I hope. Also, if we add hooks to
the fault path and/or file read/write paths to do recharging, it
should be relatively okay. Charging happens in these paths anyway if
we end up allocating a new page to satisfy the fault/read/write, so in
that sense it's not completely unheard of.

I don't have specific patches, so like you say I can't tell for sure,
and even if I had patches I would assume it's highly workload
dependent.

> >
> > For both cases, if a new mapping is created, we can do recharging there.
> >
> > Pros:
> > - Memory is recharged to the “rightful” user, eventually.
> > - The charge is predictable and correlates to a user's access.
> > - Less overhead on next access than eviction.
> >
> > Cons:
> > - The memory will remain charged to the parent until the next access
> > happens, if it ever happens.
> > - Worse overhead on next access than directly recharging to the mapper.
> >
> > With this (incompletely defined) toolkit, a recharging algorithm can
> > look like this (as a rough example):
> >
> > - If the page is file-backed:
> >   - Unmapped? evict (a).
> >   - Mapped? recharge to the mapper -- direct (c) or deferred (d).
> > - If the page is swap-backed:
> >   - Unmapped? deferred recharge to the next accessor (d).
> >   - Mapped? recharge to the mapper -- direct (c) or deferred (d).
> >
> > There are, of course, open questions:
> > 1) How do we do deferred recharging for unmapped pages? Is deferred
> > recharging even a reliable option to begin with? What if the pages are
> > never accessed again?
>
> I believe the real question is how to handle memory shared between memcgs.
> Dealing with offline memcgs is just a specific case of this problem.

Agreed, but if we can't find a not-so-long-term solution to the
sharing problem, we may want to address the offline memcgs problem for
now. I am happy to discuss the more generic sharing problem in all
cases, we are affected by it in other ways than offline memcgs as well
(e.g. shared tmpfs mounts).

> >
> > Again, I was hoping to come up with a more concrete proposal, but as
> > LSF/MM/BPF is approaching, I wanted to share my thoughts on the
> > mailing list looking for any feedback.
>
> Thank you for bringing it in!


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-02  7:18         ` Yosry Ahmed
  0 siblings, 0 replies; 51+ messages in thread
From: Yosry Ahmed @ 2023-05-02  7:18 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: T.J. Mercier, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Shakeel Butt, Muchun Song, Johannes Weiner,
	Alistair Popple, Jason Gunthorpe, Kalesh Singh, Yu Zhao,
	Matthew Wilcox, David Rientjes, Greg Thelen

Thanks Roman for taking a look!

On Mon, May 1, 2023 at 9:38 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Tue, Apr 25, 2023 at 04:36:53AM -0700, Yosry Ahmed wrote:
> >  +David Rientjes +Greg Thelen +Matthew Wilcox
>
> Hi Yosry!
>
> Sorry for being late to the party, I was offline for a week.
>
> >
> > On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier@google.com> wrote:
> > > >
> > > > When a memcg is removed by userspace it gets offlined by the kernel.
> > > > Offline memcgs are hidden from user space, but they still live in the
> > > > kernel until their reference count drops to 0. New allocations cannot
> > > > be charged to offline memcgs, but existing allocations charged to
> > > > offline memcgs remain charged, and hold a reference to the memcg.
> > > >
> > > > As such, an offline memcg can remain in the kernel indefinitely,
> > > > becoming a zombie memcg. The accumulation of a large number of zombie
> > > > memcgs lead to increased system overhead (mainly percpu data in struct
> > > > mem_cgroup). It also causes some kernel operations that scale with the
> > > > number of memcgs to become less efficient (e.g. reclaim).
>
> The problem is even more fundamental:
> 1) offline memcgs are (almost) fully functional memcgs from the kernel's point
>    of view,
> 2) if memcg A allocates some memory, goes offline and now memcg B is using this
>    memory, the memory is effectively shared between memcgs A and B,
> 3) sharing memory was never really supported by memcgs.
>
> If memory is shared between memcgs, most memcg functionality is broken aside
> from a case when memcgs are working in a very predictable and coordinated way.
> But generally all counters and stats become racy (whoever allocated it first,
> pays the full price, other get it for free), and memory limits and protections
> are based on the same counters.

100% agreed. This is why we introduced (and tried to upstream) a
memcg= mount option to enforce deterministic charging of shared
resources, but this addresses one part of the more general problem as
you state below.

>
> Depending on % memory shared, the rate at which memcg are created and destroyed,
> memory pressure and other workload-depending factors the problem can be
> significant or not.
>
> One way to tackle this problem is to stop using memcgs as wrappers for
> individual processes or workloads and use them more as performance classes.
> This means more statically and with less memory sharing.
> However I admit it's the opposite direction to where all went for the
> last decade or so.

This smells like cgroup v3, or even something completely new. Given
that we are still using cgroup v1 in Google's prodkernel, this sounds
scary to poor me :)

>
> > > >
> > > > There are currently out-of-tree solutions which attempt to
> > > > periodically clean up zombie memcgs by reclaiming from them. However
> > > > that is not effective for non-reclaimable memory, which it would be
> > > > better to reparent or recharge to an online cgroup. There are also
> > > > proposed changes that would benefit from recharging for shared
> > > > resources like pinned pages, or DMA buffer pages.
> > >
> > > I am very interested in attending this discussion, it's something that
> > > I have been actively looking into -- specifically recharging pages of
> > > offlined memcgs.
> > >
> > > >
> > > > Suggested attendees:
> > > > Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > > > Yu Zhao <yuzhao-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > > > T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > > > Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > > > Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > > > Muchun Song <muchun.song-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
> > > > Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> > > > Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
> > > > Alistair Popple <apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> > > > Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> > > > Kalesh Singh <kaleshsingh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> >
> > I was hoping I would bring a more complete idea to this thread, but
> > here is what I have so far.
> >
> > The idea is to recharge the memory charged to memcgs when they are
> > offlined. I like to think of the options we have to deal with memory
> > charged to offline memcgs as a toolkit. This toolkit includes:
> >
> > (a) Evict memory.
> >
> > This is the simplest option, just evict the memory.
> >
> > For file-backed pages, this writes them back to their backing files,
> > uncharging and freeing the page. The next access will read the page
> > again and the faulting process’s memcg will be charged.
> >
> > For swap-backed pages (anon/shmem), this swaps them out. Swapping out
> > a page charged to an offline memcg uncharges the page and charges the
> > swap to its parent. The next access will swap in the page and the
> > parent will be charged. This is effectively deferred recharging to the
> > parent.
> >
> > Pros:
> > - Simple.
> >
> > Cons:
> > - Behavior is different for file-backed vs. swap-backed pages, for
> > swap-backed pages, the memory is recharged to the parent (aka
> > reparented), not charged to the "rightful" user.
> > - Next access will incur higher latency, especially if the pages are active.
>
> Generally I think it's a good solution iff there is not much of memory sharing
> with other memcgs. But in practice there is a high chance that some very hot
> pages (e.g. shlib pages shared by pretty much everyone) will get evicted.
>

Agreed, but I guess it depends on how often are those pages charged to
a memcg being offlined. I can easily imagine a scenario where they
keep bouncing off between memcgs and getting evicted every time
though.

> >
> > (b) Direct recharge to the parent
> >
> > This can be done for any page and should be simple as the pages are
> > already hierarchically charged to the parent.
> >
> > Pros:
> > - Simple.
> >
> > Cons:
> > - If a different memcg is using the memory, it will keep taxing the
> > parent indefinitely. Same not the "rightful" user argument.
>
> It worked for slabs and other kmem objects to reduce the severity of the memcg
> zombie clogging. Muchun posted patches for lru pages. I believe it's a decent
> way to solve the zombie problem, but it doesn't solve any issues with the memory
> sharing.

It has its pros and cons. We have less potential for pinning offline
memcgs, but if the pinning is coming from a bug or a reference leak,
we have much less chance of finding it as the slab/kmem objects will
silently be reparented, and eventually accumulate at the root.

For slab/kmem, there isn't much we can do. For user pages, it should
be much easier to attribute the pages to a process/memcg, hence
recharging is more persuasive.

>
> >
> > (c) Direct recharge to the mapper
> >
> > This can be done for any mapped page by walking the rmap and
> > identifying the memcg of the process(es) mapping the page.
> >
> > Pros:
> > - Memory is recharged to the “rightful” user.
> >
> > Cons:
> > - More complicated, the “rightful” user’s memcg might run into an OOM
> > situation – which in this case will be unpredictable and hard to
> > correlate with an allocation.
> >
> > (d) Deferred recharging
> >
> > This is a mixture of (b) & (c) above. It is a two-step process. We
> > first recharge the memory to the parent, which should be simple and
> > reliable. Then, we mark the pages so that the next time they are
> > accessed or mapped we recharge them to the "rightful" user.
> >
> > For mapped pages, we can use the numa balancing approach of protecting
> > the mapping (while the vma is still accessible), and then in the fault
> > path recharge the page. This is better than eviction because the fault
> > on the next access is minor, and better than direct recharging to the
> > mapping in the sense that the charge is correlated with an
> > allocation/mapping. Of course, it is more complicated, we have to
> > handle different protection interactions (e.g. what if the page is
> > already protected?). Another disadvantage is that the recharging
> > happens in the context of a page fault, rather than asynchronously in
> > the case of directly recharging to the mapper. Page faults are more
> > latency sensitive, although this shouldn't be a common path.
> >
> > For unmapped pages, I am struggling to find a way that is simple
> > enough to recharge the memory on the next access. My first intuition
> > was to add a hook to folio_mark_accessed(), but I was quickly told
> > that this is not invoked in all access paths to unmapped pages (e.g.
> > writes through fds). We can also add a hook to folio_mark_dirty() to
> > add more coverage, but it seems like this path is fragile, and it
> > would be ideal if there is a shared well-defined common path (or
> > paths) for all accesses to unmapped pages. I would imagine if such a
> > path exists or can be forged it would probably be in the page cache
> > code somewhere.
>
> The problem is that we'd need to add hooks and checks into many hot paths,
> so the performance penalty will be likely severe.
> But of course hard to tell without actual patches.

Not always, I guess. If the pages are mapped and we choose to walk the
rmap and recharge, this can be done completely asynchronously as far
as I can tell.

For deferred recharging, yeah in some cases we may need to add hooks
in relatively hot paths. It shouldn't be something that's happening
very frequently though -- at least I hope. Also, if we add hooks to
the fault path and/or file read/write paths to do recharging, it
should be relatively okay. Charging happens in these paths anyway if
we end up allocating a new page to satisfy the fault/read/write, so in
that sense it's not completely unheard of.

I don't have specific patches, so like you say I can't tell for sure,
and even if I had patches I would assume it's highly workload
dependent.

> >
> > For both cases, if a new mapping is created, we can do recharging there.
> >
> > Pros:
> > - Memory is recharged to the “rightful” user, eventually.
> > - The charge is predictable and correlates to a user's access.
> > - Less overhead on next access than eviction.
> >
> > Cons:
> > - The memory will remain charged to the parent until the next access
> > happens, if it ever happens.
> > - Worse overhead on next access than directly recharging to the mapper.
> >
> > With this (incompletely defined) toolkit, a recharging algorithm can
> > look like this (as a rough example):
> >
> > - If the page is file-backed:
> >   - Unmapped? evict (a).
> >   - Mapped? recharge to the mapper -- direct (c) or deferred (d).
> > - If the page is swap-backed:
> >   - Unmapped? deferred recharge to the next accessor (d).
> >   - Mapped? recharge to the mapper -- direct (c) or deferred (d).
> >
> > There are, of course, open questions:
> > 1) How do we do deferred recharging for unmapped pages? Is deferred
> > recharging even a reliable option to begin with? What if the pages are
> > never accessed again?
>
> I believe the real question is how to handle memory shared between memcgs.
> Dealing with offline memcgs is just a specific case of this problem.

Agreed, but if we can't find a not-so-long-term solution to the
sharing problem, we may want to address the offline memcgs problem for
now. I am happy to discuss the more generic sharing problem in all
cases, we are affected by it in other ways than offline memcgs as well
(e.g. shared tmpfs mounts).

> >
> > Again, I was hoping to come up with a more concrete proposal, but as
> > LSF/MM/BPF is approaching, I wanted to share my thoughts on the
> > mailing list looking for any feedback.
>
> Thank you for bringing it in!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
  2023-05-01 16:38       ` Roman Gushchin
@ 2023-05-02 20:02         ` Yosry Ahmed
  -1 siblings, 0 replies; 51+ messages in thread
From: Yosry Ahmed @ 2023-05-02 20:02 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: T.J. Mercier, lsf-pc, linux-mm, cgroups, Tejun Heo, Shakeel Butt,
	Muchun Song, Johannes Weiner, Alistair Popple, Jason Gunthorpe,
	Kalesh Singh, Yu Zhao, Matthew Wilcox, David Rientjes,
	Greg Thelen

On Mon, May 1, 2023 at 9:38 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Tue, Apr 25, 2023 at 04:36:53AM -0700, Yosry Ahmed wrote:
> >  +David Rientjes +Greg Thelen +Matthew Wilcox
>
> Hi Yosry!
>
> Sorry for being late to the party, I was offline for a week.
>
> >
> > On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier@google.com> wrote:
> > > >
> > > > When a memcg is removed by userspace it gets offlined by the kernel.
> > > > Offline memcgs are hidden from user space, but they still live in the
> > > > kernel until their reference count drops to 0. New allocations cannot
> > > > be charged to offline memcgs, but existing allocations charged to
> > > > offline memcgs remain charged, and hold a reference to the memcg.
> > > >
> > > > As such, an offline memcg can remain in the kernel indefinitely,
> > > > becoming a zombie memcg. The accumulation of a large number of zombie
> > > > memcgs lead to increased system overhead (mainly percpu data in struct
> > > > mem_cgroup). It also causes some kernel operations that scale with the
> > > > number of memcgs to become less efficient (e.g. reclaim).
>
> The problem is even more fundamental:
> 1) offline memcgs are (almost) fully functional memcgs from the kernel's point
>    of view,
> 2) if memcg A allocates some memory, goes offline and now memcg B is using this
>    memory, the memory is effectively shared between memcgs A and B,
> 3) sharing memory was never really supported by memcgs.
>
> If memory is shared between memcgs, most memcg functionality is broken aside
> from a case when memcgs are working in a very predictable and coordinated way.
> But generally all counters and stats become racy (whoever allocated it first,
> pays the full price, other get it for free), and memory limits and protections
> are based on the same counters.
>
> Depending on % memory shared, the rate at which memcg are created and destroyed,
> memory pressure and other workload-depending factors the problem can be
> significant or not.
>
> One way to tackle this problem is to stop using memcgs as wrappers for
> individual processes or workloads and use them more as performance classes.
> This means more statically and with less memory sharing.
> However I admit it's the opposite direction to where all went for the
> last decade or so.
>
> > > >
> > > > There are currently out-of-tree solutions which attempt to
> > > > periodically clean up zombie memcgs by reclaiming from them. However
> > > > that is not effective for non-reclaimable memory, which it would be
> > > > better to reparent or recharge to an online cgroup. There are also
> > > > proposed changes that would benefit from recharging for shared
> > > > resources like pinned pages, or DMA buffer pages.
> > >
> > > I am very interested in attending this discussion, it's something that
> > > I have been actively looking into -- specifically recharging pages of
> > > offlined memcgs.
> > >
> > > >
> > > > Suggested attendees:
> > > > Yosry Ahmed <yosryahmed@google.com>
> > > > Yu Zhao <yuzhao@google.com>
> > > > T.J. Mercier <tjmercier@google.com>
> > > > Tejun Heo <tj@kernel.org>
> > > > Shakeel Butt <shakeelb@google.com>
> > > > Muchun Song <muchun.song@linux.dev>
> > > > Johannes Weiner <hannes@cmpxchg.org>
> > > > Roman Gushchin <roman.gushchin@linux.dev>
> > > > Alistair Popple <apopple@nvidia.com>
> > > > Jason Gunthorpe <jgg@nvidia.com>
> > > > Kalesh Singh <kaleshsingh@google.com>
> >
> > I was hoping I would bring a more complete idea to this thread, but
> > here is what I have so far.
> >
> > The idea is to recharge the memory charged to memcgs when they are
> > offlined. I like to think of the options we have to deal with memory
> > charged to offline memcgs as a toolkit. This toolkit includes:
> >
> > (a) Evict memory.
> >
> > This is the simplest option, just evict the memory.
> >
> > For file-backed pages, this writes them back to their backing files,
> > uncharging and freeing the page. The next access will read the page
> > again and the faulting process’s memcg will be charged.
> >
> > For swap-backed pages (anon/shmem), this swaps them out. Swapping out
> > a page charged to an offline memcg uncharges the page and charges the
> > swap to its parent. The next access will swap in the page and the
> > parent will be charged. This is effectively deferred recharging to the
> > parent.
> >
> > Pros:
> > - Simple.
> >
> > Cons:
> > - Behavior is different for file-backed vs. swap-backed pages, for
> > swap-backed pages, the memory is recharged to the parent (aka
> > reparented), not charged to the "rightful" user.
> > - Next access will incur higher latency, especially if the pages are active.
>
> Generally I think it's a good solution iff there is not much of memory sharing
> with other memcgs. But in practice there is a high chance that some very hot
> pages (e.g. shlib pages shared by pretty much everyone) will get evicted.

One other thing that I recently noticed. If a memcg has some pages
swapped out, or if userspace deliberately tries to free memory from a
memcg before removing it, then the memcg is offlined, then the swapped
pages will hold the memcg hostage until the next fault recharges or
reparents them.

If the pages are truly cold, this can take a while (or never happen).

Separate from recharging, one thing we can do is have a kthread loop
over swap_cgroup and reparent swap entries charged to offline memcgs
to unpin them. Ideally, we will mark those entries such that they get
recharged upon the next fault instead of just keeping them at the
parent.

Just thinking out loud here.

>
> >
> > (b) Direct recharge to the parent
> >
> > This can be done for any page and should be simple as the pages are
> > already hierarchically charged to the parent.
> >
> > Pros:
> > - Simple.
> >
> > Cons:
> > - If a different memcg is using the memory, it will keep taxing the
> > parent indefinitely. Same not the "rightful" user argument.
>
> It worked for slabs and other kmem objects to reduce the severity of the memcg
> zombie clogging. Muchun posted patches for lru pages. I believe it's a decent
> way to solve the zombie problem, but it doesn't solve any issues with the memory
> sharing.
>
> >
> > (c) Direct recharge to the mapper
> >
> > This can be done for any mapped page by walking the rmap and
> > identifying the memcg of the process(es) mapping the page.
> >
> > Pros:
> > - Memory is recharged to the “rightful” user.
> >
> > Cons:
> > - More complicated, the “rightful” user’s memcg might run into an OOM
> > situation – which in this case will be unpredictable and hard to
> > correlate with an allocation.
> >
> > (d) Deferred recharging
> >
> > This is a mixture of (b) & (c) above. It is a two-step process. We
> > first recharge the memory to the parent, which should be simple and
> > reliable. Then, we mark the pages so that the next time they are
> > accessed or mapped we recharge them to the "rightful" user.
> >
> > For mapped pages, we can use the numa balancing approach of protecting
> > the mapping (while the vma is still accessible), and then in the fault
> > path recharge the page. This is better than eviction because the fault
> > on the next access is minor, and better than direct recharging to the
> > mapping in the sense that the charge is correlated with an
> > allocation/mapping. Of course, it is more complicated, we have to
> > handle different protection interactions (e.g. what if the page is
> > already protected?). Another disadvantage is that the recharging
> > happens in the context of a page fault, rather than asynchronously in
> > the case of directly recharging to the mapper. Page faults are more
> > latency sensitive, although this shouldn't be a common path.
> >
> > For unmapped pages, I am struggling to find a way that is simple
> > enough to recharge the memory on the next access. My first intuition
> > was to add a hook to folio_mark_accessed(), but I was quickly told
> > that this is not invoked in all access paths to unmapped pages (e.g.
> > writes through fds). We can also add a hook to folio_mark_dirty() to
> > add more coverage, but it seems like this path is fragile, and it
> > would be ideal if there is a shared well-defined common path (or
> > paths) for all accesses to unmapped pages. I would imagine if such a
> > path exists or can be forged it would probably be in the page cache
> > code somewhere.
>
> The problem is that we'd need to add hooks and checks into many hot paths,
> so the performance penalty will be likely severe.
> But of course hard to tell without actual patches.
> >
> > For both cases, if a new mapping is created, we can do recharging there.
> >
> > Pros:
> > - Memory is recharged to the “rightful” user, eventually.
> > - The charge is predictable and correlates to a user's access.
> > - Less overhead on next access than eviction.
> >
> > Cons:
> > - The memory will remain charged to the parent until the next access
> > happens, if it ever happens.
> > - Worse overhead on next access than directly recharging to the mapper.
> >
> > With this (incompletely defined) toolkit, a recharging algorithm can
> > look like this (as a rough example):
> >
> > - If the page is file-backed:
> >   - Unmapped? evict (a).
> >   - Mapped? recharge to the mapper -- direct (c) or deferred (d).
> > - If the page is swap-backed:
> >   - Unmapped? deferred recharge to the next accessor (d).
> >   - Mapped? recharge to the mapper -- direct (c) or deferred (d).
> >
> > There are, of course, open questions:
> > 1) How do we do deferred recharging for unmapped pages? Is deferred
> > recharging even a reliable option to begin with? What if the pages are
> > never accessed again?
>
> I believe the real question is how to handle memory shared between memcgs.
> Dealing with offline memcgs is just a specific case of this problem.
> >
> > Again, I was hoping to come up with a more concrete proposal, but as
> > LSF/MM/BPF is approaching, I wanted to share my thoughts on the
> > mailing list looking for any feedback.
>
> Thank you for bringing it in!


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-02 20:02         ` Yosry Ahmed
  0 siblings, 0 replies; 51+ messages in thread
From: Yosry Ahmed @ 2023-05-02 20:02 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: T.J. Mercier, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Shakeel Butt, Muchun Song, Johannes Weiner,
	Alistair Popple, Jason Gunthorpe, Kalesh Singh, Yu Zhao,
	Matthew Wilcox, David Rientjes, Greg Thelen

On Mon, May 1, 2023 at 9:38 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Tue, Apr 25, 2023 at 04:36:53AM -0700, Yosry Ahmed wrote:
> >  +David Rientjes +Greg Thelen +Matthew Wilcox
>
> Hi Yosry!
>
> Sorry for being late to the party, I was offline for a week.
>
> >
> > On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier@google.com> wrote:
> > > >
> > > > When a memcg is removed by userspace it gets offlined by the kernel.
> > > > Offline memcgs are hidden from user space, but they still live in the
> > > > kernel until their reference count drops to 0. New allocations cannot
> > > > be charged to offline memcgs, but existing allocations charged to
> > > > offline memcgs remain charged, and hold a reference to the memcg.
> > > >
> > > > As such, an offline memcg can remain in the kernel indefinitely,
> > > > becoming a zombie memcg. The accumulation of a large number of zombie
> > > > memcgs lead to increased system overhead (mainly percpu data in struct
> > > > mem_cgroup). It also causes some kernel operations that scale with the
> > > > number of memcgs to become less efficient (e.g. reclaim).
>
> The problem is even more fundamental:
> 1) offline memcgs are (almost) fully functional memcgs from the kernel's point
>    of view,
> 2) if memcg A allocates some memory, goes offline and now memcg B is using this
>    memory, the memory is effectively shared between memcgs A and B,
> 3) sharing memory was never really supported by memcgs.
>
> If memory is shared between memcgs, most memcg functionality is broken aside
> from a case when memcgs are working in a very predictable and coordinated way.
> But generally all counters and stats become racy (whoever allocated it first,
> pays the full price, other get it for free), and memory limits and protections
> are based on the same counters.
>
> Depending on % memory shared, the rate at which memcg are created and destroyed,
> memory pressure and other workload-depending factors the problem can be
> significant or not.
>
> One way to tackle this problem is to stop using memcgs as wrappers for
> individual processes or workloads and use them more as performance classes.
> This means more statically and with less memory sharing.
> However I admit it's the opposite direction to where all went for the
> last decade or so.
>
> > > >
> > > > There are currently out-of-tree solutions which attempt to
> > > > periodically clean up zombie memcgs by reclaiming from them. However
> > > > that is not effective for non-reclaimable memory, which it would be
> > > > better to reparent or recharge to an online cgroup. There are also
> > > > proposed changes that would benefit from recharging for shared
> > > > resources like pinned pages, or DMA buffer pages.
> > >
> > > I am very interested in attending this discussion, it's something that
> > > I have been actively looking into -- specifically recharging pages of
> > > offlined memcgs.
> > >
> > > >
> > > > Suggested attendees:
> > > > Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > > > Yu Zhao <yuzhao-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > > > T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > > > Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > > > Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > > > Muchun Song <muchun.song-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
> > > > Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> > > > Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
> > > > Alistair Popple <apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> > > > Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> > > > Kalesh Singh <kaleshsingh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> >
> > I was hoping I would bring a more complete idea to this thread, but
> > here is what I have so far.
> >
> > The idea is to recharge the memory charged to memcgs when they are
> > offlined. I like to think of the options we have to deal with memory
> > charged to offline memcgs as a toolkit. This toolkit includes:
> >
> > (a) Evict memory.
> >
> > This is the simplest option, just evict the memory.
> >
> > For file-backed pages, this writes them back to their backing files,
> > uncharging and freeing the page. The next access will read the page
> > again and the faulting process’s memcg will be charged.
> >
> > For swap-backed pages (anon/shmem), this swaps them out. Swapping out
> > a page charged to an offline memcg uncharges the page and charges the
> > swap to its parent. The next access will swap in the page and the
> > parent will be charged. This is effectively deferred recharging to the
> > parent.
> >
> > Pros:
> > - Simple.
> >
> > Cons:
> > - Behavior is different for file-backed vs. swap-backed pages, for
> > swap-backed pages, the memory is recharged to the parent (aka
> > reparented), not charged to the "rightful" user.
> > - Next access will incur higher latency, especially if the pages are active.
>
> Generally I think it's a good solution iff there is not much of memory sharing
> with other memcgs. But in practice there is a high chance that some very hot
> pages (e.g. shlib pages shared by pretty much everyone) will get evicted.

One other thing that I recently noticed. If a memcg has some pages
swapped out, or if userspace deliberately tries to free memory from a
memcg before removing it, then the memcg is offlined, then the swapped
pages will hold the memcg hostage until the next fault recharges or
reparents them.

If the pages are truly cold, this can take a while (or never happen).

Separate from recharging, one thing we can do is have a kthread loop
over swap_cgroup and reparent swap entries charged to offline memcgs
to unpin them. Ideally, we will mark those entries such that they get
recharged upon the next fault instead of just keeping them at the
parent.

Just thinking out loud here.

>
> >
> > (b) Direct recharge to the parent
> >
> > This can be done for any page and should be simple as the pages are
> > already hierarchically charged to the parent.
> >
> > Pros:
> > - Simple.
> >
> > Cons:
> > - If a different memcg is using the memory, it will keep taxing the
> > parent indefinitely. Same not the "rightful" user argument.
>
> It worked for slabs and other kmem objects to reduce the severity of the memcg
> zombie clogging. Muchun posted patches for lru pages. I believe it's a decent
> way to solve the zombie problem, but it doesn't solve any issues with the memory
> sharing.
>
> >
> > (c) Direct recharge to the mapper
> >
> > This can be done for any mapped page by walking the rmap and
> > identifying the memcg of the process(es) mapping the page.
> >
> > Pros:
> > - Memory is recharged to the “rightful” user.
> >
> > Cons:
> > - More complicated, the “rightful” user’s memcg might run into an OOM
> > situation – which in this case will be unpredictable and hard to
> > correlate with an allocation.
> >
> > (d) Deferred recharging
> >
> > This is a mixture of (b) & (c) above. It is a two-step process. We
> > first recharge the memory to the parent, which should be simple and
> > reliable. Then, we mark the pages so that the next time they are
> > accessed or mapped we recharge them to the "rightful" user.
> >
> > For mapped pages, we can use the numa balancing approach of protecting
> > the mapping (while the vma is still accessible), and then in the fault
> > path recharge the page. This is better than eviction because the fault
> > on the next access is minor, and better than direct recharging to the
> > mapping in the sense that the charge is correlated with an
> > allocation/mapping. Of course, it is more complicated, we have to
> > handle different protection interactions (e.g. what if the page is
> > already protected?). Another disadvantage is that the recharging
> > happens in the context of a page fault, rather than asynchronously in
> > the case of directly recharging to the mapper. Page faults are more
> > latency sensitive, although this shouldn't be a common path.
> >
> > For unmapped pages, I am struggling to find a way that is simple
> > enough to recharge the memory on the next access. My first intuition
> > was to add a hook to folio_mark_accessed(), but I was quickly told
> > that this is not invoked in all access paths to unmapped pages (e.g.
> > writes through fds). We can also add a hook to folio_mark_dirty() to
> > add more coverage, but it seems like this path is fragile, and it
> > would be ideal if there is a shared well-defined common path (or
> > paths) for all accesses to unmapped pages. I would imagine if such a
> > path exists or can be forged it would probably be in the page cache
> > code somewhere.
>
> The problem is that we'd need to add hooks and checks into many hot paths,
> so the performance penalty will be likely severe.
> But of course hard to tell without actual patches.
> >
> > For both cases, if a new mapping is created, we can do recharging there.
> >
> > Pros:
> > - Memory is recharged to the “rightful” user, eventually.
> > - The charge is predictable and correlates to a user's access.
> > - Less overhead on next access than eviction.
> >
> > Cons:
> > - The memory will remain charged to the parent until the next access
> > happens, if it ever happens.
> > - Worse overhead on next access than directly recharging to the mapper.
> >
> > With this (incompletely defined) toolkit, a recharging algorithm can
> > look like this (as a rough example):
> >
> > - If the page is file-backed:
> >   - Unmapped? evict (a).
> >   - Mapped? recharge to the mapper -- direct (c) or deferred (d).
> > - If the page is swap-backed:
> >   - Unmapped? deferred recharge to the next accessor (d).
> >   - Mapped? recharge to the mapper -- direct (c) or deferred (d).
> >
> > There are, of course, open questions:
> > 1) How do we do deferred recharging for unmapped pages? Is deferred
> > recharging even a reliable option to begin with? What if the pages are
> > never accessed again?
>
> I believe the real question is how to handle memory shared between memcgs.
> Dealing with offline memcgs is just a specific case of this problem.
> >
> > Again, I was hoping to come up with a more concrete proposal, but as
> > LSF/MM/BPF is approaching, I wanted to share my thoughts on the
> > mailing list looking for any feedback.
>
> Thank you for bringing it in!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-03 22:15   ` Chris Li
  0 siblings, 0 replies; 51+ messages in thread
From: Chris Li @ 2023-05-03 22:15 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: lsf-pc, linux-mm, cgroups, Yosry Ahmed, Tejun Heo, Shakeel Butt,
	Muchun Song, Johannes Weiner, Roman Gushchin, Alistair Popple,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao

Hi T.J.,

On Tue, Apr 11, 2023 at 04:36:37PM -0700, T.J. Mercier wrote:
> When a memcg is removed by userspace it gets offlined by the kernel.
> Offline memcgs are hidden from user space, but they still live in the
> kernel until their reference count drops to 0. New allocations cannot
> be charged to offline memcgs, but existing allocations charged to
> offline memcgs remain charged, and hold a reference to the memcg.
> 
> As such, an offline memcg can remain in the kernel indefinitely,
> becoming a zombie memcg. The accumulation of a large number of zombie
> memcgs lead to increased system overhead (mainly percpu data in struct
> mem_cgroup). It also causes some kernel operations that scale with the
> number of memcgs to become less efficient (e.g. reclaim).
> 
> There are currently out-of-tree solutions which attempt to
> periodically clean up zombie memcgs by reclaiming from them. However
> that is not effective for non-reclaimable memory, which it would be
> better to reparent or recharge to an online cgroup. There are also
> proposed changes that would benefit from recharging for shared
> resources like pinned pages, or DMA buffer pages.

I am also interested in this topic. T.J. and I have some offline
discussion about this. We have some proposals to solve this
problem.

I will share the write up here for the up coming LSF/MM discussion.


Shared Memory Cgroup Controllers

= Introduction

The current memory cgroup controller does not support shared memory objects. For the memory that is shared between different processes, it is not obvious which process should get charged. Google has some internal tmpfs “memcg=” mount option to charge tmpfs data to  a specific memcg that’s often different from where charging processes run. However it faces some difficulties when the charged memcg exits and the charged memcg becomes a zombie memcg.
Other approaches include “re-parenting” the memcg charge to the parent memcg. Which has its own problem. If the charge is huge, iteration of the reparenting can be costly.

= Proposed Solution

The proposed solution is to add a new type of memory controller for shared memory usage. E.g. tmpfs, hugetlb, file system mmap and dma_buf. This shared memory cgroup controller object will have the same life cycle of the underlying  shared memory.

Processes can not be added to the shared memory cgroup. Instead the shared memory cgroup can be added to the memcg using a “smemcg” API file, similar to adding a process into the “tasks” API file.
When a smemcg is added to the memcg, the amount of memory that has been shared in the memcg process will be accounted for as the part of the memcg “memory.current”.The memory.current of the memcg is make up of two parts, 1) the processes anonymous memory and 2) the memory shared from smemcg.

When the memcg “memory.current” is raised to the limit. The kernel will active try to reclaim for the memcg to make “smemcg memory + process anonymous memory” within the limit. Further memory allocation within those memcg processes will fail if the limit can not be followed. If many reclaim attempts fail to bring the memcg “memory.current” within the limit, the process in this memcg will get OOM killed.

= Benefits

The benefits of this solution include:
* No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf.
* No reparenting. The shared memory only charge once to the smemcg object. A memcg can include a smemcg to as part of the memcg memory usage. When process exit and memcg get deleted, the charge remain to the smemcg object.
* Much cleaner mental model of the smemcg, each share memory page is charge to one smemcg only once.
* Notice the same smemcg can add to more than one memcg. It can better describe the shared memory relation.

Chris



> Suggested attendees:
> Yosry Ahmed <yosryahmed@google.com>
> Yu Zhao <yuzhao@google.com>
> T.J. Mercier <tjmercier@google.com>
> Tejun Heo <tj@kernel.org>
> Shakeel Butt <shakeelb@google.com>
> Muchun Song <muchun.song@linux.dev>
> Johannes Weiner <hannes@cmpxchg.org>
> Roman Gushchin <roman.gushchin@linux.dev>
> Alistair Popple <apopple@nvidia.com>
> Jason Gunthorpe <jgg@nvidia.com>
> Kalesh Singh <kaleshsingh@google.com>
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-03 22:15   ` Chris Li
  0 siblings, 0 replies; 51+ messages in thread
From: Chris Li @ 2023-05-03 22:15 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Yosry Ahmed, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Alistair Popple,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao

Hi T.J.,

On Tue, Apr 11, 2023 at 04:36:37PM -0700, T.J. Mercier wrote:
> When a memcg is removed by userspace it gets offlined by the kernel.
> Offline memcgs are hidden from user space, but they still live in the
> kernel until their reference count drops to 0. New allocations cannot
> be charged to offline memcgs, but existing allocations charged to
> offline memcgs remain charged, and hold a reference to the memcg.
> 
> As such, an offline memcg can remain in the kernel indefinitely,
> becoming a zombie memcg. The accumulation of a large number of zombie
> memcgs lead to increased system overhead (mainly percpu data in struct
> mem_cgroup). It also causes some kernel operations that scale with the
> number of memcgs to become less efficient (e.g. reclaim).
> 
> There are currently out-of-tree solutions which attempt to
> periodically clean up zombie memcgs by reclaiming from them. However
> that is not effective for non-reclaimable memory, which it would be
> better to reparent or recharge to an online cgroup. There are also
> proposed changes that would benefit from recharging for shared
> resources like pinned pages, or DMA buffer pages.

I am also interested in this topic. T.J. and I have some offline
discussion about this. We have some proposals to solve this
problem.

I will share the write up here for the up coming LSF/MM discussion.


Shared Memory Cgroup Controllers

= Introduction

The current memory cgroup controller does not support shared memory objects. For the memory that is shared between different processes, it is not obvious which process should get charged. Google has some internal tmpfs “memcg=” mount option to charge tmpfs data to  a specific memcg that’s often different from where charging processes run. However it faces some difficulties when the charged memcg exits and the charged memcg becomes a zombie memcg.
Other approaches include “re-parenting” the memcg charge to the parent memcg. Which has its own problem. If the charge is huge, iteration of the reparenting can be costly.

= Proposed Solution

The proposed solution is to add a new type of memory controller for shared memory usage. E.g. tmpfs, hugetlb, file system mmap and dma_buf. This shared memory cgroup controller object will have the same life cycle of the underlying  shared memory.

Processes can not be added to the shared memory cgroup. Instead the shared memory cgroup can be added to the memcg using a “smemcg” API file, similar to adding a process into the “tasks” API file.
When a smemcg is added to the memcg, the amount of memory that has been shared in the memcg process will be accounted for as the part of the memcg “memory.current”.The memory.current of the memcg is make up of two parts, 1) the processes anonymous memory and 2) the memory shared from smemcg.

When the memcg “memory.current” is raised to the limit. The kernel will active try to reclaim for the memcg to make “smemcg memory + process anonymous memory” within the limit. Further memory allocation within those memcg processes will fail if the limit can not be followed. If many reclaim attempts fail to bring the memcg “memory.current” within the limit, the process in this memcg will get OOM killed.

= Benefits

The benefits of this solution include:
* No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf.
* No reparenting. The shared memory only charge once to the smemcg object. A memcg can include a smemcg to as part of the memcg memory usage. When process exit and memcg get deleted, the charge remain to the smemcg object.
* Much cleaner mental model of the smemcg, each share memory page is charge to one smemcg only once.
* Notice the same smemcg can add to more than one memcg. It can better describe the shared memory relation.

Chris



> Suggested attendees:
> Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Yu Zhao <yuzhao-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Muchun Song <muchun.song-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
> Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
> Alistair Popple <apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> Kalesh Singh <kaleshsingh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-04 11:58     ` Alistair Popple
  0 siblings, 0 replies; 51+ messages in thread
From: Alistair Popple @ 2023-05-04 11:58 UTC (permalink / raw)
  To: Chris Li
  Cc: T.J. Mercier, lsf-pc, linux-mm, cgroups, Yosry Ahmed, Tejun Heo,
	Shakeel Butt, Muchun Song, Johannes Weiner, Roman Gushchin,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao


Chris Li <chrisl@kernel.org> writes:

> Hi T.J.,
>
> On Tue, Apr 11, 2023 at 04:36:37PM -0700, T.J. Mercier wrote:
>> When a memcg is removed by userspace it gets offlined by the kernel.
>> Offline memcgs are hidden from user space, but they still live in the
>> kernel until their reference count drops to 0. New allocations cannot
>> be charged to offline memcgs, but existing allocations charged to
>> offline memcgs remain charged, and hold a reference to the memcg.
>> 
>> As such, an offline memcg can remain in the kernel indefinitely,
>> becoming a zombie memcg. The accumulation of a large number of zombie
>> memcgs lead to increased system overhead (mainly percpu data in struct
>> mem_cgroup). It also causes some kernel operations that scale with the
>> number of memcgs to become less efficient (e.g. reclaim).
>> 
>> There are currently out-of-tree solutions which attempt to
>> periodically clean up zombie memcgs by reclaiming from them. However
>> that is not effective for non-reclaimable memory, which it would be
>> better to reparent or recharge to an online cgroup. There are also
>> proposed changes that would benefit from recharging for shared
>> resources like pinned pages, or DMA buffer pages.
>
> I am also interested in this topic. T.J. and I have some offline
> discussion about this. We have some proposals to solve this
> problem.
>
> I will share the write up here for the up coming LSF/MM discussion.

Unfortunately I won't be attending LSF/MM in person this year but I am
interested in this topic as well from the point of view of limiting
pinned pages with cgroups. I am hoping to revive this patch series soon:

https://lore.kernel.org/linux-mm/cover.c238416f0e82377b449846dbb2459ae9d7030c8e.1675669136.git-series.apopple@nvidia.com/

The main problem with this series was getting agreement on whether to
add pinning as a separate cgroup (which is what the series currently
does) or whether it should be part of a per-page memcg limit.

The issue with per-page memcg limits is what to do for shared
mappings. The below suggestion sounds promising because the pins for
shared pages could be charged to the smemcg. However I'm not sure how it
would solve the problem of a process in cgroup A being able to raise the
pin count of cgroup B when pinning a smemcg page which was one of the
reason I introduced a new cgroup controller.

> Shared Memory Cgroup Controllers
>
> = Introduction
>
> The current memory cgroup controller does not support shared memory
> objects. For the memory that is shared between different processes, it
> is not obvious which process should get charged. Google has some
> internal tmpfs “memcg=” mount option to charge tmpfs data to a
> specific memcg that’s often different from where charging processes
> run. However it faces some difficulties when the charged memcg exits
> and the charged memcg becomes a zombie memcg.
> Other approaches include “re-parenting” the memcg charge to the parent
> memcg. Which has its own problem. If the charge is huge, iteration of
> the reparenting can be costly.
>
> = Proposed Solution
>
> The proposed solution is to add a new type of memory controller for
> shared memory usage. E.g. tmpfs, hugetlb, file system mmap and
> dma_buf. This shared memory cgroup controller object will have the
> same life cycle of the underlying shared memory.
>
> Processes can not be added to the shared memory cgroup. Instead the
> shared memory cgroup can be added to the memcg using a “smemcg” API
> file, similar to adding a process into the “tasks” API file.
> When a smemcg is added to the memcg, the amount of memory that has
> been shared in the memcg process will be accounted for as the part of
> the memcg “memory.current”.The memory.current of the memcg is make up
> of two parts, 1) the processes anonymous memory and 2) the memory
> shared from smemcg.
>
> When the memcg “memory.current” is raised to the limit. The kernel
> will active try to reclaim for the memcg to make “smemcg memory +
> process anonymous memory” within the limit.

That means a process in one cgroup could force reclaim of smemcg memory
in use by a process in another cgroup right? I guess that's no different
to the current situation though.

> Further memory allocation
> within those memcg processes will fail if the limit can not be
> followed. If many reclaim attempts fail to bring the memcg
> “memory.current” within the limit, the process in this memcg will get
> OOM killed.

How would this work if say a charge for cgroup A to a smemcg in both
cgroup A and B would cause cgroup B to go over its memory limit and not
enough memory could be reclaimed from cgroup B? OOM killing a process in
cgroup B due to a charge from cgroup A doesn't sound like a good idea.

> = Benefits
>
> The benefits of this solution include:
> * No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf.

If we added pinning it could get a bit messier, as it would have to hang
around until the driver unpinned the pages. But I don't think that's a
problem.

> * No reparenting. The shared memory only charge once to the smemcg
> object. A memcg can include a smemcg to as part of the memcg memory
> usage. When process exit and memcg get deleted, the charge remain to
> the smemcg object.
> * Much cleaner mental model of the smemcg, each share memory page is charge to one smemcg only once.
> * Notice the same smemcg can add to more than one memcg. It can better
> describe the shared memory relation.
>
> Chris
>
>
>
>> Suggested attendees:
>> Yosry Ahmed <yosryahmed@google.com>
>> Yu Zhao <yuzhao@google.com>
>> T.J. Mercier <tjmercier@google.com>
>> Tejun Heo <tj@kernel.org>
>> Shakeel Butt <shakeelb@google.com>
>> Muchun Song <muchun.song@linux.dev>
>> Johannes Weiner <hannes@cmpxchg.org>
>> Roman Gushchin <roman.gushchin@linux.dev>
>> Alistair Popple <apopple@nvidia.com>
>> Jason Gunthorpe <jgg@nvidia.com>
>> Kalesh Singh <kaleshsingh@google.com>
>> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-04 11:58     ` Alistair Popple
  0 siblings, 0 replies; 51+ messages in thread
From: Alistair Popple @ 2023-05-04 11:58 UTC (permalink / raw)
  To: Chris Li
  Cc: T.J. Mercier, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Yosry Ahmed, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Jason Gunthorpe, Kalesh Singh,
	Yu Zhao


Chris Li <chrisl-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:

> Hi T.J.,
>
> On Tue, Apr 11, 2023 at 04:36:37PM -0700, T.J. Mercier wrote:
>> When a memcg is removed by userspace it gets offlined by the kernel.
>> Offline memcgs are hidden from user space, but they still live in the
>> kernel until their reference count drops to 0. New allocations cannot
>> be charged to offline memcgs, but existing allocations charged to
>> offline memcgs remain charged, and hold a reference to the memcg.
>> 
>> As such, an offline memcg can remain in the kernel indefinitely,
>> becoming a zombie memcg. The accumulation of a large number of zombie
>> memcgs lead to increased system overhead (mainly percpu data in struct
>> mem_cgroup). It also causes some kernel operations that scale with the
>> number of memcgs to become less efficient (e.g. reclaim).
>> 
>> There are currently out-of-tree solutions which attempt to
>> periodically clean up zombie memcgs by reclaiming from them. However
>> that is not effective for non-reclaimable memory, which it would be
>> better to reparent or recharge to an online cgroup. There are also
>> proposed changes that would benefit from recharging for shared
>> resources like pinned pages, or DMA buffer pages.
>
> I am also interested in this topic. T.J. and I have some offline
> discussion about this. We have some proposals to solve this
> problem.
>
> I will share the write up here for the up coming LSF/MM discussion.

Unfortunately I won't be attending LSF/MM in person this year but I am
interested in this topic as well from the point of view of limiting
pinned pages with cgroups. I am hoping to revive this patch series soon:

https://lore.kernel.org/linux-mm/cover.c238416f0e82377b449846dbb2459ae9d7030c8e.1675669136.git-series.apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org/

The main problem with this series was getting agreement on whether to
add pinning as a separate cgroup (which is what the series currently
does) or whether it should be part of a per-page memcg limit.

The issue with per-page memcg limits is what to do for shared
mappings. The below suggestion sounds promising because the pins for
shared pages could be charged to the smemcg. However I'm not sure how it
would solve the problem of a process in cgroup A being able to raise the
pin count of cgroup B when pinning a smemcg page which was one of the
reason I introduced a new cgroup controller.

> Shared Memory Cgroup Controllers
>
> = Introduction
>
> The current memory cgroup controller does not support shared memory
> objects. For the memory that is shared between different processes, it
> is not obvious which process should get charged. Google has some
> internal tmpfs “memcg=” mount option to charge tmpfs data to a
> specific memcg that’s often different from where charging processes
> run. However it faces some difficulties when the charged memcg exits
> and the charged memcg becomes a zombie memcg.
> Other approaches include “re-parenting” the memcg charge to the parent
> memcg. Which has its own problem. If the charge is huge, iteration of
> the reparenting can be costly.
>
> = Proposed Solution
>
> The proposed solution is to add a new type of memory controller for
> shared memory usage. E.g. tmpfs, hugetlb, file system mmap and
> dma_buf. This shared memory cgroup controller object will have the
> same life cycle of the underlying shared memory.
>
> Processes can not be added to the shared memory cgroup. Instead the
> shared memory cgroup can be added to the memcg using a “smemcg” API
> file, similar to adding a process into the “tasks” API file.
> When a smemcg is added to the memcg, the amount of memory that has
> been shared in the memcg process will be accounted for as the part of
> the memcg “memory.current”.The memory.current of the memcg is make up
> of two parts, 1) the processes anonymous memory and 2) the memory
> shared from smemcg.
>
> When the memcg “memory.current” is raised to the limit. The kernel
> will active try to reclaim for the memcg to make “smemcg memory +
> process anonymous memory” within the limit.

That means a process in one cgroup could force reclaim of smemcg memory
in use by a process in another cgroup right? I guess that's no different
to the current situation though.

> Further memory allocation
> within those memcg processes will fail if the limit can not be
> followed. If many reclaim attempts fail to bring the memcg
> “memory.current” within the limit, the process in this memcg will get
> OOM killed.

How would this work if say a charge for cgroup A to a smemcg in both
cgroup A and B would cause cgroup B to go over its memory limit and not
enough memory could be reclaimed from cgroup B? OOM killing a process in
cgroup B due to a charge from cgroup A doesn't sound like a good idea.

> = Benefits
>
> The benefits of this solution include:
> * No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf.

If we added pinning it could get a bit messier, as it would have to hang
around until the driver unpinned the pages. But I don't think that's a
problem.

> * No reparenting. The shared memory only charge once to the smemcg
> object. A memcg can include a smemcg to as part of the memcg memory
> usage. When process exit and memcg get deleted, the charge remain to
> the smemcg object.
> * Much cleaner mental model of the smemcg, each share memory page is charge to one smemcg only once.
> * Notice the same smemcg can add to more than one memcg. It can better
> describe the shared memory relation.
>
> Chris
>
>
>
>> Suggested attendees:
>> Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> Yu Zhao <yuzhao-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>> Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> Muchun Song <muchun.song-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
>> Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
>> Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
>> Alistair Popple <apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
>> Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
>> Kalesh Singh <kaleshsingh-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-04 15:31       ` Chris Li
  0 siblings, 0 replies; 51+ messages in thread
From: Chris Li @ 2023-05-04 15:31 UTC (permalink / raw)
  To: Alistair Popple
  Cc: T.J. Mercier, lsf-pc, linux-mm, cgroups, Yosry Ahmed, Tejun Heo,
	Shakeel Butt, Muchun Song, Johannes Weiner, Roman Gushchin,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao

On Thu, May 04, 2023 at 09:58:56PM +1000, Alistair Popple wrote:
> 
> Chris Li <chrisl@kernel.org> writes:
> 
> > Hi T.J.,
> >
> > On Tue, Apr 11, 2023 at 04:36:37PM -0700, T.J. Mercier wrote:
> >> When a memcg is removed by userspace it gets offlined by the kernel.
> >> Offline memcgs are hidden from user space, but they still live in the
> >> kernel until their reference count drops to 0. New allocations cannot
> >> be charged to offline memcgs, but existing allocations charged to
> >> offline memcgs remain charged, and hold a reference to the memcg.
> >> 
> >> As such, an offline memcg can remain in the kernel indefinitely,
> >> becoming a zombie memcg. The accumulation of a large number of zombie
> >> memcgs lead to increased system overhead (mainly percpu data in struct
> >> mem_cgroup). It also causes some kernel operations that scale with the
> >> number of memcgs to become less efficient (e.g. reclaim).
> >> 
> >> There are currently out-of-tree solutions which attempt to
> >> periodically clean up zombie memcgs by reclaiming from them. However
> >> that is not effective for non-reclaimable memory, which it would be
> >> better to reparent or recharge to an online cgroup. There are also
> >> proposed changes that would benefit from recharging for shared
> >> resources like pinned pages, or DMA buffer pages.
> >
> > I am also interested in this topic. T.J. and I have some offline
> > discussion about this. We have some proposals to solve this
> > problem.
> >
> > I will share the write up here for the up coming LSF/MM discussion.
> 
> Unfortunately I won't be attending LSF/MM in person this year but I am

Will you be able to join virtually?

> interested in this topic as well from the point of view of limiting
> pinned pages with cgroups. I am hoping to revive this patch series soon:



> 
> https://lore.kernel.org/linux-mm/cover.c238416f0e82377b449846dbb2459ae9d7030c8e.1675669136.git-series.apopple@nvidia.com/
> 
> The main problem with this series was getting agreement on whether to
> add pinning as a separate cgroup (which is what the series currently
> does) or whether it should be part of a per-page memcg limit.
> 
> The issue with per-page memcg limits is what to do for shared
> mappings. The below suggestion sounds promising because the pins for
> shared pages could be charged to the smemcg. However I'm not sure how it
> would solve the problem of a process in cgroup A being able to raise the
> pin count of cgroup B when pinning a smemcg page which was one of the
> reason I introduced a new cgroup controller.

Now that I think of it, I can see the pin count memcg as a subtype of
smemcg.

The smemcg can have a limit as well, when it add to a memcg, the operation
raise the pin count smemcg charge over the smemcg limit will fail.

For the detail tracking of shared/unshared behavior, the smemcg can model it
as a step 2 feature.

There are four different kind of operation can perform on a smemcg:

1) allocate/charge memory. The charge will add on the per smemcg charge
counter, check against the per smemcg limit. ENOMEM if it is over the limit.

2) free/uncharge memory. Similar to above just subtract the counter.

3) share/mmap already charged memory. This will not change the smemcg charge
count, it will add to a per <smemcg, memcg> borrow counter. It is possible to
put a limit on that counter as well, even though I haven't given too much thought
of how useful it is. That will limit how much memory can mapped from the smemcg.

4) unshare/unmmap already charged memory. That will reduce the per <smemcg, memcg>
borrow counter.

Will that work for your pin memory usage?

> 
> > Shared Memory Cgroup Controllers
> >
> > = Introduction
> >
> > The current memory cgroup controller does not support shared memory
> > objects. For the memory that is shared between different processes, it
> > is not obvious which process should get charged. Google has some
> > internal tmpfs “memcg=” mount option to charge tmpfs data to a
> > specific memcg that’s often different from where charging processes
> > run. However it faces some difficulties when the charged memcg exits
> > and the charged memcg becomes a zombie memcg.
> > Other approaches include “re-parenting” the memcg charge to the parent
> > memcg. Which has its own problem. If the charge is huge, iteration of
> > the reparenting can be costly.
> >
> > = Proposed Solution
> >
> > The proposed solution is to add a new type of memory controller for
> > shared memory usage. E.g. tmpfs, hugetlb, file system mmap and
> > dma_buf. This shared memory cgroup controller object will have the
> > same life cycle of the underlying shared memory.
> >
> > Processes can not be added to the shared memory cgroup. Instead the
> > shared memory cgroup can be added to the memcg using a “smemcg” API
> > file, similar to adding a process into the “tasks” API file.
> > When a smemcg is added to the memcg, the amount of memory that has
> > been shared in the memcg process will be accounted for as the part of
> > the memcg “memory.current”.The memory.current of the memcg is make up
> > of two parts, 1) the processes anonymous memory and 2) the memory
> > shared from smemcg.
> >
> > When the memcg “memory.current” is raised to the limit. The kernel
> > will active try to reclaim for the memcg to make “smemcg memory +
> > process anonymous memory” within the limit.
> 
> That means a process in one cgroup could force reclaim of smemcg memory
> in use by a process in another cgroup right? I guess that's no different
> to the current situation though.
> 
> > Further memory allocation
> > within those memcg processes will fail if the limit can not be
> > followed. If many reclaim attempts fail to bring the memcg
> > “memory.current” within the limit, the process in this memcg will get
> > OOM killed.
> 
> How would this work if say a charge for cgroup A to a smemcg in both
> cgroup A and B would cause cgroup B to go over its memory limit and not
> enough memory could be reclaimed from cgroup B? OOM killing a process in
> cgroup B due to a charge from cgroup A doesn't sound like a good idea.

If we separate out the charge counter with the borrow counter, that problem
will be solved. When smemcg is add to memcg A, we can have a policy specific
that adding the <smemcg, memcg A> borrow counter into memcg A's "memory.current".

If B did not map that page, that page will not be part of <smemcg, memcg B>
borrow count. B will not be punished.

However if B did map that page, The <smemcg, memcg B> need to increase as well.
B will be punished for it.

Will that work for your example situation?

> > = Benefits
> >
> > The benefits of this solution include:
> > * No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf.
> 
> If we added pinning it could get a bit messier, as it would have to hang
> around until the driver unpinned the pages. But I don't think that's a
> problem.


That is exactly the reason pin memory can belong to a pin smemcg. You just need
to model the driver holding the pin ref count as one of the share/mmap operation.
Then the pin smemcg will not go away if there is a pending pin ref count on it.

We have have different policy option on smemcg.
For the simple usage don't care the per memcg borrow counter, it can add the
smemcg's charge count to "memory.current".

Only the user who cares about per memcg usage of a smemcg will need to maintain
per <smemcg, memcg> borrow counter, at additional cost.

Chris


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-04 15:31       ` Chris Li
  0 siblings, 0 replies; 51+ messages in thread
From: Chris Li @ 2023-05-04 15:31 UTC (permalink / raw)
  To: Alistair Popple
  Cc: T.J. Mercier, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Yosry Ahmed, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Jason Gunthorpe, Kalesh Singh,
	Yu Zhao

On Thu, May 04, 2023 at 09:58:56PM +1000, Alistair Popple wrote:
> 
> Chris Li <chrisl-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:
> 
> > Hi T.J.,
> >
> > On Tue, Apr 11, 2023 at 04:36:37PM -0700, T.J. Mercier wrote:
> >> When a memcg is removed by userspace it gets offlined by the kernel.
> >> Offline memcgs are hidden from user space, but they still live in the
> >> kernel until their reference count drops to 0. New allocations cannot
> >> be charged to offline memcgs, but existing allocations charged to
> >> offline memcgs remain charged, and hold a reference to the memcg.
> >> 
> >> As such, an offline memcg can remain in the kernel indefinitely,
> >> becoming a zombie memcg. The accumulation of a large number of zombie
> >> memcgs lead to increased system overhead (mainly percpu data in struct
> >> mem_cgroup). It also causes some kernel operations that scale with the
> >> number of memcgs to become less efficient (e.g. reclaim).
> >> 
> >> There are currently out-of-tree solutions which attempt to
> >> periodically clean up zombie memcgs by reclaiming from them. However
> >> that is not effective for non-reclaimable memory, which it would be
> >> better to reparent or recharge to an online cgroup. There are also
> >> proposed changes that would benefit from recharging for shared
> >> resources like pinned pages, or DMA buffer pages.
> >
> > I am also interested in this topic. T.J. and I have some offline
> > discussion about this. We have some proposals to solve this
> > problem.
> >
> > I will share the write up here for the up coming LSF/MM discussion.
> 
> Unfortunately I won't be attending LSF/MM in person this year but I am

Will you be able to join virtually?

> interested in this topic as well from the point of view of limiting
> pinned pages with cgroups. I am hoping to revive this patch series soon:



> 
> https://lore.kernel.org/linux-mm/cover.c238416f0e82377b449846dbb2459ae9d7030c8e.1675669136.git-series.apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org/
> 
> The main problem with this series was getting agreement on whether to
> add pinning as a separate cgroup (which is what the series currently
> does) or whether it should be part of a per-page memcg limit.
> 
> The issue with per-page memcg limits is what to do for shared
> mappings. The below suggestion sounds promising because the pins for
> shared pages could be charged to the smemcg. However I'm not sure how it
> would solve the problem of a process in cgroup A being able to raise the
> pin count of cgroup B when pinning a smemcg page which was one of the
> reason I introduced a new cgroup controller.

Now that I think of it, I can see the pin count memcg as a subtype of
smemcg.

The smemcg can have a limit as well, when it add to a memcg, the operation
raise the pin count smemcg charge over the smemcg limit will fail.

For the detail tracking of shared/unshared behavior, the smemcg can model it
as a step 2 feature.

There are four different kind of operation can perform on a smemcg:

1) allocate/charge memory. The charge will add on the per smemcg charge
counter, check against the per smemcg limit. ENOMEM if it is over the limit.

2) free/uncharge memory. Similar to above just subtract the counter.

3) share/mmap already charged memory. This will not change the smemcg charge
count, it will add to a per <smemcg, memcg> borrow counter. It is possible to
put a limit on that counter as well, even though I haven't given too much thought
of how useful it is. That will limit how much memory can mapped from the smemcg.

4) unshare/unmmap already charged memory. That will reduce the per <smemcg, memcg>
borrow counter.

Will that work for your pin memory usage?

> 
> > Shared Memory Cgroup Controllers
> >
> > = Introduction
> >
> > The current memory cgroup controller does not support shared memory
> > objects. For the memory that is shared between different processes, it
> > is not obvious which process should get charged. Google has some
> > internal tmpfs “memcg=” mount option to charge tmpfs data to a
> > specific memcg that’s often different from where charging processes
> > run. However it faces some difficulties when the charged memcg exits
> > and the charged memcg becomes a zombie memcg.
> > Other approaches include “re-parenting” the memcg charge to the parent
> > memcg. Which has its own problem. If the charge is huge, iteration of
> > the reparenting can be costly.
> >
> > = Proposed Solution
> >
> > The proposed solution is to add a new type of memory controller for
> > shared memory usage. E.g. tmpfs, hugetlb, file system mmap and
> > dma_buf. This shared memory cgroup controller object will have the
> > same life cycle of the underlying shared memory.
> >
> > Processes can not be added to the shared memory cgroup. Instead the
> > shared memory cgroup can be added to the memcg using a “smemcg” API
> > file, similar to adding a process into the “tasks” API file.
> > When a smemcg is added to the memcg, the amount of memory that has
> > been shared in the memcg process will be accounted for as the part of
> > the memcg “memory.current”.The memory.current of the memcg is make up
> > of two parts, 1) the processes anonymous memory and 2) the memory
> > shared from smemcg.
> >
> > When the memcg “memory.current” is raised to the limit. The kernel
> > will active try to reclaim for the memcg to make “smemcg memory +
> > process anonymous memory” within the limit.
> 
> That means a process in one cgroup could force reclaim of smemcg memory
> in use by a process in another cgroup right? I guess that's no different
> to the current situation though.
> 
> > Further memory allocation
> > within those memcg processes will fail if the limit can not be
> > followed. If many reclaim attempts fail to bring the memcg
> > “memory.current” within the limit, the process in this memcg will get
> > OOM killed.
> 
> How would this work if say a charge for cgroup A to a smemcg in both
> cgroup A and B would cause cgroup B to go over its memory limit and not
> enough memory could be reclaimed from cgroup B? OOM killing a process in
> cgroup B due to a charge from cgroup A doesn't sound like a good idea.

If we separate out the charge counter with the borrow counter, that problem
will be solved. When smemcg is add to memcg A, we can have a policy specific
that adding the <smemcg, memcg A> borrow counter into memcg A's "memory.current".

If B did not map that page, that page will not be part of <smemcg, memcg B>
borrow count. B will not be punished.

However if B did map that page, The <smemcg, memcg B> need to increase as well.
B will be punished for it.

Will that work for your example situation?

> > = Benefits
> >
> > The benefits of this solution include:
> > * No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf.
> 
> If we added pinning it could get a bit messier, as it would have to hang
> around until the driver unpinned the pages. But I don't think that's a
> problem.


That is exactly the reason pin memory can belong to a pin smemcg. You just need
to model the driver holding the pin ref count as one of the share/mmap operation.
Then the pin smemcg will not go away if there is a pending pin ref count on it.

We have have different policy option on smemcg.
For the simple usage don't care the per memcg borrow counter, it can add the
smemcg's charge count to "memory.current".

Only the user who cares about per memcg usage of a smemcg will need to maintain
per <smemcg, memcg> borrow counter, at additional cost.

Chris


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-04 17:02     ` Shakeel Butt
  0 siblings, 0 replies; 51+ messages in thread
From: Shakeel Butt @ 2023-05-04 17:02 UTC (permalink / raw)
  To: Chris Li
  Cc: T.J. Mercier, lsf-pc, linux-mm, cgroups, Yosry Ahmed, Tejun Heo,
	Muchun Song, Johannes Weiner, Roman Gushchin, Alistair Popple,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao

On Wed, May 3, 2023 at 3:15 PM Chris Li <chrisl@kernel.org> wrote:
[...]
> I am also interested in this topic. T.J. and I have some offline
> discussion about this. We have some proposals to solve this
> problem.
>
> I will share the write up here for the up coming LSF/MM discussion.
>
>
> Shared Memory Cgroup Controllers
>
> = Introduction
>
> The current memory cgroup controller does not support shared memory objects. For the memory that is shared between different processes, it is not obvious which process should get charged. Google has some internal tmpfs “memcg=” mount option to charge tmpfs data to  a specific memcg that’s often different from where charging processes run. However it faces some difficulties when the charged memcg exits and the charged memcg becomes a zombie memcg.

What is the exact problem this proposal is solving? Is it the zombie
memcgs? To me that is just a side effect of memory shared between
different memcgs.

> Other approaches include “re-parenting” the memcg charge to the parent memcg. Which has its own problem. If the charge is huge, iteration of the reparenting can be costly.

What is the iteration of the reparenting? Are you referring to
reparenting the LRUs or something else?

>
> = Proposed Solution
>
> The proposed solution is to add a new type of memory controller for shared memory usage. E.g. tmpfs, hugetlb, file system mmap and dma_buf. This shared memory cgroup controller object will have the same life cycle of the underlying  shared memory.

I am confused by the relationship between shared memory controller and
the underlying shared memory. What does the same life cycle mean? Are
the users expected to register the shared memory objects with the
smemcg? What about unnamed shared memory objects like MAP_SHARED or
memfds?

How does the charging work for smemcg? Is this new controller hierarchical?

>
> Processes can not be added to the shared memory cgroup. Instead the shared memory cgroup can be added to the memcg using a “smemcg” API file, similar to adding a process into the “tasks” API file.

Is the charge of the underlying shared memory live with smemcg or the
memcg where smemcg is attached? Can a smemcg detach and reattach to a
different memcg?

> When a smemcg is added to the memcg, the amount of memory that has been shared in the memcg process will be accounted for as the part of the memcg “memory.current”.The memory.current of the memcg is make up of two parts, 1) the processes anonymous memory and 2) the memory shared from smemcg.

The above is somewhat giving the impression that the charge of shared
memory lives with smemcg. This can mess up or complicate the
hierarchical property of the original memcg.

>
> When the memcg “memory.current” is raised to the limit. The kernel will active try to reclaim for the memcg to make “smemcg memory + process anonymous memory” within the limit. Further memory allocation within those memcg processes will fail if the limit can not be followed. If many reclaim attempts fail to bring the memcg “memory.current” within the limit, the process in this memcg will get OOM killed.

The OOM killing for remote charging needs much more thought. Please
see https://lwn.net/Articles/787626/ for previous discussion on
related topic.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-04 17:02     ` Shakeel Butt
  0 siblings, 0 replies; 51+ messages in thread
From: Shakeel Butt @ 2023-05-04 17:02 UTC (permalink / raw)
  To: Chris Li
  Cc: T.J. Mercier, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Yosry Ahmed, Tejun Heo, Muchun Song, Johannes Weiner,
	Roman Gushchin, Alistair Popple, Jason Gunthorpe, Kalesh Singh,
	Yu Zhao

On Wed, May 3, 2023 at 3:15 PM Chris Li <chrisl-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
[...]
> I am also interested in this topic. T.J. and I have some offline
> discussion about this. We have some proposals to solve this
> problem.
>
> I will share the write up here for the up coming LSF/MM discussion.
>
>
> Shared Memory Cgroup Controllers
>
> = Introduction
>
> The current memory cgroup controller does not support shared memory objects. For the memory that is shared between different processes, it is not obvious which process should get charged. Google has some internal tmpfs “memcg=” mount option to charge tmpfs data to  a specific memcg that’s often different from where charging processes run. However it faces some difficulties when the charged memcg exits and the charged memcg becomes a zombie memcg.

What is the exact problem this proposal is solving? Is it the zombie
memcgs? To me that is just a side effect of memory shared between
different memcgs.

> Other approaches include “re-parenting” the memcg charge to the parent memcg. Which has its own problem. If the charge is huge, iteration of the reparenting can be costly.

What is the iteration of the reparenting? Are you referring to
reparenting the LRUs or something else?

>
> = Proposed Solution
>
> The proposed solution is to add a new type of memory controller for shared memory usage. E.g. tmpfs, hugetlb, file system mmap and dma_buf. This shared memory cgroup controller object will have the same life cycle of the underlying  shared memory.

I am confused by the relationship between shared memory controller and
the underlying shared memory. What does the same life cycle mean? Are
the users expected to register the shared memory objects with the
smemcg? What about unnamed shared memory objects like MAP_SHARED or
memfds?

How does the charging work for smemcg? Is this new controller hierarchical?

>
> Processes can not be added to the shared memory cgroup. Instead the shared memory cgroup can be added to the memcg using a “smemcg” API file, similar to adding a process into the “tasks” API file.

Is the charge of the underlying shared memory live with smemcg or the
memcg where smemcg is attached? Can a smemcg detach and reattach to a
different memcg?

> When a smemcg is added to the memcg, the amount of memory that has been shared in the memcg process will be accounted for as the part of the memcg “memory.current”.The memory.current of the memcg is make up of two parts, 1) the processes anonymous memory and 2) the memory shared from smemcg.

The above is somewhat giving the impression that the charge of shared
memory lives with smemcg. This can mess up or complicate the
hierarchical property of the original memcg.

>
> When the memcg “memory.current” is raised to the limit. The kernel will active try to reclaim for the memcg to make “smemcg memory + process anonymous memory” within the limit. Further memory allocation within those memcg processes will fail if the limit can not be followed. If many reclaim attempts fail to bring the memcg “memory.current” within the limit, the process in this memcg will get OOM killed.

The OOM killing for remote charging needs much more thought. Please
see https://lwn.net/Articles/787626/ for previous discussion on
related topic.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-04 17:36       ` Chris Li
  0 siblings, 0 replies; 51+ messages in thread
From: Chris Li @ 2023-05-04 17:36 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: T.J. Mercier, lsf-pc, linux-mm, cgroups, Yosry Ahmed, Tejun Heo,
	Muchun Song, Johannes Weiner, Roman Gushchin, Alistair Popple,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao

On Thu, May 04, 2023 at 10:02:38AM -0700, Shakeel Butt wrote:
> On Wed, May 3, 2023 at 3:15 PM Chris Li <chrisl@kernel.org> wrote:
> [...]
> > I am also interested in this topic. T.J. and I have some offline
> > discussion about this. We have some proposals to solve this
> > problem.
> >
> > I will share the write up here for the up coming LSF/MM discussion.
> >
> >
> > Shared Memory Cgroup Controllers
> >
> > = Introduction
> >
> > The current memory cgroup controller does not support shared memory objects. For the memory that is shared between different processes, it is not obvious which process should get charged. Google has some internal tmpfs “memcg=” mount option to charge tmpfs data to  a specific memcg that’s often different from where charging processes run. However it faces some difficulties when the charged memcg exits and the charged memcg becomes a zombie memcg.
> 
> What is the exact problem this proposal is solving? Is it the zombie
> memcgs? To me that is just a side effect of memory shared between
> different memcgs.

I am trying to get rid of zombie memcgs by using shared memory controllers.
That means also finding alternattive solution for "memcg=" usage.
> 
> > Other approaches include “re-parenting” the memcg charge to the parent memcg. Which has its own problem. If the charge is huge, iteration of the reparenting can be costly.
> 
> What is the iteration of the reparenting? Are you referring to
> reparenting the LRUs or something else?

Yes, reparenting the LRU. As Yu point out to me offlined. That LRU
iteration is on an offline memcg so it shouldn't block any thing
major.

Still, I think the smemcg offer more than recharging regarding
modeling the share relationship.

> >
> > = Proposed Solution
> >
> > The proposed solution is to add a new type of memory controller for shared memory usage. E.g. tmpfs, hugetlb, file system mmap and dma_buf. This shared memory cgroup controller object will have the same life cycle of the underlying  shared memory.
> 
> I am confused by the relationship between shared memory controller and
> the underlying shared memory. What does the same life cycle mean? Are

Same life cycle means, if the smemcg comes from a tmpfs, the smemcg have
the same life cycle of the tmpfs.

> the users expected to register the shared memory objects with the
> smemcg? What about unnamed shared memory objects like MAP_SHARED or


The user doesn't need to register shared memory objects. However the file
system code might need to. But I count that as the kernel not the user.

The cgroup admin adds the smemcg into the memcg that shares it. We can
also make an policy option for kernel auto add by smemcg to memcg that
share from it.

> memfds?

Each file system mount will have their own smemcg.

> 
> How does the charging work for smemcg? Is this new controller hierarchical?

Charge only happen once on the smemcg.

Please see this thread for the distinguishing between charge counter and borrow counter.
https://lore.kernel.org/linux-mm/CALvZod4=+ANT6UR5h7Cp+0hKkVx6tPAaRa5iqBF=L2VBdMKERQ@mail.gmail.com/T/#m955cab80f70097d7c9a5be21c19c4851170fa052

I haven't give too much thought of the hierarchical issue.
My initial thought are on the smemcg side, those are not hierarchical.
On the memcg side, they are.

Do you have specific examples where we can discuss hierarchical usage?

> > Processes can not be added to the shared memory cgroup. Instead the shared memory cgroup can be added to the memcg using a “smemcg” API file, similar to adding a process into the “tasks” API file.
> 
> Is the charge of the underlying shared memory live with smemcg or the
> memcg where smemcg is attached? Can a smemcg detach and reattach to a

Charge live with smemcg.

> different memcg?

Smemcg can add to more than one memcg without detached.
Please see the above email regarding borrow vs charge.

> 
> > When a smemcg is added to the memcg, the amount of memory that has been shared in the memcg process will be accounted for as the part of the memcg “memory.current”.The memory.current of the memcg is make up of two parts, 1) the processes anonymous memory and 2) the memory shared from smemcg.
> 
> The above is somewhat giving the impression that the charge of shared
> memory lives with smemcg. This can mess up or complicate the
> hierarchical property of the original memcg.


I haven't given a lot of thought to that. Can you share with me an example
how things can mess up? I will see if I can use the smemcg model to
address it.

> > When the memcg “memory.current” is raised to the limit. The kernel will active try to reclaim for the memcg to make “smemcg memory + process anonymous memory” within the limit. Further memory allocation within those memcg processes will fail if the limit can not be followed. If many reclaim attempts fail to bring the memcg “memory.current” within the limit, the process in this memcg will get OOM killed.
> 
> The OOM killing for remote charging needs much more thought. Please
> see https://lwn.net/Articles/787626/ for previous discussion on
> related topic.

Yes, just take a look. I think the new idea is borrow vs charged.

Let's come up with some detail example try to break the smemcg borrow model.

Chris


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-04 17:36       ` Chris Li
  0 siblings, 0 replies; 51+ messages in thread
From: Chris Li @ 2023-05-04 17:36 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: T.J. Mercier, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Yosry Ahmed, Tejun Heo, Muchun Song, Johannes Weiner,
	Roman Gushchin, Alistair Popple, Jason Gunthorpe, Kalesh Singh,
	Yu Zhao

On Thu, May 04, 2023 at 10:02:38AM -0700, Shakeel Butt wrote:
> On Wed, May 3, 2023 at 3:15 PM Chris Li <chrisl-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> [...]
> > I am also interested in this topic. T.J. and I have some offline
> > discussion about this. We have some proposals to solve this
> > problem.
> >
> > I will share the write up here for the up coming LSF/MM discussion.
> >
> >
> > Shared Memory Cgroup Controllers
> >
> > = Introduction
> >
> > The current memory cgroup controller does not support shared memory objects. For the memory that is shared between different processes, it is not obvious which process should get charged. Google has some internal tmpfs “memcg=” mount option to charge tmpfs data to  a specific memcg that’s often different from where charging processes run. However it faces some difficulties when the charged memcg exits and the charged memcg becomes a zombie memcg.
> 
> What is the exact problem this proposal is solving? Is it the zombie
> memcgs? To me that is just a side effect of memory shared between
> different memcgs.

I am trying to get rid of zombie memcgs by using shared memory controllers.
That means also finding alternattive solution for "memcg=" usage.
> 
> > Other approaches include “re-parenting” the memcg charge to the parent memcg. Which has its own problem. If the charge is huge, iteration of the reparenting can be costly.
> 
> What is the iteration of the reparenting? Are you referring to
> reparenting the LRUs or something else?

Yes, reparenting the LRU. As Yu point out to me offlined. That LRU
iteration is on an offline memcg so it shouldn't block any thing
major.

Still, I think the smemcg offer more than recharging regarding
modeling the share relationship.

> >
> > = Proposed Solution
> >
> > The proposed solution is to add a new type of memory controller for shared memory usage. E.g. tmpfs, hugetlb, file system mmap and dma_buf. This shared memory cgroup controller object will have the same life cycle of the underlying  shared memory.
> 
> I am confused by the relationship between shared memory controller and
> the underlying shared memory. What does the same life cycle mean? Are

Same life cycle means, if the smemcg comes from a tmpfs, the smemcg have
the same life cycle of the tmpfs.

> the users expected to register the shared memory objects with the
> smemcg? What about unnamed shared memory objects like MAP_SHARED or


The user doesn't need to register shared memory objects. However the file
system code might need to. But I count that as the kernel not the user.

The cgroup admin adds the smemcg into the memcg that shares it. We can
also make an policy option for kernel auto add by smemcg to memcg that
share from it.

> memfds?

Each file system mount will have their own smemcg.

> 
> How does the charging work for smemcg? Is this new controller hierarchical?

Charge only happen once on the smemcg.

Please see this thread for the distinguishing between charge counter and borrow counter.
https://lore.kernel.org/linux-mm/CALvZod4=+ANT6UR5h7Cp+0hKkVx6tPAaRa5iqBF=L2VBdMKERQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org/T/#m955cab80f70097d7c9a5be21c19c4851170fa052

I haven't give too much thought of the hierarchical issue.
My initial thought are on the smemcg side, those are not hierarchical.
On the memcg side, they are.

Do you have specific examples where we can discuss hierarchical usage?

> > Processes can not be added to the shared memory cgroup. Instead the shared memory cgroup can be added to the memcg using a “smemcg” API file, similar to adding a process into the “tasks” API file.
> 
> Is the charge of the underlying shared memory live with smemcg or the
> memcg where smemcg is attached? Can a smemcg detach and reattach to a

Charge live with smemcg.

> different memcg?

Smemcg can add to more than one memcg without detached.
Please see the above email regarding borrow vs charge.

> 
> > When a smemcg is added to the memcg, the amount of memory that has been shared in the memcg process will be accounted for as the part of the memcg “memory.current”.The memory.current of the memcg is make up of two parts, 1) the processes anonymous memory and 2) the memory shared from smemcg.
> 
> The above is somewhat giving the impression that the charge of shared
> memory lives with smemcg. This can mess up or complicate the
> hierarchical property of the original memcg.


I haven't given a lot of thought to that. Can you share with me an example
how things can mess up? I will see if I can use the smemcg model to
address it.

> > When the memcg “memory.current” is raised to the limit. The kernel will active try to reclaim for the memcg to make “smemcg memory + process anonymous memory” within the limit. Further memory allocation within those memcg processes will fail if the limit can not be followed. If many reclaim attempts fail to bring the memcg “memory.current” within the limit, the process in this memcg will get OOM killed.
> 
> The OOM killing for remote charging needs much more thought. Please
> see https://lwn.net/Articles/787626/ for previous discussion on
> related topic.

Yes, just take a look. I think the new idea is borrow vs charged.

Let's come up with some detail example try to break the smemcg borrow model.

Chris

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-05 13:53         ` Alistair Popple
  0 siblings, 0 replies; 51+ messages in thread
From: Alistair Popple @ 2023-05-05 13:53 UTC (permalink / raw)
  To: Chris Li
  Cc: T.J. Mercier, lsf-pc, linux-mm, cgroups, Yosry Ahmed, Tejun Heo,
	Shakeel Butt, Muchun Song, Johannes Weiner, Roman Gushchin,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao


Chris Li <chrisl@kernel.org> writes:

> On Thu, May 04, 2023 at 09:58:56PM +1000, Alistair Popple wrote:
>> 
>> Chris Li <chrisl@kernel.org> writes:
>> 
>> > Hi T.J.,
>> >
>> > On Tue, Apr 11, 2023 at 04:36:37PM -0700, T.J. Mercier wrote:
>> >> When a memcg is removed by userspace it gets offlined by the kernel.
>> >> Offline memcgs are hidden from user space, but they still live in the
>> >> kernel until their reference count drops to 0. New allocations cannot
>> >> be charged to offline memcgs, but existing allocations charged to
>> >> offline memcgs remain charged, and hold a reference to the memcg.
>> >> 
>> >> As such, an offline memcg can remain in the kernel indefinitely,
>> >> becoming a zombie memcg. The accumulation of a large number of zombie
>> >> memcgs lead to increased system overhead (mainly percpu data in struct
>> >> mem_cgroup). It also causes some kernel operations that scale with the
>> >> number of memcgs to become less efficient (e.g. reclaim).
>> >> 
>> >> There are currently out-of-tree solutions which attempt to
>> >> periodically clean up zombie memcgs by reclaiming from them. However
>> >> that is not effective for non-reclaimable memory, which it would be
>> >> better to reparent or recharge to an online cgroup. There are also
>> >> proposed changes that would benefit from recharging for shared
>> >> resources like pinned pages, or DMA buffer pages.
>> >
>> > I am also interested in this topic. T.J. and I have some offline
>> > discussion about this. We have some proposals to solve this
>> > problem.
>> >
>> > I will share the write up here for the up coming LSF/MM discussion.
>> 
>> Unfortunately I won't be attending LSF/MM in person this year but I am
>
> Will you be able to join virtually?

I should be able to join afternoon sessions virtually.

>> interested in this topic as well from the point of view of limiting
>> pinned pages with cgroups. I am hoping to revive this patch series soon:
>
>> 
>> https://lore.kernel.org/linux-mm/cover.c238416f0e82377b449846dbb2459ae9d7030c8e.1675669136.git-series.apopple@nvidia.com/
>> 
>> The main problem with this series was getting agreement on whether to
>> add pinning as a separate cgroup (which is what the series currently
>> does) or whether it should be part of a per-page memcg limit.
>> 
>> The issue with per-page memcg limits is what to do for shared
>> mappings. The below suggestion sounds promising because the pins for
>> shared pages could be charged to the smemcg. However I'm not sure how it
>> would solve the problem of a process in cgroup A being able to raise the
>> pin count of cgroup B when pinning a smemcg page which was one of the
>> reason I introduced a new cgroup controller.
>
> Now that I think of it, I can see the pin count memcg as a subtype of
> smemcg.
>
> The smemcg can have a limit as well, when it add to a memcg, the operation
> raise the pin count smemcg charge over the smemcg limit will fail.

I'm not sure that works for the pinned scenario. If a smemcg already has
pinned pages adding it to another memcg shouldn't raise the pin count of
the memcg it's being added to. The pin counts should only be raised in
memcg's of processes actually requesting the page be pinned. See below
though, the idea of borrowing seems helpful.

So for pinning at least I don't see a per smemcg limit being useful.

> For the detail tracking of shared/unshared behavior, the smemcg can model it
> as a step 2 feature.
>
> There are four different kind of operation can perform on a smemcg:
>
> 1) allocate/charge memory. The charge will add on the per smemcg charge
> counter, check against the per smemcg limit. ENOMEM if it is over the limit.
>
> 2) free/uncharge memory. Similar to above just subtract the counter.
>
> 3) share/mmap already charged memory. This will not change the smemcg charge
> count, it will add to a per <smemcg, memcg> borrow counter. It is possible to
> put a limit on that counter as well, even though I haven't given too much thought
> of how useful it is. That will limit how much memory can mapped from the smemcg.

I would like to see the idea of a borrow counter fleshed out some more
but this sounds like it could work for the pinning scenario.

Pinning could be charged to the per <smemcg, memcg> borrow counter and
the pin limit would be enforced against that plus the anonymous pins.

Implementation wise we'd need a way to lookup both the smemcg of the
struct page and the memcg that the pinning task belongs to.

> 4) unshare/unmmap already charged memory. That will reduce the per <smemcg, memcg>
> borrow counter.

Actually this is where things might get a bit tricky for pinning. We'd
have to reduce the pin charge when a driver calls put_page(). But that
implies looking up the borrow counter / <smemcg, memcg> pair a driver
charged the page to.

I will have to give this idea some more tought though. Most drivers
don't store anything other than the struct page pointers, but my series
added an accounting struct which I think could reference the borrow
counter.

> Will that work for your pin memory usage?

I think it could help. I will give it some thought.

>> 
>> > Shared Memory Cgroup Controllers
>> >
>> > = Introduction
>> >
>> > The current memory cgroup controller does not support shared memory
>> > objects. For the memory that is shared between different processes, it
>> > is not obvious which process should get charged. Google has some
>> > internal tmpfs “memcg=” mount option to charge tmpfs data to a
>> > specific memcg that’s often different from where charging processes
>> > run. However it faces some difficulties when the charged memcg exits
>> > and the charged memcg becomes a zombie memcg.
>> > Other approaches include “re-parenting” the memcg charge to the parent
>> > memcg. Which has its own problem. If the charge is huge, iteration of
>> > the reparenting can be costly.
>> >
>> > = Proposed Solution
>> >
>> > The proposed solution is to add a new type of memory controller for
>> > shared memory usage. E.g. tmpfs, hugetlb, file system mmap and
>> > dma_buf. This shared memory cgroup controller object will have the
>> > same life cycle of the underlying shared memory.
>> >
>> > Processes can not be added to the shared memory cgroup. Instead the
>> > shared memory cgroup can be added to the memcg using a “smemcg” API
>> > file, similar to adding a process into the “tasks” API file.
>> > When a smemcg is added to the memcg, the amount of memory that has
>> > been shared in the memcg process will be accounted for as the part of
>> > the memcg “memory.current”.The memory.current of the memcg is make up
>> > of two parts, 1) the processes anonymous memory and 2) the memory
>> > shared from smemcg.
>> >
>> > When the memcg “memory.current” is raised to the limit. The kernel
>> > will active try to reclaim for the memcg to make “smemcg memory +
>> > process anonymous memory” within the limit.
>> 
>> That means a process in one cgroup could force reclaim of smemcg memory
>> in use by a process in another cgroup right? I guess that's no different
>> to the current situation though.
>> 
>> > Further memory allocation
>> > within those memcg processes will fail if the limit can not be
>> > followed. If many reclaim attempts fail to bring the memcg
>> > “memory.current” within the limit, the process in this memcg will get
>> > OOM killed.
>> 
>> How would this work if say a charge for cgroup A to a smemcg in both
>> cgroup A and B would cause cgroup B to go over its memory limit and not
>> enough memory could be reclaimed from cgroup B? OOM killing a process in
>> cgroup B due to a charge from cgroup A doesn't sound like a good idea.
>
> If we separate out the charge counter with the borrow counter, that problem
> will be solved. When smemcg is add to memcg A, we can have a policy specific
> that adding the <smemcg, memcg A> borrow counter into memcg A's "memory.current".
>
> If B did not map that page, that page will not be part of <smemcg, memcg B>
> borrow count. B will not be punished.
>
> However if B did map that page, The <smemcg, memcg B> need to increase as well.
> B will be punished for it.
>
> Will that work for your example situation?

I think so, although I have been looking at this more from the point of
view of pinning. It sounds like we could treat pinning in much the same
way as mapping though.

>> > = Benefits
>> >
>> > The benefits of this solution include:
>> > * No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf.
>> 
>> If we added pinning it could get a bit messier, as it would have to hang
>> around until the driver unpinned the pages. But I don't think that's a
>> problem.
>
>
> That is exactly the reason pin memory can belong to a pin smemcg. You just need
> to model the driver holding the pin ref count as one of the share/mmap operation.
>
> Then the pin smemcg will not go away if there is a pending pin ref count on it.
>
> We have have different policy option on smemcg.
> For the simple usage don't care the per memcg borrow counter, it can add the
> smemcg's charge count to "memory.current".
>
> Only the user who cares about per memcg usage of a smemcg will need to maintain
> per <smemcg, memcg> borrow counter, at additional cost.

Right, I think pinning drivers will always have to care about the borrow
counter so will have to track that.

> Chris


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-05 13:53         ` Alistair Popple
  0 siblings, 0 replies; 51+ messages in thread
From: Alistair Popple @ 2023-05-05 13:53 UTC (permalink / raw)
  To: Chris Li
  Cc: T.J. Mercier, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Yosry Ahmed, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Jason Gunthorpe, Kalesh Singh,
	Yu Zhao


Chris Li <chrisl-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:

> On Thu, May 04, 2023 at 09:58:56PM +1000, Alistair Popple wrote:
>> 
>> Chris Li <chrisl-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:
>> 
>> > Hi T.J.,
>> >
>> > On Tue, Apr 11, 2023 at 04:36:37PM -0700, T.J. Mercier wrote:
>> >> When a memcg is removed by userspace it gets offlined by the kernel.
>> >> Offline memcgs are hidden from user space, but they still live in the
>> >> kernel until their reference count drops to 0. New allocations cannot
>> >> be charged to offline memcgs, but existing allocations charged to
>> >> offline memcgs remain charged, and hold a reference to the memcg.
>> >> 
>> >> As such, an offline memcg can remain in the kernel indefinitely,
>> >> becoming a zombie memcg. The accumulation of a large number of zombie
>> >> memcgs lead to increased system overhead (mainly percpu data in struct
>> >> mem_cgroup). It also causes some kernel operations that scale with the
>> >> number of memcgs to become less efficient (e.g. reclaim).
>> >> 
>> >> There are currently out-of-tree solutions which attempt to
>> >> periodically clean up zombie memcgs by reclaiming from them. However
>> >> that is not effective for non-reclaimable memory, which it would be
>> >> better to reparent or recharge to an online cgroup. There are also
>> >> proposed changes that would benefit from recharging for shared
>> >> resources like pinned pages, or DMA buffer pages.
>> >
>> > I am also interested in this topic. T.J. and I have some offline
>> > discussion about this. We have some proposals to solve this
>> > problem.
>> >
>> > I will share the write up here for the up coming LSF/MM discussion.
>> 
>> Unfortunately I won't be attending LSF/MM in person this year but I am
>
> Will you be able to join virtually?

I should be able to join afternoon sessions virtually.

>> interested in this topic as well from the point of view of limiting
>> pinned pages with cgroups. I am hoping to revive this patch series soon:
>
>> 
>> https://lore.kernel.org/linux-mm/cover.c238416f0e82377b449846dbb2459ae9d7030c8e.1675669136.git-series.apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org/
>> 
>> The main problem with this series was getting agreement on whether to
>> add pinning as a separate cgroup (which is what the series currently
>> does) or whether it should be part of a per-page memcg limit.
>> 
>> The issue with per-page memcg limits is what to do for shared
>> mappings. The below suggestion sounds promising because the pins for
>> shared pages could be charged to the smemcg. However I'm not sure how it
>> would solve the problem of a process in cgroup A being able to raise the
>> pin count of cgroup B when pinning a smemcg page which was one of the
>> reason I introduced a new cgroup controller.
>
> Now that I think of it, I can see the pin count memcg as a subtype of
> smemcg.
>
> The smemcg can have a limit as well, when it add to a memcg, the operation
> raise the pin count smemcg charge over the smemcg limit will fail.

I'm not sure that works for the pinned scenario. If a smemcg already has
pinned pages adding it to another memcg shouldn't raise the pin count of
the memcg it's being added to. The pin counts should only be raised in
memcg's of processes actually requesting the page be pinned. See below
though, the idea of borrowing seems helpful.

So for pinning at least I don't see a per smemcg limit being useful.

> For the detail tracking of shared/unshared behavior, the smemcg can model it
> as a step 2 feature.
>
> There are four different kind of operation can perform on a smemcg:
>
> 1) allocate/charge memory. The charge will add on the per smemcg charge
> counter, check against the per smemcg limit. ENOMEM if it is over the limit.
>
> 2) free/uncharge memory. Similar to above just subtract the counter.
>
> 3) share/mmap already charged memory. This will not change the smemcg charge
> count, it will add to a per <smemcg, memcg> borrow counter. It is possible to
> put a limit on that counter as well, even though I haven't given too much thought
> of how useful it is. That will limit how much memory can mapped from the smemcg.

I would like to see the idea of a borrow counter fleshed out some more
but this sounds like it could work for the pinning scenario.

Pinning could be charged to the per <smemcg, memcg> borrow counter and
the pin limit would be enforced against that plus the anonymous pins.

Implementation wise we'd need a way to lookup both the smemcg of the
struct page and the memcg that the pinning task belongs to.

> 4) unshare/unmmap already charged memory. That will reduce the per <smemcg, memcg>
> borrow counter.

Actually this is where things might get a bit tricky for pinning. We'd
have to reduce the pin charge when a driver calls put_page(). But that
implies looking up the borrow counter / <smemcg, memcg> pair a driver
charged the page to.

I will have to give this idea some more tought though. Most drivers
don't store anything other than the struct page pointers, but my series
added an accounting struct which I think could reference the borrow
counter.

> Will that work for your pin memory usage?

I think it could help. I will give it some thought.

>> 
>> > Shared Memory Cgroup Controllers
>> >
>> > = Introduction
>> >
>> > The current memory cgroup controller does not support shared memory
>> > objects. For the memory that is shared between different processes, it
>> > is not obvious which process should get charged. Google has some
>> > internal tmpfs “memcg=” mount option to charge tmpfs data to a
>> > specific memcg that’s often different from where charging processes
>> > run. However it faces some difficulties when the charged memcg exits
>> > and the charged memcg becomes a zombie memcg.
>> > Other approaches include “re-parenting” the memcg charge to the parent
>> > memcg. Which has its own problem. If the charge is huge, iteration of
>> > the reparenting can be costly.
>> >
>> > = Proposed Solution
>> >
>> > The proposed solution is to add a new type of memory controller for
>> > shared memory usage. E.g. tmpfs, hugetlb, file system mmap and
>> > dma_buf. This shared memory cgroup controller object will have the
>> > same life cycle of the underlying shared memory.
>> >
>> > Processes can not be added to the shared memory cgroup. Instead the
>> > shared memory cgroup can be added to the memcg using a “smemcg” API
>> > file, similar to adding a process into the “tasks” API file.
>> > When a smemcg is added to the memcg, the amount of memory that has
>> > been shared in the memcg process will be accounted for as the part of
>> > the memcg “memory.current”.The memory.current of the memcg is make up
>> > of two parts, 1) the processes anonymous memory and 2) the memory
>> > shared from smemcg.
>> >
>> > When the memcg “memory.current” is raised to the limit. The kernel
>> > will active try to reclaim for the memcg to make “smemcg memory +
>> > process anonymous memory” within the limit.
>> 
>> That means a process in one cgroup could force reclaim of smemcg memory
>> in use by a process in another cgroup right? I guess that's no different
>> to the current situation though.
>> 
>> > Further memory allocation
>> > within those memcg processes will fail if the limit can not be
>> > followed. If many reclaim attempts fail to bring the memcg
>> > “memory.current” within the limit, the process in this memcg will get
>> > OOM killed.
>> 
>> How would this work if say a charge for cgroup A to a smemcg in both
>> cgroup A and B would cause cgroup B to go over its memory limit and not
>> enough memory could be reclaimed from cgroup B? OOM killing a process in
>> cgroup B due to a charge from cgroup A doesn't sound like a good idea.
>
> If we separate out the charge counter with the borrow counter, that problem
> will be solved. When smemcg is add to memcg A, we can have a policy specific
> that adding the <smemcg, memcg A> borrow counter into memcg A's "memory.current".
>
> If B did not map that page, that page will not be part of <smemcg, memcg B>
> borrow count. B will not be punished.
>
> However if B did map that page, The <smemcg, memcg B> need to increase as well.
> B will be punished for it.
>
> Will that work for your example situation?

I think so, although I have been looking at this more from the point of
view of pinning. It sounds like we could treat pinning in much the same
way as mapping though.

>> > = Benefits
>> >
>> > The benefits of this solution include:
>> > * No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf.
>> 
>> If we added pinning it could get a bit messier, as it would have to hang
>> around until the driver unpinned the pages. But I don't think that's a
>> problem.
>
>
> That is exactly the reason pin memory can belong to a pin smemcg. You just need
> to model the driver holding the pin ref count as one of the share/mmap operation.
>
> Then the pin smemcg will not go away if there is a pending pin ref count on it.
>
> We have have different policy option on smemcg.
> For the simple usage don't care the per memcg borrow counter, it can add the
> smemcg's charge count to "memory.current".
>
> Only the user who cares about per memcg usage of a smemcg will need to maintain
> per <smemcg, memcg> borrow counter, at additional cost.

Right, I think pinning drivers will always have to care about the borrow
counter so will have to track that.

> Chris


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-06 22:49           ` Chris Li
  0 siblings, 0 replies; 51+ messages in thread
From: Chris Li @ 2023-05-06 22:49 UTC (permalink / raw)
  To: Alistair Popple
  Cc: T.J. Mercier, lsf-pc, linux-mm, cgroups, Yosry Ahmed, Tejun Heo,
	Shakeel Butt, Muchun Song, Johannes Weiner, Roman Gushchin,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao

On Fri, May 05, 2023 at 11:53:24PM +1000, Alistair Popple wrote:
> 
> >> Unfortunately I won't be attending LSF/MM in person this year but I am
> >
> > Will you be able to join virtually?
> 
> I should be able to join afternoon sessions virtually.

Great.

> >> The issue with per-page memcg limits is what to do for shared
> >> mappings. The below suggestion sounds promising because the pins for
> >> shared pages could be charged to the smemcg. However I'm not sure how it
> >> would solve the problem of a process in cgroup A being able to raise the
> >> pin count of cgroup B when pinning a smemcg page which was one of the
> >> reason I introduced a new cgroup controller.
> >
> > Now that I think of it, I can see the pin count memcg as a subtype of
> > smemcg.
> >
> > The smemcg can have a limit as well, when it add to a memcg, the operation
> > raise the pin count smemcg charge over the smemcg limit will fail.
> 
> I'm not sure that works for the pinned scenario. If a smemcg already has
> pinned pages adding it to another memcg shouldn't raise the pin count of
> the memcg it's being added to. The pin counts should only be raised in
> memcg's of processes actually requesting the page be pinned. See below
> though, the idea of borrowing seems helpful.

I am very interested in letting smemcg support your pin usage case.
I read your patch thread a bit but I still feel a bit fuzzy about
the pin usage workflow.

If you have some more detailed write up of the usage case, with a sequence
of interaction and desired outcome that would help me understand. Links to
the previous threads work too.

We can set up some meetings to discuss it as well.

> So for pinning at least I don't see a per smemcg limit being useful.

That is fine.  I see you are interested in the <smemcg, memcg> limit.

> > For the detail tracking of shared/unshared behavior, the smemcg can model it
> > as a step 2 feature.
> >
> > There are four different kind of operation can perform on a smemcg:
> >
> > 1) allocate/charge memory. The charge will add on the per smemcg charge
> > counter, check against the per smemcg limit. ENOMEM if it is over the limit.
> >
> > 2) free/uncharge memory. Similar to above just subtract the counter.
> >
> > 3) share/mmap already charged memory. This will not change the smemcg charge
> > count, it will add to a per <smemcg, memcg> borrow counter. It is possible to
> > put a limit on that counter as well, even though I haven't given too much thought
> > of how useful it is. That will limit how much memory can mapped from the smemcg.
> 
> I would like to see the idea of a borrow counter fleshed out some more
> but this sounds like it could work for the pinning scenario.
> 
> Pinning could be charged to the per <smemcg, memcg> borrow counter and
> the pin limit would be enforced against that plus the anonymous pins.
> 
> Implementation wise we'd need a way to lookup both the smemcg of the
> struct page and the memcg that the pinning task belongs to.

The page->memcg_data points to the pin smemcg. I am hoping pinning API or
the current memcg can get to the pinning memcg.

> > 4) unshare/unmmap already charged memory. That will reduce the per <smemcg, memcg>
> > borrow counter.
> 
> Actually this is where things might get a bit tricky for pinning. We'd
> have to reduce the pin charge when a driver calls put_page(). But that
> implies looking up the borrow counter / <smemcg, memcg> pair a driver
> charged the page to.

Does the pin page share between different memcg or just one memcg?

If it is shared, can the put_page() API indicate it is performing in behalf
of which memcg?
> I will have to give this idea some more tought though. Most drivers
> don't store anything other than the struct page pointers, but my series
> added an accounting struct which I think could reference the borrow
> counter.

Ack.

> 
> > Will that work for your pin memory usage?
> 
> I think it could help. I will give it some thought.

Ack.
> 
> >> 
> >> > Shared Memory Cgroup Controllers
> >> >
> >> > = Introduction
> >> >
> >> > The current memory cgroup controller does not support shared memory
> >> > objects. For the memory that is shared between different processes, it
> >> > is not obvious which process should get charged. Google has some
> >> > internal tmpfs “memcg=” mount option to charge tmpfs data to a
> >> > specific memcg that’s often different from where charging processes
> >> > run. However it faces some difficulties when the charged memcg exits
> >> > and the charged memcg becomes a zombie memcg.
> >> > Other approaches include “re-parenting” the memcg charge to the parent
> >> > memcg. Which has its own problem. If the charge is huge, iteration of
> >> > the reparenting can be costly.
> >> >
> >> > = Proposed Solution
> >> >
> >> > The proposed solution is to add a new type of memory controller for
> >> > shared memory usage. E.g. tmpfs, hugetlb, file system mmap and
> >> > dma_buf. This shared memory cgroup controller object will have the
> >> > same life cycle of the underlying shared memory.
> >> >
> >> > Processes can not be added to the shared memory cgroup. Instead the
> >> > shared memory cgroup can be added to the memcg using a “smemcg” API
> >> > file, similar to adding a process into the “tasks” API file.
> >> > When a smemcg is added to the memcg, the amount of memory that has
> >> > been shared in the memcg process will be accounted for as the part of
> >> > the memcg “memory.current”.The memory.current of the memcg is make up
> >> > of two parts, 1) the processes anonymous memory and 2) the memory
> >> > shared from smemcg.
> >> >
> >> > When the memcg “memory.current” is raised to the limit. The kernel
> >> > will active try to reclaim for the memcg to make “smemcg memory +
> >> > process anonymous memory” within the limit.
> >> 
> >> That means a process in one cgroup could force reclaim of smemcg memory
> >> in use by a process in another cgroup right? I guess that's no different
> >> to the current situation though.
> >> 
> >> > Further memory allocation
> >> > within those memcg processes will fail if the limit can not be
> >> > followed. If many reclaim attempts fail to bring the memcg
> >> > “memory.current” within the limit, the process in this memcg will get
> >> > OOM killed.
> >> 
> >> How would this work if say a charge for cgroup A to a smemcg in both
> >> cgroup A and B would cause cgroup B to go over its memory limit and not
> >> enough memory could be reclaimed from cgroup B? OOM killing a process in
> >> cgroup B due to a charge from cgroup A doesn't sound like a good idea.
> >
> > If we separate out the charge counter with the borrow counter, that problem
> > will be solved. When smemcg is add to memcg A, we can have a policy specific
> > that adding the <smemcg, memcg A> borrow counter into memcg A's "memory.current".
> >
> > If B did not map that page, that page will not be part of <smemcg, memcg B>
> > borrow count. B will not be punished.
> >
> > However if B did map that page, The <smemcg, memcg B> need to increase as well.
> > B will be punished for it.
> >
> > Will that work for your example situation?
> 
> I think so, although I have been looking at this more from the point of
> view of pinning. It sounds like we could treat pinning in much the same
> way as mapping though.

Ack.
> 
> >> > = Benefits
> >> >
> >> > The benefits of this solution include:
> >> > * No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf.
> >> 
> >> If we added pinning it could get a bit messier, as it would have to hang
> >> around until the driver unpinned the pages. But I don't think that's a
> >> problem.
> >
> >
> > That is exactly the reason pin memory can belong to a pin smemcg. You just need
> > to model the driver holding the pin ref count as one of the share/mmap operation.
> >
> > Then the pin smemcg will not go away if there is a pending pin ref count on it.
> >
> > We have have different policy option on smemcg.
> > For the simple usage don't care the per memcg borrow counter, it can add the
> > smemcg's charge count to "memory.current".
> >
> > Only the user who cares about per memcg usage of a smemcg will need to maintain
> > per <smemcg, memcg> borrow counter, at additional cost.
> 
> Right, I think pinning drivers will always have to care about the borrow
> counter so will have to track that.

Ack.

Chris



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-06 22:49           ` Chris Li
  0 siblings, 0 replies; 51+ messages in thread
From: Chris Li @ 2023-05-06 22:49 UTC (permalink / raw)
  To: Alistair Popple
  Cc: T.J. Mercier, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Yosry Ahmed, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Jason Gunthorpe, Kalesh Singh,
	Yu Zhao

On Fri, May 05, 2023 at 11:53:24PM +1000, Alistair Popple wrote:
> 
> >> Unfortunately I won't be attending LSF/MM in person this year but I am
> >
> > Will you be able to join virtually?
> 
> I should be able to join afternoon sessions virtually.

Great.

> >> The issue with per-page memcg limits is what to do for shared
> >> mappings. The below suggestion sounds promising because the pins for
> >> shared pages could be charged to the smemcg. However I'm not sure how it
> >> would solve the problem of a process in cgroup A being able to raise the
> >> pin count of cgroup B when pinning a smemcg page which was one of the
> >> reason I introduced a new cgroup controller.
> >
> > Now that I think of it, I can see the pin count memcg as a subtype of
> > smemcg.
> >
> > The smemcg can have a limit as well, when it add to a memcg, the operation
> > raise the pin count smemcg charge over the smemcg limit will fail.
> 
> I'm not sure that works for the pinned scenario. If a smemcg already has
> pinned pages adding it to another memcg shouldn't raise the pin count of
> the memcg it's being added to. The pin counts should only be raised in
> memcg's of processes actually requesting the page be pinned. See below
> though, the idea of borrowing seems helpful.

I am very interested in letting smemcg support your pin usage case.
I read your patch thread a bit but I still feel a bit fuzzy about
the pin usage workflow.

If you have some more detailed write up of the usage case, with a sequence
of interaction and desired outcome that would help me understand. Links to
the previous threads work too.

We can set up some meetings to discuss it as well.

> So for pinning at least I don't see a per smemcg limit being useful.

That is fine.  I see you are interested in the <smemcg, memcg> limit.

> > For the detail tracking of shared/unshared behavior, the smemcg can model it
> > as a step 2 feature.
> >
> > There are four different kind of operation can perform on a smemcg:
> >
> > 1) allocate/charge memory. The charge will add on the per smemcg charge
> > counter, check against the per smemcg limit. ENOMEM if it is over the limit.
> >
> > 2) free/uncharge memory. Similar to above just subtract the counter.
> >
> > 3) share/mmap already charged memory. This will not change the smemcg charge
> > count, it will add to a per <smemcg, memcg> borrow counter. It is possible to
> > put a limit on that counter as well, even though I haven't given too much thought
> > of how useful it is. That will limit how much memory can mapped from the smemcg.
> 
> I would like to see the idea of a borrow counter fleshed out some more
> but this sounds like it could work for the pinning scenario.
> 
> Pinning could be charged to the per <smemcg, memcg> borrow counter and
> the pin limit would be enforced against that plus the anonymous pins.
> 
> Implementation wise we'd need a way to lookup both the smemcg of the
> struct page and the memcg that the pinning task belongs to.

The page->memcg_data points to the pin smemcg. I am hoping pinning API or
the current memcg can get to the pinning memcg.

> > 4) unshare/unmmap already charged memory. That will reduce the per <smemcg, memcg>
> > borrow counter.
> 
> Actually this is where things might get a bit tricky for pinning. We'd
> have to reduce the pin charge when a driver calls put_page(). But that
> implies looking up the borrow counter / <smemcg, memcg> pair a driver
> charged the page to.

Does the pin page share between different memcg or just one memcg?

If it is shared, can the put_page() API indicate it is performing in behalf
of which memcg?
> I will have to give this idea some more tought though. Most drivers
> don't store anything other than the struct page pointers, but my series
> added an accounting struct which I think could reference the borrow
> counter.

Ack.

> 
> > Will that work for your pin memory usage?
> 
> I think it could help. I will give it some thought.

Ack.
> 
> >> 
> >> > Shared Memory Cgroup Controllers
> >> >
> >> > = Introduction
> >> >
> >> > The current memory cgroup controller does not support shared memory
> >> > objects. For the memory that is shared between different processes, it
> >> > is not obvious which process should get charged. Google has some
> >> > internal tmpfs “memcg=” mount option to charge tmpfs data to a
> >> > specific memcg that’s often different from where charging processes
> >> > run. However it faces some difficulties when the charged memcg exits
> >> > and the charged memcg becomes a zombie memcg.
> >> > Other approaches include “re-parenting” the memcg charge to the parent
> >> > memcg. Which has its own problem. If the charge is huge, iteration of
> >> > the reparenting can be costly.
> >> >
> >> > = Proposed Solution
> >> >
> >> > The proposed solution is to add a new type of memory controller for
> >> > shared memory usage. E.g. tmpfs, hugetlb, file system mmap and
> >> > dma_buf. This shared memory cgroup controller object will have the
> >> > same life cycle of the underlying shared memory.
> >> >
> >> > Processes can not be added to the shared memory cgroup. Instead the
> >> > shared memory cgroup can be added to the memcg using a “smemcg” API
> >> > file, similar to adding a process into the “tasks” API file.
> >> > When a smemcg is added to the memcg, the amount of memory that has
> >> > been shared in the memcg process will be accounted for as the part of
> >> > the memcg “memory.current”.The memory.current of the memcg is make up
> >> > of two parts, 1) the processes anonymous memory and 2) the memory
> >> > shared from smemcg.
> >> >
> >> > When the memcg “memory.current” is raised to the limit. The kernel
> >> > will active try to reclaim for the memcg to make “smemcg memory +
> >> > process anonymous memory” within the limit.
> >> 
> >> That means a process in one cgroup could force reclaim of smemcg memory
> >> in use by a process in another cgroup right? I guess that's no different
> >> to the current situation though.
> >> 
> >> > Further memory allocation
> >> > within those memcg processes will fail if the limit can not be
> >> > followed. If many reclaim attempts fail to bring the memcg
> >> > “memory.current” within the limit, the process in this memcg will get
> >> > OOM killed.
> >> 
> >> How would this work if say a charge for cgroup A to a smemcg in both
> >> cgroup A and B would cause cgroup B to go over its memory limit and not
> >> enough memory could be reclaimed from cgroup B? OOM killing a process in
> >> cgroup B due to a charge from cgroup A doesn't sound like a good idea.
> >
> > If we separate out the charge counter with the borrow counter, that problem
> > will be solved. When smemcg is add to memcg A, we can have a policy specific
> > that adding the <smemcg, memcg A> borrow counter into memcg A's "memory.current".
> >
> > If B did not map that page, that page will not be part of <smemcg, memcg B>
> > borrow count. B will not be punished.
> >
> > However if B did map that page, The <smemcg, memcg B> need to increase as well.
> > B will be punished for it.
> >
> > Will that work for your example situation?
> 
> I think so, although I have been looking at this more from the point of
> view of pinning. It sounds like we could treat pinning in much the same
> way as mapping though.

Ack.
> 
> >> > = Benefits
> >> >
> >> > The benefits of this solution include:
> >> > * No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf.
> >> 
> >> If we added pinning it could get a bit messier, as it would have to hang
> >> around until the driver unpinned the pages. But I don't think that's a
> >> problem.
> >
> >
> > That is exactly the reason pin memory can belong to a pin smemcg. You just need
> > to model the driver holding the pin ref count as one of the share/mmap operation.
> >
> > Then the pin smemcg will not go away if there is a pending pin ref count on it.
> >
> > We have have different policy option on smemcg.
> > For the simple usage don't care the per memcg borrow counter, it can add the
> > smemcg's charge count to "memory.current".
> >
> > Only the user who cares about per memcg usage of a smemcg will need to maintain
> > per <smemcg, memcg> borrow counter, at additional cost.
> 
> Right, I think pinning drivers will always have to care about the borrow
> counter so will have to track that.

Ack.

Chris


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-08  8:17             ` Alistair Popple
  0 siblings, 0 replies; 51+ messages in thread
From: Alistair Popple @ 2023-05-08  8:17 UTC (permalink / raw)
  To: Chris Li
  Cc: T.J. Mercier, lsf-pc, linux-mm, cgroups, Yosry Ahmed, Tejun Heo,
	Shakeel Butt, Muchun Song, Johannes Weiner, Roman Gushchin,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao


Chris Li <chrisl@kernel.org> writes:

> On Fri, May 05, 2023 at 11:53:24PM +1000, Alistair Popple wrote:
>> 
>> >> Unfortunately I won't be attending LSF/MM in person this year but I am
>> >
>> > Will you be able to join virtually?
>> 
>> I should be able to join afternoon sessions virtually.
>
> Great.

Actually I don't have an invite so might not make it. However I believe
Jason Gunthorpe is there and he has been helping with this as well so he
might be able to attend the session. Or we could discuss it in one of
the Linux MM biweeklies. (TBH I've been busy on other work and am only
just getting back up to speed on this myself).

>> >> The issue with per-page memcg limits is what to do for shared
>> >> mappings. The below suggestion sounds promising because the pins for
>> >> shared pages could be charged to the smemcg. However I'm not sure how it
>> >> would solve the problem of a process in cgroup A being able to raise the
>> >> pin count of cgroup B when pinning a smemcg page which was one of the
>> >> reason I introduced a new cgroup controller.
>> >
>> > Now that I think of it, I can see the pin count memcg as a subtype of
>> > smemcg.
>> >
>> > The smemcg can have a limit as well, when it add to a memcg, the operation
>> > raise the pin count smemcg charge over the smemcg limit will fail.
>> 
>> I'm not sure that works for the pinned scenario. If a smemcg already has
>> pinned pages adding it to another memcg shouldn't raise the pin count of
>> the memcg it's being added to. The pin counts should only be raised in
>> memcg's of processes actually requesting the page be pinned. See below
>> though, the idea of borrowing seems helpful.
>
> I am very interested in letting smemcg support your pin usage case.
> I read your patch thread a bit but I still feel a bit fuzzy about
> the pin usage workflow.

Thanks!

> If you have some more detailed write up of the usage case, with a sequence
> of interaction and desired outcome that would help me understand. Links to
> the previous threads work too.

Unfortunately I don't (yet) have a detailed write up. But the primary
use case we had in mind was sandboxing of containers and qemu instances
to be able to limit the total amount of memory they can pin.

This is just like the existing RLIMIT, just with the limit assigned via
a cgroup instead of a per-process or per-user limit as they can easily
be subverted, particularly the per-process limit.

The sandboxing requirement is what drove us to not use the existing
per-page memcg. For private mappings it would be fine because the page
would only ever be mapped by a process in a single cgroup (ie. a single
memcg).

However for shared mappings it isn't - processes in different cgroups
could be mapping the same page but the accounting should happen to the
cgroup the process is in, not the cgroup that happens to "own" the page.

It is also possible that the page might not be mapped at all. For
example a driver may be pinning a page with pin_user_pages(), but the
userspace process may have munmap()ped it.

For drivers pinned memory is generally associated with some operation on
a file-descriptor. So the desired interaction/outcome is:

1. Process in cgroup A opens file-descriptor
2. Calls an ioctl() on the FD to pin memory.
3. Driver charges the memory to a counter and checks it's under the
   limit.
4. If over limit the ioctl() will fail.

This is effectively how the vm_pinned/locked RLIMIT works today. Even if
a shared page is already pinned the process should still be "punished"
for pinning the page. Hence the interest in the total <smemcg, memcg>
limit.

Pinned memory may also outlive the process that created it - drivers
associate it via a file-descriptor not a process and even if the FD is
closed there's nothing say a driver has to unpin the memory then
(although most do).

> We can set up some meetings to discuss it as well.
>
>> So for pinning at least I don't see a per smemcg limit being useful.
>
> That is fine.  I see you are interested in the <smemcg, memcg> limit.

Right, because it sounds like it will allow pinning the same struct page
multiple times to result in multiple charges. With the current memcg
implementation this isn't possible because a page can only be associated
with a single memcg.

A limit on just smemcg doesn't seem useful from a sandboxing perspective
because processes from other cgroups can use up the limit.

>> > For the detail tracking of shared/unshared behavior, the smemcg can model it
>> > as a step 2 feature.
>> >
>> > There are four different kind of operation can perform on a smemcg:
>> >
>> > 1) allocate/charge memory. The charge will add on the per smemcg charge
>> > counter, check against the per smemcg limit. ENOMEM if it is over the limit.
>> >
>> > 2) free/uncharge memory. Similar to above just subtract the counter.
>> >
>> > 3) share/mmap already charged memory. This will not change the smemcg charge
>> > count, it will add to a per <smemcg, memcg> borrow counter. It is possible to
>> > put a limit on that counter as well, even though I haven't given too much thought
>> > of how useful it is. That will limit how much memory can mapped from the smemcg.
>> 
>> I would like to see the idea of a borrow counter fleshed out some more
>> but this sounds like it could work for the pinning scenario.
>> 
>> Pinning could be charged to the per <smemcg, memcg> borrow counter and
>> the pin limit would be enforced against that plus the anonymous pins.
>> 
>> Implementation wise we'd need a way to lookup both the smemcg of the
>> struct page and the memcg that the pinning task belongs to.
>
> The page->memcg_data points to the pin smemcg. I am hoping pinning API or
> the current memcg can get to the pinning memcg.

So the memcg to charge would come from the process doing the
pin_user_pages() rather than say page->memcg_data? Seems reasonable.

>> > 4) unshare/unmmap already charged memory. That will reduce the per <smemcg, memcg>
>> > borrow counter.
>> 
>> Actually this is where things might get a bit tricky for pinning. We'd
>> have to reduce the pin charge when a driver calls put_page(). But that
>> implies looking up the borrow counter / <smemcg, memcg> pair a driver
>> charged the page to.
>
> Does the pin page share between different memcg or just one memcg?

In general it can share between different memcg. Consider a shared
mapping shared with processes in two different cgroups (A and B). There
is nothing stopping each process opening a file-descriptor and calling
an ioctl() to pin the shared page.

Each should be punished for pinning the page in the sense that the pin
count for their respective cgroups must go up.

Drivers pinning shared pages is I think relatively rare, but it's
theorectically possible and if we're going down the path of adding
limits for pinning to memcg it's something we need to deal with to make
sandboxing effective.

> If it is shared, can the put_page() API indicate it is performing in behalf
> of which memcg?

I think so - although it varies by driver.

Drivers have to store the array of pages pinned so should be able to
track the memcg with that as well. My series added a struct vm_account
which would be the obvious place to keep that reference. Each set of pin
operations on a FD would need a new memcg reference though so it would
add overhead for drivers that only pin a small number of pages at a
time.

Non-driver users such as the mlock() syscall don't keep a pinned pages
array around but they should be able to use the current memcg during
munlock().

>> I will have to give this idea some more tought though. Most drivers
>> don't store anything other than the struct page pointers, but my series
>> added an accounting struct which I think could reference the borrow
>> counter.
>
> Ack.
>
>> 
>> > Will that work for your pin memory usage?
>> 
>> I think it could help. I will give it some thought.
>
> Ack.
>> 
>> >> 
>> >> > Shared Memory Cgroup Controllers
>> >> >
>> >> > = Introduction
>> >> >
>> >> > The current memory cgroup controller does not support shared memory
>> >> > objects. For the memory that is shared between different processes, it
>> >> > is not obvious which process should get charged. Google has some
>> >> > internal tmpfs “memcg=” mount option to charge tmpfs data to a
>> >> > specific memcg that’s often different from where charging processes
>> >> > run. However it faces some difficulties when the charged memcg exits
>> >> > and the charged memcg becomes a zombie memcg.
>> >> > Other approaches include “re-parenting” the memcg charge to the parent
>> >> > memcg. Which has its own problem. If the charge is huge, iteration of
>> >> > the reparenting can be costly.
>> >> >
>> >> > = Proposed Solution
>> >> >
>> >> > The proposed solution is to add a new type of memory controller for
>> >> > shared memory usage. E.g. tmpfs, hugetlb, file system mmap and
>> >> > dma_buf. This shared memory cgroup controller object will have the
>> >> > same life cycle of the underlying shared memory.
>> >> >
>> >> > Processes can not be added to the shared memory cgroup. Instead the
>> >> > shared memory cgroup can be added to the memcg using a “smemcg” API
>> >> > file, similar to adding a process into the “tasks” API file.
>> >> > When a smemcg is added to the memcg, the amount of memory that has
>> >> > been shared in the memcg process will be accounted for as the part of
>> >> > the memcg “memory.current”.The memory.current of the memcg is make up
>> >> > of two parts, 1) the processes anonymous memory and 2) the memory
>> >> > shared from smemcg.
>> >> >
>> >> > When the memcg “memory.current” is raised to the limit. The kernel
>> >> > will active try to reclaim for the memcg to make “smemcg memory +
>> >> > process anonymous memory” within the limit.
>> >> 
>> >> That means a process in one cgroup could force reclaim of smemcg memory
>> >> in use by a process in another cgroup right? I guess that's no different
>> >> to the current situation though.
>> >> 
>> >> > Further memory allocation
>> >> > within those memcg processes will fail if the limit can not be
>> >> > followed. If many reclaim attempts fail to bring the memcg
>> >> > “memory.current” within the limit, the process in this memcg will get
>> >> > OOM killed.
>> >> 
>> >> How would this work if say a charge for cgroup A to a smemcg in both
>> >> cgroup A and B would cause cgroup B to go over its memory limit and not
>> >> enough memory could be reclaimed from cgroup B? OOM killing a process in
>> >> cgroup B due to a charge from cgroup A doesn't sound like a good idea.
>> >
>> > If we separate out the charge counter with the borrow counter, that problem
>> > will be solved. When smemcg is add to memcg A, we can have a policy specific
>> > that adding the <smemcg, memcg A> borrow counter into memcg A's "memory.current".
>> >
>> > If B did not map that page, that page will not be part of <smemcg, memcg B>
>> > borrow count. B will not be punished.
>> >
>> > However if B did map that page, The <smemcg, memcg B> need to increase as well.
>> > B will be punished for it.
>> >
>> > Will that work for your example situation?
>> 
>> I think so, although I have been looking at this more from the point of
>> view of pinning. It sounds like we could treat pinning in much the same
>> way as mapping though.
>
> Ack.
>> 
>> >> > = Benefits
>> >> >
>> >> > The benefits of this solution include:
>> >> > * No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf.
>> >> 
>> >> If we added pinning it could get a bit messier, as it would have to hang
>> >> around until the driver unpinned the pages. But I don't think that's a
>> >> problem.
>> >
>> >
>> > That is exactly the reason pin memory can belong to a pin smemcg. You just need
>> > to model the driver holding the pin ref count as one of the share/mmap operation.
>> >
>> > Then the pin smemcg will not go away if there is a pending pin ref count on it.
>> >
>> > We have have different policy option on smemcg.
>> > For the simple usage don't care the per memcg borrow counter, it can add the
>> > smemcg's charge count to "memory.current".
>> >
>> > Only the user who cares about per memcg usage of a smemcg will need to maintain
>> > per <smemcg, memcg> borrow counter, at additional cost.
>> 
>> Right, I think pinning drivers will always have to care about the borrow
>> counter so will have to track that.
>
> Ack.
>
> Chris



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-08  8:17             ` Alistair Popple
  0 siblings, 0 replies; 51+ messages in thread
From: Alistair Popple @ 2023-05-08  8:17 UTC (permalink / raw)
  To: Chris Li
  Cc: T.J. Mercier, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Yosry Ahmed, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Jason Gunthorpe, Kalesh Singh,
	Yu Zhao


Chris Li <chrisl-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:

> On Fri, May 05, 2023 at 11:53:24PM +1000, Alistair Popple wrote:
>> 
>> >> Unfortunately I won't be attending LSF/MM in person this year but I am
>> >
>> > Will you be able to join virtually?
>> 
>> I should be able to join afternoon sessions virtually.
>
> Great.

Actually I don't have an invite so might not make it. However I believe
Jason Gunthorpe is there and he has been helping with this as well so he
might be able to attend the session. Or we could discuss it in one of
the Linux MM biweeklies. (TBH I've been busy on other work and am only
just getting back up to speed on this myself).

>> >> The issue with per-page memcg limits is what to do for shared
>> >> mappings. The below suggestion sounds promising because the pins for
>> >> shared pages could be charged to the smemcg. However I'm not sure how it
>> >> would solve the problem of a process in cgroup A being able to raise the
>> >> pin count of cgroup B when pinning a smemcg page which was one of the
>> >> reason I introduced a new cgroup controller.
>> >
>> > Now that I think of it, I can see the pin count memcg as a subtype of
>> > smemcg.
>> >
>> > The smemcg can have a limit as well, when it add to a memcg, the operation
>> > raise the pin count smemcg charge over the smemcg limit will fail.
>> 
>> I'm not sure that works for the pinned scenario. If a smemcg already has
>> pinned pages adding it to another memcg shouldn't raise the pin count of
>> the memcg it's being added to. The pin counts should only be raised in
>> memcg's of processes actually requesting the page be pinned. See below
>> though, the idea of borrowing seems helpful.
>
> I am very interested in letting smemcg support your pin usage case.
> I read your patch thread a bit but I still feel a bit fuzzy about
> the pin usage workflow.

Thanks!

> If you have some more detailed write up of the usage case, with a sequence
> of interaction and desired outcome that would help me understand. Links to
> the previous threads work too.

Unfortunately I don't (yet) have a detailed write up. But the primary
use case we had in mind was sandboxing of containers and qemu instances
to be able to limit the total amount of memory they can pin.

This is just like the existing RLIMIT, just with the limit assigned via
a cgroup instead of a per-process or per-user limit as they can easily
be subverted, particularly the per-process limit.

The sandboxing requirement is what drove us to not use the existing
per-page memcg. For private mappings it would be fine because the page
would only ever be mapped by a process in a single cgroup (ie. a single
memcg).

However for shared mappings it isn't - processes in different cgroups
could be mapping the same page but the accounting should happen to the
cgroup the process is in, not the cgroup that happens to "own" the page.

It is also possible that the page might not be mapped at all. For
example a driver may be pinning a page with pin_user_pages(), but the
userspace process may have munmap()ped it.

For drivers pinned memory is generally associated with some operation on
a file-descriptor. So the desired interaction/outcome is:

1. Process in cgroup A opens file-descriptor
2. Calls an ioctl() on the FD to pin memory.
3. Driver charges the memory to a counter and checks it's under the
   limit.
4. If over limit the ioctl() will fail.

This is effectively how the vm_pinned/locked RLIMIT works today. Even if
a shared page is already pinned the process should still be "punished"
for pinning the page. Hence the interest in the total <smemcg, memcg>
limit.

Pinned memory may also outlive the process that created it - drivers
associate it via a file-descriptor not a process and even if the FD is
closed there's nothing say a driver has to unpin the memory then
(although most do).

> We can set up some meetings to discuss it as well.
>
>> So for pinning at least I don't see a per smemcg limit being useful.
>
> That is fine.  I see you are interested in the <smemcg, memcg> limit.

Right, because it sounds like it will allow pinning the same struct page
multiple times to result in multiple charges. With the current memcg
implementation this isn't possible because a page can only be associated
with a single memcg.

A limit on just smemcg doesn't seem useful from a sandboxing perspective
because processes from other cgroups can use up the limit.

>> > For the detail tracking of shared/unshared behavior, the smemcg can model it
>> > as a step 2 feature.
>> >
>> > There are four different kind of operation can perform on a smemcg:
>> >
>> > 1) allocate/charge memory. The charge will add on the per smemcg charge
>> > counter, check against the per smemcg limit. ENOMEM if it is over the limit.
>> >
>> > 2) free/uncharge memory. Similar to above just subtract the counter.
>> >
>> > 3) share/mmap already charged memory. This will not change the smemcg charge
>> > count, it will add to a per <smemcg, memcg> borrow counter. It is possible to
>> > put a limit on that counter as well, even though I haven't given too much thought
>> > of how useful it is. That will limit how much memory can mapped from the smemcg.
>> 
>> I would like to see the idea of a borrow counter fleshed out some more
>> but this sounds like it could work for the pinning scenario.
>> 
>> Pinning could be charged to the per <smemcg, memcg> borrow counter and
>> the pin limit would be enforced against that plus the anonymous pins.
>> 
>> Implementation wise we'd need a way to lookup both the smemcg of the
>> struct page and the memcg that the pinning task belongs to.
>
> The page->memcg_data points to the pin smemcg. I am hoping pinning API or
> the current memcg can get to the pinning memcg.

So the memcg to charge would come from the process doing the
pin_user_pages() rather than say page->memcg_data? Seems reasonable.

>> > 4) unshare/unmmap already charged memory. That will reduce the per <smemcg, memcg>
>> > borrow counter.
>> 
>> Actually this is where things might get a bit tricky for pinning. We'd
>> have to reduce the pin charge when a driver calls put_page(). But that
>> implies looking up the borrow counter / <smemcg, memcg> pair a driver
>> charged the page to.
>
> Does the pin page share between different memcg or just one memcg?

In general it can share between different memcg. Consider a shared
mapping shared with processes in two different cgroups (A and B). There
is nothing stopping each process opening a file-descriptor and calling
an ioctl() to pin the shared page.

Each should be punished for pinning the page in the sense that the pin
count for their respective cgroups must go up.

Drivers pinning shared pages is I think relatively rare, but it's
theorectically possible and if we're going down the path of adding
limits for pinning to memcg it's something we need to deal with to make
sandboxing effective.

> If it is shared, can the put_page() API indicate it is performing in behalf
> of which memcg?

I think so - although it varies by driver.

Drivers have to store the array of pages pinned so should be able to
track the memcg with that as well. My series added a struct vm_account
which would be the obvious place to keep that reference. Each set of pin
operations on a FD would need a new memcg reference though so it would
add overhead for drivers that only pin a small number of pages at a
time.

Non-driver users such as the mlock() syscall don't keep a pinned pages
array around but they should be able to use the current memcg during
munlock().

>> I will have to give this idea some more tought though. Most drivers
>> don't store anything other than the struct page pointers, but my series
>> added an accounting struct which I think could reference the borrow
>> counter.
>
> Ack.
>
>> 
>> > Will that work for your pin memory usage?
>> 
>> I think it could help. I will give it some thought.
>
> Ack.
>> 
>> >> 
>> >> > Shared Memory Cgroup Controllers
>> >> >
>> >> > = Introduction
>> >> >
>> >> > The current memory cgroup controller does not support shared memory
>> >> > objects. For the memory that is shared between different processes, it
>> >> > is not obvious which process should get charged. Google has some
>> >> > internal tmpfs “memcg=” mount option to charge tmpfs data to a
>> >> > specific memcg that’s often different from where charging processes
>> >> > run. However it faces some difficulties when the charged memcg exits
>> >> > and the charged memcg becomes a zombie memcg.
>> >> > Other approaches include “re-parenting” the memcg charge to the parent
>> >> > memcg. Which has its own problem. If the charge is huge, iteration of
>> >> > the reparenting can be costly.
>> >> >
>> >> > = Proposed Solution
>> >> >
>> >> > The proposed solution is to add a new type of memory controller for
>> >> > shared memory usage. E.g. tmpfs, hugetlb, file system mmap and
>> >> > dma_buf. This shared memory cgroup controller object will have the
>> >> > same life cycle of the underlying shared memory.
>> >> >
>> >> > Processes can not be added to the shared memory cgroup. Instead the
>> >> > shared memory cgroup can be added to the memcg using a “smemcg” API
>> >> > file, similar to adding a process into the “tasks” API file.
>> >> > When a smemcg is added to the memcg, the amount of memory that has
>> >> > been shared in the memcg process will be accounted for as the part of
>> >> > the memcg “memory.current”.The memory.current of the memcg is make up
>> >> > of two parts, 1) the processes anonymous memory and 2) the memory
>> >> > shared from smemcg.
>> >> >
>> >> > When the memcg “memory.current” is raised to the limit. The kernel
>> >> > will active try to reclaim for the memcg to make “smemcg memory +
>> >> > process anonymous memory” within the limit.
>> >> 
>> >> That means a process in one cgroup could force reclaim of smemcg memory
>> >> in use by a process in another cgroup right? I guess that's no different
>> >> to the current situation though.
>> >> 
>> >> > Further memory allocation
>> >> > within those memcg processes will fail if the limit can not be
>> >> > followed. If many reclaim attempts fail to bring the memcg
>> >> > “memory.current” within the limit, the process in this memcg will get
>> >> > OOM killed.
>> >> 
>> >> How would this work if say a charge for cgroup A to a smemcg in both
>> >> cgroup A and B would cause cgroup B to go over its memory limit and not
>> >> enough memory could be reclaimed from cgroup B? OOM killing a process in
>> >> cgroup B due to a charge from cgroup A doesn't sound like a good idea.
>> >
>> > If we separate out the charge counter with the borrow counter, that problem
>> > will be solved. When smemcg is add to memcg A, we can have a policy specific
>> > that adding the <smemcg, memcg A> borrow counter into memcg A's "memory.current".
>> >
>> > If B did not map that page, that page will not be part of <smemcg, memcg B>
>> > borrow count. B will not be punished.
>> >
>> > However if B did map that page, The <smemcg, memcg B> need to increase as well.
>> > B will be punished for it.
>> >
>> > Will that work for your example situation?
>> 
>> I think so, although I have been looking at this more from the point of
>> view of pinning. It sounds like we could treat pinning in much the same
>> way as mapping though.
>
> Ack.
>> 
>> >> > = Benefits
>> >> >
>> >> > The benefits of this solution include:
>> >> > * No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf.
>> >> 
>> >> If we added pinning it could get a bit messier, as it would have to hang
>> >> around until the driver unpinned the pages. But I don't think that's a
>> >> problem.
>> >
>> >
>> > That is exactly the reason pin memory can belong to a pin smemcg. You just need
>> > to model the driver holding the pin ref count as one of the share/mmap operation.
>> >
>> > Then the pin smemcg will not go away if there is a pending pin ref count on it.
>> >
>> > We have have different policy option on smemcg.
>> > For the simple usage don't care the per memcg borrow counter, it can add the
>> > smemcg's charge count to "memory.current".
>> >
>> > Only the user who cares about per memcg usage of a smemcg will need to maintain
>> > per <smemcg, memcg> borrow counter, at additional cost.
>> 
>> Right, I think pinning drivers will always have to care about the borrow
>> counter so will have to track that.
>
> Ack.
>
> Chris


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-10 14:51               ` Chris Li
  0 siblings, 0 replies; 51+ messages in thread
From: Chris Li @ 2023-05-10 14:51 UTC (permalink / raw)
  To: Alistair Popple
  Cc: T.J. Mercier, lsf-pc, linux-mm, cgroups, Yosry Ahmed, Tejun Heo,
	Shakeel Butt, Muchun Song, Johannes Weiner, Roman Gushchin,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao

Hi Alistair,

On Mon, May 08, 2023 at 06:17:04PM +1000, Alistair Popple wrote:
> Actually I don't have an invite so might not make it. However I believe
> Jason Gunthorpe is there and he has been helping with this as well so he
> might be able to attend the session. Or we could discuss it in one of

Yes, I talked to Jason Gunthorpe and asked him about the usage workflow
of the pin memory controller. He tell me that his original intend is
just to have something like RLIMIT but without the quirky limitation
of the RLIMIT. It has nothing to do with sharing memory.
Share memories are only brought up during the online discussion.  I guess
the share memory has similar reference count and facing similar challenges
on double counting.

> the Linux MM biweeklies. (TBH I've been busy on other work and am only
> just getting back up to speed on this myself).
> 
> > If you have some more detailed write up of the usage case, with a sequence
> > of interaction and desired outcome that would help me understand. Links to
> > the previous threads work too.
> 
> Unfortunately I don't (yet) have a detailed write up. But the primary
> use case we had in mind was sandboxing of containers and qemu instances
> to be able to limit the total amount of memory they can pin.

Ack.
 
> This is just like the existing RLIMIT, just with the limit assigned via
> a cgroup instead of a per-process or per-user limit as they can easily
> be subverted, particularly the per-process limit.

Ack.

> The sandboxing requirement is what drove us to not use the existing
> per-page memcg.

Not sure I understand what you mean here "use the existing per-page memcg".
Do you mean you need byte counter rather than page counter?

On the other hand it is not important to understand the rest of your points.
I can move on.

> For private mappings it would be fine because the page
> would only ever be mapped by a process in a single cgroup (ie. a single
> memcg).

Ack.
 
> However for shared mappings it isn't - processes in different cgroups
> could be mapping the same page but the accounting should happen to the
> cgroup the process is in, not the cgroup that happens to "own" the page.

Ack. That is actually the root of the share memory problem. The model
of charging to the first process does not work well for share usage.

> It is also possible that the page might not be mapped at all. For
> example a driver may be pinning a page with pin_user_pages(), but the
> userspace process may have munmap()ped it.

Ack.
In that case the driver will need to hold a reference count for it, right?

> 
> For drivers pinned memory is generally associated with some operation on
> a file-descriptor. So the desired interaction/outcome is:
> 
> 1. Process in cgroup A opens file-descriptor
> 2. Calls an ioctl() on the FD to pin memory.
> 3. Driver charges the memory to a counter and checks it's under the
>    limit.
> 4. If over limit the ioctl() will fail.

Ack.

> 
> This is effectively how the vm_pinned/locked RLIMIT works today. Even if
> a shared page is already pinned the process should still be "punished"
> for pinning the page.

OK. This is the critical usage information that I want to know. Thanks for
the explaination.

So there are two different definition of the pin page count:
1) sum(total set of pages that this memcg process issue pin ioctl on)
2) sum(total set of pined page this memcg process own a reference count on)

It seems you want 2).

If a page has three reference counts inside one memcg, e.g. map three times. 
Does the pin count three times or only once?

> Hence the interest in the total <smemcg, memcg>  limit.
> 
> Pinned memory may also outlive the process that created it - drivers
> associate it via a file-descriptor not a process and even if the FD is
> closed there's nothing say a driver has to unpin the memory then
> (although most do).

Ack.

> 
> > We can set up some meetings to discuss it as well.
> >
> >> So for pinning at least I don't see a per smemcg limit being useful.
> >
> > That is fine.  I see you are interested in the <smemcg, memcg> limit.
> 
> Right, because it sounds like it will allow pinning the same struct page
> multiple times to result in multiple charges. With the current memcg

Multiple times to different memcgs I assume, please see the above question
regard multiple times to the same memcg.

> implementation this isn't possible because a page can only be associated
> with a single memcg.

Ack.

> 
> A limit on just smemcg doesn't seem useful from a sandboxing perspective
> because processes from other cgroups can use up the limit.

Ack.

> >> Implementation wise we'd need a way to lookup both the smemcg of the
> >> struct page and the memcg that the pinning task belongs to.
> >
> > The page->memcg_data points to the pin smemcg. I am hoping pinning API or
> > the current memcg can get to the pinning memcg.
> 
> So the memcg to charge would come from the process doing the
> pin_user_pages() rather than say page->memcg_data? Seems reasonable.

That is more of a question for you. What is the desired behavior.
If charge the current process that perform the pin_user_pages()
works for you. Great.

I agree charge the pin count to current process memcg seems to make
sense.

> >> > 4) unshare/unmmap already charged memory. That will reduce the per <smemcg, memcg>
> >> > borrow counter.
> >> 
> >> Actually this is where things might get a bit tricky for pinning. We'd
> >> have to reduce the pin charge when a driver calls put_page(). But that
> >> implies looking up the borrow counter / <smemcg, memcg> pair a driver
> >> charged the page to.
> >
> > Does the pin page share between different memcg or just one memcg?
> 
> In general it can share between different memcg. Consider a shared
> mapping shared with processes in two different cgroups (A and B). There
> is nothing stopping each process opening a file-descriptor and calling
> an ioctl() to pin the shared page.

Ack.

> Each should be punished for pinning the page in the sense that the pin
> count for their respective cgroups must go up.

Ack. That clarfy my previous question. You want definition 2)

> Drivers pinning shared pages is I think relatively rare, but it's
> theorectically possible and if we're going down the path of adding
> limits for pinning to memcg it's something we need to deal with to make
> sandboxing effective.

The driver will still have a current processor. Do you mean in this case,
the current processor is not the right one to charge?
Another option can be charge to a default system/kernel smemcg or a driver
smemcg as well.

> 
> > If it is shared, can the put_page() API indicate it is performing in behalf
> > of which memcg?
> 
> I think so - although it varies by driver.
> 
> Drivers have to store the array of pages pinned so should be able to
> track the memcg with that as well. My series added a struct vm_account
> which would be the obvious place to keep that reference.

Where does the struct vm_account lives?

> Each set of pin 
> operations on a FD would need a new memcg reference though so it would
> add overhead for drivers that only pin a small number of pages at a
> time.

Set is more complicate then allow double counting the same page in the
same smemcg. Again mostly just collecting requirement from you.

> 
> Non-driver users such as the mlock() syscall don't keep a pinned pages
> array around but they should be able to use the current memcg during
> munlock().

Ack.

Chris

> 
> >> I will have to give this idea some more tought though. Most drivers
> >> don't store anything other than the struct page pointers, but my series
> >> added an accounting struct which I think could reference the borrow
> >> counter.
> >
> > Ack.
> >
> >> 
> >> > Will that work for your pin memory usage?
> >> 
> >> I think it could help. I will give it some thought.
> >
> > Ack.
> >> 
> >> >> 
> >> >> > Shared Memory Cgroup Controllers
> >> >> >
> >> >> > = Introduction
> >> >> >
> >> >> > The current memory cgroup controller does not support shared memory
> >> >> > objects. For the memory that is shared between different processes, it
> >> >> > is not obvious which process should get charged. Google has some
> >> >> > internal tmpfs “memcg=” mount option to charge tmpfs data to a
> >> >> > specific memcg that’s often different from where charging processes
> >> >> > run. However it faces some difficulties when the charged memcg exits
> >> >> > and the charged memcg becomes a zombie memcg.
> >> >> > Other approaches include “re-parenting” the memcg charge to the parent
> >> >> > memcg. Which has its own problem. If the charge is huge, iteration of
> >> >> > the reparenting can be costly.
> >> >> >
> >> >> > = Proposed Solution
> >> >> >
> >> >> > The proposed solution is to add a new type of memory controller for
> >> >> > shared memory usage. E.g. tmpfs, hugetlb, file system mmap and
> >> >> > dma_buf. This shared memory cgroup controller object will have the
> >> >> > same life cycle of the underlying shared memory.
> >> >> >
> >> >> > Processes can not be added to the shared memory cgroup. Instead the
> >> >> > shared memory cgroup can be added to the memcg using a “smemcg” API
> >> >> > file, similar to adding a process into the “tasks” API file.
> >> >> > When a smemcg is added to the memcg, the amount of memory that has
> >> >> > been shared in the memcg process will be accounted for as the part of
> >> >> > the memcg “memory.current”.The memory.current of the memcg is make up
> >> >> > of two parts, 1) the processes anonymous memory and 2) the memory
> >> >> > shared from smemcg.
> >> >> >
> >> >> > When the memcg “memory.current” is raised to the limit. The kernel
> >> >> > will active try to reclaim for the memcg to make “smemcg memory +
> >> >> > process anonymous memory” within the limit.
> >> >> 
> >> >> That means a process in one cgroup could force reclaim of smemcg memory
> >> >> in use by a process in another cgroup right? I guess that's no different
> >> >> to the current situation though.
> >> >> 
> >> >> > Further memory allocation
> >> >> > within those memcg processes will fail if the limit can not be
> >> >> > followed. If many reclaim attempts fail to bring the memcg
> >> >> > “memory.current” within the limit, the process in this memcg will get
> >> >> > OOM killed.
> >> >> 
> >> >> How would this work if say a charge for cgroup A to a smemcg in both
> >> >> cgroup A and B would cause cgroup B to go over its memory limit and not
> >> >> enough memory could be reclaimed from cgroup B? OOM killing a process in
> >> >> cgroup B due to a charge from cgroup A doesn't sound like a good idea.
> >> >
> >> > If we separate out the charge counter with the borrow counter, that problem
> >> > will be solved. When smemcg is add to memcg A, we can have a policy specific
> >> > that adding the <smemcg, memcg A> borrow counter into memcg A's "memory.current".
> >> >
> >> > If B did not map that page, that page will not be part of <smemcg, memcg B>
> >> > borrow count. B will not be punished.
> >> >
> >> > However if B did map that page, The <smemcg, memcg B> need to increase as well.
> >> > B will be punished for it.
> >> >
> >> > Will that work for your example situation?
> >> 
> >> I think so, although I have been looking at this more from the point of
> >> view of pinning. It sounds like we could treat pinning in much the same
> >> way as mapping though.
> >
> > Ack.
> >> 
> >> >> > = Benefits
> >> >> >
> >> >> > The benefits of this solution include:
> >> >> > * No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf.
> >> >> 
> >> >> If we added pinning it could get a bit messier, as it would have to hang
> >> >> around until the driver unpinned the pages. But I don't think that's a
> >> >> problem.
> >> >
> >> >
> >> > That is exactly the reason pin memory can belong to a pin smemcg. You just need
> >> > to model the driver holding the pin ref count as one of the share/mmap operation.
> >> >
> >> > Then the pin smemcg will not go away if there is a pending pin ref count on it.
> >> >
> >> > We have have different policy option on smemcg.
> >> > For the simple usage don't care the per memcg borrow counter, it can add the
> >> > smemcg's charge count to "memory.current".
> >> >
> >> > Only the user who cares about per memcg usage of a smemcg will need to maintain
> >> > per <smemcg, memcg> borrow counter, at additional cost.
> >> 
> >> Right, I think pinning drivers will always have to care about the borrow
> >> counter so will have to track that.
> >
> > Ack.
> >
> > Chris
> 
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-10 14:51               ` Chris Li
  0 siblings, 0 replies; 51+ messages in thread
From: Chris Li @ 2023-05-10 14:51 UTC (permalink / raw)
  To: Alistair Popple
  Cc: T.J. Mercier, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Yosry Ahmed, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Jason Gunthorpe, Kalesh Singh,
	Yu Zhao

Hi Alistair,

On Mon, May 08, 2023 at 06:17:04PM +1000, Alistair Popple wrote:
> Actually I don't have an invite so might not make it. However I believe
> Jason Gunthorpe is there and he has been helping with this as well so he
> might be able to attend the session. Or we could discuss it in one of

Yes, I talked to Jason Gunthorpe and asked him about the usage workflow
of the pin memory controller. He tell me that his original intend is
just to have something like RLIMIT but without the quirky limitation
of the RLIMIT. It has nothing to do with sharing memory.
Share memories are only brought up during the online discussion.  I guess
the share memory has similar reference count and facing similar challenges
on double counting.

> the Linux MM biweeklies. (TBH I've been busy on other work and am only
> just getting back up to speed on this myself).
> 
> > If you have some more detailed write up of the usage case, with a sequence
> > of interaction and desired outcome that would help me understand. Links to
> > the previous threads work too.
> 
> Unfortunately I don't (yet) have a detailed write up. But the primary
> use case we had in mind was sandboxing of containers and qemu instances
> to be able to limit the total amount of memory they can pin.

Ack.
 
> This is just like the existing RLIMIT, just with the limit assigned via
> a cgroup instead of a per-process or per-user limit as they can easily
> be subverted, particularly the per-process limit.

Ack.

> The sandboxing requirement is what drove us to not use the existing
> per-page memcg.

Not sure I understand what you mean here "use the existing per-page memcg".
Do you mean you need byte counter rather than page counter?

On the other hand it is not important to understand the rest of your points.
I can move on.

> For private mappings it would be fine because the page
> would only ever be mapped by a process in a single cgroup (ie. a single
> memcg).

Ack.
 
> However for shared mappings it isn't - processes in different cgroups
> could be mapping the same page but the accounting should happen to the
> cgroup the process is in, not the cgroup that happens to "own" the page.

Ack. That is actually the root of the share memory problem. The model
of charging to the first process does not work well for share usage.

> It is also possible that the page might not be mapped at all. For
> example a driver may be pinning a page with pin_user_pages(), but the
> userspace process may have munmap()ped it.

Ack.
In that case the driver will need to hold a reference count for it, right?

> 
> For drivers pinned memory is generally associated with some operation on
> a file-descriptor. So the desired interaction/outcome is:
> 
> 1. Process in cgroup A opens file-descriptor
> 2. Calls an ioctl() on the FD to pin memory.
> 3. Driver charges the memory to a counter and checks it's under the
>    limit.
> 4. If over limit the ioctl() will fail.

Ack.

> 
> This is effectively how the vm_pinned/locked RLIMIT works today. Even if
> a shared page is already pinned the process should still be "punished"
> for pinning the page.

OK. This is the critical usage information that I want to know. Thanks for
the explaination.

So there are two different definition of the pin page count:
1) sum(total set of pages that this memcg process issue pin ioctl on)
2) sum(total set of pined page this memcg process own a reference count on)

It seems you want 2).

If a page has three reference counts inside one memcg, e.g. map three times. 
Does the pin count three times or only once?

> Hence the interest in the total <smemcg, memcg>  limit.
> 
> Pinned memory may also outlive the process that created it - drivers
> associate it via a file-descriptor not a process and even if the FD is
> closed there's nothing say a driver has to unpin the memory then
> (although most do).

Ack.

> 
> > We can set up some meetings to discuss it as well.
> >
> >> So for pinning at least I don't see a per smemcg limit being useful.
> >
> > That is fine.  I see you are interested in the <smemcg, memcg> limit.
> 
> Right, because it sounds like it will allow pinning the same struct page
> multiple times to result in multiple charges. With the current memcg

Multiple times to different memcgs I assume, please see the above question
regard multiple times to the same memcg.

> implementation this isn't possible because a page can only be associated
> with a single memcg.

Ack.

> 
> A limit on just smemcg doesn't seem useful from a sandboxing perspective
> because processes from other cgroups can use up the limit.

Ack.

> >> Implementation wise we'd need a way to lookup both the smemcg of the
> >> struct page and the memcg that the pinning task belongs to.
> >
> > The page->memcg_data points to the pin smemcg. I am hoping pinning API or
> > the current memcg can get to the pinning memcg.
> 
> So the memcg to charge would come from the process doing the
> pin_user_pages() rather than say page->memcg_data? Seems reasonable.

That is more of a question for you. What is the desired behavior.
If charge the current process that perform the pin_user_pages()
works for you. Great.

I agree charge the pin count to current process memcg seems to make
sense.

> >> > 4) unshare/unmmap already charged memory. That will reduce the per <smemcg, memcg>
> >> > borrow counter.
> >> 
> >> Actually this is where things might get a bit tricky for pinning. We'd
> >> have to reduce the pin charge when a driver calls put_page(). But that
> >> implies looking up the borrow counter / <smemcg, memcg> pair a driver
> >> charged the page to.
> >
> > Does the pin page share between different memcg or just one memcg?
> 
> In general it can share between different memcg. Consider a shared
> mapping shared with processes in two different cgroups (A and B). There
> is nothing stopping each process opening a file-descriptor and calling
> an ioctl() to pin the shared page.

Ack.

> Each should be punished for pinning the page in the sense that the pin
> count for their respective cgroups must go up.

Ack. That clarfy my previous question. You want definition 2)

> Drivers pinning shared pages is I think relatively rare, but it's
> theorectically possible and if we're going down the path of adding
> limits for pinning to memcg it's something we need to deal with to make
> sandboxing effective.

The driver will still have a current processor. Do you mean in this case,
the current processor is not the right one to charge?
Another option can be charge to a default system/kernel smemcg or a driver
smemcg as well.

> 
> > If it is shared, can the put_page() API indicate it is performing in behalf
> > of which memcg?
> 
> I think so - although it varies by driver.
> 
> Drivers have to store the array of pages pinned so should be able to
> track the memcg with that as well. My series added a struct vm_account
> which would be the obvious place to keep that reference.

Where does the struct vm_account lives?

> Each set of pin 
> operations on a FD would need a new memcg reference though so it would
> add overhead for drivers that only pin a small number of pages at a
> time.

Set is more complicate then allow double counting the same page in the
same smemcg. Again mostly just collecting requirement from you.

> 
> Non-driver users such as the mlock() syscall don't keep a pinned pages
> array around but they should be able to use the current memcg during
> munlock().

Ack.

Chris

> 
> >> I will have to give this idea some more tought though. Most drivers
> >> don't store anything other than the struct page pointers, but my series
> >> added an accounting struct which I think could reference the borrow
> >> counter.
> >
> > Ack.
> >
> >> 
> >> > Will that work for your pin memory usage?
> >> 
> >> I think it could help. I will give it some thought.
> >
> > Ack.
> >> 
> >> >> 
> >> >> > Shared Memory Cgroup Controllers
> >> >> >
> >> >> > = Introduction
> >> >> >
> >> >> > The current memory cgroup controller does not support shared memory
> >> >> > objects. For the memory that is shared between different processes, it
> >> >> > is not obvious which process should get charged. Google has some
> >> >> > internal tmpfs “memcg=” mount option to charge tmpfs data to a
> >> >> > specific memcg that’s often different from where charging processes
> >> >> > run. However it faces some difficulties when the charged memcg exits
> >> >> > and the charged memcg becomes a zombie memcg.
> >> >> > Other approaches include “re-parenting” the memcg charge to the parent
> >> >> > memcg. Which has its own problem. If the charge is huge, iteration of
> >> >> > the reparenting can be costly.
> >> >> >
> >> >> > = Proposed Solution
> >> >> >
> >> >> > The proposed solution is to add a new type of memory controller for
> >> >> > shared memory usage. E.g. tmpfs, hugetlb, file system mmap and
> >> >> > dma_buf. This shared memory cgroup controller object will have the
> >> >> > same life cycle of the underlying shared memory.
> >> >> >
> >> >> > Processes can not be added to the shared memory cgroup. Instead the
> >> >> > shared memory cgroup can be added to the memcg using a “smemcg” API
> >> >> > file, similar to adding a process into the “tasks” API file.
> >> >> > When a smemcg is added to the memcg, the amount of memory that has
> >> >> > been shared in the memcg process will be accounted for as the part of
> >> >> > the memcg “memory.current”.The memory.current of the memcg is make up
> >> >> > of two parts, 1) the processes anonymous memory and 2) the memory
> >> >> > shared from smemcg.
> >> >> >
> >> >> > When the memcg “memory.current” is raised to the limit. The kernel
> >> >> > will active try to reclaim for the memcg to make “smemcg memory +
> >> >> > process anonymous memory” within the limit.
> >> >> 
> >> >> That means a process in one cgroup could force reclaim of smemcg memory
> >> >> in use by a process in another cgroup right? I guess that's no different
> >> >> to the current situation though.
> >> >> 
> >> >> > Further memory allocation
> >> >> > within those memcg processes will fail if the limit can not be
> >> >> > followed. If many reclaim attempts fail to bring the memcg
> >> >> > “memory.current” within the limit, the process in this memcg will get
> >> >> > OOM killed.
> >> >> 
> >> >> How would this work if say a charge for cgroup A to a smemcg in both
> >> >> cgroup A and B would cause cgroup B to go over its memory limit and not
> >> >> enough memory could be reclaimed from cgroup B? OOM killing a process in
> >> >> cgroup B due to a charge from cgroup A doesn't sound like a good idea.
> >> >
> >> > If we separate out the charge counter with the borrow counter, that problem
> >> > will be solved. When smemcg is add to memcg A, we can have a policy specific
> >> > that adding the <smemcg, memcg A> borrow counter into memcg A's "memory.current".
> >> >
> >> > If B did not map that page, that page will not be part of <smemcg, memcg B>
> >> > borrow count. B will not be punished.
> >> >
> >> > However if B did map that page, The <smemcg, memcg B> need to increase as well.
> >> > B will be punished for it.
> >> >
> >> > Will that work for your example situation?
> >> 
> >> I think so, although I have been looking at this more from the point of
> >> view of pinning. It sounds like we could treat pinning in much the same
> >> way as mapping though.
> >
> > Ack.
> >> 
> >> >> > = Benefits
> >> >> >
> >> >> > The benefits of this solution include:
> >> >> > * No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf.
> >> >> 
> >> >> If we added pinning it could get a bit messier, as it would have to hang
> >> >> around until the driver unpinned the pages. But I don't think that's a
> >> >> problem.
> >> >
> >> >
> >> > That is exactly the reason pin memory can belong to a pin smemcg. You just need
> >> > to model the driver holding the pin ref count as one of the share/mmap operation.
> >> >
> >> > Then the pin smemcg will not go away if there is a pending pin ref count on it.
> >> >
> >> > We have have different policy option on smemcg.
> >> > For the simple usage don't care the per memcg borrow counter, it can add the
> >> > smemcg's charge count to "memory.current".
> >> >
> >> > Only the user who cares about per memcg usage of a smemcg will need to maintain
> >> > per <smemcg, memcg> borrow counter, at additional cost.
> >> 
> >> Right, I think pinning drivers will always have to care about the borrow
> >> counter so will have to track that.
> >
> > Ack.
> >
> > Chris
> 
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
  2023-04-11 23:36 ` T.J. Mercier
                   ` (2 preceding siblings ...)
  (?)
@ 2023-05-12  3:08 ` Yosry Ahmed
  -1 siblings, 0 replies; 51+ messages in thread
From: Yosry Ahmed @ 2023-05-12  3:08 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: lsf-pc, linux-mm, cgroups, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Alistair Popple,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao

[-- Attachment #1: Type: text/plain, Size: 1723 bytes --]

On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier@google.com> wrote:
>
> When a memcg is removed by userspace it gets offlined by the kernel.
> Offline memcgs are hidden from user space, but they still live in the
> kernel until their reference count drops to 0. New allocations cannot
> be charged to offline memcgs, but existing allocations charged to
> offline memcgs remain charged, and hold a reference to the memcg.
>
> As such, an offline memcg can remain in the kernel indefinitely,
> becoming a zombie memcg. The accumulation of a large number of zombie
> memcgs lead to increased system overhead (mainly percpu data in struct
> mem_cgroup). It also causes some kernel operations that scale with the
> number of memcgs to become less efficient (e.g. reclaim).
>
> There are currently out-of-tree solutions which attempt to
> periodically clean up zombie memcgs by reclaiming from them. However
> that is not effective for non-reclaimable memory, which it would be
> better to reparent or recharge to an online cgroup. There are also
> proposed changes that would benefit from recharging for shared
> resources like pinned pages, or DMA buffer pages.
>
> Suggested attendees:
> Yosry Ahmed <yosryahmed@google.com>
> Yu Zhao <yuzhao@google.com>
> T.J. Mercier <tjmercier@google.com>
> Tejun Heo <tj@kernel.org>
> Shakeel Butt <shakeelb@google.com>
> Muchun Song <muchun.song@linux.dev>
> Johannes Weiner <hannes@cmpxchg.org>
> Roman Gushchin <roman.gushchin@linux.dev>
> Alistair Popple <apopple@nvidia.com>
> Jason Gunthorpe <jgg@nvidia.com>
> Kalesh Singh <kaleshsingh@google.com>

For the record, here are the slides that were presented for this
discussion (attached).

[-- Attachment #2: [LSF_MM_BPF 2023] Reducing Zombie Memcgs.pdf --]
[-- Type: application/pdf, Size: 91559 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-12  8:45                 ` Alistair Popple
  0 siblings, 0 replies; 51+ messages in thread
From: Alistair Popple @ 2023-05-12  8:45 UTC (permalink / raw)
  To: Chris Li
  Cc: T.J. Mercier, lsf-pc, linux-mm, cgroups, Yosry Ahmed, Tejun Heo,
	Shakeel Butt, Muchun Song, Johannes Weiner, Roman Gushchin,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao


Chris Li <chrisl@kernel.org> writes:

> Hi Alistair,
>
> On Mon, May 08, 2023 at 06:17:04PM +1000, Alistair Popple wrote:
>> Actually I don't have an invite so might not make it. However I believe
>> Jason Gunthorpe is there and he has been helping with this as well so he
>> might be able to attend the session. Or we could discuss it in one of
>
> Yes, I talked to Jason Gunthorpe and asked him about the usage workflow
> of the pin memory controller. He tell me that his original intend is
> just to have something like RLIMIT but without the quirky limitation
> of the RLIMIT. It has nothing to do with sharing memory.
> Share memories are only brought up during the online discussion.  I guess
> the share memory has similar reference count and facing similar challenges
> on double counting.

Ok, good. Now I realise perhaps we delved into this discussion without
covering the background.

My original patch series implemented what Jason suggested. That was a
standalone pinscg controller (which perhaps should be implemented as a
misc controller) that behaves the same as RLIMIT does, just charged to a
pinscg rather than a process or user.

However review comments suggested it needed to be added as part of
memcg. As soon as we do that we have to address how we deal with shared
memory. If we stick with the original RLIMIT proposal this discussion
goes away, but based on feedback I think I need to at least investigate
integrating it into memcg to get anything merged.

[...]

>> However for shared mappings it isn't - processes in different cgroups
>> could be mapping the same page but the accounting should happen to the
>> cgroup the process is in, not the cgroup that happens to "own" the page.
>
> Ack. That is actually the root of the share memory problem. The model
> of charging to the first process does not work well for share usage.

Right. The RLIMIT approach avoids the shared memory problem by charging
every process in a pincg for every pin (see below). But I don't disagree
there is appeal to having pinning work in the same way as memcg hence
this discussion.

>> It is also possible that the page might not be mapped at all. For
>> example a driver may be pinning a page with pin_user_pages(), but the
>> userspace process may have munmap()ped it.
>
> Ack.
> In that case the driver will need to hold a reference count for it, right?

Correct. Drivers would normally drop that reference when the FD is
closed but there's nothing that says they have to.

[...]

> OK. This is the critical usage information that I want to know. Thanks for
> the explaination.
>
> So there are two different definition of the pin page count:
> 1) sum(total set of pages that this memcg process issue pin ioctl on)
> 2) sum(total set of pined page this memcg process own a reference count on)
>
> It seems you want 2).
>
> If a page has three reference counts inside one memcg, e.g. map three times. 
> Does the pin count three times or only once?

I'm going to be pedantic here because it's important - by "map three
times" I assume you mean "pin three times with an ioctl". For pinning it
doesn't matter if the page is actually mapped or not.

This is basically where we need to get everyone aligned. The RLIMIT
approach currently implemented by my patch series does (2). For example:

1. If a process in a pincg requests (eg. via driver ioctl) to pin a page
   it is charged against the pincg limit and will fail if going over
   limit.

2. If the same process requests another pin (doesn't matter if it's the
   same page or not) it will be charged again and can't go over limit.

3. If another process in the same pincg requests a page (again, doesn't
   matter if it's the same page or not) be pinned it will be charged
   against the limit.

4. If a process not in the pincg pins the same page it will not be
   charged against the pincg limit.

From my perspective I think (1) would be fine (or even preferable) if
and only if the sharing issues can be resolved. In that case it becomes
much easier to explain how to set the limit.

For example it could be set as a percentage of total memory allocated to
the memcg, because all that really matters is the first pin within a
given memcg.

Subsequent pins won't impact system performance or stability because
once the page is pinned once it may as well be pinned a hundred
times. The only reason I didn't take this approach in my series is that
it's currently impossible to figure out what to do in the shared case
because we have no way of mapping pages back to multiple memcgs to see
if they've already been charged to that memcg, so they would have to be
charged to a single memcg which isn't useful.

>> Hence the interest in the total <smemcg, memcg>  limit.
>> 
>> Pinned memory may also outlive the process that created it - drivers
>> associate it via a file-descriptor not a process and even if the FD is
>> closed there's nothing say a driver has to unpin the memory then
>> (although most do).
>
> Ack.
>
>> 
>> > We can set up some meetings to discuss it as well.
>> >
>> >> So for pinning at least I don't see a per smemcg limit being useful.
>> >
>> > That is fine.  I see you are interested in the <smemcg, memcg> limit.
>> 
>> Right, because it sounds like it will allow pinning the same struct page
>> multiple times to result in multiple charges. With the current memcg
>
> Multiple times to different memcgs I assume, please see the above question
> regard multiple times to the same memcg.

Right. If we have a page that is pinned by two processes in different
cgroups each cgroup should be charged once (and IMHO only once) for it.

In other words if two processes in the same memcg pin the same page that
should only count as a single pin towards that memcg's pin limit, and
the pin would be uncharged when the final pinner unpins the page.

Note this is not what is implemented by the RLIMIT approach hence why it
conflicts with my answer to the above question which describes that
approach.

[...]

>> >> Implementation wise we'd need a way to lookup both the smemcg of the
>> >> struct page and the memcg that the pinning task belongs to.
>> >
>> > The page->memcg_data points to the pin smemcg. I am hoping pinning API or
>> > the current memcg can get to the pinning memcg.
>> 
>> So the memcg to charge would come from the process doing the
>> pin_user_pages() rather than say page->memcg_data? Seems reasonable.
>
> That is more of a question for you. What is the desired behavior.
> If charge the current process that perform the pin_user_pages()
> works for you. Great.

Argh, if only I had the powers and the desire to commit what worked
solely for me :-)

My current RLIMIT implementation works, but is it's own seperate thing
and there was a strong and reasonable preference from maintainers to
have this integrated with memcg.

But that opens up this whole set of problems around sharing, etc. which
we need to solve. We didn't want to invent our own set of rules for
sharing, etc. and changing memcg to support sharing in the way that was
needed seemed like a massive project.

However if we tackle that project and resolve the sharing issues for
memcg then I think adding a pin limit per-memcg shouldn't be so hard.

>> >> > 4) unshare/unmmap already charged memory. That will reduce the per <smemcg, memcg>
>> >> > borrow counter.
>> >> 
>> >> Actually this is where things might get a bit tricky for pinning. We'd
>> >> have to reduce the pin charge when a driver calls put_page(). But that
>> >> implies looking up the borrow counter / <smemcg, memcg> pair a driver
>> >> charged the page to.
>> >
>> > Does the pin page share between different memcg or just one memcg?
>> 
>> In general it can share between different memcg. Consider a shared
>> mapping shared with processes in two different cgroups (A and B). There
>> is nothing stopping each process opening a file-descriptor and calling
>> an ioctl() to pin the shared page.
>
> Ack.
>
>> Each should be punished for pinning the page in the sense that the pin
>> count for their respective cgroups must go up.
>
> Ack. That clarfy my previous question. You want definition 2)

I wouldn't say "want" so much as that seemed the quickest/easiest path
forward without having to fix the problems with support for shared pages
in memcg.

>> Drivers pinning shared pages is I think relatively rare, but it's
>> theorectically possible and if we're going down the path of adding
>> limits for pinning to memcg it's something we need to deal with to make
>> sandboxing effective.
>
> The driver will still have a current processor. Do you mean in this case,
> the current processor is not the right one to charge?
> Another option can be charge to a default system/kernel smemcg or a driver
> smemcg as well.

It's up to the driver to inform us which process should be
charged. Potentially it's not current. But my point here was drivers are
pretty much always pinning pages in private mappings. Hence why we
thought the trade-off taking a pincg RLIMIT style approach that
occasionally double charges pages would be fine because dealing with
pinning a page in a shared mapping was both hard and rare.

But if we're solving the shared memcg problem that might at least change
the "hard" bit of that equation.

>> 
>> > If it is shared, can the put_page() API indicate it is performing in behalf
>> > of which memcg?
>> 
>> I think so - although it varies by driver.
>> 
>> Drivers have to store the array of pages pinned so should be able to
>> track the memcg with that as well. My series added a struct vm_account
>> which would be the obvious place to keep that reference.
>
> Where does the struct vm_account lives?

I am about to send a rebased version of my series, but typically drivers
keep some kind of per-FD context structure and we keep the vm_account
there.

The vm_account holds references to the task_struct/mm_struct/pinscg as
required.

>> Each set of pin 
>> operations on a FD would need a new memcg reference though so it would
>> add overhead for drivers that only pin a small number of pages at a
>> time.
>
> Set is more complicate then allow double counting the same page in the
> same smemcg. Again mostly just collecting requirement from you.

Right. Hence why we just went with double counting. I think it would be
hard to figure out if a particular page is already pinned by a
particular <memcg, smemcg> or not.

Thanks.

 - Alistair

>> 
>> Non-driver users such as the mlock() syscall don't keep a pinned pages
>> array around but they should be able to use the current memcg during
>> munlock().
>
> Ack.
>
> Chris
>
>> 
>> >> I will have to give this idea some more tought though. Most drivers
>> >> don't store anything other than the struct page pointers, but my series
>> >> added an accounting struct which I think could reference the borrow
>> >> counter.
>> >
>> > Ack.
>> >
>> >> 
>> >> > Will that work for your pin memory usage?
>> >> 
>> >> I think it could help. I will give it some thought.
>> >
>> > Ack.
>> >> 
>> >> >> 
>> >> >> > Shared Memory Cgroup Controllers
>> >> >> >
>> >> >> > = Introduction
>> >> >> >
>> >> >> > The current memory cgroup controller does not support shared memory
>> >> >> > objects. For the memory that is shared between different processes, it
>> >> >> > is not obvious which process should get charged. Google has some
>> >> >> > internal tmpfs “memcg=” mount option to charge tmpfs data to a
>> >> >> > specific memcg that’s often different from where charging processes
>> >> >> > run. However it faces some difficulties when the charged memcg exits
>> >> >> > and the charged memcg becomes a zombie memcg.
>> >> >> > Other approaches include “re-parenting” the memcg charge to the parent
>> >> >> > memcg. Which has its own problem. If the charge is huge, iteration of
>> >> >> > the reparenting can be costly.
>> >> >> >
>> >> >> > = Proposed Solution
>> >> >> >
>> >> >> > The proposed solution is to add a new type of memory controller for
>> >> >> > shared memory usage. E.g. tmpfs, hugetlb, file system mmap and
>> >> >> > dma_buf. This shared memory cgroup controller object will have the
>> >> >> > same life cycle of the underlying shared memory.
>> >> >> >
>> >> >> > Processes can not be added to the shared memory cgroup. Instead the
>> >> >> > shared memory cgroup can be added to the memcg using a “smemcg” API
>> >> >> > file, similar to adding a process into the “tasks” API file.
>> >> >> > When a smemcg is added to the memcg, the amount of memory that has
>> >> >> > been shared in the memcg process will be accounted for as the part of
>> >> >> > the memcg “memory.current”.The memory.current of the memcg is make up
>> >> >> > of two parts, 1) the processes anonymous memory and 2) the memory
>> >> >> > shared from smemcg.
>> >> >> >
>> >> >> > When the memcg “memory.current” is raised to the limit. The kernel
>> >> >> > will active try to reclaim for the memcg to make “smemcg memory +
>> >> >> > process anonymous memory” within the limit.
>> >> >> 
>> >> >> That means a process in one cgroup could force reclaim of smemcg memory
>> >> >> in use by a process in another cgroup right? I guess that's no different
>> >> >> to the current situation though.
>> >> >> 
>> >> >> > Further memory allocation
>> >> >> > within those memcg processes will fail if the limit can not be
>> >> >> > followed. If many reclaim attempts fail to bring the memcg
>> >> >> > “memory.current” within the limit, the process in this memcg will get
>> >> >> > OOM killed.
>> >> >> 
>> >> >> How would this work if say a charge for cgroup A to a smemcg in both
>> >> >> cgroup A and B would cause cgroup B to go over its memory limit and not
>> >> >> enough memory could be reclaimed from cgroup B? OOM killing a process in
>> >> >> cgroup B due to a charge from cgroup A doesn't sound like a good idea.
>> >> >
>> >> > If we separate out the charge counter with the borrow counter, that problem
>> >> > will be solved. When smemcg is add to memcg A, we can have a policy specific
>> >> > that adding the <smemcg, memcg A> borrow counter into memcg A's "memory.current".
>> >> >
>> >> > If B did not map that page, that page will not be part of <smemcg, memcg B>
>> >> > borrow count. B will not be punished.
>> >> >
>> >> > However if B did map that page, The <smemcg, memcg B> need to increase as well.
>> >> > B will be punished for it.
>> >> >
>> >> > Will that work for your example situation?
>> >> 
>> >> I think so, although I have been looking at this more from the point of
>> >> view of pinning. It sounds like we could treat pinning in much the same
>> >> way as mapping though.
>> >
>> > Ack.
>> >> 
>> >> >> > = Benefits
>> >> >> >
>> >> >> > The benefits of this solution include:
>> >> >> > * No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf.
>> >> >> 
>> >> >> If we added pinning it could get a bit messier, as it would have to hang
>> >> >> around until the driver unpinned the pages. But I don't think that's a
>> >> >> problem.
>> >> >
>> >> >
>> >> > That is exactly the reason pin memory can belong to a pin smemcg. You just need
>> >> > to model the driver holding the pin ref count as one of the share/mmap operation.
>> >> >
>> >> > Then the pin smemcg will not go away if there is a pending pin ref count on it.
>> >> >
>> >> > We have have different policy option on smemcg.
>> >> > For the simple usage don't care the per memcg borrow counter, it can add the
>> >> > smemcg's charge count to "memory.current".
>> >> >
>> >> > Only the user who cares about per memcg usage of a smemcg will need to maintain
>> >> > per <smemcg, memcg> borrow counter, at additional cost.
>> >> 
>> >> Right, I think pinning drivers will always have to care about the borrow
>> >> counter so will have to track that.
>> >
>> > Ack.
>> >
>> > Chris
>> 
>> 



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-12  8:45                 ` Alistair Popple
  0 siblings, 0 replies; 51+ messages in thread
From: Alistair Popple @ 2023-05-12  8:45 UTC (permalink / raw)
  To: Chris Li
  Cc: T.J. Mercier, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Yosry Ahmed, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Jason Gunthorpe, Kalesh Singh,
	Yu Zhao


Chris Li <chrisl-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:

> Hi Alistair,
>
> On Mon, May 08, 2023 at 06:17:04PM +1000, Alistair Popple wrote:
>> Actually I don't have an invite so might not make it. However I believe
>> Jason Gunthorpe is there and he has been helping with this as well so he
>> might be able to attend the session. Or we could discuss it in one of
>
> Yes, I talked to Jason Gunthorpe and asked him about the usage workflow
> of the pin memory controller. He tell me that his original intend is
> just to have something like RLIMIT but without the quirky limitation
> of the RLIMIT. It has nothing to do with sharing memory.
> Share memories are only brought up during the online discussion.  I guess
> the share memory has similar reference count and facing similar challenges
> on double counting.

Ok, good. Now I realise perhaps we delved into this discussion without
covering the background.

My original patch series implemented what Jason suggested. That was a
standalone pinscg controller (which perhaps should be implemented as a
misc controller) that behaves the same as RLIMIT does, just charged to a
pinscg rather than a process or user.

However review comments suggested it needed to be added as part of
memcg. As soon as we do that we have to address how we deal with shared
memory. If we stick with the original RLIMIT proposal this discussion
goes away, but based on feedback I think I need to at least investigate
integrating it into memcg to get anything merged.

[...]

>> However for shared mappings it isn't - processes in different cgroups
>> could be mapping the same page but the accounting should happen to the
>> cgroup the process is in, not the cgroup that happens to "own" the page.
>
> Ack. That is actually the root of the share memory problem. The model
> of charging to the first process does not work well for share usage.

Right. The RLIMIT approach avoids the shared memory problem by charging
every process in a pincg for every pin (see below). But I don't disagree
there is appeal to having pinning work in the same way as memcg hence
this discussion.

>> It is also possible that the page might not be mapped at all. For
>> example a driver may be pinning a page with pin_user_pages(), but the
>> userspace process may have munmap()ped it.
>
> Ack.
> In that case the driver will need to hold a reference count for it, right?

Correct. Drivers would normally drop that reference when the FD is
closed but there's nothing that says they have to.

[...]

> OK. This is the critical usage information that I want to know. Thanks for
> the explaination.
>
> So there are two different definition of the pin page count:
> 1) sum(total set of pages that this memcg process issue pin ioctl on)
> 2) sum(total set of pined page this memcg process own a reference count on)
>
> It seems you want 2).
>
> If a page has three reference counts inside one memcg, e.g. map three times. 
> Does the pin count three times or only once?

I'm going to be pedantic here because it's important - by "map three
times" I assume you mean "pin three times with an ioctl". For pinning it
doesn't matter if the page is actually mapped or not.

This is basically where we need to get everyone aligned. The RLIMIT
approach currently implemented by my patch series does (2). For example:

1. If a process in a pincg requests (eg. via driver ioctl) to pin a page
   it is charged against the pincg limit and will fail if going over
   limit.

2. If the same process requests another pin (doesn't matter if it's the
   same page or not) it will be charged again and can't go over limit.

3. If another process in the same pincg requests a page (again, doesn't
   matter if it's the same page or not) be pinned it will be charged
   against the limit.

4. If a process not in the pincg pins the same page it will not be
   charged against the pincg limit.

From my perspective I think (1) would be fine (or even preferable) if
and only if the sharing issues can be resolved. In that case it becomes
much easier to explain how to set the limit.

For example it could be set as a percentage of total memory allocated to
the memcg, because all that really matters is the first pin within a
given memcg.

Subsequent pins won't impact system performance or stability because
once the page is pinned once it may as well be pinned a hundred
times. The only reason I didn't take this approach in my series is that
it's currently impossible to figure out what to do in the shared case
because we have no way of mapping pages back to multiple memcgs to see
if they've already been charged to that memcg, so they would have to be
charged to a single memcg which isn't useful.

>> Hence the interest in the total <smemcg, memcg>  limit.
>> 
>> Pinned memory may also outlive the process that created it - drivers
>> associate it via a file-descriptor not a process and even if the FD is
>> closed there's nothing say a driver has to unpin the memory then
>> (although most do).
>
> Ack.
>
>> 
>> > We can set up some meetings to discuss it as well.
>> >
>> >> So for pinning at least I don't see a per smemcg limit being useful.
>> >
>> > That is fine.  I see you are interested in the <smemcg, memcg> limit.
>> 
>> Right, because it sounds like it will allow pinning the same struct page
>> multiple times to result in multiple charges. With the current memcg
>
> Multiple times to different memcgs I assume, please see the above question
> regard multiple times to the same memcg.

Right. If we have a page that is pinned by two processes in different
cgroups each cgroup should be charged once (and IMHO only once) for it.

In other words if two processes in the same memcg pin the same page that
should only count as a single pin towards that memcg's pin limit, and
the pin would be uncharged when the final pinner unpins the page.

Note this is not what is implemented by the RLIMIT approach hence why it
conflicts with my answer to the above question which describes that
approach.

[...]

>> >> Implementation wise we'd need a way to lookup both the smemcg of the
>> >> struct page and the memcg that the pinning task belongs to.
>> >
>> > The page->memcg_data points to the pin smemcg. I am hoping pinning API or
>> > the current memcg can get to the pinning memcg.
>> 
>> So the memcg to charge would come from the process doing the
>> pin_user_pages() rather than say page->memcg_data? Seems reasonable.
>
> That is more of a question for you. What is the desired behavior.
> If charge the current process that perform the pin_user_pages()
> works for you. Great.

Argh, if only I had the powers and the desire to commit what worked
solely for me :-)

My current RLIMIT implementation works, but is it's own seperate thing
and there was a strong and reasonable preference from maintainers to
have this integrated with memcg.

But that opens up this whole set of problems around sharing, etc. which
we need to solve. We didn't want to invent our own set of rules for
sharing, etc. and changing memcg to support sharing in the way that was
needed seemed like a massive project.

However if we tackle that project and resolve the sharing issues for
memcg then I think adding a pin limit per-memcg shouldn't be so hard.

>> >> > 4) unshare/unmmap already charged memory. That will reduce the per <smemcg, memcg>
>> >> > borrow counter.
>> >> 
>> >> Actually this is where things might get a bit tricky for pinning. We'd
>> >> have to reduce the pin charge when a driver calls put_page(). But that
>> >> implies looking up the borrow counter / <smemcg, memcg> pair a driver
>> >> charged the page to.
>> >
>> > Does the pin page share between different memcg or just one memcg?
>> 
>> In general it can share between different memcg. Consider a shared
>> mapping shared with processes in two different cgroups (A and B). There
>> is nothing stopping each process opening a file-descriptor and calling
>> an ioctl() to pin the shared page.
>
> Ack.
>
>> Each should be punished for pinning the page in the sense that the pin
>> count for their respective cgroups must go up.
>
> Ack. That clarfy my previous question. You want definition 2)

I wouldn't say "want" so much as that seemed the quickest/easiest path
forward without having to fix the problems with support for shared pages
in memcg.

>> Drivers pinning shared pages is I think relatively rare, but it's
>> theorectically possible and if we're going down the path of adding
>> limits for pinning to memcg it's something we need to deal with to make
>> sandboxing effective.
>
> The driver will still have a current processor. Do you mean in this case,
> the current processor is not the right one to charge?
> Another option can be charge to a default system/kernel smemcg or a driver
> smemcg as well.

It's up to the driver to inform us which process should be
charged. Potentially it's not current. But my point here was drivers are
pretty much always pinning pages in private mappings. Hence why we
thought the trade-off taking a pincg RLIMIT style approach that
occasionally double charges pages would be fine because dealing with
pinning a page in a shared mapping was both hard and rare.

But if we're solving the shared memcg problem that might at least change
the "hard" bit of that equation.

>> 
>> > If it is shared, can the put_page() API indicate it is performing in behalf
>> > of which memcg?
>> 
>> I think so - although it varies by driver.
>> 
>> Drivers have to store the array of pages pinned so should be able to
>> track the memcg with that as well. My series added a struct vm_account
>> which would be the obvious place to keep that reference.
>
> Where does the struct vm_account lives?

I am about to send a rebased version of my series, but typically drivers
keep some kind of per-FD context structure and we keep the vm_account
there.

The vm_account holds references to the task_struct/mm_struct/pinscg as
required.

>> Each set of pin 
>> operations on a FD would need a new memcg reference though so it would
>> add overhead for drivers that only pin a small number of pages at a
>> time.
>
> Set is more complicate then allow double counting the same page in the
> same smemcg. Again mostly just collecting requirement from you.

Right. Hence why we just went with double counting. I think it would be
hard to figure out if a particular page is already pinned by a
particular <memcg, smemcg> or not.

Thanks.

 - Alistair

>> 
>> Non-driver users such as the mlock() syscall don't keep a pinned pages
>> array around but they should be able to use the current memcg during
>> munlock().
>
> Ack.
>
> Chris
>
>> 
>> >> I will have to give this idea some more tought though. Most drivers
>> >> don't store anything other than the struct page pointers, but my series
>> >> added an accounting struct which I think could reference the borrow
>> >> counter.
>> >
>> > Ack.
>> >
>> >> 
>> >> > Will that work for your pin memory usage?
>> >> 
>> >> I think it could help. I will give it some thought.
>> >
>> > Ack.
>> >> 
>> >> >> 
>> >> >> > Shared Memory Cgroup Controllers
>> >> >> >
>> >> >> > = Introduction
>> >> >> >
>> >> >> > The current memory cgroup controller does not support shared memory
>> >> >> > objects. For the memory that is shared between different processes, it
>> >> >> > is not obvious which process should get charged. Google has some
>> >> >> > internal tmpfs “memcg=” mount option to charge tmpfs data to a
>> >> >> > specific memcg that’s often different from where charging processes
>> >> >> > run. However it faces some difficulties when the charged memcg exits
>> >> >> > and the charged memcg becomes a zombie memcg.
>> >> >> > Other approaches include “re-parenting” the memcg charge to the parent
>> >> >> > memcg. Which has its own problem. If the charge is huge, iteration of
>> >> >> > the reparenting can be costly.
>> >> >> >
>> >> >> > = Proposed Solution
>> >> >> >
>> >> >> > The proposed solution is to add a new type of memory controller for
>> >> >> > shared memory usage. E.g. tmpfs, hugetlb, file system mmap and
>> >> >> > dma_buf. This shared memory cgroup controller object will have the
>> >> >> > same life cycle of the underlying shared memory.
>> >> >> >
>> >> >> > Processes can not be added to the shared memory cgroup. Instead the
>> >> >> > shared memory cgroup can be added to the memcg using a “smemcg” API
>> >> >> > file, similar to adding a process into the “tasks” API file.
>> >> >> > When a smemcg is added to the memcg, the amount of memory that has
>> >> >> > been shared in the memcg process will be accounted for as the part of
>> >> >> > the memcg “memory.current”.The memory.current of the memcg is make up
>> >> >> > of two parts, 1) the processes anonymous memory and 2) the memory
>> >> >> > shared from smemcg.
>> >> >> >
>> >> >> > When the memcg “memory.current” is raised to the limit. The kernel
>> >> >> > will active try to reclaim for the memcg to make “smemcg memory +
>> >> >> > process anonymous memory” within the limit.
>> >> >> 
>> >> >> That means a process in one cgroup could force reclaim of smemcg memory
>> >> >> in use by a process in another cgroup right? I guess that's no different
>> >> >> to the current situation though.
>> >> >> 
>> >> >> > Further memory allocation
>> >> >> > within those memcg processes will fail if the limit can not be
>> >> >> > followed. If many reclaim attempts fail to bring the memcg
>> >> >> > “memory.current” within the limit, the process in this memcg will get
>> >> >> > OOM killed.
>> >> >> 
>> >> >> How would this work if say a charge for cgroup A to a smemcg in both
>> >> >> cgroup A and B would cause cgroup B to go over its memory limit and not
>> >> >> enough memory could be reclaimed from cgroup B? OOM killing a process in
>> >> >> cgroup B due to a charge from cgroup A doesn't sound like a good idea.
>> >> >
>> >> > If we separate out the charge counter with the borrow counter, that problem
>> >> > will be solved. When smemcg is add to memcg A, we can have a policy specific
>> >> > that adding the <smemcg, memcg A> borrow counter into memcg A's "memory.current".
>> >> >
>> >> > If B did not map that page, that page will not be part of <smemcg, memcg B>
>> >> > borrow count. B will not be punished.
>> >> >
>> >> > However if B did map that page, The <smemcg, memcg B> need to increase as well.
>> >> > B will be punished for it.
>> >> >
>> >> > Will that work for your example situation?
>> >> 
>> >> I think so, although I have been looking at this more from the point of
>> >> view of pinning. It sounds like we could treat pinning in much the same
>> >> way as mapping though.
>> >
>> > Ack.
>> >> 
>> >> >> > = Benefits
>> >> >> >
>> >> >> > The benefits of this solution include:
>> >> >> > * No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf.
>> >> >> 
>> >> >> If we added pinning it could get a bit messier, as it would have to hang
>> >> >> around until the driver unpinned the pages. But I don't think that's a
>> >> >> problem.
>> >> >
>> >> >
>> >> > That is exactly the reason pin memory can belong to a pin smemcg. You just need
>> >> > to model the driver holding the pin ref count as one of the share/mmap operation.
>> >> >
>> >> > Then the pin smemcg will not go away if there is a pending pin ref count on it.
>> >> >
>> >> > We have have different policy option on smemcg.
>> >> > For the simple usage don't care the per memcg borrow counter, it can add the
>> >> > smemcg's charge count to "memory.current".
>> >> >
>> >> > Only the user who cares about per memcg usage of a smemcg will need to maintain
>> >> > per <smemcg, memcg> borrow counter, at additional cost.
>> >> 
>> >> Right, I think pinning drivers will always have to care about the borrow
>> >> counter so will have to track that.
>> >
>> > Ack.
>> >
>> > Chris
>> 
>> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-12 21:09                   ` Jason Gunthorpe
  0 siblings, 0 replies; 51+ messages in thread
From: Jason Gunthorpe @ 2023-05-12 21:09 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Chris Li, T.J. Mercier, lsf-pc, linux-mm, cgroups, Yosry Ahmed,
	Tejun Heo, Shakeel Butt, Muchun Song, Johannes Weiner,
	Roman Gushchin, Kalesh Singh, Yu Zhao

On Fri, May 12, 2023 at 06:45:13PM +1000, Alistair Popple wrote:

> However review comments suggested it needed to be added as part of
> memcg. As soon as we do that we have to address how we deal with shared
> memory. If we stick with the original RLIMIT proposal this discussion
> goes away, but based on feedback I think I need to at least investigate
> integrating it into memcg to get anything merged.

Personally I don't see how we can effectively solve the per-page
problem without also tracking all the owning memcgs for every
page. This means giving each struct page an array of memcgs

I suspect this will be too expensive to be realistically
implementable.

If it is done then we may not even need a pin controller on its own as
the main memcg should capture most of it. (althought it doesn't
distinguish between movable/swappable and non-swappable memory)

But this is all being done for the libvirt people, so it would be good
to involve them

Jason


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-12 21:09                   ` Jason Gunthorpe
  0 siblings, 0 replies; 51+ messages in thread
From: Jason Gunthorpe @ 2023-05-12 21:09 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Chris Li, T.J. Mercier,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Yosry Ahmed, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Kalesh Singh, Yu Zhao

On Fri, May 12, 2023 at 06:45:13PM +1000, Alistair Popple wrote:

> However review comments suggested it needed to be added as part of
> memcg. As soon as we do that we have to address how we deal with shared
> memory. If we stick with the original RLIMIT proposal this discussion
> goes away, but based on feedback I think I need to at least investigate
> integrating it into memcg to get anything merged.

Personally I don't see how we can effectively solve the per-page
problem without also tracking all the owning memcgs for every
page. This means giving each struct page an array of memcgs

I suspect this will be too expensive to be realistically
implementable.

If it is done then we may not even need a pin controller on its own as
the main memcg should capture most of it. (althought it doesn't
distinguish between movable/swappable and non-swappable memory)

But this is all being done for the libvirt people, so it would be good
to involve them

Jason

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-16 12:21                     ` Alistair Popple
  0 siblings, 0 replies; 51+ messages in thread
From: Alistair Popple @ 2023-05-16 12:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Chris Li, T.J. Mercier, lsf-pc, linux-mm, cgroups, Yosry Ahmed,
	Tejun Heo, Shakeel Butt, Muchun Song, Johannes Weiner,
	Roman Gushchin, Kalesh Singh, Yu Zhao


Jason Gunthorpe <jgg@nvidia.com> writes:

> On Fri, May 12, 2023 at 06:45:13PM +1000, Alistair Popple wrote:
>
>> However review comments suggested it needed to be added as part of
>> memcg. As soon as we do that we have to address how we deal with shared
>> memory. If we stick with the original RLIMIT proposal this discussion
>> goes away, but based on feedback I think I need to at least investigate
>> integrating it into memcg to get anything merged.
>
> Personally I don't see how we can effectively solve the per-page
> problem without also tracking all the owning memcgs for every
> page. This means giving each struct page an array of memcgs
>
> I suspect this will be too expensive to be realistically
> implementable.

Yep, agree with that. Tracking the list of memcgs was the main problem
that prevented this.

> If it is done then we may not even need a pin controller on its own as
> the main memcg should capture most of it. (althought it doesn't
> distinguish between movable/swappable and non-swappable memory)
>
> But this is all being done for the libvirt people, so it would be good
> to involve them

Do you know of anyone specifically there that is interested in this?
I've rebased my series on latest upstream and am about to resend it so
would be good to get some feedback from them.

Thanks.

> Jason



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-16 12:21                     ` Alistair Popple
  0 siblings, 0 replies; 51+ messages in thread
From: Alistair Popple @ 2023-05-16 12:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Chris Li, T.J. Mercier,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Yosry Ahmed, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Kalesh Singh, Yu Zhao


Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org> writes:

> On Fri, May 12, 2023 at 06:45:13PM +1000, Alistair Popple wrote:
>
>> However review comments suggested it needed to be added as part of
>> memcg. As soon as we do that we have to address how we deal with shared
>> memory. If we stick with the original RLIMIT proposal this discussion
>> goes away, but based on feedback I think I need to at least investigate
>> integrating it into memcg to get anything merged.
>
> Personally I don't see how we can effectively solve the per-page
> problem without also tracking all the owning memcgs for every
> page. This means giving each struct page an array of memcgs
>
> I suspect this will be too expensive to be realistically
> implementable.

Yep, agree with that. Tracking the list of memcgs was the main problem
that prevented this.

> If it is done then we may not even need a pin controller on its own as
> the main memcg should capture most of it. (althought it doesn't
> distinguish between movable/swappable and non-swappable memory)
>
> But this is all being done for the libvirt people, so it would be good
> to involve them

Do you know of anyone specifically there that is interested in this?
I've rebased my series on latest upstream and am about to resend it so
would be good to get some feedback from them.

Thanks.

> Jason


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-19 15:47                       ` Jason Gunthorpe
  0 siblings, 0 replies; 51+ messages in thread
From: Jason Gunthorpe @ 2023-05-19 15:47 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Chris Li, T.J. Mercier, lsf-pc, linux-mm, cgroups, Yosry Ahmed,
	Tejun Heo, Shakeel Butt, Muchun Song, Johannes Weiner,
	Roman Gushchin, Kalesh Singh, Yu Zhao

On Tue, May 16, 2023 at 10:21:10PM +1000, Alistair Popple wrote:
> 
> Jason Gunthorpe <jgg@nvidia.com> writes:
> 
> > On Fri, May 12, 2023 at 06:45:13PM +1000, Alistair Popple wrote:
> >
> >> However review comments suggested it needed to be added as part of
> >> memcg. As soon as we do that we have to address how we deal with shared
> >> memory. If we stick with the original RLIMIT proposal this discussion
> >> goes away, but based on feedback I think I need to at least investigate
> >> integrating it into memcg to get anything merged.
> >
> > Personally I don't see how we can effectively solve the per-page
> > problem without also tracking all the owning memcgs for every
> > page. This means giving each struct page an array of memcgs
> >
> > I suspect this will be too expensive to be realistically
> > implementable.
> 
> Yep, agree with that. Tracking the list of memcgs was the main problem
> that prevented this.
> 
> > If it is done then we may not even need a pin controller on its own as
> > the main memcg should capture most of it. (althought it doesn't
> > distinguish between movable/swappable and non-swappable memory)
> >
> > But this is all being done for the libvirt people, so it would be good
> > to involve them
> 
> Do you know of anyone specifically there that is interested in this?
> I've rebased my series on latest upstream and am about to resend it so
> would be good to get some feedback from them.

"Daniel P. Berrange" <berrange@redhat.com>
Alex Williamson <alex.williamson@redhat.com>,

Are my usual gotos

Thanks,
Jason


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-19 15:47                       ` Jason Gunthorpe
  0 siblings, 0 replies; 51+ messages in thread
From: Jason Gunthorpe @ 2023-05-19 15:47 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Chris Li, T.J. Mercier,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Yosry Ahmed, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Kalesh Singh, Yu Zhao

On Tue, May 16, 2023 at 10:21:10PM +1000, Alistair Popple wrote:
> 
> Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org> writes:
> 
> > On Fri, May 12, 2023 at 06:45:13PM +1000, Alistair Popple wrote:
> >
> >> However review comments suggested it needed to be added as part of
> >> memcg. As soon as we do that we have to address how we deal with shared
> >> memory. If we stick with the original RLIMIT proposal this discussion
> >> goes away, but based on feedback I think I need to at least investigate
> >> integrating it into memcg to get anything merged.
> >
> > Personally I don't see how we can effectively solve the per-page
> > problem without also tracking all the owning memcgs for every
> > page. This means giving each struct page an array of memcgs
> >
> > I suspect this will be too expensive to be realistically
> > implementable.
> 
> Yep, agree with that. Tracking the list of memcgs was the main problem
> that prevented this.
> 
> > If it is done then we may not even need a pin controller on its own as
> > the main memcg should capture most of it. (althought it doesn't
> > distinguish between movable/swappable and non-swappable memory)
> >
> > But this is all being done for the libvirt people, so it would be good
> > to involve them
> 
> Do you know of anyone specifically there that is interested in this?
> I've rebased my series on latest upstream and am about to resend it so
> would be good to get some feedback from them.

"Daniel P. Berrange" <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,

Are my usual gotos

Thanks,
Jason

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-20 15:09                     ` Chris Li
  0 siblings, 0 replies; 51+ messages in thread
From: Chris Li @ 2023-05-20 15:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alistair Popple, T.J. Mercier, lsf-pc, linux-mm, cgroups,
	Yosry Ahmed, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Kalesh Singh, Yu Zhao

On Fri, May 12, 2023 at 06:09:20PM -0300, Jason Gunthorpe wrote:
> On Fri, May 12, 2023 at 06:45:13PM +1000, Alistair Popple wrote:
> 
> > However review comments suggested it needed to be added as part of
> > memcg. As soon as we do that we have to address how we deal with shared
> > memory. If we stick with the original RLIMIT proposal this discussion
> > goes away, but based on feedback I think I need to at least investigate
> > integrating it into memcg to get anything merged.
> 
> Personally I don't see how we can effectively solve the per-page
> problem without also tracking all the owning memcgs for every
> page. This means giving each struct page an array of memcgs
> 
> I suspect this will be too expensive to be realistically
> implementable.

Agree. To get the precise shared usage count, it needs to track
the usage at the page level.

It is possible to track just <smemcg,memcg> pair at the
leaf node of the memcg. At the parent memcg it needs to avoid
counting the same page twice. So it does need to track per page
memcgs set that it belongs to.


Chris

> If it is done then we may not even need a pin controller on its own as
> the main memcg should capture most of it. (althought it doesn't
> distinguish between movable/swappable and non-swappable memory)
> But this is all being done for the libvirt people, so it would be good
> to involve them
> 
> Jason
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-20 15:09                     ` Chris Li
  0 siblings, 0 replies; 51+ messages in thread
From: Chris Li @ 2023-05-20 15:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alistair Popple, T.J. Mercier,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Yosry Ahmed, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Kalesh Singh, Yu Zhao

On Fri, May 12, 2023 at 06:09:20PM -0300, Jason Gunthorpe wrote:
> On Fri, May 12, 2023 at 06:45:13PM +1000, Alistair Popple wrote:
> 
> > However review comments suggested it needed to be added as part of
> > memcg. As soon as we do that we have to address how we deal with shared
> > memory. If we stick with the original RLIMIT proposal this discussion
> > goes away, but based on feedback I think I need to at least investigate
> > integrating it into memcg to get anything merged.
> 
> Personally I don't see how we can effectively solve the per-page
> problem without also tracking all the owning memcgs for every
> page. This means giving each struct page an array of memcgs
> 
> I suspect this will be too expensive to be realistically
> implementable.

Agree. To get the precise shared usage count, it needs to track
the usage at the page level.

It is possible to track just <smemcg,memcg> pair at the
leaf node of the memcg. At the parent memcg it needs to avoid
counting the same page twice. So it does need to track per page
memcgs set that it belongs to.


Chris

> If it is done then we may not even need a pin controller on its own as
> the main memcg should capture most of it. (althought it doesn't
> distinguish between movable/swappable and non-swappable memory)
> But this is all being done for the libvirt people, so it would be good
> to involve them
> 
> Jason
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-20 15:31                   ` Chris Li
  0 siblings, 0 replies; 51+ messages in thread
From: Chris Li @ 2023-05-20 15:31 UTC (permalink / raw)
  To: Alistair Popple
  Cc: T.J. Mercier, lsf-pc, linux-mm, cgroups, Yosry Ahmed, Tejun Heo,
	Shakeel Butt, Muchun Song, Johannes Weiner, Roman Gushchin,
	Jason Gunthorpe, Kalesh Singh, Yu Zhao

Hi Alistair,

Sorry for the late reply. Super busy after return from LSF/MM.
Catching up my emails.

On Fri, May 12, 2023 at 06:45:13PM +1000, Alistair Popple wrote:
> > Yes, I talked to Jason Gunthorpe and asked him about the usage workflow
> > of the pin memory controller. He tell me that his original intend is
> > just to have something like RLIMIT but without the quirky limitation
> > of the RLIMIT. It has nothing to do with sharing memory.
> > Share memories are only brought up during the online discussion.  I guess
> > the share memory has similar reference count and facing similar challenges
> > on double counting.
> 
> Ok, good. Now I realise perhaps we delved into this discussion without
> covering the background.
> 
> My original patch series implemented what Jason suggested. That was a
> standalone pinscg controller (which perhaps should be implemented as a
> misc controller) that behaves the same as RLIMIT does, just charged to a
> pinscg rather than a process or user.
> 
> However review comments suggested it needed to be added as part of
> memcg. As soon as we do that we have to address how we deal with shared
> memory. If we stick with the original RLIMIT proposal this discussion
> goes away, but based on feedback I think I need to at least investigate
> integrating it into memcg to get anything merged.
Ack.

> 
> [...]
> 
> >> However for shared mappings it isn't - processes in different cgroups
> >> could be mapping the same page but the accounting should happen to the
> >> cgroup the process is in, not the cgroup that happens to "own" the page.
> >
> > Ack. That is actually the root of the share memory problem. The model
> > of charging to the first process does not work well for share usage.
> 
> Right. The RLIMIT approach avoids the shared memory problem by charging
> every process in a pincg for every pin (see below). But I don't disagree
> there is appeal to having pinning work in the same way as memcg hence
> this discussion.

Ack.

> 
> >> It is also possible that the page might not be mapped at all. For
> >> example a driver may be pinning a page with pin_user_pages(), but the
> >> userspace process may have munmap()ped it.
> >
> > Ack.
> > In that case the driver will need to hold a reference count for it, right?
> 
> Correct. Drivers would normally drop that reference when the FD is
> closed but there's nothing that says they have to.

Ack.

> 
> [...]
> 
> > OK. This is the critical usage information that I want to know. Thanks for
> > the explaination.
> >
> > So there are two different definition of the pin page count:
> > 1) sum(total set of pages that this memcg process issue pin ioctl on)
> > 2) sum(total set of pined page this memcg process own a reference count on)
> >
> > It seems you want 2).
> >
> > If a page has three reference counts inside one memcg, e.g. map three times. 
> > Does the pin count three times or only once?
> 
> I'm going to be pedantic here because it's important - by "map three
> times" I assume you mean "pin three times with an ioctl". For pinning it
> doesn't matter if the page is actually mapped or not.

Pedantic is good.
I actually thinking about the other way around: pin the page with
ioctl once, then that page is map to other process three times with
three virtual addresses.

My guess from your reply is that it counts as one pin only.

> 
> This is basically where we need to get everyone aligned. The RLIMIT
> approach currently implemented by my patch series does (2). For example:
> 
> 1. If a process in a pincg requests (eg. via driver ioctl) to pin a page
>    it is charged against the pincg limit and will fail if going over
>    limit.
> 
> 2. If the same process requests another pin (doesn't matter if it's the
>    same page or not) it will be charged again and can't go over limit.
> 
> 3. If another process in the same pincg requests a page (again, doesn't
>    matter if it's the same page or not) be pinned it will be charged
>    against the limit.

I see. You want to track and punish the number of time process
issue pin ioctl on the page.

> 
> 4. If a process not in the pincg pins the same page it will not be
>    charged against the pincg limit.

Ack, because it is not in the pincg.

> 
> From my perspective I think (1) would be fine (or even preferable) if
> and only if the sharing issues can be resolved. In that case it becomes
> much easier to explain how to set the limit.

Ack.

> For example it could be set as a percentage of total memory allocated to
> the memcg, because all that really matters is the first pin within a
> given memcg.
> 
> Subsequent pins won't impact system performance or stability because
> once the page is pinned once it may as well be pinned a hundred
> times. The only reason I didn't take this approach in my series is that
> it's currently impossible to figure out what to do in the shared case
> because we have no way of mapping pages back to multiple memcgs to see
> if they've already been charged to that memcg, so they would have to be
> charged to a single memcg which isn't useful.
> 

Ack.

> >> Hence the interest in the total <smemcg, memcg>  limit.
> >> 
> >> Pinned memory may also outlive the process that created it - drivers
> >> associate it via a file-descriptor not a process and even if the FD is
> >> closed there's nothing say a driver has to unpin the memory then
> >> (although most do).
> >
> > Ack.
> >
> >> 
> >> > We can set up some meetings to discuss it as well.
> >> >
> >> >> So for pinning at least I don't see a per smemcg limit being useful.
> >> >
> >> > That is fine.  I see you are interested in the <smemcg, memcg> limit.
> >> 
> >> Right, because it sounds like it will allow pinning the same struct page
> >> multiple times to result in multiple charges. With the current memcg
> >
> > Multiple times to different memcgs I assume, please see the above question
> > regard multiple times to the same memcg.
> 
> Right. If we have a page that is pinned by two processes in different
> cgroups each cgroup should be charged once (and IMHO only once) for it.
> 
Ack.

> In other words if two processes in the same memcg pin the same page that
> should only count as a single pin towards that memcg's pin limit, and
> the pin would be uncharged when the final pinner unpins the page.

Ack. You need to track the set of page you already pin on. 

> Note this is not what is implemented by the RLIMIT approach hence why it
> conflicts with my answer to the above question which describes that
> approach.

Ack.

> 
> [...]
> 
> >> >> Implementation wise we'd need a way to lookup both the smemcg of the
> >> >> struct page and the memcg that the pinning task belongs to.
> >> >
> >> > The page->memcg_data points to the pin smemcg. I am hoping pinning API or
> >> > the current memcg can get to the pinning memcg.
> >> 
> >> So the memcg to charge would come from the process doing the
> >> pin_user_pages() rather than say page->memcg_data? Seems reasonable.
> >
> > That is more of a question for you. What is the desired behavior.
> > If charge the current process that perform the pin_user_pages()
> > works for you. Great.
> 
> Argh, if only I had the powers and the desire to commit what worked
> solely for me :-)

I want to learn about your usage case first. Of course others might have
different usage cases in mind. Let's start with yours.

> My current RLIMIT implementation works, but is it's own seperate thing
> and there was a strong and reasonable preference from maintainers to
> have this integrated with memcg.
> 
> But that opens up this whole set of problems around sharing, etc. which
> we need to solve. We didn't want to invent our own set of rules for
> sharing, etc. and changing memcg to support sharing in the way that was
> needed seemed like a massive project.
> 
> However if we tackle that project and resolve the sharing issues for
> memcg then I think adding a pin limit per-memcg shouldn't be so hard.

Ack.

I am consider the first step for smemcg should only include the simple
binary membership relationship. That will simplify things a lot, don't
need to track the per page shared usage count.

However it does not address your requirement of tracking each page
the share usage set. That I haven't figure out a good effective way to
do it. If any one knows, I am curious to know.


> >> >> > 4) unshare/unmmap already charged memory. That will reduce the per <smemcg, memcg>
> >> >> > borrow counter.
> >> >> 
> >> >> Actually this is where things might get a bit tricky for pinning. We'd
> >> >> have to reduce the pin charge when a driver calls put_page(). But that
> >> >> implies looking up the borrow counter / <smemcg, memcg> pair a driver
> >> >> charged the page to.
> >> >
> >> > Does the pin page share between different memcg or just one memcg?
> >> 
> >> In general it can share between different memcg. Consider a shared
> >> mapping shared with processes in two different cgroups (A and B). There
> >> is nothing stopping each process opening a file-descriptor and calling
> >> an ioctl() to pin the shared page.
> >
> > Ack.
> >
> >> Each should be punished for pinning the page in the sense that the pin
> >> count for their respective cgroups must go up.
> >
> > Ack. That clarfy my previous question. You want definition 2)
> 
> I wouldn't say "want" so much as that seemed the quickest/easiest path
> forward without having to fix the problems with support for shared pages
> in memcg.

Ack.

> 
> >> Drivers pinning shared pages is I think relatively rare, but it's
> >> theorectically possible and if we're going down the path of adding
> >> limits for pinning to memcg it's something we need to deal with to make
> >> sandboxing effective.
> >
> > The driver will still have a current processor. Do you mean in this case,
> > the current processor is not the right one to charge?
> > Another option can be charge to a default system/kernel smemcg or a driver
> > smemcg as well.
> 
> It's up to the driver to inform us which process should be
> charged. Potentially it's not current. But my point here was drivers are
> pretty much always pinning pages in private mappings. Hence why we
> thought the trade-off taking a pincg RLIMIT style approach that
> occasionally double charges pages would be fine because dealing with
> pinning a page in a shared mapping was both hard and rare.
> 
> But if we're solving the shared memcg problem that might at least change
> the "hard" bit of that equation.

Ack. The smemcg can start with simple binary membership relationship.
That does not help your pin controller much though.

> >> > If it is shared, can the put_page() API indicate it is performing in behalf
> >> > of which memcg?
> >> 
> >> I think so - although it varies by driver.
> >> 
> >> Drivers have to store the array of pages pinned so should be able to
> >> track the memcg with that as well. My series added a struct vm_account
> >> which would be the obvious place to keep that reference.
> >
> > Where does the struct vm_account lives?
> 
> I am about to send a rebased version of my series, but typically drivers
> keep some kind of per-FD context structure and we keep the vm_account
> there.
> 
> The vm_account holds references to the task_struct/mm_struct/pinscg as
> required.

I will keep an eye out for your patches.

> 
> >> Each set of pin 
> >> operations on a FD would need a new memcg reference though so it would
> >> add overhead for drivers that only pin a small number of pages at a
> >> time.
> >
> > Set is more complicate then allow double counting the same page in the
> > same smemcg. Again mostly just collecting requirement from you.
> 
> Right. Hence why we just went with double counting. I think it would be
> hard to figure out if a particular page is already pinned by a
> particular <memcg, smemcg> or not.

Agree. We can track it, the question is how to do it in a scalable way.
That I don't have a solution I am happy with yet.

Chris



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-20 15:31                   ` Chris Li
  0 siblings, 0 replies; 51+ messages in thread
From: Chris Li @ 2023-05-20 15:31 UTC (permalink / raw)
  To: Alistair Popple
  Cc: T.J. Mercier, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Yosry Ahmed, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Jason Gunthorpe, Kalesh Singh,
	Yu Zhao

Hi Alistair,

Sorry for the late reply. Super busy after return from LSF/MM.
Catching up my emails.

On Fri, May 12, 2023 at 06:45:13PM +1000, Alistair Popple wrote:
> > Yes, I talked to Jason Gunthorpe and asked him about the usage workflow
> > of the pin memory controller. He tell me that his original intend is
> > just to have something like RLIMIT but without the quirky limitation
> > of the RLIMIT. It has nothing to do with sharing memory.
> > Share memories are only brought up during the online discussion.  I guess
> > the share memory has similar reference count and facing similar challenges
> > on double counting.
> 
> Ok, good. Now I realise perhaps we delved into this discussion without
> covering the background.
> 
> My original patch series implemented what Jason suggested. That was a
> standalone pinscg controller (which perhaps should be implemented as a
> misc controller) that behaves the same as RLIMIT does, just charged to a
> pinscg rather than a process or user.
> 
> However review comments suggested it needed to be added as part of
> memcg. As soon as we do that we have to address how we deal with shared
> memory. If we stick with the original RLIMIT proposal this discussion
> goes away, but based on feedback I think I need to at least investigate
> integrating it into memcg to get anything merged.
Ack.

> 
> [...]
> 
> >> However for shared mappings it isn't - processes in different cgroups
> >> could be mapping the same page but the accounting should happen to the
> >> cgroup the process is in, not the cgroup that happens to "own" the page.
> >
> > Ack. That is actually the root of the share memory problem. The model
> > of charging to the first process does not work well for share usage.
> 
> Right. The RLIMIT approach avoids the shared memory problem by charging
> every process in a pincg for every pin (see below). But I don't disagree
> there is appeal to having pinning work in the same way as memcg hence
> this discussion.

Ack.

> 
> >> It is also possible that the page might not be mapped at all. For
> >> example a driver may be pinning a page with pin_user_pages(), but the
> >> userspace process may have munmap()ped it.
> >
> > Ack.
> > In that case the driver will need to hold a reference count for it, right?
> 
> Correct. Drivers would normally drop that reference when the FD is
> closed but there's nothing that says they have to.

Ack.

> 
> [...]
> 
> > OK. This is the critical usage information that I want to know. Thanks for
> > the explaination.
> >
> > So there are two different definition of the pin page count:
> > 1) sum(total set of pages that this memcg process issue pin ioctl on)
> > 2) sum(total set of pined page this memcg process own a reference count on)
> >
> > It seems you want 2).
> >
> > If a page has three reference counts inside one memcg, e.g. map three times. 
> > Does the pin count three times or only once?
> 
> I'm going to be pedantic here because it's important - by "map three
> times" I assume you mean "pin three times with an ioctl". For pinning it
> doesn't matter if the page is actually mapped or not.

Pedantic is good.
I actually thinking about the other way around: pin the page with
ioctl once, then that page is map to other process three times with
three virtual addresses.

My guess from your reply is that it counts as one pin only.

> 
> This is basically where we need to get everyone aligned. The RLIMIT
> approach currently implemented by my patch series does (2). For example:
> 
> 1. If a process in a pincg requests (eg. via driver ioctl) to pin a page
>    it is charged against the pincg limit and will fail if going over
>    limit.
> 
> 2. If the same process requests another pin (doesn't matter if it's the
>    same page or not) it will be charged again and can't go over limit.
> 
> 3. If another process in the same pincg requests a page (again, doesn't
>    matter if it's the same page or not) be pinned it will be charged
>    against the limit.

I see. You want to track and punish the number of time process
issue pin ioctl on the page.

> 
> 4. If a process not in the pincg pins the same page it will not be
>    charged against the pincg limit.

Ack, because it is not in the pincg.

> 
> From my perspective I think (1) would be fine (or even preferable) if
> and only if the sharing issues can be resolved. In that case it becomes
> much easier to explain how to set the limit.

Ack.

> For example it could be set as a percentage of total memory allocated to
> the memcg, because all that really matters is the first pin within a
> given memcg.
> 
> Subsequent pins won't impact system performance or stability because
> once the page is pinned once it may as well be pinned a hundred
> times. The only reason I didn't take this approach in my series is that
> it's currently impossible to figure out what to do in the shared case
> because we have no way of mapping pages back to multiple memcgs to see
> if they've already been charged to that memcg, so they would have to be
> charged to a single memcg which isn't useful.
> 

Ack.

> >> Hence the interest in the total <smemcg, memcg>  limit.
> >> 
> >> Pinned memory may also outlive the process that created it - drivers
> >> associate it via a file-descriptor not a process and even if the FD is
> >> closed there's nothing say a driver has to unpin the memory then
> >> (although most do).
> >
> > Ack.
> >
> >> 
> >> > We can set up some meetings to discuss it as well.
> >> >
> >> >> So for pinning at least I don't see a per smemcg limit being useful.
> >> >
> >> > That is fine.  I see you are interested in the <smemcg, memcg> limit.
> >> 
> >> Right, because it sounds like it will allow pinning the same struct page
> >> multiple times to result in multiple charges. With the current memcg
> >
> > Multiple times to different memcgs I assume, please see the above question
> > regard multiple times to the same memcg.
> 
> Right. If we have a page that is pinned by two processes in different
> cgroups each cgroup should be charged once (and IMHO only once) for it.
> 
Ack.

> In other words if two processes in the same memcg pin the same page that
> should only count as a single pin towards that memcg's pin limit, and
> the pin would be uncharged when the final pinner unpins the page.

Ack. You need to track the set of page you already pin on. 

> Note this is not what is implemented by the RLIMIT approach hence why it
> conflicts with my answer to the above question which describes that
> approach.

Ack.

> 
> [...]
> 
> >> >> Implementation wise we'd need a way to lookup both the smemcg of the
> >> >> struct page and the memcg that the pinning task belongs to.
> >> >
> >> > The page->memcg_data points to the pin smemcg. I am hoping pinning API or
> >> > the current memcg can get to the pinning memcg.
> >> 
> >> So the memcg to charge would come from the process doing the
> >> pin_user_pages() rather than say page->memcg_data? Seems reasonable.
> >
> > That is more of a question for you. What is the desired behavior.
> > If charge the current process that perform the pin_user_pages()
> > works for you. Great.
> 
> Argh, if only I had the powers and the desire to commit what worked
> solely for me :-)

I want to learn about your usage case first. Of course others might have
different usage cases in mind. Let's start with yours.

> My current RLIMIT implementation works, but is it's own seperate thing
> and there was a strong and reasonable preference from maintainers to
> have this integrated with memcg.
> 
> But that opens up this whole set of problems around sharing, etc. which
> we need to solve. We didn't want to invent our own set of rules for
> sharing, etc. and changing memcg to support sharing in the way that was
> needed seemed like a massive project.
> 
> However if we tackle that project and resolve the sharing issues for
> memcg then I think adding a pin limit per-memcg shouldn't be so hard.

Ack.

I am consider the first step for smemcg should only include the simple
binary membership relationship. That will simplify things a lot, don't
need to track the per page shared usage count.

However it does not address your requirement of tracking each page
the share usage set. That I haven't figure out a good effective way to
do it. If any one knows, I am curious to know.


> >> >> > 4) unshare/unmmap already charged memory. That will reduce the per <smemcg, memcg>
> >> >> > borrow counter.
> >> >> 
> >> >> Actually this is where things might get a bit tricky for pinning. We'd
> >> >> have to reduce the pin charge when a driver calls put_page(). But that
> >> >> implies looking up the borrow counter / <smemcg, memcg> pair a driver
> >> >> charged the page to.
> >> >
> >> > Does the pin page share between different memcg or just one memcg?
> >> 
> >> In general it can share between different memcg. Consider a shared
> >> mapping shared with processes in two different cgroups (A and B). There
> >> is nothing stopping each process opening a file-descriptor and calling
> >> an ioctl() to pin the shared page.
> >
> > Ack.
> >
> >> Each should be punished for pinning the page in the sense that the pin
> >> count for their respective cgroups must go up.
> >
> > Ack. That clarfy my previous question. You want definition 2)
> 
> I wouldn't say "want" so much as that seemed the quickest/easiest path
> forward without having to fix the problems with support for shared pages
> in memcg.

Ack.

> 
> >> Drivers pinning shared pages is I think relatively rare, but it's
> >> theorectically possible and if we're going down the path of adding
> >> limits for pinning to memcg it's something we need to deal with to make
> >> sandboxing effective.
> >
> > The driver will still have a current processor. Do you mean in this case,
> > the current processor is not the right one to charge?
> > Another option can be charge to a default system/kernel smemcg or a driver
> > smemcg as well.
> 
> It's up to the driver to inform us which process should be
> charged. Potentially it's not current. But my point here was drivers are
> pretty much always pinning pages in private mappings. Hence why we
> thought the trade-off taking a pincg RLIMIT style approach that
> occasionally double charges pages would be fine because dealing with
> pinning a page in a shared mapping was both hard and rare.
> 
> But if we're solving the shared memcg problem that might at least change
> the "hard" bit of that equation.

Ack. The smemcg can start with simple binary membership relationship.
That does not help your pin controller much though.

> >> > If it is shared, can the put_page() API indicate it is performing in behalf
> >> > of which memcg?
> >> 
> >> I think so - although it varies by driver.
> >> 
> >> Drivers have to store the array of pages pinned so should be able to
> >> track the memcg with that as well. My series added a struct vm_account
> >> which would be the obvious place to keep that reference.
> >
> > Where does the struct vm_account lives?
> 
> I am about to send a rebased version of my series, but typically drivers
> keep some kind of per-FD context structure and we keep the vm_account
> there.
> 
> The vm_account holds references to the task_struct/mm_struct/pinscg as
> required.

I will keep an eye out for your patches.

> 
> >> Each set of pin 
> >> operations on a FD would need a new memcg reference though so it would
> >> add overhead for drivers that only pin a small number of pages at a
> >> time.
> >
> > Set is more complicate then allow double counting the same page in the
> > same smemcg. Again mostly just collecting requirement from you.
> 
> Right. Hence why we just went with double counting. I think it would be
> hard to figure out if a particular page is already pinned by a
> particular <memcg, smemcg> or not.

Agree. We can track it, the question is how to do it in a scalable way.
That I don't have a solution I am happy with yet.

Chris


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-29 19:31                     ` Jason Gunthorpe
  0 siblings, 0 replies; 51+ messages in thread
From: Jason Gunthorpe @ 2023-05-29 19:31 UTC (permalink / raw)
  To: Chris Li
  Cc: Alistair Popple, T.J. Mercier, lsf-pc, linux-mm, cgroups,
	Yosry Ahmed, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Kalesh Singh, Yu Zhao

On Sat, May 20, 2023 at 08:31:01AM -0700, Chris Li wrote:
> > This is basically where we need to get everyone aligned. The RLIMIT
> > approach currently implemented by my patch series does (2). For example:
> > 
> > 1. If a process in a pincg requests (eg. via driver ioctl) to pin a page
> >    it is charged against the pincg limit and will fail if going over
> >    limit.
> > 
> > 2. If the same process requests another pin (doesn't matter if it's the
> >    same page or not) it will be charged again and can't go over limit.
> > 
> > 3. If another process in the same pincg requests a page (again, doesn't
> >    matter if it's the same page or not) be pinned it will be charged
> >    against the limit.
> 
> I see. You want to track and punish the number of time process
> issue pin ioctl on the page.

Yes, because it is feasible to count that without a lot of overhead

In a perfect world each cgroup would be charged exactly once while any
pin is active regardless of how many times something in the cgroup
caused it to be pinned	.

Jason


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
@ 2023-05-29 19:31                     ` Jason Gunthorpe
  0 siblings, 0 replies; 51+ messages in thread
From: Jason Gunthorpe @ 2023-05-29 19:31 UTC (permalink / raw)
  To: Chris Li
  Cc: Alistair Popple, T.J. Mercier,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Yosry Ahmed, Tejun Heo, Shakeel Butt, Muchun Song,
	Johannes Weiner, Roman Gushchin, Kalesh Singh, Yu Zhao

On Sat, May 20, 2023 at 08:31:01AM -0700, Chris Li wrote:
> > This is basically where we need to get everyone aligned. The RLIMIT
> > approach currently implemented by my patch series does (2). For example:
> > 
> > 1. If a process in a pincg requests (eg. via driver ioctl) to pin a page
> >    it is charged against the pincg limit and will fail if going over
> >    limit.
> > 
> > 2. If the same process requests another pin (doesn't matter if it's the
> >    same page or not) it will be charged again and can't go over limit.
> > 
> > 3. If another process in the same pincg requests a page (again, doesn't
> >    matter if it's the same page or not) be pinned it will be charged
> >    against the limit.
> 
> I see. You want to track and punish the number of time process
> issue pin ioctl on the page.

Yes, because it is feasible to count that without a lot of overhead

In a perfect world each cgroup would be charged exactly once while any
pin is active regardless of how many times something in the cgroup
caused it to be pinned	.

Jason

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2023-05-29 19:31 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-11 23:36 [LSF/MM/BPF TOPIC] Reducing zombie memcgs T.J. Mercier
2023-04-11 23:36 ` T.J. Mercier
2023-04-11 23:48 ` Yosry Ahmed
2023-04-11 23:48   ` Yosry Ahmed
2023-04-25 11:36   ` Yosry Ahmed
2023-04-25 11:36     ` Yosry Ahmed
2023-04-25 18:42     ` Waiman Long
2023-04-25 18:42       ` Waiman Long
2023-04-25 18:53       ` Yosry Ahmed
2023-04-25 18:53         ` Yosry Ahmed
2023-04-26 20:15         ` Waiman Long
2023-04-26 20:15           ` Waiman Long
2023-05-01 16:38     ` Roman Gushchin
2023-05-01 16:38       ` Roman Gushchin
2023-05-02  7:18       ` Yosry Ahmed
2023-05-02  7:18         ` Yosry Ahmed
2023-05-02 20:02       ` Yosry Ahmed
2023-05-02 20:02         ` Yosry Ahmed
2023-05-03 22:15 ` Chris Li
2023-05-03 22:15   ` Chris Li
2023-05-04 11:58   ` Alistair Popple
2023-05-04 11:58     ` Alistair Popple
2023-05-04 15:31     ` Chris Li
2023-05-04 15:31       ` Chris Li
2023-05-05 13:53       ` Alistair Popple
2023-05-05 13:53         ` Alistair Popple
2023-05-06 22:49         ` Chris Li
2023-05-06 22:49           ` Chris Li
2023-05-08  8:17           ` Alistair Popple
2023-05-08  8:17             ` Alistair Popple
2023-05-10 14:51             ` Chris Li
2023-05-10 14:51               ` Chris Li
2023-05-12  8:45               ` Alistair Popple
2023-05-12  8:45                 ` Alistair Popple
2023-05-12 21:09                 ` Jason Gunthorpe
2023-05-12 21:09                   ` Jason Gunthorpe
2023-05-16 12:21                   ` Alistair Popple
2023-05-16 12:21                     ` Alistair Popple
2023-05-19 15:47                     ` Jason Gunthorpe
2023-05-19 15:47                       ` Jason Gunthorpe
2023-05-20 15:09                   ` Chris Li
2023-05-20 15:09                     ` Chris Li
2023-05-20 15:31                 ` Chris Li
2023-05-20 15:31                   ` Chris Li
2023-05-29 19:31                   ` Jason Gunthorpe
2023-05-29 19:31                     ` Jason Gunthorpe
2023-05-04 17:02   ` Shakeel Butt
2023-05-04 17:02     ` Shakeel Butt
2023-05-04 17:36     ` Chris Li
2023-05-04 17:36       ` Chris Li
2023-05-12  3:08 ` Yosry Ahmed

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.