[RFC Proposal] Deterministic memcg charging for shared memory

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC Proposal] Deterministic memcg charging for shared memory
@ 2021-10-13 19:23 Mina Almasry
  2021-10-18 12:51 ` Mina Almasry
  2021-10-18 13:33 ` Michal Hocko
  0 siblings, 2 replies; 7+ messages in thread
From: Mina Almasry @ 2021-10-13 19:23 UTC (permalink / raw)
  To: Roman Gushchin, Shakeel Butt, Greg Thelen, Michal Hocko,
	Johannes Weiner, Hugh Dickins, Tejun Heo, Linux-MM,
	open list:FILESYSTEMS (VFS and infrastructure),
	cgroups, riel

Below is a proposal for deterministic charging of shared memory.
Please take a look and let me know if there are any major concerns:

Problem:
Currently shared memory is charged to the memcg of the allocating
process. This makes memory usage of processes accessing shared memory
a bit unpredictable since whichever process accesses the memory first
will get charged. We have a number of use cases where our userspace
would like deterministic charging of shared memory:

1. System services allocating memory for client jobs:
We have services (namely a network access service[1]) that provide
functionality for clients running on the machine and allocate memory
to carry out these services. The memory usage of these services
depends on the number of jobs running on the machine and the nature of
the requests made to the service, which makes the memory usage of
these services hard to predict and thus hard to limit via memory.max.
These system services would like a way to allocate memory and instruct
the kernel to charge this memory to the client’s memcg.

2. Shared filesystem between subtasks of a large job
Our infrastructure has large meta jobs such as kubernetes which spawn
multiple subtasks which share a tmpfs mount. These jobs and its
subtasks use that tmpfs mount for various purposes such as data
sharing or persistent data between the subtask restarts. In kubernetes
terminology, the meta job is similar to pods and subtasks are
containers under pods. We want the shared memory to be
deterministically charged to the kubernetes's pod and independent to
the lifetime of containers under the pod.

3. Shared libraries and language runtimes shared between independent jobs.
We’d like to optimize memory usage on the machine by sharing libraries
and language runtimes of many of the processes running on our machines
in separate memcgs. This produces a side effect that one job may be
unlucky to be the first to access many of the libraries and may get
oom killed as all the cached files get charged to it.

Design:
My rough proposal to solve this problem is to simply add a
‘memcg=/path/to/memcg’ mount option for filesystems (namely tmpfs):
directing all the memory of the file system to be ‘remote charged’ to
cgroup provided by that memcg= option.

Caveats:
1. One complication to address is the behavior when the target memcg
hits its memory.max limit because of remote charging. In this case the
oom-killer will be invoked, but the oom-killer may not find anything
to kill in the target memcg being charged. In this case, I propose
simply failing the remote charge which will cause the process
executing the remote charge to get an ENOMEM This will be documented
behavior of remote charging.
2. I would like to provide an initial implementation that adds this
support for tmpfs, while leaving the implementation generic enough for
myself or others to extend to more filesystems where they find the
feature useful.
3. I would like to implement this for both cgroups v2 _and_ cgroups
v1, as we still have cgroup v1 users. If this is unacceptable I can
provide the v2 implementation only, and maintain a local patch for the
v1 support.

If this proposal sounds good in principle. I have an experimental
implementation that I can make ready for review. Please let me know of
any concerns you may have. Thank you very much in advance!
Mina Almasry

[1] https://research.google/pubs/pub48630/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC Proposal] Deterministic memcg charging for shared memory
  2021-10-13 19:23 [RFC Proposal] Deterministic memcg charging for shared memory Mina Almasry
@ 2021-10-18 12:51 ` Mina Almasry
  2021-10-18 13:33 ` Michal Hocko
  1 sibling, 0 replies; 7+ messages in thread
From: Mina Almasry @ 2021-10-18 12:51 UTC (permalink / raw)
  To: Roman Gushchin, Shakeel Butt, Greg Thelen, Michal Hocko,
	Johannes Weiner, Hugh Dickins, Tejun Heo, Linux-MM,
	open list:FILESYSTEMS (VFS and infrastructure),
	cgroups, riel

On Wed, Oct 13, 2021 at 12:23 PM Mina Almasry <almasrymina@google.com> wrote:
>
> Below is a proposal for deterministic charging of shared memory.
> Please take a look and let me know if there are any major concerns:
>

Friendly ping on the proposal below. If there are any issues you see
that I can address in the v1 I send for review, I would love to know.
And if the proposal seems fine as is I would also love to know.

Thanks!
Mina

> Problem:
> Currently shared memory is charged to the memcg of the allocating
> process. This makes memory usage of processes accessing shared memory
> a bit unpredictable since whichever process accesses the memory first
> will get charged. We have a number of use cases where our userspace
> would like deterministic charging of shared memory:
>
> 1. System services allocating memory for client jobs:
> We have services (namely a network access service[1]) that provide
> functionality for clients running on the machine and allocate memory
> to carry out these services. The memory usage of these services
> depends on the number of jobs running on the machine and the nature of
> the requests made to the service, which makes the memory usage of
> these services hard to predict and thus hard to limit via memory.max.
> These system services would like a way to allocate memory and instruct
> the kernel to charge this memory to the client’s memcg.
>
> 2. Shared filesystem between subtasks of a large job
> Our infrastructure has large meta jobs such as kubernetes which spawn
> multiple subtasks which share a tmpfs mount. These jobs and its
> subtasks use that tmpfs mount for various purposes such as data
> sharing or persistent data between the subtask restarts. In kubernetes
> terminology, the meta job is similar to pods and subtasks are
> containers under pods. We want the shared memory to be
> deterministically charged to the kubernetes's pod and independent to
> the lifetime of containers under the pod.
>
> 3. Shared libraries and language runtimes shared between independent jobs.
> We’d like to optimize memory usage on the machine by sharing libraries
> and language runtimes of many of the processes running on our machines
> in separate memcgs. This produces a side effect that one job may be
> unlucky to be the first to access many of the libraries and may get
> oom killed as all the cached files get charged to it.
>
> Design:
> My rough proposal to solve this problem is to simply add a
> ‘memcg=/path/to/memcg’ mount option for filesystems (namely tmpfs):
> directing all the memory of the file system to be ‘remote charged’ to
> cgroup provided by that memcg= option.
>
> Caveats:
> 1. One complication to address is the behavior when the target memcg
> hits its memory.max limit because of remote charging. In this case the
> oom-killer will be invoked, but the oom-killer may not find anything
> to kill in the target memcg being charged. In this case, I propose
> simply failing the remote charge which will cause the process
> executing the remote charge to get an ENOMEM This will be documented
> behavior of remote charging.
> 2. I would like to provide an initial implementation that adds this
> support for tmpfs, while leaving the implementation generic enough for
> myself or others to extend to more filesystems where they find the
> feature useful.
> 3. I would like to implement this for both cgroups v2 _and_ cgroups
> v1, as we still have cgroup v1 users. If this is unacceptable I can
> provide the v2 implementation only, and maintain a local patch for the
> v1 support.
>
> If this proposal sounds good in principle. I have an experimental
> implementation that I can make ready for review. Please let me know of
> any concerns you may have. Thank you very much in advance!
> Mina Almasry
>
> [1] https://research.google/pubs/pub48630/


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC Proposal] Deterministic memcg charging for shared memory
  2021-10-13 19:23 [RFC Proposal] Deterministic memcg charging for shared memory Mina Almasry
  2021-10-18 12:51 ` Mina Almasry
@ 2021-10-18 13:33 ` Michal Hocko
  2021-10-18 14:31   ` Mina Almasry
  1 sibling, 1 reply; 7+ messages in thread
From: Michal Hocko @ 2021-10-18 13:33 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Roman Gushchin, Shakeel Butt, Greg Thelen, Johannes Weiner,
	Hugh Dickins, Tejun Heo, Linux-MM,
	open list:FILESYSTEMS (VFS and infrastructure),
	cgroups, riel

On Wed 13-10-21 12:23:19, Mina Almasry wrote:
> Below is a proposal for deterministic charging of shared memory.
> Please take a look and let me know if there are any major concerns:
> 
> Problem:
> Currently shared memory is charged to the memcg of the allocating
> process. This makes memory usage of processes accessing shared memory
> a bit unpredictable since whichever process accesses the memory first
> will get charged. We have a number of use cases where our userspace
> would like deterministic charging of shared memory:
> 
> 1. System services allocating memory for client jobs:
> We have services (namely a network access service[1]) that provide
> functionality for clients running on the machine and allocate memory
> to carry out these services. The memory usage of these services
> depends on the number of jobs running on the machine and the nature of
> the requests made to the service, which makes the memory usage of
> these services hard to predict and thus hard to limit via memory.max.
> These system services would like a way to allocate memory and instruct
> the kernel to charge this memory to the client’s memcg.
> 
> 2. Shared filesystem between subtasks of a large job
> Our infrastructure has large meta jobs such as kubernetes which spawn
> multiple subtasks which share a tmpfs mount. These jobs and its
> subtasks use that tmpfs mount for various purposes such as data
> sharing or persistent data between the subtask restarts. In kubernetes
> terminology, the meta job is similar to pods and subtasks are
> containers under pods. We want the shared memory to be
> deterministically charged to the kubernetes's pod and independent to
> the lifetime of containers under the pod.
> 
> 3. Shared libraries and language runtimes shared between independent jobs.
> We’d like to optimize memory usage on the machine by sharing libraries
> and language runtimes of many of the processes running on our machines
> in separate memcgs. This produces a side effect that one job may be
> unlucky to be the first to access many of the libraries and may get
> oom killed as all the cached files get charged to it.
> 
> Design:
> My rough proposal to solve this problem is to simply add a
> ‘memcg=/path/to/memcg’ mount option for filesystems (namely tmpfs):
> directing all the memory of the file system to be ‘remote charged’ to
> cgroup provided by that memcg= option.

Could you be more specific about how this matches the above mentioned
usecases?

What would/should happen if the target memcg doesn't or stop existing
under remote charger feet?

> Caveats:
> 1. One complication to address is the behavior when the target memcg
> hits its memory.max limit because of remote charging. In this case the
> oom-killer will be invoked, but the oom-killer may not find anything
> to kill in the target memcg being charged. In this case, I propose
> simply failing the remote charge which will cause the process
> executing the remote charge to get an ENOMEM This will be documented
> behavior of remote charging.

Say you are in a page fault (#PF) path. If you just return ENOMEM then
you will get a system wide OOM killer via pagefault_out_of_memory. This
is very likely not something you want, right? Even if we remove this
behavior, which is another story, then the #PF wouldn't have other ways
than keep retrying which doesn't really look great either.

The only "reasonable" way I can see right now is kill the remote
charging task. That might result in some other problems though.

> 2. I would like to provide an initial implementation that adds this
> support for tmpfs, while leaving the implementation generic enough for
> myself or others to extend to more filesystems where they find the
> feature useful.

How do you envision other filesystems would implement that? Should the
information be persisted in some way?

I didn't have time to give this a lot of thought and more questions will
likely come. My initial reaction is that this will open a lot of
interesting corner cases which will be hard to deal with.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC Proposal] Deterministic memcg charging for shared memory
  2021-10-18 13:33 ` Michal Hocko
@ 2021-10-18 14:31   ` Mina Almasry
  2021-10-20  9:09     ` Michal Hocko
  0 siblings, 1 reply; 7+ messages in thread
From: Mina Almasry @ 2021-10-18 14:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Shakeel Butt, Greg Thelen, Johannes Weiner,
	Hugh Dickins, Tejun Heo, Linux-MM,
	open list:FILESYSTEMS (VFS and infrastructure),
	cgroups, riel

On Mon, Oct 18, 2021 at 6:33 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 13-10-21 12:23:19, Mina Almasry wrote:
> > Below is a proposal for deterministic charging of shared memory.
> > Please take a look and let me know if there are any major concerns:
> >
> > Problem:
> > Currently shared memory is charged to the memcg of the allocating
> > process. This makes memory usage of processes accessing shared memory
> > a bit unpredictable since whichever process accesses the memory first
> > will get charged. We have a number of use cases where our userspace
> > would like deterministic charging of shared memory:
> >
> > 1. System services allocating memory for client jobs:
> > We have services (namely a network access service[1]) that provide
> > functionality for clients running on the machine and allocate memory
> > to carry out these services. The memory usage of these services
> > depends on the number of jobs running on the machine and the nature of
> > the requests made to the service, which makes the memory usage of
> > these services hard to predict and thus hard to limit via memory.max.
> > These system services would like a way to allocate memory and instruct
> > the kernel to charge this memory to the client’s memcg.
> >
> > 2. Shared filesystem between subtasks of a large job
> > Our infrastructure has large meta jobs such as kubernetes which spawn
> > multiple subtasks which share a tmpfs mount. These jobs and its
> > subtasks use that tmpfs mount for various purposes such as data
> > sharing or persistent data between the subtask restarts. In kubernetes
> > terminology, the meta job is similar to pods and subtasks are
> > containers under pods. We want the shared memory to be
> > deterministically charged to the kubernetes's pod and independent to
> > the lifetime of containers under the pod.
> >
> > 3. Shared libraries and language runtimes shared between independent jobs.
> > We’d like to optimize memory usage on the machine by sharing libraries
> > and language runtimes of many of the processes running on our machines
> > in separate memcgs. This produces a side effect that one job may be
> > unlucky to be the first to access many of the libraries and may get
> > oom killed as all the cached files get charged to it.
> >
> > Design:
> > My rough proposal to solve this problem is to simply add a
> > ‘memcg=/path/to/memcg’ mount option for filesystems (namely tmpfs):
> > directing all the memory of the file system to be ‘remote charged’ to
> > cgroup provided by that memcg= option.
>
> Could you be more specific about how this matches the above mentioned
> usecases?
>

For the use cases I've listed respectively:
1. Our network service would mount a tmpfs with 'memcg=<path to
client's memcg>'. Any memory the service is allocating on behalf of
the client, the service will allocate inside of this tmpfs mount, thus
charging it to the client's memcg without risk of hitting the
service's limit.
2. The large job (kubernetes pod) would mount a tmpfs with
'memcg=<path to large job's memcg>. It will then share this tmpfs
mount with the subtasks (containers in the pod). The subtasks can then
allocate memory in the tmpfs, having it charged to the kubernetes job,
without risk of hitting the container's limit.
3. We would need to extend this functionality to other file systems of
persistent disk, then mount that file system with 'memcg=<dedicated
shared library memcg>'. Jobs can then use the shared library and any
memory allocated due to loading the shared library is charged to a
dedicated memcg, and not charged to the job using the shared library.

> What would/should happen if the target memcg doesn't or stop existing
> under remote charger feet?
>

My thinking is that the tmpfs acts as a charge target to the memcg and
blocks the memcg from being removed until the tmpfs mount is
unmounted, similarly to when a user tries to rmdir a memcg with some
processes still attached to it. But I don't feel strongly about this,
and I'm happy to go with another approach if you have a strong opinion
about this.

> > Caveats:
> > 1. One complication to address is the behavior when the target memcg
> > hits its memory.max limit because of remote charging. In this case the
> > oom-killer will be invoked, but the oom-killer may not find anything
> > to kill in the target memcg being charged. In this case, I propose
> > simply failing the remote charge which will cause the process
> > executing the remote charge to get an ENOMEM This will be documented
> > behavior of remote charging.
>
> Say you are in a page fault (#PF) path. If you just return ENOMEM then
> you will get a system wide OOM killer via pagefault_out_of_memory. This
> is very likely not something you want, right? Even if we remove this
> behavior, which is another story, then the #PF wouldn't have other ways
> than keep retrying which doesn't really look great either.
>
> The only "reasonable" way I can see right now is kill the remote
> charging task. That might result in some other problems though.
>

Yes! That's exactly what I was thinking, and from discussions with
userspace folks interested in this it doesn't seem like a problem.
We'd kill the remote charging task and make it clear in the
documentation that this is the behavior and the userspace is
responsible for working around that.

Worthy of mention is that if processes A and B are sharing memory via
a tmpfs, they can set memcg=<common ancestor memcg of A and B>. Thus
the memory is charged to a common ancestor of memcgs A and B and if
the common ancestor hits its limit the oom-killer will get invoked and
should always find something to kill. This will also be documented and
the userspace can choose to go this route if they don't want to risk
being killed on pagefault.

> > 2. I would like to provide an initial implementation that adds this
> > support for tmpfs, while leaving the implementation generic enough for
> > myself or others to extend to more filesystems where they find the
> > feature useful.
>
> How do you envision other filesystems would implement that? Should the
> information be persisted in some way?
>

Yes my initial implementation has a struct memcg* hanging off the
super block that is the memcg to charge, but I can move it if there is
somewhere else you feel is appropriate once I send out the patches.

> I didn't have time to give this a lot of thought and more questions will
> likely come. My initial reaction is that this will open a lot of
> interesting corner cases which will be hard to deal with.

Thank you very much for your review so far and please let me know if
you think of any more issues. My feeling is that hitting the remote
memcg limit and the oom-killing behavior surrounding that is by far
the most contentious issue. You don't seem completely revolted by what
I'm proposing there so I'm somewhat optimistic we can deal with the
rest of the corner cases :-)

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC Proposal] Deterministic memcg charging for shared memory
  2021-10-18 14:31   ` Mina Almasry
@ 2021-10-20  9:09     ` Michal Hocko
  2021-10-20 17:46       ` Mina Almasry
  2021-10-20 19:02       ` Theodore Ts'o
  0 siblings, 2 replies; 7+ messages in thread
From: Michal Hocko @ 2021-10-20  9:09 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Roman Gushchin, Shakeel Butt, Greg Thelen, Johannes Weiner,
	Hugh Dickins, Tejun Heo, Linux-MM,
	open list:FILESYSTEMS (VFS and infrastructure),
	cgroups, riel

On Mon 18-10-21 07:31:58, Mina Almasry wrote:
> On Mon, Oct 18, 2021 at 6:33 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 13-10-21 12:23:19, Mina Almasry wrote:
> > > Below is a proposal for deterministic charging of shared memory.
> > > Please take a look and let me know if there are any major concerns:
> > >
> > > Problem:
> > > Currently shared memory is charged to the memcg of the allocating
> > > process. This makes memory usage of processes accessing shared memory
> > > a bit unpredictable since whichever process accesses the memory first
> > > will get charged. We have a number of use cases where our userspace
> > > would like deterministic charging of shared memory:
> > >
> > > 1. System services allocating memory for client jobs:
> > > We have services (namely a network access service[1]) that provide
> > > functionality for clients running on the machine and allocate memory
> > > to carry out these services. The memory usage of these services
> > > depends on the number of jobs running on the machine and the nature of
> > > the requests made to the service, which makes the memory usage of
> > > these services hard to predict and thus hard to limit via memory.max.
> > > These system services would like a way to allocate memory and instruct
> > > the kernel to charge this memory to the client’s memcg.
> > >
> > > 2. Shared filesystem between subtasks of a large job
> > > Our infrastructure has large meta jobs such as kubernetes which spawn
> > > multiple subtasks which share a tmpfs mount. These jobs and its
> > > subtasks use that tmpfs mount for various purposes such as data
> > > sharing or persistent data between the subtask restarts. In kubernetes
> > > terminology, the meta job is similar to pods and subtasks are
> > > containers under pods. We want the shared memory to be
> > > deterministically charged to the kubernetes's pod and independent to
> > > the lifetime of containers under the pod.
> > >
> > > 3. Shared libraries and language runtimes shared between independent jobs.
> > > We’d like to optimize memory usage on the machine by sharing libraries
> > > and language runtimes of many of the processes running on our machines
> > > in separate memcgs. This produces a side effect that one job may be
> > > unlucky to be the first to access many of the libraries and may get
> > > oom killed as all the cached files get charged to it.
> > >
> > > Design:
> > > My rough proposal to solve this problem is to simply add a
> > > ‘memcg=/path/to/memcg’ mount option for filesystems (namely tmpfs):
> > > directing all the memory of the file system to be ‘remote charged’ to
> > > cgroup provided by that memcg= option.
> >
> > Could you be more specific about how this matches the above mentioned
> > usecases?
> >
> 
> For the use cases I've listed respectively:
> 1. Our network service would mount a tmpfs with 'memcg=<path to
> client's memcg>'. Any memory the service is allocating on behalf of
> the client, the service will allocate inside of this tmpfs mount, thus
> charging it to the client's memcg without risk of hitting the
> service's limit.
> 2. The large job (kubernetes pod) would mount a tmpfs with
> 'memcg=<path to large job's memcg>. It will then share this tmpfs
> mount with the subtasks (containers in the pod). The subtasks can then
> allocate memory in the tmpfs, having it charged to the kubernetes job,
> without risk of hitting the container's limit.

There is still a risk that the limit is hit for the memcg of shmem
owner, right? What happens then? Isn't any of the shmem consumer a
DoS attack vector for everybody else consuming from that same target
memcg? In other words aren't all of them in the same memory resource
domain effectively? If we allow target memcg to live outside of that
resource domain then this opens interesting questions about resource
control in general, no? Something the unified hierarchy was aiming to
fix wrt cgroup v1.

You are saying that it is hard to properly set limits for
respective services but this would simply allow to hide a part of the
consumption somewhere else. Aren't you just shifting the problem
elsewhere? How do configure the target memcg?

Do you have any numbers about the consumption variation and how big of a
problem that is in practice?

> 3. We would need to extend this functionality to other file systems of
> persistent disk, then mount that file system with 'memcg=<dedicated
> shared library memcg>'. Jobs can then use the shared library and any
> memory allocated due to loading the shared library is charged to a
> dedicated memcg, and not charged to the job using the shared library.

This is more of a question for fs people. My understanding is rather
limited so I cannot even imagine all the possible setups but just from
a very high level understanding bind mounts can get really interesting.
Can those disagree on the memcg? 

I am pretty sure I didn't get to think through this very deeply, my gut
feeling tells me that this will open many interesting questions and I am
not sure whether it solves more problems than it introduces at this moment.
I would be really curious what others think about this.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC Proposal] Deterministic memcg charging for shared memory
  2021-10-20  9:09     ` Michal Hocko
@ 2021-10-20 17:46       ` Mina Almasry
  2021-10-20 19:02       ` Theodore Ts'o
  1 sibling, 0 replies; 7+ messages in thread
From: Mina Almasry @ 2021-10-20 17:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Shakeel Butt, Greg Thelen, Johannes Weiner,
	Hugh Dickins, Tejun Heo, Linux-MM,
	open list:FILESYSTEMS (VFS and infrastructure),
	cgroups, riel

On Wed, Oct 20, 2021 at 2:09 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 18-10-21 07:31:58, Mina Almasry wrote:
> > On Mon, Oct 18, 2021 at 6:33 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 13-10-21 12:23:19, Mina Almasry wrote:
> > > > Below is a proposal for deterministic charging of shared memory.
> > > > Please take a look and let me know if there are any major concerns:
> > > >
> > > > Problem:
> > > > Currently shared memory is charged to the memcg of the allocating
> > > > process. This makes memory usage of processes accessing shared memory
> > > > a bit unpredictable since whichever process accesses the memory first
> > > > will get charged. We have a number of use cases where our userspace
> > > > would like deterministic charging of shared memory:
> > > >
> > > > 1. System services allocating memory for client jobs:
> > > > We have services (namely a network access service[1]) that provide
> > > > functionality for clients running on the machine and allocate memory
> > > > to carry out these services. The memory usage of these services
> > > > depends on the number of jobs running on the machine and the nature of
> > > > the requests made to the service, which makes the memory usage of
> > > > these services hard to predict and thus hard to limit via memory.max.
> > > > These system services would like a way to allocate memory and instruct
> > > > the kernel to charge this memory to the client’s memcg.
> > > >
> > > > 2. Shared filesystem between subtasks of a large job
> > > > Our infrastructure has large meta jobs such as kubernetes which spawn
> > > > multiple subtasks which share a tmpfs mount. These jobs and its
> > > > subtasks use that tmpfs mount for various purposes such as data
> > > > sharing or persistent data between the subtask restarts. In kubernetes
> > > > terminology, the meta job is similar to pods and subtasks are
> > > > containers under pods. We want the shared memory to be
> > > > deterministically charged to the kubernetes's pod and independent to
> > > > the lifetime of containers under the pod.
> > > >
> > > > 3. Shared libraries and language runtimes shared between independent jobs.
> > > > We’d like to optimize memory usage on the machine by sharing libraries
> > > > and language runtimes of many of the processes running on our machines
> > > > in separate memcgs. This produces a side effect that one job may be
> > > > unlucky to be the first to access many of the libraries and may get
> > > > oom killed as all the cached files get charged to it.
> > > >
> > > > Design:
> > > > My rough proposal to solve this problem is to simply add a
> > > > ‘memcg=/path/to/memcg’ mount option for filesystems (namely tmpfs):
> > > > directing all the memory of the file system to be ‘remote charged’ to
> > > > cgroup provided by that memcg= option.
> > >
> > > Could you be more specific about how this matches the above mentioned
> > > usecases?
> > >
> >
> > For the use cases I've listed respectively:
> > 1. Our network service would mount a tmpfs with 'memcg=<path to
> > client's memcg>'. Any memory the service is allocating on behalf of
> > the client, the service will allocate inside of this tmpfs mount, thus
> > charging it to the client's memcg without risk of hitting the
> > service's limit.
> > 2. The large job (kubernetes pod) would mount a tmpfs with
> > 'memcg=<path to large job's memcg>. It will then share this tmpfs
> > mount with the subtasks (containers in the pod). The subtasks can then
> > allocate memory in the tmpfs, having it charged to the kubernetes job,
> > without risk of hitting the container's limit.
>
> There is still a risk that the limit is hit for the memcg of shmem
> owner, right? What happens then? Isn't any of the shmem consumer a
> DoS attack vector for everybody else consuming from that same target
> memcg?

This is an interesting point and thanks for bringing it up. I think
there are a couple of things we can do about that:

1. Only allow root to mount a tmpfs with memcg=<anything>
2. Only processes allowed the enter cgroup at mount time can mount a
tmpfs with memcg=<cgroup>

Both address this DoS attack vector adequately I think. I have a
strong preference to solution (2) as it keeps the feature useful for
non-admins while still completely addressing the concern AFAICT.

> In other words aren't all of them in the same memory resource
> domain effectively? If we allow target memcg to live outside of that
> resource domain then this opens interesting questions about resource
> control in general, no? Something the unified hierarchy was aiming to
> fix wrt cgroup v1.
>

In my very humble opinion there are valid reasons for processes inside
the same resource domain to want the memory to be charged to one of
them and not the others, and there are valid reasons for the target
memcg living outside the resource domain without really fundamentally
breaking resource control. So breaking it down by use case:

W.r.t. usecase #1: our network service needs to allocate memory to
provide networking services to any job running on the system,
regardless of which cgroup that job is in. You could say it's
allocating memory to a cgroup outside its resource domain but IMHO
it's a bit of a grey area. After all the network service is servicing
network requests for jobs inside of that resource domain, and
allocating memory on behalf of those jobs and not for its 'own'
purposes in a sense.

W.r.t. usecase #2: the large jobs (kubernetes) and all its sub-tasks
(pods) are all in the same resource domain, but we still would like
the shared memory deterministically charged to the large jobs and
doesn't end up as part of the sub-task usage. I don't have concrete
numbers (I'll work on getting them), but one can image a large job
which owns a 10GB tmpfs mount. The sub-tasks say each use at most
10MB, but they need access to the shared 10GB tmpfs mount. In that
case the sub-tasks will get charged for the 10MB they privately use
and anywhere between 0-10GB for the shared memory. In situations like
these we'd like to enforce that subtasks only use 10MB of 'their'
memory and that the total memory usage of the entire job doesn't use
memory beyond a certain limit, say 10.5GB or something. There is no
way to do that today AFAIK, but with deterministic shared memory
charging this is possible.

> You are saying that it is hard to properly set limits for
> respective services but this would simply allow to hide a part of the
> consumption somewhere else. Aren't you just shifting the problem
> elsewhere? How do configure the target memcg?
>
> Do you have any numbers about the consumption variation and how big of a
> problem that is in practice?
>

So for use case #1 I have concrete numbers. Our network service
roughly allocates around 75MB/VM running on the machine. The exact
number depends on the network activity of the VM and this is just a
rough estimate. There are anywhere between 0-128 VMs running on a the
machine. So the memory usage of the network service can be anywhere
between almost 0 to 9.6GB. Statically deciding the memory limit of the
network service's cgroup is not possible. Increasing/decreasing the
limit when new VMs are scheduled on the machine is also not very
workable, as the amount of memory the service uses depends on the
network activity of the VM. Getting the limit wrong has disastrous
consequences as the service is unable to serve the entire machine.
With memcg= this becomes a trivial problem as each VM pads its memory
usage with extra headroom for network activity. If the headroom is
insufficient only that one VM is deprived of network services, not the
entire machine.

W.r.t usecase #2, I don't have concrete numbers, but I've outlined
above one (hopefully) reasonable scenario where we'd have an issue.

> > 3. We would need to extend this functionality to other file systems of
> > persistent disk, then mount that file system with 'memcg=<dedicated
> > shared library memcg>'. Jobs can then use the shared library and any
> > memory allocated due to loading the shared library is charged to a
> > dedicated memcg, and not charged to the job using the shared library.
>
> This is more of a question for fs people. My understanding is rather
> limited so I cannot even imagine all the possible setups but just from
> a very high level understanding bind mounts can get really interesting.
> Can those disagree on the memcg?
>

I will admit I've thought about the tmpfs support as much as possible,
and that extending the support to other file system will generate more
concerns, maybe even concerns specific to that file system. I'm hoping
to build support for tmpfs and leave the implementation generic enough
to be extended afterwards.

I haven't thought of it thoroughly but I would imagine bind mounts
shouldn't be able to disagree on the memcg. At least right now I can't
think of a compelling use case for that.

> I am pretty sure I didn't get to think through this very deeply, my gut
> feeling tells me that this will open many interesting questions and I am
> not sure whether it solves more problems than it introduces at this moment.
> I would be really curious what others think about this.

I would also obviously love to hear opinions of others (and thanks
Michal for taking the time to review). I think I'll work on posting a
patch series to the list and hope that solicits more opinions.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC Proposal] Deterministic memcg charging for shared memory
  2021-10-20  9:09     ` Michal Hocko
  2021-10-20 17:46       ` Mina Almasry
@ 2021-10-20 19:02       ` Theodore Ts'o
  1 sibling, 0 replies; 7+ messages in thread
From: Theodore Ts'o @ 2021-10-20 19:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mina Almasry, Roman Gushchin, Shakeel Butt, Greg Thelen,
	Johannes Weiner, Hugh Dickins, Tejun Heo, Linux-MM,
	open list:FILESYSTEMS (VFS and infrastructure),
	cgroups, riel

On Wed, Oct 20, 2021 at 11:09:07AM +0200, Michal Hocko wrote:
> > 3. We would need to extend this functionality to other file systems of
> > persistent disk, then mount that file system with 'memcg=<dedicated
> > shared library memcg>'. Jobs can then use the shared library and any
> > memory allocated due to loading the shared library is charged to a
> > dedicated memcg, and not charged to the job using the shared library.
> 
> This is more of a question for fs people. My understanding is rather
> limited so I cannot even imagine all the possible setups but just from
> a very high level understanding bind mounts can get really interesting.
> Can those disagree on the memcg? 
> 
> I am pretty sure I didn't get to think through this very deeply, my gut
> feeling tells me that this will open many interesting questions and I am
> not sure whether it solves more problems than it introduces at this moment.
> I would be really curious what others think about this.

My understanding of the proposal is that the mount option would be on
the superblock, and it would not be a per-bind-mount option, ala the
ro mount option.  In other words, the designation of the target memcg
for which all tmpfs files would be charged would be something that
would be stored in the struct super.  I'm also going to assume that
the only thing that gets charged is memory for files that are backed
on the tmpfs.  So for example, if there is a MAP_PRIVATE mapping, the
base page would have be charged to the target memcg when the file was
originally created.  However, if the process tries to modify a private
mapping, and there page allocated on the copy-on-write would get
charged to the process's memcg, and not to the tmpfs's target memcg.

If we make these simplifying assumptions, then it should be fairly
simple.  Essentially, the model is that whenever we do the file system
equivalent of "block allocation" for the file system, the tmpfs file
system has all of the pages associated with that file system is
charged to the target memcg.  That's pretty straightforward, and is
pretty easy to model and anticipate.

In fact, if the only use case was #3 (shared libraries and library
runtimes) this workload could be accomodated without needing any
kernel changes.  This could be done by simply having the setup process
run in the "target memcg", and it would simply copy all of the shared
libraries and runtime files into the tmpfs at setup time.  So that
would get charged to the memcg which first allocated the file, and
that would be the setup memcg.  And all of the Kubernetes containers
that use these shared libraries and language runtimes, when they map
those pages read-only into their task processes, since those tmpfs
pages were charged to the setup memcg, they won't get charged to the
task containers.  And I *do* believe that it's much easier to
anticipate how much memory will be used by these shared files, and so
we don't need to potentially give a task container enough memory quota
so that if it is the first container to start running, it gets charged
with all of the memory, while all of the other containers can afford
to freeload off the first container --- but we still have to give
those containers enough memory in their memcg in case those *other*
containers happen to be the first one to get launched.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-10-20 19:02 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-13 19:23 [RFC Proposal] Deterministic memcg charging for shared memory Mina Almasry
2021-10-18 12:51 ` Mina Almasry
2021-10-18 13:33 ` Michal Hocko
2021-10-18 14:31   ` Mina Almasry
2021-10-20  9:09     ` Michal Hocko
2021-10-20 17:46       ` Mina Almasry
2021-10-20 19:02       ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).