linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Memory reclaim protection and cgroup nesting (desktop use)
@ 2020-03-04  9:44 Benjamin Berg
  2020-03-04 16:30 ` Tejun Heo
  0 siblings, 1 reply; 10+ messages in thread
From: Benjamin Berg @ 2020-03-04  9:44 UTC (permalink / raw)
  To: cgroups, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3666 bytes --]

Hi,

TL;DR: I seem to need memory.min/memory.max to be set on each child
cgroup and not just the parents. Is this expected?


I have been experimenting with using cgroups to protect a GNOME
session. The intention is that the GNOME Shell itself and important
other services remain responsive, even if the application workload is
thrashing. The long term goal here is to bridge the time until an OOM
killer like oomd would get the system back into normal conditions using
memory pressure information.

Note that I have done these tests without any swap and with huge
memory.min/memory.low values. I consider this scenario pathological,
however, it seems like a reasonable way to really exercise the cgroup
reclaim protection logic.

The resulting cgroup hierarchy looked something like:

-.slice
├─user.slice
│ └─user-1000.slice
│   ├─user@1000.service
│   │ ├─session.slice
│   │ │ ├─gsd-*.service
│   │ │ │ └─208803 /usr/libexec/gsd-rfkill
│   │ │ ├─gnome-shell-wayland.service
│   │ │ │ ├─208493 /usr/bin/gnome-shell
│   │ │ │ ├─208549 /usr/bin/Xwayland :0 -rootless -noreset -accessx -core -auth /run/user/1000/.mutter-Xwayla>
│   │ │ │ └─ …
│   │ └─apps.slice
│   │   ├─gnome-launched-tracker-miner-fs.desktop-208880.scope
│   │   │ └─208880 /usr/libexec/tracker-miner-fs
│   │   ├─dbus-:1.2-org.gnome.OnlineAccounts@0.service
│   │   │ └─208668 /usr/libexec/goa-daemon
│   │   ├─flatpak-org.gnome.Fractal-210350.scope
│   │   ├─gnome-terminal-server.service
│   │   │ ├─209261 /usr/libexec/gnome-terminal-server
│   │   │ ├─209434 bash
│   │   │ └─ … including the test load i.e. "make -j32" of a C++ code


I also enabled the CPU and IO controllers in my tests, but I don't
think that is as relevant. The main thing is that I set
  memory.min: 2GiB
  memory.low: 4GiB

using systemd on all of

 * user.slice,
 * user-1000.slice,
 * user@1000.slice,
 * session.slice and
 * everything inside session.slice
   (i.e. gnome-shell-wayland.service, gsd-*.service, …)

excluding apps.slice from protection.

(In a realistic scenario I expect to have swap and then reserving maybe
a few hundred MiB; DAMON might help with finding good values.)


At that point, the protection started working pretty much flawlessly.
i.e. my gnome-shell would continue to run without major page faulting
even though everything in apps.slice was thrashing heavily. The
mouse/keyboard remained completely responsive, and interacting with
applications ended up working much better thanks to knowing where input
was going. Even if the applications themselves took seconds to react.

So far, so good. What surprises me is that I needed to set the
protection on the child cgroups (i.e. gnome-shell-wayland.service).
Without this, it would not work (reliably) and my gnome-shell would
still have a lot of re-faults to load libraries and other mmap'ed data
back into memory (I used "perf --no-syscalls -F" to trace this and
observed these to be repeatedly for the same pages loading e.g.
functions for execution).

Due to accounting effects, I would expect re-faults to happen up to one
time in this scenario. At that point the page in question will be
accounted against the shell's cgroup and reclaim protection could kick
in. Unfortunately, that did not seem to happen unless the shell's
cgroup itself had protections and not just all of its parents.

Is it expected that I need to set limits on each child?

Benjamin

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Memory reclaim protection and cgroup nesting (desktop use)
  2020-03-04  9:44 Memory reclaim protection and cgroup nesting (desktop use) Benjamin Berg
@ 2020-03-04 16:30 ` Tejun Heo
  2020-03-04 17:02   ` Jonathan Corbet
  2020-03-05 13:13   ` Benjamin Berg
  0 siblings, 2 replies; 10+ messages in thread
From: Tejun Heo @ 2020-03-04 16:30 UTC (permalink / raw)
  To: Benjamin Berg; +Cc: cgroups, linux-mm, Johannes Weiner

Hello,

(cc'ing Johannes and quoting whole msg)

On Wed, Mar 04, 2020 at 10:44:44AM +0100, Benjamin Berg wrote:
> Hi,
> 
> TL;DR: I seem to need memory.min/memory.max to be set on each child
> cgroup and not just the parents. Is this expected?

Yes, currently. However, v5.7+ will have a cgroup2 mount option to
propagate protection automatically.

  https://lore.kernel.org/linux-mm/20191219200718.15696-4-hannes@cmpxchg.org/

> I have been experimenting with using cgroups to protect a GNOME
> session. The intention is that the GNOME Shell itself and important
> other services remain responsive, even if the application workload is
> thrashing. The long term goal here is to bridge the time until an OOM
> killer like oomd would get the system back into normal conditions using
> memory pressure information.
> 
> Note that I have done these tests without any swap and with huge
> memory.min/memory.low values. I consider this scenario pathological,
> however, it seems like a reasonable way to really exercise the cgroup
> reclaim protection logic.

It's incomplete and more brittle in that the kernel has to treat a
large portion of memory usage as essentially memlocked.

> The resulting cgroup hierarchy looked something like:
> 
> -.slice
> ├─user.slice
> │ └─user-1000.slice
> │   ├─user@1000.service
> │   │ ├─session.slice
> │   │ │ ├─gsd-*.service
> │   │ │ │ └─208803 /usr/libexec/gsd-rfkill
> │   │ │ ├─gnome-shell-wayland.service
> │   │ │ │ ├─208493 /usr/bin/gnome-shell
> │   │ │ │ ├─208549 /usr/bin/Xwayland :0 -rootless -noreset -accessx -core -auth /run/user/1000/.mutter-Xwayla>
> │   │ │ │ └─ …
> │   │ └─apps.slice
> │   │   ├─gnome-launched-tracker-miner-fs.desktop-208880.scope
> │   │   │ └─208880 /usr/libexec/tracker-miner-fs
> │   │   ├─dbus-:1.2-org.gnome.OnlineAccounts@0.service
> │   │   │ └─208668 /usr/libexec/goa-daemon
> │   │   ├─flatpak-org.gnome.Fractal-210350.scope
> │   │   ├─gnome-terminal-server.service
> │   │   │ ├─209261 /usr/libexec/gnome-terminal-server
> │   │   │ ├─209434 bash
> │   │   │ └─ … including the test load i.e. "make -j32" of a C++ code
> 
> 
> I also enabled the CPU and IO controllers in my tests, but I don't
> think that is as relevant. The main thing is that I set

CPU control isn't but IO is. Without working IO isolation, it's
relatively easy to drive the system into the ground given enough
stress ouside the protected area.

>   memory.min: 2GiB
>   memory.low: 4GiB
> 
> using systemd on all of
> 
>  * user.slice,
>  * user-1000.slice,
>  * user@1000.slice,
>  * session.slice and
>  * everything inside session.slice
>    (i.e. gnome-shell-wayland.service, gsd-*.service, …)
> 
> excluding apps.slice from protection.
> 
> (In a realistic scenario I expect to have swap and then reserving maybe
> a few hundred MiB; DAMON might help with finding good values.)

What's DAMON?

> At that point, the protection started working pretty much flawlessly.
> i.e. my gnome-shell would continue to run without major page faulting
> even though everything in apps.slice was thrashing heavily. The
> mouse/keyboard remained completely responsive, and interacting with
> applications ended up working much better thanks to knowing where input
> was going. Even if the applications themselves took seconds to react.
> 
> So far, so good. What surprises me is that I needed to set the
> protection on the child cgroups (i.e. gnome-shell-wayland.service).
> Without this, it would not work (reliably) and my gnome-shell would
> still have a lot of re-faults to load libraries and other mmap'ed data
> back into memory (I used "perf --no-syscalls -F" to trace this and
> observed these to be repeatedly for the same pages loading e.g.
> functions for execution).
> 
> Due to accounting effects, I would expect re-faults to happen up to one
> time in this scenario. At that point the page in question will be
> accounted against the shell's cgroup and reclaim protection could kick
> in. Unfortunately, that did not seem to happen unless the shell's
> cgroup itself had protections and not just all of its parents.
> 
> Is it expected that I need to set limits on each child?

Yes, right now, memory.low needs to be configured all the way down to
the leaf to be effective, which can be rather cumbersome. As written
above, future kernels will be easier to work with in this respect.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Memory reclaim protection and cgroup nesting (desktop use)
  2020-03-04 16:30 ` Tejun Heo
@ 2020-03-04 17:02   ` Jonathan Corbet
  2020-03-04 17:10     ` Tejun Heo
  2020-03-05 13:13   ` Benjamin Berg
  1 sibling, 1 reply; 10+ messages in thread
From: Jonathan Corbet @ 2020-03-04 17:02 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Benjamin Berg, cgroups, linux-mm, Johannes Weiner

On Wed, 4 Mar 2020 11:30:44 -0500
Tejun Heo <tj@kernel.org> wrote:

> > (In a realistic scenario I expect to have swap and then reserving maybe
> > a few hundred MiB; DAMON might help with finding good values.)  
> 
> What's DAMON?

	https://lwn.net/Articles/812707/

jon


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Memory reclaim protection and cgroup nesting (desktop use)
  2020-03-04 17:02   ` Jonathan Corbet
@ 2020-03-04 17:10     ` Tejun Heo
  2020-03-05  1:29       ` Daniel Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2020-03-04 17:10 UTC (permalink / raw)
  To: Jonathan Corbet; +Cc: Benjamin Berg, cgroups, linux-mm, Johannes Weiner

On Wed, Mar 04, 2020 at 10:02:31AM -0700, Jonathan Corbet wrote:
> On Wed, 4 Mar 2020 11:30:44 -0500
> Tejun Heo <tj@kernel.org> wrote:
> 
> > > (In a realistic scenario I expect to have swap and then reserving maybe
> > > a few hundred MiB; DAMON might help with finding good values.)  
> > 
> > What's DAMON?
> 
> 	https://lwn.net/Articles/812707/

Ah, thanks a lot for the link. That's neat. For determining workload
size, we're using senpai:

 https://github.com/facebookincubator/senpai

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Memory reclaim protection and cgroup nesting (desktop use)
  2020-03-04 17:10     ` Tejun Heo
@ 2020-03-05  1:29       ` Daniel Xu
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel Xu @ 2020-03-05  1:29 UTC (permalink / raw)
  To: Tejun Heo, Jonathan Corbet
  Cc: Benjamin Berg, cgroups, linux-mm, Johannes Weiner

On Wed, Mar 4, 2020, at 9:10 AM, Tejun Heo wrote:
> On Wed, Mar 04, 2020 at 10:02:31AM -0700, Jonathan Corbet wrote:
> > On Wed, 4 Mar 2020 11:30:44 -0500
> > Tejun Heo <tj@kernel.org> wrote:
> > 
> > > > (In a realistic scenario I expect to have swap and then reserving maybe
> > > > a few hundred MiB; DAMON might help with finding good values.)  
> > > 
> > > What's DAMON?
> > 
> > 	https://lwn.net/Articles/812707/
> 
> Ah, thanks a lot for the link. That's neat. For determining workload
> size, we're using senpai:
> 
>  https://github.com/facebookincubator/senpai
[...]

For reference, senpai is currently deployed as an oomd plugin:
https://github.com/facebookincubator/oomd/blob/master/src/oomd/plugins/Senpai.cpp .

The python version might be a bit out of date -- Johannes probably knows.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Memory reclaim protection and cgroup nesting (desktop use)
  2020-03-04 16:30 ` Tejun Heo
  2020-03-04 17:02   ` Jonathan Corbet
@ 2020-03-05 13:13   ` Benjamin Berg
  2020-03-05 14:55     ` Tejun Heo
  1 sibling, 1 reply; 10+ messages in thread
From: Benjamin Berg @ 2020-03-05 13:13 UTC (permalink / raw)
  To: Tejun Heo; +Cc: cgroups, linux-mm, Johannes Weiner

[-- Attachment #1: Type: text/plain, Size: 7775 bytes --]

On Wed, 2020-03-04 at 11:30 -0500, Tejun Heo wrote:
> Hello,
> 
> (cc'ing Johannes and quoting whole msg)
> 
> On Wed, Mar 04, 2020 at 10:44:44AM +0100, Benjamin Berg wrote:
> > Hi,
> > 
> > TL;DR: I seem to need memory.min/memory.max to be set on each child
> > cgroup and not just the parents. Is this expected?
> 
> Yes, currently. However, v5.7+ will have a cgroup2 mount option to
> propagate protection automatically.
> 
>   https://lore.kernel.org/linux-mm/20191219200718.15696-4-hannes@cmpxchg.org/

Oh, that is an interesting discussion/development, thanks for the
pointer!

I think that mount option is great. It is also interesting to see that
I could achieve a similar effect by disabling the memory controller.
That makes sense, but it had not occurred to me.


A major discussion point seemed to be that cgroups should be grouped by
their resource management needs rather than a logical hierarchy. I
think that the resource management needs actually map well enough to
the logical hierarchy in our case. The hierarchy looks like:

                         root
                       /     \
           system.slice       user.slice
          /    |              |         \
      cron  journal    user-1000.slice   user-1001.slice
                              |                      \
                      user@1000.service            [SAME]
                        |          |
                   apps.slice   session.slice
                       |             |
                  unprotected    protected

Where the tree underneath user@1000.service is managed separately from
the rest. Moving parts of this tree to other locations would require
interactions between two systemd processes.

But, all layers can actually have a purpose:
 * user.slice:
   - Admin/distribution defines resource guarantees/constraints that
     apply to all users as a group.
     (Assuming a single user system, the same guarantees as below.)
   - i.e. it defines the separation of system services vs. user.
 * user-.slice:
   - Admin defines resource guarantees/constrains for a single user
   - A separate policy dynamically changes this depending on whether
     the user is currently active (on seat0). Ensuring only the
     active user is benefiting.
 * user@1000.service:
   - Nothing needed here, just the point where resource management
     is delegated to the users control.
 * session.slice:
   - User/session manages resources to ensure UI responsiveness

I think this actually makes sense. Both from an hierarchical point of
view and also for configuring resources. In particular the user-.slice
layer is important, because this grouping allows us to dynamically
adjust resource management. The obvious thing we can do there is to
prioritise the currently active user while also lowering resource
allocations for inactive users (e.g. graphical greeter still running in
the background).

Note, that from my point of view the scenario that most concerns me is
a resource competition between session.slice and its siblings. This
makes the hierarchy above even less important; we just need to give the
user enough control to do resource allocations within their own
subtree.

So, it seems to me that the suggested mount option should work well in
our scenario.

Benjamin


> > I have been experimenting with using cgroups to protect a GNOME
> > session. The intention is that the GNOME Shell itself and important
> > other services remain responsive, even if the application workload is
> > thrashing. The long term goal here is to bridge the time until an OOM
> > killer like oomd would get the system back into normal conditions using
> > memory pressure information.
> > 
> > Note that I have done these tests without any swap and with huge
> > memory.min/memory.low values. I consider this scenario pathological,
> > however, it seems like a reasonable way to really exercise the cgroup
> > reclaim protection logic.
> 
> It's incomplete and more brittle in that the kernel has to treat a
> large portion of memory usage as essentially memlocked.
> 
> > The resulting cgroup hierarchy looked something like:
> > 
> > -.slice
> > ├─user.slice
> > │ └─user-1000.slice
> > │   ├─user@1000.service
> > │   │ ├─session.slice
> > │   │ │ ├─gsd-*.service
> > │   │ │ │ └─208803 /usr/libexec/gsd-rfkill
> > │   │ │ ├─gnome-shell-wayland.service
> > │   │ │ │ ├─208493 /usr/bin/gnome-shell
> > │   │ │ │ ├─208549 /usr/bin/Xwayland :0 -rootless -noreset -accessx
> > -core -auth /run/user/1000/.mutter-Xwayla>
> > │   │ │ │ └─ …
> > │   │ └─apps.slice
> > │   │   ├─gnome-launched-tracker-miner-fs.desktop-208880.scope
> > │   │   │ └─208880 /usr/libexec/tracker-miner-fs
> > │   │   ├─dbus-:1.2-org.gnome.OnlineAccounts@0.service
> > │   │   │ └─208668 /usr/libexec/goa-daemon
> > │   │   ├─flatpak-org.gnome.Fractal-210350.scope
> > │   │   ├─gnome-terminal-server.service
> > │   │   │ ├─209261 /usr/libexec/gnome-terminal-server
> > │   │   │ ├─209434 bash
> > │   │   │ └─ … including the test load i.e. "make -j32" of a C++
> > code
> > 
> > 
> > I also enabled the CPU and IO controllers in my tests, but I don't
> > think that is as relevant. The main thing is that I set
> 
> CPU control isn't but IO is. Without working IO isolation, it's
> relatively easy to drive the system into the ground given enough
> stress ouside the protected area.
> 
> >   memory.min: 2GiB
> >   memory.low: 4GiB
> > 
> > using systemd on all of
> > 
> >  * user.slice,
> >  * user-1000.slice,
> >  * user@1000.slice,
> >  * session.slice and
> >  * everything inside session.slice
> >    (i.e. gnome-shell-wayland.service, gsd-*.service, …)
> > 
> > excluding apps.slice from protection.
> > 
> > (In a realistic scenario I expect to have swap and then reserving maybe
> > a few hundred MiB; DAMON might help with finding good values.)
> 
> What's DAMON?
> 
> > At that point, the protection started working pretty much flawlessly.
> > i.e. my gnome-shell would continue to run without major page faulting
> > even though everything in apps.slice was thrashing heavily. The
> > mouse/keyboard remained completely responsive, and interacting with
> > applications ended up working much better thanks to knowing where input
> > was going. Even if the applications themselves took seconds to react.
> > 
> > So far, so good. What surprises me is that I needed to set the
> > protection on the child cgroups (i.e. gnome-shell-wayland.service).
> > Without this, it would not work (reliably) and my gnome-shell would
> > still have a lot of re-faults to load libraries and other mmap'ed data
> > back into memory (I used "perf --no-syscalls -F" to trace this and
> > observed these to be repeatedly for the same pages loading e.g.
> > functions for execution).
> > 
> > Due to accounting effects, I would expect re-faults to happen up to one
> > time in this scenario. At that point the page in question will be
> > accounted against the shell's cgroup and reclaim protection could kick
> > in. Unfortunately, that did not seem to happen unless the shell's
> > cgroup itself had protections and not just all of its parents.
> > 
> > Is it expected that I need to set limits on each child?
> 
> Yes, right now, memory.low needs to be configured all the way down to
> the leaf to be effective, which can be rather cumbersome. As written
> above, future kernels will be easier to work with in this respect.
> 
> Thanks.
> 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Memory reclaim protection and cgroup nesting (desktop use)
  2020-03-05 13:13   ` Benjamin Berg
@ 2020-03-05 14:55     ` Tejun Heo
  2020-03-05 15:12       ` Tejun Heo
  2020-03-05 15:27       ` Benjamin Berg
  0 siblings, 2 replies; 10+ messages in thread
From: Tejun Heo @ 2020-03-05 14:55 UTC (permalink / raw)
  To: Benjamin Berg; +Cc: cgroups, linux-mm, Johannes Weiner

Hello,

On Thu, Mar 05, 2020 at 02:13:58PM +0100, Benjamin Berg wrote:
> A major discussion point seemed to be that cgroups should be grouped by
> their resource management needs rather than a logical hierarchy. I
> think that the resource management needs actually map well enough to
> the logical hierarchy in our case. The hierarchy looks like:

Yeah, the two layouts share a lot of commonalities in most cases. It's
not like we usually wanna distribute resources completely unrelated to
how the system is composed logically.

>                          root
>                        /     \
>            system.slice       user.slice
>           /    |              |         \
>       cron  journal    user-1000.slice   user-1001.slice
>                               |                      \
>                       user@1000.service            [SAME]
>                         |          |
>                    apps.slice   session.slice
>                        |             |
>                   unprotected    protected
> 
...
> I think this actually makes sense. Both from an hierarchical point of
> view and also for configuring resources. In particular the user-.slice
> layer is important, because this grouping allows us to dynamically
> adjust resource management. The obvious thing we can do there is to
> prioritise the currently active user while also lowering resource
> allocations for inactive users (e.g. graphical greeter still running in
> the background).

Changing memory limits dynamically can lead to pretty abrupt system
behaviors depending on how big the swing is but memory.low and io/cpu
weights should behave fine.

> Note, that from my point of view the scenario that most concerns me is
> a resource competition between session.slice and its siblings. This
> makes the hierarchy above even less important; we just need to give the
> user enough control to do resource allocations within their own
> subtree.
> 
> So, it seems to me that the suggested mount option should work well in
> our scenario.

Sounds great. In our experience, what would help quite a lot is using
per-application cgroups more (e.g. containing each application as user
services) so that one misbehaving command can't overwhelm the session
and eventually when oomd has to kick in, it can identify and kill only
the culprit application rather than the whole session.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Memory reclaim protection and cgroup nesting (desktop use)
  2020-03-05 14:55     ` Tejun Heo
@ 2020-03-05 15:12       ` Tejun Heo
  2020-03-05 15:27       ` Benjamin Berg
  1 sibling, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2020-03-05 15:12 UTC (permalink / raw)
  To: Benjamin Berg; +Cc: Cgroups, Linux MM, Johannes Weiner

On Thu, Mar 05, 2020 at 09:55:54AM -0500, Tejun Heo wrote:
> Changing memory limits dynamically can lead to pretty abrupt system
> behaviors depending on how big the swing is but memory.low and io/cpu
> weights should behave fine.

A couple more things which might be helpful.

* memory.min/low are pretty forgiving. Semantically what it tells the
  kernel is "reclaim this guy as if the protected amount isn't being
  consumed" - if a cgroup consumes 8G and has 4G protection, it'd get
  the same reclaim pressure as a sibling whose consuming 4G without
  protection. While the range of "good" configuration is pretty wide,
  you can definitely push it too far to the point the rest of the
  system has to compete too hard for memory. In practice, setting
  memory protection to something like 50-75% of expected allocation
  seems to work well - it provides ample protection while allowing the
  system to be flexible when it needs to be. One important thing that
  we learned is that as resource configuration gets more rigid, it can
  also become more brittle.

* As for io.weight (and cpu.weight too), while prioritization is
  meaningful, what matters the most is avoiding situations where one
  consumer overwhelms the device. Simply configuring io.cost correctly
  and enabling it with default weights may achieve majority of the
  benefits even without specific weight configurations. Please note
  that IO control has quite a bit of requirements to function
  correctly - it currently works well only on btrfs on physical (not
  md or dm) devices with all other IO controllers and wbt disabled.
  Hopefully, we'll be able to relax the requirements in the future but
  we aren't there yet.

With both memory and IO set up and oomd watching out for swap
depletion, our configurations show almost complete resource isolation
where no matter what we do in unprotected portion of the system it
doesn't affect the performance of the protected portion much even when
the protected portion is running resource hungry latency sensitive
workloads.

Thanks.


--
tejun


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Memory reclaim protection and cgroup nesting (desktop use)
  2020-03-05 14:55     ` Tejun Heo
  2020-03-05 15:12       ` Tejun Heo
@ 2020-03-05 15:27       ` Benjamin Berg
  2020-03-05 15:39         ` Tejun Heo
  1 sibling, 1 reply; 10+ messages in thread
From: Benjamin Berg @ 2020-03-05 15:27 UTC (permalink / raw)
  To: Tejun Heo; +Cc: cgroups, linux-mm, Johannes Weiner

[-- Attachment #1: Type: text/plain, Size: 3581 bytes --]

On Thu, 2020-03-05 at 09:55 -0500, Tejun Heo wrote:
> Hello,
> 
> On Thu, Mar 05, 2020 at 02:13:58PM +0100, Benjamin Berg wrote:
> > A major discussion point seemed to be that cgroups should be grouped by
> > their resource management needs rather than a logical hierarchy. I
> > think that the resource management needs actually map well enough to
> > the logical hierarchy in our case. The hierarchy looks like:
> 
> Yeah, the two layouts share a lot of commonalities in most cases. It's
> not like we usually wanna distribute resources completely unrelated to
> how the system is composed logically.
> 
> >                          root
> >                        /     \
> >            system.slice       user.slice
> >           /    |              |         \
> >       cron  journal    user-1000.slice   user-1001.slice
> >                               |                      \
> >                       user@1000.service            [SAME]
> >                         |          |
> >                    apps.slice   session.slice
> >                        |             |
> >                   unprotected    protected
> > 
> ...
> > I think this actually makes sense. Both from an hierarchical point of
> > view and also for configuring resources. In particular the user-.slice
> > layer is important, because this grouping allows us to dynamically
> > adjust resource management. The obvious thing we can do there is to
> > prioritise the currently active user while also lowering resource
> > allocations for inactive users (e.g. graphical greeter still running in
> > the background).
> 
> Changing memory limits dynamically can lead to pretty abrupt system
> behaviors depending on how big the swing is but memory.low and io/cpu
> weights should behave fine.

Right, we'll need some daemon to handle this, so we could even smooth
out any change over a period of time. But it seems like that will not
be needed. I don't expect we'll want to change anything beyond
memory.low and io/cpu weights.

I opened
  https://github.com/systemd/systemd/issues/15028
to discuss this further. I'll update the ticket with more pointers and
information later.

> > Note, that from my point of view the scenario that most concerns me is
> > a resource competition between session.slice and its siblings. This
> > makes the hierarchy above even less important; we just need to give the
> > user enough control to do resource allocations within their own
> > subtree.
> > 
> > So, it seems to me that the suggested mount option should work well in
> > our scenario.
> 
> Sounds great. In our experience, what would help quite a lot is using
> per-application cgroups more (e.g. containing each application as user
> services) so that one misbehaving command can't overwhelm the session
> and eventually when oomd has to kick in, it can identify and kill only
> the culprit application rather than the whole session.

We are already trying to do this in GNOME. :)

Right now GNOME is only moving processes into cgroups after launching
them though (i.e. transient systemd scopes).

But, the goal here is to improve it further and launch all
applications directly using systemd (i.e. as systemd services). systemd
itself is going to define some standards to facilitate everything. And
we'll probably also need to update some XDG standards.

So, there are some plans already, but many details have not been solved
yet. But at least KDE and GNOME people are looking into integrating
well with systemd.

Benjamin

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Memory reclaim protection and cgroup nesting (desktop use)
  2020-03-05 15:27       ` Benjamin Berg
@ 2020-03-05 15:39         ` Tejun Heo
  0 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2020-03-05 15:39 UTC (permalink / raw)
  To: Benjamin Berg; +Cc: cgroups, linux-mm, Johannes Weiner

Hello, Benjamin.

On Thu, Mar 05, 2020 at 04:27:19PM +0100, Benjamin Berg wrote:
> > Changing memory limits dynamically can lead to pretty abrupt system
> > behaviors depending on how big the swing is but memory.low and io/cpu
> > weights should behave fine.
> 
> Right, we'll need some daemon to handle this, so we could even smooth
> out any change over a period of time. But it seems like that will not
> be needed. I don't expect we'll want to change anything beyond
> memory.low and io/cpu weights.

Yeah, you don't need to baby memory.low and io/cpu weights at all.

> > Sounds great. In our experience, what would help quite a lot is using
> > per-application cgroups more (e.g. containing each application as user
> > services) so that one misbehaving command can't overwhelm the session
> > and eventually when oomd has to kick in, it can identify and kill only
> > the culprit application rather than the whole session.
> 
> We are already trying to do this in GNOME. :)

Awesome.

> Right now GNOME is only moving processes into cgroups after launching
> them though (i.e. transient systemd scopes).

Even just that would be plenty helpful.

> But, the goal here is to improve it further and launch all
> applications directly using systemd (i.e. as systemd services). systemd
> itself is going to define some standards to facilitate everything. And
> we'll probably also need to update some XDG standards.
> 
> So, there are some plans already, but many details have not been solved
> yet. But at least KDE and GNOME people are looking into integrating
> well with systemd.

Sounds great. Please let us know if there's anything we can help with.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-03-05 15:39 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-04  9:44 Memory reclaim protection and cgroup nesting (desktop use) Benjamin Berg
2020-03-04 16:30 ` Tejun Heo
2020-03-04 17:02   ` Jonathan Corbet
2020-03-04 17:10     ` Tejun Heo
2020-03-05  1:29       ` Daniel Xu
2020-03-05 13:13   ` Benjamin Berg
2020-03-05 14:55     ` Tejun Heo
2020-03-05 15:12       ` Tejun Heo
2020-03-05 15:27       ` Benjamin Berg
2020-03-05 15:39         ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).