[2/2] mm: Consider subtrees in memory.events
diff mbox series

Message ID 20190123223144.GA10798@chrisdown.name
State Superseded
Headers show
Series
  • [1/2] mm: Rename ambiguously named memory.stat counters and functions
Related show

Commit Message

Chris Down Jan. 23, 2019, 10:31 p.m. UTC
memory.stat and other files already consider subtrees in their output,
and we should too in order to not present an inconsistent interface.

The current situation is fairly confusing, because people interacting
with cgroups expect hierarchical behaviour in the vein of memory.stat,
cgroup.events, and other files. For example, this causes confusion when
debugging reclaim events under low, as currently these always read "0"
at non-leaf memcg nodes, which frequently causes people to misdiagnose
breach behaviour. The same confusion applies to other counters in this
file when debugging issues.

Aggregation is done at write time instead of at read-time since these
counters aren't hot (unlike memory.stat which is per-page, so it does it
at read time), and it makes sense to bundle this with the file
notifications.

After this patch, events are propagated up the hierarchy:

    [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
    low 0
    high 0
    max 0
    oom 0
    oom_kill 0
    [root@ktst ~]# systemd-run -p MemoryMax=1 true
    Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
    [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
    low 0
    high 0
    max 7
    oom 1
    oom_kill 1

Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: kernel-team@fb.com
---
 include/linux/memcontrol.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Roman Gushchin Jan. 24, 2019, 12:24 a.m. UTC | #1
On Wed, Jan 23, 2019 at 05:31:44PM -0500, Chris Down wrote:
> memory.stat and other files already consider subtrees in their output,
> and we should too in order to not present an inconsistent interface.
> 
> The current situation is fairly confusing, because people interacting
> with cgroups expect hierarchical behaviour in the vein of memory.stat,
> cgroup.events, and other files. For example, this causes confusion when
> debugging reclaim events under low, as currently these always read "0"
> at non-leaf memcg nodes, which frequently causes people to misdiagnose
> breach behaviour. The same confusion applies to other counters in this
> file when debugging issues.
> 
> Aggregation is done at write time instead of at read-time since these
> counters aren't hot (unlike memory.stat which is per-page, so it does it
> at read time), and it makes sense to bundle this with the file
> notifications.

I agree with the consistency argument (matching cgroup.events, ...),
and it's definitely looks better for oom* events, but at the same time it feels
like a API break.

Just for example, let's say you have a delegated sub-tree with memory.max
set. Earlier, getting memory.high/max event meant that the whole sub-tree
is tight on memory, and, for example, led to shutdown of some parts of the tree.
After your change, it might mean that some sub-cgroup has reached its limit,
and probably doesn't matter on the top level.

Maybe it's still ok, but we definitely need to document it better. It feels
bad that different versions of the kernel will handle it differently, so
the userspace has to workaround it to actually use these events.

Also, please, make sure that it doesn't break memcg kselftests.

> 
> After this patch, events are propagated up the hierarchy:
> 
>    [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
>    low 0
>    high 0
>    max 0
>    oom 0
>    oom_kill 0
>    [root@ktst ~]# systemd-run -p MemoryMax=1 true
>    Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
>    [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
>    low 0
>    high 0
>    max 7
>    oom 1
>    oom_kill 1
> 
> Signed-off-by: Chris Down <chris@chrisdown.name>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> To: Andrew Morton <akpm@linux-foundation.org>

s/To/CC

> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Dennis Zhou <dennis@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: kernel-team@fb.com
> ---
> include/linux/memcontrol.h | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 380a212a8c52..5428b372def4 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -769,8 +769,10 @@ static inline void count_memcg_event_mm(struct mm_struct *mm,
> static inline void memcg_memory_event(struct mem_cgroup *memcg,
> 				      enum memcg_memory_event event)
> {
> -	atomic_long_inc(&memcg->memory_events[event]);
> -	cgroup_file_notify(&memcg->events_file);
> +	do {
> +		atomic_long_inc(&memcg->memory_events[event]);
> +		cgroup_file_notify(&memcg->events_file);
> +	} while ((memcg = parent_mem_cgroup(memcg)));

We don't have memory.events file for the root cgroup, so we can stop earlier.

Thanks!
Chris Down Jan. 24, 2019, 1:03 a.m. UTC | #2
Roman Gushchin writes:
>On Wed, Jan 23, 2019 at 05:31:44PM -0500, Chris Down wrote:
>> memory.stat and other files already consider subtrees in their output,
>> and we should too in order to not present an inconsistent interface.
>>
>> The current situation is fairly confusing, because people interacting
>> with cgroups expect hierarchical behaviour in the vein of memory.stat,
>> cgroup.events, and other files. For example, this causes confusion when
>> debugging reclaim events under low, as currently these always read "0"
>> at non-leaf memcg nodes, which frequently causes people to misdiagnose
>> breach behaviour. The same confusion applies to other counters in this
>> file when debugging issues.
>>
>> Aggregation is done at write time instead of at read-time since these
>> counters aren't hot (unlike memory.stat which is per-page, so it does it
>> at read time), and it makes sense to bundle this with the file
>> notifications.
>
>I agree with the consistency argument (matching cgroup.events, ...),
>and it's definitely looks better for oom* events, but at the same time it feels
>like a API break.
>
>Just for example, let's say you have a delegated sub-tree with memory.max
>set. Earlier, getting memory.high/max event meant that the whole sub-tree
>is tight on memory, and, for example, led to shutdown of some parts of the tree.
>After your change, it might mean that some sub-cgroup has reached its limit,
>and probably doesn't matter on the top level.

Yeah, this is something I was thinking about while writing it. I think there's 
an argument to be made either way, since functionally they can both represent 
the same feature set, just in different ways.

In the subtree-propagated version you can find the level of the hierarchy that 
the event fired at by checking parent events vs. their subtrees' events, and 
this also allows trivially setting up event watches per-subtree.

In the previous, non-propagated version, it's more trivial to work out the 
level as the event only appears in that memory.events file, but it's harder to 
actually find out about the existence of such an event because you need to keep 
a watch for each individual cgroup in the subtree at all times.

So I think there's a reasonable argument to be made in favour of considering 
subtrees.

1. I'm not aware of anyone major currently relying on using the individual 
subtree level to indicate only subtree-level events.
2. Also, being able to detect the level at which an event happened can be 
achieved in both versions by comparing event counters.
3. Having memory.events work like cgroup.events and others seems to fit with 
principle of least astonishment.

That said, I agree that there's a tradeoff here, but in my experience this 
behaviour more closely resembles user intuition and better matches the overall 
semantics around hierarchical behaviour we've generally established for cgroup 
v2.

>Maybe it's still ok, but we definitely need to document it better. It feels
>bad that different versions of the kernel will handle it differently, so
>the userspace has to workaround it to actually use these events.

That's perfectly reasonable. I'll update the documentation to match.

>Also, please, make sure that it doesn't break memcg kselftests.

For sure.

>We don't have memory.events file for the root cgroup, so we can stop earlier.

Oh yeah, I missed that when changing from a for loop to do/while. I'll fix that 
up, thanks.

Thanks for your feedback!
Michal Hocko Jan. 24, 2019, 8:22 a.m. UTC | #3
On Wed 23-01-19 17:31:44, Chris Down wrote:
> memory.stat and other files already consider subtrees in their output,
> and we should too in order to not present an inconsistent interface.
> 
> The current situation is fairly confusing, because people interacting
> with cgroups expect hierarchical behaviour in the vein of memory.stat,
> cgroup.events, and other files. For example, this causes confusion when
> debugging reclaim events under low, as currently these always read "0"
> at non-leaf memcg nodes, which frequently causes people to misdiagnose
> breach behaviour. The same confusion applies to other counters in this
> file when debugging issues.
> 
> Aggregation is done at write time instead of at read-time since these
> counters aren't hot (unlike memory.stat which is per-page, so it does it
> at read time), and it makes sense to bundle this with the file
> notifications.

I do not think we can do that for two reasons. It breaks the existing
semantic userspace might depend on and more importantly this is not a
correct behavior IMO.

You have to realize that stats are hierarchical because that is how we
account. Events represent a way to inform that something has happened at
the specific level of the tree though. If you do not setup low/high/max
limit then you simply cannot expect to be informed those get hit because
they cannot by definition. Or put it other way, if you are waiting for
those events you really want to know the (sub)tree they happened and if
you propagate the event up the hierarchy you have hard time to tell that
(you would basically have to exclude all but the lowest one and that is
an awkward semantic at best.

Maybe we want to document this better but I do not see we are going to
change the behavior.

> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

btw. I do not see this patch posted anywhere yet it already comes with
an ack. Have I just missed a previous version?
Tejun Heo Jan. 24, 2019, 3:21 p.m. UTC | #4
Hello, Michal.

On Thu, Jan 24, 2019 at 09:22:52AM +0100, Michal Hocko wrote:
> I do not think we can do that for two reasons. It breaks the existing
> semantic userspace might depend on and more importantly this is not a
> correct behavior IMO.

This is a valid concern but I'll come back to this later.

> You have to realize that stats are hierarchical because that is how we
> account. Events represent a way to inform that something has happened at
> the specific level of the tree though. If you do not setup low/high/max

This isn't true.  e.g. cgroup.events's populated event is
hierarchical.  Everything in cgroup should be hierarchical by default.

> limit then you simply cannot expect to be informed those get hit because
> they cannot by definition. Or put it other way, if you are waiting for
> those events you really want to know the (sub)tree they happened and if
> you propagate the event up the hierarchy you have hard time to tell that
> (you would basically have to exclude all but the lowest one and that is
> an awkward semantic at best.

I don't think it's a good idea to argue this for each piece of
information.  Again, everything should be hierarchical unless there
are clear and strong reasons against; otherwise, we end up with random
mix of hierarchical and flat behaviors, something that we want to
avoid the most - remember .use_hierarchy?.

> Maybe we want to document this better but I do not see we are going to
> change the behavior.

I beg you to reconsider.  This was a clear oversight and the cgroup2
usage is still relatively limited.  We sure can add local-specific
counters if needed but must not mix local and hierarchical counters
without a clear way to tell what's what.

Thanks.
Michal Hocko Jan. 24, 2019, 3:51 p.m. UTC | #5
On Thu 24-01-19 07:21:22, Tejun Heo wrote:
> Hello, Michal.
> 
> On Thu, Jan 24, 2019 at 09:22:52AM +0100, Michal Hocko wrote:
> > I do not think we can do that for two reasons. It breaks the existing
> > semantic userspace might depend on and more importantly this is not a
> > correct behavior IMO.
> 
> This is a valid concern but I'll come back to this later.
> 
> > You have to realize that stats are hierarchical because that is how we
> > account. Events represent a way to inform that something has happened at
> > the specific level of the tree though. If you do not setup low/high/max
> 
> This isn't true.  e.g. cgroup.events's populated event is
> hierarchical.  Everything in cgroup should be hierarchical by default.
> 
> > limit then you simply cannot expect to be informed those get hit because
> > they cannot by definition. Or put it other way, if you are waiting for
> > those events you really want to know the (sub)tree they happened and if
> > you propagate the event up the hierarchy you have hard time to tell that
> > (you would basically have to exclude all but the lowest one and that is
> > an awkward semantic at best.
> 
> I don't think it's a good idea to argue this for each piece of
> information.  Again, everything should be hierarchical unless there
> are clear and strong reasons against;

And I would argue that cgroups.events with its populated event is a
different thing because that is inherently a hierarchical property.  If
you compare that to low, min, max and oom those are events very specific
to the particular level of the hierarchy because it is an action at that
level that the event informs about. E.g. max event down the hierarchy is
a completely different thing from the upper level.  That sounds like a
pretty solid reason to differentiate here.

> otherwise, we end up with random
> mix of hierarchical and flat behaviors, something that we want to
> avoid the most - remember .use_hierarchy?.
> 
> > Maybe we want to document this better but I do not see we are going to
> > change the behavior.
> 
> I beg you to reconsider.  This was a clear oversight and the cgroup2
> usage is still relatively limited.  We sure can add local-specific
> counters if needed but must not mix local and hierarchical counters
> without a clear way to tell what's what.

If you really have a usecase for hierarchical events then I think the
only way without breaking existing userspace is to add a new set of
events. But I still believe that the current behavior makes a lot of
sense.
Johannes Weiner Jan. 24, 2019, 4 p.m. UTC | #6
On Thu, Jan 24, 2019 at 09:22:52AM +0100, Michal Hocko wrote:
> On Wed 23-01-19 17:31:44, Chris Down wrote:
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> btw. I do not see this patch posted anywhere yet it already comes with
> an ack. Have I just missed a previous version?

I reviewed it offline before Chris sent it out.

I agree with the sentiment that the non-hierarchical behavior was an
oversight, not a design decision.

The arguments against the change don't convince me: the added
difficulty of finding out local values is true for all other cgroup
files as well. This is traded off with being able to detect any
subtree state from the first level cgroups and drill down on-demand,
without having to scan the entire tree on each monitoring interval.
That's a trade-off we've made everywhere else, so this is simply an
inconsistency, not a legitimate exception to the rule.

We cannot fully eliminate a risk for regression, but it strikes me as
highly unlikely, given the extremely young age of cgroup2-based system
management and surrounding tooling.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Michal Hocko Jan. 24, 2019, 5:01 p.m. UTC | #7
On Thu 24-01-19 11:00:10, Johannes Weiner wrote:
[...]
> We cannot fully eliminate a risk for regression, but it strikes me as
> highly unlikely, given the extremely young age of cgroup2-based system
> management and surrounding tooling.

I am not really sure what you consider young but this interface is 4.0+
IIRC and the cgroup v2 is considered stable since 4.5 unless I
missrememeber and that is not a short time period in my book. Changing
interfaces now represents a non-trivial risk and so far I haven't heard
any actual usecase where the current semantic is actually wrong.
Inconsistency on its own is not a sufficient justification IMO.
Johannes Weiner Jan. 24, 2019, 6:23 p.m. UTC | #8
On Thu, Jan 24, 2019 at 06:01:17PM +0100, Michal Hocko wrote:
> On Thu 24-01-19 11:00:10, Johannes Weiner wrote:
> [...]
> > We cannot fully eliminate a risk for regression, but it strikes me as
> > highly unlikely, given the extremely young age of cgroup2-based system
> > management and surrounding tooling.
> 
> I am not really sure what you consider young but this interface is 4.0+
> IIRC and the cgroup v2 is considered stable since 4.5 unless I
> missrememeber and that is not a short time period in my book.

If you read my sentence again, I'm not talking about the kernel but
the surrounding infrastructure that consumes this data. The risk is
not dependent on the age of the interface age, but on its adoption.

> Changing interfaces now represents a non-trivial risk and so far I
> haven't heard any actual usecase where the current semantic is
> actually wrong.  Inconsistency on its own is not a sufficient
> justification IMO.

It can be seen either way, and in isolation it wouldn't be wrong to
count events on the local level. But we made that decision for the
entire interface, and this file is the odd one out now. From that
comprehensive perspective, yes, the behavior is wrong. It really
confuses people who are trying to use it, because they *do* expect it
to behave recursively.

I'm really having a hard time believing there are existing cgroup2
users with specific expectations for the non-recursive behavior...
Michal Hocko Jan. 25, 2019, 8:42 a.m. UTC | #9
On Thu 24-01-19 13:23:28, Johannes Weiner wrote:
> On Thu, Jan 24, 2019 at 06:01:17PM +0100, Michal Hocko wrote:
> > On Thu 24-01-19 11:00:10, Johannes Weiner wrote:
> > [...]
> > > We cannot fully eliminate a risk for regression, but it strikes me as
> > > highly unlikely, given the extremely young age of cgroup2-based system
> > > management and surrounding tooling.
> > 
> > I am not really sure what you consider young but this interface is 4.0+
> > IIRC and the cgroup v2 is considered stable since 4.5 unless I
> > missrememeber and that is not a short time period in my book.
> 
> If you read my sentence again, I'm not talking about the kernel but
> the surrounding infrastructure that consumes this data. The risk is
> not dependent on the age of the interface age, but on its adoption.

You really have to assume the user visible interface is consumed shortly
after it is exposed/considered stable in this case as cgroups v2 was
explicitly called unstable for a considerable period of time. This is a
general policy regarding user APIs in the kernel. I can see arguments a
next release after introduction or in similar cases but this is 3 years
ago. We already have distribution kernels based on 4.12 kernel and it is
old comparing to 5.0.

> > Changing interfaces now represents a non-trivial risk and so far I
> > haven't heard any actual usecase where the current semantic is
> > actually wrong.  Inconsistency on its own is not a sufficient
> > justification IMO.
> 
> It can be seen either way, and in isolation it wouldn't be wrong to
> count events on the local level. But we made that decision for the
> entire interface, and this file is the odd one out now. From that
> comprehensive perspective, yes, the behavior is wrong.

I do see your point about consistency. But it is also important to
consider the usability of this interface. As already mentioned, catching
an oom event at a level where the oom doesn't happen and having hard
time to identify that place without races is a not a straightforward API
to use. So it might be really the case that the api is actually usable
for its purpose.

> It really
> confuses people who are trying to use it, because they *do* expect it
> to behave recursively.

Then we should improve the documentation. But seriously these are no
strong reasons to change a long term semantic people might rely on.

> I'm really having a hard time believing there are existing cgroup2
> users with specific expectations for the non-recursive behavior...

I can certainly imagine monitoring tools to hook at levels where limits
are set and report events as they happen. It would be more than
confusing to receive events for reclaim/ooms that hasn't happened at
that level just because a delegated memcg down the hierarchy has decided
to set a more restrictive limits. Really this is a very unexpected
behavior change for anybody using that interface right now on anything
but leaf memcgs.
Tejun Heo Jan. 25, 2019, 4:51 p.m. UTC | #10
Hello, Michal.

On Fri, Jan 25, 2019 at 09:42:13AM +0100, Michal Hocko wrote:
> > If you read my sentence again, I'm not talking about the kernel but
> > the surrounding infrastructure that consumes this data. The risk is
> > not dependent on the age of the interface age, but on its adoption.
> 
> You really have to assume the user visible interface is consumed shortly
> after it is exposed/considered stable in this case as cgroups v2 was
> explicitly called unstable for a considerable period of time. This is a
> general policy regarding user APIs in the kernel. I can see arguments a
> next release after introduction or in similar cases but this is 3 years
> ago. We already have distribution kernels based on 4.12 kernel and it is
> old comparing to 5.0.

We do change userland-visible behaviors if the existing behavior is
buggy / misleading / confusing.  For example, we recently changed how
discard bytes are accounted (no longer included in write bytes or ios)
and even how mincore(2) behaves, both of which are far older than
cgroup2.

The main considerations are the blast radius and existing use cases in
these decisions.  Age does contribute to it but mostly because they
affect how widely the behavior may be depended upon.

> > > Changing interfaces now represents a non-trivial risk and so far I
> > > haven't heard any actual usecase where the current semantic is
> > > actually wrong.  Inconsistency on its own is not a sufficient
> > > justification IMO.
> > 
> > It can be seen either way, and in isolation it wouldn't be wrong to
> > count events on the local level. But we made that decision for the
> > entire interface, and this file is the odd one out now. From that
> > comprehensive perspective, yes, the behavior is wrong.
> 
> I do see your point about consistency. But it is also important to
> consider the usability of this interface. As already mentioned, catching
> an oom event at a level where the oom doesn't happen and having hard
> time to identify that place without races is a not a straightforward API
> to use. So it might be really the case that the api is actually usable
> for its purpose.

What if a user wants to monitor any ooms in the subtree tho, which is
a valid use case?  If local event monitoring is useful and it can be,
let's add separate events which are clearly identifiable to be local.
Right now, it's confusing like hell.

> > It really
> > confuses people who are trying to use it, because they *do* expect it
> > to behave recursively.
> 
> Then we should improve the documentation. But seriously these are no
> strong reasons to change a long term semantic people might rely on.

This is broken interface.  We're mixing local and hierarchical numbers
willy nilly without obvious way of telling them apart.

> > I'm really having a hard time believing there are existing cgroup2
> > users with specific expectations for the non-recursive behavior...
> 
> I can certainly imagine monitoring tools to hook at levels where limits
> are set and report events as they happen. It would be more than
> confusing to receive events for reclaim/ooms that hasn't happened at
> that level just because a delegated memcg down the hierarchy has decided
> to set a more restrictive limits. Really this is a very unexpected
> behavior change for anybody using that interface right now on anything
> but leaf memcgs.

Sure, there's some probability this change may cause some disruptions
although I'm pretty skeptical given that inner node event monitoring
is mostly useless right now.  However, there's also a lot of on-going
and future costs everyone is paying because the interface is so
confusing.

Thanks.
Michal Hocko Jan. 25, 2019, 5:37 p.m. UTC | #11
On Fri 25-01-19 08:51:52, Tejun Heo wrote:
[...]
> > I do see your point about consistency. But it is also important to
> > consider the usability of this interface. As already mentioned, catching
> > an oom event at a level where the oom doesn't happen and having hard
> > time to identify that place without races is a not a straightforward API
> > to use. So it might be really the case that the api is actually usable
> > for its purpose.
> 
> What if a user wants to monitor any ooms in the subtree tho, which is
> a valid use case?

How is that information useful without know which memcg the oom applies
to?

> If local event monitoring is useful and it can be,
> let's add separate events which are clearly identifiable to be local.
> Right now, it's confusing like hell.

From a backward compatible POV it should be a new interface added.
Please note that I understand that this might be confusing with the rest
of the cgroup APIs but considering that this is the first time somebody
is actually complaining and the interface is "production ready" for more
than three years I am not really sure the situation is all that bad.
Tejun Heo Jan. 25, 2019, 6:28 p.m. UTC | #12
Hello, Michal.

On Fri, Jan 25, 2019 at 06:37:13PM +0100, Michal Hocko wrote:
> > What if a user wants to monitor any ooms in the subtree tho, which is
> > a valid use case?
> 
> How is that information useful without know which memcg the oom applies
> to?

For example, a workload manager watching over a subtree for a job with
nested memory limits set by the job itself.  It wants to take action
(reporting and possibly other remediative actions) when something goes
wrong in the delegated subtree but isn't involved in how the subtree
is configured inside.

> > If local event monitoring is useful and it can be,
> > let's add separate events which are clearly identifiable to be local.
> > Right now, it's confusing like hell.
> 
> From a backward compatible POV it should be a new interface added.

That sure is an option for use cases like above but it has the
downside of carrying over the confusing interface into the indefinite
future.  Again, I'd like to point back at how we changed the
accounting write and trim accounting because the benefits outweighted
the risks.

> Please note that I understand that this might be confusing with the rest
> of the cgroup APIs but considering that this is the first time somebody
> is actually complaining and the interface is "production ready" for more
> than three years I am not really sure the situation is all that bad.

cgroup2 uptake hasn't progressed that fast.  None of the major distros
or container frameworks are currently shipping with it although many
are evaluating switching.  I don't think I'm too mistaken in that we
(FB) are at the bleeding edge in terms of adopting cgroup2 and its
various new features and are hitting these corner cases and oversights
in the process.  If there are noticeable breakages arising from this
change, we sure can backpaddle but I think the better course of action
is fixing them up while we can.

Thanks.
Michal Hocko Jan. 28, 2019, 12:51 p.m. UTC | #13
On Fri 25-01-19 10:28:08, Tejun Heo wrote:
> Hello, Michal.
> 
> On Fri, Jan 25, 2019 at 06:37:13PM +0100, Michal Hocko wrote:
> > > What if a user wants to monitor any ooms in the subtree tho, which is
> > > a valid use case?
> > 
> > How is that information useful without know which memcg the oom applies
> > to?
> 
> For example, a workload manager watching over a subtree for a job with
> nested memory limits set by the job itself.  It wants to take action
> (reporting and possibly other remediative actions) when something goes
> wrong in the delegated subtree but isn't involved in how the subtree
> is configured inside.

Yes, I understand this part, but it is not clear to me, _how_ to report
anything sensible without knowing _what_ has caused the event. You can
walk the cgroup hierarchy and compare cached results with new ones but
this is a) racy and b) clumsy.

> > > If local event monitoring is useful and it can be,
> > > let's add separate events which are clearly identifiable to be local.
> > > Right now, it's confusing like hell.
> > 
> > From a backward compatible POV it should be a new interface added.
> 
> That sure is an option for use cases like above but it has the
> downside of carrying over the confusing interface into the indefinite
> future.

I actually believe that this is not such a big deal. For one thing the
current events are actually helpful to watch the reclaim/setup behavior.

> Again, I'd like to point back at how we changed the
> accounting write and trim accounting because the benefits outweighted
> the risks.
> 
> > Please note that I understand that this might be confusing with the rest
> > of the cgroup APIs but considering that this is the first time somebody
> > is actually complaining and the interface is "production ready" for more
> > than three years I am not really sure the situation is all that bad.
> 
> cgroup2 uptake hasn't progressed that fast.  None of the major distros
> or container frameworks are currently shipping with it although many
> are evaluating switching.  I don't think I'm too mistaken in that we
> (FB) are at the bleeding edge in terms of adopting cgroup2 and its
> various new features and are hitting these corner cases and oversights
> in the process.  If there are noticeable breakages arising from this
> change, we sure can backpaddle but I think the better course of action
> is fixing them up while we can.

I do not really think you can go back. You cannot simply change semantic
back and forth because you just break new users.

Really, I do not see the semantic changing after more than 3 years of
production ready interface. If you really believe we need a hierarchical
notification mechanism for the reclaim activity then add a new one.
Tejun Heo Jan. 28, 2019, 2:28 p.m. UTC | #14
Hello, Michal.

On Mon, Jan 28, 2019 at 01:51:51PM +0100, Michal Hocko wrote:
> > For example, a workload manager watching over a subtree for a job with
> > nested memory limits set by the job itself.  It wants to take action
> > (reporting and possibly other remediative actions) when something goes
> > wrong in the delegated subtree but isn't involved in how the subtree
> > is configured inside.
> 
> Yes, I understand this part, but it is not clear to me, _how_ to report
> anything sensible without knowing _what_ has caused the event. You can
> walk the cgroup hierarchy and compare cached results with new ones but
> this is a) racy and b) clumsy.

All .events files generate aggregated stateful notifications.  For
anyone to do anything, they'd have to remember the previous state to
identify what actually happened.  Being hierarchical, it'd of course
need to walk down when an event triggers.

> > That sure is an option for use cases like above but it has the
> > downside of carrying over the confusing interface into the indefinite
> > future.
> 
> I actually believe that this is not such a big deal. For one thing the
> current events are actually helpful to watch the reclaim/setup behavior.

Sure, it isn't something critical.  It's just confusing and I think
it'd be better to improve.

> I do not really think you can go back. You cannot simply change semantic
> back and forth because you just break new users.
> 
> Really, I do not see the semantic changing after more than 3 years of
> production ready interface. If you really believe we need a hierarchical
> notification mechanism for the reclaim activity then add a new one.

I don't see it as black and white as you do.  Let's agree to disagree.
I'll ack the patch and note the disagreement.

Thanks.
Tejun Heo Jan. 28, 2019, 2:30 p.m. UTC | #15
On Wed, Jan 23, 2019 at 05:31:44PM -0500, Chris Down wrote:
> memory.stat and other files already consider subtrees in their output,
> and we should too in order to not present an inconsistent interface.
> 
> The current situation is fairly confusing, because people interacting
> with cgroups expect hierarchical behaviour in the vein of memory.stat,
> cgroup.events, and other files. For example, this causes confusion when
> debugging reclaim events under low, as currently these always read "0"
> at non-leaf memcg nodes, which frequently causes people to misdiagnose
> breach behaviour. The same confusion applies to other counters in this
> file when debugging issues.
> 
> Aggregation is done at write time instead of at read-time since these
> counters aren't hot (unlike memory.stat which is per-page, so it does it
> at read time), and it makes sense to bundle this with the file
> notifications.
...
> Signed-off-by: Chris Down <chris@chrisdown.name>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Tejun Heo <tj@kernel.org>

Michal has a valid counterpoint that this is a change in userland
visible behavior but to me this patch seems to be more of a bug fix
than anything else in that it's addressing an obvious inconsistency in
the interface.

Thanks.
Michal Hocko Jan. 28, 2019, 2:52 p.m. UTC | #16
On Mon 28-01-19 06:28:16, Tejun Heo wrote:
> Hello, Michal.
> 
> On Mon, Jan 28, 2019 at 01:51:51PM +0100, Michal Hocko wrote:
> > > For example, a workload manager watching over a subtree for a job with
> > > nested memory limits set by the job itself.  It wants to take action
> > > (reporting and possibly other remediative actions) when something goes
> > > wrong in the delegated subtree but isn't involved in how the subtree
> > > is configured inside.
> > 
> > Yes, I understand this part, but it is not clear to me, _how_ to report
> > anything sensible without knowing _what_ has caused the event. You can
> > walk the cgroup hierarchy and compare cached results with new ones but
> > this is a) racy and b) clumsy.
> 
> All .events files generate aggregated stateful notifications.  For
> anyone to do anything, they'd have to remember the previous state to
> identify what actually happened.  Being hierarchical, it'd of course
> need to walk down when an event triggers.

And how do you do that in a raceless fashion?

> > > That sure is an option for use cases like above but it has the
> > > downside of carrying over the confusing interface into the indefinite
> > > future.
> > 
> > I actually believe that this is not such a big deal. For one thing the
> > current events are actually helpful to watch the reclaim/setup behavior.
> 
> Sure, it isn't something critical.  It's just confusing and I think
> it'd be better to improve.
> 
> > I do not really think you can go back. You cannot simply change semantic
> > back and forth because you just break new users.
> > 
> > Really, I do not see the semantic changing after more than 3 years of
> > production ready interface. If you really believe we need a hierarchical
> > notification mechanism for the reclaim activity then add a new one.
> 
> I don't see it as black and white as you do.  Let's agree to disagree.
> I'll ack the patch and note the disagreement.

Considering the justification behhind this change I really do not see
other option than nack this change. There is simply no _strong_ reason
to change the behavior. Even if the current behavior is confusing, the
documentation can be improved to be more specific. If there is a strong
demand for hierarchical reporting then add a new interface. But I have
to say that I would consider such a reporting clumsy at best.
Tejun Heo Jan. 28, 2019, 2:54 p.m. UTC | #17
Hello,

On Mon, Jan 28, 2019 at 03:52:10PM +0100, Michal Hocko wrote:
> > All .events files generate aggregated stateful notifications.  For
> > anyone to do anything, they'd have to remember the previous state to
> > identify what actually happened.  Being hierarchical, it'd of course
> > need to walk down when an event triggers.
> 
> And how do you do that in a raceless fashion?

Hmm... I'm having trouble imagining why this would be a problem.  How
would it race?

Thanks.
Michal Hocko Jan. 28, 2019, 3:18 p.m. UTC | #18
On Mon 28-01-19 06:54:07, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jan 28, 2019 at 03:52:10PM +0100, Michal Hocko wrote:
> > > All .events files generate aggregated stateful notifications.  For
> > > anyone to do anything, they'd have to remember the previous state to
> > > identify what actually happened.  Being hierarchical, it'd of course
> > > need to walk down when an event triggers.
> > 
> > And how do you do that in a raceless fashion?
> 
> Hmm... I'm having trouble imagining why this would be a problem.  How
> would it race?

How do you make an atomic snapshot of the hierarchy state? Or you do
not need it because event counters are monotonic and you are willing to
sacrifice some lost or misinterpreted events? For example, you receive
an oom event while the two children increase the oom event counter. How
do you tell which one was the source of the event and which one is still
pending? Or is the ordering unimportant in general?

I can imagine you can live with this model, but having a hierarchical
reporting without a source of the event just sounds too clumsy from my
POV. But I guess this is getting tangent to the original patch.
Tejun Heo Jan. 28, 2019, 3:41 p.m. UTC | #19
Hello, Michal.

On Mon, Jan 28, 2019 at 04:18:59PM +0100, Michal Hocko wrote:
> How do you make an atomic snapshot of the hierarchy state? Or you do
> not need it because event counters are monotonic and you are willing to
> sacrifice some lost or misinterpreted events? For example, you receive
> an oom event while the two children increase the oom event counter. How
> do you tell which one was the source of the event and which one is still
> pending? Or is the ordering unimportant in general?

Hmm... This is straightforward stateful notification.  Imagine the
following hierarchy.  The numbers are the notification counters.

     A:0
   /   \
  B:0  C:0

Let's say B generates an event, soon followed by C.  If A's counter is
read after both B and C's events, nothing is missed.

Let's say it ends up generating two notifications and we end up
walking down inbetween B and C's events.  It would look like the
following.

     A:1
   /   \
  B:1  C:0

We first see A's 0 -> 1 and then start scanning the subtrees to find
out the origin.  We will notice B but let's say we visit C before C's
event gets registered (otherwise, nothing is missed).

But, no matter where you put C's event and notification, the
followings hold.

1. A's count will be different from what was seen before.
2. There will be another notification queued on A.

IOW, it's guaranteed that we'll notice and re-scan if we don't see C's
event this time.  The worst that can happen is scanning spuriously but
that's true even for local events.

This isn't a novel thing.  It's how aggregated stateful notifications
usually work (e.g. a lot of hardware interrupts behave this way).  The
notification is just saying "something might have changed here, please
take a look" and the interlocking is achieved by following specific
orders when propagating and reading the events.

> I can imagine you can live with this model, but having a hierarchical
> reporting without a source of the event just sounds too clumsy from my
> POV. But I guess this is getting tangent to the original patch.

It seems like your opinion is mostly based on misunderstanding.  Let's
keep the discussion focused on API stability.

Thanks.
Shakeel Butt Jan. 28, 2019, 3:59 p.m. UTC | #20
Hi Tejun,

On Fri, Jan 25, 2019 at 10:28 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Michal.
>
> On Fri, Jan 25, 2019 at 06:37:13PM +0100, Michal Hocko wrote:
> > > What if a user wants to monitor any ooms in the subtree tho, which is
> > > a valid use case?
> >
> > How is that information useful without know which memcg the oom applies
> > to?
>
> For example, a workload manager watching over a subtree for a job with
> nested memory limits set by the job itself.  It wants to take action
> (reporting and possibly other remediative actions) when something goes
> wrong in the delegated subtree but isn't involved in how the subtree
> is configured inside.
>

Why not make this configurable at the delegation boundary? As you
mentioned, there are jobs who want centralized workload manager to
watch over their subtrees while there can be jobs which want to
monitor their subtree themselves. For example I can have a job which
know how to act when one of the children cgroup goes OOM. However if
the root of that job goes OOM then the centralized workload manager
should do something about it. With this change, how to implement this
scenario? How will the central manager differentiates between that a
subtree of a job goes OOM or the root of that job? I guess from the
discussion it seems like the centralized manager has to traverse that
job's subtree to find the source of OOM.

Why can't we let the implementation of centralized manager easier by
allowing to configure the propagation of these notifications across
delegation boundary.

thanks,
Shakeel
Tejun Heo Jan. 28, 2019, 4:05 p.m. UTC | #21
Hello, Shakeel.

On Mon, Jan 28, 2019 at 07:59:33AM -0800, Shakeel Butt wrote:
> Why not make this configurable at the delegation boundary? As you
> mentioned, there are jobs who want centralized workload manager to
> watch over their subtrees while there can be jobs which want to
> monitor their subtree themselves. For example I can have a job which
> know how to act when one of the children cgroup goes OOM. However if
> the root of that job goes OOM then the centralized workload manager
> should do something about it. With this change, how to implement this
> scenario? How will the central manager differentiates between that a
> subtree of a job goes OOM or the root of that job? I guess from the
> discussion it seems like the centralized manager has to traverse that
> job's subtree to find the source of OOM.
> 
> Why can't we let the implementation of centralized manager easier by
> allowing to configure the propagation of these notifications across
> delegation boundary.

I think the right way to achieve the above would be having separate
recursive and local counters.

Thanks.
Shakeel Butt Jan. 28, 2019, 4:08 p.m. UTC | #22
Hi Tejun,

On Mon, Jan 28, 2019 at 8:05 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Shakeel.
>
> On Mon, Jan 28, 2019 at 07:59:33AM -0800, Shakeel Butt wrote:
> > Why not make this configurable at the delegation boundary? As you
> > mentioned, there are jobs who want centralized workload manager to
> > watch over their subtrees while there can be jobs which want to
> > monitor their subtree themselves. For example I can have a job which
> > know how to act when one of the children cgroup goes OOM. However if
> > the root of that job goes OOM then the centralized workload manager
> > should do something about it. With this change, how to implement this
> > scenario? How will the central manager differentiates between that a
> > subtree of a job goes OOM or the root of that job? I guess from the
> > discussion it seems like the centralized manager has to traverse that
> > job's subtree to find the source of OOM.
> >
> > Why can't we let the implementation of centralized manager easier by
> > allowing to configure the propagation of these notifications across
> > delegation boundary.
>
> I think the right way to achieve the above would be having separate
> recursive and local counters.
>

Do you envision a separate interface/file for recursive and local
counters? That would make notifications simpler but that is an
additional interface.

Shakeel
Tejun Heo Jan. 28, 2019, 4:12 p.m. UTC | #23
Hello,

On Mon, Jan 28, 2019 at 08:08:26AM -0800, Shakeel Butt wrote:
> Do you envision a separate interface/file for recursive and local
> counters? That would make notifications simpler but that is an
> additional interface.

I need to think more about it but my first throught is that a separate
file would make more sense given that separating notifications could
be useful.

Thanks.
Michal Hocko Jan. 28, 2019, 5:05 p.m. UTC | #24
On Mon 28-01-19 07:41:50, Tejun Heo wrote:
> Hello, Michal.
> 
> On Mon, Jan 28, 2019 at 04:18:59PM +0100, Michal Hocko wrote:
> > How do you make an atomic snapshot of the hierarchy state? Or you do
> > not need it because event counters are monotonic and you are willing to
> > sacrifice some lost or misinterpreted events? For example, you receive
> > an oom event while the two children increase the oom event counter. How
> > do you tell which one was the source of the event and which one is still
> > pending? Or is the ordering unimportant in general?
> 
> Hmm... This is straightforward stateful notification.  Imagine the
> following hierarchy.  The numbers are the notification counters.
> 
>      A:0
>    /   \
>   B:0  C:0
> 
> Let's say B generates an event, soon followed by C.  If A's counter is
> read after both B and C's events, nothing is missed.
> 
> Let's say it ends up generating two notifications and we end up
> walking down inbetween B and C's events.  It would look like the
> following.
> 
>      A:1
>    /   \
>   B:1  C:0
> 
> We first see A's 0 -> 1 and then start scanning the subtrees to find
> out the origin.  We will notice B but let's say we visit C before C's
> event gets registered (otherwise, nothing is missed).

Yeah, that is quite clear. But it also assumes that the hierarchy is
pretty stable but cgroups might go away at any time. I am not saying
that the aggregated events are not useful I am just saying that it is
quite non-trivial to use and catch all potential corner cases. Maybe I
am overcomplicating it but one thing is quite clear to me. The existing
semantic is really useful to watch for the reclaim behavior at the
current level of the tree. You really do not have to care what is
happening in the subtree when it is clear that the workload itself
is underprovisioned etc. Considering that such a semantic already
existis, somebody might depend on it and we likely want also aggregated
semantic then I really do not see why to risk regressions rather than
add a new memory.hierarchy_events and have both.
Tejun Heo Jan. 28, 2019, 5:49 p.m. UTC | #25
Hello, Michal.

On Mon, Jan 28, 2019 at 06:05:26PM +0100, Michal Hocko wrote:
> Yeah, that is quite clear. But it also assumes that the hierarchy is
> pretty stable but cgroups might go away at any time. I am not saying
> that the aggregated events are not useful I am just saying that it is
> quite non-trivial to use and catch all potential corner cases. Maybe I

It really isn't complicated and doesn't require stable subtree.

> am overcomplicating it but one thing is quite clear to me. The existing
> semantic is really useful to watch for the reclaim behavior at the
> current level of the tree. You really do not have to care what is
> happening in the subtree when it is clear that the workload itself
> is underprovisioned etc. Considering that such a semantic already
> existis, somebody might depend on it and we likely want also aggregated
> semantic then I really do not see why to risk regressions rather than
> add a new memory.hierarchy_events and have both.

The problem then is that most other things are hierarchical including
some fields in .events files, so if we try to add local stats and
events, there's no good way to add them.

Thanks.
Michal Hocko Jan. 29, 2019, 2:43 p.m. UTC | #26
On Mon 28-01-19 09:49:05, Tejun Heo wrote:
> Hello, Michal.
> 
> On Mon, Jan 28, 2019 at 06:05:26PM +0100, Michal Hocko wrote:
> > Yeah, that is quite clear. But it also assumes that the hierarchy is
> > pretty stable but cgroups might go away at any time. I am not saying
> > that the aggregated events are not useful I am just saying that it is
> > quite non-trivial to use and catch all potential corner cases. Maybe I
> 
> It really isn't complicated and doesn't require stable subtree.
> 
> > am overcomplicating it but one thing is quite clear to me. The existing
> > semantic is really useful to watch for the reclaim behavior at the
> > current level of the tree. You really do not have to care what is
> > happening in the subtree when it is clear that the workload itself
> > is underprovisioned etc. Considering that such a semantic already
> > existis, somebody might depend on it and we likely want also aggregated
> > semantic then I really do not see why to risk regressions rather than
> > add a new memory.hierarchy_events and have both.
> 
> The problem then is that most other things are hierarchical including
> some fields in .events files, so if we try to add local stats and
> events, there's no good way to add them.

All memcg events are represented non-hierarchical AFAICS
memcg_memory_event() simply accounts at the level when it happens. Or do
I miss something? Or are you talking about .events files for other
controllers?
Tejun Heo Jan. 29, 2019, 2:52 p.m. UTC | #27
Hello,

On Tue, Jan 29, 2019 at 03:43:06PM +0100, Michal Hocko wrote:
> All memcg events are represented non-hierarchical AFAICS
> memcg_memory_event() simply accounts at the level when it happens. Or do
> I miss something? Or are you talking about .events files for other
> controllers?

Yeah, cgroup.events and .stat files as some of the local stats would
be useful too, so if we don't flip memory.events we'll end up with sth
like cgroup.events.local, memory.events.tree and memory.stats.local,
which is gonna be hilarious.

If you aren't willing to change your mind, the only option seems to be
introducing a mount option to gate the flip and additions of local
files.  Most likely, userspace will enable the option by default
everywhere, so the end result will be exactly the same but I guess
it'll better address your concern.

Thanks.
Michal Hocko Jan. 30, 2019, 4:50 p.m. UTC | #28
On Tue 29-01-19 06:52:40, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jan 29, 2019 at 03:43:06PM +0100, Michal Hocko wrote:
> > All memcg events are represented non-hierarchical AFAICS
> > memcg_memory_event() simply accounts at the level when it happens. Or do
> > I miss something? Or are you talking about .events files for other
> > controllers?
> 
> Yeah, cgroup.events and .stat files as some of the local stats would
> be useful too, so if we don't flip memory.events we'll end up with sth
> like cgroup.events.local, memory.events.tree and memory.stats.local,
> which is gonna be hilarious.

Why cannot we simply have memory.events_tree and be done with it? Sure
the file names are not goin to be consistent which is a minus but that
ship has already sailed some time ago.

> If you aren't willing to change your mind, the only option seems to be
> introducing a mount option to gate the flip and additions of local
> files.  Most likely, userspace will enable the option by default
> everywhere, so the end result will be exactly the same but I guess
> it'll better address your concern.

How does the consumer of the API learns about the mount type?
Tejun Heo Jan. 30, 2019, 5:06 p.m. UTC | #29
Hello, Michal.

On Wed, Jan 30, 2019 at 05:50:58PM +0100, Michal Hocko wrote:
> > Yeah, cgroup.events and .stat files as some of the local stats would
> > be useful too, so if we don't flip memory.events we'll end up with sth
> > like cgroup.events.local, memory.events.tree and memory.stats.local,
> > which is gonna be hilarious.
> 
> Why cannot we simply have memory.events_tree and be done with it? Sure
> the file names are not goin to be consistent which is a minus but that
> ship has already sailed some time ago.

Because the overall cost of shitty interface will be way higher in the
longer term.  cgroup2 interface is far from perfect but is way better
than cgroup1 especially for the memory controller.  Why do you think
that is?

> > If you aren't willing to change your mind, the only option seems to be
> > introducing a mount option to gate the flip and additions of local
> > files.  Most likely, userspace will enable the option by default
> > everywhere, so the end result will be exactly the same but I guess
> > it'll better address your concern.
> 
> How does the consumer of the API learns about the mount type?

It's gonna be a mount flag exposed in /sys/kernel/cgroup/features.

Thanks.
Michal Hocko Jan. 30, 2019, 5:41 p.m. UTC | #30
On Wed 30-01-19 09:06:58, Tejun Heo wrote:
> Hello, Michal.
> 
> On Wed, Jan 30, 2019 at 05:50:58PM +0100, Michal Hocko wrote:
> > > Yeah, cgroup.events and .stat files as some of the local stats would
> > > be useful too, so if we don't flip memory.events we'll end up with sth
> > > like cgroup.events.local, memory.events.tree and memory.stats.local,
> > > which is gonna be hilarious.
> > 
> > Why cannot we simply have memory.events_tree and be done with it? Sure
> > the file names are not goin to be consistent which is a minus but that
> > ship has already sailed some time ago.
> 
> Because the overall cost of shitty interface will be way higher in the
> longer term.

But we are discussing the file name effectively. I do not see a long
term maintenance burden. Confusing? Probably yes but that is were the
documentation would be helpful.
Tejun Heo Jan. 30, 2019, 5:52 p.m. UTC | #31
On Wed, Jan 30, 2019 at 06:41:17PM +0100, Michal Hocko wrote:
> But we are discussing the file name effectively. I do not see a long
> term maintenance burden. Confusing? Probably yes but that is were the

Cost on user side.

> documentation would be helpful.

which is an a lot worse option with way higher total cost.

If you aren't against making it a mount option (and you shouldn't),
I think that'd be the best use of time for both of us.

Thanks.
Michal Hocko Jan. 30, 2019, 6:16 p.m. UTC | #32
On Wed 30-01-19 09:52:22, Tejun Heo wrote:
> On Wed, Jan 30, 2019 at 06:41:17PM +0100, Michal Hocko wrote:
> > But we are discussing the file name effectively. I do not see a long
> > term maintenance burden. Confusing? Probably yes but that is were the
> 
> Cost on user side.
> 
> > documentation would be helpful.
> 
> which is an a lot worse option with way higher total cost.

And how exactly does the cost get any smaller with a mount option. The
consumer of the API will likely not know there are two modes and check
for it to figure out how to interpret those values. Or maybe I still do
not follow how exactly is the mount option supposed to work.
Shakeel Butt Jan. 30, 2019, 7:11 p.m. UTC | #33
Hi Tejun,

On Wed, Jan 30, 2019 at 9:07 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Michal.
>
> On Wed, Jan 30, 2019 at 05:50:58PM +0100, Michal Hocko wrote:
> > > Yeah, cgroup.events and .stat files as some of the local stats would
> > > be useful too, so if we don't flip memory.events we'll end up with sth
> > > like cgroup.events.local, memory.events.tree and memory.stats.local,
> > > which is gonna be hilarious.
> >
> > Why cannot we simply have memory.events_tree and be done with it? Sure
> > the file names are not goin to be consistent which is a minus but that
> > ship has already sailed some time ago.
>
> Because the overall cost of shitty interface will be way higher in the
> longer term.  cgroup2 interface is far from perfect but is way better
> than cgroup1 especially for the memory controller.  Why do you think
> that is?
>

I thought you are fine with the separate interface for the hierarchical events.

https://lkml.kernel.org/r/20190128161201.GS50184@devbig004.ftw2.facebook.com

Is that not the case?

Shakeel
Johannes Weiner Jan. 30, 2019, 7:23 p.m. UTC | #34
On Mon, Jan 28, 2019 at 01:51:51PM +0100, Michal Hocko wrote:
> On Fri 25-01-19 10:28:08, Tejun Heo wrote:
> > On Fri, Jan 25, 2019 at 06:37:13PM +0100, Michal Hocko wrote:
> > > Please note that I understand that this might be confusing with the rest
> > > of the cgroup APIs but considering that this is the first time somebody
> > > is actually complaining and the interface is "production ready" for more
> > > than three years I am not really sure the situation is all that bad.
> > 
> > cgroup2 uptake hasn't progressed that fast.  None of the major distros
> > or container frameworks are currently shipping with it although many
> > are evaluating switching.  I don't think I'm too mistaken in that we
> > (FB) are at the bleeding edge in terms of adopting cgroup2 and its
> > various new features and are hitting these corner cases and oversights
> > in the process.  If there are noticeable breakages arising from this
> > change, we sure can backpaddle but I think the better course of action
> > is fixing them up while we can.
> 
> I do not really think you can go back. You cannot simply change semantic
> back and forth because you just break new users.
> 
> Really, I do not see the semantic changing after more than 3 years of
> production ready interface. If you really believe we need a hierarchical
> notification mechanism for the reclaim activity then add a new one.

This discussion needs to be more nuanced.

We change interfaces and user-visible behavior all the time when we
think nobody is likely to rely on it. Sometimes we change them after
decades of established behavior - for example the recent OOM killer
change to not kill children over parents.

The argument was made that it's very unlikely that we break any
existing user setups relying specifically on this behavior we are
trying to fix. I don't see a real dispute to this, other than a
repetition of "we can't change it after three years".

I also don't see a concrete description of a plausible scenario that
this change might break.

I would like to see a solid case for why this change is a notable risk
to actual users (interface age is not a criterium for other changes)
before discussing errata solutions.
Johannes Weiner Jan. 30, 2019, 7:27 p.m. UTC | #35
On Wed, Jan 30, 2019 at 11:11:44AM -0800, Shakeel Butt wrote:
> Hi Tejun,
> 
> On Wed, Jan 30, 2019 at 9:07 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello, Michal.
> >
> > On Wed, Jan 30, 2019 at 05:50:58PM +0100, Michal Hocko wrote:
> > > > Yeah, cgroup.events and .stat files as some of the local stats would
> > > > be useful too, so if we don't flip memory.events we'll end up with sth
> > > > like cgroup.events.local, memory.events.tree and memory.stats.local,
> > > > which is gonna be hilarious.
> > >
> > > Why cannot we simply have memory.events_tree and be done with it? Sure
> > > the file names are not goin to be consistent which is a minus but that
> > > ship has already sailed some time ago.
> >
> > Because the overall cost of shitty interface will be way higher in the
> > longer term.  cgroup2 interface is far from perfect but is way better
> > than cgroup1 especially for the memory controller.  Why do you think
> > that is?
> >
> 
> I thought you are fine with the separate interface for the hierarchical events.

Every other file in cgroup2 is hierarchical, but for recursive
memory.events you'd need to read memory.events_tree?

Do we hate our users that much? :(
Johannes Weiner Jan. 30, 2019, 7:30 p.m. UTC | #36
On Wed, Jan 30, 2019 at 02:27:12PM -0500, Johannes Weiner wrote:
> On Wed, Jan 30, 2019 at 11:11:44AM -0800, Shakeel Butt wrote:
> > Hi Tejun,
> > 
> > On Wed, Jan 30, 2019 at 9:07 AM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > Hello, Michal.
> > >
> > > On Wed, Jan 30, 2019 at 05:50:58PM +0100, Michal Hocko wrote:
> > > > > Yeah, cgroup.events and .stat files as some of the local stats would
> > > > > be useful too, so if we don't flip memory.events we'll end up with sth
> > > > > like cgroup.events.local, memory.events.tree and memory.stats.local,
> > > > > which is gonna be hilarious.
> > > >
> > > > Why cannot we simply have memory.events_tree and be done with it? Sure
> > > > the file names are not goin to be consistent which is a minus but that
> > > > ship has already sailed some time ago.
> > >
> > > Because the overall cost of shitty interface will be way higher in the
> > > longer term.  cgroup2 interface is far from perfect but is way better
> > > than cgroup1 especially for the memory controller.  Why do you think
> > > that is?
> > >
> > 
> > I thought you are fine with the separate interface for the hierarchical events.
> 
> Every other file in cgroup2 is hierarchical, but for recursive
> memory.events you'd need to read memory.events_tree?
> 
> Do we hate our users that much? :(

FTR, I would be okay with adding .local versions to existing files
where such a behavior could be useful. But that seems to be a separate
discussion from fixing memory.events here.
Shakeel Butt Jan. 30, 2019, 7:37 p.m. UTC | #37
On Wed, Jan 30, 2019 at 11:30 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Jan 30, 2019 at 02:27:12PM -0500, Johannes Weiner wrote:
> > On Wed, Jan 30, 2019 at 11:11:44AM -0800, Shakeel Butt wrote:
> > > Hi Tejun,
> > >
> > > On Wed, Jan 30, 2019 at 9:07 AM Tejun Heo <tj@kernel.org> wrote:
> > > >
> > > > Hello, Michal.
> > > >
> > > > On Wed, Jan 30, 2019 at 05:50:58PM +0100, Michal Hocko wrote:
> > > > > > Yeah, cgroup.events and .stat files as some of the local stats would
> > > > > > be useful too, so if we don't flip memory.events we'll end up with sth
> > > > > > like cgroup.events.local, memory.events.tree and memory.stats.local,
> > > > > > which is gonna be hilarious.
> > > > >
> > > > > Why cannot we simply have memory.events_tree and be done with it? Sure
> > > > > the file names are not goin to be consistent which is a minus but that
> > > > > ship has already sailed some time ago.
> > > >
> > > > Because the overall cost of shitty interface will be way higher in the
> > > > longer term.  cgroup2 interface is far from perfect but is way better
> > > > than cgroup1 especially for the memory controller.  Why do you think
> > > > that is?
> > > >
> > >
> > > I thought you are fine with the separate interface for the hierarchical events.
> >
> > Every other file in cgroup2 is hierarchical, but for recursive
> > memory.events you'd need to read memory.events_tree?
> >
> > Do we hate our users that much? :(
>
> FTR, I would be okay with adding .local versions to existing files
> where such a behavior could be useful. But that seems to be a separate
> discussion from fixing memory.events here.

Oh ok, the dispute is on the name of the interface. I am fine with
whatever the decision is made as we (Google) are still not using these
interfaces. However what's the way forward here?

thanks,
Shakeel
Michal Hocko Jan. 30, 2019, 8:05 p.m. UTC | #38
On Wed 30-01-19 14:23:45, Johannes Weiner wrote:
> On Mon, Jan 28, 2019 at 01:51:51PM +0100, Michal Hocko wrote:
> > On Fri 25-01-19 10:28:08, Tejun Heo wrote:
> > > On Fri, Jan 25, 2019 at 06:37:13PM +0100, Michal Hocko wrote:
> > > > Please note that I understand that this might be confusing with the rest
> > > > of the cgroup APIs but considering that this is the first time somebody
> > > > is actually complaining and the interface is "production ready" for more
> > > > than three years I am not really sure the situation is all that bad.
> > > 
> > > cgroup2 uptake hasn't progressed that fast.  None of the major distros
> > > or container frameworks are currently shipping with it although many
> > > are evaluating switching.  I don't think I'm too mistaken in that we
> > > (FB) are at the bleeding edge in terms of adopting cgroup2 and its
> > > various new features and are hitting these corner cases and oversights
> > > in the process.  If there are noticeable breakages arising from this
> > > change, we sure can backpaddle but I think the better course of action
> > > is fixing them up while we can.
> > 
> > I do not really think you can go back. You cannot simply change semantic
> > back and forth because you just break new users.
> > 
> > Really, I do not see the semantic changing after more than 3 years of
> > production ready interface. If you really believe we need a hierarchical
> > notification mechanism for the reclaim activity then add a new one.
> 
> This discussion needs to be more nuanced.
> 
> We change interfaces and user-visible behavior all the time when we
> think nobody is likely to rely on it. Sometimes we change them after
> decades of established behavior - for example the recent OOM killer
> change to not kill children over parents.

That is an implementation detail of a kernel internal functionality.
Most of changes in the kernel tend to have user visible effects. This is
not what we are discussing here. We are talking about a change of user
visibile API semantic change. And that is a completely different story.

> The argument was made that it's very unlikely that we break any
> existing user setups relying specifically on this behavior we are
> trying to fix. I don't see a real dispute to this, other than a
> repetition of "we can't change it after three years".
> 
> I also don't see a concrete description of a plausible scenario that
> this change might break.
> 
> I would like to see a solid case for why this change is a notable risk
> to actual users (interface age is not a criterium for other changes)
> before discussing errata solutions.

I thought I have already mentioned an example. Say you have an observer
on the top of a delegated cgroup hierarchy and you setup limits (e.g. hard
limit) on the root of it. If you get an OOM event then you know that the
whole hierarchy might be underprovisioned and perform some rebalancing.
Now you really do not care that somewhere down the delegated tree there
was an oom. Such a spurious event would just confuse the monitoring and
lead to wrong decisions.
Johannes Weiner Jan. 30, 2019, 9:31 p.m. UTC | #39
On Wed, Jan 30, 2019 at 09:05:59PM +0100, Michal Hocko wrote:
> On Wed 30-01-19 14:23:45, Johannes Weiner wrote:
> > On Mon, Jan 28, 2019 at 01:51:51PM +0100, Michal Hocko wrote:
> > > On Fri 25-01-19 10:28:08, Tejun Heo wrote:
> > > > On Fri, Jan 25, 2019 at 06:37:13PM +0100, Michal Hocko wrote:
> > > > > Please note that I understand that this might be confusing with the rest
> > > > > of the cgroup APIs but considering that this is the first time somebody
> > > > > is actually complaining and the interface is "production ready" for more
> > > > > than three years I am not really sure the situation is all that bad.
> > > > 
> > > > cgroup2 uptake hasn't progressed that fast.  None of the major distros
> > > > or container frameworks are currently shipping with it although many
> > > > are evaluating switching.  I don't think I'm too mistaken in that we
> > > > (FB) are at the bleeding edge in terms of adopting cgroup2 and its
> > > > various new features and are hitting these corner cases and oversights
> > > > in the process.  If there are noticeable breakages arising from this
> > > > change, we sure can backpaddle but I think the better course of action
> > > > is fixing them up while we can.
> > > 
> > > I do not really think you can go back. You cannot simply change semantic
> > > back and forth because you just break new users.
> > > 
> > > Really, I do not see the semantic changing after more than 3 years of
> > > production ready interface. If you really believe we need a hierarchical
> > > notification mechanism for the reclaim activity then add a new one.
> > 
> > This discussion needs to be more nuanced.
> > 
> > We change interfaces and user-visible behavior all the time when we
> > think nobody is likely to rely on it. Sometimes we change them after
> > decades of established behavior - for example the recent OOM killer
> > change to not kill children over parents.
> 
> That is an implementation detail of a kernel internal functionality.
> Most of changes in the kernel tend to have user visible effects. This is
> not what we are discussing here. We are talking about a change of user
> visibile API semantic change. And that is a completely different story.

I think drawing such a strong line between these two is a mistake. The
critical thing is whether we change something real people rely on.

It's possible somebody relies on the child killing behavior. But it's
fairly unlikely, which is why it's okay to risk the change.

> > The argument was made that it's very unlikely that we break any
> > existing user setups relying specifically on this behavior we are
> > trying to fix. I don't see a real dispute to this, other than a
> > repetition of "we can't change it after three years".
> > 
> > I also don't see a concrete description of a plausible scenario that
> > this change might break.
> > 
> > I would like to see a solid case for why this change is a notable risk
> > to actual users (interface age is not a criterium for other changes)
> > before discussing errata solutions.
> 
> I thought I have already mentioned an example. Say you have an observer
> on the top of a delegated cgroup hierarchy and you setup limits (e.g. hard
> limit) on the root of it. If you get an OOM event then you know that the
> whole hierarchy might be underprovisioned and perform some rebalancing.
> Now you really do not care that somewhere down the delegated tree there
> was an oom. Such a spurious event would just confuse the monitoring and
> lead to wrong decisions.

You can construct a usecase like this, as per above with OOM, but it's
incredibly unlikely for something like this to exist. There is plenty
of evidence on adoption rate that supports this: we know where the big
names in containerization are; we see the things we run into that have
not been reported yet etc.

Compare this to real problems this has already caused for
us. Multi-level control and monitoring is a fundamental concept of the
cgroup design, so naturally our infrastructure doesn't monitor and log
at the individual job level (too much data, and also kind of pointless
when the jobs are identical) but at aggregate parental levels.

Because of this wart, we have missed problematic configurations when
the low, high, max events were not propagated as expected (we log oom
separately, so we still noticed those). Even once we knew about it, we
had trouble tracking these configurations down for the same reason -
the data isn't logged, and won't be logged, at this level.

Adding a separate, hierarchical file would solve this one particular
problem for us, but it wouldn't fix this pitfall for all future users
of cgroup2 (which by all available evidence is still most of them) and
would be a wart on the interface that we'd carry forever.

Adding a note in cgroup-v2.txt doesn't make up for the fact that this
behavior flies in the face of basic UX concepts that underly the
hierarchical monitoring and control idea of the cgroup2fs.

The fact that the current behavior MIGHT HAVE a valid application does
not mean that THIS FILE should be providing it. It IS NOT an argument
against this patch here, just an argument for a separate patch that
adds this functionality in a way that is consistent with the rest of
the interface (e.g. systematically adding .local files).

The current semantics have real costs to real users. You cannot
dismiss them or handwave them away with a hypothetical regression.

I would really ask you to consider the real world usage and adoption
data we have on cgroup2, rather than insist on a black and white
answer to this situation.
Michal Hocko Jan. 31, 2019, 8:58 a.m. UTC | #40
On Wed 30-01-19 16:31:31, Johannes Weiner wrote:
> On Wed, Jan 30, 2019 at 09:05:59PM +0100, Michal Hocko wrote:
[...]
> > I thought I have already mentioned an example. Say you have an observer
> > on the top of a delegated cgroup hierarchy and you setup limits (e.g. hard
> > limit) on the root of it. If you get an OOM event then you know that the
> > whole hierarchy might be underprovisioned and perform some rebalancing.
> > Now you really do not care that somewhere down the delegated tree there
> > was an oom. Such a spurious event would just confuse the monitoring and
> > lead to wrong decisions.
> 
> You can construct a usecase like this, as per above with OOM, but it's
> incredibly unlikely for something like this to exist. There is plenty
> of evidence on adoption rate that supports this: we know where the big
> names in containerization are; we see the things we run into that have
> not been reported yet etc.
> 
> Compare this to real problems this has already caused for
> us. Multi-level control and monitoring is a fundamental concept of the
> cgroup design, so naturally our infrastructure doesn't monitor and log
> at the individual job level (too much data, and also kind of pointless
> when the jobs are identical) but at aggregate parental levels.
> 
> Because of this wart, we have missed problematic configurations when
> the low, high, max events were not propagated as expected (we log oom
> separately, so we still noticed those). Even once we knew about it, we
> had trouble tracking these configurations down for the same reason -
> the data isn't logged, and won't be logged, at this level.

Yes, I do understand that you might be interested in the hierarchical
accounting.

> Adding a separate, hierarchical file would solve this one particular
> problem for us, but it wouldn't fix this pitfall for all future users
> of cgroup2 (which by all available evidence is still most of them) and
> would be a wart on the interface that we'd carry forever.

I understand even this reasoning but if I have to chose between a risk
of user breakage that would require to reimplement the monitoring or an
API incosistency I vote for the first option. It is unfortunate but this
is the way we deal with APIs and compatibility.

> Adding a note in cgroup-v2.txt doesn't make up for the fact that this
> behavior flies in the face of basic UX concepts that underly the
> hierarchical monitoring and control idea of the cgroup2fs.
> 
> The fact that the current behavior MIGHT HAVE a valid application does
> not mean that THIS FILE should be providing it. It IS NOT an argument
> against this patch here, just an argument for a separate patch that
> adds this functionality in a way that is consistent with the rest of
> the interface (e.g. systematically adding .local files).
> 
> The current semantics have real costs to real users. You cannot
> dismiss them or handwave them away with a hypothetical regression.
> 
> I would really ask you to consider the real world usage and adoption
> data we have on cgroup2, rather than insist on a black and white
> answer to this situation.

Those users requiring the hierarchical beahvior can use the new file
without any risk of breakages so I really do not see why we should
undertake the risk and do it the other way around.
Johannes Weiner Jan. 31, 2019, 4:22 p.m. UTC | #41
On Thu, Jan 31, 2019 at 09:58:08AM +0100, Michal Hocko wrote:
> On Wed 30-01-19 16:31:31, Johannes Weiner wrote:
> > On Wed, Jan 30, 2019 at 09:05:59PM +0100, Michal Hocko wrote:
> [...]
> > > I thought I have already mentioned an example. Say you have an observer
> > > on the top of a delegated cgroup hierarchy and you setup limits (e.g. hard
> > > limit) on the root of it. If you get an OOM event then you know that the
> > > whole hierarchy might be underprovisioned and perform some rebalancing.
> > > Now you really do not care that somewhere down the delegated tree there
> > > was an oom. Such a spurious event would just confuse the monitoring and
> > > lead to wrong decisions.
> > 
> > You can construct a usecase like this, as per above with OOM, but it's
> > incredibly unlikely for something like this to exist. There is plenty
> > of evidence on adoption rate that supports this: we know where the big
> > names in containerization are; we see the things we run into that have
> > not been reported yet etc.
> > 
> > Compare this to real problems this has already caused for
> > us. Multi-level control and monitoring is a fundamental concept of the
> > cgroup design, so naturally our infrastructure doesn't monitor and log
> > at the individual job level (too much data, and also kind of pointless
> > when the jobs are identical) but at aggregate parental levels.
> > 
> > Because of this wart, we have missed problematic configurations when
> > the low, high, max events were not propagated as expected (we log oom
> > separately, so we still noticed those). Even once we knew about it, we
> > had trouble tracking these configurations down for the same reason -
> > the data isn't logged, and won't be logged, at this level.
> 
> Yes, I do understand that you might be interested in the hierarchical
> accounting.
> 
> > Adding a separate, hierarchical file would solve this one particular
> > problem for us, but it wouldn't fix this pitfall for all future users
> > of cgroup2 (which by all available evidence is still most of them) and
> > would be a wart on the interface that we'd carry forever.
> 
> I understand even this reasoning but if I have to chose between a risk
> of user breakage that would require to reimplement the monitoring or an
> API incosistency I vote for the first option. It is unfortunate but this
> is the way we deal with APIs and compatibility.

I don't know why you keep repeating this, it's simply not how Linux
API is maintained in practice.

In cgroup2, we fixed io.stat to not conflate discard IO and write IO:
636620b66d5d4012c4a9c86206013964d3986c4f

Linus changed the Vmalloc field semantics in /proc/meminfo after over
a decade, without a knob to restore it in production:

    If this breaks anything, we'll obviously have to re-introduce the code
    to compute this all and add the caching patches on top.  But if given
    the option, I'd really prefer to just remove this bad idea entirely
    rather than add even more code to work around our historical mistake
    that likely nobody really cares about.
    a5ad88ce8c7fae7ddc72ee49a11a75aa837788e0

Mel changed the zone_reclaim_mode default behavior after over a
decade:

    Those that require zone_reclaim_mode are likely to be able to
    detect when it needs to be enabled and tune appropriately so lets
    have a sensible default for the bulk of users.
    4f9b16a64753d0bb607454347036dc997fd03b82
    Acked-by: Michal Hocko <mhocko@suse.cz>

And then Mel changed the default zonelist ordering to pick saner
behavior for most users, followed by a complete removal of the zone
list ordering, after again, decades of existence of these things:

    commit c9bff3eebc09be23fbc868f5e6731666d23cbea3
    Author: Michal Hocko <mhocko@suse.com>
    Date:   Wed Sep 6 16:20:13 2017 -0700

        mm, page_alloc: rip out ZONELIST_ORDER_ZONE

And why did we do any of those things and risk user disruption every
single time? Because the existing behavior was not a good default, a
burden on people, and the risk of breakage was sufficiently low.

I don't see how this case is different, and you haven't provided any
arguments that would explain that.

> > Adding a note in cgroup-v2.txt doesn't make up for the fact that this
> > behavior flies in the face of basic UX concepts that underly the
> > hierarchical monitoring and control idea of the cgroup2fs.
> > 
> > The fact that the current behavior MIGHT HAVE a valid application does
> > not mean that THIS FILE should be providing it. It IS NOT an argument
> > against this patch here, just an argument for a separate patch that
> > adds this functionality in a way that is consistent with the rest of
> > the interface (e.g. systematically adding .local files).
> > 
> > The current semantics have real costs to real users. You cannot
> > dismiss them or handwave them away with a hypothetical regression.
> > 
> > I would really ask you to consider the real world usage and adoption
> > data we have on cgroup2, rather than insist on a black and white
> > answer to this situation.
> 
> Those users requiring the hierarchical beahvior can use the new file
> without any risk of breakages so I really do not see why we should
> undertake the risk and do it the other way around.

Okay, so let's find a way forward here.

1. A new memory.events_tree file or similar. This would give us a way
to get the desired hierarchical behavior. The downside is that it's
suggesting that ${x} and ${x}_tree are the local and hierarchical
versions of a cgroup file, and that's false everywhere else. Saying we
would document it is a cop-out and doesn't actually make the interface
less confusing (most people don't look at errata documentation until
they've been burned by unexpected behavior).

2. A runtime switch (cgroup mount option, sysctl, what have you) that
lets you switch between the local and the tree behavior. This would be
able to provide the desired semantics in a clean interface, while
still having the ability to support legacy users.

2a. A runtime switch that defaults to the local behavior.

2b. A runtime switch that defaults to the tree behavior.

The choice between 2a and 2b comes down to how big we evaluate the
risk that somebody has an existing dependency on the local behavior.

Given what we know about cgroup2 usage, and considering our previous
behavior in such matters, I'd say 2b is reasonable and in line with
how we tend to handle these things. On the tiny chance that somebody
is using the current behavior, they can flick the switch (until we add
the .local files, or simply use the switch forever).
Michal Hocko Feb. 1, 2019, 10:27 a.m. UTC | #42
On Thu 31-01-19 11:22:48, Johannes Weiner wrote:
> On Thu, Jan 31, 2019 at 09:58:08AM +0100, Michal Hocko wrote:
> > On Wed 30-01-19 16:31:31, Johannes Weiner wrote:
> > > On Wed, Jan 30, 2019 at 09:05:59PM +0100, Michal Hocko wrote:
> > [...]
> > > > I thought I have already mentioned an example. Say you have an observer
> > > > on the top of a delegated cgroup hierarchy and you setup limits (e.g. hard
> > > > limit) on the root of it. If you get an OOM event then you know that the
> > > > whole hierarchy might be underprovisioned and perform some rebalancing.
> > > > Now you really do not care that somewhere down the delegated tree there
> > > > was an oom. Such a spurious event would just confuse the monitoring and
> > > > lead to wrong decisions.
> > > 
> > > You can construct a usecase like this, as per above with OOM, but it's
> > > incredibly unlikely for something like this to exist. There is plenty
> > > of evidence on adoption rate that supports this: we know where the big
> > > names in containerization are; we see the things we run into that have
> > > not been reported yet etc.
> > > 
> > > Compare this to real problems this has already caused for
> > > us. Multi-level control and monitoring is a fundamental concept of the
> > > cgroup design, so naturally our infrastructure doesn't monitor and log
> > > at the individual job level (too much data, and also kind of pointless
> > > when the jobs are identical) but at aggregate parental levels.
> > > 
> > > Because of this wart, we have missed problematic configurations when
> > > the low, high, max events were not propagated as expected (we log oom
> > > separately, so we still noticed those). Even once we knew about it, we
> > > had trouble tracking these configurations down for the same reason -
> > > the data isn't logged, and won't be logged, at this level.
> > 
> > Yes, I do understand that you might be interested in the hierarchical
> > accounting.
> > 
> > > Adding a separate, hierarchical file would solve this one particular
> > > problem for us, but it wouldn't fix this pitfall for all future users
> > > of cgroup2 (which by all available evidence is still most of them) and
> > > would be a wart on the interface that we'd carry forever.
> > 
> > I understand even this reasoning but if I have to chose between a risk
> > of user breakage that would require to reimplement the monitoring or an
> > API incosistency I vote for the first option. It is unfortunate but this
> > is the way we deal with APIs and compatibility.
> 
> I don't know why you keep repeating this, it's simply not how Linux
> API is maintained in practice.
> 
> In cgroup2, we fixed io.stat to not conflate discard IO and write IO:
> 636620b66d5d4012c4a9c86206013964d3986c4f
> 
> Linus changed the Vmalloc field semantics in /proc/meminfo after over
> a decade, without a knob to restore it in production:
> 
>     If this breaks anything, we'll obviously have to re-introduce the code
>     to compute this all and add the caching patches on top.  But if given
>     the option, I'd really prefer to just remove this bad idea entirely
>     rather than add even more code to work around our historical mistake
>     that likely nobody really cares about.
>     a5ad88ce8c7fae7ddc72ee49a11a75aa837788e0
> 
> Mel changed the zone_reclaim_mode default behavior after over a
> decade:
> 
>     Those that require zone_reclaim_mode are likely to be able to
>     detect when it needs to be enabled and tune appropriately so lets
>     have a sensible default for the bulk of users.
>     4f9b16a64753d0bb607454347036dc997fd03b82
>     Acked-by: Michal Hocko <mhocko@suse.cz>
> 
> And then Mel changed the default zonelist ordering to pick saner
> behavior for most users, followed by a complete removal of the zone
> list ordering, after again, decades of existence of these things:
> 
>     commit c9bff3eebc09be23fbc868f5e6731666d23cbea3
>     Author: Michal Hocko <mhocko@suse.com>
>     Date:   Wed Sep 6 16:20:13 2017 -0700
> 
>         mm, page_alloc: rip out ZONELIST_ORDER_ZONE
> 
> And why did we do any of those things and risk user disruption every
> single time? Because the existing behavior was not a good default, a
> burden on people, and the risk of breakage was sufficiently low.
> 
> I don't see how this case is different, and you haven't provided any
> arguments that would explain that.

Because there is no simple way to revert in _this_ particular case. Once
you change the semantic of the file you cannot simply make it
non-hierarchical after somebody complains. You do not want to break both
worlds. See the difference?

[...]
> > Those users requiring the hierarchical beahvior can use the new file
> > without any risk of breakages so I really do not see why we should
> > undertake the risk and do it the other way around.
> 
> Okay, so let's find a way forward here.
> 
> 1. A new memory.events_tree file or similar. This would give us a way
> to get the desired hierarchical behavior. The downside is that it's
> suggesting that ${x} and ${x}_tree are the local and hierarchical
> versions of a cgroup file, and that's false everywhere else. Saying we
> would document it is a cop-out and doesn't actually make the interface
> less confusing (most people don't look at errata documentation until
> they've been burned by unexpected behavior).
> 
> 2. A runtime switch (cgroup mount option, sysctl, what have you) that
> lets you switch between the local and the tree behavior. This would be
> able to provide the desired semantics in a clean interface, while
> still having the ability to support legacy users.

With an obvious downside that one or the other usecase has to learn that
the current semantic is different than expected which is again something
that has to be documented so we are in the same "people don't look at
errata documentation...". Another obvious problem is that you might have
two workloads with different semantic expectations and then this option
simply falls flat.

> 2a. A runtime switch that defaults to the local behavior.
> 
> 2b. A runtime switch that defaults to the tree behavior.
> 
> The choice between 2a and 2b comes down to how big we evaluate the
> risk that somebody has an existing dependency on the local behavior.
> 
> Given what we know about cgroup2 usage, and considering our previous
> behavior in such matters, I'd say 2b is reasonable and in line with
> how we tend to handle these things. On the tiny chance that somebody
> is using the current behavior, they can flick the switch (until we add
> the .local files, or simply use the switch forever).

My preference is 1 but if there is a _larger_ consensus of different
cgroup v2  users that 2 is more preferred then I can live with that.
Johannes Weiner Feb. 1, 2019, 4:34 p.m. UTC | #43
On Fri, Feb 01, 2019 at 11:27:41AM +0100, Michal Hocko wrote:
> On Thu 31-01-19 11:22:48, Johannes Weiner wrote:
> > On Thu, Jan 31, 2019 at 09:58:08AM +0100, Michal Hocko wrote:
> > > On Wed 30-01-19 16:31:31, Johannes Weiner wrote:
> > > > On Wed, Jan 30, 2019 at 09:05:59PM +0100, Michal Hocko wrote:
> > > [...]
> > > > > I thought I have already mentioned an example. Say you have an observer
> > > > > on the top of a delegated cgroup hierarchy and you setup limits (e.g. hard
> > > > > limit) on the root of it. If you get an OOM event then you know that the
> > > > > whole hierarchy might be underprovisioned and perform some rebalancing.
> > > > > Now you really do not care that somewhere down the delegated tree there
> > > > > was an oom. Such a spurious event would just confuse the monitoring and
> > > > > lead to wrong decisions.
> > > > 
> > > > You can construct a usecase like this, as per above with OOM, but it's
> > > > incredibly unlikely for something like this to exist. There is plenty
> > > > of evidence on adoption rate that supports this: we know where the big
> > > > names in containerization are; we see the things we run into that have
> > > > not been reported yet etc.
> > > > 
> > > > Compare this to real problems this has already caused for
> > > > us. Multi-level control and monitoring is a fundamental concept of the
> > > > cgroup design, so naturally our infrastructure doesn't monitor and log
> > > > at the individual job level (too much data, and also kind of pointless
> > > > when the jobs are identical) but at aggregate parental levels.
> > > > 
> > > > Because of this wart, we have missed problematic configurations when
> > > > the low, high, max events were not propagated as expected (we log oom
> > > > separately, so we still noticed those). Even once we knew about it, we
> > > > had trouble tracking these configurations down for the same reason -
> > > > the data isn't logged, and won't be logged, at this level.
> > > 
> > > Yes, I do understand that you might be interested in the hierarchical
> > > accounting.
> > > 
> > > > Adding a separate, hierarchical file would solve this one particular
> > > > problem for us, but it wouldn't fix this pitfall for all future users
> > > > of cgroup2 (which by all available evidence is still most of them) and
> > > > would be a wart on the interface that we'd carry forever.
> > > 
> > > I understand even this reasoning but if I have to chose between a risk
> > > of user breakage that would require to reimplement the monitoring or an
> > > API incosistency I vote for the first option. It is unfortunate but this
> > > is the way we deal with APIs and compatibility.
> > 
> > I don't know why you keep repeating this, it's simply not how Linux
> > API is maintained in practice.
> > 
> > In cgroup2, we fixed io.stat to not conflate discard IO and write IO:
> > 636620b66d5d4012c4a9c86206013964d3986c4f
> > 
> > Linus changed the Vmalloc field semantics in /proc/meminfo after over
> > a decade, without a knob to restore it in production:
> > 
> >     If this breaks anything, we'll obviously have to re-introduce the code
> >     to compute this all and add the caching patches on top.  But if given
> >     the option, I'd really prefer to just remove this bad idea entirely
> >     rather than add even more code to work around our historical mistake
> >     that likely nobody really cares about.
> >     a5ad88ce8c7fae7ddc72ee49a11a75aa837788e0
> > 
> > Mel changed the zone_reclaim_mode default behavior after over a
> > decade:
> > 
> >     Those that require zone_reclaim_mode are likely to be able to
> >     detect when it needs to be enabled and tune appropriately so lets
> >     have a sensible default for the bulk of users.
> >     4f9b16a64753d0bb607454347036dc997fd03b82
> >     Acked-by: Michal Hocko <mhocko@suse.cz>
> > 
> > And then Mel changed the default zonelist ordering to pick saner
> > behavior for most users, followed by a complete removal of the zone
> > list ordering, after again, decades of existence of these things:
> > 
> >     commit c9bff3eebc09be23fbc868f5e6731666d23cbea3
> >     Author: Michal Hocko <mhocko@suse.com>
> >     Date:   Wed Sep 6 16:20:13 2017 -0700
> > 
> >         mm, page_alloc: rip out ZONELIST_ORDER_ZONE
> > 
> > And why did we do any of those things and risk user disruption every
> > single time? Because the existing behavior was not a good default, a
> > burden on people, and the risk of breakage was sufficiently low.
> > 
> > I don't see how this case is different, and you haven't provided any
> > arguments that would explain that.
> 
> Because there is no simple way to revert in _this_ particular case. Once
> you change the semantic of the file you cannot simply make it
> non-hierarchical after somebody complains. You do not want to break both
> worlds. See the difference?

Yes and no. We cannot revert if both cases are in use, but we can
support both cases; add the .local file, or - for binary compatibility
- add the compat mount flag to switch to the old behavior.

In the vmalloc and the zonelist_order_zone cases above, any users that
would rely on those features would have to blacklist certain kernel
versions in which they are not available.

In this case, any users that rely on the old behavior would have to
mount cgroup2 with the compat knob.

Arguably, it's easier to pass a mount option than it is to change the
entire kernel.

> > > Those users requiring the hierarchical beahvior can use the new file
> > > without any risk of breakages so I really do not see why we should
> > > undertake the risk and do it the other way around.
> > 
> > Okay, so let's find a way forward here.
> > 
> > 1. A new memory.events_tree file or similar. This would give us a way
> > to get the desired hierarchical behavior. The downside is that it's
> > suggesting that ${x} and ${x}_tree are the local and hierarchical
> > versions of a cgroup file, and that's false everywhere else. Saying we
> > would document it is a cop-out and doesn't actually make the interface
> > less confusing (most people don't look at errata documentation until
> > they've been burned by unexpected behavior).
> > 
> > 2. A runtime switch (cgroup mount option, sysctl, what have you) that
> > lets you switch between the local and the tree behavior. This would be
> > able to provide the desired semantics in a clean interface, while
> > still having the ability to support legacy users.
> 
> With an obvious downside that one or the other usecase has to learn that
> the current semantic is different than expected which is again something
> that has to be documented so we are in the same "people don't look at
> errata documentation...".

Of course, but that's where the "which situation is more likely" comes
in. That was the basis for all the historic changes I quoted above.

> > 2a. A runtime switch that defaults to the local behavior.
> > 
> > 2b. A runtime switch that defaults to the tree behavior.
> > 
> > The choice between 2a and 2b comes down to how big we evaluate the
> > risk that somebody has an existing dependency on the local behavior.
> > 
> > Given what we know about cgroup2 usage, and considering our previous
> > behavior in such matters, I'd say 2b is reasonable and in line with
> > how we tend to handle these things. On the tiny chance that somebody
> > is using the current behavior, they can flick the switch (until we add
> > the .local files, or simply use the switch forever).
> 
> My preference is 1 but if there is a _larger_ consensus of different
> cgroup v2  users that 2 is more preferred then I can live with that.

Thanks.

Patch
diff mbox series

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 380a212a8c52..5428b372def4 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -769,8 +769,10 @@  static inline void count_memcg_event_mm(struct mm_struct *mm,
 static inline void memcg_memory_event(struct mem_cgroup *memcg,
 				      enum memcg_memory_event event)
 {
-	atomic_long_inc(&memcg->memory_events[event]);
-	cgroup_file_notify(&memcg->events_file);
+	do {
+		atomic_long_inc(&memcg->memory_events[event]);
+		cgroup_file_notify(&memcg->events_file);
+	} while ((memcg = parent_mem_cgroup(memcg)));
 }
 
 static inline void memcg_memory_event_mm(struct mm_struct *mm,