All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: + mm-consider-subtrees-in-memoryevents.patch added to -mm tree
       [not found] <20190212224542.ZW63a%akpm@linux-foundation.org>
@ 2019-02-13 12:47 ` Michal Hocko
  2019-05-16 17:56   ` Johannes Weiner
  2019-05-17 13:00   ` Shakeel Butt
  0 siblings, 2 replies; 12+ messages in thread
From: Michal Hocko @ 2019-02-13 12:47 UTC (permalink / raw)
  To: akpm
  Cc: mm-commits, tj, hannes, guro, dennis, chris, cgroups mailinglist,
	linux-mm

On Tue 12-02-19 14:45:42, Andrew Morton wrote:
[...]
> From: Chris Down <chris@chrisdown.name>
> Subject: mm, memcg: consider subtrees in memory.events
> 
> memory.stat and other files already consider subtrees in their output, and
> we should too in order to not present an inconsistent interface.
> 
> The current situation is fairly confusing, because people interacting with
> cgroups expect hierarchical behaviour in the vein of memory.stat,
> cgroup.events, and other files.  For example, this causes confusion when
> debugging reclaim events under low, as currently these always read "0" at
> non-leaf memcg nodes, which frequently causes people to misdiagnose breach
> behaviour.  The same confusion applies to other counters in this file when
> debugging issues.
> 
> Aggregation is done at write time instead of at read-time since these
> counters aren't hot (unlike memory.stat which is per-page, so it does it
> at read time), and it makes sense to bundle this with the file
> notifications.
> 
> After this patch, events are propagated up the hierarchy:
> 
>     [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
>     low 0
>     high 0
>     max 0
>     oom 0
>     oom_kill 0
>     [root@ktst ~]# systemd-run -p MemoryMax=1 true
>     Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
>     [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
>     low 0
>     high 0
>     max 7
>     oom 1
>     oom_kill 1
> 
> As this is a change in behaviour, this can be reverted to the old
> behaviour by mounting with the `memory_localevents' flag set.  However, we
> use the new behaviour by default as there's a lack of evidence that there
> are any current users of memory.events that would find this change
> undesirable.
> 
> Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name
> Signed-off-by: Chris Down <chris@chrisdown.name>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Dennis Zhou <dennis@kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

FTR: As I've already said here [1] I can live with this change as long
as there is a larger consensus among cgroup v2 users. So let's give this
some more time before merging to see whether there is such a consensus.

[1] http://lkml.kernel.org/r/20190201102515.GK11599@dhcp22.suse.cz

> ---
> 
>  Documentation/admin-guide/cgroup-v2.rst |    9 +++++++++
>  include/linux/cgroup-defs.h             |    5 +++++
>  include/linux/memcontrol.h              |   10 ++++++++--
>  kernel/cgroup/cgroup.c                  |   16 ++++++++++++++--
>  4 files changed, 36 insertions(+), 4 deletions(-)
> 
> --- a/Documentation/admin-guide/cgroup-v2.rst~mm-consider-subtrees-in-memoryevents
> +++ a/Documentation/admin-guide/cgroup-v2.rst
> @@ -177,6 +177,15 @@ cgroup v2 currently supports the followi
>  	ignored on non-init namespace mounts.  Please refer to the
>  	Delegation section for details.
>  
> +  memory_localevents
> +
> +        Only populate memory.events with data for the current cgroup,
> +        and not any subtrees. This is legacy behaviour, the default
> +        behaviour without this option is to include subtree counts.
> +        This option is system wide and can only be set on mount or
> +        modified through remount from the init namespace. The mount
> +        option is ignored on non-init namespace mounts.
> +
>  
>  Organizing Processes and Threads
>  --------------------------------
> --- a/include/linux/cgroup-defs.h~mm-consider-subtrees-in-memoryevents
> +++ a/include/linux/cgroup-defs.h
> @@ -83,6 +83,11 @@ enum {
>  	 * Enable cpuset controller in v1 cgroup to use v2 behavior.
>  	 */
>  	CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
> +
> +	/*
> +	 * Enable legacy local memory.events.
> +	 */
> +	CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 5),
>  };
>  
>  /* cftype->flags */
> --- a/include/linux/memcontrol.h~mm-consider-subtrees-in-memoryevents
> +++ a/include/linux/memcontrol.h
> @@ -789,8 +789,14 @@ static inline void count_memcg_event_mm(
>  static inline void memcg_memory_event(struct mem_cgroup *memcg,
>  				      enum memcg_memory_event event)
>  {
> -	atomic_long_inc(&memcg->memory_events[event]);
> -	cgroup_file_notify(&memcg->events_file);
> +	do {
> +		atomic_long_inc(&memcg->memory_events[event]);
> +		cgroup_file_notify(&memcg->events_file);
> +
> +		if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
> +			break;
> +	} while ((memcg = parent_mem_cgroup(memcg)) &&
> +		 !mem_cgroup_is_root(memcg));
>  }
>  
>  static inline void memcg_memory_event_mm(struct mm_struct *mm,
> --- a/kernel/cgroup/cgroup.c~mm-consider-subtrees-in-memoryevents
> +++ a/kernel/cgroup/cgroup.c
> @@ -1775,11 +1775,13 @@ int cgroup_show_path(struct seq_file *sf
>  
>  enum cgroup2_param {
>  	Opt_nsdelegate,
> +	Opt_memory_localevents,
>  	nr__cgroup2_params
>  };
>  
>  static const struct fs_parameter_spec cgroup2_param_specs[] = {
> -	fsparam_flag  ("nsdelegate",		Opt_nsdelegate),
> +	fsparam_flag("nsdelegate",		Opt_nsdelegate),
> +	fsparam_flag("memory_localevents",	Opt_memory_localevents),
>  	{}
>  };
>  
> @@ -1802,6 +1804,9 @@ static int cgroup2_parse_param(struct fs
>  	case Opt_nsdelegate:
>  		ctx->flags |= CGRP_ROOT_NS_DELEGATE;
>  		return 0;
> +	case Opt_memory_localevents:
> +		ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
> +		return 0;
>  	}
>  	return -EINVAL;
>  }
> @@ -1813,6 +1818,11 @@ static void apply_cgroup_root_flags(unsi
>  			cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
>  		else
>  			cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
> +
> +		if (root_flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
> +			cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
> +		else
> +			cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_LOCAL_EVENTS;
>  	}
>  }
>  
> @@ -1820,6 +1830,8 @@ static int cgroup_show_options(struct se
>  {
>  	if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
>  		seq_puts(seq, ",nsdelegate");
> +	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
> +		seq_puts(seq, ",memory_localevents");
>  	return 0;
>  }
>  
> @@ -6207,7 +6219,7 @@ static struct kobj_attribute cgroup_dele
>  static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
>  			     char *buf)
>  {
> -	return snprintf(buf, PAGE_SIZE, "nsdelegate\n");
> +	return snprintf(buf, PAGE_SIZE, "nsdelegate\nmemory_localevents\n");
>  }
>  static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
>  
> _
> 
> Patches currently in -mm which might be from chris@chrisdown.name are
> 
> mm-create-mem_cgroup_from_seq.patch
> mm-extract-memcg-maxable-seq_file-logic-to-seq_show_memcg_tunable.patch
> mm-proportional-memorylowmin-reclaim.patch
> mm-proportional-memorylowmin-reclaim-fix.patch
> mm-memcontrol-expose-thp-events-on-a-per-memcg-basis.patch
> mm-memcontrol-expose-thp-events-on-a-per-memcg-basis-fix-2.patch
> mm-make-memoryemin-the-baseline-for-utilisation-determination.patch
> mm-rename-ambiguously-named-memorystat-counters-and-functions.patch
> mm-consider-subtrees-in-memoryevents.patch

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-consider-subtrees-in-memoryevents.patch added to -mm tree
  2019-02-13 12:47 ` + mm-consider-subtrees-in-memoryevents.patch added to -mm tree Michal Hocko
@ 2019-05-16 17:56   ` Johannes Weiner
  2019-05-16 18:10     ` Michal Hocko
  2019-05-17 13:00   ` Shakeel Butt
  1 sibling, 1 reply; 12+ messages in thread
From: Johannes Weiner @ 2019-05-16 17:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, mm-commits, tj, guro, dennis, chris, cgroups mailinglist, linux-mm

On Wed, Feb 13, 2019 at 01:47:29PM +0100, Michal Hocko wrote:
> On Tue 12-02-19 14:45:42, Andrew Morton wrote:
> [...]
> > From: Chris Down <chris@chrisdown.name>
> > Subject: mm, memcg: consider subtrees in memory.events
> > 
> > memory.stat and other files already consider subtrees in their output, and
> > we should too in order to not present an inconsistent interface.
> > 
> > The current situation is fairly confusing, because people interacting with
> > cgroups expect hierarchical behaviour in the vein of memory.stat,
> > cgroup.events, and other files.  For example, this causes confusion when
> > debugging reclaim events under low, as currently these always read "0" at
> > non-leaf memcg nodes, which frequently causes people to misdiagnose breach
> > behaviour.  The same confusion applies to other counters in this file when
> > debugging issues.
> > 
> > Aggregation is done at write time instead of at read-time since these
> > counters aren't hot (unlike memory.stat which is per-page, so it does it
> > at read time), and it makes sense to bundle this with the file
> > notifications.
> > 
> > After this patch, events are propagated up the hierarchy:
> > 
> >     [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
> >     low 0
> >     high 0
> >     max 0
> >     oom 0
> >     oom_kill 0
> >     [root@ktst ~]# systemd-run -p MemoryMax=1 true
> >     Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
> >     [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
> >     low 0
> >     high 0
> >     max 7
> >     oom 1
> >     oom_kill 1
> > 
> > As this is a change in behaviour, this can be reverted to the old
> > behaviour by mounting with the `memory_localevents' flag set.  However, we
> > use the new behaviour by default as there's a lack of evidence that there
> > are any current users of memory.events that would find this change
> > undesirable.
> > 
> > Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name
> > Signed-off-by: Chris Down <chris@chrisdown.name>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: Dennis Zhou <dennis@kernel.org>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> 
> FTR: As I've already said here [1] I can live with this change as long
> as there is a larger consensus among cgroup v2 users. So let's give this
> some more time before merging to see whether there is such a consensus.
> 
> [1] http://lkml.kernel.org/r/20190201102515.GK11599@dhcp22.suse.cz

It's been three months without any objections. Can we merge this for
v5.2 please? We still have users complaining about this inconsistent
behavior (the last one was yesterday) and we'd rather not carry any
out of tree patches.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-consider-subtrees-in-memoryevents.patch added to -mm tree
  2019-05-16 17:56   ` Johannes Weiner
@ 2019-05-16 18:10     ` Michal Hocko
  2019-05-16 19:39       ` Johannes Weiner
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2019-05-16 18:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, mm-commits, tj, guro, dennis, chris, cgroups mailinglist, linux-mm

On Thu 16-05-19 13:56:55, Johannes Weiner wrote:
> On Wed, Feb 13, 2019 at 01:47:29PM +0100, Michal Hocko wrote:
> > On Tue 12-02-19 14:45:42, Andrew Morton wrote:
> > [...]
> > > From: Chris Down <chris@chrisdown.name>
> > > Subject: mm, memcg: consider subtrees in memory.events
> > > 
> > > memory.stat and other files already consider subtrees in their output, and
> > > we should too in order to not present an inconsistent interface.
> > > 
> > > The current situation is fairly confusing, because people interacting with
> > > cgroups expect hierarchical behaviour in the vein of memory.stat,
> > > cgroup.events, and other files.  For example, this causes confusion when
> > > debugging reclaim events under low, as currently these always read "0" at
> > > non-leaf memcg nodes, which frequently causes people to misdiagnose breach
> > > behaviour.  The same confusion applies to other counters in this file when
> > > debugging issues.
> > > 
> > > Aggregation is done at write time instead of at read-time since these
> > > counters aren't hot (unlike memory.stat which is per-page, so it does it
> > > at read time), and it makes sense to bundle this with the file
> > > notifications.
> > > 
> > > After this patch, events are propagated up the hierarchy:
> > > 
> > >     [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
> > >     low 0
> > >     high 0
> > >     max 0
> > >     oom 0
> > >     oom_kill 0
> > >     [root@ktst ~]# systemd-run -p MemoryMax=1 true
> > >     Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
> > >     [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
> > >     low 0
> > >     high 0
> > >     max 7
> > >     oom 1
> > >     oom_kill 1
> > > 
> > > As this is a change in behaviour, this can be reverted to the old
> > > behaviour by mounting with the `memory_localevents' flag set.  However, we
> > > use the new behaviour by default as there's a lack of evidence that there
> > > are any current users of memory.events that would find this change
> > > undesirable.
> > > 
> > > Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name
> > > Signed-off-by: Chris Down <chris@chrisdown.name>
> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Cc: Michal Hocko <mhocko@kernel.org>
> > > Cc: Tejun Heo <tj@kernel.org>
> > > Cc: Roman Gushchin <guro@fb.com>
> > > Cc: Dennis Zhou <dennis@kernel.org>
> > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > 
> > FTR: As I've already said here [1] I can live with this change as long
> > as there is a larger consensus among cgroup v2 users. So let's give this
> > some more time before merging to see whether there is such a consensus.
> > 
> > [1] http://lkml.kernel.org/r/20190201102515.GK11599@dhcp22.suse.cz
> 
> It's been three months without any objections.

It's been three months without any _feedback_ from anybody. It might
very well be true that people just do not read these emails or do not
care one way or another.

> Can we merge this for
> v5.2 please? We still have users complaining about this inconsistent
> behavior (the last one was yesterday) and we'd rather not carry any
> out of tree patches.

Could you point me to those complains or is this something internal?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-consider-subtrees-in-memoryevents.patch added to -mm tree
  2019-05-16 18:10     ` Michal Hocko
@ 2019-05-16 19:39       ` Johannes Weiner
  2019-05-17 12:33         ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Johannes Weiner @ 2019-05-16 19:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, mm-commits, tj, guro, dennis, chris, cgroups mailinglist, linux-mm

On Thu, May 16, 2019 at 08:10:42PM +0200, Michal Hocko wrote:
> On Thu 16-05-19 13:56:55, Johannes Weiner wrote:
> > On Wed, Feb 13, 2019 at 01:47:29PM +0100, Michal Hocko wrote:
> > > On Tue 12-02-19 14:45:42, Andrew Morton wrote:
> > > [...]
> > > > From: Chris Down <chris@chrisdown.name>
> > > > Subject: mm, memcg: consider subtrees in memory.events
> > > > 
> > > > memory.stat and other files already consider subtrees in their output, and
> > > > we should too in order to not present an inconsistent interface.
> > > > 
> > > > The current situation is fairly confusing, because people interacting with
> > > > cgroups expect hierarchical behaviour in the vein of memory.stat,
> > > > cgroup.events, and other files.  For example, this causes confusion when
> > > > debugging reclaim events under low, as currently these always read "0" at
> > > > non-leaf memcg nodes, which frequently causes people to misdiagnose breach
> > > > behaviour.  The same confusion applies to other counters in this file when
> > > > debugging issues.
> > > > 
> > > > Aggregation is done at write time instead of at read-time since these
> > > > counters aren't hot (unlike memory.stat which is per-page, so it does it
> > > > at read time), and it makes sense to bundle this with the file
> > > > notifications.
> > > > 
> > > > After this patch, events are propagated up the hierarchy:
> > > > 
> > > >     [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
> > > >     low 0
> > > >     high 0
> > > >     max 0
> > > >     oom 0
> > > >     oom_kill 0
> > > >     [root@ktst ~]# systemd-run -p MemoryMax=1 true
> > > >     Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
> > > >     [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
> > > >     low 0
> > > >     high 0
> > > >     max 7
> > > >     oom 1
> > > >     oom_kill 1
> > > > 
> > > > As this is a change in behaviour, this can be reverted to the old
> > > > behaviour by mounting with the `memory_localevents' flag set.  However, we
> > > > use the new behaviour by default as there's a lack of evidence that there
> > > > are any current users of memory.events that would find this change
> > > > undesirable.
> > > > 
> > > > Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name
> > > > Signed-off-by: Chris Down <chris@chrisdown.name>
> > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > Cc: Michal Hocko <mhocko@kernel.org>
> > > > Cc: Tejun Heo <tj@kernel.org>
> > > > Cc: Roman Gushchin <guro@fb.com>
> > > > Cc: Dennis Zhou <dennis@kernel.org>
> > > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > > 
> > > FTR: As I've already said here [1] I can live with this change as long
> > > as there is a larger consensus among cgroup v2 users. So let's give this
> > > some more time before merging to see whether there is such a consensus.
> > > 
> > > [1] http://lkml.kernel.org/r/20190201102515.GK11599@dhcp22.suse.cz
> > 
> > It's been three months without any objections.
> 
> It's been three months without any _feedback_ from anybody. It might
> very well be true that people just do not read these emails or do not
> care one way or another.

This is exactly the type of stuff that Mel was talking about at LSFMM
not even two weeks ago. How one objection, however absurd, can cause
"controversy" and block an effort to address a mistake we have made in
the past that is now actively causing problems for real users.

And now after stalling this fix for three months to wait for unlikely
objections, you're moving the goal post. This is frustrating.

Nobody else is speaking up because the current user base is very small
and because the idea that anybody has developed against and is relying
on the current problematic behavior is completely contrived. In
reality, the behavior surprises people and causes production issues.

> > Can we merge this for
> > v5.2 please? We still have users complaining about this inconsistent
> > behavior (the last one was yesterday) and we'd rather not carry any
> > out of tree patches.
> 
> Could you point me to those complains or is this something internal?

It's something internal, unfortunately, or I'd link to it.

In this report yesterday, the user missed OOM kills that occured in
nested subgroups of individual job components. They monitor the entire
job status and health at the top-level "job" cgroup: total memory
usage, VM activity and trends from memory.stat, pressure for cpu, io,
memory etc. All of these are recursive. They assumed they could
monitor memory.events likewise and were left in the assumption that
everything was fine when in reality there was OOM killing going on in
one of the leaves.

Such negative surprises really suck. But what's worse is that now that
they are aware of it, there is still no good solution for them because
periodically polling the entire subtree for events in leaves is not
practical. There could be a lot of cgroups, which is why we put so
much effort recently into improving the hierarchical stat aggregation.

I'd really like to get this fixed, and preferably in a way that does
deviate from upstream, and does not force the same downtimes and
wasted engineering hours on everybody who is going to switch to
cgroup2 in the next couple of years.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-consider-subtrees-in-memoryevents.patch added to -mm tree
  2019-05-16 19:39       ` Johannes Weiner
@ 2019-05-17 12:33         ` Michal Hocko
  2019-05-17 13:00           ` Shakeel Butt
  2019-05-18  1:33           ` Johannes Weiner
  0 siblings, 2 replies; 12+ messages in thread
From: Michal Hocko @ 2019-05-17 12:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, mm-commits, tj, guro, dennis, chris, cgroups mailinglist, linux-mm

On Thu 16-05-19 15:39:43, Johannes Weiner wrote:
> On Thu, May 16, 2019 at 08:10:42PM +0200, Michal Hocko wrote:
> > On Thu 16-05-19 13:56:55, Johannes Weiner wrote:
> > > On Wed, Feb 13, 2019 at 01:47:29PM +0100, Michal Hocko wrote:
[...]
> > > > FTR: As I've already said here [1] I can live with this change as long
> > > > as there is a larger consensus among cgroup v2 users. So let's give this
> > > > some more time before merging to see whether there is such a consensus.
> > > > 
> > > > [1] http://lkml.kernel.org/r/20190201102515.GK11599@dhcp22.suse.cz
> > > 
> > > It's been three months without any objections.
> > 
> > It's been three months without any _feedback_ from anybody. It might
> > very well be true that people just do not read these emails or do not
> > care one way or another.
> 
> This is exactly the type of stuff that Mel was talking about at LSFMM
> not even two weeks ago. How one objection, however absurd, can cause
> "controversy" and block an effort to address a mistake we have made in
> the past that is now actively causing problems for real users.
> 
> And now after stalling this fix for three months to wait for unlikely
> objections, you're moving the goal post. This is frustrating.

I see your frustration but I find the above wording really unfair. Let me
remind you that this is a considerable user visible change in the
semantic and that always has to be evaluated carefuly. A change that would
clearly regress anybody who rely on the current semantic. This is not an
internal implementation detail kinda thing.

I have suggested an option for the new behavior to be opt-in which
would be a regression safe option. You keep insisting that we absolutely
have to have hierarchical reporting by default for consistency reasons.
I do understand that argument but when I weigh consistency vs. potential
regression risk I rather go a conservative way. This is a traditional
way how we deal with semantic changes like this. There are always
exceptions possible and that is why I wanted to hear from other users of
cgroup v2, even from those who are not directly affected now.

If you feel so stronly about this topic and the suggested opt-in is an
absolute no-go then you are free to override my opinion here. I haven't
Nacked this patch.

> Nobody else is speaking up because the current user base is very small
> and because the idea that anybody has developed against and is relying
> on the current problematic behavior is completely contrived. In
> reality, the behavior surprises people and causes production issues.

I strongly suspect users usually do not follow discussions on our
mailing lists. They only come up later when something breaks and that
is too late. I do realize that this makes the above call for a wider
consensus harder but a lack of upstream bug reports also suggests that
people do not care or simply haven't noticed any issues due to way how
they use the said interface (maybe deeper hierarchies are not that
common).

> > > Can we merge this for
> > > v5.2 please? We still have users complaining about this inconsistent
> > > behavior (the last one was yesterday) and we'd rather not carry any
> > > out of tree patches.
> > 
> > Could you point me to those complains or is this something internal?
> 
> It's something internal, unfortunately, or I'd link to it.
> 
> In this report yesterday, the user missed OOM kills that occured in
> nested subgroups of individual job components. They monitor the entire
> job status and health at the top-level "job" cgroup: total memory
> usage, VM activity and trends from memory.stat, pressure for cpu, io,
> memory etc. All of these are recursive. They assumed they could
> monitor memory.events likewise and were left in the assumption that
> everything was fine when in reality there was OOM killing going on in
> one of the leaves.

This kind of argument has been already mentioned during the discussion
and I understand it.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-consider-subtrees-in-memoryevents.patch added to -mm tree
  2019-02-13 12:47 ` + mm-consider-subtrees-in-memoryevents.patch added to -mm tree Michal Hocko
  2019-05-16 17:56   ` Johannes Weiner
@ 2019-05-17 13:00   ` Shakeel Butt
  2019-05-17 19:04     ` Johannes Weiner
  1 sibling, 1 reply; 12+ messages in thread
From: Shakeel Butt @ 2019-05-17 13:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, mm-commits, Tejun Heo, Johannes Weiner,
	Roman Gushchin, Dennis Zhou, Chris Down, cgroups mailinglist,
	Linux MM

On Wed, Feb 13, 2019 at 4:47 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 12-02-19 14:45:42, Andrew Morton wrote:
> [...]
> > From: Chris Down <chris@chrisdown.name>
> > Subject: mm, memcg: consider subtrees in memory.events
> >
> > memory.stat and other files already consider subtrees in their output, and
> > we should too in order to not present an inconsistent interface.
> >
> > The current situation is fairly confusing, because people interacting with
> > cgroups expect hierarchical behaviour in the vein of memory.stat,
> > cgroup.events, and other files.  For example, this causes confusion when
> > debugging reclaim events under low, as currently these always read "0" at
> > non-leaf memcg nodes, which frequently causes people to misdiagnose breach
> > behaviour.  The same confusion applies to other counters in this file when
> > debugging issues.
> >
> > Aggregation is done at write time instead of at read-time since these
> > counters aren't hot (unlike memory.stat which is per-page, so it does it
> > at read time), and it makes sense to bundle this with the file

I think the above statement (memory.stat read-time aggregation) need
to be fixed due to the latest changes.

> > notifications.
> >
> > After this patch, events are propagated up the hierarchy:
> >
> >     [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
> >     low 0
> >     high 0
> >     max 0
> >     oom 0
> >     oom_kill 0
> >     [root@ktst ~]# systemd-run -p MemoryMax=1 true
> >     Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
> >     [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
> >     low 0
> >     high 0
> >     max 7
> >     oom 1
> >     oom_kill 1
> >
> > As this is a change in behaviour, this can be reverted to the old
> > behaviour by mounting with the `memory_localevents' flag set.  However, we
> > use the new behaviour by default as there's a lack of evidence that there
> > are any current users of memory.events that would find this change
> > undesirable.
> >
> > Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name
> > Signed-off-by: Chris Down <chris@chrisdown.name>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

However can we please have memory.events.local merged along with this one?

> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Tejun Heo <tj@kernel.org>
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: Dennis Zhou <dennis@kernel.org>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>
> FTR: As I've already said here [1] I can live with this change as long
> as there is a larger consensus among cgroup v2 users. So let's give this
> some more time before merging to see whether there is such a consensus.
>
> [1] http://lkml.kernel.org/r/20190201102515.GK11599@dhcp22.suse.cz
>
> > ---
> >
> >  Documentation/admin-guide/cgroup-v2.rst |    9 +++++++++
> >  include/linux/cgroup-defs.h             |    5 +++++
> >  include/linux/memcontrol.h              |   10 ++++++++--
> >  kernel/cgroup/cgroup.c                  |   16 ++++++++++++++--
> >  4 files changed, 36 insertions(+), 4 deletions(-)
> >
> > --- a/Documentation/admin-guide/cgroup-v2.rst~mm-consider-subtrees-in-memoryevents
> > +++ a/Documentation/admin-guide/cgroup-v2.rst
> > @@ -177,6 +177,15 @@ cgroup v2 currently supports the followi
> >       ignored on non-init namespace mounts.  Please refer to the
> >       Delegation section for details.
> >
> > +  memory_localevents
> > +
> > +        Only populate memory.events with data for the current cgroup,
> > +        and not any subtrees. This is legacy behaviour, the default
> > +        behaviour without this option is to include subtree counts.
> > +        This option is system wide and can only be set on mount or
> > +        modified through remount from the init namespace. The mount
> > +        option is ignored on non-init namespace mounts.
> > +
> >
> >  Organizing Processes and Threads
> >  --------------------------------
> > --- a/include/linux/cgroup-defs.h~mm-consider-subtrees-in-memoryevents
> > +++ a/include/linux/cgroup-defs.h
> > @@ -83,6 +83,11 @@ enum {
> >        * Enable cpuset controller in v1 cgroup to use v2 behavior.
> >        */
> >       CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
> > +
> > +     /*
> > +      * Enable legacy local memory.events.
> > +      */
> > +     CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 5),
> >  };
> >
> >  /* cftype->flags */
> > --- a/include/linux/memcontrol.h~mm-consider-subtrees-in-memoryevents
> > +++ a/include/linux/memcontrol.h
> > @@ -789,8 +789,14 @@ static inline void count_memcg_event_mm(
> >  static inline void memcg_memory_event(struct mem_cgroup *memcg,
> >                                     enum memcg_memory_event event)
> >  {
> > -     atomic_long_inc(&memcg->memory_events[event]);
> > -     cgroup_file_notify(&memcg->events_file);
> > +     do {
> > +             atomic_long_inc(&memcg->memory_events[event]);
> > +             cgroup_file_notify(&memcg->events_file);
> > +
> > +             if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
> > +                     break;
> > +     } while ((memcg = parent_mem_cgroup(memcg)) &&
> > +              !mem_cgroup_is_root(memcg));
> >  }
> >
> >  static inline void memcg_memory_event_mm(struct mm_struct *mm,
> > --- a/kernel/cgroup/cgroup.c~mm-consider-subtrees-in-memoryevents
> > +++ a/kernel/cgroup/cgroup.c
> > @@ -1775,11 +1775,13 @@ int cgroup_show_path(struct seq_file *sf
> >
> >  enum cgroup2_param {
> >       Opt_nsdelegate,
> > +     Opt_memory_localevents,
> >       nr__cgroup2_params
> >  };
> >
> >  static const struct fs_parameter_spec cgroup2_param_specs[] = {
> > -     fsparam_flag  ("nsdelegate",            Opt_nsdelegate),
> > +     fsparam_flag("nsdelegate",              Opt_nsdelegate),
> > +     fsparam_flag("memory_localevents",      Opt_memory_localevents),
> >       {}
> >  };
> >
> > @@ -1802,6 +1804,9 @@ static int cgroup2_parse_param(struct fs
> >       case Opt_nsdelegate:
> >               ctx->flags |= CGRP_ROOT_NS_DELEGATE;
> >               return 0;
> > +     case Opt_memory_localevents:
> > +             ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
> > +             return 0;
> >       }
> >       return -EINVAL;
> >  }
> > @@ -1813,6 +1818,11 @@ static void apply_cgroup_root_flags(unsi
> >                       cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
> >               else
> >                       cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
> > +
> > +             if (root_flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
> > +                     cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
> > +             else
> > +                     cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_LOCAL_EVENTS;
> >       }
> >  }
> >
> > @@ -1820,6 +1830,8 @@ static int cgroup_show_options(struct se
> >  {
> >       if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
> >               seq_puts(seq, ",nsdelegate");
> > +     if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
> > +             seq_puts(seq, ",memory_localevents");
> >       return 0;
> >  }
> >
> > @@ -6207,7 +6219,7 @@ static struct kobj_attribute cgroup_dele
> >  static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
> >                            char *buf)
> >  {
> > -     return snprintf(buf, PAGE_SIZE, "nsdelegate\n");
> > +     return snprintf(buf, PAGE_SIZE, "nsdelegate\nmemory_localevents\n");
> >  }
> >  static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
> >
> > _
> >
> > Patches currently in -mm which might be from chris@chrisdown.name are
> >
> > mm-create-mem_cgroup_from_seq.patch
> > mm-extract-memcg-maxable-seq_file-logic-to-seq_show_memcg_tunable.patch
> > mm-proportional-memorylowmin-reclaim.patch
> > mm-proportional-memorylowmin-reclaim-fix.patch
> > mm-memcontrol-expose-thp-events-on-a-per-memcg-basis.patch
> > mm-memcontrol-expose-thp-events-on-a-per-memcg-basis-fix-2.patch
> > mm-make-memoryemin-the-baseline-for-utilisation-determination.patch
> > mm-rename-ambiguously-named-memorystat-counters-and-functions.patch
> > mm-consider-subtrees-in-memoryevents.patch
>
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-consider-subtrees-in-memoryevents.patch added to -mm tree
  2019-05-17 12:33         ` Michal Hocko
@ 2019-05-17 13:00           ` Shakeel Butt
  2019-05-22  5:30             ` Suren Baghdasaryan
  2019-05-18  1:33           ` Johannes Weiner
  1 sibling, 1 reply; 12+ messages in thread
From: Shakeel Butt @ 2019-05-17 13:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, mm-commits, Tejun Heo,
	Roman Gushchin, Dennis Zhou, Chris Down, cgroups mailinglist,
	Linux MM

On Fri, May 17, 2019 at 5:33 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 16-05-19 15:39:43, Johannes Weiner wrote:
> > On Thu, May 16, 2019 at 08:10:42PM +0200, Michal Hocko wrote:
> > > On Thu 16-05-19 13:56:55, Johannes Weiner wrote:
> > > > On Wed, Feb 13, 2019 at 01:47:29PM +0100, Michal Hocko wrote:
> [...]
> > > > > FTR: As I've already said here [1] I can live with this change as long
> > > > > as there is a larger consensus among cgroup v2 users. So let's give this
> > > > > some more time before merging to see whether there is such a consensus.
> > > > >
> > > > > [1] http://lkml.kernel.org/r/20190201102515.GK11599@dhcp22.suse.cz
> > > >
> > > > It's been three months without any objections.
> > >
> > > It's been three months without any _feedback_ from anybody. It might
> > > very well be true that people just do not read these emails or do not
> > > care one way or another.
> >
> > This is exactly the type of stuff that Mel was talking about at LSFMM
> > not even two weeks ago. How one objection, however absurd, can cause
> > "controversy" and block an effort to address a mistake we have made in
> > the past that is now actively causing problems for real users.
> >
> > And now after stalling this fix for three months to wait for unlikely
> > objections, you're moving the goal post. This is frustrating.
>
> I see your frustration but I find the above wording really unfair. Let me
> remind you that this is a considerable user visible change in the
> semantic and that always has to be evaluated carefuly. A change that would
> clearly regress anybody who rely on the current semantic. This is not an
> internal implementation detail kinda thing.
>
> I have suggested an option for the new behavior to be opt-in which
> would be a regression safe option. You keep insisting that we absolutely
> have to have hierarchical reporting by default for consistency reasons.
> I do understand that argument but when I weigh consistency vs. potential
> regression risk I rather go a conservative way. This is a traditional
> way how we deal with semantic changes like this. There are always
> exceptions possible and that is why I wanted to hear from other users of
> cgroup v2, even from those who are not directly affected now.
>
> If you feel so stronly about this topic and the suggested opt-in is an
> absolute no-go then you are free to override my opinion here. I haven't
> Nacked this patch.
>
> > Nobody else is speaking up because the current user base is very small
> > and because the idea that anybody has developed against and is relying
> > on the current problematic behavior is completely contrived. In
> > reality, the behavior surprises people and causes production issues.
>
> I strongly suspect users usually do not follow discussions on our
> mailing lists. They only come up later when something breaks and that
> is too late. I do realize that this makes the above call for a wider
> consensus harder but a lack of upstream bug reports also suggests that
> people do not care or simply haven't noticed any issues due to way how
> they use the said interface (maybe deeper hierarchies are not that
> common).
>

I suspect that FB is the only one using cgroup v2 in production and
others (data center) users are still evaluating/exploring. Also IMHO
the cgroup v2 users are on the bleeding edge. As new cgroup v2
features and controllers are added, the users either switch to latest
kernel or backport. That might be the reason no one objected. Also
none of the distribution has defaulted to v2 yet, so, not many
transparent v2 users yet.

Shakeel


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-consider-subtrees-in-memoryevents.patch added to -mm tree
  2019-05-17 13:00   ` Shakeel Butt
@ 2019-05-17 19:04     ` Johannes Weiner
  0 siblings, 0 replies; 12+ messages in thread
From: Johannes Weiner @ 2019-05-17 19:04 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Andrew Morton, mm-commits, Tejun Heo,
	Roman Gushchin, Dennis Zhou, Chris Down, cgroups mailinglist,
	Linux MM

On Fri, May 17, 2019 at 06:00:11AM -0700, Shakeel Butt wrote:
> On Wed, Feb 13, 2019 at 4:47 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > notifications.
> > >
> > > After this patch, events are propagated up the hierarchy:
> > >
> > >     [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
> > >     low 0
> > >     high 0
> > >     max 0
> > >     oom 0
> > >     oom_kill 0
> > >     [root@ktst ~]# systemd-run -p MemoryMax=1 true
> > >     Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
> > >     [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
> > >     low 0
> > >     high 0
> > >     max 7
> > >     oom 1
> > >     oom_kill 1
> > >
> > > As this is a change in behaviour, this can be reverted to the old
> > > behaviour by mounting with the `memory_localevents' flag set.  However, we
> > > use the new behaviour by default as there's a lack of evidence that there
> > > are any current users of memory.events that would find this change
> > > undesirable.
> > >
> > > Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name
> > > Signed-off-by: Chris Down <chris@chrisdown.name>
> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Reviewed-by: Shakeel Butt <shakeelb@google.com>

Thanks, Shakeel.

> However can we please have memory.events.local merged along with this one?

Could I ask you to send a patch for this? It's not really about the
code - that should be trivial. Rather it's about laying out the exact
usecase for that, which is harder for me/Chris/FB since we don't have
one. I imagine simliar arguments could be made for memory.stat.local,
memory.pressure.local etc. since they're also reporting events and
behavior manifesting in different levels of the cgroup subtree?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-consider-subtrees-in-memoryevents.patch added to -mm tree
  2019-05-17 12:33         ` Michal Hocko
  2019-05-17 13:00           ` Shakeel Butt
@ 2019-05-18  1:33           ` Johannes Weiner
  2019-05-22  2:23             ` Andrew Morton
  1 sibling, 1 reply; 12+ messages in thread
From: Johannes Weiner @ 2019-05-18  1:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, mm-commits, tj, guro, dennis, chris, cgroups mailinglist, linux-mm

On Fri, May 17, 2019 at 02:33:10PM +0200, Michal Hocko wrote:
> On Thu 16-05-19 15:39:43, Johannes Weiner wrote:
> > On Thu, May 16, 2019 at 08:10:42PM +0200, Michal Hocko wrote:
> > > On Thu 16-05-19 13:56:55, Johannes Weiner wrote:
> > > > On Wed, Feb 13, 2019 at 01:47:29PM +0100, Michal Hocko wrote:
> [...]
> > > > > FTR: As I've already said here [1] I can live with this change as long
> > > > > as there is a larger consensus among cgroup v2 users. So let's give this
> > > > > some more time before merging to see whether there is such a consensus.
> > > > > 
> > > > > [1] http://lkml.kernel.org/r/20190201102515.GK11599@dhcp22.suse.cz
> > > > 
> > > > It's been three months without any objections.
> > > 
> > > It's been three months without any _feedback_ from anybody. It might
> > > very well be true that people just do not read these emails or do not
> > > care one way or another.
> > 
> > This is exactly the type of stuff that Mel was talking about at LSFMM
> > not even two weeks ago. How one objection, however absurd, can cause
> > "controversy" and block an effort to address a mistake we have made in
> > the past that is now actively causing problems for real users.
> > 
> > And now after stalling this fix for three months to wait for unlikely
> > objections, you're moving the goal post. This is frustrating.
> 
> I see your frustration but I find the above wording really unfair. Let me
> remind you that this is a considerable user visible change in the
> semantic and that always has to be evaluated carefuly. A change that would
> clearly regress anybody who rely on the current semantic. This is not an
> internal implementation detail kinda thing.
> 
> I have suggested an option for the new behavior to be opt-in which
> would be a regression safe option. You keep insisting that we absolutely
> have to have hierarchical reporting by default for consistency reasons.
> I do understand that argument but when I weigh consistency vs. potential
> regression risk I rather go a conservative way. This is a traditional
> way how we deal with semantic changes like this. There are always
> exceptions possible and that is why I wanted to hear from other users of
> cgroup v2, even from those who are not directly affected now.

I have acknowledged this concern in previous discussions. But the rule
is "don't break userspace", not "never change behavior". We do allow
the latter when it's highly unlikely that anyone would mind and the
new behavior is a much better default for current and future users.

Let me try to make the case for exactly this:

- Adoption data suggests that cgroup2 isn't really used yet. RHEL8 was
  just released with cgroup1 per default. Fedora is currently debating
  a switch. None of the other distros default to cgroup2. There is an
  article on the lwn frontpage *right now* about Docker planning on
  switching to cgroup2 in the near future. Kubernetes is on
  cgroup1. Android is on cgroup1. Shakeel agrees that Facebook is
  probably the only serious user of cgroup2 right now. The cloud and
  all mainstream container software is still on cgroup1.

- Using this particular part of the interface is a fairly advanced
  step in the cgroup2 adoption process. We've been using cgroup2 for a
  while and we've only now started running into this memory.events
  problem as we're enhancing our monitoring and automation
  infrastructure. If we're the only serious deployment, and we just
  started noticing it, what's the chance of regressing someone else?

- Violating expectations costs users time and money either way, but
  the status quo is much more costly: somebody who expects these
  events to be local could see events that did occur at an
  unexpectedly higher level of the tree. But somebody who expects
  these events to be hierarchical will miss real events entirely!

  Now, for an alarm and monitoring infrastructure, what is worse: to
  see occurring OOM kills reported at a tree level you don't expect?
  Or to *miss* occurring OOM kills that you're trying to look out for?

  Automatic remediation might not be as clear-cut, but for us, and I
  would assume many others, an unnecessary service restart or failover
  would have a shorter downtime than missing a restart after a kill.

- The status quo is more likely to violate expectations, given how the
  cgroup2 interface as a whole is designed.

  We have seen this in practice: memory.current is hierarchical,
  memory.stat is hierarchical, memory.pressure is hierarchical - users
  expect memory.events to be hierarchical. This misunderstanding has
  already cost us time and money.

  Chances are, even if there were other users of memory.events, that
  they're using the interface incorrectly and haven't noticed yet,
  rather than relying on the inconsistency.

  It's not a hypothetical, we have seen this with our fleet customers.

So combining what we know about

1. the current adoption rate
2. likely user expectations
3. the failure mode of missing pressure and OOM kill signals

means that going with the conservative option and not fixing this
inconsistency puts pretty much all users that will ever use this
interface at the risk of pain, outages and wasted engineering hours.

Making the fix available but opt-in has the same result for everybody
that isn't following this thread/patch along.

All that to protect an unlikely existing cgroup2 user from something
they are even less likely have noticed, let alone rely on.

This sounds like a terrible trade-off to me. I don't think it's a
close call in this case.

I understand that we have to think harder and be more careful with
changes like this. The bar *should* be high. But in this case, there
doesn't seem to be a real risk of regression for anybody, while the
risk of the status quo causing problems is high and happening. These
circumstances should be part of the decision process.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-consider-subtrees-in-memoryevents.patch added to -mm tree
  2019-05-18  1:33           ` Johannes Weiner
@ 2019-05-22  2:23             ` Andrew Morton
  2019-05-22 15:44               ` Johannes Weiner
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2019-05-22  2:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, mm-commits, tj, guro, dennis, chris,
	cgroups mailinglist, linux-mm

On Fri, 17 May 2019 21:33:48 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> - Adoption data suggests that cgroup2 isn't really used yet. RHEL8 was
>   just released with cgroup1 per default. Fedora is currently debating
>   a switch. None of the other distros default to cgroup2. There is an
>   article on the lwn frontpage *right now* about Docker planning on
>   switching to cgroup2 in the near future. Kubernetes is on
>   cgroup1. Android is on cgroup1. Shakeel agrees that Facebook is
>   probably the only serious user of cgroup2 right now. The cloud and
>   all mainstream container software is still on cgroup1.

I'm thinking we need a cc:stable so these forthcoming distros are more
likely to pick up the new behaviour?

Generally, your arguments sound good to me - I don't see evidence that
anyone is using cgroup2 in a manner which is serious enough to be
seriously affected by this change.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-consider-subtrees-in-memoryevents.patch added to -mm tree
  2019-05-17 13:00           ` Shakeel Butt
@ 2019-05-22  5:30             ` Suren Baghdasaryan
  0 siblings, 0 replies; 12+ messages in thread
From: Suren Baghdasaryan @ 2019-05-22  5:30 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Johannes Weiner, Andrew Morton, mm-commits,
	Tejun Heo, Roman Gushchin, Dennis Zhou, Chris Down,
	cgroups mailinglist, Linux MM

On Fri, May 17, 2019 at 6:00 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Fri, May 17, 2019 at 5:33 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 16-05-19 15:39:43, Johannes Weiner wrote:
> > > On Thu, May 16, 2019 at 08:10:42PM +0200, Michal Hocko wrote:
> > > > On Thu 16-05-19 13:56:55, Johannes Weiner wrote:
> > > > > On Wed, Feb 13, 2019 at 01:47:29PM +0100, Michal Hocko wrote:
> > [...]
> > > > > > FTR: As I've already said here [1] I can live with this change as long
> > > > > > as there is a larger consensus among cgroup v2 users. So let's give this
> > > > > > some more time before merging to see whether there is such a consensus.
> > > > > >
> > > > > > [1] http://lkml.kernel.org/r/20190201102515.GK11599@dhcp22.suse.cz
> > > > >
> > > > > It's been three months without any objections.
> > > >
> > > > It's been three months without any _feedback_ from anybody. It might
> > > > very well be true that people just do not read these emails or do not
> > > > care one way or another.
> > >
> > > This is exactly the type of stuff that Mel was talking about at LSFMM
> > > not even two weeks ago. How one objection, however absurd, can cause
> > > "controversy" and block an effort to address a mistake we have made in
> > > the past that is now actively causing problems for real users.
> > >
> > > And now after stalling this fix for three months to wait for unlikely
> > > objections, you're moving the goal post. This is frustrating.
> >
> > I see your frustration but I find the above wording really unfair. Let me
> > remind you that this is a considerable user visible change in the
> > semantic and that always has to be evaluated carefuly. A change that would
> > clearly regress anybody who rely on the current semantic. This is not an
> > internal implementation detail kinda thing.
> >
> > I have suggested an option for the new behavior to be opt-in which
> > would be a regression safe option. You keep insisting that we absolutely
> > have to have hierarchical reporting by default for consistency reasons.
> > I do understand that argument but when I weigh consistency vs. potential
> > regression risk I rather go a conservative way. This is a traditional
> > way how we deal with semantic changes like this. There are always
> > exceptions possible and that is why I wanted to hear from other users of
> > cgroup v2, even from those who are not directly affected now.
> >
> > If you feel so stronly about this topic and the suggested opt-in is an
> > absolute no-go then you are free to override my opinion here. I haven't
> > Nacked this patch.
> >
> > > Nobody else is speaking up because the current user base is very small
> > > and because the idea that anybody has developed against and is relying
> > > on the current problematic behavior is completely contrived. In
> > > reality, the behavior surprises people and causes production issues.
> >
> > I strongly suspect users usually do not follow discussions on our
> > mailing lists. They only come up later when something breaks and that
> > is too late. I do realize that this makes the above call for a wider
> > consensus harder but a lack of upstream bug reports also suggests that
> > people do not care or simply haven't noticed any issues due to way how
> > they use the said interface (maybe deeper hierarchies are not that
> > common).
> >
>
> I suspect that FB is the only one using cgroup v2 in production and
> others (data center) users are still evaluating/exploring. Also IMHO
> the cgroup v2 users are on the bleeding edge. As new cgroup v2
> features and controllers are added, the users either switch to latest
> kernel or backport. That might be the reason no one objected. Also
> none of the distribution has defaulted to v2 yet, so, not many
> transparent v2 users yet.

In Android we are not using cgroups v2 yet (and that's why I was
refraining from commenting earlier), however when I was evaluating
them for future use I was disappointed that events do not propagate up
the hierarchy. One usecase that I was considering is to get a
notification when OOM kill happens. With cgroups v2 we would be forced
to use per-app hierarchy to avoid process migrations between memcgs
when an app changes its state (background/foreground). With such a
setup we would end up with many leaf cgroups. Polling each individual
leaf cgroup's memory.events file to detect OOM occurrence would
require lots of extra FDs registered with an epoll(). Having an
ability to poll a common parent cgroup to detect that one of the leafs
generated an OOM event would be way more frugal.
I realize this does not constitute a real-life usecase but hopefully
possible usecases can provide some value too.
Thanks,
Suren.

> Shakeel
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + mm-consider-subtrees-in-memoryevents.patch added to -mm tree
  2019-05-22  2:23             ` Andrew Morton
@ 2019-05-22 15:44               ` Johannes Weiner
  0 siblings, 0 replies; 12+ messages in thread
From: Johannes Weiner @ 2019-05-22 15:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, mm-commits, tj, guro, dennis, chris,
	cgroups mailinglist, linux-mm

On Tue, May 21, 2019 at 07:23:51PM -0700, Andrew Morton wrote:
> On Fri, 17 May 2019 21:33:48 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > - Adoption data suggests that cgroup2 isn't really used yet. RHEL8 was
> >   just released with cgroup1 per default. Fedora is currently debating
> >   a switch. None of the other distros default to cgroup2. There is an
> >   article on the lwn frontpage *right now* about Docker planning on
> >   switching to cgroup2 in the near future. Kubernetes is on
> >   cgroup1. Android is on cgroup1. Shakeel agrees that Facebook is
> >   probably the only serious user of cgroup2 right now. The cloud and
> >   all mainstream container software is still on cgroup1.
> 
> I'm thinking we need a cc:stable so these forthcoming distros are more
> likely to pick up the new behaviour?

Yup, makes sense to me. Thank you!


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-05-22 15:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20190212224542.ZW63a%akpm@linux-foundation.org>
2019-02-13 12:47 ` + mm-consider-subtrees-in-memoryevents.patch added to -mm tree Michal Hocko
2019-05-16 17:56   ` Johannes Weiner
2019-05-16 18:10     ` Michal Hocko
2019-05-16 19:39       ` Johannes Weiner
2019-05-17 12:33         ` Michal Hocko
2019-05-17 13:00           ` Shakeel Butt
2019-05-22  5:30             ` Suren Baghdasaryan
2019-05-18  1:33           ` Johannes Weiner
2019-05-22  2:23             ` Andrew Morton
2019-05-22 15:44               ` Johannes Weiner
2019-05-17 13:00   ` Shakeel Butt
2019-05-17 19:04     ` Johannes Weiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.