linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy
@ 2014-07-16 14:39 Michal Hocko
  2014-07-16 15:58 ` Johannes Weiner
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2014-07-16 14:39 UTC (permalink / raw)
  To: linux-mm
  Cc: LKML, Johannes Weiner, Tejun Heo, Hugh Dickins, Greg Thelen,
	Vladimir Davydov, Glauber Costa, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

Starting with 8f9ac36d2cbb (cgroup: distinguish the default and legacy
hierarchies when handling cftypes) memory cgroup controller doesn't
export any knobs because all of them are marked as legacy. The idea is
that only selected knobs are exported for the new cgroup API.

This patch exports the core knobs for the memory controller. The
following knobs are not and won't be available in the default (aka
unified) hierarchy:
- use_hierarchy - was one of the biggest mistakes when memory controller
  was introduced. It allows for creating hierarchical cgroups structure
  which doesn't have any hierarchical accounting. This leads to really
  strange configurations where other co-mounted controllers behave
  hierarchically while memory controller doesn't.
  All controllers have to be hierarchical with the new cgroups API so
  this knob doesn't make any sense here.
- force_empty - has been introduced primarily to drop memory before it
  gets reparented on the group removal.  This alone doesn't sound
  fully justified because reparented pages which are not in use can be
  reclaimed also later when there is a memory pressure on the parent
  level.
  Another use-case would be something like per-memcg /proc/sys/vm/drop_caches
  which doesn't sound like a great idea either. We are trying to get
  away from using it on the global level so we shouldn't allow that on
  per-memcg level as well.
- soft_limit_in_bytes - has been originally introduced to help to
  recover from the overcommit situations where the overall hard limits
  on the system are higher than the available memory. A group which has
  the largest excess on the soft limit is reclaimed to help to reduce
  memory pressure during the global memory pressure.
  The primary problem with this tunable is that every memcg is soft
  unlimited by default which is reverse to what would be expected from
  such a knob.
  Another problem is that soft limit is considered only during the
  global memory pressure rather than on an external memory pressure in
  general (e.g. triggered by the limit hit on a parent up the
  hierarchy).
  There are other issues which are tight to the implementation (e.g.
  priority-0 reclaim used for the soft limit reclaim etc.) which are
  really hard to fix without breaking potential users.
  There will be a replacement for the soft limit in the unified
  hierarchy and users will be encouraged to switch their configuration
  to the new scheme. Until this is available users are suggested to stay
  with the legacy cgroup API.

TCP kmem sub-controller is not exported at this stage because this one has
seen basically no traction since it was merged and it is not entirely
clear why kmem controller cannot be used for the same purpose. Having 2
controllers for tracking kernel memory allocations sounds like too much.
If there are use-cases and reasons for not merging it into kmem then we
can reconsider and allow it for the new cgroups API later.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/cgroups/memory.txt |  19 ++++---
 mm/memcontrol.c                  | 105 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 115 insertions(+), 9 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 02ab997a1ed2..a8f01497c5de 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -62,10 +62,10 @@ Brief summary of control files.
  memory.memsw.failcnt		 # show the number of memory+Swap hits limits
  memory.max_usage_in_bytes	 # show max memory usage recorded
  memory.memsw.max_usage_in_bytes # show max memory+Swap usage recorded
- memory.soft_limit_in_bytes	 # set/show soft limit of memory usage
+[D] memory.soft_limit_in_bytes	 # set/show soft limit of memory usage
  memory.stat			 # show various statistics
- memory.use_hierarchy		 # set/show hierarchical account enabled
- memory.force_empty		 # trigger forced move charge to parent
+[D] memory.use_hierarchy		 # set/show hierarchical account enabled
+[D] memory.force_empty		 # trigger forced move charge to parent
  memory.pressure_level		 # set memory pressure notifications
  memory.swappiness		 # set/show swappiness parameter of vmscan
 				 (See sysctl's vm.swappiness)
@@ -78,10 +78,15 @@ Brief summary of control files.
  memory.kmem.failcnt             # show the number of kernel memory usage hits limits
  memory.kmem.max_usage_in_bytes  # show max kernel memory usage recorded
 
- memory.kmem.tcp.limit_in_bytes  # set/show hard limit for tcp buf memory
- memory.kmem.tcp.usage_in_bytes  # show current tcp buf memory allocation
- memory.kmem.tcp.failcnt            # show the number of tcp buf memory usage hits limits
- memory.kmem.tcp.max_usage_in_bytes # show max tcp buf memory usage recorded
+[D] memory.kmem.tcp.limit_in_bytes  # set/show hard limit for tcp buf memory
+[D] memory.kmem.tcp.usage_in_bytes  # show current tcp buf memory allocation
+[D] memory.kmem.tcp.failcnt            # show the number of tcp buf memory usage hits limits
+[D] memory.kmem.tcp.max_usage_in_bytes # show max tcp buf memory usage recorded
+
+Knobs marked as [D] are considered deprecated and they won't be available in
+the new cgroup Unified hierarchy API (see
+Documentation/cgroups/unified-hierarchy.txt for more information). They are
+still available with the legacy hierarchy though.
 
 1. History
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fa99a3e2e427..9ed40a045d27 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5226,7 +5226,11 @@ out_kfree:
 	return ret;
 }
 
-static struct cftype mem_cgroup_files[] = {
+/*
+ * memcg knobs for the legacy cgroup API. No new files should be
+ * added here.
+ */
+static struct cftype legacy_mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
@@ -5334,6 +5338,100 @@ static struct cftype mem_cgroup_files[] = {
 	{ },	/* terminate */
 };
 
+/* memcg knobs for new cgroups API (default aka unified hierarchy) */
+static struct cftype dfl_mem_cgroup_files[] = {
+	{
+		.name = "usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "max_usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
+		.write = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
+		.write = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "failcnt",
+		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
+		.write = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "stat",
+		.seq_show = memcg_stat_show,
+	},
+	{
+		.name = "cgroup.event_control",		/* XXX: for compat */
+		.write = memcg_write_event_control,
+		.flags = CFTYPE_NO_PREFIX,
+		.mode = S_IWUGO,
+	},
+	{
+		.name = "swappiness",
+		.read_u64 = mem_cgroup_swappiness_read,
+		.write_u64 = mem_cgroup_swappiness_write,
+	},
+	{
+		.name = "move_charge_at_immigrate",
+		.read_u64 = mem_cgroup_move_charge_read,
+		.write_u64 = mem_cgroup_move_charge_write,
+	},
+	{
+		.name = "oom_control",
+		.seq_show = mem_cgroup_oom_control_read,
+		.write_u64 = mem_cgroup_oom_control_write,
+		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
+	},
+	{
+		.name = "pressure_level",
+	},
+#ifdef CONFIG_NUMA
+	{
+		.name = "numa_stat",
+		.seq_show = memcg_numa_stat_show,
+	},
+#endif
+#ifdef CONFIG_MEMCG_KMEM
+	{
+		.name = "kmem.limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
+		.write = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "kmem.max_usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE),
+		.write = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "kmem.usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "kmem.failcnt",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT),
+		.write = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+#ifdef CONFIG_SLABINFO
+	{
+		.name = "kmem.slabinfo",
+		.seq_show = mem_cgroup_slabinfo_read,
+	},
+#endif
+#endif
+	{ },	/* terminate */
+};
+
 #ifdef CONFIG_MEMCG_SWAP
 static struct cftype memsw_cgroup_files[] = {
 	{
@@ -6266,7 +6364,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.attach = mem_cgroup_move_task,
 	.bind = mem_cgroup_bind,
-	.legacy_cftypes = mem_cgroup_files,
+	.legacy_cftypes = legacy_mem_cgroup_files,
+	.dfl_cftypes = dfl_mem_cgroup_files,
 	.early_init = 0,
 };
 
@@ -6285,6 +6384,8 @@ static void __init memsw_file_init(void)
 {
 	WARN_ON(cgroup_add_legacy_cftypes(&memory_cgrp_subsys,
 					  memsw_cgroup_files));
+	WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys,
+					  memsw_cgroup_files));
 }
 
 static void __init enable_swap_cgroup(void)
-- 
2.0.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy
  2014-07-16 14:39 [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy Michal Hocko
@ 2014-07-16 15:58 ` Johannes Weiner
  2014-07-17 13:45   ` Michal Hocko
                     ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Johannes Weiner @ 2014-07-16 15:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, LKML, Tejun Heo, Hugh Dickins, Greg Thelen,
	Vladimir Davydov, Glauber Costa, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote:
> Starting with 8f9ac36d2cbb (cgroup: distinguish the default and legacy
> hierarchies when handling cftypes) memory cgroup controller doesn't
> export any knobs because all of them are marked as legacy. The idea is
> that only selected knobs are exported for the new cgroup API.
> 
> This patch exports the core knobs for the memory controller. The
> following knobs are not and won't be available in the default (aka
> unified) hierarchy:
> - use_hierarchy - was one of the biggest mistakes when memory controller
>   was introduced. It allows for creating hierarchical cgroups structure
>   which doesn't have any hierarchical accounting. This leads to really
>   strange configurations where other co-mounted controllers behave
>   hierarchically while memory controller doesn't.
>   All controllers have to be hierarchical with the new cgroups API so
>   this knob doesn't make any sense here.
> - force_empty - has been introduced primarily to drop memory before it
>   gets reparented on the group removal.  This alone doesn't sound
>   fully justified because reparented pages which are not in use can be
>   reclaimed also later when there is a memory pressure on the parent
>   level.
>   Another use-case would be something like per-memcg /proc/sys/vm/drop_caches
>   which doesn't sound like a great idea either. We are trying to get
>   away from using it on the global level so we shouldn't allow that on
>   per-memcg level as well.
> - soft_limit_in_bytes - has been originally introduced to help to
>   recover from the overcommit situations where the overall hard limits
>   on the system are higher than the available memory. A group which has
>   the largest excess on the soft limit is reclaimed to help to reduce
>   memory pressure during the global memory pressure.
>   The primary problem with this tunable is that every memcg is soft
>   unlimited by default which is reverse to what would be expected from
>   such a knob.
>   Another problem is that soft limit is considered only during the
>   global memory pressure rather than on an external memory pressure in
>   general (e.g. triggered by the limit hit on a parent up the
>   hierarchy).
>   There are other issues which are tight to the implementation (e.g.
>   priority-0 reclaim used for the soft limit reclaim etc.) which are
>   really hard to fix without breaking potential users.
>   There will be a replacement for the soft limit in the unified
>   hierarchy and users will be encouraged to switch their configuration
>   to the new scheme. Until this is available users are suggested to stay
>   with the legacy cgroup API.
> 
> TCP kmem sub-controller is not exported at this stage because this one has
> seen basically no traction since it was merged and it is not entirely
> clear why kmem controller cannot be used for the same purpose. Having 2
> controllers for tracking kernel memory allocations sounds like too much.
> If there are use-cases and reasons for not merging it into kmem then we
> can reconsider and allow it for the new cgroups API later.

There is a reason why we start out empty on the default hierarchy: the
old interface is a complete cesspool.  We're not blindly carrying over
any of it.

Everything that is added in .dfl_ctypes is a new interface and it
needs to be treated as such: it needs a valid usecase to back it up,
and it needs to be evaluated whether the exported information or
control knob is a good way to support that usecase.

> @@ -5226,7 +5226,11 @@ out_kfree:
>  	return ret;
>  }
>  
> -static struct cftype mem_cgroup_files[] = {
> +/*
> + * memcg knobs for the legacy cgroup API. No new files should be
> + * added here.
> + */
> +static struct cftype legacy_mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
>  		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
> @@ -5334,6 +5338,100 @@ static struct cftype mem_cgroup_files[] = {
>  	{ },	/* terminate */
>  };
>  
> +/* memcg knobs for new cgroups API (default aka unified hierarchy) */
> +static struct cftype dfl_mem_cgroup_files[] = {
> +	{
> +		.name = "usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
> +		.read_u64 = mem_cgroup_read_u64,
> +	},

The _in_bytes suffix is pointless, we just need to document that
everything in memcg is in bytes, unless noted otherwise.

How about "memory.current"?

> +	{
> +		.name = "max_usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
> +		.write = mem_cgroup_reset,
> +		.read_u64 = mem_cgroup_read_u64,

What is this actually good for?  We have lazy cache shrinking, which
means that the high watermark of most workloads depends on the group
max limit.

> +	},
> +	{
> +		.name = "limit_in_bytes",
> +		.private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
> +		.write = mem_cgroup_write,
> +		.read_u64 = mem_cgroup_read_u64,
> +	},

We already agreed that there will be a max, a high, a min, and a low
limit, why would you want to reintroduce the max limit as "limit"?

How about "memory.max"?

memory.min
memory.low
memory.high
memory.max
memory.current

> +	{
> +		.name = "failcnt",
> +		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
> +		.write = mem_cgroup_reset,
> +		.read_u64 = mem_cgroup_read_u64,
> +	},

This indicates the number of times the function res_counter_charge()
was called and didn't succeed.

Let that sink in.

This is way too dependent on the current implementation.  If you want
a measure of how much pressure the group is under, look at vmpressure,
or add scanned/reclaimed reclaim statistics.  NAK.

> +	{
> +		.name = "stat",
> +		.seq_show = memcg_stat_show,
> +	},

This file is a complete mess.

pgpgin/pgpgout originally refers to IO, here it means pages charged
and uncharged.  It's questionable whether we even need counters for
charges and uncharges.

mapped_file vs. NR_FILE_MAPPED is another eyesore.

We should generally try harder to be consistent with /proc/vmstat.
And rename it to "memory.vmstat" while we are at it.

Also, having local counters no longer makes sense as there won't be
tasks in intermediate nodes anymore and we are getting rid of
reparenting.  All counters should appear once, in their hierarchical
form.  The exception is the root_mem_cgroup, which should probably
have a separate file for the local counters.

> +	{
> +		.name = "cgroup.event_control",		/* XXX: for compat */
> +		.write = memcg_write_event_control,
> +		.flags = CFTYPE_NO_PREFIX,
> +		.mode = S_IWUGO,

Why?

> +	},
> +	{
> +		.name = "swappiness",
> +		.read_u64 = mem_cgroup_swappiness_read,
> +		.write_u64 = mem_cgroup_swappiness_write,

Do we actually need this?

> +	},
> +	{
> +		.name = "move_charge_at_immigrate",
> +		.read_u64 = mem_cgroup_move_charge_read,
> +		.write_u64 = mem_cgroup_move_charge_write,

This creates significant pain because pc->mem_cgroup becomes a moving
target during the lifetime of the page.  When this feature was added,
there was no justification - except that "some users feel [charges
staying in the group] to be strange" - and it was lacking the
necessary synchronization to make this work properly, so the cost of
this feature was anything but obvious during the initial submission.

I generally don't see why tasks should be moving between groups aside
from the initial setup phase.  And then they shouldn't have consumed
any amounts of memory that they couldn't afford to leave behind in the
root/parent.

So what is the usecase for this that would justify the immense cost in
terms of both performance and code complexity?

> +	},
> +	{
> +		.name = "oom_control",
> +		.seq_show = mem_cgroup_oom_control_read,
> +		.write_u64 = mem_cgroup_oom_control_write,
> +		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),

This needs a usecase description as well.

> +	},
> +	{
> +		.name = "pressure_level",

"memory.pressure"?

> +	},
> +#ifdef CONFIG_NUMA
> +	{
> +		.name = "numa_stat",
> +		.seq_show = memcg_numa_stat_show,
> +	},

This would also be a chance to clean up this file, which is suddenly
specifying memory size in pages rather than bytes.

> +#ifdef CONFIG_MEMCG_KMEM
> +	{
> +		.name = "kmem.limit_in_bytes",
> +		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> +		.write = mem_cgroup_write,
> +		.read_u64 = mem_cgroup_read_u64,
> +	},

Does it really make sense to have a separate limit for kmem only?
IIRC, the reason we introduced this was that this memory is not
reclaimable and so we need to limit it.

But the opposite effect happened: because it's not reclaimable, the
separate kmem limit is actually unusable for any values smaller than
the overall memory limit: because there is no reclaim mechanism for
that limit, once you hit it, it's over, there is nothing you can do
anymore.  The problem isn't so much unreclaimable memory, the problem
is unreclaimable limits.

If the global case produces memory pressure through kernel memory
allocations, we reclaim page cache, anonymous pages, inodes, dentries
etc.  I think the same should happen for kmem: kmem should just be
accounted and limited in the overall memory limit of a group, and when
pressure arises, we go after anything that's reclaimable.

> +	{
> +		.name = "kmem.max_usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE),
> +		.write = mem_cgroup_reset,
> +		.read_u64 = mem_cgroup_read_u64,

As per above, I don't see that a high watermark is meaningful with
lazy cache shrinking.

> +	{
> +		.name = "kmem.usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
> +		.read_u64 = mem_cgroup_read_u64,
> +	},

We could just include slab counters, kernel stack pages etc. in the
statistics file, like /proc/vmstat does.

> +	{
> +		.name = "kmem.failcnt",
> +		.private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT),
> +		.write = mem_cgroup_reset,
> +		.read_u64 = mem_cgroup_read_u64,

NAK as per above.

> +#ifdef CONFIG_SLABINFO
> +	{
> +		.name = "kmem.slabinfo",
> +		.seq_show = mem_cgroup_slabinfo_read,
> +	},
> +#endif
> +#endif
> +	{ },	/* terminate */
> +};
> +
>  #ifdef CONFIG_MEMCG_SWAP
>  static struct cftype memsw_cgroup_files[] = {
>  	{
> @@ -6266,7 +6364,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
>  	.cancel_attach = mem_cgroup_cancel_attach,
>  	.attach = mem_cgroup_move_task,
>  	.bind = mem_cgroup_bind,
> -	.legacy_cftypes = mem_cgroup_files,
> +	.legacy_cftypes = legacy_mem_cgroup_files,
> +	.dfl_cftypes = dfl_mem_cgroup_files,

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy
  2014-07-16 15:58 ` Johannes Weiner
@ 2014-07-17 13:45   ` Michal Hocko
  2014-07-18 15:44   ` Vladimir Davydov
  2014-07-21 13:22   ` Michal Hocko
  2 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2014-07-17 13:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, LKML, Tejun Heo, Hugh Dickins, Greg Thelen,
	Vladimir Davydov, Glauber Costa, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Wed 16-07-14 11:58:14, Johannes Weiner wrote:
> On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote:
[...]
> > +/* memcg knobs for new cgroups API (default aka unified hierarchy) */
> > +static struct cftype dfl_mem_cgroup_files[] = {
> > +	{
> > +		.name = "usage_in_bytes",
> > +		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
> > +		.read_u64 = mem_cgroup_read_u64,
> > +	},
> 
> The _in_bytes suffix is pointless, we just need to document that
> everything in memcg is in bytes, unless noted otherwise.
> 
> How about "memory.current"?

I wanted old users to change the minimum possible when moving to unified
hierarchy so I didn't touch the old names.
Why should we make the end users life harder? If there is general
agreement I have no problem with renaming I just do not think it is
really necessary because there is no real reason why configurations
which do not use any of the deprecated or unified-hierarchy-only
features shouldn't run in both unified and legacy hierarchies without
any changes.

I do realize that this is a _new_ API so we can do such radical changes
but I am also aware that some people have to maintain their stacks on
top of different kernels and it really sucks to maintain two different
configurations. In such a case it would be easier for those users to
stay with the legacy mode which is a fair option but I would much rather
see them move to the new API sooner rather than later.

memory.usage would be much better fit IMO.

> > +	{
> > +		.name = "max_usage_in_bytes",
> > +		.private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
> > +		.write = mem_cgroup_reset,
> > +		.read_u64 = mem_cgroup_read_u64,
> 
> What is this actually good for?  We have lazy cache shrinking, which
> means that the high watermark of most workloads depends on the group
> max limit.

Well that is a good questions. I was going back and forth disabling this
and failcnt before posting this RFC and ended up adding them as they
never were controversial.
I have no problem ditching them, though, because the usefulness is quite
dubious. If someone wants to see whether the hard limit can be decreased
without putting too much reclaim pressure then we have a notification
mechanism for it.

> > +	},
> > +	{
> > +		.name = "limit_in_bytes",
> > +		.private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
> > +		.write = mem_cgroup_write,
> > +		.read_u64 = mem_cgroup_read_u64,
> > +	},
> 
> We already agreed that there will be a max, a high, a min, and a low
> limit, why would you want to reintroduce the max limit as "limit"?

Same as above. I didn't rename knobs for easier transition. On the
other hand it is true that the name doesn't fit so nicely with the new
upcoming limits scheme. Is this reason sufficient to make users lives
harder?

> How about "memory.max"?
> 
> memory.min
> memory.low
> memory.high
> memory.max
> memory.current
> 
> > +	{
> > +		.name = "failcnt",
> > +		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
> > +		.write = mem_cgroup_reset,
> > +		.read_u64 = mem_cgroup_read_u64,
> > +	},
> 
> This indicates the number of times the function res_counter_charge()
> was called and didn't succeed.
> 
> Let that sink in.
> 
> This is way too dependent on the current implementation. 

Well not really. It doesn't depend on the res_counter directly. We just
happen to use it this way. Whatever we will use in future there will
still be a moment when the hard (or whatever we call it) limit is hit.
Whether somebody depends on it is a question. I wouldn't be surprised if
somebody does but I also think that making any decisions based on the
counter are dubious at best. But hey, users are really creative...

> If you want a measure of how much pressure the group is under, look at
> vmpressure, or add scanned/reclaimed reclaim statistics.  NAK.

OK, I will not miss this one either. As you say there is a better way to
measure the pressure and make decisions based on that.

> > +	{
> > +		.name = "stat",
> > +		.seq_show = memcg_stat_show,
> > +	},
> 
> This file is a complete mess.

It is!

> pgpgin/pgpgout originally refers to IO, here it means pages charged
> and uncharged.  It's questionable whether we even need counters for
> charges and uncharges.
> 
> mapped_file vs. NR_FILE_MAPPED is another eyesore.
> 
> We should generally try harder to be consistent with /proc/vmstat.
> And rename it to "memory.vmstat" while we are at it.

I definitely agree that we should converge to vmstat-like counters. We
should also fix the counters which do not make much sense. I just do not
want to end up with two sets of stats depending on whether we are in
default or legacy hierarchy.

> Also, having local counters no longer makes sense as there won't be
> tasks in intermediate nodes anymore and we are getting rid of
> reparenting. 

I am not sure we should get rid of reparenting. It makes sense to have
pages from the gone memcgs in the parent so they are the first candidate
to reclaim.

> All counters should appear once, in their hierarchical form.

Each memcg has its local state (e.g. LRUs) so we should reflect that in
the stat file IMO. Or are there any plans to use different mem_cgroup
structure for the intermediate nodes? I haven't heard anything like
that.

> The exception is the root_mem_cgroup, which should probably
> have a separate file for the local counters.
> 
> > +	{
> > +		.name = "cgroup.event_control",		/* XXX: for compat */
> > +		.write = memcg_write_event_control,
> > +		.flags = CFTYPE_NO_PREFIX,
> > +		.mode = S_IWUGO,
> 
> Why?

For the oom, thresholds and vmpressure notifications, obviously. Or do
we have any other means to do the same? Does it really make sense to
push all the current users to use something else?

I understand that cgroup core didn't want to support such a generic
tool. But I think it serves its purpose for memcg and it works
reasonably well.

I am surely open to discuss alternatives.

> > +	},
> > +	{
> > +		.name = "swappiness",
> > +		.read_u64 = mem_cgroup_swappiness_read,
> > +		.write_u64 = mem_cgroup_swappiness_write,
> 
> Do we actually need this?

Swappiness is a natural property of LRU (anon/file) scanning and LRUs
belong to memcg so they should have a way to tell their preference.
Consider container setups for example. There will never be
one-swappiness-suits-all of them.
 
> > +	},
> > +	{
> > +		.name = "move_charge_at_immigrate",
> > +		.read_u64 = mem_cgroup_move_charge_read,
> > +		.write_u64 = mem_cgroup_move_charge_write,
> 
> This creates significant pain because pc->mem_cgroup becomes a moving
> target during the lifetime of the page.  When this feature was added,
> there was no justification - except that "some users feel [charges
> staying in the group] to be strange" - and it was lacking the
> necessary synchronization to make this work properly, so the cost of
> this feature was anything but obvious during the initial submission.

Actually I think that move charge with tasks should be enabled by
default. If the task moving between groups should be supported (and I
think it should be) then leaving the charges and pages behind is more
than strange. Why does the task move in the first place? Just to get rid
of the responsibility for its previous memory consumption?

So I am OK with the knob removing but I think we should move charge by
default in the unified hierarchy.

> I generally don't see why tasks should be moving between groups aside
> from the initial setup phase.  And then they shouldn't have consumed
> any amounts of memory that they couldn't afford to leave behind in the
> root/parent.
> 
> So what is the usecase for this that would justify the immense cost in
> terms of both performance and code complexity?

One of them would be cooperation with other controllers where moving
task has its own meaning (e.g. to a cpu group with a different share or a
cpuset with a slightly different cpu/node set etc...). Memory controller
shouldn't disallow task moving.

Another one would be memory "load" balancing. Say you have 2 (high and low
priority) sets of tasks running on your system/container. High priority
tasks shouldn't be reclaimed much because that would increase their
latencies or whatever. Low prio tasks can get reclaimed or even OOM
killed. Load balancer, admin, user putting task to/from the background
or any other mechanism can decide to change the "priority" of the task
simply by moving it to the appropriate memcg.

> 
> > +	},
> > +	{
> > +		.name = "oom_control",
> > +		.seq_show = mem_cgroup_oom_control_read,
> > +		.write_u64 = mem_cgroup_oom_control_write,
> > +		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
> 
> This needs a usecase description as well.

There are more of them I think. Btw. I have seen users who started using
memory cgroups just because they were able to control OOM from userspace.

The strongest use case is to handle OOM conditions gracefully. This
includes 1) allow proper shutdown of the service 2) allow killing OOM
victim's related processes which do not make any sense on their own
3) make a more workload aware victim selection.

I am fully aware that there has been a lot of abuse in the past and
users pushing the feature to its limits but that doesn't qualify to
removing otherwise very useful feature.

> 
> > +	},
> > +	{
> > +		.name = "pressure_level",
> 
> "memory.pressure"?
> 
> > +	},
> > +#ifdef CONFIG_NUMA
> > +	{
> > +		.name = "numa_stat",
> > +		.seq_show = memcg_numa_stat_show,
> > +	},
> 
> This would also be a chance to clean up this file, which is suddenly
> specifying memory size in pages rather than bytes.

OK, we should merge it with stat file.

> > +#ifdef CONFIG_MEMCG_KMEM
> > +	{
> > +		.name = "kmem.limit_in_bytes",
> > +		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> > +		.write = mem_cgroup_write,
> > +		.read_u64 = mem_cgroup_read_u64,
> > +	},
> 
> Does it really make sense to have a separate limit for kmem only?
> IIRC, the reason we introduced this was that this memory is not
> reclaimable and so we need to limit it.

My recollection is different. Basically there are users who really do
not care about kmem accounting and do not want to pay runtime overhead.
Documentation/cgroups/memory.txt then describes different usecases in
chapter 2.7.3. I am not user of kmem myself and considering the current
state of the extension I have never encourage anybody to use it for
anything but playing so I cannot tell you which of the scenario is used
most widespread.
I do not have any objections to leave the kmem extension in the legacy
mode for now and add it later when it matures. In fact my original
patch did that but then I've decided to keep it because the current
limitations seem to be more implementation than semantic specific. And
Vladimir is doing a really great job to fill the gaps.

> But the opposite effect happened: because it's not reclaimable, the
> separate kmem limit is actually unusable for any values smaller than
> the overall memory limit: because there is no reclaim mechanism for
> that limit, once you hit it, it's over, there is nothing you can do
> anymore.  The problem isn't so much unreclaimable memory, the problem
> is unreclaimable limits.

This is the limitation of the current implementation.

> If the global case produces memory pressure through kernel memory
> allocations, we reclaim page cache, anonymous pages, inodes, dentries
> etc.  I think the same should happen for kmem: kmem should just be
> accounted and limited in the overall memory limit of a group, and when
> pressure arises, we go after anything that's reclaimable.
> 
> > +	{
> > +		.name = "kmem.max_usage_in_bytes",
> > +		.private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE),
> > +		.write = mem_cgroup_reset,
> > +		.read_u64 = mem_cgroup_read_u64,
> 
> As per above, I don't see that a high watermark is meaningful with
> lazy cache shrinking.

Sure, if other max_usage_in_bytes goes away this one will go as well.

> > +	{
> > +		.name = "kmem.usage_in_bytes",
> > +		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
> > +		.read_u64 = mem_cgroup_read_u64,
> > +	},
> 
> We could just include slab counters, kernel stack pages etc. in the
> statistics file, like /proc/vmstat does.

Agreed.
 
> > +	{
> > +		.name = "kmem.failcnt",
> > +		.private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT),
> > +		.write = mem_cgroup_reset,
> > +		.read_u64 = mem_cgroup_read_u64,
> 
> NAK as per above.
> 
> > +#ifdef CONFIG_SLABINFO
> > +	{
> > +		.name = "kmem.slabinfo",
> > +		.seq_show = mem_cgroup_slabinfo_read,
> > +	},
> > +#endif
> > +#endif
> > +	{ },	/* terminate */
> > +};
> > +
> >  #ifdef CONFIG_MEMCG_SWAP
> >  static struct cftype memsw_cgroup_files[] = {
> >  	{
> > @@ -6266,7 +6364,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
> >  	.cancel_attach = mem_cgroup_cancel_attach,
> >  	.attach = mem_cgroup_move_task,
> >  	.bind = mem_cgroup_bind,
> > -	.legacy_cftypes = mem_cgroup_files,
> > +	.legacy_cftypes = legacy_mem_cgroup_files,
> > +	.dfl_cftypes = dfl_mem_cgroup_files,

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy
  2014-07-16 15:58 ` Johannes Weiner
  2014-07-17 13:45   ` Michal Hocko
@ 2014-07-18 15:44   ` Vladimir Davydov
  2014-07-18 16:13     ` Johannes Weiner
  2014-07-21  9:07     ` Michal Hocko
  2014-07-21 13:22   ` Michal Hocko
  2 siblings, 2 replies; 13+ messages in thread
From: Vladimir Davydov @ 2014-07-18 15:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-mm, LKML, Tejun Heo, Hugh Dickins,
	Greg Thelen, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Wed, Jul 16, 2014 at 11:58:14AM -0400, Johannes Weiner wrote:
> On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote:
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	{
> > +		.name = "kmem.limit_in_bytes",
> > +		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> > +		.write = mem_cgroup_write,
> > +		.read_u64 = mem_cgroup_read_u64,
> > +	},
> 
> Does it really make sense to have a separate limit for kmem only?
> IIRC, the reason we introduced this was that this memory is not
> reclaimable and so we need to limit it.
> 
> But the opposite effect happened: because it's not reclaimable, the
> separate kmem limit is actually unusable for any values smaller than
> the overall memory limit: because there is no reclaim mechanism for
> that limit, once you hit it, it's over, there is nothing you can do
> anymore.  The problem isn't so much unreclaimable memory, the problem
> is unreclaimable limits.
> 
> If the global case produces memory pressure through kernel memory
> allocations, we reclaim page cache, anonymous pages, inodes, dentries
> etc.  I think the same should happen for kmem: kmem should just be
> accounted and limited in the overall memory limit of a group, and when
> pressure arises, we go after anything that's reclaimable.

Personally, I don't think there's much sense in having a separate knob
for kmem limit either. Until we have a user with a sane use case for it,
let's not propagate it to the new interface.

Furthermore, even when we introduce kmem shrinking, the kmem-only limit
alone won't be very useful, because there are plenty of GFP_NOFS kmem
allocations, which make most of slab shrinkers useless. To avoid
ENOMEM's in such situation, we would have to introduce either a soft
kmem limit (watermark) or a kind of kmem precharges. This means if we
decided to introduce kmem-only limit, we'd eventually have to add more
knobs and write more code to make it usable w/o even knowing if anyone
would really benefit from it.

However, there might be users that only want user memory limiting and
don't want to pay the price of kmem accounting, which is pretty
expensive. Even if we implement percpu stocks for kmem, there still will
be noticeable overhead due to touching more cache lines on
kmalloc/kfree.

So I guess there should be a tunable, which will allow to toggle memcg
features. May be, a bitmask for future extensibility.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy
  2014-07-18 15:44   ` Vladimir Davydov
@ 2014-07-18 16:13     ` Johannes Weiner
  2014-07-21  9:07     ` Michal Hocko
  1 sibling, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2014-07-18 16:13 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Michal Hocko, linux-mm, LKML, Tejun Heo, Hugh Dickins,
	Greg Thelen, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Fri, Jul 18, 2014 at 07:44:43PM +0400, Vladimir Davydov wrote:
> On Wed, Jul 16, 2014 at 11:58:14AM -0400, Johannes Weiner wrote:
> > On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote:
> > > +#ifdef CONFIG_MEMCG_KMEM
> > > +	{
> > > +		.name = "kmem.limit_in_bytes",
> > > +		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> > > +		.write = mem_cgroup_write,
> > > +		.read_u64 = mem_cgroup_read_u64,
> > > +	},
> > 
> > Does it really make sense to have a separate limit for kmem only?
> > IIRC, the reason we introduced this was that this memory is not
> > reclaimable and so we need to limit it.
> > 
> > But the opposite effect happened: because it's not reclaimable, the
> > separate kmem limit is actually unusable for any values smaller than
> > the overall memory limit: because there is no reclaim mechanism for
> > that limit, once you hit it, it's over, there is nothing you can do
> > anymore.  The problem isn't so much unreclaimable memory, the problem
> > is unreclaimable limits.
> > 
> > If the global case produces memory pressure through kernel memory
> > allocations, we reclaim page cache, anonymous pages, inodes, dentries
> > etc.  I think the same should happen for kmem: kmem should just be
> > accounted and limited in the overall memory limit of a group, and when
> > pressure arises, we go after anything that's reclaimable.
> 
> Personally, I don't think there's much sense in having a separate knob
> for kmem limit either. Until we have a user with a sane use case for it,
> let's not propagate it to the new interface.
> 
> Furthermore, even when we introduce kmem shrinking, the kmem-only limit
> alone won't be very useful, because there are plenty of GFP_NOFS kmem
> allocations, which make most of slab shrinkers useless. To avoid
> ENOMEM's in such situation, we would have to introduce either a soft
> kmem limit (watermark) or a kind of kmem precharges. This means if we
> decided to introduce kmem-only limit, we'd eventually have to add more
> knobs and write more code to make it usable w/o even knowing if anyone
> would really benefit from it.
> 
> However, there might be users that only want user memory limiting and
> don't want to pay the price of kmem accounting, which is pretty
> expensive. Even if we implement percpu stocks for kmem, there still will
> be noticeable overhead due to touching more cache lines on
> kmalloc/kfree.

Yes, we should not force everybody do take that cost in general, but
once you're using it, how much overhead is it really?  Charging
already happens in the slow path and we can batch it as you said.

I wonder if it would be enough to have the same granularity as the
swap controller; a config option and a global runtime toggle.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy
  2014-07-18 15:44   ` Vladimir Davydov
  2014-07-18 16:13     ` Johannes Weiner
@ 2014-07-21  9:07     ` Michal Hocko
  2014-07-21 11:46       ` Michal Hocko
  2014-07-21 11:48       ` Vladimir Davydov
  1 sibling, 2 replies; 13+ messages in thread
From: Michal Hocko @ 2014-07-21  9:07 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, linux-mm, LKML, Tejun Heo, Hugh Dickins,
	Greg Thelen, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Fri 18-07-14 19:44:43, Vladimir Davydov wrote:
> On Wed, Jul 16, 2014 at 11:58:14AM -0400, Johannes Weiner wrote:
> > On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote:
> > > +#ifdef CONFIG_MEMCG_KMEM
> > > +	{
> > > +		.name = "kmem.limit_in_bytes",
> > > +		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> > > +		.write = mem_cgroup_write,
> > > +		.read_u64 = mem_cgroup_read_u64,
> > > +	},
> > 
> > Does it really make sense to have a separate limit for kmem only?
> > IIRC, the reason we introduced this was that this memory is not
> > reclaimable and so we need to limit it.
> > 
> > But the opposite effect happened: because it's not reclaimable, the
> > separate kmem limit is actually unusable for any values smaller than
> > the overall memory limit: because there is no reclaim mechanism for
> > that limit, once you hit it, it's over, there is nothing you can do
> > anymore.  The problem isn't so much unreclaimable memory, the problem
> > is unreclaimable limits.
> > 
> > If the global case produces memory pressure through kernel memory
> > allocations, we reclaim page cache, anonymous pages, inodes, dentries
> > etc.  I think the same should happen for kmem: kmem should just be
> > accounted and limited in the overall memory limit of a group, and when
> > pressure arises, we go after anything that's reclaimable.
> 
> Personally, I don't think there's much sense in having a separate knob
> for kmem limit either. Until we have a user with a sane use case for it,
> let's not propagate it to the new interface.

What about fork-bomb forks protection? I thought that was the primary usecase
for K < U? Or how can we handle that use case with a single limit? A
special gfp flag to not trigger OOM path when called from some kmem
charge paths?

What about task_count or what was the name of the controller which was
dropped and suggested to be replaced by kmem accounting? I can imagine
that to be implemented by a separate K limit which would be roughtly
stack_size * task_count + pillow for slab.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy
  2014-07-21  9:07     ` Michal Hocko
@ 2014-07-21 11:46       ` Michal Hocko
  2014-07-21 12:02         ` Tejun Heo
  2014-07-21 12:03         ` Vladimir Davydov
  2014-07-21 11:48       ` Vladimir Davydov
  1 sibling, 2 replies; 13+ messages in thread
From: Michal Hocko @ 2014-07-21 11:46 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, linux-mm, LKML, Tejun Heo, Hugh Dickins,
	Greg Thelen, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Mon 21-07-14 11:07:24, Michal Hocko wrote:
> On Fri 18-07-14 19:44:43, Vladimir Davydov wrote:
> > On Wed, Jul 16, 2014 at 11:58:14AM -0400, Johannes Weiner wrote:
> > > On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote:
> > > > +#ifdef CONFIG_MEMCG_KMEM
> > > > +	{
> > > > +		.name = "kmem.limit_in_bytes",
> > > > +		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> > > > +		.write = mem_cgroup_write,
> > > > +		.read_u64 = mem_cgroup_read_u64,
> > > > +	},
> > > 
> > > Does it really make sense to have a separate limit for kmem only?
> > > IIRC, the reason we introduced this was that this memory is not
> > > reclaimable and so we need to limit it.
> > > 
> > > But the opposite effect happened: because it's not reclaimable, the
> > > separate kmem limit is actually unusable for any values smaller than
> > > the overall memory limit: because there is no reclaim mechanism for
> > > that limit, once you hit it, it's over, there is nothing you can do
> > > anymore.  The problem isn't so much unreclaimable memory, the problem
> > > is unreclaimable limits.
> > > 
> > > If the global case produces memory pressure through kernel memory
> > > allocations, we reclaim page cache, anonymous pages, inodes, dentries
> > > etc.  I think the same should happen for kmem: kmem should just be
> > > accounted and limited in the overall memory limit of a group, and when
> > > pressure arises, we go after anything that's reclaimable.
> > 
> > Personally, I don't think there's much sense in having a separate knob
> > for kmem limit either. Until we have a user with a sane use case for it,
> > let's not propagate it to the new interface.
> 
> What about fork-bomb forks protection? I thought that was the primary usecase
> for K < U? Or how can we handle that use case with a single limit? A
> special gfp flag to not trigger OOM path when called from some kmem
> charge paths?

Even then, I do not see how would this fork-bomb prevention work without
causing OOMs and killing other processes within the group. The danger
would be still contained in the group and prevent from the system wide
disruption. Do we really want only such a narrow usecase?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy
  2014-07-21  9:07     ` Michal Hocko
  2014-07-21 11:46       ` Michal Hocko
@ 2014-07-21 11:48       ` Vladimir Davydov
  2014-07-21 12:09         ` Michal Hocko
  1 sibling, 1 reply; 13+ messages in thread
From: Vladimir Davydov @ 2014-07-21 11:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, LKML, Tejun Heo, Hugh Dickins,
	Greg Thelen, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Mon, Jul 21, 2014 at 11:07:24AM +0200, Michal Hocko wrote:
> On Fri 18-07-14 19:44:43, Vladimir Davydov wrote:
> > On Wed, Jul 16, 2014 at 11:58:14AM -0400, Johannes Weiner wrote:
> > > On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote:
> > > > +#ifdef CONFIG_MEMCG_KMEM
> > > > +	{
> > > > +		.name = "kmem.limit_in_bytes",
> > > > +		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> > > > +		.write = mem_cgroup_write,
> > > > +		.read_u64 = mem_cgroup_read_u64,
> > > > +	},
> > > 
> > > Does it really make sense to have a separate limit for kmem only?
> > > IIRC, the reason we introduced this was that this memory is not
> > > reclaimable and so we need to limit it.
> > > 
> > > But the opposite effect happened: because it's not reclaimable, the
> > > separate kmem limit is actually unusable for any values smaller than
> > > the overall memory limit: because there is no reclaim mechanism for
> > > that limit, once you hit it, it's over, there is nothing you can do
> > > anymore.  The problem isn't so much unreclaimable memory, the problem
> > > is unreclaimable limits.
> > > 
> > > If the global case produces memory pressure through kernel memory
> > > allocations, we reclaim page cache, anonymous pages, inodes, dentries
> > > etc.  I think the same should happen for kmem: kmem should just be
> > > accounted and limited in the overall memory limit of a group, and when
> > > pressure arises, we go after anything that's reclaimable.
> > 
> > Personally, I don't think there's much sense in having a separate knob
> > for kmem limit either. Until we have a user with a sane use case for it,
> > let's not propagate it to the new interface.
> 
> What about fork-bomb forks protection? I thought that was the primary usecase
> for K < U? Or how can we handle that use case with a single limit? A
> special gfp flag to not trigger OOM path when called from some kmem
> charge paths?

Hmm, for a moment I thought that putting a fork-bomb inside a memory
cgroup with kmem accounting enabled and K=U will isolate it from the
rest of the system and therefore there's no need in K<U, but now I
realize it's not quite right.

In contrast to user memory, thread stack allocations have costly order,
they cannot be swapped out, and on 32-bit systems they will consume a
limited resource of low mem. Although the latter two doesn't look like
being of much concern, costly order of stack pages certainly does I
think.

Is this what you mean by saying we have to disable OOM from some kmem
charge paths? To prevent OOM on the global level that might trigger due
to lack of high order pages for task stack?

> What about task_count or what was the name of the controller which was
> dropped and suggested to be replaced by kmem accounting? I can imagine
> that to be implemented by a separate K limit which would be roughtly
> stack_size * task_count + pillow for slab.

I wonder how big this pillow for slab should be...

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy
  2014-07-21 11:46       ` Michal Hocko
@ 2014-07-21 12:02         ` Tejun Heo
  2014-07-21 12:03         ` Vladimir Davydov
  1 sibling, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2014-07-21 12:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vladimir Davydov, Johannes Weiner, linux-mm, LKML, Hugh Dickins,
	Greg Thelen, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

Hello,

On Mon, Jul 21, 2014 at 01:46:55PM +0200, Michal Hocko wrote:
> Even then, I do not see how would this fork-bomb prevention work without
> causing OOMs and killing other processes within the group. The danger
> would be still contained in the group and prevent from the system wide
> disruption. Do we really want only such a narrow usecase?

Does that really matter?  I don't buy the usefulness of the various
suggested partial failure modes.  For example, is fork-bomb actually
something isolatable by not granting more forks?  Doing so is likely
to cripple the cgroup anyway, which apparently needed forking to
operate.  Such partial failure mode would only be useful iff the
culprit is mostly isolated even in the cgroup, stops forking once it
starts to fail, the already forked excess processes can be identified
and killed somehow without requiring forking in the cgroup, and fork
failures in other parts of the cgroup hopefully hasn't broken the
service provided by the cgroup yet.

In the long term, we should have per-cgroup OOM killing and terminate
the cgroups which fail to behave.  I think the value is in the ability
to contain such failures, not in the partial failure modes that may or
may not be salvageable without any way to systematically determine
which way the situation is.  Short of being able to detect which
specific process are fork bombing and take them out, which I don't
think can or should, I believe that fork bomb protection should be
dealt as an integral part of generic memcg operation.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy
  2014-07-21 11:46       ` Michal Hocko
  2014-07-21 12:02         ` Tejun Heo
@ 2014-07-21 12:03         ` Vladimir Davydov
  2014-07-21 12:49           ` Tejun Heo
  1 sibling, 1 reply; 13+ messages in thread
From: Vladimir Davydov @ 2014-07-21 12:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, LKML, Tejun Heo, Hugh Dickins,
	Greg Thelen, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Mon, Jul 21, 2014 at 01:46:55PM +0200, Michal Hocko wrote:
> On Mon 21-07-14 11:07:24, Michal Hocko wrote:
> > On Fri 18-07-14 19:44:43, Vladimir Davydov wrote:
> > > On Wed, Jul 16, 2014 at 11:58:14AM -0400, Johannes Weiner wrote:
> > > > On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote:
> > > > > +#ifdef CONFIG_MEMCG_KMEM
> > > > > +	{
> > > > > +		.name = "kmem.limit_in_bytes",
> > > > > +		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> > > > > +		.write = mem_cgroup_write,
> > > > > +		.read_u64 = mem_cgroup_read_u64,
> > > > > +	},
> > > > 
> > > > Does it really make sense to have a separate limit for kmem only?
> > > > IIRC, the reason we introduced this was that this memory is not
> > > > reclaimable and so we need to limit it.
> > > > 
> > > > But the opposite effect happened: because it's not reclaimable, the
> > > > separate kmem limit is actually unusable for any values smaller than
> > > > the overall memory limit: because there is no reclaim mechanism for
> > > > that limit, once you hit it, it's over, there is nothing you can do
> > > > anymore.  The problem isn't so much unreclaimable memory, the problem
> > > > is unreclaimable limits.
> > > > 
> > > > If the global case produces memory pressure through kernel memory
> > > > allocations, we reclaim page cache, anonymous pages, inodes, dentries
> > > > etc.  I think the same should happen for kmem: kmem should just be
> > > > accounted and limited in the overall memory limit of a group, and when
> > > > pressure arises, we go after anything that's reclaimable.
> > > 
> > > Personally, I don't think there's much sense in having a separate knob
> > > for kmem limit either. Until we have a user with a sane use case for it,
> > > let's not propagate it to the new interface.
> > 
> > What about fork-bomb forks protection? I thought that was the primary usecase
> > for K < U? Or how can we handle that use case with a single limit? A
> > special gfp flag to not trigger OOM path when called from some kmem
> > charge paths?
> 
> Even then, I do not see how would this fork-bomb prevention work without
> causing OOMs and killing other processes within the group. The danger
> would be still contained in the group and prevent from the system wide
> disruption. Do we really want only such a narrow usecase?

I think it's all about how we're going to use memory cgroups. If we're
going to use them for application containers, there's simply no such
problem, because we only want to isolate a potentially dangerous process
group from the rest of the system. If we want to start a fully
virtualized OS inside a container, then we certainly need a kind of
numproc and/or kmem limiter to prevent processes inside a cgroup from
being OOM killed by a fork-bomb. IMHO, the latter will always be better
done by VMs, so it isn't a must-have for cgroups. I may be mistaken
though.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy
  2014-07-21 11:48       ` Vladimir Davydov
@ 2014-07-21 12:09         ` Michal Hocko
  0 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2014-07-21 12:09 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, linux-mm, LKML, Tejun Heo, Hugh Dickins,
	Greg Thelen, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Mon 21-07-14 15:48:39, Vladimir Davydov wrote:
> On Mon, Jul 21, 2014 at 11:07:24AM +0200, Michal Hocko wrote:
> > On Fri 18-07-14 19:44:43, Vladimir Davydov wrote:
> > > On Wed, Jul 16, 2014 at 11:58:14AM -0400, Johannes Weiner wrote:
> > > > On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote:
> > > > > +#ifdef CONFIG_MEMCG_KMEM
> > > > > +	{
> > > > > +		.name = "kmem.limit_in_bytes",
> > > > > +		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> > > > > +		.write = mem_cgroup_write,
> > > > > +		.read_u64 = mem_cgroup_read_u64,
> > > > > +	},
> > > > 
> > > > Does it really make sense to have a separate limit for kmem only?
> > > > IIRC, the reason we introduced this was that this memory is not
> > > > reclaimable and so we need to limit it.
> > > > 
> > > > But the opposite effect happened: because it's not reclaimable, the
> > > > separate kmem limit is actually unusable for any values smaller than
> > > > the overall memory limit: because there is no reclaim mechanism for
> > > > that limit, once you hit it, it's over, there is nothing you can do
> > > > anymore.  The problem isn't so much unreclaimable memory, the problem
> > > > is unreclaimable limits.
> > > > 
> > > > If the global case produces memory pressure through kernel memory
> > > > allocations, we reclaim page cache, anonymous pages, inodes, dentries
> > > > etc.  I think the same should happen for kmem: kmem should just be
> > > > accounted and limited in the overall memory limit of a group, and when
> > > > pressure arises, we go after anything that's reclaimable.
> > > 
> > > Personally, I don't think there's much sense in having a separate knob
> > > for kmem limit either. Until we have a user with a sane use case for it,
> > > let's not propagate it to the new interface.
> > 
> > What about fork-bomb forks protection? I thought that was the primary usecase
> > for K < U? Or how can we handle that use case with a single limit? A
> > special gfp flag to not trigger OOM path when called from some kmem
> > charge paths?
> 
> Hmm, for a moment I thought that putting a fork-bomb inside a memory
> cgroup with kmem accounting enabled and K=U will isolate it from the
> rest of the system and therefore there's no need in K<U, but now I
> realize it's not quite right.
> 
> In contrast to user memory, thread stack allocations have costly order,
> they cannot be swapped out, and on 32-bit systems they will consume a
> limited resource of low mem. Although the latter two doesn't look like
> being of much concern, costly order of stack pages certainly does I
> think.
> 
> Is this what you mean by saying we have to disable OOM from some kmem
> charge paths? To prevent OOM on the global level that might trigger due
> to lack of high order pages for task stack?

No, I meant it for a different reason. If you simply cause OOM from e.g.
stack charge then you simply DoS your cgroup before you start
effectively stopping fork-bomb because the fork-bomb will usually have
much smaller RSS than anything else in the group. So this is a case
where you really want to fail the allocation.

Maybe I just didn't understand what a single-limit proposal meant...

> > What about task_count or what was the name of the controller which was
> > dropped and suggested to be replaced by kmem accounting? I can imagine
> > that to be implemented by a separate K limit which would be roughtly
> > stack_size * task_count + pillow for slab.
> 
> I wonder how big this pillow for slab should be...

Well, it obviously depends on the load running in the group. It depends
on the amount of unreclaimable slab + reclaimable_and_still_not_trashing
amount of slab. So the pillow should be quite large but that shouldn't
be a big deal as the kernel allocations usually are a small part of the
U.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy
  2014-07-21 12:03         ` Vladimir Davydov
@ 2014-07-21 12:49           ` Tejun Heo
  0 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2014-07-21 12:49 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Michal Hocko, Johannes Weiner, linux-mm, LKML, Hugh Dickins,
	Greg Thelen, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Mon, Jul 21, 2014 at 04:03:32PM +0400, Vladimir Davydov wrote:
> I think it's all about how we're going to use memory cgroups. If we're
> going to use them for application containers, there's simply no such
> problem, because we only want to isolate a potentially dangerous process
> group from the rest of the system. If we want to start a fully
> virtualized OS inside a container, then we certainly need a kind of

For shell environments, ulimit is a much better specific protection
mechanism against fork bombs and process-granular OOM killers would
behave mostly equivalently during fork bombing to the way it'd behave
in the host environment w/o cgroups.  I'm having a hard time seeing
why this would need any special treatment from cgroups.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy
  2014-07-16 15:58 ` Johannes Weiner
  2014-07-17 13:45   ` Michal Hocko
  2014-07-18 15:44   ` Vladimir Davydov
@ 2014-07-21 13:22   ` Michal Hocko
  2 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2014-07-21 13:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, LKML, Tejun Heo, Hugh Dickins, Greg Thelen,
	Vladimir Davydov, Glauber Costa, Andrew Morton,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Wed 16-07-14 11:58:14, Johannes Weiner wrote:
> On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote:
[...]
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	{
> > +		.name = "kmem.limit_in_bytes",
> > +		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> > +		.write = mem_cgroup_write,
> > +		.read_u64 = mem_cgroup_read_u64,
> > +	},
> 
> Does it really make sense to have a separate limit for kmem only?

It seems that needs furhter discussion so I will drop it in next version
of the patch and we can enable it or move to a single knob for U+K
later.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-07-21 13:22 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-16 14:39 [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy Michal Hocko
2014-07-16 15:58 ` Johannes Weiner
2014-07-17 13:45   ` Michal Hocko
2014-07-18 15:44   ` Vladimir Davydov
2014-07-18 16:13     ` Johannes Weiner
2014-07-21  9:07     ` Michal Hocko
2014-07-21 11:46       ` Michal Hocko
2014-07-21 12:02         ` Tejun Heo
2014-07-21 12:03         ` Vladimir Davydov
2014-07-21 12:49           ` Tejun Heo
2014-07-21 11:48       ` Vladimir Davydov
2014-07-21 12:09         ` Michal Hocko
2014-07-21 13:22   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).