linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: memcontrol: reclaim when shrinking memory.high below usage
@ 2016-03-10 20:50 Johannes Weiner
  2016-03-11  7:55 ` Michal Hocko
  2016-03-11  8:34 ` Vladimir Davydov
  0 siblings, 2 replies; 13+ messages in thread
From: Johannes Weiner @ 2016-03-10 20:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, linux-mm, cgroups, linux-kernel,
	kernel-team

When setting memory.high below usage, nothing happens until the next
charge comes along, and then it will only reclaim its own charge and
not the now potentially huge excess of the new memory.high. This can
cause groups to stay in excess of their memory.high indefinitely.

To fix that, when shrinking memory.high, kick off a reclaim cycle that
goes after the delta.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8615b066b642..f7c9b4cbdf01 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4992,6 +4992,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 				 char *buf, size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long nr_pages;
 	unsigned long high;
 	int err;
 
@@ -5002,6 +5003,11 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 
 	memcg->high = high;
 
+	nr_pages = page_counter_read(&memcg->memory);
+	if (nr_pages > high)
+		try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
+					     GFP_KERNEL, true);
+
 	memcg_wb_domain_size_changed(memcg);
 	return nbytes;
 }
-- 
2.7.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: reclaim when shrinking memory.high below usage
  2016-03-10 20:50 [PATCH] mm: memcontrol: reclaim when shrinking memory.high below usage Johannes Weiner
@ 2016-03-11  7:55 ` Michal Hocko
  2016-03-11  8:34 ` Vladimir Davydov
  1 sibling, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2016-03-11  7:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vladimir Davydov, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu 10-03-16 15:50:13, Johannes Weiner wrote:
> When setting memory.high below usage, nothing happens until the next
> charge comes along, and then it will only reclaim its own charge and
> not the now potentially huge excess of the new memory.high. This can
> cause groups to stay in excess of their memory.high indefinitely.
> 
> To fix that, when shrinking memory.high, kick off a reclaim cycle that
> goes after the delta.

This has been the case since the knob was introduce but I wouldn't
bother with the CC: stable # 4.0+ as this was still in experimental
mode. I guess we want to have it in 4.5 or put it to 4.5 stable.

> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/memcontrol.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8615b066b642..f7c9b4cbdf01 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4992,6 +4992,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  				 char *buf, size_t nbytes, loff_t off)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	unsigned long nr_pages;
>  	unsigned long high;
>  	int err;
>  
> @@ -5002,6 +5003,11 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  
>  	memcg->high = high;
>  
> +	nr_pages = page_counter_read(&memcg->memory);
> +	if (nr_pages > high)
> +		try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> +					     GFP_KERNEL, true);
> +
>  	memcg_wb_domain_size_changed(memcg);
>  	return nbytes;
>  }
> -- 
> 2.7.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: reclaim when shrinking memory.high below usage
  2016-03-10 20:50 [PATCH] mm: memcontrol: reclaim when shrinking memory.high below usage Johannes Weiner
  2016-03-11  7:55 ` Michal Hocko
@ 2016-03-11  8:34 ` Vladimir Davydov
  2016-03-11  8:42   ` Michal Hocko
  2016-03-16  5:41   ` Johannes Weiner
  1 sibling, 2 replies; 13+ messages in thread
From: Vladimir Davydov @ 2016-03-11  8:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

On Thu, Mar 10, 2016 at 03:50:13PM -0500, Johannes Weiner wrote:
> When setting memory.high below usage, nothing happens until the next
> charge comes along, and then it will only reclaim its own charge and
> not the now potentially huge excess of the new memory.high. This can
> cause groups to stay in excess of their memory.high indefinitely.
> 
> To fix that, when shrinking memory.high, kick off a reclaim cycle that
> goes after the delta.

I agree that we should reclaim the high excess, but I don't think it's a
good idea to do it synchronously. Currently, memory.low and memory.high
knobs can be easily used by a single-threaded load manager implemented
in userspace, because it doesn't need to care about potential stalls
caused by writes to these files. After this change it might happen that
a write to memory.high would take long, seconds perhaps, so in order to
react quickly to changes in other cgroups, a load manager would have to
spawn a thread per each write to memory.high, which would complicate its
implementation significantly.

Since, in contrast to memory.max, memory.high definition allows cgroup
to breach it, I believe it would be better if we spawned an asynchronous
reclaim work from the kernel on write to memory.high instead of doing
this synchronously. I guess we could reuse mem_cgroup->high_work for
that.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: reclaim when shrinking memory.high below usage
  2016-03-11  8:34 ` Vladimir Davydov
@ 2016-03-11  8:42   ` Michal Hocko
  2016-03-11  9:13     ` Vladimir Davydov
  2016-03-16  5:41   ` Johannes Weiner
  1 sibling, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2016-03-11  8:42 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Andrew Morton, linux-mm, cgroups, linux-kernel,
	kernel-team

On Fri 11-03-16 11:34:40, Vladimir Davydov wrote:
> On Thu, Mar 10, 2016 at 03:50:13PM -0500, Johannes Weiner wrote:
> > When setting memory.high below usage, nothing happens until the next
> > charge comes along, and then it will only reclaim its own charge and
> > not the now potentially huge excess of the new memory.high. This can
> > cause groups to stay in excess of their memory.high indefinitely.
> > 
> > To fix that, when shrinking memory.high, kick off a reclaim cycle that
> > goes after the delta.
> 
> I agree that we should reclaim the high excess, but I don't think it's a
> good idea to do it synchronously. Currently, memory.low and memory.high
> knobs can be easily used by a single-threaded load manager implemented
> in userspace, because it doesn't need to care about potential stalls
> caused by writes to these files. After this change it might happen that
> a write to memory.high would take long, seconds perhaps, so in order to
> react quickly to changes in other cgroups, a load manager would have to
> spawn a thread per each write to memory.high, which would complicate its
> implementation significantly.

Is the complication on the managing part really an issue though. Such a
manager would have to spawn a process/thread to change the .max already.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: reclaim when shrinking memory.high below usage
  2016-03-11  8:42   ` Michal Hocko
@ 2016-03-11  9:13     ` Vladimir Davydov
  2016-03-11  9:53       ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Vladimir Davydov @ 2016-03-11  9:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, linux-mm, cgroups, linux-kernel,
	kernel-team

On Fri, Mar 11, 2016 at 09:42:39AM +0100, Michal Hocko wrote:
> On Fri 11-03-16 11:34:40, Vladimir Davydov wrote:
> > On Thu, Mar 10, 2016 at 03:50:13PM -0500, Johannes Weiner wrote:
> > > When setting memory.high below usage, nothing happens until the next
> > > charge comes along, and then it will only reclaim its own charge and
> > > not the now potentially huge excess of the new memory.high. This can
> > > cause groups to stay in excess of their memory.high indefinitely.
> > > 
> > > To fix that, when shrinking memory.high, kick off a reclaim cycle that
> > > goes after the delta.
> > 
> > I agree that we should reclaim the high excess, but I don't think it's a
> > good idea to do it synchronously. Currently, memory.low and memory.high
> > knobs can be easily used by a single-threaded load manager implemented
> > in userspace, because it doesn't need to care about potential stalls
> > caused by writes to these files. After this change it might happen that
> > a write to memory.high would take long, seconds perhaps, so in order to
> > react quickly to changes in other cgroups, a load manager would have to
> > spawn a thread per each write to memory.high, which would complicate its
> > implementation significantly.
> 
> Is the complication on the managing part really an issue though. Such a
> manager would have to spawn a process/thread to change the .max already.

IMO memory.max is not something that has to be changed often. In most
cases it will be set on container start and stay put throughout
container lifetime. I can also imagine a case when memory.max will be
changed for all containers when a container starts or stops, so as to
guarantee that if <= N containers of M go mad, the system will survive.
In any case, memory.max is reconfigured rarely, it rather belongs to the
static configuration.

OTOH memory.low and memory.high are perfect to be changed dynamically,
basing on containers' memory demand/pressure. A load manager might want
to reconfigure these knobs say every 5 seconds. Spawning a thread per
each container that often would look unnecessarily overcomplicated IMO.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: reclaim when shrinking memory.high below usage
  2016-03-11  9:13     ` Vladimir Davydov
@ 2016-03-11  9:53       ` Michal Hocko
  2016-03-11 11:49         ` Vladimir Davydov
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2016-03-11  9:53 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Andrew Morton, linux-mm, cgroups, linux-kernel,
	kernel-team

On Fri 11-03-16 12:13:04, Vladimir Davydov wrote:
> On Fri, Mar 11, 2016 at 09:42:39AM +0100, Michal Hocko wrote:
> > On Fri 11-03-16 11:34:40, Vladimir Davydov wrote:
> > > On Thu, Mar 10, 2016 at 03:50:13PM -0500, Johannes Weiner wrote:
> > > > When setting memory.high below usage, nothing happens until the next
> > > > charge comes along, and then it will only reclaim its own charge and
> > > > not the now potentially huge excess of the new memory.high. This can
> > > > cause groups to stay in excess of their memory.high indefinitely.
> > > > 
> > > > To fix that, when shrinking memory.high, kick off a reclaim cycle that
> > > > goes after the delta.
> > > 
> > > I agree that we should reclaim the high excess, but I don't think it's a
> > > good idea to do it synchronously. Currently, memory.low and memory.high
> > > knobs can be easily used by a single-threaded load manager implemented
> > > in userspace, because it doesn't need to care about potential stalls
> > > caused by writes to these files. After this change it might happen that
> > > a write to memory.high would take long, seconds perhaps, so in order to
> > > react quickly to changes in other cgroups, a load manager would have to
> > > spawn a thread per each write to memory.high, which would complicate its
> > > implementation significantly.
> > 
> > Is the complication on the managing part really an issue though. Such a
> > manager would have to spawn a process/thread to change the .max already.
> 
> IMO memory.max is not something that has to be changed often. In most
> cases it will be set on container start and stay put throughout
> container lifetime. I can also imagine a case when memory.max will be
> changed for all containers when a container starts or stops, so as to
> guarantee that if <= N containers of M go mad, the system will survive.
> In any case, memory.max is reconfigured rarely, it rather belongs to the
> static configuration.

I see
 
> OTOH memory.low and memory.high are perfect to be changed dynamically,
> basing on containers' memory demand/pressure. A load manager might want
> to reconfigure these knobs say every 5 seconds. Spawning a thread per
> each container that often would look unnecessarily overcomplicated IMO.

The question however is whether we want to hide a potentially costly
operation and have it unaccounted and hidden in the kworker context.
I mean fork() + write() doesn't sound terribly complicated to me to have
a rather subtle behavior in the kernel.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: reclaim when shrinking memory.high below usage
  2016-03-11  9:53       ` Michal Hocko
@ 2016-03-11 11:49         ` Vladimir Davydov
  2016-03-11 13:39           ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Vladimir Davydov @ 2016-03-11 11:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, linux-mm, cgroups, linux-kernel,
	kernel-team

On Fri, Mar 11, 2016 at 10:53:09AM +0100, Michal Hocko wrote:
> > OTOH memory.low and memory.high are perfect to be changed dynamically,
> > basing on containers' memory demand/pressure. A load manager might want
> > to reconfigure these knobs say every 5 seconds. Spawning a thread per
> > each container that often would look unnecessarily overcomplicated IMO.
> 
> The question however is whether we want to hide a potentially costly
> operation and have it unaccounted and hidden in the kworker context.

There's already mem_cgroup->high_work doing reclaim in an unaccounted
context quite often if tcp accounting is enabled. And there's kswapd.
memory.high knob is for the root only so it can't be abused by an
unprivileged user. Regarding a privileged user, e.g. load manager, it
can screw things up anyway, e.g. by configuring sum of memory.low to be
greater than total RAM on the host and hence driving kswapd mad.

> I mean fork() + write() doesn't sound terribly complicated to me to have
> a rather subtle behavior in the kernel.

It'd be just a dubious API IMHO. With memory.max everything's clear: it
tries to reclaim memory hard, may stall for several seconds, may invoke
OOM, but if it finishes successfully we have memory.current less than
memory.max. With this patch memory.high knob behaves rather strangely:
it might stall, but there's no guarantee you'll have memory.current less
than memory.high; moreover, according to the documentation it's OK to
have memory.current greater than memory.high, so what's the point in
calling synchronous reclaim blocking the caller?

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: reclaim when shrinking memory.high below usage
  2016-03-11 11:49         ` Vladimir Davydov
@ 2016-03-11 13:39           ` Michal Hocko
  2016-03-11 14:01             ` Vladimir Davydov
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2016-03-11 13:39 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Andrew Morton, linux-mm, cgroups, linux-kernel,
	kernel-team

On Fri 11-03-16 14:49:34, Vladimir Davydov wrote:
> On Fri, Mar 11, 2016 at 10:53:09AM +0100, Michal Hocko wrote:
> > > OTOH memory.low and memory.high are perfect to be changed dynamically,
> > > basing on containers' memory demand/pressure. A load manager might want
> > > to reconfigure these knobs say every 5 seconds. Spawning a thread per
> > > each container that often would look unnecessarily overcomplicated IMO.
> > 
> > The question however is whether we want to hide a potentially costly
> > operation and have it unaccounted and hidden in the kworker context.
> 
> There's already mem_cgroup->high_work doing reclaim in an unaccounted
> context quite often if tcp accounting is enabled.

I suspect this is done because the charging context cannot do much
better.

> And there's kswapd.
> memory.high knob is for the root only so it can't be abused by an
> unprivileged user. Regarding a privileged user, e.g. load manager, it
> can screw things up anyway, e.g. by configuring sum of memory.low to be
> greater than total RAM on the host and hence driving kswapd mad.

I am not worried about abuse. It is just weird to move something which
can be perfectly sync to an async mode.
 
> > I mean fork() + write() doesn't sound terribly complicated to me to have
> > a rather subtle behavior in the kernel.
> 
> It'd be just a dubious API IMHO. With memory.max everything's clear: it
> tries to reclaim memory hard, may stall for several seconds, may invoke
> OOM, but if it finishes successfully we have memory.current less than
> memory.max. With this patch memory.high knob behaves rather strangely:
> it might stall, but there's no guarantee you'll have memory.current less
> than memory.high; moreover, according to the documentation it's OK to
> have memory.current greater than memory.high, so what's the point in
> calling synchronous reclaim blocking the caller?

Even if the reclaim is best effort it doesn't mean we should hide it
into an async context. There is simply no reason to do so. We do the
some for other knobs which are performing a potentially expensive
operation and do not guarantee the result.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: reclaim when shrinking memory.high below usage
  2016-03-11 13:39           ` Michal Hocko
@ 2016-03-11 14:01             ` Vladimir Davydov
  2016-03-11 14:22               ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Vladimir Davydov @ 2016-03-11 14:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, linux-mm, cgroups, linux-kernel,
	kernel-team

On Fri, Mar 11, 2016 at 02:39:36PM +0100, Michal Hocko wrote:
> On Fri 11-03-16 14:49:34, Vladimir Davydov wrote:
> > On Fri, Mar 11, 2016 at 10:53:09AM +0100, Michal Hocko wrote:
> > > > OTOH memory.low and memory.high are perfect to be changed dynamically,
> > > > basing on containers' memory demand/pressure. A load manager might want
> > > > to reconfigure these knobs say every 5 seconds. Spawning a thread per
> > > > each container that often would look unnecessarily overcomplicated IMO.
> > > 
> > > The question however is whether we want to hide a potentially costly
> > > operation and have it unaccounted and hidden in the kworker context.
> > 
> > There's already mem_cgroup->high_work doing reclaim in an unaccounted
> > context quite often if tcp accounting is enabled.
> 
> I suspect this is done because the charging context cannot do much
> better.
> 
> > And there's kswapd.
> > memory.high knob is for the root only so it can't be abused by an
> > unprivileged user. Regarding a privileged user, e.g. load manager, it
> > can screw things up anyway, e.g. by configuring sum of memory.low to be
> > greater than total RAM on the host and hence driving kswapd mad.
> 
> I am not worried about abuse. It is just weird to move something which
> can be perfectly sync to an async mode.
>  
> > > I mean fork() + write() doesn't sound terribly complicated to me to have
> > > a rather subtle behavior in the kernel.
> > 
> > It'd be just a dubious API IMHO. With memory.max everything's clear: it
> > tries to reclaim memory hard, may stall for several seconds, may invoke
> > OOM, but if it finishes successfully we have memory.current less than
> > memory.max. With this patch memory.high knob behaves rather strangely:
> > it might stall, but there's no guarantee you'll have memory.current less
> > than memory.high; moreover, according to the documentation it's OK to
> > have memory.current greater than memory.high, so what's the point in
> > calling synchronous reclaim blocking the caller?
> 
> Even if the reclaim is best effort it doesn't mean we should hide it
> into an async context. There is simply no reason to do so. We do the
> some for other knobs which are performing a potentially expensive
> operation and do not guarantee the result.

IMO it depends on what a knob is used for. If it's for testing or
debugging or recovering the system (e.g. manual oom, compact,
drop_caches), this must be synchronous, but memory.high is going to be
tweaked at runtime during normal system operation every several seconds
or so, at least in my understanding. I understand your concern, and may
be you're right in the end, but think about userspace that will probably
have to spawn thousands threads every 5 seconds or so just to write to a
file. It's painful IMO.

Are there any hidden non-obvious implications of handing over reclaim to
a kernel worker on adjusting memory.high? May be, I'm just missing
something obvious, and it can be really dangerous or sub-optimal.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: reclaim when shrinking memory.high below usage
  2016-03-11 14:01             ` Vladimir Davydov
@ 2016-03-11 14:22               ` Michal Hocko
  2016-03-11 14:46                 ` Vladimir Davydov
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2016-03-11 14:22 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Johannes Weiner, Andrew Morton, linux-mm, cgroups, linux-kernel,
	kernel-team

On Fri 11-03-16 17:01:46, Vladimir Davydov wrote:
> On Fri, Mar 11, 2016 at 02:39:36PM +0100, Michal Hocko wrote:
> > On Fri 11-03-16 14:49:34, Vladimir Davydov wrote:
> > > On Fri, Mar 11, 2016 at 10:53:09AM +0100, Michal Hocko wrote:
> > > > > OTOH memory.low and memory.high are perfect to be changed dynamically,
> > > > > basing on containers' memory demand/pressure. A load manager might want
> > > > > to reconfigure these knobs say every 5 seconds. Spawning a thread per
> > > > > each container that often would look unnecessarily overcomplicated IMO.
> > > > 
> > > > The question however is whether we want to hide a potentially costly
> > > > operation and have it unaccounted and hidden in the kworker context.
> > > 
> > > There's already mem_cgroup->high_work doing reclaim in an unaccounted
> > > context quite often if tcp accounting is enabled.
> > 
> > I suspect this is done because the charging context cannot do much
> > better.
> > 
> > > And there's kswapd.
> > > memory.high knob is for the root only so it can't be abused by an
> > > unprivileged user. Regarding a privileged user, e.g. load manager, it
> > > can screw things up anyway, e.g. by configuring sum of memory.low to be
> > > greater than total RAM on the host and hence driving kswapd mad.
> > 
> > I am not worried about abuse. It is just weird to move something which
> > can be perfectly sync to an async mode.
> >  
> > > > I mean fork() + write() doesn't sound terribly complicated to me to have
> > > > a rather subtle behavior in the kernel.
> > > 
> > > It'd be just a dubious API IMHO. With memory.max everything's clear: it
> > > tries to reclaim memory hard, may stall for several seconds, may invoke
> > > OOM, but if it finishes successfully we have memory.current less than
> > > memory.max. With this patch memory.high knob behaves rather strangely:
> > > it might stall, but there's no guarantee you'll have memory.current less
> > > than memory.high; moreover, according to the documentation it's OK to
> > > have memory.current greater than memory.high, so what's the point in
> > > calling synchronous reclaim blocking the caller?
> > 
> > Even if the reclaim is best effort it doesn't mean we should hide it
> > into an async context. There is simply no reason to do so. We do the
> > some for other knobs which are performing a potentially expensive
> > operation and do not guarantee the result.
> 
> IMO it depends on what a knob is used for. If it's for testing or
> debugging or recovering the system (e.g. manual oom, compact,
> drop_caches), this must be synchronous, but memory.high is going to be
> tweaked at runtime during normal system operation every several seconds
> or so,

Is this really going to happen in the real life? And if yes is it really
probable that such an adjustment would cause such a large disruption?

> at least in my understanding. I understand your concern, and may
> be you're right in the end, but think about userspace that will probably
> have to spawn thousands threads every 5 seconds or so just to write to a
> file. It's painful IMO.
> 
> Are there any hidden non-obvious implications of handing over reclaim to
> a kernel worker on adjusting memory.high? May be, I'm just missing
> something obvious, and it can be really dangerous or sub-optimal.

I am just thinking about what would happen if workers just start
stacking up because they couldn't be processed and then would race with
each other. I mean this all would be fixable but I really fail to see
how that makes sense from the very beginning.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: reclaim when shrinking memory.high below usage
  2016-03-11 14:22               ` Michal Hocko
@ 2016-03-11 14:46                 ` Vladimir Davydov
  0 siblings, 0 replies; 13+ messages in thread
From: Vladimir Davydov @ 2016-03-11 14:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, linux-mm, cgroups, linux-kernel,
	kernel-team

On Fri, Mar 11, 2016 at 03:22:27PM +0100, Michal Hocko wrote:
> On Fri 11-03-16 17:01:46, Vladimir Davydov wrote:
> > On Fri, Mar 11, 2016 at 02:39:36PM +0100, Michal Hocko wrote:
> > > On Fri 11-03-16 14:49:34, Vladimir Davydov wrote:
> > > > On Fri, Mar 11, 2016 at 10:53:09AM +0100, Michal Hocko wrote:
> > > > > > OTOH memory.low and memory.high are perfect to be changed dynamically,
> > > > > > basing on containers' memory demand/pressure. A load manager might want
> > > > > > to reconfigure these knobs say every 5 seconds. Spawning a thread per
> > > > > > each container that often would look unnecessarily overcomplicated IMO.
> > > > > 
> > > > > The question however is whether we want to hide a potentially costly
> > > > > operation and have it unaccounted and hidden in the kworker context.
> > > > 
> > > > There's already mem_cgroup->high_work doing reclaim in an unaccounted
> > > > context quite often if tcp accounting is enabled.
> > > 
> > > I suspect this is done because the charging context cannot do much
> > > better.
> > > 
> > > > And there's kswapd.
> > > > memory.high knob is for the root only so it can't be abused by an
> > > > unprivileged user. Regarding a privileged user, e.g. load manager, it
> > > > can screw things up anyway, e.g. by configuring sum of memory.low to be
> > > > greater than total RAM on the host and hence driving kswapd mad.
> > > 
> > > I am not worried about abuse. It is just weird to move something which
> > > can be perfectly sync to an async mode.
> > >  
> > > > > I mean fork() + write() doesn't sound terribly complicated to me to have
> > > > > a rather subtle behavior in the kernel.
> > > > 
> > > > It'd be just a dubious API IMHO. With memory.max everything's clear: it
> > > > tries to reclaim memory hard, may stall for several seconds, may invoke
> > > > OOM, but if it finishes successfully we have memory.current less than
> > > > memory.max. With this patch memory.high knob behaves rather strangely:
> > > > it might stall, but there's no guarantee you'll have memory.current less
> > > > than memory.high; moreover, according to the documentation it's OK to
> > > > have memory.current greater than memory.high, so what's the point in
> > > > calling synchronous reclaim blocking the caller?
> > > 
> > > Even if the reclaim is best effort it doesn't mean we should hide it
> > > into an async context. There is simply no reason to do so. We do the
> > > some for other knobs which are performing a potentially expensive
> > > operation and do not guarantee the result.
> > 
> > IMO it depends on what a knob is used for. If it's for testing or
> > debugging or recovering the system (e.g. manual oom, compact,
> > drop_caches), this must be synchronous, but memory.high is going to be
> > tweaked at runtime during normal system operation every several seconds
> > or so,
> 
> Is this really going to happen in the real life?

I don't think anyone can say for sure now, but I think it's possible.

> And if yes is it really probable that such an adjustment would cause
> such a large disruption?

Dunno.

> 
> > at least in my understanding. I understand your concern, and may
> > be you're right in the end, but think about userspace that will probably
> > have to spawn thousands threads every 5 seconds or so just to write to a
> > file. It's painful IMO.
> > 
> > Are there any hidden non-obvious implications of handing over reclaim to
> > a kernel worker on adjusting memory.high? May be, I'm just missing
> > something obvious, and it can be really dangerous or sub-optimal.
> 
> I am just thinking about what would happen if workers just start
> stacking up because they couldn't be processed and then would race with
> each other.

We can use one work per memcg (high_work?) and make it requeue itself if
it finds that memory usage is above memory.high and there are
reclaimable pages, so no stacking should happen.

> I mean this all would be fixable but I really fail to see
> how that makes sense from the very beginning.

This is a user API we're defining right now. Changing it in future will
be troublesome.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: reclaim when shrinking memory.high below usage
  2016-03-11  8:34 ` Vladimir Davydov
  2016-03-11  8:42   ` Michal Hocko
@ 2016-03-16  5:41   ` Johannes Weiner
  2016-03-16 14:47     ` Vladimir Davydov
  1 sibling, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2016-03-16  5:41 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

On Fri, Mar 11, 2016 at 11:34:40AM +0300, Vladimir Davydov wrote:
> On Thu, Mar 10, 2016 at 03:50:13PM -0500, Johannes Weiner wrote:
> > When setting memory.high below usage, nothing happens until the next
> > charge comes along, and then it will only reclaim its own charge and
> > not the now potentially huge excess of the new memory.high. This can
> > cause groups to stay in excess of their memory.high indefinitely.
> > 
> > To fix that, when shrinking memory.high, kick off a reclaim cycle that
> > goes after the delta.
> 
> I agree that we should reclaim the high excess, but I don't think it's a
> good idea to do it synchronously. Currently, memory.low and memory.high
> knobs can be easily used by a single-threaded load manager implemented
> in userspace, because it doesn't need to care about potential stalls
> caused by writes to these files. After this change it might happen that
> a write to memory.high would take long, seconds perhaps, so in order to
> react quickly to changes in other cgroups, a load manager would have to
> spawn a thread per each write to memory.high, which would complicate its
> implementation significantly.

While I do expect memory.high to be adjusted every once in a while, I
can't see anybody doing it by a significant fraction of the cgroup
every couple of seconds - or tighter than the workingset; and dropping
use-once cache is cheap. What kind of usecase would that be?

But even if we're wrong about it and this becomes a scalability issue,
the knob - even when reclaiming synchroneously - makes no guarantees
about the target being met once the write finishes. It's a best effort
mechanism. What would break if we made it async later on?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: reclaim when shrinking memory.high below usage
  2016-03-16  5:41   ` Johannes Weiner
@ 2016-03-16 14:47     ` Vladimir Davydov
  0 siblings, 0 replies; 13+ messages in thread
From: Vladimir Davydov @ 2016-03-16 14:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Mar 15, 2016 at 10:41:57PM -0700, Johannes Weiner wrote:
> On Fri, Mar 11, 2016 at 11:34:40AM +0300, Vladimir Davydov wrote:
> > On Thu, Mar 10, 2016 at 03:50:13PM -0500, Johannes Weiner wrote:
> > > When setting memory.high below usage, nothing happens until the next
> > > charge comes along, and then it will only reclaim its own charge and
> > > not the now potentially huge excess of the new memory.high. This can
> > > cause groups to stay in excess of their memory.high indefinitely.
> > > 
> > > To fix that, when shrinking memory.high, kick off a reclaim cycle that
> > > goes after the delta.
> > 
> > I agree that we should reclaim the high excess, but I don't think it's a
> > good idea to do it synchronously. Currently, memory.low and memory.high
> > knobs can be easily used by a single-threaded load manager implemented
> > in userspace, because it doesn't need to care about potential stalls
> > caused by writes to these files. After this change it might happen that
> > a write to memory.high would take long, seconds perhaps, so in order to
> > react quickly to changes in other cgroups, a load manager would have to
> > spawn a thread per each write to memory.high, which would complicate its
> > implementation significantly.
> 
> While I do expect memory.high to be adjusted every once in a while, I
> can't see anybody doing it by a significant fraction of the cgroup
> every couple of seconds - or tighter than the workingset; and dropping
> use-once cache is cheap. What kind of usecase would that be?

I agree that a load manager won't need to adjust memory.high by a
significant amount often, but there can be a lot of containers running,
so even if it takes 10 ms to adjust memory.high for one container, it
will take up to a second for 100 containers. I expect that a load
manager implementation will just blindly spawn a thread per each
memory.high update to be sure it won't be stalled for too long.

> 
> But even if we're wrong about it and this becomes a scalability issue,
> the knob - even when reclaiming synchroneously - makes no guarantees
> about the target being met once the write finishes. It's a best effort
> mechanism. What would break if we made it async later on?

You're right of course - we wouldn't be able to change async to sync,
but not the other way round. However, I'm afraid that by making it sync
from the very beginning we effectively enforce userspace applications
that need to update memory.{low,high} often to use multi-threading. Not
sure if it's that bad though.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-03-16 15:21 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-10 20:50 [PATCH] mm: memcontrol: reclaim when shrinking memory.high below usage Johannes Weiner
2016-03-11  7:55 ` Michal Hocko
2016-03-11  8:34 ` Vladimir Davydov
2016-03-11  8:42   ` Michal Hocko
2016-03-11  9:13     ` Vladimir Davydov
2016-03-11  9:53       ` Michal Hocko
2016-03-11 11:49         ` Vladimir Davydov
2016-03-11 13:39           ` Michal Hocko
2016-03-11 14:01             ` Vladimir Davydov
2016-03-11 14:22               ` Michal Hocko
2016-03-11 14:46                 ` Vladimir Davydov
2016-03-16  5:41   ` Johannes Weiner
2016-03-16 14:47     ` Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).