Re: [RFC] memory cgroup: my thoughts on memsw

From: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.cz>, Greg Thelen <gthelen@google.com>,
	Hugh Dickins <hughd@google.com>,
	Motohiro Kosaki <Motohiro.Kosaki@us.fujitsu.com>,
	Glauber Costa <glommer@gmail.com>, Tejun Heo <tj@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Pavel Emelianov <xemul@parallels.com>,
	Konstantin Khorenko <khorenko@parallels.com>,
	LKML-MM <linux-mm@kvack.org>,
	LKML-cgroups <cgroups@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC] memory cgroup: my thoughts on memsw
Date: Thu, 11 Sep 2014 17:53:56 +0900	[thread overview]
Message-ID: <54116324.7000200@jp.fujitsu.com> (raw)
In-Reply-To: <20140911082311.GB4151@esperanza>

(2014/09/11 17:23), Vladimir Davydov wrote:
> On Thu, Sep 11, 2014 at 11:04:41AM +0900, Kamezawa Hiroyuki wrote:
>> (2014/09/09 19:39), Vladimir Davydov wrote:
>>
>>>> For your purpose, you need to implement your method in system-wide way.
>>>> It seems crazy to set per-cgroup-anon-limit for avoding system-wide-oom.
>>>> You'll need help of system-wide-cgroup-configuration-middleware even if
>>>> you have a method in a cgroup. If you say logic should be in OS kernel,
>>>> please implement it in a system wide logic rather than cgroup.
>>>
>>> What if on global pressure a memory cgroup exceeding its soft limit is
>>> being reclaimed, but not fast enough, because it has a lot of anon
>>> memory? The global OOM won't be triggered then, because there's still
>>> progress, but the system will experience hard pressure due to the
>>> reclaimer runs. How can we detect if we should kill the container or
>>> not? It smells like one more heuristic to vmscan, IMO.
>>
>>
>> That's you are trying to implement by per-cgroup-anon+swap-limit, the difference
>> is heuristics by system designer at container creation or heuristics by kernel in
>> the dynamic way.
>
> anon+swap limit isn't a heuristic, it's a configuration!
>
> The difference is that the user usually knows *minimal* requirements of
> the app he's going to run in a container/VM. Basing on them, he buys a
> container/VM with some predefined amount of RAM. From the whole system
> POV it's suboptimal to set the hard limit for the container by the user
> configuration, because there might be free memory, which could be used
> for file caches and hence lower disk load. If we had anon+swap hard
> limit, we could use it in conjunction with the soft limit instead of the
> hard limit. That would be more efficient than VM-like sand-boxing though
> still safe.
>
> When I'm talking about in-kernel heuristics, I mean a pile of
> hard-to-read functions with a bunch of obscure constants. This is much
> worse than providing the user with a convenient and flexible interface.
>
>> I said it should be done by system/cloud-container-scheduler based on notification.
>
> Basically, it's unsafe to hand this out to userspace completely. The
> system would be prone to DOS attacks from inside containers then.
>
>> But okay, let me think of kernel help in global reclaim.
>>
>>   - Assume "priority" is a value calculated by "usage - soft limit".
>>
>>   - weighted kswapd/direct reclaim
>>     => Based on priority of each threads/cgroup,  increase "wait" in direct reclaim
>>        if it's contended.
>>        Low prio container will sleep longer until memory contention is fixed.
>>
>>   - weighted anon allocation
>>     similar to above, if memory is contended, page fault speed should be weighted
>>     based on priority(softlimit).
>>
>>   - off cpu direct-reclaim
>>     run direct recalim in workqueue with cpu mask. the cpu mask is a global setting
>>     per numa node, which determines cpus available for being used to reclaim memory.
>>     "How to wait" may affect the performance of system but this can allow masked cpus
>>     to be used for more important jobs.
>
> That's what I call a bunch of heuristics. And actually I don't see how
> it'd help us against latency spikes caused by reclaimer runs, seems the
> set is still incomplete :-/
>
> For example, there are two cgroups, one having a huge soft limit excess
> and full of anon memory and another not exceeding its soft limit but
> using primarily clean file caches. This prioritizing/weighting stuff
> would result in shrinking the first group first on global pressure,
> though it's way slower than shrinking the second one.

Current implementation just round-robin all memcgs under the tree.
With re-designed soft-limit, things will be changed, you can change it.

> That means a latency spike in other containers.

why ? you said the other container just contains file caches.
latency-spike just because file cache drops ?
If the service is such naive, please use hard limit.

Hmm.
How about raising kswapd's scheduling threshold in some situation ?
Per-memcg-kswapd-for-helping-softlimit may work.

> The heuristics you proposed above
> will only make it non-critical - the system will get over sooner or
> later.

My idea is always based on there is a container-manager on the system,
which can do enough clever decision based on a policy, admin specified.
IIUC, reducing cpu-hog caused by memory pressure is always helpful.

> However, it's still a kind of DOS, which anon+swap hard limit would prevent.

by oom-killer.

> On Tue, Sep 09, 2014 at 02:39:43PM +0400, Vladimir Davydov wrote:
>> However, there's one thing, which made me start this discussion, and it
>> still bothers me. It's about memsw.limit_in_bytes knob itself.
>>
>> First, its value must be greater or equal to memory.limit_in_bytes.
>> IMO, such a dependency in the user interface isn't great, but it isn't
>> the worst thing. What is worse, there's only point in setting it to
>> infinity if one wants to fully make use of soft limits as I pointed out
>> earlier.
>>
>> So, we have a userspace knob that suits only for strict sand-boxing when
>> one wants to hard-limit the amount of memory and swap an app can use.
>> When it comes to soft limits, you have to set it to infinity, and it'll
>> still be accounted at the cost of performance, but without any purpose.
>> It just seems meaningless to me.
>>
>> Not counting that the knob itself is a kind of confusing IMO. memsw
>> means memory+swap, so one would mistakenly think memsw.limit-mem.limit
>> is the limit on swap usage, but that's wrong.
>>
>> My point is that anon+swap accounting instead of the current
>> anon+file+swap memsw implementation would be more flexible. We could
>> still sandbox apps by setting hard anon+swap and memory limits, but it
>> would also be possible to make use of it in "soft" environments. It
>> wouldn't be mandatory though. If one doesn't like OOM, he can use
>> threshold notifications to restart the container when it starts to
>> behave badly. But if the user just doesn't want to bother about
>> configuration or is OK with OOM-killer, he could set hard anon+swap
>> limit. Besides, it would untie mem.limit knob from memsw.limit, which
>> would make the user interface simpler and cleaner.
>>
>> So, I think anon+swap limit would be more flexible than file+anon+swap
>> limit we have now. Is there any use case where anon+swap and anon+file
>> accounting couldn't satisfy the user requirements while the
>> anon+file+swap and anon+file pair could?
>
> I would appreciate if anybody could answer this.
>

I can't understand why you want to use OOM killer for resource controlling .

Thanks,
-Kame