Re: [RFC] memory cgroup: my thoughts on memsw

From: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.cz>, Greg Thelen <gthelen@google.com>,
	Hugh Dickins <hughd@google.com>,
	Motohiro Kosaki <Motohiro.Kosaki@us.fujitsu.com>,
	Glauber Costa <glommer@gmail.com>, Tejun Heo <tj@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Pavel Emelianov <xemul@parallels.com>,
	Konstantin Khorenko <khorenko@parallels.com>,
	LKML-MM <linux-mm@kvack.org>,
	LKML-cgroups <cgroups@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC] memory cgroup: my thoughts on memsw
Date: Mon, 8 Sep 2014 22:53:48 +0900	[thread overview]
Message-ID: <540DB4EC.6060100@jp.fujitsu.com> (raw)
In-Reply-To: <20140908110131.GA11812@esperanza>

(2014/09/08 20:01), Vladimir Davydov wrote:
> On Sat, Sep 06, 2014 at 08:15:44AM +0900, Kamezawa Hiroyuki wrote:
>> As you noticed, hitting anon+swap limit just means oom-kill.
>> My point is that using oom-killer for "server management" just seems crazy.
>>
>> Let my clarify things. your proposal was.
>>   1. soft-limit will be a main feature for server management.
>>   2. Because of soft-limit, global memory reclaim runs.
>>   3. Using swap at global memory reclaim can cause poor performance.
>>   4. So, making use of OOM-Killer for avoiding swap.
>>
>> I can't agree "4". I think
>>
>>   - don't configure swap.
>
> Suppose there are two containers, each having soft limit set to 50% of
> total system RAM. One of the containers eats 90% of the system RAM by
> allocating anonymous pages. Another starts using file caches and wants
> more than 10% of RAM to work w/o issuing disk reads. So what should we
> do then?
> We won't be able to shrink the first container to its soft
> limit, because there's no swap. Leaving it as is would be unfair from
> the second container's point of view. Kill it? But the whole system is
> going OK, because the working set of the second container is easily
> shrinkable. Besides there may be some progress in shrinking file caches
> from the first container.
>
>>   - use zram
>
> In fact this isn't different from the previous proposal (working w/o
> swap). ZRAM only compresses data while still storing them in RAM so we
> eventually may get into a situation where almost all RAM is full of
> compressed anon pages.
>

In above 2 cases, "vmpressure" works fine.

>   - use SSD for swap
>
> Such a requirement might be OK in enterprise, but forcing SMB to update
> their hardware to run a piece of software is a no go. And again, SSD
> isn't infinite, we may use it up.
>
ditto.

>> Or
>>   - provide a way to notify usage of "anon+swap" to container management software.
>>
>>     Now we have "vmpressure". Container management software can kill or respawn container
>>     with using user-defined policy for avoidng swap.
>>
>>     If you don't want to run kswapd at all, threshold notifier enhancement may be required.
>>
>> /proc/meminfo provides total number of ANON/CACHE pages.
>> Many things can be done in userland.
>
> AFAIK OOM-in-userspace-handling has been discussed many times, but
> there's still no agreement upon it. Basically it isn't reliable, because
> it can lead to a deadlock if the userspace handler won't be able to
> allocate memory to proceed or will get stuck in some other way. IMO
> there must be in-kernel OOM-handling as a last resort anyway. And
> actually we already have one - we may kill processes when they hit the
> memsw limit.
>
> But OK, you don't like OOM on hitting anon+swap limit and propose to
> introduce a kind of userspace notification instead, but the problem
> actually isn't *WHAT* we should do on hitting anon+swap limit, but *HOW*
> we should implement it (or should we implement it at all).

I'm not sure you're aware of or not, "hardlimit" counter is too expensive
for your purpose.

If I was you, I'll use some lightweight counter like percpu_counter() or
memcg's event handling system.
Did you see how threshold notifier or vmpressure works ? It's very light weight.

> No matter which way we go, in-kernel OOM or userland notifications, we have to
> *INTRODUCE ANON+SWAP ACCOUNTING* to achieve that so that on breaching a
> predefined threshold we could invoke OOM or issue a userland
> notification or both. And here goes the problem: there's anon+file and
> anon+file+swap resource counters, but no anon+swap counter. To react on
> anon+swap limit breaching, we must introduce one. I propose to *REUSE*
> memsw instead by slightly modifying its meaning.
>
you can see "anon+swap"  via memcg's accounting.

> What we would get then is the ability to react on potentially
> unreclaimable memory growth inside a container. What we would loose is
> the current implementation of memory+swap limit, *BUT* we would still be
> able to limit memory+swap usage by imposing limits on total memory and
> anon+swap usage.
>

I repeatedly say anon+swap "hardlimit" just means OOM. That's not buy.

>> And your idea can't help swap-out caused by memory pressure comes from "zones".
>
> It would help limit swap-out to a sane value.
>
>
> I'm sorry if I'm not clear or don't understand something that looks
> trivial to you.
>

It seems your purpose is to avoiding system-wide-oom-situation. Right ?

Implementing system-wide-oom-kill-avoidance logic in memcg doesn't
sound good to me. It should work under system-wide memory management logic.
If memcg can be a help for it, it will be good.

For your purpose, you need to implement your method in system-wide way.
It seems crazy to set per-cgroup-anon-limit for avoding system-wide-oom.
You'll need help of system-wide-cgroup-configuration-middleware even if
you have a method in a cgroup. If you say logic should be in OS kernel,
please implement it in a system wide logic rather than cgroup.

I think it's okay to add a help functionality in memcg if there is a
system-wide-oom-avoidance logic.

Thanks,
-Kame