From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1755466AbaIDWFs (ORCPT <rfc822;w@1wt.eu>);
	Thu, 4 Sep 2014 18:05:48 -0400
Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:39739 "EHLO
	fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1754404AbaIDWFq (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 4 Sep 2014 18:05:46 -0400
X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4
Message-ID: <5408E1CD.3090004@jp.fujitsu.com>
Date: Fri, 05 Sep 2014 07:03:57 +0900
From: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: Vladimir Davydov <vdavydov@parallels.com>,
        Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@suse.cz>
CC: Greg Thelen <gthelen@google.com>, Hugh Dickins <hughd@google.com>,
        Motohiro Kosaki <Motohiro.Kosaki@us.fujitsu.com>,
        Glauber Costa <glommer@gmail.com>, Tejun Heo <tj@kernel.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Pavel Emelianov <xemul@parallels.com>,
        Konstantin Khorenko <khorenko@parallels.com>,
        LKML-MM <linux-mm@kvack.org>, LKML-cgroups <cgroups@vger.kernel.org>,
        LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC] memory cgroup: my thoughts on memsw
References: <20140904143055.GA20099@esperanza>
In-Reply-To: <20140904143055.GA20099@esperanza>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

(2014/09/04 23:30), Vladimir Davydov wrote:
> Hi,
>
> Over its long history the memory cgroup has been developed rapidly, but
> rather in a disordered manner. As a result, today we have a bunch of
> features that are practically unusable and wants redesign (soft limits)
> or even not working (kmem accounting), not talking about the messy user
> interface we have (the _in_bytes suffix is driving me mad :-).
>
> Fortunately, thanks to Tejun's unified cgroup hierarchy, we have a great
> chance to drop or redesign some of the old features and their
> interfaces. We should use this opportunity to examine every aspect of
> the memory cgroup design, because we will probably not be granted such a
> present in future.
>
> That's why I'm starting a series of RFC's with *my thoughts* not only on
> kmem accounting, which I've been trying to fix for a while, but also on
> other parts of the memory cgroup. I'll be happy if anybody reads this to
> the end, but please don't kick me too hard if something will look stupid
> to you :-)
>
>
> Today's topic is (surprisingly!) the memsw resource counter and where it
> fails to satisfy user requests.
>
> Let's start from the very beginning. The memory cgroup has basically two
> resource counters (not counting kmem, which is unusable anyway):
> mem_cgroup->res (configured by memory.limit), which counts the total
> amount of user pages charged to the cgroup, and mem_cgroup->memsw
> (memory.memsw.limit), which is basically res + the cgroup's swap usage.
> Obviously, memsw always has both the value and limit less than the value
> and limit of res. That said, we have three options:
>
>   - memory.limit=inf, memory.memsw.limit=inf
>     No limits, only accounting.
>
>   - memory.limit=L<inf, memory.memsw.limit=inf
>     Not allowed to use more than L bytes of user pages, but use as much
>     swap as you want.
>
>   - memory.limit=L<inf, memory.memsw.limit=S<inf, L<=S
>     Not allowed to use more than L bytes of user memory. Swap *plus*
>     memory usage is limited by S.
>
> When it comes to *hard* limits everything looks fine, but hard limits
> are not efficient for partitioning a large system among lots of
> containers, because it's hard to predict the right value for the limit,
> besides many workloads will do better when they are granted more file
> caches. There we need a kind of soft limit that is only used on global
> memory pressure to shrink containers exceeding it.
>
>
> Obviously the soft limit must be less than memory.limit and therefore
> memory.memsw.limit. And here comes a problem. Suppose admin sets a
> relatively high memsw.limit (say half of RAM) and a low soft limit for a
> container hoping it will use it for file caches when there's free
> memory, but when hard times come it will be shrunk back to the soft
> limit quickly. Suppose the container, instead of using the granted
> memory for caches, creates a lot of anonymous data filling up to its
> memsw limit (i.e. half of RAM). Then, when admin starts other
> containers, he might find out that they are effectively using only half
> of RAM. Why can this happen? See below.
>
> For example, if there's no or a little swap. It's pretty common for
> customers not to bother about creating TBs of swap to back TBs of RAM
> they have. One might propose to issue OOM if we can't reclaim anything
> from a container exceeding its soft limit. OK, let it be so, although
> it's still not agreed upon AFAIK.
>
> Another case. There's plenty of swap space out there so that we can swap
> out the guilty container completely. However, it will take us some
> reasonable amount of time especially if the container isn't standing
> still, but keeps touching its data. If other containers are mostly using
> file caches, they will experience heavy pressure for a long time, not
> saying about the slowdown caused by high disk usage. Unfair. One might
> object that we can set a limit on IO operations for the culprit (more
> limits and dependencies among them, I doubt admins will be happy!). This
> will slow it down and guarantee it won't be swapping back in pages that
> are being swapped out due to high memory pressure. However, disks have
> limited speed. That means, it doesn't solve the problem with unfair
> slowdown of other containers. What is worse, if we impose IO limit we
> will slow down swap out by ourselves! Because we shouldn't ignore IO
> limit for swap out, otherwise the system will be prune to DOS attacks
> targeted on disk from inside containers, which is what IO limit (as well
> as any other limit) is to protect against.
>
> Or perhaps, I'm missing something and malicious behaviour isn't
> considered when developing cgroups?!
>
>
> To sum it up, the current mem + memsw configuration scheme doesn't allow
> us to limit swap usage if we want to partition the system dynamically
> using soft limits. Actually, it also looks rather confusing to me. We
> have mem limit and mem+swap limit. I bet that from the first glance, an
> average admin will think it's possible to limit swap usage by setting
> the limits so that the difference between memory.memsw.limit and
> memory.limit equals the maximal swap usage, but (surprise!) it isn't
> really so. It holds if there's no global memory pressure, but otherwise
> swap usage is only limited by memory.memsw.limit! IMHO, it isn't
> something obvious.
>
>
> Finally, my understanding (may be crazy!) how the things should be
> configured. Just like now, there should be mem_cgroup->res accounting
> and limiting total user memory (cache+anon) usage for processes inside
> cgroups. This is where there's nothing to do. However, mem_cgroup->memsw
> should be reworked to account *only* memory that may be swapped out plus
> memory that has been swapped out (i.e. swap usage).
>
> This way, by setting memsw.limit (or how it should be called) less than
> memory soft limit we would solve the problem I described above. The
> container would be then allowed to use only file caches above its
> memsw.limit, which are usually easily shrinkable, and get OOM-kill while
> trying to eat too much swappable memory.
>
> The configuration will also be less confusing then IMO:
>
>   - memory.limit - container can't use memory above this
>   - memory.memsw.limit - container can't use swappable memory above this
>
>  From this it clearly follows maximal swap usage is limited by
> memory.memsw.limit.
>
> One more thought. Anon memory and file caches are different and should
> be handled differently, so mixing them both under the same counter looks
> strange to me. Moreover, they are *already* handled differently
> throughout the kernel - just look at mm/vmscan.c. Here are the
> differences between them I see:
>
>   - Anon memory is handled by the user application, while file caches are
>     all on the kernel. That means the application will *definitely* die
>     w/o anon memory. W/o file caches it usually can survive, but the more
>     caches it has the better it feels.
>
>   - Anon memory is not that easy to reclaim. Swap out is a really slow
>     process, because data are usually read/written w/o any specific
>     order. Dropping file caches is much easier. Typically we have lots of
>     clean pages there.
>
>   - Swap space is limited. And today, it's OK to have TBs of RAM and only
>     several GBs of swap. Customers simply don't want to waste their disk
>     space on that.
>
> IMO, these lead us to the need for limiting swap/swappable memory usage,
> but not swap+mem usage.
>
>
> Now, a bad thing about such a change (if it were ever considered).
> There's no way to convert old settings to new, i.e. if we currently have
>
>    mem <= L,
>    mem + swap <= S,
>    L <= S,
>
> we can set
>
>    mem <= L1,
>    swappable_mem <= S1,
>
> where either
>
> L1 = L, S1 = S
>
> or
>
> L1 = L, S1 = S - L,
>
> but both configurations won't be exactly the same. In the first case
> memory+swap usage will be limited by L+S, not by S. In the second case,
> although memory+swap<S, the container won't be able to use more than S-L
> anonymous memory. This is the price we would have to pay if we decided
> to go with this change...
>
>
> Questions, comments, complains, threats?
>

If one hits anon+swap limit, it just means OOM. Hitting limit means process's death.
Is it useful ?

Thanks,
-Kame