From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757388AbaIPBg0 (ORCPT ); Mon, 15 Sep 2014 21:36:26 -0400 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:34807 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754752AbaIPBgX (ORCPT ); Mon, 15 Sep 2014 21:36:23 -0400 X-SecurityPolicyCheck: OK by SHieldMailChecker v2.2.3 X-SHieldMailCheckerPolicyVersion: FJ-ISEC-20140219-2 Message-ID: <541793BF.7070106@jp.fujitsu.com> Date: Tue, 16 Sep 2014 10:34:55 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Johannes Weiner , Vladimir Davydov CC: Michal Hocko , Greg Thelen , Hugh Dickins , Motohiro Kosaki , Glauber Costa , Tejun Heo , Andrew Morton , Pavel Emelianov , Konstantin Khorenko , LKML-MM , LKML-cgroups , LKML Subject: Re: [RFC] memory cgroup: my thoughts on memsw References: <20140904143055.GA20099@esperanza> <20140915191435.GA8950@cmpxchg.org> In-Reply-To: <20140915191435.GA8950@cmpxchg.org> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit X-SecurityPolicyCheck-GC: OK by FENCE-Mail Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2014/09/16 4:14), Johannes Weiner wrote: > Hi Vladimir, > > On Thu, Sep 04, 2014 at 06:30:55PM +0400, Vladimir Davydov wrote: >> To sum it up, the current mem + memsw configuration scheme doesn't allow >> us to limit swap usage if we want to partition the system dynamically >> using soft limits. Actually, it also looks rather confusing to me. We >> have mem limit and mem+swap limit. I bet that from the first glance, an >> average admin will think it's possible to limit swap usage by setting >> the limits so that the difference between memory.memsw.limit and >> memory.limit equals the maximal swap usage, but (surprise!) it isn't >> really so. It holds if there's no global memory pressure, but otherwise >> swap usage is only limited by memory.memsw.limit! IMHO, it isn't >> something obvious. > > Agreed, memory+swap accounting & limiting is broken. > >> - Anon memory is handled by the user application, while file caches are >> all on the kernel. That means the application will *definitely* die >> w/o anon memory. W/o file caches it usually can survive, but the more >> caches it has the better it feels. >> >> - Anon memory is not that easy to reclaim. Swap out is a really slow >> process, because data are usually read/written w/o any specific >> order. Dropping file caches is much easier. Typically we have lots of >> clean pages there. >> >> - Swap space is limited. And today, it's OK to have TBs of RAM and only >> several GBs of swap. Customers simply don't want to waste their disk >> space on that. > >> Finally, my understanding (may be crazy!) how the things should be >> configured. Just like now, there should be mem_cgroup->res accounting >> and limiting total user memory (cache+anon) usage for processes inside >> cgroups. This is where there's nothing to do. However, mem_cgroup->memsw >> should be reworked to account *only* memory that may be swapped out plus >> memory that has been swapped out (i.e. swap usage). > > But anon pages are not a resource, they are a swap space liability. > Think of virtual memory vs. physical pages - the use of one does not > necessarily result in the use of the other. Without memory pressure, > anonymous pages do not consume swap space. > > What we *should* be accounting and limiting here is the actual finite > resource: swap space. Whenever we try to swap a page, its owner > should be charged for the swap space - or the swapout be rejected. > > For hard limit reclaim, the semantics of a swap space limit would be > fairly obvious, because it's clear who the offender is. > > However, in an overcommitted machine, the amount of swap space used by > a particular group depends just as much on the behavior of the other > groups in the system, so the per-group swap limit should be enforced > even during global reclaim to feed back pressure on whoever is causing > the swapout. If reclaim fails, the global OOM killer triggers, which > should then off the group with the biggest soft limit excess. > > As far as implementation goes, it should be doable to try-charge from > add_to_swap() and keep the uncharging in swap_entry_free(). > > We'll also have to extend the global OOM killer to be memcg-aware, but > we've been meaning to do that anyway. > When we introduced memsw limitation, we tried to avoid affecting global memory reclaim. Then, we did memory+swap limitation. Now, global memory reclaim is memcg-aware. So, I think swap-limitation rather than anon+swap may be a choice. The change will reduce res_counter access. Hmm, it will be desireble to move anon pages to Unevictable if memcg's swap slot is 0. Anyway, I think softlimit should be re-implemented, 1st. It will be starting point. Thanks, -Kame