From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754631AbaIOTPW (ORCPT ); Mon, 15 Sep 2014 15:15:22 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:40707 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754529AbaIOTPU (ORCPT ); Mon, 15 Sep 2014 15:15:20 -0400 Date: Mon, 15 Sep 2014 15:14:35 -0400 From: Johannes Weiner To: Vladimir Davydov Cc: Michal Hocko , Greg Thelen , Hugh Dickins , Kamezawa Hiroyuki , Motohiro Kosaki , Glauber Costa , Tejun Heo , Andrew Morton , Pavel Emelianov , Konstantin Khorenko , LKML-MM , LKML-cgroups , LKML Subject: Re: [RFC] memory cgroup: my thoughts on memsw Message-ID: <20140915191435.GA8950@cmpxchg.org> References: <20140904143055.GA20099@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140904143055.GA20099@esperanza> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Vladimir, On Thu, Sep 04, 2014 at 06:30:55PM +0400, Vladimir Davydov wrote: > To sum it up, the current mem + memsw configuration scheme doesn't allow > us to limit swap usage if we want to partition the system dynamically > using soft limits. Actually, it also looks rather confusing to me. We > have mem limit and mem+swap limit. I bet that from the first glance, an > average admin will think it's possible to limit swap usage by setting > the limits so that the difference between memory.memsw.limit and > memory.limit equals the maximal swap usage, but (surprise!) it isn't > really so. It holds if there's no global memory pressure, but otherwise > swap usage is only limited by memory.memsw.limit! IMHO, it isn't > something obvious. Agreed, memory+swap accounting & limiting is broken. > - Anon memory is handled by the user application, while file caches are > all on the kernel. That means the application will *definitely* die > w/o anon memory. W/o file caches it usually can survive, but the more > caches it has the better it feels. > > - Anon memory is not that easy to reclaim. Swap out is a really slow > process, because data are usually read/written w/o any specific > order. Dropping file caches is much easier. Typically we have lots of > clean pages there. > > - Swap space is limited. And today, it's OK to have TBs of RAM and only > several GBs of swap. Customers simply don't want to waste their disk > space on that. > Finally, my understanding (may be crazy!) how the things should be > configured. Just like now, there should be mem_cgroup->res accounting > and limiting total user memory (cache+anon) usage for processes inside > cgroups. This is where there's nothing to do. However, mem_cgroup->memsw > should be reworked to account *only* memory that may be swapped out plus > memory that has been swapped out (i.e. swap usage). But anon pages are not a resource, they are a swap space liability. Think of virtual memory vs. physical pages - the use of one does not necessarily result in the use of the other. Without memory pressure, anonymous pages do not consume swap space. What we *should* be accounting and limiting here is the actual finite resource: swap space. Whenever we try to swap a page, its owner should be charged for the swap space - or the swapout be rejected. For hard limit reclaim, the semantics of a swap space limit would be fairly obvious, because it's clear who the offender is. However, in an overcommitted machine, the amount of swap space used by a particular group depends just as much on the behavior of the other groups in the system, so the per-group swap limit should be enforced even during global reclaim to feed back pressure on whoever is causing the swapout. If reclaim fails, the global OOM killer triggers, which should then off the group with the biggest soft limit excess. As far as implementation goes, it should be doable to try-charge from add_to_swap() and keep the uncharging in swap_entry_free(). We'll also have to extend the global OOM killer to be memcg-aware, but we've been meaning to do that anyway.