From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754557AbaIDObV (ORCPT ); Thu, 4 Sep 2014 10:31:21 -0400 Received: from mx2.parallels.com ([199.115.105.18]:58326 "EHLO mx2.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754516AbaIDObR (ORCPT ); Thu, 4 Sep 2014 10:31:17 -0400 Date: Thu, 4 Sep 2014 18:30:55 +0400 From: Vladimir Davydov To: Johannes Weiner , Michal Hocko CC: Greg Thelen , Hugh Dickins , Kamezawa Hiroyuki , Motohiro Kosaki , Glauber Costa , Tejun Heo , Andrew Morton , Pavel Emelianov , Konstantin Khorenko , LKML-MM , LKML-cgroups , LKML Subject: [RFC] memory cgroup: my thoughts on memsw Message-ID: <20140904143055.GA20099@esperanza> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Over its long history the memory cgroup has been developed rapidly, but rather in a disordered manner. As a result, today we have a bunch of features that are practically unusable and wants redesign (soft limits) or even not working (kmem accounting), not talking about the messy user interface we have (the _in_bytes suffix is driving me mad :-). Fortunately, thanks to Tejun's unified cgroup hierarchy, we have a great chance to drop or redesign some of the old features and their interfaces. We should use this opportunity to examine every aspect of the memory cgroup design, because we will probably not be granted such a present in future. That's why I'm starting a series of RFC's with *my thoughts* not only on kmem accounting, which I've been trying to fix for a while, but also on other parts of the memory cgroup. I'll be happy if anybody reads this to the end, but please don't kick me too hard if something will look stupid to you :-) Today's topic is (surprisingly!) the memsw resource counter and where it fails to satisfy user requests. Let's start from the very beginning. The memory cgroup has basically two resource counters (not counting kmem, which is unusable anyway): mem_cgroup->res (configured by memory.limit), which counts the total amount of user pages charged to the cgroup, and mem_cgroup->memsw (memory.memsw.limit), which is basically res + the cgroup's swap usage. Obviously, memsw always has both the value and limit less than the value and limit of res. That said, we have three options: - memory.limit=inf, memory.memsw.limit=inf No limits, only accounting. - memory.limit=Lres accounting and limiting total user memory (cache+anon) usage for processes inside cgroups. This is where there's nothing to do. However, mem_cgroup->memsw should be reworked to account *only* memory that may be swapped out plus memory that has been swapped out (i.e. swap usage). This way, by setting memsw.limit (or how it should be called) less than memory soft limit we would solve the problem I described above. The container would be then allowed to use only file caches above its memsw.limit, which are usually easily shrinkable, and get OOM-kill while trying to eat too much swappable memory. The configuration will also be less confusing then IMO: - memory.limit - container can't use memory above this - memory.memsw.limit - container can't use swappable memory above this >>From this it clearly follows maximal swap usage is limited by memory.memsw.limit. One more thought. Anon memory and file caches are different and should be handled differently, so mixing them both under the same counter looks strange to me. Moreover, they are *already* handled differently throughout the kernel - just look at mm/vmscan.c. Here are the differences between them I see: - Anon memory is handled by the user application, while file caches are all on the kernel. That means the application will *definitely* die w/o anon memory. W/o file caches it usually can survive, but the more caches it has the better it feels. - Anon memory is not that easy to reclaim. Swap out is a really slow process, because data are usually read/written w/o any specific order. Dropping file caches is much easier. Typically we have lots of clean pages there. - Swap space is limited. And today, it's OK to have TBs of RAM and only several GBs of swap. Customers simply don't want to waste their disk space on that. IMO, these lead us to the need for limiting swap/swappable memory usage, but not swap+mem usage. Now, a bad thing about such a change (if it were ever considered). There's no way to convert old settings to new, i.e. if we currently have mem <= L, mem + swap <= S, L <= S, we can set mem <= L1, swappable_mem <= S1, where either L1 = L, S1 = S or L1 = L, S1 = S - L, but both configurations won't be exactly the same. In the first case memory+swap usage will be limited by L+S, not by S. In the second case, although memory+swap