From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B56CBC00523 for ; Mon, 6 Jan 2020 00:17:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 8219A21582 for ; Mon, 6 Jan 2020 00:17:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727169AbgAFARZ (ORCPT ); Sun, 5 Jan 2020 19:17:25 -0500 Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:49520 "EHLO mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726494AbgAFARZ (ORCPT ); Sun, 5 Jan 2020 19:17:25 -0500 Received: from dread.disaster.area (pa49-180-68-255.pa.nsw.optusnet.com.au [49.180.68.255]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 2E04D3A17EB; Mon, 6 Jan 2020 11:17:16 +1100 (AEDT) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1ioG53-0006nK-5Q; Mon, 06 Jan 2020 11:17:13 +1100 Date: Mon, 6 Jan 2020 11:17:13 +1100 From: Dave Chinner To: Yafang Shao Cc: Johannes Weiner , Michal Hocko , Vladimir Davydov , Andrew Morton , Al Viro , Linux MM , linux-fsdevel@vger.kernel.org, Dave Chinner Subject: Re: [PATCH v2 4/5] mm: make memcg visible to lru walker isolation function Message-ID: <20200106001713.GH23195@dread.disaster.area> References: <1577174006-13025-1-git-send-email-laoar.shao@gmail.com> <1577174006-13025-5-git-send-email-laoar.shao@gmail.com> <20200104033558.GD23195@dread.disaster.area> <20200104212331.GG23195@dread.disaster.area> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0 a=sbdTpStuSq8iNQE8viVliQ==:117 a=sbdTpStuSq8iNQE8viVliQ==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=kj9zAlcOel0A:10 a=Jdjhy38mL1oA:10 a=7-415B0cAAAA:8 a=20KFwNOVAAAA:8 a=pGLkceISAAAA:8 a=goTDY2tyzamFREBFo-IA:9 a=CzQD0rvJj-p5qLdI:21 a=CjuIK1q_8ugA:10 a=biEYGPWJfzWAr4FL6Ov7:22 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Sun, Jan 05, 2020 at 09:43:11AM +0800, Yafang Shao wrote: > On Sun, Jan 5, 2020 at 5:23 AM Dave Chinner wrote: > > On Sat, Jan 04, 2020 at 03:26:13PM +0800, Yafang Shao wrote: > > > On Sat, Jan 4, 2020 at 11:36 AM Dave Chinner wrote: > > > > On Tue, Dec 24, 2019 at 02:53:25AM -0500, Yafang Shao wrote: > > > > > The lru walker isolation function may use this memcg to do something, e.g. > > > > > the inode isolatation function will use the memcg to do inode protection in > > > > > followup patch. So make memcg visible to the lru walker isolation function. > > > > > > > > > > Something should be emphasized in this patch is it replaces > > > > > for_each_memcg_cache_index() with for_each_mem_cgroup() in > > > > > list_lru_walk_node(). Because there's a gap between these two MACROs that > > > > > for_each_mem_cgroup() depends on CONFIG_MEMCG while the other one depends > > > > > on CONFIG_MEMCG_KMEM. But as list_lru_memcg_aware() returns false if > > > > > CONFIG_MEMCG_KMEM is not configured, it is safe to this replacement. > > > > > > > > > > Cc: Dave Chinner > > > > > Signed-off-by: Yafang Shao > > > > > > > > .... > > > > > > > > > @@ -299,17 +299,15 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid, > > > > > list_lru_walk_cb isolate, void *cb_arg, > > > > > unsigned long *nr_to_walk) > > > > > { > > > > > + struct mem_cgroup *memcg; > > > > > long isolated = 0; > > > > > - int memcg_idx; > > > > > > > > > > - isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg, > > > > > - nr_to_walk); > > > > > - if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) { > > > > > - for_each_memcg_cache_index(memcg_idx) { > > > > > + if (list_lru_memcg_aware(lru)) { > > > > > + for_each_mem_cgroup(memcg) { > > > > > struct list_lru_node *nlru = &lru->node[nid]; > > > > > > > > > > spin_lock(&nlru->lock); > > > > > - isolated += __list_lru_walk_one(nlru, memcg_idx, > > > > > + isolated += __list_lru_walk_one(nlru, memcg, > > > > > isolate, cb_arg, > > > > > nr_to_walk); > > > > > spin_unlock(&nlru->lock); > > > > > @@ -317,7 +315,11 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid, > > > > > if (*nr_to_walk <= 0) > > > > > break; > > > > > } > > > > > + } else { > > > > > + isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg, > > > > > + nr_to_walk); > > > > > } > > > > > + > > > > > > > > That's a change of behaviour. The old code always runs per-node > > > > reclaim, then if the LRU is memcg aware it also runs the memcg > > > > aware reclaim. The new code never runs global per-node reclaim > > > > if the list is memcg aware, so shrinkers that are initialised > > > > with the flags SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE seem > > > > likely to have reclaim problems with mixed memcg/global memory > > > > pressure scenarios. > > > > > > > > e.g. if all the memory is in the per-node lists, and the memcg needs > > > > to reclaim memory because of a global shortage, it is now unable to > > > > reclaim global memory..... > > > > > > > > > > Hi Dave, > > > > > > Thanks for your detailed explanation. > > > But I have different understanding. > > > The difference between for_each_mem_cgroup(memcg) and > > > for_each_memcg_cache_index(memcg_idx) is that the > > > for_each_mem_cgroup() includes the root_mem_cgroup while the > > > for_each_memcg_cache_index() excludes the root_mem_cgroup because the > > > memcg_idx of it is -1. > > > > Except that the "root" memcg that for_each_mem_cgroup() is not the > > "global root" memcg - it is whatever memcg that is passed down in > > the shrink_control, whereever that sits in the cgroup tree heirarchy. > > do_shrink_slab() only ever passes down the global root memcg to the > > shrinkers when the global root memcg is passed to shrink_slab(), and > > that does not iterate the memcg heirarchy - it just wants to > > reclaim from global caches an non-memcg aware shrinkers. > > > > IOWs, there are multiple changes in behaviour here - memcg specific > > reclaim won't do global reclaim, and global reclaim will now iterate > > all memcgs instead of just the global root memcg. > > > > > So it can reclaim global memory even if the list is memcg aware. > > > Is that right ? > > > > If the memcg passed to this fucntion is the root memcg, then yes, > > it will behave as you suggest. But for the majority of memcg-context > > reclaim, the memcg is not the root memcg and so they will not do > > global reclaim anymore... > > > > Thanks for you reply. > But I have to clairfy that this change is in list_lru_walk_node(), and > the memcg is not passed to this funtion from shrink_control. > In order to make it more clear, I paste the function here. > > - The new function > unsigned long list_lru_walk_node(struct list_lru *lru, int nid, > list_lru_walk_cb isolate, void *cb_arg, > unsigned long *nr_to_walk) > { > struct mem_cgroup *memcg; <<<< A local variable > long isolated = 0; > > if (list_lru_memcg_aware(lru)) { > for_each_mem_cgroup(memcg) { <<<< scan all MEMCGs, > including root_mem_cgroup Oh, wait, this is list_lru_walk_node(), which is not even called by the shrinker code - this is the "iterate every object" code path. So this isn't even part of the code path this patchset needs the memcg for. Gawd, what a mess this memcg stuff is from when you look from the outside. After digging, I find that the root memcg has a kmemcg_id = -1. There is special case code everywhere that checks the id for >= 0, open codes memcg == root_mem_cgroup checks, calls mem_cgroup_is_root(), etc to avoid doing things like using the root memcg kmemcg_id as an array index. An in the case of the list_lru, this means it doesn't use the memcg lru indexes at all (i.e. lru->node[nid].memcg_lrus[memcg_idx], but quietly returns the global list at lru->node[nid].lru when kmemcg_id < 0. So, after a couple of hours of wading through the code, I finally remember all the details of the existing code and understand how this new code works - it relies entirely on the special casing of the root memcg that makes it behave like memcgs aren't configured at all. i.e. we select the global root lists when the root memcg is passed around rather than using a memcg specific LRU list. The old code used for_each_memcg_cache_index(), which never scanned the root memcg: #define for_each_memcg_cache_index(_idx) \ for ((_idx) = 0; (_idx) < memcg_nr_cache_ids; (_idx)++) because it starts at idx = 0, and the root memcg is -1. Hence the old code always needed to scan the root memcg (the global lists) manually itself, which made the code really nice and clear because it was the same for both non-memcg and memcg aware lists. i.e. the global scan was not hidden away inside the memcg scanning. IOWs, I find the existing code is much easier to see the difference between global and per-memcg iteration. IF we are keeping this special "root memcg context is special" architecture (more on that below), I'd prefer that the code keeps that distinction, because having to understand how the root memcg is special cased just to work out how this code iterates the global caches is a waste of time and brain power. not every one really wants to know about memcg internals.... We can clean up the code a lot by getting rid of the unnecessary indenting by doing this: /* iterate the global lru first */ isolated = list_lru_walk_one(lru, nid, NULL, isolate, cb_arg, nr_to_walk); if (!list_lru_memcg_aware(lru)) return isolated; nlru = &lru->node[nid]; for_each_mem_cgroup(memcg) { /* already scanned the root memcg above */ if (is_root_memcg(memcg)) continue; if (*nr_to_walk <= 0) break; spin_lock(&nlru->lock); isolated += __list_lru_walk_one(nlru, memcg, isolate, cb_arg, nr_to_walk); spin_unlock(&nlru->lock); } return isolated; And so to architecture... This all makes me wonder why we still special case the memcg LRU lists here. Ever since we went to per-node memcg lru lists (~2015, iirc), there's been this special hidden hack for the root memcg to use the global list. and one that I have to read lots of code to remind myself it exists every time I have to did into this code again. I mean, if the list is not memcg aware, then it only needs a single list per node - the root memcg list. That could be easily stored in the memcg_lrus array for the node rather than a separate global list, and then the node iteration code all starts to look like this: nlru = &lru->node[nid]; for_each_mem_cgroup(memcg) { spin_lock(&nlru->lock); isolated += __list_lru_walk_one(nlru, memcg, isolate, cb_arg, nr_to_walk); spin_unlock(&nlru->lock); if (*nr_to_walk <= 0) break; /* non-memcg aware lists only have a root memcg list */ if (!list_lru_memcg_aware(lru)) break; } Hence for the walks in the !list_lru_memcg_aware(lru) case, the list_lru_from_memcg() call in __list_lru_walk_one() always returns just the root list. This makes everything use the same iteration interface, and when you configure out memcg then we simply code the the iterator to run once and list_lru_from_memcg() always returns the one list... i.e. the whole reason for having the separate global and per-memcg LRU lists went away back in 2015 when it was decided that we don't care about the (potentially massive) memory usage of per-node memcg LRU lists. However, the special casing of the root memcg was left in place, but what your patch says to me is that we should really be trying to make this special casing go away. Moving everything in the list-lru to the memcg_lrus also means that the global shrinker lists get the "objects in the lru to scan for reclaim" shrinker bitmaps that optimise the high level shrinker code. It means we could get rid of the separate global vs memcg shrinker paths in vmscan.c - we just pass the root memcg into shrink_slab_memcg() when we want to do global shrinker reclaim... IMO, if memcgs are fundamental to the operation of systems these days such that there the root memcg is treated no differently to all other memcgs, then let's architect the whole stack that way, not just make the code hard to read by having little bits of code hide the special casing of the root memcg layers deep where it's raelly hard to find.... And, FWIW, I noticed that the list_lru memcg code assumes we only ever put objects from memcg associated slab pages in the list_lru. It calls memcg_from_slab_page() which makes no attempt to verify the page is actually a slab page. That's a landmine just waiting to get boom - list-lru can store any type of object the user wants, not just slab pages. e.g. the binder code stores pages in the list-lru, not slab objects, and so the only reason it doesn't go boom is that the lru-list is not configured to be memcg aware.... Cheers, Dave. -- Dave Chinner david@fromorbit.com