From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751638AbdJXQWT (ORCPT ); Tue, 24 Oct 2017 12:22:19 -0400 Received: from mx2.suse.de ([195.135.220.15]:40323 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751336AbdJXQWR (ORCPT ); Tue, 24 Oct 2017 12:22:17 -0400 Date: Tue, 24 Oct 2017 18:22:13 +0200 From: Michal Hocko To: Johannes Weiner Cc: Greg Thelen , Shakeel Butt , Alexander Viro , Vladimir Davydov , Andrew Morton , Linux MM , linux-fsdevel@vger.kernel.org, LKML Subject: Re: [PATCH] fs, mm: account filp and names caches to kmemcg Message-ID: <20171024162213.n6jrpz3t5pldkgxy@dhcp22.suse.cz> References: <20171009202613.GA15027@cmpxchg.org> <20171010091430.giflzlayvjblx5bu@dhcp22.suse.cz> <20171010141733.GB16710@cmpxchg.org> <20171010142434.bpiqmsbb7gttrlcb@dhcp22.suse.cz> <20171012190312.GA5075@cmpxchg.org> <20171013063555.pa7uco43mod7vrkn@dhcp22.suse.cz> <20171013070001.mglwdzdrqjt47clz@dhcp22.suse.cz> <20171013152421.yf76n7jui3z5bbn4@dhcp22.suse.cz> <20171024160637.GB32340@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171024160637.GB32340@cmpxchg.org> User-Agent: NeoMutt/20170609 (1.8.3) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 24-10-17 12:06:37, Johannes Weiner wrote: > On Fri, Oct 13, 2017 at 05:24:21PM +0200, Michal Hocko wrote: > > Well, it actually occured to me that this would trigger the global oom > > killer in case no memcg specific victim can be found which is definitely > > not something we would like to do. This should work better. I am not > > sure we can trigger this corner case but we should cover it and it > > actually doesn't make the code much worse. > > --- > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index d5f3a62887cf..7b370f070b82 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1528,26 +1528,40 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) > > > > static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > > { > > - if (!current->memcg_may_oom) > > - return; > > /* > > * We are in the middle of the charge context here, so we > > * don't want to block when potentially sitting on a callstack > > * that holds all kinds of filesystem and mm locks. > > * > > - * Also, the caller may handle a failed allocation gracefully > > - * (like optional page cache readahead) and so an OOM killer > > - * invocation might not even be necessary. > > + * cgroup v1 allowes sync users space handling so we cannot afford > > + * to get stuck here for that configuration. That's why we don't do > > + * anything here except remember the OOM context and then deal with > > + * it at the end of the page fault when the stack is unwound, the > > + * locks are released, and when we know whether the fault was overall > > + * successful. > > How about > > "cgroup1 allows disabling the OOM killer and waiting for outside > handling until the charge can succeed; remember the context and put > the task to sleep at the end of the page fault when all locks are > released." OK > and then follow it directly with the branch that handles this: > > if (memcg->oom_kill_disable) { > css_get(&memcg->css); > current->memcg_in_oom = memcg; > ... > return false; > } > > return mem_cgroup_out_of_memory(memcg, mask, order); > > > + * On the other hand, in-kernel OOM killer allows for an async victim > > + * memory reclaim (oom_reaper) and that means that we are not solely > > + * relying on the oom victim to make a forward progress so we can stay > > + * in the the try_charge context and keep retrying as long as there > > + * are oom victims to select. > > I would put that part into try_charge, where that decision is made. OK > > * > > - * That's why we don't do anything here except remember the > > - * OOM context and then deal with it at the end of the page > > - * fault when the stack is unwound, the locks are released, > > - * and when we know whether the fault was overall successful. > > + * Please note that mem_cgroup_oom_synchronize might fail to find a > > + * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > > + * we would fall back to the global oom killer in pagefault_out_of_memory > > Ah, that's why... Ugh, that's really duct-tapey. As you know, I really hate the #PF OOM path. We should get rid of it. > > */ > > + if (!memcg->oom_kill_disable && > > + mem_cgroup_out_of_memory(memcg, mask, order)) > > + return true; > > + > > + if (!current->memcg_may_oom) > > + return false; > > css_get(&memcg->css); > > current->memcg_in_oom = memcg; > > current->memcg_oom_gfp_mask = mask; > > current->memcg_oom_order = order; > > + > > + return false; > > } > > > > /** > > @@ -2007,8 +2021,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > > > > mem_cgroup_event(mem_over_limit, MEMCG_OOM); > > > > - mem_cgroup_oom(mem_over_limit, gfp_mask, > > - get_order(nr_pages * PAGE_SIZE)); > > + if (mem_cgroup_oom(mem_over_limit, gfp_mask, > > + get_order(nr_pages * PAGE_SIZE))) { > > + nr_retries = MEM_CGROUP_RECLAIM_RETRIES; > > + goto retry; > > + } > > As per the previous email, this has to goto force, otherwise we return > -ENOMEM from syscalls once in a blue moon, which makes verification an > absolute nightmare. The behavior should be reliable, without weird p99 > corner cases. > > I think what we should be doing here is: if a charge fails, set up an > oom context and force the charge; add mem_cgroup_oom_synchronize() to > the end of syscalls and kernel-context faults. What would prevent a runaway in case the only process in the memcg is oom unkillable then? -- Michal Hocko SUSE Labs