From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932644AbdJaRus (ORCPT ); Tue, 31 Oct 2017 13:50:48 -0400 Received: from mail-wr0-f173.google.com ([209.85.128.173]:51101 "EHLO mail-wr0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932381AbdJaRuq (ORCPT ); Tue, 31 Oct 2017 13:50:46 -0400 X-Google-Smtp-Source: ABhQp+TO3d1gJgfsqjnb7g1eF+jaGUWILzXwKhLn4IvvkOwpCueFwGx3jRsq1BvbAqBrupK1fl2L/A0SOvGdubYn/YI= MIME-Version: 1.0 In-Reply-To: <20171031164008.GA32246@cmpxchg.org> References: <20171019185218.12663-1-guro@fb.com> <20171019185218.12663-4-guro@fb.com> <20171031164008.GA32246@cmpxchg.org> From: Shakeel Butt Date: Tue, 31 Oct 2017 10:50:43 -0700 Message-ID: Subject: Re: [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer To: Johannes Weiner Cc: Roman Gushchin , Linux MM , Vladimir Davydov , Tetsuo Handa , David Rientjes , Andrew Morton , Tejun Heo , kernel-team@fb.com, Cgroups , linux-doc@vger.kernel.org, LKML Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 31, 2017 at 9:40 AM, Johannes Weiner wrote: > On Tue, Oct 31, 2017 at 08:04:19AM -0700, Shakeel Butt wrote: >> > + >> > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) >> > +{ >> > + struct mem_cgroup *iter; >> > + >> > + oc->chosen_memcg = NULL; >> > + oc->chosen_points = 0; >> > + >> > + /* >> > + * The oom_score is calculated for leaf memory cgroups (including >> > + * the root memcg). >> > + */ >> > + rcu_read_lock(); >> > + for_each_mem_cgroup_tree(iter, root) { >> > + long score; >> > + >> > + if (memcg_has_children(iter) && iter != root_mem_cgroup) >> > + continue; >> > + >> >> Cgroup v2 does not support charge migration between memcgs. So, there >> can be intermediate nodes which may contain the major charge of the >> processes in their leave descendents. Skipping such intermediate nodes >> will kind of protect such processes from oom-killer (lower on the list >> to be killed). Is it ok to not handle such scenario? If yes, shouldn't >> we document it? > > Tasks cannot be in intermediate nodes, so the only way you can end up > in a situation like this is to start tasks fully, let them fault in > their full workingset, then create child groups and move them there. > > That has attribution problems much wider than the OOM killer: any > local limits you would set on a leaf cgroup like this ALSO won't > control the memory of its tasks - as it's all sitting in the parent. > > We created the "no internal competition" rule exactly to prevent this > situation. Rather than the "no internal competition" restriction I think "charge migration" would have resolved that situation? Also "no internal competition" restriction (I am assuming 'no internal competition' is no tasks in internal nodes, please correct me if I am wrong) has made "charge migration" hard to implement and thus not added in cgroup v2. I know this is parallel discussion and excuse my ignorance, what are other reasons behind "no internal competition" specifically for memory controller? > To be consistent with that rule, we might want to disallow > the creation of child groups once a cgroup has local memory charges. > > It's trivial to change the setup sequence to create the leaf cgroup > first, then launch the workload from within. > Only if cgroup hierarchy is centrally controller and each task's whole hierarchy is known in advance. > Either way, this is nothing specific about the OOM killer.