From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752877AbdIEUYf (ORCPT ); Tue, 5 Sep 2017 16:24:35 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:39702 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752176AbdIEUYa (ORCPT ); Tue, 5 Sep 2017 16:24:30 -0400 Date: Tue, 5 Sep 2017 21:23:57 +0100 From: Roman Gushchin To: Michal Hocko CC: , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , David Rientjes , Andrew Morton , Tejun Heo , , , , Subject: Re: [v7 2/5] mm, oom: cgroup-aware OOM killer Message-ID: <20170905202357.GA10535@castle.DHCP.thefacebook.com> References: <20170904142108.7165-1-guro@fb.com> <20170904142108.7165-3-guro@fb.com> <20170905145700.fd7jjd37xf4tb55h@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20170905145700.fd7jjd37xf4tb55h@dhcp22.suse.cz> User-Agent: Mutt/1.8.3 (2017-05-23) X-Originating-IP: [2620:10d:c092:200::1:ad36] X-ClientProxiedBy: AM5PR0102CA0022.eurprd01.prod.exchangelabs.com (2603:10a6:206::35) To BL2PR15MB1074.namprd15.prod.outlook.com (2603:10b6:201:17::8) X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 1c04e0a9-4cb9-4aa1-1257-08d4f49c1200 X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:(300000500095)(300135000095)(300000501095)(300135300095)(22001)(300000502095)(300135100095)(2017030254152)(300000503095)(300135400095)(2017052603199)(201703131423075)(201703031133081)(201702281549075)(300000504095)(300135200095)(300000505095)(300135600095)(300000506095)(300135500095);SRVR:BL2PR15MB1074; X-Microsoft-Exchange-Diagnostics: 1;BL2PR15MB1074;3:vDML48JJqyqCLqrPfle/DDLCaI4FBxEjJx5iBfxBI10rwa1UxiKeKFLcSsUoWY4CR/pEvrPDu+YwJnhIwtZJtk42DZnqRFEcLxRs//M21rajd5FOlF69fChFDdlrmqopd8hwrQ28vZg5AuTiU9ROHjne9k5aqyHUukL8cNzjBH3t/nlhoAaT6fwuYGgQVM5+yF031NooNil7CxU5kleDkUko90FVz9CFb9POoSI4a+YLWw4Fhbbl7xOj5I4gM9Rv;25:XFhokslYcshkqr5Ol9mPpjZF7MEEildZwiL4hqgaxPYtzdlZhv/91oQs1IF38QfaTi50+KWoGaLwQw26mh7tazLFZbJRS5Blo1LUMur/52IpYCQrG9W3+RjApJYjeLELdGTG8USZqvsNJYFmFeluBQc8QlYiOzxUKPVJPTrkOAf4vxhJKmSGujKcUToLA2VsWnNqWaKY93s18+qTHFx+cLz3Xq/QLBFdzsJfynXGH28+7XuwFFIA0j2+5B3qTShlU+N7YRQN+/fvA0qsf0cQntI9AU8swYD+3m3opD1yfJ6sIJEW4akkqsXkhz6ygwDRG7/5RW8XA7xc7AaEMPP+wg==;31:A3e4ZZrybPPGNicSd/YeynQavZuqAyTmqFX0iH74uRBKpy4ZXRJ33SkXMwVK/kXmnt5EWPpDQkPujxjBMCozo/StSV+YI6/5o7NwdjAx7VPDhOBIYata3o0waW7hY7u7/1lKQ8DxTh2Xpd/zRcFypIuNW79rfNm3jDDaVZMhw7Utq6016DF4We1JdvDADlliXcIVyFTBYB2KMsy0C5geiN65QJZTv6dx4EhBkE/Cnvk= X-MS-TrafficTypeDiagnostic: BL2PR15MB1074: X-Microsoft-Exchange-Diagnostics: 1;BL2PR15MB1074;20:Sda3J/PnnvMKrhVeONLhceuzqjNivCVPrQ3j8Brb1r3qiAYZ3sjK7xpbbnX32jDfszmdwqkw4EgyYkDXJESsZAebwS8cvJ39zvuCxHGnrX5DP0P0/Mn8GCGWYMPcku0uhOIL5SzSik0Xbn3lyFVZh88d6vvKh845oT1QFa4CMXDWj9zmKktVVjwEjntSol9KF3qUAi938l5qOIXO6vCjpHtaTztQB5ySjV5tWjYayJ4zrw/uwWhz2VsP9opR7SOqulKb1tmz3laSt8OPZjGtPrtZIS0kH1uqwi8pUIRXOROZ8538iD/fdt+BA9osXeCsd4i+nWPMJfvOEsUEQFmP0ddWRV++wJ3Jphv91KLzWUE1DPcWrqWSqyOIQPAl2ki6E28ODqHyJa7CyK+x9sbrTzL/FcUu4Hz/bEL0RxXjqCEkkj0lLiy/vjEeLJvq4vyuwZc6h6sfKzwnaCLoV9skA74UhRT0YXVYiVdvcDY/DEpsY/TggB2z8eZ3Pr7UQ7Cv;4:aE+h/3LJqohu1k/sBPBqn5rqaHW4ozdwNEA09bAjpCRsbnCLb8I3waS0y2aZ/q7axELzOeauA8FNaPEjKNgTgifLTAHz5va6PZWwQpWHmtBj8yzlzNdfi5TR6NDqlJ1HFYW42PfPN5j3wzUXoN5yiJ7C72BZUZaEht11/cl8nn4qcTaBvY2SqOxLzzQWeRg1TDSKRYYZLUwQJ8lua0nlmD12DAougQjdF9/LXMarinz1LoOOjA8nD+QXlMOBk2l0 X-Exchange-Antispam-Report-Test: UriScan:; X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(2401047)(8121501046)(5005006)(100000703101)(100105400095)(3002001)(93006095)(93001095)(10201501046)(6041248)(20161123555025)(20161123558100)(20161123560025)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123564025)(20161123562025)(6072148)(201708071742011)(100000704101)(100105200095)(100000705101)(100105500095);SRVR:BL2PR15MB1074;BCL:0;PCL:0;RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095);SRVR:BL2PR15MB1074; X-Forefront-PRVS: 0421BF7135 X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10019020)(6009001)(377424004)(24454002)(199003)(54534003)(189002)(76176999)(53936002)(4326008)(9686003)(2906002)(25786009)(101416001)(6916009)(6666003)(110136004)(39060400002)(2950100002)(6246003)(42186005)(81166006)(81156014)(50986999)(105586002)(305945005)(8676002)(106356001)(68736007)(86362001)(7736002)(55016002)(54906002)(33656002)(189998001)(47776003)(8936002)(478600001)(83506001)(5660300001)(6506006)(1076002)(7416002)(54356999)(50466002)(4001350100001)(23726003)(229853002)(97736004)(6116002)(18370500001)(42262002);DIR:OUT;SFP:1102;SCL:1;SRVR:BL2PR15MB1074;H:castle.DHCP.thefacebook.com;FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:1;LANG:en; X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1;BL2PR15MB1074;23:v9zEqDau5CdDQsgM7Qcj/po5CXCePzKu26QxZLQ+u?= =?us-ascii?Q?xu1j1UjhS/XX2LtHwPKQID/y2wfKqLP1DEuN8KEW3EpCiwwqNYQ9229m7etz?= =?us-ascii?Q?J9vxLOoi1XkXvjGGww1JyBlo6wkYVNLGeDIJjfj2Hznzpha2YlRC9Pwwnzkl?= =?us-ascii?Q?CQwymqNSSjrpwxj+/FG6pxenn6Hgd+5bamnc96krXFEt/A4tDwyCk5aqglr7?= =?us-ascii?Q?kYB+jCzjNV+reeXpb/DUE0XaSPRacAtk4GIBv/+m7b4MYCunnMaUH089uqmX?= =?us-ascii?Q?YB0JnWmno6f3R1pmx3WyPdgQDOMnXScrQFiZBUKK+gX3KfBb46xefgpYH8E8?= =?us-ascii?Q?Kp6CLCJ/sSL3RNP4orz+UlhTJ73RT8RLScDZ7x5E1VIg2aSng5uq+Ei2XpZZ?= =?us-ascii?Q?nggbyeP5AGp0zkVl/FpysMDDrzDdQqOmt47uoEseMPGiWXlx+sKE0vxN+xkG?= =?us-ascii?Q?tWPo1mVH69oQmLCOvRC1IxdYmdd0ULRsyuu6dIbN3TNBsTxYWNjAHQTjx0MQ?= =?us-ascii?Q?gv+tLVQkvoM3Bw623SGIoy5fB6VZYj+zNGwyQtbY3NuUTQsw4kHDtIjm0JXf?= =?us-ascii?Q?byRm9utG/S0DVFmR3p2GLW+CKDtfZYBq5Vihrh+zNw8K2GL9+zrkaDrvCFzI?= =?us-ascii?Q?IXttEIt0JbcqOhPYjD5mLmy4pqvRjF3frYURbJf3qGH36AhAgtR7+QsH6NIa?= =?us-ascii?Q?xlUVcCo9uudpKpGrRf81pymGsrsplElhynkXOHO9BPCldhDED+ZZ4uFx9McO?= =?us-ascii?Q?f+ylwZ0wTOQS+eK3CMvF/gcDLbEMO1tn2B3iRk9qEdBAuCVbuEfX/B8O2OSp?= =?us-ascii?Q?qV+TmdEV+m1VFgE2NB3jTx6jAUVHetVs5D+pwSZX/f9iPakYqnG2iJYv49n4?= =?us-ascii?Q?deqyjbkzoErfC48M/iCGHmY/JiosZdwV5tUPYhGu8XQxpiZZfPn2Lm4HM1iC?= =?us-ascii?Q?DzeYePJ5NOMcBdeIy6SxTX8t3n6FoLpmBU/8ewnLEHB/txofDX2faQ3OkN8Z?= =?us-ascii?Q?b9npzjpxF82jHNZojwJIKMxfAEI6bp/HGpD5H8CBjcacXME/fKPUiW3RRM2N?= =?us-ascii?Q?tGpSQormmcYlP1UxPVYOyYCHN2PMLN1pDFwmj3kFDdN9zBNQKfuGdZWFXQim?= =?us-ascii?Q?0QLmkTmYM6BsaZfkO3jRPWH38zBZ611xjmn+B208a8rf6UfaY/WQxsgYWRjf?= =?us-ascii?Q?AHuJMNxiec32sF0Wk/iLk2lXBnJcjdiO68ew7qZ76yXyMCBOp7qd2JWdQ=3D?= =?us-ascii?Q?=3D?= X-Microsoft-Exchange-Diagnostics: 1;BL2PR15MB1074;6:rBZgJ7bjdiqhnz12G5LxmLVA1aalLp2vg7j5g8Zksagwk6T8aWfxxiqK451x2BzX30UM8VhlVmPjBcsJwupniNL6GTQ0UpOHQp1mPaCFK0Ry7fbMgPFHQGuy5ixrh9IC954B3WyCqj0uAiXP0YMpDgRjQj67aqR8w3Exn4gNV33YoKUprtqMED+7cxQdf+iqWKfi3gClFVZv/eOEQmpphFU2FsYgIPAg48RNyAvl9jdPLHubG5tg05K6RW57a0cshMkjT3STSWEn7fzM4AMKNLTuV30anH96DCuv1fczeEmRsTizB+S7/uuZpEaDozyno3kedpYPKr9Rov26XIgyRw==;5:zytE8KZHoDUgM3FRdRqe508hSNem0JJL/tL5jUuG5YvpfhOKJrS9OHXprPJwHGm9XfCKZBxurGfVXYmiFcTtklS/5WYOTuUaTIe2w6Nk5pRfWEZEofJyKOVEV6P5ig0rnmhMLPgzsXgIJUI+16RS4w==;24:9aVqP5JH9IHpHhILdr/Kpp4MTfqGWZXY0m0hXQzK3wJ9L7yn8jokVQP/uMS2jM6qDKSGdCd3vxN3ugZzpxfv3myf5GleRwst8aHQRFyBJbg=;7:f3dB6yYrU3Icpldkx0W1blwr+dfysvjZdQeklkglaBDI+pYLVGNG0g5e4DixP48EHcKMbBEXt/O0GKoh8arfneOF/CxP+cRbQ9QzucGcxR7I119rYFH7XPr9vOaOU/tiVArT/NWt9+dTz9CxqRSUBMYh3OD9H9/VUcEiM31s/6ixPObbndPG9cGj9ZNv4WwYxuzMt+H+LPSFGYptLKvoQ43XT/lt2qJhStfVe+sVZgY= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1;BL2PR15MB1074;20:jCmpEc/XNoiTmRGSo5OghASTWvpxAGxENDD66OgrwwFxiIA6UrYiZzc+BEXuao6Fr4+5LQnuewDHoluLXhld4hIUrl9fSEgl1qXJ+HgHX2Wym1/iP/ipeFIRyruTzbkOj5yvuSkLB6ZQphUVXyiVGqyUY+1Xr5phq2XS5lCisYk= X-MS-Exchange-CrossTenant-OriginalArrivalTime: 05 Sep 2017 20:24:10.1245 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 8ae927fe-1255-47a7-a2af-5f3a069daaa2 X-MS-Exchange-Transport-CrossTenantHeadersStamped: BL2PR15MB1074 X-OriginatorOrg: fb.com X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-09-05_09:,, signatures=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Sep 05, 2017 at 04:57:00PM +0200, Michal Hocko wrote: > On Mon 04-09-17 15:21:05, Roman Gushchin wrote: > [...] > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index a69d23082abf..97813c56163b 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -2649,6 +2649,213 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg) > > return ret; > > } > > > > +static long memcg_oom_badness(struct mem_cgroup *memcg, > > + const nodemask_t *nodemask) > > +{ > > + long points = 0; > > + int nid; > > + pg_data_t *pgdat; > > + > > + for_each_node_state(nid, N_MEMORY) { > > + if (nodemask && !node_isset(nid, *nodemask)) > > + continue; > > + > > + points += mem_cgroup_node_nr_lru_pages(memcg, nid, > > + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE)); > > Why don't you consider file LRUs here? What if there is a lot of page > cache which is not reclaimed because it is protected by memcg->low. > Should we hide that from the OOM killer? I'm not sure here. I agree with your argument, although memcg->low should not cause OOMs in the current implementation (which is a separate problem). Also I can imagine some edge cases with mlocked pagecache belonging to a process from a different cgroup. I would suggest to refine this later. > > [...] > > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc) > > +{ > > + struct mem_cgroup *iter, *parent; > > + > > + for_each_mem_cgroup_tree(iter, root) { > > + if (memcg_has_children(iter)) { > > + iter->oom_score = 0; > > + continue; > > + } > > Do we really need this check? If it is a mere optimization then > we should probably check for tasks in the memcg rather than > descendant. More on that below. The idea is to traverse memcg only once: we're resetting oom_score for non-leaf cgroups, and for each leaf cgroup calculate the score and propagate it upwards. > > > + > > + iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask); > > + > > + /* > > + * Ignore empty and non-eligible memory cgroups. > > + */ > > + if (iter->oom_score == 0) > > + continue; > > + > > + /* > > + * If there are inflight OOM victims, we don't need to look > > + * further for new victims. > > + */ > > + if (iter->oom_score == -1) { > > + oc->chosen_memcg = INFLIGHT_VICTIM; > > + mem_cgroup_iter_break(root, iter); > > + return; > > + } > > + > > + for (parent = parent_mem_cgroup(iter); parent && parent != root; > > + parent = parent_mem_cgroup(parent)) > > + parent->oom_score += iter->oom_score; > > Hmm. The changelog says "By default, it will look for the biggest leaf > cgroup, and kill the largest task inside." But you are accumulating > oom_score up the hierarchy and so parents will have higher score than > the layer of their children and the larger the sub-hierarchy the more > biased it will become. Say you have > root > /\ > / \ > A D > / \ > B C > > B (5), C(15) thus A(20) and D(20). Unless I am missing something we are > going to go down A path and then chose C even though D is the largest > leaf group, right? You're right, changelog is not accurate, I'll fix it. The behavior is correct, IMO. > > > + } > > + > > + for (;;) { > > + struct cgroup_subsys_state *css; > > + struct mem_cgroup *memcg = NULL; > > + long score = LONG_MIN; > > + > > + css_for_each_child(css, &root->css) { > > + struct mem_cgroup *iter = mem_cgroup_from_css(css); > > + > > + /* > > + * Ignore empty and non-eligible memory cgroups. > > + */ > > + if (iter->oom_score == 0) > > + continue; > > + > > + if (iter->oom_score > score) { > > + memcg = iter; > > + score = iter->oom_score; > > + } > > + } > > + > > + if (!memcg) { > > + if (oc->memcg && root == oc->memcg) { > > + oc->chosen_memcg = oc->memcg; > > + css_get(&oc->chosen_memcg->css); > > + oc->chosen_points = oc->memcg->oom_score; > > + } > > + break; > > + } > > + > > + if (memcg->oom_group || !memcg_has_children(memcg)) { > > + oc->chosen_memcg = memcg; > > + css_get(&oc->chosen_memcg->css); > > + oc->chosen_points = score; > > + break; > > + } > > + > > + root = memcg; > > + } > > +} > > + > [...] > > + /* > > + * For system-wide OOMs we should consider tasks in the root cgroup > > + * with oom_score larger than oc->chosen_points. > > + */ > > + if (!oc->memcg) { > > + select_victim_root_cgroup_task(oc); > > I do not understand why do we have to handle root cgroup specially here. > select_victim_memcg already iterates all memcgs in the oom hierarchy > (including root) so if the root memcg is the largest one then we > should simply consider it no? We don't have necessary stats for the root cgroup, so we can't calculate it's oom_score. > You are skipping root there because of > memcg_has_children but I suspect this and the whole accumulate up the > hierarchy approach just makes the whole thing more complex than necessary. With > "tasks only in leafs" cgroup policy we should only see any pages on LRUs > on the global root memcg and leaf cgroups. The same applies to memcg > stats. So why cannot we simply do the tree walk, calculate > badness/check the priority and select the largest memcg in one go? We have to traverse from top to bottom to make priority-based decision, but size-based oom_score is calculated as sum of descending leaf cgroup scores. For example: root /\ / \ A D / \ B C A and D have same priorities, B has larger priority than C. In this case we need to calculate size-based score for A, which requires summing oom_score of the sub-tree (B an C), despite we don't need it for choosing between B and C. Maybe I don't see it, but I don't know how to implement it more optimal. > > > @@ -810,6 +810,9 @@ static void __oom_kill_process(struct task_struct *victim) > > struct mm_struct *mm; > > bool can_oom_reap = true; > > > > + if (is_global_init(victim) || (victim->flags & PF_KTHREAD)) > > + return; > > + > > This will leak a reference to the victim AFACS Good catch! I didn't fix this after moving reference dropping into __oom_kill_process(). Fixed. Thanks!