From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f197.google.com (mail-wr0-f197.google.com [209.85.128.197]) by kanga.kvack.org (Postfix) with ESMTP id AF2516B0279 for ; Fri, 23 Jun 2017 09:43:27 -0400 (EDT) Received: by mail-wr0-f197.google.com with SMTP id z45so12833396wrb.13 for ; Fri, 23 Jun 2017 06:43:27 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id r195si4116538wmd.24.2017.06.23.06.43.26 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 23 Jun 2017 06:43:26 -0700 (PDT) Date: Fri, 23 Jun 2017 15:43:24 +0200 From: Michal Hocko Subject: Re: [RFC PATCH v2 0/7] cgroup-aware OOM killer Message-ID: <20170623134323.GB5314@dhcp22.suse.cz> References: <1496342115-3974-1-git-send-email-guro@fb.com> <20170609163022.GA9332@dhcp22.suse.cz> <20170622171003.GB30035@castle> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170622171003.GB30035@castle> Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org On Thu 22-06-17 18:10:03, Roman Gushchin wrote: > Hi, Michal! > > Thank you very much for the review. I've tried to address your > comments in v3 (sent yesterday), so that is why it took some time to reply. I will try to look at it sometimes next week hopefully [...] > > - You seem to completely ignore per task oom_score_adj and override it > > by the memcg value. This makes some sense but it can lead to an > > unexpected behavior when somebody relies on the original behavior. > > E.g. a workload that would corrupt data when killed unexpectedly and > > so it is protected by OOM_SCORE_ADJ_MIN. Now this assumption will > > break when running inside a container. I do not have a good answer > > what is the desirable behavior and maybe there is no universal answer. > > Maybe you just do not to kill those tasks? But then you have to be > > careful when selecting a memcg victim. Hairy... > > I do not ignore it completely, but it matters only for root cgroup tasks > and inside a cgroup when oom_kill_all_tasks is off. > > I believe, that cgroup v2 requirement is a good enough. I mean you can't > move from v1 to v2 without changing cgroup settings, and if we will provide > per-cgroup oom_score_adj, it will be enough to reproduce the old behavior. > > Also, if you think it's necessary, I can add a sysctl to turn the cgroup-aware > oom killer off completely and provide compatibility mode. > We can't really save the old system-wide behavior of per-process oom_score_adj, > it makes no sense in the containerized environment. So what you are going to do with those applications that simply cannot be killed and which set OOM_SCORE_ADJ_MIN explicitly. Are they unsupported? How does a user find out? One way around this could be to simply to not kill tasks with OOM_SCORE_ADJ_MIN. > > - While we are at it oom_score_adj has turned out to be quite unusable > > for a sensible oom prioritization from my experience. Practically > > it reduced to disable/enforce the task for selection. The scale is > > quite small as well. There certainly is a need for prioritization > > and maybe a completely different api would be better. Maybe a simple > > priority in (0, infinity) range will be better. Priority could be used > > either as the only criterion or as a tie breaker when consumption of > > more memcgs is too close (this could be implemented for each strategy > > in a different way if we go modules way) > > I had such a version, but the the question how to compare root cgroup tasks > and cgroups becomes even harder. But if you have any ideas here, it's > just great. I do not like this [-1000, 1000] range at all. Dunno, would have to think about that some more. > > - oom_kill_all_tasks should be hierarchical and consistent within a > > hierarchy. Or maybe it should be applicable to memcgs with tasks (leaf > > cgroups). Although selecting a memcg higher in the hierarchy kill all > > tasks in that hierarchy makes some sense as well IMHO. Say you > > delegate a hierarchy to an unprivileged user and still want to contain > > that user. > > It's already hierarchical, or did I missed something? Please, explain > what you mean. If turned on for any cgroup (except root), it forces oom killer > to treat the whole cgroup as a single enitity, and kill all belonging tasks > if it is selected to be the oom victim. My fault! I can see that mem_cgroup_scan_tasks will iterate over whole hierarchy (with oc->chosen_memcg root). Setting oom_kill_all_tasks differently than up the hierarchy can be confusing but it makes sense and it is consistent with other knobs. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org