From: Johannes Weiner <hannes@cmpxchg.org> To: Michal Hocko <mhocko@kernel.org> Cc: Greg Thelen <gthelen@google.com>, Shakeel Butt <shakeelb@google.com>, Alexander Viro <viro@zeniv.linux.org.uk>, Vladimir Davydov <vdavydov.dev@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, Linux MM <linux-mm@kvack.org>, linux-fsdevel@vger.kernel.org, LKML <linux-kernel@vger.kernel.org> Subject: Re: [PATCH] fs, mm: account filp and names caches to kmemcg Date: Wed, 25 Oct 2017 12:44:02 -0400 [thread overview] Message-ID: <20171025164402.GA11582@cmpxchg.org> (raw) In-Reply-To: <20171025141221.xm4cqp2z6nunr6vy@dhcp22.suse.cz> On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote: > On Wed 25-10-17 09:11:51, Johannes Weiner wrote: > > On Wed, Oct 25, 2017 at 09:15:22AM +0200, Michal Hocko wrote: > [...] > > > ... we shouldn't make it more loose though. > > > > Then we can end this discussion right now. I pointed out right from > > the start that the only way to replace -ENOMEM with OOM killing in the > > syscall is to force charges. If we don't, we either deadlock or still > > return -ENOMEM occasionally. Nobody has refuted that this is the case. > > Yes this is true. I guess we are back to the non-failing allocations > discussion... Currently we are too ENOMEM happy for memcg !PF paths which > can lead to weird issues Greg has pointed out earlier. Going to opposite > direction to basically never ENOMEM and rather pretend a success (which > allows runaways for extreme setups with no oom eligible tasks) sounds > like going from one extreme to another. This basically means that those > charges will effectively GFP_NOFAIL. Too much to guarantee IMHO. We're talking in circles. Overrunning the hard limit in the syscall path isn't worse than an infinite loop in the page fault path. Once you make tasks ineligible for OOM, that's what you have to expect. It's the OOM disabling that results in extreme consequences across the board. It's silly to keep bringing this up as an example. > > They're a challenge as it is. The only sane options are to stick with > > the status quo, > > One thing that really worries me about the current status quo is that > the behavior depends on whether you run under memcg or not. The global > policy is "almost never fail unless something horrible is going on". > But we _do not_ guarantee that ENOMEM stays inside the kernel. > > So if we need to do something about that I would think we need an > universal solution rather than something memcg specific. Sure global > ENOMEMs are so rare that nobody will probably trigger those but that is > just a wishful thinking... The difference in current behavior comes from the fact that real-life workloads could trigger a deadlock with memcg looping indefinitely in the charge path, but not with the allocator looping indefinitely in the allocation path. That also means that if we change it like you proposed, it'll be much more likely to trigger -ENOMEM from syscalls inside memcg than it is outside. The margins are simply tighter inside cgroups. You often have only a few closely interacting tasks in there, and the probability for something to go wrong is much higher than on the system scope. But also, the system is constrained by physical limits. It's a hard wall; once you hit it, the kernel is dead, so we don't have a lot of options. Memcg on the other hand is a completely artificial limit. It doesn't make sense to me to pretend it's a hard wall to the point where we actively *create* for ourselves the downsides of a physical limitation. > So how about we start with a BIG FAT WARNING for the failure case? > Something resembling warn_alloc for the failure case. > > --- > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 5d9323028870..3ba62c73eee5 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > * we would fall back to the global oom killer in pagefault_out_of_memory > */ > - if (!memcg->oom_kill_disable && > - mem_cgroup_out_of_memory(memcg, mask, order)) > - return true; > + if (!memcg->oom_kill_disable) { > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > + return true; > + > + WARN(!current->memcg_may_oom, > + "Memory cgroup charge failed because of no reclaimable memory! " > + "This looks like a misconfiguration or a kernel bug."); > + } That's crazy! We shouldn't create interfaces that make it possible to accidentally livelock the kernel. Then warn about it and let it crash. That is a DOS-level lack of OS abstraction. In such a situation, we should ignore oom_score_adj or ignore the hard limit. Even panic() would be better from a machine management point of view than leaving random tasks inside infinite loops. Why is OOM-disabling a thing? Why isn't this simply a "kill everything else before you kill me"? It's crashing the kernel in trying to protect a userspace application. How is that not insane?
WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <hannes@cmpxchg.org> To: Michal Hocko <mhocko@kernel.org> Cc: Greg Thelen <gthelen@google.com>, Shakeel Butt <shakeelb@google.com>, Alexander Viro <viro@zeniv.linux.org.uk>, Vladimir Davydov <vdavydov.dev@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, Linux MM <linux-mm@kvack.org>, linux-fsdevel@vger.kernel.org, LKML <linux-kernel@vger.kernel.org> Subject: Re: [PATCH] fs, mm: account filp and names caches to kmemcg Date: Wed, 25 Oct 2017 12:44:02 -0400 [thread overview] Message-ID: <20171025164402.GA11582@cmpxchg.org> (raw) In-Reply-To: <20171025141221.xm4cqp2z6nunr6vy@dhcp22.suse.cz> On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote: > On Wed 25-10-17 09:11:51, Johannes Weiner wrote: > > On Wed, Oct 25, 2017 at 09:15:22AM +0200, Michal Hocko wrote: > [...] > > > ... we shouldn't make it more loose though. > > > > Then we can end this discussion right now. I pointed out right from > > the start that the only way to replace -ENOMEM with OOM killing in the > > syscall is to force charges. If we don't, we either deadlock or still > > return -ENOMEM occasionally. Nobody has refuted that this is the case. > > Yes this is true. I guess we are back to the non-failing allocations > discussion... Currently we are too ENOMEM happy for memcg !PF paths which > can lead to weird issues Greg has pointed out earlier. Going to opposite > direction to basically never ENOMEM and rather pretend a success (which > allows runaways for extreme setups with no oom eligible tasks) sounds > like going from one extreme to another. This basically means that those > charges will effectively GFP_NOFAIL. Too much to guarantee IMHO. We're talking in circles. Overrunning the hard limit in the syscall path isn't worse than an infinite loop in the page fault path. Once you make tasks ineligible for OOM, that's what you have to expect. It's the OOM disabling that results in extreme consequences across the board. It's silly to keep bringing this up as an example. > > They're a challenge as it is. The only sane options are to stick with > > the status quo, > > One thing that really worries me about the current status quo is that > the behavior depends on whether you run under memcg or not. The global > policy is "almost never fail unless something horrible is going on". > But we _do not_ guarantee that ENOMEM stays inside the kernel. > > So if we need to do something about that I would think we need an > universal solution rather than something memcg specific. Sure global > ENOMEMs are so rare that nobody will probably trigger those but that is > just a wishful thinking... The difference in current behavior comes from the fact that real-life workloads could trigger a deadlock with memcg looping indefinitely in the charge path, but not with the allocator looping indefinitely in the allocation path. That also means that if we change it like you proposed, it'll be much more likely to trigger -ENOMEM from syscalls inside memcg than it is outside. The margins are simply tighter inside cgroups. You often have only a few closely interacting tasks in there, and the probability for something to go wrong is much higher than on the system scope. But also, the system is constrained by physical limits. It's a hard wall; once you hit it, the kernel is dead, so we don't have a lot of options. Memcg on the other hand is a completely artificial limit. It doesn't make sense to me to pretend it's a hard wall to the point where we actively *create* for ourselves the downsides of a physical limitation. > So how about we start with a BIG FAT WARNING for the failure case? > Something resembling warn_alloc for the failure case. > > --- > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 5d9323028870..3ba62c73eee5 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > * we would fall back to the global oom killer in pagefault_out_of_memory > */ > - if (!memcg->oom_kill_disable && > - mem_cgroup_out_of_memory(memcg, mask, order)) > - return true; > + if (!memcg->oom_kill_disable) { > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > + return true; > + > + WARN(!current->memcg_may_oom, > + "Memory cgroup charge failed because of no reclaimable memory! " > + "This looks like a misconfiguration or a kernel bug."); > + } That's crazy! We shouldn't create interfaces that make it possible to accidentally livelock the kernel. Then warn about it and let it crash. That is a DOS-level lack of OS abstraction. In such a situation, we should ignore oom_score_adj or ignore the hard limit. Even panic() would be better from a machine management point of view than leaving random tasks inside infinite loops. Why is OOM-disabling a thing? Why isn't this simply a "kill everything else before you kill me"? It's crashing the kernel in trying to protect a userspace application. How is that not insane? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-10-25 16:44 UTC|newest] Thread overview: 104+ messages / expand[flat|nested] mbox.gz Atom feed top 2017-10-05 22:21 [PATCH] fs, mm: account filp and names caches to kmemcg Shakeel Butt 2017-10-05 22:21 ` Shakeel Butt 2017-10-06 7:59 ` Michal Hocko 2017-10-06 7:59 ` Michal Hocko 2017-10-06 19:33 ` Shakeel Butt 2017-10-06 19:33 ` Shakeel Butt 2017-10-09 6:24 ` Michal Hocko 2017-10-09 6:24 ` Michal Hocko 2017-10-09 17:52 ` Greg Thelen 2017-10-09 17:52 ` Greg Thelen 2017-10-09 18:04 ` Michal Hocko 2017-10-09 18:04 ` Michal Hocko 2017-10-09 18:17 ` Michal Hocko 2017-10-09 18:17 ` Michal Hocko 2017-10-10 9:10 ` Michal Hocko 2017-10-10 9:10 ` Michal Hocko 2017-10-10 22:21 ` Shakeel Butt 2017-10-10 22:21 ` Shakeel Butt 2017-10-11 9:09 ` Michal Hocko 2017-10-11 9:09 ` Michal Hocko 2017-10-09 20:26 ` Johannes Weiner 2017-10-09 20:26 ` Johannes Weiner 2017-10-10 9:14 ` Michal Hocko 2017-10-10 9:14 ` Michal Hocko 2017-10-10 14:17 ` Johannes Weiner 2017-10-10 14:17 ` Johannes Weiner 2017-10-10 14:24 ` Michal Hocko 2017-10-10 14:24 ` Michal Hocko 2017-10-12 19:03 ` Johannes Weiner 2017-10-12 19:03 ` Johannes Weiner 2017-10-12 23:57 ` Greg Thelen 2017-10-12 23:57 ` Greg Thelen 2017-10-13 6:51 ` Michal Hocko 2017-10-13 6:51 ` Michal Hocko 2017-10-13 6:35 ` Michal Hocko 2017-10-13 6:35 ` Michal Hocko 2017-10-13 7:00 ` Michal Hocko 2017-10-13 7:00 ` Michal Hocko 2017-10-13 15:24 ` Michal Hocko 2017-10-13 15:24 ` Michal Hocko 2017-10-24 12:18 ` Michal Hocko 2017-10-24 12:18 ` Michal Hocko 2017-10-24 17:54 ` Johannes Weiner 2017-10-24 17:54 ` Johannes Weiner 2017-10-24 16:06 ` Johannes Weiner 2017-10-24 16:06 ` Johannes Weiner 2017-10-24 16:22 ` Michal Hocko 2017-10-24 16:22 ` Michal Hocko 2017-10-24 17:23 ` Johannes Weiner 2017-10-24 17:23 ` Johannes Weiner 2017-10-24 17:55 ` Michal Hocko 2017-10-24 17:55 ` Michal Hocko 2017-10-24 18:58 ` Johannes Weiner 2017-10-24 18:58 ` Johannes Weiner 2017-10-24 20:15 ` Michal Hocko 2017-10-24 20:15 ` Michal Hocko 2017-10-25 6:51 ` Greg Thelen 2017-10-25 6:51 ` Greg Thelen 2017-10-25 7:15 ` Michal Hocko 2017-10-25 7:15 ` Michal Hocko 2017-10-25 13:11 ` Johannes Weiner 2017-10-25 13:11 ` Johannes Weiner 2017-10-25 14:12 ` Michal Hocko 2017-10-25 14:12 ` Michal Hocko 2017-10-25 16:44 ` Johannes Weiner [this message] 2017-10-25 16:44 ` Johannes Weiner 2017-10-25 17:29 ` Michal Hocko 2017-10-25 17:29 ` Michal Hocko 2017-10-25 18:11 ` Johannes Weiner 2017-10-25 18:11 ` Johannes Weiner 2017-10-25 19:00 ` Michal Hocko 2017-10-25 19:00 ` Michal Hocko 2017-10-25 21:13 ` Johannes Weiner 2017-10-25 21:13 ` Johannes Weiner 2017-10-25 22:49 ` Greg Thelen 2017-10-25 22:49 ` Greg Thelen 2017-10-26 7:49 ` Michal Hocko 2017-10-26 7:49 ` Michal Hocko 2017-10-26 12:45 ` Tetsuo Handa 2017-10-26 12:45 ` Tetsuo Handa 2017-10-26 14:31 ` Johannes Weiner 2017-10-26 14:31 ` Johannes Weiner 2017-10-26 19:56 ` Greg Thelen 2017-10-26 19:56 ` Greg Thelen 2017-10-27 8:20 ` Michal Hocko 2017-10-27 8:20 ` Michal Hocko 2017-10-27 20:50 ` Shakeel Butt 2017-10-27 20:50 ` Shakeel Butt 2017-10-30 8:29 ` Michal Hocko 2017-10-30 8:29 ` Michal Hocko 2017-10-30 19:28 ` Shakeel Butt 2017-10-30 19:28 ` Shakeel Butt 2017-10-31 8:00 ` Michal Hocko 2017-10-31 8:00 ` Michal Hocko 2017-10-31 16:49 ` Johannes Weiner 2017-10-31 16:49 ` Johannes Weiner 2017-10-31 18:50 ` Michal Hocko 2017-10-31 18:50 ` Michal Hocko 2017-10-24 15:45 ` Johannes Weiner 2017-10-24 15:45 ` Johannes Weiner 2017-10-24 16:30 ` Michal Hocko 2017-10-24 16:30 ` Michal Hocko 2017-10-10 23:32 ` Al Viro 2017-10-10 23:32 ` Al Viro
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20171025164402.GA11582@cmpxchg.org \ --to=hannes@cmpxchg.org \ --cc=akpm@linux-foundation.org \ --cc=gthelen@google.com \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mhocko@kernel.org \ --cc=shakeelb@google.com \ --cc=vdavydov.dev@gmail.com \ --cc=viro@zeniv.linux.org.uk \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.