From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932211AbdJYVOL (ORCPT ); Wed, 25 Oct 2017 17:14:11 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:45200 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751668AbdJYVOJ (ORCPT ); Wed, 25 Oct 2017 17:14:09 -0400 Date: Wed, 25 Oct 2017 17:13:59 -0400 From: Johannes Weiner To: Michal Hocko Cc: Greg Thelen , Shakeel Butt , Alexander Viro , Vladimir Davydov , Andrew Morton , Linux MM , linux-fsdevel@vger.kernel.org, LKML Subject: Re: [PATCH] fs, mm: account filp and names caches to kmemcg Message-ID: <20171025211359.GA17899@cmpxchg.org> References: <20171024185854.GA6154@cmpxchg.org> <20171024201522.3z2fjnfywgx2egqx@dhcp22.suse.cz> <20171025071522.xyw4lsvdv4xsbhbo@dhcp22.suse.cz> <20171025131151.GA8210@cmpxchg.org> <20171025141221.xm4cqp2z6nunr6vy@dhcp22.suse.cz> <20171025164402.GA11582@cmpxchg.org> <20171025172924.i7du5wnkeihx2fgl@dhcp22.suse.cz> <20171025181106.GA14967@cmpxchg.org> <20171025190057.mqmnprhce7kvsfz7@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171025190057.mqmnprhce7kvsfz7@dhcp22.suse.cz> User-Agent: Mutt/1.9.1 (2017-09-22) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote: > On Wed 25-10-17 14:11:06, Johannes Weiner wrote: > > "Safe" is a vague term, and it doesn't make much sense to me in this > > situation. The OOM behavior should be predictable and consistent. > > > > Yes, global might in the rarest cases also return -ENOMEM. Maybe. We > > don't have to do that in memcg because we're not physically limited. > > OK, so here seems to be the biggest disconnect. Being physically or > artificially constrained shouldn't make much difference IMHO. In both > cases the resource is simply limited for the consumer. And once all the > attempts to fit within the limit fail then the request for the resource > has to fail. It's a huge difference. In the global case, we have to make trade-offs to not deadlock the kernel. In the memcg case, we have to make a trade off between desirable OOM behavior and desirable meaning of memory.max. If we can borrow a resource temporarily from the ether to resolve the OOM situation, I don't see why we shouldn't. We're only briefly ignoring the limit to make sure the allocating task isn't preventing the OOM victim from exiting or the OOM reaper from reaping. It's more of an implementation detail than interface. The only scenario you brought up where this might be the permanent overrun is the single, oom-disabled task. And I explained why that is a silly argument, why that's the least problematic consequence of oom-disabling, and why it probably shouldn't even be configurable. The idea that memory.max must never be breached is an extreme and narrow view. As Greg points out, there are allocations we do not even track. There are other scenarios that force allocations. They may violate the limit on paper, but they're not notably weakening the goal of memory.max - isolating workloads from each other. Let's look at it this way. There are two deadlock relationships the OOM killer needs to solve between the triggerer and the potential OOM victim: #1 Memory. The triggerer needs memory that the victim has, but the victim needs some additional memory to release it. #2 Locks. The triggerer needs memory that the victim has, but the victim needs a lock the triggerer holds to release it. We have no qualms letting the victim temporarily (until the victim's exit) ignore memory.max to resolve the memory deadlock #1. I don't understand why it's such a stretch to let the triggerer temporarily (until the victim's exit) ignore memory.max to resolve the locks deadlock #2. [1] We need both for the OOM killer to function correctly. We've solved #1 both for memcg and globally. But we haven't solved #2. Global can still deadlock, and memcg copped out and returns -ENOMEM. Adding speculative OOM killing before the -ENOMEM makes things more muddy and unpredictable. It doesn't actually solve deadlock #2. [1] And arguably that's what we should be doing in the global case too: give the triggerer access to reserves. If you recall this thread here: https://patchwork.kernel.org/patch/6088511/ > > > So the only change I am really proposing is to keep retrying as long > > > as the oom killer makes a forward progress and ENOMEM otherwise. > > > > That's the behavior change I'm against. > > So just to make it clear you would be OK with the retry on successful > OOM killer invocation and force charge on oom failure, right? Yeah, that sounds reasonable to me.