From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 941DDC433F5 for ; Fri, 22 Oct 2021 00:10:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 78CE461052 for ; Fri, 22 Oct 2021 00:10:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232179AbhJVAMP (ORCPT ); Thu, 21 Oct 2021 20:12:15 -0400 Received: from smtp-out1.suse.de ([195.135.220.28]:47926 "EHLO smtp-out1.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231750AbhJVAMN (ORCPT ); Thu, 21 Oct 2021 20:12:13 -0400 Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id A7BCB218CE; Fri, 22 Oct 2021 00:09:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1634861394; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=t4rV+MNkF8RcZ6MSJOX/2lQh/XwvC0RkEQ5hX7Lq39w=; b=BZGFeIzcjdI2sLgabps0IlJN9KXtf0Gy/NNRiYfbHP6pnAPgBGld9kSk7/ig6/YPPh0gT4 LWh2HDcyq6wDqx9aDYnxmSHDIJGSN8HAM+xwFasSuzmChd1JaVRxwzQqu0B13yDd0c7Ikw zWVfw3h2XRzSos9AliOvE3JQfe0+SZM= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1634861394; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=t4rV+MNkF8RcZ6MSJOX/2lQh/XwvC0RkEQ5hX7Lq39w=; b=RalCbPbDj36LWJI0cp/Gng4X+RPBdCR8N6xQ98vS9FAv7YVnh/9NTmMemYmxKq0fA37z3O Exg6raq2U5cMOoCw== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 86A251346B; Fri, 22 Oct 2021 00:09:48 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id uw2vDEwBcmEmFQAAMHmgww (envelope-from ); Fri, 22 Oct 2021 00:09:48 +0000 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit MIME-Version: 1.0 From: "NeilBrown" To: "Michal Hocko" Cc: "Dave Chinner" , "Vlastimil Babka" , "Andrew Morton" , "Theodore Ts'o" , "Andreas Dilger" , "Darrick J. Wong" , "Matthew Wilcox" , "Mel Gorman" , "Jonathan Corbet" , linux-xfs@vger.kernel.org, linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org Subject: Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL In-reply-to: References: , <20211006231452.GF54211@dread.disaster.area>, , <163364854551.31063.4377741712039731672@noble.neil.brown.name>, , <20211008223649.GJ54211@dread.disaster.area>, , <163398898675.17149.16715168325131099480@noble.neil.brown.name>, , <163461794761.17149.1193247176490791274@noble.neil.brown.name>, Date: Fri, 22 Oct 2021 11:09:44 +1100 Message-id: <163486138482.17149.3261315484384960888@noble.neil.brown.name> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 20 Oct 2021, Michal Hocko wrote: > On Tue 19-10-21 15:32:27, Neil Brown wrote: [clip looks of discussion where we are largely in agreement - happy to see that!] > > Presumably there is a real risk of deadlock if we just remove the > > memory-reserves boosts of __GFP_NOFAIL. Maybe it would be safe to > > replace all current users of __GFP_NOFAIL with __GFP_NOFAIL|__GFP_HIGH, > > and then remove the __GFP_HIGH where analysis suggests there is no risk > > of deadlocks. > > I would much rather not bind those together and go other way around. If > somebody can actually hit deadlocks (those are quite easy to spot as > they do not go away) then we can talk about how to deal with them. > Memory reserves can help only > < this much. I recall maybe 10 years ago Linus saying that he preferred simplicity to mathematical provability for handling memory deadlocks (or something like that). I lean towards provability myself, but I do see the other perspective. We have mempools and they can provide strong guarantees (though they are often over-allocated I think). But they can be a bit clumsy. I believe that DaveM is strong against anything like that in the network layer, so we strongly depend on GFP_MEMALLOC functionality for swap-over-NFS. I'm sure it is important elsewhere too. Of course __GFP_HIGH and __GFP_ATOMIC provide an intermediate priority level - more likely to fail than __GFP_MEMALLOC. I suspect they should not be seen as avoiding deadlock, only as improving service. So using them when we cannot wait might make sense, but there are probably other circumstances. > > Why is __GFP_NOFAIL better? > > Because the allocator can do something if it knows that the allocation > cannot fail. E.g. give such an allocation a higher priority over those > that are allowed to fail. This is not limited to memory reserves, > although this is the only measure that is implemented currently IIRC. > On the other hand if there is something interesting the caller can do > directly - e.g. do internal object management like mempool does - then > it is better to retry at that level. It *can* do something, but I don't think it *should* do something - not if that could have a negative impact on other threads. Just because I cannot fail, that doesn't mean someone else should fail to help me. Maybe I should just wait longer. > > > > * Using this flag for costly allocations is _highly_ discouraged. > > > > This is unhelpful. Saying something is "discouraged" carries an implied > > threat. This is open source and threats need to be open. > > Why is it discouraged? IF it is not forbidden, then it is clearly > > permitted. Maybe there are costs - so a clear statement of those costs > > would be appropriate. > > Also, what is a suitable alternative? > > > > Current code will trigger a WARN_ON, so it is effectively forbidden. > > Maybe we should document that __GFP_NOFAIL is forbidden for orders above > > 1, and that vmalloc() should be used instead (thanks for proposing that > > patch!). > > I think we want to recommend kvmalloc as an alternative once vmalloc is > NOFAIL aware. > > I will skip over some of the specific regarding SLAB and NOFS usage if > you do not mind and focus on points that have direct documentation > consequences. Also I do not feel qualified commenting on neither SLAB > nor FS internals. > > [...] > > There is a lot of stuff there.... the bits that are important to me are: > > > > - why is __GFP_NOFAIL preferred? It is a valuable convenience, but I > > don't see that it is necessary > > I think it is preferred for one and a half reasons. It tells allocator > that this allocation cannot really fail and the caller doesn't have a > very good/clever retry policy (e.g. like mempools mentioned above). The > half reason would be for tracking purposes (git grep __GFP_NOFAIL) is > easier than trying to catch all sorts of while loops over allocation > which do not do anything really interesting. I think the one reason is misguided, as described above. I think the half reason is good, and that we should introduce memalloc_retry_wait() and encourage developers to use that for any memalloc retry loop. __GFP_NOFAIL would then be a convenience flag which causes the allocator (slab or alloc_page or whatever) to call memalloc_retry_wait() and do the loop internally. What exactly memalloc_retry_wait() does (if anything) can be decided separately and changed as needed. > > - is it reasonable to use __GFP_HIGH when looping if there is a risk of > > deadlock? > > As I've said above. Memory reserves are a finite resource and as such > they cannot fundamentally solve deadlocks. They can help prioritize > though. To be fair, they can solve 1 level of deadlock. i.e. if you only need to make one allocation to guarantee progress, then allocating from reserves can help. If you might need to make a second allocation without freeing the first - then a single reserve pool can provide guarantees (which is why we use mempool is layered block devices - md over dm over loop of scsi). > > > - Will __GFP_DIRECT_RECLAIM always result in a delay before failure? In > > that case it should be safe to loop around allocations using > > __GFP_DIRECT_RECLAIM without needing congestion_wait() (so it can > > just be removed. > > This is a good question and I do not think we have that documented > anywhere. We do cond_resched() for sure. I do not think we guarantee a > sleeping point in general. Maybe we should, I am not really sure. If we add memalloc_retry_wait(), it wouldn't matter. We would only need to ensure that memalloc_retry_wait() waited if page_alloc didn't. I think we should: - introduce memalloc_retry_wait() and use it for all malloc retry loops including __GFP_NOFAIL - drop all the priority boosts added for __GFP_NOFAIL - drop __GFP_ATOMIC and change all the code that tests for __GFP_ATOMIC to instead test for __GFP_HIGH. __GFP_ATOMIC is NEVER used without __GFP_HIGH. This give a slight boost to several sites that use __GFP_HIGH explicitly. - choose a consistent order threshold for disallowing __GFP_NOFAIL (rmqueue uses "order > 1", __alloc_pages_slowpath uses "order > PAGE_ALLOC_COSTLY_ORDER"), test it once - early - and document kvmalloc as an alternative. Code can also loop if there is an alternative strategy for freeing up memory. Thanks, NeilBrown Thanks, NeilBrown