* [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL @ 2021-09-17 2:56 NeilBrown 2021-09-17 2:56 ` [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages() NeilBrown ` (5 more replies) 0 siblings, 6 replies; 45+ messages in thread From: NeilBrown @ 2021-09-17 2:56 UTC (permalink / raw) To: Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko, . Dave Chinner, Jonathan Corbet Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc This second version: - add recipients for the Documentation/core-api changes - add fix for __alloc_pages_bulk() to handle GFP_NOFAIL - drops the annotations for congestion_wait() as being ineffective as that isn't really useful until an alternative is available - changes to GFP_NOFAIL documentation changes to focus on the possible deadlocks rather than the use of memory reserves - Improves ext4 and xfs patches based on feedback from Ted and Dave. The patches are independent, except that the last patch depends on the first. As mentioned last time: These are the easy bits. There are 5 calls to congestion_wait() and one to wait_iff_congested() in mm/ which need consideration. There are multiple calls to congestion_wait in fs/, particularly fs/f2fs/ which need to be addressed too. I'll try to form an opinion about these in coming weeks. (other interesting comment in original cover letter just duplicates observations made in the commit messages of individual patches). NeilBrown --- NeilBrown (6): MM: Support __GFP_NOFAIL in alloc_pages_bulk_*() and improve doco MM: improve documentation for __GFP_NOFAIL EXT4: Remove ENOMEM/congestion_wait() loops. EXT4: remove congestion_wait from ext4_bio_write_page, and simplify XFS: remove congestion_wait() loop from kmem_alloc() XFS: remove congestion_wait() loop from xfs_buf_alloc_pages() Documentation/core-api/memory-allocation.rst | 25 ++++++++- fs/ext4/ext4.h | 2 +- fs/ext4/ext4_jbd2.c | 4 +- fs/ext4/ext4_jbd2.h | 14 +++--- fs/ext4/extents.c | 53 ++++++++------------ fs/ext4/extents_status.c | 35 +++++++------ fs/ext4/extents_status.h | 2 +- fs/ext4/ialloc.c | 3 +- fs/ext4/indirect.c | 2 +- fs/ext4/inode.c | 6 +-- fs/ext4/ioctl.c | 4 +- fs/ext4/page-io.c | 13 ++--- fs/ext4/super.c | 2 +- fs/jbd2/transaction.c | 8 +-- fs/xfs/kmem.c | 19 +++---- fs/xfs/xfs_buf.c | 14 +++--- include/linux/gfp.h | 6 ++- 17 files changed, 113 insertions(+), 99 deletions(-) -- Signature ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages() 2021-09-17 2:56 [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL NeilBrown @ 2021-09-17 2:56 ` NeilBrown 2021-09-17 2:56 ` [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc() NeilBrown ` (4 subsequent siblings) 5 siblings, 0 replies; 45+ messages in thread From: NeilBrown @ 2021-09-17 2:56 UTC (permalink / raw) To: Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko, . Dave Chinner, Jonathan Corbet Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc Documentation commment in gfp.h discourages indefinite retry loops on ENOMEM and says of __GFP_NOFAIL that it is definitely preferable to use the flag rather than opencode endless loop around allocator. congestion_wait() is indistinguishable from schedule_timeout_uninterruptible() in practice and it is not a good way to wait for memory to become available. So add __GFP_NOFAIL to gfp if failure is not an option, and remove the congestion_wait(). We now only loop when failure is an option, and alloc_bulk_pages_array() made some progres, but not enough. Signed-off-by: NeilBrown <neilb@suse.de> --- fs/xfs/xfs_buf.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 5fa6cd947dd4..b19ab52c551b 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -352,7 +352,7 @@ xfs_buf_alloc_pages( if (flags & XBF_READ_AHEAD) gfp_mask |= __GFP_NORETRY; else - gfp_mask |= GFP_NOFS; + gfp_mask |= GFP_NOFS | __GFP_NOFAIL; /* Make sure that we have a page list */ bp->b_page_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE); @@ -372,8 +372,9 @@ xfs_buf_alloc_pages( /* * Bulk filling of pages can take multiple calls. Not filling the entire - * array is not an allocation failure, so don't back off if we get at - * least one extra page. + * array is not an allocation failure but is worth counting in + * xb_pages_retries statistics. If we don't even get one page, + * then this must be a READ_AHEAD and we should abort. */ for (;;) { long last = filled; @@ -385,16 +386,13 @@ xfs_buf_alloc_pages( break; } - if (filled != last) - continue; - - if (flags & XBF_READ_AHEAD) { + if (filled == last) { + ASSERT(flags & XBF_READ_AHEAD); xfs_buf_free_pages(bp); return -ENOMEM; } XFS_STATS_INC(bp->b_mount, xb_page_retries); - congestion_wait(BLK_RW_ASYNC, HZ / 50); } return 0; } ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc() 2021-09-17 2:56 [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL NeilBrown 2021-09-17 2:56 ` [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages() NeilBrown @ 2021-09-17 2:56 ` NeilBrown 2021-09-17 21:45 ` Dave Chinner 2021-09-17 2:56 ` [PATCH 1/6] MM: Support __GFP_NOFAIL in alloc_pages_bulk_*() and improve doco NeilBrown ` (3 subsequent siblings) 5 siblings, 1 reply; 45+ messages in thread From: NeilBrown @ 2021-09-17 2:56 UTC (permalink / raw) To: Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko, . Dave Chinner, Jonathan Corbet Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc Documentation commment in gfp.h discourages indefinite retry loops on ENOMEM and says of __GFP_NOFAIL that it is definitely preferable to use the flag rather than opencode endless loop around allocator. So remove the loop, instead specifying __GFP_NOFAIL if KM_MAYFAIL was not given. As we no longer have the opportunity to report a warning after some failures, clear __GFP_NOWARN so that the default warning (rate-limited to 1 ever 10 seconds) will be reported instead. Signed-off-by: NeilBrown <neilb@suse.de> --- fs/xfs/kmem.c | 19 ++++++------------- 1 file changed, 6 insertions(+), 13 deletions(-) diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c index 6f49bf39183c..575a58e61391 100644 --- a/fs/xfs/kmem.c +++ b/fs/xfs/kmem.c @@ -11,21 +11,14 @@ void * kmem_alloc(size_t size, xfs_km_flags_t flags) { - int retries = 0; gfp_t lflags = kmem_flags_convert(flags); - void *ptr; trace_kmem_alloc(size, flags, _RET_IP_); - do { - ptr = kmalloc(size, lflags); - if (ptr || (flags & KM_MAYFAIL)) - return ptr; - if (!(++retries % 100)) - xfs_err(NULL, - "%s(%u) possible memory allocation deadlock size %u in %s (mode:0x%x)", - current->comm, current->pid, - (unsigned int)size, __func__, lflags); - congestion_wait(BLK_RW_ASYNC, HZ/50); - } while (1); + if (!(flags & KM_MAYFAIL)) { + lflags |= __GFP_NOFAIL; + lflags &= ~__GFP_NOWARN; + } + + return kmalloc(size, lflags); } ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc() 2021-09-17 2:56 ` [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc() NeilBrown @ 2021-09-17 21:45 ` Dave Chinner 0 siblings, 0 replies; 45+ messages in thread From: Dave Chinner @ 2021-09-17 21:45 UTC (permalink / raw) To: NeilBrown Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On Fri, Sep 17, 2021 at 12:56:57PM +1000, NeilBrown wrote: > Documentation commment in gfp.h discourages indefinite retry loops on > ENOMEM and says of __GFP_NOFAIL that it > > is definitely preferable to use the flag rather than opencode > endless loop around allocator. > > So remove the loop, instead specifying __GFP_NOFAIL if KM_MAYFAIL was > not given. > > As we no longer have the opportunity to report a warning after some > failures, clear __GFP_NOWARN so that the default warning (rate-limited > to 1 ever 10 seconds) will be reported instead. > > Signed-off-by: NeilBrown <neilb@suse.de> > --- > fs/xfs/kmem.c | 19 ++++++------------- > 1 file changed, 6 insertions(+), 13 deletions(-) > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > index 6f49bf39183c..575a58e61391 100644 > --- a/fs/xfs/kmem.c > +++ b/fs/xfs/kmem.c > @@ -11,21 +11,14 @@ > void * > kmem_alloc(size_t size, xfs_km_flags_t flags) > { > - int retries = 0; > gfp_t lflags = kmem_flags_convert(flags); > - void *ptr; > > trace_kmem_alloc(size, flags, _RET_IP_); > > - do { > - ptr = kmalloc(size, lflags); > - if (ptr || (flags & KM_MAYFAIL)) > - return ptr; > - if (!(++retries % 100)) > - xfs_err(NULL, > - "%s(%u) possible memory allocation deadlock size %u in %s (mode:0x%x)", > - current->comm, current->pid, > - (unsigned int)size, __func__, lflags); > - congestion_wait(BLK_RW_ASYNC, HZ/50); > - } while (1); > + if (!(flags & KM_MAYFAIL)) { > + lflags |= __GFP_NOFAIL; > + lflags &= ~__GFP_NOWARN; > + } This logic should really be in kmem_flags_convert() where the gfp flags are set up. kmem_flags_convert() is only called by kmem_alloc() now so you should just be able to hack that logic to do exactly what is necessary. FWIW, We've kinda not been caring about warts in this code because the next step for kmem_alloc is to remove kmem_alloc/kmem_zalloc completely and replace all the callers with kmalloc/kzalloc being passed the correct gfp flags. There's about 30 kmem_alloc() callers and 45 kmem_zalloc() calls left to be converted before kmem.[ch] can go away completely.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 1/6] MM: Support __GFP_NOFAIL in alloc_pages_bulk_*() and improve doco 2021-09-17 2:56 [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL NeilBrown 2021-09-17 2:56 ` [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages() NeilBrown 2021-09-17 2:56 ` [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc() NeilBrown @ 2021-09-17 2:56 ` NeilBrown 2021-09-17 14:42 ` Mel Gorman 2021-09-17 2:56 ` [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops NeilBrown ` (2 subsequent siblings) 5 siblings, 1 reply; 45+ messages in thread From: NeilBrown @ 2021-09-17 2:56 UTC (permalink / raw) To: Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko, . Dave Chinner, Jonathan Corbet Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc When alloc_pages_bulk_array() is called on an array that is partially allocated, the level of effort to get a single page is less than when the array was completely unallocated. This behaviour is inconsistent, but now fixed. One effect if this is that __GFP_NOFAIL will not ensure at least one page is allocated. Also clarify the expected success rate. __alloc_pages_bulk() will allocated one page according to @gfp, and may allocate more if that can be done cheaply. It is assumed that the caller values cheap allocation where possible and may decide to use what it has got, or to call again to get more. Acked-by: Mel Gorman <mgorman@suse.com> Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator") Signed-off-by: NeilBrown <neilb@suse.de> --- mm/page_alloc.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b37435c274cf..aa51016e49c5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5191,6 +5191,11 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, * is the maximum number of pages that will be stored in the array. * * Returns the number of pages on the list or array. + * + * At least one page will be allocated if that is possible while + * remaining consistent with @gfp. Extra pages up to the requested + * total will be allocated opportunistically when doing so is + * significantly cheaper than having the caller repeat the request. */ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, nodemask_t *nodemask, int nr_pages, @@ -5292,7 +5297,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, pcp, pcp_list); if (unlikely(!page)) { /* Try and get at least one page */ - if (!nr_populated) + if (!nr_account) goto failed_irq; break; } ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH 1/6] MM: Support __GFP_NOFAIL in alloc_pages_bulk_*() and improve doco 2021-09-17 2:56 ` [PATCH 1/6] MM: Support __GFP_NOFAIL in alloc_pages_bulk_*() and improve doco NeilBrown @ 2021-09-17 14:42 ` Mel Gorman 2021-09-20 23:48 ` NeilBrown 2021-10-05 9:16 ` Vlastimil Babka 0 siblings, 2 replies; 45+ messages in thread From: Mel Gorman @ 2021-09-17 14:42 UTC (permalink / raw) To: NeilBrown Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Michal Hocko, Jesper Dangaard Brouer, . Dave Chinner, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc I'm top-posting to cc Jesper with full context of the patch. I don't have a problem with this patch other than the Fixes: being a bit marginal, I should have acked as Mel Gorman <mgorman@suse.de> and the @gfp in the comment should have been @gfp_mask. However, an assumption the API design made was that it should fail fast if memory is not quickly available but have at least one page in the array. I don't think the network use case cares about the situation where the array is already populated but I'd like Jesper to have the opportunity to think about it. It's possible he would prefer it's explicit and the check becomes (!nr_populated || ((gfp_mask & __GFP_NOFAIL) && !nr_account)) to state that __GFP_NOFAIL users are willing to take a potential latency penalty if the array is already partially populated but !__GFP_NOFAIL users would prefer fail-fast behaviour. I'm on the fence because while I wrote the implementation, it was based on other peoples requirements. On Fri, Sep 17, 2021 at 12:56:57PM +1000, NeilBrown wrote: > When alloc_pages_bulk_array() is called on an array that is partially > allocated, the level of effort to get a single page is less than when > the array was completely unallocated. This behaviour is inconsistent, > but now fixed. One effect if this is that __GFP_NOFAIL will not ensure > at least one page is allocated. > > Also clarify the expected success rate. __alloc_pages_bulk() will > allocated one page according to @gfp, and may allocate more if that can > be done cheaply. It is assumed that the caller values cheap allocation > where possible and may decide to use what it has got, or to call again > to get more. > > Acked-by: Mel Gorman <mgorman@suse.com> > Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator") > Signed-off-by: NeilBrown <neilb@suse.de> > --- > mm/page_alloc.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index b37435c274cf..aa51016e49c5 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5191,6 +5191,11 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, > * is the maximum number of pages that will be stored in the array. > * > * Returns the number of pages on the list or array. > + * > + * At least one page will be allocated if that is possible while > + * remaining consistent with @gfp. Extra pages up to the requested > + * total will be allocated opportunistically when doing so is > + * significantly cheaper than having the caller repeat the request. > */ > unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > nodemask_t *nodemask, int nr_pages, > @@ -5292,7 +5297,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > pcp, pcp_list); > if (unlikely(!page)) { > /* Try and get at least one page */ > - if (!nr_populated) > + if (!nr_account) > goto failed_irq; > break; > } > > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 1/6] MM: Support __GFP_NOFAIL in alloc_pages_bulk_*() and improve doco 2021-09-17 14:42 ` Mel Gorman @ 2021-09-20 23:48 ` NeilBrown 2021-10-05 9:16 ` Vlastimil Babka 1 sibling, 0 replies; 45+ messages in thread From: NeilBrown @ 2021-09-20 23:48 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Michal Hocko, Jesper Dangaard Brouer, Dave Chinner, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On Sat, 18 Sep 2021, Mel Gorman wrote: > I'm top-posting to cc Jesper with full context of the patch. I don't > have a problem with this patch other than the Fixes: being a bit > marginal, I should have acked as Mel Gorman <mgorman@suse.de> and the > @gfp in the comment should have been @gfp_mask. > > However, an assumption the API design made was that it should fail fast > if memory is not quickly available but have at least one page in the > array. I don't think the network use case cares about the situation where > the array is already populated but I'd like Jesper to have the opportunity > to think about it. It's possible he would prefer it's explicit and the > check becomes > (!nr_populated || ((gfp_mask & __GFP_NOFAIL) && !nr_account)) to > state that __GFP_NOFAIL users are willing to take a potential latency > penalty if the array is already partially populated but !__GFP_NOFAIL > users would prefer fail-fast behaviour. I'm on the fence because while > I wrote the implementation, it was based on other peoples requirements. I can see that it could be desirable to not try too hard when we already have pages allocated, but maybe the best way to achieve that is for the called to clear __GFP_RECLAIM in that case. Alternately, callers that really want the __GFP_RECLAIM and __GFP_NOFAIL flags to be honoured could ensure that the array passed in is empty. That wouldn't be difficult (for current callers). In either case, the documentation should make it clear which flags are honoured when. Let's see what Jesper has to say. Thanks, NeilBrown > > On Fri, Sep 17, 2021 at 12:56:57PM +1000, NeilBrown wrote: > > When alloc_pages_bulk_array() is called on an array that is partially > > allocated, the level of effort to get a single page is less than when > > the array was completely unallocated. This behaviour is inconsistent, > > but now fixed. One effect if this is that __GFP_NOFAIL will not ensure > > at least one page is allocated. > > > > Also clarify the expected success rate. __alloc_pages_bulk() will > > allocated one page according to @gfp, and may allocate more if that can > > be done cheaply. It is assumed that the caller values cheap allocation > > where possible and may decide to use what it has got, or to call again > > to get more. > > > > Acked-by: Mel Gorman <mgorman@suse.com> > > Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator") > > Signed-off-by: NeilBrown <neilb@suse.de> > > --- > > mm/page_alloc.c | 7 ++++++- > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index b37435c274cf..aa51016e49c5 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -5191,6 +5191,11 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, > > * is the maximum number of pages that will be stored in the array. > > * > > * Returns the number of pages on the list or array. > > + * > > + * At least one page will be allocated if that is possible while > > + * remaining consistent with @gfp. Extra pages up to the requested > > + * total will be allocated opportunistically when doing so is > > + * significantly cheaper than having the caller repeat the request. > > */ > > unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > > nodemask_t *nodemask, int nr_pages, > > @@ -5292,7 +5297,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, > > pcp, pcp_list); > > if (unlikely(!page)) { > > /* Try and get at least one page */ > > - if (!nr_populated) > > + if (!nr_account) > > goto failed_irq; > > break; > > } > > > > > > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 1/6] MM: Support __GFP_NOFAIL in alloc_pages_bulk_*() and improve doco 2021-09-17 14:42 ` Mel Gorman 2021-09-20 23:48 ` NeilBrown @ 2021-10-05 9:16 ` Vlastimil Babka 1 sibling, 0 replies; 45+ messages in thread From: Vlastimil Babka @ 2021-10-05 9:16 UTC (permalink / raw) To: Mel Gorman, NeilBrown Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Michal Hocko, Jesper Dangaard Brouer, . Dave Chinner, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On 9/17/21 16:42, Mel Gorman wrote: > I'm top-posting to cc Jesper with full context of the patch. I don't > have a problem with this patch other than the Fixes: being a bit > marginal, I should have acked as Mel Gorman <mgorman@suse.de> and the > @gfp in the comment should have been @gfp_mask. > > However, an assumption the API design made was that it should fail fast > if memory is not quickly available but have at least one page in the > array. I don't think the network use case cares about the situation where > the array is already populated but I'd like Jesper to have the opportunity > to think about it. It's possible he would prefer it's explicit and the > check becomes > (!nr_populated || ((gfp_mask & __GFP_NOFAIL) && !nr_account)) to Note that AFAICS nr_populated is an incomplete piece of information, as we initially only count pages in the page_array as nr_populated up to the first NULL pointer. So even before Neil's patch we could decide to allocate even if there are pre-existing pages, but placed later in the array. Which could be rather common if the array consumer starts from index 0? So with Neil's patch this at least becomes consistent, while the check suggested by Mel leaves there the weird dependency on where pre-existing pages appear in the page_array. ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops. 2021-09-17 2:56 [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL NeilBrown ` (2 preceding siblings ...) 2021-09-17 2:56 ` [PATCH 1/6] MM: Support __GFP_NOFAIL in alloc_pages_bulk_*() and improve doco NeilBrown @ 2021-09-17 2:56 ` NeilBrown 2021-09-17 2:56 ` [PATCH 4/6] EXT4: remove congestion_wait from ext4_bio_write_page, and simplify NeilBrown 2021-09-17 2:56 ` [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL NeilBrown 5 siblings, 0 replies; 45+ messages in thread From: NeilBrown @ 2021-09-17 2:56 UTC (permalink / raw) To: Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko, . Dave Chinner, Jonathan Corbet Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc Indefinite loops waiting for memory allocation are discouraged by documentation in gfp.h which says the use of __GFP_NOFAIL that it is definitely preferable to use the flag rather than opencode endless loop around allocator. Such loops that use congestion_wait() are particularly unwise as congestion_wait() is indistinguishable from schedule_timeout_uninterruptible() in practice - and should be deprecated. So this patch changes the two loops in ext4_ext_truncate() to use __GFP_NOFAIL instead of looping. As the allocation is multiple layers deeper in the call stack, this requires passing the EXT4_EX_NOFAIL flag down and handling it in various places. __ext4_journal_start_sb() is given a new 'gfp_mask' argument which is passed through to jbd2__journal_start() and always receives GFP_NOFS except when called through ext4_journal_state_with_revoke from ext4_ext_remove_space(), which now can include __GFP_NOFAIL. jbd2__journal_start() is enhanced so that the gfp_t flags passed are used for *all* allocations. Signed-off-by: NeilBrown <neilb@suse.de> --- fs/ext4/ext4.h | 2 +- fs/ext4/ext4_jbd2.c | 4 ++- fs/ext4/ext4_jbd2.h | 14 +++++++----- fs/ext4/extents.c | 53 +++++++++++++++++++--------------------------- fs/ext4/extents_status.c | 35 +++++++++++++++++------------- fs/ext4/extents_status.h | 2 +- fs/ext4/ialloc.c | 3 ++- fs/ext4/indirect.c | 2 +- fs/ext4/inode.c | 6 +++-- fs/ext4/ioctl.c | 4 ++- fs/ext4/super.c | 2 +- fs/jbd2/transaction.c | 8 +++---- 12 files changed, 67 insertions(+), 68 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 90ff5acaf11f..52a34f5dfda2 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -3720,7 +3720,7 @@ extern int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, struct ext4_map_blocks *map, int flags); extern int ext4_ext_truncate(handle_t *, struct inode *); extern int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, - ext4_lblk_t end); + ext4_lblk_t end, int nofail); extern void ext4_ext_init(struct super_block *); extern void ext4_ext_release(struct super_block *); extern long ext4_fallocate(struct file *file, int mode, loff_t offset, diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c index 6def7339056d..780fde556dc0 100644 --- a/fs/ext4/ext4_jbd2.c +++ b/fs/ext4/ext4_jbd2.c @@ -88,7 +88,7 @@ static int ext4_journal_check_start(struct super_block *sb) handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line, int type, int blocks, int rsv_blocks, - int revoke_creds) + int revoke_creds, gfp_t gfp_mask) { journal_t *journal; int err; @@ -103,7 +103,7 @@ handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line, if (!journal || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)) return ext4_get_nojournal(); return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds, - GFP_NOFS, type, line); + gfp_mask, type, line); } int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle) diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h index 0e4fa644df01..da79be8ffff4 100644 --- a/fs/ext4/ext4_jbd2.h +++ b/fs/ext4/ext4_jbd2.h @@ -263,7 +263,7 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line, handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line, int type, int blocks, int rsv_blocks, - int revoke_creds); + int revoke_creds, gfp_t gfp_mask); int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle); #define EXT4_NOJOURNAL_MAX_REF_COUNT ((unsigned long) 4096) @@ -304,7 +304,8 @@ static inline int ext4_trans_default_revoke_credits(struct super_block *sb) #define ext4_journal_start_sb(sb, type, nblocks) \ __ext4_journal_start_sb((sb), __LINE__, (type), (nblocks), 0, \ - ext4_trans_default_revoke_credits(sb)) + ext4_trans_default_revoke_credits(sb), \ + GFP_NOFS) #define ext4_journal_start(inode, type, nblocks) \ __ext4_journal_start((inode), __LINE__, (type), (nblocks), 0, \ @@ -314,9 +315,9 @@ static inline int ext4_trans_default_revoke_credits(struct super_block *sb) __ext4_journal_start((inode), __LINE__, (type), (blocks), (rsv_blocks),\ ext4_trans_default_revoke_credits((inode)->i_sb)) -#define ext4_journal_start_with_revoke(inode, type, blocks, revoke_creds) \ - __ext4_journal_start((inode), __LINE__, (type), (blocks), 0, \ - (revoke_creds)) +#define ext4_journal_start_with_revoke(gfp_flags, inode, type, blocks, revoke_creds) \ + __ext4_journal_start_sb((inode)->i_sb, __LINE__, (type), (blocks),\ + 0, (revoke_creds), (gfp_flags)) static inline handle_t *__ext4_journal_start(struct inode *inode, unsigned int line, int type, @@ -324,7 +325,8 @@ static inline handle_t *__ext4_journal_start(struct inode *inode, int revoke_creds) { return __ext4_journal_start_sb(inode->i_sb, line, type, blocks, - rsv_blocks, revoke_creds); + rsv_blocks, revoke_creds, + GFP_NOFS); } #define ext4_journal_stop(handle) \ diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index c0de30f25185..c6c16aa8b618 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -1488,7 +1488,7 @@ static int ext4_ext_search_left(struct inode *inode, static int ext4_ext_search_right(struct inode *inode, struct ext4_ext_path *path, ext4_lblk_t *logical, ext4_fsblk_t *phys, - struct ext4_extent *ret_ex) + struct ext4_extent *ret_ex, int nofail) { struct buffer_head *bh = NULL; struct ext4_extent_header *eh; @@ -1565,7 +1565,7 @@ static int ext4_ext_search_right(struct inode *inode, while (++depth < path->p_depth) { /* subtract from p_depth to get proper eh_depth */ bh = read_extent_tree_block(inode, block, - path->p_depth - depth, 0); + path->p_depth - depth, nofail); if (IS_ERR(bh)) return PTR_ERR(bh); eh = ext_block_hdr(bh); @@ -1574,7 +1574,7 @@ static int ext4_ext_search_right(struct inode *inode, put_bh(bh); } - bh = read_extent_tree_block(inode, block, path->p_depth - depth, 0); + bh = read_extent_tree_block(inode, block, path->p_depth - depth, nofail); if (IS_ERR(bh)) return PTR_ERR(bh); eh = ext_block_hdr(bh); @@ -2773,7 +2773,7 @@ ext4_ext_more_to_rm(struct ext4_ext_path *path) } int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, - ext4_lblk_t end) + ext4_lblk_t end, int nofail) { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); int depth = ext_depth(inode); @@ -2781,6 +2781,10 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, struct partial_cluster partial; handle_t *handle; int i = 0, err = 0; + gfp_t gfp_mask = GFP_NOFS; + + if (nofail) + gfp_mask |= __GFP_NOFAIL; partial.pclu = 0; partial.lblk = 0; @@ -2789,7 +2793,8 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, ext_debug(inode, "truncate since %u to %u\n", start, end); /* probably first extent we're gonna free will be last in block */ - handle = ext4_journal_start_with_revoke(inode, EXT4_HT_TRUNCATE, + handle = ext4_journal_start_with_revoke(gfp_mask, + inode, EXT4_HT_TRUNCATE, depth + 1, ext4_free_metadata_revoke_credits(inode->i_sb, depth)); if (IS_ERR(handle)) @@ -2877,7 +2882,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, */ lblk = ex_end + 1; err = ext4_ext_search_right(inode, path, &lblk, &pblk, - NULL); + NULL, nofail); if (err < 0) goto out; if (pblk) { @@ -2899,10 +2904,6 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, } else { path = kcalloc(depth + 1, sizeof(struct ext4_ext_path), GFP_NOFS | __GFP_NOFAIL); - if (path == NULL) { - ext4_journal_stop(handle); - return -ENOMEM; - } path[0].p_maxdepth = path[0].p_depth = depth; path[0].p_hdr = ext_inode_hdr(inode); i = 0; @@ -2955,7 +2956,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, memset(path + i + 1, 0, sizeof(*path)); bh = read_extent_tree_block(inode, ext4_idx_pblock(path[i].p_idx), depth - i - 1, - EXT4_EX_NOCACHE); + EXT4_EX_NOCACHE | nofail); if (IS_ERR(bh)) { /* should we reset i_size? */ err = PTR_ERR(bh); @@ -4186,7 +4187,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, if (err) goto out; ar.lright = map->m_lblk; - err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright, &ex2); + err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright, &ex2, 0); if (err < 0) goto out; @@ -4368,23 +4369,13 @@ int ext4_ext_truncate(handle_t *handle, struct inode *inode) last_block = (inode->i_size + sb->s_blocksize - 1) >> EXT4_BLOCK_SIZE_BITS(sb); -retry: err = ext4_es_remove_extent(inode, last_block, - EXT_MAX_BLOCKS - last_block); - if (err == -ENOMEM) { - cond_resched(); - congestion_wait(BLK_RW_ASYNC, HZ/50); - goto retry; - } + EXT_MAX_BLOCKS - last_block, + EXT4_EX_NOFAIL); if (err) return err; -retry_remove_space: - err = ext4_ext_remove_space(inode, last_block, EXT_MAX_BLOCKS - 1); - if (err == -ENOMEM) { - cond_resched(); - congestion_wait(BLK_RW_ASYNC, HZ/50); - goto retry_remove_space; - } + err = ext4_ext_remove_space(inode, last_block, EXT_MAX_BLOCKS - 1, + EXT4_EX_NOFAIL); return err; } @@ -5322,13 +5313,13 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len) ext4_discard_preallocations(inode, 0); ret = ext4_es_remove_extent(inode, punch_start, - EXT_MAX_BLOCKS - punch_start); + EXT_MAX_BLOCKS - punch_start, 0); if (ret) { up_write(&EXT4_I(inode)->i_data_sem); goto out_stop; } - ret = ext4_ext_remove_space(inode, punch_start, punch_stop - 1); + ret = ext4_ext_remove_space(inode, punch_start, punch_stop - 1, 0); if (ret) { up_write(&EXT4_I(inode)->i_data_sem); goto out_stop; @@ -5510,7 +5501,7 @@ static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len) } ret = ext4_es_remove_extent(inode, offset_lblk, - EXT_MAX_BLOCKS - offset_lblk); + EXT_MAX_BLOCKS - offset_lblk, 0); if (ret) { up_write(&EXT4_I(inode)->i_data_sem); goto out_stop; @@ -5574,10 +5565,10 @@ ext4_swap_extents(handle_t *handle, struct inode *inode1, BUG_ON(!inode_is_locked(inode1)); BUG_ON(!inode_is_locked(inode2)); - *erp = ext4_es_remove_extent(inode1, lblk1, count); + *erp = ext4_es_remove_extent(inode1, lblk1, count, 0); if (unlikely(*erp)) return 0; - *erp = ext4_es_remove_extent(inode2, lblk2, count); + *erp = ext4_es_remove_extent(inode2, lblk2, count, 0); if (unlikely(*erp)) return 0; diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c index 9a3a8996aacf..7f7711a2ea44 100644 --- a/fs/ext4/extents_status.c +++ b/fs/ext4/extents_status.c @@ -144,9 +144,10 @@ static struct kmem_cache *ext4_es_cachep; static struct kmem_cache *ext4_pending_cachep; -static int __es_insert_extent(struct inode *inode, struct extent_status *newes); +static int __es_insert_extent(struct inode *inode, struct extent_status *newes, + int nofail); static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk, - ext4_lblk_t end, int *reserved); + ext4_lblk_t end, int *reserved, int nofail); static int es_reclaim_extents(struct ext4_inode_info *ei, int *nr_to_scan); static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan, struct ext4_inode_info *locked_ei); @@ -452,10 +453,11 @@ static void ext4_es_list_del(struct inode *inode) static struct extent_status * ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len, - ext4_fsblk_t pblk) + ext4_fsblk_t pblk, int nofail) { struct extent_status *es; - es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC); + es = kmem_cache_alloc(ext4_es_cachep, + GFP_ATOMIC | (nofail ? __GFP_NOFAIL : 0)); if (es == NULL) return NULL; es->es_lblk = lblk; @@ -754,7 +756,8 @@ static inline void ext4_es_insert_extent_check(struct inode *inode, } #endif -static int __es_insert_extent(struct inode *inode, struct extent_status *newes) +static int __es_insert_extent(struct inode *inode, struct extent_status *newes, + int nofail) { struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree; struct rb_node **p = &tree->root.rb_node; @@ -795,7 +798,7 @@ static int __es_insert_extent(struct inode *inode, struct extent_status *newes) } es = ext4_es_alloc_extent(inode, newes->es_lblk, newes->es_len, - newes->es_pblk); + newes->es_pblk, nofail); if (!es) return -ENOMEM; rb_link_node(&es->rb_node, parent, p); @@ -848,11 +851,11 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk, ext4_es_insert_extent_check(inode, &newes); write_lock(&EXT4_I(inode)->i_es_lock); - err = __es_remove_extent(inode, lblk, end, NULL); + err = __es_remove_extent(inode, lblk, end, NULL, 0); if (err != 0) goto error; retry: - err = __es_insert_extent(inode, &newes); + err = __es_insert_extent(inode, &newes, 0); if (err == -ENOMEM && __es_shrink(EXT4_SB(inode->i_sb), 128, EXT4_I(inode))) goto retry; @@ -902,7 +905,7 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk, es = __es_tree_search(&EXT4_I(inode)->i_es_tree.root, lblk); if (!es || es->es_lblk > end) - __es_insert_extent(inode, &newes); + __es_insert_extent(inode, &newes, 0); write_unlock(&EXT4_I(inode)->i_es_lock); } @@ -1294,6 +1297,7 @@ static unsigned int get_rsvd(struct inode *inode, ext4_lblk_t end, * @lblk - first block in range * @end - last block in range * @reserved - number of cluster reservations released + * @nofail - EXT4_EX_NOFAIL if __GFP_NOFAIL should be used * * If @reserved is not NULL and delayed allocation is enabled, counts * block/cluster reservations freed by removing range and if bigalloc @@ -1301,7 +1305,7 @@ static unsigned int get_rsvd(struct inode *inode, ext4_lblk_t end, * error code on failure. */ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk, - ext4_lblk_t end, int *reserved) + ext4_lblk_t end, int *reserved, int nofail) { struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree; struct rb_node *node; @@ -1350,7 +1354,7 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk, orig_es.es_len - len2; ext4_es_store_pblock_status(&newes, block, ext4_es_status(&orig_es)); - err = __es_insert_extent(inode, &newes); + err = __es_insert_extent(inode, &newes, nofail); if (err) { es->es_lblk = orig_es.es_lblk; es->es_len = orig_es.es_len; @@ -1426,12 +1430,13 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk, * @inode - file containing range * @lblk - first block in range * @len - number of blocks to remove + * @nofail - EXT4_EX_NOFAIL if __GFP_NOFAIL should be used * * Reduces block/cluster reservation count and for bigalloc cancels pending * reservations as needed. Returns 0 on success, error code on failure. */ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk, - ext4_lblk_t len) + ext4_lblk_t len, int nofail) { ext4_lblk_t end; int err = 0; @@ -1456,7 +1461,7 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk, * is reclaimed. */ write_lock(&EXT4_I(inode)->i_es_lock); - err = __es_remove_extent(inode, lblk, end, &reserved); + err = __es_remove_extent(inode, lblk, end, &reserved, nofail); write_unlock(&EXT4_I(inode)->i_es_lock); ext4_es_print_tree(inode); ext4_da_release_space(inode, reserved); @@ -2003,11 +2008,11 @@ int ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk, write_lock(&EXT4_I(inode)->i_es_lock); - err = __es_remove_extent(inode, lblk, lblk, NULL); + err = __es_remove_extent(inode, lblk, lblk, NULL, 0); if (err != 0) goto error; retry: - err = __es_insert_extent(inode, &newes); + err = __es_insert_extent(inode, &newes, 0); if (err == -ENOMEM && __es_shrink(EXT4_SB(inode->i_sb), 128, EXT4_I(inode))) goto retry; diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h index 4ec30a798260..23d77094a165 100644 --- a/fs/ext4/extents_status.h +++ b/fs/ext4/extents_status.h @@ -134,7 +134,7 @@ extern void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len, ext4_fsblk_t pblk, unsigned int status); extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk, - ext4_lblk_t len); + ext4_lblk_t len, int nofail); extern void ext4_es_find_extent_range(struct inode *inode, int (*match_fn)(struct extent_status *es), ext4_lblk_t lblk, ext4_lblk_t end, diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index f73e5eb43eae..82049fb94314 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -1079,7 +1079,8 @@ struct inode *__ext4_new_inode(struct user_namespace *mnt_userns, BUG_ON(nblocks <= 0); handle = __ext4_journal_start_sb(dir->i_sb, line_no, handle_type, nblocks, 0, - ext4_trans_default_revoke_credits(sb)); + ext4_trans_default_revoke_credits(sb), + GFP_NOFS); if (IS_ERR(handle)) { err = PTR_ERR(handle); ext4_std_error(sb, err); diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c index 89efa78ed4b2..910e87aea7be 100644 --- a/fs/ext4/indirect.c +++ b/fs/ext4/indirect.c @@ -1125,7 +1125,7 @@ void ext4_ind_truncate(handle_t *handle, struct inode *inode) return; } - ext4_es_remove_extent(inode, last_block, EXT_MAX_BLOCKS - last_block); + ext4_es_remove_extent(inode, last_block, EXT_MAX_BLOCKS - last_block, 0); /* * The orphan list entry will now protect us from any crash which diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index d18852d6029c..24246043d94b 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1575,7 +1575,7 @@ static void mpage_release_unused_pages(struct mpage_da_data *mpd, ext4_lblk_t start, last; start = index << (PAGE_SHIFT - inode->i_blkbits); last = end << (PAGE_SHIFT - inode->i_blkbits); - ext4_es_remove_extent(inode, start, last - start + 1); + ext4_es_remove_extent(inode, start, last - start + 1, 0); } pagevec_init(&pvec); @@ -4109,7 +4109,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length) ext4_discard_preallocations(inode, 0); ret = ext4_es_remove_extent(inode, first_block, - stop_block - first_block); + stop_block - first_block, 0); if (ret) { up_write(&EXT4_I(inode)->i_data_sem); goto out_stop; @@ -4117,7 +4117,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length) if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) ret = ext4_ext_remove_space(inode, first_block, - stop_block - 1); + stop_block - 1, 0); else ret = ext4_ind_remove_space(handle, inode, first_block, stop_block); diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index 606dee9e08a3..e4de05a6b976 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -79,8 +79,8 @@ static void swap_inode_data(struct inode *inode1, struct inode *inode2) (ei1->i_flags & ~EXT4_FL_SHOULD_SWAP); ei2->i_flags = tmp | (ei2->i_flags & ~EXT4_FL_SHOULD_SWAP); swap(ei1->i_disksize, ei2->i_disksize); - ext4_es_remove_extent(inode1, 0, EXT_MAX_BLOCKS); - ext4_es_remove_extent(inode2, 0, EXT_MAX_BLOCKS); + ext4_es_remove_extent(inode1, 0, EXT_MAX_BLOCKS, 0); + ext4_es_remove_extent(inode2, 0, EXT_MAX_BLOCKS, 0); isize = i_size_read(inode1); i_size_write(inode1, i_size_read(inode2)); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 0775950ee84e..947e8376a35a 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1393,7 +1393,7 @@ void ext4_clear_inode(struct inode *inode) invalidate_inode_buffers(inode); clear_inode(inode); ext4_discard_preallocations(inode, 0); - ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS); + ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS, 0); dquot_drop(inode); if (EXT4_I(inode)->jinode) { jbd2_journal_release_jbd_inode(EXT4_JOURNAL(inode), diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 6a3caedd2285..23e0f003d43b 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -476,9 +476,9 @@ static int start_this_handle(journal_t *journal, handle_t *handle, } /* Allocate a new handle. This should probably be in a slab... */ -static handle_t *new_handle(int nblocks) +static handle_t *new_handle(int nblocks, gfp_t gfp) { - handle_t *handle = jbd2_alloc_handle(GFP_NOFS); + handle_t *handle = jbd2_alloc_handle(gfp); if (!handle) return NULL; handle->h_total_credits = nblocks; @@ -505,13 +505,13 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks, nblocks += DIV_ROUND_UP(revoke_records, journal->j_revoke_records_per_block); - handle = new_handle(nblocks); + handle = new_handle(nblocks, gfp_mask); if (!handle) return ERR_PTR(-ENOMEM); if (rsv_blocks) { handle_t *rsv_handle; - rsv_handle = new_handle(rsv_blocks); + rsv_handle = new_handle(rsv_blocks, gfp_mask); if (!rsv_handle) { jbd2_free_handle(handle); return ERR_PTR(-ENOMEM); ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 4/6] EXT4: remove congestion_wait from ext4_bio_write_page, and simplify 2021-09-17 2:56 [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL NeilBrown ` (3 preceding siblings ...) 2021-09-17 2:56 ` [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops NeilBrown @ 2021-09-17 2:56 ` NeilBrown 2021-09-17 2:56 ` [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL NeilBrown 5 siblings, 0 replies; 45+ messages in thread From: NeilBrown @ 2021-09-17 2:56 UTC (permalink / raw) To: Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko, . Dave Chinner, Jonathan Corbet Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc congestion_wait() is indistinguishable from schedule_timeout_uninterruptible(). It is best avoided and should be deprecated. It is not needed in ext4_bio_write_page(). There are two cases. If there are no ->io_bio yet, then it is appropriate to use __GFP_NOFAIL which does the waiting in a better place. The code already uses this flag on the second attempt. This patch changes to it always use that flag for this case. If there *are* ->io_bio (in which case the allocation was non-blocking) we submit the io and return the first case. No waiting is needed in this case. So remove the congestion_wait() call, and simplify the code so that the two cases are somewhat clearer. Remove the "if (io->io_bio)" before calling ext4_io_submit() as that test is performed internally by that function. Signed-off-by: NeilBrown <neilb@suse.de> --- fs/ext4/page-io.c | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index f038d578d8d8..3b6ece0d3ad6 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -506,7 +506,7 @@ int ext4_bio_write_page(struct ext4_io_submit *io, * can't happen in the common case of blocksize == PAGE_SIZE. */ if (fscrypt_inode_uses_fs_layer_crypto(inode) && nr_to_submit) { - gfp_t gfp_flags = GFP_NOFS; + gfp_t gfp_flags; unsigned int enc_bytes = round_up(len, i_blocksize(inode)); /* @@ -514,21 +514,18 @@ int ext4_bio_write_page(struct ext4_io_submit *io, * a waiting mask (i.e. request guaranteed allocation) on the * first page of the bio. Otherwise it can deadlock. */ + retry_encrypt: if (io->io_bio) gfp_flags = GFP_NOWAIT | __GFP_NOWARN; - retry_encrypt: + else + gfp_flags = GFP_NOFS | __GFP_NOFAIL; bounce_page = fscrypt_encrypt_pagecache_blocks(page, enc_bytes, 0, gfp_flags); if (IS_ERR(bounce_page)) { ret = PTR_ERR(bounce_page); if (ret == -ENOMEM && (io->io_bio || wbc->sync_mode == WB_SYNC_ALL)) { - gfp_flags = GFP_NOFS; - if (io->io_bio) - ext4_io_submit(io); - else - gfp_flags |= __GFP_NOFAIL; - congestion_wait(BLK_RW_ASYNC, HZ/50); + ext4_io_submit(io); goto retry_encrypt; } ^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-09-17 2:56 [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL NeilBrown ` (4 preceding siblings ...) 2021-09-17 2:56 ` [PATCH 4/6] EXT4: remove congestion_wait from ext4_bio_write_page, and simplify NeilBrown @ 2021-09-17 2:56 ` NeilBrown 2021-10-05 9:20 ` Vlastimil Babka 5 siblings, 1 reply; 45+ messages in thread From: NeilBrown @ 2021-09-17 2:56 UTC (permalink / raw) To: Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko, . Dave Chinner, Jonathan Corbet Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc __GFP_NOFAIL is documented both in gfp.h and memory-allocation.rst. The details are not entirely consistent. This patch ensures both places state that: - there is a risk of deadlock with reclaim/writeback/oom-kill - it should only be used when there is no real alternative - it is preferable to an endless loop - it is strongly discourages for costly-order allocations. Signed-off-by: NeilBrown <neilb@suse.de> --- Documentation/core-api/memory-allocation.rst | 25 ++++++++++++++++++++++++- include/linux/gfp.h | 6 +++++- 2 files changed, 29 insertions(+), 2 deletions(-) diff --git a/Documentation/core-api/memory-allocation.rst b/Documentation/core-api/memory-allocation.rst index 5954ddf6ee13..8ea077465446 100644 --- a/Documentation/core-api/memory-allocation.rst +++ b/Documentation/core-api/memory-allocation.rst @@ -126,7 +126,30 @@ or another request. * ``GFP_KERNEL | __GFP_NOFAIL`` - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. - This might be really dangerous especially for larger orders. + Any attempt to use ``__GFP_NOFAIL`` for allocations larger than + order-1 (2 pages) will trigger a warning. + + Use of ``__GFP_NOFAIL`` can cause deadlocks so it should only be used + when there is no alternative, and then should be used with caution. + Deadlocks can happen if the calling process holds any resources + (e.g. locks) which might be needed for memory reclaim or write-back, + or which might prevent a process killed by the OOM killer from + successfully exiting. Where possible, locks should be released + before using ``__GFP_NOFAIL``. + + While this flag is best avoided, it is still preferable to endless + loops around the allocator. Endless loops may still be used when + there is a need to test for the process being killed + (fatal_signal_pending(current)). + + * ``GFP_NOFS | __GFP_NOFAIL`` - Loop endlessly instead of failing + when performing allocations in file system code. The same guidance + as for ``GFP_KERNEL | __GFP_NOFAIL`` applies with extra emphasis on + the possibility of deadlocks. ``GFP_NOFS`` often implies that + filesystem locks are held which might lead to blocking reclaim. + Preemptively flushing or reclaiming memory associated with such + locks might be appropriate before requesting a ``__GFP_NOFAIL`` + allocation. Selecting memory allocator ========================== diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 55b2ec1f965a..1d2a89e20b8b 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -209,7 +209,11 @@ struct vm_area_struct; * used only when there is no reasonable failure policy) but it is * definitely preferable to use the flag rather than opencode endless * loop around allocator. - * Using this flag for costly allocations is _highly_ discouraged. + * Use of this flag may lead to deadlocks if locks are held which would + * be needed for memory reclaim, write-back, or the timely exit of a + * process killed by the OOM-killer. Dropping any locks not absolutely + * needed is advisable before requesting a %__GFP_NOFAIL allocate. + * Using this flag for costly allocations (order>1) is _highly_ discouraged. */ #define __GFP_IO ((__force gfp_t)___GFP_IO) #define __GFP_FS ((__force gfp_t)___GFP_FS) ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-09-17 2:56 ` [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL NeilBrown @ 2021-10-05 9:20 ` Vlastimil Babka 2021-10-05 11:09 ` Michal Hocko 0 siblings, 1 reply; 45+ messages in thread From: Vlastimil Babka @ 2021-10-05 9:20 UTC (permalink / raw) To: NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko, . Dave Chinner, Jonathan Corbet Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On 9/17/21 04:56, NeilBrown wrote: > __GFP_NOFAIL is documented both in gfp.h and memory-allocation.rst. > The details are not entirely consistent. > > This patch ensures both places state that: > - there is a risk of deadlock with reclaim/writeback/oom-kill > - it should only be used when there is no real alternative > - it is preferable to an endless loop > - it is strongly discourages for costly-order allocations. > > Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Nit below: > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 55b2ec1f965a..1d2a89e20b8b 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -209,7 +209,11 @@ struct vm_area_struct; > * used only when there is no reasonable failure policy) but it is > * definitely preferable to use the flag rather than opencode endless > * loop around allocator. > - * Using this flag for costly allocations is _highly_ discouraged. > + * Use of this flag may lead to deadlocks if locks are held which would > + * be needed for memory reclaim, write-back, or the timely exit of a > + * process killed by the OOM-killer. Dropping any locks not absolutely > + * needed is advisable before requesting a %__GFP_NOFAIL allocate. > + * Using this flag for costly allocations (order>1) is _highly_ discouraged. We define costly as 3, not 1. But sure it's best to avoid even order>0 for __GFP_NOFAIL. Advising order>1 seems arbitrary though? > */ > #define __GFP_IO ((__force gfp_t)___GFP_IO) > #define __GFP_FS ((__force gfp_t)___GFP_FS) > > > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-10-05 9:20 ` Vlastimil Babka @ 2021-10-05 11:09 ` Michal Hocko 2021-10-05 12:27 ` Vlastimil Babka 0 siblings, 1 reply; 45+ messages in thread From: Michal Hocko @ 2021-10-05 11:09 UTC (permalink / raw) To: Vlastimil Babka Cc: NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, . Dave Chinner, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On Tue 05-10-21 11:20:51, Vlastimil Babka wrote: [...] > > --- a/include/linux/gfp.h > > +++ b/include/linux/gfp.h > > @@ -209,7 +209,11 @@ struct vm_area_struct; > > * used only when there is no reasonable failure policy) but it is > > * definitely preferable to use the flag rather than opencode endless > > * loop around allocator. > > - * Using this flag for costly allocations is _highly_ discouraged. > > + * Use of this flag may lead to deadlocks if locks are held which would > > + * be needed for memory reclaim, write-back, or the timely exit of a > > + * process killed by the OOM-killer. Dropping any locks not absolutely > > + * needed is advisable before requesting a %__GFP_NOFAIL allocate. > > + * Using this flag for costly allocations (order>1) is _highly_ discouraged. > > We define costly as 3, not 1. But sure it's best to avoid even order>0 for > __GFP_NOFAIL. Advising order>1 seems arbitrary though? This is not completely arbitrary. We have a warning for any higher order allocation. rmqueue: WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); I do agree that "Using this flag for higher order allocations is _highly_ discouraged. > > */ > > #define __GFP_IO ((__force gfp_t)___GFP_IO) > > #define __GFP_FS ((__force gfp_t)___GFP_FS) > > > > > > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-10-05 11:09 ` Michal Hocko @ 2021-10-05 12:27 ` Vlastimil Babka 2021-10-06 23:14 ` Dave Chinner 0 siblings, 1 reply; 45+ messages in thread From: Vlastimil Babka @ 2021-10-05 12:27 UTC (permalink / raw) To: Michal Hocko Cc: NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, . Dave Chinner, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On 10/5/21 13:09, Michal Hocko wrote: > On Tue 05-10-21 11:20:51, Vlastimil Babka wrote: > [...] >> > --- a/include/linux/gfp.h >> > +++ b/include/linux/gfp.h >> > @@ -209,7 +209,11 @@ struct vm_area_struct; >> > * used only when there is no reasonable failure policy) but it is >> > * definitely preferable to use the flag rather than opencode endless >> > * loop around allocator. >> > - * Using this flag for costly allocations is _highly_ discouraged. >> > + * Use of this flag may lead to deadlocks if locks are held which would >> > + * be needed for memory reclaim, write-back, or the timely exit of a >> > + * process killed by the OOM-killer. Dropping any locks not absolutely >> > + * needed is advisable before requesting a %__GFP_NOFAIL allocate. >> > + * Using this flag for costly allocations (order>1) is _highly_ discouraged. >> >> We define costly as 3, not 1. But sure it's best to avoid even order>0 for >> __GFP_NOFAIL. Advising order>1 seems arbitrary though? > > This is not completely arbitrary. We have a warning for any higher order > allocation. > rmqueue: > WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); Oh, I missed that. > I do agree that "Using this flag for higher order allocations is > _highly_ discouraged. Well, with the warning in place this is effectively forbidden, not just discouraged. >> > */ >> > #define __GFP_IO ((__force gfp_t)___GFP_IO) >> > #define __GFP_FS ((__force gfp_t)___GFP_FS) >> > >> > >> > > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-10-05 12:27 ` Vlastimil Babka @ 2021-10-06 23:14 ` Dave Chinner 2021-10-07 10:07 ` Michal Hocko 0 siblings, 1 reply; 45+ messages in thread From: Dave Chinner @ 2021-10-06 23:14 UTC (permalink / raw) To: Vlastimil Babka Cc: Michal Hocko, NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On Tue, Oct 05, 2021 at 02:27:45PM +0200, Vlastimil Babka wrote: > On 10/5/21 13:09, Michal Hocko wrote: > > On Tue 05-10-21 11:20:51, Vlastimil Babka wrote: > > [...] > >> > --- a/include/linux/gfp.h > >> > +++ b/include/linux/gfp.h > >> > @@ -209,7 +209,11 @@ struct vm_area_struct; > >> > * used only when there is no reasonable failure policy) but it is > >> > * definitely preferable to use the flag rather than opencode endless > >> > * loop around allocator. > >> > - * Using this flag for costly allocations is _highly_ discouraged. > >> > + * Use of this flag may lead to deadlocks if locks are held which would > >> > + * be needed for memory reclaim, write-back, or the timely exit of a > >> > + * process killed by the OOM-killer. Dropping any locks not absolutely > >> > + * needed is advisable before requesting a %__GFP_NOFAIL allocate. > >> > + * Using this flag for costly allocations (order>1) is _highly_ discouraged. > >> > >> We define costly as 3, not 1. But sure it's best to avoid even order>0 for > >> __GFP_NOFAIL. Advising order>1 seems arbitrary though? > > > > This is not completely arbitrary. We have a warning for any higher order > > allocation. > > rmqueue: > > WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); > > Oh, I missed that. > > > I do agree that "Using this flag for higher order allocations is > > _highly_ discouraged. > > Well, with the warning in place this is effectively forbidden, not just > discouraged. Yup, especially as it doesn't obey __GFP_NOWARN. See commit de2860f46362 ("mm: Add kvrealloc()") as a direct result of unwittingly tripping over this warning when adding __GFP_NOFAIL annotations to replace open coded high-order kmalloc loops that have been in place for a couple of decades without issues. Personally I think that the way __GFP_NOFAIL is first of all recommended over open coded loops and then only later found to be effectively forbidden and needing to be replaced with open coded loops to be a complete mess. Not to mention on the impossibility of using __GFP_NOFAIL with kvmalloc() calls. Just what do we expect kmalloc_node(__GFP_NORETRY | __GFP_NOFAIL) to do, exactly? So, effectively, we have to open-code around kvmalloc() in situations where failure is not an option. Even if we pass __GFP_NOFAIL to __vmalloc(), it isn't guaranteed to succeed because of the "we won't honor gfp flags passed to __vmalloc" semantics it has. Even the API constaints of kvmalloc() w.r.t. only doing the vmalloc fallback if the gfp context is GFP_KERNEL - we already do GFP_NOFS kvmalloc via memalloc_nofs_save/restore(), so this behavioural restriction w.r.t. gfp flags just makes no sense at all. That leads to us having to go back to writing extremely custom open coded loops to avoid awful high-order kmalloc direct reclaim behaviour and still fall back to vmalloc and to still handle NOFAIL semantics we need: https://lore.kernel.org/linux-xfs/20210902095927.911100-8-david@fromorbit.com/ So, really, the problems are much deeper here than just badly documented, catch-22 rules for __GFP_NOFAIL - we can't even use __GFP_NOFAIL consistently across the allocation APIs because it changes allocation behaviours in unusable, self-defeating ways.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-10-06 23:14 ` Dave Chinner @ 2021-10-07 10:07 ` Michal Hocko 2021-10-07 23:15 ` NeilBrown 0 siblings, 1 reply; 45+ messages in thread From: Michal Hocko @ 2021-10-07 10:07 UTC (permalink / raw) To: Dave Chinner Cc: Vlastimil Babka, NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On Thu 07-10-21 10:14:52, Dave Chinner wrote: > On Tue, Oct 05, 2021 at 02:27:45PM +0200, Vlastimil Babka wrote: > > On 10/5/21 13:09, Michal Hocko wrote: > > > On Tue 05-10-21 11:20:51, Vlastimil Babka wrote: > > > [...] > > >> > --- a/include/linux/gfp.h > > >> > +++ b/include/linux/gfp.h > > >> > @@ -209,7 +209,11 @@ struct vm_area_struct; > > >> > * used only when there is no reasonable failure policy) but it is > > >> > * definitely preferable to use the flag rather than opencode endless > > >> > * loop around allocator. > > >> > - * Using this flag for costly allocations is _highly_ discouraged. > > >> > + * Use of this flag may lead to deadlocks if locks are held which would > > >> > + * be needed for memory reclaim, write-back, or the timely exit of a > > >> > + * process killed by the OOM-killer. Dropping any locks not absolutely > > >> > + * needed is advisable before requesting a %__GFP_NOFAIL allocate. > > >> > + * Using this flag for costly allocations (order>1) is _highly_ discouraged. > > >> > > >> We define costly as 3, not 1. But sure it's best to avoid even order>0 for > > >> __GFP_NOFAIL. Advising order>1 seems arbitrary though? > > > > > > This is not completely arbitrary. We have a warning for any higher order > > > allocation. > > > rmqueue: > > > WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); > > > > Oh, I missed that. > > > > > I do agree that "Using this flag for higher order allocations is > > > _highly_ discouraged. > > > > Well, with the warning in place this is effectively forbidden, not just > > discouraged. > > Yup, especially as it doesn't obey __GFP_NOWARN. > > See commit de2860f46362 ("mm: Add kvrealloc()") as a direct result > of unwittingly tripping over this warning when adding __GFP_NOFAIL > annotations to replace open coded high-order kmalloc loops that have > been in place for a couple of decades without issues. > > Personally I think that the way __GFP_NOFAIL is first of all > recommended over open coded loops and then only later found to be > effectively forbidden and needing to be replaced with open coded > loops to be a complete mess. Well, there are two things. Opencoding something that _can_ be replaced by __GFP_NOFAIL and those that cannot because the respective allocator doesn't really support that semantic. kvmalloc is explicit about that IIRC. If you have a better way to consolidate the documentation then I am all for it. > Not to mention on the impossibility of using __GFP_NOFAIL with > kvmalloc() calls. Just what do we expect kmalloc_node(__GFP_NORETRY > | __GFP_NOFAIL) to do, exactly? This combination doesn't make any sense. Like others. Do you want us to list all combinations that make sense? > So, effectively, we have to open-code around kvmalloc() in > situations where failure is not an option. Even if we pass > __GFP_NOFAIL to __vmalloc(), it isn't guaranteed to succeed because > of the "we won't honor gfp flags passed to __vmalloc" semantics it > has. yes vmalloc doesn't support nofail semantic and it is not really trivial to craft it there. > Even the API constaints of kvmalloc() w.r.t. only doing the vmalloc > fallback if the gfp context is GFP_KERNEL - we already do GFP_NOFS > kvmalloc via memalloc_nofs_save/restore(), so this behavioural > restriction w.r.t. gfp flags just makes no sense at all. GFP_NOFS (without using the scope API) has the same problem as NOFAIL in the vmalloc. Hence it is not supported. If you use the scope API then you can GFP_KERNEL for kvmalloc. This is clumsy but I am not sure how to define these conditions in a more sensible way. Special case NOFS if the scope api is in use? Why do you want an explicit NOFS then? > That leads to us having to go back to writing extremely custom open > coded loops to avoid awful high-order kmalloc direct reclaim > behaviour and still fall back to vmalloc and to still handle NOFAIL > semantics we need: > > https://lore.kernel.org/linux-xfs/20210902095927.911100-8-david@fromorbit.com/ It would be more productive to get to MM people rather than rant on a xfs specific patchse. Anyway, I can see a kvmalloc mode where the kmalloc allocation would be really a very optimistic one - like your effectively GFP_NOWAIT. Nobody has requested such a mode until now and I am not sure how we would sensibly describe that by a gfp mask. Btw. your GFP_NOWAIT | __GFP_NORETRY combination doesn't make any sense in the allocator context as the later is a reclaim mofifier which doesn't get applied when the reclaim is disabled (in your case by flags &= ~__GFP_DIRECT_RECLAIM). GFP flags are not that easy to build a coherent and usable apis. Something we carry as a baggage for a long time. > So, really, the problems are much deeper here than just badly > documented, catch-22 rules for __GFP_NOFAIL - we can't even use > __GFP_NOFAIL consistently across the allocation APIs because it > changes allocation behaviours in unusable, self-defeating ways.... GFP_NOFAIL sucks. Not all allocator can follow it for practical reasons. You are welcome to help document those awkward corner cases or fix them up if you have a good idea how. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-10-07 10:07 ` Michal Hocko @ 2021-10-07 23:15 ` NeilBrown 2021-10-08 7:48 ` Michal Hocko 0 siblings, 1 reply; 45+ messages in thread From: NeilBrown @ 2021-10-07 23:15 UTC (permalink / raw) To: Michal Hocko Cc: Dave Chinner, Vlastimil Babka, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On Thu, 07 Oct 2021, Michal Hocko wrote: > On Thu 07-10-21 10:14:52, Dave Chinner wrote: > > On Tue, Oct 05, 2021 at 02:27:45PM +0200, Vlastimil Babka wrote: > > > On 10/5/21 13:09, Michal Hocko wrote: > > > > On Tue 05-10-21 11:20:51, Vlastimil Babka wrote: > > > > [...] > > > >> > --- a/include/linux/gfp.h > > > >> > +++ b/include/linux/gfp.h > > > >> > @@ -209,7 +209,11 @@ struct vm_area_struct; > > > >> > * used only when there is no reasonable failure policy) but it is > > > >> > * definitely preferable to use the flag rather than opencode endless > > > >> > * loop around allocator. > > > >> > - * Using this flag for costly allocations is _highly_ discouraged. > > > >> > + * Use of this flag may lead to deadlocks if locks are held which would > > > >> > + * be needed for memory reclaim, write-back, or the timely exit of a > > > >> > + * process killed by the OOM-killer. Dropping any locks not absolutely > > > >> > + * needed is advisable before requesting a %__GFP_NOFAIL allocate. > > > >> > + * Using this flag for costly allocations (order>1) is _highly_ discouraged. > > > >> > > > >> We define costly as 3, not 1. But sure it's best to avoid even order>0 for > > > >> __GFP_NOFAIL. Advising order>1 seems arbitrary though? > > > > > > > > This is not completely arbitrary. We have a warning for any higher order > > > > allocation. > > > > rmqueue: > > > > WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); > > > > > > Oh, I missed that. > > > > > > > I do agree that "Using this flag for higher order allocations is > > > > _highly_ discouraged. > > > > > > Well, with the warning in place this is effectively forbidden, not just > > > discouraged. > > > > Yup, especially as it doesn't obey __GFP_NOWARN. > > > > See commit de2860f46362 ("mm: Add kvrealloc()") as a direct result > > of unwittingly tripping over this warning when adding __GFP_NOFAIL > > annotations to replace open coded high-order kmalloc loops that have > > been in place for a couple of decades without issues. > > > > Personally I think that the way __GFP_NOFAIL is first of all > > recommended over open coded loops and then only later found to be > > effectively forbidden and needing to be replaced with open coded > > loops to be a complete mess. > > Well, there are two things. Opencoding something that _can_ be replaced > by __GFP_NOFAIL and those that cannot because the respective allocator > doesn't really support that semantic. kvmalloc is explicit about that > IIRC. If you have a better way to consolidate the documentation then I > am all for it. I think one thing that might help make the documentation better is to explicitly state *why* __GFP_NOFAIL is better than a loop. It occurs to me that while (!(p = kmalloc(sizeof(*p), GFP_KERNEL)); would behave much the same as adding __GFP_NOFAIL and dropping the 'while'. So why not? I certainly cannot see the need to add any delay to this loop as kmalloc does a fair bit of sleeping when permitted. I understand that __GFP_NOFAIL allows page_alloc to dip into reserves, but Mel holds that up as a reason *not* to use __GFP_NOFAIL as it can impact on other subsystems. Why not just let the caller decide if they deserve the boost, but oring in __GFP_ATOMIC or __GFP_MEMALLOC as appropriate. I assume there is a good reason. I vaguely remember the conversation that lead to __GFP_NOFAIL being introduced. I just cannot remember or deduce what the reason is. So it would be great to have it documented. > > > Not to mention on the impossibility of using __GFP_NOFAIL with > > kvmalloc() calls. Just what do we expect kmalloc_node(__GFP_NORETRY > > | __GFP_NOFAIL) to do, exactly? > > This combination doesn't make any sense. Like others. Do you want us to > list all combinations that make sense? I've been wondering about that. There seem to be sets of flags that are mutually exclusive. It is as though gfp_t is a struct of a few enums. 0, DMA32, DMA, HIGHMEM 0, FS, IO 0, ATOMIC, MEMALLOC, NOMEMALLOC, HIGH NORETRY, RETRY_MAYFAIL, 0, NOFAIL 0, KSWAPD_RECLAIM, DIRECT_RECLAIM 0, THISNODE, HARDWALL In a few cases there seem to be 3 bits where there are only 4 possibly combinations, so 2 bits would be enough. There is probably no real value is squeezing these into 2 bits, but clearly documenting the groups surely wouldn't hurt. Particularly highlighting the difference between related bits would help. The set with 'ATOMIC' is hard to wrap my mind around. They relate to ALLOC_HIGH and ALLOC_HARDER, but also to WMARK_NIN, WMARK_LOW, WMARK_HIGH ... I think. I wonder if FS,IO is really in the same set as DIRECT_RECLAIM as they all affect reclaim. Maybe FS and IO are only relevan if DIRECT_RECLAIM is set? I'd love to know that to expect if neither RETRY_MAYFAIL or NOFAIL is set. I guess it can fail, but it still tries harder than if RETRY_MAYFAIL is set.... Ahhhh... I found some documentation which mentions that RETRY_MAYFAIL doesn't trigger the oom killer. Is that it? So RETRY_NOKILLOOM might be a better name? > > > So, effectively, we have to open-code around kvmalloc() in > > situations where failure is not an option. Even if we pass > > __GFP_NOFAIL to __vmalloc(), it isn't guaranteed to succeed because > > of the "we won't honor gfp flags passed to __vmalloc" semantics it > > has. > > yes vmalloc doesn't support nofail semantic and it is not really trivial > to craft it there. > > > Even the API constaints of kvmalloc() w.r.t. only doing the vmalloc > > fallback if the gfp context is GFP_KERNEL - we already do GFP_NOFS > > kvmalloc via memalloc_nofs_save/restore(), so this behavioural > > restriction w.r.t. gfp flags just makes no sense at all. > > GFP_NOFS (without using the scope API) has the same problem as NOFAIL in > the vmalloc. Hence it is not supported. If you use the scope API then > you can GFP_KERNEL for kvmalloc. This is clumsy but I am not sure how to > define these conditions in a more sensible way. Special case NOFS if the > scope api is in use? Why do you want an explicit NOFS then? It would seem to make sense for kvmalloc to WARN_ON if it is passed flags that does not allow it to use vmalloc. Such callers could then know they can either change to a direct kmalloc(), or change flags. Silently ignoring the 'v' in the function name sees like a poor choice. Thanks, NeilBrown > > > That leads to us having to go back to writing extremely custom open > > coded loops to avoid awful high-order kmalloc direct reclaim > > behaviour and still fall back to vmalloc and to still handle NOFAIL > > semantics we need: > > > > https://lore.kernel.org/linux-xfs/20210902095927.911100-8-david@fromorbit.com/ > > It would be more productive to get to MM people rather than rant on a > xfs specific patchse. Anyway, I can see a kvmalloc mode where the > kmalloc allocation would be really a very optimistic one - like your > effectively GFP_NOWAIT. Nobody has requested such a mode until now and I > am not sure how we would sensibly describe that by a gfp mask. > > Btw. your GFP_NOWAIT | __GFP_NORETRY combination doesn't make any sense > in the allocator context as the later is a reclaim mofifier which > doesn't get applied when the reclaim is disabled (in your case by flags > &= ~__GFP_DIRECT_RECLAIM). > > GFP flags are not that easy to build a coherent and usable apis. > Something we carry as a baggage for a long time. > > > So, really, the problems are much deeper here than just badly > > documented, catch-22 rules for __GFP_NOFAIL - we can't even use > > __GFP_NOFAIL consistently across the allocation APIs because it > > changes allocation behaviours in unusable, self-defeating ways.... > > GFP_NOFAIL sucks. Not all allocator can follow it for practical > reasons. You are welcome to help document those awkward corner cases or > fix them up if you have a good idea how. > > Thanks! > -- > Michal Hocko > SUSE Labs > > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-10-07 23:15 ` NeilBrown @ 2021-10-08 7:48 ` Michal Hocko 2021-10-08 22:36 ` Dave Chinner 0 siblings, 1 reply; 45+ messages in thread From: Michal Hocko @ 2021-10-08 7:48 UTC (permalink / raw) To: NeilBrown Cc: Dave Chinner, Vlastimil Babka, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On Fri 08-10-21 10:15:45, Neil Brown wrote: > On Thu, 07 Oct 2021, Michal Hocko wrote: > > On Thu 07-10-21 10:14:52, Dave Chinner wrote: > > > On Tue, Oct 05, 2021 at 02:27:45PM +0200, Vlastimil Babka wrote: > > > > On 10/5/21 13:09, Michal Hocko wrote: > > > > > On Tue 05-10-21 11:20:51, Vlastimil Babka wrote: > > > > > [...] > > > > >> > --- a/include/linux/gfp.h > > > > >> > +++ b/include/linux/gfp.h > > > > >> > @@ -209,7 +209,11 @@ struct vm_area_struct; > > > > >> > * used only when there is no reasonable failure policy) but it is > > > > >> > * definitely preferable to use the flag rather than opencode endless > > > > >> > * loop around allocator. > > > > >> > - * Using this flag for costly allocations is _highly_ discouraged. > > > > >> > + * Use of this flag may lead to deadlocks if locks are held which would > > > > >> > + * be needed for memory reclaim, write-back, or the timely exit of a > > > > >> > + * process killed by the OOM-killer. Dropping any locks not absolutely > > > > >> > + * needed is advisable before requesting a %__GFP_NOFAIL allocate. > > > > >> > + * Using this flag for costly allocations (order>1) is _highly_ discouraged. > > > > >> > > > > >> We define costly as 3, not 1. But sure it's best to avoid even order>0 for > > > > >> __GFP_NOFAIL. Advising order>1 seems arbitrary though? > > > > > > > > > > This is not completely arbitrary. We have a warning for any higher order > > > > > allocation. > > > > > rmqueue: > > > > > WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); > > > > > > > > Oh, I missed that. > > > > > > > > > I do agree that "Using this flag for higher order allocations is > > > > > _highly_ discouraged. > > > > > > > > Well, with the warning in place this is effectively forbidden, not just > > > > discouraged. > > > > > > Yup, especially as it doesn't obey __GFP_NOWARN. > > > > > > See commit de2860f46362 ("mm: Add kvrealloc()") as a direct result > > > of unwittingly tripping over this warning when adding __GFP_NOFAIL > > > annotations to replace open coded high-order kmalloc loops that have > > > been in place for a couple of decades without issues. > > > > > > Personally I think that the way __GFP_NOFAIL is first of all > > > recommended over open coded loops and then only later found to be > > > effectively forbidden and needing to be replaced with open coded > > > loops to be a complete mess. > > > > Well, there are two things. Opencoding something that _can_ be replaced > > by __GFP_NOFAIL and those that cannot because the respective allocator > > doesn't really support that semantic. kvmalloc is explicit about that > > IIRC. If you have a better way to consolidate the documentation then I > > am all for it. > > I think one thing that might help make the documentation better is to > explicitly state *why* __GFP_NOFAIL is better than a loop. > > It occurs to me that > while (!(p = kmalloc(sizeof(*p), GFP_KERNEL)); > > would behave much the same as adding __GFP_NOFAIL and dropping the > 'while'. So why not? I certainly cannot see the need to add any delay > to this loop as kmalloc does a fair bit of sleeping when permitted. > > I understand that __GFP_NOFAIL allows page_alloc to dip into reserves, > but Mel holds that up as a reason *not* to use __GFP_NOFAIL as it can > impact on other subsystems. __GFP_NOFAIL usage is a risk on its own. It is a hard requirement that the allocator cannot back off. So it has to absolutely everything to suceed. Whether it cheats and dips into reserves or not is a mere implementation detail and a subject to the specific implementation. > Why not just let the caller decide if they > deserve the boost, but oring in __GFP_ATOMIC or __GFP_MEMALLOC as > appropriate. They can do that. Explicit access to memory reserves is allowed unless it is explicitly forbidden by NOMEMALLOC flag. > I assume there is a good reason. I vaguely remember the conversation > that lead to __GFP_NOFAIL being introduced. I just cannot remember or > deduce what the reason is. So it would be great to have it documented. The basic reason is that if the allocator knows this is must suceed allocation request then it can prioritize it in some way. A dumb kmalloc loop as you pictured it is likely much less optimal in that sense, isn't it? Compare that to mempool allocator which is non failing as well but it has some involved handling and that is certainly not a good fit for __GFP_NOFAIL in the page allocator. > > > Not to mention on the impossibility of using __GFP_NOFAIL with > > > kvmalloc() calls. Just what do we expect kmalloc_node(__GFP_NORETRY > > > | __GFP_NOFAIL) to do, exactly? > > > > This combination doesn't make any sense. Like others. Do you want us to > > list all combinations that make sense? > > I've been wondering about that. There seem to be sets of flags that are > mutually exclusive. It is as though gfp_t is a struct of a few enums. > > 0, DMA32, DMA, HIGHMEM > 0, FS, IO > 0, ATOMIC, MEMALLOC, NOMEMALLOC, HIGH > NORETRY, RETRY_MAYFAIL, 0, NOFAIL > 0, KSWAPD_RECLAIM, DIRECT_RECLAIM > 0, THISNODE, HARDWALL > > In a few cases there seem to be 3 bits where there are only 4 possibly > combinations, so 2 bits would be enough. There is probably no real > value is squeezing these into 2 bits, but clearly documenting the groups > surely wouldn't hurt. Particularly highlighting the difference between > related bits would help. Don't we have that already? We have them grouped by placement, watermarks, reclaim and action modifiers. Then we have useful combinations. I believe we can always improve on that and I am always ready to listen here. > The set with 'ATOMIC' is hard to wrap my mind around. > They relate to ALLOC_HIGH and ALLOC_HARDER, but also to WMARK_NIN, > WMARK_LOW, WMARK_HIGH ... I think. ALLOC* and WMARK* is an internal allocator concept and I believe users of gfp flags shouldn't really care or even know those exist. > I wonder if FS,IO is really in the same set as DIRECT_RECLAIM as they > all affect reclaim. Maybe FS and IO are only relevan if DIRECT_RECLAIM > is set? yes, this indeed the case. Page allocator doesn't go outside of its proper without the direct reclaim. > I'd love to know that to expect if neither RETRY_MAYFAIL or NOFAIL is > set. I guess it can fail, but it still tries harder than if > RETRY_MAYFAIL is set.... > Ahhhh... I found some documentation which mentions The reclaim behavior is described along with the respective modifiers. I believe we can thank you for this structure as you were the primary driving force to clarify the behavior. > that RETRY_MAYFAIL > doesn't trigger the oom killer. Is that it? So RETRY_NOKILLOOM might be > a better name? Again the those are implementation details and I am not sure we really want to bother users with all of them. This wold quickly become hairy and likely even outdated after some time. The documentation tries to describe different levels of involvement. NOWAIT - no direct reclaim, NORETRY - only a light attempt to reclaim, RETRY_MAYFAIL - try as hard as feasible, NOFAIL - cannot really fail. If we can improve the wording I am all for it. > > > So, effectively, we have to open-code around kvmalloc() in > > > situations where failure is not an option. Even if we pass > > > __GFP_NOFAIL to __vmalloc(), it isn't guaranteed to succeed because > > > of the "we won't honor gfp flags passed to __vmalloc" semantics it > > > has. > > > > yes vmalloc doesn't support nofail semantic and it is not really trivial > > to craft it there. > > > > > Even the API constaints of kvmalloc() w.r.t. only doing the vmalloc > > > fallback if the gfp context is GFP_KERNEL - we already do GFP_NOFS > > > kvmalloc via memalloc_nofs_save/restore(), so this behavioural > > > restriction w.r.t. gfp flags just makes no sense at all. > > > > GFP_NOFS (without using the scope API) has the same problem as NOFAIL in > > the vmalloc. Hence it is not supported. If you use the scope API then > > you can GFP_KERNEL for kvmalloc. This is clumsy but I am not sure how to > > define these conditions in a more sensible way. Special case NOFS if the > > scope api is in use? Why do you want an explicit NOFS then? > > It would seem to make sense for kvmalloc to WARN_ON if it is passed > flags that does not allow it to use vmalloc. vmalloc is certainly not the hottest path in the kernel so I wouldn't be opposed. One should be careful that WARN_ON is effectively BUG_ON in some configurations but we are sinners from that perspective all over the place... Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-10-08 7:48 ` Michal Hocko @ 2021-10-08 22:36 ` Dave Chinner 2021-10-11 11:57 ` Michal Hocko 0 siblings, 1 reply; 45+ messages in thread From: Dave Chinner @ 2021-10-08 22:36 UTC (permalink / raw) To: Michal Hocko Cc: NeilBrown, Vlastimil Babka, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On Fri, Oct 08, 2021 at 09:48:39AM +0200, Michal Hocko wrote: > On Fri 08-10-21 10:15:45, Neil Brown wrote: > > On Thu, 07 Oct 2021, Michal Hocko wrote: > > > On Thu 07-10-21 10:14:52, Dave Chinner wrote: > > > > On Tue, Oct 05, 2021 at 02:27:45PM +0200, Vlastimil Babka wrote: > > > > > On 10/5/21 13:09, Michal Hocko wrote: > > > > > > On Tue 05-10-21 11:20:51, Vlastimil Babka wrote: > > > > > > [...] > > > > > >> > --- a/include/linux/gfp.h > > > > > >> > +++ b/include/linux/gfp.h > > > > > >> > @@ -209,7 +209,11 @@ struct vm_area_struct; > > > > > >> > * used only when there is no reasonable failure policy) but it is > > > > > >> > * definitely preferable to use the flag rather than opencode endless > > > > > >> > * loop around allocator. > > > > > >> > - * Using this flag for costly allocations is _highly_ discouraged. > > > > > >> > + * Use of this flag may lead to deadlocks if locks are held which would > > > > > >> > + * be needed for memory reclaim, write-back, or the timely exit of a > > > > > >> > + * process killed by the OOM-killer. Dropping any locks not absolutely > > > > > >> > + * needed is advisable before requesting a %__GFP_NOFAIL allocate. > > > > > >> > + * Using this flag for costly allocations (order>1) is _highly_ discouraged. > > > > > >> > > > > > >> We define costly as 3, not 1. But sure it's best to avoid even order>0 for > > > > > >> __GFP_NOFAIL. Advising order>1 seems arbitrary though? > > > > > > > > > > > > This is not completely arbitrary. We have a warning for any higher order > > > > > > allocation. > > > > > > rmqueue: > > > > > > WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); > > > > > > > > > > Oh, I missed that. > > > > > > > > > > > I do agree that "Using this flag for higher order allocations is > > > > > > _highly_ discouraged. > > > > > > > > > > Well, with the warning in place this is effectively forbidden, not just > > > > > discouraged. > > > > > > > > Yup, especially as it doesn't obey __GFP_NOWARN. > > > > > > > > See commit de2860f46362 ("mm: Add kvrealloc()") as a direct result > > > > of unwittingly tripping over this warning when adding __GFP_NOFAIL > > > > annotations to replace open coded high-order kmalloc loops that have > > > > been in place for a couple of decades without issues. > > > > > > > > Personally I think that the way __GFP_NOFAIL is first of all > > > > recommended over open coded loops and then only later found to be > > > > effectively forbidden and needing to be replaced with open coded > > > > loops to be a complete mess. > > > > > > Well, there are two things. Opencoding something that _can_ be replaced > > > by __GFP_NOFAIL and those that cannot because the respective allocator > > > doesn't really support that semantic. kvmalloc is explicit about that > > > IIRC. If you have a better way to consolidate the documentation then I > > > am all for it. > > > > I think one thing that might help make the documentation better is to > > explicitly state *why* __GFP_NOFAIL is better than a loop. > > > > It occurs to me that > > while (!(p = kmalloc(sizeof(*p), GFP_KERNEL)); > > > > would behave much the same as adding __GFP_NOFAIL and dropping the > > 'while'. So why not? I certainly cannot see the need to add any delay > > to this loop as kmalloc does a fair bit of sleeping when permitted. > > > > I understand that __GFP_NOFAIL allows page_alloc to dip into reserves, > > but Mel holds that up as a reason *not* to use __GFP_NOFAIL as it can > > impact on other subsystems. > > __GFP_NOFAIL usage is a risk on its own. It is a hard requirement that > the allocator cannot back off. No, "allocator cannot back off" isn't a hard requirement for most GFP_NOFAIL uses. *Not failing the allocation* is the hard requirement. How long it takes for the allocation to actually succeed is irrelevant to most callers, and given that we are replacing loops that do while (!(p = kmalloc(sizeof(*p), GFP_KERNEL)) with __GFP_NOFAIL largely indicates that allocation *latency* and/or deadlocks are not an issue here. Indeed, if we deadlock in XFS because there is no memory available, that is *not a problem kmalloc() should be trying to solve*. THe problem is the caller being unable to handle allocation failure, so if allocation cannot make progress, that needs to be fixed by the caller getting rid of the unfailable allocation. The fact is that we've had these loops in production code for a couple of decades and these subsystems just aren't failing or deadlocking with such loops. IOWs, we don't need __GFP_NOFAIL to dig deep into reserves or drive the system to OOM killing - we just need to it keep retrying the same allocation until it succeeds. Put simply, we want "retry forever" semantics to match what production kernels have been doing for the past couple of decades, but all we've been given are "never fail" semantics that also do something different and potentially much more problematic. Do you see the difference here? __GFP_NOFAIL is not what we need in the vast majority of cases where it is used. We don't want the failing allocations to drive the machine hard into critical reserves, we just want the allocation to -eventually succeed- and if it doesn't, that's our problem to handle, not kmalloc().... > So it has to absolutely everything to > suceed. Whether it cheats and dips into reserves or not is a mere > implementation detail and a subject to the specific implementation. My point exactly: that's how the MM interprets __GFP_NOFAIL is supposed to provide callers with. What we are trying to tell you is that the semantics associated with __GFP_NOFAIL is not actually what we require, and it's the current semantics of __GFP_NOFAIL that cause all the "can't be applied consistently across the entire allocation APIs" problems.... > > > > So, effectively, we have to open-code around kvmalloc() in > > > > situations where failure is not an option. Even if we pass > > > > __GFP_NOFAIL to __vmalloc(), it isn't guaranteed to succeed because > > > > of the "we won't honor gfp flags passed to __vmalloc" semantics it > > > > has. > > > > > > yes vmalloc doesn't support nofail semantic and it is not really trivial > > > to craft it there. Yet retry-forever is trivial to implement across everything: kvmalloc(size, gfp_mask) { gfp_t flags = gfp_mask & ~__GFP_RETRY_FOREVER; do { p = __kvmalloc(size, flags) } while (!p && (gfp_mask & __GFP_RETRY_FOREVER)); return p; } That provides "allocation will eventually succeed" semantics just fine, yes? It doesn't guarantee forwards progress or success, just that *it won't fail*. It should be obvious what the difference between "retry forever" and __GFP_NOFAIL semantics are now, and why we don't actually want __GFP_NOFAIL. We just want __GFP_RETRY_FOREVER semantics that can be applied consistently across the entire allocation API regardless of whatever other flags are passed into the allocation: don't return until an allocation with the provided semantics succeeds. > > > > Even the API constaints of kvmalloc() w.r.t. only doing the vmalloc > > > > fallback if the gfp context is GFP_KERNEL - we already do GFP_NOFS > > > > kvmalloc via memalloc_nofs_save/restore(), so this behavioural > > > > restriction w.r.t. gfp flags just makes no sense at all. > > > > > > GFP_NOFS (without using the scope API) has the same problem as NOFAIL in > > > the vmalloc. Hence it is not supported. If you use the scope API then > > > you can GFP_KERNEL for kvmalloc. This is clumsy but I am not sure how to > > > define these conditions in a more sensible way. Special case NOFS if the > > > scope api is in use? Why do you want an explicit NOFS then? Exactly my point - this is clumsy and a total mess. I'm not asking for an explicit GFP_NOFS, just pointing out that the documented restrictions that "vmalloc can only do GFP_KERNEL allocations" is completely wrong. vmalloc() { if (!(gfp_flags & __GFP_FS)) memalloc_nofs_save(); p = __vmalloc(gfp_flags | GFP_KERNEL) if (!(gfp_flags & __GFP_FS)) memalloc_nofs_restore(); } Yup, that's how simple it is to support GFP_NOFS support in vmalloc(). This goes along with the argument that "it's impossible to do GFP_NOFAIL with vmalloc" as I addressed above. These things are not impossible, but we hide behind "we don't want people to use vmalloc" as an excuse for having shitty behaviour whilst ignoring that vmalloc is *heavily used* by core subsystems like filesystems because they cannot rely on high order allocations succeeding.... It also points out that the scope API is highly deficient. We can do GFP_NOFS via the scope API, but we can't do anything else because *there is no scope API for other GFP flags*. Why don't we have a GFP_NOFAIL/__GFP_RETRY_FOREVER scope API? That would save us a lot of bother in XFS. What about GFP_DIRECT_RECLAIM? I'd really like to turn that off for allocations in the XFS transaction commit path (as noted already in this thread) because direct reclaim that can make no progress is actively harmful (as noted already in this thread) Like I said - this is more than just bad documentation - the problem is that the whole allocation API is an inconsistent mess of control mechanisms to begin with... > > It would seem to make sense for kvmalloc to WARN_ON if it is passed > > flags that does not allow it to use vmalloc. > > vmalloc is certainly not the hottest path in the kernel so I wouldn't be > opposed. kvmalloc is most certainly becoming one of the hottest paths in XFS. IOWs, arguments that "vmalloc is not a hot path" are simply invalid these days because they are simply untrue. e.g. the profiles I posted in this thread... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-10-08 22:36 ` Dave Chinner @ 2021-10-11 11:57 ` Michal Hocko 2021-10-11 21:49 ` NeilBrown 2021-10-13 2:32 ` Dave Chinner 0 siblings, 2 replies; 45+ messages in thread From: Michal Hocko @ 2021-10-11 11:57 UTC (permalink / raw) To: Dave Chinner Cc: NeilBrown, Vlastimil Babka, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On Sat 09-10-21 09:36:49, Dave Chinner wrote: > On Fri, Oct 08, 2021 at 09:48:39AM +0200, Michal Hocko wrote: > > __GFP_NOFAIL usage is a risk on its own. It is a hard requirement that > > the allocator cannot back off. [...] > > No, "allocator cannot back off" isn't a hard requirement for most > GFP_NOFAIL uses. *Not failing the allocation* is the hard > requirement. We are talking about the same thing here I belive. By cannot back off I really mean cannot fail. Just for the clarification. > How long it takes for the allocation to actually succeed is > irrelevant to most callers, and given that we are replacing loops > that do > > while (!(p = kmalloc(sizeof(*p), GFP_KERNEL)) > > with __GFP_NOFAIL largely indicates that allocation *latency* and/or > deadlocks are not an issue here. Agreed. > Indeed, if we deadlock in XFS because there is no memory available, > that is *not a problem kmalloc() should be trying to solve*. THe > problem is the caller being unable to handle allocation failure, so > if allocation cannot make progress, that needs to be fixed by the > caller getting rid of the unfailable allocation. > > The fact is that we've had these loops in production code for a > couple of decades and these subsystems just aren't failing or > deadlocking with such loops. IOWs, we don't need __GFP_NOFAIL to dig > deep into reserves or drive the system to OOM killing - we just need > to it keep retrying the same allocation until it succeeds. > > Put simply, we want "retry forever" semantics to match what > production kernels have been doing for the past couple of decades, > but all we've been given are "never fail" semantics that also do > something different and potentially much more problematic. > > Do you see the difference here? __GFP_NOFAIL is not what we > need in the vast majority of cases where it is used. We don't want > the failing allocations to drive the machine hard into critical > reserves, we just want the allocation to -eventually succeed- and if > it doesn't, that's our problem to handle, not kmalloc().... I can see your point. I do have a recollection that there were some instance involved where an emergency access to memory reserves helped in OOM situations. Anway as I've tried to explain earlier that this all is an implementation detail users of the flag shouldn't really care about. If this heuristic is not doing any good then it should be removed. [...] > > > > > Even the API constaints of kvmalloc() w.r.t. only doing the vmalloc > > > > > fallback if the gfp context is GFP_KERNEL - we already do GFP_NOFS > > > > > kvmalloc via memalloc_nofs_save/restore(), so this behavioural > > > > > restriction w.r.t. gfp flags just makes no sense at all. > > > > > > > > GFP_NOFS (without using the scope API) has the same problem as NOFAIL in > > > > the vmalloc. Hence it is not supported. If you use the scope API then > > > > you can GFP_KERNEL for kvmalloc. This is clumsy but I am not sure how to > > > > define these conditions in a more sensible way. Special case NOFS if the > > > > scope api is in use? Why do you want an explicit NOFS then? > > Exactly my point - this is clumsy and a total mess. I'm not asking > for an explicit GFP_NOFS, just pointing out that the documented > restrictions that "vmalloc can only do GFP_KERNEL allocations" is > completely wrong. > > vmalloc() > { > if (!(gfp_flags & __GFP_FS)) > memalloc_nofs_save(); > p = __vmalloc(gfp_flags | GFP_KERNEL) > if (!(gfp_flags & __GFP_FS)) > memalloc_nofs_restore(); > } > > Yup, that's how simple it is to support GFP_NOFS support in > vmalloc(). Yes, this would work from the functionality POV but it defeats the philosophy behind the scope API. Why would you even need this if the scope was defined by the caller of the allocator? The initial hope was to get rid of the NOFS abuse that can be seen in many filesystems. All allocations from the scope would simply inherit the NOFS semantic so an explicit NOFS shouldn't be really necessary, right? > This goes along with the argument that "it's impossible to do > GFP_NOFAIL with vmalloc" as I addressed above. These things are not > impossible, but we hide behind "we don't want people to use vmalloc" > as an excuse for having shitty behaviour whilst ignoring that > vmalloc is *heavily used* by core subsystems like filesystems > because they cannot rely on high order allocations succeeding.... I do not think there is any reason to discourage anybody from using vmalloc these days. 32b is dying out and vmalloc space is no longer a very scarce resource. > It also points out that the scope API is highly deficient. > We can do GFP_NOFS via the scope API, but we can't > do anything else because *there is no scope API for other GFP > flags*. > > Why don't we have a GFP_NOFAIL/__GFP_RETRY_FOREVER scope API? NO{FS,IO} where first flags to start this approach. And I have to admit the experiment was much less successful then I hoped for. There are still thousands of direct NOFS users so for some reason defining scopes is not an easy thing to do. I am not against NOFAIL scopes in principle but seeing the nofs "success" I am worried this will not go really well either and it is much more tricky as NOFAIL has much stronger requirements than NOFS. Just imagine how tricky this can be if you just call a library code that is not under your control within a NOFAIL scope. What if that library code decides to allocate (e.g. printk that would attempt to do an optimistic NOWAIT allocation). > That > would save us a lot of bother in XFS. What about GFP_DIRECT_RECLAIM? > I'd really like to turn that off for allocations in the XFS > transaction commit path (as noted already in this thread) because > direct reclaim that can make no progress is actively harmful (as > noted already in this thread) As always if you have reasonable usecases then it is best to bring them up on the MM list and we can discuss them. > Like I said - this is more than just bad documentation - the problem > is that the whole allocation API is an inconsistent mess of control > mechanisms to begin with... I am not going to disagree. There is a lot of historical baggage and it doesn't help that any change is really hard to review because this interface is used throughout the kernel. I have tried to change some most obvious inconsistencies and I can tell this has always been a frustrating experience with a very small "reward" in the end because there are so many other problems. That being said, I would more than love to have a consistent and well defined interface and if you want to spend a lot of time on that then be my guest. > > > It would seem to make sense for kvmalloc to WARN_ON if it is passed > > > flags that does not allow it to use vmalloc. > > > > vmalloc is certainly not the hottest path in the kernel so I wouldn't be > > opposed. > > kvmalloc is most certainly becoming one of the hottest paths in XFS. > IOWs, arguments that "vmalloc is not a hot path" are simply invalid > these days because they are simply untrue. e.g. the profiles I > posted in this thread... Is it such a hot path that a check for compatible flags would be visible in profiles though? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-10-11 11:57 ` Michal Hocko @ 2021-10-11 21:49 ` NeilBrown 2021-10-18 10:23 ` Michal Hocko 2021-10-13 2:32 ` Dave Chinner 1 sibling, 1 reply; 45+ messages in thread From: NeilBrown @ 2021-10-11 21:49 UTC (permalink / raw) To: Michal Hocko Cc: Dave Chinner, Vlastimil Babka, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On Mon, 11 Oct 2021, Michal Hocko wrote: > On Sat 09-10-21 09:36:49, Dave Chinner wrote: > > > > Put simply, we want "retry forever" semantics to match what > > production kernels have been doing for the past couple of decades, > > but all we've been given are "never fail" semantics that also do > > something different and potentially much more problematic. > > > > Do you see the difference here? __GFP_NOFAIL is not what we > > need in the vast majority of cases where it is used. We don't want > > the failing allocations to drive the machine hard into critical > > reserves, we just want the allocation to -eventually succeed- and if > > it doesn't, that's our problem to handle, not kmalloc().... > > I can see your point. I do have a recollection that there were some > instance involved where an emergency access to memory reserves helped > in OOM situations. It might have been better to annotate those particular calls with __GFP_ATOMIC or similar rather then change GFP_NOFAIL for everyone. Too late to fix that now though I think. Maybe the best way forward is to discourage new uses of GFP_NOFAIL. We would need a well-documented replacement. > > Anway as I've tried to explain earlier that this all is an > implementation detail users of the flag shouldn't really care about. If > this heuristic is not doing any good then it should be removed. Maybe users shouldn't care about implementation details, but they do need to care about semantics and costs. We need to know when it is appropriate to use GFP_NOFAIL, and when it is not. And what alternatives there are when it is not appropriate. Just saying "try to avoid using it" and "requires careful analysis" isn't acceptable. Sometimes it is unavoidable and analysis can only be done with a clear understanding of costs. Possibly analysis can only be done with a clear understanding of the internal implementation details. > > > It also points out that the scope API is highly deficient. > > We can do GFP_NOFS via the scope API, but we can't > > do anything else because *there is no scope API for other GFP > > flags*. > > > > Why don't we have a GFP_NOFAIL/__GFP_RETRY_FOREVER scope API? > > NO{FS,IO} where first flags to start this approach. And I have to admit > the experiment was much less successful then I hoped for. There are > still thousands of direct NOFS users so for some reason defining scopes > is not an easy thing to do. I'm not certain your conclusion is valid. It could be that defining scopes is easy enough, but no one feels motivated to do it. We need to do more than provide functionality. We need to tell people. Repeatedly. And advertise widely. And propose patches to make use of the functionality. And... and... and... I think changing to the scope API is a good change, but it is conceptually a big change. It needs to be driven. > > I am not against NOFAIL scopes in principle but seeing the nofs > "success" I am worried this will not go really well either and it is > much more tricky as NOFAIL has much stronger requirements than NOFS. > Just imagine how tricky this can be if you just call a library code > that is not under your control within a NOFAIL scope. What if that > library code decides to allocate (e.g. printk that would attempt to do > an optimistic NOWAIT allocation). __GFP_NOMEMALLOC holds a lesson worth learning here. PF_MEMALLOC effectively adds __GFP_MEMALLOC to all allocations, but some call sites need to over-ride that because there are alternate strategies available. This need-to-over-ride doesn't apply to NOFS or NOIO as that really is a thread-wide state. But MEMALLOC and NOFAIL are different. Some call sites can reasonably handle failure locally. I imagine the scope-api would say something like "NO_ENOMEM". i.e. memory allocations can fail as long as ENOMEM is never returned. Any caller that sets __GFP_RETRY_MAYFAIL or __GFP_NORETRY or maybe some others which not be affected by the NO_ENOMEM scope. But a plain GFP_KERNEL would. Introducing the scope api would be a good opportunity to drop the priority boost and *just* block until success. Priority boosts could then be added (possibly as a scope) only where they are measurably needed. I think we have 28 process flags in use. So we can probably afford one more for PF_MEMALLOC_NO_ENOMEM. What other scope flags might be useful? PF_MEMALLOC_BOOST which added __GFP_ATOMIC but not __GFP_MEMALLOC ?? PF_MEMALLOC_NORECLAIM ?? > > > That > > would save us a lot of bother in XFS. What about GFP_DIRECT_RECLAIM? > > I'd really like to turn that off for allocations in the XFS > > transaction commit path (as noted already in this thread) because > > direct reclaim that can make no progress is actively harmful (as > > noted already in this thread) > > As always if you have reasonable usecases then it is best to bring them > up on the MM list and we can discuss them. We are on the MM lists now... let's discuss :-) Dave: How would you feel about an effort to change xfs to stop using GFP_NOFS, and to use memalloc_nofs_save/restore instead? Having a major filesystem make the transition would be a good test-case, and could be used to motivate other filesystems to follow. We could add and use memalloc_no_enomem_save() too. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-10-11 21:49 ` NeilBrown @ 2021-10-18 10:23 ` Michal Hocko 2021-10-19 4:32 ` NeilBrown 0 siblings, 1 reply; 45+ messages in thread From: Michal Hocko @ 2021-10-18 10:23 UTC (permalink / raw) To: NeilBrown Cc: Dave Chinner, Vlastimil Babka, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On Tue 12-10-21 08:49:46, Neil Brown wrote: > On Mon, 11 Oct 2021, Michal Hocko wrote: > > On Sat 09-10-21 09:36:49, Dave Chinner wrote: > > > > > > Put simply, we want "retry forever" semantics to match what > > > production kernels have been doing for the past couple of decades, > > > but all we've been given are "never fail" semantics that also do > > > something different and potentially much more problematic. > > > > > > Do you see the difference here? __GFP_NOFAIL is not what we > > > need in the vast majority of cases where it is used. We don't want > > > the failing allocations to drive the machine hard into critical > > > reserves, we just want the allocation to -eventually succeed- and if > > > it doesn't, that's our problem to handle, not kmalloc().... > > > > I can see your point. I do have a recollection that there were some > > instance involved where an emergency access to memory reserves helped > > in OOM situations. > > It might have been better to annotate those particular calls with > __GFP_ATOMIC or similar rather then change GFP_NOFAIL for everyone. For historical reasons __GFP_ATOMIC is reserved for non sleeping allocations. __GFP_HIGH would be an alternative. > Too late to fix that now though I think. Maybe the best way forward is > to discourage new uses of GFP_NOFAIL. We would need a well-documented > replacement. I am not sure what that should be. Really if the memory reserves behavior of GFP_NOFAIL is really problematic then let's just reap it out. I do not see a new nofail like flag is due. > > Anway as I've tried to explain earlier that this all is an > > implementation detail users of the flag shouldn't really care about. If > > this heuristic is not doing any good then it should be removed. > > Maybe users shouldn't care about implementation details, but they do > need to care about semantics and costs. > We need to know when it is appropriate to use GFP_NOFAIL, and when it is > not. And what alternatives there are when it is not appropriate. > Just saying "try to avoid using it" and "requires careful analysis" > isn't acceptable. Sometimes it is unavoidable and analysis can only be > done with a clear understanding of costs. Possibly analysis can only be > done with a clear understanding of the internal implementation details. What we document currently is this * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller * cannot handle allocation failures. The allocation could block * indefinitely but will never return with failure. Testing for * failure is pointless. * New users should be evaluated carefully (and the flag should be * used only when there is no reasonable failure policy) but it is * definitely preferable to use the flag rather than opencode endless * loop around allocator. * Using this flag for costly allocations is _highly_ discouraged. so we tell when to use it - aka no reasonable failure policy. We put some discouragind language there. There is some discouraging language for high order allocations. Maybe we should suggest an alternative there. It seems there are usecases for those as well so we should implement a proper NOFAIL kvmalloc and recommend it for that instead. > > > It also points out that the scope API is highly deficient. > > > We can do GFP_NOFS via the scope API, but we can't > > > do anything else because *there is no scope API for other GFP > > > flags*. > > > > > > Why don't we have a GFP_NOFAIL/__GFP_RETRY_FOREVER scope API? > > > > NO{FS,IO} where first flags to start this approach. And I have to admit > > the experiment was much less successful then I hoped for. There are > > still thousands of direct NOFS users so for some reason defining scopes > > is not an easy thing to do. > > I'm not certain your conclusion is valid. It could be that defining > scopes is easy enough, but no one feels motivated to do it. > We need to do more than provide functionality. We need to tell people. > Repeatedly. And advertise widely. And propose patches to make use of > the functionality. And... and... and... Been there, done that for the low hanging fruit. Others were much more complex for me to follow up and I had other stuff on my table. > I think changing to the scope API is a good change, but it is > conceptually a big change. It needs to be driven. Agreed. > > I am not against NOFAIL scopes in principle but seeing the nofs > > "success" I am worried this will not go really well either and it is > > much more tricky as NOFAIL has much stronger requirements than NOFS. > > Just imagine how tricky this can be if you just call a library code > > that is not under your control within a NOFAIL scope. What if that > > library code decides to allocate (e.g. printk that would attempt to do > > an optimistic NOWAIT allocation). > > __GFP_NOMEMALLOC holds a lesson worth learning here. PF_MEMALLOC > effectively adds __GFP_MEMALLOC to all allocations, but some call sites > need to over-ride that because there are alternate strategies available. > This need-to-over-ride doesn't apply to NOFS or NOIO as that really is a > thread-wide state. But MEMALLOC and NOFAIL are different. Some call > sites can reasonably handle failure locally. > > I imagine the scope-api would say something like "NO_ENOMEM". i.e. > memory allocations can fail as long as ENOMEM is never returned. > Any caller that sets __GFP_RETRY_MAYFAIL or __GFP_NORETRY or maybe some > others which not be affected by the NO_ENOMEM scope. But a plain > GFP_KERNEL would. > > Introducing the scope api would be a good opportunity to drop the > priority boost and *just* block until success. Priority boosts could > then be added (possibly as a scope) only where they are measurably needed. > > I think we have 28 process flags in use. So we can probably afford one > more for PF_MEMALLOC_NO_ENOMEM. What other scope flags might be useful? > PF_MEMALLOC_BOOST which added __GFP_ATOMIC but not __GFP_MEMALLOC ?? > PF_MEMALLOC_NORECLAIM ?? I dunno. PF_MEMALLOC and its GFP_$FOO counterparts are quite hard to wrap my head around. I have never liked thos much TBH and building more on top sounds like step backward. I might be wrong but this sounds like even more work than NOFS scopes. > > > That > > > would save us a lot of bother in XFS. What about GFP_DIRECT_RECLAIM? > > > I'd really like to turn that off for allocations in the XFS > > > transaction commit path (as noted already in this thread) because > > > direct reclaim that can make no progress is actively harmful (as > > > noted already in this thread) > > > > As always if you have reasonable usecases then it is best to bring them > > up on the MM list and we can discuss them. > > We are on the MM lists now... let's discuss :-) Sure we can but this thread is a mix of so many topics that finding something useful will turn to be hard from my past experience. > Dave: How would you feel about an effort to change xfs to stop using > GFP_NOFS, and to use memalloc_nofs_save/restore instead? xfs is an example of a well behaved scope user. In fact the API has been largely based on xfs previous interface. There are still NOFS usesages in xfs which would be great to get rid of (e.g. the default mapping NOFS which was added due to lockdep false positives but that is unrelated). > Having a major > filesystem make the transition would be a good test-case, and could be > used to motivate other filesystems to follow. > We could add and use memalloc_no_enomem_save() too. ext has converted their transaction context to the scope API as well. There is still some explicit NOFS usage but I haven't checked details recently. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-10-18 10:23 ` Michal Hocko @ 2021-10-19 4:32 ` NeilBrown 2021-10-19 13:59 ` Michal Hocko 0 siblings, 1 reply; 45+ messages in thread From: NeilBrown @ 2021-10-19 4:32 UTC (permalink / raw) To: Michal Hocko Cc: Dave Chinner, Vlastimil Babka, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On Mon, 18 Oct 2021, Michal Hocko wrote: > On Tue 12-10-21 08:49:46, Neil Brown wrote: > > On Mon, 11 Oct 2021, Michal Hocko wrote: > > > On Sat 09-10-21 09:36:49, Dave Chinner wrote: > > > > > > > > Put simply, we want "retry forever" semantics to match what > > > > production kernels have been doing for the past couple of decades, > > > > but all we've been given are "never fail" semantics that also do > > > > something different and potentially much more problematic. > > > > > > > > Do you see the difference here? __GFP_NOFAIL is not what we > > > > need in the vast majority of cases where it is used. We don't want > > > > the failing allocations to drive the machine hard into critical > > > > reserves, we just want the allocation to -eventually succeed- and if > > > > it doesn't, that's our problem to handle, not kmalloc().... > > > > > > I can see your point. I do have a recollection that there were some > > > instance involved where an emergency access to memory reserves helped > > > in OOM situations. > > > > It might have been better to annotate those particular calls with > > __GFP_ATOMIC or similar rather then change GFP_NOFAIL for everyone. > > For historical reasons __GFP_ATOMIC is reserved for non sleeping > allocations. __GFP_HIGH would be an alternative. Historical reasons certainly shouldn't be ignored. But they can be questioned. __GFP_ATOMIC is documented as "the caller cannot reclaim or sleep and is high priority". This seems to over-lap with __GFP_DIRECT_RECLAIM (which permits reclaim and is the only place where page_alloc sleeps ... I think). The effect of setting __GFP_ATOMIC is: - triggers WARN_ON if __GFP_DIRECT_RECLAIM is also set. - bypass memcg limits - ignore the watermark_boost_factor effect - clears ALLOC_CPUSET - sets ALLOC_HARDER which provides: - access to nr_reserved_highatomic reserves - access to 1/4 the low-watermark reserves (ALLOC_HIGH gives 1/2) Combine them and you get access to 5/8 of the reserves. It is also used by driver/iommu/tegra-smmu.c to decide if a spinlock should remain held, or should be dropped over the alloc_page(). That's .... not my favourite code. So apart from the tegra thing and the WARN_ON, there is nothing about __GFP_ATOMIC which suggests it should only be used for non-sleeping allocations. It *should* only be used for allocations with a high failure cost and relatively short time before the memory will be returned and that likely includes many non sleeping allocations. It isn't clear to me why an allocation that is willing to sleep (if absolutely necessary) shouldn't be able to benefit from the priority boost of __GFP_ATOMIC. Or at least of ALLOC_HARDER... Maybe __GFP_HIGH should get the memcg and watermark_boost benefits too? Given that we have ALLOC_HARDER and ALLOC_HIGH, it would seem to be sensible to export those two settings in GFP_foo, and not forbid one of them to be used with __GFP_DIRECT_RECLAIM. > > > Too late to fix that now though I think. Maybe the best way forward is > > to discourage new uses of GFP_NOFAIL. We would need a well-documented > > replacement. > > I am not sure what that should be. Really if the memory reserves > behavior of GFP_NOFAIL is really problematic then let's just reap it > out. I do not see a new nofail like flag is due. Presumably there is a real risk of deadlock if we just remove the memory-reserves boosts of __GFP_NOFAIL. Maybe it would be safe to replace all current users of __GFP_NOFAIL with __GFP_NOFAIL|__GFP_HIGH, and then remove the __GFP_HIGH where analysis suggests there is no risk of deadlocks. Or maybe rename the __GFP_NOFAIL flag and #define __GFP_NOFAIL to include __GFP_HIGH? This would certainly be a better result than adding a new flag. > > > > Anway as I've tried to explain earlier that this all is an > > > implementation detail users of the flag shouldn't really care about. If > > > this heuristic is not doing any good then it should be removed. > > > > Maybe users shouldn't care about implementation details, but they do > > need to care about semantics and costs. > > We need to know when it is appropriate to use GFP_NOFAIL, and when it is > > not. And what alternatives there are when it is not appropriate. > > Just saying "try to avoid using it" and "requires careful analysis" > > isn't acceptable. Sometimes it is unavoidable and analysis can only be > > done with a clear understanding of costs. Possibly analysis can only be > > done with a clear understanding of the internal implementation details. > > What we document currently is this > * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > * cannot handle allocation failures. The allocation could block > * indefinitely but will never return with failure. Testing for > * failure is pointless. This implies it is incompatible with __GFP_NORETRY and (probably) requires __GFP_RECLAIM. That is worth documenting, and possibly also a WARN_ON. > * New users should be evaluated carefully (and the flag should be > * used only when there is no reasonable failure policy) but it is > * definitely preferable to use the flag rather than opencode endless > * loop around allocator. How do we perform this evaluation? And why is it preferable to a loop? There are times when a loop makes sense, if there might be some other event that could provide the needed memory ... or if a SIGKILL might make it irrelevant. slab allocators presumably shouldn't pass __GFP_NOFAIL to alloc_page(), but should instead loop around 1/ check if any existing slabs have space 2/ if not, try to allocate a new page Providing the latter blocks for a while but not indefinitely that should be optimal. Why is __GFP_NOFAIL better? > * Using this flag for costly allocations is _highly_ discouraged. This is unhelpful. Saying something is "discouraged" carries an implied threat. This is open source and threats need to be open. Why is it discouraged? IF it is not forbidden, then it is clearly permitted. Maybe there are costs - so a clear statement of those costs would be appropriate. Also, what is a suitable alternative? Current code will trigger a WARN_ON, so it is effectively forbidden. Maybe we should document that __GFP_NOFAIL is forbidden for orders above 1, and that vmalloc() should be used instead (thanks for proposing that patch!). But that would mean __GFP_NOFAIL cannot be used for slabs which happen to use large orders. Hmmm. it appears that slub.c disables __GFP_NOFAIL when it tries for a large order allocation, and slob.c never tries large order allocations. So this only affects slab.c. xfs makes heavy use of kmem_cache_zalloc with __GFP_NOFAIL. I wonder if any of these slabs have large order with slab.c. > > so we tell when to use it - aka no reasonable failure policy. We put > some discouragind language there. There is some discouraging language > for high order allocations. Maybe we should suggest an alternative > there. It seems there are usecases for those as well so we should > implement a proper NOFAIL kvmalloc and recommend it for that instead. yes- suggest an alternative and also say what the tradeoffs are. > > > > > It also points out that the scope API is highly deficient. > > > > We can do GFP_NOFS via the scope API, but we can't > > > > do anything else because *there is no scope API for other GFP > > > > flags*. > > > > > > > > Why don't we have a GFP_NOFAIL/__GFP_RETRY_FOREVER scope API? > > > > > > NO{FS,IO} where first flags to start this approach. And I have to admit > > > the experiment was much less successful then I hoped for. There are > > > still thousands of direct NOFS users so for some reason defining scopes > > > is not an easy thing to do. > > > > I'm not certain your conclusion is valid. It could be that defining > > scopes is easy enough, but no one feels motivated to do it. > > We need to do more than provide functionality. We need to tell people. > > Repeatedly. And advertise widely. And propose patches to make use of > > the functionality. And... and... and... > > Been there, done that for the low hanging fruit. Others were much more > complex for me to follow up and I had other stuff on my table. I have no doubt that is a slow and rather thankless task, with no real payoff until it is complete. It reminds me a bit of BKL removal and 64-bit time. I think it is worth doing though. Finding the balance between letting it consume you and just giving up would be a challenge. > > > I think changing to the scope API is a good change, but it is > > conceptually a big change. It needs to be driven. > > Agreed. > > > > I am not against NOFAIL scopes in principle but seeing the nofs > > > "success" I am worried this will not go really well either and it is > > > much more tricky as NOFAIL has much stronger requirements than NOFS. > > > Just imagine how tricky this can be if you just call a library code > > > that is not under your control within a NOFAIL scope. What if that > > > library code decides to allocate (e.g. printk that would attempt to do > > > an optimistic NOWAIT allocation). > > > > __GFP_NOMEMALLOC holds a lesson worth learning here. PF_MEMALLOC > > effectively adds __GFP_MEMALLOC to all allocations, but some call sites > > need to over-ride that because there are alternate strategies available. > > This need-to-over-ride doesn't apply to NOFS or NOIO as that really is a > > thread-wide state. But MEMALLOC and NOFAIL are different. Some call > > sites can reasonably handle failure locally. > > > > I imagine the scope-api would say something like "NO_ENOMEM". i.e. > > memory allocations can fail as long as ENOMEM is never returned. > > Any caller that sets __GFP_RETRY_MAYFAIL or __GFP_NORETRY or maybe some > > others which not be affected by the NO_ENOMEM scope. But a plain > > GFP_KERNEL would. > > > > Introducing the scope api would be a good opportunity to drop the > > priority boost and *just* block until success. Priority boosts could > > then be added (possibly as a scope) only where they are measurably needed. > > > > I think we have 28 process flags in use. So we can probably afford one > > more for PF_MEMALLOC_NO_ENOMEM. What other scope flags might be useful? > > PF_MEMALLOC_BOOST which added __GFP_ATOMIC but not __GFP_MEMALLOC ?? > > PF_MEMALLOC_NORECLAIM ?? > > I dunno. PF_MEMALLOC and its GFP_$FOO counterparts are quite hard to > wrap my head around. I have never liked thos much TBH and building more > on top sounds like step backward. I might be wrong but this sounds like > even more work than NOFS scopes. > > > > > That > > > > would save us a lot of bother in XFS. What about GFP_DIRECT_RECLAIM? > > > > I'd really like to turn that off for allocations in the XFS > > > > transaction commit path (as noted already in this thread) because > > > > direct reclaim that can make no progress is actively harmful (as > > > > noted already in this thread) > > > > > > As always if you have reasonable usecases then it is best to bring them > > > up on the MM list and we can discuss them. > > > > We are on the MM lists now... let's discuss :-) > > Sure we can but this thread is a mix of so many topics that finding > something useful will turn to be hard from my past experience. Unfortunately life is messy. I just wanted to remove all congestion_wait() calls. But that lead to __GFP_NOFAIL and to scopes allocation API, and there are still more twisty passages waiting. Sometimes you don't know what topic will usefully start a constructive thread until you've already figured out the answer :-( > > > Dave: How would you feel about an effort to change xfs to stop using > > GFP_NOFS, and to use memalloc_nofs_save/restore instead? > > xfs is an example of a well behaved scope user. In fact the API has been > largely based on xfs previous interface. There are still NOFS usesages > in xfs which would be great to get rid of (e.g. the default mapping NOFS > which was added due to lockdep false positives but that is unrelated). > > > Having a major > > filesystem make the transition would be a good test-case, and could be > > used to motivate other filesystems to follow. > > We could add and use memalloc_no_enomem_save() too. > > ext has converted their transaction context to the scope API as well. > There is still some explicit NOFS usage but I haven't checked details > recently. Of the directories in fs/, 42 contain no mention of GFP_NOFS 17 contain fewer than 10 The 10 with most frequent usage (including comments) are: 47 fs/afs/ 48 fs/f2fs/ 49 fs/nfs/ 54 fs/dlm/ 59 fs/ceph/ 66 fs/ext4/ 73 fs/ntfs3/ 73 fs/ocfs2/ 83 fs/ubifs/ 231 fs/btrfs/ xfs is 28 - came in number 12. Though there are 25 KM_NOFS allocations, which would push it up to 7th place. A few use GFP_NOIO - nfs(11) and f2fs(9) being the biggest users. So clearly there is work to do Maybe we could add something to checkpatch.pl to discourage the addition of new GFP_NOFS usage. There is a lot of stuff there.... the bits that are important to me are: - why is __GFP_NOFAIL preferred? It is a valuable convenience, but I don't see that it is necessary - is it reasonable to use __GFP_HIGH when looping if there is a risk of deadlock? - Will __GFP_DIRECT_RECLAIM always result in a delay before failure? In that case it should be safe to loop around allocations using __GFP_DIRECT_RECLAIM without needing congestion_wait() (so it can just be removed. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-10-19 4:32 ` NeilBrown @ 2021-10-19 13:59 ` Michal Hocko 2021-10-22 0:09 ` NeilBrown 0 siblings, 1 reply; 45+ messages in thread From: Michal Hocko @ 2021-10-19 13:59 UTC (permalink / raw) To: NeilBrown Cc: Dave Chinner, Vlastimil Babka, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On Tue 19-10-21 15:32:27, Neil Brown wrote: > On Mon, 18 Oct 2021, Michal Hocko wrote: > > On Tue 12-10-21 08:49:46, Neil Brown wrote: > > > On Mon, 11 Oct 2021, Michal Hocko wrote: > > > > On Sat 09-10-21 09:36:49, Dave Chinner wrote: > > > > > > > > > > Put simply, we want "retry forever" semantics to match what > > > > > production kernels have been doing for the past couple of decades, > > > > > but all we've been given are "never fail" semantics that also do > > > > > something different and potentially much more problematic. > > > > > > > > > > Do you see the difference here? __GFP_NOFAIL is not what we > > > > > need in the vast majority of cases where it is used. We don't want > > > > > the failing allocations to drive the machine hard into critical > > > > > reserves, we just want the allocation to -eventually succeed- and if > > > > > it doesn't, that's our problem to handle, not kmalloc().... > > > > > > > > I can see your point. I do have a recollection that there were some > > > > instance involved where an emergency access to memory reserves helped > > > > in OOM situations. > > > > > > It might have been better to annotate those particular calls with > > > __GFP_ATOMIC or similar rather then change GFP_NOFAIL for everyone. > > > > For historical reasons __GFP_ATOMIC is reserved for non sleeping > > allocations. __GFP_HIGH would be an alternative. > > Historical reasons certainly shouldn't be ignored. But they can be > questioned. Agreed. Changing them is a more challenging task though. For example I really dislike how access to memory reserves is bound to "no reclaim" requirement. Ideally those should be completely orthogonal. I also do not think we need as many ways to ask for memory reserves as we have. Can a "regular" kernel developer tell a difference between __GFP_ATOMIC and __GFP_HIGH? I do not think so, unless one is willing to do ... > __GFP_ATOMIC is documented as "the caller cannot reclaim or sleep and is > high priority". > This seems to over-lap with __GFP_DIRECT_RECLAIM (which permits reclaim > and is the only place where page_alloc sleeps ... I think). > > The effect of setting __GFP_ATOMIC is: > - triggers WARN_ON if __GFP_DIRECT_RECLAIM is also set. > - bypass memcg limits > - ignore the watermark_boost_factor effect > - clears ALLOC_CPUSET > - sets ALLOC_HARDER which provides: > - access to nr_reserved_highatomic reserves > - access to 1/4 the low-watermark reserves (ALLOC_HIGH gives 1/2) > Combine them and you get access to 5/8 of the reserves. ... exactly this. And these are bunch of hacks developed over time and the baggage which is hard to change as I've said. Somebody with a sufficient time budget should start questioning all those and eventually make __GFP_ATOMIC a story of the past. > It is also used by driver/iommu/tegra-smmu.c to decide if a spinlock > should remain held, or should be dropped over the alloc_page(). That's > .... not my favourite code. Exactly! > So apart from the tegra thing and the WARN_ON, there is nothing about > __GFP_ATOMIC which suggests it should only be used for non-sleeping > allocations. The warning was added when the original GFP_ATOMIC was untangled from the reclaim implications to keep the "backward compatibility" IIRC. Mostly for IRQ handlers where the GFP_ATOMIC was used the most. My memory might fail me though. > It *should* only be used for allocations with a high failure cost and > relatively short time before the memory will be returned and that likely > includes many non sleeping allocations. It isn't clear to me why an > allocation that is willing to sleep (if absolutely necessary) shouldn't > be able to benefit from the priority boost of __GFP_ATOMIC. Or at least > of ALLOC_HARDER... I completely agree! As mentioned above memory reserves should be completely orthogonal. I am not sure we want an API for many different levels of reserves access. Do we need more than __GFP_HIGH? Maybe with a more descriptive name. > Maybe __GFP_HIGH should get the memcg and watermark_boost benefits too? > > Given that we have ALLOC_HARDER and ALLOC_HIGH, it would seem to be > sensible to export those two settings in GFP_foo, and not forbid one of > them to be used with __GFP_DIRECT_RECLAIM. I think ALLOC_HARDER should be kept internal implementation detail when the allocator needs to give somebody a boost on top of requests for internal balancing between requests. ALLOC_HIGH already matches __GFP_HIGH and that should be the way to ask for a boost explicitly IMO. We also have ALLOC_OOM as another level of internal memory reserves for OOM victims. Again something to be in hands of the allocator. > > > Too late to fix that now though I think. Maybe the best way forward is > > > to discourage new uses of GFP_NOFAIL. We would need a well-documented > > > replacement. > > > > I am not sure what that should be. Really if the memory reserves > > behavior of GFP_NOFAIL is really problematic then let's just reap it > > out. I do not see a new nofail like flag is due. > > Presumably there is a real risk of deadlock if we just remove the > memory-reserves boosts of __GFP_NOFAIL. Maybe it would be safe to > replace all current users of __GFP_NOFAIL with __GFP_NOFAIL|__GFP_HIGH, > and then remove the __GFP_HIGH where analysis suggests there is no risk > of deadlocks. I would much rather not bind those together and go other way around. If somebody can actually hit deadlocks (those are quite easy to spot as they do not go away) then we can talk about how to deal with them. Memory reserves can help only > < this much. > Or maybe rename the __GFP_NOFAIL flag and #define __GFP_NOFAIL to > include __GFP_HIGH? Wouldn't that lead to the __GFP_ATOMIC story again? > This would certainly be a better result than adding a new flag. > > > > > > > Anway as I've tried to explain earlier that this all is an > > > > implementation detail users of the flag shouldn't really care about. If > > > > this heuristic is not doing any good then it should be removed. > > > > > > Maybe users shouldn't care about implementation details, but they do > > > need to care about semantics and costs. > > > We need to know when it is appropriate to use GFP_NOFAIL, and when it is > > > not. And what alternatives there are when it is not appropriate. > > > Just saying "try to avoid using it" and "requires careful analysis" > > > isn't acceptable. Sometimes it is unavoidable and analysis can only be > > > done with a clear understanding of costs. Possibly analysis can only be > > > done with a clear understanding of the internal implementation details. > > > > What we document currently is this > > * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > > * cannot handle allocation failures. The allocation could block > > * indefinitely but will never return with failure. Testing for > > * failure is pointless. > > This implies it is incompatible with __GFP_NORETRY and (probably) > requires __GFP_RECLAIM. That is worth documenting, and possibly also a > WARN_ON. Yes, I thought this would be obvious as those are reclaim modifiers so they require a reclaim. But I do see a point that being explicit here cannot hurt. Same with combining them together. It just doesn't make much sense to retry for ever and requesting noretry or retry and fail at the same time. Again a clarification cannot hurt though. > > * New users should be evaluated carefully (and the flag should be > > * used only when there is no reasonable failure policy) but it is > > * definitely preferable to use the flag rather than opencode endless > > * loop around allocator. > > How do we perform this evaluation? And why is it preferable to a loop? > There are times when a loop makes sense, if there might be some other > event that could provide the needed memory ... or if a SIGKILL might > make it irrelevant. > slab allocators presumably shouldn't pass __GFP_NOFAIL to alloc_page(), > but should instead loop around > 1/ check if any existing slabs have space > 2/ if not, try to allocate a new page > Providing the latter blocks for a while but not indefinitely that should > be optimal. > Why is __GFP_NOFAIL better? Because the allocator can do something if it knows that the allocation cannot fail. E.g. give such an allocation a higher priority over those that are allowed to fail. This is not limited to memory reserves, although this is the only measure that is implemented currently IIRC. On the other hand if there is something interesting the caller can do directly - e.g. do internal object management like mempool does - then it is better to retry at that level. > > * Using this flag for costly allocations is _highly_ discouraged. > > This is unhelpful. Saying something is "discouraged" carries an implied > threat. This is open source and threats need to be open. > Why is it discouraged? IF it is not forbidden, then it is clearly > permitted. Maybe there are costs - so a clear statement of those costs > would be appropriate. > Also, what is a suitable alternative? > > Current code will trigger a WARN_ON, so it is effectively forbidden. > Maybe we should document that __GFP_NOFAIL is forbidden for orders above > 1, and that vmalloc() should be used instead (thanks for proposing that > patch!). I think we want to recommend kvmalloc as an alternative once vmalloc is NOFAIL aware. I will skip over some of the specific regarding SLAB and NOFS usage if you do not mind and focus on points that have direct documentation consequences. Also I do not feel qualified commenting on neither SLAB nor FS internals. [...] > There is a lot of stuff there.... the bits that are important to me are: > > - why is __GFP_NOFAIL preferred? It is a valuable convenience, but I > don't see that it is necessary I think it is preferred for one and a half reasons. It tells allocator that this allocation cannot really fail and the caller doesn't have a very good/clever retry policy (e.g. like mempools mentioned above). The half reason would be for tracking purposes (git grep __GFP_NOFAIL) is easier than trying to catch all sorts of while loops over allocation which do not do anything really interesting. > - is it reasonable to use __GFP_HIGH when looping if there is a risk of > deadlock? As I've said above. Memory reserves are a finite resource and as such they cannot fundamentally solve deadlocks. They can help prioritize though. > - Will __GFP_DIRECT_RECLAIM always result in a delay before failure? In > that case it should be safe to loop around allocations using > __GFP_DIRECT_RECLAIM without needing congestion_wait() (so it can > just be removed. This is a good question and I do not think we have that documented anywhere. We do cond_resched() for sure. I do not think we guarantee a sleeping point in general. Maybe we should, I am not really sure. Thanks for good comments and tough questions -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-10-19 13:59 ` Michal Hocko @ 2021-10-22 0:09 ` NeilBrown 0 siblings, 0 replies; 45+ messages in thread From: NeilBrown @ 2021-10-22 0:09 UTC (permalink / raw) To: Michal Hocko Cc: Dave Chinner, Vlastimil Babka, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On Wed, 20 Oct 2021, Michal Hocko wrote: > On Tue 19-10-21 15:32:27, Neil Brown wrote: [clip looks of discussion where we are largely in agreement - happy to see that!] > > Presumably there is a real risk of deadlock if we just remove the > > memory-reserves boosts of __GFP_NOFAIL. Maybe it would be safe to > > replace all current users of __GFP_NOFAIL with __GFP_NOFAIL|__GFP_HIGH, > > and then remove the __GFP_HIGH where analysis suggests there is no risk > > of deadlocks. > > I would much rather not bind those together and go other way around. If > somebody can actually hit deadlocks (those are quite easy to spot as > they do not go away) then we can talk about how to deal with them. > Memory reserves can help only > < this much. I recall maybe 10 years ago Linus saying that he preferred simplicity to mathematical provability for handling memory deadlocks (or something like that). I lean towards provability myself, but I do see the other perspective. We have mempools and they can provide strong guarantees (though they are often over-allocated I think). But they can be a bit clumsy. I believe that DaveM is strong against anything like that in the network layer, so we strongly depend on GFP_MEMALLOC functionality for swap-over-NFS. I'm sure it is important elsewhere too. Of course __GFP_HIGH and __GFP_ATOMIC provide an intermediate priority level - more likely to fail than __GFP_MEMALLOC. I suspect they should not be seen as avoiding deadlock, only as improving service. So using them when we cannot wait might make sense, but there are probably other circumstances. > > Why is __GFP_NOFAIL better? > > Because the allocator can do something if it knows that the allocation > cannot fail. E.g. give such an allocation a higher priority over those > that are allowed to fail. This is not limited to memory reserves, > although this is the only measure that is implemented currently IIRC. > On the other hand if there is something interesting the caller can do > directly - e.g. do internal object management like mempool does - then > it is better to retry at that level. It *can* do something, but I don't think it *should* do something - not if that could have a negative impact on other threads. Just because I cannot fail, that doesn't mean someone else should fail to help me. Maybe I should just wait longer. > > > > * Using this flag for costly allocations is _highly_ discouraged. > > > > This is unhelpful. Saying something is "discouraged" carries an implied > > threat. This is open source and threats need to be open. > > Why is it discouraged? IF it is not forbidden, then it is clearly > > permitted. Maybe there are costs - so a clear statement of those costs > > would be appropriate. > > Also, what is a suitable alternative? > > > > Current code will trigger a WARN_ON, so it is effectively forbidden. > > Maybe we should document that __GFP_NOFAIL is forbidden for orders above > > 1, and that vmalloc() should be used instead (thanks for proposing that > > patch!). > > I think we want to recommend kvmalloc as an alternative once vmalloc is > NOFAIL aware. > > I will skip over some of the specific regarding SLAB and NOFS usage if > you do not mind and focus on points that have direct documentation > consequences. Also I do not feel qualified commenting on neither SLAB > nor FS internals. > > [...] > > There is a lot of stuff there.... the bits that are important to me are: > > > > - why is __GFP_NOFAIL preferred? It is a valuable convenience, but I > > don't see that it is necessary > > I think it is preferred for one and a half reasons. It tells allocator > that this allocation cannot really fail and the caller doesn't have a > very good/clever retry policy (e.g. like mempools mentioned above). The > half reason would be for tracking purposes (git grep __GFP_NOFAIL) is > easier than trying to catch all sorts of while loops over allocation > which do not do anything really interesting. I think the one reason is misguided, as described above. I think the half reason is good, and that we should introduce memalloc_retry_wait() and encourage developers to use that for any memalloc retry loop. __GFP_NOFAIL would then be a convenience flag which causes the allocator (slab or alloc_page or whatever) to call memalloc_retry_wait() and do the loop internally. What exactly memalloc_retry_wait() does (if anything) can be decided separately and changed as needed. > > - is it reasonable to use __GFP_HIGH when looping if there is a risk of > > deadlock? > > As I've said above. Memory reserves are a finite resource and as such > they cannot fundamentally solve deadlocks. They can help prioritize > though. To be fair, they can solve 1 level of deadlock. i.e. if you only need to make one allocation to guarantee progress, then allocating from reserves can help. If you might need to make a second allocation without freeing the first - then a single reserve pool can provide guarantees (which is why we use mempool is layered block devices - md over dm over loop of scsi). > > > - Will __GFP_DIRECT_RECLAIM always result in a delay before failure? In > > that case it should be safe to loop around allocations using > > __GFP_DIRECT_RECLAIM without needing congestion_wait() (so it can > > just be removed. > > This is a good question and I do not think we have that documented > anywhere. We do cond_resched() for sure. I do not think we guarantee a > sleeping point in general. Maybe we should, I am not really sure. If we add memalloc_retry_wait(), it wouldn't matter. We would only need to ensure that memalloc_retry_wait() waited if page_alloc didn't. I think we should: - introduce memalloc_retry_wait() and use it for all malloc retry loops including __GFP_NOFAIL - drop all the priority boosts added for __GFP_NOFAIL - drop __GFP_ATOMIC and change all the code that tests for __GFP_ATOMIC to instead test for __GFP_HIGH. __GFP_ATOMIC is NEVER used without __GFP_HIGH. This give a slight boost to several sites that use __GFP_HIGH explicitly. - choose a consistent order threshold for disallowing __GFP_NOFAIL (rmqueue uses "order > 1", __alloc_pages_slowpath uses "order > PAGE_ALLOC_COSTLY_ORDER"), test it once - early - and document kvmalloc as an alternative. Code can also loop if there is an alternative strategy for freeing up memory. Thanks, NeilBrown Thanks, NeilBrown ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-10-11 11:57 ` Michal Hocko 2021-10-11 21:49 ` NeilBrown @ 2021-10-13 2:32 ` Dave Chinner 2021-10-13 8:26 ` Michal Hocko 1 sibling, 1 reply; 45+ messages in thread From: Dave Chinner @ 2021-10-13 2:32 UTC (permalink / raw) To: Michal Hocko Cc: NeilBrown, Vlastimil Babka, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On Mon, Oct 11, 2021 at 01:57:36PM +0200, Michal Hocko wrote: > On Sat 09-10-21 09:36:49, Dave Chinner wrote: > > On Fri, Oct 08, 2021 at 09:48:39AM +0200, Michal Hocko wrote: > > > > > > Even the API constaints of kvmalloc() w.r.t. only doing the vmalloc > > > > > > fallback if the gfp context is GFP_KERNEL - we already do GFP_NOFS > > > > > > kvmalloc via memalloc_nofs_save/restore(), so this behavioural > > > > > > restriction w.r.t. gfp flags just makes no sense at all. > > > > > > > > > > GFP_NOFS (without using the scope API) has the same problem as NOFAIL in > > > > > the vmalloc. Hence it is not supported. If you use the scope API then > > > > > you can GFP_KERNEL for kvmalloc. This is clumsy but I am not sure how to > > > > > define these conditions in a more sensible way. Special case NOFS if the > > > > > scope api is in use? Why do you want an explicit NOFS then? > > > > Exactly my point - this is clumsy and a total mess. I'm not asking > > for an explicit GFP_NOFS, just pointing out that the documented > > restrictions that "vmalloc can only do GFP_KERNEL allocations" is > > completely wrong. > > > > vmalloc() > > { > > if (!(gfp_flags & __GFP_FS)) > > memalloc_nofs_save(); > > p = __vmalloc(gfp_flags | GFP_KERNEL) > > if (!(gfp_flags & __GFP_FS)) > > memalloc_nofs_restore(); > > } > > > > Yup, that's how simple it is to support GFP_NOFS support in > > vmalloc(). > > Yes, this would work from the functionality POV but it defeats the > philosophy behind the scope API. Why would you even need this if the > scope was defined by the caller of the allocator? Who actually cares that vmalloc might be using the scoped API internally to implement GFP_NOFS or GFP_NOIO? Nobody at all. It is far more useful (and self documenting!) for one-off allocations to pass a GFP_NOFS flag than it is to use a scope API... > The initial hope was > to get rid of the NOFS abuse that can be seen in many filesystems. All > allocations from the scope would simply inherit the NOFS semantic so > an explicit NOFS shouldn't be really necessary, right? Yes, but I think you miss my point entirely: that the vmalloc restrictions on what gfp flags can be passed without making it entirely useless are completely arbitrary and non-sensical. > > This goes along with the argument that "it's impossible to do > > GFP_NOFAIL with vmalloc" as I addressed above. These things are not > > impossible, but we hide behind "we don't want people to use vmalloc" > > as an excuse for having shitty behaviour whilst ignoring that > > vmalloc is *heavily used* by core subsystems like filesystems > > because they cannot rely on high order allocations succeeding.... > > I do not think there is any reason to discourage anybody from using > vmalloc these days. 32b is dying out and vmalloc space is no longer a > very scarce resource. We are still discouraged from doing high order allocations and should only use pages directly. Not to mention that the API doesn't make it simple to use vmalloc as a direct replacement for high order kmalloc tends to discourage new users... > > It also points out that the scope API is highly deficient. > > We can do GFP_NOFS via the scope API, but we can't > > do anything else because *there is no scope API for other GFP > > flags*. > > > > Why don't we have a GFP_NOFAIL/__GFP_RETRY_FOREVER scope API? > > NO{FS,IO} where first flags to start this approach. And I have to admit > the experiment was much less successful then I hoped for. There are > still thousands of direct NOFS users so for some reason defining scopes > is not an easy thing to do. > > I am not against NOFAIL scopes in principle but seeing the nofs > "success" I am worried this will not go really well either and it is > much more tricky as NOFAIL has much stronger requirements than NOFS. > Just imagine how tricky this can be if you just call a library code > that is not under your control within a NOFAIL scope. What if that > library code decides to allocate (e.g. printk that would attempt to do > an optimistic NOWAIT allocation). I already asked you that _exact_ question earlier in the thread w.r.t. kvmalloc(GFP_NOFAIL) using optimistic NOWAIT kmalloc allocation. I asked you as a MM expert to define *and document* the behaviour that should result, not turn around and use the fact that it is undefined behaviour as a "this is too hard" excuse for not changing anything. THe fact is that the scope APIs are only really useful for certain contexts where restrictions are set by higher level functionality. For one-off allocation constraints the API sucks and we end up with crap like this (found in btrfs): /* * We're holding a transaction handle, so use a NOFS memory * allocation context to avoid deadlock if reclaim happens. */ nofs_flag = memalloc_nofs_save(); value = kmalloc(size, GFP_KERNEL); memalloc_nofs_restore(nofs_flag); But also from btrfs, this pattern is repeated in several places: nofs_flag = memalloc_nofs_save(); ctx = kvmalloc(struct_size(ctx, chunks, num_chunks), GFP_KERNEL); memalloc_nofs_restore(nofs_flag); This needs to use the scoped API because vmalloc doesn't support GFP_NOFS. So the poor "vmalloc needs scoped API" pattern is bleeding over into other code that doesn't have the problems vmalloc does. Do you see how this leads to poorly written code now? Or perhaps I should just point at ceph? /* * kvmalloc() doesn't fall back to the vmalloc allocator unless flags are * compatible with (a superset of) GFP_KERNEL. This is because while the * actual pages are allocated with the specified flags, the page table pages * are always allocated with GFP_KERNEL. * * ceph_kvmalloc() may be called with GFP_KERNEL, GFP_NOFS or GFP_NOIO. */ void *ceph_kvmalloc(size_t size, gfp_t flags) { void *p; if ((flags & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) { p = kvmalloc(size, flags); } else if ((flags & (__GFP_IO | __GFP_FS)) == __GFP_IO) { unsigned int nofs_flag = memalloc_nofs_save(); p = kvmalloc(size, GFP_KERNEL); memalloc_nofs_restore(nofs_flag); } else { unsigned int noio_flag = memalloc_noio_save(); p = kvmalloc(size, GFP_KERNEL); memalloc_noio_restore(noio_flag); } return p; } IOWs, a large number of the users of the scope API simply make [k]vmalloc() provide GFP_NOFS behaviour. ceph_kvmalloc() is pretty much a wrapper that indicates how all vmalloc functions should behave. Honour GFP_NOFS and GFP_NOIO by using the scope API internally. > > That > > would save us a lot of bother in XFS. What about GFP_DIRECT_RECLAIM? > > I'd really like to turn that off for allocations in the XFS > > transaction commit path (as noted already in this thread) because > > direct reclaim that can make no progress is actively harmful (as > > noted already in this thread) > > As always if you have reasonable usecases then it is best to bring them > up on the MM list and we can discuss them. They've been pointed out many times in the past, and I've pointed them out again in this thread. Telling me to "bring them up on the mm list" when that's exactly what I'm doing right now is not a helpful response. > > Like I said - this is more than just bad documentation - the problem > > is that the whole allocation API is an inconsistent mess of control > > mechanisms to begin with... > > I am not going to disagree. There is a lot of historical baggage and > it doesn't help that any change is really hard to review because this > interface is used throughout the kernel. I have tried to change some > most obvious inconsistencies and I can tell this has always been a > frustrating experience with a very small "reward" in the end because > there are so many other problems. Technical debt in the mm APIs is something the mm developers need to address, not the people who tell you it's a problem for them. Telling the messenger "do my job for me because I find it too frustrating to make progress myself" doesn't help anyone make progress. If you find it frustrating trying to get mm code changed, imagine what it feels like for someone on the outside asking for relatively basic things like a consistent control API.... > That being said, I would more than love to have a consistent and well > defined interface and if you want to spend a lot of time on that then be > my guest. My point exactly: saying "fix it yourself" is not a good response.... > > > > It would seem to make sense for kvmalloc to WARN_ON if it is passed > > > > flags that does not allow it to use vmalloc. > > > > > > vmalloc is certainly not the hottest path in the kernel so I wouldn't be > > > opposed. > > > > kvmalloc is most certainly becoming one of the hottest paths in XFS. > > IOWs, arguments that "vmalloc is not a hot path" are simply invalid > > these days because they are simply untrue. e.g. the profiles I > > posted in this thread... > > Is it such a hot path that a check for compatible flags would be visible > in profiles though? No, that doesn't even show up as noise - the overhead of global spinlock contention and direct reclaim are the elephants that profiles point to, not a couple of flag checks on function parameters... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-10-13 2:32 ` Dave Chinner @ 2021-10-13 8:26 ` Michal Hocko 2021-10-14 11:32 ` David Sterba 0 siblings, 1 reply; 45+ messages in thread From: Michal Hocko @ 2021-10-13 8:26 UTC (permalink / raw) To: Dave Chinner Cc: NeilBrown, Vlastimil Babka, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On Wed 13-10-21 13:32:31, Dave Chinner wrote: > On Mon, Oct 11, 2021 at 01:57:36PM +0200, Michal Hocko wrote: > > On Sat 09-10-21 09:36:49, Dave Chinner wrote: > > > On Fri, Oct 08, 2021 at 09:48:39AM +0200, Michal Hocko wrote: > > > > > > > Even the API constaints of kvmalloc() w.r.t. only doing the vmalloc > > > > > > > fallback if the gfp context is GFP_KERNEL - we already do GFP_NOFS > > > > > > > kvmalloc via memalloc_nofs_save/restore(), so this behavioural > > > > > > > restriction w.r.t. gfp flags just makes no sense at all. > > > > > > > > > > > > GFP_NOFS (without using the scope API) has the same problem as NOFAIL in > > > > > > the vmalloc. Hence it is not supported. If you use the scope API then > > > > > > you can GFP_KERNEL for kvmalloc. This is clumsy but I am not sure how to > > > > > > define these conditions in a more sensible way. Special case NOFS if the > > > > > > scope api is in use? Why do you want an explicit NOFS then? > > > > > > Exactly my point - this is clumsy and a total mess. I'm not asking > > > for an explicit GFP_NOFS, just pointing out that the documented > > > restrictions that "vmalloc can only do GFP_KERNEL allocations" is > > > completely wrong. > > > > > > vmalloc() > > > { > > > if (!(gfp_flags & __GFP_FS)) > > > memalloc_nofs_save(); > > > p = __vmalloc(gfp_flags | GFP_KERNEL) > > > if (!(gfp_flags & __GFP_FS)) > > > memalloc_nofs_restore(); > > > } > > > > > > Yup, that's how simple it is to support GFP_NOFS support in > > > vmalloc(). > > > > Yes, this would work from the functionality POV but it defeats the > > philosophy behind the scope API. Why would you even need this if the > > scope was defined by the caller of the allocator? > > Who actually cares that vmalloc might be using the scoped API > internally to implement GFP_NOFS or GFP_NOIO? Nobody at all. > It is far more useful (and self documenting!) for one-off allocations > to pass a GFP_NOFS flag than it is to use a scope API... I would agree with you if the explicit GFP_NOFS usage was consistent and actually justified in the majority cases. My experience tells me otherwise though. Many filesystems use the flag just because that is easier. That leads to a huge overuse of the flag that leads to practical problems. I was hoping that if we offer an API that would define problematic reclaim recursion scopes then it would reduce the abuse. I haven't expected this to happen overnight but it is few years and it seems it will not happen soon either. [...] > > > It also points out that the scope API is highly deficient. > > > We can do GFP_NOFS via the scope API, but we can't > > > do anything else because *there is no scope API for other GFP > > > flags*. > > > > > > Why don't we have a GFP_NOFAIL/__GFP_RETRY_FOREVER scope API? > > > > NO{FS,IO} where first flags to start this approach. And I have to admit > > the experiment was much less successful then I hoped for. There are > > still thousands of direct NOFS users so for some reason defining scopes > > is not an easy thing to do. > > > > I am not against NOFAIL scopes in principle but seeing the nofs > > "success" I am worried this will not go really well either and it is > > much more tricky as NOFAIL has much stronger requirements than NOFS. > > Just imagine how tricky this can be if you just call a library code > > that is not under your control within a NOFAIL scope. What if that > > library code decides to allocate (e.g. printk that would attempt to do > > an optimistic NOWAIT allocation). > > I already asked you that _exact_ question earlier in the thread > w.r.t. kvmalloc(GFP_NOFAIL) using optimistic NOWAIT kmalloc > allocation. I asked you as a MM expert to define *and document* the > behaviour that should result, not turn around and use the fact that > it is undefined behaviour as a "this is too hard" excuse for not > changing anything. Dave, you have "thrown" a lot of complains in previous emails and it is hard to tell rants from features requests apart. I am sorry but I believe it would be much more productive to continue this discussion if you could mild your tone. Can I ask you to break down your feature requests into separate emails so that we can discuss and track them separately rather in this quite a long thread which has IMHO diverghed from the initial topic. Thanks! > THe fact is that the scope APIs are only really useful for certain > contexts where restrictions are set by higher level functionality. > For one-off allocation constraints the API sucks and we end up with Could you be more specific about these one-off allocation constrains? What would be the reason to define one-off NO{FS,IO} allocation constrain? Or did you have your NOFAIL example in mind? > crap like this (found in btrfs): > > /* > * We're holding a transaction handle, so use a NOFS memory > * allocation context to avoid deadlock if reclaim happens. > */ > nofs_flag = memalloc_nofs_save(); > value = kmalloc(size, GFP_KERNEL); > memalloc_nofs_restore(nofs_flag); Yes this looks wrong indeed! If I were to review such a code I would ask why the scope cannot match the transaction handle context. IIRC jbd does that. I am aware of these patterns. I was pulled in some discussions in the past and in some it turned out that the constrain is not needed at all and in some cases that has led to a proper scope definition. As you point out in your other examples it just happens that it is easier to go an easy path and define scopes ad-hoc to work around allocation API limitations. [...] > IOWs, a large number of the users of the scope API simply make > [k]vmalloc() provide GFP_NOFS behaviour. ceph_kvmalloc() is pretty > much a wrapper that indicates how all vmalloc functions should > behave. Honour GFP_NOFS and GFP_NOIO by using the scope API > internally. I was discouraging from this behavior at vmalloc level to push people to use scopes properly - aka at the level where the reclaim recursion is really a problem. If that is infeasible in practice then we can re-evaluate of course. I was really hoping we can get rid of cargo cult GFP_NOFS usage this way but the reality often disagrees with hopes. All that being said, let's discuss [k]vmalloc constrains and usecases that need changes in a separate email thread. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-10-13 8:26 ` Michal Hocko @ 2021-10-14 11:32 ` David Sterba 2021-10-14 11:46 ` Michal Hocko 0 siblings, 1 reply; 45+ messages in thread From: David Sterba @ 2021-10-14 11:32 UTC (permalink / raw) To: Michal Hocko Cc: Dave Chinner, NeilBrown, Vlastimil Babka, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On Wed, Oct 13, 2021 at 10:26:58AM +0200, Michal Hocko wrote: > > crap like this (found in btrfs): > > > > /* > > * We're holding a transaction handle, so use a NOFS memory > > * allocation context to avoid deadlock if reclaim happens. > > */ > > nofs_flag = memalloc_nofs_save(); > > value = kmalloc(size, GFP_KERNEL); > > memalloc_nofs_restore(nofs_flag); > > Yes this looks wrong indeed! If I were to review such a code I would ask > why the scope cannot match the transaction handle context. IIRC jbd does > that. Adding the transaction start/end as the NOFS scope is a long term plan and going on for years, because it's not a change we would need in btrfs, but rather a favor to MM to switch away from "GFP_NOFS everywhere because it's easy". The first step was to convert the easy cases. Almost all safe cases switching GFP_NOFS to GFP_KERNEL have happened. Another step is to convert GFP_NOFS to memalloc_nofs_save/GFP_KERNEL/memalloc_nofs_restore in contexts where we know we'd rely on the transaction NOFS scope in the future. Once this is implemented, the memalloc_nofs_* calls are deleted and it works as expected. Now you may argue that the switch could be changing GFP_NOFS to GFP_KERNEL at that time but that is not that easy to review or reason about in the whole transaction context in all allocations. This leads to code that was found in __btrfs_set_acl and called crap or wrong, because perhaps the background and the bigger plan is not immediately obvious. I hope the explanation above it puts it to the right perspective. The other class of scoped NOFS protection is around vmalloc-based allocations but that's for a different reason, would be solved by the same transaction start/end conversion as well. I'm working on that from time to time but this usually gets pushed down in the todo list. It's changing a lot of code, from what I've researched so far cannot be done at once and would probably introduce bugs hard to hit because of the external conditions (allocator, system load, ...). I have a plan to do that incrementally, adding assertions and converting functions in small batches to be able to catch bugs early, but I'm not exactly thrilled to start such endeavour in addition to normal development bug hunting. To get things moving again, I've refreshed the patch adding stubs and will try to find the best timing for merg to avoid patch conflicts, but no promises. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL 2021-10-14 11:32 ` David Sterba @ 2021-10-14 11:46 ` Michal Hocko 0 siblings, 0 replies; 45+ messages in thread From: Michal Hocko @ 2021-10-14 11:46 UTC (permalink / raw) To: David Sterba Cc: Dave Chinner, NeilBrown, Vlastimil Babka, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel, linux-doc On Thu 14-10-21 13:32:01, David Sterba wrote: > On Wed, Oct 13, 2021 at 10:26:58AM +0200, Michal Hocko wrote: > > > crap like this (found in btrfs): > > > > > > /* > > > * We're holding a transaction handle, so use a NOFS memory > > > * allocation context to avoid deadlock if reclaim happens. > > > */ > > > nofs_flag = memalloc_nofs_save(); > > > value = kmalloc(size, GFP_KERNEL); > > > memalloc_nofs_restore(nofs_flag); > > > > Yes this looks wrong indeed! If I were to review such a code I would ask > > why the scope cannot match the transaction handle context. IIRC jbd does > > that. > > Adding the transaction start/end as the NOFS scope is a long term plan > and going on for years, because it's not a change we would need in > btrfs, but rather a favor to MM to switch away from "GFP_NOFS everywhere > because it's easy". > > The first step was to convert the easy cases. Almost all safe cases > switching GFP_NOFS to GFP_KERNEL have happened. Another step is to > convert GFP_NOFS to memalloc_nofs_save/GFP_KERNEL/memalloc_nofs_restore > in contexts where we know we'd rely on the transaction NOFS scope in the > future. Once this is implemented, the memalloc_nofs_* calls are deleted > and it works as expected. Now you may argue that the switch could be > changing GFP_NOFS to GFP_KERNEL at that time but that is not that easy > to review or reason about in the whole transaction context in all > allocations. > > This leads to code that was found in __btrfs_set_acl and called crap > or wrong, because perhaps the background and the bigger plan is not > immediately obvious. I hope the explanation above it puts it to the > right perspective. Yes it helps. Thanks for the clarification because this is far from obvious and changelogs I've checked do not mention this high level plan. I would have gone with a /* TODO: remove me once transactions use scopes... */ but this is obviously your call. > > The other class of scoped NOFS protection is around vmalloc-based > allocations but that's for a different reason, would be solved by the > same transaction start/end conversion as well. > > I'm working on that from time to time but this usually gets pushed down > in the todo list. It's changing a lot of code, from what I've researched > so far cannot be done at once and would probably introduce bugs hard to > hit because of the external conditions (allocator, system load, ...). > > I have a plan to do that incrementally, adding assertions and converting > functions in small batches to be able to catch bugs early, but I'm not > exactly thrilled to start such endeavour in addition to normal > development bug hunting. > > To get things moving again, I've refreshed the patch adding stubs and > will try to find the best timing for merg to avoid patch conflicts, but > no promises. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 0/6] congestion_wait() and GFP_NOFAIL @ 2021-09-14 0:13 NeilBrown 2021-09-14 0:13 ` [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops NeilBrown 0 siblings, 1 reply; 45+ messages in thread From: NeilBrown @ 2021-09-14 0:13 UTC (permalink / raw) To: Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel While working on an NFS issue recently I was informed (or maybe "reminded") that congestion_wait() doesn't really do what we think it does. It is indistinguishable from schedule_timeout_uninterruptible(). Some current users for congestion_wait() would be better suited by __GFP_NOFAIL. In related discussions it was pointed out that the __GFP_NOFAIL documentation could usefully clarify the costs of its use. So this set of patch addresses some of these issues. The patches are all independent and can safely be applied separately in different tress as appropriate. They: - add or improve documentation relating to these issues - make a tiny fix to the page_alloc_bulk_* - replace those calls to congestion_wait() which are simply waiting to retry a memory allocation. These are the easy bits. There are 5 calls to congestion_wait() and one to wait_iff_congested() in mm/ which need consideration. There are multiple calls to congestion_wait in fs/, particularly fs/f2fs/ which need to be addressed too. I'll try to form an opinion about these in coming weeks. Thanks, NeilBrown --- NeilBrown (6): MM: improve documentation for __GFP_NOFAIL MM: annotate congestion_wait() and wait_iff_congested() as ineffective. EXT4: Remove ENOMEM/congestion_wait() loops. EXT4: remove congestion_wait from ext4_bio_write_page, and simplify XFS: remove congestion_wait() loop from kmem_alloc() XFS: remove congestion_wait() loop from xfs_buf_alloc_pages() fs/ext4/ext4.h | 2 +- fs/ext4/ext4_jbd2.c | 8 +++++- fs/ext4/extents.c | 49 ++++++++++++++----------------------- fs/ext4/extents_status.c | 35 ++++++++++++++------------ fs/ext4/extents_status.h | 2 +- fs/ext4/indirect.c | 2 +- fs/ext4/inode.c | 6 ++--- fs/ext4/ioctl.c | 4 +-- fs/ext4/page-io.c | 13 ++++------ fs/ext4/super.c | 2 +- fs/jbd2/transaction.c | 8 +++--- fs/xfs/kmem.c | 16 +++--------- fs/xfs/xfs_buf.c | 6 ++--- include/linux/backing-dev.h | 7 ++++++ mm/backing-dev.c | 9 +++++++ 15 files changed, 86 insertions(+), 83 deletions(-) -- Signature ^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops. 2021-09-14 0:13 [PATCH 0/6] congestion_wait() and GFP_NOFAIL NeilBrown @ 2021-09-14 0:13 ` NeilBrown 2021-09-14 16:34 ` Mel Gorman 2021-09-15 0:28 ` Theodore Ts'o 0 siblings, 2 replies; 45+ messages in thread From: NeilBrown @ 2021-09-14 0:13 UTC (permalink / raw) To: Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel Indefinite loops waiting for memory allocation are discouraged by documentation in gfp.h which says the use of __GFP_NOFAIL that it is definitely preferable to use the flag rather than opencode endless loop around allocator. Such loops that use congestion_wait() are particularly unwise as congestion_wait() is indistinguishable from schedule_timeout_uninterruptible() in practice - and should be deprecated. So this patch changes the two loops in ext4_ext_truncate() to use __GFP_NOFAIL instead of looping. As the allocation is multiple layers deeper in the call stack, this requires passing the EXT4_EX_NOFAIL flag down and handling it in various places. Of particular interest is the ext4_journal_start family of calls which can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'. This could be seen as a blurring of types. However 'type' is 8 bits, and EXT4_EX_NOFAIL is a high bit, so it is safe in practice. jbd2__journal_start() is enhanced so that the gfp_t flags passed are used for *all* allocations. Signed-off-by: NeilBrown <neilb@suse.de> --- fs/ext4/ext4.h | 2 +- fs/ext4/ext4_jbd2.c | 8 +++++++- fs/ext4/extents.c | 49 +++++++++++++++++----------------------------- fs/ext4/extents_status.c | 35 +++++++++++++++++++-------------- fs/ext4/extents_status.h | 2 +- fs/ext4/indirect.c | 2 +- fs/ext4/inode.c | 6 +++--- fs/ext4/ioctl.c | 4 ++-- fs/ext4/super.c | 2 +- fs/jbd2/transaction.c | 8 ++++---- 10 files changed, 58 insertions(+), 60 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 90ff5acaf11f..52a34f5dfda2 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -3720,7 +3720,7 @@ extern int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, struct ext4_map_blocks *map, int flags); extern int ext4_ext_truncate(handle_t *, struct inode *); extern int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, - ext4_lblk_t end); + ext4_lblk_t end, int nofail); extern void ext4_ext_init(struct super_block *); extern void ext4_ext_release(struct super_block *); extern long ext4_fallocate(struct file *file, int mode, loff_t offset, diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c index 6def7339056d..2bdda3b7a3e6 100644 --- a/fs/ext4/ext4_jbd2.c +++ b/fs/ext4/ext4_jbd2.c @@ -92,6 +92,12 @@ handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line, { journal_t *journal; int err; + gfp_t gfp_mask = GFP_NOFS; + + if (type & EXT4_EX_NOFAIL) { + gfp_mask |= __GFP_NOFAIL; + type &= ~EXT4_EX_NOFAIL; + } trace_ext4_journal_start(sb, blocks, rsv_blocks, revoke_creds, _RET_IP_); @@ -103,7 +109,7 @@ handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line, if (!journal || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)) return ext4_get_nojournal(); return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds, - GFP_NOFS, type, line); + gfp_mask, type, line); } int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index c0de30f25185..b7bc12aedf78 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -1488,7 +1488,7 @@ static int ext4_ext_search_left(struct inode *inode, static int ext4_ext_search_right(struct inode *inode, struct ext4_ext_path *path, ext4_lblk_t *logical, ext4_fsblk_t *phys, - struct ext4_extent *ret_ex) + struct ext4_extent *ret_ex, int nofail) { struct buffer_head *bh = NULL; struct ext4_extent_header *eh; @@ -1565,7 +1565,7 @@ static int ext4_ext_search_right(struct inode *inode, while (++depth < path->p_depth) { /* subtract from p_depth to get proper eh_depth */ bh = read_extent_tree_block(inode, block, - path->p_depth - depth, 0); + path->p_depth - depth, nofail); if (IS_ERR(bh)) return PTR_ERR(bh); eh = ext_block_hdr(bh); @@ -1574,7 +1574,7 @@ static int ext4_ext_search_right(struct inode *inode, put_bh(bh); } - bh = read_extent_tree_block(inode, block, path->p_depth - depth, 0); + bh = read_extent_tree_block(inode, block, path->p_depth - depth, nofail); if (IS_ERR(bh)) return PTR_ERR(bh); eh = ext_block_hdr(bh); @@ -2773,7 +2773,7 @@ ext4_ext_more_to_rm(struct ext4_ext_path *path) } int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, - ext4_lblk_t end) + ext4_lblk_t end, int nofail) { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); int depth = ext_depth(inode); @@ -2789,7 +2789,8 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, ext_debug(inode, "truncate since %u to %u\n", start, end); /* probably first extent we're gonna free will be last in block */ - handle = ext4_journal_start_with_revoke(inode, EXT4_HT_TRUNCATE, + handle = ext4_journal_start_with_revoke(inode, + EXT4_HT_TRUNCATE | nofail, depth + 1, ext4_free_metadata_revoke_credits(inode->i_sb, depth)); if (IS_ERR(handle)) @@ -2877,7 +2878,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, */ lblk = ex_end + 1; err = ext4_ext_search_right(inode, path, &lblk, &pblk, - NULL); + NULL, nofail); if (err < 0) goto out; if (pblk) { @@ -2899,10 +2900,6 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, } else { path = kcalloc(depth + 1, sizeof(struct ext4_ext_path), GFP_NOFS | __GFP_NOFAIL); - if (path == NULL) { - ext4_journal_stop(handle); - return -ENOMEM; - } path[0].p_maxdepth = path[0].p_depth = depth; path[0].p_hdr = ext_inode_hdr(inode); i = 0; @@ -2955,7 +2952,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, memset(path + i + 1, 0, sizeof(*path)); bh = read_extent_tree_block(inode, ext4_idx_pblock(path[i].p_idx), depth - i - 1, - EXT4_EX_NOCACHE); + EXT4_EX_NOCACHE | nofail); if (IS_ERR(bh)) { /* should we reset i_size? */ err = PTR_ERR(bh); @@ -4186,7 +4183,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, if (err) goto out; ar.lright = map->m_lblk; - err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright, &ex2); + err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright, &ex2, 0); if (err < 0) goto out; @@ -4368,23 +4365,13 @@ int ext4_ext_truncate(handle_t *handle, struct inode *inode) last_block = (inode->i_size + sb->s_blocksize - 1) >> EXT4_BLOCK_SIZE_BITS(sb); -retry: err = ext4_es_remove_extent(inode, last_block, - EXT_MAX_BLOCKS - last_block); - if (err == -ENOMEM) { - cond_resched(); - congestion_wait(BLK_RW_ASYNC, HZ/50); - goto retry; - } + EXT_MAX_BLOCKS - last_block, + EXT4_EX_NOFAIL); if (err) return err; -retry_remove_space: - err = ext4_ext_remove_space(inode, last_block, EXT_MAX_BLOCKS - 1); - if (err == -ENOMEM) { - cond_resched(); - congestion_wait(BLK_RW_ASYNC, HZ/50); - goto retry_remove_space; - } + err = ext4_ext_remove_space(inode, last_block, EXT_MAX_BLOCKS - 1, + EXT4_EX_NOFAIL); return err; } @@ -5322,13 +5309,13 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len) ext4_discard_preallocations(inode, 0); ret = ext4_es_remove_extent(inode, punch_start, - EXT_MAX_BLOCKS - punch_start); + EXT_MAX_BLOCKS - punch_start, 0); if (ret) { up_write(&EXT4_I(inode)->i_data_sem); goto out_stop; } - ret = ext4_ext_remove_space(inode, punch_start, punch_stop - 1); + ret = ext4_ext_remove_space(inode, punch_start, punch_stop - 1, 0); if (ret) { up_write(&EXT4_I(inode)->i_data_sem); goto out_stop; @@ -5510,7 +5497,7 @@ static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len) } ret = ext4_es_remove_extent(inode, offset_lblk, - EXT_MAX_BLOCKS - offset_lblk); + EXT_MAX_BLOCKS - offset_lblk, 0); if (ret) { up_write(&EXT4_I(inode)->i_data_sem); goto out_stop; @@ -5574,10 +5561,10 @@ ext4_swap_extents(handle_t *handle, struct inode *inode1, BUG_ON(!inode_is_locked(inode1)); BUG_ON(!inode_is_locked(inode2)); - *erp = ext4_es_remove_extent(inode1, lblk1, count); + *erp = ext4_es_remove_extent(inode1, lblk1, count, 0); if (unlikely(*erp)) return 0; - *erp = ext4_es_remove_extent(inode2, lblk2, count); + *erp = ext4_es_remove_extent(inode2, lblk2, count, 0); if (unlikely(*erp)) return 0; diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c index 9a3a8996aacf..7f7711a2ea44 100644 --- a/fs/ext4/extents_status.c +++ b/fs/ext4/extents_status.c @@ -144,9 +144,10 @@ static struct kmem_cache *ext4_es_cachep; static struct kmem_cache *ext4_pending_cachep; -static int __es_insert_extent(struct inode *inode, struct extent_status *newes); +static int __es_insert_extent(struct inode *inode, struct extent_status *newes, + int nofail); static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk, - ext4_lblk_t end, int *reserved); + ext4_lblk_t end, int *reserved, int nofail); static int es_reclaim_extents(struct ext4_inode_info *ei, int *nr_to_scan); static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan, struct ext4_inode_info *locked_ei); @@ -452,10 +453,11 @@ static void ext4_es_list_del(struct inode *inode) static struct extent_status * ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len, - ext4_fsblk_t pblk) + ext4_fsblk_t pblk, int nofail) { struct extent_status *es; - es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC); + es = kmem_cache_alloc(ext4_es_cachep, + GFP_ATOMIC | (nofail ? __GFP_NOFAIL : 0)); if (es == NULL) return NULL; es->es_lblk = lblk; @@ -754,7 +756,8 @@ static inline void ext4_es_insert_extent_check(struct inode *inode, } #endif -static int __es_insert_extent(struct inode *inode, struct extent_status *newes) +static int __es_insert_extent(struct inode *inode, struct extent_status *newes, + int nofail) { struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree; struct rb_node **p = &tree->root.rb_node; @@ -795,7 +798,7 @@ static int __es_insert_extent(struct inode *inode, struct extent_status *newes) } es = ext4_es_alloc_extent(inode, newes->es_lblk, newes->es_len, - newes->es_pblk); + newes->es_pblk, nofail); if (!es) return -ENOMEM; rb_link_node(&es->rb_node, parent, p); @@ -848,11 +851,11 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk, ext4_es_insert_extent_check(inode, &newes); write_lock(&EXT4_I(inode)->i_es_lock); - err = __es_remove_extent(inode, lblk, end, NULL); + err = __es_remove_extent(inode, lblk, end, NULL, 0); if (err != 0) goto error; retry: - err = __es_insert_extent(inode, &newes); + err = __es_insert_extent(inode, &newes, 0); if (err == -ENOMEM && __es_shrink(EXT4_SB(inode->i_sb), 128, EXT4_I(inode))) goto retry; @@ -902,7 +905,7 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk, es = __es_tree_search(&EXT4_I(inode)->i_es_tree.root, lblk); if (!es || es->es_lblk > end) - __es_insert_extent(inode, &newes); + __es_insert_extent(inode, &newes, 0); write_unlock(&EXT4_I(inode)->i_es_lock); } @@ -1294,6 +1297,7 @@ static unsigned int get_rsvd(struct inode *inode, ext4_lblk_t end, * @lblk - first block in range * @end - last block in range * @reserved - number of cluster reservations released + * @nofail - EXT4_EX_NOFAIL if __GFP_NOFAIL should be used * * If @reserved is not NULL and delayed allocation is enabled, counts * block/cluster reservations freed by removing range and if bigalloc @@ -1301,7 +1305,7 @@ static unsigned int get_rsvd(struct inode *inode, ext4_lblk_t end, * error code on failure. */ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk, - ext4_lblk_t end, int *reserved) + ext4_lblk_t end, int *reserved, int nofail) { struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree; struct rb_node *node; @@ -1350,7 +1354,7 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk, orig_es.es_len - len2; ext4_es_store_pblock_status(&newes, block, ext4_es_status(&orig_es)); - err = __es_insert_extent(inode, &newes); + err = __es_insert_extent(inode, &newes, nofail); if (err) { es->es_lblk = orig_es.es_lblk; es->es_len = orig_es.es_len; @@ -1426,12 +1430,13 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk, * @inode - file containing range * @lblk - first block in range * @len - number of blocks to remove + * @nofail - EXT4_EX_NOFAIL if __GFP_NOFAIL should be used * * Reduces block/cluster reservation count and for bigalloc cancels pending * reservations as needed. Returns 0 on success, error code on failure. */ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk, - ext4_lblk_t len) + ext4_lblk_t len, int nofail) { ext4_lblk_t end; int err = 0; @@ -1456,7 +1461,7 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk, * is reclaimed. */ write_lock(&EXT4_I(inode)->i_es_lock); - err = __es_remove_extent(inode, lblk, end, &reserved); + err = __es_remove_extent(inode, lblk, end, &reserved, nofail); write_unlock(&EXT4_I(inode)->i_es_lock); ext4_es_print_tree(inode); ext4_da_release_space(inode, reserved); @@ -2003,11 +2008,11 @@ int ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk, write_lock(&EXT4_I(inode)->i_es_lock); - err = __es_remove_extent(inode, lblk, lblk, NULL); + err = __es_remove_extent(inode, lblk, lblk, NULL, 0); if (err != 0) goto error; retry: - err = __es_insert_extent(inode, &newes); + err = __es_insert_extent(inode, &newes, 0); if (err == -ENOMEM && __es_shrink(EXT4_SB(inode->i_sb), 128, EXT4_I(inode))) goto retry; diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h index 4ec30a798260..23d77094a165 100644 --- a/fs/ext4/extents_status.h +++ b/fs/ext4/extents_status.h @@ -134,7 +134,7 @@ extern void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len, ext4_fsblk_t pblk, unsigned int status); extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk, - ext4_lblk_t len); + ext4_lblk_t len, int nofail); extern void ext4_es_find_extent_range(struct inode *inode, int (*match_fn)(struct extent_status *es), ext4_lblk_t lblk, ext4_lblk_t end, diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c index 89efa78ed4b2..910e87aea7be 100644 --- a/fs/ext4/indirect.c +++ b/fs/ext4/indirect.c @@ -1125,7 +1125,7 @@ void ext4_ind_truncate(handle_t *handle, struct inode *inode) return; } - ext4_es_remove_extent(inode, last_block, EXT_MAX_BLOCKS - last_block); + ext4_es_remove_extent(inode, last_block, EXT_MAX_BLOCKS - last_block, 0); /* * The orphan list entry will now protect us from any crash which diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index d18852d6029c..24246043d94b 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1575,7 +1575,7 @@ static void mpage_release_unused_pages(struct mpage_da_data *mpd, ext4_lblk_t start, last; start = index << (PAGE_SHIFT - inode->i_blkbits); last = end << (PAGE_SHIFT - inode->i_blkbits); - ext4_es_remove_extent(inode, start, last - start + 1); + ext4_es_remove_extent(inode, start, last - start + 1, 0); } pagevec_init(&pvec); @@ -4109,7 +4109,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length) ext4_discard_preallocations(inode, 0); ret = ext4_es_remove_extent(inode, first_block, - stop_block - first_block); + stop_block - first_block, 0); if (ret) { up_write(&EXT4_I(inode)->i_data_sem); goto out_stop; @@ -4117,7 +4117,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length) if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) ret = ext4_ext_remove_space(inode, first_block, - stop_block - 1); + stop_block - 1, 0); else ret = ext4_ind_remove_space(handle, inode, first_block, stop_block); diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index 606dee9e08a3..e4de05a6b976 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -79,8 +79,8 @@ static void swap_inode_data(struct inode *inode1, struct inode *inode2) (ei1->i_flags & ~EXT4_FL_SHOULD_SWAP); ei2->i_flags = tmp | (ei2->i_flags & ~EXT4_FL_SHOULD_SWAP); swap(ei1->i_disksize, ei2->i_disksize); - ext4_es_remove_extent(inode1, 0, EXT_MAX_BLOCKS); - ext4_es_remove_extent(inode2, 0, EXT_MAX_BLOCKS); + ext4_es_remove_extent(inode1, 0, EXT_MAX_BLOCKS, 0); + ext4_es_remove_extent(inode2, 0, EXT_MAX_BLOCKS, 0); isize = i_size_read(inode1); i_size_write(inode1, i_size_read(inode2)); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 0775950ee84e..947e8376a35a 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1393,7 +1393,7 @@ void ext4_clear_inode(struct inode *inode) invalidate_inode_buffers(inode); clear_inode(inode); ext4_discard_preallocations(inode, 0); - ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS); + ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS, 0); dquot_drop(inode); if (EXT4_I(inode)->jinode) { jbd2_journal_release_jbd_inode(EXT4_JOURNAL(inode), diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 6a3caedd2285..23e0f003d43b 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -476,9 +476,9 @@ static int start_this_handle(journal_t *journal, handle_t *handle, } /* Allocate a new handle. This should probably be in a slab... */ -static handle_t *new_handle(int nblocks) +static handle_t *new_handle(int nblocks, gfp_t gfp) { - handle_t *handle = jbd2_alloc_handle(GFP_NOFS); + handle_t *handle = jbd2_alloc_handle(gfp); if (!handle) return NULL; handle->h_total_credits = nblocks; @@ -505,13 +505,13 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks, nblocks += DIV_ROUND_UP(revoke_records, journal->j_revoke_records_per_block); - handle = new_handle(nblocks); + handle = new_handle(nblocks, gfp_mask); if (!handle) return ERR_PTR(-ENOMEM); if (rsv_blocks) { handle_t *rsv_handle; - rsv_handle = new_handle(rsv_blocks); + rsv_handle = new_handle(rsv_blocks, gfp_mask); if (!rsv_handle) { jbd2_free_handle(handle); return ERR_PTR(-ENOMEM); ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops. 2021-09-14 0:13 ` [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops NeilBrown @ 2021-09-14 16:34 ` Mel Gorman 2021-09-14 21:48 ` NeilBrown 2021-09-14 23:55 ` Dave Chinner 2021-09-15 0:28 ` Theodore Ts'o 1 sibling, 2 replies; 45+ messages in thread From: Mel Gorman @ 2021-09-14 16:34 UTC (permalink / raw) To: NeilBrown Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Jan Kara, Michal Hocko, Matthew Wilcox, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote: > Indefinite loops waiting for memory allocation are discouraged by > documentation in gfp.h which says the use of __GFP_NOFAIL that it > > is definitely preferable to use the flag rather than opencode endless > loop around allocator. > > Such loops that use congestion_wait() are particularly unwise as > congestion_wait() is indistinguishable from > schedule_timeout_uninterruptible() in practice - and should be > deprecated. > > So this patch changes the two loops in ext4_ext_truncate() to use > __GFP_NOFAIL instead of looping. > > As the allocation is multiple layers deeper in the call stack, this > requires passing the EXT4_EX_NOFAIL flag down and handling it in various > places. > > Of particular interest is the ext4_journal_start family of calls which > can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'. This could be seen > as a blurring of types. However 'type' is 8 bits, and EXT4_EX_NOFAIL is > a high bit, so it is safe in practice. > > jbd2__journal_start() is enhanced so that the gfp_t flags passed are > used for *all* allocations. > > Signed-off-by: NeilBrown <neilb@suse.de> I'm not a fan. GFP_NOFAIL allows access to emergency reserves increasing the risk of a livelock if memory is completely depleted where as some callers can afford to wait. The key event should be reclaim making progress. The hack below is intended to vaguely demonstrate how blocking can be based on reclaim making progress instead of "congestion" but has not even been booted. A more complete overhaul may involve introducing reclaim_congestion_wait_nodemask(gfp_t gfp_mask, long timeout, nodemask_t *nodemask) and reclaim_congestion_wait_nodemask(gfp_t gfp_mask, long timeout) and converting congestion_wait and wait_iff_congestion to calling reclaim_congestion_wait_nodemask which waits on the first usable node and then audit every single congestion_wait() user to see which API they should call. Further work would be to establish whether the page allocator should call reclaim_congestion_wait_nodemask() if direct reclaim is not making progress or whether that should be in vmscan.c. Conceivably, GFP_NOFAIL could then soften its access to emergency reserves but I haven't given it much thought. Yes it's significant work, but it would be a better than letting __GFP_NOFAIL propagate further and kicking us down the road. This hack is terrible, it's not the right way to do it, it's just to illustrate the idea of "waiting on memory should be based on reclaim making progress and not the state of storage" is not impossible. --8<-- diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 5c0318509f9e..5ed81c5746ec 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -832,6 +832,7 @@ typedef struct pglist_data { unsigned long node_spanned_pages; /* total size of physical page range, including holes */ int node_id; + wait_queue_head_t reclaim_wait; wait_queue_head_t kswapd_wait; wait_queue_head_t pfmemalloc_wait; struct task_struct *kswapd; /* Protected by diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 6122c78ce914..21a9cd693d12 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -13,6 +13,7 @@ #include <linux/module.h> #include <linux/writeback.h> #include <linux/device.h> +#include <linux/swap.h> #include <trace/events/writeback.h> struct backing_dev_info noop_backing_dev_info; @@ -1013,25 +1014,41 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync) EXPORT_SYMBOL(set_bdi_congested); /** - * congestion_wait - wait for a backing_dev to become uncongested - * @sync: SYNC or ASYNC IO - * @timeout: timeout in jiffies + * congestion_wait - the docs are now worthless but avoiding a rename * - * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit - * write congestion. If no backing_devs are congested then just wait for the - * next write to be completed. + * New thing -- wait for a timeout or reclaim to make progress */ long congestion_wait(int sync, long timeout) { + pg_data_t *pgdat; long ret; unsigned long start = jiffies; DEFINE_WAIT(wait); - wait_queue_head_t *wqh = &congestion_wqh[sync]; + wait_queue_head_t *wqh; - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); - ret = io_schedule_timeout(timeout); + /* Never let kswapd sleep on itself */ + if (current_is_kswapd()) + goto trace; + + /* + * Dangerous, local memory may be forbidden by cpuset or policies, + * use first eligible zone in zonelists node instead + */ + preempt_disable(); + pgdat = NODE_DATA(smp_processor_id()); + preempt_enable(); + wqh = &pgdat->reclaim_wait; + + /* + * Should probably check watermark of suitable zones here + * in case this is spuriously called + */ + + prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE); + ret = schedule_timeout(timeout); finish_wait(wqh, &wait); +trace: trace_writeback_congestion_wait(jiffies_to_usecs(timeout), jiffies_to_usecs(jiffies - start)); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5b09e71c9ce7..4b87b73d1264 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7418,6 +7418,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat) pgdat_init_split_queue(pgdat); pgdat_init_kcompactd(pgdat); + init_waitqueue_head(&pgdat->reclaim_wait); init_waitqueue_head(&pgdat->kswapd_wait); init_waitqueue_head(&pgdat->pfmemalloc_wait); diff --git a/mm/vmscan.c b/mm/vmscan.c index 158c9c93d03c..0ac2cf6be5e3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2888,6 +2888,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) } while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL))); } +static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx); + static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) { struct reclaim_state *reclaim_state = current->reclaim_state; @@ -3070,6 +3072,18 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) sc)) goto again; + /* + * Might be race-prone, more appropriate to do this when exiting + * direct reclaim and when kswapd finds that pgdat is balanced. + * May also be appropriate to update pgdat_balanced to take + * a watermark level and wakeup when min watermarks are ok + * instead of waiting for the high watermark + */ + if (waitqueue_active(&pgdat->reclaim_wait) && + pgdat_balanced(pgdat, 0, ZONE_MOVABLE)) { + wake_up_interruptible(&pgdat->reclaim_wait); + } + /* * Kswapd gives up on balancing particular nodes after too * many failures to reclaim anything from them and goes to ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops. 2021-09-14 16:34 ` Mel Gorman @ 2021-09-14 21:48 ` NeilBrown 2021-09-15 12:06 ` Michal Hocko 2021-09-14 23:55 ` Dave Chinner 1 sibling, 1 reply; 45+ messages in thread From: NeilBrown @ 2021-09-14 21:48 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Jan Kara, Michal Hocko, Matthew Wilcox, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel On Wed, 15 Sep 2021, Mel Gorman wrote: > On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote: > > Indefinite loops waiting for memory allocation are discouraged by > > documentation in gfp.h which says the use of __GFP_NOFAIL that it > > > > is definitely preferable to use the flag rather than opencode endless > > loop around allocator. > > > > Such loops that use congestion_wait() are particularly unwise as > > congestion_wait() is indistinguishable from > > schedule_timeout_uninterruptible() in practice - and should be > > deprecated. > > > > So this patch changes the two loops in ext4_ext_truncate() to use > > __GFP_NOFAIL instead of looping. > > > > As the allocation is multiple layers deeper in the call stack, this > > requires passing the EXT4_EX_NOFAIL flag down and handling it in various > > places. > > > > Of particular interest is the ext4_journal_start family of calls which > > can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'. This could be seen > > as a blurring of types. However 'type' is 8 bits, and EXT4_EX_NOFAIL is > > a high bit, so it is safe in practice. > > > > jbd2__journal_start() is enhanced so that the gfp_t flags passed are > > used for *all* allocations. > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > I'm not a fan. GFP_NOFAIL allows access to emergency reserves increasing > the risk of a livelock if memory is completely depleted where as some > callers can afford to wait. Maybe we should wind back and focus on the documentation patches. As quoted above, mm.h says: > > is definitely preferable to use the flag rather than opencode endless > > loop around allocator. but you seem to be saying that is wrong. I'd certainly like to get the documentation right before changing any code. Why does __GFP_NOFAIL access the reserves? Why not require that the relevant "Try harder" flag (__GFP_ATOMIC or __GFP_MEMALLOC) be included with __GFP_NOFAIL if that is justified? There are over 100 __GFP_NOFAIL allocation sites. I don't feel like reviewing them all and seeing if any really need a try-harder flag. Can we rename __GFP_NOFAIL to __GFP_NEVERFAIL and then #define __GFP_NOFAIL (__GFP_NEVERFAIL | __GFP_ATOMIC) and encourage the use of __GFP_NEVERFAIL in future? When __GFP_NOFAIL loops, it calls congestion_wait() internally. That certainly needs to be fixed and the ideas you present below are certainly worth considering when trying to understand how to address that. I'd rather fix it once there in page_alloc.c rather then export a waiting API like congestion_wait(). That would provide more flexibility. e.g. a newly freed page could be handed directly back to the waiter. Thanks, NeilBrown > > The key event should be reclaim making progress. The hack below is > intended to vaguely demonstrate how blocking can be based on reclaim > making progress instead of "congestion" but has not even been booted. A > more complete overhaul may involve introducing > reclaim_congestion_wait_nodemask(gfp_t gfp_mask, long timeout, nodemask_t *nodemask) > and > reclaim_congestion_wait_nodemask(gfp_t gfp_mask, long timeout) > and converting congestion_wait and wait_iff_congestion to calling > reclaim_congestion_wait_nodemask which waits on the first usable node > and then audit every single congestion_wait() user to see which API > they should call. Further work would be to establish whether the page allocator should > call reclaim_congestion_wait_nodemask() if direct reclaim is not making > progress or whether that should be in vmscan.c. Conceivably, GFP_NOFAIL > could then soften its access to emergency reserves but I haven't given > it much thought. > > Yes it's significant work, but it would be a better than letting > __GFP_NOFAIL propagate further and kicking us down the road. > > This hack is terrible, it's not the right way to do it, it's just to > illustrate the idea of "waiting on memory should be based on reclaim > making progress and not the state of storage" is not impossible. > > --8<-- > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 5c0318509f9e..5ed81c5746ec 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -832,6 +832,7 @@ typedef struct pglist_data { > unsigned long node_spanned_pages; /* total size of physical page > range, including holes */ > int node_id; > + wait_queue_head_t reclaim_wait; > wait_queue_head_t kswapd_wait; > wait_queue_head_t pfmemalloc_wait; > struct task_struct *kswapd; /* Protected by > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index 6122c78ce914..21a9cd693d12 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -13,6 +13,7 @@ > #include <linux/module.h> > #include <linux/writeback.h> > #include <linux/device.h> > +#include <linux/swap.h> > #include <trace/events/writeback.h> > > struct backing_dev_info noop_backing_dev_info; > @@ -1013,25 +1014,41 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync) > EXPORT_SYMBOL(set_bdi_congested); > > /** > - * congestion_wait - wait for a backing_dev to become uncongested > - * @sync: SYNC or ASYNC IO > - * @timeout: timeout in jiffies > + * congestion_wait - the docs are now worthless but avoiding a rename > * > - * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit > - * write congestion. If no backing_devs are congested then just wait for the > - * next write to be completed. > + * New thing -- wait for a timeout or reclaim to make progress > */ > long congestion_wait(int sync, long timeout) > { > + pg_data_t *pgdat; > long ret; > unsigned long start = jiffies; > DEFINE_WAIT(wait); > - wait_queue_head_t *wqh = &congestion_wqh[sync]; > + wait_queue_head_t *wqh; > > - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); > - ret = io_schedule_timeout(timeout); > + /* Never let kswapd sleep on itself */ > + if (current_is_kswapd()) > + goto trace; > + > + /* > + * Dangerous, local memory may be forbidden by cpuset or policies, > + * use first eligible zone in zonelists node instead > + */ > + preempt_disable(); > + pgdat = NODE_DATA(smp_processor_id()); > + preempt_enable(); > + wqh = &pgdat->reclaim_wait; > + > + /* > + * Should probably check watermark of suitable zones here > + * in case this is spuriously called > + */ > + > + prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE); > + ret = schedule_timeout(timeout); > finish_wait(wqh, &wait); > > +trace: > trace_writeback_congestion_wait(jiffies_to_usecs(timeout), > jiffies_to_usecs(jiffies - start)); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 5b09e71c9ce7..4b87b73d1264 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -7418,6 +7418,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat) > pgdat_init_split_queue(pgdat); > pgdat_init_kcompactd(pgdat); > > + init_waitqueue_head(&pgdat->reclaim_wait); > init_waitqueue_head(&pgdat->kswapd_wait); > init_waitqueue_head(&pgdat->pfmemalloc_wait); > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 158c9c93d03c..0ac2cf6be5e3 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2888,6 +2888,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > } while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL))); > } > > +static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx); > + > static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > { > struct reclaim_state *reclaim_state = current->reclaim_state; > @@ -3070,6 +3072,18 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > sc)) > goto again; > > + /* > + * Might be race-prone, more appropriate to do this when exiting > + * direct reclaim and when kswapd finds that pgdat is balanced. > + * May also be appropriate to update pgdat_balanced to take > + * a watermark level and wakeup when min watermarks are ok > + * instead of waiting for the high watermark > + */ > + if (waitqueue_active(&pgdat->reclaim_wait) && > + pgdat_balanced(pgdat, 0, ZONE_MOVABLE)) { > + wake_up_interruptible(&pgdat->reclaim_wait); > + } > + > /* > * Kswapd gives up on balancing particular nodes after too > * many failures to reclaim anything from them and goes to > > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops. 2021-09-14 21:48 ` NeilBrown @ 2021-09-15 12:06 ` Michal Hocko 2021-09-15 22:35 ` NeilBrown 0 siblings, 1 reply; 45+ messages in thread From: Michal Hocko @ 2021-09-15 12:06 UTC (permalink / raw) To: NeilBrown Cc: Mel Gorman, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Jan Kara, Matthew Wilcox, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel On Wed 15-09-21 07:48:11, Neil Brown wrote: > On Wed, 15 Sep 2021, Mel Gorman wrote: > > On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote: > > > Indefinite loops waiting for memory allocation are discouraged by > > > documentation in gfp.h which says the use of __GFP_NOFAIL that it > > > > > > is definitely preferable to use the flag rather than opencode endless > > > loop around allocator. > > > > > > Such loops that use congestion_wait() are particularly unwise as > > > congestion_wait() is indistinguishable from > > > schedule_timeout_uninterruptible() in practice - and should be > > > deprecated. > > > > > > So this patch changes the two loops in ext4_ext_truncate() to use > > > __GFP_NOFAIL instead of looping. > > > > > > As the allocation is multiple layers deeper in the call stack, this > > > requires passing the EXT4_EX_NOFAIL flag down and handling it in various > > > places. > > > > > > Of particular interest is the ext4_journal_start family of calls which > > > can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'. This could be seen > > > as a blurring of types. However 'type' is 8 bits, and EXT4_EX_NOFAIL is > > > a high bit, so it is safe in practice. > > > > > > jbd2__journal_start() is enhanced so that the gfp_t flags passed are > > > used for *all* allocations. > > > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > > > I'm not a fan. GFP_NOFAIL allows access to emergency reserves increasing > > the risk of a livelock if memory is completely depleted where as some > > callers can afford to wait. > > Maybe we should wind back and focus on the documentation patches. > As quoted above, mm.h says: > > > > is definitely preferable to use the flag rather than opencode endless > > > loop around allocator. > > but you seem to be saying that is wrong. I'd certainly like to get the > documentation right before changing any code. > > Why does __GFP_NOFAIL access the reserves? Why not require that the > relevant "Try harder" flag (__GFP_ATOMIC or __GFP_MEMALLOC) be included > with __GFP_NOFAIL if that is justified? Does 5020e285856c ("mm, oom: give __GFP_NOFAIL allocations access to memory reserves") help? I would be worried to make the semantic even more complex than already is. Access to memory reserves is an implementation detail that the page allocator does currently. Callers shouldn't really be worried about that. I do not ever remember any actual NOFAIL triggered memory exhaustion. I have seen that to happen for unrestricted access to memory reserves by OOM victim though. Hence cd04ae1e2dc8 ("mm, oom: do not rely on TIF_MEMDIE for memory reserves access"). We can consider something similar if NOFAIL allocation really tend to show a similar problem. We do not want callers to care about OOM sitauations for this kind of requests. __GFP_NOFAIL | __GFP_HIGH is certainly something that is a valid usage but I wouldn't base OOM behavior based on that. > There are over 100 __GFP_NOFAIL allocation sites. I don't feel like > reviewing them all and seeing if any really need a try-harder flag. > Can we rename __GFP_NOFAIL to __GFP_NEVERFAIL and then > #define __GFP_NOFAIL (__GFP_NEVERFAIL | __GFP_ATOMIC) > and encourage the use of __GFP_NEVERFAIL in future? Doesn't this add even more complexity? > When __GFP_NOFAIL loops, it calls congestion_wait() internally. That > certainly needs to be fixed and the ideas you present below are > certainly worth considering when trying to understand how to address > that. I'd rather fix it once there in page_alloc.c rather then export a > waiting API like congestion_wait(). That would provide more > flexibility. e.g. a newly freed page could be handed directly back to > the waiter. Completely agreed here. We really do not want people to open code NOFAIL unless they can do something really subsystem specific that would help to make a forward progress. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops. 2021-09-15 12:06 ` Michal Hocko @ 2021-09-15 22:35 ` NeilBrown 2021-09-16 0:37 ` Dave Chinner 2021-09-16 6:52 ` Michal Hocko 0 siblings, 2 replies; 45+ messages in thread From: NeilBrown @ 2021-09-15 22:35 UTC (permalink / raw) To: Michal Hocko Cc: Mel Gorman, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Jan Kara, Matthew Wilcox, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel On Wed, 15 Sep 2021, Michal Hocko wrote: > On Wed 15-09-21 07:48:11, Neil Brown wrote: > > > > Why does __GFP_NOFAIL access the reserves? Why not require that the > > relevant "Try harder" flag (__GFP_ATOMIC or __GFP_MEMALLOC) be included > > with __GFP_NOFAIL if that is justified? > > Does 5020e285856c ("mm, oom: give __GFP_NOFAIL allocations access to > memory reserves") help? Yes, that helps. A bit. I'm not fond of the clause "the allocation request might have come with some locks held". What if it doesn't? Does it still have to pay the price. Should we not require that the caller indicate if any locks are held? That way callers which don't hold locks can use __GFP_NOFAIL without worrying about imposing on other code. Or is it so rare that __GFP_NOFAIL would be used without holding a lock that it doesn't matter? The other commit of interest is Commit: 6c18ba7a1899 ("mm: help __GFP_NOFAIL allocations which do not trigger OOM killer") I don't find the reasoning convincing. It is a bit like "Robbing Peter to pay Paul". It takes from the reserves to allow a __GFP_NOFAIL to proceed, with out any reason to think this particular allocation has any more 'right' to the reserves than anything else. While I don't like the reasoning in either of these, they do make it clear (to me) that the use of reserves is entirely an internal policy decision. They should *not* be seen as part of the API and callers should not have to be concerned about it when deciding whether to use __GFP_NOFAIL or not. The use of these reserves is, at most, a hypothetical problem. If it ever looks like becoming a real practical problem, it needs to be fixed internally to the page allocator. Maybe an extra water-mark which isn't quite as permissive as ALLOC_HIGH... I'm inclined to drop all references to reserves from the documentation for __GFP_NOFAIL. I think there are enough users already that adding a couple more isn't going to make problems substantially more likely. And more will be added anyway that the mm/ team won't have the opportunity or bandwidth to review. Meanwhile I'll see if I can understand the intricacies of alloc_page so that I can contibute to making it more predictable. Question: In those cases where an open-coded loop is appropriate, such as when you want to handle signals or can drop locks, how bad would it be to have a tight loop without any sleep? should_reclaim_retry() will sleep 100ms (sometimes...). Is that enough? __GFP_NOFAIL doesn't add any sleep when looping. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops. 2021-09-15 22:35 ` NeilBrown @ 2021-09-16 0:37 ` Dave Chinner 2021-09-16 6:52 ` Michal Hocko 1 sibling, 0 replies; 45+ messages in thread From: Dave Chinner @ 2021-09-16 0:37 UTC (permalink / raw) To: NeilBrown Cc: Michal Hocko, Mel Gorman, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Jan Kara, Matthew Wilcox, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel On Thu, Sep 16, 2021 at 08:35:40AM +1000, NeilBrown wrote: > On Wed, 15 Sep 2021, Michal Hocko wrote: > > On Wed 15-09-21 07:48:11, Neil Brown wrote: > > > > > > Why does __GFP_NOFAIL access the reserves? Why not require that the > > > relevant "Try harder" flag (__GFP_ATOMIC or __GFP_MEMALLOC) be included > > > with __GFP_NOFAIL if that is justified? > > > > Does 5020e285856c ("mm, oom: give __GFP_NOFAIL allocations access to > > memory reserves") help? > > Yes, that helps. A bit. > > I'm not fond of the clause "the allocation request might have come with some > locks held". What if it doesn't? Does it still have to pay the price. > > Should we not require that the caller indicate if any locks are held? > That way callers which don't hold locks can use __GFP_NOFAIL without > worrying about imposing on other code. > > Or is it so rare that __GFP_NOFAIL would be used without holding a lock > that it doesn't matter? > > The other commit of interest is > > Commit: 6c18ba7a1899 ("mm: help __GFP_NOFAIL allocations which do not trigger OOM killer") > > I don't find the reasoning convincing. It is a bit like "Robbing Peter > to pay Paul". It takes from the reserves to allow a __GFP_NOFAIL to > proceed, with out any reason to think this particular allocation has any > more 'right' to the reserves than anything else. > > While I don't like the reasoning in either of these, they do make it > clear (to me) that the use of reserves is entirely an internal policy > decision. They should *not* be seen as part of the API and callers > should not have to be concerned about it when deciding whether to use > __GFP_NOFAIL or not. Agree totally with this - we just want to block until allocation succeeds, and if the -filesystem- deadlocks because allocation never succeeds then that's a problem that needs to be solved in the filesystem with a different memory allocation strategy... OTOH, setting up a single __GFP_NOFAIL call site with the ability to take the entire system down seems somewhat misguided. > The use of these reserves is, at most, a hypothetical problem. If it > ever looks like becoming a real practical problem, it needs to be fixed > internally to the page allocator. Maybe an extra water-mark which isn't > quite as permissive as ALLOC_HIGH... > > I'm inclined to drop all references to reserves from the documentation > for __GFP_NOFAIL. I think there are enough users already that adding a > couple more isn't going to make problems substantially more likely. And > more will be added anyway that the mm/ team won't have the opportunity > or bandwidth to review. Yup, we've been replacing open coded loops like in kmem_alloc() with explicit __GFP_NOFAIL usage for a while now: $ ▶ git grep __GFP_NOFAIL fs/xfs |wc -l 33 $ ANd we've got another 100 or so call sites planned for conversion to __GFP_NOFAIL. Hence the suggestion to remove the use of reserves from __GFP_NOFAIL seems like a sensible plan because it has never been necessary in the past for all the allocation sites we are converting from open coded loops to __GFP_NOFAIL... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops. 2021-09-15 22:35 ` NeilBrown 2021-09-16 0:37 ` Dave Chinner @ 2021-09-16 6:52 ` Michal Hocko 1 sibling, 0 replies; 45+ messages in thread From: Michal Hocko @ 2021-09-16 6:52 UTC (permalink / raw) To: NeilBrown Cc: Mel Gorman, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Jan Kara, Matthew Wilcox, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel On Thu 16-09-21 08:35:40, Neil Brown wrote: > On Wed, 15 Sep 2021, Michal Hocko wrote: > > On Wed 15-09-21 07:48:11, Neil Brown wrote: > > > > > > Why does __GFP_NOFAIL access the reserves? Why not require that the > > > relevant "Try harder" flag (__GFP_ATOMIC or __GFP_MEMALLOC) be included > > > with __GFP_NOFAIL if that is justified? > > > > Does 5020e285856c ("mm, oom: give __GFP_NOFAIL allocations access to > > memory reserves") help? > > Yes, that helps. A bit. > > I'm not fond of the clause "the allocation request might have come with some > locks held". What if it doesn't? Does it still have to pay the price. > > Should we not require that the caller indicate if any locks are held? I do not think this would help much TBH. What if the lock in question doesn't impose any dependency through allocation problem? > That way callers which don't hold locks can use __GFP_NOFAIL without > worrying about imposing on other code. > > Or is it so rare that __GFP_NOFAIL would be used without holding a lock > that it doesn't matter? > > The other commit of interest is > > Commit: 6c18ba7a1899 ("mm: help __GFP_NOFAIL allocations which do not trigger OOM killer") > > I don't find the reasoning convincing. It is a bit like "Robbing Peter > to pay Paul". It takes from the reserves to allow a __GFP_NOFAIL to > proceed, with out any reason to think this particular allocation has any > more 'right' to the reserves than anything else. I do agree that this is not really optimal. I do not remember exact details but these changes were mostly based or inspired by extreme memory pressure testing by Tetsuo who has managed to trigger quite some corner cases. Especially those where NOFS was involved were problematic. > While I don't like the reasoning in either of these, they do make it > clear (to me) that the use of reserves is entirely an internal policy > decision. They should *not* be seen as part of the API and callers > should not have to be concerned about it when deciding whether to use > __GFP_NOFAIL or not. Yes. NOFAIL should have high enough bar to use - essentially there is no other way than use it - that memory reserves shouldn't be a road block. If we learn that existing users can seriously deplete memory reserves then we might need to reconsider the existing logic. So far there are no indications that NOFAIL would really cause any problems in that area. > The use of these reserves is, at most, a hypothetical problem. If it > ever looks like becoming a real practical problem, it needs to be fixed > internally to the page allocator. Maybe an extra water-mark which isn't > quite as permissive as ALLOC_HIGH... > > I'm inclined to drop all references to reserves from the documentation > for __GFP_NOFAIL. I have found your additions to the documentation useful. > I think there are enough users already that adding a > couple more isn't going to make problems substantially more likely. And > more will be added anyway that the mm/ team won't have the opportunity > or bandwidth to review. > > Meanwhile I'll see if I can understand the intricacies of alloc_page so > that I can contibute to making it more predictable. > > Question: In those cases where an open-coded loop is appropriate, such > as when you want to handle signals or can drop locks, how bad would it > be to have a tight loop without any sleep? > > should_reclaim_retry() will sleep 100ms (sometimes...). Is that enough? > __GFP_NOFAIL doesn't add any sleep when looping. Yeah, NOFAIL doesn't add any explicit sleep points. In general there is no guarantee that a sleepable allocation will sleep. We do cond_resched in general but sleeping is enforced only for worker contexts because WQ concurrency depends on an explicit sleeping. So to answer your question, if you really need to sleep between retries then you should do it manually but cond_resched can be implied. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops. 2021-09-14 16:34 ` Mel Gorman 2021-09-14 21:48 ` NeilBrown @ 2021-09-14 23:55 ` Dave Chinner 2021-09-15 8:59 ` Mel Gorman 1 sibling, 1 reply; 45+ messages in thread From: Dave Chinner @ 2021-09-14 23:55 UTC (permalink / raw) To: Mel Gorman Cc: NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Jan Kara, Michal Hocko, Matthew Wilcox, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel On Tue, Sep 14, 2021 at 05:34:32PM +0100, Mel Gorman wrote: > On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote: > > Indefinite loops waiting for memory allocation are discouraged by > > documentation in gfp.h which says the use of __GFP_NOFAIL that it > > > > is definitely preferable to use the flag rather than opencode endless > > loop around allocator. > > > > Such loops that use congestion_wait() are particularly unwise as > > congestion_wait() is indistinguishable from > > schedule_timeout_uninterruptible() in practice - and should be > > deprecated. > > > > So this patch changes the two loops in ext4_ext_truncate() to use > > __GFP_NOFAIL instead of looping. > > > > As the allocation is multiple layers deeper in the call stack, this > > requires passing the EXT4_EX_NOFAIL flag down and handling it in various > > places. > > > > Of particular interest is the ext4_journal_start family of calls which > > can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'. This could be seen > > as a blurring of types. However 'type' is 8 bits, and EXT4_EX_NOFAIL is > > a high bit, so it is safe in practice. > > > > jbd2__journal_start() is enhanced so that the gfp_t flags passed are > > used for *all* allocations. > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > I'm not a fan. GFP_NOFAIL allows access to emergency reserves increasing > the risk of a livelock if memory is completely depleted where as some > callers can afford to wait. Undocumented behaviour, never mentioned or communicated to users in any __GFP_NOFAIL discussion I've taken part in until now. How is it different to, say, GFP_ATOMIC? i.e. Does GFP_NOFAIL actually imply GFP_ATOMIC, or is there some other undocumented behaviour going on here? We've already go ~80 __GFP_NOFAIL allocation contexts in fs/ and the vast majority of the are GFP_KERNEL | __GFP_NOFAIL or GFP_NOFS | __GFP_NOFAIL, so some clarification on what this actually means would be really good... > The key event should be reclaim making progress. Yup, that's what we need, but I don't see why it needs to be exposed outside the allocation code at all. > The hack below is > intended to vaguely demonstrate how blocking can be based on reclaim > making progress instead of "congestion" but has not even been booted. A > more complete overhaul may involve introducing > reclaim_congestion_wait_nodemask(gfp_t gfp_mask, long timeout, nodemask_t *nodemask) > and > reclaim_congestion_wait_nodemask(gfp_t gfp_mask, long timeout) I think that's racy. There's no guarantee that the node we are currently running on matches the cpu/node id that we failed to allocate from. Pre-emptible kernels and all that. IOWs, I think needs to be completely internal to the reclaim infrastructure and based on the current context we are trying to reclaim from. That way "GFP_RETRY_FOREVER" allocation contexts don't have to jump through an ever changing tangle of hoops to make basic "never-fail" allocation semantics behave correctly. > and converting congestion_wait and wait_iff_congestion to calling > reclaim_congestion_wait_nodemask which waits on the first usable node > and then audit every single congestion_wait() user to see which API > they should call. Further work would be to establish whether the page allocator should > call reclaim_congestion_wait_nodemask() if direct reclaim is not making > progress or whether that should be in vmscan.c. Conceivably, GFP_NOFAIL > could then soften its access to emergency reserves but I haven't given > it much thought. > > Yes it's significant work, but it would be a better than letting > __GFP_NOFAIL propagate further and kicking us down the road. Unfortunately, that seems to ignore the fact that we still need never-fail allocation semantics for stable system performance. Like it or not the requirements for __GFP_NOFAIL (and "retry forever" equivalent semantics) or open coded endless retry loops are *never* going away. IOWs, I'd suggest that we should think about how to formally support "never-fail" allocation semantics in both the API and the implementation in such a way that we don't end up with this __GFP_NOFAIL catch-22 ever again. Having the memory reclaim code wait on forwards progress instead of congestion as you propose here would be a core part of providing "never-fail" allocations... > This hack is terrible, it's not the right way to do it, it's just to > illustrate the idea of "waiting on memory should be based on reclaim > making progress and not the state of storage" is not impossible. I've been saying that is how reclaim should work for years. :/ It was LFSMM 2013 or 2014 that I was advocating for memory reclaim to move to IO-less reclaim throttling based on the rate at which free pages are returned to the freelists similar to the way IO-less dirty page throttling is based on the rate dirty pages are cleaned. Relying on IO interactions (submitting IO or waiting for completion) for high level page state management has always been a bad way to throttle demand because it only provides indirect control and has poor feedback indication. It's also a good way to remove the dependency on direct reclaim - just sleep instead of duplicating the work that kswapd should already be doing in the background to reclaim pages... > --8<-- > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 5c0318509f9e..5ed81c5746ec 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -832,6 +832,7 @@ typedef struct pglist_data { > unsigned long node_spanned_pages; /* total size of physical page > range, including holes */ > int node_id; > + wait_queue_head_t reclaim_wait; > wait_queue_head_t kswapd_wait; > wait_queue_head_t pfmemalloc_wait; > struct task_struct *kswapd; /* Protected by > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index 6122c78ce914..21a9cd693d12 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -13,6 +13,7 @@ > #include <linux/module.h> > #include <linux/writeback.h> > #include <linux/device.h> > +#include <linux/swap.h> > #include <trace/events/writeback.h> > > struct backing_dev_info noop_backing_dev_info; > @@ -1013,25 +1014,41 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync) > EXPORT_SYMBOL(set_bdi_congested); > > /** > - * congestion_wait - wait for a backing_dev to become uncongested > - * @sync: SYNC or ASYNC IO > - * @timeout: timeout in jiffies > + * congestion_wait - the docs are now worthless but avoiding a rename > * > - * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit > - * write congestion. If no backing_devs are congested then just wait for the > - * next write to be completed. > + * New thing -- wait for a timeout or reclaim to make progress > */ > long congestion_wait(int sync, long timeout) > { > + pg_data_t *pgdat; > long ret; > unsigned long start = jiffies; > DEFINE_WAIT(wait); > - wait_queue_head_t *wqh = &congestion_wqh[sync]; > + wait_queue_head_t *wqh; > > - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); > - ret = io_schedule_timeout(timeout); > + /* Never let kswapd sleep on itself */ > + if (current_is_kswapd()) > + goto trace; I think this breaks the kswapd 100ms immediate reclaim backoff in shrink_node(). > + > + /* > + * Dangerous, local memory may be forbidden by cpuset or policies, > + * use first eligible zone in zonelists node instead > + */ > + preempt_disable(); > + pgdat = NODE_DATA(smp_processor_id()); > + preempt_enable(); > + wqh = &pgdat->reclaim_wait; This goes away if it is kept internal and is passed the reclaim pgdat context we just failed to reclaim pages from. > + > + /* > + * Should probably check watermark of suitable zones here > + * in case this is spuriously called > + */ Ditto. These hacks really make me think that an external "wait for memory reclaim to make progress before retrying allocation" behaviour is the wrong way to tackle this. It's always been a hack because open-coded retry loops had to be implemented everywhere for never-fail allocation semantics. Neil has the right idea by replacing such fail-never back-offs with actual allocation attempts that encapsulate waiting for reclaim to make progress. This needs to be a formally supported function of memory allocation, and then these backoffs can be properly integrated into the memory reclaim retry mechanism instead of being poorly grafted onto the side... Whether that be __GFP_NOFAIL or GFP_RETRY_FOREVER that doesn't have the "dip into reserves" behaviour of __GFP_NOFAIL (which we clearly don't need because open coded retry loops have clearly work well enough for production systems for many years), I don't really care. But I think the memory allocation subsystem needs to move beyond "ahhhh, never-fail is too hard!!!!" and take steps to integrate this behaviour properly so that it can be made to work a whole lot better than it currently does.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops. 2021-09-14 23:55 ` Dave Chinner @ 2021-09-15 8:59 ` Mel Gorman 2021-09-15 12:20 ` Michal Hocko 2021-09-15 14:35 ` Mel Gorman 0 siblings, 2 replies; 45+ messages in thread From: Mel Gorman @ 2021-09-15 8:59 UTC (permalink / raw) To: Dave Chinner Cc: NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Jan Kara, Michal Hocko, Matthew Wilcox, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel On Wed, Sep 15, 2021 at 09:55:35AM +1000, Dave Chinner wrote: > On Tue, Sep 14, 2021 at 05:34:32PM +0100, Mel Gorman wrote: > > On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote: > > > Indefinite loops waiting for memory allocation are discouraged by > > > documentation in gfp.h which says the use of __GFP_NOFAIL that it > > > > > > is definitely preferable to use the flag rather than opencode endless > > > loop around allocator. > > > > > > Such loops that use congestion_wait() are particularly unwise as > > > congestion_wait() is indistinguishable from > > > schedule_timeout_uninterruptible() in practice - and should be > > > deprecated. > > > > > > So this patch changes the two loops in ext4_ext_truncate() to use > > > __GFP_NOFAIL instead of looping. > > > > > > As the allocation is multiple layers deeper in the call stack, this > > > requires passing the EXT4_EX_NOFAIL flag down and handling it in various > > > places. > > > > > > Of particular interest is the ext4_journal_start family of calls which > > > can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'. This could be seen > > > as a blurring of types. However 'type' is 8 bits, and EXT4_EX_NOFAIL is > > > a high bit, so it is safe in practice. > > > > > > jbd2__journal_start() is enhanced so that the gfp_t flags passed are > > > used for *all* allocations. > > > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > > > I'm not a fan. GFP_NOFAIL allows access to emergency reserves increasing > > the risk of a livelock if memory is completely depleted where as some > > callers can afford to wait. > > Undocumented behaviour, never mentioned or communicated to users in > any __GFP_NOFAIL discussion I've taken part in until now. > > How is it different to, say, GFP_ATOMIC? i.e. Does GFP_NOFAIL > actually imply GFP_ATOMIC, or is there some other undocumented > behaviour going on here? > Hmm, it's similar but not the same as GFP_ATOMIC. The most severe aspect of depleting emergency reserves comes from this block which is relevant when the system is effectively OOM /* * XXX: GFP_NOFS allocations should rather fail than rely on * other request to make a forward progress. * We are in an unfortunate situation where out_of_memory cannot * do much for this context but let's try it to at least get * access to memory reserved if the current task is killed (see * out_of_memory). Once filesystems are ready to handle allocation * failures more gracefully we should just bail out here. */ /* Exhausted what can be done so it's blame time */ if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) { *did_some_progress = 1; /* * Help non-failing allocations by giving them access to memory * reserves */ if (gfp_mask & __GFP_NOFAIL) page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_NO_WATERMARKS, ac); } The less severe aspect comes from /* * Help non-failing allocations by giving them access to memory * reserves but do not use ALLOC_NO_WATERMARKS because this * could deplete whole memory reserves which would just make * the situation worse */ page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac); if (page) goto got_pg; This doesn't dip into reserves as much as an atomic allocation does but it competes with them. > We've already go ~80 __GFP_NOFAIL allocation contexts in fs/ and the > vast majority of the are GFP_KERNEL | __GFP_NOFAIL or GFP_NOFS | > __GFP_NOFAIL, so some clarification on what this actually means > would be really good... > I'm not sure how much clarity can be given. Whatever the documented semantics, at some point under the current implementation __GFP_NOFAIL potentially competes with the same reserves as GFP_ATOMIC and has a path where watermarks are ignored entirely. > > The key event should be reclaim making progress. > > Yup, that's what we need, but I don't see why it needs to be exposed > outside the allocation code at all. > Probably not. At least some of it could be contained within reclaim itself to block when reclaim is not making progress as opposed to anything congestion related. That might still livelock if no progress can be made but that's not new, the OOM hammer should eventually kick in. > > The hack below is > > intended to vaguely demonstrate how blocking can be based on reclaim > > making progress instead of "congestion" but has not even been booted. A > > more complete overhaul may involve introducing > > reclaim_congestion_wait_nodemask(gfp_t gfp_mask, long timeout, nodemask_t *nodemask) > > and > > reclaim_congestion_wait_nodemask(gfp_t gfp_mask, long timeout) > > I think that's racy. There's no guarantee that the node we are > currently running on matches the cpu/node id that we failed to > allocate from. I know, I commented + /* + * Dangerous, local memory may be forbidden by cpuset or policies, + * use first eligible zone in zonelists node instead + */ There may be multiple nodes "we failed to allocate from", but the first eligible node is definitely one of them. There is the possibility that the first eligible node may be completely unreclaimable (all anonymous, no swap) in which case the timeout kicks in. I don't think this should be a global workqueue because there will be spurious wakeups. > Pre-emptible kernels and all that. IOWs, I think > needs to be completely internal to the reclaim infrastructure and > based on the current context we are trying to reclaim from. > A further step could be something similar to capture_control whereby reclaimed pages are immediately assigned to tasks blocked on reclaim_congestion_wait. It may be excessively complicated and overkill. > That way "GFP_RETRY_FOREVER" allocation contexts don't have to jump > through an ever changing tangle of hoops to make basic "never-fail" > allocation semantics behave correctly. > True and I can see what that is desirable. What I'm saying is that right now, increasing the use of __GFP_NOFAIL may cause a different set of problems (unbounded retries combined with ATOMIC allocation failures) as they compete for similar resources. > > and converting congestion_wait and wait_iff_congestion to calling > > reclaim_congestion_wait_nodemask which waits on the first usable node > > and then audit every single congestion_wait() user to see which API > > they should call. Further work would be to establish whether the page allocator should > > call reclaim_congestion_wait_nodemask() if direct reclaim is not making > > progress or whether that should be in vmscan.c. Conceivably, GFP_NOFAIL > > could then soften its access to emergency reserves but I haven't given > > it much thought. > > > > Yes it's significant work, but it would be a better than letting > > __GFP_NOFAIL propagate further and kicking us down the road. > > Unfortunately, that seems to ignore the fact that we still need > never-fail allocation semantics for stable system performance. Like > it or not the requirements for __GFP_NOFAIL (and "retry forever" > equivalent semantics) or open coded endless retry loops > are *never* going away. > I'm aware there will be cases where never-fail allocation semantics are required, particularly in GFP_NOFS contexts. What I'm saying is that right now because throttling is based on imaginary "congestion" that increasing the use could result in live-lock like bugs when multiple users complete for similar emergency resources to atomic. Note that I didn't NACK this. > IOWs, I'd suggest that we should think about how to formally > support "never-fail" allocation semantics in both the API and the > implementation in such a way that we don't end up with this > __GFP_NOFAIL catch-22 ever again. Having the memory reclaim code > wait on forwards progress instead of congestion as you propose here > would be a core part of providing "never-fail" allocations... > > > This hack is terrible, it's not the right way to do it, it's just to > > illustrate the idea of "waiting on memory should be based on reclaim > > making progress and not the state of storage" is not impossible. > > I've been saying that is how reclaim should work for years. :/ > > It was LFSMM 2013 or 2014 that I was advocating for memory reclaim > to move to IO-less reclaim throttling based on the rate at which > free pages are returned to the freelists similar to the way IO-less > dirty page throttling is based on the rate dirty pages are cleaned. > I'm going to guess no one ever tried. > Relying on IO interactions (submitting IO or waiting for completion) > for high level page state management has always been a bad way to > throttle demand because it only provides indirect control and has > poor feedback indication. > Also true. > It's also a good way to remove the dependency on direct reclaim - > just sleep instead of duplicating the work that kswapd should > already be doing in the background to reclaim pages... > Even for direct reclaim, I do think that the number of direct reclaimers should be limited with the rest going to sleep. At some point, excessive direct reclaim tasks are simply hammering the lru lock. > > --8<-- > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 5c0318509f9e..5ed81c5746ec 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -832,6 +832,7 @@ typedef struct pglist_data { > > unsigned long node_spanned_pages; /* total size of physical page > > range, including holes */ > > int node_id; > > + wait_queue_head_t reclaim_wait; > > wait_queue_head_t kswapd_wait; > > wait_queue_head_t pfmemalloc_wait; > > struct task_struct *kswapd; /* Protected by > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > > index 6122c78ce914..21a9cd693d12 100644 > > --- a/mm/backing-dev.c > > +++ b/mm/backing-dev.c > > @@ -13,6 +13,7 @@ > > #include <linux/module.h> > > #include <linux/writeback.h> > > #include <linux/device.h> > > +#include <linux/swap.h> > > #include <trace/events/writeback.h> > > > > struct backing_dev_info noop_backing_dev_info; > > @@ -1013,25 +1014,41 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync) > > EXPORT_SYMBOL(set_bdi_congested); > > > > /** > > - * congestion_wait - wait for a backing_dev to become uncongested > > - * @sync: SYNC or ASYNC IO > > - * @timeout: timeout in jiffies > > + * congestion_wait - the docs are now worthless but avoiding a rename > > * > > - * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit > > - * write congestion. If no backing_devs are congested then just wait for the > > - * next write to be completed. > > + * New thing -- wait for a timeout or reclaim to make progress > > */ > > long congestion_wait(int sync, long timeout) > > { > > + pg_data_t *pgdat; > > long ret; > > unsigned long start = jiffies; > > DEFINE_WAIT(wait); > > - wait_queue_head_t *wqh = &congestion_wqh[sync]; > > + wait_queue_head_t *wqh; > > > > - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); > > - ret = io_schedule_timeout(timeout); > > + /* Never let kswapd sleep on itself */ > > + if (current_is_kswapd()) > > + goto trace; > > I think this breaks the kswapd 100ms immediate reclaim backoff in > shrink_node(). > Yep, it is. That would definitely need better care. > > + > > + /* > > + * Dangerous, local memory may be forbidden by cpuset or policies, > > + * use first eligible zone in zonelists node instead > > + */ > > + preempt_disable(); > > + pgdat = NODE_DATA(smp_processor_id()); > > + preempt_enable(); > > + wqh = &pgdat->reclaim_wait; > > This goes away if it is kept internal and is passed the reclaim > pgdat context we just failed to reclaim pages from. > Yep, that would also work if this was called only from reclaim contexts or mm internally. Some helper would still be needed to implement an alternative congestion_wait that looks up the same information until congestion_wait callers can be removed. Again, I wasn't trying to offer a correct implementation, only illustrating that it's perfectly possible to throttle based on reclaim making progress instead of "congestion". > > + > > + /* > > + * Should probably check watermark of suitable zones here > > + * in case this is spuriously called > > + */ > > Ditto. > > These hacks really make me think that an external "wait for memory > reclaim to make progress before retrying allocation" behaviour is > the wrong way to tackle this. It's always been a hack because > open-coded retry loops had to be implemented everywhere for > never-fail allocation semantics. > > Neil has the right idea by replacing such fail-never back-offs with > actual allocation attempts that encapsulate waiting for reclaim to > make progress. This needs to be a formally supported function of > memory allocation, and then these backoffs can be properly > integrated into the memory reclaim retry mechanism instead of being > poorly grafted onto the side... > I'm not necessarily opposed to this. What I'm saying is that doing the conversion now *MIGHT* mean an increase in live-lock-like bugs because with the current implementation, the callers may not sleep/throttle in the same way the crappy "loop around congestion_wait" implementations did. > Whether that be __GFP_NOFAIL or GFP_RETRY_FOREVER that doesn't have > the "dip into reserves" behaviour of __GFP_NOFAIL (which we clearly > don't need because open coded retry loops have clearly work well > enough for production systems for many years), I don't really care. > I suspected this was true and that it might be appropriate for __GFP_NOFAIL to obey normal watermarks unless __GFP_HIGH is also specified if it's absolutely necessary but I'm not sure because I haven't put enough thought into it. > But I think the memory allocation subsystem needs to move beyond > "ahhhh, never-fail is too hard!!!!" and take steps to integrate this > behaviour properly so that it can be made to work a whole lot better > than it currently does.... > Again, not opposed. It's simply a heads-up that converting now may cause problems that manifest as livelock-like bugs unless, at minimum, internal reclaim bases throttling on some reclaim making progress instead of congestion_wait. Given my current load, I can't promise I'd find the time to follow through with converting the hack into a proper implementation but someone reading linux-mm might. Either way, I felt it was necessary to at least warn about the hazards. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops. 2021-09-15 8:59 ` Mel Gorman @ 2021-09-15 12:20 ` Michal Hocko 2021-09-15 14:35 ` Mel Gorman 1 sibling, 0 replies; 45+ messages in thread From: Michal Hocko @ 2021-09-15 12:20 UTC (permalink / raw) To: Mel Gorman Cc: Dave Chinner, NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Jan Kara, Matthew Wilcox, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel On Wed 15-09-21 09:59:04, Mel Gorman wrote: > On Wed, Sep 15, 2021 at 09:55:35AM +1000, Dave Chinner wrote: > > That way "GFP_RETRY_FOREVER" allocation contexts don't have to jump > > through an ever changing tangle of hoops to make basic "never-fail" > > allocation semantics behave correctly. > > > > True and I can see what that is desirable. What I'm saying is that right > now, increasing the use of __GFP_NOFAIL may cause a different set of > problems (unbounded retries combined with ATOMIC allocation failures) as > they compete for similar resources. I have commented on reasoning behind the above code in other reply. Let me just comment on this particular concern. I completely do agree that any use of __GFP_NOFAIL should be carefully evaluated. This is a very strong recuirement and it should be used only as a last resort. On the other hand converting an existing open coded nofail code that _doesn't_ really do any clever tricks to allow a forward progress (e.g. dropping locks, kicking some internal caching mechinisms etc.) should just be turned into __GPF_NOFAIL. Not only it makes it easier to spot that code but it also allows the page allocator to behave consistently and predictably. If the existing heuristic wrt. memory reserves to GFP_NOFAIL turns out to be suboptimal we can fix it for all those users. Dropping the rest of the email which talks about reclaim changes because I will need much more time to digest that. [...] -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops. 2021-09-15 8:59 ` Mel Gorman 2021-09-15 12:20 ` Michal Hocko @ 2021-09-15 14:35 ` Mel Gorman 2021-09-15 22:38 ` Dave Chinner 1 sibling, 1 reply; 45+ messages in thread From: Mel Gorman @ 2021-09-15 14:35 UTC (permalink / raw) To: Dave Chinner Cc: NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Jan Kara, Michal Hocko, Matthew Wilcox, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel On Wed, Sep 15, 2021 at 09:59:04AM +0100, Mel Gorman wrote: > > Yup, that's what we need, but I don't see why it needs to be exposed > > outside the allocation code at all. > > > > Probably not. At least some of it could be contained within reclaim > itself to block when reclaim is not making progress as opposed to anything > congestion related. That might still livelock if no progress can be made > but that's not new, the OOM hammer should eventually kick in. > There are two sides to the reclaim-related throttling 1. throttling because zero progress is being made 2. throttling because there are too many dirty pages or pages under writeback cycling through the LRU too quickly. The dirty page aspects (and the removal of wait_iff_congested which is almost completely broken) could be done with something like the following (completly untested). The downside is that end_page_writeback() takes an atomic penalty if reclaim is throttled but at that point the system is struggling anyway so I doubt it matters. --8<-- diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index ac7f231b8825..9fb1f0ae273c 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -154,7 +154,6 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits) } long congestion_wait(int sync, long timeout); -long wait_iff_congested(int sync, long timeout); static inline bool mapping_can_writeback(struct address_space *mapping) { diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 6a1d79d84675..5a289ada48cb 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -841,6 +841,9 @@ typedef struct pglist_data { int node_id; wait_queue_head_t kswapd_wait; wait_queue_head_t pfmemalloc_wait; + wait_queue_head_t reclaim_wait; /* wq for throttling reclaim */ + atomic_t nr_reclaim_throttled; /* nr of throtted tasks */ + atomic_t nr_reclaim_written; /* nr pages written since throttled */ struct task_struct *kswapd; /* Protected by mem_hotplug_begin/end() */ int kswapd_order; diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 840d1ba84cf5..3bc759b81897 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -763,13 +763,6 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait, TP_ARGS(usec_timeout, usec_delayed) ); -DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested, - - TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed), - - TP_ARGS(usec_timeout, usec_delayed) -); - DECLARE_EVENT_CLASS(writeback_single_inode_template, TP_PROTO(struct inode *inode, diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 4a9d4e27d0d9..0ea1a105eae5 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -1041,51 +1041,3 @@ long congestion_wait(int sync, long timeout) return ret; } EXPORT_SYMBOL(congestion_wait); - -/** - * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes - * @sync: SYNC or ASYNC IO - * @timeout: timeout in jiffies - * - * In the event of a congested backing_dev (any backing_dev) this waits - * for up to @timeout jiffies for either a BDI to exit congestion of the - * given @sync queue or a write to complete. - * - * The return value is 0 if the sleep is for the full timeout. Otherwise, - * it is the number of jiffies that were still remaining when the function - * returned. return_value == timeout implies the function did not sleep. - */ -long wait_iff_congested(int sync, long timeout) -{ - long ret; - unsigned long start = jiffies; - DEFINE_WAIT(wait); - wait_queue_head_t *wqh = &congestion_wqh[sync]; - - /* - * If there is no congestion, yield if necessary instead - * of sleeping on the congestion queue - */ - if (atomic_read(&nr_wb_congested[sync]) == 0) { - cond_resched(); - - /* In case we scheduled, work out time remaining */ - ret = timeout - (jiffies - start); - if (ret < 0) - ret = 0; - - goto out; - } - - /* Sleep until uncongested or a write happens */ - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); - ret = io_schedule_timeout(timeout); - finish_wait(wqh, &wait); - -out: - trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout), - jiffies_to_usecs(jiffies - start)); - - return ret; -} -EXPORT_SYMBOL(wait_iff_congested); diff --git a/mm/filemap.c b/mm/filemap.c index dae481293b5d..b9be9afa4308 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1606,6 +1606,8 @@ void end_page_writeback(struct page *page) smp_mb__after_atomic(); wake_up_page(page, PG_writeback); put_page(page); + + acct_reclaim_writeback(page); } EXPORT_SYMBOL(end_page_writeback); diff --git a/mm/internal.h b/mm/internal.h index cf3cb933eba3..47e77009e0d5 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -34,6 +34,13 @@ void page_writeback_init(void); +void __acct_reclaim_writeback(struct page *page); +static inline void acct_reclaim_writeback(struct page *page) +{ + if (atomic_read(&page_pgdat(page)->nr_reclaim_throttled)) + __acct_reclaim_writeback(page); +} + vm_fault_t do_swap_page(struct vm_fault *vmf); void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b37435c274cf..d849ddfc1e51 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7396,6 +7396,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat) init_waitqueue_head(&pgdat->kswapd_wait); init_waitqueue_head(&pgdat->pfmemalloc_wait); + init_waitqueue_head(&pgdat->reclaim_wait); pgdat_page_ext_init(pgdat); lruvec_init(&pgdat->__lruvec); diff --git a/mm/vmscan.c b/mm/vmscan.c index 74296c2d1fed..b209564766b0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1006,6 +1006,40 @@ static void handle_write_error(struct address_space *mapping, unlock_page(page); } +static void +reclaim_writeback_throttle(pg_data_t *pgdat, long timeout) +{ + wait_queue_head_t *wqh = &pgdat->reclaim_wait; + long ret; + DEFINE_WAIT(wait); + + atomic_inc(&pgdat->nr_reclaim_throttled); + + prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE); + ret = schedule_timeout(timeout); + finish_wait(&pgdat->reclaim_wait, &wait); + + if (atomic_dec_and_test(&pgdat->nr_reclaim_throttled)) + atomic_set(&pgdat->nr_reclaim_written, 0); + + /* TODO: Add tracepoint to track time sleeping */ +} + +/* + * Account for pages written if tasks are throttled waiting on dirty + * pages to clean. If enough pages have been cleaned since throttling + * started then wakeup the throttled tasks. + */ +void __acct_reclaim_writeback(struct page *page) +{ + pg_data_t *pgdat = page_pgdat(page); + int nr_written = atomic_inc_return(&pgdat->nr_reclaim_written); + int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled); + + if (nr_written > SWAP_CLUSTER_MAX * nr_throttled) + wake_up_interruptible(&pgdat->reclaim_wait); +} + /* possible outcome of pageout() */ typedef enum { /* failed to write page out, page is locked */ @@ -1412,9 +1446,8 @@ static unsigned int shrink_page_list(struct list_head *page_list, /* * The number of dirty pages determines if a node is marked - * reclaim_congested which affects wait_iff_congested. kswapd - * will stall and start writing pages if the tail of the LRU - * is all dirty unqueued pages. + * reclaim_congested. kswapd will stall and start writing + * pages if the tail of the LRU is all dirty unqueued pages. */ page_check_dirty_writeback(page, &dirty, &writeback); if (dirty || writeback) @@ -3180,19 +3213,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) * If kswapd scans pages marked for immediate * reclaim and under writeback (nr_immediate), it * implies that pages are cycling through the LRU - * faster than they are written so also forcibly stall. + * faster than they are written so forcibly stall + * until some pages complete writeback. */ if (sc->nr.immediate) - congestion_wait(BLK_RW_ASYNC, HZ/10); + reclaim_writeback_throttle(pgdat, HZ/10); } /* * Tag a node/memcg as congested if all the dirty pages * scanned were backed by a congested BDI and - * wait_iff_congested will stall. + * non-kswapd tasks will stall on reclaim_writeback_throttle. * * Legacy memcg will stall in page writeback so avoid forcibly - * stalling in wait_iff_congested(). + * stalling in reclaim_writeback_throttle(). */ if ((current_is_kswapd() || (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) && @@ -3208,7 +3242,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) if (!current_is_kswapd() && current_may_throttle() && !sc->hibernation_mode && test_bit(LRUVEC_CONGESTED, &target_lruvec->flags)) - wait_iff_congested(BLK_RW_ASYNC, HZ/10); + reclaim_writeback_throttle(pgdat, HZ/10); if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, sc)) @@ -4286,6 +4320,8 @@ static int kswapd(void *p) WRITE_ONCE(pgdat->kswapd_order, 0); WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES); + atomic_set(&pgdat->nr_reclaim_throttled, 0); + atomic_set(&pgdat->nr_reclaim_written, 0); for ( ; ; ) { bool ret; ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops. 2021-09-15 14:35 ` Mel Gorman @ 2021-09-15 22:38 ` Dave Chinner 2021-09-16 9:00 ` Mel Gorman 0 siblings, 1 reply; 45+ messages in thread From: Dave Chinner @ 2021-09-15 22:38 UTC (permalink / raw) To: Mel Gorman Cc: NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Jan Kara, Michal Hocko, Matthew Wilcox, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel On Wed, Sep 15, 2021 at 03:35:10PM +0100, Mel Gorman wrote: > On Wed, Sep 15, 2021 at 09:59:04AM +0100, Mel Gorman wrote: > > > Yup, that's what we need, but I don't see why it needs to be exposed > > > outside the allocation code at all. > > > > > > > Probably not. At least some of it could be contained within reclaim > > itself to block when reclaim is not making progress as opposed to anything > > congestion related. That might still livelock if no progress can be made > > but that's not new, the OOM hammer should eventually kick in. > > > > There are two sides to the reclaim-related throttling > > 1. throttling because zero progress is being made > 2. throttling because there are too many dirty pages or pages under > writeback cycling through the LRU too quickly. > > The dirty page aspects (and the removal of wait_iff_congested which is > almost completely broken) could be done with something like the following > (completly untested). The downside is that end_page_writeback() takes an > atomic penalty if reclaim is throttled but at that point the system is > struggling anyway so I doubt it matters. The atomics are pretty nasty, as is directly accessing the pgdat on every call to end_page_writeback(). Those will be performance limiting factors. Indeed, we don't use atomics for dirty page throttling, which does dirty page accounting via percpu counters on the BDI and doesn't require wakeups. Also, we've already got per-node and per-zone counters there for dirty/write pending stats, so do we actually need new counters and wakeups here? i.e. balance_dirty_pages() does not have an explicit wakeup - it bases it's sleep time on the (memcg aware) measured writeback rate on the BDI the page belongs to and the amount of outstanding dirty data on that BDI. i.e. it estimates fairly accurately what the wait time for this task should be given the dirty page demand and current writeback progress being made is and just sleeps for that length of time. Ideally, that's what should be happening here - we should be able to calculate a page cleaning rate estimation and then base the sleep time on that. No wakeups needed - when we've waited for the estimated time, we try to reclaim again... In fact, why can't this "too many dirty pages" case just use the balance_dirty_pages() infrastructure to do the "wait for writeback" reclaim backoff? Why do we even need to re-invent the wheel here? > diff --git a/mm/filemap.c b/mm/filemap.c > index dae481293b5d..b9be9afa4308 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1606,6 +1606,8 @@ void end_page_writeback(struct page *page) > smp_mb__after_atomic(); > wake_up_page(page, PG_writeback); > put_page(page); > + > + acct_reclaim_writeback(page); UAF - that would need to be before the put_page() call... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops. 2021-09-15 22:38 ` Dave Chinner @ 2021-09-16 9:00 ` Mel Gorman 0 siblings, 0 replies; 45+ messages in thread From: Mel Gorman @ 2021-09-16 9:00 UTC (permalink / raw) To: Dave Chinner Cc: NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger, Darrick J. Wong, Jan Kara, Michal Hocko, Matthew Wilcox, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel On Thu, Sep 16, 2021 at 08:38:58AM +1000, Dave Chinner wrote: > On Wed, Sep 15, 2021 at 03:35:10PM +0100, Mel Gorman wrote: > > On Wed, Sep 15, 2021 at 09:59:04AM +0100, Mel Gorman wrote: > > > > Yup, that's what we need, but I don't see why it needs to be exposed > > > > outside the allocation code at all. > > > > > > > > > > Probably not. At least some of it could be contained within reclaim > > > itself to block when reclaim is not making progress as opposed to anything > > > congestion related. That might still livelock if no progress can be made > > > but that's not new, the OOM hammer should eventually kick in. > > > > > > > There are two sides to the reclaim-related throttling > > > > 1. throttling because zero progress is being made > > 2. throttling because there are too many dirty pages or pages under > > writeback cycling through the LRU too quickly. > > > > The dirty page aspects (and the removal of wait_iff_congested which is > > almost completely broken) could be done with something like the following > > (completly untested). The downside is that end_page_writeback() takes an > > atomic penalty if reclaim is throttled but at that point the system is > > struggling anyway so I doubt it matters. > > The atomics are pretty nasty, as is directly accessing the pgdat on > every call to end_page_writeback(). Those will be performance > limiting factors. Indeed, we don't use atomics for dirty page > throttling, which does dirty page accounting via > percpu counters on the BDI and doesn't require wakeups. > Thanks for taking a look! From end_page_writeback, the first atomic operation is an atomic read which is READ_ONCE on most architectures (alpha is a counter example as it has a memory barrier but alpha is niche). The main atomic penalty is when the system is reclaim throttled but it can be a per-cpu node page state counter instead. That sacrifices accuracy for speed but in this context, I think that's ok. As for accessing the pgdat structure, every vmstat counter for the node involves a pgdat lookup as the API is page-based and so there are already a bunch of pgdat lookups in the IO path. > Also, we've already got per-node and per-zone counters there for > dirty/write pending stats, so do we actually need new counters and > wakeups here? > I think we need at least a new counter because dirty/write pending stats do not tell us how many pages were cleaned since reclaim started hitting problems with dirty pages at the tail of the LRU. Reading dirty/write_pending stats at two points of time cannot be used to infer how many pages were cleaned during the same interval. At minimum, we'd need nr_dirtied and a new nr_cleaned stat to infer pages cleaned between two points in time. That can be done but if the new counters is NR_THROTTLED_WRITTEN (NR_WRITTEN while reclaim throttled), we only need one field in struct zone to track nr_reclaim_throttled when throttling startsi (updated patch at the end of the mail). > i.e. balance_dirty_pages() does not have an explicit wakeup - it > bases it's sleep time on the (memcg aware) measured writeback rate > on the BDI the page belongs to and the amount of outstanding dirty > data on that BDI. i.e. it estimates fairly accurately what the wait > time for this task should be given the dirty page demand and current > writeback progress being made is and just sleeps for that length of > time. > > Ideally, that's what should be happening here - we should be able to > calculate a page cleaning rate estimation and then base the sleep > time on that. No wakeups needed - when we've waited for the > estimated time, we try to reclaim again... > > In fact, why can't this "too many dirty pages" case just use the > balance_dirty_pages() infrastructure to do the "wait for writeback" > reclaim backoff? Why do we even need to re-invent the wheel here? > Conceptually I can see what you are asking for but am finding it hard to translate it into an implementation. Dirty page throttling is throttling heavy writers on a task and bdi basis but does not care about the positioning of pages on the LRU or what node the page is allocated from. On the reclaim side, the concern is how many pages that are dirty or writeback at the tail of the LRU regardless of what task dirtied that page or BDI it belongs to. Hence I'm failing to see how the same rate-limiting mechanism could be used on the reclaim side. I guess we could look at the reclaim efficiency for a given task by tracking pages that could not be reclaimed due to dirty/writeback relative to pages that could be reclaimed and sleeping for increasing lengths of time unconditionally when the reclaim efficiency is low. However it's complex and would be hard to debug. It could hit serious problems in cases where there are both fast and slow bdi's with the pages backed by a slow bdi dominating the tail of the LRU -- it could throttle excessively prematurely. Alternatively, look at taking pages that are dirty/writeback off the inactive list like what is done for LRU_UNEVICTABLE pages and throttling based on a high rate of INACTIVE_FILE:LRU_UNEVICTABLE, but again, it's complex and could incur additional penalties in the end_page_writeback due to LRU manipulations. Both are essentially re-inventing a very complex wheel. I'm aware that what I'm proposing also has its problems. It could wake prematurely because all the pages cleaned were backed by a fast bdi when the pages it scanned were backed by a slow bdi. Prehaps this could be dealt with by tracking the estimated writeback speed of pages cleaned and comparing it against the estimated writeback speed of pages at the tail of the LRU but again, the complexity may be excessive. If the first solution is too complex, it'll get hit with the KISS hammer with a request to justify the complexity when the basis for comparison is a broken concept. So I want to start simple, all it has to be is better than congestion_wait/wait_iff_congested. If that still is not good enough, the more complex options will have a basis for comparison. > > diff --git a/mm/filemap.c b/mm/filemap.c > > index dae481293b5d..b9be9afa4308 100644 > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -1606,6 +1606,8 @@ void end_page_writeback(struct page *page) > > smp_mb__after_atomic(); > > wake_up_page(page, PG_writeback); > > put_page(page); > > + > > + acct_reclaim_writeback(page); > > UAF - that would need to be before the put_page() call... > UAF indeed. Here is another version of the same concept that avoids atomic updates from end_page_writeback() context and limits pgdat lookups. It's still not tested other than "it boots under kvm". diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index ac7f231b8825..9fb1f0ae273c 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -154,7 +154,6 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits) } long congestion_wait(int sync, long timeout); -long wait_iff_congested(int sync, long timeout); static inline bool mapping_can_writeback(struct address_space *mapping) { diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 6a1d79d84675..12a011912c3c 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -199,6 +199,7 @@ enum node_stat_item { NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ NR_DIRTIED, /* page dirtyings since bootup */ NR_WRITTEN, /* page writings since bootup */ + NR_THROTTLED_WRITTEN, /* NR_WRITTEN while reclaim throttled */ NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */ NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */ NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */ @@ -841,6 +842,10 @@ typedef struct pglist_data { int node_id; wait_queue_head_t kswapd_wait; wait_queue_head_t pfmemalloc_wait; + wait_queue_head_t reclaim_wait; /* wq for throttling reclaim */ + atomic_t nr_reclaim_throttled; /* nr of throtted tasks */ + unsigned long nr_reclaim_start; /* nr pages written while throttled + * when throttling started. */ struct task_struct *kswapd; /* Protected by mem_hotplug_begin/end() */ int kswapd_order; diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 840d1ba84cf5..3bc759b81897 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -763,13 +763,6 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait, TP_ARGS(usec_timeout, usec_delayed) ); -DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested, - - TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed), - - TP_ARGS(usec_timeout, usec_delayed) -); - DECLARE_EVENT_CLASS(writeback_single_inode_template, TP_PROTO(struct inode *inode, diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 4a9d4e27d0d9..0ea1a105eae5 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -1041,51 +1041,3 @@ long congestion_wait(int sync, long timeout) return ret; } EXPORT_SYMBOL(congestion_wait); - -/** - * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes - * @sync: SYNC or ASYNC IO - * @timeout: timeout in jiffies - * - * In the event of a congested backing_dev (any backing_dev) this waits - * for up to @timeout jiffies for either a BDI to exit congestion of the - * given @sync queue or a write to complete. - * - * The return value is 0 if the sleep is for the full timeout. Otherwise, - * it is the number of jiffies that were still remaining when the function - * returned. return_value == timeout implies the function did not sleep. - */ -long wait_iff_congested(int sync, long timeout) -{ - long ret; - unsigned long start = jiffies; - DEFINE_WAIT(wait); - wait_queue_head_t *wqh = &congestion_wqh[sync]; - - /* - * If there is no congestion, yield if necessary instead - * of sleeping on the congestion queue - */ - if (atomic_read(&nr_wb_congested[sync]) == 0) { - cond_resched(); - - /* In case we scheduled, work out time remaining */ - ret = timeout - (jiffies - start); - if (ret < 0) - ret = 0; - - goto out; - } - - /* Sleep until uncongested or a write happens */ - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); - ret = io_schedule_timeout(timeout); - finish_wait(wqh, &wait); - -out: - trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout), - jiffies_to_usecs(jiffies - start)); - - return ret; -} -EXPORT_SYMBOL(wait_iff_congested); diff --git a/mm/filemap.c b/mm/filemap.c index dae481293b5d..59187787fbfc 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1605,6 +1605,7 @@ void end_page_writeback(struct page *page) smp_mb__after_atomic(); wake_up_page(page, PG_writeback); + acct_reclaim_writeback(page); put_page(page); } EXPORT_SYMBOL(end_page_writeback); diff --git a/mm/internal.h b/mm/internal.h index cf3cb933eba3..cd8b892537a0 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -34,6 +34,14 @@ void page_writeback_init(void); +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page); +static inline void acct_reclaim_writeback(struct page *page) +{ + pg_data_t *pgdat = page_pgdat(page); + if (atomic_read(&pgdat->nr_reclaim_throttled)) + __acct_reclaim_writeback(pgdat, page); +} + vm_fault_t do_swap_page(struct vm_fault *vmf); void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b37435c274cf..d849ddfc1e51 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7396,6 +7396,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat) init_waitqueue_head(&pgdat->kswapd_wait); init_waitqueue_head(&pgdat->pfmemalloc_wait); + init_waitqueue_head(&pgdat->reclaim_wait); pgdat_page_ext_init(pgdat); lruvec_init(&pgdat->__lruvec); diff --git a/mm/vmscan.c b/mm/vmscan.c index 74296c2d1fed..f7908ed079f7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1006,6 +1006,43 @@ static void handle_write_error(struct address_space *mapping, unlock_page(page); } +static void +reclaim_writeback_throttle(pg_data_t *pgdat, long timeout) +{ + wait_queue_head_t *wqh = &pgdat->reclaim_wait; + long ret; + DEFINE_WAIT(wait); + + atomic_inc(&pgdat->nr_reclaim_throttled); + WRITE_ONCE(pgdat->nr_reclaim_start, + node_page_state(pgdat, NR_THROTTLED_WRITTEN)); + + prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE); + ret = schedule_timeout(timeout); + finish_wait(&pgdat->reclaim_wait, &wait); + atomic_dec(&pgdat->nr_reclaim_throttled); + + /* TODO: Add tracepoint to track time sleeping */ +} + +/* + * Account for pages written if tasks are throttled waiting on dirty + * pages to clean. If enough pages have been cleaned since throttling + * started then wakeup the throttled tasks. + */ +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page) +{ + unsigned long nr_written; + int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled); + + __inc_node_page_state(page, NR_THROTTLED_WRITTEN); + nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) - + READ_ONCE(pgdat->nr_reclaim_start); + + if (nr_written > SWAP_CLUSTER_MAX * nr_throttled) + wake_up_interruptible(&pgdat->reclaim_wait); +} + /* possible outcome of pageout() */ typedef enum { /* failed to write page out, page is locked */ @@ -1412,9 +1449,8 @@ static unsigned int shrink_page_list(struct list_head *page_list, /* * The number of dirty pages determines if a node is marked - * reclaim_congested which affects wait_iff_congested. kswapd - * will stall and start writing pages if the tail of the LRU - * is all dirty unqueued pages. + * reclaim_congested. kswapd will stall and start writing + * pages if the tail of the LRU is all dirty unqueued pages. */ page_check_dirty_writeback(page, &dirty, &writeback); if (dirty || writeback) @@ -3180,19 +3216,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) * If kswapd scans pages marked for immediate * reclaim and under writeback (nr_immediate), it * implies that pages are cycling through the LRU - * faster than they are written so also forcibly stall. + * faster than they are written so forcibly stall + * until some pages complete writeback. */ if (sc->nr.immediate) - congestion_wait(BLK_RW_ASYNC, HZ/10); + reclaim_writeback_throttle(pgdat, HZ/10); } /* * Tag a node/memcg as congested if all the dirty pages * scanned were backed by a congested BDI and - * wait_iff_congested will stall. + * non-kswapd tasks will stall on reclaim_writeback_throttle. * * Legacy memcg will stall in page writeback so avoid forcibly - * stalling in wait_iff_congested(). + * stalling in reclaim_writeback_throttle(). */ if ((current_is_kswapd() || (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) && @@ -3208,7 +3245,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) if (!current_is_kswapd() && current_may_throttle() && !sc->hibernation_mode && test_bit(LRUVEC_CONGESTED, &target_lruvec->flags)) - wait_iff_congested(BLK_RW_ASYNC, HZ/10); + reclaim_writeback_throttle(pgdat, HZ/10); if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, sc)) @@ -4286,6 +4323,7 @@ static int kswapd(void *p) WRITE_ONCE(pgdat->kswapd_order, 0); WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES); + atomic_set(&pgdat->nr_reclaim_throttled, 0); for ( ; ; ) { bool ret; diff --git a/mm/vmstat.c b/mm/vmstat.c index 8ce2620344b2..9b2bc9d61d4b 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1225,6 +1225,7 @@ const char * const vmstat_text[] = { "nr_vmscan_immediate_reclaim", "nr_dirtied", "nr_written", + "nr_throttled_written", "nr_kernel_misc_reclaimable", "nr_foll_pin_acquired", "nr_foll_pin_released", ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops. 2021-09-14 0:13 ` [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops NeilBrown 2021-09-14 16:34 ` Mel Gorman @ 2021-09-15 0:28 ` Theodore Ts'o 2021-09-15 5:25 ` NeilBrown 1 sibling, 1 reply; 45+ messages in thread From: Theodore Ts'o @ 2021-09-15 0:28 UTC (permalink / raw) To: NeilBrown Cc: Andrew Morton, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote: > > Of particular interest is the ext4_journal_start family of calls which > can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'. This could be seen > as a blurring of types. However 'type' is 8 bits, and EXT4_EX_NOFAIL is > a high bit, so it is safe in practice. I'm really not fond of this type blurring. What I'd suggeset doing instead is adding a "gfp_t gfp_mask" parameter to the __ext4_journal_start_sb(). With the exception of one call site in fs/ext4/ialloc.c, most of the callers of __ext4_journal_start_sb() are via #define helper macros or inline funcions. So it would just require adding a GFP_NOFS as an extra parameter to the various macros and inline functions which call __ext4_journal_start_sb() in ext4_jbd2.h. The function ext4_journal_start_with_revoke() is called exactly once so we could just bury the __GFP_NOFAIL in the definition of that macros, e.g.: #define ext4_journal_start_with_revoke(inode, type, blocks, revoke_creds) \ __ext4_journal_start((inode), __LINE__, (type), (blocks), 0, \ GFP_NOFS | __GFP_NOFAIL, (revoke_creds)) but it's probably better to do something like this: #define ext4_journal_start_with_revoke(gfp_mask, inode, type, blocks, revoke_creds) \ __ext4_journal_start((inode), __LINE__, (type), (blocks), 0, \ gfp_mask, (revoke_creds)) So it's explicit in the C function ext4_ext_remove_space() in fs/ext4/extents.c that we are explicitly requesting the __GFP_NOFAIL behavior. Does that make sense? - Ted ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops. 2021-09-15 0:28 ` Theodore Ts'o @ 2021-09-15 5:25 ` NeilBrown 2021-09-15 17:02 ` Theodore Ts'o 0 siblings, 1 reply; 45+ messages in thread From: NeilBrown @ 2021-09-15 5:25 UTC (permalink / raw) To: Theodore Ts'o Cc: Andrew Morton, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel On Wed, 15 Sep 2021, Theodore Ts'o wrote: > On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote: > > > > Of particular interest is the ext4_journal_start family of calls which > > can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'. This could be seen > > as a blurring of types. However 'type' is 8 bits, and EXT4_EX_NOFAIL is > > a high bit, so it is safe in practice. > > I'm really not fond of this type blurring. What I'd suggeset doing > instead is adding a "gfp_t gfp_mask" parameter to the > __ext4_journal_start_sb(). With the exception of one call site in > fs/ext4/ialloc.c, most of the callers of __ext4_journal_start_sb() are > via #define helper macros or inline funcions. So it would just > require adding a GFP_NOFS as an extra parameter to the various macros > and inline functions which call __ext4_journal_start_sb() in > ext4_jbd2.h. > > The function ext4_journal_start_with_revoke() is called exactly once > so we could just bury the __GFP_NOFAIL in the definition of that > macros, e.g.: > > #define ext4_journal_start_with_revoke(inode, type, blocks, revoke_creds) \ > __ext4_journal_start((inode), __LINE__, (type), (blocks), 0, \ > GFP_NOFS | __GFP_NOFAIL, (revoke_creds)) > > but it's probably better to do something like this: > > #define ext4_journal_start_with_revoke(gfp_mask, inode, type, blocks, revoke_creds) \ > __ext4_journal_start((inode), __LINE__, (type), (blocks), 0, \ > gfp_mask, (revoke_creds)) > > So it's explicit in the C function ext4_ext_remove_space() in > fs/ext4/extents.c that we are explicitly requesting the __GFP_NOFAIL > behavior. > > Does that make sense? Mostly. Adding gfp_mask to __ext4_journal_start_sb() make perfect sense. There doesn't seem much point adding one to __ext4_journal_start(), we can have ext4_journal_start_with_revoke() call __ext4_journal_start_sb() directly. But I cannot see what it doesn't already do that. i.e. why have the inline __ext4_journal_start() at all? Is it OK if I don't use that for ext4_journal_start_with_revoke()? Thanks, NeilBrown ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops. 2021-09-15 5:25 ` NeilBrown @ 2021-09-15 17:02 ` Theodore Ts'o 0 siblings, 0 replies; 45+ messages in thread From: Theodore Ts'o @ 2021-09-15 17:02 UTC (permalink / raw) To: NeilBrown Cc: Andrew Morton, Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel On Wed, Sep 15, 2021 at 03:25:40PM +1000, NeilBrown wrote: > Adding gfp_mask to __ext4_journal_start_sb() make perfect sense. > There doesn't seem much point adding one to __ext4_journal_start(), > we can have ext4_journal_start_with_revoke() call > __ext4_journal_start_sb() directly. > But I cannot see what it doesn't already do that. > i.e. why have the inline __ext4_journal_start() at all? > Is it OK if I don't use that for ext4_journal_start_with_revoke()? Sure. I think the only reason why we have __ext4_journal_start() as an inline function at all was for historical reasons. That is, we modified __ext4_journal_start() so that it took a struct super, and instead of changing all of the macros which called __ext4_journal_start(), we named it to be __ext4_journal_start_sb() and added the inline definition of __ext4_journal_start() to avoid changing all of the existing users of __ext4_journal_start(). So sure, it's fine not to use that for ext4_journal_start_with_revoke(), and we probably should clean up the use of __ext4_journal_start() at some point. That's unrelated to your work, though. Cheers, - Ted ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2021-10-22 0:10 UTC | newest] Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-09-17 2:56 [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL NeilBrown 2021-09-17 2:56 ` [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages() NeilBrown 2021-09-17 2:56 ` [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc() NeilBrown 2021-09-17 21:45 ` Dave Chinner 2021-09-17 2:56 ` [PATCH 1/6] MM: Support __GFP_NOFAIL in alloc_pages_bulk_*() and improve doco NeilBrown 2021-09-17 14:42 ` Mel Gorman 2021-09-20 23:48 ` NeilBrown 2021-10-05 9:16 ` Vlastimil Babka 2021-09-17 2:56 ` [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops NeilBrown 2021-09-17 2:56 ` [PATCH 4/6] EXT4: remove congestion_wait from ext4_bio_write_page, and simplify NeilBrown 2021-09-17 2:56 ` [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL NeilBrown 2021-10-05 9:20 ` Vlastimil Babka 2021-10-05 11:09 ` Michal Hocko 2021-10-05 12:27 ` Vlastimil Babka 2021-10-06 23:14 ` Dave Chinner 2021-10-07 10:07 ` Michal Hocko 2021-10-07 23:15 ` NeilBrown 2021-10-08 7:48 ` Michal Hocko 2021-10-08 22:36 ` Dave Chinner 2021-10-11 11:57 ` Michal Hocko 2021-10-11 21:49 ` NeilBrown 2021-10-18 10:23 ` Michal Hocko 2021-10-19 4:32 ` NeilBrown 2021-10-19 13:59 ` Michal Hocko 2021-10-22 0:09 ` NeilBrown 2021-10-13 2:32 ` Dave Chinner 2021-10-13 8:26 ` Michal Hocko 2021-10-14 11:32 ` David Sterba 2021-10-14 11:46 ` Michal Hocko -- strict thread matches above, loose matches on Subject: below -- 2021-09-14 0:13 [PATCH 0/6] congestion_wait() and GFP_NOFAIL NeilBrown 2021-09-14 0:13 ` [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops NeilBrown 2021-09-14 16:34 ` Mel Gorman 2021-09-14 21:48 ` NeilBrown 2021-09-15 12:06 ` Michal Hocko 2021-09-15 22:35 ` NeilBrown 2021-09-16 0:37 ` Dave Chinner 2021-09-16 6:52 ` Michal Hocko 2021-09-14 23:55 ` Dave Chinner 2021-09-15 8:59 ` Mel Gorman 2021-09-15 12:20 ` Michal Hocko 2021-09-15 14:35 ` Mel Gorman 2021-09-15 22:38 ` Dave Chinner 2021-09-16 9:00 ` Mel Gorman 2021-09-15 0:28 ` Theodore Ts'o 2021-09-15 5:25 ` NeilBrown 2021-09-15 17:02 ` Theodore Ts'o
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).