linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL
@ 2021-09-17  2:56 NeilBrown
  2021-09-17  2:56 ` [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages() NeilBrown
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: NeilBrown @ 2021-09-17  2:56 UTC (permalink / raw)
  To: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko,
	. Dave Chinner, Jonathan Corbet
  Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel, linux-doc

This second version:
  - add recipients for the Documentation/core-api changes
  - add fix for __alloc_pages_bulk() to handle GFP_NOFAIL
  - drops the annotations for congestion_wait() as being ineffective
    as that isn't really useful until an alternative is available
  - changes to GFP_NOFAIL documentation changes to focus on the possible
    deadlocks rather than the use of memory reserves
  - Improves ext4 and xfs patches based on feedback from Ted and Dave.

The patches are independent, except that the last patch depends on the
first.

As mentioned last time:

  These are the easy bits.  There are 5 calls to congestion_wait() and
  one to wait_iff_congested() in mm/ which need consideration.  There
  are multiple calls to congestion_wait in fs/, particularly fs/f2fs/
  which need to be addressed too.  I'll try to form an opinion about
  these in coming weeks.

(other interesting comment in original cover letter just duplicates
 observations made in the commit messages of individual patches).

NeilBrown


---

NeilBrown (6):
      MM: Support __GFP_NOFAIL in  alloc_pages_bulk_*() and improve doco
      MM: improve documentation for __GFP_NOFAIL
      EXT4: Remove ENOMEM/congestion_wait() loops.
      EXT4: remove congestion_wait from ext4_bio_write_page, and simplify
      XFS: remove congestion_wait() loop from kmem_alloc()
      XFS: remove congestion_wait() loop from xfs_buf_alloc_pages()


 Documentation/core-api/memory-allocation.rst | 25 ++++++++-
 fs/ext4/ext4.h                               |  2 +-
 fs/ext4/ext4_jbd2.c                          |  4 +-
 fs/ext4/ext4_jbd2.h                          | 14 +++---
 fs/ext4/extents.c                            | 53 ++++++++------------
 fs/ext4/extents_status.c                     | 35 +++++++------
 fs/ext4/extents_status.h                     |  2 +-
 fs/ext4/ialloc.c                             |  3 +-
 fs/ext4/indirect.c                           |  2 +-
 fs/ext4/inode.c                              |  6 +--
 fs/ext4/ioctl.c                              |  4 +-
 fs/ext4/page-io.c                            | 13 ++---
 fs/ext4/super.c                              |  2 +-
 fs/jbd2/transaction.c                        |  8 +--
 fs/xfs/kmem.c                                | 19 +++----
 fs/xfs/xfs_buf.c                             | 14 +++---
 include/linux/gfp.h                          |  6 ++-
 17 files changed, 113 insertions(+), 99 deletions(-)

--
Signature


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/6] MM: Support __GFP_NOFAIL in  alloc_pages_bulk_*() and improve doco
  2021-09-17  2:56 [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL NeilBrown
  2021-09-17  2:56 ` [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages() NeilBrown
  2021-09-17  2:56 ` [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc() NeilBrown
@ 2021-09-17  2:56 ` NeilBrown
  2021-09-17 14:42   ` Mel Gorman
  2021-09-17  2:56 ` [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops NeilBrown
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2021-09-17  2:56 UTC (permalink / raw)
  To: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko,
	. Dave Chinner, Jonathan Corbet
  Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel, linux-doc

When alloc_pages_bulk_array() is called on an array that is partially
allocated, the level of effort to get a single page is less than when
the array was completely unallocated.  This behaviour is inconsistent,
but now fixed.  One effect if this is that __GFP_NOFAIL will not ensure
at least one page is allocated.

Also clarify the expected success rate.  __alloc_pages_bulk() will
allocated one page according to @gfp, and may allocate more if that can
be done cheaply.  It is assumed that the caller values cheap allocation
where possible and may decide to use what it has got, or to call again
to get more.

Acked-by: Mel Gorman <mgorman@suse.com>
Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator")
Signed-off-by: NeilBrown <neilb@suse.de>
---
 mm/page_alloc.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b37435c274cf..aa51016e49c5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5191,6 +5191,11 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
  * is the maximum number of pages that will be stored in the array.
  *
  * Returns the number of pages on the list or array.
+ *
+ * At least one page will be allocated if that is possible while
+ * remaining consistent with @gfp.  Extra pages up to the requested
+ * total will be allocated opportunistically when doing so is
+ * significantly cheaper than having the caller repeat the request.
  */
 unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 			nodemask_t *nodemask, int nr_pages,
@@ -5292,7 +5297,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 								pcp, pcp_list);
 		if (unlikely(!page)) {
 			/* Try and get at least one page */
-			if (!nr_populated)
+			if (!nr_account)
 				goto failed_irq;
 			break;
 		}



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL
  2021-09-17  2:56 [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL NeilBrown
                   ` (4 preceding siblings ...)
  2021-09-17  2:56 ` [PATCH 4/6] EXT4: remove congestion_wait from ext4_bio_write_page, and simplify NeilBrown
@ 2021-09-17  2:56 ` NeilBrown
  2021-10-05  9:20   ` Vlastimil Babka
  5 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2021-09-17  2:56 UTC (permalink / raw)
  To: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko,
	. Dave Chinner, Jonathan Corbet
  Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel, linux-doc

__GFP_NOFAIL is documented both in gfp.h and memory-allocation.rst.
The details are not entirely consistent.

This patch ensures both places state that:
 - there is a risk of deadlock with reclaim/writeback/oom-kill
 - it should only be used when there is no real alternative
 - it is preferable to an endless loop
 - it is strongly discourages for costly-order allocations.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 Documentation/core-api/memory-allocation.rst |   25 ++++++++++++++++++++++++-
 include/linux/gfp.h                          |    6 +++++-
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/Documentation/core-api/memory-allocation.rst b/Documentation/core-api/memory-allocation.rst
index 5954ddf6ee13..8ea077465446 100644
--- a/Documentation/core-api/memory-allocation.rst
+++ b/Documentation/core-api/memory-allocation.rst
@@ -126,7 +126,30 @@ or another request.
 
   * ``GFP_KERNEL | __GFP_NOFAIL`` - overrides the default allocator behavior
     and all allocation requests will loop endlessly until they succeed.
-    This might be really dangerous especially for larger orders.
+    Any attempt to use ``__GFP_NOFAIL`` for allocations larger than
+    order-1 (2 pages) will trigger a warning.
+
+    Use of ``__GFP_NOFAIL`` can cause deadlocks so it should only be used
+    when there is no alternative, and then should be used with caution.
+    Deadlocks can happen if the calling process holds any resources
+    (e.g. locks) which might be needed for memory reclaim or write-back,
+    or which might prevent a process killed by the OOM killer from
+    successfully exiting.  Where possible, locks should be released
+    before using ``__GFP_NOFAIL``.
+
+    While this flag is best avoided, it is still preferable to endless
+    loops around the allocator.  Endless loops may still be used when
+    there is a need to test for the process being killed
+    (fatal_signal_pending(current)).
+
+  * ``GFP_NOFS | __GFP_NOFAIL`` - Loop endlessly instead of failing
+    when performing allocations in file system code.  The same guidance
+    as for ``GFP_KERNEL | __GFP_NOFAIL`` applies with extra emphasis on
+    the possibility of deadlocks.  ``GFP_NOFS`` often implies that
+    filesystem locks are held which might lead to blocking reclaim.
+    Preemptively flushing or reclaiming memory associated with such
+    locks might be appropriate before requesting a ``__GFP_NOFAIL``
+    allocation.
 
 Selecting memory allocator
 ==========================
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 55b2ec1f965a..1d2a89e20b8b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -209,7 +209,11 @@ struct vm_area_struct;
  * used only when there is no reasonable failure policy) but it is
  * definitely preferable to use the flag rather than opencode endless
  * loop around allocator.
- * Using this flag for costly allocations is _highly_ discouraged.
+ * Use of this flag may lead to deadlocks if locks are held which would
+ * be needed for memory reclaim, write-back, or the timely exit of a
+ * process killed by the OOM-killer.  Dropping any locks not absolutely
+ * needed is advisable before requesting a %__GFP_NOFAIL allocate.
+ * Using this flag for costly allocations (order>1) is _highly_ discouraged.
  */
 #define __GFP_IO	((__force gfp_t)___GFP_IO)
 #define __GFP_FS	((__force gfp_t)___GFP_FS)



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.
  2021-09-17  2:56 [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL NeilBrown
                   ` (2 preceding siblings ...)
  2021-09-17  2:56 ` [PATCH 1/6] MM: Support __GFP_NOFAIL in alloc_pages_bulk_*() and improve doco NeilBrown
@ 2021-09-17  2:56 ` NeilBrown
  2021-09-17  2:56 ` [PATCH 4/6] EXT4: remove congestion_wait from ext4_bio_write_page, and simplify NeilBrown
  2021-09-17  2:56 ` [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL NeilBrown
  5 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2021-09-17  2:56 UTC (permalink / raw)
  To: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko,
	. Dave Chinner, Jonathan Corbet
  Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel, linux-doc

Indefinite loops waiting for memory allocation are discouraged by
documentation in gfp.h which says the use of __GFP_NOFAIL that it

 is definitely preferable to use the flag rather than opencode endless
 loop around allocator.

Such loops that use congestion_wait() are particularly unwise as
congestion_wait() is indistinguishable from
schedule_timeout_uninterruptible() in practice - and should be
deprecated.

So this patch changes the two loops in ext4_ext_truncate() to use
__GFP_NOFAIL instead of looping.

As the allocation is multiple layers deeper in the call stack, this
requires passing the EXT4_EX_NOFAIL flag down and handling it in various
places.

__ext4_journal_start_sb() is given a new 'gfp_mask' argument which is
passed through to jbd2__journal_start() and always receives GFP_NOFS
except when called through ext4_journal_state_with_revoke from
ext4_ext_remove_space(), which now can include __GFP_NOFAIL.

jbd2__journal_start() is enhanced so that the gfp_t flags passed are
used for *all* allocations.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/ext4/ext4.h           |    2 +-
 fs/ext4/ext4_jbd2.c      |    4 ++-
 fs/ext4/ext4_jbd2.h      |   14 +++++++-----
 fs/ext4/extents.c        |   53 +++++++++++++++++++---------------------------
 fs/ext4/extents_status.c |   35 +++++++++++++++++-------------
 fs/ext4/extents_status.h |    2 +-
 fs/ext4/ialloc.c         |    3 ++-
 fs/ext4/indirect.c       |    2 +-
 fs/ext4/inode.c          |    6 +++--
 fs/ext4/ioctl.c          |    4 ++-
 fs/ext4/super.c          |    2 +-
 fs/jbd2/transaction.c    |    8 +++----
 12 files changed, 67 insertions(+), 68 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 90ff5acaf11f..52a34f5dfda2 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3720,7 +3720,7 @@ extern int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 			       struct ext4_map_blocks *map, int flags);
 extern int ext4_ext_truncate(handle_t *, struct inode *);
 extern int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
-				 ext4_lblk_t end);
+				 ext4_lblk_t end, int nofail);
 extern void ext4_ext_init(struct super_block *);
 extern void ext4_ext_release(struct super_block *);
 extern long ext4_fallocate(struct file *file, int mode, loff_t offset,
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 6def7339056d..780fde556dc0 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -88,7 +88,7 @@ static int ext4_journal_check_start(struct super_block *sb)
 
 handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
 				  int type, int blocks, int rsv_blocks,
-				  int revoke_creds)
+				  int revoke_creds, gfp_t gfp_mask)
 {
 	journal_t *journal;
 	int err;
@@ -103,7 +103,7 @@ handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
 	if (!journal || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
 		return ext4_get_nojournal();
 	return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds,
-				   GFP_NOFS, type, line);
+				   gfp_mask, type, line);
 }
 
 int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle)
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 0e4fa644df01..da79be8ffff4 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -263,7 +263,7 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
 
 handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
 				  int type, int blocks, int rsv_blocks,
-				  int revoke_creds);
+				  int revoke_creds, gfp_t gfp_mask);
 int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle);
 
 #define EXT4_NOJOURNAL_MAX_REF_COUNT ((unsigned long) 4096)
@@ -304,7 +304,8 @@ static inline int ext4_trans_default_revoke_credits(struct super_block *sb)
 
 #define ext4_journal_start_sb(sb, type, nblocks)			\
 	__ext4_journal_start_sb((sb), __LINE__, (type), (nblocks), 0,	\
-				ext4_trans_default_revoke_credits(sb))
+				ext4_trans_default_revoke_credits(sb),	\
+				GFP_NOFS)
 
 #define ext4_journal_start(inode, type, nblocks)			\
 	__ext4_journal_start((inode), __LINE__, (type), (nblocks), 0,	\
@@ -314,9 +315,9 @@ static inline int ext4_trans_default_revoke_credits(struct super_block *sb)
 	__ext4_journal_start((inode), __LINE__, (type), (blocks), (rsv_blocks),\
 			     ext4_trans_default_revoke_credits((inode)->i_sb))
 
-#define ext4_journal_start_with_revoke(inode, type, blocks, revoke_creds) \
-	__ext4_journal_start((inode), __LINE__, (type), (blocks), 0,	\
-			     (revoke_creds))
+#define ext4_journal_start_with_revoke(gfp_flags, inode, type, blocks, revoke_creds) \
+	__ext4_journal_start_sb((inode)->i_sb, __LINE__, (type), (blocks),\
+				0, (revoke_creds), (gfp_flags))
 
 static inline handle_t *__ext4_journal_start(struct inode *inode,
 					     unsigned int line, int type,
@@ -324,7 +325,8 @@ static inline handle_t *__ext4_journal_start(struct inode *inode,
 					     int revoke_creds)
 {
 	return __ext4_journal_start_sb(inode->i_sb, line, type, blocks,
-				       rsv_blocks, revoke_creds);
+				       rsv_blocks, revoke_creds,
+				       GFP_NOFS);
 }
 
 #define ext4_journal_stop(handle) \
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index c0de30f25185..c6c16aa8b618 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1488,7 +1488,7 @@ static int ext4_ext_search_left(struct inode *inode,
 static int ext4_ext_search_right(struct inode *inode,
 				 struct ext4_ext_path *path,
 				 ext4_lblk_t *logical, ext4_fsblk_t *phys,
-				 struct ext4_extent *ret_ex)
+				 struct ext4_extent *ret_ex, int nofail)
 {
 	struct buffer_head *bh = NULL;
 	struct ext4_extent_header *eh;
@@ -1565,7 +1565,7 @@ static int ext4_ext_search_right(struct inode *inode,
 	while (++depth < path->p_depth) {
 		/* subtract from p_depth to get proper eh_depth */
 		bh = read_extent_tree_block(inode, block,
-					    path->p_depth - depth, 0);
+					    path->p_depth - depth, nofail);
 		if (IS_ERR(bh))
 			return PTR_ERR(bh);
 		eh = ext_block_hdr(bh);
@@ -1574,7 +1574,7 @@ static int ext4_ext_search_right(struct inode *inode,
 		put_bh(bh);
 	}
 
-	bh = read_extent_tree_block(inode, block, path->p_depth - depth, 0);
+	bh = read_extent_tree_block(inode, block, path->p_depth - depth, nofail);
 	if (IS_ERR(bh))
 		return PTR_ERR(bh);
 	eh = ext_block_hdr(bh);
@@ -2773,7 +2773,7 @@ ext4_ext_more_to_rm(struct ext4_ext_path *path)
 }
 
 int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
-			  ext4_lblk_t end)
+			  ext4_lblk_t end, int nofail)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	int depth = ext_depth(inode);
@@ -2781,6 +2781,10 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 	struct partial_cluster partial;
 	handle_t *handle;
 	int i = 0, err = 0;
+	gfp_t gfp_mask = GFP_NOFS;
+
+	if (nofail)
+		gfp_mask |= __GFP_NOFAIL;
 
 	partial.pclu = 0;
 	partial.lblk = 0;
@@ -2789,7 +2793,8 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 	ext_debug(inode, "truncate since %u to %u\n", start, end);
 
 	/* probably first extent we're gonna free will be last in block */
-	handle = ext4_journal_start_with_revoke(inode, EXT4_HT_TRUNCATE,
+	handle = ext4_journal_start_with_revoke(gfp_mask,
+			inode, EXT4_HT_TRUNCATE,
 			depth + 1,
 			ext4_free_metadata_revoke_credits(inode->i_sb, depth));
 	if (IS_ERR(handle))
@@ -2877,7 +2882,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 			 */
 			lblk = ex_end + 1;
 			err = ext4_ext_search_right(inode, path, &lblk, &pblk,
-						    NULL);
+						    NULL, nofail);
 			if (err < 0)
 				goto out;
 			if (pblk) {
@@ -2899,10 +2904,6 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 	} else {
 		path = kcalloc(depth + 1, sizeof(struct ext4_ext_path),
 			       GFP_NOFS | __GFP_NOFAIL);
-		if (path == NULL) {
-			ext4_journal_stop(handle);
-			return -ENOMEM;
-		}
 		path[0].p_maxdepth = path[0].p_depth = depth;
 		path[0].p_hdr = ext_inode_hdr(inode);
 		i = 0;
@@ -2955,7 +2956,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 			memset(path + i + 1, 0, sizeof(*path));
 			bh = read_extent_tree_block(inode,
 				ext4_idx_pblock(path[i].p_idx), depth - i - 1,
-				EXT4_EX_NOCACHE);
+				EXT4_EX_NOCACHE | nofail);
 			if (IS_ERR(bh)) {
 				/* should we reset i_size? */
 				err = PTR_ERR(bh);
@@ -4186,7 +4187,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	if (err)
 		goto out;
 	ar.lright = map->m_lblk;
-	err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright, &ex2);
+	err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright, &ex2, 0);
 	if (err < 0)
 		goto out;
 
@@ -4368,23 +4369,13 @@ int ext4_ext_truncate(handle_t *handle, struct inode *inode)
 
 	last_block = (inode->i_size + sb->s_blocksize - 1)
 			>> EXT4_BLOCK_SIZE_BITS(sb);
-retry:
 	err = ext4_es_remove_extent(inode, last_block,
-				    EXT_MAX_BLOCKS - last_block);
-	if (err == -ENOMEM) {
-		cond_resched();
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
-		goto retry;
-	}
+				    EXT_MAX_BLOCKS - last_block,
+				    EXT4_EX_NOFAIL);
 	if (err)
 		return err;
-retry_remove_space:
-	err = ext4_ext_remove_space(inode, last_block, EXT_MAX_BLOCKS - 1);
-	if (err == -ENOMEM) {
-		cond_resched();
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
-		goto retry_remove_space;
-	}
+	err = ext4_ext_remove_space(inode, last_block, EXT_MAX_BLOCKS - 1,
+				    EXT4_EX_NOFAIL);
 	return err;
 }
 
@@ -5322,13 +5313,13 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
 	ext4_discard_preallocations(inode, 0);
 
 	ret = ext4_es_remove_extent(inode, punch_start,
-				    EXT_MAX_BLOCKS - punch_start);
+				    EXT_MAX_BLOCKS - punch_start, 0);
 	if (ret) {
 		up_write(&EXT4_I(inode)->i_data_sem);
 		goto out_stop;
 	}
 
-	ret = ext4_ext_remove_space(inode, punch_start, punch_stop - 1);
+	ret = ext4_ext_remove_space(inode, punch_start, punch_stop - 1, 0);
 	if (ret) {
 		up_write(&EXT4_I(inode)->i_data_sem);
 		goto out_stop;
@@ -5510,7 +5501,7 @@ static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
 	}
 
 	ret = ext4_es_remove_extent(inode, offset_lblk,
-			EXT_MAX_BLOCKS - offset_lblk);
+				    EXT_MAX_BLOCKS - offset_lblk, 0);
 	if (ret) {
 		up_write(&EXT4_I(inode)->i_data_sem);
 		goto out_stop;
@@ -5574,10 +5565,10 @@ ext4_swap_extents(handle_t *handle, struct inode *inode1,
 	BUG_ON(!inode_is_locked(inode1));
 	BUG_ON(!inode_is_locked(inode2));
 
-	*erp = ext4_es_remove_extent(inode1, lblk1, count);
+	*erp = ext4_es_remove_extent(inode1, lblk1, count, 0);
 	if (unlikely(*erp))
 		return 0;
-	*erp = ext4_es_remove_extent(inode2, lblk2, count);
+	*erp = ext4_es_remove_extent(inode2, lblk2, count, 0);
 	if (unlikely(*erp))
 		return 0;
 
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 9a3a8996aacf..7f7711a2ea44 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -144,9 +144,10 @@
 static struct kmem_cache *ext4_es_cachep;
 static struct kmem_cache *ext4_pending_cachep;
 
-static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
+static int __es_insert_extent(struct inode *inode, struct extent_status *newes,
+			      int nofail);
 static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
-			      ext4_lblk_t end, int *reserved);
+			      ext4_lblk_t end, int *reserved, int nofail);
 static int es_reclaim_extents(struct ext4_inode_info *ei, int *nr_to_scan);
 static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
 		       struct ext4_inode_info *locked_ei);
@@ -452,10 +453,11 @@ static void ext4_es_list_del(struct inode *inode)
 
 static struct extent_status *
 ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
-		     ext4_fsblk_t pblk)
+		     ext4_fsblk_t pblk, int nofail)
 {
 	struct extent_status *es;
-	es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
+	es = kmem_cache_alloc(ext4_es_cachep,
+			      GFP_ATOMIC | (nofail ? __GFP_NOFAIL : 0));
 	if (es == NULL)
 		return NULL;
 	es->es_lblk = lblk;
@@ -754,7 +756,8 @@ static inline void ext4_es_insert_extent_check(struct inode *inode,
 }
 #endif
 
-static int __es_insert_extent(struct inode *inode, struct extent_status *newes)
+static int __es_insert_extent(struct inode *inode, struct extent_status *newes,
+			      int nofail)
 {
 	struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
 	struct rb_node **p = &tree->root.rb_node;
@@ -795,7 +798,7 @@ static int __es_insert_extent(struct inode *inode, struct extent_status *newes)
 	}
 
 	es = ext4_es_alloc_extent(inode, newes->es_lblk, newes->es_len,
-				  newes->es_pblk);
+				  newes->es_pblk, nofail);
 	if (!es)
 		return -ENOMEM;
 	rb_link_node(&es->rb_node, parent, p);
@@ -848,11 +851,11 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	ext4_es_insert_extent_check(inode, &newes);
 
 	write_lock(&EXT4_I(inode)->i_es_lock);
-	err = __es_remove_extent(inode, lblk, end, NULL);
+	err = __es_remove_extent(inode, lblk, end, NULL, 0);
 	if (err != 0)
 		goto error;
 retry:
-	err = __es_insert_extent(inode, &newes);
+	err = __es_insert_extent(inode, &newes, 0);
 	if (err == -ENOMEM && __es_shrink(EXT4_SB(inode->i_sb),
 					  128, EXT4_I(inode)))
 		goto retry;
@@ -902,7 +905,7 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
 
 	es = __es_tree_search(&EXT4_I(inode)->i_es_tree.root, lblk);
 	if (!es || es->es_lblk > end)
-		__es_insert_extent(inode, &newes);
+		__es_insert_extent(inode, &newes, 0);
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 }
 
@@ -1294,6 +1297,7 @@ static unsigned int get_rsvd(struct inode *inode, ext4_lblk_t end,
  * @lblk - first block in range
  * @end - last block in range
  * @reserved - number of cluster reservations released
+ * @nofail - EXT4_EX_NOFAIL if __GFP_NOFAIL should be used
  *
  * If @reserved is not NULL and delayed allocation is enabled, counts
  * block/cluster reservations freed by removing range and if bigalloc
@@ -1301,7 +1305,7 @@ static unsigned int get_rsvd(struct inode *inode, ext4_lblk_t end,
  * error code on failure.
  */
 static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
-			      ext4_lblk_t end, int *reserved)
+			      ext4_lblk_t end, int *reserved, int nofail)
 {
 	struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
 	struct rb_node *node;
@@ -1350,7 +1354,7 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 					orig_es.es_len - len2;
 			ext4_es_store_pblock_status(&newes, block,
 						    ext4_es_status(&orig_es));
-			err = __es_insert_extent(inode, &newes);
+			err = __es_insert_extent(inode, &newes, nofail);
 			if (err) {
 				es->es_lblk = orig_es.es_lblk;
 				es->es_len = orig_es.es_len;
@@ -1426,12 +1430,13 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
  * @inode - file containing range
  * @lblk - first block in range
  * @len - number of blocks to remove
+ * @nofail - EXT4_EX_NOFAIL if __GFP_NOFAIL should be used
  *
  * Reduces block/cluster reservation count and for bigalloc cancels pending
  * reservations as needed. Returns 0 on success, error code on failure.
  */
 int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
-			  ext4_lblk_t len)
+			  ext4_lblk_t len, int nofail)
 {
 	ext4_lblk_t end;
 	int err = 0;
@@ -1456,7 +1461,7 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	 * is reclaimed.
 	 */
 	write_lock(&EXT4_I(inode)->i_es_lock);
-	err = __es_remove_extent(inode, lblk, end, &reserved);
+	err = __es_remove_extent(inode, lblk, end, &reserved, nofail);
 	write_unlock(&EXT4_I(inode)->i_es_lock);
 	ext4_es_print_tree(inode);
 	ext4_da_release_space(inode, reserved);
@@ -2003,11 +2008,11 @@ int ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
 
 	write_lock(&EXT4_I(inode)->i_es_lock);
 
-	err = __es_remove_extent(inode, lblk, lblk, NULL);
+	err = __es_remove_extent(inode, lblk, lblk, NULL, 0);
 	if (err != 0)
 		goto error;
 retry:
-	err = __es_insert_extent(inode, &newes);
+	err = __es_insert_extent(inode, &newes, 0);
 	if (err == -ENOMEM && __es_shrink(EXT4_SB(inode->i_sb),
 					  128, EXT4_I(inode)))
 		goto retry;
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 4ec30a798260..23d77094a165 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -134,7 +134,7 @@ extern void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
 				 ext4_lblk_t len, ext4_fsblk_t pblk,
 				 unsigned int status);
 extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
-				 ext4_lblk_t len);
+				 ext4_lblk_t len, int nofail);
 extern void ext4_es_find_extent_range(struct inode *inode,
 				      int (*match_fn)(struct extent_status *es),
 				      ext4_lblk_t lblk, ext4_lblk_t end,
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index f73e5eb43eae..82049fb94314 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1079,7 +1079,8 @@ struct inode *__ext4_new_inode(struct user_namespace *mnt_userns,
 			BUG_ON(nblocks <= 0);
 			handle = __ext4_journal_start_sb(dir->i_sb, line_no,
 				 handle_type, nblocks, 0,
-				 ext4_trans_default_revoke_credits(sb));
+				 ext4_trans_default_revoke_credits(sb),
+				 GFP_NOFS);
 			if (IS_ERR(handle)) {
 				err = PTR_ERR(handle);
 				ext4_std_error(sb, err);
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 89efa78ed4b2..910e87aea7be 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -1125,7 +1125,7 @@ void ext4_ind_truncate(handle_t *handle, struct inode *inode)
 			return;
 	}
 
-	ext4_es_remove_extent(inode, last_block, EXT_MAX_BLOCKS - last_block);
+	ext4_es_remove_extent(inode, last_block, EXT_MAX_BLOCKS - last_block, 0);
 
 	/*
 	 * The orphan list entry will now protect us from any crash which
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d18852d6029c..24246043d94b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1575,7 +1575,7 @@ static void mpage_release_unused_pages(struct mpage_da_data *mpd,
 		ext4_lblk_t start, last;
 		start = index << (PAGE_SHIFT - inode->i_blkbits);
 		last = end << (PAGE_SHIFT - inode->i_blkbits);
-		ext4_es_remove_extent(inode, start, last - start + 1);
+		ext4_es_remove_extent(inode, start, last - start + 1, 0);
 	}
 
 	pagevec_init(&pvec);
@@ -4109,7 +4109,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 		ext4_discard_preallocations(inode, 0);
 
 		ret = ext4_es_remove_extent(inode, first_block,
-					    stop_block - first_block);
+					    stop_block - first_block, 0);
 		if (ret) {
 			up_write(&EXT4_I(inode)->i_data_sem);
 			goto out_stop;
@@ -4117,7 +4117,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 
 		if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
 			ret = ext4_ext_remove_space(inode, first_block,
-						    stop_block - 1);
+						    stop_block - 1, 0);
 		else
 			ret = ext4_ind_remove_space(handle, inode, first_block,
 						    stop_block);
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 606dee9e08a3..e4de05a6b976 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -79,8 +79,8 @@ static void swap_inode_data(struct inode *inode1, struct inode *inode2)
 		(ei1->i_flags & ~EXT4_FL_SHOULD_SWAP);
 	ei2->i_flags = tmp | (ei2->i_flags & ~EXT4_FL_SHOULD_SWAP);
 	swap(ei1->i_disksize, ei2->i_disksize);
-	ext4_es_remove_extent(inode1, 0, EXT_MAX_BLOCKS);
-	ext4_es_remove_extent(inode2, 0, EXT_MAX_BLOCKS);
+	ext4_es_remove_extent(inode1, 0, EXT_MAX_BLOCKS, 0);
+	ext4_es_remove_extent(inode2, 0, EXT_MAX_BLOCKS, 0);
 
 	isize = i_size_read(inode1);
 	i_size_write(inode1, i_size_read(inode2));
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0775950ee84e..947e8376a35a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1393,7 +1393,7 @@ void ext4_clear_inode(struct inode *inode)
 	invalidate_inode_buffers(inode);
 	clear_inode(inode);
 	ext4_discard_preallocations(inode, 0);
-	ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS);
+	ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS, 0);
 	dquot_drop(inode);
 	if (EXT4_I(inode)->jinode) {
 		jbd2_journal_release_jbd_inode(EXT4_JOURNAL(inode),
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 6a3caedd2285..23e0f003d43b 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -476,9 +476,9 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 }
 
 /* Allocate a new handle.  This should probably be in a slab... */
-static handle_t *new_handle(int nblocks)
+static handle_t *new_handle(int nblocks, gfp_t gfp)
 {
-	handle_t *handle = jbd2_alloc_handle(GFP_NOFS);
+	handle_t *handle = jbd2_alloc_handle(gfp);
 	if (!handle)
 		return NULL;
 	handle->h_total_credits = nblocks;
@@ -505,13 +505,13 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
 
 	nblocks += DIV_ROUND_UP(revoke_records,
 				journal->j_revoke_records_per_block);
-	handle = new_handle(nblocks);
+	handle = new_handle(nblocks, gfp_mask);
 	if (!handle)
 		return ERR_PTR(-ENOMEM);
 	if (rsv_blocks) {
 		handle_t *rsv_handle;
 
-		rsv_handle = new_handle(rsv_blocks);
+		rsv_handle = new_handle(rsv_blocks, gfp_mask);
 		if (!rsv_handle) {
 			jbd2_free_handle(handle);
 			return ERR_PTR(-ENOMEM);



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 4/6] EXT4: remove congestion_wait from ext4_bio_write_page, and simplify
  2021-09-17  2:56 [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL NeilBrown
                   ` (3 preceding siblings ...)
  2021-09-17  2:56 ` [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops NeilBrown
@ 2021-09-17  2:56 ` NeilBrown
  2021-09-17  2:56 ` [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL NeilBrown
  5 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2021-09-17  2:56 UTC (permalink / raw)
  To: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko,
	. Dave Chinner, Jonathan Corbet
  Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel, linux-doc

congestion_wait() is indistinguishable from
schedule_timeout_uninterruptible().  It is best avoided and should be
deprecated.

It is not needed in ext4_bio_write_page().  There are two cases.
If there are no ->io_bio yet, then it is appropriate to use __GFP_NOFAIL
which does the waiting in a better place.  The code already uses this
flag on the second attempt.  This patch changes to it always use that
flag for this case.

If there *are* ->io_bio (in which case the allocation was non-blocking)
we submit the io and return the first case.  No waiting is needed in
this case.

So remove the congestion_wait() call, and simplify the code so that the
two cases are somewhat clearer.

Remove the "if (io->io_bio)" before calling ext4_io_submit() as that
test is performed internally by that function.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/ext4/page-io.c |   13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index f038d578d8d8..3b6ece0d3ad6 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -506,7 +506,7 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 	 * can't happen in the common case of blocksize == PAGE_SIZE.
 	 */
 	if (fscrypt_inode_uses_fs_layer_crypto(inode) && nr_to_submit) {
-		gfp_t gfp_flags = GFP_NOFS;
+		gfp_t gfp_flags;
 		unsigned int enc_bytes = round_up(len, i_blocksize(inode));
 
 		/*
@@ -514,21 +514,18 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 		 * a waiting mask (i.e. request guaranteed allocation) on the
 		 * first page of the bio.  Otherwise it can deadlock.
 		 */
+	retry_encrypt:
 		if (io->io_bio)
 			gfp_flags = GFP_NOWAIT | __GFP_NOWARN;
-	retry_encrypt:
+		else
+			gfp_flags = GFP_NOFS | __GFP_NOFAIL;
 		bounce_page = fscrypt_encrypt_pagecache_blocks(page, enc_bytes,
 							       0, gfp_flags);
 		if (IS_ERR(bounce_page)) {
 			ret = PTR_ERR(bounce_page);
 			if (ret == -ENOMEM &&
 			    (io->io_bio || wbc->sync_mode == WB_SYNC_ALL)) {
-				gfp_flags = GFP_NOFS;
-				if (io->io_bio)
-					ext4_io_submit(io);
-				else
-					gfp_flags |= __GFP_NOFAIL;
-				congestion_wait(BLK_RW_ASYNC, HZ/50);
+				ext4_io_submit(io);
 				goto retry_encrypt;
 			}
 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc()
  2021-09-17  2:56 [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL NeilBrown
  2021-09-17  2:56 ` [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages() NeilBrown
@ 2021-09-17  2:56 ` NeilBrown
  2021-09-17 21:45   ` Dave Chinner
  2021-09-17  2:56 ` [PATCH 1/6] MM: Support __GFP_NOFAIL in alloc_pages_bulk_*() and improve doco NeilBrown
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2021-09-17  2:56 UTC (permalink / raw)
  To: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko,
	. Dave Chinner, Jonathan Corbet
  Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel, linux-doc

Documentation commment in gfp.h discourages indefinite retry loops on
ENOMEM and says of __GFP_NOFAIL that it

    is definitely preferable to use the flag rather than opencode
    endless loop around allocator.

So remove the loop, instead specifying __GFP_NOFAIL if KM_MAYFAIL was
not given.

As we no longer have the opportunity to report a warning after some
failures, clear __GFP_NOWARN so that the default warning (rate-limited
to 1 ever 10 seconds) will be reported instead.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/xfs/kmem.c |   19 ++++++-------------
 1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 6f49bf39183c..575a58e61391 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -11,21 +11,14 @@
 void *
 kmem_alloc(size_t size, xfs_km_flags_t flags)
 {
-	int	retries = 0;
 	gfp_t	lflags = kmem_flags_convert(flags);
-	void	*ptr;
 
 	trace_kmem_alloc(size, flags, _RET_IP_);
 
-	do {
-		ptr = kmalloc(size, lflags);
-		if (ptr || (flags & KM_MAYFAIL))
-			return ptr;
-		if (!(++retries % 100))
-			xfs_err(NULL,
-	"%s(%u) possible memory allocation deadlock size %u in %s (mode:0x%x)",
-				current->comm, current->pid,
-				(unsigned int)size, __func__, lflags);
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
-	} while (1);
+	if (!(flags & KM_MAYFAIL)) {
+		lflags |= __GFP_NOFAIL;
+		lflags &= ~__GFP_NOWARN;
+	}
+
+	return kmalloc(size, lflags);
 }



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages()
  2021-09-17  2:56 [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL NeilBrown
@ 2021-09-17  2:56 ` NeilBrown
  2021-09-17  2:56 ` [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc() NeilBrown
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2021-09-17  2:56 UTC (permalink / raw)
  To: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko,
	. Dave Chinner, Jonathan Corbet
  Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel, linux-doc

Documentation commment in gfp.h discourages indefinite retry loops on
ENOMEM and says of __GFP_NOFAIL that it

    is definitely preferable to use the flag rather than opencode
    endless loop around allocator.

congestion_wait() is indistinguishable from
schedule_timeout_uninterruptible() in practice and it is not a good way
to wait for memory to become available.

So add __GFP_NOFAIL to gfp if failure is not an option, and remove the
congestion_wait().  We now only loop when failure is an option, and
alloc_bulk_pages_array() made some progres, but not enough.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/xfs/xfs_buf.c |   14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 5fa6cd947dd4..b19ab52c551b 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -352,7 +352,7 @@ xfs_buf_alloc_pages(
 	if (flags & XBF_READ_AHEAD)
 		gfp_mask |= __GFP_NORETRY;
 	else
-		gfp_mask |= GFP_NOFS;
+		gfp_mask |= GFP_NOFS | __GFP_NOFAIL;
 
 	/* Make sure that we have a page list */
 	bp->b_page_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
@@ -372,8 +372,9 @@ xfs_buf_alloc_pages(
 
 	/*
 	 * Bulk filling of pages can take multiple calls. Not filling the entire
-	 * array is not an allocation failure, so don't back off if we get at
-	 * least one extra page.
+	 * array is not an allocation failure but is worth counting in
+	 * xb_pages_retries statistics.  If we don't even get one page,
+	 * then this must be a READ_AHEAD and we should abort.
 	 */
 	for (;;) {
 		long	last = filled;
@@ -385,16 +386,13 @@ xfs_buf_alloc_pages(
 			break;
 		}
 
-		if (filled != last)
-			continue;
-
-		if (flags & XBF_READ_AHEAD) {
+		if (filled == last) {
+			ASSERT(flags & XBF_READ_AHEAD);
 			xfs_buf_free_pages(bp);
 			return -ENOMEM;
 		}
 
 		XFS_STATS_INC(bp->b_mount, xb_page_retries);
-		congestion_wait(BLK_RW_ASYNC, HZ / 50);
 	}
 	return 0;
 }



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/6] MM: Support __GFP_NOFAIL in  alloc_pages_bulk_*() and improve doco
  2021-09-17  2:56 ` [PATCH 1/6] MM: Support __GFP_NOFAIL in alloc_pages_bulk_*() and improve doco NeilBrown
@ 2021-09-17 14:42   ` Mel Gorman
  2021-09-20 23:48     ` NeilBrown
  2021-10-05  9:16     ` Vlastimil Babka
  0 siblings, 2 replies; 25+ messages in thread
From: Mel Gorman @ 2021-09-17 14:42 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Michal Hocko,
	Jesper Dangaard Brouer, . Dave Chinner, Jonathan Corbet,
	linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel, linux-doc

I'm top-posting to cc Jesper with full context of the patch. I don't
have a problem with this patch other than the Fixes: being a bit
marginal, I should have acked as Mel Gorman <mgorman@suse.de> and the
@gfp in the comment should have been @gfp_mask.

However, an assumption the API design made was that it should fail fast
if memory is not quickly available but have at least one page in the
array. I don't think the network use case cares about the situation where
the array is already populated but I'd like Jesper to have the opportunity
to think about it.  It's possible he would prefer it's explicit and the
check becomes
(!nr_populated || ((gfp_mask & __GFP_NOFAIL) && !nr_account)) to
state that __GFP_NOFAIL users are willing to take a potential latency
penalty if the array is already partially populated but !__GFP_NOFAIL
users would prefer fail-fast behaviour. I'm on the fence because while
I wrote the implementation, it was based on other peoples requirements.

On Fri, Sep 17, 2021 at 12:56:57PM +1000, NeilBrown wrote:
> When alloc_pages_bulk_array() is called on an array that is partially
> allocated, the level of effort to get a single page is less than when
> the array was completely unallocated.  This behaviour is inconsistent,
> but now fixed.  One effect if this is that __GFP_NOFAIL will not ensure
> at least one page is allocated.
> 
> Also clarify the expected success rate.  __alloc_pages_bulk() will
> allocated one page according to @gfp, and may allocate more if that can
> be done cheaply.  It is assumed that the caller values cheap allocation
> where possible and may decide to use what it has got, or to call again
> to get more.
> 
> Acked-by: Mel Gorman <mgorman@suse.com>
> Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator")
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  mm/page_alloc.c |    7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b37435c274cf..aa51016e49c5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5191,6 +5191,11 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
>   * is the maximum number of pages that will be stored in the array.
>   *
>   * Returns the number of pages on the list or array.
> + *
> + * At least one page will be allocated if that is possible while
> + * remaining consistent with @gfp.  Extra pages up to the requested
> + * total will be allocated opportunistically when doing so is
> + * significantly cheaper than having the caller repeat the request.
>   */
>  unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  			nodemask_t *nodemask, int nr_pages,
> @@ -5292,7 +5297,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  								pcp, pcp_list);
>  		if (unlikely(!page)) {
>  			/* Try and get at least one page */
> -			if (!nr_populated)
> +			if (!nr_account)
>  				goto failed_irq;
>  			break;
>  		}
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc()
  2021-09-17  2:56 ` [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc() NeilBrown
@ 2021-09-17 21:45   ` Dave Chinner
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2021-09-17 21:45 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko,
	Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs,
	linux-mm, linux-kernel, linux-doc

On Fri, Sep 17, 2021 at 12:56:57PM +1000, NeilBrown wrote:
> Documentation commment in gfp.h discourages indefinite retry loops on
> ENOMEM and says of __GFP_NOFAIL that it
> 
>     is definitely preferable to use the flag rather than opencode
>     endless loop around allocator.
> 
> So remove the loop, instead specifying __GFP_NOFAIL if KM_MAYFAIL was
> not given.
> 
> As we no longer have the opportunity to report a warning after some
> failures, clear __GFP_NOWARN so that the default warning (rate-limited
> to 1 ever 10 seconds) will be reported instead.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/xfs/kmem.c |   19 ++++++-------------
>  1 file changed, 6 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 6f49bf39183c..575a58e61391 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -11,21 +11,14 @@
>  void *
>  kmem_alloc(size_t size, xfs_km_flags_t flags)
>  {
> -	int	retries = 0;
>  	gfp_t	lflags = kmem_flags_convert(flags);
> -	void	*ptr;
>  
>  	trace_kmem_alloc(size, flags, _RET_IP_);
>  
> -	do {
> -		ptr = kmalloc(size, lflags);
> -		if (ptr || (flags & KM_MAYFAIL))
> -			return ptr;
> -		if (!(++retries % 100))
> -			xfs_err(NULL,
> -	"%s(%u) possible memory allocation deadlock size %u in %s (mode:0x%x)",
> -				current->comm, current->pid,
> -				(unsigned int)size, __func__, lflags);
> -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> -	} while (1);
> +	if (!(flags & KM_MAYFAIL)) {
> +		lflags |= __GFP_NOFAIL;
> +		lflags &= ~__GFP_NOWARN;
> +	}

This logic should really be in kmem_flags_convert() where the gfp
flags are set up. kmem_flags_convert() is only called by
kmem_alloc() now so you should just be able to hack that logic
to do exactly what is necessary.

FWIW, We've kinda not been caring about warts in this code because
the next step for kmem_alloc is to remove kmem_alloc/kmem_zalloc
completely and replace all the callers with kmalloc/kzalloc being
passed the correct gfp flags. There's about 30 kmem_alloc() callers
and 45 kmem_zalloc() calls left to be converted before kmem.[ch] can
go away completely....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/6] MM: Support __GFP_NOFAIL in  alloc_pages_bulk_*() and improve doco
  2021-09-17 14:42   ` Mel Gorman
@ 2021-09-20 23:48     ` NeilBrown
  2021-10-05  9:16     ` Vlastimil Babka
  1 sibling, 0 replies; 25+ messages in thread
From: NeilBrown @ 2021-09-20 23:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Michal Hocko,
	Jesper Dangaard Brouer, Dave Chinner, Jonathan Corbet, linux-xfs,
	linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel,
	linux-doc

On Sat, 18 Sep 2021, Mel Gorman wrote:
> I'm top-posting to cc Jesper with full context of the patch. I don't
> have a problem with this patch other than the Fixes: being a bit
> marginal, I should have acked as Mel Gorman <mgorman@suse.de> and the
> @gfp in the comment should have been @gfp_mask.
> 
> However, an assumption the API design made was that it should fail fast
> if memory is not quickly available but have at least one page in the
> array. I don't think the network use case cares about the situation where
> the array is already populated but I'd like Jesper to have the opportunity
> to think about it.  It's possible he would prefer it's explicit and the
> check becomes
> (!nr_populated || ((gfp_mask & __GFP_NOFAIL) && !nr_account)) to
> state that __GFP_NOFAIL users are willing to take a potential latency
> penalty if the array is already partially populated but !__GFP_NOFAIL
> users would prefer fail-fast behaviour. I'm on the fence because while
> I wrote the implementation, it was based on other peoples requirements.

I can see that it could be desirable to not try too hard when we already
have pages allocated, but maybe the best way to achieve that is for the
called to clear __GFP_RECLAIM in that case.

Alternately, callers that really want the __GFP_RECLAIM and __GFP_NOFAIL
flags to be honoured could ensure that the array passed in is empty.
That wouldn't be difficult (for current callers).

In either case, the documentation should make it clear which flags are
honoured when.

Let's see what Jesper has to say.

Thanks,
NeilBrown


> 
> On Fri, Sep 17, 2021 at 12:56:57PM +1000, NeilBrown wrote:
> > When alloc_pages_bulk_array() is called on an array that is partially
> > allocated, the level of effort to get a single page is less than when
> > the array was completely unallocated.  This behaviour is inconsistent,
> > but now fixed.  One effect if this is that __GFP_NOFAIL will not ensure
> > at least one page is allocated.
> > 
> > Also clarify the expected success rate.  __alloc_pages_bulk() will
> > allocated one page according to @gfp, and may allocate more if that can
> > be done cheaply.  It is assumed that the caller values cheap allocation
> > where possible and may decide to use what it has got, or to call again
> > to get more.
> > 
> > Acked-by: Mel Gorman <mgorman@suse.com>
> > Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator")
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  mm/page_alloc.c |    7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index b37435c274cf..aa51016e49c5 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5191,6 +5191,11 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> >   * is the maximum number of pages that will be stored in the array.
> >   *
> >   * Returns the number of pages on the list or array.
> > + *
> > + * At least one page will be allocated if that is possible while
> > + * remaining consistent with @gfp.  Extra pages up to the requested
> > + * total will be allocated opportunistically when doing so is
> > + * significantly cheaper than having the caller repeat the request.
> >   */
> >  unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> >  			nodemask_t *nodemask, int nr_pages,
> > @@ -5292,7 +5297,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> >  								pcp, pcp_list);
> >  		if (unlikely(!page)) {
> >  			/* Try and get at least one page */
> > -			if (!nr_populated)
> > +			if (!nr_account)
> >  				goto failed_irq;
> >  			break;
> >  		}
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/6] MM: Support __GFP_NOFAIL in alloc_pages_bulk_*() and improve doco
  2021-09-17 14:42   ` Mel Gorman
  2021-09-20 23:48     ` NeilBrown
@ 2021-10-05  9:16     ` Vlastimil Babka
  1 sibling, 0 replies; 25+ messages in thread
From: Vlastimil Babka @ 2021-10-05  9:16 UTC (permalink / raw)
  To: Mel Gorman, NeilBrown
  Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Michal Hocko,
	Jesper Dangaard Brouer, . Dave Chinner, Jonathan Corbet,
	linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel, linux-doc

On 9/17/21 16:42, Mel Gorman wrote:
> I'm top-posting to cc Jesper with full context of the patch. I don't
> have a problem with this patch other than the Fixes: being a bit
> marginal, I should have acked as Mel Gorman <mgorman@suse.de> and the
> @gfp in the comment should have been @gfp_mask.
> 
> However, an assumption the API design made was that it should fail fast
> if memory is not quickly available but have at least one page in the
> array. I don't think the network use case cares about the situation where
> the array is already populated but I'd like Jesper to have the opportunity
> to think about it.  It's possible he would prefer it's explicit and the
> check becomes
> (!nr_populated || ((gfp_mask & __GFP_NOFAIL) && !nr_account)) to

Note that AFAICS nr_populated is an incomplete piece of information, as we
initially only count pages in the page_array as nr_populated up to the first
NULL pointer. So even before Neil's patch we could decide to allocate even
if there are pre-existing pages, but placed later in the array. Which could
be rather common if the array consumer starts from index 0? So with Neil's
patch this at least becomes consistent, while the check suggested by Mel
leaves there the weird dependency on where pre-existing pages appear in the
page_array.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL
  2021-09-17  2:56 ` [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL NeilBrown
@ 2021-10-05  9:20   ` Vlastimil Babka
  2021-10-05 11:09     ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Vlastimil Babka @ 2021-10-05  9:20 UTC (permalink / raw)
  To: NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, Michal Hocko,
	. Dave Chinner, Jonathan Corbet
  Cc: linux-xfs, linux-ext4, linux-fsdevel, linux-nfs, linux-mm,
	linux-kernel, linux-doc

On 9/17/21 04:56, NeilBrown wrote:
> __GFP_NOFAIL is documented both in gfp.h and memory-allocation.rst.
> The details are not entirely consistent.
> 
> This patch ensures both places state that:
>  - there is a risk of deadlock with reclaim/writeback/oom-kill
>  - it should only be used when there is no real alternative
>  - it is preferable to an endless loop
>  - it is strongly discourages for costly-order allocations.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Nit below:

> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 55b2ec1f965a..1d2a89e20b8b 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -209,7 +209,11 @@ struct vm_area_struct;
>   * used only when there is no reasonable failure policy) but it is
>   * definitely preferable to use the flag rather than opencode endless
>   * loop around allocator.
> - * Using this flag for costly allocations is _highly_ discouraged.
> + * Use of this flag may lead to deadlocks if locks are held which would
> + * be needed for memory reclaim, write-back, or the timely exit of a
> + * process killed by the OOM-killer.  Dropping any locks not absolutely
> + * needed is advisable before requesting a %__GFP_NOFAIL allocate.
> + * Using this flag for costly allocations (order>1) is _highly_ discouraged.

We define costly as 3, not 1. But sure it's best to avoid even order>0 for
__GFP_NOFAIL. Advising order>1 seems arbitrary though?

>   */
>  #define __GFP_IO	((__force gfp_t)___GFP_IO)
>  #define __GFP_FS	((__force gfp_t)___GFP_FS)
> 
> 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL
  2021-10-05  9:20   ` Vlastimil Babka
@ 2021-10-05 11:09     ` Michal Hocko
  2021-10-05 12:27       ` Vlastimil Babka
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2021-10-05 11:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, . Dave Chinner,
	Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs,
	linux-mm, linux-kernel, linux-doc

On Tue 05-10-21 11:20:51, Vlastimil Babka wrote:
[...]
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -209,7 +209,11 @@ struct vm_area_struct;
> >   * used only when there is no reasonable failure policy) but it is
> >   * definitely preferable to use the flag rather than opencode endless
> >   * loop around allocator.
> > - * Using this flag for costly allocations is _highly_ discouraged.
> > + * Use of this flag may lead to deadlocks if locks are held which would
> > + * be needed for memory reclaim, write-back, or the timely exit of a
> > + * process killed by the OOM-killer.  Dropping any locks not absolutely
> > + * needed is advisable before requesting a %__GFP_NOFAIL allocate.
> > + * Using this flag for costly allocations (order>1) is _highly_ discouraged.
> 
> We define costly as 3, not 1. But sure it's best to avoid even order>0 for
> __GFP_NOFAIL. Advising order>1 seems arbitrary though?

This is not completely arbitrary. We have a warning for any higher order
allocation.
rmqueue:
	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

I do agree that "Using this flag for higher order allocations is
_highly_ discouraged.


> >   */
> >  #define __GFP_IO	((__force gfp_t)___GFP_IO)
> >  #define __GFP_FS	((__force gfp_t)___GFP_FS)
> > 
> > 
> > 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL
  2021-10-05 11:09     ` Michal Hocko
@ 2021-10-05 12:27       ` Vlastimil Babka
  2021-10-06 23:14         ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Vlastimil Babka @ 2021-10-05 12:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: NeilBrown, Andrew Morton, Theodore Ts'o, Andreas Dilger,
	Darrick J. Wong, Matthew Wilcox, Mel Gorman, . Dave Chinner,
	Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs,
	linux-mm, linux-kernel, linux-doc

On 10/5/21 13:09, Michal Hocko wrote:
> On Tue 05-10-21 11:20:51, Vlastimil Babka wrote:
> [...]
>> > --- a/include/linux/gfp.h
>> > +++ b/include/linux/gfp.h
>> > @@ -209,7 +209,11 @@ struct vm_area_struct;
>> >   * used only when there is no reasonable failure policy) but it is
>> >   * definitely preferable to use the flag rather than opencode endless
>> >   * loop around allocator.
>> > - * Using this flag for costly allocations is _highly_ discouraged.
>> > + * Use of this flag may lead to deadlocks if locks are held which would
>> > + * be needed for memory reclaim, write-back, or the timely exit of a
>> > + * process killed by the OOM-killer.  Dropping any locks not absolutely
>> > + * needed is advisable before requesting a %__GFP_NOFAIL allocate.
>> > + * Using this flag for costly allocations (order>1) is _highly_ discouraged.
>> 
>> We define costly as 3, not 1. But sure it's best to avoid even order>0 for
>> __GFP_NOFAIL. Advising order>1 seems arbitrary though?
> 
> This is not completely arbitrary. We have a warning for any higher order
> allocation.
> rmqueue:
> 	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

Oh, I missed that.

> I do agree that "Using this flag for higher order allocations is
> _highly_ discouraged.

Well, with the warning in place this is effectively forbidden, not just
discouraged.

>> >   */
>> >  #define __GFP_IO	((__force gfp_t)___GFP_IO)
>> >  #define __GFP_FS	((__force gfp_t)___GFP_FS)
>> > 
>> > 
>> > 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL
  2021-10-05 12:27       ` Vlastimil Babka
@ 2021-10-06 23:14         ` Dave Chinner
  2021-10-07 10:07           ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2021-10-06 23:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, NeilBrown, Andrew Morton, Theodore Ts'o,
	Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman,
	Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs,
	linux-mm, linux-kernel, linux-doc

On Tue, Oct 05, 2021 at 02:27:45PM +0200, Vlastimil Babka wrote:
> On 10/5/21 13:09, Michal Hocko wrote:
> > On Tue 05-10-21 11:20:51, Vlastimil Babka wrote:
> > [...]
> >> > --- a/include/linux/gfp.h
> >> > +++ b/include/linux/gfp.h
> >> > @@ -209,7 +209,11 @@ struct vm_area_struct;
> >> >   * used only when there is no reasonable failure policy) but it is
> >> >   * definitely preferable to use the flag rather than opencode endless
> >> >   * loop around allocator.
> >> > - * Using this flag for costly allocations is _highly_ discouraged.
> >> > + * Use of this flag may lead to deadlocks if locks are held which would
> >> > + * be needed for memory reclaim, write-back, or the timely exit of a
> >> > + * process killed by the OOM-killer.  Dropping any locks not absolutely
> >> > + * needed is advisable before requesting a %__GFP_NOFAIL allocate.
> >> > + * Using this flag for costly allocations (order>1) is _highly_ discouraged.
> >> 
> >> We define costly as 3, not 1. But sure it's best to avoid even order>0 for
> >> __GFP_NOFAIL. Advising order>1 seems arbitrary though?
> > 
> > This is not completely arbitrary. We have a warning for any higher order
> > allocation.
> > rmqueue:
> > 	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> 
> Oh, I missed that.
> 
> > I do agree that "Using this flag for higher order allocations is
> > _highly_ discouraged.
> 
> Well, with the warning in place this is effectively forbidden, not just
> discouraged.

Yup, especially as it doesn't obey __GFP_NOWARN.

See commit de2860f46362 ("mm: Add kvrealloc()") as a direct result
of unwittingly tripping over this warning when adding __GFP_NOFAIL
annotations to replace open coded high-order kmalloc loops that have
been in place for a couple of decades without issues.

Personally I think that the way __GFP_NOFAIL is first of all
recommended over open coded loops and then only later found to be
effectively forbidden and needing to be replaced with open coded
loops to be a complete mess.

Not to mention on the impossibility of using __GFP_NOFAIL with
kvmalloc() calls. Just what do we expect kmalloc_node(__GFP_NORETRY
| __GFP_NOFAIL) to do, exactly?

So, effectively, we have to open-code around kvmalloc() in
situations where failure is not an option. Even if we pass
__GFP_NOFAIL to __vmalloc(), it isn't guaranteed to succeed because
of the "we won't honor gfp flags passed to __vmalloc" semantics it
has.

Even the API constaints of kvmalloc() w.r.t. only doing the vmalloc
fallback if the gfp context is GFP_KERNEL - we already do GFP_NOFS
kvmalloc via memalloc_nofs_save/restore(), so this behavioural
restriction w.r.t. gfp flags just makes no sense at all.

That leads to us having to go back to writing extremely custom open
coded loops to avoid awful high-order kmalloc direct reclaim
behaviour and still fall back to vmalloc and to still handle NOFAIL
semantics we need:

https://lore.kernel.org/linux-xfs/20210902095927.911100-8-david@fromorbit.com/

So, really, the problems are much deeper here than just badly
documented, catch-22 rules for __GFP_NOFAIL - we can't even use
__GFP_NOFAIL consistently across the allocation APIs because it
changes allocation behaviours in unusable, self-defeating ways....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL
  2021-10-06 23:14         ` Dave Chinner
@ 2021-10-07 10:07           ` Michal Hocko
  2021-10-07 23:15             ` NeilBrown
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2021-10-07 10:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vlastimil Babka, NeilBrown, Andrew Morton, Theodore Ts'o,
	Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman,
	Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs,
	linux-mm, linux-kernel, linux-doc

On Thu 07-10-21 10:14:52, Dave Chinner wrote:
> On Tue, Oct 05, 2021 at 02:27:45PM +0200, Vlastimil Babka wrote:
> > On 10/5/21 13:09, Michal Hocko wrote:
> > > On Tue 05-10-21 11:20:51, Vlastimil Babka wrote:
> > > [...]
> > >> > --- a/include/linux/gfp.h
> > >> > +++ b/include/linux/gfp.h
> > >> > @@ -209,7 +209,11 @@ struct vm_area_struct;
> > >> >   * used only when there is no reasonable failure policy) but it is
> > >> >   * definitely preferable to use the flag rather than opencode endless
> > >> >   * loop around allocator.
> > >> > - * Using this flag for costly allocations is _highly_ discouraged.
> > >> > + * Use of this flag may lead to deadlocks if locks are held which would
> > >> > + * be needed for memory reclaim, write-back, or the timely exit of a
> > >> > + * process killed by the OOM-killer.  Dropping any locks not absolutely
> > >> > + * needed is advisable before requesting a %__GFP_NOFAIL allocate.
> > >> > + * Using this flag for costly allocations (order>1) is _highly_ discouraged.
> > >> 
> > >> We define costly as 3, not 1. But sure it's best to avoid even order>0 for
> > >> __GFP_NOFAIL. Advising order>1 seems arbitrary though?
> > > 
> > > This is not completely arbitrary. We have a warning for any higher order
> > > allocation.
> > > rmqueue:
> > > 	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> > 
> > Oh, I missed that.
> > 
> > > I do agree that "Using this flag for higher order allocations is
> > > _highly_ discouraged.
> > 
> > Well, with the warning in place this is effectively forbidden, not just
> > discouraged.
> 
> Yup, especially as it doesn't obey __GFP_NOWARN.
> 
> See commit de2860f46362 ("mm: Add kvrealloc()") as a direct result
> of unwittingly tripping over this warning when adding __GFP_NOFAIL
> annotations to replace open coded high-order kmalloc loops that have
> been in place for a couple of decades without issues.
> 
> Personally I think that the way __GFP_NOFAIL is first of all
> recommended over open coded loops and then only later found to be
> effectively forbidden and needing to be replaced with open coded
> loops to be a complete mess.

Well, there are two things. Opencoding something that _can_ be replaced
by __GFP_NOFAIL and those that cannot because the respective allocator
doesn't really support that semantic. kvmalloc is explicit about that
IIRC. If you have a better way to consolidate the documentation then I
am all for it.

> Not to mention on the impossibility of using __GFP_NOFAIL with
> kvmalloc() calls. Just what do we expect kmalloc_node(__GFP_NORETRY
> | __GFP_NOFAIL) to do, exactly?

This combination doesn't make any sense. Like others. Do you want us to
list all combinations that make sense?

> So, effectively, we have to open-code around kvmalloc() in
> situations where failure is not an option. Even if we pass
> __GFP_NOFAIL to __vmalloc(), it isn't guaranteed to succeed because
> of the "we won't honor gfp flags passed to __vmalloc" semantics it
> has.

yes vmalloc doesn't support nofail semantic and it is not really trivial
to craft it there.

> Even the API constaints of kvmalloc() w.r.t. only doing the vmalloc
> fallback if the gfp context is GFP_KERNEL - we already do GFP_NOFS
> kvmalloc via memalloc_nofs_save/restore(), so this behavioural
> restriction w.r.t. gfp flags just makes no sense at all.

GFP_NOFS (without using the scope API) has the same problem as NOFAIL in
the vmalloc. Hence it is not supported. If you use the scope API then
you can GFP_KERNEL for kvmalloc. This is clumsy but I am not sure how to
define these conditions in a more sensible way. Special case NOFS if the
scope api is in use? Why do you want an explicit NOFS then?

> That leads to us having to go back to writing extremely custom open
> coded loops to avoid awful high-order kmalloc direct reclaim
> behaviour and still fall back to vmalloc and to still handle NOFAIL
> semantics we need:
> 
> https://lore.kernel.org/linux-xfs/20210902095927.911100-8-david@fromorbit.com/

It would be more productive to get to MM people rather than rant on a
xfs specific patchse. Anyway, I can see a kvmalloc mode where the
kmalloc allocation would be really a very optimistic one - like your
effectively GFP_NOWAIT. Nobody has requested such a mode until now and I
am not sure how we would sensibly describe that by a gfp mask.

Btw. your GFP_NOWAIT | __GFP_NORETRY combination doesn't make any sense
in the allocator context as the later is a reclaim mofifier which
doesn't get applied when the reclaim is disabled (in your case by flags
&= ~__GFP_DIRECT_RECLAIM).

GFP flags are not that easy to build a coherent and usable apis.
Something we carry as a baggage for a long time.

> So, really, the problems are much deeper here than just badly
> documented, catch-22 rules for __GFP_NOFAIL - we can't even use
> __GFP_NOFAIL consistently across the allocation APIs because it
> changes allocation behaviours in unusable, self-defeating ways....

GFP_NOFAIL sucks. Not all allocator can follow it for practical
reasons. You are welcome to help document those awkward corner cases or
fix them up if you have a good idea how.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL
  2021-10-07 10:07           ` Michal Hocko
@ 2021-10-07 23:15             ` NeilBrown
  2021-10-08  7:48               ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2021-10-07 23:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Chinner, Vlastimil Babka, Andrew Morton, Theodore Ts'o,
	Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman,
	Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs,
	linux-mm, linux-kernel, linux-doc

On Thu, 07 Oct 2021, Michal Hocko wrote:
> On Thu 07-10-21 10:14:52, Dave Chinner wrote:
> > On Tue, Oct 05, 2021 at 02:27:45PM +0200, Vlastimil Babka wrote:
> > > On 10/5/21 13:09, Michal Hocko wrote:
> > > > On Tue 05-10-21 11:20:51, Vlastimil Babka wrote:
> > > > [...]
> > > >> > --- a/include/linux/gfp.h
> > > >> > +++ b/include/linux/gfp.h
> > > >> > @@ -209,7 +209,11 @@ struct vm_area_struct;
> > > >> >   * used only when there is no reasonable failure policy) but it is
> > > >> >   * definitely preferable to use the flag rather than opencode endless
> > > >> >   * loop around allocator.
> > > >> > - * Using this flag for costly allocations is _highly_ discouraged.
> > > >> > + * Use of this flag may lead to deadlocks if locks are held which would
> > > >> > + * be needed for memory reclaim, write-back, or the timely exit of a
> > > >> > + * process killed by the OOM-killer.  Dropping any locks not absolutely
> > > >> > + * needed is advisable before requesting a %__GFP_NOFAIL allocate.
> > > >> > + * Using this flag for costly allocations (order>1) is _highly_ discouraged.
> > > >> 
> > > >> We define costly as 3, not 1. But sure it's best to avoid even order>0 for
> > > >> __GFP_NOFAIL. Advising order>1 seems arbitrary though?
> > > > 
> > > > This is not completely arbitrary. We have a warning for any higher order
> > > > allocation.
> > > > rmqueue:
> > > > 	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> > > 
> > > Oh, I missed that.
> > > 
> > > > I do agree that "Using this flag for higher order allocations is
> > > > _highly_ discouraged.
> > > 
> > > Well, with the warning in place this is effectively forbidden, not just
> > > discouraged.
> > 
> > Yup, especially as it doesn't obey __GFP_NOWARN.
> > 
> > See commit de2860f46362 ("mm: Add kvrealloc()") as a direct result
> > of unwittingly tripping over this warning when adding __GFP_NOFAIL
> > annotations to replace open coded high-order kmalloc loops that have
> > been in place for a couple of decades without issues.
> > 
> > Personally I think that the way __GFP_NOFAIL is first of all
> > recommended over open coded loops and then only later found to be
> > effectively forbidden and needing to be replaced with open coded
> > loops to be a complete mess.
> 
> Well, there are two things. Opencoding something that _can_ be replaced
> by __GFP_NOFAIL and those that cannot because the respective allocator
> doesn't really support that semantic. kvmalloc is explicit about that
> IIRC. If you have a better way to consolidate the documentation then I
> am all for it.

I think one thing that might help make the documentation better is to
explicitly state *why* __GFP_NOFAIL is better than a loop.

It occurs to me that
  while (!(p = kmalloc(sizeof(*p), GFP_KERNEL));

would behave much the same as adding __GFP_NOFAIL and dropping the
'while'.  So why not? I certainly cannot see the need to add any delay
to this loop as kmalloc does a fair bit of sleeping when permitted.

I understand that __GFP_NOFAIL allows page_alloc to dip into reserves,
but Mel holds that up as a reason *not* to use __GFP_NOFAIL as it can
impact on other subsystems.  Why not just let the caller decide if they
deserve the boost, but oring in __GFP_ATOMIC or __GFP_MEMALLOC as
appropriate.

I assume there is a good reason.  I vaguely remember the conversation
that lead to __GFP_NOFAIL being introduced.  I just cannot remember or
deduce what the reason is.  So it would be great to have it documented.

> 
> > Not to mention on the impossibility of using __GFP_NOFAIL with
> > kvmalloc() calls. Just what do we expect kmalloc_node(__GFP_NORETRY
> > | __GFP_NOFAIL) to do, exactly?
> 
> This combination doesn't make any sense. Like others. Do you want us to
> list all combinations that make sense?

I've been wondering about that.  There seem to be sets of flags that are
mutually exclusive.  It is as though gfp_t is a struct of a few enums.

0, DMA32, DMA, HIGHMEM
0, FS, IO
0, ATOMIC, MEMALLOC, NOMEMALLOC, HIGH
NORETRY, RETRY_MAYFAIL, 0, NOFAIL
0, KSWAPD_RECLAIM, DIRECT_RECLAIM
0, THISNODE, HARDWALL

In a few cases there seem to be 3 bits where there are only 4 possibly
combinations, so 2 bits would be enough.  There is probably no real
value is squeezing these into 2 bits, but clearly documenting the groups
surely wouldn't hurt.  Particularly highlighting the difference between
related bits would help.

The set with  'ATOMIC' is hard to wrap my mind around.
They relate to ALLOC_HIGH and ALLOC_HARDER, but also to WMARK_NIN,
WMARK_LOW, WMARK_HIGH ... I think.

I wonder if FS,IO is really in the same set as DIRECT_RECLAIM as they
all affect reclaim.  Maybe FS and IO are only relevan if DIRECT_RECLAIM
is set?

I'd love to know that to expect if neither RETRY_MAYFAIL or NOFAIL is
set.  I guess it can fail, but it still tries harder than if
RETRY_MAYFAIL is set....
Ahhhh...  I found some documentation which mentions that RETRY_MAYFAIL
doesn't trigger the oom killer.  Is that it? So RETRY_NOKILLOOM might be
a better name?

> 
> > So, effectively, we have to open-code around kvmalloc() in
> > situations where failure is not an option. Even if we pass
> > __GFP_NOFAIL to __vmalloc(), it isn't guaranteed to succeed because
> > of the "we won't honor gfp flags passed to __vmalloc" semantics it
> > has.
> 
> yes vmalloc doesn't support nofail semantic and it is not really trivial
> to craft it there.
> 
> > Even the API constaints of kvmalloc() w.r.t. only doing the vmalloc
> > fallback if the gfp context is GFP_KERNEL - we already do GFP_NOFS
> > kvmalloc via memalloc_nofs_save/restore(), so this behavioural
> > restriction w.r.t. gfp flags just makes no sense at all.
> 
> GFP_NOFS (without using the scope API) has the same problem as NOFAIL in
> the vmalloc. Hence it is not supported. If you use the scope API then
> you can GFP_KERNEL for kvmalloc. This is clumsy but I am not sure how to
> define these conditions in a more sensible way. Special case NOFS if the
> scope api is in use? Why do you want an explicit NOFS then?

It would seem to make sense for kvmalloc to WARN_ON if it is passed
flags that does not allow it to use vmalloc.
Such callers could then know they can either change to a direct
kmalloc(), or change flags.  Silently ignoring the 'v' in the function
name sees like a poor choice.

Thanks,
NeilBrown

> 
> > That leads to us having to go back to writing extremely custom open
> > coded loops to avoid awful high-order kmalloc direct reclaim
> > behaviour and still fall back to vmalloc and to still handle NOFAIL
> > semantics we need:
> > 
> > https://lore.kernel.org/linux-xfs/20210902095927.911100-8-david@fromorbit.com/
> 
> It would be more productive to get to MM people rather than rant on a
> xfs specific patchse. Anyway, I can see a kvmalloc mode where the
> kmalloc allocation would be really a very optimistic one - like your
> effectively GFP_NOWAIT. Nobody has requested such a mode until now and I
> am not sure how we would sensibly describe that by a gfp mask.
> 
> Btw. your GFP_NOWAIT | __GFP_NORETRY combination doesn't make any sense
> in the allocator context as the later is a reclaim mofifier which
> doesn't get applied when the reclaim is disabled (in your case by flags
> &= ~__GFP_DIRECT_RECLAIM).
> 
> GFP flags are not that easy to build a coherent and usable apis.
> Something we carry as a baggage for a long time.
> 
> > So, really, the problems are much deeper here than just badly
> > documented, catch-22 rules for __GFP_NOFAIL - we can't even use
> > __GFP_NOFAIL consistently across the allocation APIs because it
> > changes allocation behaviours in unusable, self-defeating ways....
> 
> GFP_NOFAIL sucks. Not all allocator can follow it for practical
> reasons. You are welcome to help document those awkward corner cases or
> fix them up if you have a good idea how.
> 
> Thanks!
> -- 
> Michal Hocko
> SUSE Labs
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL
  2021-10-07 23:15             ` NeilBrown
@ 2021-10-08  7:48               ` Michal Hocko
  2021-10-08 22:36                 ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2021-10-08  7:48 UTC (permalink / raw)
  To: NeilBrown
  Cc: Dave Chinner, Vlastimil Babka, Andrew Morton, Theodore Ts'o,
	Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman,
	Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs,
	linux-mm, linux-kernel, linux-doc

On Fri 08-10-21 10:15:45, Neil Brown wrote:
> On Thu, 07 Oct 2021, Michal Hocko wrote:
> > On Thu 07-10-21 10:14:52, Dave Chinner wrote:
> > > On Tue, Oct 05, 2021 at 02:27:45PM +0200, Vlastimil Babka wrote:
> > > > On 10/5/21 13:09, Michal Hocko wrote:
> > > > > On Tue 05-10-21 11:20:51, Vlastimil Babka wrote:
> > > > > [...]
> > > > >> > --- a/include/linux/gfp.h
> > > > >> > +++ b/include/linux/gfp.h
> > > > >> > @@ -209,7 +209,11 @@ struct vm_area_struct;
> > > > >> >   * used only when there is no reasonable failure policy) but it is
> > > > >> >   * definitely preferable to use the flag rather than opencode endless
> > > > >> >   * loop around allocator.
> > > > >> > - * Using this flag for costly allocations is _highly_ discouraged.
> > > > >> > + * Use of this flag may lead to deadlocks if locks are held which would
> > > > >> > + * be needed for memory reclaim, write-back, or the timely exit of a
> > > > >> > + * process killed by the OOM-killer.  Dropping any locks not absolutely
> > > > >> > + * needed is advisable before requesting a %__GFP_NOFAIL allocate.
> > > > >> > + * Using this flag for costly allocations (order>1) is _highly_ discouraged.
> > > > >> 
> > > > >> We define costly as 3, not 1. But sure it's best to avoid even order>0 for
> > > > >> __GFP_NOFAIL. Advising order>1 seems arbitrary though?
> > > > > 
> > > > > This is not completely arbitrary. We have a warning for any higher order
> > > > > allocation.
> > > > > rmqueue:
> > > > > 	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> > > > 
> > > > Oh, I missed that.
> > > > 
> > > > > I do agree that "Using this flag for higher order allocations is
> > > > > _highly_ discouraged.
> > > > 
> > > > Well, with the warning in place this is effectively forbidden, not just
> > > > discouraged.
> > > 
> > > Yup, especially as it doesn't obey __GFP_NOWARN.
> > > 
> > > See commit de2860f46362 ("mm: Add kvrealloc()") as a direct result
> > > of unwittingly tripping over this warning when adding __GFP_NOFAIL
> > > annotations to replace open coded high-order kmalloc loops that have
> > > been in place for a couple of decades without issues.
> > > 
> > > Personally I think that the way __GFP_NOFAIL is first of all
> > > recommended over open coded loops and then only later found to be
> > > effectively forbidden and needing to be replaced with open coded
> > > loops to be a complete mess.
> > 
> > Well, there are two things. Opencoding something that _can_ be replaced
> > by __GFP_NOFAIL and those that cannot because the respective allocator
> > doesn't really support that semantic. kvmalloc is explicit about that
> > IIRC. If you have a better way to consolidate the documentation then I
> > am all for it.
> 
> I think one thing that might help make the documentation better is to
> explicitly state *why* __GFP_NOFAIL is better than a loop.
> 
> It occurs to me that
>   while (!(p = kmalloc(sizeof(*p), GFP_KERNEL));
> 
> would behave much the same as adding __GFP_NOFAIL and dropping the
> 'while'.  So why not? I certainly cannot see the need to add any delay
> to this loop as kmalloc does a fair bit of sleeping when permitted.
> 
> I understand that __GFP_NOFAIL allows page_alloc to dip into reserves,
> but Mel holds that up as a reason *not* to use __GFP_NOFAIL as it can
> impact on other subsystems.

__GFP_NOFAIL usage is a risk on its own. It is a hard requirement that
the allocator cannot back off. So it has to absolutely everything to
suceed. Whether it cheats and dips into reserves or not is a mere
implementation detail and a subject to the specific implementation.

> Why not just let the caller decide if they
> deserve the boost, but oring in __GFP_ATOMIC or __GFP_MEMALLOC as
> appropriate.

They can do that. Explicit access to memory reserves is allowed unless
it is explicitly forbidden by NOMEMALLOC flag.

> I assume there is a good reason.  I vaguely remember the conversation
> that lead to __GFP_NOFAIL being introduced.  I just cannot remember or
> deduce what the reason is.  So it would be great to have it documented.

The basic reason is that if the allocator knows this is must suceed
allocation request then it can prioritize it in some way. A dumb kmalloc
loop as you pictured it is likely much less optimal in that sense, isn't
it? Compare that to mempool allocator which is non failing as well but
it has some involved handling and that is certainly not a good fit for
__GFP_NOFAIL in the page allocator.
 
> > > Not to mention on the impossibility of using __GFP_NOFAIL with
> > > kvmalloc() calls. Just what do we expect kmalloc_node(__GFP_NORETRY
> > > | __GFP_NOFAIL) to do, exactly?
> > 
> > This combination doesn't make any sense. Like others. Do you want us to
> > list all combinations that make sense?
> 
> I've been wondering about that.  There seem to be sets of flags that are
> mutually exclusive.  It is as though gfp_t is a struct of a few enums.
> 
> 0, DMA32, DMA, HIGHMEM
> 0, FS, IO
> 0, ATOMIC, MEMALLOC, NOMEMALLOC, HIGH
> NORETRY, RETRY_MAYFAIL, 0, NOFAIL
> 0, KSWAPD_RECLAIM, DIRECT_RECLAIM
> 0, THISNODE, HARDWALL
> 
> In a few cases there seem to be 3 bits where there are only 4 possibly
> combinations, so 2 bits would be enough.  There is probably no real
> value is squeezing these into 2 bits, but clearly documenting the groups
> surely wouldn't hurt.  Particularly highlighting the difference between
> related bits would help.

Don't we have that already? We have them grouped by placement,
watermarks, reclaim and action modifiers. Then we have useful
combinations. I believe we can always improve on that and I am always
ready to listen here.

> The set with  'ATOMIC' is hard to wrap my mind around.
> They relate to ALLOC_HIGH and ALLOC_HARDER, but also to WMARK_NIN,
> WMARK_LOW, WMARK_HIGH ... I think.

ALLOC* and WMARK* is an internal allocator concept and I believe users
of gfp flags shouldn't really care or even know those exist.

> I wonder if FS,IO is really in the same set as DIRECT_RECLAIM as they
> all affect reclaim.  Maybe FS and IO are only relevan if DIRECT_RECLAIM
> is set?

yes, this indeed the case. Page allocator doesn't go outside of its
proper without the direct reclaim.

> I'd love to know that to expect if neither RETRY_MAYFAIL or NOFAIL is
> set.  I guess it can fail, but it still tries harder than if
> RETRY_MAYFAIL is set....
> Ahhhh...  I found some documentation which mentions

The reclaim behavior is described along with the respective modifiers. I
believe we can thank you for this structure as you were the primary
driving force to clarify the behavior.

> that RETRY_MAYFAIL
> doesn't trigger the oom killer.  Is that it? So RETRY_NOKILLOOM might be
> a better name?

Again the those are implementation details and I am not sure we really
want to bother users with all of them. This wold quickly become hairy
and likely even outdated after some time. The documentation tries to
describe different levels of involvement. NOWAIT - no direct reclaim,
NORETRY - only a light attempt to reclaim, RETRY_MAYFAIL - try as hard
as feasible, NOFAIL - cannot really fail.

If we can improve the wording I am all for it.
 
> > > So, effectively, we have to open-code around kvmalloc() in
> > > situations where failure is not an option. Even if we pass
> > > __GFP_NOFAIL to __vmalloc(), it isn't guaranteed to succeed because
> > > of the "we won't honor gfp flags passed to __vmalloc" semantics it
> > > has.
> > 
> > yes vmalloc doesn't support nofail semantic and it is not really trivial
> > to craft it there.
> > 
> > > Even the API constaints of kvmalloc() w.r.t. only doing the vmalloc
> > > fallback if the gfp context is GFP_KERNEL - we already do GFP_NOFS
> > > kvmalloc via memalloc_nofs_save/restore(), so this behavioural
> > > restriction w.r.t. gfp flags just makes no sense at all.
> > 
> > GFP_NOFS (without using the scope API) has the same problem as NOFAIL in
> > the vmalloc. Hence it is not supported. If you use the scope API then
> > you can GFP_KERNEL for kvmalloc. This is clumsy but I am not sure how to
> > define these conditions in a more sensible way. Special case NOFS if the
> > scope api is in use? Why do you want an explicit NOFS then?
> 
> It would seem to make sense for kvmalloc to WARN_ON if it is passed
> flags that does not allow it to use vmalloc.

vmalloc is certainly not the hottest path in the kernel so I wouldn't be
opposed. One should be careful that WARN_ON is effectively BUG_ON in
some configurations but we are sinners from that perspective all over
the place...

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL
  2021-10-08  7:48               ` Michal Hocko
@ 2021-10-08 22:36                 ` Dave Chinner
  2021-10-11 11:57                   ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2021-10-08 22:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: NeilBrown, Vlastimil Babka, Andrew Morton, Theodore Ts'o,
	Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman,
	Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs,
	linux-mm, linux-kernel, linux-doc

On Fri, Oct 08, 2021 at 09:48:39AM +0200, Michal Hocko wrote:
> On Fri 08-10-21 10:15:45, Neil Brown wrote:
> > On Thu, 07 Oct 2021, Michal Hocko wrote:
> > > On Thu 07-10-21 10:14:52, Dave Chinner wrote:
> > > > On Tue, Oct 05, 2021 at 02:27:45PM +0200, Vlastimil Babka wrote:
> > > > > On 10/5/21 13:09, Michal Hocko wrote:
> > > > > > On Tue 05-10-21 11:20:51, Vlastimil Babka wrote:
> > > > > > [...]
> > > > > >> > --- a/include/linux/gfp.h
> > > > > >> > +++ b/include/linux/gfp.h
> > > > > >> > @@ -209,7 +209,11 @@ struct vm_area_struct;
> > > > > >> >   * used only when there is no reasonable failure policy) but it is
> > > > > >> >   * definitely preferable to use the flag rather than opencode endless
> > > > > >> >   * loop around allocator.
> > > > > >> > - * Using this flag for costly allocations is _highly_ discouraged.
> > > > > >> > + * Use of this flag may lead to deadlocks if locks are held which would
> > > > > >> > + * be needed for memory reclaim, write-back, or the timely exit of a
> > > > > >> > + * process killed by the OOM-killer.  Dropping any locks not absolutely
> > > > > >> > + * needed is advisable before requesting a %__GFP_NOFAIL allocate.
> > > > > >> > + * Using this flag for costly allocations (order>1) is _highly_ discouraged.
> > > > > >> 
> > > > > >> We define costly as 3, not 1. But sure it's best to avoid even order>0 for
> > > > > >> __GFP_NOFAIL. Advising order>1 seems arbitrary though?
> > > > > > 
> > > > > > This is not completely arbitrary. We have a warning for any higher order
> > > > > > allocation.
> > > > > > rmqueue:
> > > > > > 	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> > > > > 
> > > > > Oh, I missed that.
> > > > > 
> > > > > > I do agree that "Using this flag for higher order allocations is
> > > > > > _highly_ discouraged.
> > > > > 
> > > > > Well, with the warning in place this is effectively forbidden, not just
> > > > > discouraged.
> > > > 
> > > > Yup, especially as it doesn't obey __GFP_NOWARN.
> > > > 
> > > > See commit de2860f46362 ("mm: Add kvrealloc()") as a direct result
> > > > of unwittingly tripping over this warning when adding __GFP_NOFAIL
> > > > annotations to replace open coded high-order kmalloc loops that have
> > > > been in place for a couple of decades without issues.
> > > > 
> > > > Personally I think that the way __GFP_NOFAIL is first of all
> > > > recommended over open coded loops and then only later found to be
> > > > effectively forbidden and needing to be replaced with open coded
> > > > loops to be a complete mess.
> > > 
> > > Well, there are two things. Opencoding something that _can_ be replaced
> > > by __GFP_NOFAIL and those that cannot because the respective allocator
> > > doesn't really support that semantic. kvmalloc is explicit about that
> > > IIRC. If you have a better way to consolidate the documentation then I
> > > am all for it.
> > 
> > I think one thing that might help make the documentation better is to
> > explicitly state *why* __GFP_NOFAIL is better than a loop.
> > 
> > It occurs to me that
> >   while (!(p = kmalloc(sizeof(*p), GFP_KERNEL));
> > 
> > would behave much the same as adding __GFP_NOFAIL and dropping the
> > 'while'.  So why not? I certainly cannot see the need to add any delay
> > to this loop as kmalloc does a fair bit of sleeping when permitted.
> > 
> > I understand that __GFP_NOFAIL allows page_alloc to dip into reserves,
> > but Mel holds that up as a reason *not* to use __GFP_NOFAIL as it can
> > impact on other subsystems.
> 
> __GFP_NOFAIL usage is a risk on its own. It is a hard requirement that
> the allocator cannot back off.

No, "allocator cannot back off" isn't a hard requirement for most
GFP_NOFAIL uses. *Not failing the allocation* is the hard
requirement.

How long it takes for the allocation to actually succeed is
irrelevant to most callers, and given that we are replacing loops
that do

	while (!(p = kmalloc(sizeof(*p), GFP_KERNEL))

with __GFP_NOFAIL largely indicates that allocation *latency* and/or
deadlocks are not an issue here.

Indeed, if we deadlock in XFS because there is no memory available,
that is *not a problem kmalloc() should be trying to solve*. THe
problem is the caller being unable to handle allocation failure, so
if allocation cannot make progress, that needs to be fixed by the
caller getting rid of the unfailable allocation.

The fact is that we've had these loops in production code for a
couple of decades and these subsystems just aren't failing or
deadlocking with such loops. IOWs, we don't need __GFP_NOFAIL to dig
deep into reserves or drive the system to OOM killing - we just need
to it keep retrying the same allocation until it succeeds.

Put simply, we want "retry forever" semantics to match what
production kernels have been doing for the past couple of decades,
but all we've been given are "never fail" semantics that also do
something different and potentially much more problematic.

Do you see the difference here? __GFP_NOFAIL is not what we
need in the vast majority of cases where it is used. We don't want
the failing allocations to drive the machine hard into critical
reserves, we just want the allocation to -eventually succeed- and if
it doesn't, that's our problem to handle, not kmalloc()....

> So it has to absolutely everything to
> suceed. Whether it cheats and dips into reserves or not is a mere
> implementation detail and a subject to the specific implementation.

My point exactly: that's how the MM interprets __GFP_NOFAIL is supposed to
provide callers with. What we are trying to tell you is that the
semantics associated with __GFP_NOFAIL is not actually what we
require, and it's the current semantics of __GFP_NOFAIL that cause
all the "can't be applied consistently across the entire allocation
APIs" problems....

> > > > So, effectively, we have to open-code around kvmalloc() in
> > > > situations where failure is not an option. Even if we pass
> > > > __GFP_NOFAIL to __vmalloc(), it isn't guaranteed to succeed because
> > > > of the "we won't honor gfp flags passed to __vmalloc" semantics it
> > > > has.
> > > 
> > > yes vmalloc doesn't support nofail semantic and it is not really trivial
> > > to craft it there.

Yet retry-forever is trivial to implement across everything:

kvmalloc(size, gfp_mask)
{
	gfp_t	flags = gfp_mask & ~__GFP_RETRY_FOREVER;

	do {
		p = __kvmalloc(size, flags)
	} while (!p && (gfp_mask & __GFP_RETRY_FOREVER));

	return p;
}

That provides "allocation will eventually succeed" semantics just
fine, yes? It doesn't guarantee forwards progress or success, just
that *it won't fail*.

It should be obvious what the difference between "retry forever" and
__GFP_NOFAIL semantics are now, and why we don't actually want
__GFP_NOFAIL. We just want __GFP_RETRY_FOREVER semantics that can be
applied consistently across the entire allocation API regardless of
whatever other flags are passed into the allocation: don't return
until an allocation with the provided semantics succeeds.



> > > > Even the API constaints of kvmalloc() w.r.t. only doing the vmalloc
> > > > fallback if the gfp context is GFP_KERNEL - we already do GFP_NOFS
> > > > kvmalloc via memalloc_nofs_save/restore(), so this behavioural
> > > > restriction w.r.t. gfp flags just makes no sense at all.
> > > 
> > > GFP_NOFS (without using the scope API) has the same problem as NOFAIL in
> > > the vmalloc. Hence it is not supported. If you use the scope API then
> > > you can GFP_KERNEL for kvmalloc. This is clumsy but I am not sure how to
> > > define these conditions in a more sensible way. Special case NOFS if the
> > > scope api is in use? Why do you want an explicit NOFS then?

Exactly my point - this is clumsy and a total mess. I'm not asking
for an explicit GFP_NOFS, just pointing out that the documented
restrictions that "vmalloc can only do GFP_KERNEL allocations" is
completely wrong.

vmalloc()
{
	if (!(gfp_flags &  __GFP_FS))
		memalloc_nofs_save();
	p = __vmalloc(gfp_flags | GFP_KERNEL)
	if (!(gfp_flags &  __GFP_FS))
		memalloc_nofs_restore();
}

Yup, that's how simple it is to support GFP_NOFS support in
vmalloc().

This goes along with the argument that "it's impossible to do
GFP_NOFAIL with vmalloc" as I addressed above. These things are not
impossible, but we hide behind "we don't want people to use vmalloc"
as an excuse for having shitty behaviour whilst ignoring that
vmalloc is *heavily used* by core subsystems like filesystems
because they cannot rely on high order allocations succeeding....

It also points out that the scope API is highly deficient.
We can do GFP_NOFS via the scope API, but we can't
do anything else because *there is no scope API for other GFP
flags*.

Why don't we have a GFP_NOFAIL/__GFP_RETRY_FOREVER scope API? That
would save us a lot of bother in XFS. What about GFP_DIRECT_RECLAIM?
I'd really like to turn that off for allocations in the XFS
transaction commit path (as noted already in this thread) because
direct reclaim that can make no progress is actively harmful (as
noted already in this thread)

Like I said - this is more than just bad documentation - the problem
is that the whole allocation API is an inconsistent mess of control
mechanisms to begin with...

> > It would seem to make sense for kvmalloc to WARN_ON if it is passed
> > flags that does not allow it to use vmalloc.
> 
> vmalloc is certainly not the hottest path in the kernel so I wouldn't be
> opposed.

kvmalloc is most certainly becoming one of the hottest paths in XFS.
IOWs, arguments that "vmalloc is not a hot path" are simply invalid
these days because they are simply untrue. e.g. the profiles I
posted in this thread...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL
  2021-10-08 22:36                 ` Dave Chinner
@ 2021-10-11 11:57                   ` Michal Hocko
  2021-10-11 21:49                     ` NeilBrown
  2021-10-13  2:32                     ` Dave Chinner
  0 siblings, 2 replies; 25+ messages in thread
From: Michal Hocko @ 2021-10-11 11:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: NeilBrown, Vlastimil Babka, Andrew Morton, Theodore Ts'o,
	Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman,
	Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs,
	linux-mm, linux-kernel, linux-doc

On Sat 09-10-21 09:36:49, Dave Chinner wrote:
> On Fri, Oct 08, 2021 at 09:48:39AM +0200, Michal Hocko wrote:
> > __GFP_NOFAIL usage is a risk on its own. It is a hard requirement that
> > the allocator cannot back off.
[...]
> 
> No, "allocator cannot back off" isn't a hard requirement for most
> GFP_NOFAIL uses. *Not failing the allocation* is the hard
> requirement.

We are talking about the same thing here I belive. By cannot back off I
really mean cannot fail. Just for the clarification.

> How long it takes for the allocation to actually succeed is
> irrelevant to most callers, and given that we are replacing loops
> that do
> 
> 	while (!(p = kmalloc(sizeof(*p), GFP_KERNEL))
> 
> with __GFP_NOFAIL largely indicates that allocation *latency* and/or
> deadlocks are not an issue here.

Agreed. 

> Indeed, if we deadlock in XFS because there is no memory available,
> that is *not a problem kmalloc() should be trying to solve*. THe
> problem is the caller being unable to handle allocation failure, so
> if allocation cannot make progress, that needs to be fixed by the
> caller getting rid of the unfailable allocation.
> 
> The fact is that we've had these loops in production code for a
> couple of decades and these subsystems just aren't failing or
> deadlocking with such loops. IOWs, we don't need __GFP_NOFAIL to dig
> deep into reserves or drive the system to OOM killing - we just need
> to it keep retrying the same allocation until it succeeds.
> 
> Put simply, we want "retry forever" semantics to match what
> production kernels have been doing for the past couple of decades,
> but all we've been given are "never fail" semantics that also do
> something different and potentially much more problematic.
> 
> Do you see the difference here? __GFP_NOFAIL is not what we
> need in the vast majority of cases where it is used. We don't want
> the failing allocations to drive the machine hard into critical
> reserves, we just want the allocation to -eventually succeed- and if
> it doesn't, that's our problem to handle, not kmalloc()....

I can see your point. I do have a recollection that there were some
instance involved where an emergency access to memory reserves helped
in OOM situations.

Anway as I've tried to explain earlier that this all is an
implementation detail users of the flag shouldn't really care about. If
this heuristic is not doing any good then it should be removed.

[...]
> > > > > Even the API constaints of kvmalloc() w.r.t. only doing the vmalloc
> > > > > fallback if the gfp context is GFP_KERNEL - we already do GFP_NOFS
> > > > > kvmalloc via memalloc_nofs_save/restore(), so this behavioural
> > > > > restriction w.r.t. gfp flags just makes no sense at all.
> > > > 
> > > > GFP_NOFS (without using the scope API) has the same problem as NOFAIL in
> > > > the vmalloc. Hence it is not supported. If you use the scope API then
> > > > you can GFP_KERNEL for kvmalloc. This is clumsy but I am not sure how to
> > > > define these conditions in a more sensible way. Special case NOFS if the
> > > > scope api is in use? Why do you want an explicit NOFS then?
> 
> Exactly my point - this is clumsy and a total mess. I'm not asking
> for an explicit GFP_NOFS, just pointing out that the documented
> restrictions that "vmalloc can only do GFP_KERNEL allocations" is
> completely wrong.
> 
> vmalloc()
> {
> 	if (!(gfp_flags &  __GFP_FS))
> 		memalloc_nofs_save();
> 	p = __vmalloc(gfp_flags | GFP_KERNEL)
> 	if (!(gfp_flags &  __GFP_FS))
> 		memalloc_nofs_restore();
> }
> 
> Yup, that's how simple it is to support GFP_NOFS support in
> vmalloc().

Yes, this would work from the functionality POV but it defeats the
philosophy behind the scope API. Why would you even need this if the
scope was defined by the caller of the allocator? The initial hope was
to get rid of the NOFS abuse that can be seen in many filesystems. All
allocations from the scope would simply inherit the NOFS semantic so
an explicit NOFS shouldn't be really necessary, right?

> This goes along with the argument that "it's impossible to do
> GFP_NOFAIL with vmalloc" as I addressed above. These things are not
> impossible, but we hide behind "we don't want people to use vmalloc"
> as an excuse for having shitty behaviour whilst ignoring that
> vmalloc is *heavily used* by core subsystems like filesystems
> because they cannot rely on high order allocations succeeding....

I do not think there is any reason to discourage anybody from using
vmalloc these days. 32b is dying out and vmalloc space is no longer a
very scarce resource.

> It also points out that the scope API is highly deficient.
> We can do GFP_NOFS via the scope API, but we can't
> do anything else because *there is no scope API for other GFP
> flags*.
> 
> Why don't we have a GFP_NOFAIL/__GFP_RETRY_FOREVER scope API?

NO{FS,IO} where first flags to start this approach. And I have to admit
the experiment was much less successful then I hoped for. There are
still thousands of direct NOFS users so for some reason defining scopes
is not an easy thing to do.

I am not against NOFAIL scopes in principle but seeing the nofs
"success" I am worried this will not go really well either and it is
much more tricky as NOFAIL has much stronger requirements than NOFS.
Just imagine how tricky this can be if you just call a library code
that is not under your control within a NOFAIL scope. What if that
library code decides to allocate (e.g. printk that would attempt to do
an optimistic NOWAIT allocation).

> That
> would save us a lot of bother in XFS. What about GFP_DIRECT_RECLAIM?
> I'd really like to turn that off for allocations in the XFS
> transaction commit path (as noted already in this thread) because
> direct reclaim that can make no progress is actively harmful (as
> noted already in this thread)

As always if you have reasonable usecases then it is best to bring them
up on the MM list and we can discuss them.

> Like I said - this is more than just bad documentation - the problem
> is that the whole allocation API is an inconsistent mess of control
> mechanisms to begin with...

I am not going to disagree. There is a lot of historical baggage and
it doesn't help that any change is really hard to review because this
interface is used throughout the kernel. I have tried to change some
most obvious inconsistencies and I can tell this has always been a
frustrating experience with a very small "reward" in the end because
there are so many other problems.

That being said, I would more than love to have a consistent and well
defined interface and if you want to spend a lot of time on that then be
my guest.

> > > It would seem to make sense for kvmalloc to WARN_ON if it is passed
> > > flags that does not allow it to use vmalloc.
> > 
> > vmalloc is certainly not the hottest path in the kernel so I wouldn't be
> > opposed.
> 
> kvmalloc is most certainly becoming one of the hottest paths in XFS.
> IOWs, arguments that "vmalloc is not a hot path" are simply invalid
> these days because they are simply untrue. e.g. the profiles I
> posted in this thread...

Is it such a hot path that a check for compatible flags would be visible
in profiles though?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL
  2021-10-11 11:57                   ` Michal Hocko
@ 2021-10-11 21:49                     ` NeilBrown
  2021-10-13  2:32                     ` Dave Chinner
  1 sibling, 0 replies; 25+ messages in thread
From: NeilBrown @ 2021-10-11 21:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Chinner, Vlastimil Babka, Andrew Morton, Theodore Ts'o,
	Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman,
	Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs,
	linux-mm, linux-kernel, linux-doc

On Mon, 11 Oct 2021, Michal Hocko wrote:
> On Sat 09-10-21 09:36:49, Dave Chinner wrote:
> > 
> > Put simply, we want "retry forever" semantics to match what
> > production kernels have been doing for the past couple of decades,
> > but all we've been given are "never fail" semantics that also do
> > something different and potentially much more problematic.
> > 
> > Do you see the difference here? __GFP_NOFAIL is not what we
> > need in the vast majority of cases where it is used. We don't want
> > the failing allocations to drive the machine hard into critical
> > reserves, we just want the allocation to -eventually succeed- and if
> > it doesn't, that's our problem to handle, not kmalloc()....
> 
> I can see your point. I do have a recollection that there were some
> instance involved where an emergency access to memory reserves helped
> in OOM situations.

It might have been better to annotate those particular calls with
__GFP_ATOMIC or similar rather then change GFP_NOFAIL for everyone.
Too late to fix that now though I think.  Maybe the best way forward is
to discourage new uses of GFP_NOFAIL.  We would need a well-documented
replacement.

> 
> Anway as I've tried to explain earlier that this all is an
> implementation detail users of the flag shouldn't really care about. If
> this heuristic is not doing any good then it should be removed.

Maybe users shouldn't care about implementation details, but they do
need to care about semantics and costs.
We need to know when it is appropriate to use GFP_NOFAIL, and when it is
not.  And what alternatives there are when it is not appropriate.
Just saying "try to avoid using it" and "requires careful analysis"
isn't acceptable.  Sometimes it is unavoidable and analysis can only be
done with a clear understanding of costs.  Possibly analysis can only be
done with a clear understanding of the internal implementation details.

> 
> > It also points out that the scope API is highly deficient.
> > We can do GFP_NOFS via the scope API, but we can't
> > do anything else because *there is no scope API for other GFP
> > flags*.
> > 
> > Why don't we have a GFP_NOFAIL/__GFP_RETRY_FOREVER scope API?
> 
> NO{FS,IO} where first flags to start this approach. And I have to admit
> the experiment was much less successful then I hoped for. There are
> still thousands of direct NOFS users so for some reason defining scopes
> is not an easy thing to do.

I'm not certain your conclusion is valid.  It could be that defining
scopes is easy enough, but no one feels motivated to do it.
We need to do more than provide functionality.  We need to tell people. 
Repeatedly.  And advertise widely.  And propose patches to make use of
the functionality.  And... and... and...

I think changing to the scope API is a good change, but it is
conceptually a big change.  It needs to be driven.

> 
> I am not against NOFAIL scopes in principle but seeing the nofs
> "success" I am worried this will not go really well either and it is
> much more tricky as NOFAIL has much stronger requirements than NOFS.
> Just imagine how tricky this can be if you just call a library code
> that is not under your control within a NOFAIL scope. What if that
> library code decides to allocate (e.g. printk that would attempt to do
> an optimistic NOWAIT allocation).

__GFP_NOMEMALLOC holds a lesson worth learning here.  PF_MEMALLOC
effectively adds __GFP_MEMALLOC to all allocations, but some call sites
need to over-ride that because there are alternate strategies available.
This need-to-over-ride doesn't apply to NOFS or NOIO as that really is a
thread-wide state.  But MEMALLOC and NOFAIL are different.  Some call
sites can reasonably handle failure locally.

I imagine the scope-api would say something like "NO_ENOMEM".  i.e.
memory allocations can fail as long as ENOMEM is never returned.
Any caller that sets __GFP_RETRY_MAYFAIL or __GFP_NORETRY or maybe some
others which not be affected by the NO_ENOMEM scope.  But a plain
GFP_KERNEL would.

Introducing the scope api would be a good opportunity to drop the
priority boost and *just* block until success.  Priority boosts could
then be added (possibly as a scope) only where they are measurably needed.

I think we have 28 process flags in use.  So we can probably afford one
more for PF_MEMALLOC_NO_ENOMEM.  What other scope flags might be useful?
PF_MEMALLOC_BOOST which added __GFP_ATOMIC but not __GFP_MEMALLOC ??
PF_MEMALLOC_NORECLAIM ??

> 
> > That
> > would save us a lot of bother in XFS. What about GFP_DIRECT_RECLAIM?
> > I'd really like to turn that off for allocations in the XFS
> > transaction commit path (as noted already in this thread) because
> > direct reclaim that can make no progress is actively harmful (as
> > noted already in this thread)
> 
> As always if you have reasonable usecases then it is best to bring them
> up on the MM list and we can discuss them.

We are on the MM lists now... let's discuss :-)

Dave: How would you feel about an effort to change xfs to stop using
GFP_NOFS, and to use memalloc_nofs_save/restore instead? Having a major
filesystem make the transition would be a good test-case, and could be
used to motivate other filesystems to follow.
We could add and use memalloc_no_enomem_save() too.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL
  2021-10-11 11:57                   ` Michal Hocko
  2021-10-11 21:49                     ` NeilBrown
@ 2021-10-13  2:32                     ` Dave Chinner
  2021-10-13  8:26                       ` Michal Hocko
  1 sibling, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2021-10-13  2:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: NeilBrown, Vlastimil Babka, Andrew Morton, Theodore Ts'o,
	Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman,
	Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs,
	linux-mm, linux-kernel, linux-doc

On Mon, Oct 11, 2021 at 01:57:36PM +0200, Michal Hocko wrote:
> On Sat 09-10-21 09:36:49, Dave Chinner wrote:
> > On Fri, Oct 08, 2021 at 09:48:39AM +0200, Michal Hocko wrote:
> > > > > > Even the API constaints of kvmalloc() w.r.t. only doing the vmalloc
> > > > > > fallback if the gfp context is GFP_KERNEL - we already do GFP_NOFS
> > > > > > kvmalloc via memalloc_nofs_save/restore(), so this behavioural
> > > > > > restriction w.r.t. gfp flags just makes no sense at all.
> > > > > 
> > > > > GFP_NOFS (without using the scope API) has the same problem as NOFAIL in
> > > > > the vmalloc. Hence it is not supported. If you use the scope API then
> > > > > you can GFP_KERNEL for kvmalloc. This is clumsy but I am not sure how to
> > > > > define these conditions in a more sensible way. Special case NOFS if the
> > > > > scope api is in use? Why do you want an explicit NOFS then?
> > 
> > Exactly my point - this is clumsy and a total mess. I'm not asking
> > for an explicit GFP_NOFS, just pointing out that the documented
> > restrictions that "vmalloc can only do GFP_KERNEL allocations" is
> > completely wrong.
> > 
> > vmalloc()
> > {
> > 	if (!(gfp_flags &  __GFP_FS))
> > 		memalloc_nofs_save();
> > 	p = __vmalloc(gfp_flags | GFP_KERNEL)
> > 	if (!(gfp_flags &  __GFP_FS))
> > 		memalloc_nofs_restore();
> > }
> > 
> > Yup, that's how simple it is to support GFP_NOFS support in
> > vmalloc().
> 
> Yes, this would work from the functionality POV but it defeats the
> philosophy behind the scope API. Why would you even need this if the
> scope was defined by the caller of the allocator?

Who actually cares that vmalloc might be using the scoped API
internally to implement GFP_NOFS or GFP_NOIO? Nobody at all.
It is far more useful (and self documenting!) for one-off allocations
to pass a GFP_NOFS flag than it is to use a scope API...

> The initial hope was
> to get rid of the NOFS abuse that can be seen in many filesystems. All
> allocations from the scope would simply inherit the NOFS semantic so
> an explicit NOFS shouldn't be really necessary, right?

Yes, but I think you miss my point entirely: that the vmalloc
restrictions on what gfp flags can be passed without making it
entirely useless are completely arbitrary and non-sensical.

> > This goes along with the argument that "it's impossible to do
> > GFP_NOFAIL with vmalloc" as I addressed above. These things are not
> > impossible, but we hide behind "we don't want people to use vmalloc"
> > as an excuse for having shitty behaviour whilst ignoring that
> > vmalloc is *heavily used* by core subsystems like filesystems
> > because they cannot rely on high order allocations succeeding....
> 
> I do not think there is any reason to discourage anybody from using
> vmalloc these days. 32b is dying out and vmalloc space is no longer a
> very scarce resource.

We are still discouraged from doing high order allocations and
should only use pages directly. Not to mention that the API doesn't
make it simple to use vmalloc as a direct replacement for high order
kmalloc tends to discourage new users...

> > It also points out that the scope API is highly deficient.
> > We can do GFP_NOFS via the scope API, but we can't
> > do anything else because *there is no scope API for other GFP
> > flags*.
> > 
> > Why don't we have a GFP_NOFAIL/__GFP_RETRY_FOREVER scope API?
> 
> NO{FS,IO} where first flags to start this approach. And I have to admit
> the experiment was much less successful then I hoped for. There are
> still thousands of direct NOFS users so for some reason defining scopes
> is not an easy thing to do.
> 
> I am not against NOFAIL scopes in principle but seeing the nofs
> "success" I am worried this will not go really well either and it is
> much more tricky as NOFAIL has much stronger requirements than NOFS.
> Just imagine how tricky this can be if you just call a library code
> that is not under your control within a NOFAIL scope. What if that
> library code decides to allocate (e.g. printk that would attempt to do
> an optimistic NOWAIT allocation).

I already asked you that _exact_ question earlier in the thread
w.r.t.  kvmalloc(GFP_NOFAIL) using optimistic NOWAIT kmalloc
allocation. I asked you as a MM expert to define *and document* the
behaviour that should result, not turn around and use the fact that
it is undefined behaviour as a "this is too hard" excuse for not
changing anything.

THe fact is that the scope APIs are only really useful for certain
contexts where restrictions are set by higher level functionality.
For one-off allocation constraints the API sucks and we end up with
crap like this (found in btrfs):

                /*                                                               
                 * We're holding a transaction handle, so use a NOFS memory      
                 * allocation context to avoid deadlock if reclaim happens.      
                 */                                                              
                nofs_flag = memalloc_nofs_save();                                
                value = kmalloc(size, GFP_KERNEL);                               
                memalloc_nofs_restore(nofs_flag);                                

But also from btrfs, this pattern is repeated in several places:

        nofs_flag = memalloc_nofs_save();                                        
        ctx = kvmalloc(struct_size(ctx, chunks, num_chunks), GFP_KERNEL);        
        memalloc_nofs_restore(nofs_flag);                                        

This needs to use the scoped API because vmalloc doesn't support
GFP_NOFS. So the poor "vmalloc needs scoped API" pattern is bleeding
over into other code that doesn't have the problems vmalloc does. Do
you see how this leads to poorly written code now?

Or perhaps I should just point at ceph?

/*
 * kvmalloc() doesn't fall back to the vmalloc allocator unless flags are
 * compatible with (a superset of) GFP_KERNEL.  This is because while the
 * actual pages are allocated with the specified flags, the page table pages
 * are always allocated with GFP_KERNEL.
 *
 * ceph_kvmalloc() may be called with GFP_KERNEL, GFP_NOFS or GFP_NOIO.
 */
void *ceph_kvmalloc(size_t size, gfp_t flags)
{
        void *p;

        if ((flags & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) {
                p = kvmalloc(size, flags);
        } else if ((flags & (__GFP_IO | __GFP_FS)) == __GFP_IO) {
                unsigned int nofs_flag = memalloc_nofs_save();
                p = kvmalloc(size, GFP_KERNEL);
                memalloc_nofs_restore(nofs_flag);
        } else {
                unsigned int noio_flag = memalloc_noio_save();
                p = kvmalloc(size, GFP_KERNEL);
                memalloc_noio_restore(noio_flag);
        }

        return p;
}

IOWs, a large number of the users of the scope API simply make
[k]vmalloc() provide GFP_NOFS behaviour. ceph_kvmalloc() is pretty
much a wrapper that indicates how all vmalloc functions should
behave. Honour GFP_NOFS and GFP_NOIO by using the scope API
internally.

> > That
> > would save us a lot of bother in XFS. What about GFP_DIRECT_RECLAIM?
> > I'd really like to turn that off for allocations in the XFS
> > transaction commit path (as noted already in this thread) because
> > direct reclaim that can make no progress is actively harmful (as
> > noted already in this thread)
> 
> As always if you have reasonable usecases then it is best to bring them
> up on the MM list and we can discuss them.

They've been pointed out many times in the past, and I've pointed
them out again in this thread. Telling me to "bring them up on the
mm list" when that's exactly what I'm doing right now is not a
helpful response.

> > Like I said - this is more than just bad documentation - the problem
> > is that the whole allocation API is an inconsistent mess of control
> > mechanisms to begin with...
> 
> I am not going to disagree. There is a lot of historical baggage and
> it doesn't help that any change is really hard to review because this
> interface is used throughout the kernel. I have tried to change some
> most obvious inconsistencies and I can tell this has always been a
> frustrating experience with a very small "reward" in the end because
> there are so many other problems.

Technical debt in the mm APIs is something the mm developers need to
address, not the people who tell you it's a problem for them.
Telling the messenger "do my job for me because I find it too
frustrating to make progress myself" doesn't help anyone make
progress. If you find it frustrating trying to get mm code changed,
imagine what it feels like for someone on the outside asking for
relatively basic things like a consistent control API....

> That being said, I would more than love to have a consistent and well
> defined interface and if you want to spend a lot of time on that then be
> my guest.

My point exactly: saying "fix it yourself" is not a good response....

> > > > It would seem to make sense for kvmalloc to WARN_ON if it is passed
> > > > flags that does not allow it to use vmalloc.
> > > 
> > > vmalloc is certainly not the hottest path in the kernel so I wouldn't be
> > > opposed.
> > 
> > kvmalloc is most certainly becoming one of the hottest paths in XFS.
> > IOWs, arguments that "vmalloc is not a hot path" are simply invalid
> > these days because they are simply untrue. e.g. the profiles I
> > posted in this thread...
> 
> Is it such a hot path that a check for compatible flags would be visible
> in profiles though?

No, that doesn't even show up as noise - the overhead of global
spinlock contention and direct reclaim are the elephants that
profiles point to, not a couple of flag checks on function
parameters...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL
  2021-10-13  2:32                     ` Dave Chinner
@ 2021-10-13  8:26                       ` Michal Hocko
  2021-10-14 11:32                         ` David Sterba
  0 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2021-10-13  8:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: NeilBrown, Vlastimil Babka, Andrew Morton, Theodore Ts'o,
	Andreas Dilger, Darrick J. Wong, Matthew Wilcox, Mel Gorman,
	Jonathan Corbet, linux-xfs, linux-ext4, linux-fsdevel, linux-nfs,
	linux-mm, linux-kernel, linux-doc

On Wed 13-10-21 13:32:31, Dave Chinner wrote:
> On Mon, Oct 11, 2021 at 01:57:36PM +0200, Michal Hocko wrote:
> > On Sat 09-10-21 09:36:49, Dave Chinner wrote:
> > > On Fri, Oct 08, 2021 at 09:48:39AM +0200, Michal Hocko wrote:
> > > > > > > Even the API constaints of kvmalloc() w.r.t. only doing the vmalloc
> > > > > > > fallback if the gfp context is GFP_KERNEL - we already do GFP_NOFS
> > > > > > > kvmalloc via memalloc_nofs_save/restore(), so this behavioural
> > > > > > > restriction w.r.t. gfp flags just makes no sense at all.
> > > > > > 
> > > > > > GFP_NOFS (without using the scope API) has the same problem as NOFAIL in
> > > > > > the vmalloc. Hence it is not supported. If you use the scope API then
> > > > > > you can GFP_KERNEL for kvmalloc. This is clumsy but I am not sure how to
> > > > > > define these conditions in a more sensible way. Special case NOFS if the
> > > > > > scope api is in use? Why do you want an explicit NOFS then?
> > > 
> > > Exactly my point - this is clumsy and a total mess. I'm not asking
> > > for an explicit GFP_NOFS, just pointing out that the documented
> > > restrictions that "vmalloc can only do GFP_KERNEL allocations" is
> > > completely wrong.
> > > 
> > > vmalloc()
> > > {
> > > 	if (!(gfp_flags &  __GFP_FS))
> > > 		memalloc_nofs_save();
> > > 	p = __vmalloc(gfp_flags | GFP_KERNEL)
> > > 	if (!(gfp_flags &  __GFP_FS))
> > > 		memalloc_nofs_restore();
> > > }
> > > 
> > > Yup, that's how simple it is to support GFP_NOFS support in
> > > vmalloc().
> > 
> > Yes, this would work from the functionality POV but it defeats the
> > philosophy behind the scope API. Why would you even need this if the
> > scope was defined by the caller of the allocator?
> 
> Who actually cares that vmalloc might be using the scoped API
> internally to implement GFP_NOFS or GFP_NOIO? Nobody at all.
> It is far more useful (and self documenting!) for one-off allocations
> to pass a GFP_NOFS flag than it is to use a scope API...

I would agree with you if the explicit GFP_NOFS usage was consistent
and actually justified in the majority cases. My experience tells me
otherwise though. Many filesystems use the flag just because that is
easier. That leads to a huge overuse of the flag that leads to practical
problems.

I was hoping that if we offer an API that would define problematic
reclaim recursion scopes then it would reduce the abuse. I haven't
expected this to happen overnight but it is few years and it seems
it will not happen soon either.

[...]

> > > It also points out that the scope API is highly deficient.
> > > We can do GFP_NOFS via the scope API, but we can't
> > > do anything else because *there is no scope API for other GFP
> > > flags*.
> > > 
> > > Why don't we have a GFP_NOFAIL/__GFP_RETRY_FOREVER scope API?
> > 
> > NO{FS,IO} where first flags to start this approach. And I have to admit
> > the experiment was much less successful then I hoped for. There are
> > still thousands of direct NOFS users so for some reason defining scopes
> > is not an easy thing to do.
> > 
> > I am not against NOFAIL scopes in principle but seeing the nofs
> > "success" I am worried this will not go really well either and it is
> > much more tricky as NOFAIL has much stronger requirements than NOFS.
> > Just imagine how tricky this can be if you just call a library code
> > that is not under your control within a NOFAIL scope. What if that
> > library code decides to allocate (e.g. printk that would attempt to do
> > an optimistic NOWAIT allocation).
> 
> I already asked you that _exact_ question earlier in the thread
> w.r.t.  kvmalloc(GFP_NOFAIL) using optimistic NOWAIT kmalloc
> allocation. I asked you as a MM expert to define *and document* the
> behaviour that should result, not turn around and use the fact that
> it is undefined behaviour as a "this is too hard" excuse for not
> changing anything.

Dave, you have "thrown" a lot of complains in previous emails and it is
hard to tell rants from features requests apart. I am sorry but I
believe it would be much more productive to continue this discussion if
you could mild your tone.

Can I ask you to break down your feature requests into separate emails
so that we can discuss and track them separately rather in this quite a
long thread which has IMHO diverghed from the initial topic. Thanks!

> THe fact is that the scope APIs are only really useful for certain
> contexts where restrictions are set by higher level functionality.
> For one-off allocation constraints the API sucks and we end up with

Could you be more specific about these one-off allocation constrains?
What would be the reason to define one-off NO{FS,IO} allocation
constrain? Or did you have your NOFAIL example in mind?

> crap like this (found in btrfs):
> 
>                 /*                                                               
>                  * We're holding a transaction handle, so use a NOFS memory      
>                  * allocation context to avoid deadlock if reclaim happens.      
>                  */                                                              
>                 nofs_flag = memalloc_nofs_save();                                
>                 value = kmalloc(size, GFP_KERNEL);                               
>                 memalloc_nofs_restore(nofs_flag);                                

Yes this looks wrong indeed! If I were to review such a code I would ask
why the scope cannot match the transaction handle context. IIRC jbd does
that.

I am aware of these patterns. I was pulled in some discussions in the
past and in some it turned out that the constrain is not needed at all
and in some cases that has led to a proper scope definition. As you
point out in your other examples it just happens that it is easier to
go an easy path and define scopes ad-hoc to work around allocation
API limitations.

[...]

> IOWs, a large number of the users of the scope API simply make
> [k]vmalloc() provide GFP_NOFS behaviour. ceph_kvmalloc() is pretty
> much a wrapper that indicates how all vmalloc functions should
> behave. Honour GFP_NOFS and GFP_NOIO by using the scope API
> internally.

I was discouraging from this behavior at vmalloc level to push people
to use scopes properly - aka at the level where the reclaim recursion is
really a problem. If that is infeasible in practice then we can
re-evaluate of course. I was really hoping we can get rid of cargo cult
GFP_NOFS usage this way but the reality often disagrees with hopes.

All that being said, let's discuss [k]vmalloc constrains and usecases
that need changes in a separate email thread.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL
  2021-10-13  8:26                       ` Michal Hocko
@ 2021-10-14 11:32                         ` David Sterba
  2021-10-14 11:46                           ` Michal Hocko
  0 siblings, 1 reply; 25+ messages in thread
From: David Sterba @ 2021-10-14 11:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Chinner, NeilBrown, Vlastimil Babka, Andrew Morton,
	Theodore Ts'o, Andreas Dilger, Darrick J. Wong,
	Matthew Wilcox, Mel Gorman, Jonathan Corbet, linux-xfs,
	linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel,
	linux-doc

On Wed, Oct 13, 2021 at 10:26:58AM +0200, Michal Hocko wrote:
> > crap like this (found in btrfs):
> > 
> >                 /*                                                               
> >                  * We're holding a transaction handle, so use a NOFS memory      
> >                  * allocation context to avoid deadlock if reclaim happens.      
> >                  */                                                              
> >                 nofs_flag = memalloc_nofs_save();                                
> >                 value = kmalloc(size, GFP_KERNEL);                               
> >                 memalloc_nofs_restore(nofs_flag);                                
> 
> Yes this looks wrong indeed! If I were to review such a code I would ask
> why the scope cannot match the transaction handle context. IIRC jbd does
> that.

Adding the transaction start/end as the NOFS scope is a long term plan
and going on for years, because it's not a change we would need in
btrfs, but rather a favor to MM to switch away from "GFP_NOFS everywhere
because it's easy".

The first step was to convert the easy cases. Almost all safe cases
switching GFP_NOFS to GFP_KERNEL have happened. Another step is to
convert GFP_NOFS to memalloc_nofs_save/GFP_KERNEL/memalloc_nofs_restore
in contexts where we know we'd rely on the transaction NOFS scope in the
future. Once this is implemented, the memalloc_nofs_* calls are deleted
and it works as expected.  Now you may argue that the switch could be
changing GFP_NOFS to GFP_KERNEL at that time but that is not that easy
to review or reason about in the whole transaction context in all
allocations.

This leads to code that was found in __btrfs_set_acl and called crap
or wrong, because perhaps the background and the bigger plan is not
immediately obvious. I hope the explanation above it puts it to the
right perspective.

The other class of scoped NOFS protection is around vmalloc-based
allocations but that's for a different reason, would be solved by the
same transaction start/end conversion as well.

I'm working on that from time to time but this usually gets pushed down
in the todo list. It's changing a lot of code, from what I've researched
so far cannot be done at once and would probably introduce bugs hard to
hit because of the external conditions (allocator, system load, ...).

I have a plan to do that incrementally, adding assertions and converting
functions in small batches to be able to catch bugs early, but I'm not
exactly thrilled to start such endeavour in addition to normal
development bug hunting.

To get things moving again, I've refreshed the patch adding stubs and
will try to find the best timing for merg to avoid patch conflicts, but
no promises.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL
  2021-10-14 11:32                         ` David Sterba
@ 2021-10-14 11:46                           ` Michal Hocko
  0 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2021-10-14 11:46 UTC (permalink / raw)
  To: David Sterba
  Cc: Dave Chinner, NeilBrown, Vlastimil Babka, Andrew Morton,
	Theodore Ts'o, Andreas Dilger, Darrick J. Wong,
	Matthew Wilcox, Mel Gorman, Jonathan Corbet, linux-xfs,
	linux-ext4, linux-fsdevel, linux-nfs, linux-mm, linux-kernel,
	linux-doc

On Thu 14-10-21 13:32:01, David Sterba wrote:
> On Wed, Oct 13, 2021 at 10:26:58AM +0200, Michal Hocko wrote:
> > > crap like this (found in btrfs):
> > > 
> > >                 /*                                                               
> > >                  * We're holding a transaction handle, so use a NOFS memory      
> > >                  * allocation context to avoid deadlock if reclaim happens.      
> > >                  */                                                              
> > >                 nofs_flag = memalloc_nofs_save();                                
> > >                 value = kmalloc(size, GFP_KERNEL);                               
> > >                 memalloc_nofs_restore(nofs_flag);                                
> > 
> > Yes this looks wrong indeed! If I were to review such a code I would ask
> > why the scope cannot match the transaction handle context. IIRC jbd does
> > that.
> 
> Adding the transaction start/end as the NOFS scope is a long term plan
> and going on for years, because it's not a change we would need in
> btrfs, but rather a favor to MM to switch away from "GFP_NOFS everywhere
> because it's easy".
> 
> The first step was to convert the easy cases. Almost all safe cases
> switching GFP_NOFS to GFP_KERNEL have happened. Another step is to
> convert GFP_NOFS to memalloc_nofs_save/GFP_KERNEL/memalloc_nofs_restore
> in contexts where we know we'd rely on the transaction NOFS scope in the
> future. Once this is implemented, the memalloc_nofs_* calls are deleted
> and it works as expected.  Now you may argue that the switch could be
> changing GFP_NOFS to GFP_KERNEL at that time but that is not that easy
> to review or reason about in the whole transaction context in all
> allocations.
> 
> This leads to code that was found in __btrfs_set_acl and called crap
> or wrong, because perhaps the background and the bigger plan is not
> immediately obvious. I hope the explanation above it puts it to the
> right perspective.

Yes it helps. Thanks for the clarification because this is far from
obvious and changelogs I've checked do not mention this high level plan.
I would have gone with a /* TODO: remove me once transactions use scopes... */
but this is obviously your call.

> 
> The other class of scoped NOFS protection is around vmalloc-based
> allocations but that's for a different reason, would be solved by the
> same transaction start/end conversion as well.
> 
> I'm working on that from time to time but this usually gets pushed down
> in the todo list. It's changing a lot of code, from what I've researched
> so far cannot be done at once and would probably introduce bugs hard to
> hit because of the external conditions (allocator, system load, ...).
> 
> I have a plan to do that incrementally, adding assertions and converting
> functions in small batches to be able to catch bugs early, but I'm not
> exactly thrilled to start such endeavour in addition to normal
> development bug hunting.
> 
> To get things moving again, I've refreshed the patch adding stubs and
> will try to find the best timing for merg to avoid patch conflicts, but
> no promises.

Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-10-14 22:28 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-17  2:56 [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL NeilBrown
2021-09-17  2:56 ` [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages() NeilBrown
2021-09-17  2:56 ` [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc() NeilBrown
2021-09-17 21:45   ` Dave Chinner
2021-09-17  2:56 ` [PATCH 1/6] MM: Support __GFP_NOFAIL in alloc_pages_bulk_*() and improve doco NeilBrown
2021-09-17 14:42   ` Mel Gorman
2021-09-20 23:48     ` NeilBrown
2021-10-05  9:16     ` Vlastimil Babka
2021-09-17  2:56 ` [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops NeilBrown
2021-09-17  2:56 ` [PATCH 4/6] EXT4: remove congestion_wait from ext4_bio_write_page, and simplify NeilBrown
2021-09-17  2:56 ` [PATCH 2/6] MM: improve documentation for __GFP_NOFAIL NeilBrown
2021-10-05  9:20   ` Vlastimil Babka
2021-10-05 11:09     ` Michal Hocko
2021-10-05 12:27       ` Vlastimil Babka
2021-10-06 23:14         ` Dave Chinner
2021-10-07 10:07           ` Michal Hocko
2021-10-07 23:15             ` NeilBrown
2021-10-08  7:48               ` Michal Hocko
2021-10-08 22:36                 ` Dave Chinner
2021-10-11 11:57                   ` Michal Hocko
2021-10-11 21:49                     ` NeilBrown
2021-10-13  2:32                     ` Dave Chinner
2021-10-13  8:26                       ` Michal Hocko
2021-10-14 11:32                         ` David Sterba
2021-10-14 11:46                           ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).