All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6 v2] Discard __GFP_ATOMIC
@ 2023-01-09 15:16 Mel Gorman
  2023-01-09 15:16 ` [PATCH 1/7] mm/page_alloc: Rename ALLOC_HIGH to ALLOC_MIN_RESERVE Mel Gorman
                   ` (6 more replies)
  0 siblings, 7 replies; 34+ messages in thread
From: Mel Gorman @ 2023-01-09 15:16 UTC (permalink / raw)
  To: Linux-MM
  Cc: Andrew Morton, Michal Hocko, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML, Mel Gorman

Changelog since v1
o Split one patch						(vbabka)
o Improve OOM reserve handling					(vbabka)
o Fix __GFP_RECLAIM vs __GFP_DIRECT_RECLAIM			(vbabka)

Neil's patch has been residing in mm-unstable as commit 2fafb4fe8f7a
("mm: discard __GFP_ATOMIC") for a long time and recently brought up
again. Most recently, I was worried that __GFP_HIGH allocations could
use high-order atomic reserves which is unintentional but there was no
response so lets revisit -- this series reworks how min reserves are used,
protects highorder reserves and then finishes with Neil's patch with very
minor modifications so it fits on top.

There was a review discussion on renaming __GFP_DIRECT_RECLAIM to
__GFP_ALLOW_BLOCKING but I didn't think it was that big an issue and is
ortogonal to the removal of __GFP_ATOMIC.

There were some concerns about how the gfp flags affect the min reserves
but it never reached a solid conclusion so I made my own attempt.

The series tries to iron out some of the details on how reserves are
used. ALLOC_HIGH becomes ALLOC_MIN_RESERVE and ALLOC_HARDER becomes
ALLOC_NON_BLOCK and documents how the reserves are affected. For example,
ALLOC_NON_BLOCK (no direct reclaim) on its own allows 25% of the min reserve.
ALLOC_MIN_RESERVE (__GFP_HIGH) allows 50% and both combined allows deeper
access again. ALLOC_OOM allows access to 75%.

High-order atomic allocations are explicitly handled with the caveat that
no __GFP_ATOMIC flag means that any high-order allocation that specifies
GFP_HIGH and cannot enter direct reclaim will be treated as if it was
GFP_ATOMIC.

 Documentation/mm/balance.rst   |   2 +-
 drivers/iommu/tegra-smmu.c     |   4 +-
 include/linux/gfp_types.h      |  12 ++--
 include/trace/events/mmflags.h |   1 -
 lib/test_printf.c              |   8 +--
 mm/internal.h                  |  15 ++++-
 mm/page_alloc.c                | 103 ++++++++++++++++++++-------------
 tools/perf/builtin-kmem.c      |   1 -
 8 files changed, 86 insertions(+), 60 deletions(-)

-- 
2.35.3


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/7] mm/page_alloc: Rename ALLOC_HIGH to ALLOC_MIN_RESERVE
  2023-01-09 15:16 [PATCH 0/6 v2] Discard __GFP_ATOMIC Mel Gorman
@ 2023-01-09 15:16 ` Mel Gorman
  2023-01-11 15:18   ` Michal Hocko
  2023-01-09 15:16 ` [PATCH 2/7] mm/page_alloc: Treat RT tasks similar to __GFP_HIGH Mel Gorman
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2023-01-09 15:16 UTC (permalink / raw)
  To: Linux-MM
  Cc: Andrew Morton, Michal Hocko, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML, Mel Gorman

__GFP_HIGH aliases to ALLOC_HIGH but the name does not really hint
what it means. As ALLOC_HIGH is internal to the allocator, rename
it to ALLOC_MIN_RESERVE to document that the min reserves can
be depleted.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/internal.h   | 4 +++-
 mm/page_alloc.c | 8 ++++----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index bcf75a8b032d..403e4386626d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -736,7 +736,9 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #endif
 
 #define ALLOC_HARDER		 0x10 /* try to alloc harder */
-#define ALLOC_HIGH		 0x20 /* __GFP_HIGH set */
+#define ALLOC_MIN_RESERVE	 0x20 /* __GFP_HIGH set. Allow access to 50%
+				       * of the min watermark.
+				       */
 #define ALLOC_CPUSET		 0x40 /* check for correct cpuset */
 #define ALLOC_CMA		 0x80 /* allow allocations from CMA areas */
 #ifdef CONFIG_ZONE_DMA32
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0745aedebb37..244c1e675dc8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3976,7 +3976,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	/* free_pages may go negative - that's OK */
 	free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
 
-	if (alloc_flags & ALLOC_HIGH)
+	if (alloc_flags & ALLOC_MIN_RESERVE)
 		min -= min / 2;
 
 	if (unlikely(alloc_harder)) {
@@ -4818,18 +4818,18 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	unsigned int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
 
 	/*
-	 * __GFP_HIGH is assumed to be the same as ALLOC_HIGH
+	 * __GFP_HIGH is assumed to be the same as ALLOC_MIN_RESERVE
 	 * and __GFP_KSWAPD_RECLAIM is assumed to be the same as ALLOC_KSWAPD
 	 * to save two branches.
 	 */
-	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
+	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_MIN_RESERVE);
 	BUILD_BUG_ON(__GFP_KSWAPD_RECLAIM != (__force gfp_t) ALLOC_KSWAPD);
 
 	/*
 	 * The caller may dip into page reserves a bit more if the caller
 	 * cannot run direct reclaim, or if the caller has realtime scheduling
 	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (__GFP_ATOMIC) and ALLOC_HIGH (__GFP_HIGH).
+	 * set both ALLOC_HARDER (__GFP_ATOMIC) and ALLOC_MIN_RESERVE(__GFP_HIGH).
 	 */
 	alloc_flags |= (__force int)
 		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/7] mm/page_alloc: Treat RT tasks similar to __GFP_HIGH
  2023-01-09 15:16 [PATCH 0/6 v2] Discard __GFP_ATOMIC Mel Gorman
  2023-01-09 15:16 ` [PATCH 1/7] mm/page_alloc: Rename ALLOC_HIGH to ALLOC_MIN_RESERVE Mel Gorman
@ 2023-01-09 15:16 ` Mel Gorman
  2023-01-11 15:27   ` Michal Hocko
  2023-01-09 15:16 ` [PATCH 3/7] mm/page_alloc: Explicitly record high-order atomic allocations in alloc_flags Mel Gorman
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2023-01-09 15:16 UTC (permalink / raw)
  To: Linux-MM
  Cc: Andrew Morton, Michal Hocko, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML, Mel Gorman

RT tasks are allowed to dip below the min reserve but ALLOC_HARDER is
typically combined with ALLOC_MIN_RESERVE so RT tasks are a little
unusual. While there is some justification for allowing RT tasks
access to memory reserves, there is a strong chance that a RT task
that is also under memory pressure is at risk of missing deadlines
anyway. Relax how much reserves an RT task can access by treating
it the same as __GFP_HIGH allocations.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 244c1e675dc8..0040b4e00913 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4847,7 +4847,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 		 */
 		alloc_flags &= ~ALLOC_CPUSET;
 	} else if (unlikely(rt_task(current)) && in_task())
-		alloc_flags |= ALLOC_HARDER;
+		alloc_flags |= ALLOC_MIN_RESERVE;
 
 	alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags);
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/7] mm/page_alloc: Explicitly record high-order atomic allocations in alloc_flags
  2023-01-09 15:16 [PATCH 0/6 v2] Discard __GFP_ATOMIC Mel Gorman
  2023-01-09 15:16 ` [PATCH 1/7] mm/page_alloc: Rename ALLOC_HIGH to ALLOC_MIN_RESERVE Mel Gorman
  2023-01-09 15:16 ` [PATCH 2/7] mm/page_alloc: Treat RT tasks similar to __GFP_HIGH Mel Gorman
@ 2023-01-09 15:16 ` Mel Gorman
  2023-01-10 15:28   ` Vlastimil Babka
  2023-01-11 15:36   ` Michal Hocko
  2023-01-09 15:16 ` [PATCH 4/7] mm/page_alloc: Explicitly define what alloc flags deplete min reserves Mel Gorman
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 34+ messages in thread
From: Mel Gorman @ 2023-01-09 15:16 UTC (permalink / raw)
  To: Linux-MM
  Cc: Andrew Morton, Michal Hocko, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML, Mel Gorman

A high-order ALLOC_HARDER allocation is assumed to be atomic. While that
is accurate, it changes later in the series. In preparation, explicitly
record high-order atomic allocations in gfp_to_alloc_flags(). There is
a slight functional change in that OOM handling avoids using high-order
reserve until it has to.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/internal.h   |  1 +
 mm/page_alloc.c | 29 +++++++++++++++++++++++------
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 403e4386626d..178484d9fd94 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -746,6 +746,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #else
 #define ALLOC_NOFRAGMENT	  0x0
 #endif
+#define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
 #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
 
 enum ttu_flags;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0040b4e00913..0ef4f3236a5a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3706,10 +3706,20 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 		 * reserved for high-order atomic allocation, so order-0
 		 * request should skip it.
 		 */
-		if (order > 0 && alloc_flags & ALLOC_HARDER)
+		if (alloc_flags & ALLOC_HIGHATOMIC)
 			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
 		if (!page) {
 			page = __rmqueue(zone, order, migratetype, alloc_flags);
+
+			/*
+			 * If the allocation fails, allow OOM handling access
+			 * to HIGHATOMIC reserves as failing now is worse than
+			 * failing a high-order atomic allocation in the
+			 * future.
+			 */
+			if (!page && (alloc_flags & ALLOC_OOM))
+				page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+
 			if (!page) {
 				spin_unlock_irqrestore(&zone->lock, flags);
 				return NULL;
@@ -4023,8 +4033,10 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			return true;
 		}
 #endif
-		if (alloc_harder && !free_area_empty(area, MIGRATE_HIGHATOMIC))
+		if ((alloc_flags & (ALLOC_HIGHATOMIC|ALLOC_OOM)) &&
+		    !free_area_empty(area, MIGRATE_HIGHATOMIC)) {
 			return true;
+		}
 	}
 	return false;
 }
@@ -4286,7 +4298,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			 * If this is a high-order atomic allocation then check
 			 * if the pageblock should be reserved for the future
 			 */
-			if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
+			if (unlikely(alloc_flags & ALLOC_HIGHATOMIC))
 				reserve_highatomic_pageblock(page, zone, order);
 
 			return page;
@@ -4813,7 +4825,7 @@ static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
 }
 
 static inline unsigned int
-gfp_to_alloc_flags(gfp_t gfp_mask)
+gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
 {
 	unsigned int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
 
@@ -4839,8 +4851,13 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 		 * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
 		 * if it can't schedule.
 		 */
-		if (!(gfp_mask & __GFP_NOMEMALLOC))
+		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 			alloc_flags |= ALLOC_HARDER;
+
+			if (order > 0)
+				alloc_flags |= ALLOC_HIGHATOMIC;
+		}
+
 		/*
 		 * Ignore cpuset mems for GFP_ATOMIC rather than fail, see the
 		 * comment for __cpuset_node_allowed().
@@ -5048,7 +5065,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * kswapd needs to be woken up, and to avoid the cost of setting up
 	 * alloc_flags precisely. So we do that now.
 	 */
-	alloc_flags = gfp_to_alloc_flags(gfp_mask);
+	alloc_flags = gfp_to_alloc_flags(gfp_mask, order);
 
 	/*
 	 * We need to recalculate the starting point for the zonelist iterator
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/7] mm/page_alloc: Explicitly define what alloc flags deplete min reserves
  2023-01-09 15:16 [PATCH 0/6 v2] Discard __GFP_ATOMIC Mel Gorman
                   ` (2 preceding siblings ...)
  2023-01-09 15:16 ` [PATCH 3/7] mm/page_alloc: Explicitly record high-order atomic allocations in alloc_flags Mel Gorman
@ 2023-01-09 15:16 ` Mel Gorman
  2023-01-11 14:04   ` Vlastimil Babka
  2023-01-11 15:37   ` Michal Hocko
  2023-01-09 15:16 ` [PATCH 5/7] mm/page_alloc.c: Allow __GFP_NOFAIL requests deeper access to reserves Mel Gorman
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 34+ messages in thread
From: Mel Gorman @ 2023-01-09 15:16 UTC (permalink / raw)
  To: Linux-MM
  Cc: Andrew Morton, Michal Hocko, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML, Mel Gorman

As there are more ALLOC_ flags that affect reserves, define what flags
affect reserves and clarify the effect of each flag.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/internal.h   |  3 +++
 mm/page_alloc.c | 34 ++++++++++++++++++++++------------
 2 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 178484d9fd94..8706d46863df 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -749,6 +749,9 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
 #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
 
+/* Flags that allow allocations below the min watermark. */
+#define ALLOC_RESERVES (ALLOC_HARDER|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
+
 enum ttu_flags;
 struct tlbflush_unmap_batch;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0ef4f3236a5a..6f41b84a97ac 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3949,15 +3949,14 @@ ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE);
 static inline long __zone_watermark_unusable_free(struct zone *z,
 				unsigned int order, unsigned int alloc_flags)
 {
-	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
 	long unusable_free = (1 << order) - 1;
 
 	/*
-	 * If the caller does not have rights to ALLOC_HARDER then subtract
-	 * the high-atomic reserves. This will over-estimate the size of the
-	 * atomic reserve but it avoids a search.
+	 * If the caller does not have rights to reserves below the min
+	 * watermark then subtract the high-atomic reserves. This will
+	 * over-estimate the size of the atomic reserve but it avoids a search.
 	 */
-	if (likely(!alloc_harder))
+	if (likely(!(alloc_flags & ALLOC_RESERVES)))
 		unusable_free += z->nr_reserved_highatomic;
 
 #ifdef CONFIG_CMA
@@ -3981,25 +3980,36 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 {
 	long min = mark;
 	int o;
-	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
 
 	/* free_pages may go negative - that's OK */
 	free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
 
-	if (alloc_flags & ALLOC_MIN_RESERVE)
-		min -= min / 2;
+	if (unlikely(alloc_flags & ALLOC_RESERVES)) {
+		/*
+		 * __GFP_HIGH allows access to 50% of the min reserve as well
+		 * as OOM.
+		 */
+		if (alloc_flags & ALLOC_MIN_RESERVE)
+			min -= min / 2;
 
-	if (unlikely(alloc_harder)) {
 		/*
-		 * OOM victims can try even harder than normal ALLOC_HARDER
+		 * Non-blocking allocations can access some of the reserve
+		 * with more access if also __GFP_HIGH. The reasoning is that
+		 * a non-blocking caller may incur a more severe penalty
+		 * if it cannot get memory quickly, particularly if it's
+		 * also __GFP_HIGH.
+		 */
+		if (alloc_flags & ALLOC_HARDER)
+			min -= min / 4;
+
+		/*
+		 * OOM victims can try even harder than the normal reserve
 		 * users on the grounds that it's definitely going to be in
 		 * the exit path shortly and free memory. Any allocation it
 		 * makes during the free path will be small and short-lived.
 		 */
 		if (alloc_flags & ALLOC_OOM)
 			min -= min / 2;
-		else
-			min -= min / 4;
 	}
 
 	/*
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/7] mm/page_alloc.c: Allow __GFP_NOFAIL requests deeper access to reserves
  2023-01-09 15:16 [PATCH 0/6 v2] Discard __GFP_ATOMIC Mel Gorman
                   ` (3 preceding siblings ...)
  2023-01-09 15:16 ` [PATCH 4/7] mm/page_alloc: Explicitly define what alloc flags deplete min reserves Mel Gorman
@ 2023-01-09 15:16 ` Mel Gorman
  2023-01-11 14:05   ` Vlastimil Babka
  2023-01-11 15:46   ` Michal Hocko
  2023-01-09 15:16 ` [PATCH 6/7] mm/page_alloc: Give GFP_ATOMIC and non-blocking allocations " Mel Gorman
  2023-01-09 15:16 ` [PATCH 7/7] mm: discard __GFP_ATOMIC Mel Gorman
  6 siblings, 2 replies; 34+ messages in thread
From: Mel Gorman @ 2023-01-09 15:16 UTC (permalink / raw)
  To: Linux-MM
  Cc: Andrew Morton, Michal Hocko, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML, Mel Gorman

Currently __GFP_NOFAIL allocations without any other flags can access 25%
of the reserves but these requests imply that the system cannot make forward
progress until the allocation succeeds. Allow __GFP_NOFAIL access to 75%
of the min reserve.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6f41b84a97ac..d2df78f5baa2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5308,7 +5308,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 * could deplete whole memory reserves which would just make
 		 * the situation worse
 		 */
-		page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
+		page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_MIN_RESERVE|ALLOC_HARDER, ac);
 		if (page)
 			goto got_pg;
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 6/7] mm/page_alloc: Give GFP_ATOMIC and non-blocking allocations access to reserves
  2023-01-09 15:16 [PATCH 0/6 v2] Discard __GFP_ATOMIC Mel Gorman
                   ` (4 preceding siblings ...)
  2023-01-09 15:16 ` [PATCH 5/7] mm/page_alloc.c: Allow __GFP_NOFAIL requests deeper access to reserves Mel Gorman
@ 2023-01-09 15:16 ` Mel Gorman
  2023-01-11 14:12   ` Vlastimil Babka
  2023-01-11 15:58   ` Michal Hocko
  2023-01-09 15:16 ` [PATCH 7/7] mm: discard __GFP_ATOMIC Mel Gorman
  6 siblings, 2 replies; 34+ messages in thread
From: Mel Gorman @ 2023-01-09 15:16 UTC (permalink / raw)
  To: Linux-MM
  Cc: Andrew Morton, Michal Hocko, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML, Mel Gorman

Explicit GFP_ATOMIC allocations get flagged ALLOC_HARDER which is a bit
vague. In preparation for removing __GFP_ATOMIC, give GFP_ATOMIC and
other non-blocking allocation requests equal access to reserve.  Rename
ALLOC_HARDER to ALLOC_NON_BLOCK to make it more clear what the flag
means.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/internal.h   |  7 +++++--
 mm/page_alloc.c | 23 +++++++++++++----------
 2 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 8706d46863df..23a37588073a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -735,7 +735,10 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_OOM		ALLOC_NO_WATERMARKS
 #endif
 
-#define ALLOC_HARDER		 0x10 /* try to alloc harder */
+#define ALLOC_NON_BLOCK		 0x10 /* Caller cannot block. Allow access
+				       * to 25% of the min watermark or
+				       * 62.5% if __GFP_HIGH is set.
+				       */
 #define ALLOC_MIN_RESERVE	 0x20 /* __GFP_HIGH set. Allow access to 50%
 				       * of the min watermark.
 				       */
@@ -750,7 +753,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
 
 /* Flags that allow allocations below the min watermark. */
-#define ALLOC_RESERVES (ALLOC_HARDER|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
+#define ALLOC_RESERVES (ALLOC_NON_BLOCK|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
 
 enum ttu_flags;
 struct tlbflush_unmap_batch;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d2df78f5baa2..2217bab2dbb2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3999,7 +3999,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 		 * if it cannot get memory quickly, particularly if it's
 		 * also __GFP_HIGH.
 		 */
-		if (alloc_flags & ALLOC_HARDER)
+		if (alloc_flags & ALLOC_NON_BLOCK)
 			min -= min / 4;
 
 		/*
@@ -4851,28 +4851,30 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
 	 * The caller may dip into page reserves a bit more if the caller
 	 * cannot run direct reclaim, or if the caller has realtime scheduling
 	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (__GFP_ATOMIC) and ALLOC_MIN_RESERVE(__GFP_HIGH).
+	 * set both ALLOC_NON_BLOCK and ALLOC_MIN_RESERVE(__GFP_HIGH).
 	 */
 	alloc_flags |= (__force int)
 		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
 
-	if (gfp_mask & __GFP_ATOMIC) {
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
 		/*
 		 * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
 		 * if it can't schedule.
 		 */
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
-			alloc_flags |= ALLOC_HARDER;
+			alloc_flags |= ALLOC_NON_BLOCK;
 
 			if (order > 0)
 				alloc_flags |= ALLOC_HIGHATOMIC;
 		}
 
 		/*
-		 * Ignore cpuset mems for GFP_ATOMIC rather than fail, see the
-		 * comment for __cpuset_node_allowed().
+		 * Ignore cpuset mems for non-blocking __GFP_HIGH (probably
+		 * GFP_ATOMIC) rather than fail, see the comment for
+		 * __cpuset_node_allowed().
 		 */
-		alloc_flags &= ~ALLOC_CPUSET;
+		if (alloc_flags & ALLOC_MIN_RESERVE)
+			alloc_flags &= ~ALLOC_CPUSET;
 	} else if (unlikely(rt_task(current)) && in_task())
 		alloc_flags |= ALLOC_MIN_RESERVE;
 
@@ -5304,11 +5306,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 		/*
 		 * Help non-failing allocations by giving them access to memory
-		 * reserves but do not use ALLOC_NO_WATERMARKS because this
+		 * reserves normally used for high priority non-blocking
+		 * allocations but do not use ALLOC_NO_WATERMARKS because this
 		 * could deplete whole memory reserves which would just make
-		 * the situation worse
+		 * the situation worse.
 		 */
-		page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_MIN_RESERVE|ALLOC_HARDER, ac);
+		page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_MIN_RESERVE|ALLOC_NON_BLOCK, ac);
 		if (page)
 			goto got_pg;
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 7/7] mm: discard __GFP_ATOMIC
  2023-01-09 15:16 [PATCH 0/6 v2] Discard __GFP_ATOMIC Mel Gorman
                   ` (5 preceding siblings ...)
  2023-01-09 15:16 ` [PATCH 6/7] mm/page_alloc: Give GFP_ATOMIC and non-blocking allocations " Mel Gorman
@ 2023-01-09 15:16 ` Mel Gorman
  2023-01-12  8:12   ` Michal Hocko
  6 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2023-01-09 15:16 UTC (permalink / raw)
  To: Linux-MM
  Cc: Andrew Morton, Michal Hocko, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML, Mel Gorman

From: NeilBrown <neilb@suse.de>

__GFP_ATOMIC serves little purpose.  Its main effect is to set
ALLOC_HARDER which adds a few little boosts to increase the chance of an
allocation succeeding, one of which is to lower the water-mark at which it
will succeed.

It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also
adjusts this watermark.  It is probable that other users of __GFP_HIGH
should benefit from the other little bonuses that __GFP_ATOMIC gets.

__GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM.
There is little point to this.  We already get a might_sleep() warning if
__GFP_DIRECT_RECLAIM is set.

__GFP_ATOMIC allows the "watermark_boost" to be side-stepped.  It is
probable that testing ALLOC_HARDER is a better fit here.

__GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might
sleep.  This should test __GFP_DIRECT_RECLAIM instead.

This patch:
 - removes __GFP_ATOMIC
 - allows __GFP_HIGH allocations to ignore watermark boosting as well
   as GFP_ATOMIC requests.
 - makes other adjustments as suggested by the above.

The net result is not change to GFP_ATOMIC allocations.  Other
allocations that use __GFP_HIGH will benefit from a few different extra
privileges.  This affects:
  xen, dm, md, ntfs3
  the vermillion frame buffer
  hibernation
  ksm
  swap
all of which likely produce more benefit than cost if these selected
allocation are more likely to succeed quickly.

[mgorman: Minor adjustments to rework on top of a series]
Link: https://lkml.kernel.org/r/163712397076.13692.4727608274002939094@noble.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 Documentation/mm/balance.rst   |  2 +-
 drivers/iommu/tegra-smmu.c     |  4 ++--
 include/linux/gfp_types.h      | 12 ++++--------
 include/trace/events/mmflags.h |  1 -
 lib/test_printf.c              |  8 ++++----
 mm/internal.h                  |  2 +-
 mm/page_alloc.c                | 13 +++----------
 tools/perf/builtin-kmem.c      |  1 -
 8 files changed, 15 insertions(+), 28 deletions(-)

diff --git a/Documentation/mm/balance.rst b/Documentation/mm/balance.rst
index 6a1fadf3e173..e38e9d83c1c7 100644
--- a/Documentation/mm/balance.rst
+++ b/Documentation/mm/balance.rst
@@ -6,7 +6,7 @@ Memory Balancing
 
 Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
 
-Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
+Memory balancing is needed for !__GFP_HIGH and !__GFP_KSWAPD_RECLAIM as
 well as for non __GFP_IO allocations.
 
 The first reason why a caller may avoid reclaim is that the caller can not
diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c
index 5b1af40221ec..af8d0e685260 100644
--- a/drivers/iommu/tegra-smmu.c
+++ b/drivers/iommu/tegra-smmu.c
@@ -671,12 +671,12 @@ static struct page *as_get_pde_page(struct tegra_smmu_as *as,
 	 * allocate page in a sleeping context if GFP flags permit. Hence
 	 * spinlock needs to be unlocked and re-locked after allocation.
 	 */
-	if (!(gfp & __GFP_ATOMIC))
+	if (gfpflags_allow_blocking(gfp))
 		spin_unlock_irqrestore(&as->lock, *flags);
 
 	page = alloc_page(gfp | __GFP_DMA | __GFP_ZERO);
 
-	if (!(gfp & __GFP_ATOMIC))
+	if (gfpflags_allow_blocking(gfp))
 		spin_lock_irqsave(&as->lock, *flags);
 
 	/*
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index d88c46ca82e1..5088637fe5c2 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -31,7 +31,7 @@ typedef unsigned int __bitwise gfp_t;
 #define ___GFP_IO		0x40u
 #define ___GFP_FS		0x80u
 #define ___GFP_ZERO		0x100u
-#define ___GFP_ATOMIC		0x200u
+/* 0x200u unused */
 #define ___GFP_DIRECT_RECLAIM	0x400u
 #define ___GFP_KSWAPD_RECLAIM	0x800u
 #define ___GFP_WRITE		0x1000u
@@ -116,11 +116,8 @@ typedef unsigned int __bitwise gfp_t;
  *
  * %__GFP_HIGH indicates that the caller is high-priority and that granting
  * the request is necessary before the system can make forward progress.
- * For example, creating an IO context to clean pages.
- *
- * %__GFP_ATOMIC indicates that the caller cannot reclaim or sleep and is
- * high priority. Users are typically interrupt handlers. This may be
- * used in conjunction with %__GFP_HIGH
+ * For example creating an IO context to clean pages and requests
+ * from atomic context.
  *
  * %__GFP_MEMALLOC allows access to all memory. This should only be used when
  * the caller guarantees the allocation will allow more memory to be freed
@@ -135,7 +132,6 @@ typedef unsigned int __bitwise gfp_t;
  * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
  * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
  */
-#define __GFP_ATOMIC	((__force gfp_t)___GFP_ATOMIC)
 #define __GFP_HIGH	((__force gfp_t)___GFP_HIGH)
 #define __GFP_MEMALLOC	((__force gfp_t)___GFP_MEMALLOC)
 #define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC)
@@ -329,7 +325,7 @@ typedef unsigned int __bitwise gfp_t;
  * version does not attempt reclaim/compaction at all and is by default used
  * in page fault path, while the non-light is used by khugepaged.
  */
-#define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
+#define GFP_ATOMIC	(__GFP_HIGH|__GFP_KSWAPD_RECLAIM)
 #define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
 #define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
 #define GFP_NOWAIT	(__GFP_KSWAPD_RECLAIM)
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 412b5a46374c..9db52bc4ce19 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -31,7 +31,6 @@
 	gfpflag_string(__GFP_HIGHMEM),		\
 	gfpflag_string(GFP_DMA32),		\
 	gfpflag_string(__GFP_HIGH),		\
-	gfpflag_string(__GFP_ATOMIC),		\
 	gfpflag_string(__GFP_IO),		\
 	gfpflag_string(__GFP_FS),		\
 	gfpflag_string(__GFP_NOWARN),		\
diff --git a/lib/test_printf.c b/lib/test_printf.c
index d34dc636b81c..46b4e6c414a3 100644
--- a/lib/test_printf.c
+++ b/lib/test_printf.c
@@ -674,17 +674,17 @@ flags(void)
 	gfp = GFP_ATOMIC|__GFP_DMA;
 	test("GFP_ATOMIC|GFP_DMA", "%pGg", &gfp);
 
-	gfp = __GFP_ATOMIC;
-	test("__GFP_ATOMIC", "%pGg", &gfp);
+	gfp = __GFP_HIGH;
+	test("__GFP_HIGH", "%pGg", &gfp);
 
 	/* Any flags not translated by the table should remain numeric */
 	gfp = ~__GFP_BITS_MASK;
 	snprintf(cmp_buffer, BUF_SIZE, "%#lx", (unsigned long) gfp);
 	test(cmp_buffer, "%pGg", &gfp);
 
-	snprintf(cmp_buffer, BUF_SIZE, "__GFP_ATOMIC|%#lx",
+	snprintf(cmp_buffer, BUF_SIZE, "__GFP_HIGH|%#lx",
 							(unsigned long) gfp);
-	gfp |= __GFP_ATOMIC;
+	gfp |= __GFP_HIGH;
 	test(cmp_buffer, "%pGg", &gfp);
 
 	kfree(cmp_buffer);
diff --git a/mm/internal.h b/mm/internal.h
index 23a37588073a..71b1111427f3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -24,7 +24,7 @@ struct folio_batch;
 #define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
 			__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_NOFAIL|\
 			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC|\
-			__GFP_ATOMIC|__GFP_NOLOCKDEP)
+			__GFP_NOLOCKDEP)
 
 /* The GFP flags allowed during early boot */
 #define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2217bab2dbb2..7244ab522028 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4086,13 +4086,14 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
 	if (__zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags,
 					free_pages))
 		return true;
+
 	/*
-	 * Ignore watermark boosting for GFP_ATOMIC order-0 allocations
+	 * Ignore watermark boosting for __GFP_HIGH order-0 allocations
 	 * when checking the min watermark. The min watermark is the
 	 * point where boosting is ignored so that kswapd is woken up
 	 * when below the low watermark.
 	 */
-	if (unlikely(!order && (gfp_mask & __GFP_ATOMIC) && z->watermark_boost
+	if (unlikely(!order && (alloc_flags & ALLOC_MIN_RESERVE) && z->watermark_boost
 		&& ((alloc_flags & ALLOC_WMARK_MASK) == WMARK_MIN))) {
 		mark = z->_watermark[WMARK_MIN];
 		return __zone_watermark_ok(z, order, mark, highest_zoneidx,
@@ -5057,14 +5058,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned int zonelist_iter_cookie;
 	int reserve_flags;
 
-	/*
-	 * We also sanity check to catch abuse of atomic reserves being used by
-	 * callers that are not in atomic context.
-	 */
-	if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
-				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
-		gfp_mask &= ~__GFP_ATOMIC;
-
 restart:
 	compaction_retries = 0;
 	no_progress_loops = 0;
diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index e20656c431a4..173d407dce92 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -641,7 +641,6 @@ static const struct {
 	{ "__GFP_HIGHMEM",		"HM" },
 	{ "GFP_DMA32",			"D32" },
 	{ "__GFP_HIGH",			"H" },
-	{ "__GFP_ATOMIC",		"_A" },
 	{ "__GFP_IO",			"I" },
 	{ "__GFP_FS",			"F" },
 	{ "__GFP_NOWARN",		"NWR" },
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/7] mm/page_alloc: Explicitly record high-order atomic allocations in alloc_flags
  2023-01-09 15:16 ` [PATCH 3/7] mm/page_alloc: Explicitly record high-order atomic allocations in alloc_flags Mel Gorman
@ 2023-01-10 15:28   ` Vlastimil Babka
  2023-01-11 15:36   ` Michal Hocko
  1 sibling, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2023-01-10 15:28 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Andrew Morton, Michal Hocko, NeilBrown, Thierry Reding,
	Matthew Wilcox, LKML

On 1/9/23 16:16, Mel Gorman wrote:
> A high-order ALLOC_HARDER allocation is assumed to be atomic. While that
> is accurate, it changes later in the series. In preparation, explicitly
> record high-order atomic allocations in gfp_to_alloc_flags(). There is
> a slight functional change in that OOM handling avoids using high-order
> reserve until it has to.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] mm/page_alloc: Explicitly define what alloc flags deplete min reserves
  2023-01-09 15:16 ` [PATCH 4/7] mm/page_alloc: Explicitly define what alloc flags deplete min reserves Mel Gorman
@ 2023-01-11 14:04   ` Vlastimil Babka
  2023-01-11 15:37   ` Michal Hocko
  1 sibling, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2023-01-11 14:04 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Andrew Morton, Michal Hocko, NeilBrown, Thierry Reding,
	Matthew Wilcox, LKML

On 1/9/23 16:16, Mel Gorman wrote:
> As there are more ALLOC_ flags that affect reserves, define what flags
> affect reserves and clarify the effect of each flag.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/7] mm/page_alloc.c: Allow __GFP_NOFAIL requests deeper access to reserves
  2023-01-09 15:16 ` [PATCH 5/7] mm/page_alloc.c: Allow __GFP_NOFAIL requests deeper access to reserves Mel Gorman
@ 2023-01-11 14:05   ` Vlastimil Babka
  2023-01-11 15:46   ` Michal Hocko
  1 sibling, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2023-01-11 14:05 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Andrew Morton, Michal Hocko, NeilBrown, Thierry Reding,
	Matthew Wilcox, LKML

On 1/9/23 16:16, Mel Gorman wrote:
> Currently __GFP_NOFAIL allocations without any other flags can access 25%
> of the reserves but these requests imply that the system cannot make forward
> progress until the allocation succeeds. Allow __GFP_NOFAIL access to 75%
> of the min reserve.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6f41b84a97ac..d2df78f5baa2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5308,7 +5308,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		 * could deplete whole memory reserves which would just make
>  		 * the situation worse
>  		 */
> -		page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
> +		page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_MIN_RESERVE|ALLOC_HARDER, ac);
>  		if (page)
>  			goto got_pg;
>  


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/7] mm/page_alloc: Give GFP_ATOMIC and non-blocking allocations access to reserves
  2023-01-09 15:16 ` [PATCH 6/7] mm/page_alloc: Give GFP_ATOMIC and non-blocking allocations " Mel Gorman
@ 2023-01-11 14:12   ` Vlastimil Babka
  2023-01-11 15:58   ` Michal Hocko
  1 sibling, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2023-01-11 14:12 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM
  Cc: Andrew Morton, Michal Hocko, NeilBrown, Thierry Reding,
	Matthew Wilcox, LKML

On 1/9/23 16:16, Mel Gorman wrote:
> Explicit GFP_ATOMIC allocations get flagged ALLOC_HARDER which is a bit
> vague. In preparation for removing __GFP_ATOMIC, give GFP_ATOMIC and
> other non-blocking allocation requests equal access to reserve.  Rename
> ALLOC_HARDER to ALLOC_NON_BLOCK to make it more clear what the flag
> means.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/7] mm/page_alloc: Rename ALLOC_HIGH to ALLOC_MIN_RESERVE
  2023-01-09 15:16 ` [PATCH 1/7] mm/page_alloc: Rename ALLOC_HIGH to ALLOC_MIN_RESERVE Mel Gorman
@ 2023-01-11 15:18   ` Michal Hocko
  2023-01-12  9:26     ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2023-01-11 15:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Mon 09-01-23 15:16:25, Mel Gorman wrote:
> __GFP_HIGH aliases to ALLOC_HIGH but the name does not really hint
> what it means. As ALLOC_HIGH is internal to the allocator, rename
> it to ALLOC_MIN_RESERVE to document that the min reserves can
> be depleted.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Naming is hard but ALLOC_HIGH is definitely much more confusing as it
can collide with high watermark. ALLOC_MIN_RESERVE says that some
reserves are involved. ALl the reserves are below min watermark by
defition but I cannot really come up with a better name. I do not think
we want to encode the amount of reserves into the name.

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!
> ---
>  mm/internal.h   | 4 +++-
>  mm/page_alloc.c | 8 ++++----
>  2 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index bcf75a8b032d..403e4386626d 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -736,7 +736,9 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>  #endif
>  
>  #define ALLOC_HARDER		 0x10 /* try to alloc harder */
> -#define ALLOC_HIGH		 0x20 /* __GFP_HIGH set */
> +#define ALLOC_MIN_RESERVE	 0x20 /* __GFP_HIGH set. Allow access to 50%
> +				       * of the min watermark.
> +				       */
>  #define ALLOC_CPUSET		 0x40 /* check for correct cpuset */
>  #define ALLOC_CMA		 0x80 /* allow allocations from CMA areas */
>  #ifdef CONFIG_ZONE_DMA32
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0745aedebb37..244c1e675dc8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3976,7 +3976,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  	/* free_pages may go negative - that's OK */
>  	free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
>  
> -	if (alloc_flags & ALLOC_HIGH)
> +	if (alloc_flags & ALLOC_MIN_RESERVE)
>  		min -= min / 2;
>  
>  	if (unlikely(alloc_harder)) {
> @@ -4818,18 +4818,18 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	unsigned int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
>  
>  	/*
> -	 * __GFP_HIGH is assumed to be the same as ALLOC_HIGH
> +	 * __GFP_HIGH is assumed to be the same as ALLOC_MIN_RESERVE
>  	 * and __GFP_KSWAPD_RECLAIM is assumed to be the same as ALLOC_KSWAPD
>  	 * to save two branches.
>  	 */
> -	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> +	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_MIN_RESERVE);
>  	BUILD_BUG_ON(__GFP_KSWAPD_RECLAIM != (__force gfp_t) ALLOC_KSWAPD);
>  
>  	/*
>  	 * The caller may dip into page reserves a bit more if the caller
>  	 * cannot run direct reclaim, or if the caller has realtime scheduling
>  	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> -	 * set both ALLOC_HARDER (__GFP_ATOMIC) and ALLOC_HIGH (__GFP_HIGH).
> +	 * set both ALLOC_HARDER (__GFP_ATOMIC) and ALLOC_MIN_RESERVE(__GFP_HIGH).
>  	 */
>  	alloc_flags |= (__force int)
>  		(gfp_mask & (__GFP_HIGH | __GFP_KSWAPD_RECLAIM));
> -- 
> 2.35.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] mm/page_alloc: Treat RT tasks similar to __GFP_HIGH
  2023-01-09 15:16 ` [PATCH 2/7] mm/page_alloc: Treat RT tasks similar to __GFP_HIGH Mel Gorman
@ 2023-01-11 15:27   ` Michal Hocko
  2023-01-12  9:36     ` Mel Gorman
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2023-01-11 15:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Mon 09-01-23 15:16:26, Mel Gorman wrote:
> RT tasks are allowed to dip below the min reserve but ALLOC_HARDER is
> typically combined with ALLOC_MIN_RESERVE so RT tasks are a little
> unusual. While there is some justification for allowing RT tasks
> access to memory reserves, there is a strong chance that a RT task
> that is also under memory pressure is at risk of missing deadlines
> anyway. Relax how much reserves an RT task can access by treating
> it the same as __GFP_HIGH allocations.

TBH, I would much rather drop the RT special casing here. As you say if
a RT task need to dip into memory reserves it is either already too late
because the execution is already under RT constrains or this is init
phase where the reclaim is not a problem yet.

I have tried to trace down this special case and only found a patch from
Robert Love from 2003 which says:
: - Let real-time tasks dip further into the reserves than usual in
:   __alloc_pages().  There are a lot of ways to special case this.  This
:   patch just cuts z->pages_low in half, before doing the incremental min
:   thing, for real-time tasks.  I do not do anything in the low memory slow
:   path.  We can be a _lot_ more aggressive if we want.  Right now, we just
:   give real-time tasks a little help.

This doesn't really explain why this is needed.

We are really great at preserving a behavior and cementing it for
future generations. Maybe we should just drop it and see if something
breaks. We would get some reasoning at least finally.

So I am not opposed to the patch per se but I would much rather see this
branch go away. If you want me I can condense the above into a changelog
and send a patch (either on top of this one or replacing it). WDYT?

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 244c1e675dc8..0040b4e00913 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4847,7 +4847,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  		 */
>  		alloc_flags &= ~ALLOC_CPUSET;
>  	} else if (unlikely(rt_task(current)) && in_task())
> -		alloc_flags |= ALLOC_HARDER;
> +		alloc_flags |= ALLOC_MIN_RESERVE;
>  
>  	alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags);
>  
> -- 
> 2.35.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/7] mm/page_alloc: Explicitly record high-order atomic allocations in alloc_flags
  2023-01-09 15:16 ` [PATCH 3/7] mm/page_alloc: Explicitly record high-order atomic allocations in alloc_flags Mel Gorman
  2023-01-10 15:28   ` Vlastimil Babka
@ 2023-01-11 15:36   ` Michal Hocko
  2023-01-12  9:38     ` Mel Gorman
  1 sibling, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2023-01-11 15:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Mon 09-01-23 15:16:27, Mel Gorman wrote:
> A high-order ALLOC_HARDER allocation is assumed to be atomic. While that
> is accurate, it changes later in the series. In preparation, explicitly
> record high-order atomic allocations in gfp_to_alloc_flags(). There is
> a slight functional change in that OOM handling avoids using high-order
> reserve until it has to.

I do not follow the oom handling part. IIRC we are dropping highatomic
reserves before triggering oom. Something might have changed down the
path but I can still see unreserve_highatomic_pageblock in
should_reclaim_retry.

I do not have any objection to ALLOC_HIGHATOMIC itself, though.

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  mm/internal.h   |  1 +
>  mm/page_alloc.c | 29 +++++++++++++++++++++++------
>  2 files changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 403e4386626d..178484d9fd94 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -746,6 +746,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>  #else
>  #define ALLOC_NOFRAGMENT	  0x0
>  #endif
> +#define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
>  #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
>  
>  enum ttu_flags;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0040b4e00913..0ef4f3236a5a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3706,10 +3706,20 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
>  		 * reserved for high-order atomic allocation, so order-0
>  		 * request should skip it.
>  		 */
> -		if (order > 0 && alloc_flags & ALLOC_HARDER)
> +		if (alloc_flags & ALLOC_HIGHATOMIC)
>  			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
>  		if (!page) {
>  			page = __rmqueue(zone, order, migratetype, alloc_flags);
> +
> +			/*
> +			 * If the allocation fails, allow OOM handling access
> +			 * to HIGHATOMIC reserves as failing now is worse than
> +			 * failing a high-order atomic allocation in the
> +			 * future.
> +			 */
> +			if (!page && (alloc_flags & ALLOC_OOM))
> +				page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> +
>  			if (!page) {
>  				spin_unlock_irqrestore(&zone->lock, flags);
>  				return NULL;
> @@ -4023,8 +4033,10 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  			return true;
>  		}
>  #endif
> -		if (alloc_harder && !free_area_empty(area, MIGRATE_HIGHATOMIC))
> +		if ((alloc_flags & (ALLOC_HIGHATOMIC|ALLOC_OOM)) &&
> +		    !free_area_empty(area, MIGRATE_HIGHATOMIC)) {
>  			return true;
> +		}
>  	}
>  	return false;
>  }
> @@ -4286,7 +4298,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  			 * If this is a high-order atomic allocation then check
>  			 * if the pageblock should be reserved for the future
>  			 */
> -			if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
> +			if (unlikely(alloc_flags & ALLOC_HIGHATOMIC))
>  				reserve_highatomic_pageblock(page, zone, order);
>  
>  			return page;
> @@ -4813,7 +4825,7 @@ static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
>  }
>  
>  static inline unsigned int
> -gfp_to_alloc_flags(gfp_t gfp_mask)
> +gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
>  {
>  	unsigned int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
>  
> @@ -4839,8 +4851,13 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  		 * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
>  		 * if it can't schedule.
>  		 */
> -		if (!(gfp_mask & __GFP_NOMEMALLOC))
> +		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
>  			alloc_flags |= ALLOC_HARDER;
> +
> +			if (order > 0)
> +				alloc_flags |= ALLOC_HIGHATOMIC;
> +		}
> +
>  		/*
>  		 * Ignore cpuset mems for GFP_ATOMIC rather than fail, see the
>  		 * comment for __cpuset_node_allowed().
> @@ -5048,7 +5065,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	 * kswapd needs to be woken up, and to avoid the cost of setting up
>  	 * alloc_flags precisely. So we do that now.
>  	 */
> -	alloc_flags = gfp_to_alloc_flags(gfp_mask);
> +	alloc_flags = gfp_to_alloc_flags(gfp_mask, order);
>  
>  	/*
>  	 * We need to recalculate the starting point for the zonelist iterator
> -- 
> 2.35.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] mm/page_alloc: Explicitly define what alloc flags deplete min reserves
  2023-01-09 15:16 ` [PATCH 4/7] mm/page_alloc: Explicitly define what alloc flags deplete min reserves Mel Gorman
  2023-01-11 14:04   ` Vlastimil Babka
@ 2023-01-11 15:37   ` Michal Hocko
  1 sibling, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2023-01-11 15:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Mon 09-01-23 15:16:28, Mel Gorman wrote:
> As there are more ALLOC_ flags that affect reserves, define what flags
> affect reserves and clarify the effect of each flag.

I like this! It makes the code much easier to follow.

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/internal.h   |  3 +++
>  mm/page_alloc.c | 34 ++++++++++++++++++++++------------
>  2 files changed, 25 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 178484d9fd94..8706d46863df 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -749,6 +749,9 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>  #define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
>  #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
>  
> +/* Flags that allow allocations below the min watermark. */
> +#define ALLOC_RESERVES (ALLOC_HARDER|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
> +
>  enum ttu_flags;
>  struct tlbflush_unmap_batch;
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0ef4f3236a5a..6f41b84a97ac 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3949,15 +3949,14 @@ ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE);
>  static inline long __zone_watermark_unusable_free(struct zone *z,
>  				unsigned int order, unsigned int alloc_flags)
>  {
> -	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
>  	long unusable_free = (1 << order) - 1;
>  
>  	/*
> -	 * If the caller does not have rights to ALLOC_HARDER then subtract
> -	 * the high-atomic reserves. This will over-estimate the size of the
> -	 * atomic reserve but it avoids a search.
> +	 * If the caller does not have rights to reserves below the min
> +	 * watermark then subtract the high-atomic reserves. This will
> +	 * over-estimate the size of the atomic reserve but it avoids a search.
>  	 */
> -	if (likely(!alloc_harder))
> +	if (likely(!(alloc_flags & ALLOC_RESERVES)))
>  		unusable_free += z->nr_reserved_highatomic;
>  
>  #ifdef CONFIG_CMA
> @@ -3981,25 +3980,36 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  {
>  	long min = mark;
>  	int o;
> -	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
>  
>  	/* free_pages may go negative - that's OK */
>  	free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
>  
> -	if (alloc_flags & ALLOC_MIN_RESERVE)
> -		min -= min / 2;
> +	if (unlikely(alloc_flags & ALLOC_RESERVES)) {
> +		/*
> +		 * __GFP_HIGH allows access to 50% of the min reserve as well
> +		 * as OOM.
> +		 */
> +		if (alloc_flags & ALLOC_MIN_RESERVE)
> +			min -= min / 2;
>  
> -	if (unlikely(alloc_harder)) {
>  		/*
> -		 * OOM victims can try even harder than normal ALLOC_HARDER
> +		 * Non-blocking allocations can access some of the reserve
> +		 * with more access if also __GFP_HIGH. The reasoning is that
> +		 * a non-blocking caller may incur a more severe penalty
> +		 * if it cannot get memory quickly, particularly if it's
> +		 * also __GFP_HIGH.
> +		 */
> +		if (alloc_flags & ALLOC_HARDER)
> +			min -= min / 4;
> +
> +		/*
> +		 * OOM victims can try even harder than the normal reserve
>  		 * users on the grounds that it's definitely going to be in
>  		 * the exit path shortly and free memory. Any allocation it
>  		 * makes during the free path will be small and short-lived.
>  		 */
>  		if (alloc_flags & ALLOC_OOM)
>  			min -= min / 2;
> -		else
> -			min -= min / 4;
>  	}
>  
>  	/*
> -- 
> 2.35.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/7] mm/page_alloc.c: Allow __GFP_NOFAIL requests deeper access to reserves
  2023-01-09 15:16 ` [PATCH 5/7] mm/page_alloc.c: Allow __GFP_NOFAIL requests deeper access to reserves Mel Gorman
  2023-01-11 14:05   ` Vlastimil Babka
@ 2023-01-11 15:46   ` Michal Hocko
  2023-01-12  9:43     ` Mel Gorman
  1 sibling, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2023-01-11 15:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Mon 09-01-23 15:16:29, Mel Gorman wrote:
> Currently __GFP_NOFAIL allocations without any other flags can access 25%
> of the reserves but these requests imply that the system cannot make forward
> progress until the allocation succeeds. Allow __GFP_NOFAIL access to 75%
> of the min reserve.

I am not sure this is really needed. IIRC the original motivation for
allowing NOFAIL request to access access to memory reserves was
GFP_NOFS|__GFP_NOFAIL requests which do not invoke the OOM killer.
The amount of memory reserves granted was not really important. The
point was to allow to move forward. Giving more of the reserves is a
double edge sword. It can help in some cases but it can also prevent
other high priority users from fwd progress.

I would much rahter see such a change with an example where it really
made a difference.

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6f41b84a97ac..d2df78f5baa2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5308,7 +5308,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		 * could deplete whole memory reserves which would just make
>  		 * the situation worse
>  		 */
> -		page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
> +		page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_MIN_RESERVE|ALLOC_HARDER, ac);
>  		if (page)
>  			goto got_pg;
>  
> -- 
> 2.35.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/7] mm/page_alloc: Give GFP_ATOMIC and non-blocking allocations access to reserves
  2023-01-09 15:16 ` [PATCH 6/7] mm/page_alloc: Give GFP_ATOMIC and non-blocking allocations " Mel Gorman
  2023-01-11 14:12   ` Vlastimil Babka
@ 2023-01-11 15:58   ` Michal Hocko
  2023-01-11 17:05     ` Mel Gorman
  1 sibling, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2023-01-11 15:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Mon 09-01-23 15:16:30, Mel Gorman wrote:
> Explicit GFP_ATOMIC allocations get flagged ALLOC_HARDER which is a bit
> vague. In preparation for removing __GFP_ATOMIC, give GFP_ATOMIC and
> other non-blocking allocation requests equal access to reserve.  Rename
> ALLOC_HARDER to ALLOC_NON_BLOCK to make it more clear what the flag
> means.

GFP_NOWAIT can be also used for opportunistic allocations which can and
should fail quickly if the memory is tight and more elaborate path
should be taken (e.g. try higher order allocation first but fall back to
smaller request if the memory is fragmented). Do we really want to give
those access to memory reserves as well?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/7] mm/page_alloc: Give GFP_ATOMIC and non-blocking allocations access to reserves
  2023-01-11 15:58   ` Michal Hocko
@ 2023-01-11 17:05     ` Mel Gorman
  2023-01-12  8:11       ` Michal Hocko
  2023-01-14 22:10       ` NeilBrown
  0 siblings, 2 replies; 34+ messages in thread
From: Mel Gorman @ 2023-01-11 17:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Wed, Jan 11, 2023 at 04:58:02PM +0100, Michal Hocko wrote:
> On Mon 09-01-23 15:16:30, Mel Gorman wrote:
> > Explicit GFP_ATOMIC allocations get flagged ALLOC_HARDER which is a bit
> > vague. In preparation for removing __GFP_ATOMIC, give GFP_ATOMIC and
> > other non-blocking allocation requests equal access to reserve.  Rename
> > ALLOC_HARDER to ALLOC_NON_BLOCK to make it more clear what the flag
> > means.
> 
> GFP_NOWAIT can be also used for opportunistic allocations which can and
> should fail quickly if the memory is tight and more elaborate path
> should be taken (e.g. try higher order allocation first but fall back to
> smaller request if the memory is fragmented). Do we really want to give
> those access to memory reserves as well?

Good question. Without __GFP_ATOMIC, GFP_NOWAIT only differs from GFP_ATOMIC
by __GFP_HIGH but that is not enough to distinguish between a caller that
cannot sleep versus one that is speculatively attempting an allocation but
has other options. That changelog is misleading, it's not equal access
as GFP_NOWAIT ends up with 25% of the reserves which is less than what
GFP_ATOMIC gets.

Because it becomes impossible to distinguish between non-blocking and
atomic without __GFP_ATOMIC, there is some justification for allowing
access to reserves for GFP_NOWAIT. bio for example attempts an allocation
(clears __GFP_DIRECT_RECLAIM) before falling back to mempool but delays
in IO can also lead to further allocation pressure. mmu gather failing
GFP_WAIT slows the rate memory can be freed. NFS failing GFP_NOWAIT will
have to retry IOs multiple times. The examples were picked at random but
the point is that there are cases where failing GFP_NOWAIT can degrade
the system, particularly delay the cleaning of pages before reclaim.

A lot of the truly speculative users appear to use GFP_NOWAIT | __GFP_NOWARN
so one compromise would be to avoid using reserves if __GFP_NOWARN is
also specified.

Something like this as a separate patch?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7244ab522028..0a7a0ac1b46d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4860,9 +4860,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
 	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
 		/*
 		 * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
-		 * if it can't schedule.
+		 * if it can't schedule. Similarly, a caller specifying
+		 * __GFP_NOWARN is likely a speculative allocation with a
+		 * graceful recovery path.
 		 */
-		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
+		if (!(gfp_mask & (__GFP_NOMEMALLOC|__GFP_NOWARN))) {
 			alloc_flags |= ALLOC_NON_BLOCK;
 
 			if (order > 0)

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/7] mm/page_alloc: Give GFP_ATOMIC and non-blocking allocations access to reserves
  2023-01-11 17:05     ` Mel Gorman
@ 2023-01-12  8:11       ` Michal Hocko
  2023-01-12  8:29         ` Michal Hocko
  2023-01-12  9:24         ` Mel Gorman
  2023-01-14 22:10       ` NeilBrown
  1 sibling, 2 replies; 34+ messages in thread
From: Michal Hocko @ 2023-01-12  8:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Wed 11-01-23 17:05:52, Mel Gorman wrote:
> On Wed, Jan 11, 2023 at 04:58:02PM +0100, Michal Hocko wrote:
> > On Mon 09-01-23 15:16:30, Mel Gorman wrote:
> > > Explicit GFP_ATOMIC allocations get flagged ALLOC_HARDER which is a bit
> > > vague. In preparation for removing __GFP_ATOMIC, give GFP_ATOMIC and
> > > other non-blocking allocation requests equal access to reserve.  Rename
> > > ALLOC_HARDER to ALLOC_NON_BLOCK to make it more clear what the flag
> > > means.
> > 
> > GFP_NOWAIT can be also used for opportunistic allocations which can and
> > should fail quickly if the memory is tight and more elaborate path
> > should be taken (e.g. try higher order allocation first but fall back to
> > smaller request if the memory is fragmented). Do we really want to give
> > those access to memory reserves as well?
> 
> Good question. Without __GFP_ATOMIC, GFP_NOWAIT only differs from GFP_ATOMIC
> by __GFP_HIGH but that is not enough to distinguish between a caller that
> cannot sleep versus one that is speculatively attempting an allocation but
> has other options. That changelog is misleading, it's not equal access
> as GFP_NOWAIT ends up with 25% of the reserves which is less than what
> GFP_ATOMIC gets.
> 
> Because it becomes impossible to distinguish between non-blocking and
> atomic without __GFP_ATOMIC, there is some justification for allowing
> access to reserves for GFP_NOWAIT. bio for example attempts an allocation
> (clears __GFP_DIRECT_RECLAIM) before falling back to mempool but delays
> in IO can also lead to further allocation pressure. mmu gather failing
> GFP_WAIT slows the rate memory can be freed. NFS failing GFP_NOWAIT will
> have to retry IOs multiple times. The examples were picked at random but
> the point is that there are cases where failing GFP_NOWAIT can degrade
> the system, particularly delay the cleaning of pages before reclaim.

Fair points.

> A lot of the truly speculative users appear to use GFP_NOWAIT | __GFP_NOWARN
> so one compromise would be to avoid using reserves if __GFP_NOWARN is
> also specified.
> 
> Something like this as a separate patch?

I cannot say I would be happy about adding more side effects to
__GFP_NOWARN. You are right that it should be used for those optimistic
allocation requests but historically all many of these subtle side effects
have kicked back at some point. Wouldn't it make sense to explicitly
mark those places which really benefit from reserves instead? This is
more work but it should pay off long term. Your examples above would use
GFP_ATOMIC instead of GFP_NOWAIT.

The semantic would be easier to explain as well. GFP_ATOMIC - non
sleeping allocations which are important so they have access to memory
reserves. GFP_NOWAIT - non sleeping allocations.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7244ab522028..0a7a0ac1b46d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4860,9 +4860,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
>  	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
>  		/*
>  		 * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
> -		 * if it can't schedule.
> +		 * if it can't schedule. Similarly, a caller specifying
> +		 * __GFP_NOWARN is likely a speculative allocation with a
> +		 * graceful recovery path.
>  		 */
> -		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> +		if (!(gfp_mask & (__GFP_NOMEMALLOC|__GFP_NOWARN))) {
>  			alloc_flags |= ALLOC_NON_BLOCK;
>  
>  			if (order > 0)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/7] mm: discard __GFP_ATOMIC
  2023-01-09 15:16 ` [PATCH 7/7] mm: discard __GFP_ATOMIC Mel Gorman
@ 2023-01-12  8:12   ` Michal Hocko
  0 siblings, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2023-01-12  8:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Mon 09-01-23 15:16:31, Mel Gorman wrote:
> From: NeilBrown <neilb@suse.de>
> 
> __GFP_ATOMIC serves little purpose.  Its main effect is to set
> ALLOC_HARDER which adds a few little boosts to increase the chance of an
> allocation succeeding, one of which is to lower the water-mark at which it
> will succeed.
> 
> It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also
> adjusts this watermark.  It is probable that other users of __GFP_HIGH
> should benefit from the other little bonuses that __GFP_ATOMIC gets.
> 
> __GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM.
> There is little point to this.  We already get a might_sleep() warning if
> __GFP_DIRECT_RECLAIM is set.
> 
> __GFP_ATOMIC allows the "watermark_boost" to be side-stepped.  It is
> probable that testing ALLOC_HARDER is a better fit here.
> 
> __GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might
> sleep.  This should test __GFP_DIRECT_RECLAIM instead.
> 
> This patch:
>  - removes __GFP_ATOMIC
>  - allows __GFP_HIGH allocations to ignore watermark boosting as well
>    as GFP_ATOMIC requests.
>  - makes other adjustments as suggested by the above.
> 
> The net result is not change to GFP_ATOMIC allocations.  Other
> allocations that use __GFP_HIGH will benefit from a few different extra
> privileges.  This affects:
>   xen, dm, md, ntfs3
>   the vermillion frame buffer
>   hibernation
>   ksm
>   swap
> all of which likely produce more benefit than cost if these selected
> allocation are more likely to succeed quickly.
> 
> [mgorman: Minor adjustments to rework on top of a series]
> Link: https://lkml.kernel.org/r/163712397076.13692.4727608274002939094@noble.neil.brown.name
> Signed-off-by: NeilBrown <neilb@suse.de>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Forgot to ack this one yesterday
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  Documentation/mm/balance.rst   |  2 +-
>  drivers/iommu/tegra-smmu.c     |  4 ++--
>  include/linux/gfp_types.h      | 12 ++++--------
>  include/trace/events/mmflags.h |  1 -
>  lib/test_printf.c              |  8 ++++----
>  mm/internal.h                  |  2 +-
>  mm/page_alloc.c                | 13 +++----------
>  tools/perf/builtin-kmem.c      |  1 -
>  8 files changed, 15 insertions(+), 28 deletions(-)
> 
> diff --git a/Documentation/mm/balance.rst b/Documentation/mm/balance.rst
> index 6a1fadf3e173..e38e9d83c1c7 100644
> --- a/Documentation/mm/balance.rst
> +++ b/Documentation/mm/balance.rst
> @@ -6,7 +6,7 @@ Memory Balancing
>  
>  Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
>  
> -Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
> +Memory balancing is needed for !__GFP_HIGH and !__GFP_KSWAPD_RECLAIM as
>  well as for non __GFP_IO allocations.
>  
>  The first reason why a caller may avoid reclaim is that the caller can not
> diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c
> index 5b1af40221ec..af8d0e685260 100644
> --- a/drivers/iommu/tegra-smmu.c
> +++ b/drivers/iommu/tegra-smmu.c
> @@ -671,12 +671,12 @@ static struct page *as_get_pde_page(struct tegra_smmu_as *as,
>  	 * allocate page in a sleeping context if GFP flags permit. Hence
>  	 * spinlock needs to be unlocked and re-locked after allocation.
>  	 */
> -	if (!(gfp & __GFP_ATOMIC))
> +	if (gfpflags_allow_blocking(gfp))
>  		spin_unlock_irqrestore(&as->lock, *flags);
>  
>  	page = alloc_page(gfp | __GFP_DMA | __GFP_ZERO);
>  
> -	if (!(gfp & __GFP_ATOMIC))
> +	if (gfpflags_allow_blocking(gfp))
>  		spin_lock_irqsave(&as->lock, *flags);
>  
>  	/*
> diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
> index d88c46ca82e1..5088637fe5c2 100644
> --- a/include/linux/gfp_types.h
> +++ b/include/linux/gfp_types.h
> @@ -31,7 +31,7 @@ typedef unsigned int __bitwise gfp_t;
>  #define ___GFP_IO		0x40u
>  #define ___GFP_FS		0x80u
>  #define ___GFP_ZERO		0x100u
> -#define ___GFP_ATOMIC		0x200u
> +/* 0x200u unused */
>  #define ___GFP_DIRECT_RECLAIM	0x400u
>  #define ___GFP_KSWAPD_RECLAIM	0x800u
>  #define ___GFP_WRITE		0x1000u
> @@ -116,11 +116,8 @@ typedef unsigned int __bitwise gfp_t;
>   *
>   * %__GFP_HIGH indicates that the caller is high-priority and that granting
>   * the request is necessary before the system can make forward progress.
> - * For example, creating an IO context to clean pages.
> - *
> - * %__GFP_ATOMIC indicates that the caller cannot reclaim or sleep and is
> - * high priority. Users are typically interrupt handlers. This may be
> - * used in conjunction with %__GFP_HIGH
> + * For example creating an IO context to clean pages and requests
> + * from atomic context.
>   *
>   * %__GFP_MEMALLOC allows access to all memory. This should only be used when
>   * the caller guarantees the allocation will allow more memory to be freed
> @@ -135,7 +132,6 @@ typedef unsigned int __bitwise gfp_t;
>   * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
>   * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
>   */
> -#define __GFP_ATOMIC	((__force gfp_t)___GFP_ATOMIC)
>  #define __GFP_HIGH	((__force gfp_t)___GFP_HIGH)
>  #define __GFP_MEMALLOC	((__force gfp_t)___GFP_MEMALLOC)
>  #define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC)
> @@ -329,7 +325,7 @@ typedef unsigned int __bitwise gfp_t;
>   * version does not attempt reclaim/compaction at all and is by default used
>   * in page fault path, while the non-light is used by khugepaged.
>   */
> -#define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
> +#define GFP_ATOMIC	(__GFP_HIGH|__GFP_KSWAPD_RECLAIM)
>  #define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
>  #define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
>  #define GFP_NOWAIT	(__GFP_KSWAPD_RECLAIM)
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 412b5a46374c..9db52bc4ce19 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -31,7 +31,6 @@
>  	gfpflag_string(__GFP_HIGHMEM),		\
>  	gfpflag_string(GFP_DMA32),		\
>  	gfpflag_string(__GFP_HIGH),		\
> -	gfpflag_string(__GFP_ATOMIC),		\
>  	gfpflag_string(__GFP_IO),		\
>  	gfpflag_string(__GFP_FS),		\
>  	gfpflag_string(__GFP_NOWARN),		\
> diff --git a/lib/test_printf.c b/lib/test_printf.c
> index d34dc636b81c..46b4e6c414a3 100644
> --- a/lib/test_printf.c
> +++ b/lib/test_printf.c
> @@ -674,17 +674,17 @@ flags(void)
>  	gfp = GFP_ATOMIC|__GFP_DMA;
>  	test("GFP_ATOMIC|GFP_DMA", "%pGg", &gfp);
>  
> -	gfp = __GFP_ATOMIC;
> -	test("__GFP_ATOMIC", "%pGg", &gfp);
> +	gfp = __GFP_HIGH;
> +	test("__GFP_HIGH", "%pGg", &gfp);
>  
>  	/* Any flags not translated by the table should remain numeric */
>  	gfp = ~__GFP_BITS_MASK;
>  	snprintf(cmp_buffer, BUF_SIZE, "%#lx", (unsigned long) gfp);
>  	test(cmp_buffer, "%pGg", &gfp);
>  
> -	snprintf(cmp_buffer, BUF_SIZE, "__GFP_ATOMIC|%#lx",
> +	snprintf(cmp_buffer, BUF_SIZE, "__GFP_HIGH|%#lx",
>  							(unsigned long) gfp);
> -	gfp |= __GFP_ATOMIC;
> +	gfp |= __GFP_HIGH;
>  	test(cmp_buffer, "%pGg", &gfp);
>  
>  	kfree(cmp_buffer);
> diff --git a/mm/internal.h b/mm/internal.h
> index 23a37588073a..71b1111427f3 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -24,7 +24,7 @@ struct folio_batch;
>  #define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
>  			__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_NOFAIL|\
>  			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC|\
> -			__GFP_ATOMIC|__GFP_NOLOCKDEP)
> +			__GFP_NOLOCKDEP)
>  
>  /* The GFP flags allowed during early boot */
>  #define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2217bab2dbb2..7244ab522028 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4086,13 +4086,14 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
>  	if (__zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags,
>  					free_pages))
>  		return true;
> +
>  	/*
> -	 * Ignore watermark boosting for GFP_ATOMIC order-0 allocations
> +	 * Ignore watermark boosting for __GFP_HIGH order-0 allocations
>  	 * when checking the min watermark. The min watermark is the
>  	 * point where boosting is ignored so that kswapd is woken up
>  	 * when below the low watermark.
>  	 */
> -	if (unlikely(!order && (gfp_mask & __GFP_ATOMIC) && z->watermark_boost
> +	if (unlikely(!order && (alloc_flags & ALLOC_MIN_RESERVE) && z->watermark_boost
>  		&& ((alloc_flags & ALLOC_WMARK_MASK) == WMARK_MIN))) {
>  		mark = z->_watermark[WMARK_MIN];
>  		return __zone_watermark_ok(z, order, mark, highest_zoneidx,
> @@ -5057,14 +5058,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	unsigned int zonelist_iter_cookie;
>  	int reserve_flags;
>  
> -	/*
> -	 * We also sanity check to catch abuse of atomic reserves being used by
> -	 * callers that are not in atomic context.
> -	 */
> -	if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
> -				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
> -		gfp_mask &= ~__GFP_ATOMIC;
> -
>  restart:
>  	compaction_retries = 0;
>  	no_progress_loops = 0;
> diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
> index e20656c431a4..173d407dce92 100644
> --- a/tools/perf/builtin-kmem.c
> +++ b/tools/perf/builtin-kmem.c
> @@ -641,7 +641,6 @@ static const struct {
>  	{ "__GFP_HIGHMEM",		"HM" },
>  	{ "GFP_DMA32",			"D32" },
>  	{ "__GFP_HIGH",			"H" },
> -	{ "__GFP_ATOMIC",		"_A" },
>  	{ "__GFP_IO",			"I" },
>  	{ "__GFP_FS",			"F" },
>  	{ "__GFP_NOWARN",		"NWR" },
> -- 
> 2.35.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/7] mm/page_alloc: Give GFP_ATOMIC and non-blocking allocations access to reserves
  2023-01-12  8:11       ` Michal Hocko
@ 2023-01-12  8:29         ` Michal Hocko
  2023-01-12  9:24         ` Mel Gorman
  1 sibling, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2023-01-12  8:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Thu 12-01-23 09:11:07, Michal Hocko wrote:
> On Wed 11-01-23 17:05:52, Mel Gorman wrote:
> > On Wed, Jan 11, 2023 at 04:58:02PM +0100, Michal Hocko wrote:
> > > On Mon 09-01-23 15:16:30, Mel Gorman wrote:
> > > > Explicit GFP_ATOMIC allocations get flagged ALLOC_HARDER which is a bit
> > > > vague. In preparation for removing __GFP_ATOMIC, give GFP_ATOMIC and
> > > > other non-blocking allocation requests equal access to reserve.  Rename
> > > > ALLOC_HARDER to ALLOC_NON_BLOCK to make it more clear what the flag
> > > > means.
> > > 
> > > GFP_NOWAIT can be also used for opportunistic allocations which can and
> > > should fail quickly if the memory is tight and more elaborate path
> > > should be taken (e.g. try higher order allocation first but fall back to
> > > smaller request if the memory is fragmented). Do we really want to give
> > > those access to memory reserves as well?
> > 
> > Good question. Without __GFP_ATOMIC, GFP_NOWAIT only differs from GFP_ATOMIC
> > by __GFP_HIGH but that is not enough to distinguish between a caller that
> > cannot sleep versus one that is speculatively attempting an allocation but
> > has other options. That changelog is misleading, it's not equal access
> > as GFP_NOWAIT ends up with 25% of the reserves which is less than what
> > GFP_ATOMIC gets.
> > 
> > Because it becomes impossible to distinguish between non-blocking and
> > atomic without __GFP_ATOMIC, there is some justification for allowing
> > access to reserves for GFP_NOWAIT. bio for example attempts an allocation
> > (clears __GFP_DIRECT_RECLAIM) before falling back to mempool but delays
> > in IO can also lead to further allocation pressure. mmu gather failing
> > GFP_WAIT slows the rate memory can be freed. NFS failing GFP_NOWAIT will
> > have to retry IOs multiple times. The examples were picked at random but
> > the point is that there are cases where failing GFP_NOWAIT can degrade
> > the system, particularly delay the cleaning of pages before reclaim.
> 
> Fair points.
> 
> > A lot of the truly speculative users appear to use GFP_NOWAIT | __GFP_NOWARN
> > so one compromise would be to avoid using reserves if __GFP_NOWARN is
> > also specified.
> > 
> > Something like this as a separate patch?
> 
> I cannot say I would be happy about adding more side effects to
> __GFP_NOWARN. You are right that it should be used for those optimistic
> allocation requests but historically all many of these subtle side effects
> have kicked back at some point.

Should have looked at git grep GFP_ATOMIC | __GFP_NOWARN is quite
popular with more than 50 instances.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/7] mm/page_alloc: Give GFP_ATOMIC and non-blocking allocations access to reserves
  2023-01-12  8:11       ` Michal Hocko
  2023-01-12  8:29         ` Michal Hocko
@ 2023-01-12  9:24         ` Mel Gorman
  2023-01-12  9:45           ` Michal Hocko
  1 sibling, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2023-01-12  9:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Thu, Jan 12, 2023 at 09:11:06AM +0100, Michal Hocko wrote:
> On Wed 11-01-23 17:05:52, Mel Gorman wrote:
> > On Wed, Jan 11, 2023 at 04:58:02PM +0100, Michal Hocko wrote:
> > > On Mon 09-01-23 15:16:30, Mel Gorman wrote:
> > > > Explicit GFP_ATOMIC allocations get flagged ALLOC_HARDER which is a bit
> > > > vague. In preparation for removing __GFP_ATOMIC, give GFP_ATOMIC and
> > > > other non-blocking allocation requests equal access to reserve.  Rename
> > > > ALLOC_HARDER to ALLOC_NON_BLOCK to make it more clear what the flag
> > > > means.
> > > 
> > > GFP_NOWAIT can be also used for opportunistic allocations which can and
> > > should fail quickly if the memory is tight and more elaborate path
> > > should be taken (e.g. try higher order allocation first but fall back to
> > > smaller request if the memory is fragmented). Do we really want to give
> > > those access to memory reserves as well?
> > 
> > Good question. Without __GFP_ATOMIC, GFP_NOWAIT only differs from GFP_ATOMIC
> > by __GFP_HIGH but that is not enough to distinguish between a caller that
> > cannot sleep versus one that is speculatively attempting an allocation but
> > has other options. That changelog is misleading, it's not equal access
> > as GFP_NOWAIT ends up with 25% of the reserves which is less than what
> > GFP_ATOMIC gets.
> > 
> > Because it becomes impossible to distinguish between non-blocking and
> > atomic without __GFP_ATOMIC, there is some justification for allowing
> > access to reserves for GFP_NOWAIT. bio for example attempts an allocation
> > (clears __GFP_DIRECT_RECLAIM) before falling back to mempool but delays
> > in IO can also lead to further allocation pressure. mmu gather failing
> > GFP_WAIT slows the rate memory can be freed. NFS failing GFP_NOWAIT will
> > have to retry IOs multiple times. The examples were picked at random but
> > the point is that there are cases where failing GFP_NOWAIT can degrade
> > the system, particularly delay the cleaning of pages before reclaim.
> 
> Fair points.
> 
> > A lot of the truly speculative users appear to use GFP_NOWAIT | __GFP_NOWARN
> > so one compromise would be to avoid using reserves if __GFP_NOWARN is
> > also specified.
> > 
> > Something like this as a separate patch?
> 
> I cannot say I would be happy about adding more side effects to
> __GFP_NOWARN. You are right that it should be used for those optimistic
> allocation requests but historically all many of these subtle side effects
> have kicked back at some point.

True.

> Wouldn't it make sense to explicitly
> mark those places which really benefit from reserves instead?

That would be __GFP_HIGH and would require context from every caller on
whether they need reserves or not and to determine what the consequences
are if there is a stall. Is there immediate local fallout or wider fallout
such as a variable delay before pages can be cleaned?

> This is
> more work but it should pay off long term. Your examples above would use
> GFP_ATOMIC instead of GFP_NOWAIT.
> 

Yes, although it would confuse the meaning of GFP_ATOMIC as a result.
It's described as "%GFP_ATOMIC users can not sleep and need the allocation to
succeed" and something like the bio callsite does not *need* the allocation
to succeed. It can fallback to the mempool and performance simply degrades
temporarily. No doubt there are a few abuses of GFP_ATOMIC just to get
non-blocking behaviour already.

> The semantic would be easier to explain as well. GFP_ATOMIC - non
> sleeping allocations which are important so they have access to memory
> reserves. GFP_NOWAIT - non sleeping allocations.
> 

People's definition of "important" will vary wildly. The following would
avoid reserve access for GFP_NOWAIT for now. It would need to be folded
into this patch and a new changelog

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7244ab522028..aa20165224cf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3989,18 +3989,19 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 		 * __GFP_HIGH allows access to 50% of the min reserve as well
 		 * as OOM.
 		 */
-		if (alloc_flags & ALLOC_MIN_RESERVE)
+		if (alloc_flags & ALLOC_MIN_RESERVE) {
 			min -= min / 2;
 
-		/*
-		 * Non-blocking allocations can access some of the reserve
-		 * with more access if also __GFP_HIGH. The reasoning is that
-		 * a non-blocking caller may incur a more severe penalty
-		 * if it cannot get memory quickly, particularly if it's
-		 * also __GFP_HIGH.
-		 */
-		if (alloc_flags & ALLOC_NON_BLOCK)
-			min -= min / 4;
+			/*
+			 * Non-blocking allocations (e.g. GFP_ATOMIC) can
+			 * access more reserves than just __GFP_HIGH. Other
+			 * non-blocking allocations requests such as GFP_NOWAIT
+			 * or (GFP_KERNEL & ~__GFP_DIRECT_RECLAIM) do not get
+			 * access to the min reserve.
+			 */
+			if (alloc_flags & ALLOC_NON_BLOCK)
+				min -= min / 4;
+		}
 
 		/*
 		 * OOM victims can try even harder than the normal reserve





-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/7] mm/page_alloc: Rename ALLOC_HIGH to ALLOC_MIN_RESERVE
  2023-01-11 15:18   ` Michal Hocko
@ 2023-01-12  9:26     ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2023-01-12  9:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Wed, Jan 11, 2023 at 04:18:50PM +0100, Michal Hocko wrote:
> On Mon 09-01-23 15:16:25, Mel Gorman wrote:
> > __GFP_HIGH aliases to ALLOC_HIGH but the name does not really hint
> > what it means. As ALLOC_HIGH is internal to the allocator, rename
> > it to ALLOC_MIN_RESERVE to document that the min reserves can
> > be depleted.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Naming is hard but ALLOC_HIGH is definitely much more confusing as it
> can collide with high watermark. ALLOC_MIN_RESERVE says that some
> reserves are involved. ALl the reserves are below min watermark by
> defition but I cannot really come up with a better name. I do not think
> we want to encode the amount of reserves into the name.
> 

It's internal to the page allocator so I didn't sweat about it too much.
Access to the reserves currently means "allow pages to be allocated
below the min reserve". Even if that changes in the future, the name can
change with it.

> Acked-by: Michal Hocko <mhocko@suse.com>
> 

Thanks!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] mm/page_alloc: Treat RT tasks similar to __GFP_HIGH
  2023-01-11 15:27   ` Michal Hocko
@ 2023-01-12  9:36     ` Mel Gorman
  2023-01-12  9:47       ` Michal Hocko
  2023-01-13  9:04       ` David Laight
  0 siblings, 2 replies; 34+ messages in thread
From: Mel Gorman @ 2023-01-12  9:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Wed, Jan 11, 2023 at 04:27:29PM +0100, Michal Hocko wrote:
> On Mon 09-01-23 15:16:26, Mel Gorman wrote:
> > RT tasks are allowed to dip below the min reserve but ALLOC_HARDER is
> > typically combined with ALLOC_MIN_RESERVE so RT tasks are a little
> > unusual. While there is some justification for allowing RT tasks
> > access to memory reserves, there is a strong chance that a RT task
> > that is also under memory pressure is at risk of missing deadlines
> > anyway. Relax how much reserves an RT task can access by treating
> > it the same as __GFP_HIGH allocations.
> 
> TBH, I would much rather drop the RT special casing here. As you say if
> a RT task need to dip into memory reserves it is either already too late
> because the execution is already under RT constrains or this is init
> phase where the reclaim is not a problem yet.
> 

I completely agree. I included it in the changelog because I was tempted
to delete it now. I'm wary that the series will result in some
allocation failure bug reports and so played it cautious.

Hard realtime tasks should be locking down resources in advance. Even a
soft-realtime task like audio or video live decoding which cannot jitter
should be allocating both memory and any disk space required up-front
before the recording starts instead of relying on reserves. At best,
reserve access will only delay the problem by a very short interval.

> I have tried to trace down this special case and only found a patch from
> Robert Love from 2003 which says:
> : - Let real-time tasks dip further into the reserves than usual in
> :   __alloc_pages().  There are a lot of ways to special case this.  This
> :   patch just cuts z->pages_low in half, before doing the incremental min
> :   thing, for real-time tasks.  I do not do anything in the low memory slow
> :   path.  We can be a _lot_ more aggressive if we want.  Right now, we just
> :   give real-time tasks a little help.
> 
> This doesn't really explain why this is needed.
> 

No, it does not but I'm not willing to complain either. 20 years ago,
it might have been completely reasonable.

> We are really great at preserving a behavior and cementing it for
> future generations. Maybe we should just drop it and see if something
> breaks. We would get some reasoning at least finally.
> 
> So I am not opposed to the patch per se but I would much rather see this
> branch go away. If you want me I can condense the above into a changelog
> and send a patch (either on top of this one or replacing it). WDYT?
> 

I agree with you but given the risk of bisections hitting this series,
would you be opposed to delaying the removal by 1 kernel release? That
way bisections for failures will hit 6.3 and a single commit or at least
just a report against 6.3. That would mitigate the risk of a full revert
of the series. I can add a note to the changelog mentioning the expected
removal so git blame will also highlight it.

> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/7] mm/page_alloc: Explicitly record high-order atomic allocations in alloc_flags
  2023-01-11 15:36   ` Michal Hocko
@ 2023-01-12  9:38     ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2023-01-12  9:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Wed, Jan 11, 2023 at 04:36:01PM +0100, Michal Hocko wrote:
> On Mon 09-01-23 15:16:27, Mel Gorman wrote:
> > A high-order ALLOC_HARDER allocation is assumed to be atomic. While that
> > is accurate, it changes later in the series. In preparation, explicitly
> > record high-order atomic allocations in gfp_to_alloc_flags(). There is
> > a slight functional change in that OOM handling avoids using high-order
> > reserve until it has to.
> 
> I do not follow the oom handling part. IIRC we are dropping highatomic
> reserves before triggering oom. Something might have changed down the
> path but I can still see unreserve_highatomic_pageblock in
> should_reclaim_retry.
> 

That comment is now stale and should be removed because I fixed up the
OOM oddities. At this point, a series resubmission is needed because a
few changelogs have to be updated.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/7] mm/page_alloc.c: Allow __GFP_NOFAIL requests deeper access to reserves
  2023-01-11 15:46   ` Michal Hocko
@ 2023-01-12  9:43     ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2023-01-12  9:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Wed, Jan 11, 2023 at 04:46:13PM +0100, Michal Hocko wrote:
> On Mon 09-01-23 15:16:29, Mel Gorman wrote:
> > Currently __GFP_NOFAIL allocations without any other flags can access 25%
> > of the reserves but these requests imply that the system cannot make forward
> > progress until the allocation succeeds. Allow __GFP_NOFAIL access to 75%
> > of the min reserve.
> 
> I am not sure this is really needed. IIRC the original motivation for
> allowing NOFAIL request to access access to memory reserves was
> GFP_NOFS|__GFP_NOFAIL requests which do not invoke the OOM killer.
> The amount of memory reserves granted was not really important. The
> point was to allow to move forward. Giving more of the reserves is a
> double edge sword. It can help in some cases but it can also prevent
> other high priority users from fwd progress.
> 
> I would much rahter see such a change with an example where it really
> made a difference.
> 

Fair point but based on your review for "mm/page_alloc: Give GFP_ATOMIC
and non-blocking allocations access to reserves" and only allowing
non-blocking allocations to access reserves if __GFP_HIGH is also
specified, this patch becomes a no-op and can be dropped.

If GFP_NOFAIL requests really require deeper access to reserves, it'll
have to be explicitly handled in __zone_watermark_ok and __GFP_NOFAIL
would be added to the ALLOC_RESERVES collection of flags.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/7] mm/page_alloc: Give GFP_ATOMIC and non-blocking allocations access to reserves
  2023-01-12  9:24         ` Mel Gorman
@ 2023-01-12  9:45           ` Michal Hocko
  0 siblings, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2023-01-12  9:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Thu 12-01-23 09:24:52, Mel Gorman wrote:
> On Thu, Jan 12, 2023 at 09:11:06AM +0100, Michal Hocko wrote:
> > On Wed 11-01-23 17:05:52, Mel Gorman wrote:
> > > On Wed, Jan 11, 2023 at 04:58:02PM +0100, Michal Hocko wrote:
> > > > On Mon 09-01-23 15:16:30, Mel Gorman wrote:
> > > > > Explicit GFP_ATOMIC allocations get flagged ALLOC_HARDER which is a bit
> > > > > vague. In preparation for removing __GFP_ATOMIC, give GFP_ATOMIC and
> > > > > other non-blocking allocation requests equal access to reserve.  Rename
> > > > > ALLOC_HARDER to ALLOC_NON_BLOCK to make it more clear what the flag
> > > > > means.
> > > > 
> > > > GFP_NOWAIT can be also used for opportunistic allocations which can and
> > > > should fail quickly if the memory is tight and more elaborate path
> > > > should be taken (e.g. try higher order allocation first but fall back to
> > > > smaller request if the memory is fragmented). Do we really want to give
> > > > those access to memory reserves as well?
> > > 
> > > Good question. Without __GFP_ATOMIC, GFP_NOWAIT only differs from GFP_ATOMIC
> > > by __GFP_HIGH but that is not enough to distinguish between a caller that
> > > cannot sleep versus one that is speculatively attempting an allocation but
> > > has other options. That changelog is misleading, it's not equal access
> > > as GFP_NOWAIT ends up with 25% of the reserves which is less than what
> > > GFP_ATOMIC gets.
> > > 
> > > Because it becomes impossible to distinguish between non-blocking and
> > > atomic without __GFP_ATOMIC, there is some justification for allowing
> > > access to reserves for GFP_NOWAIT. bio for example attempts an allocation
> > > (clears __GFP_DIRECT_RECLAIM) before falling back to mempool but delays
> > > in IO can also lead to further allocation pressure. mmu gather failing
> > > GFP_WAIT slows the rate memory can be freed. NFS failing GFP_NOWAIT will
> > > have to retry IOs multiple times. The examples were picked at random but
> > > the point is that there are cases where failing GFP_NOWAIT can degrade
> > > the system, particularly delay the cleaning of pages before reclaim.
> > 
> > Fair points.
> > 
> > > A lot of the truly speculative users appear to use GFP_NOWAIT | __GFP_NOWARN
> > > so one compromise would be to avoid using reserves if __GFP_NOWARN is
> > > also specified.
> > > 
> > > Something like this as a separate patch?
> > 
> > I cannot say I would be happy about adding more side effects to
> > __GFP_NOWARN. You are right that it should be used for those optimistic
> > allocation requests but historically all many of these subtle side effects
> > have kicked back at some point.
> 
> True.
> 
> > Wouldn't it make sense to explicitly
> > mark those places which really benefit from reserves instead?
> 
> That would be __GFP_HIGH and would require context from every caller on
> whether they need reserves or not and to determine what the consequences
> are if there is a stall. Is there immediate local fallout or wider fallout
> such as a variable delay before pages can be cleaned?

Yes, and I will not hide I do not mind putting the burden on caller to
justify adding requirement and eat from otherwise shared pool which
memory reserves are.

> > This is
> > more work but it should pay off long term. Your examples above would use
> > GFP_ATOMIC instead of GFP_NOWAIT.
> > 
> 
> Yes, although it would confuse the meaning of GFP_ATOMIC as a result.
> It's described as "%GFP_ATOMIC users can not sleep and need the allocation to
> succeed" and something like the bio callsite does not *need* the allocation
> to succeed. It can fallback to the mempool and performance simply degrades
> temporarily. No doubt there are a few abuses of GFP_ATOMIC just to get
> non-blocking behaviour already.

I am afraid GFP_ATOMIC will eventually require a closer look. Many
users are simply confused by the name and use it from the spin lock
context. Others use it from IRQ context because that is the right thing
to do (TM).

> > The semantic would be easier to explain as well. GFP_ATOMIC - non
> > sleeping allocations which are important so they have access to memory
> > reserves. GFP_NOWAIT - non sleeping allocations.
> > 
> 
> People's definition of "important" will vary wildly. The following would
> avoid reserve access for GFP_NOWAIT for now. It would need to be folded
> into this patch and a new changelog

OK, so that effectively means that __GFP_HIGH modifier will give more
reserves to non-sleepable allocations than sleepable. That is a better
semantic than other special casing because when the two allocations are
competing then the priority non-sleepable should win because it simply
cannot reclaim. That hierarchy makes sense to me. Thanks for bearing
with me here. Changing gfp flags semantic is a PITA. I wish would could
design the whole thing from scratch (and screw it in yet another way).

I will ack the patch once you post the full version of it.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] mm/page_alloc: Treat RT tasks similar to __GFP_HIGH
  2023-01-12  9:36     ` Mel Gorman
@ 2023-01-12  9:47       ` Michal Hocko
  2023-01-12 16:42         ` Mel Gorman
  2023-01-13  9:04       ` David Laight
  1 sibling, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2023-01-12  9:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Thu 12-01-23 09:36:23, Mel Gorman wrote:
[...]
> I agree with you but given the risk of bisections hitting this series,
> would you be opposed to delaying the removal by 1 kernel release? That
> way bisections for failures will hit 6.3 and a single commit or at least
> just a report against 6.3. That would mitigate the risk of a full revert
> of the series. I can add a note to the changelog mentioning the expected
> removal so git blame will also highlight it.

Sure. I will post the removal on top of your series and put myself into
the "wait for regression chair".

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] mm/page_alloc: Treat RT tasks similar to __GFP_HIGH
  2023-01-12  9:47       ` Michal Hocko
@ 2023-01-12 16:42         ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2023-01-12 16:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Thu, Jan 12, 2023 at 10:47:24AM +0100, Michal Hocko wrote:
> On Thu 12-01-23 09:36:23, Mel Gorman wrote:
> [...]
> > I agree with you but given the risk of bisections hitting this series,
> > would you be opposed to delaying the removal by 1 kernel release? That
> > way bisections for failures will hit 6.3 and a single commit or at least
> > just a report against 6.3. That would mitigate the risk of a full revert
> > of the series. I can add a note to the changelog mentioning the expected
> > removal so git blame will also highlight it.
> 
> Sure. I will post the removal on top of your series and put myself into
> the "wait for regression chair".
> 

I'm happy to sit in the same chair and send the patch. If there is an
example of an RT task actually caring about memory reserves, I'd like to
determine if it's a real problem or a badly designed RT application.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH 2/7] mm/page_alloc: Treat RT tasks similar to __GFP_HIGH
  2023-01-12  9:36     ` Mel Gorman
  2023-01-12  9:47       ` Michal Hocko
@ 2023-01-13  9:04       ` David Laight
  2023-01-13 11:09         ` Mel Gorman
  1 sibling, 1 reply; 34+ messages in thread
From: David Laight @ 2023-01-13  9:04 UTC (permalink / raw)
  To: 'Mel Gorman', Michal Hocko
  Cc: Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

From: Mel Gorman
> Sent: 12 January 2023 09:36
...
> Hard realtime tasks should be locking down resources in advance. Even a
> soft-realtime task like audio or video live decoding which cannot jitter
> should be allocating both memory and any disk space required up-front
> before the recording starts instead of relying on reserves. At best,
> reserve access will only delay the problem by a very short interval.

Or, at least, ensuring the system isn't memory limited.

The biggest effect on RT task latency/jitter (on a normal kernel)
is hardware interrupt and softint code 'stealing' the cpu.
The main 'culprit' being ethernet receive.
 
Unfortunately if you are doing RTP audio (UDP data) you absolutely
need the ethernet receive to run. When the softint code decides
to drop back to a normal priority kernel worker thread packets
get lost.

(I've been running 10000 RTP streams - with 10k receive UDP sockets.)

So I doubt avoiding sleeps in kmalloc() is going to make a
significant difference.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] mm/page_alloc: Treat RT tasks similar to __GFP_HIGH
  2023-01-13  9:04       ` David Laight
@ 2023-01-13 11:09         ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2023-01-13 11:09 UTC (permalink / raw)
  To: David Laight
  Cc: Michal Hocko, Linux-MM, Andrew Morton, NeilBrown, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Fri, Jan 13, 2023 at 09:04:50AM +0000, David Laight wrote:
> From: Mel Gorman
> > Sent: 12 January 2023 09:36
> ...
> > Hard realtime tasks should be locking down resources in advance. Even a
> > soft-realtime task like audio or video live decoding which cannot jitter
> > should be allocating both memory and any disk space required up-front
> > before the recording starts instead of relying on reserves. At best,
> > reserve access will only delay the problem by a very short interval.
> 
> Or, at least, ensuring the system isn't memory limited.
> 

Added to changelog.

> The biggest effect on RT task latency/jitter (on a normal kernel)
> is hardware interrupt and softint code 'stealing' the cpu.
> The main 'culprit' being ethernet receive.
>  
> Unfortunately if you are doing RTP audio (UDP data) you absolutely
> need the ethernet receive to run. When the softint code decides
> to drop back to a normal priority kernel worker thread packets
> get lost.
> 

Yes, although this is a fundamental problem for background networking
processing in general IIUC that is independent of mm/. ksoftirqd may be
getting stalled behind a higher priority, a realtime task, a task that has
built a credit due to sleep time under GENTLE_FAIR_SLEEPERS relative to
ksoftirqd etc. As a normal low-priority CPU hog may be at the same priority
as ksoftirqd, it can use enough of the scheduler slice for the runqueue to
cause an indirect priority inversion if a high priority task is sleeping
waiting on network traffic it needs ASAP that's stalled on ksoftirqd. I
didn't check for other examples but the only one I'm aware of that boosts
ksoftirq priority is during rcutorture (see rcutorture_booster_init). A
quick glance doesn't show any possibility of boosting ksoftirqd priority
temporarily if dealing with NET_TX_SOFTIRQ although it might be an
interesting idea if it was demonstrated with a realistic (or at least
semi-realistic) test case that priority inversion can occur due to background
RX processing. It's not even PREEMPT_RT-specific, I suspect it's a general
problem. I don't think this series makes a difference because if any of
the critical tasks are depending on the memory reserves, they're already
in serious trouble.

> (I've been running 10000 RTP streams - with 10k receive UDP sockets.)
> 

min_reserve access isn't going to fix any stalls with that volume of
streams :D

> So I doubt avoiding sleeps in kmalloc() is going to make a
> significant difference.
> 

Agreed.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/7] mm/page_alloc: Give GFP_ATOMIC and non-blocking allocations access to reserves
  2023-01-11 17:05     ` Mel Gorman
  2023-01-12  8:11       ` Michal Hocko
@ 2023-01-14 22:10       ` NeilBrown
  2023-01-16 10:29         ` Mel Gorman
  1 sibling, 1 reply; 34+ messages in thread
From: NeilBrown @ 2023-01-14 22:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Michal Hocko, Linux-MM, Andrew Morton, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Thu, 12 Jan 2023, Mel Gorman wrote:
> On Wed, Jan 11, 2023 at 04:58:02PM +0100, Michal Hocko wrote:
> > On Mon 09-01-23 15:16:30, Mel Gorman wrote:
> > > Explicit GFP_ATOMIC allocations get flagged ALLOC_HARDER which is a bit
> > > vague. In preparation for removing __GFP_ATOMIC, give GFP_ATOMIC and
> > > other non-blocking allocation requests equal access to reserve.  Rename
> > > ALLOC_HARDER to ALLOC_NON_BLOCK to make it more clear what the flag
> > > means.
> > 
> > GFP_NOWAIT can be also used for opportunistic allocations which can and
> > should fail quickly if the memory is tight and more elaborate path
> > should be taken (e.g. try higher order allocation first but fall back to
> > smaller request if the memory is fragmented). Do we really want to give
> > those access to memory reserves as well?
> 
> Good question. Without __GFP_ATOMIC, GFP_NOWAIT only differs from GFP_ATOMIC
> by __GFP_HIGH but that is not enough to distinguish between a caller that
> cannot sleep versus one that is speculatively attempting an allocation but
> has other options.

Isn't that a distinction without a difference?
A caller than cannot sleep MUST have other options, because failure is
always possible.
The "other option" might be failure (error to user space, dropped packets
etc), but sometimes failure IS an option.

So the difference between ATOMIC and NOWAIT boils down to the perceived
cost of the "other options".  If that cost is high, then include
__GFP_HIGH to get GFP_ATOMIC.  If that cost is low, then don't include
__GFP_HIGH and get GFP_NOWAIT.

I don't think there is any useful third option that is worth supporting.

NeilBrown



>                     That changelog is misleading, it's not equal access
> as GFP_NOWAIT ends up with 25% of the reserves which is less than what
> GFP_ATOMIC gets.
> 
> Because it becomes impossible to distinguish between non-blocking and
> atomic without __GFP_ATOMIC, there is some justification for allowing
> access to reserves for GFP_NOWAIT. bio for example attempts an allocation
> (clears __GFP_DIRECT_RECLAIM) before falling back to mempool but delays
> in IO can also lead to further allocation pressure. mmu gather failing
> GFP_WAIT slows the rate memory can be freed. NFS failing GFP_NOWAIT will
> have to retry IOs multiple times. The examples were picked at random but
> the point is that there are cases where failing GFP_NOWAIT can degrade
> the system, particularly delay the cleaning of pages before reclaim.
> 
> A lot of the truly speculative users appear to use GFP_NOWAIT | __GFP_NOWARN
> so one compromise would be to avoid using reserves if __GFP_NOWARN is
> also specified.
> 
> Something like this as a separate patch?
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7244ab522028..0a7a0ac1b46d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4860,9 +4860,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
>  	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
>  		/*
>  		 * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
> -		 * if it can't schedule.
> +		 * if it can't schedule. Similarly, a caller specifying
> +		 * __GFP_NOWARN is likely a speculative allocation with a
> +		 * graceful recovery path.
>  		 */
> -		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> +		if (!(gfp_mask & (__GFP_NOMEMALLOC|__GFP_NOWARN))) {
>  			alloc_flags |= ALLOC_NON_BLOCK;
>  
>  			if (order > 0)
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/7] mm/page_alloc: Give GFP_ATOMIC and non-blocking allocations access to reserves
  2023-01-14 22:10       ` NeilBrown
@ 2023-01-16 10:29         ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2023-01-16 10:29 UTC (permalink / raw)
  To: NeilBrown
  Cc: Michal Hocko, Linux-MM, Andrew Morton, Thierry Reding,
	Matthew Wilcox, Vlastimil Babka, LKML

On Sun, Jan 15, 2023 at 09:10:13AM +1100, NeilBrown wrote:
> On Thu, 12 Jan 2023, Mel Gorman wrote:
> > On Wed, Jan 11, 2023 at 04:58:02PM +0100, Michal Hocko wrote:
> > > On Mon 09-01-23 15:16:30, Mel Gorman wrote:
> > > > Explicit GFP_ATOMIC allocations get flagged ALLOC_HARDER which is a bit
> > > > vague. In preparation for removing __GFP_ATOMIC, give GFP_ATOMIC and
> > > > other non-blocking allocation requests equal access to reserve.  Rename
> > > > ALLOC_HARDER to ALLOC_NON_BLOCK to make it more clear what the flag
> > > > means.
> > > 
> > > GFP_NOWAIT can be also used for opportunistic allocations which can and
> > > should fail quickly if the memory is tight and more elaborate path
> > > should be taken (e.g. try higher order allocation first but fall back to
> > > smaller request if the memory is fragmented). Do we really want to give
> > > those access to memory reserves as well?
> > 
> > Good question. Without __GFP_ATOMIC, GFP_NOWAIT only differs from GFP_ATOMIC
> > by __GFP_HIGH but that is not enough to distinguish between a caller that
> > cannot sleep versus one that is speculatively attempting an allocation but
> > has other options.
> 
> Isn't that a distinction without a difference?

Ideally yes but it's not always clear what the consequences of failure are.

> A caller than cannot sleep MUST have other options, because failure is
> always possible.
> The "other option" might be failure (error to user space, dropped packets
> etc), but sometimes failure IS an option.
> 

True, but it varies how gracefully it's handled and there is some cut&paste
involved and other cases where the GFP_ATOMIC usage predated the existance
or awareness of NOWAIT.

> So the difference between ATOMIC and NOWAIT boils down to the perceived
> cost of the "other options".  If that cost is high, then include
> __GFP_HIGH to get GFP_ATOMIC.  If that cost is low, then don't include
> __GFP_HIGH and get GFP_NOWAIT.
> 

Again, ideally yes but not necessary true. It depends on how careful
the caller was. The core appears to get it right in the cases I checked,
I'm less sure about drivers.

> I don't think there is any useful third option that is worth supporting.
> 

That's what we'll find out over time once the series hits a released
kernel.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2023-01-16 10:29 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-09 15:16 [PATCH 0/6 v2] Discard __GFP_ATOMIC Mel Gorman
2023-01-09 15:16 ` [PATCH 1/7] mm/page_alloc: Rename ALLOC_HIGH to ALLOC_MIN_RESERVE Mel Gorman
2023-01-11 15:18   ` Michal Hocko
2023-01-12  9:26     ` Mel Gorman
2023-01-09 15:16 ` [PATCH 2/7] mm/page_alloc: Treat RT tasks similar to __GFP_HIGH Mel Gorman
2023-01-11 15:27   ` Michal Hocko
2023-01-12  9:36     ` Mel Gorman
2023-01-12  9:47       ` Michal Hocko
2023-01-12 16:42         ` Mel Gorman
2023-01-13  9:04       ` David Laight
2023-01-13 11:09         ` Mel Gorman
2023-01-09 15:16 ` [PATCH 3/7] mm/page_alloc: Explicitly record high-order atomic allocations in alloc_flags Mel Gorman
2023-01-10 15:28   ` Vlastimil Babka
2023-01-11 15:36   ` Michal Hocko
2023-01-12  9:38     ` Mel Gorman
2023-01-09 15:16 ` [PATCH 4/7] mm/page_alloc: Explicitly define what alloc flags deplete min reserves Mel Gorman
2023-01-11 14:04   ` Vlastimil Babka
2023-01-11 15:37   ` Michal Hocko
2023-01-09 15:16 ` [PATCH 5/7] mm/page_alloc.c: Allow __GFP_NOFAIL requests deeper access to reserves Mel Gorman
2023-01-11 14:05   ` Vlastimil Babka
2023-01-11 15:46   ` Michal Hocko
2023-01-12  9:43     ` Mel Gorman
2023-01-09 15:16 ` [PATCH 6/7] mm/page_alloc: Give GFP_ATOMIC and non-blocking allocations " Mel Gorman
2023-01-11 14:12   ` Vlastimil Babka
2023-01-11 15:58   ` Michal Hocko
2023-01-11 17:05     ` Mel Gorman
2023-01-12  8:11       ` Michal Hocko
2023-01-12  8:29         ` Michal Hocko
2023-01-12  9:24         ` Mel Gorman
2023-01-12  9:45           ` Michal Hocko
2023-01-14 22:10       ` NeilBrown
2023-01-16 10:29         ` Mel Gorman
2023-01-09 15:16 ` [PATCH 7/7] mm: discard __GFP_ATOMIC Mel Gorman
2023-01-12  8:12   ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.