All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] Drain remote per-cpu directly v2
@ 2022-05-09 13:07 Mel Gorman
  2022-05-09 13:08 ` [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list Mel Gorman
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Mel Gorman @ 2022-05-09 13:07 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, LKML, Linux-MM,
	Mel Gorman

Changelog since v1
o Fix unsafe RT locking scheme
o Use spin_trylock on UP PREEMPT_RT

This series has the same intent as Nicolas' series "mm/page_alloc: Remote
per-cpu lists drain support" -- avoid interference of a high priority
task due to a workqueue item draining per-cpu page lists. While many
workloads can tolerate a brief interruption, it may be cause a real-time
task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
the draining in non-deterministic.

Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
The local_lock on its own prevents migration and the IRQ disabling protects
from corruption due to an interrupt arriving while a page allocation is
in progress. The locking is inherently unsafe for remote access unless
the CPU is hot-removed.

This series adjusts the locking. A spinlock is added to struct
per_cpu_pages to protect the list contents while local_lock_irq continues
to prevent migration and IRQ reentry. This allows a remote CPU to safely
drain a remote per-cpu list.

This series is a partial series. Follow-on work should allow the
local_irq_save to be converted to a local_irq to avoid IRQs being
disabled/enabled in most cases. Consequently, there are some TODO comments
highlighting the places that would change if local_irq was used. However,
there are enough corner cases that it deserves a series on its own
separated by one kernel release and the priority right now is to avoid
interference of high priority tasks.

Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
	and when it is storing per-cpu pages.

Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
	this is not necessary but it avoids per_cpu_pages consuming another
	cache line.

Patch 3 is a preparation patch to avoid code duplication.

Patch 4 is a simple micro-optimisation that improves code flow necessary for
	a later patch to avoid code duplication.

Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
	relying on local_lock to prevent migration, stabilise the pcp
	lookup and prevent IRQ reentrancy.

Patch 6 remote drains per-cpu pages directly instead of using a workqueue.

 include/linux/mm_types.h |   5 +
 include/linux/mmzone.h   |  12 +-
 mm/page_alloc.c          | 342 +++++++++++++++++++++++++--------------
 3 files changed, 230 insertions(+), 129 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list
  2022-05-09 13:07 [RFC PATCH 0/6] Drain remote per-cpu directly v2 Mel Gorman
@ 2022-05-09 13:08 ` Mel Gorman
  2022-05-13  8:41   ` Muchun Song
  2022-05-09 13:08 ` [PATCH 2/6] mm/page_alloc: Use only one PCP list for THP-sized allocations Mel Gorman
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2022-05-09 13:08 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, LKML, Linux-MM,
	Mel Gorman

The page allocator uses page->lru for storing pages on either buddy or
PCP lists. Create page->buddy_list and page->pcp_list as a union with
page->lru. This is simply to clarify what type of list a page is on
in the page allocator.

No functional change intended.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mm_types.h |  5 +++++
 mm/page_alloc.c          | 18 +++++++++---------
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8834e38c06a4..a2782e8af307 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -87,6 +87,7 @@ struct page {
 			 */
 			union {
 				struct list_head lru;
+
 				/* Or, for the Unevictable "LRU list" slot */
 				struct {
 					/* Always even, to negate PageTail */
@@ -94,6 +95,10 @@ struct page {
 					/* Count page's or folio's mlocks */
 					unsigned int mlock_count;
 				};
+
+				/* Or, free page */
+				struct list_head buddy_list;
+				struct list_head pcp_list;
 			};
 			/* See page-flags.h for PAGE_MAPPING_FLAGS */
 			struct address_space *mapping;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2db95780e003..63976ad4b7f1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -780,7 +780,7 @@ static inline bool set_page_guard(struct zone *zone, struct page *page,
 		return false;
 
 	__SetPageGuard(page);
-	INIT_LIST_HEAD(&page->lru);
+	INIT_LIST_HEAD(&page->buddy_list);
 	set_page_private(page, order);
 	/* Guard pages are not available for any usage */
 	__mod_zone_freepage_state(zone, -(1 << order), migratetype);
@@ -957,7 +957,7 @@ static inline void add_to_free_list(struct page *page, struct zone *zone,
 {
 	struct free_area *area = &zone->free_area[order];
 
-	list_add(&page->lru, &area->free_list[migratetype]);
+	list_add(&page->buddy_list, &area->free_list[migratetype]);
 	area->nr_free++;
 }
 
@@ -967,7 +967,7 @@ static inline void add_to_free_list_tail(struct page *page, struct zone *zone,
 {
 	struct free_area *area = &zone->free_area[order];
 
-	list_add_tail(&page->lru, &area->free_list[migratetype]);
+	list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
 	area->nr_free++;
 }
 
@@ -981,7 +981,7 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
 {
 	struct free_area *area = &zone->free_area[order];
 
-	list_move_tail(&page->lru, &area->free_list[migratetype]);
+	list_move_tail(&page->buddy_list, &area->free_list[migratetype]);
 }
 
 static inline void del_page_from_free_list(struct page *page, struct zone *zone,
@@ -991,7 +991,7 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 	if (page_reported(page))
 		__ClearPageReported(page);
 
-	list_del(&page->lru);
+	list_del(&page->buddy_list);
 	__ClearPageBuddy(page);
 	set_page_private(page, 0);
 	zone->free_area[order].nr_free--;
@@ -1493,7 +1493,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			mt = get_pcppage_migratetype(page);
 
 			/* must delete to avoid corrupting pcp list */
-			list_del(&page->lru);
+			list_del(&page->pcp_list);
 			count -= nr_pages;
 			pcp->count -= nr_pages;
 
@@ -3053,7 +3053,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * for IO devices that can merge IO requests if the physical
 		 * pages are ordered properly.
 		 */
-		list_add_tail(&page->lru, list);
+		list_add_tail(&page->pcp_list, list);
 		allocated++;
 		if (is_migrate_cma(get_pcppage_migratetype(page)))
 			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
@@ -3392,7 +3392,7 @@ static void free_unref_page_commit(struct page *page, int migratetype,
 	__count_vm_event(PGFREE);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	pindex = order_to_pindex(migratetype, order);
-	list_add(&page->lru, &pcp->lists[pindex]);
+	list_add(&page->pcp_list, &pcp->lists[pindex]);
 	pcp->count += 1 << order;
 
 	/*
@@ -3656,7 +3656,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 		}
 
 		page = list_first_entry(list, struct page, lru);
-		list_del(&page->lru);
+		list_del(&page->pcp_list);
 		pcp->count -= 1 << order;
 	} while (check_new_pcp(page, order));
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/6] mm/page_alloc: Use only one PCP list for THP-sized allocations
  2022-05-09 13:07 [RFC PATCH 0/6] Drain remote per-cpu directly v2 Mel Gorman
  2022-05-09 13:08 ` [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list Mel Gorman
@ 2022-05-09 13:08 ` Mel Gorman
  2022-05-09 13:08 ` [PATCH 3/6] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper Mel Gorman
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2022-05-09 13:08 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, LKML, Linux-MM,
	Mel Gorman

The per_cpu_pages is cache-aligned on a standard x86-64 distribution
configuration but a later patch will add a new field which would push
the structure into the next cache line. Use only one list to store
THP-sized pages on the per-cpu list. This assumes that the vast majority
of THP-sized allocations are GFP_MOVABLE but even if it was another type,
it would not contribute to serious fragmentation that potentially causes
a later THP allocation failure. Align per_cpu_pages on the cacheline
boundary to ensure there is no false cache sharing.

After this patch, the structure sizing is;

struct per_cpu_pages {
        int                        count;                /*     0     4 */
        int                        high;                 /*     4     4 */
        int                        batch;                /*     8     4 */
        short int                  free_factor;          /*    12     2 */
        short int                  expire;               /*    14     2 */
        struct list_head           lists[13];            /*    16   208 */

        /* size: 256, cachelines: 4, members: 6 */
        /* padding: 32 */
} __attribute__((__aligned__(64)));

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h | 11 +++++++----
 mm/page_alloc.c        |  4 ++--
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 962b14d403e8..abe530748de6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -358,15 +358,18 @@ enum zone_watermarks {
 };
 
 /*
- * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER plus one additional
- * for pageblock size for THP if configured.
+ * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER. One additional list
+ * for THP which will usually be GFP_MOVABLE. Even if it is another type,
+ * it should not contribute to serious fragmentation causing THP allocation
+ * failures.
  */
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define NR_PCP_THP 1
 #else
 #define NR_PCP_THP 0
 #endif
-#define NR_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1 + NR_PCP_THP))
+#define NR_LOWORDER_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1))
+#define NR_PCP_LISTS (NR_LOWORDER_PCP_LISTS + NR_PCP_THP)
 
 /*
  * Shift to encode migratetype and order in the same integer, with order
@@ -392,7 +395,7 @@ struct per_cpu_pages {
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
 	struct list_head lists[NR_PCP_LISTS];
-};
+} ____cacheline_aligned_in_smp;
 
 struct per_cpu_zonestat {
 #ifdef CONFIG_SMP
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 63976ad4b7f1..ed2deb93a758 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -648,7 +648,7 @@ static inline unsigned int order_to_pindex(int migratetype, int order)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	if (order > PAGE_ALLOC_COSTLY_ORDER) {
 		VM_BUG_ON(order != pageblock_order);
-		base = PAGE_ALLOC_COSTLY_ORDER + 1;
+		return NR_LOWORDER_PCP_LISTS;
 	}
 #else
 	VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
@@ -662,7 +662,7 @@ static inline int pindex_to_order(unsigned int pindex)
 	int order = pindex / MIGRATE_PCPTYPES;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (order > PAGE_ALLOC_COSTLY_ORDER)
+	if (pindex == NR_LOWORDER_PCP_LISTS)
 		order = pageblock_order;
 #else
 	VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/6] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper
  2022-05-09 13:07 [RFC PATCH 0/6] Drain remote per-cpu directly v2 Mel Gorman
  2022-05-09 13:08 ` [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list Mel Gorman
  2022-05-09 13:08 ` [PATCH 2/6] mm/page_alloc: Use only one PCP list for THP-sized allocations Mel Gorman
@ 2022-05-09 13:08 ` Mel Gorman
  2022-05-09 13:08 ` [PATCH 4/6] mm/page_alloc: Remove unnecessary page == NULL check in rmqueue Mel Gorman
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2022-05-09 13:08 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, LKML, Linux-MM,
	Mel Gorman

This is a preparation page to allow the buddy removal code to be reused
in a later patch.

No functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 87 ++++++++++++++++++++++++++++---------------------
 1 file changed, 50 insertions(+), 37 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ed2deb93a758..4c1acf666056 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3622,6 +3622,46 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
 #endif
 }
 
+static __always_inline
+struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
+			   unsigned int order, unsigned int alloc_flags,
+			   int migratetype)
+{
+	struct page *page;
+	unsigned long flags;
+
+	do {
+		page = NULL;
+		spin_lock_irqsave(&zone->lock, flags);
+		/*
+		 * order-0 request can reach here when the pcplist is skipped
+		 * due to non-CMA allocation context. HIGHATOMIC area is
+		 * reserved for high-order atomic allocation, so order-0
+		 * request should skip it.
+		 */
+		if (order > 0 && alloc_flags & ALLOC_HARDER) {
+			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+			if (page)
+				trace_mm_page_alloc_zone_locked(page, order, migratetype);
+		}
+		if (!page) {
+			page = __rmqueue(zone, order, migratetype, alloc_flags);
+			if (!page) {
+				spin_unlock_irqrestore(&zone->lock, flags);
+				return NULL;
+			}
+		}
+		__mod_zone_freepage_state(zone, -(1 << order),
+					  get_pcppage_migratetype(page));
+		spin_unlock_irqrestore(&zone->lock, flags);
+	} while (check_new_pages(page, order));
+
+	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
+	zone_statistics(preferred_zone, zone, 1);
+
+	return page;
+}
+
 /* Remove page from the per-cpu list, caller must protect the list */
 static inline
 struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
@@ -3702,9 +3742,14 @@ struct page *rmqueue(struct zone *preferred_zone,
 			gfp_t gfp_flags, unsigned int alloc_flags,
 			int migratetype)
 {
-	unsigned long flags;
 	struct page *page;
 
+	/*
+	 * We most definitely don't want callers attempting to
+	 * allocate greater than order-1 page units with __GFP_NOFAIL.
+	 */
+	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
+
 	if (likely(pcp_allowed_order(order))) {
 		/*
 		 * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
@@ -3718,38 +3763,10 @@ struct page *rmqueue(struct zone *preferred_zone,
 		}
 	}
 
-	/*
-	 * We most definitely don't want callers attempting to
-	 * allocate greater than order-1 page units with __GFP_NOFAIL.
-	 */
-	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
-
-	do {
-		page = NULL;
-		spin_lock_irqsave(&zone->lock, flags);
-		/*
-		 * order-0 request can reach here when the pcplist is skipped
-		 * due to non-CMA allocation context. HIGHATOMIC area is
-		 * reserved for high-order atomic allocation, so order-0
-		 * request should skip it.
-		 */
-		if (order > 0 && alloc_flags & ALLOC_HARDER) {
-			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
-			if (page)
-				trace_mm_page_alloc_zone_locked(page, order, migratetype);
-		}
-		if (!page) {
-			page = __rmqueue(zone, order, migratetype, alloc_flags);
-			if (!page)
-				goto failed;
-		}
-		__mod_zone_freepage_state(zone, -(1 << order),
-					  get_pcppage_migratetype(page));
-		spin_unlock_irqrestore(&zone->lock, flags);
-	} while (check_new_pages(page, order));
-
-	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
-	zone_statistics(preferred_zone, zone, 1);
+	page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
+							migratetype);
+	if (unlikely(!page))
+		return NULL;
 
 out:
 	/* Separate test+clear to avoid unnecessary atomics */
@@ -3760,10 +3777,6 @@ struct page *rmqueue(struct zone *preferred_zone,
 
 	VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
 	return page;
-
-failed:
-	spin_unlock_irqrestore(&zone->lock, flags);
-	return NULL;
 }
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/6] mm/page_alloc: Remove unnecessary page == NULL check in rmqueue
  2022-05-09 13:07 [RFC PATCH 0/6] Drain remote per-cpu directly v2 Mel Gorman
                   ` (2 preceding siblings ...)
  2022-05-09 13:08 ` [PATCH 3/6] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper Mel Gorman
@ 2022-05-09 13:08 ` Mel Gorman
  2022-05-09 13:08 ` [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock Mel Gorman
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2022-05-09 13:08 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, LKML, Linux-MM,
	Mel Gorman

The VM_BUG_ON check for a valid page can be avoided with a simple
change in the flow. The ZONE_BOOSTED_WATERMARK is unlikely in general
and even more unlikely if the page allocation failed so mark the
branch unlikely.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4c1acf666056..dc0fdeb3795c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3765,17 +3765,18 @@ struct page *rmqueue(struct zone *preferred_zone,
 
 	page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
 							migratetype);
-	if (unlikely(!page))
-		return NULL;
 
 out:
 	/* Separate test+clear to avoid unnecessary atomics */
-	if (test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags)) {
+	if (unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
 		clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
 		wakeup_kswapd(zone, 0, 0, zone_idx(zone));
 	}
 
-	VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
+	if (unlikely(!page))
+		return NULL;
+
+	VM_BUG_ON_PAGE(bad_range(zone, page), page);
 	return page;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock
  2022-05-09 13:07 [RFC PATCH 0/6] Drain remote per-cpu directly v2 Mel Gorman
                   ` (3 preceding siblings ...)
  2022-05-09 13:08 ` [PATCH 4/6] mm/page_alloc: Remove unnecessary page == NULL check in rmqueue Mel Gorman
@ 2022-05-09 13:08 ` Mel Gorman
  2022-05-22  2:49   ` Hugh Dickins
  2022-05-09 13:08 ` [PATCH 6/6] mm/page_alloc: Remotely drain per-cpu lists Mel Gorman
  2022-05-09 15:58 ` [RFC PATCH 0/6] Drain remote per-cpu directly v2 Minchan Kim
  6 siblings, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2022-05-09 13:08 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, LKML, Linux-MM,
	Mel Gorman

Currently the PCP lists are protected by using local_lock_irqsave to
prevent migration and IRQ reentrancy but this is inconvenient. Remote
draining of the lists is impossible and a workqueue is required and
every task allocation/free must disable then enable interrupts which is
expensive.

As preparation for dealing with both of those problems, protect the
lists with a spinlock. The IRQ-unsafe version of the lock is used
because IRQs are already disabled by local_lock_irqsave. spin_trylock
is used in preparation for a time when local_lock could be used instead
of lock_lock_irqsave.

The per_cpu_pages still fits within the same number of cache lines after
this patch relative to before the series.

struct per_cpu_pages {
        spinlock_t                 lock;                 /*     0     4 */
        int                        count;                /*     4     4 */
        int                        high;                 /*     8     4 */
        int                        batch;                /*    12     4 */
        short int                  free_factor;          /*    16     2 */
        short int                  expire;               /*    18     2 */

        /* XXX 4 bytes hole, try to pack */

        struct list_head           lists[13];            /*    24   208 */

        /* size: 256, cachelines: 4, members: 7 */
        /* sum members: 228, holes: 1, sum holes: 4 */
        /* padding: 24 */
} __attribute__((__aligned__(64)));

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |   1 +
 mm/page_alloc.c        | 169 ++++++++++++++++++++++++++++++++++++-----
 2 files changed, 149 insertions(+), 21 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index abe530748de6..8b5757735428 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -385,6 +385,7 @@ enum zone_watermarks {
 
 /* Fields and list protected by pagesets local_lock in page_alloc.c */
 struct per_cpu_pages {
+	spinlock_t lock;	/* Protects lists field */
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dc0fdeb3795c..7be21254087c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -132,6 +132,20 @@ static DEFINE_PER_CPU(struct pagesets, pagesets) __maybe_unused = {
 	.lock = INIT_LOCAL_LOCK(lock),
 };
 
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
+/*
+ * On SMP, spin_trylock is sufficient protection.
+ * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
+ */
+#define pcp_trylock_prepare(flags)	do { } while (0)
+#define pcp_trylock_finish(flag)	do { } while (0)
+#else
+
+/* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */
+#define pcp_trylock_prepare(flags)	local_irq_save(flags)
+#define pcp_trylock_finish(flags)	local_irq_restore(flags)
+#endif
+
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -3082,15 +3096,22 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
  */
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 {
-	unsigned long flags;
 	int to_drain, batch;
 
-	local_lock_irqsave(&pagesets.lock, flags);
 	batch = READ_ONCE(pcp->batch);
 	to_drain = min(pcp->count, batch);
-	if (to_drain > 0)
+	if (to_drain > 0) {
+		unsigned long flags;
+
+		/*
+		 * free_pcppages_bulk expects IRQs disabled for zone->lock
+		 * so even though pcp->lock is not intended to be IRQ-safe,
+		 * it's needed in this context.
+		 */
+		spin_lock_irqsave(&pcp->lock, flags);
 		free_pcppages_bulk(zone, to_drain, pcp, 0);
-	local_unlock_irqrestore(&pagesets.lock, flags);
+		spin_unlock_irqrestore(&pcp->lock, flags);
+	}
 }
 #endif
 
@@ -3103,16 +3124,17 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
  */
 static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 {
-	unsigned long flags;
 	struct per_cpu_pages *pcp;
 
-	local_lock_irqsave(&pagesets.lock, flags);
-
 	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
-	if (pcp->count)
-		free_pcppages_bulk(zone, pcp->count, pcp, 0);
+	if (pcp->count) {
+		unsigned long flags;
 
-	local_unlock_irqrestore(&pagesets.lock, flags);
+		/* See drain_zone_pages on why this is disabling IRQs */
+		spin_lock_irqsave(&pcp->lock, flags);
+		free_pcppages_bulk(zone, pcp->count, pcp, 0);
+		spin_unlock_irqrestore(&pcp->lock, flags);
+	}
 }
 
 /*
@@ -3380,18 +3402,30 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 	return min(READ_ONCE(pcp->batch) << 2, high);
 }
 
-static void free_unref_page_commit(struct page *page, int migratetype,
-				   unsigned int order)
+/* Returns true if the page was committed to the per-cpu list. */
+static bool free_unref_page_commit(struct page *page, int migratetype,
+				   unsigned int order, bool locked)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	int high;
 	int pindex;
 	bool free_high;
+	unsigned long __maybe_unused UP_flags;
 
 	__count_vm_event(PGFREE);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	pindex = order_to_pindex(migratetype, order);
+
+	if (!locked) {
+		/* Protect against a parallel drain. */
+		pcp_trylock_prepare(UP_flags);
+		if (!spin_trylock(&pcp->lock)) {
+			pcp_trylock_finish(UP_flags);
+			return false;
+		}
+	}
+
 	list_add(&page->pcp_list, &pcp->lists[pindex]);
 	pcp->count += 1 << order;
 
@@ -3409,6 +3443,13 @@ static void free_unref_page_commit(struct page *page, int migratetype,
 
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch, free_high), pcp, pindex);
 	}
+
+	if (!locked) {
+		spin_unlock(&pcp->lock);
+		pcp_trylock_finish(UP_flags);
+	}
+
+	return true;
 }
 
 /*
@@ -3419,6 +3460,7 @@ void free_unref_page(struct page *page, unsigned int order)
 	unsigned long flags;
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
+	bool freed_pcp = false;
 
 	if (!free_unref_page_prepare(page, pfn, order))
 		return;
@@ -3440,8 +3482,11 @@ void free_unref_page(struct page *page, unsigned int order)
 	}
 
 	local_lock_irqsave(&pagesets.lock, flags);
-	free_unref_page_commit(page, migratetype, order);
+	freed_pcp = free_unref_page_commit(page, migratetype, order, false);
 	local_unlock_irqrestore(&pagesets.lock, flags);
+
+	if (unlikely(!freed_pcp))
+		free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
 }
 
 /*
@@ -3450,10 +3495,19 @@ void free_unref_page(struct page *page, unsigned int order)
 void free_unref_page_list(struct list_head *list)
 {
 	struct page *page, *next;
+	struct per_cpu_pages *pcp;
+	struct zone *locked_zone;
 	unsigned long flags;
 	int batch_count = 0;
 	int migratetype;
 
+	/*
+	 * An empty list is possible. Check early so that the later
+	 * lru_to_page() does not potentially read garbage.
+	 */
+	if (list_empty(list))
+		return;
+
 	/* Prepare pages for freeing */
 	list_for_each_entry_safe(page, next, list, lru) {
 		unsigned long pfn = page_to_pfn(page);
@@ -3474,8 +3528,33 @@ void free_unref_page_list(struct list_head *list)
 		}
 	}
 
+	/*
+	 * Preparation could have drained the list due to failing to prepare
+	 * or all pages are being isolated.
+	 */
+	if (list_empty(list))
+		return;
+
+	VM_BUG_ON(in_hardirq());
+
 	local_lock_irqsave(&pagesets.lock, flags);
+
+	page = lru_to_page(list);
+	locked_zone = page_zone(page);
+	pcp = this_cpu_ptr(locked_zone->per_cpu_pageset);
+	spin_lock(&pcp->lock);
+
 	list_for_each_entry_safe(page, next, list, lru) {
+		struct zone *zone = page_zone(page);
+
+		/* Different zone, different pcp lock. */
+		if (zone != locked_zone) {
+			spin_unlock(&pcp->lock);
+			locked_zone = zone;
+			pcp = this_cpu_ptr(zone->per_cpu_pageset);
+			spin_lock(&pcp->lock);
+		}
+
 		/*
 		 * Non-isolated types over MIGRATE_PCPTYPES get added
 		 * to the MIGRATE_MOVABLE pcp list.
@@ -3485,18 +3564,33 @@ void free_unref_page_list(struct list_head *list)
 			migratetype = MIGRATE_MOVABLE;
 
 		trace_mm_page_free_batched(page);
-		free_unref_page_commit(page, migratetype, 0);
+
+		/*
+		 * If there is a parallel drain in progress, free to the buddy
+		 * allocator directly. This is expensive as the zone lock will
+		 * be acquired multiple times but if a drain is in progress
+		 * then an expensive operation is already taking place.
+		 *
+		 * TODO: Always false at the moment due to local_lock_irqsave
+		 *       and is preparation for converting to local_lock.
+		 */
+		if (unlikely(!free_unref_page_commit(page, migratetype, 0, true)))
+			free_one_page(page_zone(page), page, page_to_pfn(page), 0, migratetype, FPI_NONE);
 
 		/*
 		 * Guard against excessive IRQ disabled times when we get
 		 * a large list of pages to free.
 		 */
 		if (++batch_count == SWAP_CLUSTER_MAX) {
+			spin_unlock(&pcp->lock);
 			local_unlock_irqrestore(&pagesets.lock, flags);
 			batch_count = 0;
 			local_lock_irqsave(&pagesets.lock, flags);
+			pcp = this_cpu_ptr(locked_zone->per_cpu_pageset);
+			spin_lock(&pcp->lock);
 		}
 	}
+	spin_unlock(&pcp->lock);
 	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 
@@ -3668,9 +3762,28 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 			int migratetype,
 			unsigned int alloc_flags,
 			struct per_cpu_pages *pcp,
-			struct list_head *list)
+			struct list_head *list,
+			bool locked)
 {
 	struct page *page;
+	unsigned long __maybe_unused UP_flags;
+
+	/*
+	 * spin_trylock is not necessary right now due to due to
+	 * local_lock_irqsave and is a preparation step for
+	 * a conversion to local_lock using the trylock to prevent
+	 * IRQ re-entrancy. If pcp->lock cannot be acquired, the caller
+	 * uses rmqueue_buddy.
+	 *
+	 * TODO: Convert local_lock_irqsave to local_lock.
+	 */
+	if (unlikely(!locked)) {
+		pcp_trylock_prepare(UP_flags);
+		if (!spin_trylock(&pcp->lock)) {
+			pcp_trylock_finish(UP_flags);
+			return NULL;
+		}
+	}
 
 	do {
 		if (list_empty(list)) {
@@ -3691,8 +3804,10 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 					migratetype, alloc_flags);
 
 			pcp->count += alloced << order;
-			if (unlikely(list_empty(list)))
-				return NULL;
+			if (unlikely(list_empty(list))) {
+				page = NULL;
+				goto out;
+			}
 		}
 
 		page = list_first_entry(list, struct page, lru);
@@ -3700,6 +3815,12 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 		pcp->count -= 1 << order;
 	} while (check_new_pcp(page, order));
 
+out:
+	if (!locked) {
+		spin_unlock(&pcp->lock);
+		pcp_trylock_finish(UP_flags);
+	}
+
 	return page;
 }
 
@@ -3724,7 +3845,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	pcp->free_factor >>= 1;
 	list = &pcp->lists[order_to_pindex(migratetype, order)];
-	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
+	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list, false);
 	local_unlock_irqrestore(&pagesets.lock, flags);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
@@ -3759,7 +3880,8 @@ struct page *rmqueue(struct zone *preferred_zone,
 				migratetype != MIGRATE_MOVABLE) {
 			page = rmqueue_pcplist(preferred_zone, zone, order,
 					gfp_flags, migratetype, alloc_flags);
-			goto out;
+			if (likely(page))
+				goto out;
 		}
 	}
 
@@ -5326,6 +5448,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	local_lock_irqsave(&pagesets.lock, flags);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
+	spin_lock(&pcp->lock);
 
 	while (nr_populated < nr_pages) {
 
@@ -5336,11 +5459,13 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		}
 
 		page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
-								pcp, pcp_list);
+							pcp, pcp_list, true);
 		if (unlikely(!page)) {
 			/* Try and get at least one page */
-			if (!nr_populated)
+			if (!nr_populated) {
+				spin_unlock(&pcp->lock);
 				goto failed_irq;
+			}
 			break;
 		}
 		nr_account++;
@@ -5353,6 +5478,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		nr_populated++;
 	}
 
+	spin_unlock(&pcp->lock);
 	local_unlock_irqrestore(&pagesets.lock, flags);
 
 	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
@@ -6992,6 +7118,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
 	memset(pcp, 0, sizeof(*pcp));
 	memset(pzstats, 0, sizeof(*pzstats));
 
+	spin_lock_init(&pcp->lock);
 	for (pindex = 0; pindex < NR_PCP_LISTS; pindex++)
 		INIT_LIST_HEAD(&pcp->lists[pindex]);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 6/6] mm/page_alloc: Remotely drain per-cpu lists
  2022-05-09 13:07 [RFC PATCH 0/6] Drain remote per-cpu directly v2 Mel Gorman
                   ` (4 preceding siblings ...)
  2022-05-09 13:08 ` [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock Mel Gorman
@ 2022-05-09 13:08 ` Mel Gorman
  2022-05-09 15:58 ` [RFC PATCH 0/6] Drain remote per-cpu directly v2 Minchan Kim
  6 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2022-05-09 13:08 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, LKML, Linux-MM,
	Mel Gorman

From: Nicolas Saenz Julienne <nsaenzju@redhat.com>

Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
drain work queued by __drain_all_pages(). So introduce new a mechanism
to remotely drain the per-cpu lists. It is made possible by remotely
locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this
new scheme is that drain operations are now migration safe.

There was no observed performance degradation vs. the previous scheme.
Both netperf and hackbench were run in parallel to triggering the
__drain_all_pages(NULL, true) code path around ~100 times per second.
The new scheme performs a bit better (~5%), although the important point
here is there are no performance regressions vs. the previous mechanism.
Per-cpu lists draining happens only in slow paths.

Link: https://lore.kernel.org/r/20211103170512.2745765-4-nsaenzju@redhat.com
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 59 +++++--------------------------------------------
 1 file changed, 5 insertions(+), 54 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7be21254087c..4ac39d30ec8f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -164,13 +164,7 @@ DEFINE_PER_CPU(int, _numa_mem_);		/* Kernel "local memory" node */
 EXPORT_PER_CPU_SYMBOL(_numa_mem_);
 #endif
 
-/* work_structs for global per-cpu drains */
-struct pcpu_drain {
-	struct zone *zone;
-	struct work_struct work;
-};
 static DEFINE_MUTEX(pcpu_drain_mutex);
-static DEFINE_PER_CPU(struct pcpu_drain, pcpu_drain);
 
 #ifdef CONFIG_GCC_PLUGIN_LATENT_ENTROPY
 volatile unsigned long latent_entropy __latent_entropy;
@@ -3090,9 +3084,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
  * Called from the vmstat counter updater to drain pagesets of this
  * currently executing processor on remote nodes after they have
  * expired.
- *
- * Note that this function must be called with the thread pinned to
- * a single processor.
  */
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 {
@@ -3117,10 +3108,6 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 
 /*
  * Drain pcplists of the indicated processor and zone.
- *
- * The processor must either be the current processor and the
- * thread pinned to the current processor or a processor that
- * is not online.
  */
 static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 {
@@ -3139,10 +3126,6 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 
 /*
  * Drain pcplists of all zones on the indicated processor.
- *
- * The processor must either be the current processor and the
- * thread pinned to the current processor or a processor that
- * is not online.
  */
 static void drain_pages(unsigned int cpu)
 {
@@ -3155,9 +3138,6 @@ static void drain_pages(unsigned int cpu)
 
 /*
  * Spill all of this CPU's per-cpu pages back into the buddy allocator.
- *
- * The CPU has to be pinned. When zone parameter is non-NULL, spill just
- * the single zone's pages.
  */
 void drain_local_pages(struct zone *zone)
 {
@@ -3169,24 +3149,6 @@ void drain_local_pages(struct zone *zone)
 		drain_pages(cpu);
 }
 
-static void drain_local_pages_wq(struct work_struct *work)
-{
-	struct pcpu_drain *drain;
-
-	drain = container_of(work, struct pcpu_drain, work);
-
-	/*
-	 * drain_all_pages doesn't use proper cpu hotplug protection so
-	 * we can race with cpu offline when the WQ can move this from
-	 * a cpu pinned worker to an unbound one. We can operate on a different
-	 * cpu which is alright but we also have to make sure to not move to
-	 * a different one.
-	 */
-	migrate_disable();
-	drain_local_pages(drain->zone);
-	migrate_enable();
-}
-
 /*
  * The implementation of drain_all_pages(), exposing an extra parameter to
  * drain on all cpus.
@@ -3207,13 +3169,6 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 	 */
 	static cpumask_t cpus_with_pcps;
 
-	/*
-	 * Make sure nobody triggers this path before mm_percpu_wq is fully
-	 * initialized.
-	 */
-	if (WARN_ON_ONCE(!mm_percpu_wq))
-		return;
-
 	/*
 	 * Do not drain if one is already in progress unless it's specific to
 	 * a zone. Such callers are primarily CMA and memory hotplug and need
@@ -3263,14 +3218,12 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 	}
 
 	for_each_cpu(cpu, &cpus_with_pcps) {
-		struct pcpu_drain *drain = per_cpu_ptr(&pcpu_drain, cpu);
-
-		drain->zone = zone;
-		INIT_WORK(&drain->work, drain_local_pages_wq);
-		queue_work_on(cpu, mm_percpu_wq, &drain->work);
+		if (zone) {
+			drain_pages_zone(cpu, zone);
+		} else {
+			drain_pages(cpu);
+		}
 	}
-	for_each_cpu(cpu, &cpus_with_pcps)
-		flush_work(&per_cpu_ptr(&pcpu_drain, cpu)->work);
 
 	mutex_unlock(&pcpu_drain_mutex);
 }
@@ -3279,8 +3232,6 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
  * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
  *
  * When zone parameter is non-NULL, spill just the single zone's pages.
- *
- * Note that this can be extremely slow as the draining happens in a workqueue.
  */
 void drain_all_pages(struct zone *zone)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/6] Drain remote per-cpu directly v2
  2022-05-09 13:07 [RFC PATCH 0/6] Drain remote per-cpu directly v2 Mel Gorman
                   ` (5 preceding siblings ...)
  2022-05-09 13:08 ` [PATCH 6/6] mm/page_alloc: Remotely drain per-cpu lists Mel Gorman
@ 2022-05-09 15:58 ` Minchan Kim
  2022-05-10  9:27   ` Mel Gorman
  6 siblings, 1 reply; 17+ messages in thread
From: Minchan Kim @ 2022-05-09 15:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM

On Mon, May 09, 2022 at 02:07:59PM +0100, Mel Gorman wrote:
> Changelog since v1
> o Fix unsafe RT locking scheme
> o Use spin_trylock on UP PREEMPT_RT

Mel,


Is this only change from previous last version which has some
delta you fixed based on Vlastimil and me?

And is it still RFC?

Do you have some benchmark data?

I'd like to give Acked-by/Tested-by(even though there are a few
more places to align with new fields name in 1/6) since I have
tested these patchset in my workload and didn't spot any other
problems.

I really support this patch to be merged since the pcp draining
using workqueue is really harmful in the reclaim/cma path of
Android workload which has a variety of process priority control
and be stuck on the pcp draining.

> 
> This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> per-cpu lists drain support" -- avoid interference of a high priority
> task due to a workqueue item draining per-cpu page lists. While many
> workloads can tolerate a brief interruption, it may be cause a real-time
> task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> the draining in non-deterministic.
> 
> Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> The local_lock on its own prevents migration and the IRQ disabling protects
> from corruption due to an interrupt arriving while a page allocation is
> in progress. The locking is inherently unsafe for remote access unless
> the CPU is hot-removed.
> 
> This series adjusts the locking. A spinlock is added to struct
> per_cpu_pages to protect the list contents while local_lock_irq continues
> to prevent migration and IRQ reentry. This allows a remote CPU to safely
> drain a remote per-cpu list.
> 
> This series is a partial series. Follow-on work should allow the
> local_irq_save to be converted to a local_irq to avoid IRQs being
> disabled/enabled in most cases. Consequently, there are some TODO comments
> highlighting the places that would change if local_irq was used. However,
> there are enough corner cases that it deserves a series on its own
> separated by one kernel release and the priority right now is to avoid
> interference of high priority tasks.
> 
> Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
> 	and when it is storing per-cpu pages.
> 
> Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
> 	this is not necessary but it avoids per_cpu_pages consuming another
> 	cache line.
> 
> Patch 3 is a preparation patch to avoid code duplication.
> 
> Patch 4 is a simple micro-optimisation that improves code flow necessary for
> 	a later patch to avoid code duplication.
> 
> Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
> 	relying on local_lock to prevent migration, stabilise the pcp
> 	lookup and prevent IRQ reentrancy.
> 
> Patch 6 remote drains per-cpu pages directly instead of using a workqueue.
> 
>  include/linux/mm_types.h |   5 +
>  include/linux/mmzone.h   |  12 +-
>  mm/page_alloc.c          | 342 +++++++++++++++++++++++++--------------
>  3 files changed, 230 insertions(+), 129 deletions(-)
> 
> -- 
> 2.34.1
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/6] Drain remote per-cpu directly v2
  2022-05-09 15:58 ` [RFC PATCH 0/6] Drain remote per-cpu directly v2 Minchan Kim
@ 2022-05-10  9:27   ` Mel Gorman
  2022-05-10 18:13     ` Minchan Kim
  0 siblings, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2022-05-10  9:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM

On Mon, May 09, 2022 at 08:58:51AM -0700, Minchan Kim wrote:
> On Mon, May 09, 2022 at 02:07:59PM +0100, Mel Gorman wrote:
> > Changelog since v1
> > o Fix unsafe RT locking scheme
> > o Use spin_trylock on UP PREEMPT_RT
> 
> Mel,
> 
> 
> Is this only change from previous last version which has some
> delta you fixed based on Vlastimil and me?
> 

Full diff is below although it can also be generated by
comparing the mm-pcpdrain-v1r8..mm-pcpdrain-v2r1 branches in
https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/

> And is it still RFC?
> 

It's still RFC because it's a different approach to Nicolas' series and
I want at least his Acked-by before taking the RFC label out and sending
it to Andrew.

> Do you have some benchmark data?
> 

Yes, but as reclaim is not fundamentally altered the main difference
in behavious is that work is done inline instead of being deferred to a
workqueue. That means in some cases, system CPU usage of a task will be
higher because it's paying the cost directly.

The workloads I used just hit reclaim directly to make sure it's
functionally not broken. There is no change in page aging decisions,
only timing of drains. I didn't check interference of a heavy workload
interfering with a CPU-bound workload running on NOHZ CPUs as I assumed
both you and Nicolas had a test case ready to use.

The main one I paid interest to was a fault latency benchmark in
the presense of heavy reclaim called stutterp. It simulates a simple
workload. One part uses a lot of anonymous memory, a second measures mmap
latency and a third copies a large file.  The primary metric is checking
for mmap latency. It was originally put together to debug interactivity
issues on a desktop in the presense of heavy IO where the desktop
applications were being pushed to backing storage.

stutterp
                              5.18.0-rc1             5.18.0-rc1
                                 vanilla       mm-pcpdrain-v2r1
1st-qrtle mmap-4      15.9557 (   0.00%)     15.4045 (   3.45%)
1st-qrtle mmap-6      10.8025 (   0.00%)     11.1204 (  -2.94%)
1st-qrtle mmap-8      16.9338 (   0.00%)     17.0595 (  -0.74%)
1st-qrtle mmap-12     41.4746 (   0.00%)      9.4003 (  77.33%)
1st-qrtle mmap-18     47.7123 (   0.00%)    100.0275 (-109.65%)
1st-qrtle mmap-24     17.7098 (   0.00%)     16.9633 (   4.22%)
1st-qrtle mmap-30     69.2565 (   0.00%)     38.2205 (  44.81%)
1st-qrtle mmap-32     49.1295 (   0.00%)     46.8819 (   4.57%)
3rd-qrtle mmap-4      18.4706 (   0.00%)     17.4799 (   5.36%)
3rd-qrtle mmap-6      11.4526 (   0.00%)     11.5567 (  -0.91%)
3rd-qrtle mmap-8      19.5903 (   0.00%)     19.0046 (   2.99%)
3rd-qrtle mmap-12     50.3095 (   0.00%)     25.3254 (  49.66%)
3rd-qrtle mmap-18     67.3319 (   0.00%)    147.6404 (-119.27%)
3rd-qrtle mmap-24     41.3779 (   0.00%)     84.4035 (-103.98%)
3rd-qrtle mmap-30    127.1375 (   0.00%)    148.8884 ( -17.11%)
3rd-qrtle mmap-32     79.7423 (   0.00%)    182.3042 (-128.62%)
Max-99    mmap-4      46.9123 (   0.00%)     49.7731 (  -6.10%)
Max-99    mmap-6      42.5414 (   0.00%)     16.6173 (  60.94%)
Max-99    mmap-8      43.1237 (   0.00%)     23.3478 (  45.86%)
Max-99    mmap-12     62.8025 (   0.00%)   1947.3862 (-3000.81%)
Max-99    mmap-18  27936.8695 (   0.00%)    232.7122 (  99.17%)
Max-99    mmap-24 204543.9436 (   0.00%)   5805.2478 (  97.16%)
Max-99    mmap-30   2350.0289 (   0.00%)  10300.6344 (-338.32%)
Max-99    mmap-32  56164.2271 (   0.00%)   7789.7526 (  86.13%)
Max       mmap-4     840.3468 (   0.00%)   1137.4462 ( -35.35%)
Max       mmap-6  255233.3996 (   0.00%)  91304.0952 (  64.23%)
Max       mmap-8  210910.6497 (   0.00%) 117931.0796 (  44.08%)
Max       mmap-12 108268.9537 (   0.00%) 319971.6910 (-195.53%)
Max       mmap-18 608805.3195 (   0.00%) 197483.2205 (  67.56%)
Max       mmap-24 327697.5605 (   0.00%) 382842.5356 ( -16.83%)
Max       mmap-30 688684.5335 (   0.00%) 669992.7705 (   2.71%)
Max       mmap-32 396842.0114 (   0.00%) 415978.2539 (  -4.82%)

                  5.18.0-rc1  5.18.0-rc1
                     vanillamm-pcpdrain-v2r1
Duration User        1438.08     1637.21
Duration System     12267.41    10307.96
Duration Elapsed     3929.70     3443.53


It's a mixed bag but this workload is always a mixed bag and it's stressing
reclaim.  At some points, latencies are worse, in others better. Overall,
it completed faster and this was on a 1-socket machine.

On a 2-socket machine, the overall completions times were worse

                  5.18.0-rc1  5.18.0-rc1
                     vanillamm-pcpdrain-v2r1
Duration User        3713.75     2899.90
Duration System    303507.56   378909.94
Duration Elapsed    15444.59    19067.40

In general this type of workload is variable given the nature of what it
does and can give different results on each execution. When originally
designed, it was to deal with stalls lasting several seconds to reduce
them to the sub-millisecond range.

The intent of the series is switching out-of-line work to in-line so
what it should be measuring is interference effects and not straight-line
performance and I haven't written a specific test case yet. When writing
the series initially, it was to investigate if the PCP could be lockless
and failing that, if disabling IRQs could be avoided in the common case.
It just turned out that part of that made remote draining possible and
I focused closer on that because it's more important.

> I'd like to give Acked-by/Tested-by(even though there are a few
> more places to align with new fields name in 1/6)

Which ones are of concern?

Some of the page->lru references I left alone in the init paths simply
because in those contexts, the page wasn't on a buddy or PCP list. In
free_unref_page_list the page is not on the LRU, it's just been isolated
from the LRU. In alloc_pages_bulk, it's not on a buddy, pcp or LRU list
and is just a list placeholder so I left it alone. In
free_tail_pages_check the context was a page that was likely previously
on a LRU.

> since I have
> tested these patchset in my workload and didn't spot any other
> problems.
> 

Can you describe this workload, is it available anywhere and does it
require Android to execute?

If you have positive results, it would be appreciated if you could post
them or just note in a Tested-by/Acked-by that it had a measurable impact
on the reclaim/cma path.

> I really support this patch to be merged since the pcp draining
> using workqueue is really harmful in the reclaim/cma path of
> Android workload which has a variety of process priority control
> and be stuck on the pcp draining.
> 

Diff between v1 and v2

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 17d11eb0413e..4ac39d30ec8f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -132,8 +132,11 @@ static DEFINE_PER_CPU(struct pagesets, pagesets) __maybe_unused = {
 	.lock = INIT_LOCAL_LOCK(lock),
 };
 
-#ifdef CONFIG_SMP
-/* On SMP, spin_trylock is sufficient protection */
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
+/*
+ * On SMP, spin_trylock is sufficient protection.
+ * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
+ */
 #define pcp_trylock_prepare(flags)	do { } while (0)
 #define pcp_trylock_finish(flag)	do { } while (0)
 #else
@@ -3091,14 +3094,14 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 	if (to_drain > 0) {
 		unsigned long flags;
 
-		/* free_pcppages_bulk expects IRQs disabled for zone->lock */
-		local_irq_save(flags);
-
-		spin_lock(&pcp->lock);
+		/*
+		 * free_pcppages_bulk expects IRQs disabled for zone->lock
+		 * so even though pcp->lock is not intended to be IRQ-safe,
+		 * it's needed in this context.
+		 */
+		spin_lock_irqsave(&pcp->lock, flags);
 		free_pcppages_bulk(zone, to_drain, pcp, 0);
-		spin_unlock(&pcp->lock);
-
-		local_irq_restore(flags);
+		spin_unlock_irqrestore(&pcp->lock, flags);
 	}
 }
 #endif
@@ -3114,14 +3117,10 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 	if (pcp->count) {
 		unsigned long flags;
 
-		/* free_pcppages_bulk expects IRQs disabled for zone->lock */
-		local_irq_save(flags);
-
-		spin_lock(&pcp->lock);
+		/* See drain_zone_pages on why this is disabling IRQs */
+		spin_lock_irqsave(&pcp->lock, flags);
 		free_pcppages_bulk(zone, pcp->count, pcp, 0);
-		spin_unlock(&pcp->lock);
-
-		local_irq_restore(flags);
+		spin_unlock_irqrestore(&pcp->lock, flags);
 	}
 }
 
@@ -3480,6 +3479,13 @@ void free_unref_page_list(struct list_head *list)
 		}
 	}
 
+	/*
+	 * Preparation could have drained the list due to failing to prepare
+	 * or all pages are being isolated.
+	 */
+	if (list_empty(list))
+		return;
+
 	VM_BUG_ON(in_hardirq());
 
 	local_lock_irqsave(&pagesets.lock, flags);
@@ -3515,6 +3521,9 @@ void free_unref_page_list(struct list_head *list)
 		 * allocator directly. This is expensive as the zone lock will
 		 * be acquired multiple times but if a drain is in progress
 		 * then an expensive operation is already taking place.
+		 *
+		 * TODO: Always false at the moment due to local_lock_irqsave
+		 *       and is preparation for converting to local_lock.
 		 */
 		if (unlikely(!free_unref_page_commit(page, migratetype, 0, true)))
 			free_one_page(page_zone(page), page, page_to_pfn(page), 0, migratetype, FPI_NONE);
@@ -3717,9 +3726,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 	 * IRQ re-entrancy. If pcp->lock cannot be acquired, the caller
 	 * uses rmqueue_buddy.
 	 *
-	 * TODO: Convert local_lock_irqsave to local_lock. Care
-	 * 	 is needed as the type of local_lock would need a
-	 * 	 PREEMPT_RT version due to threaded IRQs.
+	 * TODO: Convert local_lock_irqsave to local_lock.
 	 */
 	if (unlikely(!locked)) {
 		pcp_trylock_prepare(UP_flags);

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/6] Drain remote per-cpu directly v2
  2022-05-10  9:27   ` Mel Gorman
@ 2022-05-10 18:13     ` Minchan Kim
  2022-05-11 12:47       ` Mel Gorman
  0 siblings, 1 reply; 17+ messages in thread
From: Minchan Kim @ 2022-05-10 18:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM

On Tue, May 10, 2022 at 10:27:33AM +0100, Mel Gorman wrote:
> On Mon, May 09, 2022 at 08:58:51AM -0700, Minchan Kim wrote:
> > On Mon, May 09, 2022 at 02:07:59PM +0100, Mel Gorman wrote:
> > > Changelog since v1
> > > o Fix unsafe RT locking scheme
> > > o Use spin_trylock on UP PREEMPT_RT
> > 
> > Mel,
> > 
> > 
> > Is this only change from previous last version which has some
> > delta you fixed based on Vlastimil and me?
> > 
> 
> Full diff is below although it can also be generated by
> comparing the mm-pcpdrain-v1r8..mm-pcpdrain-v2r1 branches in
> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/

I took the delta when I started testing so the testing result
would be valid. Thanks.

> 
> > And is it still RFC?
> > 
> 
> It's still RFC because it's a different approach to Nicolas' series and
> I want at least his Acked-by before taking the RFC label out and sending
> it to Andrew.
> 
> > Do you have some benchmark data?
> > 
> 
> Yes, but as reclaim is not fundamentally altered the main difference
> in behavious is that work is done inline instead of being deferred to a
> workqueue. That means in some cases, system CPU usage of a task will be
> higher because it's paying the cost directly.

Sure but the reclaim path is already expensive so I doubt we could
see the sizable measurement on the system CPU usage.

What I wanted to see was whether we have regression due to adding
spin_lock/unlock instructions in hot path. Due to squeeze it to
a cacheline, I expected the regression would be just marginal.

> 
> The workloads I used just hit reclaim directly to make sure it's
> functionally not broken. There is no change in page aging decisions,
> only timing of drains. I didn't check interference of a heavy workload
> interfering with a CPU-bound workload running on NOHZ CPUs as I assumed
> both you and Nicolas had a test case ready to use.

The my workload is not NOHZ CPUs but run apps under heavy memory
pressure so they goes to direct reclaim and be stuck on drain_all_pages
until work on workqueue run.

unit: nanosecond
max(dur)        avg(dur)                count(dur)
166713013       487511.77786438033      1283

From traces, system encountered the drain_all_pages 1283 times and
worst case was 166ms and avg was 487us.

The other problem was alloc_contig_range in CMA. The PCP draining
takes several hundred millisecond sometimes though there is no
memory pressure or a few of pages to be migrated out but CPU were
fully booked.

Your patch perfectly removed those wasted time.

> 
> The main one I paid interest to was a fault latency benchmark in
> the presense of heavy reclaim called stutterp. It simulates a simple
> workload. One part uses a lot of anonymous memory, a second measures mmap
> latency and a third copies a large file.  The primary metric is checking
> for mmap latency. It was originally put together to debug interactivity
> issues on a desktop in the presense of heavy IO where the desktop
> applications were being pushed to backing storage.
> 
> stutterp
>                               5.18.0-rc1             5.18.0-rc1
>                                  vanilla       mm-pcpdrain-v2r1
> 1st-qrtle mmap-4      15.9557 (   0.00%)     15.4045 (   3.45%)
> 1st-qrtle mmap-6      10.8025 (   0.00%)     11.1204 (  -2.94%)
> 1st-qrtle mmap-8      16.9338 (   0.00%)     17.0595 (  -0.74%)
> 1st-qrtle mmap-12     41.4746 (   0.00%)      9.4003 (  77.33%)
> 1st-qrtle mmap-18     47.7123 (   0.00%)    100.0275 (-109.65%)
> 1st-qrtle mmap-24     17.7098 (   0.00%)     16.9633 (   4.22%)
> 1st-qrtle mmap-30     69.2565 (   0.00%)     38.2205 (  44.81%)
> 1st-qrtle mmap-32     49.1295 (   0.00%)     46.8819 (   4.57%)
> 3rd-qrtle mmap-4      18.4706 (   0.00%)     17.4799 (   5.36%)
> 3rd-qrtle mmap-6      11.4526 (   0.00%)     11.5567 (  -0.91%)
> 3rd-qrtle mmap-8      19.5903 (   0.00%)     19.0046 (   2.99%)
> 3rd-qrtle mmap-12     50.3095 (   0.00%)     25.3254 (  49.66%)
> 3rd-qrtle mmap-18     67.3319 (   0.00%)    147.6404 (-119.27%)
> 3rd-qrtle mmap-24     41.3779 (   0.00%)     84.4035 (-103.98%)
> 3rd-qrtle mmap-30    127.1375 (   0.00%)    148.8884 ( -17.11%)
> 3rd-qrtle mmap-32     79.7423 (   0.00%)    182.3042 (-128.62%)
> Max-99    mmap-4      46.9123 (   0.00%)     49.7731 (  -6.10%)
> Max-99    mmap-6      42.5414 (   0.00%)     16.6173 (  60.94%)
> Max-99    mmap-8      43.1237 (   0.00%)     23.3478 (  45.86%)
> Max-99    mmap-12     62.8025 (   0.00%)   1947.3862 (-3000.81%)
> Max-99    mmap-18  27936.8695 (   0.00%)    232.7122 (  99.17%)
> Max-99    mmap-24 204543.9436 (   0.00%)   5805.2478 (  97.16%)
> Max-99    mmap-30   2350.0289 (   0.00%)  10300.6344 (-338.32%)
> Max-99    mmap-32  56164.2271 (   0.00%)   7789.7526 (  86.13%)
> Max       mmap-4     840.3468 (   0.00%)   1137.4462 ( -35.35%)
> Max       mmap-6  255233.3996 (   0.00%)  91304.0952 (  64.23%)
> Max       mmap-8  210910.6497 (   0.00%) 117931.0796 (  44.08%)
> Max       mmap-12 108268.9537 (   0.00%) 319971.6910 (-195.53%)
> Max       mmap-18 608805.3195 (   0.00%) 197483.2205 (  67.56%)
> Max       mmap-24 327697.5605 (   0.00%) 382842.5356 ( -16.83%)
> Max       mmap-30 688684.5335 (   0.00%) 669992.7705 (   2.71%)
> Max       mmap-32 396842.0114 (   0.00%) 415978.2539 (  -4.82%)
> 
>                   5.18.0-rc1  5.18.0-rc1
>                      vanillamm-pcpdrain-v2r1
> Duration User        1438.08     1637.21
> Duration System     12267.41    10307.96
> Duration Elapsed     3929.70     3443.53
> 
> 
> It's a mixed bag but this workload is always a mixed bag and it's stressing
> reclaim.  At some points, latencies are worse, in others better. Overall,
> it completed faster and this was on a 1-socket machine.
> 
> On a 2-socket machine, the overall completions times were worse
> 
>                   5.18.0-rc1  5.18.0-rc1
>                      vanillamm-pcpdrain-v2r1
> Duration User        3713.75     2899.90
> Duration System    303507.56   378909.94
> Duration Elapsed    15444.59    19067.40
> 
> In general this type of workload is variable given the nature of what it
> does and can give different results on each execution. When originally
> designed, it was to deal with stalls lasting several seconds to reduce
> them to the sub-millisecond range.
> 
> The intent of the series is switching out-of-line work to in-line so
> what it should be measuring is interference effects and not straight-line
> performance and I haven't written a specific test case yet. When writing
> the series initially, it was to investigate if the PCP could be lockless
> and failing that, if disabling IRQs could be avoided in the common case.
> It just turned out that part of that made remote draining possible and
> I focused closer on that because it's more important.
> 
> > I'd like to give Acked-by/Tested-by(even though there are a few
> > more places to align with new fields name in 1/6)
> 
> Which ones are of concern?
> 
> Some of the page->lru references I left alone in the init paths simply
> because in those contexts, the page wasn't on a buddy or PCP list. In
> free_unref_page_list the page is not on the LRU, it's just been isolated
> from the LRU. In alloc_pages_bulk, it's not on a buddy, pcp or LRU list
> and is just a list placeholder so I left it alone. In
> free_tail_pages_check the context was a page that was likely previously
> on a LRU.

Just nits: all are list macros.

free_pcppages_bulk's list_last_entry should be pcp_list.

mark_free_pages's list_for_each_entry should be buddy_list

__rmqueue_pcplist's list_first_enty should be pcp_list.

> 
> > since I have
> > tested these patchset in my workload and didn't spot any other
> > problems.
> > 
> 
> Can you describe this workload, is it available anywhere and does it
> require Android to execute?

I wrote down above. It runs on Android but I don't think it's
android specific issue but anyone could see such a long latency
from PCP draining once one of cores are monopolized by higher
priority processes or too many pending kworks.

> 
> If you have positive results, it would be appreciated if you could post
> them or just note in a Tested-by/Acked-by that it had a measurable impact
> on the reclaim/cma path.

Sure.

All patches in this series.

Tested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/6] Drain remote per-cpu directly v2
  2022-05-10 18:13     ` Minchan Kim
@ 2022-05-11 12:47       ` Mel Gorman
  2022-05-11 17:20         ` Minchan Kim
  0 siblings, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2022-05-11 12:47 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM

On Tue, May 10, 2022 at 11:13:05AM -0700, Minchan Kim wrote:
> > Yes, but as reclaim is not fundamentally altered the main difference
> > in behavious is that work is done inline instead of being deferred to a
> > workqueue. That means in some cases, system CPU usage of a task will be
> > higher because it's paying the cost directly.
> 
> Sure but the reclaim path is already expensive so I doubt we could
> see the sizable measurement on the system CPU usage.
> 

It would be difficult to distinguish from the noise.

> What I wanted to see was whether we have regression due to adding
> spin_lock/unlock instructions in hot path. Due to squeeze it to
> a cacheline, I expected the regression would be just marginal.
> 

Ah, yes, I did test for this. page-fault-test hits the relevant paths
very heavily and did show minor differences.

                                     5.18.0-rc1               5.18.0-rc1
                                        vanilla         mm-pcpdrain-v2r1
Hmean     faults/sec-1   886331.5718 (   0.00%)   885462.7479 (  -0.10%)
Hmean     faults/sec-3  2337706.1583 (   0.00%)  2332130.4909 *  -0.24%*
Hmean     faults/sec-5  2851594.2897 (   0.00%)  2844123.9307 (  -0.26%)
Hmean     faults/sec-7  3543251.5507 (   0.00%)  3516889.0442 *  -0.74%*
Hmean     faults/sec-8  3947098.0024 (   0.00%)  3916162.8476 *  -0.78%*
Stddev    faults/sec-1     2302.9105 (   0.00%)     2065.0845 (  10.33%)
Stddev    faults/sec-3     7275.2442 (   0.00%)     6033.2620 (  17.07%)
Stddev    faults/sec-5    24726.0328 (   0.00%)    12525.1026 (  49.34%)
Stddev    faults/sec-7     9974.2542 (   0.00%)     9543.9627 (   4.31%)
Stddev    faults/sec-8     9468.0191 (   0.00%)     7958.2607 (  15.95%)
CoeffVar  faults/sec-1        0.2598 (   0.00%)        0.2332 (  10.24%)
CoeffVar  faults/sec-3        0.3112 (   0.00%)        0.2587 (  16.87%)
CoeffVar  faults/sec-5        0.8670 (   0.00%)        0.4404 (  49.21%)
CoeffVar  faults/sec-7        0.2815 (   0.00%)        0.2714 (   3.60%)
CoeffVar  faults/sec-8        0.2399 (   0.00%)        0.2032 (  15.28%)

There is a small hit in the number of faults per second but it's within
the noise and the results are more stable with the series so I'd mark it
down as a small but potentially measurable impact.

> > 
> > The workloads I used just hit reclaim directly to make sure it's
> > functionally not broken. There is no change in page aging decisions,
> > only timing of drains. I didn't check interference of a heavy workload
> > interfering with a CPU-bound workload running on NOHZ CPUs as I assumed
> > both you and Nicolas had a test case ready to use.
> 
> The my workload is not NOHZ CPUs but run apps under heavy memory
> pressure so they goes to direct reclaim and be stuck on drain_all_pages
> until work on workqueue run.
> 
> unit: nanosecond
> max(dur)        avg(dur)                count(dur)
> 166713013       487511.77786438033      1283
> 
> From traces, system encountered the drain_all_pages 1283 times and
> worst case was 166ms and avg was 487us.
> 
> The other problem was alloc_contig_range in CMA. The PCP draining
> takes several hundred millisecond sometimes though there is no
> memory pressure or a few of pages to be migrated out but CPU were
> fully booked.
> 
> Your patch perfectly removed those wasted time.
> 

Those stalls are painful and it's a direct impact where a workload does
not make progress. The NOHZ stall is different in that it's worried
about interference. Both problems should have the same solution.

Do you mind if I quote these paragraphs in the leader to v3?

> > Which ones are of concern?
> > 
> > Some of the page->lru references I left alone in the init paths simply
> > because in those contexts, the page wasn't on a buddy or PCP list. In
> > free_unref_page_list the page is not on the LRU, it's just been isolated
> > from the LRU. In alloc_pages_bulk, it's not on a buddy, pcp or LRU list
> > and is just a list placeholder so I left it alone. In
> > free_tail_pages_check the context was a page that was likely previously
> > on a LRU.
> 
> Just nits: all are list macros.
> 
> free_pcppages_bulk's list_last_entry should be pcp_list.
> 
> mark_free_pages's list_for_each_entry should be buddy_list
> 
> __rmqueue_pcplist's list_first_enty should be pcp_list.
> 

Ah, you're completely correct.

> > 
> > > since I have
> > > tested these patchset in my workload and didn't spot any other
> > > problems.
> > > 
> > 
> > Can you describe this workload, is it available anywhere and does it
> > require Android to execute?
> 
> I wrote down above. It runs on Android but I don't think it's
> android specific issue but anyone could see such a long latency
> from PCP draining once one of cores are monopolized by higher
> priority processes or too many pending kworks.
> 

Yeah, I agree it's not an Android-specific problem. It could be detected by
tracing the time spent in drain_all_pages for any arbitrary workload. The
BCC funclatency tool could measure it.

> > 
> > If you have positive results, it would be appreciated if you could post
> > them or just note in a Tested-by/Acked-by that it had a measurable impact
> > on the reclaim/cma path.
> 
> Sure.
> 
> All patches in this series.
> 
> Tested-by: Minchan Kim <minchan@kernel.org>
> Acked-by: Minchan Kim <minchan@kernel.org>
> 

Thanks, I've added that to all the patches. I'll wait another day for
more feedback before sending out a v3. The following is the diff between
v2 and v3 based on your feedback.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4ac39d30ec8f..0f5a6a5b0302 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1497,7 +1497,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		do {
 			int mt;
 
-			page = list_last_entry(list, struct page, lru);
+			page = list_last_entry(list, struct page, pcp_list);
 			mt = get_pcppage_migratetype(page);
 
 			/* must delete to avoid corrupting pcp list */
@@ -3276,7 +3276,7 @@ void mark_free_pages(struct zone *zone)
 
 	for_each_migratetype_order(order, t) {
 		list_for_each_entry(page,
-				&zone->free_area[order].free_list[t], lru) {
+				&zone->free_area[order].free_list[t], buddy_list) {
 			unsigned long i;
 
 			pfn = page_to_pfn(page);
@@ -3761,7 +3761,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 			}
 		}
 
-		page = list_first_entry(list, struct page, lru);
+		page = list_first_entry(list, struct page, pcp_list);
 		list_del(&page->pcp_list);
 		pcp->count -= 1 << order;
 	} while (check_new_pcp(page, order));

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/6] Drain remote per-cpu directly v2
  2022-05-11 12:47       ` Mel Gorman
@ 2022-05-11 17:20         ` Minchan Kim
  0 siblings, 0 replies; 17+ messages in thread
From: Minchan Kim @ 2022-05-11 17:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM

On Wed, May 11, 2022 at 01:47:00PM +0100, Mel Gorman wrote:
> On Tue, May 10, 2022 at 11:13:05AM -0700, Minchan Kim wrote:
> > > Yes, but as reclaim is not fundamentally altered the main difference
> > > in behavious is that work is done inline instead of being deferred to a
> > > workqueue. That means in some cases, system CPU usage of a task will be
> > > higher because it's paying the cost directly.
> > 
> > Sure but the reclaim path is already expensive so I doubt we could
> > see the sizable measurement on the system CPU usage.
> > 
> 
> It would be difficult to distinguish from the noise.
> 
> > What I wanted to see was whether we have regression due to adding
> > spin_lock/unlock instructions in hot path. Due to squeeze it to
> > a cacheline, I expected the regression would be just marginal.
> > 
> 
> Ah, yes, I did test for this. page-fault-test hits the relevant paths
> very heavily and did show minor differences.
> 
>                                      5.18.0-rc1               5.18.0-rc1
>                                         vanilla         mm-pcpdrain-v2r1
> Hmean     faults/sec-1   886331.5718 (   0.00%)   885462.7479 (  -0.10%)
> Hmean     faults/sec-3  2337706.1583 (   0.00%)  2332130.4909 *  -0.24%*
> Hmean     faults/sec-5  2851594.2897 (   0.00%)  2844123.9307 (  -0.26%)
> Hmean     faults/sec-7  3543251.5507 (   0.00%)  3516889.0442 *  -0.74%*
> Hmean     faults/sec-8  3947098.0024 (   0.00%)  3916162.8476 *  -0.78%*
> Stddev    faults/sec-1     2302.9105 (   0.00%)     2065.0845 (  10.33%)
> Stddev    faults/sec-3     7275.2442 (   0.00%)     6033.2620 (  17.07%)
> Stddev    faults/sec-5    24726.0328 (   0.00%)    12525.1026 (  49.34%)
> Stddev    faults/sec-7     9974.2542 (   0.00%)     9543.9627 (   4.31%)
> Stddev    faults/sec-8     9468.0191 (   0.00%)     7958.2607 (  15.95%)
> CoeffVar  faults/sec-1        0.2598 (   0.00%)        0.2332 (  10.24%)
> CoeffVar  faults/sec-3        0.3112 (   0.00%)        0.2587 (  16.87%)
> CoeffVar  faults/sec-5        0.8670 (   0.00%)        0.4404 (  49.21%)
> CoeffVar  faults/sec-7        0.2815 (   0.00%)        0.2714 (   3.60%)
> CoeffVar  faults/sec-8        0.2399 (   0.00%)        0.2032 (  15.28%)
> 
> There is a small hit in the number of faults per second but it's within
> the noise and the results are more stable with the series so I'd mark it
> down as a small but potentially measurable impact.

Thanks for sharing. It would be great to have in the description, too.

> 
> > > 
> > > The workloads I used just hit reclaim directly to make sure it's
> > > functionally not broken. There is no change in page aging decisions,
> > > only timing of drains. I didn't check interference of a heavy workload
> > > interfering with a CPU-bound workload running on NOHZ CPUs as I assumed
> > > both you and Nicolas had a test case ready to use.
> > 
> > The my workload is not NOHZ CPUs but run apps under heavy memory
> > pressure so they goes to direct reclaim and be stuck on drain_all_pages
> > until work on workqueue run.
> > 
> > unit: nanosecond
> > max(dur)        avg(dur)                count(dur)
> > 166713013       487511.77786438033      1283
> > 
> > From traces, system encountered the drain_all_pages 1283 times and
> > worst case was 166ms and avg was 487us.
> > 
> > The other problem was alloc_contig_range in CMA. The PCP draining
> > takes several hundred millisecond sometimes though there is no
> > memory pressure or a few of pages to be migrated out but CPU were
> > fully booked.
> > 
> > Your patch perfectly removed those wasted time.
> > 
> 
> Those stalls are painful and it's a direct impact where a workload does
> not make progress. The NOHZ stall is different in that it's worried
> about interference. Both problems should have the same solution.
> 
> Do you mind if I quote these paragraphs in the leader to v3?

Please have it in the description.

> 
> > > Which ones are of concern?
> > > 
> > > Some of the page->lru references I left alone in the init paths simply
> > > because in those contexts, the page wasn't on a buddy or PCP list. In
> > > free_unref_page_list the page is not on the LRU, it's just been isolated
> > > from the LRU. In alloc_pages_bulk, it's not on a buddy, pcp or LRU list
> > > and is just a list placeholder so I left it alone. In
> > > free_tail_pages_check the context was a page that was likely previously
> > > on a LRU.
> > 
> > Just nits: all are list macros.
> > 
> > free_pcppages_bulk's list_last_entry should be pcp_list.
> > 
> > mark_free_pages's list_for_each_entry should be buddy_list
> > 
> > __rmqueue_pcplist's list_first_enty should be pcp_list.
> > 
> 
> Ah, you're completely correct.
> 
> > > 
> > > > since I have
> > > > tested these patchset in my workload and didn't spot any other
> > > > problems.
> > > > 
> > > 
> > > Can you describe this workload, is it available anywhere and does it
> > > require Android to execute?
> > 
> > I wrote down above. It runs on Android but I don't think it's
> > android specific issue but anyone could see such a long latency
> > from PCP draining once one of cores are monopolized by higher
> > priority processes or too many pending kworks.
> > 
> 
> Yeah, I agree it's not an Android-specific problem. It could be detected by
> tracing the time spent in drain_all_pages for any arbitrary workload. The
> BCC funclatency tool could measure it.
> 
> > > 
> > > If you have positive results, it would be appreciated if you could post
> > > them or just note in a Tested-by/Acked-by that it had a measurable impact
> > > on the reclaim/cma path.
> > 
> > Sure.
> > 
> > All patches in this series.
> > 
> > Tested-by: Minchan Kim <minchan@kernel.org>
> > Acked-by: Minchan Kim <minchan@kernel.org>
> > 
> 
> Thanks, I've added that to all the patches. I'll wait another day for
> more feedback before sending out a v3. The following is the diff between
> v2 and v3 based on your feedback.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4ac39d30ec8f..0f5a6a5b0302 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1497,7 +1497,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  		do {
>  			int mt;
>  
> -			page = list_last_entry(list, struct page, lru);
> +			page = list_last_entry(list, struct page, pcp_list);
>  			mt = get_pcppage_migratetype(page);
>  
>  			/* must delete to avoid corrupting pcp list */
> @@ -3276,7 +3276,7 @@ void mark_free_pages(struct zone *zone)
>  
>  	for_each_migratetype_order(order, t) {
>  		list_for_each_entry(page,
> -				&zone->free_area[order].free_list[t], lru) {
> +				&zone->free_area[order].free_list[t], buddy_list) {
>  			unsigned long i;
>  
>  			pfn = page_to_pfn(page);
> @@ -3761,7 +3761,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
>  			}
>  		}
>  
> -		page = list_first_entry(list, struct page, lru);
> +		page = list_first_entry(list, struct page, pcp_list);
>  		list_del(&page->pcp_list);
>  		pcp->count -= 1 << order;
>  	} while (check_new_pcp(page, order));

Looks good to me. Please stick the my Tested-by/Acked-by.

Thanks, Mel.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list
  2022-05-09 13:08 ` [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list Mel Gorman
@ 2022-05-13  8:41   ` Muchun Song
  2022-05-26 10:14     ` Mel Gorman
  0 siblings, 1 reply; 17+ messages in thread
From: Muchun Song @ 2022-05-13  8:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM

On Mon, May 09, 2022 at 02:08:00PM +0100, Mel Gorman wrote:
> The page allocator uses page->lru for storing pages on either buddy or
> PCP lists. Create page->buddy_list and page->pcp_list as a union with
> page->lru. This is simply to clarify what type of list a page is on
> in the page allocator.
> 
> No functional change intended.
> 

Nice cleanup for me.

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/mm_types.h |  5 +++++
>  mm/page_alloc.c          | 18 +++++++++---------
>  2 files changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 8834e38c06a4..a2782e8af307 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -87,6 +87,7 @@ struct page {
>  			 */
>  			union {
>  				struct list_head lru;
> +
>  				/* Or, for the Unevictable "LRU list" slot */
>  				struct {
>  					/* Always even, to negate PageTail */
> @@ -94,6 +95,10 @@ struct page {
>  					/* Count page's or folio's mlocks */
>  					unsigned int mlock_count;
>  				};
> +
> +				/* Or, free page */
> +				struct list_head buddy_list;
> +				struct list_head pcp_list;
>  			};

Since you have clarified "lru" member, should we go further?
Like union "index" to "pcp_migratetype" and "private" to "order"
since buddy allocator reuses "index" and "private" as well.
My initial idea is as follows, it is more clear for me, what
do you think?

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index dbfd892ac96a..644349a5f901 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -107,6 +107,15 @@ struct page {
                         */
                        unsigned long private;
                };
+               struct {        /* pages manipulated by buddy allocator */
+                       union {
+                               struct list_head buddy_list;
+                               struct list_head pcp_list;
+                       };
+                       unsigned long mapping_pad;      /* Not used */
+                       unsigned long pcp_migratetype;
+                       unsigned long order;
+               };
                struct {        /* page_pool used by netstack */
                        /**
                         * @pp_magic: magic value to avoid recycling non


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock
  2022-05-09 13:08 ` [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock Mel Gorman
@ 2022-05-22  2:49   ` Hugh Dickins
  2022-05-24 12:12     ` Mel Gorman
  0 siblings, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2022-05-22  2:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Nicolas Saenz Julienne, Marcelo Tosatti,
	Vlastimil Babka, Michal Hocko, LKML, Linux-MM

On Mon, 9 May 2022, Mel Gorman wrote:

> Currently the PCP lists are protected by using local_lock_irqsave to
> prevent migration and IRQ reentrancy but this is inconvenient. Remote
> draining of the lists is impossible and a workqueue is required and
> every task allocation/free must disable then enable interrupts which is
> expensive.
> 
> As preparation for dealing with both of those problems, protect the
> lists with a spinlock. The IRQ-unsafe version of the lock is used
> because IRQs are already disabled by local_lock_irqsave. spin_trylock
> is used in preparation for a time when local_lock could be used instead
> of lock_lock_irqsave.

8c580f60a145 ("mm/page_alloc: protect PCP lists with a spinlock")
in next-20220520: I haven't looked up whether that comes from a
stable or unstable suburb of akpm's tree.

Mel, the VM_BUG_ON(in_hardirq()) which this adds to free_unref_page_list() 
is not valid.  I have no appreciation of how important it is to the whole
scheme, but as it stands, it crashes; and when I change it to a warning

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3475,7 +3475,7 @@ void free_unref_page_list(struct list_he
 	if (list_empty(list))
 		return;
 
-	VM_BUG_ON(in_hardirq());
+	WARN_ON_ONCE(in_hardirq());
 
 	local_lock_irqsave(&pagesets.lock, flags);
 
then everything *appears* to go on working correctly after the splat
below (from which you will infer that I'm swapping to nvme):

[  256.167040] WARNING: CPU: 0 PID: 9842 at mm/page_alloc.c:3478 free_unref_page_list+0x92/0x343
[  256.170031] CPU: 0 PID: 9842 Comm: cc1 Not tainted 5.18.0-rc7-n20 #3
[  256.171285] Hardware name: LENOVO 20HQS0EG02/20HQS0EG02, BIOS N1MET54W (1.39 ) 04/16/2019
[  256.172555] RIP: 0010:free_unref_page_list+0x92/0x343
[  256.173820] Code: ff ff 49 8b 44 24 08 4d 89 e0 4c 8d 60 f8 eb b6 48 8b 03 48 39 c3 0f 84 af 02 00 00 65 8b 05 72 7f df 7e a9 00 00 0f 00 74 02 <0f> 0b 9c 41 5d fa 41 0f ba e5 09 73 05 e8 1f 0a f9 ff e8 46 90 7b
[  256.175289] RSP: 0018:ffff88803ec07c80 EFLAGS: 00010006
[  256.176683] RAX: 0000000080010000 RBX: ffff88803ec07cf8 RCX: 000000000000002c
[  256.178122] RDX: 0000000000000000 RSI: ffff88803ec29d28 RDI: 0000000000000040
[  256.179580] RBP: ffff88803ec07cc0 R08: ffff88803ec07cf0 R09: 00000000000a401d
[  256.181031] R10: 0000000000000000 R11: ffff8880101891b8 R12: ffff88803f6dd600
[  256.182501] R13: ffff88803ec07cf8 R14: 000000000000000f R15: 0000000000000000
[  256.183957] FS:  00007ffff7fcfac0(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
[  256.185419] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  256.186911] CR2: 0000555555710cdc CR3: 00000000240b4004 CR4: 00000000003706f0
[  256.188395] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  256.189888] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  256.191390] Call Trace:
[  256.192844]  <IRQ>
[  256.194253]  ? __mem_cgroup_uncharge_list+0x4e/0x57
[  256.195715]  release_pages+0x26f/0x27e
[  256.197150]  ? list_add_tail+0x39/0x39
[  256.198603]  pagevec_lru_move_fn+0x95/0xa4
[  256.200065]  folio_rotate_reclaimable+0xa0/0xd1
[  256.201545]  folio_end_writeback+0x1c/0x78
[  256.203064]  end_page_writeback+0x11/0x13
[  256.204495]  end_swap_bio_write+0x87/0x95
[  256.205967]  bio_endio+0x15e/0x162
[  256.207393]  blk_mq_end_request_batch+0xd2/0x18d
[  256.208932]  ? __this_cpu_preempt_check+0x13/0x15
[  256.210378]  ? lock_is_held_type+0xcf/0x10f
[  256.211714]  ? lock_is_held+0xc/0xe
[  256.213065]  ? rcu_read_lock_sched_held+0x24/0x4f
[  256.214450]  nvme_pci_complete_batch+0x4c/0x51
[  256.215811]  nvme_irq+0x43/0x4e
[  256.217173]  ? nvme_unmap_data+0xb5/0xb5
[  256.218633]  __handle_irq_event_percpu+0xff/0x235
[  256.220062]  handle_irq_event_percpu+0x10/0x39
[  256.221584]  handle_irq_event+0x34/0x53
[  256.223061]  handle_edge_irq+0xb1/0xd5
[  256.224486]  __common_interrupt+0x7a/0xe6
[  256.225918]  common_interrupt+0x9c/0xca
[  256.227330]  </IRQ>
[  256.228763]  <TASK>
[  256.230226]  asm_common_interrupt+0x2c/0x40
[  256.231724] RIP: 0010:lock_acquire.part.0+0x1a9/0x1b4
[  256.233190] Code: df ec 7e ff c8 74 19 0f 0b 48 c7 c7 77 2c 4e 82 e8 d7 e7 88 00 65 c7 05 01 df ec 7e 00 00 00 00 48 85 db 74 01 fb 48 8d 65 d8 <5b> 41 5c 41 5d 41 5e 41 5f 5d c3 55 48 89 e5 41 57 4d 89 cf 41 56
[  256.234804] RSP: 0018:ffff888010e939f0 EFLAGS: 00000206
[  256.236339] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 3e4406066e4f4abc
[  256.237889] RDX: 0000000000000000 RSI: ffffffff824021e7 RDI: ffffffff8244b3d1
[  256.239401] RBP: ffff888010e93a18 R08: 0000000000000028 R09: 000000000002001d
[  256.240994] R10: 0000000000000000 R11: ffff8880101891b8 R12: 0000000000000002
[  256.242545] R13: ffffffff82757a80 R14: ffffffff81253954 R15: 0000000000000000
[  256.244085]  ? __folio_memcg_unlock+0x48/0x48
[  256.245597]  lock_acquire+0xfa/0x10a
[  256.247179]  ? __folio_memcg_unlock+0x48/0x48
[  256.248727]  rcu_lock_acquire.constprop.0+0x24/0x27
[  256.250293]  ? __folio_memcg_unlock+0x48/0x48
[  256.251736]  mem_cgroup_iter+0x3d/0x178
[  256.253245]  shrink_node_memcgs+0x169/0x182
[  256.254822]  shrink_node+0x220/0x3d9
[  256.256372]  shrink_zones+0x10f/0x1ca
[  256.257923]  ? __this_cpu_preempt_check+0x13/0x15
[  256.259437]  do_try_to_free_pages+0x7a/0x192
[  256.260947]  try_to_free_mem_cgroup_pages+0x14b/0x213
[  256.262405]  try_charge_memcg+0x230/0x433
[  256.263865]  try_charge+0x12/0x17
[  256.265236]  charge_memcg+0x25/0x7c
[  256.266615]  __mem_cgroup_charge+0x28/0x3d
[  256.267962]  mem_cgroup_charge.constprop.0+0x1d/0x1f
[  256.269290]  do_anonymous_page+0x118/0x20c
[  256.270712]  handle_pte_fault+0x151/0x15f
[  256.272015]  __handle_mm_fault+0x39d/0x3ac
[  256.273198]  handle_mm_fault+0xc2/0x188
[  256.274371]  do_user_addr_fault+0x240/0x39d
[  256.275595]  exc_page_fault+0x1e1/0x204
[  256.276787]  asm_exc_page_fault+0x2c/0x40
[  256.277906] RIP: 0033:0xd67250
[  256.279017] Code: 02 f6 ff 49 89 c0 48 83 fb 08 73 1e f6 c3 04 75 39 48 85 db 74 2d c6 00 00 f6 c3 02 74 25 31 c0 66 41 89 44 18 fe eb 1b 66 90 <48> c7 44 18 f8 00 00 00 00 48 8d 4b ff 31 c0 4c 89 c7 48 c1 e9 03
[  256.280355] RSP: 002b:00007fffffffc360 EFLAGS: 00010206
[  256.281678] RAX: 00007ffff5972000 RBX: 00000000000000a8 RCX: 000000000000000c
[  256.282955] RDX: 0000000000000006 RSI: 0000000000000017 RDI: 0000000000000986
[  256.284264] RBP: 0000000000000002 R08: 00007ffff5972000 R09: 0000000000000987
[  256.285554] R10: 0000000000000001 R11: 0000000001000001 R12: 00007ffff5969c80
[  256.286930] R13: 0000000000000015 R14: 0000000000000000 R15: 0000000001eb4760
[  256.288300]  </TASK>
[  256.289533] irq event stamp: 95044
[  256.290776] hardirqs last  enabled at (95043): [<ffffffff819ddf43>] irqentry_exit+0x67/0x75
[  256.292147] hardirqs last disabled at (95044): [<ffffffff819db09c>] common_interrupt+0x1a/0xca
[  256.293479] softirqs last  enabled at (94982): [<ffffffff81c0036f>] __do_softirq+0x36f/0x3aa
[  256.294754] softirqs last disabled at (94977): [<ffffffff81105c01>] __irq_exit_rcu+0x85/0xc1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock
  2022-05-22  2:49   ` Hugh Dickins
@ 2022-05-24 12:12     ` Mel Gorman
  2022-05-24 12:19       ` Mel Gorman
  0 siblings, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2022-05-24 12:12 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Nicolas Saenz Julienne, Marcelo Tosatti,
	Vlastimil Babka, Michal Hocko, LKML, Linux-MM

On Sat, May 21, 2022 at 07:49:10PM -0700, Hugh Dickins wrote:
> On Mon, 9 May 2022, Mel Gorman wrote:
> 
> > Currently the PCP lists are protected by using local_lock_irqsave to
> > prevent migration and IRQ reentrancy but this is inconvenient. Remote
> > draining of the lists is impossible and a workqueue is required and
> > every task allocation/free must disable then enable interrupts which is
> > expensive.
> > 
> > As preparation for dealing with both of those problems, protect the
> > lists with a spinlock. The IRQ-unsafe version of the lock is used
> > because IRQs are already disabled by local_lock_irqsave. spin_trylock
> > is used in preparation for a time when local_lock could be used instead
> > of lock_lock_irqsave.
> 
> 8c580f60a145 ("mm/page_alloc: protect PCP lists with a spinlock")
> in next-20220520: I haven't looked up whether that comes from a
> stable or unstable suburb of akpm's tree.
> 
> Mel, the VM_BUG_ON(in_hardirq()) which this adds to free_unref_page_list() 
> is not valid.  I have no appreciation of how important it is to the whole
> scheme, but as it stands, it crashes; and when I change it to a warning
> 

Thanks Hugh. Sorry for the delay in responding, I was offline for a few
days. The context where free_unref_page_list is called from IRQ context
is safe and the VM_BUG_ON can be removed.

--8<--
mm/page_alloc: Protect PCP lists with a spinlock -fix

Hugh Dickins reported the following problem;

[  256.167040] WARNING: CPU: 0 PID: 9842 at mm/page_alloc.c:3478 free_unref_page_list+0x92/0x343
[  256.170031] CPU: 0 PID: 9842 Comm: cc1 Not tainted 5.18.0-rc7-n20 #3
[  256.171285] Hardware name: LENOVO 20HQS0EG02/20HQS0EG02, BIOS N1MET54W (1.39 ) 04/16/2019
[  256.172555] RIP: 0010:free_unref_page_list+0x92/0x343
[  256.173820] Code: ff ff 49 8b 44 24 08 4d 89 e0 4c 8d 60 f8 eb b6 48 8b 03 48 39 c3 0f 84 af 02 00 00 65 8b 05 72 7f df 7e a9 00 00 0f 00 74 02 <0f> 0b 9c 41 5d fa 41 0f ba e5 09 73 05 e8 1f 0a f9 ff e8 46 90 7b
[  256.175289] RSP: 0018:ffff88803ec07c80 EFLAGS: 00010006
[  256.176683] RAX: 0000000080010000 RBX: ffff88803ec07cf8 RCX: 000000000000002c
[  256.178122] RDX: 0000000000000000 RSI: ffff88803ec29d28 RDI: 0000000000000040
[  256.179580] RBP: ffff88803ec07cc0 R08: ffff88803ec07cf0 R09: 00000000000a401d
[  256.181031] R10: 0000000000000000 R11: ffff8880101891b8 R12: ffff88803f6dd600
[  256.182501] R13: ffff88803ec07cf8 R14: 000000000000000f R15: 0000000000000000
[  256.183957] FS:  00007ffff7fcfac0(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
[  256.185419] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  256.186911] CR2: 0000555555710cdc CR3: 00000000240b4004 CR4: 00000000003706f0
[  256.188395] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  256.189888] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  256.191390] Call Trace:
[  256.192844]  <IRQ>
[  256.194253]  ? __mem_cgroup_uncharge_list+0x4e/0x57
[  256.195715]  release_pages+0x26f/0x27e
[  256.197150]  ? list_add_tail+0x39/0x39
[  256.198603]  pagevec_lru_move_fn+0x95/0xa4

The VM_BUG_ON was added as preparing for a time when the PCP was an
IRQ-unsafe lock. The fundamental limitation is that free_unref_page_list()
cannot be called with the PCP lock held as a normal spinlock when an IRQ
is delivered. At the moment, this is impossible and even if PCP was an
IRQ-unsafe lock, free_unref_page_list is not called from page allocator
context in an unsafe manner. Remove the VM_BUG_ON.

This is a fix for the mmotm patch
mm-page_alloc-protect-pcp-lists-with-a-spinlock.patch

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d169aeeac6f..4c1e2a773e47 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3522,8 +3522,6 @@ void free_unref_page_list(struct list_head *list)
 	if (list_empty(list))
 		return;
 
-	VM_BUG_ON(in_hardirq());
-
 	page = lru_to_page(list);
 	locked_zone = page_zone(page);
 	pcp = pcp_spin_lock(locked_zone->per_cpu_pageset);


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock
  2022-05-24 12:12     ` Mel Gorman
@ 2022-05-24 12:19       ` Mel Gorman
  0 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2022-05-24 12:19 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Nicolas Saenz Julienne, Marcelo Tosatti,
	Vlastimil Babka, Michal Hocko, LKML, Linux-MM

On Tue, May 24, 2022 at 01:12:24PM +0100, Mel Gorman wrote:
> On Sat, May 21, 2022 at 07:49:10PM -0700, Hugh Dickins wrote:
> > On Mon, 9 May 2022, Mel Gorman wrote:
> > 
> > > Currently the PCP lists are protected by using local_lock_irqsave to
> > > prevent migration and IRQ reentrancy but this is inconvenient. Remote
> > > draining of the lists is impossible and a workqueue is required and
> > > every task allocation/free must disable then enable interrupts which is
> > > expensive.
> > > 
> > > As preparation for dealing with both of those problems, protect the
> > > lists with a spinlock. The IRQ-unsafe version of the lock is used
> > > because IRQs are already disabled by local_lock_irqsave. spin_trylock
> > > is used in preparation for a time when local_lock could be used instead
> > > of lock_lock_irqsave.
> > 
> > 8c580f60a145 ("mm/page_alloc: protect PCP lists with a spinlock")
> > in next-20220520: I haven't looked up whether that comes from a
> > stable or unstable suburb of akpm's tree.
> > 
> > Mel, the VM_BUG_ON(in_hardirq()) which this adds to free_unref_page_list() 
> > is not valid.  I have no appreciation of how important it is to the whole
> > scheme, but as it stands, it crashes; and when I change it to a warning
> > 
> 
> Thanks Hugh. Sorry for the delay in responding, I was offline for a few
> days. The context where free_unref_page_list is called from IRQ context
> is safe and the VM_BUG_ON can be removed.
> 

Version that has a more appropriate baseline, it'll cause a collision
later in the series but it's trivially resolved.

--8<--
mm/page_alloc: Protect PCP lists with a spinlock -fix

Hugh Dickins reported the following problem;

[  256.167040] WARNING: CPU: 0 PID: 9842 at mm/page_alloc.c:3478 free_unref_page_list+0x92/0x343
[  256.170031] CPU: 0 PID: 9842 Comm: cc1 Not tainted 5.18.0-rc7-n20 #3
[  256.171285] Hardware name: LENOVO 20HQS0EG02/20HQS0EG02, BIOS N1MET54W (1.39 ) 04/16/2019
[  256.172555] RIP: 0010:free_unref_page_list+0x92/0x343
[  256.173820] Code: ff ff 49 8b 44 24 08 4d 89 e0 4c 8d 60 f8 eb b6 48 8b 03 48 39 c3 0f 84 af 02 00 00 65 8b 05 72 7f df 7e a9 00 00 0f 00 74 02 <0f> 0b 9c 41 5d fa 41 0f ba e5 09 73 05 e8 1f 0a f9 ff e8 46 90 7b
[  256.175289] RSP: 0018:ffff88803ec07c80 EFLAGS: 00010006
[  256.176683] RAX: 0000000080010000 RBX: ffff88803ec07cf8 RCX: 000000000000002c
[  256.178122] RDX: 0000000000000000 RSI: ffff88803ec29d28 RDI: 0000000000000040
[  256.179580] RBP: ffff88803ec07cc0 R08: ffff88803ec07cf0 R09: 00000000000a401d
[  256.181031] R10: 0000000000000000 R11: ffff8880101891b8 R12: ffff88803f6dd600
[  256.182501] R13: ffff88803ec07cf8 R14: 000000000000000f R15: 0000000000000000
[  256.183957] FS:  00007ffff7fcfac0(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
[  256.185419] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  256.186911] CR2: 0000555555710cdc CR3: 00000000240b4004 CR4: 00000000003706f0
[  256.188395] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  256.189888] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  256.191390] Call Trace:
[  256.192844]  <IRQ>
[  256.194253]  ? __mem_cgroup_uncharge_list+0x4e/0x57
[  256.195715]  release_pages+0x26f/0x27e
[  256.197150]  ? list_add_tail+0x39/0x39
[  256.198603]  pagevec_lru_move_fn+0x95/0xa4

The VM_BUG_ON was added as preparing for a time when the PCP was an IRQ-unsafe
lock. The fundamental limitation is that free_unref_page_list() cannot be called
with the PCP lock held when an IRQ is delivered. At the moment, this is impossible
and even if PCP was an IRQ-unsafe lock, free_unref_page_list is not called from
page allocator context in an unsafe manner. Remove the VM_BUG_ON.

This is a fix to the mmotm patch
mm-page_alloc-protect-pcp-lists-with-a-spinlock.patch

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e3a6aa97ad7a..52e7fe681483 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3535,8 +3535,6 @@ void free_unref_page_list(struct list_head *list)
 	if (list_empty(list))
 		return;
 
-	VM_BUG_ON(in_hardirq());
-
 	local_lock_irqsave(&pagesets.lock, flags);
 
 	page = lru_to_page(list);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list
  2022-05-13  8:41   ` Muchun Song
@ 2022-05-26 10:14     ` Mel Gorman
  0 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2022-05-26 10:14 UTC (permalink / raw)
  To: Muchun Song
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM

On Fri, May 13, 2022 at 04:41:20PM +0800, Muchun Song wrote:
> > @@ -94,6 +95,10 @@ struct page {
> >  					/* Count page's or folio's mlocks */
> >  					unsigned int mlock_count;
> >  				};
> > +
> > +				/* Or, free page */
> > +				struct list_head buddy_list;
> > +				struct list_head pcp_list;
> >  			};
> 
> Since you have clarified "lru" member, should we go further?
> Like union "index" to "pcp_migratetype" and "private" to "order"
> since buddy allocator reuses "index" and "private" as well.
> My initial idea is as follows, it is more clear for me, what
> do you think?
> 

I think it would be more appropriate to split it out as a separate type as
suggested by Matthew Wilcox. While I would not be opposed to your approach
as such, it's outside the context of the series which is modifying how
PCP works.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-05-26 10:16 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-09 13:07 [RFC PATCH 0/6] Drain remote per-cpu directly v2 Mel Gorman
2022-05-09 13:08 ` [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list Mel Gorman
2022-05-13  8:41   ` Muchun Song
2022-05-26 10:14     ` Mel Gorman
2022-05-09 13:08 ` [PATCH 2/6] mm/page_alloc: Use only one PCP list for THP-sized allocations Mel Gorman
2022-05-09 13:08 ` [PATCH 3/6] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper Mel Gorman
2022-05-09 13:08 ` [PATCH 4/6] mm/page_alloc: Remove unnecessary page == NULL check in rmqueue Mel Gorman
2022-05-09 13:08 ` [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock Mel Gorman
2022-05-22  2:49   ` Hugh Dickins
2022-05-24 12:12     ` Mel Gorman
2022-05-24 12:19       ` Mel Gorman
2022-05-09 13:08 ` [PATCH 6/6] mm/page_alloc: Remotely drain per-cpu lists Mel Gorman
2022-05-09 15:58 ` [RFC PATCH 0/6] Drain remote per-cpu directly v2 Minchan Kim
2022-05-10  9:27   ` Mel Gorman
2022-05-10 18:13     ` Minchan Kim
2022-05-11 12:47       ` Mel Gorman
2022-05-11 17:20         ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.