All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/7] Drain remote per-cpu directly
@ 2022-06-24 12:54 Mel Gorman
  2022-06-24 12:54 ` [PATCH 1/7] mm/page_alloc: Add page->buddy_list and page->pcp_list Mel Gorman
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: Mel Gorman @ 2022-06-24 12:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, Hugh Dickins, Yu Zhao, Marek Szyprowski, LKML,
	Linux-MM, Mel Gorman

This replaces the existing version on mm-unstable. While there are
some fixes, this is mostly refactoring of patch 5 based on Vlastimil's
feedback to reduce churn in later patches. The level of refactoring made
-fix patches excessively complicated.

Changelog since v4
o Fix lockdep issues in patch 7
o Refactor patch 5 to reduce churn in patches 6 and 7
o Rebase to 5.19-rc3

Some setups, notably NOHZ_FULL CPUs, may be running realtime or
latency-sensitive applications that cannot tolerate interference due to
per-cpu drain work queued by __drain_all_pages().  Introduce a new
mechanism to remotely drain the per-cpu lists. It is made possible by
remotely locking 'struct per_cpu_pages' new per-cpu spinlocks.  This has
two advantages, the time to drain is more predictable and other unrelated
tasks are not interrupted.

This series has the same intent as Nicolas' series "mm/page_alloc: Remote
per-cpu lists drain support" -- avoid interference of a high priority task
due to a workqueue item draining per-cpu page lists.  While many workloads
can tolerate a brief interruption, it may cause a real-time task running
on a NOHZ_FULL CPU to miss a deadline and at minimum, the draining is
non-deterministic.

Currently an IRQ-safe local_lock protects the page allocator per-cpu
lists. The local_lock on its own prevents migration and the IRQ disabling
protects from corruption due to an interrupt arriving while a page
allocation is in progress.

This series adjusts the locking.  A spinlock is added to struct
per_cpu_pages to protect the list contents while local_lock_irq is
ultimately replaced by just the spinlock in the final patch.  This allows
a remote CPU to safely. Follow-on work should allow the spin_lock_irqsave
to be converted to spin_lock to avoid IRQs being disabled/enabled in
most cases. The follow-on patch will be one kernel release later as it
is relatively high risk and it'll make bisections more clear if there
are any problems.

Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
	and when it is storing per-cpu pages.

Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
	this is not necessary but it avoids per_cpu_pages consuming another
	cache line.

Patch 3 is a preparation patch to avoid code duplication.

Patch 4 is a minor correction.

Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
	relying on local_lock to prevent migration, stabilise the pcp
	lookup and prevent IRQ reentrancy.

Patch 6 remote drains per-cpu pages directly instead of using a workqueue.

Patch 7 uses a normal spinlock instead of local_lock for remote draining

Nicolas Saenz Julienne (1):
  mm/page_alloc: Remotely drain per-cpu lists

 include/linux/mm_types.h |   5 +
 include/linux/mmzone.h   |  12 +-
 mm/page_alloc.c          | 386 ++++++++++++++++++++++++---------------
 3 files changed, 250 insertions(+), 153 deletions(-)

-- 
2.35.3


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/7] mm/page_alloc: Add page->buddy_list and page->pcp_list
  2022-06-24 12:54 [PATCH v5 00/7] Drain remote per-cpu directly Mel Gorman
@ 2022-06-24 12:54 ` Mel Gorman
  2022-06-24 12:54 ` [PATCH 2/7] mm/page_alloc: Use only one PCP list for THP-sized allocations Mel Gorman
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2022-06-24 12:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, Hugh Dickins, Yu Zhao, Marek Szyprowski, LKML,
	Linux-MM, Mel Gorman

The page allocator uses page->lru for storing pages on either buddy or PCP
lists.  Create page->buddy_list and page->pcp_list as a union with
page->lru.  This is simply to clarify what type of list a page is on in
the page allocator.

No functional change intended.

[minchan@kernel.org: fix page lru fields in macros]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm_types.h |  5 +++++
 mm/page_alloc.c          | 24 ++++++++++++------------
 2 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c29ab4c0cd5c..e6321ec7621d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -87,6 +87,7 @@ struct page {
 			 */
 			union {
 				struct list_head lru;
+
 				/* Or, for the Unevictable "LRU list" slot */
 				struct {
 					/* Always even, to negate PageTail */
@@ -94,6 +95,10 @@ struct page {
 					/* Count page's or folio's mlocks */
 					unsigned int mlock_count;
 				};
+
+				/* Or, free page */
+				struct list_head buddy_list;
+				struct list_head pcp_list;
 			};
 			/* See page-flags.h for PAGE_MAPPING_FLAGS */
 			struct address_space *mapping;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e008a3df0485..247fa7502199 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -785,7 +785,7 @@ static inline bool set_page_guard(struct zone *zone, struct page *page,
 		return false;
 
 	__SetPageGuard(page);
-	INIT_LIST_HEAD(&page->lru);
+	INIT_LIST_HEAD(&page->buddy_list);
 	set_page_private(page, order);
 	/* Guard pages are not available for any usage */
 	__mod_zone_freepage_state(zone, -(1 << order), migratetype);
@@ -928,7 +928,7 @@ static inline void add_to_free_list(struct page *page, struct zone *zone,
 {
 	struct free_area *area = &zone->free_area[order];
 
-	list_add(&page->lru, &area->free_list[migratetype]);
+	list_add(&page->buddy_list, &area->free_list[migratetype]);
 	area->nr_free++;
 }
 
@@ -938,7 +938,7 @@ static inline void add_to_free_list_tail(struct page *page, struct zone *zone,
 {
 	struct free_area *area = &zone->free_area[order];
 
-	list_add_tail(&page->lru, &area->free_list[migratetype]);
+	list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
 	area->nr_free++;
 }
 
@@ -952,7 +952,7 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
 {
 	struct free_area *area = &zone->free_area[order];
 
-	list_move_tail(&page->lru, &area->free_list[migratetype]);
+	list_move_tail(&page->buddy_list, &area->free_list[migratetype]);
 }
 
 static inline void del_page_from_free_list(struct page *page, struct zone *zone,
@@ -962,7 +962,7 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 	if (page_reported(page))
 		__ClearPageReported(page);
 
-	list_del(&page->lru);
+	list_del(&page->buddy_list);
 	__ClearPageBuddy(page);
 	set_page_private(page, 0);
 	zone->free_area[order].nr_free--;
@@ -1504,11 +1504,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		do {
 			int mt;
 
-			page = list_last_entry(list, struct page, lru);
+			page = list_last_entry(list, struct page, pcp_list);
 			mt = get_pcppage_migratetype(page);
 
 			/* must delete to avoid corrupting pcp list */
-			list_del(&page->lru);
+			list_del(&page->pcp_list);
 			count -= nr_pages;
 			pcp->count -= nr_pages;
 
@@ -3068,7 +3068,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * for IO devices that can merge IO requests if the physical
 		 * pages are ordered properly.
 		 */
-		list_add_tail(&page->lru, list);
+		list_add_tail(&page->pcp_list, list);
 		allocated++;
 		if (is_migrate_cma(get_pcppage_migratetype(page)))
 			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
@@ -3318,7 +3318,7 @@ void mark_free_pages(struct zone *zone)
 
 	for_each_migratetype_order(order, t) {
 		list_for_each_entry(page,
-				&zone->free_area[order].free_list[t], lru) {
+				&zone->free_area[order].free_list[t], buddy_list) {
 			unsigned long i;
 
 			pfn = page_to_pfn(page);
@@ -3407,7 +3407,7 @@ static void free_unref_page_commit(struct page *page, int migratetype,
 	__count_vm_event(PGFREE);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	pindex = order_to_pindex(migratetype, order);
-	list_add(&page->lru, &pcp->lists[pindex]);
+	list_add(&page->pcp_list, &pcp->lists[pindex]);
 	pcp->count += 1 << order;
 
 	/*
@@ -3670,8 +3670,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 				return NULL;
 		}
 
-		page = list_first_entry(list, struct page, lru);
-		list_del(&page->lru);
+		page = list_first_entry(list, struct page, pcp_list);
+		list_del(&page->pcp_list);
 		pcp->count -= 1 << order;
 	} while (check_new_pcp(page, order));
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/7] mm/page_alloc: Use only one PCP list for THP-sized allocations
  2022-06-24 12:54 [PATCH v5 00/7] Drain remote per-cpu directly Mel Gorman
  2022-06-24 12:54 ` [PATCH 1/7] mm/page_alloc: Add page->buddy_list and page->pcp_list Mel Gorman
@ 2022-06-24 12:54 ` Mel Gorman
  2022-06-24 12:54 ` [PATCH 3/7] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper Mel Gorman
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2022-06-24 12:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, Hugh Dickins, Yu Zhao, Marek Szyprowski, LKML,
	Linux-MM, Mel Gorman

The per_cpu_pages is cache-aligned on a standard x86-64 distribution
configuration but a later patch will add a new field which would push the
structure into the next cache line.  Use only one list to store THP-sized
pages on the per-cpu list.  This assumes that the vast majority of
THP-sized allocations are GFP_MOVABLE but even if it was another type, it
would not contribute to serious fragmentation that potentially causes a
later THP allocation failure.  Align per_cpu_pages on the cacheline
boundary to ensure there is no false cache sharing.

After this patch, the structure sizing is;

struct per_cpu_pages {
        int                        count;                /*     0     4 */
        int                        high;                 /*     4     4 */
        int                        batch;                /*     8     4 */
        short int                  free_factor;          /*    12     2 */
        short int                  expire;               /*    14     2 */
        struct list_head           lists[13];            /*    16   208 */

        /* size: 256, cachelines: 4, members: 6 */
        /* padding: 32 */
} __attribute__((__aligned__(64)));

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h | 11 +++++++----
 mm/page_alloc.c        |  4 ++--
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index aab70355d64f..4e0352cc2fcb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -355,15 +355,18 @@ enum zone_watermarks {
 };
 
 /*
- * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER plus one additional
- * for pageblock size for THP if configured.
+ * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER. One additional list
+ * for THP which will usually be GFP_MOVABLE. Even if it is another type,
+ * it should not contribute to serious fragmentation causing THP allocation
+ * failures.
  */
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define NR_PCP_THP 1
 #else
 #define NR_PCP_THP 0
 #endif
-#define NR_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1 + NR_PCP_THP))
+#define NR_LOWORDER_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1))
+#define NR_PCP_LISTS (NR_LOWORDER_PCP_LISTS + NR_PCP_THP)
 
 /*
  * Shift to encode migratetype and order in the same integer, with order
@@ -389,7 +392,7 @@ struct per_cpu_pages {
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
 	struct list_head lists[NR_PCP_LISTS];
-};
+} ____cacheline_aligned_in_smp;
 
 struct per_cpu_zonestat {
 #ifdef CONFIG_SMP
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 247fa7502199..febd97f4a2fc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -653,7 +653,7 @@ static inline unsigned int order_to_pindex(int migratetype, int order)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	if (order > PAGE_ALLOC_COSTLY_ORDER) {
 		VM_BUG_ON(order != pageblock_order);
-		base = PAGE_ALLOC_COSTLY_ORDER + 1;
+		return NR_LOWORDER_PCP_LISTS;
 	}
 #else
 	VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
@@ -667,7 +667,7 @@ static inline int pindex_to_order(unsigned int pindex)
 	int order = pindex / MIGRATE_PCPTYPES;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (order > PAGE_ALLOC_COSTLY_ORDER)
+	if (pindex == NR_LOWORDER_PCP_LISTS)
 		order = pageblock_order;
 #else
 	VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/7] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper
  2022-06-24 12:54 [PATCH v5 00/7] Drain remote per-cpu directly Mel Gorman
  2022-06-24 12:54 ` [PATCH 1/7] mm/page_alloc: Add page->buddy_list and page->pcp_list Mel Gorman
  2022-06-24 12:54 ` [PATCH 2/7] mm/page_alloc: Use only one PCP list for THP-sized allocations Mel Gorman
@ 2022-06-24 12:54 ` Mel Gorman
  2022-06-24 12:54 ` [PATCH 4/7] mm/page_alloc: Remove mistaken page == NULL check in rmqueue Mel Gorman
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2022-06-24 12:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, Hugh Dickins, Yu Zhao, Marek Szyprowski, LKML,
	Linux-MM, Mel Gorman

This is a preparation page to allow the buddy removal code to be reused in
a later patch.

No functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 81 ++++++++++++++++++++++++++++---------------------
 1 file changed, 47 insertions(+), 34 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index febd97f4a2fc..44d198af4b35 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3637,6 +3637,43 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
 #endif
 }
 
+static __always_inline
+struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
+			   unsigned int order, unsigned int alloc_flags,
+			   int migratetype)
+{
+	struct page *page;
+	unsigned long flags;
+
+	do {
+		page = NULL;
+		spin_lock_irqsave(&zone->lock, flags);
+		/*
+		 * order-0 request can reach here when the pcplist is skipped
+		 * due to non-CMA allocation context. HIGHATOMIC area is
+		 * reserved for high-order atomic allocation, so order-0
+		 * request should skip it.
+		 */
+		if (order > 0 && alloc_flags & ALLOC_HARDER)
+			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+		if (!page) {
+			page = __rmqueue(zone, order, migratetype, alloc_flags);
+			if (!page) {
+				spin_unlock_irqrestore(&zone->lock, flags);
+				return NULL;
+			}
+		}
+		__mod_zone_freepage_state(zone, -(1 << order),
+					  get_pcppage_migratetype(page));
+		spin_unlock_irqrestore(&zone->lock, flags);
+	} while (check_new_pages(page, order));
+
+	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
+	zone_statistics(preferred_zone, zone, 1);
+
+	return page;
+}
+
 /* Remove page from the per-cpu list, caller must protect the list */
 static inline
 struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
@@ -3717,9 +3754,14 @@ struct page *rmqueue(struct zone *preferred_zone,
 			gfp_t gfp_flags, unsigned int alloc_flags,
 			int migratetype)
 {
-	unsigned long flags;
 	struct page *page;
 
+	/*
+	 * We most definitely don't want callers attempting to
+	 * allocate greater than order-1 page units with __GFP_NOFAIL.
+	 */
+	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
+
 	if (likely(pcp_allowed_order(order))) {
 		/*
 		 * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
@@ -3733,35 +3775,10 @@ struct page *rmqueue(struct zone *preferred_zone,
 		}
 	}
 
-	/*
-	 * We most definitely don't want callers attempting to
-	 * allocate greater than order-1 page units with __GFP_NOFAIL.
-	 */
-	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
-
-	do {
-		page = NULL;
-		spin_lock_irqsave(&zone->lock, flags);
-		/*
-		 * order-0 request can reach here when the pcplist is skipped
-		 * due to non-CMA allocation context. HIGHATOMIC area is
-		 * reserved for high-order atomic allocation, so order-0
-		 * request should skip it.
-		 */
-		if (order > 0 && alloc_flags & ALLOC_HARDER)
-			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
-		if (!page) {
-			page = __rmqueue(zone, order, migratetype, alloc_flags);
-			if (!page)
-				goto failed;
-		}
-		__mod_zone_freepage_state(zone, -(1 << order),
-					  get_pcppage_migratetype(page));
-		spin_unlock_irqrestore(&zone->lock, flags);
-	} while (check_new_pages(page, order));
-
-	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
-	zone_statistics(preferred_zone, zone, 1);
+	page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
+							migratetype);
+	if (unlikely(!page))
+		return NULL;
 
 out:
 	/* Separate test+clear to avoid unnecessary atomics */
@@ -3772,10 +3789,6 @@ struct page *rmqueue(struct zone *preferred_zone,
 
 	VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
 	return page;
-
-failed:
-	spin_unlock_irqrestore(&zone->lock, flags);
-	return NULL;
 }
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/7] mm/page_alloc: Remove mistaken page == NULL check in rmqueue
  2022-06-24 12:54 [PATCH v5 00/7] Drain remote per-cpu directly Mel Gorman
                   ` (2 preceding siblings ...)
  2022-06-24 12:54 ` [PATCH 3/7] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper Mel Gorman
@ 2022-06-24 12:54 ` Mel Gorman
  2022-06-24 12:54 ` [PATCH 5/7] mm/page_alloc: Protect PCP lists with a spinlock Mel Gorman
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2022-06-24 12:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, Hugh Dickins, Yu Zhao, Marek Szyprowski, LKML,
	Linux-MM, Mel Gorman

If a page allocation fails, the ZONE_BOOSTER_WATERMARK should be tested,
cleared and kswapd woken whether the allocation attempt was via the PCP
or directly via the buddy list.

Remove the page == NULL so the ZONE_BOOSTED_WATERMARK bit is checked
unconditionally. As it is unlikely that ZONE_BOOSTED_WATERMARK is set,
mark the branch accordingly.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 44d198af4b35..7fb262eeec2f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3777,12 +3777,10 @@ struct page *rmqueue(struct zone *preferred_zone,
 
 	page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
 							migratetype);
-	if (unlikely(!page))
-		return NULL;
 
 out:
 	/* Separate test+clear to avoid unnecessary atomics */
-	if (test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags)) {
+	if (unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
 		clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
 		wakeup_kswapd(zone, 0, 0, zone_idx(zone));
 	}
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/7] mm/page_alloc: Protect PCP lists with a spinlock
  2022-06-24 12:54 [PATCH v5 00/7] Drain remote per-cpu directly Mel Gorman
                   ` (3 preceding siblings ...)
  2022-06-24 12:54 ` [PATCH 4/7] mm/page_alloc: Remove mistaken page == NULL check in rmqueue Mel Gorman
@ 2022-06-24 12:54 ` Mel Gorman
  2022-07-04 12:31   ` Vlastimil Babka
  2022-07-04 16:32   ` Nicolas Saenz Julienne
  2022-06-24 12:54 ` [PATCH 6/7] mm/page_alloc: Remotely drain per-cpu lists Mel Gorman
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 21+ messages in thread
From: Mel Gorman @ 2022-06-24 12:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, Hugh Dickins, Yu Zhao, Marek Szyprowski, LKML,
	Linux-MM, Mel Gorman

Currently the PCP lists are protected by using local_lock_irqsave to
prevent migration and IRQ reentrancy but this is inconvenient.  Remote
draining of the lists is impossible and a workqueue is required and every
task allocation/free must disable then enable interrupts which is
expensive.

As preparation for dealing with both of those problems, protect the lists
with a spinlock.  The IRQ-unsafe version of the lock is used because IRQs
are already disabled by local_lock_irqsave.  spin_trylock is used in
preparation for a time when local_lock could be used instead of
lock_lock_irqsave.

The per_cpu_pages still fits within the same number of cache lines after
this patch relative to before the series.

struct per_cpu_pages {
        spinlock_t                 lock;                 /*     0     4 */
        int                        count;                /*     4     4 */
        int                        high;                 /*     8     4 */
        int                        batch;                /*    12     4 */
        short int                  free_factor;          /*    16     2 */
        short int                  expire;               /*    18     2 */

        /* XXX 4 bytes hole, try to pack */

        struct list_head           lists[13];            /*    24   208 */

        /* size: 256, cachelines: 4, members: 7 */
        /* sum members: 228, holes: 1, sum holes: 4 */
        /* padding: 24 */
} __attribute__((__aligned__(64)));

There is overhead in the fast path due to acquiring the spinlock even
though the spinlock is per-cpu and uncontended in the common case.  Page
Fault Test (PFT) running on a 1-socket reported the following results on a
1 socket machine.

                                     5.19.0-rc3               5.19.0-rc3
                                        vanilla      mm-pcpspinirq-v5r16
Hmean     faults/sec-1   869275.7381 (   0.00%)   874597.5167 *   0.61%*
Hmean     faults/sec-3  2370266.6681 (   0.00%)  2379802.0362 *   0.40%*
Hmean     faults/sec-5  2701099.7019 (   0.00%)  2664889.7003 *  -1.34%*
Hmean     faults/sec-7  3517170.9157 (   0.00%)  3491122.8242 *  -0.74%*
Hmean     faults/sec-8  3965729.6187 (   0.00%)  3939727.0243 *  -0.66%*

There is a small hit in the number of faults per second but given that the
results are more stable, it's borderline noise.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |   1 +
 mm/page_alloc.c        | 118 +++++++++++++++++++++++++++++++++--------
 2 files changed, 98 insertions(+), 21 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4e0352cc2fcb..299259cfe462 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -382,6 +382,7 @@ enum zone_watermarks {
 
 /* Fields and list protected by pagesets local_lock in page_alloc.c */
 struct per_cpu_pages {
+	spinlock_t lock;	/* Protects lists field */
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7fb262eeec2f..3b819c2720f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -133,6 +133,20 @@ static DEFINE_PER_CPU(struct pagesets, pagesets) = {
 	.lock = INIT_LOCAL_LOCK(lock),
 };
 
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
+/*
+ * On SMP, spin_trylock is sufficient protection.
+ * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
+ */
+#define pcp_trylock_prepare(flags)	do { } while (0)
+#define pcp_trylock_finish(flag)	do { } while (0)
+#else
+
+/* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */
+#define pcp_trylock_prepare(flags)	local_irq_save(flags)
+#define pcp_trylock_finish(flags)	local_irq_restore(flags)
+#endif
+
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -3097,15 +3111,22 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
  */
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 {
-	unsigned long flags;
 	int to_drain, batch;
 
-	local_lock_irqsave(&pagesets.lock, flags);
 	batch = READ_ONCE(pcp->batch);
 	to_drain = min(pcp->count, batch);
-	if (to_drain > 0)
+	if (to_drain > 0) {
+		unsigned long flags;
+
+		/*
+		 * free_pcppages_bulk expects IRQs disabled for zone->lock
+		 * so even though pcp->lock is not intended to be IRQ-safe,
+		 * it's needed in this context.
+		 */
+		spin_lock_irqsave(&pcp->lock, flags);
 		free_pcppages_bulk(zone, to_drain, pcp, 0);
-	local_unlock_irqrestore(&pagesets.lock, flags);
+		spin_unlock_irqrestore(&pcp->lock, flags);
+	}
 }
 #endif
 
@@ -3118,16 +3139,17 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
  */
 static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 {
-	unsigned long flags;
 	struct per_cpu_pages *pcp;
 
-	local_lock_irqsave(&pagesets.lock, flags);
-
 	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
-	if (pcp->count)
-		free_pcppages_bulk(zone, pcp->count, pcp, 0);
+	if (pcp->count) {
+		unsigned long flags;
 
-	local_unlock_irqrestore(&pagesets.lock, flags);
+		/* See drain_zone_pages on why this is disabling IRQs */
+		spin_lock_irqsave(&pcp->lock, flags);
+		free_pcppages_bulk(zone, pcp->count, pcp, 0);
+		spin_unlock_irqrestore(&pcp->lock, flags);
+	}
 }
 
 /*
@@ -3395,17 +3417,15 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 	return min(READ_ONCE(pcp->batch) << 2, high);
 }
 
-static void free_unref_page_commit(struct page *page, int migratetype,
+static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
+				   struct page *page, int migratetype,
 				   unsigned int order)
 {
-	struct zone *zone = page_zone(page);
-	struct per_cpu_pages *pcp;
 	int high;
 	int pindex;
 	bool free_high;
 
 	__count_vm_event(PGFREE);
-	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	pindex = order_to_pindex(migratetype, order);
 	list_add(&page->pcp_list, &pcp->lists[pindex]);
 	pcp->count += 1 << order;
@@ -3432,6 +3452,9 @@ static void free_unref_page_commit(struct page *page, int migratetype,
 void free_unref_page(struct page *page, unsigned int order)
 {
 	unsigned long flags;
+	unsigned long __maybe_unused UP_flags;
+	struct per_cpu_pages *pcp;
+	struct zone *zone;
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
@@ -3455,7 +3478,16 @@ void free_unref_page(struct page *page, unsigned int order)
 	}
 
 	local_lock_irqsave(&pagesets.lock, flags);
-	free_unref_page_commit(page, migratetype, order);
+	zone = page_zone(page);
+	pcp_trylock_prepare(UP_flags);
+	pcp = this_cpu_ptr(zone->per_cpu_pageset);
+	if (spin_trylock(&pcp->lock)) {
+		free_unref_page_commit(zone, pcp, page, migratetype, order);
+		spin_unlock(&pcp->lock);
+	} else {
+		free_one_page(zone, page, pfn, order, migratetype, FPI_NONE);
+	}
+	pcp_trylock_finish(UP_flags);
 	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 
@@ -3465,6 +3497,8 @@ void free_unref_page(struct page *page, unsigned int order)
 void free_unref_page_list(struct list_head *list)
 {
 	struct page *page, *next;
+	struct per_cpu_pages *pcp = NULL;
+	struct zone *locked_zone = NULL;
 	unsigned long flags;
 	int batch_count = 0;
 	int migratetype;
@@ -3491,6 +3525,17 @@ void free_unref_page_list(struct list_head *list)
 
 	local_lock_irqsave(&pagesets.lock, flags);
 	list_for_each_entry_safe(page, next, list, lru) {
+		struct zone *zone = page_zone(page);
+
+		/* Different zone, different pcp lock. */
+		if (zone != locked_zone) {
+			if (pcp)
+				spin_unlock(&pcp->lock);
+			locked_zone = zone;
+			pcp = this_cpu_ptr(zone->per_cpu_pageset);
+			spin_lock(&pcp->lock);
+		}
+
 		/*
 		 * Non-isolated types over MIGRATE_PCPTYPES get added
 		 * to the MIGRATE_MOVABLE pcp list.
@@ -3500,18 +3545,24 @@ void free_unref_page_list(struct list_head *list)
 			migratetype = MIGRATE_MOVABLE;
 
 		trace_mm_page_free_batched(page);
-		free_unref_page_commit(page, migratetype, 0);
+		free_unref_page_commit(zone, pcp, page, migratetype, 0);
 
 		/*
 		 * Guard against excessive IRQ disabled times when we get
 		 * a large list of pages to free.
 		 */
 		if (++batch_count == SWAP_CLUSTER_MAX) {
+			spin_unlock(&pcp->lock);
 			local_unlock_irqrestore(&pagesets.lock, flags);
 			batch_count = 0;
 			local_lock_irqsave(&pagesets.lock, flags);
+			pcp = this_cpu_ptr(locked_zone->per_cpu_pageset);
+			spin_lock(&pcp->lock);
 		}
 	}
+
+	if (pcp)
+		spin_unlock(&pcp->lock);
 	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 
@@ -3725,18 +3776,31 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	struct list_head *list;
 	struct page *page;
 	unsigned long flags;
+	unsigned long __maybe_unused UP_flags;
 
 	local_lock_irqsave(&pagesets.lock, flags);
 
+	/*
+	 * spin_trylock may fail due to a parallel drain. In the future, the
+	 * trylock will also protect against IRQ reentrancy.
+	 */
+	pcp = this_cpu_ptr(zone->per_cpu_pageset);
+	pcp_trylock_prepare(UP_flags);
+	if (!spin_trylock(&pcp->lock)) {
+		pcp_trylock_finish(UP_flags);
+		return NULL;
+	}
+
 	/*
 	 * On allocation, reduce the number of pages that are batch freed.
 	 * See nr_pcp_free() where free_factor is increased for subsequent
 	 * frees.
 	 */
-	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	pcp->free_factor >>= 1;
 	list = &pcp->lists[order_to_pindex(migratetype, order)];
 	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
+	spin_unlock(&pcp->lock);
+	pcp_trylock_finish(UP_flags);
 	local_unlock_irqrestore(&pagesets.lock, flags);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
@@ -3771,7 +3835,8 @@ struct page *rmqueue(struct zone *preferred_zone,
 				migratetype != MIGRATE_MOVABLE) {
 			page = rmqueue_pcplist(preferred_zone, zone, order,
 					gfp_flags, migratetype, alloc_flags);
-			goto out;
+			if (likely(page))
+				goto out;
 		}
 	}
 
@@ -5259,6 +5324,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 {
 	struct page *page;
 	unsigned long flags;
+	unsigned long __maybe_unused UP_flags;
 	struct zone *zone;
 	struct zoneref *z;
 	struct per_cpu_pages *pcp;
@@ -5339,11 +5405,15 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	if (unlikely(!zone))
 		goto failed;
 
-	/* Attempt the batch allocation */
+	/* Is a parallel drain in progress? */
 	local_lock_irqsave(&pagesets.lock, flags);
+	pcp_trylock_prepare(UP_flags);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
-	pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
+	if (!spin_trylock(&pcp->lock))
+		goto failed_irq;
 
+	/* Attempt the batch allocation */
+	pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
 	while (nr_populated < nr_pages) {
 
 		/* Skip existing pages */
@@ -5356,8 +5426,10 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 								pcp, pcp_list);
 		if (unlikely(!page)) {
 			/* Try and allocate at least one page */
-			if (!nr_account)
+			if (!nr_account) {
+				spin_unlock(&pcp->lock);
 				goto failed_irq;
+			}
 			break;
 		}
 		nr_account++;
@@ -5370,6 +5442,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		nr_populated++;
 	}
 
+	spin_unlock(&pcp->lock);
+	pcp_trylock_finish(UP_flags);
 	local_unlock_irqrestore(&pagesets.lock, flags);
 
 	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
@@ -5379,6 +5453,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	return nr_populated;
 
 failed_irq:
+	pcp_trylock_finish(UP_flags);
 	local_unlock_irqrestore(&pagesets.lock, flags);
 
 failed:
@@ -7019,6 +7094,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
 	memset(pcp, 0, sizeof(*pcp));
 	memset(pzstats, 0, sizeof(*pzstats));
 
+	spin_lock_init(&pcp->lock);
 	for (pindex = 0; pindex < NR_PCP_LISTS; pindex++)
 		INIT_LIST_HEAD(&pcp->lists[pindex]);
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/7] mm/page_alloc: Remotely drain per-cpu lists
  2022-06-24 12:54 [PATCH v5 00/7] Drain remote per-cpu directly Mel Gorman
                   ` (4 preceding siblings ...)
  2022-06-24 12:54 ` [PATCH 5/7] mm/page_alloc: Protect PCP lists with a spinlock Mel Gorman
@ 2022-06-24 12:54 ` Mel Gorman
  2022-07-04 14:28   ` Vlastimil Babka
  2022-06-24 12:54 ` [PATCH 7/7] mm/page_alloc: Replace local_lock with normal spinlock Mel Gorman
  2022-07-03 23:28 ` [PATCH v5 00/7] Drain remote per-cpu directly Andrew Morton
  7 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2022-06-24 12:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, Hugh Dickins, Yu Zhao, Marek Szyprowski, LKML,
	Linux-MM, Mel Gorman

From: Nicolas Saenz Julienne <nsaenzju@redhat.com>

Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
drain work queued by __drain_all_pages().  So introduce a new mechanism to
remotely drain the per-cpu lists.  It is made possible by remotely locking
'struct per_cpu_pages' new per-cpu spinlocks.  A benefit of this new
scheme is that drain operations are now migration safe.

There was no observed performance degradation vs.  the previous scheme.
Both netperf and hackbench were run in parallel to triggering the
__drain_all_pages(NULL, true) code path around ~100 times per second.  The
new scheme performs a bit better (~5%), although the important point here
is there are no performance regressions vs.  the previous mechanism.
Per-cpu lists draining happens only in slow paths.

Minchan Kim tested an earlier version and reported;

	My workload is not NOHZ CPUs but run apps under heavy memory
	pressure so they goes to direct reclaim and be stuck on
	drain_all_pages until work on workqueue run.

	unit: nanosecond
	max(dur)        avg(dur)                count(dur)
	166713013       487511.77786438033      1283

	From traces, system encountered the drain_all_pages 1283 times and
	worst case was 166ms and avg was 487us.

	The other problem was alloc_contig_range in CMA. The PCP draining
	takes several hundred millisecond sometimes though there is no
	memory pressure or a few of pages to be migrated out but CPU were
	fully booked.

	Your patch perfectly removed those wasted time.

Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 58 ++++---------------------------------------------
 1 file changed, 4 insertions(+), 54 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b819c2720f8..44e7c29aaa7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -165,13 +165,7 @@ DEFINE_PER_CPU(int, _numa_mem_);		/* Kernel "local memory" node */
 EXPORT_PER_CPU_SYMBOL(_numa_mem_);
 #endif
 
-/* work_structs for global per-cpu drains */
-struct pcpu_drain {
-	struct zone *zone;
-	struct work_struct work;
-};
 static DEFINE_MUTEX(pcpu_drain_mutex);
-static DEFINE_PER_CPU(struct pcpu_drain, pcpu_drain);
 
 #ifdef CONFIG_GCC_PLUGIN_LATENT_ENTROPY
 volatile unsigned long latent_entropy __latent_entropy;
@@ -3105,9 +3099,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
  * Called from the vmstat counter updater to drain pagesets of this
  * currently executing processor on remote nodes after they have
  * expired.
- *
- * Note that this function must be called with the thread pinned to
- * a single processor.
  */
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 {
@@ -3132,10 +3123,6 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 
 /*
  * Drain pcplists of the indicated processor and zone.
- *
- * The processor must either be the current processor and the
- * thread pinned to the current processor or a processor that
- * is not online.
  */
 static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 {
@@ -3154,10 +3141,6 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 
 /*
  * Drain pcplists of all zones on the indicated processor.
- *
- * The processor must either be the current processor and the
- * thread pinned to the current processor or a processor that
- * is not online.
  */
 static void drain_pages(unsigned int cpu)
 {
@@ -3170,9 +3153,6 @@ static void drain_pages(unsigned int cpu)
 
 /*
  * Spill all of this CPU's per-cpu pages back into the buddy allocator.
- *
- * The CPU has to be pinned. When zone parameter is non-NULL, spill just
- * the single zone's pages.
  */
 void drain_local_pages(struct zone *zone)
 {
@@ -3184,24 +3164,6 @@ void drain_local_pages(struct zone *zone)
 		drain_pages(cpu);
 }
 
-static void drain_local_pages_wq(struct work_struct *work)
-{
-	struct pcpu_drain *drain;
-
-	drain = container_of(work, struct pcpu_drain, work);
-
-	/*
-	 * drain_all_pages doesn't use proper cpu hotplug protection so
-	 * we can race with cpu offline when the WQ can move this from
-	 * a cpu pinned worker to an unbound one. We can operate on a different
-	 * cpu which is alright but we also have to make sure to not move to
-	 * a different one.
-	 */
-	migrate_disable();
-	drain_local_pages(drain->zone);
-	migrate_enable();
-}
-
 /*
  * The implementation of drain_all_pages(), exposing an extra parameter to
  * drain on all cpus.
@@ -3222,13 +3184,6 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 	 */
 	static cpumask_t cpus_with_pcps;
 
-	/*
-	 * Make sure nobody triggers this path before mm_percpu_wq is fully
-	 * initialized.
-	 */
-	if (WARN_ON_ONCE(!mm_percpu_wq))
-		return;
-
 	/*
 	 * Do not drain if one is already in progress unless it's specific to
 	 * a zone. Such callers are primarily CMA and memory hotplug and need
@@ -3278,14 +3233,11 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 	}
 
 	for_each_cpu(cpu, &cpus_with_pcps) {
-		struct pcpu_drain *drain = per_cpu_ptr(&pcpu_drain, cpu);
-
-		drain->zone = zone;
-		INIT_WORK(&drain->work, drain_local_pages_wq);
-		queue_work_on(cpu, mm_percpu_wq, &drain->work);
+		if (zone)
+			drain_pages_zone(cpu, zone);
+		else
+			drain_pages(cpu);
 	}
-	for_each_cpu(cpu, &cpus_with_pcps)
-		flush_work(&per_cpu_ptr(&pcpu_drain, cpu)->work);
 
 	mutex_unlock(&pcpu_drain_mutex);
 }
@@ -3294,8 +3246,6 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
  * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
  *
  * When zone parameter is non-NULL, spill just the single zone's pages.
- *
- * Note that this can be extremely slow as the draining happens in a workqueue.
  */
 void drain_all_pages(struct zone *zone)
 {
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 7/7] mm/page_alloc: Replace local_lock with normal spinlock
  2022-06-24 12:54 [PATCH v5 00/7] Drain remote per-cpu directly Mel Gorman
                   ` (5 preceding siblings ...)
  2022-06-24 12:54 ` [PATCH 6/7] mm/page_alloc: Remotely drain per-cpu lists Mel Gorman
@ 2022-06-24 12:54 ` Mel Gorman
  2022-06-24 18:59   ` Yu Zhao
                     ` (2 more replies)
  2022-07-03 23:28 ` [PATCH v5 00/7] Drain remote per-cpu directly Andrew Morton
  7 siblings, 3 replies; 21+ messages in thread
From: Mel Gorman @ 2022-06-24 12:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, Hugh Dickins, Yu Zhao, Marek Szyprowski, LKML,
	Linux-MM, Mel Gorman

struct per_cpu_pages is no longer strictly local as PCP lists can be
drained remotely using a lock for protection. While the use of local_lock
works, it goes against the intent of local_lock which is for "pure
CPU local concurrency control mechanisms and not suited for inter-CPU
concurrency control" (Documentation/locking/locktypes.rst)

local_lock protects against migration between when the percpu pointer is
accessed and the pcp->lock acquired. The lock acquisition is a preemption
point so in the worst case, a task could migrate to another NUMA node
and accidentally allocate remote memory. The main requirement is to pin
the task to a CPU that is suitable for PREEMPT_RT and !PREEMPT_RT.

Replace local_lock with helpers that pin a task to a CPU, lookup the
per-cpu structure and acquire the embedded lock. It's similar to local_lock
without breaking the intent behind the API. It is not a complete API
as only the parts needed for PCP-alloc are implemented but in theory,
the generic helpers could be promoted to a general API if there was
demand for an embedded lock within a per-cpu struct with a guarantee
that the per-cpu structure locked matches the running CPU and cannot use
get_cpu_var due to RT concerns. PCP requires these semantics to avoid
accidentally allocating remote memory.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 139 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 95 insertions(+), 44 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 44e7c29aaa7d..71065b01827b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -126,13 +126,6 @@ typedef int __bitwise fpi_t;
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
 
-struct pagesets {
-	local_lock_t lock;
-};
-static DEFINE_PER_CPU(struct pagesets, pagesets) = {
-	.lock = INIT_LOCAL_LOCK(lock),
-};
-
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
 /*
  * On SMP, spin_trylock is sufficient protection.
@@ -147,6 +140,83 @@ static DEFINE_PER_CPU(struct pagesets, pagesets) = {
 #define pcp_trylock_finish(flags)	local_irq_restore(flags)
 #endif
 
+/*
+ * Locking a pcp requires a PCP lookup followed by a spinlock. To avoid
+ * a migration causing the wrong PCP to be locked and remote memory being
+ * potentially allocated, pin the task to the CPU for the lookup+lock.
+ * preempt_disable is used on !RT because it is faster than migrate_disable.
+ * migrate_disable is used on RT because otherwise RT spinlock usage is
+ * interfered with and a high priority task cannot preempt the allocator.
+ */
+#ifndef CONFIG_PREEMPT_RT
+#define pcpu_task_pin()		preempt_disable()
+#define pcpu_task_unpin()	preempt_enable()
+#else
+#define pcpu_task_pin()		migrate_disable()
+#define pcpu_task_unpin()	migrate_enable()
+#endif
+
+/*
+ * Generic helper to lookup and a per-cpu variable with an embedded spinlock.
+ * Return value should be used with equivalent unlock helper.
+ */
+#define pcpu_spin_lock(type, member, ptr)				\
+({									\
+	type *_ret;							\
+	pcpu_task_pin();						\
+	_ret = this_cpu_ptr(ptr);					\
+	spin_lock(&_ret->member);					\
+	_ret;								\
+})
+
+#define pcpu_spin_lock_irqsave(type, member, ptr, flags)		\
+({									\
+	type *_ret;							\
+	pcpu_task_pin();						\
+	_ret = this_cpu_ptr(ptr);					\
+	spin_lock_irqsave(&_ret->member, flags);			\
+	_ret;								\
+})
+
+#define pcpu_spin_trylock_irqsave(type, member, ptr, flags)		\
+({									\
+	type *_ret;							\
+	pcpu_task_pin();						\
+	_ret = this_cpu_ptr(ptr);					\
+	if (!spin_trylock_irqsave(&_ret->member, flags)) {		\
+		pcpu_task_unpin();					\
+		_ret = NULL;						\
+	}								\
+	_ret;								\
+})
+
+#define pcpu_spin_unlock(member, ptr)					\
+({									\
+	spin_unlock(&ptr->member);					\
+	pcpu_task_unpin();						\
+})
+
+#define pcpu_spin_unlock_irqrestore(member, ptr, flags)			\
+({									\
+	spin_unlock_irqrestore(&ptr->member, flags);			\
+	pcpu_task_unpin();						\
+})
+
+/* struct per_cpu_pages specific helpers. */
+#define pcp_spin_lock(ptr)						\
+	pcpu_spin_lock(struct per_cpu_pages, lock, ptr)
+
+#define pcp_spin_lock_irqsave(ptr, flags)				\
+	pcpu_spin_lock_irqsave(struct per_cpu_pages, lock, ptr, flags)
+
+#define pcp_spin_trylock_irqsave(ptr, flags)				\
+	pcpu_spin_trylock_irqsave(struct per_cpu_pages, lock, ptr, flags)
+
+#define pcp_spin_unlock(ptr)						\
+	pcpu_spin_unlock(lock, ptr)
+
+#define pcp_spin_unlock_irqrestore(ptr, flags)				\
+	pcpu_spin_unlock_irqrestore(lock, ptr, flags)
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -1481,10 +1551,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	/* Ensure requested pindex is drained first. */
 	pindex = pindex - 1;
 
-	/*
-	 * local_lock_irq held so equivalent to spin_lock_irqsave for
-	 * both PREEMPT_RT and non-PREEMPT_RT configurations.
-	 */
+	/* Caller must hold IRQ-safe pcp->lock so IRQs are disabled. */
 	spin_lock(&zone->lock);
 	isolated_pageblocks = has_isolate_pageblock(zone);
 
@@ -3052,10 +3119,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 {
 	int i, allocated = 0;
 
-	/*
-	 * local_lock_irq held so equivalent to spin_lock_irqsave for
-	 * both PREEMPT_RT and non-PREEMPT_RT configurations.
-	 */
+	/* Caller must hold IRQ-safe pcp->lock so IRQs are disabled. */
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
@@ -3427,18 +3491,16 @@ void free_unref_page(struct page *page, unsigned int order)
 		migratetype = MIGRATE_MOVABLE;
 	}
 
-	local_lock_irqsave(&pagesets.lock, flags);
 	zone = page_zone(page);
 	pcp_trylock_prepare(UP_flags);
-	pcp = this_cpu_ptr(zone->per_cpu_pageset);
-	if (spin_trylock(&pcp->lock)) {
+	pcp = pcpu_spin_trylock_irqsave(struct per_cpu_pages, lock, zone->per_cpu_pageset, flags);
+	if (pcp) {
 		free_unref_page_commit(zone, pcp, page, migratetype, order);
-		spin_unlock(&pcp->lock);
+		pcp_spin_unlock_irqrestore(pcp, flags);
 	} else {
 		free_one_page(zone, page, pfn, order, migratetype, FPI_NONE);
 	}
 	pcp_trylock_finish(UP_flags);
-	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 
 /*
@@ -3473,17 +3535,16 @@ void free_unref_page_list(struct list_head *list)
 		}
 	}
 
-	local_lock_irqsave(&pagesets.lock, flags);
 	list_for_each_entry_safe(page, next, list, lru) {
 		struct zone *zone = page_zone(page);
 
 		/* Different zone, different pcp lock. */
 		if (zone != locked_zone) {
 			if (pcp)
-				spin_unlock(&pcp->lock);
+				pcp_spin_unlock_irqrestore(pcp, flags);
+
 			locked_zone = zone;
-			pcp = this_cpu_ptr(zone->per_cpu_pageset);
-			spin_lock(&pcp->lock);
+			pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
 		}
 
 		/*
@@ -3502,18 +3563,14 @@ void free_unref_page_list(struct list_head *list)
 		 * a large list of pages to free.
 		 */
 		if (++batch_count == SWAP_CLUSTER_MAX) {
-			spin_unlock(&pcp->lock);
-			local_unlock_irqrestore(&pagesets.lock, flags);
+			pcp_spin_unlock_irqrestore(pcp, flags);
 			batch_count = 0;
-			local_lock_irqsave(&pagesets.lock, flags);
-			pcp = this_cpu_ptr(locked_zone->per_cpu_pageset);
-			spin_lock(&pcp->lock);
+			pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
 		}
 	}
 
 	if (pcp)
-		spin_unlock(&pcp->lock);
-	local_unlock_irqrestore(&pagesets.lock, flags);
+		pcp_spin_unlock_irqrestore(pcp, flags);
 }
 
 /*
@@ -3728,15 +3785,13 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	unsigned long flags;
 	unsigned long __maybe_unused UP_flags;
 
-	local_lock_irqsave(&pagesets.lock, flags);
-
 	/*
 	 * spin_trylock may fail due to a parallel drain. In the future, the
 	 * trylock will also protect against IRQ reentrancy.
 	 */
-	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	pcp_trylock_prepare(UP_flags);
-	if (!spin_trylock(&pcp->lock)) {
+	pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
+	if (!pcp) {
 		pcp_trylock_finish(UP_flags);
 		return NULL;
 	}
@@ -3749,9 +3804,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	pcp->free_factor >>= 1;
 	list = &pcp->lists[order_to_pindex(migratetype, order)];
 	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
-	spin_unlock(&pcp->lock);
+	pcp_spin_unlock_irqrestore(pcp, flags);
 	pcp_trylock_finish(UP_flags);
-	local_unlock_irqrestore(&pagesets.lock, flags);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
 		zone_statistics(preferred_zone, zone, 1);
@@ -5356,10 +5410,9 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		goto failed;
 
 	/* Is a parallel drain in progress? */
-	local_lock_irqsave(&pagesets.lock, flags);
 	pcp_trylock_prepare(UP_flags);
-	pcp = this_cpu_ptr(zone->per_cpu_pageset);
-	if (!spin_trylock(&pcp->lock))
+	pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
+	if (!pcp)
 		goto failed_irq;
 
 	/* Attempt the batch allocation */
@@ -5377,7 +5430,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		if (unlikely(!page)) {
 			/* Try and allocate at least one page */
 			if (!nr_account) {
-				spin_unlock(&pcp->lock);
+				pcp_spin_unlock_irqrestore(pcp, flags);
 				goto failed_irq;
 			}
 			break;
@@ -5392,9 +5445,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		nr_populated++;
 	}
 
-	spin_unlock(&pcp->lock);
+	pcp_spin_unlock_irqrestore(pcp, flags);
 	pcp_trylock_finish(UP_flags);
-	local_unlock_irqrestore(&pagesets.lock, flags);
 
 	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
 	zone_statistics(ac.preferred_zoneref->zone, zone, nr_account);
@@ -5404,7 +5456,6 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 
 failed_irq:
 	pcp_trylock_finish(UP_flags);
-	local_unlock_irqrestore(&pagesets.lock, flags);
 
 failed:
 	page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/7] mm/page_alloc: Replace local_lock with normal spinlock
  2022-06-24 12:54 ` [PATCH 7/7] mm/page_alloc: Replace local_lock with normal spinlock Mel Gorman
@ 2022-06-24 18:59   ` Yu Zhao
  2022-06-27  8:46     ` [PATCH] mm/page_alloc: Replace local_lock with normal spinlock -fix Mel Gorman
  2022-07-04 14:39   ` [PATCH 7/7] mm/page_alloc: Replace local_lock with normal spinlock Vlastimil Babka
  2022-07-04 16:33   ` Nicolas Saenz Julienne
  2 siblings, 1 reply; 21+ messages in thread
From: Yu Zhao @ 2022-06-24 18:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Nicolas Saenz Julienne, Marcelo Tosatti,
	Vlastimil Babka, Michal Hocko, Hugh Dickins, Marek Szyprowski,
	LKML, Linux-MM

On Fri, Jun 24, 2022 at 6:55 AM Mel Gorman <mgorman@techsingularity.net> wrote:
...
> +#define pcp_spin_trylock_irqsave(ptr, flags)                           \
> +       pcpu_spin_trylock_irqsave(struct per_cpu_pages, lock, ptr, flags)
> +
...
> +       pcp = pcpu_spin_trylock_irqsave(struct per_cpu_pages, lock, zone->per_cpu_pageset, flags);

Why pcp_spin_trylock_irqsave() was added earlier but not used later?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] mm/page_alloc: Replace local_lock with normal spinlock -fix
  2022-06-24 18:59   ` Yu Zhao
@ 2022-06-27  8:46     ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2022-06-27  8:46 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Nicolas Saenz Julienne, Marcelo Tosatti,
	Vlastimil Babka, Michal Hocko, Hugh Dickins, Marek Szyprowski,
	LKML, Linux-MM

As noted by Yu Zhao, use pcp_spin_trylock_irqsave instead of
pcpu_spin_trylock_irqsave. This is a fix to the mm-unstable patch
mm-page_alloc-replace-local_lock-with-normal-spinlock.patch

Reported-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 71065b01827b..934d1b5a5449 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3493,7 +3493,7 @@ void free_unref_page(struct page *page, unsigned int order)
 
 	zone = page_zone(page);
 	pcp_trylock_prepare(UP_flags);
-	pcp = pcpu_spin_trylock_irqsave(struct per_cpu_pages, lock, zone->per_cpu_pageset, flags);
+	pcp = pcp_spin_trylock_irqsave(zone->per_cpu_pageset, flags);
 	if (pcp) {
 		free_unref_page_commit(zone, pcp, page, migratetype, order);
 		pcp_spin_unlock_irqrestore(pcp, flags);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 00/7] Drain remote per-cpu directly
  2022-06-24 12:54 [PATCH v5 00/7] Drain remote per-cpu directly Mel Gorman
                   ` (6 preceding siblings ...)
  2022-06-24 12:54 ` [PATCH 7/7] mm/page_alloc: Replace local_lock with normal spinlock Mel Gorman
@ 2022-07-03 23:28 ` Andrew Morton
  2022-07-03 23:31   ` Yu Zhao
  7 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2022-07-03 23:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, Hugh Dickins, Yu Zhao, Marek Szyprowski, LKML,
	Linux-MM

On Fri, 24 Jun 2022 13:54:16 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:

> Some setups, notably NOHZ_FULL CPUs, may be running realtime or
> latency-sensitive applications that cannot tolerate interference due to
> per-cpu drain work queued by __drain_all_pages().  Introduce a new
> mechanism to remotely drain the per-cpu lists. It is made possible by
> remotely locking 'struct per_cpu_pages' new per-cpu spinlocks.  This has
> two advantages, the time to drain is more predictable and other unrelated
> tasks are not interrupted.
> 
> This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> per-cpu lists drain support" -- avoid interference of a high priority task
> due to a workqueue item draining per-cpu page lists.  While many workloads
> can tolerate a brief interruption, it may cause a real-time task running
> on a NOHZ_FULL CPU to miss a deadline and at minimum, the draining is
> non-deterministic.
> 
> Currently an IRQ-safe local_lock protects the page allocator per-cpu
> lists. The local_lock on its own prevents migration and the IRQ disabling
> protects from corruption due to an interrupt arriving while a page
> allocation is in progress.
> 
> This series adjusts the locking.  A spinlock is added to struct
> per_cpu_pages to protect the list contents while local_lock_irq is
> ultimately replaced by just the spinlock in the final patch.  This allows
> a remote CPU to safely. Follow-on work should allow the spin_lock_irqsave
> to be converted to spin_lock to avoid IRQs being disabled/enabled in
> most cases. The follow-on patch will be one kernel release later as it
> is relatively high risk and it'll make bisections more clear if there
> are any problems.

I plan to move this and Mel's fix to [7/7] into mm-stable around July 8.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 00/7] Drain remote per-cpu directly
  2022-07-03 23:28 ` [PATCH v5 00/7] Drain remote per-cpu directly Andrew Morton
@ 2022-07-03 23:31   ` Yu Zhao
  2022-07-03 23:35     ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Yu Zhao @ 2022-07-03 23:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Nicolas Saenz Julienne, Marcelo Tosatti,
	Vlastimil Babka, Michal Hocko, Hugh Dickins, Marek Szyprowski,
	LKML, Linux-MM

On Sun, Jul 3, 2022 at 5:28 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 24 Jun 2022 13:54:16 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:
>
> > Some setups, notably NOHZ_FULL CPUs, may be running realtime or
> > latency-sensitive applications that cannot tolerate interference due to
> > per-cpu drain work queued by __drain_all_pages().  Introduce a new
> > mechanism to remotely drain the per-cpu lists. It is made possible by
> > remotely locking 'struct per_cpu_pages' new per-cpu spinlocks.  This has
> > two advantages, the time to drain is more predictable and other unrelated
> > tasks are not interrupted.
> >
> > This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> > per-cpu lists drain support" -- avoid interference of a high priority task
> > due to a workqueue item draining per-cpu page lists.  While many workloads
> > can tolerate a brief interruption, it may cause a real-time task running
> > on a NOHZ_FULL CPU to miss a deadline and at minimum, the draining is
> > non-deterministic.
> >
> > Currently an IRQ-safe local_lock protects the page allocator per-cpu
> > lists. The local_lock on its own prevents migration and the IRQ disabling
> > protects from corruption due to an interrupt arriving while a page
> > allocation is in progress.
> >
> > This series adjusts the locking.  A spinlock is added to struct
> > per_cpu_pages to protect the list contents while local_lock_irq is
> > ultimately replaced by just the spinlock in the final patch.  This allows
> > a remote CPU to safely. Follow-on work should allow the spin_lock_irqsave
> > to be converted to spin_lock to avoid IRQs being disabled/enabled in
> > most cases. The follow-on patch will be one kernel release later as it
> > is relatively high risk and it'll make bisections more clear if there
> > are any problems.
>
> I plan to move this and Mel's fix to [7/7] into mm-stable around July 8.

I've thrown it together with the Maple Tree and passed a series of stress tests.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 00/7] Drain remote per-cpu directly
  2022-07-03 23:31   ` Yu Zhao
@ 2022-07-03 23:35     ` Andrew Morton
  0 siblings, 0 replies; 21+ messages in thread
From: Andrew Morton @ 2022-07-03 23:35 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Mel Gorman, Nicolas Saenz Julienne, Marcelo Tosatti,
	Vlastimil Babka, Michal Hocko, Hugh Dickins, Marek Szyprowski,
	LKML, Linux-MM

On Sun, 3 Jul 2022 17:31:09 -0600 Yu Zhao <yuzhao@google.com> wrote:

> > > This series adjusts the locking.  A spinlock is added to struct
> > > per_cpu_pages to protect the list contents while local_lock_irq is
> > > ultimately replaced by just the spinlock in the final patch.  This allows
> > > a remote CPU to safely. Follow-on work should allow the spin_lock_irqsave
> > > to be converted to spin_lock to avoid IRQs being disabled/enabled in
> > > most cases. The follow-on patch will be one kernel release later as it
> > > is relatively high risk and it'll make bisections more clear if there
> > > are any problems.
> >
> > I plan to move this and Mel's fix to [7/7] into mm-stable around July 8.
> 
> I've thrown it together with the Maple Tree and passed a series of stress tests.

Cool, thanks.  I added your Tested-by: to everything.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/7] mm/page_alloc: Protect PCP lists with a spinlock
  2022-06-24 12:54 ` [PATCH 5/7] mm/page_alloc: Protect PCP lists with a spinlock Mel Gorman
@ 2022-07-04 12:31   ` Vlastimil Babka
  2022-07-05  7:20     ` Mel Gorman
  2022-07-04 16:32   ` Nicolas Saenz Julienne
  1 sibling, 1 reply; 21+ messages in thread
From: Vlastimil Babka @ 2022-07-04 12:31 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Michal Hocko,
	Hugh Dickins, Yu Zhao, Marek Szyprowski, LKML, Linux-MM

On 6/24/22 14:54, Mel Gorman wrote:
> Currently the PCP lists are protected by using local_lock_irqsave to
> prevent migration and IRQ reentrancy but this is inconvenient.  Remote
> draining of the lists is impossible and a workqueue is required and every
> task allocation/free must disable then enable interrupts which is
> expensive.
> 
> As preparation for dealing with both of those problems, protect the lists
> with a spinlock.  The IRQ-unsafe version of the lock is used because IRQs
> are already disabled by local_lock_irqsave.  spin_trylock is used in
> preparation for a time when local_lock could be used instead of
> lock_lock_irqsave.

  ^ local_lock_irqsave?

> 
> The per_cpu_pages still fits within the same number of cache lines after
> this patch relative to before the series.
> 
> struct per_cpu_pages {
>         spinlock_t                 lock;                 /*     0     4 */
>         int                        count;                /*     4     4 */
>         int                        high;                 /*     8     4 */
>         int                        batch;                /*    12     4 */
>         short int                  free_factor;          /*    16     2 */
>         short int                  expire;               /*    18     2 */
> 
>         /* XXX 4 bytes hole, try to pack */
> 
>         struct list_head           lists[13];            /*    24   208 */
> 
>         /* size: 256, cachelines: 4, members: 7 */
>         /* sum members: 228, holes: 1, sum holes: 4 */
>         /* padding: 24 */
> } __attribute__((__aligned__(64)));
> 
> There is overhead in the fast path due to acquiring the spinlock even
> though the spinlock is per-cpu and uncontended in the common case.  Page
> Fault Test (PFT) running on a 1-socket reported the following results on a
> 1 socket machine.
> 
>                                      5.19.0-rc3               5.19.0-rc3
>                                         vanilla      mm-pcpspinirq-v5r16
> Hmean     faults/sec-1   869275.7381 (   0.00%)   874597.5167 *   0.61%*
> Hmean     faults/sec-3  2370266.6681 (   0.00%)  2379802.0362 *   0.40%*
> Hmean     faults/sec-5  2701099.7019 (   0.00%)  2664889.7003 *  -1.34%*
> Hmean     faults/sec-7  3517170.9157 (   0.00%)  3491122.8242 *  -0.74%*
> Hmean     faults/sec-8  3965729.6187 (   0.00%)  3939727.0243 *  -0.66%*
> 
> There is a small hit in the number of faults per second but given that the
> results are more stable, it's borderline noise.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/linux/mmzone.h |   1 +
>  mm/page_alloc.c        | 118 +++++++++++++++++++++++++++++++++--------
>  2 files changed, 98 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 4e0352cc2fcb..299259cfe462 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -382,6 +382,7 @@ enum zone_watermarks {
>  
>  /* Fields and list protected by pagesets local_lock in page_alloc.c */
>  struct per_cpu_pages {
> +	spinlock_t lock;	/* Protects lists field */
>  	int count;		/* number of pages in the list */
>  	int high;		/* high watermark, emptying needed */
>  	int batch;		/* chunk size for buddy add/remove */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7fb262eeec2f..3b819c2720f8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -133,6 +133,20 @@ static DEFINE_PER_CPU(struct pagesets, pagesets) = {
>  	.lock = INIT_LOCAL_LOCK(lock),
>  };
>  
> +#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
> +/*
> + * On SMP, spin_trylock is sufficient protection.
> + * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
> + */
> +#define pcp_trylock_prepare(flags)	do { } while (0)
> +#define pcp_trylock_finish(flag)	do { } while (0)
> +#else
> +
> +/* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */
> +#define pcp_trylock_prepare(flags)	local_irq_save(flags)
> +#define pcp_trylock_finish(flags)	local_irq_restore(flags)
> +#endif
> +
>  #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
>  DEFINE_PER_CPU(int, numa_node);
>  EXPORT_PER_CPU_SYMBOL(numa_node);
> @@ -3097,15 +3111,22 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>   */
>  void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
>  {
> -	unsigned long flags;
>  	int to_drain, batch;
>  
> -	local_lock_irqsave(&pagesets.lock, flags);
>  	batch = READ_ONCE(pcp->batch);
>  	to_drain = min(pcp->count, batch);
> -	if (to_drain > 0)
> +	if (to_drain > 0) {
> +		unsigned long flags;
> +
> +		/*
> +		 * free_pcppages_bulk expects IRQs disabled for zone->lock
> +		 * so even though pcp->lock is not intended to be IRQ-safe,
> +		 * it's needed in this context.
> +		 */
> +		spin_lock_irqsave(&pcp->lock, flags);
>  		free_pcppages_bulk(zone, to_drain, pcp, 0);
> -	local_unlock_irqrestore(&pagesets.lock, flags);
> +		spin_unlock_irqrestore(&pcp->lock, flags);
> +	}
>  }
>  #endif
>  
> @@ -3118,16 +3139,17 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
>   */
>  static void drain_pages_zone(unsigned int cpu, struct zone *zone)
>  {
> -	unsigned long flags;
>  	struct per_cpu_pages *pcp;
>  
> -	local_lock_irqsave(&pagesets.lock, flags);
> -
>  	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
> -	if (pcp->count)
> -		free_pcppages_bulk(zone, pcp->count, pcp, 0);
> +	if (pcp->count) {
> +		unsigned long flags;
>  
> -	local_unlock_irqrestore(&pagesets.lock, flags);
> +		/* See drain_zone_pages on why this is disabling IRQs */
> +		spin_lock_irqsave(&pcp->lock, flags);
> +		free_pcppages_bulk(zone, pcp->count, pcp, 0);
> +		spin_unlock_irqrestore(&pcp->lock, flags);
> +	}
>  }
>  
>  /*
> @@ -3395,17 +3417,15 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
>  	return min(READ_ONCE(pcp->batch) << 2, high);
>  }
>  
> -static void free_unref_page_commit(struct page *page, int migratetype,
> +static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
> +				   struct page *page, int migratetype,
>  				   unsigned int order)
>  {
> -	struct zone *zone = page_zone(page);
> -	struct per_cpu_pages *pcp;
>  	int high;
>  	int pindex;
>  	bool free_high;
>  
>  	__count_vm_event(PGFREE);
> -	pcp = this_cpu_ptr(zone->per_cpu_pageset);
>  	pindex = order_to_pindex(migratetype, order);
>  	list_add(&page->pcp_list, &pcp->lists[pindex]);
>  	pcp->count += 1 << order;
> @@ -3432,6 +3452,9 @@ static void free_unref_page_commit(struct page *page, int migratetype,
>  void free_unref_page(struct page *page, unsigned int order)
>  {
>  	unsigned long flags;
> +	unsigned long __maybe_unused UP_flags;
> +	struct per_cpu_pages *pcp;
> +	struct zone *zone;
>  	unsigned long pfn = page_to_pfn(page);
>  	int migratetype;
>  
> @@ -3455,7 +3478,16 @@ void free_unref_page(struct page *page, unsigned int order)
>  	}
>  
>  	local_lock_irqsave(&pagesets.lock, flags);
> -	free_unref_page_commit(page, migratetype, order);
> +	zone = page_zone(page);
> +	pcp_trylock_prepare(UP_flags);
> +	pcp = this_cpu_ptr(zone->per_cpu_pageset);
> +	if (spin_trylock(&pcp->lock)) {
> +		free_unref_page_commit(zone, pcp, page, migratetype, order);
> +		spin_unlock(&pcp->lock);
> +	} else {
> +		free_one_page(zone, page, pfn, order, migratetype, FPI_NONE);
> +	}
> +	pcp_trylock_finish(UP_flags);
>  	local_unlock_irqrestore(&pagesets.lock, flags);
>  }
>  
> @@ -3465,6 +3497,8 @@ void free_unref_page(struct page *page, unsigned int order)
>  void free_unref_page_list(struct list_head *list)
>  {
>  	struct page *page, *next;
> +	struct per_cpu_pages *pcp = NULL;
> +	struct zone *locked_zone = NULL;
>  	unsigned long flags;
>  	int batch_count = 0;
>  	int migratetype;
> @@ -3491,6 +3525,17 @@ void free_unref_page_list(struct list_head *list)
>  
>  	local_lock_irqsave(&pagesets.lock, flags);
>  	list_for_each_entry_safe(page, next, list, lru) {
> +		struct zone *zone = page_zone(page);
> +
> +		/* Different zone, different pcp lock. */
> +		if (zone != locked_zone) {
> +			if (pcp)
> +				spin_unlock(&pcp->lock);
> +			locked_zone = zone;
> +			pcp = this_cpu_ptr(zone->per_cpu_pageset);
> +			spin_lock(&pcp->lock);
> +		}
> +
>  		/*
>  		 * Non-isolated types over MIGRATE_PCPTYPES get added
>  		 * to the MIGRATE_MOVABLE pcp list.
> @@ -3500,18 +3545,24 @@ void free_unref_page_list(struct list_head *list)
>  			migratetype = MIGRATE_MOVABLE;
>  
>  		trace_mm_page_free_batched(page);
> -		free_unref_page_commit(page, migratetype, 0);
> +		free_unref_page_commit(zone, pcp, page, migratetype, 0);
>  
>  		/*
>  		 * Guard against excessive IRQ disabled times when we get
>  		 * a large list of pages to free.
>  		 */
>  		if (++batch_count == SWAP_CLUSTER_MAX) {
> +			spin_unlock(&pcp->lock);
>  			local_unlock_irqrestore(&pagesets.lock, flags);
>  			batch_count = 0;
>  			local_lock_irqsave(&pagesets.lock, flags);
> +			pcp = this_cpu_ptr(locked_zone->per_cpu_pageset);
> +			spin_lock(&pcp->lock);
>  		}
>  	}
> +
> +	if (pcp)
> +		spin_unlock(&pcp->lock);
>  	local_unlock_irqrestore(&pagesets.lock, flags);
>  }
>  
> @@ -3725,18 +3776,31 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
>  	struct list_head *list;
>  	struct page *page;
>  	unsigned long flags;
> +	unsigned long __maybe_unused UP_flags;
>  
>  	local_lock_irqsave(&pagesets.lock, flags);
>  
> +	/*
> +	 * spin_trylock may fail due to a parallel drain. In the future, the
> +	 * trylock will also protect against IRQ reentrancy.
> +	 */
> +	pcp = this_cpu_ptr(zone->per_cpu_pageset);
> +	pcp_trylock_prepare(UP_flags);
> +	if (!spin_trylock(&pcp->lock)) {
> +		pcp_trylock_finish(UP_flags);
> +		return NULL;
> +	}
> +
>  	/*
>  	 * On allocation, reduce the number of pages that are batch freed.
>  	 * See nr_pcp_free() where free_factor is increased for subsequent
>  	 * frees.
>  	 */
> -	pcp = this_cpu_ptr(zone->per_cpu_pageset);
>  	pcp->free_factor >>= 1;
>  	list = &pcp->lists[order_to_pindex(migratetype, order)];
>  	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
> +	spin_unlock(&pcp->lock);
> +	pcp_trylock_finish(UP_flags);
>  	local_unlock_irqrestore(&pagesets.lock, flags);
>  	if (page) {
>  		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
> @@ -3771,7 +3835,8 @@ struct page *rmqueue(struct zone *preferred_zone,
>  				migratetype != MIGRATE_MOVABLE) {
>  			page = rmqueue_pcplist(preferred_zone, zone, order,
>  					gfp_flags, migratetype, alloc_flags);
> -			goto out;
> +			if (likely(page))
> +				goto out;
>  		}
>  	}
>  
> @@ -5259,6 +5324,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  {
>  	struct page *page;
>  	unsigned long flags;
> +	unsigned long __maybe_unused UP_flags;
>  	struct zone *zone;
>  	struct zoneref *z;
>  	struct per_cpu_pages *pcp;
> @@ -5339,11 +5405,15 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  	if (unlikely(!zone))
>  		goto failed;
>  
> -	/* Attempt the batch allocation */
> +	/* Is a parallel drain in progress? */
>  	local_lock_irqsave(&pagesets.lock, flags);
> +	pcp_trylock_prepare(UP_flags);
>  	pcp = this_cpu_ptr(zone->per_cpu_pageset);
> -	pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
> +	if (!spin_trylock(&pcp->lock))
> +		goto failed_irq;
>  
> +	/* Attempt the batch allocation */
> +	pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
>  	while (nr_populated < nr_pages) {
>  
>  		/* Skip existing pages */
> @@ -5356,8 +5426,10 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  								pcp, pcp_list);
>  		if (unlikely(!page)) {
>  			/* Try and allocate at least one page */
> -			if (!nr_account)
> +			if (!nr_account) {
> +				spin_unlock(&pcp->lock);
>  				goto failed_irq;
> +			}
>  			break;
>  		}
>  		nr_account++;
> @@ -5370,6 +5442,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  		nr_populated++;
>  	}
>  
> +	spin_unlock(&pcp->lock);
> +	pcp_trylock_finish(UP_flags);
>  	local_unlock_irqrestore(&pagesets.lock, flags);
>  
>  	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
> @@ -5379,6 +5453,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  	return nr_populated;
>  
>  failed_irq:
> +	pcp_trylock_finish(UP_flags);
>  	local_unlock_irqrestore(&pagesets.lock, flags);
>  
>  failed:
> @@ -7019,6 +7094,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
>  	memset(pcp, 0, sizeof(*pcp));
>  	memset(pzstats, 0, sizeof(*pzstats));
>  
> +	spin_lock_init(&pcp->lock);
>  	for (pindex = 0; pindex < NR_PCP_LISTS; pindex++)
>  		INIT_LIST_HEAD(&pcp->lists[pindex]);
>  


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/7] mm/page_alloc: Remotely drain per-cpu lists
  2022-06-24 12:54 ` [PATCH 6/7] mm/page_alloc: Remotely drain per-cpu lists Mel Gorman
@ 2022-07-04 14:28   ` Vlastimil Babka
  0 siblings, 0 replies; 21+ messages in thread
From: Vlastimil Babka @ 2022-07-04 14:28 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Michal Hocko,
	Hugh Dickins, Yu Zhao, Marek Szyprowski, LKML, Linux-MM

On 6/24/22 14:54, Mel Gorman wrote:
> From: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> 
> Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
> drain work queued by __drain_all_pages().  So introduce a new mechanism to
> remotely drain the per-cpu lists.  It is made possible by remotely locking
> 'struct per_cpu_pages' new per-cpu spinlocks.  A benefit of this new
> scheme is that drain operations are now migration safe.
> 
> There was no observed performance degradation vs.  the previous scheme.
> Both netperf and hackbench were run in parallel to triggering the
> __drain_all_pages(NULL, true) code path around ~100 times per second.  The
> new scheme performs a bit better (~5%), although the important point here
> is there are no performance regressions vs.  the previous mechanism.
> Per-cpu lists draining happens only in slow paths.
> 
> Minchan Kim tested an earlier version and reported;
> 
> 	My workload is not NOHZ CPUs but run apps under heavy memory
> 	pressure so they goes to direct reclaim and be stuck on
> 	drain_all_pages until work on workqueue run.
> 
> 	unit: nanosecond
> 	max(dur)        avg(dur)                count(dur)
> 	166713013       487511.77786438033      1283
> 
> 	From traces, system encountered the drain_all_pages 1283 times and
> 	worst case was 166ms and avg was 487us.
> 
> 	The other problem was alloc_contig_range in CMA. The PCP draining
> 	takes several hundred millisecond sometimes though there is no
> 	memory pressure or a few of pages to be migrated out but CPU were
> 	fully booked.
> 
> 	Your patch perfectly removed those wasted time.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/7] mm/page_alloc: Replace local_lock with normal spinlock
  2022-06-24 12:54 ` [PATCH 7/7] mm/page_alloc: Replace local_lock with normal spinlock Mel Gorman
  2022-06-24 18:59   ` Yu Zhao
@ 2022-07-04 14:39   ` Vlastimil Babka
  2022-07-04 16:33   ` Nicolas Saenz Julienne
  2 siblings, 0 replies; 21+ messages in thread
From: Vlastimil Babka @ 2022-07-04 14:39 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Michal Hocko,
	Hugh Dickins, Yu Zhao, Marek Szyprowski, LKML, Linux-MM

On 6/24/22 14:54, Mel Gorman wrote:
> struct per_cpu_pages is no longer strictly local as PCP lists can be
> drained remotely using a lock for protection. While the use of local_lock
> works, it goes against the intent of local_lock which is for "pure
> CPU local concurrency control mechanisms and not suited for inter-CPU
> concurrency control" (Documentation/locking/locktypes.rst)
> 
> local_lock protects against migration between when the percpu pointer is
> accessed and the pcp->lock acquired. The lock acquisition is a preemption
> point so in the worst case, a task could migrate to another NUMA node
> and accidentally allocate remote memory. The main requirement is to pin
> the task to a CPU that is suitable for PREEMPT_RT and !PREEMPT_RT.
> 
> Replace local_lock with helpers that pin a task to a CPU, lookup the
> per-cpu structure and acquire the embedded lock. It's similar to local_lock
> without breaking the intent behind the API. It is not a complete API
> as only the parts needed for PCP-alloc are implemented but in theory,
> the generic helpers could be promoted to a general API if there was
> demand for an embedded lock within a per-cpu struct with a guarantee
> that the per-cpu structure locked matches the running CPU and cannot use
> get_cpu_var due to RT concerns. PCP requires these semantics to avoid
> accidentally allocating remote memory.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

-fix looks OK too

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/7] mm/page_alloc: Protect PCP lists with a spinlock
  2022-06-24 12:54 ` [PATCH 5/7] mm/page_alloc: Protect PCP lists with a spinlock Mel Gorman
  2022-07-04 12:31   ` Vlastimil Babka
@ 2022-07-04 16:32   ` Nicolas Saenz Julienne
  1 sibling, 0 replies; 21+ messages in thread
From: Nicolas Saenz Julienne @ 2022-07-04 16:32 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, Hugh Dickins,
	Yu Zhao, Marek Szyprowski, LKML, Linux-MM

On Fri, 2022-06-24 at 13:54 +0100, Mel Gorman wrote:
> Currently the PCP lists are protected by using local_lock_irqsave to
> prevent migration and IRQ reentrancy but this is inconvenient.  Remote
> draining of the lists is impossible and a workqueue is required and every
> task allocation/free must disable then enable interrupts which is
> expensive.
> 
> As preparation for dealing with both of those problems, protect the lists
> with a spinlock.  The IRQ-unsafe version of the lock is used because IRQs
> are already disabled by local_lock_irqsave.  spin_trylock is used in
> preparation for a time when local_lock could be used instead of
> lock_lock_irqsave.
> 
> The per_cpu_pages still fits within the same number of cache lines after
> this patch relative to before the series.
> 
> struct per_cpu_pages {
>         spinlock_t                 lock;                 /*     0     4 */
>         int                        count;                /*     4     4 */
>         int                        high;                 /*     8     4 */
>         int                        batch;                /*    12     4 */
>         short int                  free_factor;          /*    16     2 */
>         short int                  expire;               /*    18     2 */
> 
>         /* XXX 4 bytes hole, try to pack */
> 
>         struct list_head           lists[13];            /*    24   208 */
> 
>         /* size: 256, cachelines: 4, members: 7 */
>         /* sum members: 228, holes: 1, sum holes: 4 */
>         /* padding: 24 */
> } __attribute__((__aligned__(64)));
> 
> There is overhead in the fast path due to acquiring the spinlock even
> though the spinlock is per-cpu and uncontended in the common case.  Page
> Fault Test (PFT) running on a 1-socket reported the following results on a
> 1 socket machine.
> 
>                                      5.19.0-rc3               5.19.0-rc3
>                                         vanilla      mm-pcpspinirq-v5r16
> Hmean     faults/sec-1   869275.7381 (   0.00%)   874597.5167 *   0.61%*
> Hmean     faults/sec-3  2370266.6681 (   0.00%)  2379802.0362 *   0.40%*
> Hmean     faults/sec-5  2701099.7019 (   0.00%)  2664889.7003 *  -1.34%*
> Hmean     faults/sec-7  3517170.9157 (   0.00%)  3491122.8242 *  -0.74%*
> Hmean     faults/sec-8  3965729.6187 (   0.00%)  3939727.0243 *  -0.66%*
> 
> There is a small hit in the number of faults per second but given that the
> results are more stable, it's borderline noise.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---

Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>

Thanks!

-- 
Nicolás Sáenz


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/7] mm/page_alloc: Replace local_lock with normal spinlock
  2022-06-24 12:54 ` [PATCH 7/7] mm/page_alloc: Replace local_lock with normal spinlock Mel Gorman
  2022-06-24 18:59   ` Yu Zhao
  2022-07-04 14:39   ` [PATCH 7/7] mm/page_alloc: Replace local_lock with normal spinlock Vlastimil Babka
@ 2022-07-04 16:33   ` Nicolas Saenz Julienne
  2 siblings, 0 replies; 21+ messages in thread
From: Nicolas Saenz Julienne @ 2022-07-04 16:33 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, Hugh Dickins,
	Yu Zhao, Marek Szyprowski, LKML, Linux-MM

On Fri, 2022-06-24 at 13:54 +0100, Mel Gorman wrote:
> struct per_cpu_pages is no longer strictly local as PCP lists can be
> drained remotely using a lock for protection. While the use of local_lock
> works, it goes against the intent of local_lock which is for "pure
> CPU local concurrency control mechanisms and not suited for inter-CPU
> concurrency control" (Documentation/locking/locktypes.rst)
> 
> local_lock protects against migration between when the percpu pointer is
> accessed and the pcp->lock acquired. The lock acquisition is a preemption
> point so in the worst case, a task could migrate to another NUMA node
> and accidentally allocate remote memory. The main requirement is to pin
> the task to a CPU that is suitable for PREEMPT_RT and !PREEMPT_RT.
> 
> Replace local_lock with helpers that pin a task to a CPU, lookup the
> per-cpu structure and acquire the embedded lock. It's similar to local_lock
> without breaking the intent behind the API. It is not a complete API
> as only the parts needed for PCP-alloc are implemented but in theory,
> the generic helpers could be promoted to a general API if there was
> demand for an embedded lock within a per-cpu struct with a guarantee
> that the per-cpu structure locked matches the running CPU and cannot use
> get_cpu_var due to RT concerns. PCP requires these semantics to avoid
> accidentally allocating remote memory.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---

Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>

Thanks!

-- 
Nicolás Sáenz


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/7] mm/page_alloc: Protect PCP lists with a spinlock
  2022-07-04 12:31   ` Vlastimil Babka
@ 2022-07-05  7:20     ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2022-07-05  7:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Nicolas Saenz Julienne, Marcelo Tosatti,
	Michal Hocko, Hugh Dickins, Yu Zhao, Marek Szyprowski, LKML,
	Linux-MM

On Mon, Jul 04, 2022 at 02:31:17PM +0200, Vlastimil Babka wrote:
> On 6/24/22 14:54, Mel Gorman wrote:
> > Currently the PCP lists are protected by using local_lock_irqsave to
> > prevent migration and IRQ reentrancy but this is inconvenient.  Remote
> > draining of the lists is impossible and a workqueue is required and every
> > task allocation/free must disable then enable interrupts which is
> > expensive.
> > 
> > As preparation for dealing with both of those problems, protect the lists
> > with a spinlock.  The IRQ-unsafe version of the lock is used because IRQs
> > are already disabled by local_lock_irqsave.  spin_trylock is used in
> > preparation for a time when local_lock could be used instead of
> > lock_lock_irqsave.
> 
>   ^ local_lock_irqsave?
> 

More appropriate given how the series evolved would be;

spin_trylock is used in combination with local_lock_irqsave() but later
will be replaced with a spin_trylock_irqsave when the local_lock is
removed.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/7] mm/page_alloc: Remotely drain per-cpu lists
  2022-06-13 12:56 ` [PATCH 6/7] mm/page_alloc: Remotely drain per-cpu lists Mel Gorman
@ 2022-06-16 16:41   ` Vlastimil Babka
  0 siblings, 0 replies; 21+ messages in thread
From: Vlastimil Babka @ 2022-06-16 16:41 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Michal Hocko,
	Hugh Dickins, LKML, Linux-MM

On 6/13/22 14:56, Mel Gorman wrote:
> From: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> 
> Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
> drain work queued by __drain_all_pages().  So introduce a new mechanism to
> remotely drain the per-cpu lists.  It is made possible by remotely locking
> 'struct per_cpu_pages' new per-cpu spinlocks.  A benefit of this new
> scheme is that drain operations are now migration safe.
> 
> There was no observed performance degradation vs.  the previous scheme.
> Both netperf and hackbench were run in parallel to triggering the
> __drain_all_pages(NULL, true) code path around ~100 times per second.  The
> new scheme performs a bit better (~5%), although the important point here
> is there are no performance regressions vs.  the previous mechanism.
> Per-cpu lists draining happens only in slow paths.
> 
> Minchan Kim tested this independently and reported;
> 
> 	My workload is not NOHZ CPUs but run apps under heavy memory
> 	pressure so they goes to direct reclaim and be stuck on
> 	drain_all_pages until work on workqueue run.
> 
> 	unit: nanosecond
> 	max(dur)        avg(dur)                count(dur)
> 	166713013       487511.77786438033      1283
> 
> 	From traces, system encountered the drain_all_pages 1283 times and
> 	worst case was 166ms and avg was 487us.
> 
> 	The other problem was alloc_contig_range in CMA. The PCP draining
> 	takes several hundred millisecond sometimes though there is no
> 	memory pressure or a few of pages to be migrated out but CPU were
> 	fully booked.
> 
> 	Your patch perfectly removed those wasted time.
> 
> Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Tested-by: Minchan Kim <minchan@kernel.org>
> Acked-by: Minchan Kim <minchan@kernel.org>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 6/7] mm/page_alloc: Remotely drain per-cpu lists
  2022-06-13 12:56 [PATCH v4 " Mel Gorman
@ 2022-06-13 12:56 ` Mel Gorman
  2022-06-16 16:41   ` Vlastimil Babka
  0 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2022-06-13 12:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, Hugh Dickins, LKML, Linux-MM, Mel Gorman

From: Nicolas Saenz Julienne <nsaenzju@redhat.com>

Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
drain work queued by __drain_all_pages().  So introduce a new mechanism to
remotely drain the per-cpu lists.  It is made possible by remotely locking
'struct per_cpu_pages' new per-cpu spinlocks.  A benefit of this new
scheme is that drain operations are now migration safe.

There was no observed performance degradation vs.  the previous scheme.
Both netperf and hackbench were run in parallel to triggering the
__drain_all_pages(NULL, true) code path around ~100 times per second.  The
new scheme performs a bit better (~5%), although the important point here
is there are no performance regressions vs.  the previous mechanism.
Per-cpu lists draining happens only in slow paths.

Minchan Kim tested this independently and reported;

	My workload is not NOHZ CPUs but run apps under heavy memory
	pressure so they goes to direct reclaim and be stuck on
	drain_all_pages until work on workqueue run.

	unit: nanosecond
	max(dur)        avg(dur)                count(dur)
	166713013       487511.77786438033      1283

	From traces, system encountered the drain_all_pages 1283 times and
	worst case was 166ms and avg was 487us.

	The other problem was alloc_contig_range in CMA. The PCP draining
	takes several hundred millisecond sometimes though there is no
	memory pressure or a few of pages to be migrated out but CPU were
	fully booked.

	Your patch perfectly removed those wasted time.

Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
---
 mm/page_alloc.c | 58 ++++---------------------------------------------
 1 file changed, 4 insertions(+), 54 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 533fc5527582..03882ce7765f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -165,13 +165,7 @@ DEFINE_PER_CPU(int, _numa_mem_);		/* Kernel "local memory" node */
 EXPORT_PER_CPU_SYMBOL(_numa_mem_);
 #endif
 
-/* work_structs for global per-cpu drains */
-struct pcpu_drain {
-	struct zone *zone;
-	struct work_struct work;
-};
 static DEFINE_MUTEX(pcpu_drain_mutex);
-static DEFINE_PER_CPU(struct pcpu_drain, pcpu_drain);
 
 #ifdef CONFIG_GCC_PLUGIN_LATENT_ENTROPY
 volatile unsigned long latent_entropy __latent_entropy;
@@ -3105,9 +3099,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
  * Called from the vmstat counter updater to drain pagesets of this
  * currently executing processor on remote nodes after they have
  * expired.
- *
- * Note that this function must be called with the thread pinned to
- * a single processor.
  */
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 {
@@ -3132,10 +3123,6 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 
 /*
  * Drain pcplists of the indicated processor and zone.
- *
- * The processor must either be the current processor and the
- * thread pinned to the current processor or a processor that
- * is not online.
  */
 static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 {
@@ -3154,10 +3141,6 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 
 /*
  * Drain pcplists of all zones on the indicated processor.
- *
- * The processor must either be the current processor and the
- * thread pinned to the current processor or a processor that
- * is not online.
  */
 static void drain_pages(unsigned int cpu)
 {
@@ -3170,9 +3153,6 @@ static void drain_pages(unsigned int cpu)
 
 /*
  * Spill all of this CPU's per-cpu pages back into the buddy allocator.
- *
- * The CPU has to be pinned. When zone parameter is non-NULL, spill just
- * the single zone's pages.
  */
 void drain_local_pages(struct zone *zone)
 {
@@ -3184,24 +3164,6 @@ void drain_local_pages(struct zone *zone)
 		drain_pages(cpu);
 }
 
-static void drain_local_pages_wq(struct work_struct *work)
-{
-	struct pcpu_drain *drain;
-
-	drain = container_of(work, struct pcpu_drain, work);
-
-	/*
-	 * drain_all_pages doesn't use proper cpu hotplug protection so
-	 * we can race with cpu offline when the WQ can move this from
-	 * a cpu pinned worker to an unbound one. We can operate on a different
-	 * cpu which is alright but we also have to make sure to not move to
-	 * a different one.
-	 */
-	migrate_disable();
-	drain_local_pages(drain->zone);
-	migrate_enable();
-}
-
 /*
  * The implementation of drain_all_pages(), exposing an extra parameter to
  * drain on all cpus.
@@ -3222,13 +3184,6 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 	 */
 	static cpumask_t cpus_with_pcps;
 
-	/*
-	 * Make sure nobody triggers this path before mm_percpu_wq is fully
-	 * initialized.
-	 */
-	if (WARN_ON_ONCE(!mm_percpu_wq))
-		return;
-
 	/*
 	 * Do not drain if one is already in progress unless it's specific to
 	 * a zone. Such callers are primarily CMA and memory hotplug and need
@@ -3278,14 +3233,11 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 	}
 
 	for_each_cpu(cpu, &cpus_with_pcps) {
-		struct pcpu_drain *drain = per_cpu_ptr(&pcpu_drain, cpu);
-
-		drain->zone = zone;
-		INIT_WORK(&drain->work, drain_local_pages_wq);
-		queue_work_on(cpu, mm_percpu_wq, &drain->work);
+		if (zone)
+			drain_pages_zone(cpu, zone);
+		else
+			drain_pages(cpu);
 	}
-	for_each_cpu(cpu, &cpus_with_pcps)
-		flush_work(&per_cpu_ptr(&pcpu_drain, cpu)->work);
 
 	mutex_unlock(&pcpu_drain_mutex);
 }
@@ -3294,8 +3246,6 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
  * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
  *
  * When zone parameter is non-NULL, spill just the single zone's pages.
- *
- * Note that this can be extremely slow as the draining happens in a workqueue.
  */
 void drain_all_pages(struct zone *zone)
 {
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-07-05  7:20 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-24 12:54 [PATCH v5 00/7] Drain remote per-cpu directly Mel Gorman
2022-06-24 12:54 ` [PATCH 1/7] mm/page_alloc: Add page->buddy_list and page->pcp_list Mel Gorman
2022-06-24 12:54 ` [PATCH 2/7] mm/page_alloc: Use only one PCP list for THP-sized allocations Mel Gorman
2022-06-24 12:54 ` [PATCH 3/7] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper Mel Gorman
2022-06-24 12:54 ` [PATCH 4/7] mm/page_alloc: Remove mistaken page == NULL check in rmqueue Mel Gorman
2022-06-24 12:54 ` [PATCH 5/7] mm/page_alloc: Protect PCP lists with a spinlock Mel Gorman
2022-07-04 12:31   ` Vlastimil Babka
2022-07-05  7:20     ` Mel Gorman
2022-07-04 16:32   ` Nicolas Saenz Julienne
2022-06-24 12:54 ` [PATCH 6/7] mm/page_alloc: Remotely drain per-cpu lists Mel Gorman
2022-07-04 14:28   ` Vlastimil Babka
2022-06-24 12:54 ` [PATCH 7/7] mm/page_alloc: Replace local_lock with normal spinlock Mel Gorman
2022-06-24 18:59   ` Yu Zhao
2022-06-27  8:46     ` [PATCH] mm/page_alloc: Replace local_lock with normal spinlock -fix Mel Gorman
2022-07-04 14:39   ` [PATCH 7/7] mm/page_alloc: Replace local_lock with normal spinlock Vlastimil Babka
2022-07-04 16:33   ` Nicolas Saenz Julienne
2022-07-03 23:28 ` [PATCH v5 00/7] Drain remote per-cpu directly Andrew Morton
2022-07-03 23:31   ` Yu Zhao
2022-07-03 23:35     ` Andrew Morton
  -- strict thread matches above, loose matches on Subject: below --
2022-06-13 12:56 [PATCH v4 " Mel Gorman
2022-06-13 12:56 ` [PATCH 6/7] mm/page_alloc: Remotely drain per-cpu lists Mel Gorman
2022-06-16 16:41   ` Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.