All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] Drain remote per-cpu directly
@ 2022-04-20  9:59 Mel Gorman
  2022-04-20  9:59 ` [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list Mel Gorman
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: Mel Gorman @ 2022-04-20  9:59 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, LKML, Linux-MM,
	Mel Gorman

This series has the same intent as Nicolas' series "mm/page_alloc: Remote
per-cpu lists drain support" -- avoid interference of a high priority
task due to a workqueue item draining per-cpu page lists. While many
workloads can tolerate a brief interruption, it may be cause a real-time
task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
the draining in non-deterministic.

Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
The local_lock on its own prevents migration and the IRQ disabling protects
from corruption due to an interrupt arriving while a page allocation is
in progress. The locking is inherently unsafe for remote access unless
the CPU is hot-removed.

This series adjusts the locking. A spin-lock is added to struct
per_cpu_pages to protect the list contents while local_lock_irq continues
to prevent migration and IRQ reentry. This allows a remote CPU to safely
drain a remote per-cpu list.

This series is a partial series. Follow-on work would allow the
local_irq_save to be converted to a local_irq to avoid IRQs being
disabled/enabled in most cases. However, there are enough corner cases
that it deserves a series on its own separated by one kernel release and
the priority right now is to avoid interference of high priority tasks.

Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
	and when it is storing per-cpu pages.

Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
	this is not necessary but it avoids per_cpu_pages consuming another
	cache line.

Patch 3 is a preparation patch to avoid code duplication.

Patch 4 is a simple micro-optimisation that improves code flow necessary for
	a later patch to avoid code duplication.

Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
	relying on local_lock to prevent migration, stabilise the pcp
	lookup and prevent IRQ reentrancy.

Patch 6 remote drains per-cpu pages directly instead of using a workqueue.

 include/linux/mm_types.h |   5 +
 include/linux/mmzone.h   |  12 +-
 mm/page_alloc.c          | 333 ++++++++++++++++++++++++---------------
 3 files changed, 222 insertions(+), 128 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list
  2022-04-20  9:59 [RFC PATCH 0/6] Drain remote per-cpu directly Mel Gorman
@ 2022-04-20  9:59 ` Mel Gorman
  2022-04-20 20:43   ` Matthew Wilcox
  2022-04-20  9:59 ` [PATCH 2/6] mm/page_alloc: Use only one PCP list for THP-sized allocations Mel Gorman
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2022-04-20  9:59 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, LKML, Linux-MM,
	Mel Gorman

The page allocator uses page->lru for storing pages on either buddy or
PCP lists. Create page->buddy_list and page->pcp_list as a union with
page->lru. This is simply to clarify what type of list a page is on
in the page allocator.

No functional change intended.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mm_types.h |  5 +++++
 mm/page_alloc.c          | 18 +++++++++---------
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8834e38c06a4..a2782e8af307 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -87,6 +87,7 @@ struct page {
 			 */
 			union {
 				struct list_head lru;
+
 				/* Or, for the Unevictable "LRU list" slot */
 				struct {
 					/* Always even, to negate PageTail */
@@ -94,6 +95,10 @@ struct page {
 					/* Count page's or folio's mlocks */
 					unsigned int mlock_count;
 				};
+
+				/* Or, free page */
+				struct list_head buddy_list;
+				struct list_head pcp_list;
 			};
 			/* See page-flags.h for PAGE_MAPPING_FLAGS */
 			struct address_space *mapping;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2db95780e003..63976ad4b7f1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -780,7 +780,7 @@ static inline bool set_page_guard(struct zone *zone, struct page *page,
 		return false;
 
 	__SetPageGuard(page);
-	INIT_LIST_HEAD(&page->lru);
+	INIT_LIST_HEAD(&page->buddy_list);
 	set_page_private(page, order);
 	/* Guard pages are not available for any usage */
 	__mod_zone_freepage_state(zone, -(1 << order), migratetype);
@@ -957,7 +957,7 @@ static inline void add_to_free_list(struct page *page, struct zone *zone,
 {
 	struct free_area *area = &zone->free_area[order];
 
-	list_add(&page->lru, &area->free_list[migratetype]);
+	list_add(&page->buddy_list, &area->free_list[migratetype]);
 	area->nr_free++;
 }
 
@@ -967,7 +967,7 @@ static inline void add_to_free_list_tail(struct page *page, struct zone *zone,
 {
 	struct free_area *area = &zone->free_area[order];
 
-	list_add_tail(&page->lru, &area->free_list[migratetype]);
+	list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
 	area->nr_free++;
 }
 
@@ -981,7 +981,7 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
 {
 	struct free_area *area = &zone->free_area[order];
 
-	list_move_tail(&page->lru, &area->free_list[migratetype]);
+	list_move_tail(&page->buddy_list, &area->free_list[migratetype]);
 }
 
 static inline void del_page_from_free_list(struct page *page, struct zone *zone,
@@ -991,7 +991,7 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 	if (page_reported(page))
 		__ClearPageReported(page);
 
-	list_del(&page->lru);
+	list_del(&page->buddy_list);
 	__ClearPageBuddy(page);
 	set_page_private(page, 0);
 	zone->free_area[order].nr_free--;
@@ -1493,7 +1493,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			mt = get_pcppage_migratetype(page);
 
 			/* must delete to avoid corrupting pcp list */
-			list_del(&page->lru);
+			list_del(&page->pcp_list);
 			count -= nr_pages;
 			pcp->count -= nr_pages;
 
@@ -3053,7 +3053,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * for IO devices that can merge IO requests if the physical
 		 * pages are ordered properly.
 		 */
-		list_add_tail(&page->lru, list);
+		list_add_tail(&page->pcp_list, list);
 		allocated++;
 		if (is_migrate_cma(get_pcppage_migratetype(page)))
 			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
@@ -3392,7 +3392,7 @@ static void free_unref_page_commit(struct page *page, int migratetype,
 	__count_vm_event(PGFREE);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	pindex = order_to_pindex(migratetype, order);
-	list_add(&page->lru, &pcp->lists[pindex]);
+	list_add(&page->pcp_list, &pcp->lists[pindex]);
 	pcp->count += 1 << order;
 
 	/*
@@ -3656,7 +3656,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 		}
 
 		page = list_first_entry(list, struct page, lru);
-		list_del(&page->lru);
+		list_del(&page->pcp_list);
 		pcp->count -= 1 << order;
 	} while (check_new_pcp(page, order));
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/6] mm/page_alloc: Use only one PCP list for THP-sized allocations
  2022-04-20  9:59 [RFC PATCH 0/6] Drain remote per-cpu directly Mel Gorman
  2022-04-20  9:59 ` [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list Mel Gorman
@ 2022-04-20  9:59 ` Mel Gorman
  2022-04-20  9:59 ` [PATCH 3/6] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper Mel Gorman
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2022-04-20  9:59 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, LKML, Linux-MM,
	Mel Gorman

The per_cpu_pages is cache-aligned on a standard x86-64 distribution
configuration but a later patch will add a new field which would push
the structure into the next cache line. Use only one list to store
THP-sized pages on the per-cpu list. This assumes that the vast majority
of THP-sized allocations are GFP_MOVABLE but even if it was another type,
it would not contribute to serious fragmentation that potentially causes
a later THP allocation failure. Align per_cpu_pages on the cacheline
boundary to ensure there is no false cache sharing.

After this patch, the structure sizing is;

struct per_cpu_pages {
        int                        count;                /*     0     4 */
        int                        high;                 /*     4     4 */
        int                        batch;                /*     8     4 */
        short int                  free_factor;          /*    12     2 */
        short int                  expire;               /*    14     2 */
        struct list_head           lists[13];            /*    16   208 */

        /* size: 256, cachelines: 4, members: 6 */
        /* padding: 32 */
} __attribute__((__aligned__(64)));

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h | 11 +++++++----
 mm/page_alloc.c        |  4 ++--
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 962b14d403e8..abe530748de6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -358,15 +358,18 @@ enum zone_watermarks {
 };
 
 /*
- * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER plus one additional
- * for pageblock size for THP if configured.
+ * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER. One additional list
+ * for THP which will usually be GFP_MOVABLE. Even if it is another type,
+ * it should not contribute to serious fragmentation causing THP allocation
+ * failures.
  */
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define NR_PCP_THP 1
 #else
 #define NR_PCP_THP 0
 #endif
-#define NR_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1 + NR_PCP_THP))
+#define NR_LOWORDER_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1))
+#define NR_PCP_LISTS (NR_LOWORDER_PCP_LISTS + NR_PCP_THP)
 
 /*
  * Shift to encode migratetype and order in the same integer, with order
@@ -392,7 +395,7 @@ struct per_cpu_pages {
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
 	struct list_head lists[NR_PCP_LISTS];
-};
+} ____cacheline_aligned_in_smp;
 
 struct per_cpu_zonestat {
 #ifdef CONFIG_SMP
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 63976ad4b7f1..ed2deb93a758 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -648,7 +648,7 @@ static inline unsigned int order_to_pindex(int migratetype, int order)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	if (order > PAGE_ALLOC_COSTLY_ORDER) {
 		VM_BUG_ON(order != pageblock_order);
-		base = PAGE_ALLOC_COSTLY_ORDER + 1;
+		return NR_LOWORDER_PCP_LISTS;
 	}
 #else
 	VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
@@ -662,7 +662,7 @@ static inline int pindex_to_order(unsigned int pindex)
 	int order = pindex / MIGRATE_PCPTYPES;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (order > PAGE_ALLOC_COSTLY_ORDER)
+	if (pindex == NR_LOWORDER_PCP_LISTS)
 		order = pageblock_order;
 #else
 	VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/6] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper
  2022-04-20  9:59 [RFC PATCH 0/6] Drain remote per-cpu directly Mel Gorman
  2022-04-20  9:59 ` [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list Mel Gorman
  2022-04-20  9:59 ` [PATCH 2/6] mm/page_alloc: Use only one PCP list for THP-sized allocations Mel Gorman
@ 2022-04-20  9:59 ` Mel Gorman
  2022-04-20  9:59 ` [PATCH 4/6] mm/page_alloc: Remove unnecessary page == NULL check in rmqueue Mel Gorman
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2022-04-20  9:59 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, LKML, Linux-MM,
	Mel Gorman

This is a preparation page to allow the buddy removal code to be reused
in a later patch.

No functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 87 ++++++++++++++++++++++++++++---------------------
 1 file changed, 50 insertions(+), 37 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ed2deb93a758..4c1acf666056 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3622,6 +3622,46 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
 #endif
 }
 
+static __always_inline
+struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
+			   unsigned int order, unsigned int alloc_flags,
+			   int migratetype)
+{
+	struct page *page;
+	unsigned long flags;
+
+	do {
+		page = NULL;
+		spin_lock_irqsave(&zone->lock, flags);
+		/*
+		 * order-0 request can reach here when the pcplist is skipped
+		 * due to non-CMA allocation context. HIGHATOMIC area is
+		 * reserved for high-order atomic allocation, so order-0
+		 * request should skip it.
+		 */
+		if (order > 0 && alloc_flags & ALLOC_HARDER) {
+			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+			if (page)
+				trace_mm_page_alloc_zone_locked(page, order, migratetype);
+		}
+		if (!page) {
+			page = __rmqueue(zone, order, migratetype, alloc_flags);
+			if (!page) {
+				spin_unlock_irqrestore(&zone->lock, flags);
+				return NULL;
+			}
+		}
+		__mod_zone_freepage_state(zone, -(1 << order),
+					  get_pcppage_migratetype(page));
+		spin_unlock_irqrestore(&zone->lock, flags);
+	} while (check_new_pages(page, order));
+
+	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
+	zone_statistics(preferred_zone, zone, 1);
+
+	return page;
+}
+
 /* Remove page from the per-cpu list, caller must protect the list */
 static inline
 struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
@@ -3702,9 +3742,14 @@ struct page *rmqueue(struct zone *preferred_zone,
 			gfp_t gfp_flags, unsigned int alloc_flags,
 			int migratetype)
 {
-	unsigned long flags;
 	struct page *page;
 
+	/*
+	 * We most definitely don't want callers attempting to
+	 * allocate greater than order-1 page units with __GFP_NOFAIL.
+	 */
+	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
+
 	if (likely(pcp_allowed_order(order))) {
 		/*
 		 * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
@@ -3718,38 +3763,10 @@ struct page *rmqueue(struct zone *preferred_zone,
 		}
 	}
 
-	/*
-	 * We most definitely don't want callers attempting to
-	 * allocate greater than order-1 page units with __GFP_NOFAIL.
-	 */
-	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
-
-	do {
-		page = NULL;
-		spin_lock_irqsave(&zone->lock, flags);
-		/*
-		 * order-0 request can reach here when the pcplist is skipped
-		 * due to non-CMA allocation context. HIGHATOMIC area is
-		 * reserved for high-order atomic allocation, so order-0
-		 * request should skip it.
-		 */
-		if (order > 0 && alloc_flags & ALLOC_HARDER) {
-			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
-			if (page)
-				trace_mm_page_alloc_zone_locked(page, order, migratetype);
-		}
-		if (!page) {
-			page = __rmqueue(zone, order, migratetype, alloc_flags);
-			if (!page)
-				goto failed;
-		}
-		__mod_zone_freepage_state(zone, -(1 << order),
-					  get_pcppage_migratetype(page));
-		spin_unlock_irqrestore(&zone->lock, flags);
-	} while (check_new_pages(page, order));
-
-	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
-	zone_statistics(preferred_zone, zone, 1);
+	page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
+							migratetype);
+	if (unlikely(!page))
+		return NULL;
 
 out:
 	/* Separate test+clear to avoid unnecessary atomics */
@@ -3760,10 +3777,6 @@ struct page *rmqueue(struct zone *preferred_zone,
 
 	VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
 	return page;
-
-failed:
-	spin_unlock_irqrestore(&zone->lock, flags);
-	return NULL;
 }
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/6] mm/page_alloc: Remove unnecessary page == NULL check in rmqueue
  2022-04-20  9:59 [RFC PATCH 0/6] Drain remote per-cpu directly Mel Gorman
                   ` (2 preceding siblings ...)
  2022-04-20  9:59 ` [PATCH 3/6] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper Mel Gorman
@ 2022-04-20  9:59 ` Mel Gorman
  2022-04-20  9:59 ` [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock Mel Gorman
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2022-04-20  9:59 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, LKML, Linux-MM,
	Mel Gorman

The VM_BUG_ON check for a valid page can be avoided with a simple
change in the flow. The ZONE_BOOSTED_WATERMARK is unlikely in general
and even more unlikely if the page allocation failed so mark the
branch unlikely.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4c1acf666056..dc0fdeb3795c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3765,17 +3765,18 @@ struct page *rmqueue(struct zone *preferred_zone,
 
 	page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
 							migratetype);
-	if (unlikely(!page))
-		return NULL;
 
 out:
 	/* Separate test+clear to avoid unnecessary atomics */
-	if (test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags)) {
+	if (unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
 		clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
 		wakeup_kswapd(zone, 0, 0, zone_idx(zone));
 	}
 
-	VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
+	if (unlikely(!page))
+		return NULL;
+
+	VM_BUG_ON_PAGE(bad_range(zone, page), page);
 	return page;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock
  2022-04-20  9:59 [RFC PATCH 0/6] Drain remote per-cpu directly Mel Gorman
                   ` (3 preceding siblings ...)
  2022-04-20  9:59 ` [PATCH 4/6] mm/page_alloc: Remove unnecessary page == NULL check in rmqueue Mel Gorman
@ 2022-04-20  9:59 ` Mel Gorman
  2022-04-20 14:02   ` Hillf Danton
                     ` (2 more replies)
  2022-04-20  9:59 ` [PATCH 6/6] mm/page_alloc: Remotely drain per-cpu lists Mel Gorman
                   ` (2 subsequent siblings)
  7 siblings, 3 replies; 21+ messages in thread
From: Mel Gorman @ 2022-04-20  9:59 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, LKML, Linux-MM,
	Mel Gorman

Currently the PCP lists are protected by using local_lock_irqsave to
prevent migration and IRQ reentrancy but this is inconvenient. Remote
draining of the lists is impossible and a workqueue is required and
every task allocation/free must disable then enable interrupts which is
expensive.

As preparation for dealing with both of those problems, protect the
lists with a spinlock. The IRQ-unsafe version of the lock is used
because IRQs are already disabled by local_lock_irqsave. spin_trylock
is used in preparation for a time when local_lock could be used instead
of lock_lock_irqsave.

The per_cpu_pages still fits within the same number of cache lines after
this patch relative to before the series.

struct per_cpu_pages {
        spinlock_t                 lock;                 /*     0     4 */
        int                        count;                /*     4     4 */
        int                        high;                 /*     8     4 */
        int                        batch;                /*    12     4 */
        short int                  free_factor;          /*    16     2 */
        short int                  expire;               /*    18     2 */

        /* XXX 4 bytes hole, try to pack */

        struct list_head           lists[13];            /*    24   208 */

        /* size: 256, cachelines: 4, members: 7 */
        /* sum members: 228, holes: 1, sum holes: 4 */
        /* padding: 24 */
} __attribute__((__aligned__(64)));

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |   1 +
 mm/page_alloc.c        | 155 +++++++++++++++++++++++++++++++++++------
 2 files changed, 136 insertions(+), 20 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index abe530748de6..8b5757735428 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -385,6 +385,7 @@ enum zone_watermarks {
 
 /* Fields and list protected by pagesets local_lock in page_alloc.c */
 struct per_cpu_pages {
+	spinlock_t lock;	/* Protects lists field */
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dc0fdeb3795c..813c84b67c65 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -132,6 +132,17 @@ static DEFINE_PER_CPU(struct pagesets, pagesets) __maybe_unused = {
 	.lock = INIT_LOCAL_LOCK(lock),
 };
 
+#ifdef CONFIG_SMP
+/* On SMP, spin_trylock is sufficient protection */
+#define pcp_trylock_prepare(flags)	do { } while (0)
+#define pcp_trylock_finish(flag)	do { } while (0)
+#else
+
+/* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */
+#define pcp_trylock_prepare(flags)	local_irq_save(flags)
+#define pcp_trylock_finish(flags)	local_irq_restore(flags)
+#endif
+
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -3082,15 +3093,22 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
  */
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 {
-	unsigned long flags;
 	int to_drain, batch;
 
-	local_lock_irqsave(&pagesets.lock, flags);
 	batch = READ_ONCE(pcp->batch);
 	to_drain = min(pcp->count, batch);
-	if (to_drain > 0)
+	if (to_drain > 0) {
+		unsigned long flags;
+
+		/* free_pcppages_bulk expects IRQs disabled for zone->lock */
+		local_irq_save(flags);
+
+		spin_lock(&pcp->lock);
 		free_pcppages_bulk(zone, to_drain, pcp, 0);
-	local_unlock_irqrestore(&pagesets.lock, flags);
+		spin_unlock(&pcp->lock);
+
+		local_irq_restore(flags);
+	}
 }
 #endif
 
@@ -3103,16 +3121,21 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
  */
 static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 {
-	unsigned long flags;
 	struct per_cpu_pages *pcp;
 
-	local_lock_irqsave(&pagesets.lock, flags);
-
 	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
-	if (pcp->count)
+	if (pcp->count) {
+		unsigned long flags;
+
+		/* free_pcppages_bulk expects IRQs disabled for zone->lock */
+		local_irq_save(flags);
+
+		spin_lock(&pcp->lock);
 		free_pcppages_bulk(zone, pcp->count, pcp, 0);
+		spin_unlock(&pcp->lock);
 
-	local_unlock_irqrestore(&pagesets.lock, flags);
+		local_irq_restore(flags);
+	}
 }
 
 /*
@@ -3380,18 +3403,30 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 	return min(READ_ONCE(pcp->batch) << 2, high);
 }
 
-static void free_unref_page_commit(struct page *page, int migratetype,
-				   unsigned int order)
+/* Returns true if the page was committed to the per-cpu list. */
+static bool free_unref_page_commit(struct page *page, int migratetype,
+				   unsigned int order, bool locked)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	int high;
 	int pindex;
 	bool free_high;
+	unsigned long __maybe_unused UP_flags;
 
 	__count_vm_event(PGFREE);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	pindex = order_to_pindex(migratetype, order);
+
+	if (!locked) {
+		/* Protect against a parallel drain. */
+		pcp_trylock_prepare(UP_flags);
+		if (!spin_trylock(&pcp->lock)) {
+			pcp_trylock_finish(UP_flags);
+			return false;
+		}
+	}
+
 	list_add(&page->pcp_list, &pcp->lists[pindex]);
 	pcp->count += 1 << order;
 
@@ -3409,6 +3444,13 @@ static void free_unref_page_commit(struct page *page, int migratetype,
 
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch, free_high), pcp, pindex);
 	}
+
+	if (!locked) {
+		spin_unlock(&pcp->lock);
+		pcp_trylock_finish(UP_flags);
+	}
+
+	return true;
 }
 
 /*
@@ -3419,6 +3461,7 @@ void free_unref_page(struct page *page, unsigned int order)
 	unsigned long flags;
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
+	bool freed_pcp = false;
 
 	if (!free_unref_page_prepare(page, pfn, order))
 		return;
@@ -3440,8 +3483,11 @@ void free_unref_page(struct page *page, unsigned int order)
 	}
 
 	local_lock_irqsave(&pagesets.lock, flags);
-	free_unref_page_commit(page, migratetype, order);
+	freed_pcp = free_unref_page_commit(page, migratetype, order, false);
 	local_unlock_irqrestore(&pagesets.lock, flags);
+
+	if (unlikely(!freed_pcp))
+		free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
 }
 
 /*
@@ -3450,10 +3496,19 @@ void free_unref_page(struct page *page, unsigned int order)
 void free_unref_page_list(struct list_head *list)
 {
 	struct page *page, *next;
+	struct per_cpu_pages *pcp;
+	struct zone *locked_zone;
 	unsigned long flags;
 	int batch_count = 0;
 	int migratetype;
 
+	/*
+	 * An empty list is possible. Check early so that the later
+	 * lru_to_page() does not potentially read garbage.
+	 */
+	if (list_empty(list))
+		return;
+
 	/* Prepare pages for freeing */
 	list_for_each_entry_safe(page, next, list, lru) {
 		unsigned long pfn = page_to_pfn(page);
@@ -3474,8 +3529,26 @@ void free_unref_page_list(struct list_head *list)
 		}
 	}
 
+	VM_BUG_ON(in_hardirq());
+
 	local_lock_irqsave(&pagesets.lock, flags);
+
+	page = lru_to_page(list);
+	locked_zone = page_zone(page);
+	pcp = this_cpu_ptr(locked_zone->per_cpu_pageset);
+	spin_lock(&pcp->lock);
+
 	list_for_each_entry_safe(page, next, list, lru) {
+		struct zone *zone = page_zone(page);
+
+		/* Different zone, different pcp lock. */
+		if (zone != locked_zone) {
+			spin_unlock(&pcp->lock);
+			locked_zone = zone;
+			pcp = this_cpu_ptr(zone->per_cpu_pageset);
+			spin_lock(&pcp->lock);
+		}
+
 		/*
 		 * Non-isolated types over MIGRATE_PCPTYPES get added
 		 * to the MIGRATE_MOVABLE pcp list.
@@ -3485,18 +3558,25 @@ void free_unref_page_list(struct list_head *list)
 			migratetype = MIGRATE_MOVABLE;
 
 		trace_mm_page_free_batched(page);
-		free_unref_page_commit(page, migratetype, 0);
+
+		/* True is dead code at the moment due to local_lock_irqsave. */
+		if (unlikely(!free_unref_page_commit(page, migratetype, 0, true)))
+			free_one_page(page_zone(page), page, page_to_pfn(page), 0, migratetype, FPI_NONE);
 
 		/*
 		 * Guard against excessive IRQ disabled times when we get
 		 * a large list of pages to free.
 		 */
 		if (++batch_count == SWAP_CLUSTER_MAX) {
+			spin_unlock(&pcp->lock);
 			local_unlock_irqrestore(&pagesets.lock, flags);
 			batch_count = 0;
 			local_lock_irqsave(&pagesets.lock, flags);
+			pcp = this_cpu_ptr(locked_zone->per_cpu_pageset);
+			spin_lock(&pcp->lock);
 		}
 	}
+	spin_unlock(&pcp->lock);
 	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 
@@ -3668,9 +3748,30 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 			int migratetype,
 			unsigned int alloc_flags,
 			struct per_cpu_pages *pcp,
-			struct list_head *list)
+			struct list_head *list,
+			bool locked)
 {
 	struct page *page;
+	unsigned long __maybe_unused UP_flags;
+
+	/*
+	 * spin_trylock is not necessary right now due to due to
+	 * local_lock_irqsave and is a preparation step for
+	 * a conversion to local_lock using the trylock to prevent
+	 * IRQ re-entrancy. If pcp->lock cannot be acquired, the caller
+	 * uses rmqueue_buddy.
+	 *
+	 * TODO: Convert local_lock_irqsave to local_lock. Care
+	 * 	 is needed as the type of local_lock would need a
+	 * 	 PREEMPT_RT version due to threaded IRQs.
+	 */
+	if (unlikely(!locked)) {
+		pcp_trylock_prepare(UP_flags);
+		if (!spin_trylock(&pcp->lock)) {
+			pcp_trylock_finish(UP_flags);
+			return NULL;
+		}
+	}
 
 	do {
 		if (list_empty(list)) {
@@ -3691,8 +3792,10 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 					migratetype, alloc_flags);
 
 			pcp->count += alloced << order;
-			if (unlikely(list_empty(list)))
-				return NULL;
+			if (unlikely(list_empty(list))) {
+				page = NULL;
+				goto out;
+			}
 		}
 
 		page = list_first_entry(list, struct page, lru);
@@ -3700,6 +3803,12 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 		pcp->count -= 1 << order;
 	} while (check_new_pcp(page, order));
 
+out:
+	if (!locked) {
+		spin_unlock(&pcp->lock);
+		pcp_trylock_finish(UP_flags);
+	}
+
 	return page;
 }
 
@@ -3724,7 +3833,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	pcp->free_factor >>= 1;
 	list = &pcp->lists[order_to_pindex(migratetype, order)];
-	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
+	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list, false);
 	local_unlock_irqrestore(&pagesets.lock, flags);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
@@ -3759,7 +3868,8 @@ struct page *rmqueue(struct zone *preferred_zone,
 				migratetype != MIGRATE_MOVABLE) {
 			page = rmqueue_pcplist(preferred_zone, zone, order,
 					gfp_flags, migratetype, alloc_flags);
-			goto out;
+			if (likely(page))
+				goto out;
 		}
 	}
 
@@ -5326,6 +5436,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	local_lock_irqsave(&pagesets.lock, flags);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
+	spin_lock(&pcp->lock);
 
 	while (nr_populated < nr_pages) {
 
@@ -5336,11 +5447,13 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		}
 
 		page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
-								pcp, pcp_list);
+							pcp, pcp_list, true);
 		if (unlikely(!page)) {
 			/* Try and get at least one page */
-			if (!nr_populated)
+			if (!nr_populated) {
+				spin_unlock(&pcp->lock);
 				goto failed_irq;
+			}
 			break;
 		}
 		nr_account++;
@@ -5353,6 +5466,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		nr_populated++;
 	}
 
+	spin_unlock(&pcp->lock);
 	local_unlock_irqrestore(&pagesets.lock, flags);
 
 	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
@@ -6992,6 +7106,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
 	memset(pcp, 0, sizeof(*pcp));
 	memset(pzstats, 0, sizeof(*pzstats));
 
+	spin_lock_init(&pcp->lock);
 	for (pindex = 0; pindex < NR_PCP_LISTS; pindex++)
 		INIT_LIST_HEAD(&pcp->lists[pindex]);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/6] mm/page_alloc: Remotely drain per-cpu lists
  2022-04-20  9:59 [RFC PATCH 0/6] Drain remote per-cpu directly Mel Gorman
                   ` (4 preceding siblings ...)
  2022-04-20  9:59 ` [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock Mel Gorman
@ 2022-04-20  9:59 ` Mel Gorman
  2022-04-25 22:58 ` [RFC PATCH 0/6] Drain remote per-cpu directly Minchan Kim
  2022-04-26  2:49 ` Suren Baghdasaryan
  7 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2022-04-20  9:59 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, LKML, Linux-MM,
	Mel Gorman

From: Nicolas Saenz Julienne <nsaenzju@redhat.com>

Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
drain work queued by __drain_all_pages(). So introduce new a mechanism
to remotely drain the per-cpu lists. It is made possible by remotely
locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this
new scheme is that drain operations are now migration safe.

There was no observed performance degradation vs. the previous scheme.
Both netperf and hackbench were run in parallel to triggering the
__drain_all_pages(NULL, true) code path around ~100 times per second.
The new scheme performs a bit better (~5%), although the important point
here is there are no performance regressions vs. the previous mechanism.
Per-cpu lists draining happens only in slow paths.

Link: https://lore.kernel.org/r/20211103170512.2745765-4-nsaenzju@redhat.com
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 66 +++++++++----------------------------------------
 1 file changed, 11 insertions(+), 55 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 813c84b67c65..17d11eb0413e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -161,13 +161,7 @@ DEFINE_PER_CPU(int, _numa_mem_);		/* Kernel "local memory" node */
 EXPORT_PER_CPU_SYMBOL(_numa_mem_);
 #endif
 
-/* work_structs for global per-cpu drains */
-struct pcpu_drain {
-	struct zone *zone;
-	struct work_struct work;
-};
 static DEFINE_MUTEX(pcpu_drain_mutex);
-static DEFINE_PER_CPU(struct pcpu_drain, pcpu_drain);
 
 #ifdef CONFIG_GCC_PLUGIN_LATENT_ENTROPY
 volatile unsigned long latent_entropy __latent_entropy;
@@ -3087,9 +3081,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
  * Called from the vmstat counter updater to drain pagesets of this
  * currently executing processor on remote nodes after they have
  * expired.
- *
- * Note that this function must be called with the thread pinned to
- * a single processor.
  */
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 {
@@ -3114,10 +3105,6 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 
 /*
  * Drain pcplists of the indicated processor and zone.
- *
- * The processor must either be the current processor and the
- * thread pinned to the current processor or a processor that
- * is not online.
  */
 static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 {
@@ -3140,10 +3127,6 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 
 /*
  * Drain pcplists of all zones on the indicated processor.
- *
- * The processor must either be the current processor and the
- * thread pinned to the current processor or a processor that
- * is not online.
  */
 static void drain_pages(unsigned int cpu)
 {
@@ -3156,9 +3139,6 @@ static void drain_pages(unsigned int cpu)
 
 /*
  * Spill all of this CPU's per-cpu pages back into the buddy allocator.
- *
- * The CPU has to be pinned. When zone parameter is non-NULL, spill just
- * the single zone's pages.
  */
 void drain_local_pages(struct zone *zone)
 {
@@ -3170,24 +3150,6 @@ void drain_local_pages(struct zone *zone)
 		drain_pages(cpu);
 }
 
-static void drain_local_pages_wq(struct work_struct *work)
-{
-	struct pcpu_drain *drain;
-
-	drain = container_of(work, struct pcpu_drain, work);
-
-	/*
-	 * drain_all_pages doesn't use proper cpu hotplug protection so
-	 * we can race with cpu offline when the WQ can move this from
-	 * a cpu pinned worker to an unbound one. We can operate on a different
-	 * cpu which is alright but we also have to make sure to not move to
-	 * a different one.
-	 */
-	migrate_disable();
-	drain_local_pages(drain->zone);
-	migrate_enable();
-}
-
 /*
  * The implementation of drain_all_pages(), exposing an extra parameter to
  * drain on all cpus.
@@ -3208,13 +3170,6 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 	 */
 	static cpumask_t cpus_with_pcps;
 
-	/*
-	 * Make sure nobody triggers this path before mm_percpu_wq is fully
-	 * initialized.
-	 */
-	if (WARN_ON_ONCE(!mm_percpu_wq))
-		return;
-
 	/*
 	 * Do not drain if one is already in progress unless it's specific to
 	 * a zone. Such callers are primarily CMA and memory hotplug and need
@@ -3264,14 +3219,12 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 	}
 
 	for_each_cpu(cpu, &cpus_with_pcps) {
-		struct pcpu_drain *drain = per_cpu_ptr(&pcpu_drain, cpu);
-
-		drain->zone = zone;
-		INIT_WORK(&drain->work, drain_local_pages_wq);
-		queue_work_on(cpu, mm_percpu_wq, &drain->work);
+		if (zone) {
+			drain_pages_zone(cpu, zone);
+		} else {
+			drain_pages(cpu);
+		}
 	}
-	for_each_cpu(cpu, &cpus_with_pcps)
-		flush_work(&per_cpu_ptr(&pcpu_drain, cpu)->work);
 
 	mutex_unlock(&pcpu_drain_mutex);
 }
@@ -3280,8 +3233,6 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
  * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
  *
  * When zone parameter is non-NULL, spill just the single zone's pages.
- *
- * Note that this can be extremely slow as the draining happens in a workqueue.
  */
 void drain_all_pages(struct zone *zone)
 {
@@ -3559,7 +3510,12 @@ void free_unref_page_list(struct list_head *list)
 
 		trace_mm_page_free_batched(page);
 
-		/* True is dead code at the moment due to local_lock_irqsave. */
+		/*
+		 * If there is a parallel drain in progress, free to the buddy
+		 * allocator directly. This is expensive as the zone lock will
+		 * be acquired multiple times but if a drain is in progress
+		 * then an expensive operation is already taking place.
+		 */
 		if (unlikely(!free_unref_page_commit(page, migratetype, 0, true)))
 			free_one_page(page_zone(page), page, page_to_pfn(page), 0, migratetype, FPI_NONE);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock
  2022-04-20  9:59 ` [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock Mel Gorman
@ 2022-04-20 14:02   ` Hillf Danton
  2022-04-20 14:35     ` Nicolas Saenz Julienne
  2022-04-26 16:42   ` Nicolas Saenz Julienne
  2022-04-26 19:24   ` Minchan Kim
  2 siblings, 1 reply; 21+ messages in thread
From: Hillf Danton @ 2022-04-20 14:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM

On Wed, 20 Apr 2022 10:59:05 +0100 Mel Gorman wrote:
>  void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
>  {
> -	unsigned long flags;
>  	int to_drain, batch;
>  
> -	local_lock_irqsave(&pagesets.lock, flags);
>  	batch = READ_ONCE(pcp->batch);
>  	to_drain = min(pcp->count, batch);
> -	if (to_drain > 0)
> +	if (to_drain > 0) {
> +		unsigned long flags;
> +
> +		/* free_pcppages_bulk expects IRQs disabled for zone->lock */
> +		local_irq_save(flags);
> +
> +		spin_lock(&pcp->lock);

Nit, spin_lock_irqsave() instead.

>  		free_pcppages_bulk(zone, to_drain, pcp, 0);
> -	local_unlock_irqrestore(&pagesets.lock, flags);
> +		spin_unlock(&pcp->lock);
> +
> +		local_irq_restore(flags);
> +	}
>  }
>  #endif

Hillf


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock
  2022-04-20 14:02   ` Hillf Danton
@ 2022-04-20 14:35     ` Nicolas Saenz Julienne
  0 siblings, 0 replies; 21+ messages in thread
From: Nicolas Saenz Julienne @ 2022-04-20 14:35 UTC (permalink / raw)
  To: Hillf Danton, Mel Gorman
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, LKML, Linux-MM

Hi Hillf,

On Wed, 2022-04-20 at 22:02 +0800, Hillf Danton wrote:
> On Wed, 20 Apr 2022 10:59:05 +0100 Mel Gorman wrote:
> >  void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
> >  {
> > -	unsigned long flags;
> >  	int to_drain, batch;
> >  
> > -	local_lock_irqsave(&pagesets.lock, flags);
> >  	batch = READ_ONCE(pcp->batch);
> >  	to_drain = min(pcp->count, batch);
> > -	if (to_drain > 0)
> > +	if (to_drain > 0) {
> > +		unsigned long flags;
> > +
> > +		/* free_pcppages_bulk expects IRQs disabled for zone->lock */
> > +		local_irq_save(flags);
> > +
> > +		spin_lock(&pcp->lock);
> 
> Nit, spin_lock_irqsave() instead.

See cover letter's:

"This series is a partial series. Follow-on work would allow the local_irq_save
to be converted to a local_irq to avoid IRQs being disabled/enabled in most
cases. However, there are enough corner cases that it deserves a series on its
own separated by one kernel release and the priority right now is to avoid
interference of high priority tasks."

Regards,

-- 
Nicolás Sáenz


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list
  2022-04-20  9:59 ` [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list Mel Gorman
@ 2022-04-20 20:43   ` Matthew Wilcox
  2022-04-21  8:38     ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2022-04-20 20:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM

On Wed, Apr 20, 2022 at 10:59:01AM +0100, Mel Gorman wrote:
> The page allocator uses page->lru for storing pages on either buddy or
> PCP lists. Create page->buddy_list and page->pcp_list as a union with
> page->lru. This is simply to clarify what type of list a page is on
> in the page allocator.

Hi Mel,

No objection to this change, and I certainly don't want to hold up
fixing this (or any other) problem in the page allocator.  I would
like to talk about splitting out free page management from struct page.
Maybe you'd like to discuss that in person at LSFMM, but a quick
sketch of a data structure might look like ...

struct free_mem {
	unsigned long __page_flags;
	union {
		struct list_head buddy_list;
		struct list_head pcp_list;
	};
	unsigned long __rsvd4;
	unsigned long pcp_migratetype_and_order;
	unsigned long buddy_order;
	unsigned int __page_type;
	atomic_t _refcount;
};

Am I missing anything there?

(Would you like to use separate types for pcp and buddy?  That might be
overkill, or it might help keep the different stages of "free" memory
separate from each other)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list
  2022-04-20 20:43   ` Matthew Wilcox
@ 2022-04-21  8:38     ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2022-04-21  8:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM

On Wed, Apr 20, 2022 at 09:43:05PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 20, 2022 at 10:59:01AM +0100, Mel Gorman wrote:
> > The page allocator uses page->lru for storing pages on either buddy or
> > PCP lists. Create page->buddy_list and page->pcp_list as a union with
> > page->lru. This is simply to clarify what type of list a page is on
> > in the page allocator.
> 
> Hi Mel,
> 
> No objection to this change, and I certainly don't want to hold up
> fixing this (or any other) problem in the page allocator. 

Minimally, I think the patch makes it easier to implement your suggestion
and it can happen before or after the rest of the series.

> I would
> like to talk about splitting out free page management from struct page.
> Maybe you'd like to discuss that in person at LSFMM, but a quick
> sketch of a data structure might look like ...
> 

Unfortunately, I am unable to attend LSF/MM (or any other conference)
physically but I have no objection to splitting this out as a separate
structure. I assume the basis for the change would be for type checking.

> struct free_mem {
> 	unsigned long __page_flags;

page->flags, ok

> 	union {
> 		struct list_head buddy_list;
> 		struct list_head pcp_list;
> 	};

page->lru, ok

> 	unsigned long __rsvd4;

page->mapping, we need that

> 	unsigned long pcp_migratetype_and_order;

page->index, ok but more on this later

> 	unsigned long buddy_order;

page->private, ok.

> 	unsigned int __page_type;

page->_mapcount, we need that too.

> 	atomic_t _refcount;

page->_refcount, ok.

> };
> 
> Am I missing anything there?
> 

s/__page_flags/flags/	The allocator checks and modifies these bits so
	it has awareness of page->flags

s/__rsvd4/mapping/	The mapping is checked for debugging and the
	allocator is responsible for clearing page->mapping

s/pcp_migratetype_and_order/pcp_migratetype/
	Commit 8b10b465d0e1 ("mm/page_alloc: free pages in a single pass
	during bulk free") removed the migratetype and order stuffing
	in page->index. The order is inferred from the array index via
	order_to_pindex and pindex_to_order but migratetype is still
	stored in page->index by set_pcppage_migratetype

s/__page_type/_mapcount/ because _mapcount if checked for debugging

memcg_data needs to be accessible for debugging checks

_last_cpupid needs to be accessible as it's reset during prepare via
	page_cpupid_reset_last. Rather than putting in a dummy field
	for virtual, maybe virtual can move.

> (Would you like to use separate types for pcp and buddy?  That might be
> overkill, or it might help keep the different stages of "free" memory
> separate from each other)

I think it's overkill because there is too much overlap between a PCP
page and buddy page due to page checks and preparation. The distinguishing
factor between a free pcp page and free buddy page is the PageBuddy bit.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] Drain remote per-cpu directly
  2022-04-20  9:59 [RFC PATCH 0/6] Drain remote per-cpu directly Mel Gorman
                   ` (5 preceding siblings ...)
  2022-04-20  9:59 ` [PATCH 6/6] mm/page_alloc: Remotely drain per-cpu lists Mel Gorman
@ 2022-04-25 22:58 ` Minchan Kim
  2022-04-26 11:06   ` Nicolas Saenz Julienne
  2022-04-26  2:49 ` Suren Baghdasaryan
  7 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2022-04-25 22:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM

On Wed, Apr 20, 2022 at 10:59:00AM +0100, Mel Gorman wrote:
> This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> per-cpu lists drain support" -- avoid interference of a high priority
> task due to a workqueue item draining per-cpu page lists. While many
> workloads can tolerate a brief interruption, it may be cause a real-time
> task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> the draining in non-deterministic.

Yeah, the non-deterministic is a problem. I saw the kworker-based draining
takes 100+ms(up to 300ms observed) sometimes in alloc_contig_range if CPUs
are heavily loaded.

I am not sure Nicolas already observed. it's not only problem of
per_cpu_pages but it is also lru_pvecs (pagevec) draining.
Do we need to introduce similar(allow remote drainning with spin_lock)
solution for pagevec?

> 
> Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> The local_lock on its own prevents migration and the IRQ disabling protects
> from corruption due to an interrupt arriving while a page allocation is
> in progress. The locking is inherently unsafe for remote access unless
> the CPU is hot-removed.
> 
> This series adjusts the locking. A spin-lock is added to struct
> per_cpu_pages to protect the list contents while local_lock_irq continues
> to prevent migration and IRQ reentry. This allows a remote CPU to safely
> drain a remote per-cpu list.
> 
> This series is a partial series. Follow-on work would allow the
> local_irq_save to be converted to a local_irq to avoid IRQs being
> disabled/enabled in most cases. However, there are enough corner cases
> that it deserves a series on its own separated by one kernel release and
> the priority right now is to avoid interference of high priority tasks.
> 
> Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
> 	and when it is storing per-cpu pages.
> 
> Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
> 	this is not necessary but it avoids per_cpu_pages consuming another
> 	cache line.
> 
> Patch 3 is a preparation patch to avoid code duplication.
> 
> Patch 4 is a simple micro-optimisation that improves code flow necessary for
> 	a later patch to avoid code duplication.
> 
> Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
> 	relying on local_lock to prevent migration, stabilise the pcp
> 	lookup and prevent IRQ reentrancy.
> 
> Patch 6 remote drains per-cpu pages directly instead of using a workqueue.
> 
>  include/linux/mm_types.h |   5 +
>  include/linux/mmzone.h   |  12 +-
>  mm/page_alloc.c          | 333 ++++++++++++++++++++++++---------------
>  3 files changed, 222 insertions(+), 128 deletions(-)
> 
> -- 
> 2.34.1
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] Drain remote per-cpu directly
  2022-04-20  9:59 [RFC PATCH 0/6] Drain remote per-cpu directly Mel Gorman
                   ` (6 preceding siblings ...)
  2022-04-25 22:58 ` [RFC PATCH 0/6] Drain remote per-cpu directly Minchan Kim
@ 2022-04-26  2:49 ` Suren Baghdasaryan
  2022-04-26  6:30   ` Suren Baghdasaryan
  7 siblings, 1 reply; 21+ messages in thread
From: Suren Baghdasaryan @ 2022-04-26  2:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM

On Wed, Apr 20, 2022 at 2:59 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> per-cpu lists drain support" -- avoid interference of a high priority
> task due to a workqueue item draining per-cpu page lists. While many
> workloads can tolerate a brief interruption, it may be cause a real-time
> task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> the draining in non-deterministic.
>
> Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> The local_lock on its own prevents migration and the IRQ disabling protects
> from corruption due to an interrupt arriving while a page allocation is
> in progress. The locking is inherently unsafe for remote access unless
> the CPU is hot-removed.
>
> This series adjusts the locking. A spin-lock is added to struct
> per_cpu_pages to protect the list contents while local_lock_irq continues
> to prevent migration and IRQ reentry. This allows a remote CPU to safely
> drain a remote per-cpu list.
>
> This series is a partial series. Follow-on work would allow the
> local_irq_save to be converted to a local_irq to avoid IRQs being
> disabled/enabled in most cases. However, there are enough corner cases
> that it deserves a series on its own separated by one kernel release and
> the priority right now is to avoid interference of high priority tasks.
>
> Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
>         and when it is storing per-cpu pages.
>
> Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
>         this is not necessary but it avoids per_cpu_pages consuming another
>         cache line.
>
> Patch 3 is a preparation patch to avoid code duplication.
>
> Patch 4 is a simple micro-optimisation that improves code flow necessary for
>         a later patch to avoid code duplication.
>
> Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
>         relying on local_lock to prevent migration, stabilise the pcp
>         lookup and prevent IRQ reentrancy.
>
> Patch 6 remote drains per-cpu pages directly instead of using a workqueue.

This quite possibly solves the issue I was trying to fix in
https://lore.kernel.org/all/20220225012819.1807147-1-surenb@google.com.
I will give it a try and see how it looks.
Thanks!

>
>  include/linux/mm_types.h |   5 +
>  include/linux/mmzone.h   |  12 +-
>  mm/page_alloc.c          | 333 ++++++++++++++++++++++++---------------
>  3 files changed, 222 insertions(+), 128 deletions(-)
>
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] Drain remote per-cpu directly
  2022-04-26  2:49 ` Suren Baghdasaryan
@ 2022-04-26  6:30   ` Suren Baghdasaryan
  0 siblings, 0 replies; 21+ messages in thread
From: Suren Baghdasaryan @ 2022-04-26  6:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM

On Mon, Apr 25, 2022 at 7:49 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Apr 20, 2022 at 2:59 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> > per-cpu lists drain support" -- avoid interference of a high priority
> > task due to a workqueue item draining per-cpu page lists. While many
> > workloads can tolerate a brief interruption, it may be cause a real-time
> > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> > the draining in non-deterministic.
> >
> > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> > The local_lock on its own prevents migration and the IRQ disabling protects
> > from corruption due to an interrupt arriving while a page allocation is
> > in progress. The locking is inherently unsafe for remote access unless
> > the CPU is hot-removed.
> >
> > This series adjusts the locking. A spin-lock is added to struct
> > per_cpu_pages to protect the list contents while local_lock_irq continues
> > to prevent migration and IRQ reentry. This allows a remote CPU to safely
> > drain a remote per-cpu list.
> >
> > This series is a partial series. Follow-on work would allow the
> > local_irq_save to be converted to a local_irq to avoid IRQs being
> > disabled/enabled in most cases. However, there are enough corner cases
> > that it deserves a series on its own separated by one kernel release and
> > the priority right now is to avoid interference of high priority tasks.
> >
> > Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
> >         and when it is storing per-cpu pages.
> >
> > Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
> >         this is not necessary but it avoids per_cpu_pages consuming another
> >         cache line.
> >
> > Patch 3 is a preparation patch to avoid code duplication.
> >
> > Patch 4 is a simple micro-optimisation that improves code flow necessary for
> >         a later patch to avoid code duplication.
> >
> > Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
> >         relying on local_lock to prevent migration, stabilise the pcp
> >         lookup and prevent IRQ reentrancy.
> >
> > Patch 6 remote drains per-cpu pages directly instead of using a workqueue.
>
> This quite possibly solves the issue I was trying to fix in
> https://lore.kernel.org/all/20220225012819.1807147-1-surenb@google.com.
> I will give it a try and see how it looks.

My test shows sizable improvement for the worst case drain_all_pages
duration. Before the change I caught cases when a drain_local_pages_wq
in the workqueue was delayed by 100+ms (not even counting
drain_local_pages_wq execution time itself). With this patchset the
worst time I was able to record for drain_all_pages duration was 17ms.

> Thanks!
>
> >
> >  include/linux/mm_types.h |   5 +
> >  include/linux/mmzone.h   |  12 +-
> >  mm/page_alloc.c          | 333 ++++++++++++++++++++++++---------------
> >  3 files changed, 222 insertions(+), 128 deletions(-)
> >
> > --
> > 2.34.1
> >
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] Drain remote per-cpu directly
  2022-04-25 22:58 ` [RFC PATCH 0/6] Drain remote per-cpu directly Minchan Kim
@ 2022-04-26 11:06   ` Nicolas Saenz Julienne
  2022-04-27 15:21     ` Marcelo Tosatti
  0 siblings, 1 reply; 21+ messages in thread
From: Nicolas Saenz Julienne @ 2022-04-26 11:06 UTC (permalink / raw)
  To: Minchan Kim, Mel Gorman
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, LKML, Linux-MM

On Mon, 2022-04-25 at 15:58 -0700, Minchan Kim wrote:
> On Wed, Apr 20, 2022 at 10:59:00AM +0100, Mel Gorman wrote:
> > This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> > per-cpu lists drain support" -- avoid interference of a high priority
> > task due to a workqueue item draining per-cpu page lists. While many
> > workloads can tolerate a brief interruption, it may be cause a real-time
> > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> > the draining in non-deterministic.
> 
> Yeah, the non-deterministic is a problem. I saw the kworker-based draining
> takes 100+ms(up to 300ms observed) sometimes in alloc_contig_range if CPUs
> are heavily loaded.
> 
> I am not sure Nicolas already observed. it's not only problem of
> per_cpu_pages but it is also lru_pvecs (pagevec) draining.
> Do we need to introduce similar(allow remote drainning with spin_lock)
> solution for pagevec?

Yes, I'm aware of the lru problem. I'll start working on it too once we're done
with the page allocator (and if no-one beats me to it). That said, I don't know
if we can apply the exact same approach, the devil is in the details. :)

-- 
Nicolás Sáenz


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock
  2022-04-20  9:59 ` [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock Mel Gorman
  2022-04-20 14:02   ` Hillf Danton
@ 2022-04-26 16:42   ` Nicolas Saenz Julienne
  2022-04-26 16:48     ` Vlastimil Babka
  2022-04-29  9:13     ` Mel Gorman
  2022-04-26 19:24   ` Minchan Kim
  2 siblings, 2 replies; 21+ messages in thread
From: Nicolas Saenz Julienne @ 2022-04-26 16:42 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, LKML, Linux-MM

On Wed, 2022-04-20 at 10:59 +0100, Mel Gorman wrote:
> Currently the PCP lists are protected by using local_lock_irqsave to
> prevent migration and IRQ reentrancy but this is inconvenient. Remote
> draining of the lists is impossible and a workqueue is required and
> every task allocation/free must disable then enable interrupts which is
> expensive.
> 
> As preparation for dealing with both of those problems, protect the
> lists with a spinlock. The IRQ-unsafe version of the lock is used
> because IRQs are already disabled by local_lock_irqsave. spin_trylock
> is used in preparation for a time when local_lock could be used instead
> of lock_lock_irqsave.
> 
> The per_cpu_pages still fits within the same number of cache lines after
> this patch relative to before the series.
> 
> struct per_cpu_pages {
>         spinlock_t                 lock;                 /*     0     4 */
>         int                        count;                /*     4     4 */
>         int                        high;                 /*     8     4 */
>         int                        batch;                /*    12     4 */
>         short int                  free_factor;          /*    16     2 */
>         short int                  expire;               /*    18     2 */
> 
>         /* XXX 4 bytes hole, try to pack */
> 
>         struct list_head           lists[13];            /*    24   208 */
> 
>         /* size: 256, cachelines: 4, members: 7 */
>         /* sum members: 228, holes: 1, sum holes: 4 */
>         /* padding: 24 */
> } __attribute__((__aligned__(64)));
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/mmzone.h |   1 +
>  mm/page_alloc.c        | 155 +++++++++++++++++++++++++++++++++++------
>  2 files changed, 136 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index abe530748de6..8b5757735428 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -385,6 +385,7 @@ enum zone_watermarks {
>  
>  /* Fields and list protected by pagesets local_lock in page_alloc.c */
>  struct per_cpu_pages {
> +	spinlock_t lock;	/* Protects lists field */
>  	int count;		/* number of pages in the list */
>  	int high;		/* high watermark, emptying needed */
>  	int batch;		/* chunk size for buddy add/remove */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dc0fdeb3795c..813c84b67c65 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -132,6 +132,17 @@ static DEFINE_PER_CPU(struct pagesets, pagesets) __maybe_unused = {
>  	.lock = INIT_LOCAL_LOCK(lock),
>  };
>  
> +#ifdef CONFIG_SMP
> +/* On SMP, spin_trylock is sufficient protection */
> +#define pcp_trylock_prepare(flags)	do { } while (0)
> +#define pcp_trylock_finish(flag)	do { } while (0)
> +#else
> +
> +/* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */

This is only needed on non-RT kernels. RT UP builds go through
rt_spin_trylock(), which behaves like its SMP counterpart.

> +#define pcp_trylock_prepare(flags)	local_irq_save(flags)
> +#define pcp_trylock_finish(flags)	local_irq_restore(flags)
> +#endif
> +
>  #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
>  DEFINE_PER_CPU(int, numa_node);
>  EXPORT_PER_CPU_SYMBOL(numa_node);
> @@ -3082,15 +3093,22 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>   */
>  void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
>  {
> -	unsigned long flags;
>  	int to_drain, batch;
>  
> -	local_lock_irqsave(&pagesets.lock, flags);
>  	batch = READ_ONCE(pcp->batch);
>  	to_drain = min(pcp->count, batch);
> -	if (to_drain > 0)
> +	if (to_drain > 0) {
> +		unsigned long flags;
> +
> +		/* free_pcppages_bulk expects IRQs disabled for zone->lock */
> +		local_irq_save(flags);

Why dropping the local_lock? That approach is nicer to RT builds, and I don't
think it makes a difference from a non-RT perspective.

That said, IIUC, this will eventually disappear with subsequent patches, right?

> +
> +		spin_lock(&pcp->lock);
>  		free_pcppages_bulk(zone, to_drain, pcp, 0);
> -	local_unlock_irqrestore(&pagesets.lock, flags);
> +		spin_unlock(&pcp->lock);
> +
> +		local_irq_restore(flags);
> +	}
>  }
>  #endif
>  

[...]

> @@ -3668,9 +3748,30 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
>  			int migratetype,
>  			unsigned int alloc_flags,
>  			struct per_cpu_pages *pcp,
> -			struct list_head *list)
> +			struct list_head *list,
> +			bool locked)
>  {
>  	struct page *page;
> +	unsigned long __maybe_unused UP_flags;
> +
> +	/*
> +	 * spin_trylock is not necessary right now due to due to
> +	 * local_lock_irqsave and is a preparation step for
> +	 * a conversion to local_lock using the trylock to prevent
> +	 * IRQ re-entrancy. If pcp->lock cannot be acquired, the caller
> +	 * uses rmqueue_buddy.
> +	 *
> +	 * TODO: Convert local_lock_irqsave to local_lock. Care
> +	 * 	 is needed as the type of local_lock would need a
> +	 * 	 PREEMPT_RT version due to threaded IRQs.
> +	 */

I missed this in our previous conversation, but as far as RT is concerned
local_lock_irqsave() is the same as local_lock(), see in local_lock_internal.h:

#define __local_lock_irqsave(lock, flags)			\
	do {							\
		typecheck(unsigned long, flags);		\
		flags = 0;					\
		__local_lock(lock);				\
	} while (0)

Something similar happens with spin_lock_irqsave(), see in spinlock_rt.h:

#define spin_lock_irqsave(lock, flags)			 \
	do {						 \
		typecheck(unsigned long, flags);	 \
		flags = 0;				 \
		spin_lock(lock);			 \
	} while (0)

IIUC, RT will run this code paths in threaded IRQ context, where we can think
of RT spinlocks as mutexes (plus some extra priority inheritance magic).

Regards,

-- 
Nicolás Sáenz


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock
  2022-04-26 16:42   ` Nicolas Saenz Julienne
@ 2022-04-26 16:48     ` Vlastimil Babka
  2022-04-29  9:13     ` Mel Gorman
  1 sibling, 0 replies; 21+ messages in thread
From: Vlastimil Babka @ 2022-04-26 16:48 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, Mel Gorman
  Cc: Marcelo Tosatti, Michal Hocko, LKML, Linux-MM

On 4/26/22 18:42, Nicolas Saenz Julienne wrote:
> On Wed, 2022-04-20 at 10:59 +0100, Mel Gorman wrote:
>> @@ -3082,15 +3093,22 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>>   */
>>  void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
>>  {
>> -	unsigned long flags;
>>  	int to_drain, batch;
>>  
>> -	local_lock_irqsave(&pagesets.lock, flags);
>>  	batch = READ_ONCE(pcp->batch);
>>  	to_drain = min(pcp->count, batch);
>> -	if (to_drain > 0)
>> +	if (to_drain > 0) {
>> +		unsigned long flags;
>> +
>> +		/* free_pcppages_bulk expects IRQs disabled for zone->lock */
>> +		local_irq_save(flags);
> 
> Why dropping the local_lock? That approach is nicer to RT builds, and I don't
> think it makes a difference from a non-RT perspective.

I think the separate irq_disable+spin_lock here is actually broken on RT
config, as explained in Documentation/locking/locktypes.rst. pcp->lock would
have to be a raw_spin_lock.

> That said, IIUC, this will eventually disappear with subsequent patches, right?

So it wouldn't be mergeable even as a temporary step.

> 
>> +
>> +		spin_lock(&pcp->lock);
>>  		free_pcppages_bulk(zone, to_drain, pcp, 0);
>> -	local_unlock_irqrestore(&pagesets.lock, flags);
>> +		spin_unlock(&pcp->lock);
>> +
>> +		local_irq_restore(flags);
>> +	}
>>  }
>>  #endif
>>  

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock
  2022-04-20  9:59 ` [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock Mel Gorman
  2022-04-20 14:02   ` Hillf Danton
  2022-04-26 16:42   ` Nicolas Saenz Julienne
@ 2022-04-26 19:24   ` Minchan Kim
  2022-04-29  9:05     ` Mel Gorman
  2 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2022-04-26 19:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM

On Wed, Apr 20, 2022 at 10:59:05AM +0100, Mel Gorman wrote:
> Currently the PCP lists are protected by using local_lock_irqsave to
> prevent migration and IRQ reentrancy but this is inconvenient. Remote
> draining of the lists is impossible and a workqueue is required and
> every task allocation/free must disable then enable interrupts which is
> expensive.
> 
> As preparation for dealing with both of those problems, protect the
> lists with a spinlock. The IRQ-unsafe version of the lock is used
> because IRQs are already disabled by local_lock_irqsave. spin_trylock
> is used in preparation for a time when local_lock could be used instead
> of lock_lock_irqsave.
> 
> The per_cpu_pages still fits within the same number of cache lines after
> this patch relative to before the series.
> 
> struct per_cpu_pages {
>         spinlock_t                 lock;                 /*     0     4 */
>         int                        count;                /*     4     4 */
>         int                        high;                 /*     8     4 */
>         int                        batch;                /*    12     4 */
>         short int                  free_factor;          /*    16     2 */
>         short int                  expire;               /*    18     2 */
> 
>         /* XXX 4 bytes hole, try to pack */
> 
>         struct list_head           lists[13];            /*    24   208 */
> 
>         /* size: 256, cachelines: 4, members: 7 */
>         /* sum members: 228, holes: 1, sum holes: 4 */
>         /* padding: 24 */
> } __attribute__((__aligned__(64)));
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/mmzone.h |   1 +
>  mm/page_alloc.c        | 155 +++++++++++++++++++++++++++++++++++------
>  2 files changed, 136 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index abe530748de6..8b5757735428 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -385,6 +385,7 @@ enum zone_watermarks {
>  
>  /* Fields and list protected by pagesets local_lock in page_alloc.c */
>  struct per_cpu_pages {
> +	spinlock_t lock;	/* Protects lists field */
>  	int count;		/* number of pages in the list */
>  	int high;		/* high watermark, emptying needed */
>  	int batch;		/* chunk size for buddy add/remove */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dc0fdeb3795c..813c84b67c65 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -132,6 +132,17 @@ static DEFINE_PER_CPU(struct pagesets, pagesets) __maybe_unused = {
>  	.lock = INIT_LOCAL_LOCK(lock),
>  };
>  
> +#ifdef CONFIG_SMP
> +/* On SMP, spin_trylock is sufficient protection */
> +#define pcp_trylock_prepare(flags)	do { } while (0)
> +#define pcp_trylock_finish(flag)	do { } while (0)
> +#else
> +
> +/* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */
> +#define pcp_trylock_prepare(flags)	local_irq_save(flags)
> +#define pcp_trylock_finish(flags)	local_irq_restore(flags)
> +#endif
> +
>  #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
>  DEFINE_PER_CPU(int, numa_node);
>  EXPORT_PER_CPU_SYMBOL(numa_node);
> @@ -3082,15 +3093,22 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>   */
>  void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
>  {
> -	unsigned long flags;
>  	int to_drain, batch;
>  
> -	local_lock_irqsave(&pagesets.lock, flags);
>  	batch = READ_ONCE(pcp->batch);
>  	to_drain = min(pcp->count, batch);
> -	if (to_drain > 0)
> +	if (to_drain > 0) {
> +		unsigned long flags;
> +
> +		/* free_pcppages_bulk expects IRQs disabled for zone->lock */
> +		local_irq_save(flags);
> +
> +		spin_lock(&pcp->lock);
>  		free_pcppages_bulk(zone, to_drain, pcp, 0);
> -	local_unlock_irqrestore(&pagesets.lock, flags);
> +		spin_unlock(&pcp->lock);
> +
> +		local_irq_restore(flags);
> +	}
>  }
>  #endif
>  
> @@ -3103,16 +3121,21 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
>   */
>  static void drain_pages_zone(unsigned int cpu, struct zone *zone)
>  {
> -	unsigned long flags;
>  	struct per_cpu_pages *pcp;
>  
> -	local_lock_irqsave(&pagesets.lock, flags);
> -
>  	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
> -	if (pcp->count)
> +	if (pcp->count) {
> +		unsigned long flags;
> +
> +		/* free_pcppages_bulk expects IRQs disabled for zone->lock */
> +		local_irq_save(flags);
> +
> +		spin_lock(&pcp->lock);
>  		free_pcppages_bulk(zone, pcp->count, pcp, 0);
> +		spin_unlock(&pcp->lock);
>  
> -	local_unlock_irqrestore(&pagesets.lock, flags);
> +		local_irq_restore(flags);
> +	}
>  }
>  
>  /*
> @@ -3380,18 +3403,30 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
>  	return min(READ_ONCE(pcp->batch) << 2, high);
>  }
>  
> -static void free_unref_page_commit(struct page *page, int migratetype,
> -				   unsigned int order)
> +/* Returns true if the page was committed to the per-cpu list. */
> +static bool free_unref_page_commit(struct page *page, int migratetype,
> +				   unsigned int order, bool locked)
>  {
>  	struct zone *zone = page_zone(page);
>  	struct per_cpu_pages *pcp;
>  	int high;
>  	int pindex;
>  	bool free_high;
> +	unsigned long __maybe_unused UP_flags;
>  
>  	__count_vm_event(PGFREE);
>  	pcp = this_cpu_ptr(zone->per_cpu_pageset);
>  	pindex = order_to_pindex(migratetype, order);
> +
> +	if (!locked) {
> +		/* Protect against a parallel drain. */
> +		pcp_trylock_prepare(UP_flags);
> +		if (!spin_trylock(&pcp->lock)) {
> +			pcp_trylock_finish(UP_flags);
> +			return false;
> +		}
> +	}
> +
>  	list_add(&page->pcp_list, &pcp->lists[pindex]);
>  	pcp->count += 1 << order;
>  
> @@ -3409,6 +3444,13 @@ static void free_unref_page_commit(struct page *page, int migratetype,
>  
>  		free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch, free_high), pcp, pindex);
>  	}
> +
> +	if (!locked) {
> +		spin_unlock(&pcp->lock);
> +		pcp_trylock_finish(UP_flags);
> +	}
> +
> +	return true;
>  }
>  
>  /*
> @@ -3419,6 +3461,7 @@ void free_unref_page(struct page *page, unsigned int order)
>  	unsigned long flags;
>  	unsigned long pfn = page_to_pfn(page);
>  	int migratetype;
> +	bool freed_pcp = false;
>  
>  	if (!free_unref_page_prepare(page, pfn, order))
>  		return;
> @@ -3440,8 +3483,11 @@ void free_unref_page(struct page *page, unsigned int order)
>  	}
>  
>  	local_lock_irqsave(&pagesets.lock, flags);
> -	free_unref_page_commit(page, migratetype, order);
> +	freed_pcp = free_unref_page_commit(page, migratetype, order, false);
>  	local_unlock_irqrestore(&pagesets.lock, flags);
> +
> +	if (unlikely(!freed_pcp))
> +		free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
>  }
>  
>  /*
> @@ -3450,10 +3496,19 @@ void free_unref_page(struct page *page, unsigned int order)
>  void free_unref_page_list(struct list_head *list)
>  {
>  	struct page *page, *next;
> +	struct per_cpu_pages *pcp;
> +	struct zone *locked_zone;
>  	unsigned long flags;
>  	int batch_count = 0;
>  	int migratetype;
>  
> +	/*
> +	 * An empty list is possible. Check early so that the later
> +	 * lru_to_page() does not potentially read garbage.
> +	 */
> +	if (list_empty(list))
> +		return;
> +
>  	/* Prepare pages for freeing */
>  	list_for_each_entry_safe(page, next, list, lru) {
>  		unsigned long pfn = page_to_pfn(page);
> @@ -3474,8 +3529,26 @@ void free_unref_page_list(struct list_head *list)
>  		}
>  	}
>  
> +	VM_BUG_ON(in_hardirq());

You need to check the list_empty here again and bail out if it is.

> +
>  	local_lock_irqsave(&pagesets.lock, flags);
> +
> +	page = lru_to_page(list);

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/6] Drain remote per-cpu directly
  2022-04-26 11:06   ` Nicolas Saenz Julienne
@ 2022-04-27 15:21     ` Marcelo Tosatti
  0 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2022-04-27 15:21 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Minchan Kim, Mel Gorman, Vlastimil Babka, Michal Hocko, LKML, Linux-MM

On Tue, Apr 26, 2022 at 01:06:12PM +0200, Nicolas Saenz Julienne wrote:
> On Mon, 2022-04-25 at 15:58 -0700, Minchan Kim wrote:
> > On Wed, Apr 20, 2022 at 10:59:00AM +0100, Mel Gorman wrote:
> > > This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> > > per-cpu lists drain support" -- avoid interference of a high priority
> > > task due to a workqueue item draining per-cpu page lists. While many
> > > workloads can tolerate a brief interruption, it may be cause a real-time
> > > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> > > the draining in non-deterministic.
> > 
> > Yeah, the non-deterministic is a problem. I saw the kworker-based draining
> > takes 100+ms(up to 300ms observed) sometimes in alloc_contig_range if CPUs
> > are heavily loaded.
> > 
> > I am not sure Nicolas already observed. it's not only problem of
> > per_cpu_pages but it is also lru_pvecs (pagevec) draining.
> > Do we need to introduce similar(allow remote drainning with spin_lock)
> > solution for pagevec?
> 
> Yes, I'm aware of the lru problem. I'll start working on it too once we're done
> with the page allocator (and if no-one beats me to it). That said, I don't know
> if we can apply the exact same approach, the devil is in the details. :)

I think one necessary step for that (adding spinlock to protect per-CPU
lru_pvecs) would be to find a suitable testcase.

Mel, do you have anything in mind ?



> 
> -- 
> Nicolás Sáenz
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock
  2022-04-26 19:24   ` Minchan Kim
@ 2022-04-29  9:05     ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2022-04-29  9:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Nicolas Saenz Julienne, Marcelo Tosatti, Vlastimil Babka,
	Michal Hocko, LKML, Linux-MM

On Tue, Apr 26, 2022 at 12:24:56PM -0700, Minchan Kim wrote:
> > @@ -3450,10 +3496,19 @@ void free_unref_page(struct page *page, unsigned int order)
> >  void free_unref_page_list(struct list_head *list)
> >  {
> >  	struct page *page, *next;
> > +	struct per_cpu_pages *pcp;
> > +	struct zone *locked_zone;
> >  	unsigned long flags;
> >  	int batch_count = 0;
> >  	int migratetype;
> >  
> > +	/*
> > +	 * An empty list is possible. Check early so that the later
> > +	 * lru_to_page() does not potentially read garbage.
> > +	 */
> > +	if (list_empty(list))
> > +		return;
> > +
> >  	/* Prepare pages for freeing */
> >  	list_for_each_entry_safe(page, next, list, lru) {
> >  		unsigned long pfn = page_to_pfn(page);
> > @@ -3474,8 +3529,26 @@ void free_unref_page_list(struct list_head *list)
> >  		}
> >  	}
> >  
> > +	VM_BUG_ON(in_hardirq());
> 
> You need to check the list_empty here again and bail out if it is.
> 

You're right, every page could have failed to prepare or was isolated.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock
  2022-04-26 16:42   ` Nicolas Saenz Julienne
  2022-04-26 16:48     ` Vlastimil Babka
@ 2022-04-29  9:13     ` Mel Gorman
  1 sibling, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2022-04-29  9:13 UTC (permalink / raw)
  To: Nicolas Saenz Julienne
  Cc: Marcelo Tosatti, Vlastimil Babka, Michal Hocko, Minchan Kim,
	LKML, Linux-MM

On Tue, Apr 26, 2022 at 06:42:43PM +0200, Nicolas Saenz Julienne wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index dc0fdeb3795c..813c84b67c65 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -132,6 +132,17 @@ static DEFINE_PER_CPU(struct pagesets, pagesets) __maybe_unused = {
> >  	.lock = INIT_LOCAL_LOCK(lock),
> >  };
> >  
> > +#ifdef CONFIG_SMP
> > +/* On SMP, spin_trylock is sufficient protection */
> > +#define pcp_trylock_prepare(flags)	do { } while (0)
> > +#define pcp_trylock_finish(flag)	do { } while (0)
> > +#else
> > +
> > +/* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */
> 
> This is only needed on non-RT kernels. RT UP builds go through
> rt_spin_trylock(), which behaves like its SMP counterpart.
> 

I missed that.

> > +#define pcp_trylock_prepare(flags)	local_irq_save(flags)
> > +#define pcp_trylock_finish(flags)	local_irq_restore(flags)
> > +#endif
> > +
> >  #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
> >  DEFINE_PER_CPU(int, numa_node);
> >  EXPORT_PER_CPU_SYMBOL(numa_node);
> > @@ -3082,15 +3093,22 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> >   */
> >  void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
> >  {
> > -	unsigned long flags;
> >  	int to_drain, batch;
> >  
> > -	local_lock_irqsave(&pagesets.lock, flags);
> >  	batch = READ_ONCE(pcp->batch);
> >  	to_drain = min(pcp->count, batch);
> > -	if (to_drain > 0)
> > +	if (to_drain > 0) {
> > +		unsigned long flags;
> > +
> > +		/* free_pcppages_bulk expects IRQs disabled for zone->lock */
> > +		local_irq_save(flags);
> 
> Why dropping the local_lock? That approach is nicer to RT builds, and I don't
> think it makes a difference from a non-RT perspective.
> 

The local_lock was dropped because local_lock stabilises the lookup of
pcp but in this context, we are looking at a remote pcp and local_lock
does not protect the pcp reference. It confuses what local_lock is
protecting exactly.

> That said, IIUC, this will eventually disappear with subsequent patches, right?
> 

It would have but this patch as also wrong as pointed out by Vlastimil.
So even if it was eventually correct, it would not be safe to add this
patch as-is and spin_lock_irqsave is required.

> > +
> > +		spin_lock(&pcp->lock);
> >  		free_pcppages_bulk(zone, to_drain, pcp, 0);
> > -	local_unlock_irqrestore(&pagesets.lock, flags);
> > +		spin_unlock(&pcp->lock);
> > +
> > +		local_irq_restore(flags);
> > +	}
> >  }
> >  #endif
> >  
> 
> [...]
> 
> > @@ -3668,9 +3748,30 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
> >  			int migratetype,
> >  			unsigned int alloc_flags,
> >  			struct per_cpu_pages *pcp,
> > -			struct list_head *list)
> > +			struct list_head *list,
> > +			bool locked)
> >  {
> >  	struct page *page;
> > +	unsigned long __maybe_unused UP_flags;
> > +
> > +	/*
> > +	 * spin_trylock is not necessary right now due to due to
> > +	 * local_lock_irqsave and is a preparation step for
> > +	 * a conversion to local_lock using the trylock to prevent
> > +	 * IRQ re-entrancy. If pcp->lock cannot be acquired, the caller
> > +	 * uses rmqueue_buddy.
> > +	 *
> > +	 * TODO: Convert local_lock_irqsave to local_lock. Care
> > +	 * 	 is needed as the type of local_lock would need a
> > +	 * 	 PREEMPT_RT version due to threaded IRQs.
> > +	 */
> 
> I missed this in our previous conversation, but as far as RT is concerned
> local_lock_irqsave() is the same as local_lock(), see in local_lock_internal.h:
> 
> #define __local_lock_irqsave(lock, flags)			\
> 	do {							\
> 		typecheck(unsigned long, flags);		\
> 		flags = 0;					\
> 		__local_lock(lock);				\
> 	} while (0)
> 
> Something similar happens with spin_lock_irqsave(), see in spinlock_rt.h:
> 
> #define spin_lock_irqsave(lock, flags)			 \
> 	do {						 \
> 		typecheck(unsigned long, flags);	 \
> 		flags = 0;				 \
> 		spin_lock(lock);			 \
> 	} while (0)
> 
> IIUC, RT will run this code paths in threaded IRQ context, where we can think
> of RT spinlocks as mutexes (plus some extra priority inheritance magic).
> 

Ah, thanks for the explanation.

Based on your feedback, Vlastimil's and Minchan's, the current delta
between this series and what I'm preparing for testing is

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 17d11eb0413e..4ac39d30ec8f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -132,8 +132,11 @@ static DEFINE_PER_CPU(struct pagesets, pagesets) __maybe_unused = {
 	.lock = INIT_LOCAL_LOCK(lock),
 };
 
-#ifdef CONFIG_SMP
-/* On SMP, spin_trylock is sufficient protection */
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
+/*
+ * On SMP, spin_trylock is sufficient protection.
+ * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
+ */
 #define pcp_trylock_prepare(flags)	do { } while (0)
 #define pcp_trylock_finish(flag)	do { } while (0)
 #else
@@ -3091,14 +3094,14 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 	if (to_drain > 0) {
 		unsigned long flags;
 
-		/* free_pcppages_bulk expects IRQs disabled for zone->lock */
-		local_irq_save(flags);
-
-		spin_lock(&pcp->lock);
+		/*
+		 * free_pcppages_bulk expects IRQs disabled for zone->lock
+		 * so even though pcp->lock is not intended to be IRQ-safe,
+		 * it's needed in this context.
+		 */
+		spin_lock_irqsave(&pcp->lock, flags);
 		free_pcppages_bulk(zone, to_drain, pcp, 0);
-		spin_unlock(&pcp->lock);
-
-		local_irq_restore(flags);
+		spin_unlock_irqrestore(&pcp->lock, flags);
 	}
 }
 #endif
@@ -3114,14 +3117,10 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 	if (pcp->count) {
 		unsigned long flags;
 
-		/* free_pcppages_bulk expects IRQs disabled for zone->lock */
-		local_irq_save(flags);
-
-		spin_lock(&pcp->lock);
+		/* See drain_zone_pages on why this is disabling IRQs */
+		spin_lock_irqsave(&pcp->lock, flags);
 		free_pcppages_bulk(zone, pcp->count, pcp, 0);
-		spin_unlock(&pcp->lock);
-
-		local_irq_restore(flags);
+		spin_unlock_irqrestore(&pcp->lock, flags);
 	}
 }
 
@@ -3480,6 +3479,13 @@ void free_unref_page_list(struct list_head *list)
 		}
 	}
 
+	/*
+	 * Preparation could have drained the list due to failing to prepare
+	 * or all pages are being isolated.
+	 */
+	if (list_empty(list))
+		return;
+
 	VM_BUG_ON(in_hardirq());
 
 	local_lock_irqsave(&pagesets.lock, flags);
@@ -3515,6 +3521,9 @@ void free_unref_page_list(struct list_head *list)
 		 * allocator directly. This is expensive as the zone lock will
 		 * be acquired multiple times but if a drain is in progress
 		 * then an expensive operation is already taking place.
+		 *
+		 * TODO: Always false at the moment due to local_lock_irqsave
+		 *       and is preparation for converting to local_lock.
 		 */
 		if (unlikely(!free_unref_page_commit(page, migratetype, 0, true)))
 			free_one_page(page_zone(page), page, page_to_pfn(page), 0, migratetype, FPI_NONE);
@@ -3717,9 +3726,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 	 * IRQ re-entrancy. If pcp->lock cannot be acquired, the caller
 	 * uses rmqueue_buddy.
 	 *
-	 * TODO: Convert local_lock_irqsave to local_lock. Care
-	 * 	 is needed as the type of local_lock would need a
-	 * 	 PREEMPT_RT version due to threaded IRQs.
+	 * TODO: Convert local_lock_irqsave to local_lock.
 	 */
 	if (unlikely(!locked)) {
 		pcp_trylock_prepare(UP_flags);

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-04-29  9:13 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-20  9:59 [RFC PATCH 0/6] Drain remote per-cpu directly Mel Gorman
2022-04-20  9:59 ` [PATCH 1/6] mm/page_alloc: Add page->buddy_list and page->pcp_list Mel Gorman
2022-04-20 20:43   ` Matthew Wilcox
2022-04-21  8:38     ` Mel Gorman
2022-04-20  9:59 ` [PATCH 2/6] mm/page_alloc: Use only one PCP list for THP-sized allocations Mel Gorman
2022-04-20  9:59 ` [PATCH 3/6] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper Mel Gorman
2022-04-20  9:59 ` [PATCH 4/6] mm/page_alloc: Remove unnecessary page == NULL check in rmqueue Mel Gorman
2022-04-20  9:59 ` [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock Mel Gorman
2022-04-20 14:02   ` Hillf Danton
2022-04-20 14:35     ` Nicolas Saenz Julienne
2022-04-26 16:42   ` Nicolas Saenz Julienne
2022-04-26 16:48     ` Vlastimil Babka
2022-04-29  9:13     ` Mel Gorman
2022-04-26 19:24   ` Minchan Kim
2022-04-29  9:05     ` Mel Gorman
2022-04-20  9:59 ` [PATCH 6/6] mm/page_alloc: Remotely drain per-cpu lists Mel Gorman
2022-04-25 22:58 ` [RFC PATCH 0/6] Drain remote per-cpu directly Minchan Kim
2022-04-26 11:06   ` Nicolas Saenz Julienne
2022-04-27 15:21     ` Marcelo Tosatti
2022-04-26  2:49 ` Suren Baghdasaryan
2022-04-26  6:30   ` Suren Baghdasaryan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.