All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/9] mm: introduce Designated Movable Blocks
@ 2022-10-20 21:53 Doug Berger
  2022-10-20 21:53 ` [PATCH v3 1/9] lib/show_mem.c: display MovableOnly Doug Berger
                   ` (10 more replies)
  0 siblings, 11 replies; 24+ messages in thread
From: Doug Berger @ 2022-10-20 21:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mel Gorman, Mike Kravetz, Florian Fainelli,
	David Hildenbrand, Oscar Salvador, Joonsoo Kim, linux-doc,
	linux-kernel, linux-mm, Doug Berger

MOTIVATION:
Some Broadcom devices (e.g. 7445, 7278) contain multiple memory
controllers with each mapped in a different address range within
a Uniform Memory Architecture. Some users of these systems have
expressed the desire to locate ZONE_MOVABLE memory on each
memory controller to allow user space intensive processing to
make better use of the additional memory bandwidth.
Unfortunately, the historical monotonic layout of zones would
mean that if the lowest addressed memory controller contains
ZONE_MOVABLE memory then all of the memory available from
memory controllers at higher addresses must also be in the
ZONE_MOVABLE zone. This would force all kernel memory accesses
onto the lowest addressed memory controller and significantly
reduce the amount of memory available for non-movable
allocations.

The main objective of this patch set is therefore to allow a
block of memory to be designated as part of the ZONE_MOVABLE
zone where it will always only be used by the kernel page
allocator to satisfy requests for movable pages. The term
Designated Movable Block is introduced here to represent such a
block. The favored implementation allows extension of the
'movablecore' kernel parameter to allow specification of a base
address and support for multiple blocks. The existing
'movablecore' mechanisms are retained.

BACKGROUND:
NUMA architectures support distributing movablecore memory
across each node, but it is undesirable to introduce the
overhead and complexities of NUMA on systems that don't have a
Non-Uniform Memory Architecture.

Commit 342332e6a925 ("mm/page_alloc.c: introduce kernelcore=mirror option")
also depends on zone overlap to support sytems with multiple
mirrored ranges.

Commit c6f03e2903c9 ("mm, memory_hotplug: remove zone restrictions")
embraced overlapped zones for memory hotplug.

This commit set follows their lead to allow the ZONE_MOVABLE
zone to overlap other zones. Designated Movable Blocks are made
absent from overlapping zones and present within the
ZONE_MOVABLE zone.

I initially investigated an implementation using a Designated
Movable migrate type in line with comments[1] made by Mel Gorman
regarding a "sticky" MIGRATE_MOVABLE type to avoid using
ZONE_MOVABLE. However, this approach was riskier since it was
much more instrusive on the allocation paths. Ultimately, the
progress made by the memory hotplug folks to expand the
ZONE_MOVABLE functionality convinced me to follow this approach.

Changes in v3:
  - removed OTHER OPPORTUNITIES and NOTES from this cover letter.
  - prevent the creation of empty zones instead of adding extra
    info to zoneinfo.
  - size the ZONE_MOVABLE span to the minimum necessary to cover
    pages within the zone to be more intuitive.
  - removed "real" from variable names that were consolidated.
  - rebased to akpm-mm/master (i.e. Linux 6.1-rc1).

Changes in v2:
  - first three commits upstreamed separately [3], [4], and [5].
  - commits 04-06 submitted separately [6].
  - Corrected errors "Reported-by: kernel test robot <lkp@intel.com>"
  - Deferred commits after 15 to simplify review of the base
    functionality.
  - minor reorganization of commit 13.

v2: https://lore.kernel.org/linux-mm/20220928223301.375229-1-opendmb@gmail.com/ 
v1: https://lore.kernel.org/linux-mm/20220913195508.3511038-1-opendmb@gmail.com/

[1] https://lore.kernel.org/lkml/20160428103927.GM2858@techsingularity.net/
[2] https://lore.kernel.org/lkml/1401260672-28339-1-git-send-email-iamjoonsoo.kim@lge.com
[3] https://lore.kernel.org/linux-mm/20220914023913.1855924-1-zi.yan@sent.com
[4] https://lore.kernel.org/linux-mm/20220823030209.57434-2-linmiaohe@huawei.com
[5] https://lore.kernel.org/linux-mm/20220914190917.3517663-1-opendmb@gmail.com
[6] https://lore.kernel.org/linux-mm/20220921223639.1152392-1-opendmb@gmail.com/

Doug Berger (9):
  lib/show_mem.c: display MovableOnly
  mm/page_alloc: calculate node_spanned_pages from pfns
  mm/page_alloc: prevent creation of empty zones
  mm/page_alloc.c: allow oversized movablecore
  mm/page_alloc: introduce init_reserved_pageblock()
  memblock: introduce MEMBLOCK_MOVABLE flag
  mm/dmb: Introduce Designated Movable Blocks
  mm/page_alloc: make alloc_contig_pages DMB aware
  mm/page_alloc: allow base for movablecore

 .../admin-guide/kernel-parameters.txt         |  14 +-
 include/linux/dmb.h                           |  29 +++
 include/linux/gfp.h                           |   5 +-
 include/linux/memblock.h                      |   8 +
 lib/show_mem.c                                |   2 +-
 mm/Kconfig                                    |  12 ++
 mm/Makefile                                   |   1 +
 mm/cma.c                                      |  15 +-
 mm/dmb.c                                      |  91 +++++++++
 mm/memblock.c                                 |  30 ++-
 mm/page_alloc.c                               | 188 +++++++++++++-----
 11 files changed, 338 insertions(+), 57 deletions(-)
 create mode 100644 include/linux/dmb.h
 create mode 100644 mm/dmb.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 1/9] lib/show_mem.c: display MovableOnly
  2022-10-20 21:53 [PATCH v3 0/9] mm: introduce Designated Movable Blocks Doug Berger
@ 2022-10-20 21:53 ` Doug Berger
  2022-10-20 21:53 ` [PATCH v3 2/9] mm/page_alloc: calculate node_spanned_pages from pfns Doug Berger
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Doug Berger @ 2022-10-20 21:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mel Gorman, Mike Kravetz, Florian Fainelli,
	David Hildenbrand, Oscar Salvador, Joonsoo Kim, linux-doc,
	linux-kernel, linux-mm, Doug Berger

The comment for commit c78e93630d15 ("mm: do not walk all of
system memory during show_mem") indicates it "also corrects the
reporting of HighMem as HighMem/MovableOnly as ZONE_MOVABLE has
similar problems to HighMem with respect to lowmem/highmem
exhaustion."

Presuming the similar problems are with regard to the general
exclusion of kernel allocations from either zone, I believe it
makes sense to include all ZONE_MOVABLE memory even on systems
without HighMem.

To the extent that this was the intent of the original commit I
have included a "Fixes" tag, but it seems unnecessary to submit
to linux-stable.

Fixes: c78e93630d15 ("mm: do not walk all of system memory during show_mem")
Signed-off-by: Doug Berger <opendmb@gmail.com>
---
 lib/show_mem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/show_mem.c b/lib/show_mem.c
index 0d7585cde2a6..6a632b0c35c5 100644
--- a/lib/show_mem.c
+++ b/lib/show_mem.c
@@ -27,7 +27,7 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx)
 			total += zone->present_pages;
 			reserved += zone->present_pages - zone_managed_pages(zone);
 
-			if (is_highmem_idx(zoneid))
+			if (zoneid == ZONE_MOVABLE || is_highmem_idx(zoneid))
 				highmem += zone->present_pages;
 		}
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 2/9] mm/page_alloc: calculate node_spanned_pages from pfns
  2022-10-20 21:53 [PATCH v3 0/9] mm: introduce Designated Movable Blocks Doug Berger
  2022-10-20 21:53 ` [PATCH v3 1/9] lib/show_mem.c: display MovableOnly Doug Berger
@ 2022-10-20 21:53 ` Doug Berger
  2022-10-20 21:53 ` [PATCH v3 3/9] mm/page_alloc: prevent creation of empty zones Doug Berger
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Doug Berger @ 2022-10-20 21:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mel Gorman, Mike Kravetz, Florian Fainelli,
	David Hildenbrand, Oscar Salvador, Joonsoo Kim, linux-doc,
	linux-kernel, linux-mm, Doug Berger

Since the start and end pfns of the node are passed as arguments
to calculate_node_totalpages() they might as well be used to
specify the node_spanned_pages value for the node rather than
accumulating the spans of member zones.

This prevents the need for additional adjustments if zones are
allowed to overlap.

The realtotalpages name is reverted to just totalpages to reduce
the burden of supporting multiple realities.

Signed-off-by: Doug Berger <opendmb@gmail.com>
---
 mm/page_alloc.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e20ade858e71..92908c51f1c3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7542,7 +7542,7 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
 						unsigned long node_start_pfn,
 						unsigned long node_end_pfn)
 {
-	unsigned long realtotalpages = 0, totalpages = 0;
+	unsigned long totalpages = 0;
 	enum zone_type i;
 
 	for (i = 0; i < MAX_NR_ZONES; i++) {
@@ -7573,13 +7573,12 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
 		zone->present_early_pages = real_size;
 #endif
 
-		totalpages += size;
-		realtotalpages += real_size;
+		totalpages += real_size;
 	}
 
-	pgdat->node_spanned_pages = totalpages;
-	pgdat->node_present_pages = realtotalpages;
-	pr_debug("On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages);
+	pgdat->node_spanned_pages = node_end_pfn - node_start_pfn;
+	pgdat->node_present_pages = totalpages;
+	pr_debug("On node %d totalpages: %lu\n", pgdat->node_id, totalpages);
 }
 
 #ifndef CONFIG_SPARSEMEM
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 3/9] mm/page_alloc: prevent creation of empty zones
  2022-10-20 21:53 [PATCH v3 0/9] mm: introduce Designated Movable Blocks Doug Berger
  2022-10-20 21:53 ` [PATCH v3 1/9] lib/show_mem.c: display MovableOnly Doug Berger
  2022-10-20 21:53 ` [PATCH v3 2/9] mm/page_alloc: calculate node_spanned_pages from pfns Doug Berger
@ 2022-10-20 21:53 ` Doug Berger
  2022-10-20 21:53 ` [PATCH v3 4/9] mm/page_alloc.c: allow oversized movablecore Doug Berger
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Doug Berger @ 2022-10-20 21:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mel Gorman, Mike Kravetz, Florian Fainelli,
	David Hildenbrand, Oscar Salvador, Joonsoo Kim, linux-doc,
	linux-kernel, linux-mm, Doug Berger

If none of the pages a zone spans are present then its start pfn
and span should be zeroed to prevent initialization.

This prevents the creation of an empty zone if all of its pages
are moved to a zone that would overlap it.

The real_size name is reverted to just size to reduce the burden
of supporting multiple realities.

Signed-off-by: Doug Berger <opendmb@gmail.com>
---
 mm/page_alloc.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 92908c51f1c3..2f7b88d78bc2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7548,8 +7548,7 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
 	for (i = 0; i < MAX_NR_ZONES; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 		unsigned long zone_start_pfn, zone_end_pfn;
-		unsigned long spanned, absent;
-		unsigned long size, real_size;
+		unsigned long spanned, absent, size;
 
 		spanned = zone_spanned_pages_in_node(pgdat->node_id, i,
 						     node_start_pfn,
@@ -7560,20 +7559,21 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
 						   node_start_pfn,
 						   node_end_pfn);
 
-		size = spanned;
-		real_size = size - absent;
+		size = spanned - absent;
 
-		if (size)
+		if (size) {
 			zone->zone_start_pfn = zone_start_pfn;
-		else
+		} else {
+			spanned = 0;
 			zone->zone_start_pfn = 0;
-		zone->spanned_pages = size;
-		zone->present_pages = real_size;
+		}
+		zone->spanned_pages = spanned;
+		zone->present_pages = size;
 #if defined(CONFIG_MEMORY_HOTPLUG)
-		zone->present_early_pages = real_size;
+		zone->present_early_pages = size;
 #endif
 
-		totalpages += real_size;
+		totalpages += size;
 	}
 
 	pgdat->node_spanned_pages = node_end_pfn - node_start_pfn;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 4/9] mm/page_alloc.c: allow oversized movablecore
  2022-10-20 21:53 [PATCH v3 0/9] mm: introduce Designated Movable Blocks Doug Berger
                   ` (2 preceding siblings ...)
  2022-10-20 21:53 ` [PATCH v3 3/9] mm/page_alloc: prevent creation of empty zones Doug Berger
@ 2022-10-20 21:53 ` Doug Berger
  2022-10-20 21:53 ` [PATCH v3 5/9] mm/page_alloc: introduce init_reserved_pageblock() Doug Berger
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Doug Berger @ 2022-10-20 21:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mel Gorman, Mike Kravetz, Florian Fainelli,
	David Hildenbrand, Oscar Salvador, Joonsoo Kim, linux-doc,
	linux-kernel, linux-mm, Doug Berger

Now that the error in computation of corepages has been corrected
by commit 9fd745d450e7 ("mm: fix overflow in
find_zone_movable_pfns_for_nodes()"), oversized specifications of
movablecore will result in a zero value for required_kernelcore if
it is not also specified.

It is unintuitive for such a request to lead to no ZONE_MOVABLE
memory when the kernel parameters are clearly requesting some.

The current behavior when requesting an oversized kernelcore is to
classify all of the pages in movable_zone as kernelcore. The new
behavior when requesting an oversized movablecore (when not also
specifying kernelcore) is to similarly classify all of the pages
in movable_zone as movablecore.

Signed-off-by: Doug Berger <opendmb@gmail.com>
---
 mm/page_alloc.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2f7b88d78bc2..a4c2c157bacf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8121,13 +8121,13 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 		corepages = totalpages - required_movablecore;
 
 		required_kernelcore = max(required_kernelcore, corepages);
+	} else if (!required_kernelcore) {
+		/* If kernelcore was not specified, there is no ZONE_MOVABLE */
+		goto out;
 	}
 
-	/*
-	 * If kernelcore was not specified or kernelcore size is larger
-	 * than totalpages, there is no ZONE_MOVABLE.
-	 */
-	if (!required_kernelcore || required_kernelcore >= totalpages)
+	/* If kernelcore size exceeds totalpages, there is no ZONE_MOVABLE */
+	if (required_kernelcore >= totalpages)
 		goto out;
 
 	/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 5/9] mm/page_alloc: introduce init_reserved_pageblock()
  2022-10-20 21:53 [PATCH v3 0/9] mm: introduce Designated Movable Blocks Doug Berger
                   ` (3 preceding siblings ...)
  2022-10-20 21:53 ` [PATCH v3 4/9] mm/page_alloc.c: allow oversized movablecore Doug Berger
@ 2022-10-20 21:53 ` Doug Berger
  2022-10-20 21:53 ` [PATCH v3 6/9] memblock: introduce MEMBLOCK_MOVABLE flag Doug Berger
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Doug Berger @ 2022-10-20 21:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mel Gorman, Mike Kravetz, Florian Fainelli,
	David Hildenbrand, Oscar Salvador, Joonsoo Kim, linux-doc,
	linux-kernel, linux-mm, Doug Berger

Most of the implementation of init_cma_reserved_pageblock() is
common to the initialization of any reserved pageblock for use
by the page allocator.

This commit breaks that functionality out into the new common
function init_reserved_pageblock() for use by code other than
CMA. The CMA specific code is relocated from page_alloc to the
point where init_cma_reserved_pageblock() was invoked and the
new function is used there instead. The error path is also
updated to use the function to operate on pageblocks rather
than pages.

Signed-off-by: Doug Berger <opendmb@gmail.com>
---
 include/linux/gfp.h |  5 +----
 mm/cma.c            | 15 +++++++++++----
 mm/page_alloc.c     |  8 ++------
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index ef4aea3b356e..6d66193f336d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -347,9 +347,6 @@ extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
 #endif
 void free_contig_range(unsigned long pfn, unsigned long nr_pages);
 
-#ifdef CONFIG_CMA
-/* CMA stuff */
-extern void init_cma_reserved_pageblock(struct page *page);
-#endif
+extern void init_reserved_pageblock(struct page *page);
 
 #endif /* __LINUX_GFP_H */
diff --git a/mm/cma.c b/mm/cma.c
index 4a978e09547a..6208a3e1cd9d 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -31,6 +31,7 @@
 #include <linux/highmem.h>
 #include <linux/io.h>
 #include <linux/kmemleak.h>
+#include <linux/page-isolation.h>
 #include <trace/events/cma.h>
 
 #include "cma.h"
@@ -116,8 +117,13 @@ static void __init cma_activate_area(struct cma *cma)
 	}
 
 	for (pfn = base_pfn; pfn < base_pfn + cma->count;
-	     pfn += pageblock_nr_pages)
-		init_cma_reserved_pageblock(pfn_to_page(pfn));
+	     pfn += pageblock_nr_pages) {
+		struct page *page = pfn_to_page(pfn);
+
+		set_pageblock_migratetype(page, MIGRATE_CMA);
+		init_reserved_pageblock(page);
+		page_zone(page)->cma_pages += pageblock_nr_pages;
+	}
 
 	spin_lock_init(&cma->lock);
 
@@ -133,8 +139,9 @@ static void __init cma_activate_area(struct cma *cma)
 out_error:
 	/* Expose all pages to the buddy, they are useless for CMA. */
 	if (!cma->reserve_pages_on_error) {
-		for (pfn = base_pfn; pfn < base_pfn + cma->count; pfn++)
-			free_reserved_page(pfn_to_page(pfn));
+		for (pfn = base_pfn; pfn < base_pfn + cma->count;
+		     pfn += pageblock_nr_pages)
+			init_reserved_pageblock(pfn_to_page(pfn));
 	}
 	totalcma_pages -= cma->count;
 	cma->count = 0;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a4c2c157bacf..d7a5a05ead4b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2308,9 +2308,8 @@ void __init page_alloc_init_late(void)
 		set_zone_contiguous(zone);
 }
 
-#ifdef CONFIG_CMA
-/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
-void __init init_cma_reserved_pageblock(struct page *page)
+/* Free whole pageblock */
+void __init init_reserved_pageblock(struct page *page)
 {
 	unsigned i = pageblock_nr_pages;
 	struct page *p = page;
@@ -2320,14 +2319,11 @@ void __init init_cma_reserved_pageblock(struct page *page)
 		set_page_count(p, 0);
 	} while (++p, --i);
 
-	set_pageblock_migratetype(page, MIGRATE_CMA);
 	set_page_refcounted(page);
 	__free_pages(page, pageblock_order);
 
 	adjust_managed_page_count(page, pageblock_nr_pages);
-	page_zone(page)->cma_pages += pageblock_nr_pages;
 }
-#endif
 
 /*
  * The order of subdivision here is critical for the IO subsystem.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 6/9] memblock: introduce MEMBLOCK_MOVABLE flag
  2022-10-20 21:53 [PATCH v3 0/9] mm: introduce Designated Movable Blocks Doug Berger
                   ` (4 preceding siblings ...)
  2022-10-20 21:53 ` [PATCH v3 5/9] mm/page_alloc: introduce init_reserved_pageblock() Doug Berger
@ 2022-10-20 21:53 ` Doug Berger
  2022-10-20 21:53 ` [PATCH v3 7/9] mm/dmb: Introduce Designated Movable Blocks Doug Berger
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Doug Berger @ 2022-10-20 21:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mel Gorman, Mike Kravetz, Florian Fainelli,
	David Hildenbrand, Oscar Salvador, Joonsoo Kim, linux-doc,
	linux-kernel, linux-mm, Doug Berger

The MEMBLOCK_MOVABLE flag is introduced to designate a memblock
as only supporting movable allocations by the page allocator.

Signed-off-by: Doug Berger <opendmb@gmail.com>
---
 include/linux/memblock.h |  8 ++++++++
 mm/memblock.c            | 24 ++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 50ad19662a32..8eb3ca32dfa7 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -47,6 +47,7 @@ enum memblock_flags {
 	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
 	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
 	MEMBLOCK_DRIVER_MANAGED = 0x8,	/* always detected via a driver */
+	MEMBLOCK_MOVABLE	= 0x10,	/* designated movable block */
 };
 
 /**
@@ -125,6 +126,8 @@ int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
 int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
 int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
 int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
+int memblock_mark_movable(phys_addr_t base, phys_addr_t size);
+int memblock_clear_movable(phys_addr_t base, phys_addr_t size);
 
 void memblock_free_all(void);
 void memblock_free(void *ptr, size_t size);
@@ -265,6 +268,11 @@ static inline bool memblock_is_driver_managed(struct memblock_region *m)
 	return m->flags & MEMBLOCK_DRIVER_MANAGED;
 }
 
+static inline bool memblock_is_movable(struct memblock_region *m)
+{
+	return m->flags & MEMBLOCK_MOVABLE;
+}
+
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
 			    unsigned long  *end_pfn);
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
diff --git a/mm/memblock.c b/mm/memblock.c
index 511d4783dcf1..46a4deb07d92 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -979,6 +979,30 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
 	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
 }
 
+/**
+ * memblock_mark_movable - Mark designated movable block with MEMBLOCK_MOVABLE.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_mark_movable(phys_addr_t base, phys_addr_t size)
+{
+	return memblock_setclr_flag(base, size, 1, MEMBLOCK_MOVABLE);
+}
+
+/**
+ * memblock_clear_movable - Clear flag MEMBLOCK_MOVABLE for a specified region.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_clear_movable(phys_addr_t base, phys_addr_t size)
+{
+	return memblock_setclr_flag(base, size, 0, MEMBLOCK_MOVABLE);
+}
+
 static bool should_skip_region(struct memblock_type *type,
 			       struct memblock_region *m,
 			       int nid, int flags)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 7/9] mm/dmb: Introduce Designated Movable Blocks
  2022-10-20 21:53 [PATCH v3 0/9] mm: introduce Designated Movable Blocks Doug Berger
                   ` (5 preceding siblings ...)
  2022-10-20 21:53 ` [PATCH v3 6/9] memblock: introduce MEMBLOCK_MOVABLE flag Doug Berger
@ 2022-10-20 21:53 ` Doug Berger
  2022-10-20 21:53 ` [PATCH v3 8/9] mm/page_alloc: make alloc_contig_pages DMB aware Doug Berger
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Doug Berger @ 2022-10-20 21:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mel Gorman, Mike Kravetz, Florian Fainelli,
	David Hildenbrand, Oscar Salvador, Joonsoo Kim, linux-doc,
	linux-kernel, linux-mm, Doug Berger

Designated Movable Blocks are blocks of memory that are composed
of one or more adjacent memblocks that have the MEMBLOCK_MOVABLE
designation. These blocks must be reserved before receiving that
designation and will be located in the ZONE_MOVABLE zone rather
than any other zone that may span them.

Signed-off-by: Doug Berger <opendmb@gmail.com>
---
 include/linux/dmb.h | 29 ++++++++++++++
 mm/Kconfig          | 12 ++++++
 mm/Makefile         |  1 +
 mm/dmb.c            | 91 +++++++++++++++++++++++++++++++++++++++++++
 mm/memblock.c       |  6 ++-
 mm/page_alloc.c     | 95 ++++++++++++++++++++++++++++++++++++++-------
 6 files changed, 220 insertions(+), 14 deletions(-)
 create mode 100644 include/linux/dmb.h
 create mode 100644 mm/dmb.c

diff --git a/include/linux/dmb.h b/include/linux/dmb.h
new file mode 100644
index 000000000000..fa2976c0fa21
--- /dev/null
+++ b/include/linux/dmb.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __DMB_H__
+#define __DMB_H__
+
+#include <linux/memblock.h>
+
+/*
+ * the buddy -- especially pageblock merging and alloc_contig_range()
+ * -- can deal with only some pageblocks of a higher-order page being
+ *  MIGRATE_MOVABLE, we can use pageblock_nr_pages.
+ */
+#define DMB_MIN_ALIGNMENT_PAGES pageblock_nr_pages
+#define DMB_MIN_ALIGNMENT_BYTES (PAGE_SIZE * DMB_MIN_ALIGNMENT_PAGES)
+
+enum {
+	DMB_DISJOINT = 0,
+	DMB_INTERSECTS,
+	DMB_MIXED,
+};
+
+struct dmb;
+
+extern int dmb_intersects(unsigned long spfn, unsigned long epfn);
+
+extern int dmb_reserve(phys_addr_t base, phys_addr_t size,
+		       struct dmb **res_dmb);
+extern void dmb_init_region(struct memblock_region *region);
+
+#endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 57e1d8c5b505..3a058b34e77e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -868,6 +868,18 @@ config CMA_AREAS
 
 	  If unsure, leave the default value "7" in UMA and "19" in NUMA.
 
+config DMB_COUNT
+	int "Maximum count of Designated Movable Blocks"
+	default 19 if NUMA
+	default 7
+	help
+	  Designated Movable Blocks are blocks of memory that can be used
+	  by the page allocator exclusively for movable pages. They are
+	  managed in ZONE_MOVABLE but may overlap with other zones. This
+	  parameter sets the maximum number of DMBs in the system.
+
+	  If unsure, leave the default value "7" in UMA and "19" in NUMA.
+
 config MEM_SOFT_DIRTY
 	bool "Track memory changes"
 	depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY && PROC_FS
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..824be8fb11cd 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -67,6 +67,7 @@ obj-y += page-alloc.o
 obj-y += init-mm.o
 obj-y += memblock.o
 obj-y += $(memory-hotplug-y)
+obj-y += dmb.o
 
 ifdef CONFIG_MMU
 	obj-$(CONFIG_ADVISE_SYSCALLS)	+= madvise.o
diff --git a/mm/dmb.c b/mm/dmb.c
new file mode 100644
index 000000000000..f6c4e2662e0f
--- /dev/null
+++ b/mm/dmb.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Designated Movable Block
+ */
+
+#define pr_fmt(fmt) "dmb: " fmt
+
+#include <linux/dmb.h>
+
+struct dmb {
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+};
+
+static struct dmb dmb_areas[CONFIG_DMB_COUNT];
+static unsigned int dmb_area_count;
+
+int dmb_intersects(unsigned long spfn, unsigned long epfn)
+{
+	int i;
+	struct dmb *dmb;
+
+	if (spfn >= epfn)
+		return DMB_DISJOINT;
+
+	for (i = 0; i < dmb_area_count; i++) {
+		dmb = &dmb_areas[i];
+		if (spfn >= dmb->end_pfn)
+			continue;
+		if (epfn <= dmb->start_pfn)
+			return DMB_DISJOINT;
+		if (spfn >= dmb->start_pfn && epfn <= dmb->end_pfn)
+			return DMB_INTERSECTS;
+		else
+			return DMB_MIXED;
+	}
+
+	return DMB_DISJOINT;
+}
+EXPORT_SYMBOL(dmb_intersects);
+
+int __init dmb_reserve(phys_addr_t base, phys_addr_t size,
+		       struct dmb **res_dmb)
+{
+	struct dmb *dmb;
+
+	/* Sanity checks */
+	if (!size || !memblock_is_region_reserved(base, size))
+		return -EINVAL;
+
+	/* ensure minimal alignment required by mm core */
+	if (!IS_ALIGNED(base | size, DMB_MIN_ALIGNMENT_BYTES))
+		return -EINVAL;
+
+	if (dmb_area_count == ARRAY_SIZE(dmb_areas)) {
+		pr_warn("Not enough slots for DMB reserved regions!\n");
+		return -ENOSPC;
+	}
+
+	/*
+	 * Each reserved area must be initialised later, when more kernel
+	 * subsystems (like slab allocator) are available.
+	 */
+	dmb = &dmb_areas[dmb_area_count++];
+
+	dmb->start_pfn = PFN_DOWN(base);
+	dmb->end_pfn = PFN_DOWN(base + size);
+	if (res_dmb)
+		*res_dmb = dmb;
+
+	memblock_mark_movable(base, size);
+	return 0;
+}
+
+void __init dmb_init_region(struct memblock_region *region)
+{
+	unsigned long pfn;
+	int i;
+
+	for (pfn = memblock_region_memory_base_pfn(region);
+	     pfn < memblock_region_memory_end_pfn(region);
+	     pfn += pageblock_nr_pages) {
+		struct page *page = pfn_to_page(pfn);
+
+		for (i = 0; i < pageblock_nr_pages; i++)
+			set_page_zone(page + i, ZONE_MOVABLE);
+
+		/* free reserved pageblocks to page allocator */
+		init_reserved_pageblock(page);
+	}
+}
diff --git a/mm/memblock.c b/mm/memblock.c
index 46a4deb07d92..2a957337bd8b 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -16,6 +16,7 @@
 #include <linux/kmemleak.h>
 #include <linux/seq_file.h>
 #include <linux/memblock.h>
+#include <linux/dmb.h>
 
 #include <asm/sections.h>
 #include <linux/io.h>
@@ -2090,13 +2091,16 @@ static void __init memmap_init_reserved_pages(void)
 	for_each_reserved_mem_range(i, &start, &end)
 		reserve_bootmem_region(start, end);
 
-	/* and also treat struct pages for the NOMAP regions as PageReserved */
 	for_each_mem_region(region) {
+		/* treat struct pages for the NOMAP regions as PageReserved */
 		if (memblock_is_nomap(region)) {
 			start = region->base;
 			end = start + region->size;
 			reserve_bootmem_region(start, end);
 		}
+		/* move Designated Movable Block pages to ZONE_MOVABLE */
+		if (memblock_is_movable(region))
+			dmb_init_region(region);
 	}
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d7a5a05ead4b..82cad751e0b8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -76,6 +76,7 @@
 #include <linux/khugepaged.h>
 #include <linux/buffer_head.h>
 #include <linux/delayacct.h>
+#include <linux/dmb.h>
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -434,6 +435,8 @@ static unsigned long required_kernelcore __initdata;
 static unsigned long required_kernelcore_percent __initdata;
 static unsigned long required_movablecore __initdata;
 static unsigned long required_movablecore_percent __initdata;
+static unsigned long min_dmb_pfn[MAX_NUMNODES] __initdata;
+static unsigned long max_dmb_pfn[MAX_NUMNODES] __initdata;
 static unsigned long zone_movable_pfn[MAX_NUMNODES] __initdata;
 bool mirrored_kernelcore __initdata_memblock;
 
@@ -2171,7 +2174,7 @@ static int __init deferred_init_memmap(void *data)
 	}
 zone_empty:
 	/* Sanity check that the next zone really is unpopulated */
-	WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
+	WARN_ON(++zid < ZONE_MOVABLE && populated_zone(++zone));
 
 	pr_info("node %d deferred pages initialised in %ums\n",
 		pgdat->node_id, jiffies_to_msecs(jiffies - start));
@@ -6978,6 +6981,10 @@ static void __init memmap_init_zone_range(struct zone *zone,
 	unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages;
 	int nid = zone_to_nid(zone), zone_id = zone_idx(zone);
 
+	/* Skip overlap of ZONE_MOVABLE */
+	if (zone_id == ZONE_MOVABLE && zone_start_pfn < *hole_pfn)
+		zone_start_pfn = *hole_pfn;
+
 	start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
 	end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);
 
@@ -7438,6 +7445,12 @@ static unsigned long __init zone_spanned_pages_in_node(int nid,
 				node_start_pfn, node_end_pfn,
 				zone_start_pfn, zone_end_pfn);
 
+	if (zone_type == ZONE_MOVABLE && max_dmb_pfn[nid]) {
+		if (*zone_start_pfn == *zone_end_pfn)
+			*zone_end_pfn = max_dmb_pfn[nid];
+		*zone_start_pfn = min(*zone_start_pfn, min_dmb_pfn[nid]);
+	}
+
 	/* Check that this node has pages within the zone's required range */
 	if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn)
 		return 0;
@@ -7506,12 +7519,21 @@ static unsigned long __init zone_absent_pages_in_node(int nid,
 			&zone_start_pfn, &zone_end_pfn);
 	nr_absent = __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
 
+	if (zone_type == ZONE_MOVABLE && max_dmb_pfn[nid]) {
+		if (zone_start_pfn == zone_end_pfn)
+			zone_end_pfn = max_dmb_pfn[nid];
+		else
+			zone_end_pfn = zone_movable_pfn[nid];
+		zone_start_pfn = min(zone_start_pfn, min_dmb_pfn[nid]);
+		nr_absent += zone_end_pfn - zone_start_pfn;
+	}
+
 	/*
 	 * ZONE_MOVABLE handling.
-	 * Treat pages to be ZONE_MOVABLE in ZONE_NORMAL as absent pages
+	 * Treat pages to be ZONE_MOVABLE in other zones as absent pages
 	 * and vice versa.
 	 */
-	if (mirrored_kernelcore && zone_movable_pfn[nid]) {
+	if (zone_movable_pfn[nid]) {
 		unsigned long start_pfn, end_pfn;
 		struct memblock_region *r;
 
@@ -7521,6 +7543,19 @@ static unsigned long __init zone_absent_pages_in_node(int nid,
 			end_pfn = clamp(memblock_region_memory_end_pfn(r),
 					zone_start_pfn, zone_end_pfn);
 
+			if (memblock_is_movable(r)) {
+				if (zone_type != ZONE_MOVABLE) {
+					nr_absent += end_pfn - start_pfn;
+					continue;
+				}
+
+				nr_absent -=  end_pfn - start_pfn;
+				continue;
+			}
+
+			if (!mirrored_kernelcore)
+				continue;
+
 			if (zone_type == ZONE_MOVABLE &&
 			    memblock_is_mirror(r))
 				nr_absent += end_pfn - start_pfn;
@@ -7540,18 +7575,27 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
 {
 	unsigned long totalpages = 0;
 	enum zone_type i;
+	int nid = pgdat->node_id;
+
+	/*
+	 * If Designated Movable Blocks are defined on this node, ensure that
+	 * zone_movable_pfn is also defined for this node.
+	 */
+	if (max_dmb_pfn[nid] && !zone_movable_pfn[nid])
+		zone_movable_pfn[nid] = min(node_end_pfn,
+				arch_zone_highest_possible_pfn[movable_zone]);
 
 	for (i = 0; i < MAX_NR_ZONES; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 		unsigned long zone_start_pfn, zone_end_pfn;
 		unsigned long spanned, absent, size;
 
-		spanned = zone_spanned_pages_in_node(pgdat->node_id, i,
+		spanned = zone_spanned_pages_in_node(nid, i,
 						     node_start_pfn,
 						     node_end_pfn,
 						     &zone_start_pfn,
 						     &zone_end_pfn);
-		absent = zone_absent_pages_in_node(pgdat->node_id, i,
+		absent = zone_absent_pages_in_node(nid, i,
 						   node_start_pfn,
 						   node_end_pfn);
 
@@ -8002,15 +8046,27 @@ unsigned long __init node_map_pfn_alignment(void)
 static unsigned long __init early_calculate_totalpages(void)
 {
 	unsigned long totalpages = 0;
-	unsigned long start_pfn, end_pfn;
-	int i, nid;
+	struct memblock_region *r;
 
-	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
-		unsigned long pages = end_pfn - start_pfn;
+	for_each_mem_region(r) {
+		unsigned long start_pfn, end_pfn, pages;
+		int nid;
+
+		nid = memblock_get_region_node(r);
+		start_pfn = memblock_region_memory_base_pfn(r);
+		end_pfn = memblock_region_memory_end_pfn(r);
 
-		totalpages += pages;
-		if (pages)
+		pages = end_pfn - start_pfn;
+		if (pages) {
+			totalpages += pages;
 			node_set_state(nid, N_MEMORY);
+			if (memblock_is_movable(r)) {
+				if (start_pfn < min_dmb_pfn[nid])
+					min_dmb_pfn[nid] = start_pfn;
+				if (end_pfn > max_dmb_pfn[nid])
+					max_dmb_pfn[nid] = end_pfn;
+			}
+		}
 	}
 	return totalpages;
 }
@@ -8023,7 +8079,7 @@ static unsigned long __init early_calculate_totalpages(void)
  */
 static void __init find_zone_movable_pfns_for_nodes(void)
 {
-	int i, nid;
+	int nid;
 	unsigned long usable_startpfn;
 	unsigned long kernelcore_node, kernelcore_remaining;
 	/* save the state before borrow the nodemask */
@@ -8151,13 +8207,24 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 		kernelcore_remaining = kernelcore_node;
 
 		/* Go through each range of PFNs within this node */
-		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
+		for_each_mem_region(r) {
 			unsigned long size_pages;
 
+			if (memblock_get_region_node(r) != nid)
+				continue;
+
+			start_pfn = memblock_region_memory_base_pfn(r);
+			end_pfn = memblock_region_memory_end_pfn(r);
 			start_pfn = max(start_pfn, zone_movable_pfn[nid]);
 			if (start_pfn >= end_pfn)
 				continue;
 
+			/* Skip over Designated Movable Blocks */
+			if (memblock_is_movable(r)) {
+				zone_movable_pfn[nid] = end_pfn;
+				continue;
+			}
+
 			/* Account for what is only usable for kernelcore */
 			if (start_pfn < usable_startpfn) {
 				unsigned long kernel_pages;
@@ -8306,6 +8373,8 @@ void __init free_area_init(unsigned long *max_zone_pfn)
 	}
 
 	/* Find the PFNs that ZONE_MOVABLE begins at in each node */
+	memset(min_dmb_pfn, 0xff, sizeof(min_dmb_pfn));
+	memset(max_dmb_pfn, 0, sizeof(max_dmb_pfn));
 	memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
 	find_zone_movable_pfns_for_nodes();
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 8/9] mm/page_alloc: make alloc_contig_pages DMB aware
  2022-10-20 21:53 [PATCH v3 0/9] mm: introduce Designated Movable Blocks Doug Berger
                   ` (6 preceding siblings ...)
  2022-10-20 21:53 ` [PATCH v3 7/9] mm/dmb: Introduce Designated Movable Blocks Doug Berger
@ 2022-10-20 21:53 ` Doug Berger
  2022-10-20 21:53 ` [PATCH v3 9/9] mm/page_alloc: allow base for movablecore Doug Berger
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Doug Berger @ 2022-10-20 21:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mel Gorman, Mike Kravetz, Florian Fainelli,
	David Hildenbrand, Oscar Salvador, Joonsoo Kim, linux-doc,
	linux-kernel, linux-mm, Doug Berger

Designated Movable Blocks are skipped when attempting to allocate
contiguous pages. Doing per page validation across all spanned
pages within a zone can be extra inefficient when Designated
Movable Blocks create large overlaps between zones. Use
dmb_intersects() within pfn_range_valid_contig as an early check
to signal the range is not valid.

The zone_movable_pfn array which represents the start of non-
overlapped ZONE_MOVABLE on the node is now preserved to be used
at runtime to skip over any DMB-only portion of the zone.

Signed-off-by: Doug Berger <opendmb@gmail.com>
---
 mm/page_alloc.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 82cad751e0b8..a39eca3bc01b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -437,7 +437,7 @@ static unsigned long required_movablecore __initdata;
 static unsigned long required_movablecore_percent __initdata;
 static unsigned long min_dmb_pfn[MAX_NUMNODES] __initdata;
 static unsigned long max_dmb_pfn[MAX_NUMNODES] __initdata;
-static unsigned long zone_movable_pfn[MAX_NUMNODES] __initdata;
+static unsigned long zone_movable_pfn[MAX_NUMNODES];
 bool mirrored_kernelcore __initdata_memblock;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
@@ -9460,6 +9460,9 @@ static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
 	unsigned long i, end_pfn = start_pfn + nr_pages;
 	struct page *page;
 
+	if (dmb_intersects(start_pfn, end_pfn))
+		return false;
+
 	for (i = start_pfn; i < end_pfn; i++) {
 		page = pfn_to_online_page(i);
 		if (!page)
@@ -9516,7 +9519,10 @@ struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
 					gfp_zone(gfp_mask), nodemask) {
 		spin_lock_irqsave(&zone->lock, flags);
 
-		pfn = ALIGN(zone->zone_start_pfn, nr_pages);
+		if (zone_idx(zone) == ZONE_MOVABLE && zone_movable_pfn[nid])
+			pfn = ALIGN(zone_movable_pfn[nid], nr_pages);
+		else
+			pfn = ALIGN(zone->zone_start_pfn, nr_pages);
 		while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
 			if (pfn_range_valid_contig(zone, pfn, nr_pages)) {
 				/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 9/9] mm/page_alloc: allow base for movablecore
  2022-10-20 21:53 [PATCH v3 0/9] mm: introduce Designated Movable Blocks Doug Berger
                   ` (7 preceding siblings ...)
  2022-10-20 21:53 ` [PATCH v3 8/9] mm/page_alloc: make alloc_contig_pages DMB aware Doug Berger
@ 2022-10-20 21:53 ` Doug Berger
  2022-10-26 10:55 ` [PATCH v3 0/9] mm: introduce Designated Movable Blocks Mel Gorman
  2023-01-03 23:43 ` Florian Fainelli
  10 siblings, 0 replies; 24+ messages in thread
From: Doug Berger @ 2022-10-20 21:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mel Gorman, Mike Kravetz, Florian Fainelli,
	David Hildenbrand, Oscar Salvador, Joonsoo Kim, linux-doc,
	linux-kernel, linux-mm, Doug Berger

A Designated Movable Block can be created by including the base
address of the block when specifying a movablecore range on the
kernel command line.

Signed-off-by: Doug Berger <opendmb@gmail.com>
---
 .../admin-guide/kernel-parameters.txt         | 14 ++++++-
 mm/page_alloc.c                               | 38 ++++++++++++++++---
 2 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a465d5242774..f4f783cd683e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3325,7 +3325,7 @@
 			reporting absolute coordinates, such as tablets
 
 	movablecore=	[KNL,X86,IA-64,PPC]
-			Format: nn[KMGTPE] | nn%
+			Format: nn[KMGTPE] | nn[KMGTPE]@ss[KMGTPE] | nn%
 			This parameter is the complement to kernelcore=, it
 			specifies the amount of memory used for migratable
 			allocations.  If both kernelcore and movablecore is
@@ -3335,6 +3335,18 @@
 			that the amount of memory usable for all allocations
 			is not too small.
 
+			If @ss[KMGTPE] is included, memory within the region
+			from ss to ss+nn will be designated as a movable block
+			and included in ZONE_MOVABLE. Designated Movable Blocks
+			must be aligned to pageblock_order. Designated Movable
+			Blocks take priority over values of kernelcore= and are
+			considered part of any memory specified by more general
+			movablecore= values.
+			Multiple Designated Movable Blocks may be specified,
+			comma delimited.
+			Example:
+				movablecore=100M@2G,100M@3G,1G@1024G
+
 	movable_node	[KNL] Boot-time switch to make hotplugable memory
 			NUMA nodes to be movable. This means that the memory
 			of such nodes will be usable only for movable
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a39eca3bc01b..385fc9082945 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8461,9 +8461,9 @@ void __init free_area_init(unsigned long *max_zone_pfn)
 }
 
 static int __init cmdline_parse_core(char *p, unsigned long *core,
-				     unsigned long *percent)
+				     unsigned long *percent, bool movable)
 {
-	unsigned long long coremem;
+	unsigned long long coremem, address;
 	char *endptr;
 
 	if (!p)
@@ -8478,6 +8478,17 @@ static int __init cmdline_parse_core(char *p, unsigned long *core,
 		*percent = coremem;
 	} else {
 		coremem = memparse(p, &p);
+		if (movable && *p == '@') {
+			address = memparse(++p, &p);
+			if (*p != '\0' ||
+			    !memblock_is_region_memory(address, coremem) ||
+			    memblock_is_region_reserved(address, coremem))
+				return -EINVAL;
+			memblock_reserve(address, coremem);
+			return dmb_reserve(address, coremem, NULL);
+		} else if (*p != '\0') {
+			return -EINVAL;
+		}
 		/* Paranoid check that UL is enough for the coremem value */
 		WARN_ON((coremem >> PAGE_SHIFT) > ULONG_MAX);
 
@@ -8500,17 +8511,32 @@ static int __init cmdline_parse_kernelcore(char *p)
 	}
 
 	return cmdline_parse_core(p, &required_kernelcore,
-				  &required_kernelcore_percent);
+				  &required_kernelcore_percent, false);
 }
 
 /*
  * movablecore=size sets the amount of memory for use for allocations that
- * can be reclaimed or migrated.
+ * can be reclaimed or migrated. movablecore=size@base defines a Designated
+ * Movable Block.
  */
 static int __init cmdline_parse_movablecore(char *p)
 {
-	return cmdline_parse_core(p, &required_movablecore,
-				  &required_movablecore_percent);
+	int ret = -EINVAL;
+
+	while (p) {
+		char *k = strchr(p, ',');
+
+		if (k)
+			*k++ = 0;
+
+		ret = cmdline_parse_core(p, &required_movablecore,
+					 &required_movablecore_percent, true);
+		if (ret)
+			break;
+		p = k;
+	}
+
+	return ret;
 }
 
 early_param("kernelcore", cmdline_parse_kernelcore);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks
  2022-10-20 21:53 [PATCH v3 0/9] mm: introduce Designated Movable Blocks Doug Berger
                   ` (8 preceding siblings ...)
  2022-10-20 21:53 ` [PATCH v3 9/9] mm/page_alloc: allow base for movablecore Doug Berger
@ 2022-10-26 10:55 ` Mel Gorman
  2022-10-26 11:11   ` David Hildenbrand
  2022-11-02 22:33   ` Doug Berger
  2023-01-03 23:43 ` Florian Fainelli
  10 siblings, 2 replies; 24+ messages in thread
From: Mel Gorman @ 2022-10-26 10:55 UTC (permalink / raw)
  To: Doug Berger
  Cc: Andrew Morton, Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mike Kravetz, Florian Fainelli,
	David Hildenbrand, Oscar Salvador, Joonsoo Kim, linux-doc,
	linux-kernel, linux-mm

On Thu, Oct 20, 2022 at 02:53:09PM -0700, Doug Berger wrote:
> MOTIVATION:
> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory
> controllers with each mapped in a different address range within
> a Uniform Memory Architecture. Some users of these systems have
> expressed the desire to locate ZONE_MOVABLE memory on each
> memory controller to allow user space intensive processing to
> make better use of the additional memory bandwidth.
> Unfortunately, the historical monotonic layout of zones would
> mean that if the lowest addressed memory controller contains
> ZONE_MOVABLE memory then all of the memory available from
> memory controllers at higher addresses must also be in the
> ZONE_MOVABLE zone. This would force all kernel memory accesses
> onto the lowest addressed memory controller and significantly
> reduce the amount of memory available for non-movable
> allocations.
> 

I didn't review the first version of this patch because others, particularly
David Hildenbrand highlighted many of the concerns I had. I broadly followed
the discussion but didn't respond because I live in a permanent state of
having too much to do but with a new version, I have to say something.

The three big questions he initially asked were

	How large are these areas typically?
	How large are they in comparison to other memory in the system?
	How is this memory currently presented to the system?
	Can you share some more how exactly ZONE_MOVABLE would help here to make
		better use of the memory bandwidth?

Zones are about addressing limitations primarily and frankly, ZONE_MOVABLE
was a bad idea in retrospect. Today, the preferred approach would have
been to create a separate NUMA node with distance-1 to the local node
(fudge by adding 1 to the local distance "10" for zonelist purposes)
that was ZONE_MOVABLE with the zonelists structured such that GFP_MOVABLE
allocations would prefer the "movable" node first. While I don't recall
why I did not take that approach, it most likely was because CONFIG_NUMA
was not always set, it was only intended for hugetlbfs allocations and
maybe I didn't have the necessary skill or foresight to take that approach.

Hotplugs requirements are somewhat different, the primary motivation that
I'm aware of is being able to guarantee they can be offlined, particularly
nodes, which can be done in some circumstances. Generally hotplug does
not care what uses the memory as long as it can be removed later. The
requirements for restricted access to high speed memory is different.

There is a high degree of uncertainity of how these regions are to be
used by applications to get access to the high speed memory, to quote

	I'm not certain what is typical because these systems are highly
	configurable and Broadcom's customers have different ideas about
	application processing.

	...

	The Designated Movable Block concept introduced here has the
	potential to offer useful services to different constituencies. I
	tried to highlight this in my V1 patch set with the hope of
	attracting some interest, but it can complicate the overall
	discussion, so I would like to maybe narrow the discussion here. It
	may be good to keep them in mind when assessing the overall value,
	but perhaps the "other opportunities" can be covered as a follow
	on discussion.

I note the "potential" part here because we don't actually know. A
major limitation of ZONE_MOVABLE is that there is no way of controlling
access from userspace to restrict the high-speed memory to a designated
application, only to all applications in general. The primary interface
to control access to memory with different characteristics is mempolicies
which is NUMA orientated, not zone orientated. So, if there is a special
application that requires exclusive access, it's very difficult to configure
based on zones.  Furthermore, page table pages mapping data located in the
high-speed region are stored in the slower memory which potentially impacts
the performance if the working set of the application exceeds TLB reach.
Finally, while there is mention that Broadcom may have some special
interface to determine what applications can use the high-speed region,
it's hardware-specific as opposed to something that belongs in the core mm.

I agree that keeping the high-speed memory in a local node and using "sticky"
pageblocks or CMA has limitations of its own but in itself, that does not
justify using ZONE_MOVABLE in my opinion. The statement that ARM can have
multiple controllers with equal distance and bandwidth (if I'm reading it
correctly) but places them in different zones.... that's just a bit weird if
there are no other addressing limitations. It's not obvious why ARM would do
that, but it also does not matter because it shouldn't be a core mm concern.

There are already examples of where memory is physically "local" to
the CPU but has different bandwidth or latency including High Bandwidth
(HBM), Sub-NUMA Clustering (SNC), PMEM as a memory-life device and some
AMD EPYC Chips, particularly the first generation where a sockets memory
controllers had different distances. With the broadcom controllers,
it sounds like a local memory controller but the bandwidth available
differs. It's functionally equivalent to HBM.

The fact that the memory access is physically local to the CPU socket is
irrelevant when the characteristics of that locality differs. NUMA stands
for Non-Uniform Memory Access and if bandwidth to different address ranges
differs, then the system is inherently NUMA even if that is inconvenient.

While I have not evaluated the implementation in detail, there is already
infrastructure dealing with tiered memory (memory that is local but has
different characteristics) with support for moving memory between tiers
depending on access patterns. Memory policies can be used to restrict
access to what processes can access the higher bandwidth memory. Given the
use case for DSM, I suspect that the intent is "application data uses high
bandwidth memory where possible and kernel uses lower bandwidth memory"
which is probably fine for an appliance because there is only one workload
but it's not a generic solution suitable.

Going back to the original questions;

	How large are these areas typically?
	How large are they in comparison to other memory in the system?

I am treating this as the same question because the consequencs are the
same. A high ratio of !MOVABLE:MOVABLE can cause big problems including
premature OOM, surprising reclaim behaviour etc

	How is this memory currently presented to the system?

It's local, but with different characteristics so it's inherently NUMA
because it's Non-Uniform, there is no getting away from that.

	Can you share some more how exactly ZONE_MOVABLE would help here to make
		better use of the memory bandwidth?

In the appliance case, it doesn't matter if the intent is that "all
application data should use high bandwidth memory where possible and
the application phase behaviour is predictable" and that may very well
work fine for the users of the Broadcom platforms with multiple memory
controllers. It does not work at all for the general where access must
be restricted to a subset of tasks in a general system that can only be
controlled with memory policies.

The high bandwidth memory should be representated as a NUMA node, optionally
to create that node as ZONE_MOVABLE and relying on the zonelists to select
the movable zone as the first preference.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks
  2022-10-26 10:55 ` [PATCH v3 0/9] mm: introduce Designated Movable Blocks Mel Gorman
@ 2022-10-26 11:11   ` David Hildenbrand
  2022-10-26 12:02     ` Mel Gorman
  2022-11-02 22:33   ` Doug Berger
  1 sibling, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2022-10-26 11:11 UTC (permalink / raw)
  To: Mel Gorman, Doug Berger
  Cc: Andrew Morton, Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mike Kravetz, Florian Fainelli, Oscar Salvador,
	Joonsoo Kim, linux-doc, linux-kernel, linux-mm

On 26.10.22 12:55, Mel Gorman wrote:
> On Thu, Oct 20, 2022 at 02:53:09PM -0700, Doug Berger wrote:
>> MOTIVATION:
>> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory
>> controllers with each mapped in a different address range within
>> a Uniform Memory Architecture. Some users of these systems have
>> expressed the desire to locate ZONE_MOVABLE memory on each
>> memory controller to allow user space intensive processing to
>> make better use of the additional memory bandwidth.
>> Unfortunately, the historical monotonic layout of zones would
>> mean that if the lowest addressed memory controller contains
>> ZONE_MOVABLE memory then all of the memory available from
>> memory controllers at higher addresses must also be in the
>> ZONE_MOVABLE zone. This would force all kernel memory accesses
>> onto the lowest addressed memory controller and significantly
>> reduce the amount of memory available for non-movable
>> allocations.
>>
> 
> I didn't review the first version of this patch because others, particularly
> David Hildenbrand highlighted many of the concerns I had. I broadly followed
> the discussion but didn't respond because I live in a permanent state of
> having too much to do but with a new version, I have to say something.

:) Just a note that I am still behind on replying to the discussion in 
v2. I wish I had more capacity right now to be more responsive -- but 
just like you (Mel) "permanent state of having too much to do". Other 
things (especially bug fixes) have higher priority.

Thanks for having a look at it Mel --- I only skimmed over your reply, 
but ...

> 
> The three big questions he initially asked were
> 
> 	How large are these areas typically?
> 	How large are they in comparison to other memory in the system?
> 	How is this memory currently presented to the system?
> 	Can you share some more how exactly ZONE_MOVABLE would help here to make
> 		better use of the memory bandwidth?
> 
> Zones are about addressing limitations primarily and frankly, ZONE_MOVABLE
> was a bad idea in retrospect. Today, the preferred approach would have
> been to create a separate NUMA node with distance-1 to the local node
> (fudge by adding 1 to the local distance "10" for zonelist purposes)
> that was ZONE_MOVABLE with the zonelists structured such that GFP_MOVABLE
> allocations would prefer the "movable" node first. While I don't recall
> why I did not take that approach, it most likely was because CONFIG_NUMA
> was not always set, it was only intended for hugetlbfs allocations and
> maybe I didn't have the necessary skill or foresight to take that approach.
> 
> Hotplugs requirements are somewhat different, the primary motivation that
> I'm aware of is being able to guarantee they can be offlined, particularly
> nodes, which can be done in some circumstances. Generally hotplug does
> not care what uses the memory as long as it can be removed later. The
> requirements for restricted access to high speed memory is different.
> 
> There is a high degree of uncertainity of how these regions are to be
> used by applications to get access to the high speed memory, to quote
> 
> 	I'm not certain what is typical because these systems are highly
> 	configurable and Broadcom's customers have different ideas about
> 	application processing.
> 
> 	...
> 
> 	The Designated Movable Block concept introduced here has the
> 	potential to offer useful services to different constituencies. I
> 	tried to highlight this in my V1 patch set with the hope of
> 	attracting some interest, but it can complicate the overall
> 	discussion, so I would like to maybe narrow the discussion here. It
> 	may be good to keep them in mind when assessing the overall value,
> 	but perhaps the "other opportunities" can be covered as a follow
> 	on discussion.
> 
> I note the "potential" part here because we don't actually know. A
> major limitation of ZONE_MOVABLE is that there is no way of controlling
> access from userspace to restrict the high-speed memory to a designated
> application, only to all applications in general. The primary interface
> to control access to memory with different characteristics is mempolicies
> which is NUMA orientated, not zone orientated. So, if there is a special
> application that requires exclusive access, it's very difficult to configure
> based on zones.  Furthermore, page table pages mapping data located in the
> high-speed region are stored in the slower memory which potentially impacts
> the performance if the working set of the application exceeds TLB reach.
> Finally, while there is mention that Broadcom may have some special
> interface to determine what applications can use the high-speed region,
> it's hardware-specific as opposed to something that belongs in the core mm.
> 
> I agree that keeping the high-speed memory in a local node and using "sticky"
> pageblocks or CMA has limitations of its own but in itself, that does not
> justify using ZONE_MOVABLE in my opinion. The statement that ARM can have
> multiple controllers with equal distance and bandwidth (if I'm reading it
> correctly) but places them in different zones.... that's just a bit weird if
> there are no other addressing limitations. It's not obvious why ARM would do
> that, but it also does not matter because it shouldn't be a core mm concern.
> 
> There are already examples of where memory is physically "local" to
> the CPU but has different bandwidth or latency including High Bandwidth
> (HBM), Sub-NUMA Clustering (SNC), PMEM as a memory-life device and some
> AMD EPYC Chips, particularly the first generation where a sockets memory
> controllers had different distances. With the broadcom controllers,
> it sounds like a local memory controller but the bandwidth available
> differs. It's functionally equivalent to HBM.
> 
> The fact that the memory access is physically local to the CPU socket is
> irrelevant when the characteristics of that locality differs. NUMA stands
> for Non-Uniform Memory Access and if bandwidth to different address ranges
> differs, then the system is inherently NUMA even if that is inconvenient.
> 
> While I have not evaluated the implementation in detail, there is already
> infrastructure dealing with tiered memory (memory that is local but has
> different characteristics) with support for moving memory between tiers
> depending on access patterns. Memory policies can be used to restrict
> access to what processes can access the higher bandwidth memory. Given the
> use case for DSM, I suspect that the intent is "application data uses high
> bandwidth memory where possible and kernel uses lower bandwidth memory"
> which is probably fine for an appliance because there is only one workload
> but it's not a generic solution suitable.
> 
> Going back to the original questions;
> 
> 	How large are these areas typically?
> 	How large are they in comparison to other memory in the system?
> 
> I am treating this as the same question because the consequencs are the
> same. A high ratio of !MOVABLE:MOVABLE can cause big problems including
> premature OOM, surprising reclaim behaviour etc
> 
> 	How is this memory currently presented to the system?
> 
> It's local, but with different characteristics so it's inherently NUMA
> because it's Non-Uniform, there is no getting away from that.
> 
> 	Can you share some more how exactly ZONE_MOVABLE would help here to make
> 		better use of the memory bandwidth?
> 
> In the appliance case, it doesn't matter if the intent is that "all
> application data should use high bandwidth memory where possible and
> the application phase behaviour is predictable" and that may very well
> work fine for the users of the Broadcom platforms with multiple memory
> controllers. It does not work at all for the general where access must
> be restricted to a subset of tasks in a general system that can only be
> controlled with memory policies.
> 
> The high bandwidth memory should be representated as a NUMA node, optionally
> to create that node as ZONE_MOVABLE and relying on the zonelists to select
> the movable zone as the first preference.

... that boils down to my remark to tiered memory and eventually using 
devdax to expose this memory to the system and letting the admin decide 
to online it to ZONE_MOVABLE. Of course, that's just one way of doing it.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks
  2022-10-26 11:11   ` David Hildenbrand
@ 2022-10-26 12:02     ` Mel Gorman
  0 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2022-10-26 12:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Doug Berger, Andrew Morton, Jonathan Corbet, Mike Rapoport,
	Borislav Petkov, Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap,
	Damien Le Moal, Muchun Song, Vlastimil Babka, Johannes Weiner,
	Michal Hocko, KOSAKI Motohiro, Mike Kravetz, Florian Fainelli,
	Oscar Salvador, Joonsoo Kim, linux-doc, linux-kernel, linux-mm

On Wed, Oct 26, 2022 at 01:11:40PM +0200, David Hildenbrand wrote:
> > In the appliance case, it doesn't matter if the intent is that "all
> > application data should use high bandwidth memory where possible and
> > the application phase behaviour is predictable" and that may very well
> > work fine for the users of the Broadcom platforms with multiple memory
> > controllers. It does not work at all for the general where access must
> > be restricted to a subset of tasks in a general system that can only be
> > controlled with memory policies.
> > 
> > The high bandwidth memory should be representated as a NUMA node, optionally
> > to create that node as ZONE_MOVABLE and relying on the zonelists to select
> > the movable zone as the first preference.
> 
> ... that boils down to my remark to tiered memory and eventually using
> devdax to expose this memory to the system and letting the admin decide to
> online it to ZONE_MOVABLE. Of course, that's just one way of doing it.
> 

I don't see this approach being inherently bad as such, particularly in
the appliance space where it is known in advance what exactly is running
and what the requirements are. It's not automagical but it's not worse
than specifying something like movablecore=100M@2G,100M@3G,1G@1024G. In
either case, knowledge of the address ranges needing special treatment is
required with the difference being that access to the special memory can
be restricted by policies in the general case.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks
  2022-10-26 10:55 ` [PATCH v3 0/9] mm: introduce Designated Movable Blocks Mel Gorman
  2022-10-26 11:11   ` David Hildenbrand
@ 2022-11-02 22:33   ` Doug Berger
  2022-11-18 17:05     ` Mel Gorman
  1 sibling, 1 reply; 24+ messages in thread
From: Doug Berger @ 2022-11-02 22:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mike Kravetz, Florian Fainelli,
	David Hildenbrand, Oscar Salvador, Joonsoo Kim, linux-doc,
	linux-kernel, linux-mm

On 10/26/2022 3:55 AM, Mel Gorman wrote:
> On Thu, Oct 20, 2022 at 02:53:09PM -0700, Doug Berger wrote:
>> MOTIVATION:
>> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory
>> controllers with each mapped in a different address range within
>> a Uniform Memory Architecture. Some users of these systems have
>> expressed the desire to locate ZONE_MOVABLE memory on each
>> memory controller to allow user space intensive processing to
>> make better use of the additional memory bandwidth.
>> Unfortunately, the historical monotonic layout of zones would
>> mean that if the lowest addressed memory controller contains
>> ZONE_MOVABLE memory then all of the memory available from
>> memory controllers at higher addresses must also be in the
>> ZONE_MOVABLE zone. This would force all kernel memory accesses
>> onto the lowest addressed memory controller and significantly
>> reduce the amount of memory available for non-movable
>> allocations.
>>
> 
> I didn't review the first version of this patch because others, particularly
> David Hildenbrand highlighted many of the concerns I had. I broadly followed
> the discussion but didn't respond because I live in a permanent state of
> having too much to do but with a new version, I have to say something.
I am familiar with that state and as a beneficiary of your hard work 
I'll take the opportunity to say thanks and I appreciate you taking the 
time to respond.

> 
> The three big questions he initially asked were
> 
> 	How large are these areas typically?
> 	How large are they in comparison to other memory in the system?
> 	How is this memory currently presented to the system?
> 	Can you share some more how exactly ZONE_MOVABLE would help here to make
> 		better use of the memory bandwidth?
> 
> Zones are about addressing limitations primarily and frankly, ZONE_MOVABLE
> was a bad idea in retrospect. Today, the preferred approach would have
> been to create a separate NUMA node with distance-1 to the local node
> (fudge by adding 1 to the local distance "10" for zonelist purposes)
> that was ZONE_MOVABLE with the zonelists structured such that GFP_MOVABLE
> allocations would prefer the "movable" node first.
I'm afraid I don't completely follow what you are suggesting here.

> While I don't recall
> why I did not take that approach, it most likely was because CONFIG_NUMA
> was not always set, it was only intended for hugetlbfs allocations and
> maybe I didn't have the necessary skill or foresight to take that approach.
It remains true that CONFIG_NUMA is not always set and that is a key 
motivator for this patch set. For example, Google is moving to a common 
GKI kernel for their Google TV platform that they are requiring vendors 
to support. Currently the arm64 GKI kernel does not set CONFIG_NUMA and 
it seems unlikely that we will be able to get all vendors to accept such 
a change.

> 
> Hotplugs requirements are somewhat different, the primary motivation that
> I'm aware of is being able to guarantee they can be offlined, particularly
> nodes, which can be done in some circumstances. Generally hotplug does
> not care what uses the memory as long as it can be removed later. The
> requirements for restricted access to high speed memory is different.
This is effectively the same requirement that an implementation of 
'reusable' reserved memory has. A driver that owns reserved memory may 
not care what uses the memory as long as the memory can be reclaimed 
when the driver needs it. This is functionally analogous to memory 
hotplug. Reserved memory that is 'reusable' and compatible with 
'shared-dma-pool' uses the CMA implementation, but there is room for an 
alternative implementation that shares the memory more aggressively. 
This is a separate motivator for Designated Movable Block support, but I 
am deferring that discussion since it is desirable to have a more 
extended debate over APIs and such.

> 
> There is a high degree of uncertainity of how these regions are to be
> used by applications to get access to the high speed memory, to quote
> 
> 	I'm not certain what is typical because these systems are highly
> 	configurable and Broadcom's customers have different ideas about
> 	application processing.
> 
> 	...
> 
> 	The Designated Movable Block concept introduced here has the
> 	potential to offer useful services to different constituencies. I
> 	tried to highlight this in my V1 patch set with the hope of
> 	attracting some interest, but it can complicate the overall
> 	discussion, so I would like to maybe narrow the discussion here. It
> 	may be good to keep them in mind when assessing the overall value,
> 	but perhaps the "other opportunities" can be covered as a follow
> 	on discussion.
> 
> I note the "potential" part here because we don't actually know.
I used "potential" here not as in "it might be useful", but rather that 
"different constituencies (i.e. people outside of the Broadcom 
ecosystem) might also find them useful".

> A
> major limitation of ZONE_MOVABLE is that there is no way of controlling
> access from userspace to restrict the high-speed memory to a designated
> application, only to all applications in general. The primary interface
> to control access to memory with different characteristics is mempolicies
> which is NUMA orientated, not zone orientated. So, if there is a special
> application that requires exclusive access, it's very difficult to configure
> based on zones.  Furthermore, page table pages mapping data located in the
> high-speed region are stored in the slower memory which potentially impacts
> the performance if the working set of the application exceeds TLB reach.
> Finally, while there is mention that Broadcom may have some special
> interface to determine what applications can use the high-speed region,
> it's hardware-specific as opposed to something that belongs in the core mm.
> 
> I agree that keeping the high-speed memory in a local node and using "sticky"
> pageblocks or CMA has limitations of its own but in itself, that does not
> justify using ZONE_MOVABLE in my opinion. The statement that ARM can have
> multiple controllers with equal distance and bandwidth (if I'm reading it
> correctly) but places them in different zones.... that's just a bit weird if
> there are no other addressing limitations. It's not obvious why ARM would do
> that, but it also does not matter because it shouldn't be a core mm concern.
There appears to be some confusion regarding my explanation of multiple 
memory controllers on a device like the BCM7278. There is no inherent 
performance difference between the two memory controllers and their 
attached DRAM. They merely provide the opportunity to perform memory 
accesses in parallel for different physical address ranges. The physical 
address ranges were selected by the SoC designers for reasons only known 
to them, but I'm sure they had no consideration of zones in their 
decision making. The selection of zones remains an artifact of the 
design of Linux.

Since the BCM7278 contains a 4-core SMP cluster and each core can have 
multiple outstanding memory transactions the speed of DDR transactions 
can create a bottleneck for the system. If each memory controller has an 
effective bandwidth of X then, provided the system memory accesses can 
be distributed across both memory controllers, the combined effective 
bandwidth can be additive (X + X = 2X). Of course the actual result is 
highly dependent on the dependent clause "provided the system memory 
accesses can be distributed across both memory controllers". The 
accesses do not need to be evenly distributed to gain a benefit. We just 
want to reduce any idle time on each memory controller.

It was observed that the monotonic zone layout for a non-NUMA system 
(like this one) creates a bias for kernel space to use lower physical 
memory addresses and user space to use higher physical memory addresses. 
Broadcom customers requested the ability to locate movablecore memory 
within the physical address range of each memory controller and reported 
that it improved their system performance. Unfortunately, I do not have 
access to their data and I doubt they would allow me to share it if I 
did. I don't believe this is really about trying to optimize the 
performance of a specific application as much as trying to prevent 
overall system performance degradation from underutilized memory bandwidth.

> 
> There are already examples of where memory is physically "local" to
> the CPU but has different bandwidth or latency including High Bandwidth
> (HBM), Sub-NUMA Clustering (SNC), PMEM as a memory-life device and some
> AMD EPYC Chips, particularly the first generation where a sockets memory
> controllers had different distances. With the broadcom controllers,
> it sounds like a local memory controller but the bandwidth available
> differs. It's functionally equivalent to HBM.
The bandwidth available does not differ, but if too few transactions 
target one of the memory controllers, that controllers bandwidth is 
underutilized.

> 
> The fact that the memory access is physically local to the CPU socket is
> irrelevant when the characteristics of that locality differs. NUMA stands
> for Non-Uniform Memory Access and if bandwidth to different address ranges
> differs, then the system is inherently NUMA even if that is inconvenient.
The bandwidth to different address ranges does not differ. A single 
threaded application should see no performance difference regardless of 
where its memory is located. However, if multiple application threads 
are executing in parallel and the memory is divided between the memory 
controllers they will be able to do more work per unit of time than if 
the memory is predominantly located on one memory controller.

> 
> While I have not evaluated the implementation in detail, there is already
> infrastructure dealing with tiered memory (memory that is local but has
> different characteristics) with support for moving memory between tiers
> depending on access patterns. Memory policies can be used to restrict
> access to what processes can access the higher bandwidth memory. Given the
> use case for DSM, I suspect that the intent is "application data uses high
> bandwidth memory where possible and kernel uses lower bandwidth memory"
> which is probably fine for an appliance because there is only one workload
> but it's not a generic solution suitable.
> 
> Going back to the original questions;
> 
> 	How large are these areas typically?
> 	How large are they in comparison to other memory in the system?
> 
> I am treating this as the same question because the consequencs are the
> same. A high ratio of !MOVABLE:MOVABLE can cause big problems including
> premature OOM, surprising reclaim behaviour etc
This is what makes the current movablecore implementation unacceptable. 
In order to get any movablecore memory in the lower physical address 
range requires all of the upper physical address range to be movablecore 
which is horribly unbalanced. Specifying a value like 
'movablecore=256M@2G,512M' with this patch set allows us to specify 
512MB of total movablecore with 256MB of it at a physical address within 
the lower memory controller and the remainder at the highest addresses 
of the upper memory controller.

> 
> 	How is this memory currently presented to the system?
> 
> It's local, but with different characteristics so it's inherently NUMA
> because it's Non-Uniform, there is no getting away from that.
It does not have different characteristics.

> 
> 	Can you share some more how exactly ZONE_MOVABLE would help here to make
> 		better use of the memory bandwidth?
> 
> In the appliance case, it doesn't matter if the intent is that "all
> application data should use high bandwidth memory where possible and
> the application phase behaviour is predictable" and that may very well
> work fine for the users of the Broadcom platforms with multiple memory
> controllers. It does not work at all for the general where access must
> be restricted to a subset of tasks in a general system that can only be
> controlled with memory policies.
> 
> The high bandwidth memory should be representated as a NUMA node, optionally
> to create that node as ZONE_MOVABLE and relying on the zonelists to select
> the movable zone as the first preference.
> 
This patch set is fundamentally about greater control over the placement 
of movablecore memory. The current implementation of movablecore 
requires all of the ZONE_MOVABLE memory to be located at the highest 
physical addresses of the system when CONFIG_NUMA is not set. Allowing 
the specification of a base address allows greater flexibility on 
systems where there are benefits.

Thanks again for your time,
     Doug



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks
  2022-11-02 22:33   ` Doug Berger
@ 2022-11-18 17:05     ` Mel Gorman
  2022-12-15  0:17       ` Doug Berger
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2022-11-18 17:05 UTC (permalink / raw)
  To: Doug Berger
  Cc: Andrew Morton, Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mike Kravetz, Florian Fainelli,
	David Hildenbrand, Oscar Salvador, Joonsoo Kim, linux-doc,
	linux-kernel, linux-mm

On Wed, Nov 02, 2022 at 03:33:53PM -0700, Doug Berger wrote:
> > 
> > The three big questions he initially asked were
> > 
> > 	How large are these areas typically?
> > 	How large are they in comparison to other memory in the system?
> > 	How is this memory currently presented to the system?
> > 	Can you share some more how exactly ZONE_MOVABLE would help here to make
> > 		better use of the memory bandwidth?
> > 
> > Zones are about addressing limitations primarily and frankly, ZONE_MOVABLE
> > was a bad idea in retrospect. Today, the preferred approach would have
> > been to create a separate NUMA node with distance-1 to the local node
> > (fudge by adding 1 to the local distance "10" for zonelist purposes)
> > that was ZONE_MOVABLE with the zonelists structured such that GFP_MOVABLE
> > allocations would prefer the "movable" node first.
>
> I'm afraid I don't completely follow what you are suggesting here.
> 

It's not especially important how it could have been done but using a
node would have avoided confusing zones (address limitations) with memory
partitioning (e.g. MOVABLE). Nodes can also interleave but it would have
required CONFIG_NUMA so pointless for GKI and the current discussion other
than with a time machine, GKI might have enabled CONFIG_NUMA :/

> > While I don't recall
> > why I did not take that approach, it most likely was because CONFIG_NUMA
> > was not always set, it was only intended for hugetlbfs allocations and
> > maybe I didn't have the necessary skill or foresight to take that approach.
>
> It remains true that CONFIG_NUMA is not always set and that is a key
> motivator for this patch set. For example, Google is moving to a common GKI
> kernel for their Google TV platform that they are requiring vendors to
> support. Currently the arm64 GKI kernel does not set CONFIG_NUMA and it
> seems unlikely that we will be able to get all vendors to accept such a
> change.
> 

Ok.

> > 
> > Hotplugs requirements are somewhat different, the primary motivation that
> > I'm aware of is being able to guarantee they can be offlined, particularly
> > nodes, which can be done in some circumstances. Generally hotplug does
> > not care what uses the memory as long as it can be removed later. The
> > requirements for restricted access to high speed memory is different.
>
> This is effectively the same requirement that an implementation of
> 'reusable' reserved memory has. A driver that owns reserved memory may not
> care what uses the memory as long as the memory can be reclaimed when the
> driver needs it. This is functionally analogous to memory hotplug. Reserved
> memory that is 'reusable' and compatible with 'shared-dma-pool' uses the CMA
> implementation, but there is room for an alternative implementation that
> shares the memory more aggressively. This is a separate motivator for
> Designated Movable Block support, but I am deferring that discussion since
> it is desirable to have a more extended debate over APIs and such.
> 

There needs to be a better explanation as to why CMA cannot be used or more
importantly why page_alloc.shuffle= should not be used (more on that later.
It's not clear how a movable zone shares memory more aggressively than CMA
would. Both have the problem that the if protected range is too large that
premature memory exhaustion can occur for kernel allocations.

> > 
> > There is a high degree of uncertainity of how these regions are to be
> > used by applications to get access to the high speed memory, to quote
> > 
> > 	I'm not certain what is typical because these systems are highly
> > 	configurable and Broadcom's customers have different ideas about
> > 	application processing.
> > 
> > 	...
> > 
> > 	The Designated Movable Block concept introduced here has the
> > 	potential to offer useful services to different constituencies. I
> > 	tried to highlight this in my V1 patch set with the hope of
> > 	attracting some interest, but it can complicate the overall
> > 	discussion, so I would like to maybe narrow the discussion here. It
> > 	may be good to keep them in mind when assessing the overall value,
> > 	but perhaps the "other opportunities" can be covered as a follow
> > 	on discussion.
> > 
> > I note the "potential" part here because we don't actually know.
>
> I used "potential" here not as in "it might be useful", but rather that
> "different constituencies (i.e. people outside of the Broadcom ecosystem)
> might also find them useful".
> 

That's very vague unfortunately.

> > A
> > major limitation of ZONE_MOVABLE is that there is no way of controlling
> > access from userspace to restrict the high-speed memory to a designated
> > application, only to all applications in general. The primary interface
> > to control access to memory with different characteristics is mempolicies
> > which is NUMA orientated, not zone orientated. So, if there is a special
> > application that requires exclusive access, it's very difficult to configure
> > based on zones.  Furthermore, page table pages mapping data located in the
> > high-speed region are stored in the slower memory which potentially impacts
> > the performance if the working set of the application exceeds TLB reach.
> > Finally, while there is mention that Broadcom may have some special
> > interface to determine what applications can use the high-speed region,
> > it's hardware-specific as opposed to something that belongs in the core mm.
> > 
> > I agree that keeping the high-speed memory in a local node and using "sticky"
> > pageblocks or CMA has limitations of its own but in itself, that does not
> > justify using ZONE_MOVABLE in my opinion. The statement that ARM can have
> > multiple controllers with equal distance and bandwidth (if I'm reading it
> > correctly) but places them in different zones.... that's just a bit weird if
> > there are no other addressing limitations. It's not obvious why ARM would do
> > that, but it also does not matter because it shouldn't be a core mm concern.
>
> There appears to be some confusion regarding my explanation of multiple
> memory controllers on a device like the BCM7278. There is no inherent
> performance difference between the two memory controllers and their attached
> DRAM. They merely provide the opportunity to perform memory accesses in
> parallel for different physical address ranges. The physical address ranges
> were selected by the SoC designers for reasons only known to them, but I'm
> sure they had no consideration of zones in their decision making. The
> selection of zones remains an artifact of the design of Linux.
> 

Ok, so the channels are equal but the channels are not interleaved in
hardware so basically you are trying to implement software-based memory
channel interleaving?

> Since the BCM7278 contains a 4-core SMP cluster and each core can have
> multiple outstanding memory transactions the speed of DDR transactions can
> create a bottleneck for the system. If each memory controller has an
> effective bandwidth of X then, provided the system memory accesses can be
> distributed across both memory controllers, the combined effective bandwidth
> can be additive (X + X = 2X). Of course the actual result is highly
> dependent on the dependent clause "provided the system memory accesses can
> be distributed across both memory controllers". The accesses do not need to
> be evenly distributed to gain a benefit. We just want to reduce any idle
> time on each memory controller.
> 
> It was observed that the monotonic zone layout for a non-NUMA system (like
> this one) creates a bias for kernel space to use lower physical memory
> addresses and user space to use higher physical memory addresses. Broadcom
> customers requested the ability to locate movablecore memory within the
> physical address range of each memory controller and reported that it
> improved their system performance. Unfortunately, I do not have access to
> their data and I doubt they would allow me to share it if I did. I don't
> believe this is really about trying to optimize the performance of a
> specific application as much as trying to prevent overall system performance
> degradation from underutilized memory bandwidth.
> 

So if I'm reading this right, the intent behind using ZONE_MOVABLE at
fixed address ranges is so most (but not all) user pages end up using
one controller and all kernel pages and some user pages use the other
controller. If kernel and userspace accesses are split 50/50, then the
memory bandwidth usage will be split across channels.  However, if the
ratio of kernel:user accesses is large then the bandwidth usage will still
be assymetric.

For example, if there are 10 times more accesses to kernel pages then user
pages, then one channel will receive most of the traffic. The reverse
is also true but to a lesser extent as user pages can use all zones and
kernel accesses use a subset. Depending on the access pattern, creating
separate zones may not help at all and in some cases, could make the problem
worse. The trap is that it might happen to work for a fixed appliance like
a TV with a predictable workload, it may not work in the general case.

Splitting based on the __GFP_MOVABLE does not guarantee that idle time on
a memory controller can be reduced as it relies on the access pattern.

> > There are already examples of where memory is physically "local" to
> > the CPU but has different bandwidth or latency including High Bandwidth
> > (HBM), Sub-NUMA Clustering (SNC), PMEM as a memory-life device and some
> > AMD EPYC Chips, particularly the first generation where a sockets memory
> > controllers had different distances. With the broadcom controllers,
> > it sounds like a local memory controller but the bandwidth available
> > differs. It's functionally equivalent to HBM.
>
> The bandwidth available does not differ, but if too few transactions target
> one of the memory controllers, that controllers bandwidth is underutilized.
> 

This is also a limitation of the patch series. Lets say the bulk of
accesses are to user pages allocated in ZONE_MOVABLE which correlates to
one memory channel then one channel gets saturated anyway.

It also gets more complicated if there are more controllers because the
only division possible is between MOVABLE/everything else. An odd number
of channels will be difficult to split meaningfully.

> > The fact that the memory access is physically local to the CPU socket is
> > irrelevant when the characteristics of that locality differs. NUMA stands
> > for Non-Uniform Memory Access and if bandwidth to different address ranges
> > differs, then the system is inherently NUMA even if that is inconvenient.
>
> The bandwidth to different address ranges does not differ. A single threaded
> application should see no performance difference regardless of where its
> memory is located. However, if multiple application threads are executing in
> parallel and the memory is divided between the memory controllers they will
> be able to do more work per unit of time than if the memory is predominantly
> located on one memory controller.
> 

And if multiple application threads dominantly access user pages then
splitting the zone will not necessarily help, particularly if ZONE_MOVABLE
is not filled as the bulk of the accesses will still use one memory channel.

> > In the appliance case, it doesn't matter if the intent is that "all
> > application data should use high bandwidth memory where possible and
> > the application phase behaviour is predictable" and that may very well
> > work fine for the users of the Broadcom platforms with multiple memory
> > controllers. It does not work at all for the general where access must
> > be restricted to a subset of tasks in a general system that can only be
> > controlled with memory policies.
> > 
> > The high bandwidth memory should be representated as a NUMA node, optionally
> > to create that node as ZONE_MOVABLE and relying on the zonelists to select
> > the movable zone as the first preference.
> > 
> This patch set is fundamentally about greater control over the placement of
> movablecore memory. The current implementation of movablecore requires all
> of the ZONE_MOVABLE memory to be located at the highest physical addresses
> of the system when CONFIG_NUMA is not set. Allowing the specification of a
> base address allows greater flexibility on systems where there are benefits.
> 

Unfortunately, while greater control of the ranges used by ZONE_MOVABLE
will help in some cases, it will not help in others and may be misleading.

If memory accesses need to be interleaved in software then the free lists
need to be broken up into arenas within a zone by some boundary whether
that boundary is is fixed-length ranges, memory sections, memory channels
or specified explicitly on the kernel command line. Any allocation type
can use any arena with tasks moving to another arena based on contention,
pageblock type availability or interleaving round-robin explicitly.
Unfortunately, it's non-trivial to implement and a *lot* of heavy lifting.

A somewhat awful hack would be to reorder top-level MAX_ORDER-1 list at
boot time. By default that list is ordered

1, 2, 3 ...... n-2, n-1, n

If at boot time it was reordered to be

1, n, 2, n-1, 3, n-2 ......

This would interleave all the early allocations across memory channels in
the case where channels are based on large contiguous physical ranges of
memory. Applications starting early would then interleave between channels
but after a period of time, it would be pseudo-random and it's weak.

A similar, and probably better, option is to look at what page_alloc.shuffle=
does and randomly shuffle the free lists to randomly select between the
memory channels. I haven't looked at the implementation recently and forget
how it works exactly. Maybe it would benefit from being able to take ranges
that should be special cased for shuffling, particularly at boot time to
order it "1, n, 2, n-1" as described above or allowing SHUFFLE_ORDER to
be a lower value. Either way, shuffling would achieve similar goals of
spreading allocations between channels without assuming that the access
ratio of kernel:user is close to 1:1.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks
  2022-11-18 17:05     ` Mel Gorman
@ 2022-12-15  0:17       ` Doug Berger
  2023-01-04 15:43         ` Mel Gorman
  0 siblings, 1 reply; 24+ messages in thread
From: Doug Berger @ 2022-12-15  0:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mike Kravetz, Florian Fainelli,
	David Hildenbrand, Oscar Salvador, Joonsoo Kim, linux-doc,
	linux-kernel, linux-mm

On 11/18/2022 9:05 AM, Mel Gorman wrote:
> On Wed, Nov 02, 2022 at 03:33:53PM -0700, Doug Berger wrote:
>>>
>>> Hotplugs requirements are somewhat different, the primary motivation that
>>> I'm aware of is being able to guarantee they can be offlined, particularly
>>> nodes, which can be done in some circumstances. Generally hotplug does
>>> not care what uses the memory as long as it can be removed later. The
>>> requirements for restricted access to high speed memory is different.
>>
>> This is effectively the same requirement that an implementation of
>> 'reusable' reserved memory has. A driver that owns reserved memory may not
>> care what uses the memory as long as the memory can be reclaimed when the
>> driver needs it. This is functionally analogous to memory hotplug. Reserved
>> memory that is 'reusable' and compatible with 'shared-dma-pool' uses the CMA
>> implementation, but there is room for an alternative implementation that
>> shares the memory more aggressively. This is a separate motivator for
>> Designated Movable Block support, but I am deferring that discussion since
>> it is desirable to have a more extended debate over APIs and such.
>>
> 
> There needs to be a better explanation as to why CMA cannot be used or more
> importantly why page_alloc.shuffle= should not be used (more on that later.
> It's not clear how a movable zone shares memory more aggressively than CMA
> would. Both have the problem that the if protected range is too large that
> premature memory exhaustion can occur for kernel allocations.
The pages within a CMA pool can be allocated by either the CMA allocator 
(i.e. alloc_contig_range()) for use by the kernel (i.e. drivers) or by 
the page allocator to satisfy movable requests (i.e. predominantly user 
space). However, the page allocator is constrained by rules that make 
allocations from MIGRATE_CMA free lists a secondary consideration when 
the memory is available elsewhere. This tends toward keeping pages in a 
CMA pool free while user processes consume more memory outside of the 
CMA pool that could have been used more generally by the kernel.
Pages on the MIGRATE_MOVABLE free list in ZONE_MOVABLE (i.e. all of the 
pages in the movable zone) are the first choice for satisfying movable 
requests. This allows user space to make full use of a 'reusable' 
reserved memory range that isn't actively used by a driver, which is 
what I mean by "more aggressive". When the driver wants to reclaim its 
'reusable' reserved memory range it can use alloc_contig_range() to 
force any movable allocations out of the range perhaps into memory that 
could have been used more generally by the kernel. Such a reclamation 
may be more time consuming to complete since the pages are more likely 
to be in use than if they were on the MIGRATE_CMA free list, but no 
pages go unused by either the driver or user space.

I was not familiar with page_alloc.shuffle, but it may very well have a 
role to play here.

> 
>>>
>>> There is a high degree of uncertainity of how these regions are to be
>>> used by applications to get access to the high speed memory, to quote
>>>
>>> 	I'm not certain what is typical because these systems are highly
>>> 	configurable and Broadcom's customers have different ideas about
>>> 	application processing.
>>>
>>> 	...
>>>
>>> 	The Designated Movable Block concept introduced here has the
>>> 	potential to offer useful services to different constituencies. I
>>> 	tried to highlight this in my V1 patch set with the hope of
>>> 	attracting some interest, but it can complicate the overall
>>> 	discussion, so I would like to maybe narrow the discussion here. It
>>> 	may be good to keep them in mind when assessing the overall value,
>>> 	but perhaps the "other opportunities" can be covered as a follow
>>> 	on discussion.
>>>
>>> I note the "potential" part here because we don't actually know.
>>
>> I used "potential" here not as in "it might be useful", but rather that
>> "different constituencies (i.e. people outside of the Broadcom ecosystem)
>> might also find them useful".
>>
> 
> That's very vague unfortunately.
As an example, there have been submissions related to memory hotplug 
that could have used this capability if it existed at the time. In fact 
the comments for [1] incorrectly assumed the movablecore= behavior 
proposed here was already implemented. Eventually those patch sets 
morphed into the movable_node implementation and modification of 
movablecore= was dropped.

The point I intended to make is that in addition to the use case 
identified here (i.e. improved bandwidth utilization on multiple memory 
controller systems) Designated Movable Blocks as a general mechanism 
could be useful to people I don't know to solve problems of which I am 
not aware (e.g. memory hot unplugging, reusable reserved memory, ???). I 
offer as evidence that I am not the only person to conceive of the concept.

[1] 
https://lore.kernel.org/all/1374220774-29974-21-git-send-email-tangchen@cn.fujitsu.com/

> 
>>> A
>>> major limitation of ZONE_MOVABLE is that there is no way of controlling
>>> access from userspace to restrict the high-speed memory to a designated
>>> application, only to all applications in general. The primary interface
>>> to control access to memory with different characteristics is mempolicies
>>> which is NUMA orientated, not zone orientated. So, if there is a special
>>> application that requires exclusive access, it's very difficult to configure
>>> based on zones.  Furthermore, page table pages mapping data located in the
>>> high-speed region are stored in the slower memory which potentially impacts
>>> the performance if the working set of the application exceeds TLB reach.
>>> Finally, while there is mention that Broadcom may have some special
>>> interface to determine what applications can use the high-speed region,
>>> it's hardware-specific as opposed to something that belongs in the core mm.
>>>
>>> I agree that keeping the high-speed memory in a local node and using "sticky"
>>> pageblocks or CMA has limitations of its own but in itself, that does not
>>> justify using ZONE_MOVABLE in my opinion. The statement that ARM can have
>>> multiple controllers with equal distance and bandwidth (if I'm reading it
>>> correctly) but places them in different zones.... that's just a bit weird if
>>> there are no other addressing limitations. It's not obvious why ARM would do
>>> that, but it also does not matter because it shouldn't be a core mm concern.
>>
>> There appears to be some confusion regarding my explanation of multiple
>> memory controllers on a device like the BCM7278. There is no inherent
>> performance difference between the two memory controllers and their attached
>> DRAM. They merely provide the opportunity to perform memory accesses in
>> parallel for different physical address ranges. The physical address ranges
>> were selected by the SoC designers for reasons only known to them, but I'm
>> sure they had no consideration of zones in their decision making. The
>> selection of zones remains an artifact of the design of Linux.
>>
> 
> Ok, so the channels are equal but the channels are not interleaved in
> hardware so basically you are trying to implement software-based memory
> channel interleaving?
I suppose that could be a fair characterization of the objective, though 
the approach taken here is very much a "poor man's" approach that 
attempts to improve things without requiring the "heavy lifting" 
required for a more complete solution.

That said, the use of page_alloc.shuffle is interesting here since it 
could improve the likelyhood of interleaving.

> 
>> Since the BCM7278 contains a 4-core SMP cluster and each core can have
>> multiple outstanding memory transactions the speed of DDR transactions can
>> create a bottleneck for the system. If each memory controller has an
>> effective bandwidth of X then, provided the system memory accesses can be
>> distributed across both memory controllers, the combined effective bandwidth
>> can be additive (X + X = 2X). Of course the actual result is highly
>> dependent on the dependent clause "provided the system memory accesses can
>> be distributed across both memory controllers". The accesses do not need to
>> be evenly distributed to gain a benefit. We just want to reduce any idle
>> time on each memory controller.
>>
>> It was observed that the monotonic zone layout for a non-NUMA system (like
>> this one) creates a bias for kernel space to use lower physical memory
>> addresses and user space to use higher physical memory addresses. Broadcom
>> customers requested the ability to locate movablecore memory within the
>> physical address range of each memory controller and reported that it
>> improved their system performance. Unfortunately, I do not have access to
>> their data and I doubt they would allow me to share it if I did. I don't
>> believe this is really about trying to optimize the performance of a
>> specific application as much as trying to prevent overall system performance
>> degradation from underutilized memory bandwidth.
>>
> 
> So if I'm reading this right, the intent behind using ZONE_MOVABLE at
> fixed address ranges is so most (but not all) user pages end up using
> one controller and all kernel pages and some user pages use the other
> controller. If kernel and userspace accesses are split 50/50, then the
> memory bandwidth usage will be split across channels.  However, if the
> ratio of kernel:user accesses is large then the bandwidth usage will still
> be assymetric.
> 
> For example, if there are 10 times more accesses to kernel pages then user
> pages, then one channel will receive most of the traffic. The reverse
> is also true but to a lesser extent as user pages can use all zones and
> kernel accesses use a subset. Depending on the access pattern, creating
> separate zones may not help at all and in some cases, could make the problem
> worse. The trap is that it might happen to work for a fixed appliance like
> a TV with a predictable workload, it may not work in the general case.
> 
> Splitting based on the __GFP_MOVABLE does not guarantee that idle time on
> a memory controller can be reduced as it relies on the access pattern.
You are not reading it quite right. We could accomplish the split you 
describe on a BCM7278 SoC with two memory controllers using the existing 
movablecore=50% kernel parameter. This would create a ZONE_MOVABLE on 
the high address memory controller and a ZONE_DMA on the low address 
memory controller.

What is of interest to Broadcom customers is to better distribute user 
space accesses across each memory controller to improve the bandwidth 
available to user space dominated work flows. With no ZONE_MOVABLE, the 
BCM7278 SoC with 1GB of memory on each memory controller will place the 
1GB on the low address memory controller in ZONE_DMA and the 1GB on the 
high address memory controller in ZONE_NORMAL. With this layout movable 
allocation requests will only fallback to the ZONE_DMA (low memory 
controller) once the ZONE_NORMAL (high memory controller) is 
sufficiently depleted of free memory.

Adding ZONE_MOVABLE memory above ZONE_NORMAL with the current 
movablecore behavior does not improve this situation other than forcing 
more kernel allocations off of the high memory controller. User space 
allocations are even more likely to be on the high memory controller.

The Designated Movable Block mechanism allows ZONE_MOVABLE memory to be 
located on the low memory controller to make it easier for user space 
allocations to land on the low memory controller. If ZONE_MOVABLE is 
only placed on the low memory controller then user space allocations can 
land in ZONE_NORMAL on the high memory controller, but only through 
fallback after ZONE_MOVABLE is sufficiently depleted of free memory 
which is just the reverse of the existing situation. The Designated 
Movable Block mechanism allows ZONE_MOVABLE memory to be located on each 
memory controller so that user space allocations have equal access to 
each memory controller until the ZONE_MOVABLE memory is depleted and 
fallback to other zones occurs.

To my knowledge Broadcom customers that are currently using the 
Designated Movable Block mechanism are relying on the somewhat random 
starting and stopping of parallel user space processes to produce a more 
random distribution of ZONE_MOVABLE allocations across multiple memory 
controllers, but the page_alloc.shuffle mechanism seems like it would be 
a good addition to promote this randomness. Even better, it appears that 
page_alloc.shuffle is already enabled in the GKI configuration.

You are of course correct that the access patterns make all of the 
difference and it is almost certain that one memory controller or the 
other will be saturated at any given time, but the intent is to increase 
the opportunity to use more of the total bandwidth made available by the 
multiple memory controllers.

> 
>>> There are already examples of where memory is physically "local" to
>>> the CPU but has different bandwidth or latency including High Bandwidth
>>> (HBM), Sub-NUMA Clustering (SNC), PMEM as a memory-life device and some
>>> AMD EPYC Chips, particularly the first generation where a sockets memory
>>> controllers had different distances. With the broadcom controllers,
>>> it sounds like a local memory controller but the bandwidth available
>>> differs. It's functionally equivalent to HBM.
>>
>> The bandwidth available does not differ, but if too few transactions target
>> one of the memory controllers, that controllers bandwidth is underutilized.
>>
> 
> This is also a limitation of the patch series. Lets say the bulk of
> accesses are to user pages allocated in ZONE_MOVABLE which correlates to
> one memory channel then one channel gets saturated anyway.
> 
> It also gets more complicated if there are more controllers because the
> only division possible is between MOVABLE/everything else. An odd number
> of channels will be difficult to split meaningfully.
The patch series allows Designated Movable Blocks to occupy a portion of 
each memory controller while allowing the ZONE_MOVABLE zone to span all 
of the memory controllers. In this way user pages allocated from 
ZONE_MOVABLE may be distributed across all of the memory controllers. 
Use of page_alloc.shuffle should improve the randomness of this 
distribution. Memory outside of Designated Movable Blocks on each memory 
controller can be outside ZONE_MOVABLE (e.g. ZONE_DMA and ZONE_NORMAL) 
and managed accordingly. An odd number of channels need not affect this.

> 
>>> The fact that the memory access is physically local to the CPU socket is
>>> irrelevant when the characteristics of that locality differs. NUMA stands
>>> for Non-Uniform Memory Access and if bandwidth to different address ranges
>>> differs, then the system is inherently NUMA even if that is inconvenient.
>>
>> The bandwidth to different address ranges does not differ. A single threaded
>> application should see no performance difference regardless of where its
>> memory is located. However, if multiple application threads are executing in
>> parallel and the memory is divided between the memory controllers they will
>> be able to do more work per unit of time than if the memory is predominantly
>> located on one memory controller.
>>
> 
> And if multiple application threads dominantly access user pages then
> splitting the zone will not necessarily help, particularly if ZONE_MOVABLE
> is not filled as the bulk of the accesses will still use one memory channel.
> 
>>> In the appliance case, it doesn't matter if the intent is that "all
>>> application data should use high bandwidth memory where possible and
>>> the application phase behaviour is predictable" and that may very well
>>> work fine for the users of the Broadcom platforms with multiple memory
>>> controllers. It does not work at all for the general where access must
>>> be restricted to a subset of tasks in a general system that can only be
>>> controlled with memory policies.
>>>
>>> The high bandwidth memory should be representated as a NUMA node, optionally
>>> to create that node as ZONE_MOVABLE and relying on the zonelists to select
>>> the movable zone as the first preference.
>>>
>> This patch set is fundamentally about greater control over the placement of
>> movablecore memory. The current implementation of movablecore requires all
>> of the ZONE_MOVABLE memory to be located at the highest physical addresses
>> of the system when CONFIG_NUMA is not set. Allowing the specification of a
>> base address allows greater flexibility on systems where there are benefits.
>>
> 
> Unfortunately, while greater control of the ranges used by ZONE_MOVABLE
> will help in some cases, it will not help in others and may be misleading.
> 
> If memory accesses need to be interleaved in software then the free lists
> need to be broken up into arenas within a zone by some boundary whether
> that boundary is is fixed-length ranges, memory sections, memory channels
> or specified explicitly on the kernel command line. Any allocation type
> can use any arena with tasks moving to another arena based on contention,
> pageblock type availability or interleaving round-robin explicitly.
> Unfortunately, it's non-trivial to implement and a *lot* of heavy lifting.
> 
> A somewhat awful hack would be to reorder top-level MAX_ORDER-1 list at
> boot time. By default that list is ordered
> 
> 1, 2, 3 ...... n-2, n-1, n
> 
> If at boot time it was reordered to be
> 
> 1, n, 2, n-1, 3, n-2 ......
> 
> This would interleave all the early allocations across memory channels in
> the case where channels are based on large contiguous physical ranges of
> memory. Applications starting early would then interleave between channels
> but after a period of time, it would be pseudo-random and it's weak.
> 
> A similar, and probably better, option is to look at what page_alloc.shuffle=
> does and randomly shuffle the free lists to randomly select between the
> memory channels. I haven't looked at the implementation recently and forget
> how it works exactly. Maybe it would benefit from being able to take ranges
> that should be special cased for shuffling, particularly at boot time to
> order it "1, n, 2, n-1" as described above or allowing SHUFFLE_ORDER to
> be a lower value. Either way, shuffling would achieve similar goals of
> spreading allocations between channels without assuming that the access
> ratio of kernel:user is close to 1:1.
> 

I decided to implement this very simple multi-threaded application as a 
testcase to experiment with the concepts discussed here:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <pthread.h>

#define BUF_SIZE (0x4000000)
#define THREADS (4)
#define COPY_COUNT (30)

void *thread_function( void *ptr );

int main()
{
      pthread_t thread[THREADS];
      int  i, iret[THREADS];

      for(i = 0; i < THREADS; i++)
	     iret[i] = pthread_create( &thread[i], NULL, thread_function, 
(void*) NULL);

      for(i = 0; i < THREADS; i++)
	     pthread_join( thread[i], NULL);

      for(i = 0; i < THREADS; i++)
	     printf("Thread %d returns: %d\n", i, iret[i]);
      exit(0);
}

void *thread_function( void *ptr )
{
      char *s, *d;
      int i;

      s = malloc(BUF_SIZE);
      if (!s)
	     return NULL;

      d = malloc(BUF_SIZE);
      if (!d) {
	     free(s);
	     return NULL;
      }

      for (i = 0; i < COPY_COUNT; i++) {
	     memcpy(d, s, BUF_SIZE);
      }
      free(s);
      free(d);
}

It meaninglessly moves data from one large dynamically allocated buffer 
to another a number of times without trying to be clever. I experimented 
with a Broadcom BCM7278 system with 1GB on each memory controller (i.e. 
2GB total memory). The buffers were made large to render data caching 
meaningless and to require several pages to be allocated to populate the 
buffer.

With V3 of this patch set applied to a 6.1-rc1 kernel I observed these 
results:
With no movablecore kernel parameter specified:
# time /tmp/thread_test
Thread 1 returns: 0
Thread 2 returns: 0
Thread 3 returns: 0
Thread 4 returns: 0

real    0m4.047s
user    0m14.183s
sys     0m1.215s

With this additional kernel parameter "movablecore=600M":
# time /tmp/thread_test
Thread 0 returns: 0
Thread 1 returns: 0
Thread 2 returns: 0
Thread 3 returns: 0

real    0m4.068s
user    0m14.402s
sys     0m1.117s

With this additional kernel parameter "movablecore=600M@0x50000000":
# time /tmp/thread_test
Thread 0 returns: 0
Thread 1 returns: 0
Thread 2 returns: 0
Thread 3 returns: 0

real    0m4.010s
user    0m13.979s
sys     0m1.070s

However, with these additional kernel parameters 
"movablecore=300M@0x60000000,300M@0x320000000 page_alloc.shuffle=1":
# time /tmp/thread_test
Thread 0 returns: 0
Thread 1 returns: 0
Thread 2 returns: 0
Thread 3 returns: 0

real    0m3.173s
user    0m11.175s
sys     0m1.067s

These numbers show an over 20% improvement in performance of the test 
application when distributing ZONE_MOVABLE across both memory controllers.

Happy Holidays!
     Doug

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks
  2022-10-20 21:53 [PATCH v3 0/9] mm: introduce Designated Movable Blocks Doug Berger
                   ` (9 preceding siblings ...)
  2022-10-26 10:55 ` [PATCH v3 0/9] mm: introduce Designated Movable Blocks Mel Gorman
@ 2023-01-03 23:43 ` Florian Fainelli
  2023-01-04 15:56   ` David Hildenbrand
  10 siblings, 1 reply; 24+ messages in thread
From: Florian Fainelli @ 2023-01-03 23:43 UTC (permalink / raw)
  To: Doug Berger, Andrew Morton, Mel Gorman, David Hildenbrand
  Cc: Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mike Kravetz, Oscar Salvador, Joonsoo Kim,
	linux-doc, linux-kernel, linux-mm

On 10/20/22 14:53, Doug Berger wrote:
> MOTIVATION:
> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory
> controllers with each mapped in a different address range within
> a Uniform Memory Architecture. Some users of these systems have
> expressed the desire to locate ZONE_MOVABLE memory on each
> memory controller to allow user space intensive processing to
> make better use of the additional memory bandwidth.
> Unfortunately, the historical monotonic layout of zones would
> mean that if the lowest addressed memory controller contains
> ZONE_MOVABLE memory then all of the memory available from
> memory controllers at higher addresses must also be in the
> ZONE_MOVABLE zone. This would force all kernel memory accesses
> onto the lowest addressed memory controller and significantly
> reduce the amount of memory available for non-movable
> allocations.
> 
> The main objective of this patch set is therefore to allow a
> block of memory to be designated as part of the ZONE_MOVABLE
> zone where it will always only be used by the kernel page
> allocator to satisfy requests for movable pages. The term
> Designated Movable Block is introduced here to represent such a
> block. The favored implementation allows extension of the
> 'movablecore' kernel parameter to allow specification of a base
> address and support for multiple blocks. The existing
> 'movablecore' mechanisms are retained.
> 
> BACKGROUND:
> NUMA architectures support distributing movablecore memory
> across each node, but it is undesirable to introduce the
> overhead and complexities of NUMA on systems that don't have a
> Non-Uniform Memory Architecture.
> 
> Commit 342332e6a925 ("mm/page_alloc.c: introduce kernelcore=mirror option")
> also depends on zone overlap to support sytems with multiple
> mirrored ranges.
> 
> Commit c6f03e2903c9 ("mm, memory_hotplug: remove zone restrictions")
> embraced overlapped zones for memory hotplug.
> 
> This commit set follows their lead to allow the ZONE_MOVABLE
> zone to overlap other zones. Designated Movable Blocks are made
> absent from overlapping zones and present within the
> ZONE_MOVABLE zone.
> 
> I initially investigated an implementation using a Designated
> Movable migrate type in line with comments[1] made by Mel Gorman
> regarding a "sticky" MIGRATE_MOVABLE type to avoid using
> ZONE_MOVABLE. However, this approach was riskier since it was
> much more instrusive on the allocation paths. Ultimately, the
> progress made by the memory hotplug folks to expand the
> ZONE_MOVABLE functionality convinced me to follow this approach.
> 

Mel, David, does the sub-thread discussion with Doug help ensuring that 
all of the context is gathered before getting into a more detailed patch 
review on a patch-by-patch basis?

Eventually we may need a fairly firm answer as to whether the proposed 
approach has any chance of landing upstream in order to either commit to 
in subsequent iterations of this patch set, or find an alternative.

Thank you!
-- 
Florian


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks
  2022-12-15  0:17       ` Doug Berger
@ 2023-01-04 15:43         ` Mel Gorman
  2023-01-04 19:10           ` Florian Fainelli
  2023-01-19 22:33           ` Doug Berger
  0 siblings, 2 replies; 24+ messages in thread
From: Mel Gorman @ 2023-01-04 15:43 UTC (permalink / raw)
  To: Doug Berger
  Cc: Andrew Morton, Jonathan Corbet, Mike Rapoport, Paul E. McKenney,
	Neeraj Upadhyay, Randy Dunlap, Damien Le Moal, Muchun Song,
	Vlastimil Babka, Johannes Weiner, Michal Hocko, KOSAKI Motohiro,
	Mike Kravetz, Florian Fainelli, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, linux-doc, linux-kernel, linux-mm

On Wed, Dec 14, 2022 at 04:17:35PM -0800, Doug Berger wrote:
> On 11/18/2022 9:05 AM, Mel Gorman wrote:
> > On Wed, Nov 02, 2022 at 03:33:53PM -0700, Doug Berger wrote:
> > > > 
> > > > Hotplugs requirements are somewhat different, the primary motivation that
> > > > I'm aware of is being able to guarantee they can be offlined, particularly
> > > > nodes, which can be done in some circumstances. Generally hotplug does
> > > > not care what uses the memory as long as it can be removed later. The
> > > > requirements for restricted access to high speed memory is different.
> > > 
> > > This is effectively the same requirement that an implementation of
> > > 'reusable' reserved memory has. A driver that owns reserved memory may not
> > > care what uses the memory as long as the memory can be reclaimed when the
> > > driver needs it. This is functionally analogous to memory hotplug. Reserved
> > > memory that is 'reusable' and compatible with 'shared-dma-pool' uses the CMA
> > > implementation, but there is room for an alternative implementation that
> > > shares the memory more aggressively. This is a separate motivator for
> > > Designated Movable Block support, but I am deferring that discussion since
> > > it is desirable to have a more extended debate over APIs and such.
> > > 
> > 
> > There needs to be a better explanation as to why CMA cannot be used or more
> > importantly why page_alloc.shuffle= should not be used (more on that later.
> > It's not clear how a movable zone shares memory more aggressively than CMA
> > would. Both have the problem that the if protected range is too large that
> > premature memory exhaustion can occur for kernel allocations.
>
> The pages within a CMA pool can be allocated by either the CMA allocator
> (i.e. alloc_contig_range()) for use by the kernel (i.e. drivers) or by the
> page allocator to satisfy movable requests (i.e. predominantly user space).
> However, the page allocator is constrained by rules that make allocations
> from MIGRATE_CMA free lists a secondary consideration when the memory is
> available elsewhere. This tends toward keeping pages in a CMA pool free
> while user processes consume more memory outside of the CMA pool that could
> have been used more generally by the kernel.
> Pages on the MIGRATE_MOVABLE free list in ZONE_MOVABLE (i.e. all of the
> pages in the movable zone) are the first choice for satisfying movable
> requests. This allows user space to make full use of a 'reusable' reserved
> memory range that isn't actively used by a driver, which is what I mean by
> "more aggressive". When the driver wants to reclaim its 'reusable' reserved
> memory range it can use alloc_contig_range() to force any movable
> allocations out of the range perhaps into memory that could have been used
> more generally by the kernel. Such a reclamation may be more time consuming
> to complete since the pages are more likely to be in use than if they were
> on the MIGRATE_CMA free list, but no pages go unused by either the driver or
> user space.
> 

Ok, so CMA is a bad fit. It's off the table.

> I was not familiar with page_alloc.shuffle, but it may very well have a role
> to play here.
> 

It almost certainly does because unlike zones or CMA, it affects how
free lists are arranged. IIRC, the original purpose was about improving
performance of high-speed direct-mapped cache but it also serves a
purpose in this case -- randomising allocations between two channels.
It's still not perfect interleaving but better than none.

> > > > A
> > > > major limitation of ZONE_MOVABLE is that there is no way of controlling
> > > > access from userspace to restrict the high-speed memory to a designated
> > > > application, only to all applications in general. The primary interface
> > > > to control access to memory with different characteristics is mempolicies
> > > > which is NUMA orientated, not zone orientated. So, if there is a special
> > > > application that requires exclusive access, it's very difficult to configure
> > > > based on zones.  Furthermore, page table pages mapping data located in the
> > > > high-speed region are stored in the slower memory which potentially impacts
> > > > the performance if the working set of the application exceeds TLB reach.
> > > > Finally, while there is mention that Broadcom may have some special
> > > > interface to determine what applications can use the high-speed region,
> > > > it's hardware-specific as opposed to something that belongs in the core mm.
> > > > 
> > > > I agree that keeping the high-speed memory in a local node and using "sticky"
> > > > pageblocks or CMA has limitations of its own but in itself, that does not
> > > > justify using ZONE_MOVABLE in my opinion. The statement that ARM can have
> > > > multiple controllers with equal distance and bandwidth (if I'm reading it
> > > > correctly) but places them in different zones.... that's just a bit weird if
> > > > there are no other addressing limitations. It's not obvious why ARM would do
> > > > that, but it also does not matter because it shouldn't be a core mm concern.
> > > 
> > > There appears to be some confusion regarding my explanation of multiple
> > > memory controllers on a device like the BCM7278. There is no inherent
> > > performance difference between the two memory controllers and their attached
> > > DRAM. They merely provide the opportunity to perform memory accesses in
> > > parallel for different physical address ranges. The physical address ranges
> > > were selected by the SoC designers for reasons only known to them, but I'm
> > > sure they had no consideration of zones in their decision making. The
> > > selection of zones remains an artifact of the design of Linux.
> > > 
> > 
> > Ok, so the channels are equal but the channels are not interleaved in
> > hardware so basically you are trying to implement software-based memory
> > channel interleaving?
>
> I suppose that could be a fair characterization of the objective, though the
> approach taken here is very much a "poor man's" approach that attempts to
> improve things without requiring the "heavy lifting" required for a more
> complete solution.
>

It's still unfortunate that the concept of zones being primarily about
addressing or capability limitations changes. It's also difficult to use as
any user of it has to be very aware of the memory channel configuration of
the machine and know how to match addresses to channels. Information from
zoneinfo on start_pfns, spanned ranges and the like become less useful. It's
relatively minor but splitting the zones also means there is a performance
hit during compaction because pageblock_pfn_to_page is more expensive.

> That said, the use of page_alloc.shuffle is interesting here since it could
> improve the likelyhood of interleaving.
> 

Exactly.

> > 
> > > Since the BCM7278 contains a 4-core SMP cluster and each core can have
> > > multiple outstanding memory transactions the speed of DDR transactions can
> > > create a bottleneck for the system. If each memory controller has an
> > > effective bandwidth of X then, provided the system memory accesses can be
> > > distributed across both memory controllers, the combined effective bandwidth
> > > can be additive (X + X = 2X). Of course the actual result is highly
> > > dependent on the dependent clause "provided the system memory accesses can
> > > be distributed across both memory controllers". The accesses do not need to
> > > be evenly distributed to gain a benefit. We just want to reduce any idle
> > > time on each memory controller.
> > > 
> > > It was observed that the monotonic zone layout for a non-NUMA system (like
> > > this one) creates a bias for kernel space to use lower physical memory
> > > addresses and user space to use higher physical memory addresses. Broadcom
> > > customers requested the ability to locate movablecore memory within the
> > > physical address range of each memory controller and reported that it
> > > improved their system performance. Unfortunately, I do not have access to
> > > their data and I doubt they would allow me to share it if I did. I don't
> > > believe this is really about trying to optimize the performance of a
> > > specific application as much as trying to prevent overall system performance
> > > degradation from underutilized memory bandwidth.
> > > 
> > 
> > So if I'm reading this right, the intent behind using ZONE_MOVABLE at
> > fixed address ranges is so most (but not all) user pages end up using
> > one controller and all kernel pages and some user pages use the other
> > controller. If kernel and userspace accesses are split 50/50, then the
> > memory bandwidth usage will be split across channels.  However, if the
> > ratio of kernel:user accesses is large then the bandwidth usage will still
> > be assymetric.
> > 
> > For example, if there are 10 times more accesses to kernel pages then user
> > pages, then one channel will receive most of the traffic. The reverse
> > is also true but to a lesser extent as user pages can use all zones and
> > kernel accesses use a subset. Depending on the access pattern, creating
> > separate zones may not help at all and in some cases, could make the problem
> > worse. The trap is that it might happen to work for a fixed appliance like
> > a TV with a predictable workload, it may not work in the general case.
> > 
> > Splitting based on the __GFP_MOVABLE does not guarantee that idle time on
> > a memory controller can be reduced as it relies on the access pattern.
>
> You are not reading it quite right. We could accomplish the split you
> describe on a BCM7278 SoC with two memory controllers using the existing
> movablecore=50% kernel parameter. This would create a ZONE_MOVABLE on the
> high address memory controller and a ZONE_DMA on the low address memory
> controller.
> 

Ok, I did misunderstand at the time that ZONE_MOVABLE would be split
between the controllers to improve interleaving of user accesses.

> What is of interest to Broadcom customers is to better distribute user space
> accesses across each memory controller to improve the bandwidth available to
> user space dominated work flows. With no ZONE_MOVABLE, the BCM7278 SoC with
> 1GB of memory on each memory controller will place the 1GB on the low
> address memory controller in ZONE_DMA and the 1GB on the high address memory
> controller in ZONE_NORMAL. With this layout movable allocation requests will
> only fallback to the ZONE_DMA (low memory controller) once the ZONE_NORMAL
> (high memory controller) is sufficiently depleted of free memory.
>
> Adding ZONE_MOVABLE memory above ZONE_NORMAL with the current movablecore
> behavior does not improve this situation other than forcing more kernel
> allocations off of the high memory controller. User space allocations are
> even more likely to be on the high memory controller.
> 

But it's a weak promise that interleaving will happen. If only a portion
of ZONE_MOVABLE is used, it might still be all on the same channel. This
might improve over time if enough memory was used and the system was up
for long enough.

> The Designated Movable Block mechanism allows ZONE_MOVABLE memory to be
> located on the low memory controller to make it easier for user space
> allocations to land on the low memory controller. If ZONE_MOVABLE is only
> placed on the low memory controller then user space allocations can land in
> ZONE_NORMAL on the high memory controller, but only through fallback after
> ZONE_MOVABLE is sufficiently depleted of free memory which is just the
> reverse of the existing situation. The Designated Movable Block mechanism
> allows ZONE_MOVABLE memory to be located on each memory controller so that
> user space allocations have equal access to each memory controller until the
> ZONE_MOVABLE memory is depleted and fallback to other zones occurs.
> 
> To my knowledge Broadcom customers that are currently using the Designated
> Movable Block mechanism are relying on the somewhat random starting and
> stopping of parallel user space processes to produce a more random
> distribution of ZONE_MOVABLE allocations across multiple memory controllers,
> but the page_alloc.shuffle mechanism seems like it would be a good addition
> to promote this randomness. Even better, it appears that page_alloc.shuffle
> is already enabled in the GKI configuration.
> 

The "random starting and stopping of parallel user space processes" is
required for the mechanism to work. It's arbitrary and unknown if the
interleaving happens where as shuffle has an immediate, if random, impact.

> You are of course correct that the access patterns make all of the
> difference and it is almost certain that one memory controller or the other
> will be saturated at any given time, but the intent is to increase the
> opportunity to use more of the total bandwidth made available by the
> multiple memory controllers.
> 

And shuffle should also provide that opportunity except it's trivial
to configure and only requires the user to know the memory channels are
not interleaved.

> > 
> > > > There are already examples of where memory is physically "local" to
> > > > the CPU but has different bandwidth or latency including High Bandwidth
> > > > (HBM), Sub-NUMA Clustering (SNC), PMEM as a memory-life device and some
> > > > AMD EPYC Chips, particularly the first generation where a sockets memory
> > > > controllers had different distances. With the broadcom controllers,
> > > > it sounds like a local memory controller but the bandwidth available
> > > > differs. It's functionally equivalent to HBM.
> > > 
> > > The bandwidth available does not differ, but if too few transactions target
> > > one of the memory controllers, that controllers bandwidth is underutilized.
> > > 
> > 
> > This is also a limitation of the patch series. Lets say the bulk of
> > accesses are to user pages allocated in ZONE_MOVABLE which correlates to
> > one memory channel then one channel gets saturated anyway.
> > 
> > It also gets more complicated if there are more controllers because the
> > only division possible is between MOVABLE/everything else. An odd number
> > of channels will be difficult to split meaningfully.
>
> The patch series allows Designated Movable Blocks to occupy a portion of
> each memory controller while allowing the ZONE_MOVABLE zone to span all of
> the memory controllers. In this way user pages allocated from ZONE_MOVABLE
> may be distributed across all of the memory controllers. Use of
> page_alloc.shuffle should improve the randomness of this distribution.
> Memory outside of Designated Movable Blocks on each memory controller can be
> outside ZONE_MOVABLE (e.g. ZONE_DMA and ZONE_NORMAL) and managed
> accordingly. An odd number of channels need not affect this.
>

You're right, if ZONE_MOVABLE is split across the channels then the
number of channels is less important.
 
> > 
> > > > The fact that the memory access is physically local to the CPU socket is
> > > > irrelevant when the characteristics of that locality differs. NUMA stands
> > > > for Non-Uniform Memory Access and if bandwidth to different address ranges
> > > > differs, then the system is inherently NUMA even if that is inconvenient.
> > > 
> > > The bandwidth to different address ranges does not differ. A single threaded
> > > application should see no performance difference regardless of where its
> > > memory is located. However, if multiple application threads are executing in
> > > parallel and the memory is divided between the memory controllers they will
> > > be able to do more work per unit of time than if the memory is predominantly
> > > located on one memory controller.
> > > 
> > 
> > And if multiple application threads dominantly access user pages then
> > splitting the zone will not necessarily help, particularly if ZONE_MOVABLE
> > is not filled as the bulk of the accesses will still use one memory channel.
> > 
> > > > In the appliance case, it doesn't matter if the intent is that "all
> > > > application data should use high bandwidth memory where possible and
> > > > the application phase behaviour is predictable" and that may very well
> > > > work fine for the users of the Broadcom platforms with multiple memory
> > > > controllers. It does not work at all for the general where access must
> > > > be restricted to a subset of tasks in a general system that can only be
> > > > controlled with memory policies.
> > > > 
> > > > The high bandwidth memory should be representated as a NUMA node, optionally
> > > > to create that node as ZONE_MOVABLE and relying on the zonelists to select
> > > > the movable zone as the first preference.
> > > > 
> > > This patch set is fundamentally about greater control over the placement of
> > > movablecore memory. The current implementation of movablecore requires all
> > > of the ZONE_MOVABLE memory to be located at the highest physical addresses
> > > of the system when CONFIG_NUMA is not set. Allowing the specification of a
> > > base address allows greater flexibility on systems where there are benefits.
> > > 
> > 
> > Unfortunately, while greater control of the ranges used by ZONE_MOVABLE
> > will help in some cases, it will not help in others and may be misleading.
> > 
> > If memory accesses need to be interleaved in software then the free lists
> > need to be broken up into arenas within a zone by some boundary whether
> > that boundary is is fixed-length ranges, memory sections, memory channels
> > or specified explicitly on the kernel command line. Any allocation type
> > can use any arena with tasks moving to another arena based on contention,
> > pageblock type availability or interleaving round-robin explicitly.
> > Unfortunately, it's non-trivial to implement and a *lot* of heavy lifting.
> > 
> > A somewhat awful hack would be to reorder top-level MAX_ORDER-1 list at
> > boot time. By default that list is ordered
> > 
> > 1, 2, 3 ...... n-2, n-1, n
> > 
> > If at boot time it was reordered to be
> > 
> > 1, n, 2, n-1, 3, n-2 ......
> > 
> > This would interleave all the early allocations across memory channels in
> > the case where channels are based on large contiguous physical ranges of
> > memory. Applications starting early would then interleave between channels
> > but after a period of time, it would be pseudo-random and it's weak.
> > 
> > A similar, and probably better, option is to look at what page_alloc.shuffle=
> > does and randomly shuffle the free lists to randomly select between the
> > memory channels. I haven't looked at the implementation recently and forget
> > how it works exactly. Maybe it would benefit from being able to take ranges
> > that should be special cased for shuffling, particularly at boot time to
> > order it "1, n, 2, n-1" as described above or allowing SHUFFLE_ORDER to
> > be a lower value. Either way, shuffling would achieve similar goals of
> > spreading allocations between channels without assuming that the access
> > ratio of kernel:user is close to 1:1.
> > 
> 
> I decided to implement this very simple multi-threaded application as a
> testcase to experiment with the concepts discussed here:
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <pthread.h>
> 
> #define BUF_SIZE (0x4000000)
> #define THREADS (4)
> #define COPY_COUNT (30)
> 
> void *thread_function( void *ptr );
> 
> int main()
> {
>      pthread_t thread[THREADS];
>      int  i, iret[THREADS];
> 
>      for(i = 0; i < THREADS; i++)
> 	     iret[i] = pthread_create( &thread[i], NULL, thread_function, (void*)
> NULL);
> 
>      for(i = 0; i < THREADS; i++)
> 	     pthread_join( thread[i], NULL);
> 
>      for(i = 0; i < THREADS; i++)
> 	     printf("Thread %d returns: %d\n", i, iret[i]);
>      exit(0);
> }
> 
> void *thread_function( void *ptr )
> {
>      char *s, *d;
>      int i;
> 
>      s = malloc(BUF_SIZE);
>      if (!s)
> 	     return NULL;
> 
>      d = malloc(BUF_SIZE);
>      if (!d) {
> 	     free(s);
> 	     return NULL;
>      }
> 
>      for (i = 0; i < COPY_COUNT; i++) {
> 	     memcpy(d, s, BUF_SIZE);
>      }
>      free(s);
>      free(d);
> }
> 

Straight-forward, it's a meaningless load but relevant to this problem.
The buffers are small enough that all threads would likely sit on the same
controller with a vanilla kernel and compete for bandwidth.

> It meaninglessly moves data from one large dynamically allocated buffer to
> another a number of times without trying to be clever. I experimented with a
> Broadcom BCM7278 system with 1GB on each memory controller (i.e. 2GB total
> memory). The buffers were made large to render data caching meaningless and
> to require several pages to be allocated to populate the buffer.
> 
> With V3 of this patch set applied to a 6.1-rc1 kernel I observed these
> results:
> With no movablecore kernel parameter specified:
> # time /tmp/thread_test
> Thread 1 returns: 0
> Thread 2 returns: 0
> Thread 3 returns: 0
> Thread 4 returns: 0
> 
> real    0m4.047s
> user    0m14.183s
> sys     0m1.215s
> 
> With this additional kernel parameter "movablecore=600M":
> # time /tmp/thread_test
> Thread 0 returns: 0
> Thread 1 returns: 0
> Thread 2 returns: 0
> Thread 3 returns: 0
> 
> real    0m4.068s
> user    0m14.402s
> sys     0m1.117s
> 
> With this additional kernel parameter "movablecore=600M@0x50000000":
> # time /tmp/thread_test
> Thread 0 returns: 0
> Thread 1 returns: 0
> Thread 2 returns: 0
> Thread 3 returns: 0
> 
> real    0m4.010s
> user    0m13.979s
> sys     0m1.070s
> 
> However, with these additional kernel parameters
> "movablecore=300M@0x60000000,300M@0x320000000 page_alloc.shuffle=1":
> # time /tmp/thread_test
> Thread 0 returns: 0
> Thread 1 returns: 0
> Thread 2 returns: 0
> Thread 3 returns: 0
> 
> real    0m3.173s
> user    0m11.175s
> sys     0m1.067s
> 

What were the results with just
"movablecore=300M@0x60000000,300M@0x320000000" on its own and
page_alloc.shuffle=1 on its own?

For shuffle on its own, my expectations are that the results will be
variable, sometimes good and sometimes bad, because it's at the mercy of
the randomisation. It might be slightly improved if the initial top-level
lists were ordered "1, n, 2, n-1, 3, n-2" optionally in __shuffle_zone or
if shuffle_pick_tail was aware of the memory channels but a lot more work
to implement.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks
  2023-01-03 23:43 ` Florian Fainelli
@ 2023-01-04 15:56   ` David Hildenbrand
  2023-01-04 19:00     ` Florian Fainelli
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2023-01-04 15:56 UTC (permalink / raw)
  To: Florian Fainelli, Doug Berger, Andrew Morton, Mel Gorman
  Cc: Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mike Kravetz, Oscar Salvador, Joonsoo Kim,
	linux-doc, linux-kernel, linux-mm

On 04.01.23 00:43, Florian Fainelli wrote:
> On 10/20/22 14:53, Doug Berger wrote:
>> MOTIVATION:
>> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory
>> controllers with each mapped in a different address range within
>> a Uniform Memory Architecture. Some users of these systems have
>> expressed the desire to locate ZONE_MOVABLE memory on each
>> memory controller to allow user space intensive processing to
>> make better use of the additional memory bandwidth.
>> Unfortunately, the historical monotonic layout of zones would
>> mean that if the lowest addressed memory controller contains
>> ZONE_MOVABLE memory then all of the memory available from
>> memory controllers at higher addresses must also be in the
>> ZONE_MOVABLE zone. This would force all kernel memory accesses
>> onto the lowest addressed memory controller and significantly
>> reduce the amount of memory available for non-movable
>> allocations.
>>
>> The main objective of this patch set is therefore to allow a
>> block of memory to be designated as part of the ZONE_MOVABLE
>> zone where it will always only be used by the kernel page
>> allocator to satisfy requests for movable pages. The term
>> Designated Movable Block is introduced here to represent such a
>> block. The favored implementation allows extension of the
>> 'movablecore' kernel parameter to allow specification of a base
>> address and support for multiple blocks. The existing
>> 'movablecore' mechanisms are retained.
>>
>> BACKGROUND:
>> NUMA architectures support distributing movablecore memory
>> across each node, but it is undesirable to introduce the
>> overhead and complexities of NUMA on systems that don't have a
>> Non-Uniform Memory Architecture.
>>
>> Commit 342332e6a925 ("mm/page_alloc.c: introduce kernelcore=mirror option")
>> also depends on zone overlap to support sytems with multiple
>> mirrored ranges.
>>
>> Commit c6f03e2903c9 ("mm, memory_hotplug: remove zone restrictions")
>> embraced overlapped zones for memory hotplug.
>>
>> This commit set follows their lead to allow the ZONE_MOVABLE
>> zone to overlap other zones. Designated Movable Blocks are made
>> absent from overlapping zones and present within the
>> ZONE_MOVABLE zone.
>>
>> I initially investigated an implementation using a Designated
>> Movable migrate type in line with comments[1] made by Mel Gorman
>> regarding a "sticky" MIGRATE_MOVABLE type to avoid using
>> ZONE_MOVABLE. However, this approach was riskier since it was
>> much more instrusive on the allocation paths. Ultimately, the
>> progress made by the memory hotplug folks to expand the
>> ZONE_MOVABLE functionality convinced me to follow this approach.
>>
> 
> Mel, David, does the sub-thread discussion with Doug help ensuring that
> all of the context is gathered before getting into a more detailed patch
> review on a patch-by-patch basis?
> 
> Eventually we may need a fairly firm answer as to whether the proposed
> approach has any chance of landing upstream in order to either commit to
> in subsequent iterations of this patch set, or find an alternative.


As raised, I'd appreciate if less intrusive alternatives could be 
evaluated (e.g., fake NUMA nodes and being ablee to just use mbind(), 
moving such memory to ZONE_MOVABLE after boot via something like daxctl).

I'm not convinced that these intrusive changes are worth it at this 
point. Further, some of the assumptions (ZONE_MOVABLE == user space) are 
not really future proof as I raised.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks
  2023-01-04 15:56   ` David Hildenbrand
@ 2023-01-04 19:00     ` Florian Fainelli
  2023-01-05 13:29       ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Florian Fainelli @ 2023-01-04 19:00 UTC (permalink / raw)
  To: David Hildenbrand, Doug Berger, Andrew Morton, Mel Gorman
  Cc: Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mike Kravetz, Oscar Salvador, Joonsoo Kim,
	linux-doc, linux-kernel, linux-mm

On 1/4/23 07:56, David Hildenbrand wrote:
> On 04.01.23 00:43, Florian Fainelli wrote:
>> On 10/20/22 14:53, Doug Berger wrote:
>>> MOTIVATION:
>>> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory
>>> controllers with each mapped in a different address range within
>>> a Uniform Memory Architecture. Some users of these systems have
>>> expressed the desire to locate ZONE_MOVABLE memory on each
>>> memory controller to allow user space intensive processing to
>>> make better use of the additional memory bandwidth.
>>> Unfortunately, the historical monotonic layout of zones would
>>> mean that if the lowest addressed memory controller contains
>>> ZONE_MOVABLE memory then all of the memory available from
>>> memory controllers at higher addresses must also be in the
>>> ZONE_MOVABLE zone. This would force all kernel memory accesses
>>> onto the lowest addressed memory controller and significantly
>>> reduce the amount of memory available for non-movable
>>> allocations.
>>>
>>> The main objective of this patch set is therefore to allow a
>>> block of memory to be designated as part of the ZONE_MOVABLE
>>> zone where it will always only be used by the kernel page
>>> allocator to satisfy requests for movable pages. The term
>>> Designated Movable Block is introduced here to represent such a
>>> block. The favored implementation allows extension of the
>>> 'movablecore' kernel parameter to allow specification of a base
>>> address and support for multiple blocks. The existing
>>> 'movablecore' mechanisms are retained.
>>>
>>> BACKGROUND:
>>> NUMA architectures support distributing movablecore memory
>>> across each node, but it is undesirable to introduce the
>>> overhead and complexities of NUMA on systems that don't have a
>>> Non-Uniform Memory Architecture.
>>>
>>> Commit 342332e6a925 ("mm/page_alloc.c: introduce kernelcore=mirror 
>>> option")
>>> also depends on zone overlap to support sytems with multiple
>>> mirrored ranges.
>>>
>>> Commit c6f03e2903c9 ("mm, memory_hotplug: remove zone restrictions")
>>> embraced overlapped zones for memory hotplug.
>>>
>>> This commit set follows their lead to allow the ZONE_MOVABLE
>>> zone to overlap other zones. Designated Movable Blocks are made
>>> absent from overlapping zones and present within the
>>> ZONE_MOVABLE zone.
>>>
>>> I initially investigated an implementation using a Designated
>>> Movable migrate type in line with comments[1] made by Mel Gorman
>>> regarding a "sticky" MIGRATE_MOVABLE type to avoid using
>>> ZONE_MOVABLE. However, this approach was riskier since it was
>>> much more instrusive on the allocation paths. Ultimately, the
>>> progress made by the memory hotplug folks to expand the
>>> ZONE_MOVABLE functionality convinced me to follow this approach.
>>>
>>
>> Mel, David, does the sub-thread discussion with Doug help ensuring that
>> all of the context is gathered before getting into a more detailed patch
>> review on a patch-by-patch basis?
>>
>> Eventually we may need a fairly firm answer as to whether the proposed
>> approach has any chance of landing upstream in order to either commit to
>> in subsequent iterations of this patch set, or find an alternative.
> 
> 
> As raised, I'd appreciate if less intrusive alternatives could be 
> evaluated (e.g., fake NUMA nodes and being ablee to just use mbind(), 
> moving such memory to ZONE_MOVABLE after boot via something like daxctl).

This is not an option with the environment we have to ultimately fit in 
which is Android TV utilizing the GKI kernel which does not enable NUMA 
and probably never will, and for similar reasons bringing a whole swath 
of user-space tools like daxctl may not be practical either, from both a 
logistical perspective (simply getting the tools built with bionic, 
accepted etc.) as well as system configuration perspective.

> 
> I'm not convinced that these intrusive changes are worth it at this 
> point. Further, some of the assumptions (ZONE_MOVABLE == user space) are 
> not really future proof as I raised.

I find this patch set reasonably small in contrast to a lot of other mm/ 
changes, what did you find intrusive specifically?

AFAICT, there only assumption that is being made is that ZONE_MOVABLE 
contains memory that can be moved, but even if it did not in the future, 
there should hopefully be enough opportunities, given a large enough DMB 
region to service the allocation requests of its users. I will go back 
and read your comment to make sure I don't misunderstand it.

Thanks
-- 
Florian


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks
  2023-01-04 15:43         ` Mel Gorman
@ 2023-01-04 19:10           ` Florian Fainelli
  2023-01-19 22:33           ` Doug Berger
  1 sibling, 0 replies; 24+ messages in thread
From: Florian Fainelli @ 2023-01-04 19:10 UTC (permalink / raw)
  To: Mel Gorman, Doug Berger
  Cc: Andrew Morton, Jonathan Corbet, Mike Rapoport, Paul E. McKenney,
	Neeraj Upadhyay, Randy Dunlap, Damien Le Moal, Muchun Song,
	Vlastimil Babka, Johannes Weiner, Michal Hocko, KOSAKI Motohiro,
	Mike Kravetz, David Hildenbrand, Oscar Salvador, Joonsoo Kim,
	linux-doc, linux-kernel, linux-mm

On 1/4/23 07:43, Mel Gorman wrote:

[snip]

>> What is of interest to Broadcom customers is to better distribute user space
>> accesses across each memory controller to improve the bandwidth available to
>> user space dominated work flows. With no ZONE_MOVABLE, the BCM7278 SoC with
>> 1GB of memory on each memory controller will place the 1GB on the low
>> address memory controller in ZONE_DMA and the 1GB on the high address memory
>> controller in ZONE_NORMAL. With this layout movable allocation requests will
>> only fallback to the ZONE_DMA (low memory controller) once the ZONE_NORMAL
>> (high memory controller) is sufficiently depleted of free memory.
>>
>> Adding ZONE_MOVABLE memory above ZONE_NORMAL with the current movablecore
>> behavior does not improve this situation other than forcing more kernel
>> allocations off of the high memory controller. User space allocations are
>> even more likely to be on the high memory controller.
>>
> 
> But it's a weak promise that interleaving will happen. If only a portion
> of ZONE_MOVABLE is used, it might still be all on the same channel. This
> might improve over time if enough memory was used and the system was up
> for long enough.
It is indeed a weak promise for user-space allocations out of 
ZONE_MOVABLE, however the other consumer of the DMB region is a kernel 
driver (typically a video decoder engine) which is directly tied to a 
specific memory controller/DMB region. For the kernel driver using the 
DMB region there is a hard guarantee from the kernel that it gets memory 
from a specific PFN range mapping directly to the desired memory 
controller and thus it is meeting the desired bandwidth 
allocation/deadlines/bursts etc.

We care about both sides of the coin, though we acknowledge that 
"controlling" where user-space allocations are coming from such that 
they be steered towards a specific memory controller is a much harder 
task and so having some amount of non-determinism is acceptable here.
-- 
Florian


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks
  2023-01-04 19:00     ` Florian Fainelli
@ 2023-01-05 13:29       ` David Hildenbrand
  2023-04-18 20:09         ` Florian Fainelli
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2023-01-05 13:29 UTC (permalink / raw)
  To: Florian Fainelli, Doug Berger, Andrew Morton, Mel Gorman
  Cc: Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mike Kravetz, Oscar Salvador, Joonsoo Kim,
	linux-doc, linux-kernel, linux-mm

On 04.01.23 20:00, Florian Fainelli wrote:
> On 1/4/23 07:56, David Hildenbrand wrote:
>> On 04.01.23 00:43, Florian Fainelli wrote:
>>> On 10/20/22 14:53, Doug Berger wrote:
>>>> MOTIVATION:
>>>> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory
>>>> controllers with each mapped in a different address range within
>>>> a Uniform Memory Architecture. Some users of these systems have
>>>> expressed the desire to locate ZONE_MOVABLE memory on each
>>>> memory controller to allow user space intensive processing to
>>>> make better use of the additional memory bandwidth.
>>>> Unfortunately, the historical monotonic layout of zones would
>>>> mean that if the lowest addressed memory controller contains
>>>> ZONE_MOVABLE memory then all of the memory available from
>>>> memory controllers at higher addresses must also be in the
>>>> ZONE_MOVABLE zone. This would force all kernel memory accesses
>>>> onto the lowest addressed memory controller and significantly
>>>> reduce the amount of memory available for non-movable
>>>> allocations.
>>>>
>>>> The main objective of this patch set is therefore to allow a
>>>> block of memory to be designated as part of the ZONE_MOVABLE
>>>> zone where it will always only be used by the kernel page
>>>> allocator to satisfy requests for movable pages. The term
>>>> Designated Movable Block is introduced here to represent such a
>>>> block. The favored implementation allows extension of the
>>>> 'movablecore' kernel parameter to allow specification of a base
>>>> address and support for multiple blocks. The existing
>>>> 'movablecore' mechanisms are retained.
>>>>
>>>> BACKGROUND:
>>>> NUMA architectures support distributing movablecore memory
>>>> across each node, but it is undesirable to introduce the
>>>> overhead and complexities of NUMA on systems that don't have a
>>>> Non-Uniform Memory Architecture.
>>>>
>>>> Commit 342332e6a925 ("mm/page_alloc.c: introduce kernelcore=mirror
>>>> option")
>>>> also depends on zone overlap to support sytems with multiple
>>>> mirrored ranges.
>>>>
>>>> Commit c6f03e2903c9 ("mm, memory_hotplug: remove zone restrictions")
>>>> embraced overlapped zones for memory hotplug.
>>>>
>>>> This commit set follows their lead to allow the ZONE_MOVABLE
>>>> zone to overlap other zones. Designated Movable Blocks are made
>>>> absent from overlapping zones and present within the
>>>> ZONE_MOVABLE zone.
>>>>
>>>> I initially investigated an implementation using a Designated
>>>> Movable migrate type in line with comments[1] made by Mel Gorman
>>>> regarding a "sticky" MIGRATE_MOVABLE type to avoid using
>>>> ZONE_MOVABLE. However, this approach was riskier since it was
>>>> much more instrusive on the allocation paths. Ultimately, the
>>>> progress made by the memory hotplug folks to expand the
>>>> ZONE_MOVABLE functionality convinced me to follow this approach.
>>>>
>>>
>>> Mel, David, does the sub-thread discussion with Doug help ensuring that
>>> all of the context is gathered before getting into a more detailed patch
>>> review on a patch-by-patch basis?
>>>
>>> Eventually we may need a fairly firm answer as to whether the proposed
>>> approach has any chance of landing upstream in order to either commit to
>>> in subsequent iterations of this patch set, or find an alternative.
>>
>>
>> As raised, I'd appreciate if less intrusive alternatives could be
>> evaluated (e.g., fake NUMA nodes and being ablee to just use mbind(),
>> moving such memory to ZONE_MOVABLE after boot via something like daxctl).
> 
> This is not an option with the environment we have to ultimately fit in
> which is Android TV utilizing the GKI kernel which does not enable NUMA
> and probably never will, and for similar reasons bringing a whole swath
> of user-space tools like daxctl may not be practical either, from both a
> logistical perspective (simply getting the tools built with bionic,
> accepted etc.) as well as system configuration perspective.

Adding feature A because people don't want to (! whoever the "people" 
are) enable feature B? I hope I don't have to tell you what I think 
about statements like this :)

If feature B is a problem, try stripping it down such that it can be 
enabled. If it's to hard to configure for your use case, maybe we can 
extend configuration options to avoid tools like daxctl for some special 
cases.

But of course, only if feature B actually solves the problem.

One issue I have with DMB is actual use cases / users / requirements. 
Maybe requirements are defined somewhere cleanly and I missed them.

If we have clear requirements, we can talk about possible solutions. If 
we have a specific solution, it's harder to talk about requirements.

>  >>
>> I'm not convinced that these intrusive changes are worth it at this
>> point. Further, some of the assumptions (ZONE_MOVABLE == user space) are
>> not really future proof as I raised.
> 
> I find this patch set reasonably small in contrast to a lot of other mm/
> changes, what did you find intrusive specifically?
> 
> AFAICT, there only assumption that is being made is that ZONE_MOVABLE
> contains memory that can be moved, but even if it did not in the future,
> there should hopefully be enough opportunities, given a large enough DMB
> region to service the allocation requests of its users. I will go back
> and read your comment to make sure I don't misunderstand it.

Let me clarify what ZONE_MOVABLE can and cannot do:

* We cannot assume that specific user space allocations are served from
   it, neither can we really modify behavior.
* We cannot assume that user space allocations won't be migrated off
   that zone to another zone.
* We cannot assume that no other (kernel) allocations will end up on it.
* We cannot make specific processes preferably consume memory from it.

Designing a feature that relies on any of these assumptions is IMHO wrong.

If you want an application to consume memory from a specific address 
range, there are some possible ways I can see:

(1) Model the special memory areas using fake NUMA nodes. e.g., mbind()
     the applications to these nodes. Use ZONE_MOVABLE to make sure we
     don't get unmovable allocations. The buddy will take care of it.
(2) Use some driver that manages that memory and provides that memory
     to an application by mmap()'ing it. The buddy won't manage it (no
     swap, migration ...). DEVDAX is one possible such driver.
(3) Use hugetlb and reserve them from the selected memory ranges.
     Make the application consume these hugetlb pages.

For a single node, without a special driver, it gets more complicated: 
We'd need new way to tell the buddy that these memory ranges are 
"special". I don't want to use the word "new zone" but that's most 
likely what it would have to be. Further, one would need a way to 
specify that only specific allocations should end up on these ranges.

Maybe I'm overthinking this. Having clear requirements such that we can 
try discussing solutions and exploring alternatives would be great.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks
  2023-01-04 15:43         ` Mel Gorman
  2023-01-04 19:10           ` Florian Fainelli
@ 2023-01-19 22:33           ` Doug Berger
  1 sibling, 0 replies; 24+ messages in thread
From: Doug Berger @ 2023-01-19 22:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jonathan Corbet, Mike Rapoport, Paul E. McKenney,
	Neeraj Upadhyay, Randy Dunlap, Damien Le Moal, Muchun Song,
	Vlastimil Babka, Johannes Weiner, Michal Hocko, KOSAKI Motohiro,
	Mike Kravetz, Florian Fainelli, David Hildenbrand,
	Oscar Salvador, Joonsoo Kim, linux-doc, linux-kernel, linux-mm

On 1/4/2023 7:43 AM, Mel Gorman wrote:
> On Wed, Dec 14, 2022 at 04:17:35PM -0800, Doug Berger wrote:
>> On 11/18/2022 9:05 AM, Mel Gorman wrote:
>>> On Wed, Nov 02, 2022 at 03:33:53PM -0700, Doug Berger wrote:

[snip]

>> I was not familiar with page_alloc.shuffle, but it may very well have a role
>> to play here.
>>
> 
> It almost certainly does because unlike zones or CMA, it affects how
> free lists are arranged. IIRC, the original purpose was about improving
> performance of high-speed direct-mapped cache but it also serves a
> purpose in this case -- randomising allocations between two channels.
> It's still not perfect interleaving but better than none.

Agreed.

>>>>> A
>>>>> major limitation of ZONE_MOVABLE is that there is no way of controlling
>>>>> access from userspace to restrict the high-speed memory to a designated
>>>>> application, only to all applications in general. The primary interface
>>>>> to control access to memory with different characteristics is mempolicies
>>>>> which is NUMA orientated, not zone orientated. So, if there is a special
>>>>> application that requires exclusive access, it's very difficult to configure
>>>>> based on zones.  Furthermore, page table pages mapping data located in the
>>>>> high-speed region are stored in the slower memory which potentially impacts
>>>>> the performance if the working set of the application exceeds TLB reach.
>>>>> Finally, while there is mention that Broadcom may have some special
>>>>> interface to determine what applications can use the high-speed region,
>>>>> it's hardware-specific as opposed to something that belongs in the core mm.
>>>>>
>>>>> I agree that keeping the high-speed memory in a local node and using "sticky"
>>>>> pageblocks or CMA has limitations of its own but in itself, that does not
>>>>> justify using ZONE_MOVABLE in my opinion. The statement that ARM can have
>>>>> multiple controllers with equal distance and bandwidth (if I'm reading it
>>>>> correctly) but places them in different zones.... that's just a bit weird if
>>>>> there are no other addressing limitations. It's not obvious why ARM would do
>>>>> that, but it also does not matter because it shouldn't be a core mm concern.
>>>>
>>>> There appears to be some confusion regarding my explanation of multiple
>>>> memory controllers on a device like the BCM7278. There is no inherent
>>>> performance difference between the two memory controllers and their attached
>>>> DRAM. They merely provide the opportunity to perform memory accesses in
>>>> parallel for different physical address ranges. The physical address ranges
>>>> were selected by the SoC designers for reasons only known to them, but I'm
>>>> sure they had no consideration of zones in their decision making. The
>>>> selection of zones remains an artifact of the design of Linux.
>>>>
>>>
>>> Ok, so the channels are equal but the channels are not interleaved in
>>> hardware so basically you are trying to implement software-based memory
>>> channel interleaving?
>>
>> I suppose that could be a fair characterization of the objective, though the
>> approach taken here is very much a "poor man's" approach that attempts to
>> improve things without requiring the "heavy lifting" required for a more
>> complete solution.
>>
> 
> It's still unfortunate that the concept of zones being primarily about
> addressing or capability limitations changes.

Perhaps arguably, the ZONE_MOVABLE zone continues to be about a 
capability limitation (i.e. the page allocator cannot use the zone to 
satisfy requests for non-movable/pinnable memory). This capability 
limitation has value in different use cases. The hugetlbfs benefits by 
being able to move data to better compact free memory into higher order 
free pages. The memory hotplug users benefit by being able to move data 
before removing memory from the system. A "reusable" reserved memory 
implementation could benefit from it by being able to move data out of 
the range when it is reclaimed by the software that owns the reservation.

The capability limitation has the follow-on attribute that the zone is 
prioritized for user-space allocations because the virtual address 
abstraction of user-space creates the perfect opportunity for physical 
address independence allowing for movement of data. This is the 
attribute that is of interest to the multi-channel memory without 
hardware interleaving use case discussed here rather than the actual 
movability of the data.

The Designated Movable Blocks proposal is a generic mechanism for adding 
flexibility to determining what memory should be included in the 
ZONE_MOVABLE zone, and as a result it could support any of these use 
cases. The memory hotplug developers proposed a similar mechanism early 
on in their development of what ultimately became the movable_node 
implementation.

> It's also difficult to use as
> any user of it has to be very aware of the memory channel configuration of
> the machine and know how to match addresses to channels. Information from
> zoneinfo on start_pfns, spanned ranges and the like become less useful. It's
> relatively minor but splitting the zones also means there is a performance
> hit during compaction because pageblock_pfn_to_page is more expensive.

I agree that it requires special knowledge of the system to configure 
for the multi-channel memory without hardware interleaving use case, but 
that becomes a task for the system administrator that wants to take 
advantage of the performance benefit of this specific use case. The 
users don't actually need to be aware of it in this case, and there are 
no cases where such configuration would occur automatically on systems 
that were not explicitly interested in it. The memory hotplug developers 
were able to avoid this complexity using ACPI and SRAT tables, which is 
why they withdrew their early proposed command line arguments, but those 
features are not currently available to Broadcom customers.

[snip]

>> What is of interest to Broadcom customers is to better distribute user space
>> accesses across each memory controller to improve the bandwidth available to
>> user space dominated work flows. With no ZONE_MOVABLE, the BCM7278 SoC with
>> 1GB of memory on each memory controller will place the 1GB on the low
>> address memory controller in ZONE_DMA and the 1GB on the high address memory
>> controller in ZONE_NORMAL. With this layout movable allocation requests will
>> only fallback to the ZONE_DMA (low memory controller) once the ZONE_NORMAL
>> (high memory controller) is sufficiently depleted of free memory.
>>
>> Adding ZONE_MOVABLE memory above ZONE_NORMAL with the current movablecore
>> behavior does not improve this situation other than forcing more kernel
>> allocations off of the high memory controller. User space allocations are
>> even more likely to be on the high memory controller.
>>
> 
> But it's a weak promise that interleaving will happen. If only a portion
> of ZONE_MOVABLE is used, it might still be all on the same channel. This
> might improve over time if enough memory was used and the system was up
> for long enough.

A "lightly" loaded system is unlikely to see much, if any, benefit from 
this configuration, but such a system has much less competition for 
resources. As noted previously, it is the more "heavily" loaded system 
with multiple parallel user space intensive processes that can benefit 
by reducing the memory bottleneck created by the biasing of user space 
allocations to higher addressed zones. The page_alloc.shuffle feature 
does appear to remove the need for time to pass.

> 
>> The Designated Movable Block mechanism allows ZONE_MOVABLE memory to be
>> located on the low memory controller to make it easier for user space
>> allocations to land on the low memory controller. If ZONE_MOVABLE is only
>> placed on the low memory controller then user space allocations can land in
>> ZONE_NORMAL on the high memory controller, but only through fallback after
>> ZONE_MOVABLE is sufficiently depleted of free memory which is just the
>> reverse of the existing situation. The Designated Movable Block mechanism
>> allows ZONE_MOVABLE memory to be located on each memory controller so that
>> user space allocations have equal access to each memory controller until the
>> ZONE_MOVABLE memory is depleted and fallback to other zones occurs.
>>
>> To my knowledge Broadcom customers that are currently using the Designated
>> Movable Block mechanism are relying on the somewhat random starting and
>> stopping of parallel user space processes to produce a more random
>> distribution of ZONE_MOVABLE allocations across multiple memory controllers,
>> but the page_alloc.shuffle mechanism seems like it would be a good addition
>> to promote this randomness. Even better, it appears that page_alloc.shuffle
>> is already enabled in the GKI configuration.
>>
> 
> The "random starting and stopping of parallel user space processes" is
> required for the mechanism to work. It's arbitrary and unknown if the
> interleaving happens where as shuffle has an immediate, if random, impact.

Yes, page_alloc.shuffle does improve things.

> 
>> You are of course correct that the access patterns make all of the
>> difference and it is almost certain that one memory controller or the other
>> will be saturated at any given time, but the intent is to increase the
>> opportunity to use more of the total bandwidth made available by the
>> multiple memory controllers.
>>
> 
> And shuffle should also provide that opportunity except it's trivial
> to configure and only requires the user to know the memory channels are
> not interleaved.

The problem with page_alloc.shuffle on its own is that the shuffling can 
only occur within a zone. As noted for the BCM7278 SoC described above, 
the low memory controller contains only ZONE_DMA memory and the high 
memory controller contains only ZONE_NORMAL memory. Shuffling the pages 
within a zone will not improve the random placement of allocations 
across the multiple memory controllers unless a zone spans all memory 
controllers. The creation of Designated Movable Blocks allows a 
ZONE_MOVABLE zone to be created that spans all memory controllers in the 
system with an equivalent footprint on each.

[snip]

>> I experimented with a
>> Broadcom BCM7278 system with 1GB on each memory controller (i.e. 2GB total
>> memory). The buffers were made large to render data caching meaningless and
>> to require several pages to be allocated to populate the buffer.
>>
>> With V3 of this patch set applied to a 6.1-rc1 kernel I observed these
>> results:
>> With no movablecore kernel parameter specified:
>> # time /tmp/thread_test
>> Thread 1 returns: 0
>> Thread 2 returns: 0
>> Thread 3 returns: 0
>> Thread 4 returns: 0
>>
>> real    0m4.047s
>> user    0m14.183s
>> sys     0m1.215s
>>
>> With this additional kernel parameter "movablecore=600M":
>> # time /tmp/thread_test
>> Thread 0 returns: 0
>> Thread 1 returns: 0
>> Thread 2 returns: 0
>> Thread 3 returns: 0
>>
>> real    0m4.068s
>> user    0m14.402s
>> sys     0m1.117s
>>
>> With this additional kernel parameter "movablecore=600M@0x50000000":
>> # time /tmp/thread_test
>> Thread 0 returns: 0
>> Thread 1 returns: 0
>> Thread 2 returns: 0
>> Thread 3 returns: 0
>>
>> real    0m4.010s
>> user    0m13.979s
>> sys     0m1.070s
>>
>> However, with these additional kernel parameters
>> "movablecore=300M@0x60000000,300M@0x320000000 page_alloc.shuffle=1":
>> # time /tmp/thread_test
>> Thread 0 returns: 0
>> Thread 1 returns: 0
>> Thread 2 returns: 0
>> Thread 3 returns: 0
>>
>> real    0m3.173s
>> user    0m11.175s
>> sys     0m1.067s
>>
> 
> What were the results with just
> "movablecore=300M@0x60000000,300M@0x320000000" on its own and
> page_alloc.shuffle=1 on its own?
> 
> For shuffle on its own, my expectations are that the results will be
> variable, sometimes good and sometimes bad, because it's at the mercy of
> the randomisation. It might be slightly improved if the initial top-level
> lists were ordered "1, n, 2, n-1, 3, n-2" optionally in __shuffle_zone or
> if shuffle_pick_tail was aware of the memory channels but a lot more work
> to implement.

With the kernel parameters
"movablecore=300M@0x60000000,300M@0x320000000"
# time /tmp/thread_test
Thread 0 returns: 0
Thread 1 returns: 0
Thread 2 returns: 0
Thread 3 returns: 0

real    0m3.562s
user    0m12.386s
sys     0m1.176s

The "movablecore=300M@0x60000000,300M@0x320000000" result is worse than 
when combined with the shuffle parameter, but may improve over time due 
to "random starting and stopping of parallel user space processes".

With the kernel parameters
"page_alloc.shuffle=1"
# time /tmp/thread_test
Thread 0 returns: 0
Thread 1 returns: 0
Thread 2 returns: 0
Thread 3 returns: 0

real    0m4.056s
user    0m14.680s
sys     0m1.060s

The shuffle on its own result is no better than no movablecore parameter 
because all of ZONE_NORMAL is on the high memory controller so the pages 
don't get shuffled between controllers.

Happy New Year!
     Doug


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/9] mm: introduce Designated Movable Blocks
  2023-01-05 13:29       ` David Hildenbrand
@ 2023-04-18 20:09         ` Florian Fainelli
  0 siblings, 0 replies; 24+ messages in thread
From: Florian Fainelli @ 2023-04-18 20:09 UTC (permalink / raw)
  To: David Hildenbrand, Doug Berger, Andrew Morton, Mel Gorman
  Cc: Jonathan Corbet, Mike Rapoport, Borislav Petkov,
	Paul E. McKenney, Neeraj Upadhyay, Randy Dunlap, Damien Le Moal,
	Muchun Song, Vlastimil Babka, Johannes Weiner, Michal Hocko,
	KOSAKI Motohiro, Mike Kravetz, Oscar Salvador, Joonsoo Kim,
	linux-doc, linux-kernel, linux-mm

On 1/5/23 05:29, David Hildenbrand wrote:
> On 04.01.23 20:00, Florian Fainelli wrote:
>> On 1/4/23 07:56, David Hildenbrand wrote:
>>> On 04.01.23 00:43, Florian Fainelli wrote:
>>>> On 10/20/22 14:53, Doug Berger wrote:
>>>>> MOTIVATION:
>>>>> Some Broadcom devices (e.g. 7445, 7278) contain multiple memory
>>>>> controllers with each mapped in a different address range within
>>>>> a Uniform Memory Architecture. Some users of these systems have
>>>>> expressed the desire to locate ZONE_MOVABLE memory on each
>>>>> memory controller to allow user space intensive processing to
>>>>> make better use of the additional memory bandwidth.
>>>>> Unfortunately, the historical monotonic layout of zones would
>>>>> mean that if the lowest addressed memory controller contains
>>>>> ZONE_MOVABLE memory then all of the memory available from
>>>>> memory controllers at higher addresses must also be in the
>>>>> ZONE_MOVABLE zone. This would force all kernel memory accesses
>>>>> onto the lowest addressed memory controller and significantly
>>>>> reduce the amount of memory available for non-movable
>>>>> allocations.
>>>>>
>>>>> The main objective of this patch set is therefore to allow a
>>>>> block of memory to be designated as part of the ZONE_MOVABLE
>>>>> zone where it will always only be used by the kernel page
>>>>> allocator to satisfy requests for movable pages. The term
>>>>> Designated Movable Block is introduced here to represent such a
>>>>> block. The favored implementation allows extension of the
>>>>> 'movablecore' kernel parameter to allow specification of a base
>>>>> address and support for multiple blocks. The existing
>>>>> 'movablecore' mechanisms are retained.
>>>>>
>>>>> BACKGROUND:
>>>>> NUMA architectures support distributing movablecore memory
>>>>> across each node, but it is undesirable to introduce the
>>>>> overhead and complexities of NUMA on systems that don't have a
>>>>> Non-Uniform Memory Architecture.
>>>>>
>>>>> Commit 342332e6a925 ("mm/page_alloc.c: introduce kernelcore=mirror
>>>>> option")
>>>>> also depends on zone overlap to support sytems with multiple
>>>>> mirrored ranges.
>>>>>
>>>>> Commit c6f03e2903c9 ("mm, memory_hotplug: remove zone restrictions")
>>>>> embraced overlapped zones for memory hotplug.
>>>>>
>>>>> This commit set follows their lead to allow the ZONE_MOVABLE
>>>>> zone to overlap other zones. Designated Movable Blocks are made
>>>>> absent from overlapping zones and present within the
>>>>> ZONE_MOVABLE zone.
>>>>>
>>>>> I initially investigated an implementation using a Designated
>>>>> Movable migrate type in line with comments[1] made by Mel Gorman
>>>>> regarding a "sticky" MIGRATE_MOVABLE type to avoid using
>>>>> ZONE_MOVABLE. However, this approach was riskier since it was
>>>>> much more instrusive on the allocation paths. Ultimately, the
>>>>> progress made by the memory hotplug folks to expand the
>>>>> ZONE_MOVABLE functionality convinced me to follow this approach.
>>>>>
>>>>
>>>> Mel, David, does the sub-thread discussion with Doug help ensuring that
>>>> all of the context is gathered before getting into a more detailed 
>>>> patch
>>>> review on a patch-by-patch basis?
>>>>
>>>> Eventually we may need a fairly firm answer as to whether the proposed
>>>> approach has any chance of landing upstream in order to either 
>>>> commit to
>>>> in subsequent iterations of this patch set, or find an alternative.
>>>
>>>
>>> As raised, I'd appreciate if less intrusive alternatives could be
>>> evaluated (e.g., fake NUMA nodes and being ablee to just use mbind(),
>>> moving such memory to ZONE_MOVABLE after boot via something like 
>>> daxctl).
>>
>> This is not an option with the environment we have to ultimately fit in
>> which is Android TV utilizing the GKI kernel which does not enable NUMA
>> and probably never will, and for similar reasons bringing a whole swath
>> of user-space tools like daxctl may not be practical either, from both a
>> logistical perspective (simply getting the tools built with bionic,
>> accepted etc.) as well as system configuration perspective.

(looks like I never replied to this email, whoops)

> 
> Adding feature A because people don't want to (! whoever the "people" 
> are) enable feature B? I hope I don't have to tell you what I think 
> about statements like this :)

It is not just that NUMA is not wanted, it is also not a great fit, the 
ARM CPU cluster and most peripherals that Linux cares about do have an 
uniform memory access to the available DRAM controllers/DRAM chips.

Only a subset of the peripherals, especially the real-time and high 
bandwidth ones like video decoders and display that may not be uniformly 
accessing DRAM. This stems from the fact that the memory controller(s) 
on the System-on-Chip we work with have a star topology and they 
schedule the accesses of each DRAM client (CPU, GPU, video decoder, 
display, Ethernet, PCIe, etc) differently in order to guarantee a 
certain quality of service.

On a system with multiple DRAM controller / DRAM chips, you will 
typically see video decoder + display instances #0 be serviced by DRAM 
controller 0, and video decoder + display instance #1 be servied by DRAM 
controller 1, and this is the only way to allow dual decode + display as 
they are very bandwidth hungry.

The splitting or load balancing is done on a PFN basis, DRAM pages below 
a certain address are serviced by DRAM controller #0 and those above 
another cut off are servied by DRAM controller #1.

> 
> If feature B is a problem, try stripping it down such that it can be 
> enabled. If it's to hard to configure for your use case, maybe we can 
> extend configuration options to avoid tools like daxctl for some special 
> cases.

I do not see the splitting of the notion of a 'memory node' object away 
from CONFIG_NUMA going anywhere, and sorry to put that way, but this 
would be requiring many months for a result that is not even clear, but 
would be undone anytime someone is not aware of that larger effort.

> 
> But of course, only if feature B actually solves the problem.
> 
> One issue I have with DMB is actual use cases / users / requirements. 
> Maybe requirements are defined somewhere cleanly and I missed them.

That part is entirely fair, the requirements would be as follows:

- we need to be able to control precisely across the available DRAM 
range which specific PFNs fall within specific zones, and the 
consequence is that we should also be able to have a non-monotonically 
increasing definition of zones such that there is an appropriate balance 
between zones and the underlying PFNs / backing DRAM controller instance

- device driver(s) should be able to be allocate (via 
alloc_contig_range() and friends) memory from specific regions of DRAM 
which should be covered by an underlying zone/fallback/migrate set of 
heuristics which maximizes the re-use of such memory when the driver is 
not using it

- the underlying zone/fallback/migrate type heuristics should not 
"excessively" memory in reserve (CMA I am looking at you) but rather 
should allow for all of the memory in ideal conditions to be "claimed" 
by the device driver(s) if they desire so

- it is acceptable to spend time compacting/reclaiming memory under 
tight memory pressure since the transitions requiring said driver(s) to 
allocate are slow path/control events

We have other "soft" requirements which are mainly logistical such that:

- the least amount of files are changed

- there is no need for custom user-space to be running in order to 
set-up the regions, aka plug & play is highly desirable

- there is no dependency upon CONFIG_NUMA in order to simplify the 
deployment

- there is no overhead to the other users of the patch set and the 
behavior is entirely opt-in

> 
> If we have clear requirements, we can talk about possible solutions. If 
> we have a specific solution, it's harder to talk about requirements.
> 
>>  >>
>>> I'm not convinced that these intrusive changes are worth it at this
>>> point. Further, some of the assumptions (ZONE_MOVABLE == user space) are
>>> not really future proof as I raised.
>>
>> I find this patch set reasonably small in contrast to a lot of other mm/
>> changes, what did you find intrusive specifically?
>>
>> AFAICT, there only assumption that is being made is that ZONE_MOVABLE
>> contains memory that can be moved, but even if it did not in the future,
>> there should hopefully be enough opportunities, given a large enough DMB
>> region to service the allocation requests of its users. I will go back
>> and read your comment to make sure I don't misunderstand it.
> 
> Let me clarify what ZONE_MOVABLE can and cannot do:
> 
> * We cannot assume that specific user space allocations are served from
>    it, neither can we really modify behavior.
> * We cannot assume that user space allocations won't be migrated off
>    that zone to another zone.
> * We cannot assume that no other (kernel) allocations will end up on it.
> * We cannot make specific processes preferably consume memory from it.
> 
> Designing a feature that relies on any of these assumptions is IMHO wrong.
> 
> If you want an application to consume memory from a specific address 
> range, there are some possible ways I can see:
> 
> (1) Model the special memory areas using fake NUMA nodes. e.g., mbind()
>      the applications to these nodes. Use ZONE_MOVABLE to make sure we
>      don't get unmovable allocations. The buddy will take care of it.
> (2) Use some driver that manages that memory and provides that memory
>      to an application by mmap()'ing it. The buddy won't manage it (no
>      swap, migration ...). DEVDAX is one possible such driver.
> (3) Use hugetlb and reserve them from the selected memory ranges.
>      Make the application consume these hugetlb pages.
> 
> For a single node, without a special driver, it gets more complicated: 
> We'd need new way to tell the buddy that these memory ranges are 
> "special". I don't want to use the word "new zone" but that's most 
> likely what it would have to be. Further, one would need a way to 
> specify that only specific allocations should end up on these ranges.
> 
> Maybe I'm overthinking this. Having clear requirements such that we can 
> try discussing solutions and exploring alternatives would be great.

This was helpful. We do not need or want to control precisely or exactly 
where and how applications are allocating memory from, we just need the 
kernel to do a good enough job at re-using the memory defined in 
ZONE_MOVABLE when there is no other consumer of the memory residing 
there. We are perfectly fine with a rather opportunistic and not 
deterministic approach as it does not require any specific support on 
the user-space side. For instance using page_alloc.shuffle=1 as Doug 
showed would be entirely acceptable even if it is fairly naive currently.

Hope this helps.
-- 
Florian


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-04-18 20:09 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-20 21:53 [PATCH v3 0/9] mm: introduce Designated Movable Blocks Doug Berger
2022-10-20 21:53 ` [PATCH v3 1/9] lib/show_mem.c: display MovableOnly Doug Berger
2022-10-20 21:53 ` [PATCH v3 2/9] mm/page_alloc: calculate node_spanned_pages from pfns Doug Berger
2022-10-20 21:53 ` [PATCH v3 3/9] mm/page_alloc: prevent creation of empty zones Doug Berger
2022-10-20 21:53 ` [PATCH v3 4/9] mm/page_alloc.c: allow oversized movablecore Doug Berger
2022-10-20 21:53 ` [PATCH v3 5/9] mm/page_alloc: introduce init_reserved_pageblock() Doug Berger
2022-10-20 21:53 ` [PATCH v3 6/9] memblock: introduce MEMBLOCK_MOVABLE flag Doug Berger
2022-10-20 21:53 ` [PATCH v3 7/9] mm/dmb: Introduce Designated Movable Blocks Doug Berger
2022-10-20 21:53 ` [PATCH v3 8/9] mm/page_alloc: make alloc_contig_pages DMB aware Doug Berger
2022-10-20 21:53 ` [PATCH v3 9/9] mm/page_alloc: allow base for movablecore Doug Berger
2022-10-26 10:55 ` [PATCH v3 0/9] mm: introduce Designated Movable Blocks Mel Gorman
2022-10-26 11:11   ` David Hildenbrand
2022-10-26 12:02     ` Mel Gorman
2022-11-02 22:33   ` Doug Berger
2022-11-18 17:05     ` Mel Gorman
2022-12-15  0:17       ` Doug Berger
2023-01-04 15:43         ` Mel Gorman
2023-01-04 19:10           ` Florian Fainelli
2023-01-19 22:33           ` Doug Berger
2023-01-03 23:43 ` Florian Fainelli
2023-01-04 15:56   ` David Hildenbrand
2023-01-04 19:00     ` Florian Fainelli
2023-01-05 13:29       ` David Hildenbrand
2023-04-18 20:09         ` Florian Fainelli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.