All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/4] Interface for higher order contiguous allocations
@ 2018-05-03 23:29 Mike Kravetz
  2018-05-03 23:29 ` [PATCH v2 1/4] mm: change type of free_contig_range(nr_pages) to unsigned long Mike Kravetz
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Mike Kravetz @ 2018-05-03 23:29 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Reinette Chatre, Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, Vlastimil Babka,
	David Nellans, Laura Abbott, Pavel Machek, Dave Hansen,
	Andrew Morton, Mike Kravetz

A respin of of the series to address these issues:

- fix issues found by kbuild
- find_alloc_contig_pages() should take nr_pages as argument instead of
  page order (Vlastimil and Michal).
- Cleaned up migratetype handling (Vlastimil and Michal).
- Use pfn_to_online_page instead of pfn_to_page as suggested by Michal.
  Also added comment about minimal number of conditions checked in
  contig_pfn_range_valid().
- When scanning pfns in zone, take pgdat_resize_lock() instead of
  zone->lock (Michal)

Also, 
- Separate patch to change type of free_contig_range(nr_pages) to an
  unsigned long so that it is consistent with other uses of nr_pages.
- Separate patch to optionally validate migratetype during pageblock
  isolation.
- Make find_alloc_contig_pages() work for smaller size allocation by
  simply calling __alloc_pages_nodemask().

Vlastimil and Michal brought up the issue of allocation alignment.  The
routine will currently align to 'nr_pages' (which is the requested size
argument).  It does this by examining and trying to allocate the first
nr_pages aligned/nr_pages sized range.  If this fails, it moves on to the
next nr_pages aligned/nr_pages sized range until success or all potential
ranges are exhausted.  If we allow an alignment to be specified, we will
need to potentially check all alignment aligned/nr_pages sized ranges.
In the worst case where alignment = PAGE_SIZE, this could result in huge
increase in the number of ranges to check.
To help cut down on the number of ranges to check, we could identify the
first page that causes a range allocation failure and start the next
range at the next aligned boundary.  I tried this, and we still end up
with a huge number of ranges and wasted CPU cycles.
This series did not add an alignment option.  Allocations are aligned to
nr_pages as described above.  If someone can thing of a good way to support
an alignment argument, I am open to implementing/adding it.

As described before,
These patches came out of the "[RFC] mmap(MAP_CONTIG)" discussions at:
http://lkml.kernel.org/r/21f1ec96-2822-1189-1c95-79a2bb491571@oracle.com

One suggestion in that thread was to create a friendlier interface that
could be used by drivers and others outside core mm code to allocate a
contiguous set of pages.  The alloc_contig_range() interface is used for
this purpose today by CMA and gigantic page allocation.  However, this is
not a general purpose interface.  So, wrap alloc_contig_range() in the
more general interface:

struct page *find_alloc_contig_pages(unsigned long nr_pages, gfp_t gfp,
					int nid, nodemask_t *nodemask)

This interface is essentially the same functionality provided by the
hugetlb specific routine alloc_gigantic_page().  After creating the
interface, change alloc_gigantic_page() to call find_alloc_contig_pages()
and delete all the supporting code in hugetlb.c.

A new use case for allocating contiguous memory has been identified in
Intel(R) Resource Director Technology Cache Pseudo-Locking.

Mike Kravetz (4):
  mm: change type of free_contig_range(nr_pages) to unsigned long
  mm: check for proper migrate type during isolation
  mm: add find_alloc_contig_pages() interface
  mm/hugetlb: use find_alloc_contig_pages() to allocate gigantic pages

 include/linux/gfp.h            |  14 +++-
 include/linux/page-isolation.h |   8 +--
 mm/cma.c                       |   2 +-
 mm/hugetlb.c                   |  87 ++--------------------
 mm/memory_hotplug.c            |   2 +-
 mm/page_alloc.c                | 159 +++++++++++++++++++++++++++++++++++++----
 mm/page_isolation.c            |  40 ++++++++---
 7 files changed, 200 insertions(+), 112 deletions(-)

-- 
2.13.6


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/4] mm: change type of free_contig_range(nr_pages) to unsigned long
  2018-05-03 23:29 [PATCH v2 0/4] Interface for higher order contiguous allocations Mike Kravetz
@ 2018-05-03 23:29 ` Mike Kravetz
  2018-05-18  9:12   ` Vlastimil Babka
  2018-05-03 23:29 ` [PATCH v2 2/4] mm: check for proper migrate type during isolation Mike Kravetz
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Mike Kravetz @ 2018-05-03 23:29 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Reinette Chatre, Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, Vlastimil Babka,
	David Nellans, Laura Abbott, Pavel Machek, Dave Hansen,
	Andrew Morton, Mike Kravetz

free_contig_range() is currently defined as:
void free_contig_range(unsigned long pfn, unsigned nr_pages);
change to,
void free_contig_range(unsigned long pfn, unsigned long nr_pages);

Some callers are passing a truncated unsigned long today.  It is
highly unlikely that these values will overflow an unsigned int.
However, this should be changed to an unsigned long to be consistent
with other page counts.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/gfp.h | 2 +-
 mm/cma.c            | 2 +-
 mm/hugetlb.c        | 2 +-
 mm/page_alloc.c     | 6 +++---
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a4582b44d32..86a0d06463ab 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -572,7 +572,7 @@ static inline bool pm_suspended_storage(void)
 /* The below functions must be run on a range from a single zone. */
 extern int alloc_contig_range(unsigned long start, unsigned long end,
 			      unsigned migratetype, gfp_t gfp_mask);
-extern void free_contig_range(unsigned long pfn, unsigned nr_pages);
+extern void free_contig_range(unsigned long pfn, unsigned long nr_pages);
 #endif
 
 #ifdef CONFIG_CMA
diff --git a/mm/cma.c b/mm/cma.c
index aa40e6c7b042..f473fc2b7cbd 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -563,7 +563,7 @@ bool cma_release(struct cma *cma, const struct page *pages, unsigned int count)
 
 	VM_BUG_ON(pfn + count > cma->base_pfn + cma->count);
 
-	free_contig_range(pfn, count);
+	free_contig_range(pfn, (unsigned long)count);
 	cma_clear_bitmap(cma, pfn, count);
 	trace_cma_release(pfn, pages, count);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 218679138255..c81072ce7510 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1055,7 +1055,7 @@ static void destroy_compound_gigantic_page(struct page *page,
 
 static void free_gigantic_page(struct page *page, unsigned int order)
 {
-	free_contig_range(page_to_pfn(page), 1 << order);
+	free_contig_range(page_to_pfn(page), 1UL << order);
 }
 
 static int __alloc_gigantic_page(unsigned long start_pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 905db9d7962f..0fd5e8e2456e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7937,9 +7937,9 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	return ret;
 }
 
-void free_contig_range(unsigned long pfn, unsigned nr_pages)
+void free_contig_range(unsigned long pfn, unsigned long nr_pages)
 {
-	unsigned int count = 0;
+	unsigned long count = 0;
 
 	for (; nr_pages--; pfn++) {
 		struct page *page = pfn_to_page(pfn);
@@ -7947,7 +7947,7 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages)
 		count += page_count(page) != 1;
 		__free_page(page);
 	}
-	WARN(count != 0, "%d pages are still in use!\n", count);
+	WARN(count != 0, "%ld pages are still in use!\n", count);
 }
 #endif
 
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 2/4] mm: check for proper migrate type during isolation
  2018-05-03 23:29 [PATCH v2 0/4] Interface for higher order contiguous allocations Mike Kravetz
  2018-05-03 23:29 ` [PATCH v2 1/4] mm: change type of free_contig_range(nr_pages) to unsigned long Mike Kravetz
@ 2018-05-03 23:29 ` Mike Kravetz
  2018-05-18 10:32   ` Vlastimil Babka
  2018-05-03 23:29 ` [PATCH v2 3/4] mm: add find_alloc_contig_pages() interface Mike Kravetz
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Mike Kravetz @ 2018-05-03 23:29 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Reinette Chatre, Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, Vlastimil Babka,
	David Nellans, Laura Abbott, Pavel Machek, Dave Hansen,
	Andrew Morton, Mike Kravetz

The routine start_isolate_page_range and alloc_contig_range have
comments saying that migratetype must be either MIGRATE_MOVABLE or
MIGRATE_CMA.  However, this is not enforced.  What is important is
that that all pageblocks in the range are of type migratetype.
This is because blocks will be set to migratetype on error.

Add a boolean argument enforce_migratetype to the routine
start_isolate_page_range.  If set, it will check that all pageblocks
in the range have the passed migratetype.  Return -EINVAL is pageblock
is wrong type is found in range.

A boolean is used for enforce_migratetype as there are two primary
users.  Contiguous range allocation which wants to enforce migration
type checking.  Memory offline (hotplug) which is not concerned about
type checking.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/page-isolation.h |  8 +++-----
 mm/memory_hotplug.c            |  2 +-
 mm/page_alloc.c                | 17 +++++++++--------
 mm/page_isolation.c            | 40 ++++++++++++++++++++++++++++++----------
 4 files changed, 43 insertions(+), 24 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 4ae347cbc36d..2ab7e5a399ac 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -38,8 +38,6 @@ int move_freepages_block(struct zone *zone, struct page *page,
 
 /*
  * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
- * If specified range includes migrate types other than MOVABLE or CMA,
- * this will fail with -EBUSY.
  *
  * For isolating all pages in the range finally, the caller have to
  * free all pages in the range. test_page_isolated() can be used for
@@ -47,11 +45,11 @@ int move_freepages_block(struct zone *zone, struct page *page,
  */
 int
 start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			 unsigned migratetype, bool skip_hwpoisoned_pages);
+			 unsigned migratetype, bool skip_hwpoisoned_pages,
+			 bool enforce_migratetype);
 
 /*
- * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
- * target range is [start_pfn, end_pfn)
+ * Changes MIGRATE_ISOLATE to migratetype for range [start_pfn, end_pfn)
  */
 int
 undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f74826cdceea..ebc1c8c330e2 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1601,7 +1601,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 	/* set above range as isolated */
 	ret = start_isolate_page_range(start_pfn, end_pfn,
-				       MIGRATE_MOVABLE, true);
+				       MIGRATE_MOVABLE, true, false);
 	if (ret)
 		return ret;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0fd5e8e2456e..cb1a5e0be6ee 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7787,9 +7787,10 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
  * alloc_contig_range() -- tries to allocate given range of pages
  * @start:	start PFN to allocate
  * @end:	one-past-the-last PFN to allocate
- * @migratetype:	migratetype of the underlaying pageblocks (either
- *			#MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks
- *			in range must have the same migratetype and it must
+ * @migratetype:	migratetype of the underlaying pageblocks.  All
+ *			pageblocks in range must have the same migratetype.
+ *			migratetype is typically MIGRATE_MOVABLE or
+ *			MIGRATE_CMA, but this is not a requirement.
  *			be either of the two.
  * @gfp_mask:	GFP mask to use during compaction
  *
@@ -7840,15 +7841,15 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 * allocator removing them from the buddy system.  This way
 	 * page allocator will never consider using them.
 	 *
-	 * This lets us mark the pageblocks back as
-	 * MIGRATE_CMA/MIGRATE_MOVABLE so that free pages in the
-	 * aligned range but not in the unaligned, original range are
-	 * put back to page allocator so that buddy can use them.
+	 * This lets us mark the pageblocks back as their original
+	 * migrate type so that free pages in the  aligned range but
+	 * not in the unaligned, original range are put back to page
+	 * allocator so that buddy can use them.
 	 */
 
 	ret = start_isolate_page_range(pfn_max_align_down(start),
 				       pfn_max_align_up(end), migratetype,
-				       false);
+				       false, true);
 	if (ret)
 		return ret;
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 43e085608846..472191cc1909 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -16,7 +16,8 @@
 #include <trace/events/page_isolation.h>
 
 static int set_migratetype_isolate(struct page *page, int migratetype,
-				bool skip_hwpoisoned_pages)
+				bool skip_hwpoisoned_pages,
+				bool enforce_migratetype)
 {
 	struct zone *zone;
 	unsigned long flags, pfn;
@@ -36,6 +37,17 @@ static int set_migratetype_isolate(struct page *page, int migratetype,
 	if (is_migrate_isolate_page(page))
 		goto out;
 
+	/*
+	 * If requested, check migration type of pageblock and make sure
+	 * it matches migratetype
+	 */
+	if (enforce_migratetype) {
+		if (get_pageblock_migratetype(page) != migratetype) {
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
 	pfn = page_to_pfn(page);
 	arg.start_pfn = pfn;
 	arg.nr_pages = pageblock_nr_pages;
@@ -167,14 +179,16 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * to be MIGRATE_ISOLATE.
  * @start_pfn: The lower PFN of the range to be isolated.
  * @end_pfn: The upper PFN of the range to be isolated.
- * @migratetype: migrate type to set in error recovery.
+ * @migratetype: migrate type of all blocks in range.
  *
  * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
  * the range will never be allocated. Any free pages and pages freed in the
  * future will not be allocated again.
  *
  * start_pfn/end_pfn must be aligned to pageblock_order.
- * Return 0 on success and -EBUSY if any part of range cannot be isolated.
+ * Return 0 on success or error returned by set_migratetype_isolate.  Typical
+ * errors are -EBUSY if any part of range cannot be isolated or -EINVAL if
+ * any page block is not of migratetype.
  *
  * There is no high level synchronization mechanism that prevents two threads
  * from trying to isolate overlapping ranges.  If this happens, one thread
@@ -185,11 +199,13 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * prevents two threads from simultaneously working on overlapping ranges.
  */
 int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			     unsigned migratetype, bool skip_hwpoisoned_pages)
+			     unsigned migratetype, bool skip_hwpoisoned_pages,
+			     bool enforce_migratetype)
 {
 	unsigned long pfn;
 	unsigned long undo_pfn;
 	struct page *page;
+	int ret = 0;
 
 	BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
 	BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));
@@ -198,13 +214,17 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 	     pfn < end_pfn;
 	     pfn += pageblock_nr_pages) {
 		page = __first_valid_page(pfn, pageblock_nr_pages);
-		if (page &&
-		    set_migratetype_isolate(page, migratetype, skip_hwpoisoned_pages)) {
-			undo_pfn = pfn;
-			goto undo;
+		if (page) {
+			ret = set_migratetype_isolate(page, migratetype,
+							skip_hwpoisoned_pages,
+							enforce_migratetype);
+			if (ret) {
+				undo_pfn = pfn;
+				goto undo;
+			}
 		}
 	}
-	return 0;
+	return ret;
 undo:
 	for (pfn = start_pfn;
 	     pfn < undo_pfn;
@@ -215,7 +235,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 		unset_migratetype_isolate(page, migratetype);
 	}
 
-	return -EBUSY;
+	return ret;
 }
 
 /*
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 3/4] mm: add find_alloc_contig_pages() interface
  2018-05-03 23:29 [PATCH v2 0/4] Interface for higher order contiguous allocations Mike Kravetz
  2018-05-03 23:29 ` [PATCH v2 1/4] mm: change type of free_contig_range(nr_pages) to unsigned long Mike Kravetz
  2018-05-03 23:29 ` [PATCH v2 2/4] mm: check for proper migrate type during isolation Mike Kravetz
@ 2018-05-03 23:29 ` Mike Kravetz
  2018-05-21  8:54   ` Vlastimil Babka
  2018-05-03 23:29 ` [PATCH v2 4/4] mm/hugetlb: use find_alloc_contig_pages() to allocate gigantic pages Mike Kravetz
  2018-05-21 12:00 ` [PATCH v2 0/4] Interface for higher order contiguous allocations Vlastimil Babka
  4 siblings, 1 reply; 19+ messages in thread
From: Mike Kravetz @ 2018-05-03 23:29 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Reinette Chatre, Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, Vlastimil Babka,
	David Nellans, Laura Abbott, Pavel Machek, Dave Hansen,
	Andrew Morton, Mike Kravetz

find_alloc_contig_pages() is a new interface that attempts to locate
and allocate a contiguous range of pages.  It is provided as a more
convenient interface than alloc_contig_range() which is currently
used by CMA and gigantic huge pages.

When attempting to allocate a range of pages, migration is employed
if possible.  There is no guarantee that the routine will succeed.
So, the user must be prepared for failure and have a fall back plan.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/gfp.h |  12 +++++
 mm/page_alloc.c     | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 146 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 86a0d06463ab..b0d11777d487 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -573,6 +573,18 @@ static inline bool pm_suspended_storage(void)
 extern int alloc_contig_range(unsigned long start, unsigned long end,
 			      unsigned migratetype, gfp_t gfp_mask);
 extern void free_contig_range(unsigned long pfn, unsigned long nr_pages);
+extern struct page *find_alloc_contig_pages(unsigned long nr_pages, gfp_t gfp,
+						int nid, nodemask_t *nodemask);
+extern void free_contig_pages(struct page *page, unsigned long nr_pages);
+#else
+static inline struct page *find_alloc_contig_pages(unsigned long nr_pages,
+				gfp_t gfp, int nid, nodemask_t *nodemask)
+{
+	return NULL;
+}
+static inline void free_contig_pages(struct page *page, unsigned long nr_pages)
+{
+}
 #endif
 
 #ifdef CONFIG_CMA
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb1a5e0be6ee..d0a2d0da9eae 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -67,6 +67,7 @@
 #include <linux/ftrace.h>
 #include <linux/lockdep.h>
 #include <linux/nmi.h>
+#include <linux/mmzone.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -7913,8 +7914,12 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 
 	/* Make sure the range is really isolated. */
 	if (test_pages_isolated(outer_start, end, false)) {
-		pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
-			__func__, outer_start, end);
+#ifdef MIGRATE_CMA
+		/* Only print messages for CMA allocations */
+		if (migratetype == MIGRATE_CMA)
+			pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
+				__func__, outer_start, end);
+#endif
 		ret = -EBUSY;
 		goto done;
 	}
@@ -7950,6 +7955,133 @@ void free_contig_range(unsigned long pfn, unsigned long nr_pages)
 	}
 	WARN(count != 0, "%ld pages are still in use!\n", count);
 }
+
+/*
+ * Only check for obvious pfn/pages which can not be used/migrated.  The
+ * migration code will do the final check.  Under stress, this minimal set
+ * has been observed to provide the best results.  The checks can be expanded
+ * if needed.
+ */
+static bool contig_pfn_range_valid(struct zone *z, unsigned long start_pfn,
+					unsigned long nr_pages)
+{
+	unsigned long i, end_pfn = start_pfn + nr_pages;
+	struct page *page;
+
+	for (i = start_pfn; i < end_pfn; i++) {
+		if (!pfn_valid(i))
+			return false;
+
+		page = pfn_to_online_page(i);
+
+		if (page_zone(page) != z)
+			return false;
+
+	}
+
+	return true;
+}
+
+/*
+ * Search for and attempt to allocate contiguous allocations greater than
+ * MAX_ORDER.
+ */
+static struct page *__alloc_contig_pages_nodemask(gfp_t gfp,
+						unsigned long order,
+						int nid, nodemask_t *nodemask)
+{
+	unsigned long nr_pages, pfn, flags;
+	struct page *ret_page = NULL;
+	struct zonelist *zonelist;
+	struct zoneref *z;
+	struct zone *zone;
+	int rc;
+
+	nr_pages = 1 << order;
+	zonelist = node_zonelist(nid, gfp);
+	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp),
+					nodemask) {
+		pgdat_resize_lock(zone->zone_pgdat, &flags);
+		pfn = ALIGN(zone->zone_start_pfn, nr_pages);
+		while (zone_spans_pfn(zone, pfn + nr_pages - 1)) {
+			if (contig_pfn_range_valid(zone, pfn, nr_pages)) {
+				struct page *page = pfn_to_online_page(pfn);
+				unsigned int migratetype;
+
+				/*
+				 * All pageblocks in range must be of same
+				 * migrate type.
+				 */
+				migratetype = get_pageblock_migratetype(page);
+				pgdat_resize_unlock(zone->zone_pgdat, &flags);
+
+				rc = alloc_contig_range(pfn, pfn + nr_pages,
+						migratetype, gfp);
+				if (!rc) {
+					ret_page = pfn_to_page(pfn);
+					return ret_page;
+				}
+				pgdat_resize_lock(zone->zone_pgdat, &flags);
+			}
+			pfn += nr_pages;
+		}
+		pgdat_resize_unlock(zone->zone_pgdat, &flags);
+	}
+
+	return ret_page;
+}
+
+/**
+ * find_alloc_contig_pages() -- attempt to find and allocate a contiguous
+ *				range of pages
+ * @nr_pages:	number of pages to find/allocate
+ * @gfp:	gfp mask used to limit search as well as during compaction
+ * @nid:	target node
+ * @nodemask:	mask of other possible nodes
+ *
+ * Pages can be freed with a call to free_contig_pages(), or by manually
+ * calling __free_page() for each page allocated.
+ *
+ * Return: pointer to 'order' pages on success, or NULL if not successful.
+ */
+struct page *find_alloc_contig_pages(unsigned long nr_pages, gfp_t gfp,
+					int nid, nodemask_t *nodemask)
+{
+	unsigned long i, alloc_order, order_pages;
+	struct page *pages;
+
+	/*
+	 * Underlying allocators perform page order sized allocations.
+	 */
+	alloc_order = get_count_order(nr_pages);
+	if (alloc_order < MAX_ORDER) {
+		pages = __alloc_pages_nodemask(gfp, (unsigned int)alloc_order,
+						nid, nodemask);
+		split_page(pages, alloc_order);
+	} else {
+		pages = __alloc_contig_pages_nodemask(gfp, alloc_order, nid,
+							nodemask);
+	}
+
+	if (pages) {
+		/*
+		 * More pages than desired could have been allocated due to
+		 * rounding up to next page order.  Free any excess pages.
+		 */
+		order_pages = 1UL << alloc_order;
+		for (i = nr_pages; i < order_pages; i++)
+			__free_page(pages + i);
+	}
+
+	return pages;
+}
+EXPORT_SYMBOL_GPL(find_alloc_contig_pages);
+
+void free_contig_pages(struct page *page, unsigned long nr_pages)
+{
+	free_contig_range(page_to_pfn(page), nr_pages);
+}
+EXPORT_SYMBOL_GPL(free_contig_pages);
 #endif
 
 #if defined CONFIG_MEMORY_HOTPLUG || defined CONFIG_CMA
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 4/4] mm/hugetlb: use find_alloc_contig_pages() to allocate gigantic pages
  2018-05-03 23:29 [PATCH v2 0/4] Interface for higher order contiguous allocations Mike Kravetz
                   ` (2 preceding siblings ...)
  2018-05-03 23:29 ` [PATCH v2 3/4] mm: add find_alloc_contig_pages() interface Mike Kravetz
@ 2018-05-03 23:29 ` Mike Kravetz
  2018-05-21 12:00 ` [PATCH v2 0/4] Interface for higher order contiguous allocations Vlastimil Babka
  4 siblings, 0 replies; 19+ messages in thread
From: Mike Kravetz @ 2018-05-03 23:29 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api
  Cc: Reinette Chatre, Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, Vlastimil Babka,
	David Nellans, Laura Abbott, Pavel Machek, Dave Hansen,
	Andrew Morton, Mike Kravetz

Use the new find_alloc_contig_pages() interface for the allocation of
gigantic pages and remove associated code in hugetlb.c.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 87 +++++-------------------------------------------------------
 1 file changed, 6 insertions(+), 81 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c81072ce7510..91edf1df3592 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1053,91 +1053,16 @@ static void destroy_compound_gigantic_page(struct page *page,
 	__ClearPageHead(page);
 }
 
-static void free_gigantic_page(struct page *page, unsigned int order)
+static void free_gigantic_page(struct page *page, struct hstate *h)
 {
-	free_contig_range(page_to_pfn(page), 1UL << order);
-}
-
-static int __alloc_gigantic_page(unsigned long start_pfn,
-				unsigned long nr_pages, gfp_t gfp_mask)
-{
-	unsigned long end_pfn = start_pfn + nr_pages;
-	return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
-				  gfp_mask);
-}
-
-static bool pfn_range_valid_gigantic(struct zone *z,
-			unsigned long start_pfn, unsigned long nr_pages)
-{
-	unsigned long i, end_pfn = start_pfn + nr_pages;
-	struct page *page;
-
-	for (i = start_pfn; i < end_pfn; i++) {
-		if (!pfn_valid(i))
-			return false;
-
-		page = pfn_to_page(i);
-
-		if (page_zone(page) != z)
-			return false;
-
-		if (PageReserved(page))
-			return false;
-
-		if (page_count(page) > 0)
-			return false;
-
-		if (PageHuge(page))
-			return false;
-	}
-
-	return true;
-}
-
-static bool zone_spans_last_pfn(const struct zone *zone,
-			unsigned long start_pfn, unsigned long nr_pages)
-{
-	unsigned long last_pfn = start_pfn + nr_pages - 1;
-	return zone_spans_pfn(zone, last_pfn);
+	free_contig_pages(page, (unsigned long)pages_per_huge_page(h));
 }
 
 static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 		int nid, nodemask_t *nodemask)
 {
-	unsigned int order = huge_page_order(h);
-	unsigned long nr_pages = 1 << order;
-	unsigned long ret, pfn, flags;
-	struct zonelist *zonelist;
-	struct zone *zone;
-	struct zoneref *z;
-
-	zonelist = node_zonelist(nid, gfp_mask);
-	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), nodemask) {
-		spin_lock_irqsave(&zone->lock, flags);
-
-		pfn = ALIGN(zone->zone_start_pfn, nr_pages);
-		while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
-			if (pfn_range_valid_gigantic(zone, pfn, nr_pages)) {
-				/*
-				 * We release the zone lock here because
-				 * alloc_contig_range() will also lock the zone
-				 * at some point. If there's an allocation
-				 * spinning on this lock, it may win the race
-				 * and cause alloc_contig_range() to fail...
-				 */
-				spin_unlock_irqrestore(&zone->lock, flags);
-				ret = __alloc_gigantic_page(pfn, nr_pages, gfp_mask);
-				if (!ret)
-					return pfn_to_page(pfn);
-				spin_lock_irqsave(&zone->lock, flags);
-			}
-			pfn += nr_pages;
-		}
-
-		spin_unlock_irqrestore(&zone->lock, flags);
-	}
-
-	return NULL;
+	return find_alloc_contig_pages((unsigned long)pages_per_huge_page(h),
+					gfp_mask, nid, nodemask);
 }
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
@@ -1147,7 +1072,7 @@ static void prep_compound_gigantic_page(struct page *page, unsigned int order);
 static inline bool gigantic_page_supported(void) { return false; }
 static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 		int nid, nodemask_t *nodemask) { return NULL; }
-static inline void free_gigantic_page(struct page *page, unsigned int order) { }
+static inline void free_gigantic_page(struct page *page, struct hstate *h) { }
 static inline void destroy_compound_gigantic_page(struct page *page,
 						unsigned int order) { }
 #endif
@@ -1172,7 +1097,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
 		destroy_compound_gigantic_page(page, huge_page_order(h));
-		free_gigantic_page(page, huge_page_order(h));
+		free_gigantic_page(page, h);
 	} else {
 		__free_pages(page, huge_page_order(h));
 	}
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/4] mm: change type of free_contig_range(nr_pages) to unsigned long
  2018-05-03 23:29 ` [PATCH v2 1/4] mm: change type of free_contig_range(nr_pages) to unsigned long Mike Kravetz
@ 2018-05-18  9:12   ` Vlastimil Babka
  2018-05-18 22:01     ` Mike Kravetz
  0 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2018-05-18  9:12 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel, linux-api
  Cc: Reinette Chatre, Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, David Nellans,
	Laura Abbott, Pavel Machek, Dave Hansen, Andrew Morton,
	Joonsoo Kim

On 05/04/2018 01:29 AM, Mike Kravetz wrote:
> free_contig_range() is currently defined as:
> void free_contig_range(unsigned long pfn, unsigned nr_pages);
> change to,
> void free_contig_range(unsigned long pfn, unsigned long nr_pages);
> 
> Some callers are passing a truncated unsigned long today.  It is
> highly unlikely that these values will overflow an unsigned int.
> However, this should be changed to an unsigned long to be consistent
> with other page counts.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  include/linux/gfp.h | 2 +-
>  mm/cma.c            | 2 +-
>  mm/hugetlb.c        | 2 +-
>  mm/page_alloc.c     | 6 +++---
>  4 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 1a4582b44d32..86a0d06463ab 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -572,7 +572,7 @@ static inline bool pm_suspended_storage(void)
>  /* The below functions must be run on a range from a single zone. */
>  extern int alloc_contig_range(unsigned long start, unsigned long end,
>  			      unsigned migratetype, gfp_t gfp_mask);
> -extern void free_contig_range(unsigned long pfn, unsigned nr_pages);
> +extern void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>  #endif
>  
>  #ifdef CONFIG_CMA
> diff --git a/mm/cma.c b/mm/cma.c
> index aa40e6c7b042..f473fc2b7cbd 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -563,7 +563,7 @@ bool cma_release(struct cma *cma, const struct page *pages, unsigned int count)
>  
>  	VM_BUG_ON(pfn + count > cma->base_pfn + cma->count);
>  
> -	free_contig_range(pfn, count);
> +	free_contig_range(pfn, (unsigned long)count);

I guess this cast from uint to ulong doesn't need to be explicit? But
instead, cma_release() signature could be also changed to ulong, because
some of its callers do pass those?

>  	cma_clear_bitmap(cma, pfn, count);
>  	trace_cma_release(pfn, pages, count);
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 218679138255..c81072ce7510 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1055,7 +1055,7 @@ static void destroy_compound_gigantic_page(struct page *page,
>  
>  static void free_gigantic_page(struct page *page, unsigned int order)
>  {
> -	free_contig_range(page_to_pfn(page), 1 << order);
> +	free_contig_range(page_to_pfn(page), 1UL << order);
>  }
>  
>  static int __alloc_gigantic_page(unsigned long start_pfn,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 905db9d7962f..0fd5e8e2456e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7937,9 +7937,9 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	return ret;
>  }
>  
> -void free_contig_range(unsigned long pfn, unsigned nr_pages)
> +void free_contig_range(unsigned long pfn, unsigned long nr_pages)
>  {
> -	unsigned int count = 0;
> +	unsigned long count = 0;
>  
>  	for (; nr_pages--; pfn++) {
>  		struct page *page = pfn_to_page(pfn);
> @@ -7947,7 +7947,7 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages)
>  		count += page_count(page) != 1;
>  		__free_page(page);
>  	}
> -	WARN(count != 0, "%d pages are still in use!\n", count);
> +	WARN(count != 0, "%ld pages are still in use!\n", count);

Maybe change to %lu while at it?
BTW, this warning can theoretically produce false positives, because
page users have to deal with page_count() being incremented by e.g.
parallel pfn scanners using get_page_unless_zero(). We also don't detect
refcount leaks in general. Should we remove it or change it to VM_WARN
if it's still useful for debugging?

>  }
>  #endif
>  
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/4] mm: check for proper migrate type during isolation
  2018-05-03 23:29 ` [PATCH v2 2/4] mm: check for proper migrate type during isolation Mike Kravetz
@ 2018-05-18 10:32   ` Vlastimil Babka
  2018-05-21 23:10     ` Mike Kravetz
  0 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2018-05-18 10:32 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel, linux-api
  Cc: Reinette Chatre, Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, David Nellans,
	Laura Abbott, Pavel Machek, Dave Hansen, Andrew Morton,
	Joonsoo Kim

On 05/04/2018 01:29 AM, Mike Kravetz wrote:
> The routine start_isolate_page_range and alloc_contig_range have
> comments saying that migratetype must be either MIGRATE_MOVABLE or
> MIGRATE_CMA.  However, this is not enforced.

Enforced, no. But if the pageblocks really were as such, it used to
shortcut has_unmovable_pages(). This was wrong and removed in
d7b236e10ced ("mm: drop migrate type checks from has_unmovable_pages")
plus 4da2ce250f98 ("mm: distinguish CMA and MOVABLE isolation in
has_unmovable_pages()").


  What is important is
> that that all pageblocks in the range are of type migratetype.
                                               the same
> This is because blocks will be set to migratetype on error.

Strictly speaking this is true only for the CMA case. For other cases,
the best thing actually would be to employ the same heuristics as page
allocation migratetype fallbacks, and count how many pages are free and
how many appear to be movable, see how steal_suitable_fallback() uses
the last parameter of move_freepages_block().

> Add a boolean argument enforce_migratetype to the routine
> start_isolate_page_range.  If set, it will check that all pageblocks
> in the range have the passed migratetype.  Return -EINVAL is pageblock
                                                            if
> is wrong type is found in range.
  of
> 
> A boolean is used for enforce_migratetype as there are two primary
> users.  Contiguous range allocation which wants to enforce migration
> type checking.  Memory offline (hotplug) which is not concerned about
> type checking.

This is missing some high-level result. The end change is that CMA is
now enforcing. So we are making it more robust when it's called on
non-CMA pageblocks by mistake? (BTW I still do hope we can remove
MIGRATE_CMA soon after Joonsoo's ZONE_MOVABLE CMA conversion. Combined
with my suggestion above we could hopefully get rid of the migratetype
parameter completely instead of enforcing it?). Is this also a
preparation for introducing find_alloc_contig_pages() which will be
enforcing? (I guess, and will find out shortly, but it should be stated
here)

Thanks,
Vlastimil

> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  include/linux/page-isolation.h |  8 +++-----
>  mm/memory_hotplug.c            |  2 +-
>  mm/page_alloc.c                | 17 +++++++++--------
>  mm/page_isolation.c            | 40 ++++++++++++++++++++++++++++++----------
>  4 files changed, 43 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 4ae347cbc36d..2ab7e5a399ac 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -38,8 +38,6 @@ int move_freepages_block(struct zone *zone, struct page *page,
>  
>  /*
>   * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
> - * If specified range includes migrate types other than MOVABLE or CMA,
> - * this will fail with -EBUSY.
>   *
>   * For isolating all pages in the range finally, the caller have to
>   * free all pages in the range. test_page_isolated() can be used for
> @@ -47,11 +45,11 @@ int move_freepages_block(struct zone *zone, struct page *page,
>   */
>  int
>  start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> -			 unsigned migratetype, bool skip_hwpoisoned_pages);
> +			 unsigned migratetype, bool skip_hwpoisoned_pages,
> +			 bool enforce_migratetype);
>  
>  /*
> - * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
> - * target range is [start_pfn, end_pfn)
> + * Changes MIGRATE_ISOLATE to migratetype for range [start_pfn, end_pfn)
>   */
>  int
>  undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f74826cdceea..ebc1c8c330e2 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1601,7 +1601,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  
>  	/* set above range as isolated */
>  	ret = start_isolate_page_range(start_pfn, end_pfn,
> -				       MIGRATE_MOVABLE, true);
> +				       MIGRATE_MOVABLE, true, false);
>  	if (ret)
>  		return ret;
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0fd5e8e2456e..cb1a5e0be6ee 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7787,9 +7787,10 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
>   * alloc_contig_range() -- tries to allocate given range of pages
>   * @start:	start PFN to allocate
>   * @end:	one-past-the-last PFN to allocate
> - * @migratetype:	migratetype of the underlaying pageblocks (either
> - *			#MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks
> - *			in range must have the same migratetype and it must
> + * @migratetype:	migratetype of the underlaying pageblocks.  All
> + *			pageblocks in range must have the same migratetype.
> + *			migratetype is typically MIGRATE_MOVABLE or
> + *			MIGRATE_CMA, but this is not a requirement.
>   *			be either of the two.
>   * @gfp_mask:	GFP mask to use during compaction
>   *
> @@ -7840,15 +7841,15 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	 * allocator removing them from the buddy system.  This way
>  	 * page allocator will never consider using them.
>  	 *
> -	 * This lets us mark the pageblocks back as
> -	 * MIGRATE_CMA/MIGRATE_MOVABLE so that free pages in the
> -	 * aligned range but not in the unaligned, original range are
> -	 * put back to page allocator so that buddy can use them.
> +	 * This lets us mark the pageblocks back as their original
> +	 * migrate type so that free pages in the  aligned range but
> +	 * not in the unaligned, original range are put back to page
> +	 * allocator so that buddy can use them.
>  	 */
>  
>  	ret = start_isolate_page_range(pfn_max_align_down(start),
>  				       pfn_max_align_up(end), migratetype,
> -				       false);
> +				       false, true);
>  	if (ret)
>  		return ret;
>  
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 43e085608846..472191cc1909 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -16,7 +16,8 @@
>  #include <trace/events/page_isolation.h>
>  
>  static int set_migratetype_isolate(struct page *page, int migratetype,
> -				bool skip_hwpoisoned_pages)
> +				bool skip_hwpoisoned_pages,
> +				bool enforce_migratetype)
>  {
>  	struct zone *zone;
>  	unsigned long flags, pfn;
> @@ -36,6 +37,17 @@ static int set_migratetype_isolate(struct page *page, int migratetype,
>  	if (is_migrate_isolate_page(page))
>  		goto out;
>  
> +	/*
> +	 * If requested, check migration type of pageblock and make sure
> +	 * it matches migratetype
> +	 */
> +	if (enforce_migratetype) {
> +		if (get_pageblock_migratetype(page) != migratetype) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +	}
> +
>  	pfn = page_to_pfn(page);
>  	arg.start_pfn = pfn;
>  	arg.nr_pages = pageblock_nr_pages;
> @@ -167,14 +179,16 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>   * to be MIGRATE_ISOLATE.
>   * @start_pfn: The lower PFN of the range to be isolated.
>   * @end_pfn: The upper PFN of the range to be isolated.
> - * @migratetype: migrate type to set in error recovery.
> + * @migratetype: migrate type of all blocks in range.
>   *
>   * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
>   * the range will never be allocated. Any free pages and pages freed in the
>   * future will not be allocated again.
>   *
>   * start_pfn/end_pfn must be aligned to pageblock_order.
> - * Return 0 on success and -EBUSY if any part of range cannot be isolated.
> + * Return 0 on success or error returned by set_migratetype_isolate.  Typical
> + * errors are -EBUSY if any part of range cannot be isolated or -EINVAL if
> + * any page block is not of migratetype.
>   *
>   * There is no high level synchronization mechanism that prevents two threads
>   * from trying to isolate overlapping ranges.  If this happens, one thread
> @@ -185,11 +199,13 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>   * prevents two threads from simultaneously working on overlapping ranges.
>   */
>  int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
> -			     unsigned migratetype, bool skip_hwpoisoned_pages)
> +			     unsigned migratetype, bool skip_hwpoisoned_pages,
> +			     bool enforce_migratetype)
>  {
>  	unsigned long pfn;
>  	unsigned long undo_pfn;
>  	struct page *page;
> +	int ret = 0;
>  
>  	BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
>  	BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));
> @@ -198,13 +214,17 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>  	     pfn < end_pfn;
>  	     pfn += pageblock_nr_pages) {
>  		page = __first_valid_page(pfn, pageblock_nr_pages);
> -		if (page &&
> -		    set_migratetype_isolate(page, migratetype, skip_hwpoisoned_pages)) {
> -			undo_pfn = pfn;
> -			goto undo;
> +		if (page) {
> +			ret = set_migratetype_isolate(page, migratetype,
> +							skip_hwpoisoned_pages,
> +							enforce_migratetype);
> +			if (ret) {
> +				undo_pfn = pfn;
> +				goto undo;
> +			}
>  		}
>  	}
> -	return 0;
> +	return ret;
>  undo:
>  	for (pfn = start_pfn;
>  	     pfn < undo_pfn;
> @@ -215,7 +235,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>  		unset_migratetype_isolate(page, migratetype);
>  	}
>  
> -	return -EBUSY;
> +	return ret;
>  }
>  
>  /*
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/4] mm: change type of free_contig_range(nr_pages) to unsigned long
  2018-05-18  9:12   ` Vlastimil Babka
@ 2018-05-18 22:01     ` Mike Kravetz
  0 siblings, 0 replies; 19+ messages in thread
From: Mike Kravetz @ 2018-05-18 22:01 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm, linux-kernel, linux-api
  Cc: Reinette Chatre, Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, David Nellans,
	Laura Abbott, Pavel Machek, Dave Hansen, Andrew Morton,
	Joonsoo Kim, Marek Szyprowski

On 05/18/2018 02:12 AM, Vlastimil Babka wrote:
> On 05/04/2018 01:29 AM, Mike Kravetz wrote:
>> free_contig_range() is currently defined as:
>> void free_contig_range(unsigned long pfn, unsigned nr_pages);
>> change to,
>> void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>>
>> Some callers are passing a truncated unsigned long today.  It is
>> highly unlikely that these values will overflow an unsigned int.
>> However, this should be changed to an unsigned long to be consistent
>> with other page counts.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>  include/linux/gfp.h | 2 +-
>>  mm/cma.c            | 2 +-
>>  mm/hugetlb.c        | 2 +-
>>  mm/page_alloc.c     | 6 +++---
>>  4 files changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index 1a4582b44d32..86a0d06463ab 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -572,7 +572,7 @@ static inline bool pm_suspended_storage(void)
>>  /* The below functions must be run on a range from a single zone. */
>>  extern int alloc_contig_range(unsigned long start, unsigned long end,
>>  			      unsigned migratetype, gfp_t gfp_mask);
>> -extern void free_contig_range(unsigned long pfn, unsigned nr_pages);
>> +extern void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>>  #endif
>>  
>>  #ifdef CONFIG_CMA
>> diff --git a/mm/cma.c b/mm/cma.c
>> index aa40e6c7b042..f473fc2b7cbd 100644
>> --- a/mm/cma.c
>> +++ b/mm/cma.c
>> @@ -563,7 +563,7 @@ bool cma_release(struct cma *cma, const struct page *pages, unsigned int count)
>>  
>>  	VM_BUG_ON(pfn + count > cma->base_pfn + cma->count);
>>  
>> -	free_contig_range(pfn, count);
>> +	free_contig_range(pfn, (unsigned long)count);
> 
> I guess this cast from uint to ulong doesn't need to be explicit? But
> instead, cma_release() signature could be also changed to ulong, because
> some of its callers do pass those?

Correct, that cast is not needed.

I like the idea of changing cma_release() to take ulong.  Until you mentioned
this, I did not realize that some callers were passing in a truncated ulong.
As noted in my commit message, this truncation is unlikely to be an issue.
But, I think we should 'fix' them if we can.

I'll spin another version with this change.

> 
>>  	cma_clear_bitmap(cma, pfn, count);
>>  	trace_cma_release(pfn, pages, count);
>>  
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 218679138255..c81072ce7510 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1055,7 +1055,7 @@ static void destroy_compound_gigantic_page(struct page *page,
>>  
>>  static void free_gigantic_page(struct page *page, unsigned int order)
>>  {
>> -	free_contig_range(page_to_pfn(page), 1 << order);
>> +	free_contig_range(page_to_pfn(page), 1UL << order);
>>  }
>>  
>>  static int __alloc_gigantic_page(unsigned long start_pfn,
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 905db9d7962f..0fd5e8e2456e 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -7937,9 +7937,9 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>>  	return ret;
>>  }
>>  
>> -void free_contig_range(unsigned long pfn, unsigned nr_pages)
>> +void free_contig_range(unsigned long pfn, unsigned long nr_pages)
>>  {
>> -	unsigned int count = 0;
>> +	unsigned long count = 0;
>>  
>>  	for (; nr_pages--; pfn++) {
>>  		struct page *page = pfn_to_page(pfn);
>> @@ -7947,7 +7947,7 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages)
>>  		count += page_count(page) != 1;
>>  		__free_page(page);
>>  	}
>> -	WARN(count != 0, "%d pages are still in use!\n", count);
>> +	WARN(count != 0, "%ld pages are still in use!\n", count);
> 
> Maybe change to %lu while at it?

Yes

> BTW, this warning can theoretically produce false positives, because
> page users have to deal with page_count() being incremented by e.g.
> parallel pfn scanners using get_page_unless_zero(). We also don't detect
> refcount leaks in general. Should we remove it or change it to VM_WARN
> if it's still useful for debugging?

Added Marek on Cc: as he is the one who originally added this message (although
it has been a long time).  I do not know what specific issue he was concerned
with.  A search found a few bug reports related to this warning.  In these
cases, it clearly was not a false positive but some other issue.  It 'appears'
the message helped in those cases.

However, I would hate for a support organization to spend a bunch of time
doing investigation for a false positive.  At the very least, I think we
should add a message/comment about the possibility of false positives in the
code.  I would be inclined to change it to VM_WARN, but it would be good
to get input from others who might find it useful as it.

-- 
Mike Kravetz

> 
>>  }
>>  #endif
>>  
>>
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/4] mm: add find_alloc_contig_pages() interface
  2018-05-03 23:29 ` [PATCH v2 3/4] mm: add find_alloc_contig_pages() interface Mike Kravetz
@ 2018-05-21  8:54   ` Vlastimil Babka
  2018-05-21 23:48     ` Mike Kravetz
  0 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2018-05-21  8:54 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel, linux-api
  Cc: Reinette Chatre, Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, David Nellans,
	Laura Abbott, Pavel Machek, Dave Hansen, Andrew Morton

On 05/04/2018 01:29 AM, Mike Kravetz wrote:
> find_alloc_contig_pages() is a new interface that attempts to locate
> and allocate a contiguous range of pages.  It is provided as a more

How about dropping the 'find_' from the name, so it's more like other
allocator functions? All of them have to 'find' the free pages in some
sense.

> convenient interface than alloc_contig_range() which is currently
> used by CMA and gigantic huge pages.
> 
> When attempting to allocate a range of pages, migration is employed
> if possible.  There is no guarantee that the routine will succeed.
> So, the user must be prepared for failure and have a fall back plan.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  include/linux/gfp.h |  12 +++++
>  mm/page_alloc.c     | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 146 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 86a0d06463ab..b0d11777d487 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -573,6 +573,18 @@ static inline bool pm_suspended_storage(void)
>  extern int alloc_contig_range(unsigned long start, unsigned long end,
>  			      unsigned migratetype, gfp_t gfp_mask);
>  extern void free_contig_range(unsigned long pfn, unsigned long nr_pages);
> +extern struct page *find_alloc_contig_pages(unsigned long nr_pages, gfp_t gfp,
> +						int nid, nodemask_t *nodemask);
> +extern void free_contig_pages(struct page *page, unsigned long nr_pages);
> +#else
> +static inline struct page *find_alloc_contig_pages(unsigned long nr_pages,
> +				gfp_t gfp, int nid, nodemask_t *nodemask)
> +{
> +	return NULL;
> +}
> +static inline void free_contig_pages(struct page *page, unsigned long nr_pages)
> +{
> +}
>  #endif
>  
>  #ifdef CONFIG_CMA
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cb1a5e0be6ee..d0a2d0da9eae 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -67,6 +67,7 @@
>  #include <linux/ftrace.h>
>  #include <linux/lockdep.h>
>  #include <linux/nmi.h>
> +#include <linux/mmzone.h>
>  
>  #include <asm/sections.h>
>  #include <asm/tlbflush.h>
> @@ -7913,8 +7914,12 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  
>  	/* Make sure the range is really isolated. */
>  	if (test_pages_isolated(outer_start, end, false)) {
> -		pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
> -			__func__, outer_start, end);
> +#ifdef MIGRATE_CMA
> +		/* Only print messages for CMA allocations */
> +		if (migratetype == MIGRATE_CMA)

I think is_migrate_cma() can be used to avoid the #ifdef.

> +			pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
> +				__func__, outer_start, end);
> +#endif
>  		ret = -EBUSY;
>  		goto done;
>  	}
> @@ -7950,6 +7955,133 @@ void free_contig_range(unsigned long pfn, unsigned long nr_pages)
>  	}
>  	WARN(count != 0, "%ld pages are still in use!\n", count);
>  }
> +
> +/*
> + * Only check for obvious pfn/pages which can not be used/migrated.  The
> + * migration code will do the final check.  Under stress, this minimal set
> + * has been observed to provide the best results.  The checks can be expanded
> + * if needed.

Hm I kind of doubt this is optimal, it doesn't test almost anything
besides basic validity, so it won't exclude ranges where the allocation
will fail. I will write more in a reply to the header where complexity
is discussed.

> + */
> +static bool contig_pfn_range_valid(struct zone *z, unsigned long start_pfn,
> +					unsigned long nr_pages)
> +{
> +	unsigned long i, end_pfn = start_pfn + nr_pages;
> +	struct page *page;
> +
> +	for (i = start_pfn; i < end_pfn; i++) {
> +		if (!pfn_valid(i))
> +			return false;
> +
> +		page = pfn_to_online_page(i);
> +
> +		if (page_zone(page) != z)
> +			return false;
> +
> +	}
> +
> +	return true;
> +}
> +
> +/*
> + * Search for and attempt to allocate contiguous allocations greater than
> + * MAX_ORDER.
> + */
> +static struct page *__alloc_contig_pages_nodemask(gfp_t gfp,
> +						unsigned long order,
> +						int nid, nodemask_t *nodemask)
> +{
> +	unsigned long nr_pages, pfn, flags;
> +	struct page *ret_page = NULL;
> +	struct zonelist *zonelist;
> +	struct zoneref *z;
> +	struct zone *zone;
> +	int rc;
> +
> +	nr_pages = 1 << order;
> +	zonelist = node_zonelist(nid, gfp);
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp),
> +					nodemask) {
> +		pgdat_resize_lock(zone->zone_pgdat, &flags);
> +		pfn = ALIGN(zone->zone_start_pfn, nr_pages);
> +		while (zone_spans_pfn(zone, pfn + nr_pages - 1)) {
> +			if (contig_pfn_range_valid(zone, pfn, nr_pages)) {
> +				struct page *page = pfn_to_online_page(pfn);
> +				unsigned int migratetype;
> +
> +				/*
> +				 * All pageblocks in range must be of same
> +				 * migrate type.
> +				 */
> +				migratetype = get_pageblock_migratetype(page);
> +				pgdat_resize_unlock(zone->zone_pgdat, &flags);
> +
> +				rc = alloc_contig_range(pfn, pfn + nr_pages,
> +						migratetype, gfp);
> +				if (!rc) {
> +					ret_page = pfn_to_page(pfn);
> +					return ret_page;
> +				}
> +				pgdat_resize_lock(zone->zone_pgdat, &flags);
> +			}
> +			pfn += nr_pages;
> +		}
> +		pgdat_resize_unlock(zone->zone_pgdat, &flags);
> +	}
> +
> +	return ret_page;
> +}
> +
> +/**
> + * find_alloc_contig_pages() -- attempt to find and allocate a contiguous
> + *				range of pages
> + * @nr_pages:	number of pages to find/allocate
> + * @gfp:	gfp mask used to limit search as well as during compaction
> + * @nid:	target node
> + * @nodemask:	mask of other possible nodes
> + *
> + * Pages can be freed with a call to free_contig_pages(), or by manually
> + * calling __free_page() for each page allocated.
> + *
> + * Return: pointer to 'order' pages on success, or NULL if not successful.
> + */
> +struct page *find_alloc_contig_pages(unsigned long nr_pages, gfp_t gfp,
> +					int nid, nodemask_t *nodemask)
> +{
> +	unsigned long i, alloc_order, order_pages;
> +	struct page *pages;
> +
> +	/*
> +	 * Underlying allocators perform page order sized allocations.
> +	 */
> +	alloc_order = get_count_order(nr_pages);

So if takes arbitrary nr_pages but convert it to order anyway? I think
that's rather suboptimal and wasteful... e.g. a range could be skipped
because some of the pages added by rounding cannot be migrated away.

Vlastimil

> +	if (alloc_order < MAX_ORDER) {
> +		pages = __alloc_pages_nodemask(gfp, (unsigned int)alloc_order,
> +						nid, nodemask);
> +		split_page(pages, alloc_order);
> +	} else {
> +		pages = __alloc_contig_pages_nodemask(gfp, alloc_order, nid,
> +							nodemask);
> +	}
> +
> +	if (pages) {
> +		/*
> +		 * More pages than desired could have been allocated due to
> +		 * rounding up to next page order.  Free any excess pages.
> +		 */
> +		order_pages = 1UL << alloc_order;
> +		for (i = nr_pages; i < order_pages; i++)
> +			__free_page(pages + i);
> +	}
> +
> +	return pages;
> +}
> +EXPORT_SYMBOL_GPL(find_alloc_contig_pages);
> +
> +void free_contig_pages(struct page *page, unsigned long nr_pages)
> +{
> +	free_contig_range(page_to_pfn(page), nr_pages);
> +}
> +EXPORT_SYMBOL_GPL(free_contig_pages);
>  #endif
>  
>  #if defined CONFIG_MEMORY_HOTPLUG || defined CONFIG_CMA
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/4] Interface for higher order contiguous allocations
  2018-05-03 23:29 [PATCH v2 0/4] Interface for higher order contiguous allocations Mike Kravetz
                   ` (3 preceding siblings ...)
  2018-05-03 23:29 ` [PATCH v2 4/4] mm/hugetlb: use find_alloc_contig_pages() to allocate gigantic pages Mike Kravetz
@ 2018-05-21 12:00 ` Vlastimil Babka
  2018-05-22  0:15   ` Mike Kravetz
  4 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2018-05-21 12:00 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel, linux-api
  Cc: Reinette Chatre, Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, David Nellans,
	Laura Abbott, Pavel Machek, Dave Hansen, Andrew Morton

On 05/04/2018 01:29 AM, Mike Kravetz wrote:
> Vlastimil and Michal brought up the issue of allocation alignment.  The
> routine will currently align to 'nr_pages' (which is the requested size
> argument).  It does this by examining and trying to allocate the first
> nr_pages aligned/nr_pages sized range.  If this fails, it moves on to the
> next nr_pages aligned/nr_pages sized range until success or all potential
> ranges are exhausted.

As I've noted in my patch 3/4 review, in fact nr_pages is first rounded
up to an order, which makes this simpler, but suboptimal. I think we
could perhaps assume that nr_pages that's a power of two should be
aligned as such, and other values of nr_pages need no alignment? This
should fit existing users, and can be extended to explicit alignment
when such user appears?

> If we allow an alignment to be specified, we will
> need to potentially check all alignment aligned/nr_pages sized ranges.
> In the worst case where alignment = PAGE_SIZE, this could result in huge
> increase in the number of ranges to check.
> To help cut down on the number of ranges to check, we could identify the
> first page that causes a range allocation failure and start the next
> range at the next aligned boundary.  I tried this, and we still end up
> with a huge number of ranges and wasted CPU cycles.

I think the wasted cycle issues is due to the current code structure,
which is based on the CMA use-case, which assumes that the allocations
will succeed, because the areas are reserved and may contain only
movable allocations

find_alloc_contig_pages()
  __alloc_contig_pages_nodemask()
    contig_pfn_range_valid()
      - performs only very basic pfn validity and belongs-to-zone checks
    alloc_contig_range()
      start_isolate_page_range()
       for (pfn per pageblock) - the main cycle
         set_migratetype_isolate()
           has_unmovable_pages() - cancel if yes
           move_freepages_block() - expensive!
      __alloc_contig_migrate_range()
etc (not important)

So I think the problem is that in the main cycle we might do a number of
expensive move_freepages_block() operations, then hit a block where
has_unmovable_pages() is true, cancel and do more expensive
undo_isolate_page_range() operations.

If we instead first scanned the range with has_unmovable_pages() and
only start doing the expensive work when we find a large enough (aligned
or not depending on caller) range, it should be much faster and there
should be no algorithmic difference between aligned and non-aligned case.

Thanks,
Vlastimil

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/4] mm: check for proper migrate type during isolation
  2018-05-18 10:32   ` Vlastimil Babka
@ 2018-05-21 23:10     ` Mike Kravetz
  2018-05-22  7:07       ` Vlastimil Babka
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Kravetz @ 2018-05-21 23:10 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm, linux-kernel, linux-api
  Cc: Reinette Chatre, Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, David Nellans,
	Laura Abbott, Pavel Machek, Dave Hansen, Andrew Morton,
	Joonsoo Kim

On 05/18/2018 03:32 AM, Vlastimil Babka wrote:
> On 05/04/2018 01:29 AM, Mike Kravetz wrote:
>> The routine start_isolate_page_range and alloc_contig_range have
>> comments saying that migratetype must be either MIGRATE_MOVABLE or
>> MIGRATE_CMA.  However, this is not enforced.
> 
> Enforced, no. But if the pageblocks really were as such, it used to
> shortcut has_unmovable_pages(). This was wrong and removed in
> d7b236e10ced ("mm: drop migrate type checks from has_unmovable_pages")
> plus 4da2ce250f98 ("mm: distinguish CMA and MOVABLE isolation in
> has_unmovable_pages()").
> 
> 
>   What is important is
>> that that all pageblocks in the range are of type migratetype.
>                                                the same
>> This is because blocks will be set to migratetype on error.
> 
> Strictly speaking this is true only for the CMA case. For other cases,
> the best thing actually would be to employ the same heuristics as page
> allocation migratetype fallbacks, and count how many pages are free and
> how many appear to be movable, see how steal_suitable_fallback() uses
> the last parameter of move_freepages_block().
> 
>> Add a boolean argument enforce_migratetype to the routine
>> start_isolate_page_range.  If set, it will check that all pageblocks
>> in the range have the passed migratetype.  Return -EINVAL is pageblock
>                                                             if
>> is wrong type is found in range.
>   of
>>
>> A boolean is used for enforce_migratetype as there are two primary
>> users.  Contiguous range allocation which wants to enforce migration
>> type checking.  Memory offline (hotplug) which is not concerned about
>> type checking.
> 
> This is missing some high-level result. The end change is that CMA is
> now enforcing. So we are making it more robust when it's called on
> non-CMA pageblocks by mistake? (BTW I still do hope we can remove
> MIGRATE_CMA soon after Joonsoo's ZONE_MOVABLE CMA conversion. Combined
> with my suggestion above we could hopefully get rid of the migratetype
> parameter completely instead of enforcing it?). Is this also a
> preparation for introducing find_alloc_contig_pages() which will be
> enforcing? (I guess, and will find out shortly, but it should be stated
> here)

Thank you for looking at these patches Vlastimil.

My primary motivation for this patch was the 'error recovery' in
start_isolate_page_range.  It takes a range and attempts to set
all pageblocks to MIGRATE_ISOLATE.  If it encounters an error after
setting some blocks to isolate, it will 'clean up' by setting the
migrate type of previously modified blocks to the passed migratetype.

So, one possible side effect of an error in start_isolate_page_range
is that the migrate type of some pageblocks could be modified.  Thinking
about it more now, that may be OK.  It just does not seem like the
right thing to do, especially with comments saying "migratetype must
be either MIGRATE_MOVABLE or MIGRATE_CMA".  I'm fine with leaving the
code as is and just cleaning up the comments if you think that may
be better.

> 
> Thanks,
> Vlastimil
> 
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>  include/linux/page-isolation.h |  8 +++-----
>>  mm/memory_hotplug.c            |  2 +-
>>  mm/page_alloc.c                | 17 +++++++++--------
>>  mm/page_isolation.c            | 40 ++++++++++++++++++++++++++++++----------
>>  4 files changed, 43 insertions(+), 24 deletions(-)
>>
>> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
>> index 4ae347cbc36d..2ab7e5a399ac 100644
>> --- a/include/linux/page-isolation.h
>> +++ b/include/linux/page-isolation.h
>> @@ -38,8 +38,6 @@ int move_freepages_block(struct zone *zone, struct page *page,
>>  
>>  /*
>>   * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
>> - * If specified range includes migrate types other than MOVABLE or CMA,
>> - * this will fail with -EBUSY.
>>   *
>>   * For isolating all pages in the range finally, the caller have to
>>   * free all pages in the range. test_page_isolated() can be used for
>> @@ -47,11 +45,11 @@ int move_freepages_block(struct zone *zone, struct page *page,
>>   */
>>  int
>>  start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>> -			 unsigned migratetype, bool skip_hwpoisoned_pages);
>> +			 unsigned migratetype, bool skip_hwpoisoned_pages,
>> +			 bool enforce_migratetype);
>>  
>>  /*
>> - * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
>> - * target range is [start_pfn, end_pfn)
>> + * Changes MIGRATE_ISOLATE to migratetype for range [start_pfn, end_pfn)
>>   */
>>  int
>>  undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index f74826cdceea..ebc1c8c330e2 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1601,7 +1601,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>>  
>>  	/* set above range as isolated */
>>  	ret = start_isolate_page_range(start_pfn, end_pfn,
>> -				       MIGRATE_MOVABLE, true);
>> +				       MIGRATE_MOVABLE, true, false);
>>  	if (ret)
>>  		return ret;
>>  
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 0fd5e8e2456e..cb1a5e0be6ee 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -7787,9 +7787,10 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
>>   * alloc_contig_range() -- tries to allocate given range of pages
>>   * @start:	start PFN to allocate
>>   * @end:	one-past-the-last PFN to allocate
>> - * @migratetype:	migratetype of the underlaying pageblocks (either
>> - *			#MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks
>> - *			in range must have the same migratetype and it must
>> + * @migratetype:	migratetype of the underlaying pageblocks.  All
>> + *			pageblocks in range must have the same migratetype.
>> + *			migratetype is typically MIGRATE_MOVABLE or
>> + *			MIGRATE_CMA, but this is not a requirement.
>>   *			be either of the two.
>>   * @gfp_mask:	GFP mask to use during compaction
>>   *
>> @@ -7840,15 +7841,15 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>>  	 * allocator removing them from the buddy system.  This way
>>  	 * page allocator will never consider using them.
>>  	 *
>> -	 * This lets us mark the pageblocks back as
>> -	 * MIGRATE_CMA/MIGRATE_MOVABLE so that free pages in the
>> -	 * aligned range but not in the unaligned, original range are
>> -	 * put back to page allocator so that buddy can use them.
>> +	 * This lets us mark the pageblocks back as their original
>> +	 * migrate type so that free pages in the  aligned range but
>> +	 * not in the unaligned, original range are put back to page
>> +	 * allocator so that buddy can use them.
>>  	 */
>>  
>>  	ret = start_isolate_page_range(pfn_max_align_down(start),
>>  				       pfn_max_align_up(end), migratetype,
>> -				       false);
>> +				       false, true);
>>  	if (ret)
>>  		return ret;
>>  
>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>> index 43e085608846..472191cc1909 100644
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -16,7 +16,8 @@
>>  #include <trace/events/page_isolation.h>
>>  
>>  static int set_migratetype_isolate(struct page *page, int migratetype,
>> -				bool skip_hwpoisoned_pages)
>> +				bool skip_hwpoisoned_pages,
>> +				bool enforce_migratetype)
>>  {
>>  	struct zone *zone;
>>  	unsigned long flags, pfn;
>> @@ -36,6 +37,17 @@ static int set_migratetype_isolate(struct page *page, int migratetype,
>>  	if (is_migrate_isolate_page(page))
>>  		goto out;
>>  
>> +	/*
>> +	 * If requested, check migration type of pageblock and make sure
>> +	 * it matches migratetype
>> +	 */
>> +	if (enforce_migratetype) {
>> +		if (get_pageblock_migratetype(page) != migratetype) {
>> +			ret = -EINVAL;
>> +			goto out;
>> +		}
>> +	}
>> +
>>  	pfn = page_to_pfn(page);
>>  	arg.start_pfn = pfn;
>>  	arg.nr_pages = pageblock_nr_pages;
>> @@ -167,14 +179,16 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>   * to be MIGRATE_ISOLATE.
>>   * @start_pfn: The lower PFN of the range to be isolated.
>>   * @end_pfn: The upper PFN of the range to be isolated.
>> - * @migratetype: migrate type to set in error recovery.
>> + * @migratetype: migrate type of all blocks in range.
>>   *
>>   * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
>>   * the range will never be allocated. Any free pages and pages freed in the
>>   * future will not be allocated again.
>>   *
>>   * start_pfn/end_pfn must be aligned to pageblock_order.
>> - * Return 0 on success and -EBUSY if any part of range cannot be isolated.
>> + * Return 0 on success or error returned by set_migratetype_isolate.  Typical
>> + * errors are -EBUSY if any part of range cannot be isolated or -EINVAL if
>> + * any page block is not of migratetype.
>>   *
>>   * There is no high level synchronization mechanism that prevents two threads
>>   * from trying to isolate overlapping ranges.  If this happens, one thread
>> @@ -185,11 +199,13 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>   * prevents two threads from simultaneously working on overlapping ranges.
>>   */
>>  int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>> -			     unsigned migratetype, bool skip_hwpoisoned_pages)
>> +			     unsigned migratetype, bool skip_hwpoisoned_pages,
>> +			     bool enforce_migratetype)
>>  {
>>  	unsigned long pfn;
>>  	unsigned long undo_pfn;
>>  	struct page *page;
>> +	int ret = 0;
>>  
>>  	BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
>>  	BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));
>> @@ -198,13 +214,17 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>  	     pfn < end_pfn;
>>  	     pfn += pageblock_nr_pages) {
>>  		page = __first_valid_page(pfn, pageblock_nr_pages);
>> -		if (page &&
>> -		    set_migratetype_isolate(page, migratetype, skip_hwpoisoned_pages)) {
>> -			undo_pfn = pfn;
>> -			goto undo;
>> +		if (page) {
>> +			ret = set_migratetype_isolate(page, migratetype,
>> +							skip_hwpoisoned_pages,
>> +							enforce_migratetype);
>> +			if (ret) {
>> +				undo_pfn = pfn;
>> +				goto undo;
>> +			}
>>  		}
>>  	}
>> -	return 0;
>> +	return ret;
>>  undo:
>>  	for (pfn = start_pfn;
>>  	     pfn < undo_pfn;
>> @@ -215,7 +235,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>>  		unset_migratetype_isolate(page, migratetype);
>>  	}
>>  
>> -	return -EBUSY;
>> +	return ret;
>>  }
>>  
>>  /*
>>
> 


-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/4] mm: add find_alloc_contig_pages() interface
  2018-05-21  8:54   ` Vlastimil Babka
@ 2018-05-21 23:48     ` Mike Kravetz
  2018-05-22 16:41       ` Reinette Chatre
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Kravetz @ 2018-05-21 23:48 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm, linux-kernel, linux-api
  Cc: Reinette Chatre, Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, David Nellans,
	Laura Abbott, Pavel Machek, Dave Hansen, Andrew Morton

On 05/21/2018 01:54 AM, Vlastimil Babka wrote:
> On 05/04/2018 01:29 AM, Mike Kravetz wrote:
>> find_alloc_contig_pages() is a new interface that attempts to locate
>> and allocate a contiguous range of pages.  It is provided as a more
> 
> How about dropping the 'find_' from the name, so it's more like other
> allocator functions? All of them have to 'find' the free pages in some
> sense.

Sure

> 
>> convenient interface than alloc_contig_range() which is currently
>> used by CMA and gigantic huge pages.
>>
>> When attempting to allocate a range of pages, migration is employed
>> if possible.  There is no guarantee that the routine will succeed.
>> So, the user must be prepared for failure and have a fall back plan.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>  include/linux/gfp.h |  12 +++++
>>  mm/page_alloc.c     | 136 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 146 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index 86a0d06463ab..b0d11777d487 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -573,6 +573,18 @@ static inline bool pm_suspended_storage(void)
>>  extern int alloc_contig_range(unsigned long start, unsigned long end,
>>  			      unsigned migratetype, gfp_t gfp_mask);
>>  extern void free_contig_range(unsigned long pfn, unsigned long nr_pages);
>> +extern struct page *find_alloc_contig_pages(unsigned long nr_pages, gfp_t gfp,
>> +						int nid, nodemask_t *nodemask);
>> +extern void free_contig_pages(struct page *page, unsigned long nr_pages);
>> +#else
>> +static inline struct page *find_alloc_contig_pages(unsigned long nr_pages,
>> +				gfp_t gfp, int nid, nodemask_t *nodemask)
>> +{
>> +	return NULL;
>> +}
>> +static inline void free_contig_pages(struct page *page, unsigned long nr_pages)
>> +{
>> +}
>>  #endif
>>  
>>  #ifdef CONFIG_CMA
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index cb1a5e0be6ee..d0a2d0da9eae 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -67,6 +67,7 @@
>>  #include <linux/ftrace.h>
>>  #include <linux/lockdep.h>
>>  #include <linux/nmi.h>
>> +#include <linux/mmzone.h>
>>  
>>  #include <asm/sections.h>
>>  #include <asm/tlbflush.h>
>> @@ -7913,8 +7914,12 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>>  
>>  	/* Make sure the range is really isolated. */
>>  	if (test_pages_isolated(outer_start, end, false)) {
>> -		pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
>> -			__func__, outer_start, end);
>> +#ifdef MIGRATE_CMA
>> +		/* Only print messages for CMA allocations */
>> +		if (migratetype == MIGRATE_CMA)
> 
> I think is_migrate_cma() can be used to avoid the #ifdef.
> 

Thanks.  I missed that and did not want to create something new.

>> +			pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
>> +				__func__, outer_start, end);
>> +#endif
>>  		ret = -EBUSY;
>>  		goto done;
>>  	}
>> @@ -7950,6 +7955,133 @@ void free_contig_range(unsigned long pfn, unsigned long nr_pages)
>>  	}
>>  	WARN(count != 0, "%ld pages are still in use!\n", count);
>>  }
>> +
>> +/*
>> + * Only check for obvious pfn/pages which can not be used/migrated.  The
>> + * migration code will do the final check.  Under stress, this minimal set
>> + * has been observed to provide the best results.  The checks can be expanded
>> + * if needed.
> 
> Hm I kind of doubt this is optimal, it doesn't test almost anything
> besides basic validity, so it won't exclude ranges where the allocation
> will fail. I will write more in a reply to the header where complexity
> is discussed.
> 

Ok.  This 'appeared' to work best in testing where I had all CPUs in tight
loops calling this new interface to allocate and then free contiguous pages.
I was somewhat surprised at the result, and it may just be due to the nature
of my testing.

>> + */
>> +static bool contig_pfn_range_valid(struct zone *z, unsigned long start_pfn,
>> +					unsigned long nr_pages)
>> +{
>> +	unsigned long i, end_pfn = start_pfn + nr_pages;
>> +	struct page *page;
>> +
>> +	for (i = start_pfn; i < end_pfn; i++) {
>> +		if (!pfn_valid(i))
>> +			return false;
>> +
>> +		page = pfn_to_online_page(i);
>> +
>> +		if (page_zone(page) != z)
>> +			return false;
>> +
>> +	}
>> +
>> +	return true;
>> +}
>> +
>> +/*
>> + * Search for and attempt to allocate contiguous allocations greater than
>> + * MAX_ORDER.
>> + */
>> +static struct page *__alloc_contig_pages_nodemask(gfp_t gfp,
>> +						unsigned long order,
>> +						int nid, nodemask_t *nodemask)
>> +{
>> +	unsigned long nr_pages, pfn, flags;
>> +	struct page *ret_page = NULL;
>> +	struct zonelist *zonelist;
>> +	struct zoneref *z;
>> +	struct zone *zone;
>> +	int rc;
>> +
>> +	nr_pages = 1 << order;
>> +	zonelist = node_zonelist(nid, gfp);
>> +	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp),
>> +					nodemask) {
>> +		pgdat_resize_lock(zone->zone_pgdat, &flags);
>> +		pfn = ALIGN(zone->zone_start_pfn, nr_pages);
>> +		while (zone_spans_pfn(zone, pfn + nr_pages - 1)) {
>> +			if (contig_pfn_range_valid(zone, pfn, nr_pages)) {
>> +				struct page *page = pfn_to_online_page(pfn);
>> +				unsigned int migratetype;
>> +
>> +				/*
>> +				 * All pageblocks in range must be of same
>> +				 * migrate type.
>> +				 */
>> +				migratetype = get_pageblock_migratetype(page);
>> +				pgdat_resize_unlock(zone->zone_pgdat, &flags);
>> +
>> +				rc = alloc_contig_range(pfn, pfn + nr_pages,
>> +						migratetype, gfp);
>> +				if (!rc) {
>> +					ret_page = pfn_to_page(pfn);
>> +					return ret_page;
>> +				}
>> +				pgdat_resize_lock(zone->zone_pgdat, &flags);
>> +			}
>> +			pfn += nr_pages;
>> +		}
>> +		pgdat_resize_unlock(zone->zone_pgdat, &flags);
>> +	}
>> +
>> +	return ret_page;
>> +}
>> +
>> +/**
>> + * find_alloc_contig_pages() -- attempt to find and allocate a contiguous
>> + *				range of pages
>> + * @nr_pages:	number of pages to find/allocate
>> + * @gfp:	gfp mask used to limit search as well as during compaction
>> + * @nid:	target node
>> + * @nodemask:	mask of other possible nodes
>> + *
>> + * Pages can be freed with a call to free_contig_pages(), or by manually
>> + * calling __free_page() for each page allocated.
>> + *
>> + * Return: pointer to 'order' pages on success, or NULL if not successful.
>> + */
>> +struct page *find_alloc_contig_pages(unsigned long nr_pages, gfp_t gfp,
>> +					int nid, nodemask_t *nodemask)
>> +{
>> +	unsigned long i, alloc_order, order_pages;
>> +	struct page *pages;
>> +
>> +	/*
>> +	 * Underlying allocators perform page order sized allocations.
>> +	 */
>> +	alloc_order = get_count_order(nr_pages);
> 
> So if takes arbitrary nr_pages but convert it to order anyway? I think
> that's rather suboptimal and wasteful... e.g. a range could be skipped
> because some of the pages added by rounding cannot be migrated away.

Yes.  My idea with this series was to use existing allocators which are
all order based.  Let me think about how to do allocation for arbitrary
number of allocations.
- For less than MAX_ORDER size we rely on the buddy allocator, so we are
  pretty much stuck with order sized allocation.  However, allocations of
  this size are not really interesting as you can call existing routines
  directly.
- For sizes greater than MAX_ORDER, we know that the allocation size will
  be at least pageblock sized.  So, the isolate/migrate scheme can still
  be used for full pageblocks.  We can then use direct migration for the
  remaining pages.  This does complicate things a bit.

I'm guessing that most (?all?) allocations will be order based.  The use
cases I am aware of (hugetlbfs, Intel Cache Pseudo-Locking, RDMA) are all
order based.  However, as commented in previous version taking arbitrary
nr_pages makes interface more future proof.
-- 
Mike Kravetz

> 
> Vlastimil
> 
>> +	if (alloc_order < MAX_ORDER) {
>> +		pages = __alloc_pages_nodemask(gfp, (unsigned int)alloc_order,
>> +						nid, nodemask);
>> +		split_page(pages, alloc_order);
>> +	} else {
>> +		pages = __alloc_contig_pages_nodemask(gfp, alloc_order, nid,
>> +							nodemask);
>> +	}
>> +
>> +	if (pages) {
>> +		/*
>> +		 * More pages than desired could have been allocated due to
>> +		 * rounding up to next page order.  Free any excess pages.
>> +		 */
>> +		order_pages = 1UL << alloc_order;
>> +		for (i = nr_pages; i < order_pages; i++)
>> +			__free_page(pages + i);
>> +	}
>> +
>> +	return pages;
>> +}
>> +EXPORT_SYMBOL_GPL(find_alloc_contig_pages);
>> +
>> +void free_contig_pages(struct page *page, unsigned long nr_pages)
>> +{
>> +	free_contig_range(page_to_pfn(page), nr_pages);
>> +}
>> +EXPORT_SYMBOL_GPL(free_contig_pages);
>>  #endif
>>  
>>  #if defined CONFIG_MEMORY_HOTPLUG || defined CONFIG_CMA
>>
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/4] Interface for higher order contiguous allocations
  2018-05-21 12:00 ` [PATCH v2 0/4] Interface for higher order contiguous allocations Vlastimil Babka
@ 2018-05-22  0:15   ` Mike Kravetz
  0 siblings, 0 replies; 19+ messages in thread
From: Mike Kravetz @ 2018-05-22  0:15 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm, linux-kernel, linux-api
  Cc: Reinette Chatre, Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, David Nellans,
	Laura Abbott, Pavel Machek, Dave Hansen, Andrew Morton

On 05/21/2018 05:00 AM, Vlastimil Babka wrote:
> On 05/04/2018 01:29 AM, Mike Kravetz wrote:
>> Vlastimil and Michal brought up the issue of allocation alignment.  The
>> routine will currently align to 'nr_pages' (which is the requested size
>> argument).  It does this by examining and trying to allocate the first
>> nr_pages aligned/nr_pages sized range.  If this fails, it moves on to the
>> next nr_pages aligned/nr_pages sized range until success or all potential
>> ranges are exhausted.
> 
> As I've noted in my patch 3/4 review, in fact nr_pages is first rounded
> up to an order, which makes this simpler, but suboptimal. I think we
> could perhaps assume that nr_pages that's a power of two should be
> aligned as such, and other values of nr_pages need no alignment? This
> should fit existing users, and can be extended to explicit alignment
> when such user appears?

I'm good with that.  I do believe that minimum alignment will be
pageblock size alignment (for > MAX_ORDER allocations).

>> If we allow an alignment to be specified, we will
>> need to potentially check all alignment aligned/nr_pages sized ranges.
>> In the worst case where alignment = PAGE_SIZE, this could result in huge
>> increase in the number of ranges to check.
>> To help cut down on the number of ranges to check, we could identify the
>> first page that causes a range allocation failure and start the next
>> range at the next aligned boundary.  I tried this, and we still end up
>> with a huge number of ranges and wasted CPU cycles.
> 
> I think the wasted cycle issues is due to the current code structure,
> which is based on the CMA use-case, which assumes that the allocations
> will succeed, because the areas are reserved and may contain only
> movable allocations
> 
> find_alloc_contig_pages()
>   __alloc_contig_pages_nodemask()
>     contig_pfn_range_valid()
>       - performs only very basic pfn validity and belongs-to-zone checks
>     alloc_contig_range()
>       start_isolate_page_range()
>        for (pfn per pageblock) - the main cycle
>          set_migratetype_isolate()
>            has_unmovable_pages() - cancel if yes
>            move_freepages_block() - expensive!
>       __alloc_contig_migrate_range()
> etc (not important)
> 
> So I think the problem is that in the main cycle we might do a number of
> expensive move_freepages_block() operations, then hit a block where
> has_unmovable_pages() is true, cancel and do more expensive
> undo_isolate_page_range() operations.
> 
> If we instead first scanned the range with has_unmovable_pages() and
> only start doing the expensive work when we find a large enough (aligned
> or not depending on caller) range, it should be much faster and there
> should be no algorithmic difference between aligned and non-aligned case.

Ok, I will give that a try.

Thanks again for looking at these.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/4] mm: check for proper migrate type during isolation
  2018-05-21 23:10     ` Mike Kravetz
@ 2018-05-22  7:07       ` Vlastimil Babka
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2018-05-22  7:07 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel, linux-api
  Cc: Reinette Chatre, Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, David Nellans,
	Laura Abbott, Pavel Machek, Dave Hansen, Andrew Morton,
	Joonsoo Kim

On 05/22/2018 01:10 AM, Mike Kravetz wrote:
> On 05/18/2018 03:32 AM, Vlastimil Babka wrote:
>> On 05/04/2018 01:29 AM, Mike Kravetz wrote:
>>> The routine start_isolate_page_range and alloc_contig_range have
>>> comments saying that migratetype must be either MIGRATE_MOVABLE or
>>> MIGRATE_CMA.  However, this is not enforced.
>>
>> Enforced, no. But if the pageblocks really were as such, it used to
>> shortcut has_unmovable_pages(). This was wrong and removed in
>> d7b236e10ced ("mm: drop migrate type checks from has_unmovable_pages")
>> plus 4da2ce250f98 ("mm: distinguish CMA and MOVABLE isolation in
>> has_unmovable_pages()").
>>
>>
>>   What is important is
>>> that that all pageblocks in the range are of type migratetype.
>>                                                the same
>>> This is because blocks will be set to migratetype on error.
>>
>> Strictly speaking this is true only for the CMA case. For other cases,
>> the best thing actually would be to employ the same heuristics as page
>> allocation migratetype fallbacks, and count how many pages are free and
>> how many appear to be movable, see how steal_suitable_fallback() uses
>> the last parameter of move_freepages_block().
>>
>>> Add a boolean argument enforce_migratetype to the routine
>>> start_isolate_page_range.  If set, it will check that all pageblocks
>>> in the range have the passed migratetype.  Return -EINVAL is pageblock
>>                                                             if
>>> is wrong type is found in range.
>>   of
>>>
>>> A boolean is used for enforce_migratetype as there are two primary
>>> users.  Contiguous range allocation which wants to enforce migration
>>> type checking.  Memory offline (hotplug) which is not concerned about
>>> type checking.
>>
>> This is missing some high-level result. The end change is that CMA is
>> now enforcing. So we are making it more robust when it's called on
>> non-CMA pageblocks by mistake? (BTW I still do hope we can remove
>> MIGRATE_CMA soon after Joonsoo's ZONE_MOVABLE CMA conversion. Combined
>> with my suggestion above we could hopefully get rid of the migratetype
>> parameter completely instead of enforcing it?). Is this also a
>> preparation for introducing find_alloc_contig_pages() which will be
>> enforcing? (I guess, and will find out shortly, but it should be stated
>> here)
> 
> Thank you for looking at these patches Vlastimil.
> 
> My primary motivation for this patch was the 'error recovery' in
> start_isolate_page_range.  It takes a range and attempts to set
> all pageblocks to MIGRATE_ISOLATE.  If it encounters an error after
> setting some blocks to isolate, it will 'clean up' by setting the
> migrate type of previously modified blocks to the passed migratetype.

Right.

> So, one possible side effect of an error in start_isolate_page_range
> is that the migrate type of some pageblocks could be modified.  Thinking
> about it more now, that may be OK.

It would be definitely OK if the migratetype was changed similarly as
steal_suitable_fallback() does it, as I've said above.

> It just does not seem like the
> right thing to do, especially with comments saying "migratetype must
> be either MIGRATE_MOVABLE or MIGRATE_CMA".  I'm fine with leaving the
> code as is and just cleaning up the comments if you think that may
> be better.

That's also possible, especially when the code is restructured as I've
suggested in the other reply, which should significantly reduce the
amount of error recoveries.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/4] mm: add find_alloc_contig_pages() interface
  2018-05-21 23:48     ` Mike Kravetz
@ 2018-05-22 16:41       ` Reinette Chatre
  2018-05-22 20:35         ` Mike Kravetz
  2018-05-23 11:18         ` Vlastimil Babka
  0 siblings, 2 replies; 19+ messages in thread
From: Reinette Chatre @ 2018-05-22 16:41 UTC (permalink / raw)
  To: Mike Kravetz, Vlastimil Babka, linux-mm, linux-kernel, linux-api
  Cc: Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, David Nellans,
	Laura Abbott, Pavel Machek, Dave Hansen, Andrew Morton

On 5/21/2018 4:48 PM, Mike Kravetz wrote:
> On 05/21/2018 01:54 AM, Vlastimil Babka wrote:
>> On 05/04/2018 01:29 AM, Mike Kravetz wrote:
>>> +/**
>>> + * find_alloc_contig_pages() -- attempt to find and allocate a contiguous
>>> + *				range of pages
>>> + * @nr_pages:	number of pages to find/allocate
>>> + * @gfp:	gfp mask used to limit search as well as during compaction
>>> + * @nid:	target node
>>> + * @nodemask:	mask of other possible nodes
>>> + *
>>> + * Pages can be freed with a call to free_contig_pages(), or by manually
>>> + * calling __free_page() for each page allocated.
>>> + *
>>> + * Return: pointer to 'order' pages on success, or NULL if not successful.
>>> + */
>>> +struct page *find_alloc_contig_pages(unsigned long nr_pages, gfp_t gfp,
>>> +					int nid, nodemask_t *nodemask)
>>> +{
>>> +	unsigned long i, alloc_order, order_pages;
>>> +	struct page *pages;
>>> +
>>> +	/*
>>> +	 * Underlying allocators perform page order sized allocations.
>>> +	 */
>>> +	alloc_order = get_count_order(nr_pages);
>>
>> So if takes arbitrary nr_pages but convert it to order anyway? I think
>> that's rather suboptimal and wasteful... e.g. a range could be skipped
>> because some of the pages added by rounding cannot be migrated away.
> 
> Yes.  My idea with this series was to use existing allocators which are
> all order based.  Let me think about how to do allocation for arbitrary
> number of allocations.
> - For less than MAX_ORDER size we rely on the buddy allocator, so we are
>   pretty much stuck with order sized allocation.  However, allocations of
>   this size are not really interesting as you can call existing routines
>   directly.
> - For sizes greater than MAX_ORDER, we know that the allocation size will
>   be at least pageblock sized.  So, the isolate/migrate scheme can still
>   be used for full pageblocks.  We can then use direct migration for the
>   remaining pages.  This does complicate things a bit.
> 
> I'm guessing that most (?all?) allocations will be order based.  The use
> cases I am aware of (hugetlbfs, Intel Cache Pseudo-Locking, RDMA) are all
> order based.  However, as commented in previous version taking arbitrary
> nr_pages makes interface more future proof.
> 

I noticed this Cache Pseudo-Locking statement and would like to clarify.
I have not been following this thread in detail so I would like to
apologize first if my comments are out of context.

Currently the Cache Pseudo-Locking allocations are order based because I
assumed it was required by the allocator. The contiguous regions needed
by Cache Pseudo-Locking will not always be order based - instead it is
based on the granularity of the cache allocation. One example is a
platform with 55MB L3 cache that can be divided into 20 equal portions.
To support Cache Pseudo-Locking on this platform we need to be able to
allocate contiguous regions at increments of 2816KB (the size of each
portion). In support of this example platform regions needed would thus
be 2816KB, 5632KB, 8448KB, etc.

Regards,

Reinette


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/4] mm: add find_alloc_contig_pages() interface
  2018-05-22 16:41       ` Reinette Chatre
@ 2018-05-22 20:35         ` Mike Kravetz
  2018-05-23 11:18         ` Vlastimil Babka
  1 sibling, 0 replies; 19+ messages in thread
From: Mike Kravetz @ 2018-05-22 20:35 UTC (permalink / raw)
  To: Reinette Chatre, Vlastimil Babka, linux-mm, linux-kernel, linux-api
  Cc: Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, David Nellans,
	Laura Abbott, Pavel Machek, Dave Hansen, Andrew Morton

On 05/22/2018 09:41 AM, Reinette Chatre wrote:
> On 5/21/2018 4:48 PM, Mike Kravetz wrote:
>> On 05/21/2018 01:54 AM, Vlastimil Babka wrote:
>>> On 05/04/2018 01:29 AM, Mike Kravetz wrote:
>>>> +/**
>>>> + * find_alloc_contig_pages() -- attempt to find and allocate a contiguous
>>>> + *				range of pages
>>>> + * @nr_pages:	number of pages to find/allocate
>>>> + * @gfp:	gfp mask used to limit search as well as during compaction
>>>> + * @nid:	target node
>>>> + * @nodemask:	mask of other possible nodes
>>>> + *
>>>> + * Pages can be freed with a call to free_contig_pages(), or by manually
>>>> + * calling __free_page() for each page allocated.
>>>> + *
>>>> + * Return: pointer to 'order' pages on success, or NULL if not successful.
>>>> + */
>>>> +struct page *find_alloc_contig_pages(unsigned long nr_pages, gfp_t gfp,
>>>> +					int nid, nodemask_t *nodemask)
>>>> +{
>>>> +	unsigned long i, alloc_order, order_pages;
>>>> +	struct page *pages;
>>>> +
>>>> +	/*
>>>> +	 * Underlying allocators perform page order sized allocations.
>>>> +	 */
>>>> +	alloc_order = get_count_order(nr_pages);
>>>
>>> So if takes arbitrary nr_pages but convert it to order anyway? I think
>>> that's rather suboptimal and wasteful... e.g. a range could be skipped
>>> because some of the pages added by rounding cannot be migrated away.
>>
>> Yes.  My idea with this series was to use existing allocators which are
>> all order based.  Let me think about how to do allocation for arbitrary
>> number of allocations.
>> - For less than MAX_ORDER size we rely on the buddy allocator, so we are
>>   pretty much stuck with order sized allocation.  However, allocations of
>>   this size are not really interesting as you can call existing routines
>>   directly.
>> - For sizes greater than MAX_ORDER, we know that the allocation size will
>>   be at least pageblock sized.  So, the isolate/migrate scheme can still
>>   be used for full pageblocks.  We can then use direct migration for the
>>   remaining pages.  This does complicate things a bit.
>>
>> I'm guessing that most (?all?) allocations will be order based.  The use
>> cases I am aware of (hugetlbfs, Intel Cache Pseudo-Locking, RDMA) are all
>> order based.  However, as commented in previous version taking arbitrary
>> nr_pages makes interface more future proof.
>>
> 
> I noticed this Cache Pseudo-Locking statement and would like to clarify.
> I have not been following this thread in detail so I would like to
> apologize first if my comments are out of context.
> 
> Currently the Cache Pseudo-Locking allocations are order based because I
> assumed it was required by the allocator. The contiguous regions needed
> by Cache Pseudo-Locking will not always be order based - instead it is
> based on the granularity of the cache allocation. One example is a
> platform with 55MB L3 cache that can be divided into 20 equal portions.
> To support Cache Pseudo-Locking on this platform we need to be able to
> allocate contiguous regions at increments of 2816KB (the size of each
> portion). In support of this example platform regions needed would thus
> be 2816KB, 5632KB, 8448KB, etc.

Thank you Reinette.  I was not aware of these details.  Yours is the most
concrete new use case.

This certainly makes more of a case for arbitrary sized allocations.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/4] mm: add find_alloc_contig_pages() interface
  2018-05-22 16:41       ` Reinette Chatre
  2018-05-22 20:35         ` Mike Kravetz
@ 2018-05-23 11:18         ` Vlastimil Babka
  2018-05-23 18:07           ` Reinette Chatre
  1 sibling, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2018-05-23 11:18 UTC (permalink / raw)
  To: Reinette Chatre, Mike Kravetz, linux-mm, linux-kernel, linux-api
  Cc: Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, David Nellans,
	Laura Abbott, Pavel Machek, Dave Hansen, Andrew Morton

On 05/22/2018 06:41 PM, Reinette Chatre wrote:
> On 5/21/2018 4:48 PM, Mike Kravetz wrote:
>> I'm guessing that most (?all?) allocations will be order based.  The use
>> cases I am aware of (hugetlbfs, Intel Cache Pseudo-Locking, RDMA) are all
>> order based.  However, as commented in previous version taking arbitrary
>> nr_pages makes interface more future proof.
>>
> 
> I noticed this Cache Pseudo-Locking statement and would like to clarify.
> I have not been following this thread in detail so I would like to
> apologize first if my comments are out of context.
> 
> Currently the Cache Pseudo-Locking allocations are order based because I
> assumed it was required by the allocator. The contiguous regions needed
> by Cache Pseudo-Locking will not always be order based - instead it is
> based on the granularity of the cache allocation. One example is a
> platform with 55MB L3 cache that can be divided into 20 equal portions.
> To support Cache Pseudo-Locking on this platform we need to be able to
> allocate contiguous regions at increments of 2816KB (the size of each
> portion). In support of this example platform regions needed would thus
> be 2816KB, 5632KB, 8448KB, etc.

Will there be any alignment requirements for these allocations e.g. for
minimizing conflict misses?

Vlastimil

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/4] mm: add find_alloc_contig_pages() interface
  2018-05-23 11:18         ` Vlastimil Babka
@ 2018-05-23 18:07           ` Reinette Chatre
  2018-05-28 13:12             ` Vlastimil Babka
  0 siblings, 1 reply; 19+ messages in thread
From: Reinette Chatre @ 2018-05-23 18:07 UTC (permalink / raw)
  To: Vlastimil Babka, Mike Kravetz, linux-mm, linux-kernel, linux-api
  Cc: Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, David Nellans,
	Laura Abbott, Pavel Machek, Dave Hansen, Andrew Morton

Hi Vlastimil,

On 5/23/2018 4:18 AM, Vlastimil Babka wrote:
> On 05/22/2018 06:41 PM, Reinette Chatre wrote:
>> On 5/21/2018 4:48 PM, Mike Kravetz wrote:
>>> I'm guessing that most (?all?) allocations will be order based.  The use
>>> cases I am aware of (hugetlbfs, Intel Cache Pseudo-Locking, RDMA) are all
>>> order based.  However, as commented in previous version taking arbitrary
>>> nr_pages makes interface more future proof.
>>>
>>
>> I noticed this Cache Pseudo-Locking statement and would like to clarify.
>> I have not been following this thread in detail so I would like to
>> apologize first if my comments are out of context.
>>
>> Currently the Cache Pseudo-Locking allocations are order based because I
>> assumed it was required by the allocator. The contiguous regions needed
>> by Cache Pseudo-Locking will not always be order based - instead it is
>> based on the granularity of the cache allocation. One example is a
>> platform with 55MB L3 cache that can be divided into 20 equal portions.
>> To support Cache Pseudo-Locking on this platform we need to be able to
>> allocate contiguous regions at increments of 2816KB (the size of each
>> portion). In support of this example platform regions needed would thus
>> be 2816KB, 5632KB, 8448KB, etc.
> 
> Will there be any alignment requirements for these allocations e.g. for
> minimizing conflict misses?

Two views on the usage of the allocated memory are: On the user space
side, the kernel memory is mapped to userspace (using remap_pfn_range())
and thus need to be page aligned. On the kernel side the memory is
loaded into the cache and it is here where the requirement originates
for it to be contiguous. The memory being contiguous reduces the
likelihood of physical addresses from the allocated memory mapping to
the same cache line and thus cause cache evictions of memory we are
trying to load into the cache.

I hope I answered your question, if not, please let me know which parts
I missed and I will try again.

Reinette


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/4] mm: add find_alloc_contig_pages() interface
  2018-05-23 18:07           ` Reinette Chatre
@ 2018-05-28 13:12             ` Vlastimil Babka
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2018-05-28 13:12 UTC (permalink / raw)
  To: Reinette Chatre, Mike Kravetz, linux-mm, linux-kernel, linux-api
  Cc: Michal Hocko, Christopher Lameter, Guy Shattah,
	Anshuman Khandual, Michal Nazarewicz, David Nellans,
	Laura Abbott, Pavel Machek, Dave Hansen, Andrew Morton

On 05/23/2018 08:07 PM, Reinette Chatre wrote:
> On 5/23/2018 4:18 AM, Vlastimil Babka wrote:
>> On 05/22/2018 06:41 PM, Reinette Chatre wrote:
>>> Currently the Cache Pseudo-Locking allocations are order based because I
>>> assumed it was required by the allocator. The contiguous regions needed
>>> by Cache Pseudo-Locking will not always be order based - instead it is
>>> based on the granularity of the cache allocation. One example is a
>>> platform with 55MB L3 cache that can be divided into 20 equal portions.
>>> To support Cache Pseudo-Locking on this platform we need to be able to
>>> allocate contiguous regions at increments of 2816KB (the size of each
>>> portion). In support of this example platform regions needed would thus
>>> be 2816KB, 5632KB, 8448KB, etc.
>>
>> Will there be any alignment requirements for these allocations e.g. for
>> minimizing conflict misses?
> 
> Two views on the usage of the allocated memory are: On the user space
> side, the kernel memory is mapped to userspace (using remap_pfn_range())
> and thus need to be page aligned. On the kernel side the memory is
> loaded into the cache and it is here where the requirement originates
> for it to be contiguous. The memory being contiguous reduces the
> likelihood of physical addresses from the allocated memory mapping to
> the same cache line and thus cause cache evictions of memory we are
> trying to load into the cache.

Hi, yeah that's what I've been thinking, and I guess page alignment is
enough for that after all. I'm just not used to cache sizes and ways
that are not power of two :)

> I hope I answered your question, if not, please let me know which parts
> I missed and I will try again.

Thanks!

Vlastimil

> Reinette
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-05-28 15:54 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-03 23:29 [PATCH v2 0/4] Interface for higher order contiguous allocations Mike Kravetz
2018-05-03 23:29 ` [PATCH v2 1/4] mm: change type of free_contig_range(nr_pages) to unsigned long Mike Kravetz
2018-05-18  9:12   ` Vlastimil Babka
2018-05-18 22:01     ` Mike Kravetz
2018-05-03 23:29 ` [PATCH v2 2/4] mm: check for proper migrate type during isolation Mike Kravetz
2018-05-18 10:32   ` Vlastimil Babka
2018-05-21 23:10     ` Mike Kravetz
2018-05-22  7:07       ` Vlastimil Babka
2018-05-03 23:29 ` [PATCH v2 3/4] mm: add find_alloc_contig_pages() interface Mike Kravetz
2018-05-21  8:54   ` Vlastimil Babka
2018-05-21 23:48     ` Mike Kravetz
2018-05-22 16:41       ` Reinette Chatre
2018-05-22 20:35         ` Mike Kravetz
2018-05-23 11:18         ` Vlastimil Babka
2018-05-23 18:07           ` Reinette Chatre
2018-05-28 13:12             ` Vlastimil Babka
2018-05-03 23:29 ` [PATCH v2 4/4] mm/hugetlb: use find_alloc_contig_pages() to allocate gigantic pages Mike Kravetz
2018-05-21 12:00 ` [PATCH v2 0/4] Interface for higher order contiguous allocations Vlastimil Babka
2018-05-22  0:15   ` Mike Kravetz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.