linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/7] Make alloc_contig_range handle Hugetlb pages
@ 2021-04-13 10:47 Oscar Salvador
  2021-04-13 10:47 ` [PATCH v7 1/7] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range Oscar Salvador
                   ` (6 more replies)
  0 siblings, 7 replies; 40+ messages in thread
From: Oscar Salvador @ 2021-04-13 10:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Vlastimil Babka, David Hildenbrand, Michal Hocko,
	Muchun Song, linux-mm, linux-kernel, Oscar Salvador

So, after Mike's work [1] has gone in, here is a new version of top.

NOTE: If you are going to try out this patchset, be aware of [2].
      You should fix that up before any testing.

[1] https://patchwork.kernel.org/project/linux-mm/cover/20210409205254.242291-1-mike.kravetz@oracle.com/
[2] https://patchwork.kernel.org/project/linux-mm/patch/1617259448-22529-5-git-send-email-anshuman.khandual@arm.com/#24108247

v6 -> v7:
 - Add patch to move the clearing of HPageFreed flag out of the lock
 - Add patch to decouple counter handling from prep_new_huge_page().
   We end up with two new functions, __prep_new_huge_page() which
   does the proper initialization of the new huge page, and
   __prep_account_new_huge_page(), which increments hstate->nr_huge_pages.
   prep_new_huge_page() still calls both of them (details in patch#4).
   This comes in handy in patch#5, where the whole operation of
   replacing the page must be done under the lock.
 - Remove Reviewed-by/Acked-by from patch#5, as needs to be checked again.

v5 -> v6:
 - Collect Acked-by from Michal
 - Adressed feedback for patch#2 (expand the comment about migrate_pfn and
   change return values)
 - Complete pathc#3's changelog (per Michal)
 - Place retry lock inside of alloc_and_dissolve_huge_page()

v4 -> v5:
 - Collect Acked-by and Reviewed-by from David and Vlastimil
 - Drop racy checks in pfn_range_valid_contig (David)
 - Rebased on top of 5.12-rc3

v3 -> v4:
 - Addressed some feedback from David and Michal
 - Make more clear what hugetlb_lock protects in isolate_or_dissolve_huge_page
 - Start reporting proper error codes from isolate_migratepages_{range,block}
 - Bail out earlier in __alloc_contig_migrate_range on -ENOMEM
 - Addressed internal feedback from Vastlimil wrt. compaction code changes

v2 -> v3:
 - Drop usage of high-level generic helpers in favour of
   low-level approach (per Michal)
 - Check for the page to be marked as PageHugeFreed
 - Add a one-time retry in case someone grabbed the free huge page
   from under us

v1 -> v2:
 - Adressed feedback by Michal
 - Restrict the allocation to a node with __GFP_THISNODE
 - Drop PageHuge check in alloc_and_dissolve_huge_page
 - Re-order comments in isolate_or_dissolve_huge_page
 - Extend comment in isolate_migratepages_block
 - Place put_page right after we got the page, otherwise
   dissolve_free_huge_page will fail

 RFC -> v1:
 - Drop RFC
 - Addressed feedback from David and Mike
 - Fence off gigantic pages as there is a cyclic dependency between
   them and alloc_contig_range
 - Re-organize the code to make race-window smaller and to put
   all details in hugetlb code
 - Drop nodemask initialization. First a node will be tried and then we
   will back to other nodes containing memory (N_MEMORY). Details in
   patch#1's changelog
 - Count new page as surplus in case we failed to dissolve the old page
   and the new one. Details in patch#1.

Cover letter:

 alloc_contig_range lacks the hability for handling HugeTLB pages.
 This can be problematic for some users, e.g: CMA and virtio-mem, where those
 users will fail the call if alloc_contig_range ever sees a HugeTLB page, even
 when those pages lay in ZONE_MOVABLE and are free.
 That problem can be easily solved by replacing the page in the free hugepage
 pool.

 In-use HugeTLB are no exception though, as those can be isolated and migrated
 as any other LRU or Movable page.

 This patchset aims for improving alloc_contig_range->isolate_migratepages_block,
 so HugeTLB pages can be recognized and handled.

 Since we also need to start reporting errors down the chain (e.g: -ENOMEM due to
 not be able to allocate a new hugetlb page), isolate_migratepages_{range,block}
 interfaces  need to change to start reporting error codes instead of the pfn == 0
 vs pfn != 0 scheme it is using right now.
 From now on, isolate_migratepages_block will not return the next pfn to be scanned
 anymore, but -EINTR, -ENOMEM or 0, so we the next pfn to be scanned will be recorded
 in cc->migrate_pfn field (as it is already done in isolate_migratepages_range()).

 Below is an insight from David (thanks), where the problem can clearly be seen:

 "Start a VM with 4G. Hotplug 1G via virtio-mem and online it to
  ZONE_MOVABLE. Allocate 512 huge pages.

  [root@localhost ~]# cat /proc/meminfo
  MemTotal:        5061512 kB
  MemFree:         3319396 kB
  MemAvailable:    3457144 kB
  ...
  HugePages_Total:     512
  HugePages_Free:      512
  HugePages_Rsvd:        0
  HugePages_Surp:        0
  Hugepagesize:       2048 kB

  The huge pages get partially allocate from ZONE_MOVABLE. Try unplugging
  1G via virtio-mem (remember, all ZONE_MOVABLE). Inside the guest:

  [  180.058992] alloc_contig_range: [1b8000, 1c0000) PFNs busy
  [  180.060531] alloc_contig_range: [1b8000, 1c0000) PFNs busy
  [  180.061972] alloc_contig_range: [1b8000, 1c0000) PFNs busy
  [  180.063413] alloc_contig_range: [1b8000, 1c0000) PFNs busy
  [  180.064838] alloc_contig_range: [1b8000, 1c0000) PFNs busy
  [  180.065848] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
  [  180.066794] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
  [  180.067738] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
  [  180.068669] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
  [  180.069598] alloc_contig_range: [1bfc00, 1c0000) PFNs busy"

 And then with this patchset running:

 "Same experiment with ZONE_MOVABLE:

  a) Free huge pages: all memory can get unplugged again.

  b) Allocated/populated but idle huge pages: all memory can get unplugged
     again.

  c) Allocated/populated but all 512 huge pages are read/written in a
     loop: all memory can get unplugged again, but I get a single

  [  121.192345] alloc_contig_range: [180000, 188000) PFNs busy

  Most probably because it happened to try migrating a huge page while it
  was busy. As virtio-mem retries on ZONE_MOVABLE a couple of times, it
  can deal with this temporary failure.

  Last but not least, I did something extreme:

  # cat /proc/meminfo
  MemTotal:        5061568 kB
  MemFree:          186560 kB
  MemAvailable:     354524 kB
  ...
  HugePages_Total:    2048
  HugePages_Free:     2048
  HugePages_Rsvd:        0
  HugePages_Surp:        0

  Triggering unplug would require to dissolve+alloc - which now fails when
  trying to allocate an additional ~512 huge pages (1G).

  As expected, I can properly see memory unplug not fully succeeding. + I
  get a fairly continuous stream of

  [  226.611584] alloc_contig_range: [19f400, 19f800) PFNs busy
  ...

  But more importantly, the hugepage count remains stable, as configured
  by the admin (me):

  HugePages_Total:    2048
  HugePages_Free:     2048
  HugePages_Rsvd:        0
  HugePages_Surp:        0"

Oscar Salvador (7):
  mm,page_alloc: Bail out earlier on -ENOMEM in
    alloc_contig_migrate_range
  mm,compaction: Let isolate_migratepages_{range,block} return error
    codes
  mm,hugetlb: Clear HPageFreed outside of the lock
  mm,hugetlb: Split prep_new_huge_page functionality
  mm: Make alloc_contig_range handle free hugetlb pages
  mm: Make alloc_contig_range handle in-use hugetlb pages
  mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig

 include/linux/hugetlb.h |   7 +++
 mm/compaction.c         |  97 ++++++++++++++++++++++---------
 mm/hugetlb.c            | 148 ++++++++++++++++++++++++++++++++++++++++++++++--
 mm/internal.h           |  10 +++-
 mm/page_alloc.c         |  22 +++----
 mm/vmscan.c             |   5 +-
 6 files changed, 242 insertions(+), 47 deletions(-)

-- 
2.16.3


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v7 1/7] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range
  2021-04-13 10:47 [PATCH v7 0/7] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
@ 2021-04-13 10:47 ` Oscar Salvador
  2021-04-13 17:00   ` Mike Kravetz
  2021-04-13 10:47 ` [PATCH v7 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes Oscar Salvador
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 40+ messages in thread
From: Oscar Salvador @ 2021-04-13 10:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Vlastimil Babka, David Hildenbrand, Michal Hocko,
	Muchun Song, linux-mm, linux-kernel, Oscar Salvador

Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or -EBUSY,
and report them down the chain.
The problem is that when migrate_pages() reports -ENOMEM, we keep going till we
exhaust all the try-attempts (5 at the moment) instead of bailing out.

migrate_pages() bails out right away on -ENOMEM because it is considered a fatal
error. Do the same here instead of keep going and retrying.
Note that this is not fixing a real issue, just a cosmetic change. Although we
can save some cycles by backing off ealier

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c67c99603a3..689454692de1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8697,7 +8697,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 			}
 			tries = 0;
 		} else if (++tries == 5) {
-			ret = ret < 0 ? ret : -EBUSY;
+			ret = -EBUSY;
 			break;
 		}
 
@@ -8707,6 +8707,13 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 
 		ret = migrate_pages(&cc->migratepages, alloc_migration_target,
 				NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
+
+		/*
+		 * On -ENOMEM, migrate_pages() bails out right away. It is pointless
+		 * to retry again over this error, so do the same here.
+		 */
+		if (ret == -ENOMEM)
+			break;
 	}
 
 	lru_cache_enable();
-- 
2.16.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes
  2021-04-13 10:47 [PATCH v7 0/7] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
  2021-04-13 10:47 ` [PATCH v7 1/7] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range Oscar Salvador
@ 2021-04-13 10:47 ` Oscar Salvador
  2021-04-13 17:52   ` Mike Kravetz
  2021-04-14 11:54   ` David Hildenbrand
  2021-04-13 10:47 ` [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock Oscar Salvador
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 40+ messages in thread
From: Oscar Salvador @ 2021-04-13 10:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Vlastimil Babka, David Hildenbrand, Michal Hocko,
	Muchun Song, linux-mm, linux-kernel, Oscar Salvador

Currently, isolate_migratepages_{range,block} and their callers use
a pfn == 0 vs pfn != 0 scheme to let the caller know whether there was
any error during isolation.
This does not work as soon as we need to start reporting different error
codes and make sure we pass them down the chain, so they are properly
interpreted by functions like e.g: alloc_contig_range.

Let us rework isolate_migratepages_{range,block} so we can report error
codes.
Since isolate_migratepages_block will stop returning the next pfn to be
scanned, we reuse the cc->migrate_pfn field to keep track of that.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/compaction.c | 54 +++++++++++++++++++++++++++---------------------------
 mm/internal.h   | 10 ++++++++--
 mm/page_alloc.c |  7 +++----
 3 files changed, 38 insertions(+), 33 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 8c5028bfbd56..eeba4668c22c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -787,15 +787,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
  *
  * Isolate all pages that can be migrated from the range specified by
  * [low_pfn, end_pfn). The range is expected to be within same pageblock.
- * Returns zero if there is a fatal signal pending, otherwise PFN of the
- * first page that was not scanned (which may be both less, equal to or more
- * than end_pfn).
+ * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or congestion,
+ * or 0.
+ * cc->migrate_pfn will contain the next pfn to scan (which may be both less,
+ * equal to or more that end_pfn).
  *
  * The pages are isolated on cc->migratepages list (not required to be empty),
- * and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field
- * is neither read nor updated.
+ * and cc->nr_migratepages is updated accordingly.
  */
-static unsigned long
+static int
 isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			unsigned long end_pfn, isolate_mode_t isolate_mode)
 {
@@ -809,6 +809,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 	bool skip_on_failure = false;
 	unsigned long next_skip_pfn = 0;
 	bool skip_updated = false;
+	int ret = 0;
+
+	cc->migrate_pfn = low_pfn;
 
 	/*
 	 * Ensure that there are not too many pages isolated from the LRU
@@ -818,16 +821,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 	while (unlikely(too_many_isolated(pgdat))) {
 		/* stop isolation if there are still pages not migrated */
 		if (cc->nr_migratepages)
-			return 0;
+			return -EAGAIN;
 
 		/* async migration should just abort */
 		if (cc->mode == MIGRATE_ASYNC)
-			return 0;
+			return -EAGAIN;
 
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		if (fatal_signal_pending(current))
-			return 0;
+			return -EINTR;
 	}
 
 	cond_resched();
@@ -875,8 +878,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 			if (fatal_signal_pending(current)) {
 				cc->contended = true;
+				ret = -EINTR;
 
-				low_pfn = 0;
 				goto fatal_pending;
 			}
 
@@ -1130,7 +1133,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 	if (nr_isolated)
 		count_compact_events(COMPACTISOLATED, nr_isolated);
 
-	return low_pfn;
+	cc->migrate_pfn = low_pfn;
+
+	return ret;
 }
 
 /**
@@ -1139,15 +1144,15 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
  * @start_pfn: The first PFN to start isolating.
  * @end_pfn:   The one-past-last PFN.
  *
- * Returns zero if isolation fails fatally due to e.g. pending signal.
- * Otherwise, function returns one-past-the-last PFN of isolated page
- * (which may be greater than end_pfn if end fell in a middle of a THP page).
+ * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or congestion,
+ * or 0.
  */
-unsigned long
+int
 isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
 							unsigned long end_pfn)
 {
 	unsigned long pfn, block_start_pfn, block_end_pfn;
+	int ret = 0;
 
 	/* Scan block by block. First and last block may be incomplete */
 	pfn = start_pfn;
@@ -1166,17 +1171,17 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
 					block_end_pfn, cc->zone))
 			continue;
 
-		pfn = isolate_migratepages_block(cc, pfn, block_end_pfn,
-							ISOLATE_UNEVICTABLE);
+		ret = isolate_migratepages_block(cc, pfn, block_end_pfn,
+						 ISOLATE_UNEVICTABLE);
 
-		if (!pfn)
+		if (ret)
 			break;
 
 		if (cc->nr_migratepages >= COMPACT_CLUSTER_MAX)
 			break;
 	}
 
-	return pfn;
+	return ret;
 }
 
 #endif /* CONFIG_COMPACTION || CONFIG_CMA */
@@ -1847,7 +1852,7 @@ static isolate_migrate_t isolate_migratepages(struct compact_control *cc)
 	 */
 	for (; block_end_pfn <= cc->free_pfn;
 			fast_find_block = false,
-			low_pfn = block_end_pfn,
+			cc->migrate_pfn = low_pfn = block_end_pfn,
 			block_start_pfn = block_end_pfn,
 			block_end_pfn += pageblock_nr_pages) {
 
@@ -1889,10 +1894,8 @@ static isolate_migrate_t isolate_migratepages(struct compact_control *cc)
 		}
 
 		/* Perform the isolation */
-		low_pfn = isolate_migratepages_block(cc, low_pfn,
-						block_end_pfn, isolate_mode);
-
-		if (!low_pfn)
+		if (isolate_migratepages_block(cc, low_pfn, block_end_pfn,
+						isolate_mode))
 			return ISOLATE_ABORT;
 
 		/*
@@ -1903,9 +1906,6 @@ static isolate_migrate_t isolate_migratepages(struct compact_control *cc)
 		break;
 	}
 
-	/* Record where migration scanner will be restarted. */
-	cc->migrate_pfn = low_pfn;
-
 	return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index f469f69309de..46eb82eaa195 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -244,7 +244,13 @@ struct compact_control {
 	unsigned int nr_freepages;	/* Number of isolated free pages */
 	unsigned int nr_migratepages;	/* Number of pages to migrate */
 	unsigned long free_pfn;		/* isolate_freepages search base */
-	unsigned long migrate_pfn;	/* isolate_migratepages search base */
+	/*
+	 * Acts as an in/out parameter to page isolation for migration.
+	 * isolate_migratepages uses it as a search base.
+	 * isolate_migratepages_block will update the value to the next pfn
+	 * after the last isolated one.
+	 */
+	unsigned long migrate_pfn;
 	unsigned long fast_start_pfn;	/* a pfn to start linear scan from */
 	struct zone *zone;
 	unsigned long total_migrate_scanned;
@@ -280,7 +286,7 @@ struct capture_control {
 unsigned long
 isolate_freepages_range(struct compact_control *cc,
 			unsigned long start_pfn, unsigned long end_pfn);
-unsigned long
+int
 isolate_migratepages_range(struct compact_control *cc,
 			   unsigned long low_pfn, unsigned long end_pfn);
 int find_suitable_fallback(struct free_area *area, unsigned int order,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 689454692de1..b5a94de3cdde 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8690,11 +8690,10 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 
 		if (list_empty(&cc->migratepages)) {
 			cc->nr_migratepages = 0;
-			pfn = isolate_migratepages_range(cc, pfn, end);
-			if (!pfn) {
-				ret = -EINTR;
+			ret = isolate_migratepages_range(cc, pfn, end);
+			if (ret && ret != -EAGAIN)
 				break;
-			}
+			pfn = cc->migrate_pfn;
 			tries = 0;
 		} else if (++tries == 5) {
 			ret = -EBUSY;
-- 
2.16.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock
  2021-04-13 10:47 [PATCH v7 0/7] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
  2021-04-13 10:47 ` [PATCH v7 1/7] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range Oscar Salvador
  2021-04-13 10:47 ` [PATCH v7 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes Oscar Salvador
@ 2021-04-13 10:47 ` Oscar Salvador
  2021-04-13 13:23   ` Michal Hocko
  2021-04-13 10:47 ` [PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality Oscar Salvador
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 40+ messages in thread
From: Oscar Salvador @ 2021-04-13 10:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Vlastimil Babka, David Hildenbrand, Michal Hocko,
	Muchun Song, linux-mm, linux-kernel, Oscar Salvador

Currently, the clearing of the flag is done under the lock, but this
is unnecessary as we just allocated the page and we did not give it
away yet, so no one should be messing with it.

Also, this helps making clear that here the lock is only protecting the
counter.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/hugetlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 54d81d5947ed..e40d5fe5c63c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1490,10 +1490,10 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 	hugetlb_set_page_subpool(page, NULL);
 	set_hugetlb_cgroup(page, NULL);
 	set_hugetlb_cgroup_rsvd(page, NULL);
+	ClearHPageFreed(page);
 	spin_lock_irq(&hugetlb_lock);
 	h->nr_huge_pages++;
 	h->nr_huge_pages_node[nid]++;
-	ClearHPageFreed(page);
 	spin_unlock_irq(&hugetlb_lock);
 }
 
-- 
2.16.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality
  2021-04-13 10:47 [PATCH v7 0/7] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
                   ` (2 preceding siblings ...)
  2021-04-13 10:47 ` [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock Oscar Salvador
@ 2021-04-13 10:47 ` Oscar Salvador
  2021-04-13 13:24   ` Michal Hocko
  2021-04-13 21:33   ` Mike Kravetz
  2021-04-13 10:47 ` [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 40+ messages in thread
From: Oscar Salvador @ 2021-04-13 10:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Vlastimil Babka, David Hildenbrand, Michal Hocko,
	Muchun Song, linux-mm, linux-kernel, Oscar Salvador

Currently, prep_new_huge_page() performs two functions.
It sets the right state for a new hugetlb, and increases the hstate's
counters to account for the new page.

Let us split its functionality into two separate functions, decoupling
the handling of the counters from initializing a hugepage.
The outcome is having __prep_new_huge_page(), which only
initializes the page , and __prep_account_new_huge_page(), which adds
the new page to the hstate's counters.

This allows us to be able to set a hugetlb without having to worry
about the counter/locking. It will prove useful in the next patch.
prep_new_huge_page() still calls both functions.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/hugetlb.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e40d5fe5c63c..0607b2b71ac6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1483,7 +1483,16 @@ void free_huge_page(struct page *page)
 	}
 }
 
-static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
+/*
+ * Must be called with the hugetlb lock held
+ */
+static void __prep_account_new_huge_page(struct hstate *h, int nid)
+{
+	h->nr_huge_pages++;
+	h->nr_huge_pages_node[nid]++;
+}
+
+static void __prep_new_huge_page(struct page *page)
 {
 	INIT_LIST_HEAD(&page->lru);
 	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
@@ -1491,9 +1500,13 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 	set_hugetlb_cgroup(page, NULL);
 	set_hugetlb_cgroup_rsvd(page, NULL);
 	ClearHPageFreed(page);
+}
+
+static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
+{
+	__prep_new_huge_page(page);
 	spin_lock_irq(&hugetlb_lock);
-	h->nr_huge_pages++;
-	h->nr_huge_pages_node[nid]++;
+	__prep_account_new_huge_page(h, nid);
 	spin_unlock_irq(&hugetlb_lock);
 }
 
-- 
2.16.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages
  2021-04-13 10:47 [PATCH v7 0/7] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
                   ` (3 preceding siblings ...)
  2021-04-13 10:47 ` [PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality Oscar Salvador
@ 2021-04-13 10:47 ` Oscar Salvador
  2021-04-13 13:40   ` Michal Hocko
                     ` (2 more replies)
  2021-04-13 10:47 ` [PATCH v7 6/7] mm: Make alloc_contig_range handle in-use " Oscar Salvador
  2021-04-13 10:47 ` [PATCH v7 7/7] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig Oscar Salvador
  6 siblings, 3 replies; 40+ messages in thread
From: Oscar Salvador @ 2021-04-13 10:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Vlastimil Babka, David Hildenbrand, Michal Hocko,
	Muchun Song, linux-mm, linux-kernel, Oscar Salvador

alloc_contig_range will fail if it ever sees a HugeTLB page within the
range we are trying to allocate, even when that page is free and can be
easily reallocated.
This has proved to be problematic for some users of alloc_contic_range,
e.g: CMA and virtio-mem, where those would fail the call even when those
pages lay in ZONE_MOVABLE and are free.

We can do better by trying to replace such page.

Free hugepages are tricky to handle so as to no userspace application
notices disruption, we need to replace the current free hugepage with
a new one.

In order to do that, a new function called alloc_and_dissolve_huge_page
is introduced.
This function will first try to get a new fresh hugepage, and if it
succeeds, it will replace the old one in the free hugepage pool.

The free page replacement is done under hugetlb_lock, so no external
users of hugetlb will notice the change.
To allocate the new huge page, we use alloc_buddy_huge_page(), so we
do not have to deal with any counters, and prep_new_huge_page() is not
called. This is valulable because in case we need to free the new page,
we only need to call __free_pages().

Once we know that the page to be replaced is a genuine 0-refcounted
huge page, we remove the old page from the freelist by remove_hugetlb_page().
Then, we can call __prep_new_huge_page() and __prep_account_new_huge_page()
for the new huge page to properly initialize it and increment the
hstate->nr_huge_pages counter (previously decremented by
remove_hugetlb_page()).
Once done, the page is enqueued by enqueue_huge_page() and it is ready
to be used.

There is one tricky case when
page's refcount is 0 because it is in the process of being released.
A missing PageHugeFreed bit will tell us that freeing is in flight so
we retry after dropping the hugetlb_lock. The race window should be
small and the next retry should make a forward progress.

E.g:

CPU0				CPU1
free_huge_page()		isolate_or_dissolve_huge_page
				  PageHuge() == T
				  alloc_and_dissolve_huge_page
				    alloc_buddy_huge_page()
				    spin_lock_irq(hugetlb_lock)
				    // PageHuge() && !PageHugeFreed &&
				    // !PageCount()
				    spin_unlock_irq(hugetlb_lock)
  spin_lock_irq(hugetlb_lock)
  1) update_and_free_page
       PageHuge() == F
       __free_pages()
  2) enqueue_huge_page
       SetPageHugeFreed()
  spin_unlock(&hugetlb_lock)
				  spin_lock_irq(hugetlb_lock)
                                   1) PageHuge() == F (freed by case#1 from CPU0)
				   2) PageHuge() == T
                                       PageHugeFreed() == T
                                       - proceed with replacing the page

In the case above we retry as the window race is quite small and we have high
chances to succeed next time.

With regard to the allocation, we restrict it to the node the page belongs
to with __GFP_THISNODE, meaning we do not fallback on other node's zones.

Note that gigantic hugetlb pages are fenced off since there is a cyclic
dependency between them and alloc_contig_range.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/hugetlb.h |   6 +++
 mm/compaction.c         |  37 ++++++++++++++--
 mm/hugetlb.c            | 115 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 155 insertions(+), 3 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 09f1fd12a6fa..b2d2118bfd1a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -595,6 +595,7 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };
 
+int isolate_or_dissolve_huge_page(struct page *page);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -877,6 +878,11 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
+static inline int isolate_or_dissolve_huge_page(struct page *page)
+{
+	return -ENOMEM;
+}
+
 static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
 					   unsigned long addr,
 					   int avoid_reserve)
diff --git a/mm/compaction.c b/mm/compaction.c
index eeba4668c22c..89426b6d1ea3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -788,7 +788,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
  * Isolate all pages that can be migrated from the range specified by
  * [low_pfn, end_pfn). The range is expected to be within same pageblock.
  * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or congestion,
- * or 0.
+ * -ENOMEM in case we could not allocate a page, or 0.
  * cc->migrate_pfn will contain the next pfn to scan (which may be both less,
  * equal to or more that end_pfn).
  *
@@ -809,6 +809,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 	bool skip_on_failure = false;
 	unsigned long next_skip_pfn = 0;
 	bool skip_updated = false;
+	bool fatal_error = false;
 	int ret = 0;
 
 	cc->migrate_pfn = low_pfn;
@@ -907,6 +908,33 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			valid_page = page;
 		}
 
+		if (PageHuge(page) && cc->alloc_contig) {
+			ret = isolate_or_dissolve_huge_page(page);
+
+			/*
+			 * Fail isolation in case isolate_or_dissolve_huge_page
+			 * reports an error. In case of -ENOMEM, abort right away.
+			 */
+			if (ret < 0) {
+				/*
+				 * Do not report -EBUSY down the chain.
+				 */
+				if (ret == -ENOMEM)
+					fatal_error = true;
+				else
+					ret = 0;
+				low_pfn += (1UL << compound_order(page)) - 1;
+				goto isolate_fail;
+			}
+
+			/*
+			 * Ok, the hugepage was dissolved. Now these pages are
+			 * Buddy and cannot be re-allocated because they are
+			 * isolated. Fall-through as the check below handles
+			 * Buddy pages.
+			 */
+		}
+
 		/*
 		 * Skip if free. We read page order here without zone lock
 		 * which is generally unsafe, but the race window is small and
@@ -1066,7 +1094,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		put_page(page);
 
 isolate_fail:
-		if (!skip_on_failure)
+		if (!skip_on_failure && !fatal_error)
 			continue;
 
 		/*
@@ -1092,6 +1120,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			 */
 			next_skip_pfn += 1UL << cc->order;
 		}
+
+		if (fatal_error)
+			break;
 	}
 
 	/*
@@ -1145,7 +1176,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
  * @end_pfn:   The one-past-last PFN.
  *
  * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or congestion,
- * or 0.
+ * -ENOMEM in case we could not allocate a page, or 0.
  */
 int
 isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0607b2b71ac6..4a664d6e82c1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2266,6 +2266,121 @@ static void restore_reserve_on_error(struct hstate *h,
 	}
 }
 
+/*
+ * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one
+ * @h: struct hstate old page belongs to
+ * @old_page: Old page to dissolve
+ * Returns 0 on success, otherwise negated error.
+ */
+
+static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
+{
+	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
+	int nid = page_to_nid(old_page);
+	struct page *new_page;
+	int ret = 0;
+
+	/*
+	 * Before dissolving the page, we need to allocate a new one for the
+	 * pool to remain stable. Using alloc_buddy_huge_page() allows us to
+	 * not having to deal with prep_new_page() and avoids dealing of any
+	 * counters. This simplifies and let us do the whole thing under the
+	 * lock.
+	 */
+	new_page = alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL);
+	if (!new_page)
+		return -ENOMEM;
+
+retry:
+	spin_lock_irq(&hugetlb_lock);
+	if (!PageHuge(old_page)) {
+		/*
+		 * Freed from under us. Drop new_page too.
+		 */
+		goto free_new;
+	} else if (page_count(old_page)) {
+		/*
+		 * Someone has grabbed the page, fail for now.
+		 */
+		ret = -EBUSY;
+		goto free_new;
+	} else if (!HPageFreed(old_page)) {
+		/*
+		 * Page's refcount is 0 but it has not been enqueued in the
+		 * freelist yet. Race window is small, so we can succeed here if
+		 * we retry.
+		 */
+		spin_unlock_irq(&hugetlb_lock);
+		cond_resched();
+		goto retry;
+	} else {
+		/*
+		 * Ok, old_page is still a genuine free hugepage. Remove it from
+		 * the freelist and decrease the counters. These will be
+		 * incremented again when calling __prep_account_new_huge_page()
+		 * and enqueue_huge_page() for new_page. The counters will remain
+		 * stable since this happens under the lock.
+		 */
+		remove_hugetlb_page(h, old_page, false);
+
+		/*
+		 * Call __prep_new_huge_page() to construct the hugetlb page, and
+		 * enqueue it then to place it in the freelists. After this,
+		 * counters are back on track. Free hugepages have a refcount of 0,
+		 * so we need to decrease new_page's count as well.
+		 */
+		__prep_new_huge_page(new_page);
+		__prep_account_new_huge_page(h, nid);
+		page_ref_dec(new_page);
+		enqueue_huge_page(h, new_page);
+
+		/*
+		 * Pages have been replaced, we can safely free the old one.
+		 */
+		spin_unlock_irq(&hugetlb_lock);
+		update_and_free_page(h, old_page);
+	}
+
+	return ret;
+
+free_new:
+	spin_unlock_irq(&hugetlb_lock);
+	__free_pages(new_page, huge_page_order(h));
+
+	return ret;
+}
+
+int isolate_or_dissolve_huge_page(struct page *page)
+{
+	struct hstate *h;
+	struct page *head;
+
+	/*
+	 * The page might have been dissolved from under our feet, so make sure
+	 * to carefully check the state under the lock.
+	 * Return success when racing as if we dissolved the page ourselves.
+	 */
+	spin_lock_irq(&hugetlb_lock);
+	if (PageHuge(page)) {
+		head = compound_head(page);
+		h = page_hstate(head);
+	} else {
+		spin_unlock(&hugetlb_lock);
+		return 0;
+	}
+	spin_unlock_irq(&hugetlb_lock);
+
+	/*
+	 * Fence off gigantic pages as there is a cyclic dependency between
+	 * alloc_contig_range and them. Return -ENOME as this has the effect
+	 * of bailing out right away without further retrying.
+	 */
+	if (hstate_is_gigantic(h))
+		return -ENOMEM;
+
+	return alloc_and_dissolve_huge_page(h, head);
+}
+
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
-- 
2.16.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 6/7] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-04-13 10:47 [PATCH v7 0/7] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
                   ` (4 preceding siblings ...)
  2021-04-13 10:47 ` [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
@ 2021-04-13 10:47 ` Oscar Salvador
  2021-04-13 22:48   ` Mike Kravetz
  2021-04-13 10:47 ` [PATCH v7 7/7] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig Oscar Salvador
  6 siblings, 1 reply; 40+ messages in thread
From: Oscar Salvador @ 2021-04-13 10:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Vlastimil Babka, David Hildenbrand, Michal Hocko,
	Muchun Song, linux-mm, linux-kernel, Oscar Salvador

alloc_contig_range() will fail if it finds a HugeTLB page within the range,
without a chance to handle them. Since HugeTLB pages can be migrated as any
LRU or Movable page, it does not make sense to bail out without trying.
Enable the interface to recognize in-use HugeTLB pages so we can migrate
them, and have much better chances to succeed the call.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/hugetlb.h |  5 +++--
 mm/compaction.c         | 12 +++++++++++-
 mm/hugetlb.c            | 22 +++++++++++++++++-----
 mm/vmscan.c             |  5 +++--
 4 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b2d2118bfd1a..b92f25ccef58 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -595,7 +595,7 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };
 
-int isolate_or_dissolve_huge_page(struct page *page);
+int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -878,7 +878,8 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
-static inline int isolate_or_dissolve_huge_page(struct page *page)
+static inline int isolate_or_dissolve_huge_page(struct page *page,
+						struct list_head *list)
 {
 	return -ENOMEM;
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index 89426b6d1ea3..bb8ff3543972 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -909,7 +909,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		}
 
 		if (PageHuge(page) && cc->alloc_contig) {
-			ret = isolate_or_dissolve_huge_page(page);
+			ret = isolate_or_dissolve_huge_page(page, &cc->migratepages);
 
 			/*
 			 * Fail isolation in case isolate_or_dissolve_huge_page
@@ -927,6 +927,15 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 				goto isolate_fail;
 			}
 
+			if (PageHuge(page)) {
+				/*
+				 * Hugepage was successfully isolated and placed
+				 * on the cc->migratepages list.
+				 */
+				low_pfn += compound_nr(page) - 1;
+				goto isolate_success_no_list;
+			}
+
 			/*
 			 * Ok, the hugepage was dissolved. Now these pages are
 			 * Buddy and cannot be re-allocated because they are
@@ -1068,6 +1077,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 isolate_success:
 		list_add(&page->lru, &cc->migratepages);
+isolate_success_no_list:
 		cc->nr_migratepages += compound_nr(page);
 		nr_isolated += compound_nr(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4a664d6e82c1..24a453ff47f2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2270,10 +2270,12 @@ static void restore_reserve_on_error(struct hstate *h,
  * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one
  * @h: struct hstate old page belongs to
  * @old_page: Old page to dissolve
+ * @list: List to isolate the page in case we need to
  * Returns 0 on success, otherwise negated error.
  */
 
-static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
+static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
+					struct list_head *list)
 {
 	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
 	int nid = page_to_nid(old_page);
@@ -2300,9 +2302,13 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
 		goto free_new;
 	} else if (page_count(old_page)) {
 		/*
-		 * Someone has grabbed the page, fail for now.
+		 * Someone has grabbed the page, try to isolate it here.
+		 * Fail with -EBUSY if not possible.
 		 */
-		ret = -EBUSY;
+		spin_unlock_irq(&hugetlb_lock);
+		if (!isolate_huge_page(old_page, list))
+			ret = -EBUSY;
+		spin_lock_irq(&hugetlb_lock);
 		goto free_new;
 	} else if (!HPageFreed(old_page)) {
 		/*
@@ -2350,10 +2356,11 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
 	return ret;
 }
 
-int isolate_or_dissolve_huge_page(struct page *page)
+int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
 {
 	struct hstate *h;
 	struct page *head;
+	int ret = -EBUSY;
 
 	/*
 	 * The page might have been dissolved from under our feet, so make sure
@@ -2378,7 +2385,12 @@ int isolate_or_dissolve_huge_page(struct page *page)
 	if (hstate_is_gigantic(h))
 		return -ENOMEM;
 
-	return alloc_and_dissolve_huge_page(h, head);
+	if (page_count(head) && isolate_huge_page(head, list))
+		ret = 0;
+	else if (!page_count(head))
+		ret = alloc_and_dissolve_huge_page(h, head, list);
+
+	return ret;
 }
 
 struct page *alloc_huge_page(struct vm_area_struct *vma,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bb8321026c0c..5199b9696bab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1703,8 +1703,9 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 	LIST_HEAD(clean_pages);
 
 	list_for_each_entry_safe(page, next, page_list, lru) {
-		if (page_is_file_lru(page) && !PageDirty(page) &&
-		    !__PageMovable(page) && !PageUnevictable(page)) {
+		if (!PageHuge(page) && page_is_file_lru(page) &&
+		    !PageDirty(page) && !__PageMovable(page) &&
+		    !PageUnevictable(page)) {
 			ClearPageActive(page);
 			list_move(&page->lru, &clean_pages);
 		}
-- 
2.16.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 7/7] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig
  2021-04-13 10:47 [PATCH v7 0/7] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
                   ` (5 preceding siblings ...)
  2021-04-13 10:47 ` [PATCH v7 6/7] mm: Make alloc_contig_range handle in-use " Oscar Salvador
@ 2021-04-13 10:47 ` Oscar Salvador
  2021-04-13 22:53   ` Mike Kravetz
  6 siblings, 1 reply; 40+ messages in thread
From: Oscar Salvador @ 2021-04-13 10:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, Vlastimil Babka, David Hildenbrand, Michal Hocko,
	Muchun Song, linux-mm, linux-kernel, Oscar Salvador

pfn_range_valid_contig() bails out when it finds an in-use page or a
hugetlb page, among other things.
We can drop the in-use page check since __alloc_contig_pages can migrate
away those pages, and the hugetlb page check can go too since
isolate_migratepages_range is now capable of dealing with hugetlb pages.
Either way, those checks are racy so let the end function handle it
when the time comes.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 mm/page_alloc.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b5a94de3cdde..c5338e912ace 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8901,12 +8901,6 @@ static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
 
 		if (PageReserved(page))
 			return false;
-
-		if (page_count(page) > 0)
-			return false;
-
-		if (PageHuge(page))
-			return false;
 	}
 	return true;
 }
-- 
2.16.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock
  2021-04-13 10:47 ` [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock Oscar Salvador
@ 2021-04-13 13:23   ` Michal Hocko
  2021-04-13 21:19     ` Mike Kravetz
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2021-04-13 13:23 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, Vlastimil Babka, David Hildenbrand,
	Muchun Song, linux-mm, linux-kernel

On Tue 13-04-21 12:47:43, Oscar Salvador wrote:
> Currently, the clearing of the flag is done under the lock, but this
> is unnecessary as we just allocated the page and we did not give it
> away yet, so no one should be messing with it.
> 
> Also, this helps making clear that here the lock is only protecting the
> counter.

While moving the flag clearing is ok I am wondering why do we need that
in the first place. I think it is just a leftover from 6c0371490140
("hugetlb: convert PageHugeFreed to HPageFreed flag"). Prior to that a tail
page as been used to keep track of the state but now all happens in the
head page and the flag uses page->private which is always initialized
when allocated by the allocator (post_alloc_hook).

Or do we need it for giga pages which are not allocated by the page
allocator? If yes then moving it to prep_compound_gigantic_page would be
better.

So should we just drop it here?

> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  mm/hugetlb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 54d81d5947ed..e40d5fe5c63c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1490,10 +1490,10 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>  	hugetlb_set_page_subpool(page, NULL);
>  	set_hugetlb_cgroup(page, NULL);
>  	set_hugetlb_cgroup_rsvd(page, NULL);
> +	ClearHPageFreed(page);
>  	spin_lock_irq(&hugetlb_lock);
>  	h->nr_huge_pages++;
>  	h->nr_huge_pages_node[nid]++;
> -	ClearHPageFreed(page);
>  	spin_unlock_irq(&hugetlb_lock);
>  }
>  
> -- 
> 2.16.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality
  2021-04-13 10:47 ` [PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality Oscar Salvador
@ 2021-04-13 13:24   ` Michal Hocko
  2021-04-13 13:26     ` Michal Hocko
  2021-04-13 21:33   ` Mike Kravetz
  1 sibling, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2021-04-13 13:24 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, Vlastimil Babka, David Hildenbrand,
	Muchun Song, linux-mm, linux-kernel

On Tue 13-04-21 12:47:44, Oscar Salvador wrote:
[...]
> +static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> +{
> +	__prep_new_huge_page(page);
>  	spin_lock_irq(&hugetlb_lock);
> -	h->nr_huge_pages++;
> -	h->nr_huge_pages_node[nid]++;
> +	__prep_account_new_huge_page(h, nid);
>  	spin_unlock_irq(&hugetlb_lock);
>  }

Any reason to decouple the locking from the accounting?
>  
> -- 
> 2.16.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality
  2021-04-13 13:24   ` Michal Hocko
@ 2021-04-13 13:26     ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2021-04-13 13:26 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, Vlastimil Babka, David Hildenbrand,
	Muchun Song, linux-mm, linux-kernel

On Tue 13-04-21 15:24:32, Michal Hocko wrote:
> On Tue 13-04-21 12:47:44, Oscar Salvador wrote:
> [...]
> > +static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> > +{
> > +	__prep_new_huge_page(page);
> >  	spin_lock_irq(&hugetlb_lock);
> > -	h->nr_huge_pages++;
> > -	h->nr_huge_pages_node[nid]++;
> > +	__prep_account_new_huge_page(h, nid);
> >  	spin_unlock_irq(&hugetlb_lock);
> >  }
> 
> Any reason to decouple the locking from the accounting?

OK, I spoke too soon. The next patch already has the locking around when
calling this. Please add a lockdep assert into the helper to make the
locking expectations more obvious.

With that
Acked-by: Michal Hocko <mhocko@suse.com>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages
  2021-04-13 10:47 ` [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
@ 2021-04-13 13:40   ` Michal Hocko
  2021-04-15  8:29     ` Oscar Salvador
  2021-04-13 22:29   ` Mike Kravetz
  2021-04-14 12:26   ` David Hildenbrand
  2 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2021-04-13 13:40 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, Vlastimil Babka, David Hildenbrand,
	Muchun Song, linux-mm, linux-kernel

On Tue 13-04-21 12:47:45, Oscar Salvador wrote:
> alloc_contig_range will fail if it ever sees a HugeTLB page within the
> range we are trying to allocate, even when that page is free and can be
> easily reallocated.
> This has proved to be problematic for some users of alloc_contic_range,
> e.g: CMA and virtio-mem, where those would fail the call even when those
> pages lay in ZONE_MOVABLE and are free.
> 
> We can do better by trying to replace such page.
> 
> Free hugepages are tricky to handle so as to no userspace application
> notices disruption, we need to replace the current free hugepage with
> a new one.
> 
> In order to do that, a new function called alloc_and_dissolve_huge_page
> is introduced.
> This function will first try to get a new fresh hugepage, and if it
> succeeds, it will replace the old one in the free hugepage pool.
> 
> The free page replacement is done under hugetlb_lock, so no external
> users of hugetlb will notice the change.
> To allocate the new huge page, we use alloc_buddy_huge_page(), so we
> do not have to deal with any counters, and prep_new_huge_page() is not
> called. This is valulable because in case we need to free the new page,
> we only need to call __free_pages().
> 
> Once we know that the page to be replaced is a genuine 0-refcounted
> huge page, we remove the old page from the freelist by remove_hugetlb_page().
> Then, we can call __prep_new_huge_page() and __prep_account_new_huge_page()
> for the new huge page to properly initialize it and increment the
> hstate->nr_huge_pages counter (previously decremented by
> remove_hugetlb_page()).
> Once done, the page is enqueued by enqueue_huge_page() and it is ready
> to be used.
> 
> There is one tricky case when
> page's refcount is 0 because it is in the process of being released.
> A missing PageHugeFreed bit will tell us that freeing is in flight so
> we retry after dropping the hugetlb_lock. The race window should be
> small and the next retry should make a forward progress.
> 
> E.g:
> 
> CPU0				CPU1
> free_huge_page()		isolate_or_dissolve_huge_page
> 				  PageHuge() == T
> 				  alloc_and_dissolve_huge_page
> 				    alloc_buddy_huge_page()
> 				    spin_lock_irq(hugetlb_lock)
> 				    // PageHuge() && !PageHugeFreed &&
> 				    // !PageCount()
> 				    spin_unlock_irq(hugetlb_lock)
>   spin_lock_irq(hugetlb_lock)
>   1) update_and_free_page
>        PageHuge() == F
>        __free_pages()
>   2) enqueue_huge_page
>        SetPageHugeFreed()
>   spin_unlock(&hugetlb_lock)
> 				  spin_lock_irq(hugetlb_lock)
>                                    1) PageHuge() == F (freed by case#1 from CPU0)
> 				   2) PageHuge() == T
>                                        PageHugeFreed() == T
>                                        - proceed with replacing the page
> 
> In the case above we retry as the window race is quite small and we have high
> chances to succeed next time.
> 
> With regard to the allocation, we restrict it to the node the page belongs
> to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
> 
> Note that gigantic hugetlb pages are fenced off since there is a cyclic
> dependency between them and alloc_contig_range.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Acked-by: Michal Hocko <mhocko@suse.com>

One minor nit below
[...]

> +		/*
> +		 * Ok, old_page is still a genuine free hugepage. Remove it from
> +		 * the freelist and decrease the counters. These will be
> +		 * incremented again when calling __prep_account_new_huge_page()
> +		 * and enqueue_huge_page() for new_page. The counters will remain
> +		 * stable since this happens under the lock.
> +		 */
> +		remove_hugetlb_page(h, old_page, false);
> +
> +		/*
> +		 * Call __prep_new_huge_page() to construct the hugetlb page, and
> +		 * enqueue it then to place it in the freelists. After this,
> +		 * counters are back on track. Free hugepages have a refcount of 0,
> +		 * so we need to decrease new_page's count as well.
> +		 */
> +		__prep_new_huge_page(new_page);
> +		__prep_account_new_huge_page(h, nid);

I think it would help to put something like the following into the
comment above this really strange construct.

		/*
		 * new_page needs to be initialized with the standard
		 * hugetlb state. This is normally done by
		 * prep_new_huge_page but that takes hugetlb_lock which
		 * is already held so we need to open code it here.
		 * Reference count trick is needed because allocator
		 * gives us referenced page but the pool requires pages
		 * with 0 refcount.
		 */

> +		page_ref_dec(new_page);
> +		enqueue_huge_page(h, new_page);
> +
> +		/*
> +		 * Pages have been replaced, we can safely free the old one.
> +		 */
> +		spin_unlock_irq(&hugetlb_lock);
> +		update_and_free_page(h, old_page);

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 1/7] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range
  2021-04-13 10:47 ` [PATCH v7 1/7] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range Oscar Salvador
@ 2021-04-13 17:00   ` Mike Kravetz
  0 siblings, 0 replies; 40+ messages in thread
From: Mike Kravetz @ 2021-04-13 17:00 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Vlastimil Babka, David Hildenbrand, Michal Hocko, Muchun Song,
	linux-mm, linux-kernel

On 4/13/21 3:47 AM, Oscar Salvador wrote:
> Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or -EBUSY,
> and report them down the chain.
> The problem is that when migrate_pages() reports -ENOMEM, we keep going till we
> exhaust all the try-attempts (5 at the moment) instead of bailing out.
> 
> migrate_pages() bails out right away on -ENOMEM because it is considered a fatal
> error. Do the same here instead of keep going and retrying.
> Note that this is not fixing a real issue, just a cosmetic change. Although we
> can save some cycles by backing off ealier
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes
  2021-04-13 10:47 ` [PATCH v7 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes Oscar Salvador
@ 2021-04-13 17:52   ` Mike Kravetz
  2021-04-14 11:54   ` David Hildenbrand
  1 sibling, 0 replies; 40+ messages in thread
From: Mike Kravetz @ 2021-04-13 17:52 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Vlastimil Babka, David Hildenbrand, Michal Hocko, Muchun Song,
	linux-mm, linux-kernel

On 4/13/21 3:47 AM, Oscar Salvador wrote:
> Currently, isolate_migratepages_{range,block} and their callers use
> a pfn == 0 vs pfn != 0 scheme to let the caller know whether there was
> any error during isolation.
> This does not work as soon as we need to start reporting different error
> codes and make sure we pass them down the chain, so they are properly
> interpreted by functions like e.g: alloc_contig_range.
> 
> Let us rework isolate_migratepages_{range,block} so we can report error
> codes.
> Since isolate_migratepages_block will stop returning the next pfn to be
> scanned, we reuse the cc->migrate_pfn field to keep track of that.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock
  2021-04-13 13:23   ` Michal Hocko
@ 2021-04-13 21:19     ` Mike Kravetz
  2021-04-14  6:04       ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Mike Kravetz @ 2021-04-13 21:19 UTC (permalink / raw)
  To: Michal Hocko, Oscar Salvador
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On 4/13/21 6:23 AM, Michal Hocko wrote:
> On Tue 13-04-21 12:47:43, Oscar Salvador wrote:
>> Currently, the clearing of the flag is done under the lock, but this
>> is unnecessary as we just allocated the page and we did not give it
>> away yet, so no one should be messing with it.
>>
>> Also, this helps making clear that here the lock is only protecting the
>> counter.
> 
> While moving the flag clearing is ok I am wondering why do we need that
> in the first place. I think it is just a leftover from 6c0371490140
> ("hugetlb: convert PageHugeFreed to HPageFreed flag"). Prior to that a tail
> page as been used to keep track of the state but now all happens in the
> head page and the flag uses page->private which is always initialized
> when allocated by the allocator (post_alloc_hook).

Yes, this was just left over from 6c0371490140.  And, as you mention
post_alloc_hook will clear page->private for all non-gigantic pages
allocated via buddy.

> Or do we need it for giga pages which are not allocated by the page
> allocator? If yes then moving it to prep_compound_gigantic_page would be
> better.

I am pretty sure dynamically allocated giga pages have page->Private
cleared as well.  It is not obvious, but the alloc_contig_range code
used to put together giga pages will end up calling isolate_freepages_range
that will call split_map_pages and then post_alloc_hook for each MAX_ORDER
block.  As mentioned, this is not obvious and I would hate to rely on this
behavior not changing.

> 
> So should we just drop it here?

The only place where page->private may not be initialized is when we do
allocations at boot time from memblock.  In this case, we will add the
pages to the free list via put_page/free_huge_page so the appropriate
flags will be cleared before anyone notices.

I'm wondering if we should just do a set_page_private(page, 0) here in
prep_new_huge_page since we now use that field for flags.  Or, is that
overkill?
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality
  2021-04-13 10:47 ` [PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality Oscar Salvador
  2021-04-13 13:24   ` Michal Hocko
@ 2021-04-13 21:33   ` Mike Kravetz
  2021-04-14  4:59     ` Oscar Salvador
  1 sibling, 1 reply; 40+ messages in thread
From: Mike Kravetz @ 2021-04-13 21:33 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Vlastimil Babka, David Hildenbrand, Michal Hocko, Muchun Song,
	linux-mm, linux-kernel

On 4/13/21 3:47 AM, Oscar Salvador wrote:
> Currently, prep_new_huge_page() performs two functions.
> It sets the right state for a new hugetlb, and increases the hstate's
> counters to account for the new page.
> 
> Let us split its functionality into two separate functions, decoupling
> the handling of the counters from initializing a hugepage.
> The outcome is having __prep_new_huge_page(), which only
> initializes the page , and __prep_account_new_huge_page(), which adds
> the new page to the hstate's counters.
> 
> This allows us to be able to set a hugetlb without having to worry
> about the counter/locking. It will prove useful in the next patch.
> prep_new_huge_page() still calls both functions.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  mm/hugetlb.c | 19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e40d5fe5c63c..0607b2b71ac6 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1483,7 +1483,16 @@ void free_huge_page(struct page *page)
>  	}
>  }
>  
> -static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> +/*
> + * Must be called with the hugetlb lock held
> + */
> +static void __prep_account_new_huge_page(struct hstate *h, int nid)
> +{
> +	h->nr_huge_pages++;
> +	h->nr_huge_pages_node[nid]++;

I would prefer if we also move setting the destructor to this routine.
	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);

That way, PageHuge() will be false until it 'really' is a huge page.
If not, we could potentially go into that retry loop in
dissolve_free_huge_page or alloc_and_dissolve_huge_page in patch 5.
-- 
Mike Kravetz

> +}
> +
> +static void __prep_new_huge_page(struct page *page)
>  {
>  	INIT_LIST_HEAD(&page->lru);
>  	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
> @@ -1491,9 +1500,13 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>  	set_hugetlb_cgroup(page, NULL);
>  	set_hugetlb_cgroup_rsvd(page, NULL);
>  	ClearHPageFreed(page);
> +}
> +
> +static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> +{
> +	__prep_new_huge_page(page);
>  	spin_lock_irq(&hugetlb_lock);
> -	h->nr_huge_pages++;
> -	h->nr_huge_pages_node[nid]++;
> +	__prep_account_new_huge_page(h, nid);
>  	spin_unlock_irq(&hugetlb_lock);
>  }
>  
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages
  2021-04-13 10:47 ` [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
  2021-04-13 13:40   ` Michal Hocko
@ 2021-04-13 22:29   ` Mike Kravetz
  2021-04-14  4:54     ` Oscar Salvador
  2021-04-14 12:26   ` David Hildenbrand
  2 siblings, 1 reply; 40+ messages in thread
From: Mike Kravetz @ 2021-04-13 22:29 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Vlastimil Babka, David Hildenbrand, Michal Hocko, Muchun Song,
	linux-mm, linux-kernel

On 4/13/21 3:47 AM, Oscar Salvador wrote:
> alloc_contig_range will fail if it ever sees a HugeTLB page within the
> range we are trying to allocate, even when that page is free and can be
> easily reallocated.
> This has proved to be problematic for some users of alloc_contic_range,
> e.g: CMA and virtio-mem, where those would fail the call even when those
> pages lay in ZONE_MOVABLE and are free.
> 
> We can do better by trying to replace such page.
> 
> Free hugepages are tricky to handle so as to no userspace application
> notices disruption, we need to replace the current free hugepage with
> a new one.
> 
> In order to do that, a new function called alloc_and_dissolve_huge_page
> is introduced.
> This function will first try to get a new fresh hugepage, and if it
> succeeds, it will replace the old one in the free hugepage pool.
> 
> The free page replacement is done under hugetlb_lock, so no external
> users of hugetlb will notice the change.
> To allocate the new huge page, we use alloc_buddy_huge_page(), so we
> do not have to deal with any counters, and prep_new_huge_page() is not
> called. This is valulable because in case we need to free the new page,
> we only need to call __free_pages().
> 
> Once we know that the page to be replaced is a genuine 0-refcounted
> huge page, we remove the old page from the freelist by remove_hugetlb_page().
> Then, we can call __prep_new_huge_page() and __prep_account_new_huge_page()
> for the new huge page to properly initialize it and increment the
> hstate->nr_huge_pages counter (previously decremented by
> remove_hugetlb_page()).
> Once done, the page is enqueued by enqueue_huge_page() and it is ready
> to be used.
> 
> There is one tricky case when
> page's refcount is 0 because it is in the process of being released.
> A missing PageHugeFreed bit will tell us that freeing is in flight so
> we retry after dropping the hugetlb_lock. The race window should be
> small and the next retry should make a forward progress.
> 
> E.g:
> 
> CPU0				CPU1
> free_huge_page()		isolate_or_dissolve_huge_page
> 				  PageHuge() == T
> 				  alloc_and_dissolve_huge_page
> 				    alloc_buddy_huge_page()
> 				    spin_lock_irq(hugetlb_lock)
> 				    // PageHuge() && !PageHugeFreed &&
> 				    // !PageCount()
> 				    spin_unlock_irq(hugetlb_lock)
>   spin_lock_irq(hugetlb_lock)
>   1) update_and_free_page
>        PageHuge() == F
>        __free_pages()
>   2) enqueue_huge_page
>        SetPageHugeFreed()
>   spin_unlock(&hugetlb_lock)

Very small nit, the above should be spin_unlock_irq(&hugetlb_lock)

> 				  spin_lock_irq(hugetlb_lock)
>                                    1) PageHuge() == F (freed by case#1 from CPU0)
> 				   2) PageHuge() == T
>                                        PageHugeFreed() == T
>                                        - proceed with replacing the page
> 
> In the case above we retry as the window race is quite small and we have high
> chances to succeed next time.
> 
> With regard to the allocation, we restrict it to the node the page belongs
> to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
> 
> Note that gigantic hugetlb pages are fenced off since there is a cyclic
> dependency between them and alloc_contig_range.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
...
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 0607b2b71ac6..4a664d6e82c1 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2266,6 +2266,121 @@ static void restore_reserve_on_error(struct hstate *h,
>  	}
>  }
>  
> +/*
> + * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one
> + * @h: struct hstate old page belongs to
> + * @old_page: Old page to dissolve
> + * Returns 0 on success, otherwise negated error.
> + */
> +
> +static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
> +{
> +	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
> +	int nid = page_to_nid(old_page);
> +	struct page *new_page;
> +	int ret = 0;
> +
> +	/*
> +	 * Before dissolving the page, we need to allocate a new one for the
> +	 * pool to remain stable. Using alloc_buddy_huge_page() allows us to
> +	 * not having to deal with prep_new_page() and avoids dealing of any
> +	 * counters. This simplifies and let us do the whole thing under the
> +	 * lock.
> +	 */
> +	new_page = alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL);
> +	if (!new_page)
> +		return -ENOMEM;
> +
> +retry:
> +	spin_lock_irq(&hugetlb_lock);
> +	if (!PageHuge(old_page)) {
> +		/*
> +		 * Freed from under us. Drop new_page too.
> +		 */
> +		goto free_new;
> +	} else if (page_count(old_page)) {
> +		/*
> +		 * Someone has grabbed the page, fail for now.
> +		 */
> +		ret = -EBUSY;
> +		goto free_new;
> +	} else if (!HPageFreed(old_page)) {
> +		/*
> +		 * Page's refcount is 0 but it has not been enqueued in the
> +		 * freelist yet. Race window is small, so we can succeed here if
> +		 * we retry.
> +		 */
> +		spin_unlock_irq(&hugetlb_lock);
> +		cond_resched();
> +		goto retry;
> +	} else {
> +		/*
> +		 * Ok, old_page is still a genuine free hugepage. Remove it from
> +		 * the freelist and decrease the counters. These will be
> +		 * incremented again when calling __prep_account_new_huge_page()
> +		 * and enqueue_huge_page() for new_page. The counters will remain
> +		 * stable since this happens under the lock.
> +		 */
> +		remove_hugetlb_page(h, old_page, false);
> +
> +		/*
> +		 * Call __prep_new_huge_page() to construct the hugetlb page, and
> +		 * enqueue it then to place it in the freelists. After this,
> +		 * counters are back on track. Free hugepages have a refcount of 0,
> +		 * so we need to decrease new_page's count as well.
> +		 */
> +		__prep_new_huge_page(new_page);
> +		__prep_account_new_huge_page(h, nid);
> +		page_ref_dec(new_page);
> +		enqueue_huge_page(h, new_page);
> +
> +		/*
> +		 * Pages have been replaced, we can safely free the old one.
> +		 */
> +		spin_unlock_irq(&hugetlb_lock);
> +		update_and_free_page(h, old_page);
> +	}
> +
> +	return ret;
> +
> +free_new:
> +	spin_unlock_irq(&hugetlb_lock);
> +	__free_pages(new_page, huge_page_order(h));
> +
> +	return ret;
> +}
> +
> +int isolate_or_dissolve_huge_page(struct page *page)
> +{
> +	struct hstate *h;
> +	struct page *head;
> +
> +	/*
> +	 * The page might have been dissolved from under our feet, so make sure
> +	 * to carefully check the state under the lock.
> +	 * Return success when racing as if we dissolved the page ourselves.
> +	 */
> +	spin_lock_irq(&hugetlb_lock);
> +	if (PageHuge(page)) {
> +		head = compound_head(page);
> +		h = page_hstate(head);
> +	} else {
> +		spin_unlock(&hugetlb_lock);

Should be be spin_unlock_irq(&hugetlb_lock);

Other than that, it looks good.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 6/7] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-04-13 10:47 ` [PATCH v7 6/7] mm: Make alloc_contig_range handle in-use " Oscar Salvador
@ 2021-04-13 22:48   ` Mike Kravetz
  2021-04-14  4:52     ` Oscar Salvador
  0 siblings, 1 reply; 40+ messages in thread
From: Mike Kravetz @ 2021-04-13 22:48 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Vlastimil Babka, David Hildenbrand, Michal Hocko, Muchun Song,
	linux-mm, linux-kernel

On 4/13/21 3:47 AM, Oscar Salvador wrote:
> alloc_contig_range() will fail if it finds a HugeTLB page within the range,
> without a chance to handle them. Since HugeTLB pages can be migrated as any
> LRU or Movable page, it does not make sense to bail out without trying.
> Enable the interface to recognize in-use HugeTLB pages so we can migrate
> them, and have much better chances to succeed the call.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

One small issue/question/request below.

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4a664d6e82c1..24a453ff47f2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2270,10 +2270,12 @@ static void restore_reserve_on_error(struct hstate *h,
>   * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one
>   * @h: struct hstate old page belongs to
>   * @old_page: Old page to dissolve
> + * @list: List to isolate the page in case we need to
>   * Returns 0 on success, otherwise negated error.
>   */
>  
> -static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
> +static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
> +					struct list_head *list)
>  {
>  	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>  	int nid = page_to_nid(old_page);
> @@ -2300,9 +2302,13 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
>  		goto free_new;
>  	} else if (page_count(old_page)) {
>  		/*
> -		 * Someone has grabbed the page, fail for now.
> +		 * Someone has grabbed the page, try to isolate it here.
> +		 * Fail with -EBUSY if not possible.
>  		 */
> -		ret = -EBUSY;
> +		spin_unlock_irq(&hugetlb_lock);
> +		if (!isolate_huge_page(old_page, list))
> +			ret = -EBUSY;
> +		spin_lock_irq(&hugetlb_lock);
>  		goto free_new;

The label free_new is:

free_new:
        spin_unlock_irq(&hugetlb_lock);
        __free_pages(new_page, huge_page_order(h));

        return ret;

So, we are locking and immediately unlocking without any code in
between.  Usually, I don't like like multiple labels before return.
However, perhaps we should add another to avoid this unnecessary
cycle.  On the other hand, this is an uncommon race condition so the
simple code may be acceptable.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 7/7] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig
  2021-04-13 10:47 ` [PATCH v7 7/7] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig Oscar Salvador
@ 2021-04-13 22:53   ` Mike Kravetz
  0 siblings, 0 replies; 40+ messages in thread
From: Mike Kravetz @ 2021-04-13 22:53 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Vlastimil Babka, David Hildenbrand, Michal Hocko, Muchun Song,
	linux-mm, linux-kernel

On 4/13/21 3:47 AM, Oscar Salvador wrote:
> pfn_range_valid_contig() bails out when it finds an in-use page or a
> hugetlb page, among other things.
> We can drop the in-use page check since __alloc_contig_pages can migrate
> away those pages, and the hugetlb page check can go too since
> isolate_migratepages_range is now capable of dealing with hugetlb pages.
> Either way, those checks are racy so let the end function handle it
> when the time comes.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/page_alloc.c | 6 ------
>  1 file changed, 6 deletions(-)

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 6/7] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-04-13 22:48   ` Mike Kravetz
@ 2021-04-14  4:52     ` Oscar Salvador
  2021-04-14 17:07       ` Mike Kravetz
  0 siblings, 1 reply; 40+ messages in thread
From: Oscar Salvador @ 2021-04-14  4:52 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Michal Hocko,
	Muchun Song, linux-mm, linux-kernel

On Tue, Apr 13, 2021 at 03:48:53PM -0700, Mike Kravetz wrote:
> The label free_new is:
> 
> free_new:
>         spin_unlock_irq(&hugetlb_lock);
>         __free_pages(new_page, huge_page_order(h));
> 
>         return ret;
> 
> So, we are locking and immediately unlocking without any code in
> between.  Usually, I don't like like multiple labels before return.
> However, perhaps we should add another to avoid this unnecessary
> cycle.  On the other hand, this is an uncommon race condition so the
> simple code may be acceptable.

I guess we could have something like:

 free_new:
         spin_unlock_irq(&hugetlb_lock);
 free_new_nolock:
         __free_pages(new_page, huge_page_order(h));
 
         return ret;

And let the retry go to there without locking. But as you said, the
racecondition is rare enough, so I am not sure if this buys us much.
But I can certainly add it if you feel strong about it.


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages
  2021-04-13 22:29   ` Mike Kravetz
@ 2021-04-14  4:54     ` Oscar Salvador
  0 siblings, 0 replies; 40+ messages in thread
From: Oscar Salvador @ 2021-04-14  4:54 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Michal Hocko,
	Muchun Song, linux-mm, linux-kernel

On Tue, Apr 13, 2021 at 03:29:02PM -0700, Mike Kravetz wrote:
> >   spin_lock_irq(hugetlb_lock)
> >   1) update_and_free_page
> >        PageHuge() == F
> >        __free_pages()
> >   2) enqueue_huge_page
> >        SetPageHugeFreed()
> >   spin_unlock(&hugetlb_lock)
> 
> Very small nit, the above should be spin_unlock_irq(&hugetlb_lock)

Right, I missed it somehow.

> > +	/*
> > +	 * The page might have been dissolved from under our feet, so make sure
> > +	 * to carefully check the state under the lock.
> > +	 * Return success when racing as if we dissolved the page ourselves.
> > +	 */
> > +	spin_lock_irq(&hugetlb_lock);
> > +	if (PageHuge(page)) {
> > +		head = compound_head(page);
> > +		h = page_hstate(head);
> > +	} else {
> > +		spin_unlock(&hugetlb_lock);
> 
> Should be be spin_unlock_irq(&hugetlb_lock);
> 
> Other than that, it looks good.

Yeah, I will amend it in the next version.

Thanks Mike!


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality
  2021-04-13 21:33   ` Mike Kravetz
@ 2021-04-14  4:59     ` Oscar Salvador
  2021-04-14 12:15       ` David Hildenbrand
  2021-04-14 17:03       ` Mike Kravetz
  0 siblings, 2 replies; 40+ messages in thread
From: Oscar Salvador @ 2021-04-14  4:59 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Michal Hocko,
	Muchun Song, linux-mm, linux-kernel

On Tue, Apr 13, 2021 at 02:33:41PM -0700, Mike Kravetz wrote:
> > -static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> > +/*
> > + * Must be called with the hugetlb lock held
> > + */
> > +static void __prep_account_new_huge_page(struct hstate *h, int nid)
> > +{
> > +	h->nr_huge_pages++;
> > +	h->nr_huge_pages_node[nid]++;
> 
> I would prefer if we also move setting the destructor to this routine.
> 	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);

Uhm, but that is the routine that does the accounting, it feels wrong
here, plus...

> 
> That way, PageHuge() will be false until it 'really' is a huge page.
> If not, we could potentially go into that retry loop in
> dissolve_free_huge_page or alloc_and_dissolve_huge_page in patch 5.

...I do not follow here, could you please elaborate some more?
Unless I am missing something, behaviour should not be any different with this
patch.

Thanks


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock
  2021-04-13 21:19     ` Mike Kravetz
@ 2021-04-14  6:04       ` Michal Hocko
  2021-04-14  7:41         ` Oscar Salvador
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2021-04-14  6:04 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Oscar Salvador, Andrew Morton, Vlastimil Babka,
	David Hildenbrand, Muchun Song, linux-mm, linux-kernel

On Tue 13-04-21 14:19:03, Mike Kravetz wrote:
> On 4/13/21 6:23 AM, Michal Hocko wrote:
> > On Tue 13-04-21 12:47:43, Oscar Salvador wrote:
[...]
> > Or do we need it for giga pages which are not allocated by the page
> > allocator? If yes then moving it to prep_compound_gigantic_page would be
> > better.
> 
> I am pretty sure dynamically allocated giga pages have page->Private
> cleared as well.  It is not obvious, but the alloc_contig_range code
> used to put together giga pages will end up calling isolate_freepages_range
> that will call split_map_pages and then post_alloc_hook for each MAX_ORDER
> block.

Thanks for saving me from crawling that code.

> As mentioned, this is not obvious and I would hate to rely on this
> behavior not changing.

Thinking about it some more, having some (page granularity) allocator
not clearing page private would be a serious problem for anybody relying
on its state. So I am not sure this can change.
 
> > So should we just drop it here?
> 
> The only place where page->private may not be initialized is when we do
> allocations at boot time from memblock.  In this case, we will add the
> pages to the free list via put_page/free_huge_page so the appropriate
> flags will be cleared before anyone notices.

Pages allocated by the bootmem should be pre initialized from the boot,
no?

> I'm wondering if we should just do a set_page_private(page, 0) here in
> prep_new_huge_page since we now use that field for flags.  Or, is that
> overkill?

I would rather not duplicate the work done by underlying allocators. I
do not think other users of the allocator want to do the same so why
should hugetlb be any different.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock
  2021-04-14  6:04       ` Michal Hocko
@ 2021-04-14  7:41         ` Oscar Salvador
  2021-04-14  8:28           ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Oscar Salvador @ 2021-04-14  7:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Andrew Morton, Vlastimil Babka, David Hildenbrand,
	Muchun Song, linux-mm, linux-kernel

On Wed, Apr 14, 2021 at 08:04:21AM +0200, Michal Hocko wrote:
> On Tue 13-04-21 14:19:03, Mike Kravetz wrote:
> > On 4/13/21 6:23 AM, Michal Hocko wrote:
> > The only place where page->private may not be initialized is when we do
> > allocations at boot time from memblock.  In this case, we will add the
> > pages to the free list via put_page/free_huge_page so the appropriate
> > flags will be cleared before anyone notices.
> 
> Pages allocated by the bootmem should be pre initialized from the boot,
> no?

I guess Mike means:

hugetlb_hstate_alloc_pages
 alloc_bootmem_huge_page
  __alloc_bootmem_huge_page
   memblock_alloc_try_nid_raw

and AFAICS, memblock_alloc_try_nid_raw() does not zero the memory.

Then these pages are initialized in:

gather_bootmem_prealloc
 prep_compound_huge_page
 prep_new_huge_page

But as can be noticed, no one touches page->private when coming from that
path.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock
  2021-04-14  7:41         ` Oscar Salvador
@ 2021-04-14  8:28           ` Michal Hocko
  2021-04-14 10:01             ` Oscar Salvador
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2021-04-14  8:28 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Mike Kravetz, Andrew Morton, Vlastimil Babka, David Hildenbrand,
	Muchun Song, linux-mm, linux-kernel

On Wed 14-04-21 09:41:32, Oscar Salvador wrote:
> On Wed, Apr 14, 2021 at 08:04:21AM +0200, Michal Hocko wrote:
> > On Tue 13-04-21 14:19:03, Mike Kravetz wrote:
> > > On 4/13/21 6:23 AM, Michal Hocko wrote:
> > > The only place where page->private may not be initialized is when we do
> > > allocations at boot time from memblock.  In this case, we will add the
> > > pages to the free list via put_page/free_huge_page so the appropriate
> > > flags will be cleared before anyone notices.
> > 
> > Pages allocated by the bootmem should be pre initialized from the boot,
> > no?
> 
> I guess Mike means:
> 
> hugetlb_hstate_alloc_pages
>  alloc_bootmem_huge_page
>   __alloc_bootmem_huge_page
>    memblock_alloc_try_nid_raw
> 
> and AFAICS, memblock_alloc_try_nid_raw() does not zero the memory.

You are right it doesn't do it there. But all struct pages, even those
that are allocated by the bootmem allocator should initialize its struct
pages. They would be poisoned otherwise, right? I would have to look at
the exact code path but IIRC this should be around the time bootmem
allocator state transitions to the page allocator.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock
  2021-04-14  8:28           ` Michal Hocko
@ 2021-04-14 10:01             ` Oscar Salvador
  2021-04-14 10:03               ` Oscar Salvador
  2021-04-14 10:32               ` Michal Hocko
  0 siblings, 2 replies; 40+ messages in thread
From: Oscar Salvador @ 2021-04-14 10:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Andrew Morton, Vlastimil Babka, David Hildenbrand,
	Muchun Song, linux-mm, linux-kernel

On Wed, Apr 14, 2021 at 10:28:33AM +0200, Michal Hocko wrote:
> You are right it doesn't do it there. But all struct pages, even those
> that are allocated by the bootmem allocator should initialize its struct
> pages. They would be poisoned otherwise, right? I would have to look at
> the exact code path but IIRC this should be around the time bootmem
> allocator state transitions to the page allocator.

Ok, you are right.
struct pages are initialized a bit earlier through:

start_kernel
 setup_arch
  paging_init
   zone_sizes_init
    free_area_init
     free_area_init_node
      free_area_init_core
       memmap_init_zone
        memmap_init_range
         __init_single_page

While the allocation of bootmem hugetlb happens

start_kernel
 parse_args
  ...
   hugepages_setup
    ...
     hugetlb_hstate_alloc_pages
      __alloc_bootmem_huge_page

which is after the setup_arch() call.

So by the time we get the page from __alloc_bootmem_huge_page(), fields are
zeroed.
I thought we might get in trouble because memblock_alloc_try_nid_raw() calls
page_init_poison() which poisons the chunk with 0xff,e.g:

[    1.955471] boot: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
[    1.955476] boot: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff

 but it seems that does not the memmap struct page.

I checked, and when we get there in __alloc_bootmem_huge_page, page->private is
still zeroed, so I guess it should be safe to assume that we do not really need
to clear the flag in __prep_new_huge_page() routine?

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock
  2021-04-14 10:01             ` Oscar Salvador
@ 2021-04-14 10:03               ` Oscar Salvador
  2021-04-14 10:32               ` Michal Hocko
  1 sibling, 0 replies; 40+ messages in thread
From: Oscar Salvador @ 2021-04-14 10:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Andrew Morton, Vlastimil Babka, David Hildenbrand,
	Muchun Song, linux-mm, linux-kernel

On Wed, Apr 14, 2021 at 12:01:47PM +0200, Oscar Salvador wrote:
>  but it seems that does not the memmap struct page.
that sould read as "but it seems that that does not affect the memmap struct page"
 

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock
  2021-04-14 10:01             ` Oscar Salvador
  2021-04-14 10:03               ` Oscar Salvador
@ 2021-04-14 10:32               ` Michal Hocko
  2021-04-14 10:49                 ` Oscar Salvador
  1 sibling, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2021-04-14 10:32 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Mike Kravetz, Andrew Morton, Vlastimil Babka, David Hildenbrand,
	Muchun Song, linux-mm, linux-kernel

On Wed 14-04-21 12:01:47, Oscar Salvador wrote:
> On Wed, Apr 14, 2021 at 10:28:33AM +0200, Michal Hocko wrote:
> > You are right it doesn't do it there. But all struct pages, even those
> > that are allocated by the bootmem allocator should initialize its struct
> > pages. They would be poisoned otherwise, right? I would have to look at
> > the exact code path but IIRC this should be around the time bootmem
> > allocator state transitions to the page allocator.
> 
> Ok, you are right.
> struct pages are initialized a bit earlier through:
> 
> start_kernel
>  setup_arch
>   paging_init
>    zone_sizes_init
>     free_area_init
>      free_area_init_node
>       free_area_init_core
>        memmap_init_zone
>         memmap_init_range
>          __init_single_page
> 
> While the allocation of bootmem hugetlb happens
> 
> start_kernel
>  parse_args
>   ...
>    hugepages_setup
>     ...
>      hugetlb_hstate_alloc_pages
>       __alloc_bootmem_huge_page
> 
> which is after the setup_arch() call.

Thanks for pulling those paths. It is always painful to crawl that code.

> So by the time we get the page from __alloc_bootmem_huge_page(), fields are
> zeroed.
> I thought we might get in trouble because memblock_alloc_try_nid_raw() calls
> page_init_poison() which poisons the chunk with 0xff,e.g:
> 
> [    1.955471] boot: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
> [    1.955476] boot: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
> 
>  but it seems that does not the memmap struct page.

Well, to be precise it does the very same thing with memamp struct pages
but that is before the initialization code you have pointed out above.
In this context it just poisons the allocated content which is the GB
page storage.

> I checked, and when we get there in __alloc_bootmem_huge_page, page->private is
> still zeroed, so I guess it should be safe to assume that we do not really need
> to clear the flag in __prep_new_huge_page() routine?

It would be quite nasty if the struct pages content would be undefined.
Maybe that is possible but then I would rather stick the initialization
into __alloc_bootmem_huge_page.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock
  2021-04-14 10:32               ` Michal Hocko
@ 2021-04-14 10:49                 ` Oscar Salvador
  2021-04-14 11:09                   ` Michal Hocko
  2021-04-14 16:45                   ` Mike Kravetz
  0 siblings, 2 replies; 40+ messages in thread
From: Oscar Salvador @ 2021-04-14 10:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Andrew Morton, Vlastimil Babka, David Hildenbrand,
	Muchun Song, linux-mm, linux-kernel

On Wed, Apr 14, 2021 at 12:32:58PM +0200, Michal Hocko wrote:
> Well, to be precise it does the very same thing with memamp struct pages
> but that is before the initialization code you have pointed out above.
> In this context it just poisons the allocated content which is the GB
> page storage.

Right.

> > I checked, and when we get there in __alloc_bootmem_huge_page, page->private is
> > still zeroed, so I guess it should be safe to assume that we do not really need
> > to clear the flag in __prep_new_huge_page() routine?
> 
> It would be quite nasty if the struct pages content would be undefined.
> Maybe that is possible but then I would rather stick the initialization
> into __alloc_bootmem_huge_page.

Yes, but I do not think that is really possible unless I missed something.
Let us see what Mike thinks of it, if there are no objections, we can
get rid of the clearing flag right there.
 

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock
  2021-04-14 10:49                 ` Oscar Salvador
@ 2021-04-14 11:09                   ` Michal Hocko
  2021-04-14 12:02                     ` David Hildenbrand
  2021-04-14 16:45                   ` Mike Kravetz
  1 sibling, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2021-04-14 11:09 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Mike Kravetz, Andrew Morton, Vlastimil Babka, David Hildenbrand,
	Muchun Song, linux-mm, linux-kernel

On Wed 14-04-21 12:49:53, Oscar Salvador wrote:
> On Wed, Apr 14, 2021 at 12:32:58PM +0200, Michal Hocko wrote:
[...]
> > > I checked, and when we get there in __alloc_bootmem_huge_page, page->private is
> > > still zeroed, so I guess it should be safe to assume that we do not really need
> > > to clear the flag in __prep_new_huge_page() routine?
> > 
> > It would be quite nasty if the struct pages content would be undefined.
> > Maybe that is possible but then I would rather stick the initialization
> > into __alloc_bootmem_huge_page.
> 
> Yes, but I do not think that is really possible unless I missed something.

Yeah, it should be fine. I was thinking of a alloc, modify struct pages,
free back to the bootmem allocator sequence. But I do not remember ever
seeing sequence like that. Bootmem allocator users tend to be simple,
allocate storage and either retain it for the life time. Other than
PageReserved bit they do not touch metadata. If we want to be paranoid
then we can add VM_WARN_ON for unexpected state when allocating from the
bootmem. But I am not sure this is really worth it.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes
  2021-04-13 10:47 ` [PATCH v7 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes Oscar Salvador
  2021-04-13 17:52   ` Mike Kravetz
@ 2021-04-14 11:54   ` David Hildenbrand
  2021-04-15  8:42     ` Oscar Salvador
  1 sibling, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2021-04-14 11:54 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Mike Kravetz, Vlastimil Babka, Michal Hocko, Muchun Song,
	linux-mm, linux-kernel

On 13.04.21 12:47, Oscar Salvador wrote:
> Currently, isolate_migratepages_{range,block} and their callers use
> a pfn == 0 vs pfn != 0 scheme to let the caller know whether there was
> any error during isolation.
> This does not work as soon as we need to start reporting different error
> codes and make sure we pass them down the chain, so they are properly
> interpreted by functions like e.g: alloc_contig_range.
> 
> Let us rework isolate_migratepages_{range,block} so we can report error
> codes.
> Since isolate_migratepages_block will stop returning the next pfn to be
> scanned, we reuse the cc->migrate_pfn field to keep track of that.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>   mm/compaction.c | 54 +++++++++++++++++++++++++++---------------------------
>   mm/internal.h   | 10 ++++++++--
>   mm/page_alloc.c |  7 +++----
>   3 files changed, 38 insertions(+), 33 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 8c5028bfbd56..eeba4668c22c 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -787,15 +787,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
>    *
>    * Isolate all pages that can be migrated from the range specified by
>    * [low_pfn, end_pfn). The range is expected to be within same pageblock.
> - * Returns zero if there is a fatal signal pending, otherwise PFN of the
> - * first page that was not scanned (which may be both less, equal to or more
> - * than end_pfn).
> + * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or congestion,
> + * or 0.
> + * cc->migrate_pfn will contain the next pfn to scan (which may be both less,
> + * equal to or more that end_pfn).

I failed to parse the last sentence -- e.g., using "both" and then 
listing three options. Also, s/than/than/? Can we simplify to

"cc->migrate_pfn will contain the next pfn to scan"

>    *
>    * The pages are isolated on cc->migratepages list (not required to be empty),
> - * and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field
> - * is neither read nor updated.
> + * and cc->nr_migratepages is updated accordingly.
>    */
> -static unsigned long
> +static int
>   isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   			unsigned long end_pfn, isolate_mode_t isolate_mode)
>   {
> @@ -809,6 +809,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   	bool skip_on_failure = false;
>   	unsigned long next_skip_pfn = 0;
>   	bool skip_updated = false;
> +	int ret = 0;
> +
> +	cc->migrate_pfn = low_pfn;
>   
>   	/*
>   	 * Ensure that there are not too many pages isolated from the LRU
> @@ -818,16 +821,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   	while (unlikely(too_many_isolated(pgdat))) {
>   		/* stop isolation if there are still pages not migrated */
>   		if (cc->nr_migratepages)
> -			return 0;
> +			return -EAGAIN;
>   
>   		/* async migration should just abort */
>   		if (cc->mode == MIGRATE_ASYNC)
> -			return 0;
> +			return -EAGAIN;
>   
>   		congestion_wait(BLK_RW_ASYNC, HZ/10);
>   
>   		if (fatal_signal_pending(current))
> -			return 0;
> +			return -EINTR;
>   	}
>   
>   	cond_resched();
> @@ -875,8 +878,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   
>   			if (fatal_signal_pending(current)) {
>   				cc->contended = true;
> +				ret = -EINTR;
>   
> -				low_pfn = 0;
>   				goto fatal_pending;
>   			}
>   
> @@ -1130,7 +1133,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   	if (nr_isolated)
>   		count_compact_events(COMPACTISOLATED, nr_isolated);
>   
> -	return low_pfn;
> +	cc->migrate_pfn = low_pfn;
> +
> +	return ret;
>   }
>   
>   /**
> @@ -1139,15 +1144,15 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>    * @start_pfn: The first PFN to start isolating.
>    * @end_pfn:   The one-past-last PFN.
>    *
> - * Returns zero if isolation fails fatally due to e.g. pending signal.
> - * Otherwise, function returns one-past-the-last PFN of isolated page
> - * (which may be greater than end_pfn if end fell in a middle of a THP page).
> + * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or congestion,

errno is usually positive.

> + * or 0.

I'd be more specific here. Something like

"
Returns 0 if isolation succeeded. Returns -EINTR if a fatal signal is 
pending. Returns -EAGAIN when contended and the caller should retry.

In any case, cc->migrate_pfn is set to one-past-the-last PFN that has 
been isolated.
"

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock
  2021-04-14 11:09                   ` Michal Hocko
@ 2021-04-14 12:02                     ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2021-04-14 12:02 UTC (permalink / raw)
  To: Michal Hocko, Oscar Salvador
  Cc: Mike Kravetz, Andrew Morton, Vlastimil Babka, Muchun Song,
	linux-mm, linux-kernel

On 14.04.21 13:09, Michal Hocko wrote:
> On Wed 14-04-21 12:49:53, Oscar Salvador wrote:
>> On Wed, Apr 14, 2021 at 12:32:58PM +0200, Michal Hocko wrote:
> [...]
>>>> I checked, and when we get there in __alloc_bootmem_huge_page, page->private is
>>>> still zeroed, so I guess it should be safe to assume that we do not really need
>>>> to clear the flag in __prep_new_huge_page() routine?
>>>
>>> It would be quite nasty if the struct pages content would be undefined.
>>> Maybe that is possible but then I would rather stick the initialization
>>> into __alloc_bootmem_huge_page.
>>
>> Yes, but I do not think that is really possible unless I missed something.
> 
> Yeah, it should be fine. I was thinking of a alloc, modify struct pages,
> free back to the bootmem allocator sequence. But I do not remember ever
> seeing sequence like that.

We do have code like that, though.

Take a look at arch/x86/mm/init_64.c:free_pagetable() as one example.

But in general, we assume freeing code (buddy, but also memblock) 
restores the state of the memmap to the original state it obtained the 
memmap. So if it's properly initialized when coming from the allocator 
the first time, it should remain properly initialized when re-entering 
and re-leaving the allocator.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality
  2021-04-14  4:59     ` Oscar Salvador
@ 2021-04-14 12:15       ` David Hildenbrand
  2021-04-14 17:03       ` Mike Kravetz
  1 sibling, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2021-04-14 12:15 UTC (permalink / raw)
  To: Oscar Salvador, Mike Kravetz
  Cc: Andrew Morton, Vlastimil Babka, Michal Hocko, Muchun Song,
	linux-mm, linux-kernel

On 14.04.21 06:59, Oscar Salvador wrote:
> On Tue, Apr 13, 2021 at 02:33:41PM -0700, Mike Kravetz wrote:
>>> -static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>>> +/*
>>> + * Must be called with the hugetlb lock held
>>> + */
>>> +static void __prep_account_new_huge_page(struct hstate *h, int nid)
>>> +{
>>> +	h->nr_huge_pages++;
>>> +	h->nr_huge_pages_node[nid]++;
>>
>> I would prefer if we also move setting the destructor to this routine.
>> 	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
> 
> Uhm, but that is the routine that does the accounting, it feels wrong
> here, plus...

I agree. If we want the final activation separately, it might be better 
to have it as a separate step/function like __active_new_huge_page().

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages
  2021-04-13 10:47 ` [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
  2021-04-13 13:40   ` Michal Hocko
  2021-04-13 22:29   ` Mike Kravetz
@ 2021-04-14 12:26   ` David Hildenbrand
  2021-04-15  8:28     ` Oscar Salvador
  2 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2021-04-14 12:26 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Mike Kravetz, Vlastimil Babka, Michal Hocko, Muchun Song,
	linux-mm, linux-kernel

> +static inline int isolate_or_dissolve_huge_page(struct page *page)
> +{
> +	return -ENOMEM;

Without CONFIG_HUGETLB_PAGE, there is no way someone could possible pass 
in something valid. Although it doesn't matter too much, -EINVAL or 
similar sounds a bit better.

> +}
> +
>   static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
>   					   unsigned long addr,
>   					   int avoid_reserve)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index eeba4668c22c..89426b6d1ea3 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -788,7 +788,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
>    * Isolate all pages that can be migrated from the range specified by
>    * [low_pfn, end_pfn). The range is expected to be within same pageblock.
>    * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or congestion,
> - * or 0.
> + * -ENOMEM in case we could not allocate a page, or 0.
>    * cc->migrate_pfn will contain the next pfn to scan (which may be both less,
>    * equal to or more that end_pfn).
>    *
> @@ -809,6 +809,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   	bool skip_on_failure = false;
>   	unsigned long next_skip_pfn = 0;
>   	bool skip_updated = false;
> +	bool fatal_error = false;

Can't we use "ret == -ENOMEM" instead of fatal_error?

>   	int ret = 0;
>   
>   	cc->migrate_pfn = low_pfn;
> @@ -907,6 +908,33 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   			valid_page = page;
>   		}
>   
> +		if (PageHuge(page) && cc->alloc_contig) {
> +			ret = isolate_or_dissolve_huge_page(page);
> +
> +			/*
> +			 * Fail isolation in case isolate_or_dissolve_huge_page
> +			 * reports an error. In case of -ENOMEM, abort right away.
> +			 */
> +			if (ret < 0) {
> +				/*
> +				 * Do not report -EBUSY down the chain.
> +				 */
> +				if (ret == -ENOMEM)
> +					fatal_error = true;
> +				else
> +					ret = 0;
> +				low_pfn += (1UL << compound_order(page)) - 1;
> +				goto isolate_fail;
> +			}
> +
> +			/*
> +			 * Ok, the hugepage was dissolved. Now these pages are
> +			 * Buddy and cannot be re-allocated because they are
> +			 * isolated. Fall-through as the check below handles
> +			 * Buddy pages.
> +			 */
> +		}
> +
>   		/*
>   		 * Skip if free. We read page order here without zone lock
>   		 * which is generally unsafe, but the race window is small and
> @@ -1066,7 +1094,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   		put_page(page);
>   
>   isolate_fail:
> -		if (!skip_on_failure)
> +		if (!skip_on_failure && !fatal_error)
>   			continue;
>   
>   		/*
> @@ -1092,6 +1120,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   			 */
>   			next_skip_pfn += 1UL << cc->order;
>   		}
> +
> +		if (fatal_error)
> +			break;
>   	}
>   
>   	/*
> @@ -1145,7 +1176,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>    * @end_pfn:   The one-past-last PFN.
>    *
>    * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or congestion,
> - * or 0.
> + * -ENOMEM in case we could not allocate a page, or 0.
>    */
>   int
>   isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 0607b2b71ac6..4a664d6e82c1 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2266,6 +2266,121 @@ static void restore_reserve_on_error(struct hstate *h,
>   	}
>   }
>   
> +/*
> + * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one
> + * @h: struct hstate old page belongs to
> + * @old_page: Old page to dissolve
> + * Returns 0 on success, otherwise negated error.
> + */
> +
> +static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
> +{
> +	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
> +	int nid = page_to_nid(old_page);
> +	struct page *new_page;
> +	int ret = 0;
> +
> +	/*
> +	 * Before dissolving the page, we need to allocate a new one for the
> +	 * pool to remain stable. Using alloc_buddy_huge_page() allows us to
> +	 * not having to deal with prep_new_page() and avoids dealing of any
> +	 * counters. This simplifies and let us do the whole thing under the
> +	 * lock.
> +	 */
> +	new_page = alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL);
> +	if (!new_page)
> +		return -ENOMEM;
> +
> +retry:
> +	spin_lock_irq(&hugetlb_lock);
> +	if (!PageHuge(old_page)) {
> +		/*
> +		 * Freed from under us. Drop new_page too.
> +		 */
> +		goto free_new;
> +	} else if (page_count(old_page)) {
> +		/*
> +		 * Someone has grabbed the page, fail for now.
> +		 */
> +		ret = -EBUSY;
> +		goto free_new;
> +	} else if (!HPageFreed(old_page)) {
> +		/*
> +		 * Page's refcount is 0 but it has not been enqueued in the
> +		 * freelist yet. Race window is small, so we can succeed here if
> +		 * we retry.
> +		 */
> +		spin_unlock_irq(&hugetlb_lock);
> +		cond_resched();
> +		goto retry;
> +	} else {
> +		/*
> +		 * Ok, old_page is still a genuine free hugepage. Remove it from
> +		 * the freelist and decrease the counters. These will be
> +		 * incremented again when calling __prep_account_new_huge_page()
> +		 * and enqueue_huge_page() for new_page. The counters will remain
> +		 * stable since this happens under the lock.
> +		 */
> +		remove_hugetlb_page(h, old_page, false);
> +
> +		/*
> +		 * Call __prep_new_huge_page() to construct the hugetlb page, and
> +		 * enqueue it then to place it in the freelists. After this,
> +		 * counters are back on track. Free hugepages have a refcount of 0,
> +		 * so we need to decrease new_page's count as well.
> +		 */
> +		__prep_new_huge_page(new_page);
> +		__prep_account_new_huge_page(h, nid);
> +		page_ref_dec(new_page);
> +		enqueue_huge_page(h, new_page);
> +
> +		/*
> +		 * Pages have been replaced, we can safely free the old one.
> +		 */
> +		spin_unlock_irq(&hugetlb_lock);
> +		update_and_free_page(h, old_page);
> +	}
> +
> +	return ret;
> +
> +free_new:
> +	spin_unlock_irq(&hugetlb_lock);
> +	__free_pages(new_page, huge_page_order(h));
> +
> +	return ret;
> +}
> +
> +int isolate_or_dissolve_huge_page(struct page *page)
> +{
> +	struct hstate *h;
> +	struct page *head;
> +
> +	/*
> +	 * The page might have been dissolved from under our feet, so make sure
> +	 * to carefully check the state under the lock.
> +	 * Return success when racing as if we dissolved the page ourselves.
> +	 */
> +	spin_lock_irq(&hugetlb_lock);
> +	if (PageHuge(page)) {
> +		head = compound_head(page);
> +		h = page_hstate(head);
> +	} else {
> +		spin_unlock(&hugetlb_lock);
> +		return 0;
> +	}
> +	spin_unlock_irq(&hugetlb_lock);
> +
> +	/*
> +	 * Fence off gigantic pages as there is a cyclic dependency between
> +	 * alloc_contig_range and them. Return -ENOME as this has the effect

s/-ENOME/-ENOMEM/

> +	 * of bailing out right away without further retrying.
> +	 */
> +	if (hstate_is_gigantic(h))
> +		return -ENOMEM;
> +
> +	return alloc_and_dissolve_huge_page(h, head);
> +}
> +
>   struct page *alloc_huge_page(struct vm_area_struct *vma,
>   				    unsigned long addr, int avoid_reserve)
>   {
> 

Complicated stuff, but looks good to me.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock
  2021-04-14 10:49                 ` Oscar Salvador
  2021-04-14 11:09                   ` Michal Hocko
@ 2021-04-14 16:45                   ` Mike Kravetz
  1 sibling, 0 replies; 40+ messages in thread
From: Mike Kravetz @ 2021-04-14 16:45 UTC (permalink / raw)
  To: Oscar Salvador, Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On 4/14/21 3:49 AM, Oscar Salvador wrote:
> On Wed, Apr 14, 2021 at 12:32:58PM +0200, Michal Hocko wrote:
>> Well, to be precise it does the very same thing with memamp struct pages
>> but that is before the initialization code you have pointed out above.
>> In this context it just poisons the allocated content which is the GB
>> page storage.
> 
> Right.
> 
>>> I checked, and when we get there in __alloc_bootmem_huge_page, page->private is
>>> still zeroed, so I guess it should be safe to assume that we do not really need
>>> to clear the flag in __prep_new_huge_page() routine?
>>
>> It would be quite nasty if the struct pages content would be undefined.
>> Maybe that is possible but then I would rather stick the initialization
>> into __alloc_bootmem_huge_page.
> 
> Yes, but I do not think that is really possible unless I missed something.
> Let us see what Mike thinks of it, if there are no objections, we can
> get rid of the clearing flag right there.
>  

Thanks for crawling through that code Oscar!

I do not think you missed anything.  Let's just get rid of the flag
clearing.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality
  2021-04-14  4:59     ` Oscar Salvador
  2021-04-14 12:15       ` David Hildenbrand
@ 2021-04-14 17:03       ` Mike Kravetz
  1 sibling, 0 replies; 40+ messages in thread
From: Mike Kravetz @ 2021-04-14 17:03 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Michal Hocko,
	Muchun Song, linux-mm, linux-kernel

On 4/13/21 9:59 PM, Oscar Salvador wrote:
> On Tue, Apr 13, 2021 at 02:33:41PM -0700, Mike Kravetz wrote:
>>> -static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>>> +/*
>>> + * Must be called with the hugetlb lock held
>>> + */
>>> +static void __prep_account_new_huge_page(struct hstate *h, int nid)
>>> +{
>>> +	h->nr_huge_pages++;
>>> +	h->nr_huge_pages_node[nid]++;
>>
>> I would prefer if we also move setting the destructor to this routine.
>> 	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
> 
> Uhm, but that is the routine that does the accounting, it feels wrong
> here, plus...
> 
>>
>> That way, PageHuge() will be false until it 'really' is a huge page.
>> If not, we could potentially go into that retry loop in
>> dissolve_free_huge_page or alloc_and_dissolve_huge_page in patch 5.
> 
> ...I do not follow here, could you please elaborate some more?
> Unless I am missing something, behaviour should not be any different with this
> patch.
> 

I was thinking of the time between the call to __prep_new_huge_page and
__prep_account_new_huge_page.  In that time, PageHuge() will be true but
the page is not yet fully being managed as a hugetlb page.  My thought
was that isolation, migration, offline or any code that does pfn
scanning might the page as PageHuge() (after taking lock) and start to
process it.

Now I see that in patch 5 you call both __prep_new_huge_page and
__prep_account_new_huge_page with the lock held.  So, no issue.  Sorry.

I 'think' there may still be a potential race with the prep_new_huge_page
routine, but that existed before any of your changes.  It may also be
that 'synchronization' at the pageblock level which prevents some of
these pfn scanning operations to operate on the same pageblocks may
prevent this from ever happening.

Mostly thinking out loud.  Upon further thought, I have no objections to
this change.  Sorry for the noise.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 6/7] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-04-14  4:52     ` Oscar Salvador
@ 2021-04-14 17:07       ` Mike Kravetz
  0 siblings, 0 replies; 40+ messages in thread
From: Mike Kravetz @ 2021-04-14 17:07 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Michal Hocko,
	Muchun Song, linux-mm, linux-kernel

On 4/13/21 9:52 PM, Oscar Salvador wrote:
> On Tue, Apr 13, 2021 at 03:48:53PM -0700, Mike Kravetz wrote:
>> The label free_new is:
>>
>> free_new:
>>         spin_unlock_irq(&hugetlb_lock);
>>         __free_pages(new_page, huge_page_order(h));
>>
>>         return ret;
>>
>> So, we are locking and immediately unlocking without any code in
>> between.  Usually, I don't like like multiple labels before return.
>> However, perhaps we should add another to avoid this unnecessary
>> cycle.  On the other hand, this is an uncommon race condition so the
>> simple code may be acceptable.
> 
> I guess we could have something like:
> 
>  free_new:
>          spin_unlock_irq(&hugetlb_lock);
>  free_new_nolock:
>          __free_pages(new_page, huge_page_order(h));
>  
>          return ret;
> 
> And let the retry go to there without locking. But as you said, the
> racecondition is rare enough, so I am not sure if this buys us much.
> But I can certainly add it if you feel strong about it.

No strong feelings.  I am fine with it as is.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages
  2021-04-14 12:26   ` David Hildenbrand
@ 2021-04-15  8:28     ` Oscar Salvador
  0 siblings, 0 replies; 40+ messages in thread
From: Oscar Salvador @ 2021-04-15  8:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Mike Kravetz, Vlastimil Babka, Michal Hocko,
	Muchun Song, linux-mm, linux-kernel

On Wed, Apr 14, 2021 at 02:26:21PM +0200, David Hildenbrand wrote:
> > +static inline int isolate_or_dissolve_huge_page(struct page *page)
> > +{
> > +	return -ENOMEM;
> 
> Without CONFIG_HUGETLB_PAGE, there is no way someone could possible pass in
> something valid. Although it doesn't matter too much, -EINVAL or similar
> sounds a bit better.

I guess we could, was just to make it consistent with the kind of error
we return when we have it enabled.

> > @@ -809,6 +809,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >   	bool skip_on_failure = false;
> >   	unsigned long next_skip_pfn = 0;
> >   	bool skip_updated = false;
> > +	bool fatal_error = false;
> 
> Can't we use "ret == -ENOMEM" instead of fatal_error?

Yes, we can, I will see how it looks.

[...]

> > +	/*
> > +	 * Fence off gigantic pages as there is a cyclic dependency between
> > +	 * alloc_contig_range and them. Return -ENOME as this has the effect
> 
> s/-ENOME/-ENOMEM/

thanks, I missed that one.

> 
> > +	 * of bailing out right away without further retrying.
> > +	 */
> > +	if (hstate_is_gigantic(h))
> > +		return -ENOMEM;
> > +
> > +	return alloc_and_dissolve_huge_page(h, head);
> > +}
> > +
> >   struct page *alloc_huge_page(struct vm_area_struct *vma,
> >   				    unsigned long addr, int avoid_reserve)
> >   {
> > 
> 
> Complicated stuff, but looks good to me.

Thanks for having a look!

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages
  2021-04-13 13:40   ` Michal Hocko
@ 2021-04-15  8:29     ` Oscar Salvador
  0 siblings, 0 replies; 40+ messages in thread
From: Oscar Salvador @ 2021-04-15  8:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mike Kravetz, Vlastimil Babka, David Hildenbrand,
	Muchun Song, linux-mm, linux-kernel

On Tue, Apr 13, 2021 at 03:40:18PM +0200, Michal Hocko wrote:
> > +		/*
> > +		 * Call __prep_new_huge_page() to construct the hugetlb page, and
> > +		 * enqueue it then to place it in the freelists. After this,
> > +		 * counters are back on track. Free hugepages have a refcount of 0,
> > +		 * so we need to decrease new_page's count as well.
> > +		 */
> > +		__prep_new_huge_page(new_page);
> > +		__prep_account_new_huge_page(h, nid);
> 
> I think it would help to put something like the following into the
> comment above this really strange construct.
> 
> 		/*
> 		 * new_page needs to be initialized with the standard
> 		 * hugetlb state. This is normally done by
> 		 * prep_new_huge_page but that takes hugetlb_lock which
> 		 * is already held so we need to open code it here.
> 		 * Reference count trick is needed because allocator
> 		 * gives us referenced page but the pool requires pages
> 		 * with 0 refcount.
> 		 */

Ok, I will try to add more info, thanks Michal!

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes
  2021-04-14 11:54   ` David Hildenbrand
@ 2021-04-15  8:42     ` Oscar Salvador
  0 siblings, 0 replies; 40+ messages in thread
From: Oscar Salvador @ 2021-04-15  8:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Mike Kravetz, Vlastimil Babka, Michal Hocko,
	Muchun Song, linux-mm, linux-kernel

On Wed, Apr 14, 2021 at 01:54:17PM +0200, David Hildenbrand wrote:
> >    * Isolate all pages that can be migrated from the range specified by
> >    * [low_pfn, end_pfn). The range is expected to be within same pageblock.
> > - * Returns zero if there is a fatal signal pending, otherwise PFN of the
> > - * first page that was not scanned (which may be both less, equal to or more
> > - * than end_pfn).
> > + * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or congestion,
> > + * or 0.
> > + * cc->migrate_pfn will contain the next pfn to scan (which may be both less,
> > + * equal to or more that end_pfn).
> 
> I failed to parse the last sentence -- e.g., using "both" and then listing
> three options. Also, s/than/than/? Can we simplify to
> 
> "cc->migrate_pfn will contain the next pfn to scan"

Ok, will simplify

> >   /**
> > @@ -1139,15 +1144,15 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >    * @start_pfn: The first PFN to start isolating.
> >    * @end_pfn:   The one-past-last PFN.
> >    *
> > - * Returns zero if isolation fails fatally due to e.g. pending signal.
> > - * Otherwise, function returns one-past-the-last PFN of isolated page
> > - * (which may be greater than end_pfn if end fell in a middle of a THP page).
> > + * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or congestion,
> 
> errno is usually positive.
> 
> > + * or 0.
> 
> I'd be more specific here. Something like
> 
> "
> Returns 0 if isolation succeeded. Returns -EINTR if a fatal signal is
> pending. Returns -EAGAIN when contended and the caller should retry.
> 
> In any case, cc->migrate_pfn is set to one-past-the-last PFN that has been
> isolated.
> "

I will try to reword it. Although note that 0 does not mean the
isolation succeeded.
Compaction tracks whether we could isolate pages by means of
cc->nr_migratepages.

The thing is that alloc_contig_range and compaction code need different
things.
Compaction wants to isolate and migrate as much as it needs to in order to
form a higher-order page (or in case
if it is trying to compact the whole zone, it goes through the zone).
alloc_contig_range() needs to isolate the whole range we specified in
order to be able to migrate it and make it free for whoever wants to use
it.

Let us say that the interface between alloc_contig_range() and
compaction still needs some more love that I plan to work on when this goes
in.


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2021-04-15  8:42 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-13 10:47 [PATCH v7 0/7] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
2021-04-13 10:47 ` [PATCH v7 1/7] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range Oscar Salvador
2021-04-13 17:00   ` Mike Kravetz
2021-04-13 10:47 ` [PATCH v7 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes Oscar Salvador
2021-04-13 17:52   ` Mike Kravetz
2021-04-14 11:54   ` David Hildenbrand
2021-04-15  8:42     ` Oscar Salvador
2021-04-13 10:47 ` [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock Oscar Salvador
2021-04-13 13:23   ` Michal Hocko
2021-04-13 21:19     ` Mike Kravetz
2021-04-14  6:04       ` Michal Hocko
2021-04-14  7:41         ` Oscar Salvador
2021-04-14  8:28           ` Michal Hocko
2021-04-14 10:01             ` Oscar Salvador
2021-04-14 10:03               ` Oscar Salvador
2021-04-14 10:32               ` Michal Hocko
2021-04-14 10:49                 ` Oscar Salvador
2021-04-14 11:09                   ` Michal Hocko
2021-04-14 12:02                     ` David Hildenbrand
2021-04-14 16:45                   ` Mike Kravetz
2021-04-13 10:47 ` [PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality Oscar Salvador
2021-04-13 13:24   ` Michal Hocko
2021-04-13 13:26     ` Michal Hocko
2021-04-13 21:33   ` Mike Kravetz
2021-04-14  4:59     ` Oscar Salvador
2021-04-14 12:15       ` David Hildenbrand
2021-04-14 17:03       ` Mike Kravetz
2021-04-13 10:47 ` [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
2021-04-13 13:40   ` Michal Hocko
2021-04-15  8:29     ` Oscar Salvador
2021-04-13 22:29   ` Mike Kravetz
2021-04-14  4:54     ` Oscar Salvador
2021-04-14 12:26   ` David Hildenbrand
2021-04-15  8:28     ` Oscar Salvador
2021-04-13 10:47 ` [PATCH v7 6/7] mm: Make alloc_contig_range handle in-use " Oscar Salvador
2021-04-13 22:48   ` Mike Kravetz
2021-04-14  4:52     ` Oscar Salvador
2021-04-14 17:07       ` Mike Kravetz
2021-04-13 10:47 ` [PATCH v7 7/7] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig Oscar Salvador
2021-04-13 22:53   ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).