linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/5] Make alloc_contig_range handle Hugetlb pages
@ 2021-03-17 11:12 Oscar Salvador
  2021-03-17 11:12 ` [PATCH v5 1/5] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range Oscar Salvador
                   ` (4 more replies)
  0 siblings, 5 replies; 33+ messages in thread
From: Oscar Salvador @ 2021-03-17 11:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Hildenbrand, Michal Hocko, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel, Oscar Salvador

v4->v5:
 - Collect Acked-by and Reviewed-by from David and Vlastimil
 - Drop racy checks in pfn_range_valid_contig (David)
 - Rebased on top of 5.12-rc3

v3 -> v4:
 - Addressed some feedback from David and Michal
 - Make more clear what hugetlb_lock protects in isolate_or_dissolve_huge_page
 - Start reporting proper error codes from isolate_migratepages_{range,block}
 - Bail out earlier in __alloc_contig_migrate_range on -ENOMEM
 - Addressed internal feedback from Vastlimil wrt. compaction code changes

v2 -> v3:
 - Drop usage of high-level generic helpers in favour of
   low-level approach (per Michal)
 - Check for the page to be marked as PageHugeFreed
 - Add a one-time retry in case someone grabbed the free huge page
   from under us

v1 -> v2:
 - Adressed feedback by Michal
 - Restrict the allocation to a node with __GFP_THISNODE
 - Drop PageHuge check in alloc_and_dissolve_huge_page
 - Re-order comments in isolate_or_dissolve_huge_page
 - Extend comment in isolate_migratepages_block
 - Place put_page right after we got the page, otherwise
   dissolve_free_huge_page will fail

 RFC -> v1:
 - Drop RFC
 - Addressed feedback from David and Mike
 - Fence off gigantic pages as there is a cyclic dependency between
   them and alloc_contig_range
 - Re-organize the code to make race-window smaller and to put
   all details in hugetlb code
 - Drop nodemask initialization. First a node will be tried and then we
   will back to other nodes containing memory (N_MEMORY). Details in
   patch#1's changelog
 - Count new page as surplus in case we failed to dissolve the old page
   and the new one. Details in patch#1.

Cover letter:

 alloc_contig_range lacks the hability for handling HugeTLB pages.
 This can be problematic for some users, e.g: CMA and virtio-mem, where those
 users will fail the call if alloc_contig_range ever sees a HugeTLB page, even
 when those pages lay in ZONE_MOVABLE and are free.
 That problem can be easily solved by replacing the page in the free hugepage
 pool.

 In-use HugeTLB are no exception though, as those can be isolated and migrated
 as any other LRU or Movable page.

 This patchset aims for improving alloc_contig_range->isolate_migratepages_block,
 so HugeTLB pages can be recognized and handled.

 Since we also need to start reporting errors down the chain (e.g: -ENOMEM due to
 not be able to allocate a new hugetlb page), isolate_migratepages_{range,block}
 interfaces  need to change to start reporting error codes instead of the pfn == 0
 vs pfn != 0 scheme it is using right now.
 From now on, isolate_migratepages_block will not return the next pfn to be scanned
 anymore, but -EINTR, -ENOMEM or 0, so we the next pfn to be scanned will be recorded
 in cc->migrate_pfn field (as it is already done in isolate_migratepages_range()).

 Below is an insight from David (thanks), where the problem can clearly be seen:

 "Start a VM with 4G. Hotplug 1G via virtio-mem and online it to
  ZONE_MOVABLE. Allocate 512 huge pages.

  [root@localhost ~]# cat /proc/meminfo
  MemTotal:        5061512 kB
  MemFree:         3319396 kB
  MemAvailable:    3457144 kB
  ...
  HugePages_Total:     512
  HugePages_Free:      512
  HugePages_Rsvd:        0
  HugePages_Surp:        0
  Hugepagesize:       2048 kB

  The huge pages get partially allocate from ZONE_MOVABLE. Try unplugging
  1G via virtio-mem (remember, all ZONE_MOVABLE). Inside the guest:

  [  180.058992] alloc_contig_range: [1b8000, 1c0000) PFNs busy
  [  180.060531] alloc_contig_range: [1b8000, 1c0000) PFNs busy
  [  180.061972] alloc_contig_range: [1b8000, 1c0000) PFNs busy
  [  180.063413] alloc_contig_range: [1b8000, 1c0000) PFNs busy
  [  180.064838] alloc_contig_range: [1b8000, 1c0000) PFNs busy
  [  180.065848] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
  [  180.066794] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
  [  180.067738] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
  [  180.068669] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
  [  180.069598] alloc_contig_range: [1bfc00, 1c0000) PFNs busy"

 And then with this patchset running:

 "Same experiment with ZONE_MOVABLE:

  a) Free huge pages: all memory can get unplugged again.

  b) Allocated/populated but idle huge pages: all memory can get unplugged
     again.

  c) Allocated/populated but all 512 huge pages are read/written in a
     loop: all memory can get unplugged again, but I get a single

  [  121.192345] alloc_contig_range: [180000, 188000) PFNs busy

  Most probably because it happened to try migrating a huge page while it
  was busy. As virtio-mem retries on ZONE_MOVABLE a couple of times, it
  can deal with this temporary failure.

  Last but not least, I did something extreme:

  # cat /proc/meminfo
  MemTotal:        5061568 kB
  MemFree:          186560 kB
  MemAvailable:     354524 kB
  ...
  HugePages_Total:    2048
  HugePages_Free:     2048
  HugePages_Rsvd:        0
  HugePages_Surp:        0

  Triggering unplug would require to dissolve+alloc - which now fails when
  trying to allocate an additional ~512 huge pages (1G).

  As expected, I can properly see memory unplug not fully succeeding. + I
  get a fairly continuous stream of

  [  226.611584] alloc_contig_range: [19f400, 19f800) PFNs busy
  ...

  But more importantly, the hugepage count remains stable, as configured
  by the admin (me):

  HugePages_Total:    2048
  HugePages_Free:     2048
  HugePages_Rsvd:        0
  HugePages_Surp:        0"

Oscar Salvador (5):
  mm,page_alloc: Bail out earlier on -ENOMEM in
    alloc_contig_migrate_range
  mm,compaction: Let isolate_migratepages_{range,block} return error
    codes
  mm: Make alloc_contig_range handle free hugetlb pages
  mm: Make alloc_contig_range handle in-use hugetlb pages
  mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig

 include/linux/hugetlb.h |   7 +++
 mm/compaction.c         |  89 ++++++++++++++++++++++++----------
 mm/hugetlb.c            | 125 +++++++++++++++++++++++++++++++++++++++++++++++-
 mm/internal.h           |   2 +-
 mm/page_alloc.c         |  21 ++++----
 mm/vmscan.c             |   5 +-
 6 files changed, 209 insertions(+), 40 deletions(-)

-- 
2.16.3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v5 1/5] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range
  2021-03-17 11:12 [PATCH v5 0/5] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
@ 2021-03-17 11:12 ` Oscar Salvador
  2021-03-17 14:05   ` Michal Hocko
  2021-03-17 11:12 ` [PATCH v5 2/5] mm,compaction: Let isolate_migratepages_{range,block} return error codes Oscar Salvador
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 33+ messages in thread
From: Oscar Salvador @ 2021-03-17 11:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Hildenbrand, Michal Hocko, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel, Oscar Salvador

Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or -EBUSY,
and report them down the chain.
The problem is that when migrate_pages() reports -ENOMEM, we keep going till we
exhaust all the try-attempts (5 at the moment) instead of bailing out.

migrate_pages() bails out right away on -ENOMEM because it is considered a fatal
error. Do the same here instead of keep going and retrying.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 mm/page_alloc.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cfc72873961d..a4f67063b85f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8481,7 +8481,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 			}
 			tries = 0;
 		} else if (++tries == 5) {
-			ret = ret < 0 ? ret : -EBUSY;
+			ret = -EBUSY;
 			break;
 		}
 
@@ -8491,6 +8491,12 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 
 		ret = migrate_pages(&cc->migratepages, alloc_migration_target,
 				NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
+		/*
+		 * On -ENOMEM, migrate_pages() bails out right away. It is pointless
+		 * to retry again over this error, so do the same here.
+		 */
+		if (ret == -ENOMEM)
+			break;
 	}
 	if (ret < 0) {
 		putback_movable_pages(&cc->migratepages);
-- 
2.16.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 2/5] mm,compaction: Let isolate_migratepages_{range,block} return error codes
  2021-03-17 11:12 [PATCH v5 0/5] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
  2021-03-17 11:12 ` [PATCH v5 1/5] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range Oscar Salvador
@ 2021-03-17 11:12 ` Oscar Salvador
  2021-03-17 14:12   ` Michal Hocko
  2021-03-17 11:12 ` [PATCH v5 3/5] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 33+ messages in thread
From: Oscar Salvador @ 2021-03-17 11:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Hildenbrand, Michal Hocko, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel, Oscar Salvador

Currently, isolate_migratepages_{range,block} and their callers use
a pfn == 0 vs pfn != 0 scheme to let the caller know whether there was
any error during isolation.
This does not work as soon as we need to start reporting different error
codes and make sure we pass them down the chain, so they are properly
interpreted by functions like e.g: alloc_contig_range.

Let us rework isolate_migratepages_{range,block} so we can report error
codes.
Since isolate_migratepages_block will stop returning the next pfn to be
scanned, we reuse the cc->migrate_pfn field to keep track of that.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/compaction.c | 48 ++++++++++++++++++++++++------------------------
 mm/internal.h   |  2 +-
 mm/page_alloc.c |  7 +++----
 3 files changed, 28 insertions(+), 29 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index e04f4476e68e..5769753a8f60 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -787,15 +787,16 @@ static bool too_many_isolated(pg_data_t *pgdat)
  *
  * Isolate all pages that can be migrated from the range specified by
  * [low_pfn, end_pfn). The range is expected to be within same pageblock.
- * Returns zero if there is a fatal signal pending, otherwise PFN of the
- * first page that was not scanned (which may be both less, equal to or more
- * than end_pfn).
+ * Returns -EINTR in case we need to abort when we have too many isolated pages
+ * due to e.g: signal pending, async mode or having still pages to migrate, or 0.
+ * cc->migrate_pfn will contain the next pfn to scan (which may be both less,
+ * equal to or more that end_pfn).
  *
  * The pages are isolated on cc->migratepages list (not required to be empty),
  * and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field
  * is neither read nor updated.
  */
-static unsigned long
+static int
 isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			unsigned long end_pfn, isolate_mode_t isolate_mode)
 {
@@ -810,6 +811,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 	unsigned long next_skip_pfn = 0;
 	bool skip_updated = false;
 
+	cc->migrate_pfn = low_pfn;
+
 	/*
 	 * Ensure that there are not too many pages isolated from the LRU
 	 * list by either parallel reclaimers or compaction. If there are,
@@ -818,16 +821,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 	while (unlikely(too_many_isolated(pgdat))) {
 		/* stop isolation if there are still pages not migrated */
 		if (cc->nr_migratepages)
-			return 0;
+			return -EINTR;
 
 		/* async migration should just abort */
 		if (cc->mode == MIGRATE_ASYNC)
-			return 0;
+			return -EINTR;
 
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		if (fatal_signal_pending(current))
-			return 0;
+			return -EINTR;
 	}
 
 	cond_resched();
@@ -1130,7 +1133,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 	if (nr_isolated)
 		count_compact_events(COMPACTISOLATED, nr_isolated);
 
-	return low_pfn;
+	cc->migrate_pfn = low_pfn;
+
+	return 0;
 }
 
 /**
@@ -1139,15 +1144,15 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
  * @start_pfn: The first PFN to start isolating.
  * @end_pfn:   The one-past-last PFN.
  *
- * Returns zero if isolation fails fatally due to e.g. pending signal.
- * Otherwise, function returns one-past-the-last PFN of isolated page
- * (which may be greater than end_pfn if end fell in a middle of a THP page).
+ * Returns -EINTR in case isolation fails fatally due to e.g. pending signal,
+ * or 0.
  */
-unsigned long
+int
 isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
 							unsigned long end_pfn)
 {
 	unsigned long pfn, block_start_pfn, block_end_pfn;
+	int ret = 0;
 
 	/* Scan block by block. First and last block may be incomplete */
 	pfn = start_pfn;
@@ -1166,17 +1171,17 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
 					block_end_pfn, cc->zone))
 			continue;
 
-		pfn = isolate_migratepages_block(cc, pfn, block_end_pfn,
-							ISOLATE_UNEVICTABLE);
+		ret = isolate_migratepages_block(cc, pfn, block_end_pfn,
+						 ISOLATE_UNEVICTABLE);
 
-		if (!pfn)
+		if (ret)
 			break;
 
 		if (cc->nr_migratepages >= COMPACT_CLUSTER_MAX)
 			break;
 	}
 
-	return pfn;
+	return ret;
 }
 
 #endif /* CONFIG_COMPACTION || CONFIG_CMA */
@@ -1847,7 +1852,7 @@ static isolate_migrate_t isolate_migratepages(struct compact_control *cc)
 	 */
 	for (; block_end_pfn <= cc->free_pfn;
 			fast_find_block = false,
-			low_pfn = block_end_pfn,
+			cc->migrate_pfn = low_pfn = block_end_pfn,
 			block_start_pfn = block_end_pfn,
 			block_end_pfn += pageblock_nr_pages) {
 
@@ -1889,10 +1894,8 @@ static isolate_migrate_t isolate_migratepages(struct compact_control *cc)
 		}
 
 		/* Perform the isolation */
-		low_pfn = isolate_migratepages_block(cc, low_pfn,
-						block_end_pfn, isolate_mode);
-
-		if (!low_pfn)
+		if (isolate_migratepages_block(cc, low_pfn, block_end_pfn,
+						isolate_mode))
 			return ISOLATE_ABORT;
 
 		/*
@@ -1903,9 +1906,6 @@ static isolate_migrate_t isolate_migratepages(struct compact_control *cc)
 		break;
 	}
 
-	/* Record where migration scanner will be restarted. */
-	cc->migrate_pfn = low_pfn;
-
 	return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 1432feec62df..1f2ccba8e289 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -261,7 +261,7 @@ struct capture_control {
 unsigned long
 isolate_freepages_range(struct compact_control *cc,
 			unsigned long start_pfn, unsigned long end_pfn);
-unsigned long
+int
 isolate_migratepages_range(struct compact_control *cc,
 			   unsigned long low_pfn, unsigned long end_pfn);
 int find_suitable_fallback(struct free_area *area, unsigned int order,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a4f67063b85f..4cb455355f6d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8474,11 +8474,10 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 
 		if (list_empty(&cc->migratepages)) {
 			cc->nr_migratepages = 0;
-			pfn = isolate_migratepages_range(cc, pfn, end);
-			if (!pfn) {
-				ret = -EINTR;
+			ret = isolate_migratepages_range(cc, pfn, end);
+			if (ret)
 				break;
-			}
+			pfn = cc->migrate_pfn;
 			tries = 0;
 		} else if (++tries == 5) {
 			ret = -EBUSY;
-- 
2.16.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 3/5] mm: Make alloc_contig_range handle free hugetlb pages
  2021-03-17 11:12 [PATCH v5 0/5] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
  2021-03-17 11:12 ` [PATCH v5 1/5] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range Oscar Salvador
  2021-03-17 11:12 ` [PATCH v5 2/5] mm,compaction: Let isolate_migratepages_{range,block} return error codes Oscar Salvador
@ 2021-03-17 11:12 ` Oscar Salvador
  2021-03-17 14:22   ` Michal Hocko
  2021-03-17 11:12 ` [PATCH v5 4/5] mm: Make alloc_contig_range handle in-use " Oscar Salvador
  2021-03-17 11:12 ` [PATCH v5 5/5] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig Oscar Salvador
  4 siblings, 1 reply; 33+ messages in thread
From: Oscar Salvador @ 2021-03-17 11:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Hildenbrand, Michal Hocko, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel, Oscar Salvador

alloc_contig_range will fail if it ever sees a HugeTLB page within the
range we are trying to allocate, even when that page is free and can be
easily reallocated.
This has proved to be problematic for some users of alloc_contic_range,
e.g: CMA and virtio-mem, where those would fail the call even when those
pages lay in ZONE_MOVABLE and are free.

We can do better by trying to replace such page.

Free hugepages are tricky to handle so as to no userspace application
notices disruption, we need to replace the current free hugepage with
a new one.

In order to do that, a new function called alloc_and_dissolve_huge_page
is introduced.
This function will first try to get a new fresh hugepage, and if it
succeeds, it will replace the old one in the free hugepage pool.

All operations are being handled under hugetlb_lock, so no races are
possible. The only exception is when page's refcount is 0, but it still
has not been flagged as PageHugeFreed.
E.g, below scenario:

CPU0				CPU1
__free_huge_page()		isolate_or_dissolve_huge_page
				  PageHuge() == T
				  alloc_and_dissolve_huge_page
				    alloc_fresh_huge_page()
				    spin_lock(hugetlb_lock)
				    // PageHuge() && !PageHugeFreed &&
				    // !PageCount()
				    spin_unlock(hugetlb_lock)
  spin_lock(hugetlb_lock)
  1) update_and_free_page
       PageHuge() == F
       __free_pages()
  2) enqueue_huge_page
       SetPageHugeFreed()
  spin_unlock(&hugetlb_lock)
				  spin_lock(hugetlb_lock)
                                   1) PageHuge() == F (freed by case#1 from CPU0)
				   2) PageHuge() == T
                                       PageHugeFreed() == T
                                       - proceed with replacing the page

In the case above we retry as the window race is quite small and we have high
chances to succeed next time.

With regard to the allocation, we restrict it to the node the page belongs
to with __GFP_THISNODE, meaning we do not fallback on other node's zones.

Note that gigantic hugetlb pages are fenced off since there is a cyclic
dependency between them and alloc_contig_range.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/hugetlb.h |   6 +++
 mm/compaction.c         |  33 ++++++++++++++-
 mm/hugetlb.c            | 109 +++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 145 insertions(+), 3 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index cccd1aab69dd..bcff86ca616f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -583,6 +583,7 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };
 
+int isolate_or_dissolve_huge_page(struct page *page);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -865,6 +866,11 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
+static inline int isolate_or_dissolve_huge_page(struct page *page)
+{
+	return -ENOMEM;
+}
+
 static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
 					   unsigned long addr,
 					   int avoid_reserve)
diff --git a/mm/compaction.c b/mm/compaction.c
index 5769753a8f60..9f253fc3b4f9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -810,6 +810,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 	bool skip_on_failure = false;
 	unsigned long next_skip_pfn = 0;
 	bool skip_updated = false;
+	bool fatal_error = false;
+	int ret = 0;
 
 	cc->migrate_pfn = low_pfn;
 
@@ -907,6 +909,32 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			valid_page = page;
 		}
 
+		if (PageHuge(page) && cc->alloc_contig) {
+			ret = isolate_or_dissolve_huge_page(page);
+
+			/*
+			 * Fail isolation in case isolate_or_dissolve_huge_page
+			 * reports an error. In case of -ENOMEM, abort right away.
+			 */
+			if (ret < 0) {
+				/*
+				 * Do not report -EBUSY down the chain.
+				 */
+				if (ret == -ENOMEM)
+					fatal_error = true;
+				else
+					ret = 0;
+				goto isolate_fail;
+			}
+
+			/*
+			 * Ok, the hugepage was dissolved. Now these pages are
+			 * Buddy and cannot be re-allocated because they are
+			 * isolated. Fall-through as the check below handles
+			 * Buddy pages.
+			 */
+		}
+
 		/*
 		 * Skip if free. We read page order here without zone lock
 		 * which is generally unsafe, but the race window is small and
@@ -1092,6 +1120,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			 */
 			next_skip_pfn += 1UL << cc->order;
 		}
+
+		if (fatal_error)
+			break;
 	}
 
 	/*
@@ -1135,7 +1166,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 	cc->migrate_pfn = low_pfn;
 
-	return 0;
+	return ret;
 }
 
 /**
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5b1ab1f427c5..3194c1bd9e32 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1035,13 +1035,18 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
 	return false;
 }
 
+static void __enqueue_huge_page(struct list_head *list, struct page *page)
+{
+	list_move(&page->lru, list);
+	SetHPageFreed(page);
+}
+
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
-	list_move(&page->lru, &h->hugepage_freelists[nid]);
+	__enqueue_huge_page(&h->hugepage_freelists[nid], page);
 	h->free_huge_pages++;
 	h->free_huge_pages_node[nid]++;
-	SetHPageFreed(page);
 }
 
 static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
@@ -2245,6 +2250,106 @@ static void restore_reserve_on_error(struct hstate *h,
 	}
 }
 
+/*
+ * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one
+ * @h: struct hstate old page belongs to
+ * @old_page: Old page to dissolve
+ * Returns 0 on success, otherwise negated error.
+ */
+
+static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
+{
+	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
+	int nid = page_to_nid(old_page);
+	struct page *new_page;
+	int ret = 0;
+
+	/*
+	 * Before dissolving the page, we need to allocate a new one,
+	 * so the pool remains stable.
+	 */
+	new_page = alloc_fresh_huge_page(h, gfp_mask, nid, NULL, NULL);
+	if (!new_page)
+		return -ENOMEM;
+
+	/*
+	 * Pages got from Buddy are self-refcounted, but free hugepages
+	 * need to have a refcount of 0.
+	 */
+	page_ref_dec(new_page);
+retry:
+	spin_lock(&hugetlb_lock);
+	if (!PageHuge(old_page)) {
+		/*
+		 * Freed from under us. Drop new_page too.
+		 */
+		update_and_free_page(h, new_page);
+		goto unlock;
+	} else if (page_count(old_page)) {
+		/*
+		 * Someone has grabbed the page, fail for now.
+		 */
+		ret = -EBUSY;
+		update_and_free_page(h, new_page);
+		goto unlock;
+	} else if (!HPageFreed(old_page)) {
+		/*
+		 * Page's refcount is 0 but it has not been enqueued in the
+		 * freelist yet. Race window is small, so we can succed here if
+		 * we retry.
+		 */
+		spin_unlock(&hugetlb_lock);
+		cond_resched();
+		goto retry;
+	} else {
+		/*
+		 * Ok, old_page is still a genuine free hugepage. Replace it
+		 * with the new one.
+		 */
+		list_del(&old_page->lru);
+		update_and_free_page(h, old_page);
+		/*
+		 * h->free_huge_pages{_node} counters do not need to be updated.
+		 */
+		__enqueue_huge_page(&h->hugepage_freelists[nid], new_page);
+	}
+unlock:
+	spin_unlock(&hugetlb_lock);
+
+	return ret;
+}
+
+int isolate_or_dissolve_huge_page(struct page *page)
+{
+	struct hstate *h;
+	struct page *head;
+
+	/*
+	 * The page might have been dissolved from under our feet, so make sure
+	 * to carefully check the state under the lock.
+	 * Return success when racing as if we dissolved the page ourselves.
+	 */
+	spin_lock(&hugetlb_lock);
+	if (PageHuge(page)) {
+		head = compound_head(page);
+		h = page_hstate(head);
+	} else {
+		spin_unlock(&hugetlb_lock);
+		return 0;
+	}
+	spin_unlock(&hugetlb_lock);
+
+	/*
+	 * Fence off gigantic pages as there is a cyclic dependency between
+	 * alloc_contig_range and them. Return -ENOME as this has the effect
+	 * of bailing out right away without further retrying.
+	 */
+	if (hstate_is_gigantic(h))
+		return -ENOMEM;
+
+	return alloc_and_dissolve_huge_page(h, head);
+}
+
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
-- 
2.16.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 4/5] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-03-17 11:12 [PATCH v5 0/5] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
                   ` (2 preceding siblings ...)
  2021-03-17 11:12 ` [PATCH v5 3/5] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
@ 2021-03-17 11:12 ` Oscar Salvador
  2021-03-17 14:26   ` Michal Hocko
  2021-03-17 11:12 ` [PATCH v5 5/5] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig Oscar Salvador
  4 siblings, 1 reply; 33+ messages in thread
From: Oscar Salvador @ 2021-03-17 11:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Hildenbrand, Michal Hocko, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel, Oscar Salvador

alloc_contig_range() will fail if it finds a HugeTLB page within the range,
without a chance to handle them. Since HugeTLB pages can be migrated as any
LRU or Movable page, it does not make sense to bail out without trying.
Enable the interface to recognize in-use HugeTLB pages so we can migrate
them, and have much better chances to succeed the call.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  5 +++--
 mm/compaction.c         | 12 +++++++++++-
 mm/hugetlb.c            | 22 +++++++++++++++++++---
 mm/vmscan.c             |  5 +++--
 4 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index bcff86ca616f..a37b4ce86e58 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -583,7 +583,7 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };
 
-int isolate_or_dissolve_huge_page(struct page *page);
+int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -866,7 +866,8 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
-static inline int isolate_or_dissolve_huge_page(struct page *page)
+static inline int isolate_or_dissolve_huge_page(struct page *page,
+						struct list_head *list)
 {
 	return -ENOMEM;
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index 9f253fc3b4f9..6e47855fd154 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -910,7 +910,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		}
 
 		if (PageHuge(page) && cc->alloc_contig) {
-			ret = isolate_or_dissolve_huge_page(page);
+			ret = isolate_or_dissolve_huge_page(page, &cc->migratepages);
 
 			/*
 			 * Fail isolation in case isolate_or_dissolve_huge_page
@@ -927,6 +927,15 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 				goto isolate_fail;
 			}
 
+			if (PageHuge(page)) {
+				/*
+				 * Hugepage was successfully isolated and placed
+				 * on the cc->migratepages list.
+				 */
+				low_pfn += compound_nr(page) - 1;
+				goto isolate_success_no_list;
+			}
+
 			/*
 			 * Ok, the hugepage was dissolved. Now these pages are
 			 * Buddy and cannot be re-allocated because they are
@@ -1068,6 +1077,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 isolate_success:
 		list_add(&page->lru, &cc->migratepages);
+isolate_success_no_list:
 		cc->nr_migratepages += compound_nr(page);
 		nr_isolated += compound_nr(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3194c1bd9e32..11e86434d8bd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2287,7 +2287,9 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
 		goto unlock;
 	} else if (page_count(old_page)) {
 		/*
-		 * Someone has grabbed the page, fail for now.
+		 * Someone has grabbed the page, return -EBUSY so we give
+		 * isolate_or_dissolve_huge_page a chance to handle an in-use
+		 * page.
 		 */
 		ret = -EBUSY;
 		update_and_free_page(h, new_page);
@@ -2319,10 +2321,12 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
 	return ret;
 }
 
-int isolate_or_dissolve_huge_page(struct page *page)
+int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
 {
 	struct hstate *h;
 	struct page *head;
+	bool try_again = true;
+	int ret = -EBUSY;
 
 	/*
 	 * The page might have been dissolved from under our feet, so make sure
@@ -2347,7 +2351,19 @@ int isolate_or_dissolve_huge_page(struct page *page)
 	if (hstate_is_gigantic(h))
 		return -ENOMEM;
 
-	return alloc_and_dissolve_huge_page(h, head);
+retry:
+	if (page_count(head) && isolate_huge_page(head, list)) {
+		ret = 0;
+	} else if (!page_count(head)) {
+		ret = alloc_and_dissolve_huge_page(h, head);
+
+		if (ret == -EBUSY && try_again) {
+			try_again = false;
+			goto retry;
+		}
+	}
+
+	return ret;
 }
 
 struct page *alloc_huge_page(struct vm_area_struct *vma,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 562e87cbd7a1..42aaef30633e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1507,8 +1507,9 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 	LIST_HEAD(clean_pages);
 
 	list_for_each_entry_safe(page, next, page_list, lru) {
-		if (page_is_file_lru(page) && !PageDirty(page) &&
-		    !__PageMovable(page) && !PageUnevictable(page)) {
+		if (!PageHuge(page) && page_is_file_lru(page) &&
+		    !PageDirty(page) && !__PageMovable(page) &&
+		    !PageUnevictable(page)) {
 			ClearPageActive(page);
 			list_move(&page->lru, &clean_pages);
 		}
-- 
2.16.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 5/5] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig
  2021-03-17 11:12 [PATCH v5 0/5] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
                   ` (3 preceding siblings ...)
  2021-03-17 11:12 ` [PATCH v5 4/5] mm: Make alloc_contig_range handle in-use " Oscar Salvador
@ 2021-03-17 11:12 ` Oscar Salvador
  2021-03-17 11:15   ` David Hildenbrand
  2021-03-17 14:31   ` Michal Hocko
  4 siblings, 2 replies; 33+ messages in thread
From: Oscar Salvador @ 2021-03-17 11:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Hildenbrand, Michal Hocko, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel, Oscar Salvador

pfn_range_valid_contig() bails out when it finds an in-use page or a
hugetlb page, among other things.
We can drop the in-use page check since __alloc_contig_pages can migrate
away those pages, and the hugetlb page check can go too since
isolate_migratepages_range is now capable of dealing with hugetlb pages.
Either way, those checks are racy so let the end function handle it
when the time comes.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: David Hildenbrand <david@redhat.com>
---
 mm/page_alloc.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4cb455355f6d..50d73e68b79e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8685,12 +8685,6 @@ static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
 
 		if (PageReserved(page))
 			return false;
-
-		if (page_count(page) > 0)
-			return false;
-
-		if (PageHuge(page))
-			return false;
 	}
 	return true;
 }
-- 
2.16.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 5/5] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig
  2021-03-17 11:12 ` [PATCH v5 5/5] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig Oscar Salvador
@ 2021-03-17 11:15   ` David Hildenbrand
  2021-03-17 14:31   ` Michal Hocko
  1 sibling, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2021-03-17 11:15 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Vlastimil Babka, Michal Hocko, Muchun Song, Mike Kravetz,
	linux-mm, linux-kernel

On 17.03.21 12:12, Oscar Salvador wrote:
> pfn_range_valid_contig() bails out when it finds an in-use page or a
> hugetlb page, among other things.
> We can drop the in-use page check since __alloc_contig_pages can migrate
> away those pages, and the hugetlb page check can go too since
> isolate_migratepages_range is now capable of dealing with hugetlb pages.
> Either way, those checks are racy so let the end function handle it
> when the time comes.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Suggested-by: David Hildenbrand <david@redhat.com>
> ---
>   mm/page_alloc.c | 6 ------
>   1 file changed, 6 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4cb455355f6d..50d73e68b79e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8685,12 +8685,6 @@ static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
>   
>   		if (PageReserved(page))
>   			return false;
> -
> -		if (page_count(page) > 0)
> -			return false;
> -
> -		if (PageHuge(page))
> -			return false;
>   	}
>   	return true;
>   }
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 1/5] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range
  2021-03-17 11:12 ` [PATCH v5 1/5] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range Oscar Salvador
@ 2021-03-17 14:05   ` Michal Hocko
  2021-03-17 14:42     ` David Hildenbrand
  2021-03-18 11:04     ` Oscar Salvador
  0 siblings, 2 replies; 33+ messages in thread
From: Michal Hocko @ 2021-03-17 14:05 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Wed 17-03-21 12:12:47, Oscar Salvador wrote:
> Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or -EBUSY,
> and report them down the chain.
> The problem is that when migrate_pages() reports -ENOMEM, we keep going till we
> exhaust all the try-attempts (5 at the moment) instead of bailing out.
> 
> migrate_pages() bails out right away on -ENOMEM because it is considered a fatal
> error. Do the same here instead of keep going and retrying.

I suspect this is not really a real life problem, right? The allocation
would be more costly in the end but this is to be expected under a heavy
memory pressure.

That being said, bailing out early makes sense to me. But now that
you've made me look into the migrate_pages excellent error state reporting
I suspect we have a bug here. Note the 
"Returns the number of pages that were not migrated, or an error code."

but I do not see putback_movable_pages for ret > 0 so it seems we might
leak some pages.

That aside. Now looking at other callers of migrate_pages most of them
do not care about the number of failed pages. The only one which cares
is migrate_pages syscall (do_migrate_pages). I think it would be much
more reasonable to have migrate_pages (kernel function) return error or
0 and make the only caller which cares to count number of failed pages
(e.g. by returning the number of pages from putback_movable_pages).
 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: David Hildenbrand <david@redhat.com>

The patch itself looks reasonable but make sure to mention this is mere
cosmetic change unless there is a real problem fixed by this.
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cfc72873961d..a4f67063b85f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8481,7 +8481,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
>  			}
>  			tries = 0;
>  		} else if (++tries == 5) {
> -			ret = ret < 0 ? ret : -EBUSY;
> +			ret = -EBUSY;
>  			break;
>  		}
>  
> @@ -8491,6 +8491,12 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
>  
>  		ret = migrate_pages(&cc->migratepages, alloc_migration_target,
>  				NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
> +		/*
> +		 * On -ENOMEM, migrate_pages() bails out right away. It is pointless
> +		 * to retry again over this error, so do the same here.
> +		 */
> +		if (ret == -ENOMEM)
> +			break;
>  	}
>  	if (ret < 0) {
>  		putback_movable_pages(&cc->migratepages);
> -- 
> 2.16.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/5] mm,compaction: Let isolate_migratepages_{range,block} return error codes
  2021-03-17 11:12 ` [PATCH v5 2/5] mm,compaction: Let isolate_migratepages_{range,block} return error codes Oscar Salvador
@ 2021-03-17 14:12   ` Michal Hocko
  2021-03-17 14:38     ` Oscar Salvador
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-03-17 14:12 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Wed 17-03-21 12:12:48, Oscar Salvador wrote:
> Currently, isolate_migratepages_{range,block} and their callers use
> a pfn == 0 vs pfn != 0 scheme to let the caller know whether there was
> any error during isolation.
> This does not work as soon as we need to start reporting different error
> codes and make sure we pass them down the chain, so they are properly
> interpreted by functions like e.g: alloc_contig_range.
> 
> Let us rework isolate_migratepages_{range,block} so we can report error
> codes.

Yes this is an improvement.

> Since isolate_migratepages_block will stop returning the next pfn to be
> scanned, we reuse the cc->migrate_pfn field to keep track of that.

This looks hakish and I cannot really tell that users of cc->migrate_pfn
work as intended.
> @@ -810,6 +811,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>  	unsigned long next_skip_pfn = 0;
>  	bool skip_updated = false;
>  
> +	cc->migrate_pfn = low_pfn;
> +
>  	/*
>  	 * Ensure that there are not too many pages isolated from the LRU
>  	 * list by either parallel reclaimers or compaction. If there are,
> @@ -818,16 +821,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>  	while (unlikely(too_many_isolated(pgdat))) {
>  		/* stop isolation if there are still pages not migrated */
>  		if (cc->nr_migratepages)
> -			return 0;
> +			return -EINTR;
>  
>  		/* async migration should just abort */
>  		if (cc->mode == MIGRATE_ASYNC)
> -			return 0;
> +			return -EINTR;

EINTR for anything other than signal based bail out is really confusing.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 3/5] mm: Make alloc_contig_range handle free hugetlb pages
  2021-03-17 11:12 ` [PATCH v5 3/5] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
@ 2021-03-17 14:22   ` Michal Hocko
  0 siblings, 0 replies; 33+ messages in thread
From: Michal Hocko @ 2021-03-17 14:22 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Wed 17-03-21 12:12:49, Oscar Salvador wrote:
> alloc_contig_range will fail if it ever sees a HugeTLB page within the
> range we are trying to allocate, even when that page is free and can be
> easily reallocated.
> This has proved to be problematic for some users of alloc_contic_range,
> e.g: CMA and virtio-mem, where those would fail the call even when those
> pages lay in ZONE_MOVABLE and are free.
> 
> We can do better by trying to replace such page.
> 
> Free hugepages are tricky to handle so as to no userspace application
> notices disruption, we need to replace the current free hugepage with
> a new one.
> 
> In order to do that, a new function called alloc_and_dissolve_huge_page
> is introduced.
> This function will first try to get a new fresh hugepage, and if it
> succeeds, it will replace the old one in the free hugepage pool.
> 
> All operations are being handled under hugetlb_lock, so no races are

Slightly confusing because allocation which is a part of the process is
certainly not done under the lock.
"The free page replacement is done under hugetlb_lock, so no external
user of hugetlb will notice the change. There is one tricky case when
page's refcount is 0 because it is in the process of being released.
A mising PageHugeFreed bit will tell us that freeing is in flight so we
retry after dropping the hugetlb_lock. The race window should be small
and the next retry should make a forward progress.

> possible. The only exception is when page's refcount is 0, but it still
> has not been flagged as PageHugeFreed.
> E.g, below scenario:
> 
> CPU0				CPU1
> __free_huge_page()		isolate_or_dissolve_huge_page
> 				  PageHuge() == T
> 				  alloc_and_dissolve_huge_page
> 				    alloc_fresh_huge_page()
> 				    spin_lock(hugetlb_lock)
> 				    // PageHuge() && !PageHugeFreed &&
> 				    // !PageCount()
> 				    spin_unlock(hugetlb_lock)
>   spin_lock(hugetlb_lock)
>   1) update_and_free_page
>        PageHuge() == F
>        __free_pages()
>   2) enqueue_huge_page
>        SetPageHugeFreed()
>   spin_unlock(&hugetlb_lock)
> 				  spin_lock(hugetlb_lock)
>                                    1) PageHuge() == F (freed by case#1 from CPU0)
> 				   2) PageHuge() == T
>                                        PageHugeFreed() == T
>                                        - proceed with replacing the page
> 
> In the case above we retry as the window race is quite small and we have high
> chances to succeed next time.
> 
> With regard to the allocation, we restrict it to the node the page belongs
> to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
> 
> Note that gigantic hugetlb pages are fenced off since there is a cyclic
> dependency between them and alloc_contig_range.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

my ack still applies.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 4/5] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-03-17 11:12 ` [PATCH v5 4/5] mm: Make alloc_contig_range handle in-use " Oscar Salvador
@ 2021-03-17 14:26   ` Michal Hocko
  2021-03-18  8:54     ` Oscar Salvador
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-03-17 14:26 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Wed 17-03-21 12:12:50, Oscar Salvador wrote:
> alloc_contig_range() will fail if it finds a HugeTLB page within the range,
> without a chance to handle them. Since HugeTLB pages can be migrated as any
> LRU or Movable page, it does not make sense to bail out without trying.
> Enable the interface to recognize in-use HugeTLB pages so we can migrate
> them, and have much better chances to succeed the call.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

Acked-by: Michal Hocko <mhocko@suse.com>

I am still not entirely happy about this
> @@ -2347,7 +2351,19 @@ int isolate_or_dissolve_huge_page(struct page *page)
>  	if (hstate_is_gigantic(h))
>  		return -ENOMEM;
>  
> -	return alloc_and_dissolve_huge_page(h, head);
> +retry:
> +	if (page_count(head) && isolate_huge_page(head, list)) {
> +		ret = 0;
> +	} else if (!page_count(head)) {
> +		ret = alloc_and_dissolve_huge_page(h, head);
> +
> +		if (ret == -EBUSY && try_again) {
> +			try_again = false;
> +			goto retry;
> +		}
> +	}
> +
> +	return ret;
>  }

it would be imho better to retry inside alloc_and_dissolve_huge_page
because it already has its retry logic implemented.

But not something I will insist on.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 5/5] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig
  2021-03-17 11:12 ` [PATCH v5 5/5] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig Oscar Salvador
  2021-03-17 11:15   ` David Hildenbrand
@ 2021-03-17 14:31   ` Michal Hocko
  2021-03-17 14:36     ` David Hildenbrand
  1 sibling, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-03-17 14:31 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Wed 17-03-21 12:12:51, Oscar Salvador wrote:
> pfn_range_valid_contig() bails out when it finds an in-use page or a
> hugetlb page, among other things.
> We can drop the in-use page check since __alloc_contig_pages can migrate
> away those pages, and the hugetlb page check can go too since
> isolate_migratepages_range is now capable of dealing with hugetlb pages.
> Either way, those checks are racy so let the end function handle it
> when the time comes.

I haven't realized PageHuge check is done this early. This means that
previous patches are not actually active until now which is not really
greate for bisectability. Can we remove the HugePage check earlier?

Act to the page_count check removal. We should rely on migrate_pages
here.

> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Suggested-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/page_alloc.c | 6 ------
>  1 file changed, 6 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4cb455355f6d..50d73e68b79e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8685,12 +8685,6 @@ static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
>  
>  		if (PageReserved(page))
>  			return false;
> -
> -		if (page_count(page) > 0)
> -			return false;
> -
> -		if (PageHuge(page))
> -			return false;
>  	}
>  	return true;
>  }
> -- 
> 2.16.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 5/5] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig
  2021-03-17 14:31   ` Michal Hocko
@ 2021-03-17 14:36     ` David Hildenbrand
  2021-03-17 15:03       ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2021-03-17 14:36 UTC (permalink / raw)
  To: Michal Hocko, Oscar Salvador
  Cc: Andrew Morton, Vlastimil Babka, Muchun Song, Mike Kravetz,
	linux-mm, linux-kernel

On 17.03.21 15:31, Michal Hocko wrote:
> On Wed 17-03-21 12:12:51, Oscar Salvador wrote:
>> pfn_range_valid_contig() bails out when it finds an in-use page or a
>> hugetlb page, among other things.
>> We can drop the in-use page check since __alloc_contig_pages can migrate
>> away those pages, and the hugetlb page check can go too since
>> isolate_migratepages_range is now capable of dealing with hugetlb pages.
>> Either way, those checks are racy so let the end function handle it
>> when the time comes.
> 
> I haven't realized PageHuge check is done this early. This means that
> previous patches are not actually active until now which is not really
> greate for bisectability. Can we remove the HugePage check earlier?

alloc_contig_pages() vs. alloc_contig_range(). The patches are active 
for virtio-mem and CMA AFAIKS.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/5] mm,compaction: Let isolate_migratepages_{range,block} return error codes
  2021-03-17 14:12   ` Michal Hocko
@ 2021-03-17 14:38     ` Oscar Salvador
  2021-03-17 14:59       ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Oscar Salvador @ 2021-03-17 14:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Wed, Mar 17, 2021 at 03:12:29PM +0100, Michal Hocko wrote:
> > Since isolate_migratepages_block will stop returning the next pfn to be
> > scanned, we reuse the cc->migrate_pfn field to keep track of that.
> 
> This looks hakish and I cannot really tell that users of cc->migrate_pfn
> work as intended.

When discussing this with Vlastimil, I came up with the idea of adding a new
field in compact_control struct, e.g: next_pfn_scan to keep track of the next
pfn to be scanned.

But Vlastimil made me realize that since cc->migrate_pfn points to that aleady,
so we do not need any extra field.

> > @@ -810,6 +811,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >  	unsigned long next_skip_pfn = 0;
> >  	bool skip_updated = false;
> >  
> > +	cc->migrate_pfn = low_pfn;
> > +
> >  	/*
> >  	 * Ensure that there are not too many pages isolated from the LRU
> >  	 * list by either parallel reclaimers or compaction. If there are,
> > @@ -818,16 +821,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >  	while (unlikely(too_many_isolated(pgdat))) {
> >  		/* stop isolation if there are still pages not migrated */
> >  		if (cc->nr_migratepages)
> > -			return 0;
> > +			return -EINTR;
> >  
> >  		/* async migration should just abort */
> >  		if (cc->mode == MIGRATE_ASYNC)
> > -			return 0;
> > +			return -EINTR;
> 
> EINTR for anything other than signal based bail out is really confusing.

When coding that, I thought about using -1 for the first two checks, and keep
-EINTR for the signal check, but isolate_migratepages_block only has two users:

- isolate_migratepages: Does not care about the return code other than pfn != 0,
  and it does not pass the error down the chain.
- isolate_migratepages_range: The error is passed down the chain, and !pfn is being
  treated as -EINTR:

static int __alloc_contig_migrate_range(struct compact_control *cc,
					unsigned long start, unsigned long end)
 {
  ...
  ...
  pfn = isolate_migratepages_range(cc, pfn, end);
  if (!pfn) {
          ret = -EINTR;
          break;
  }
  ...
 }

That is why I decided to stick with -EINTR.


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 1/5] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range
  2021-03-17 14:05   ` Michal Hocko
@ 2021-03-17 14:42     ` David Hildenbrand
  2021-03-17 14:49       ` Michal Hocko
  2021-03-18 11:04     ` Oscar Salvador
  1 sibling, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2021-03-17 14:42 UTC (permalink / raw)
  To: Michal Hocko, Oscar Salvador
  Cc: Andrew Morton, Vlastimil Babka, Muchun Song, Mike Kravetz,
	linux-mm, linux-kernel

On 17.03.21 15:05, Michal Hocko wrote:
> On Wed 17-03-21 12:12:47, Oscar Salvador wrote:
>> Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or -EBUSY,
>> and report them down the chain.
>> The problem is that when migrate_pages() reports -ENOMEM, we keep going till we
>> exhaust all the try-attempts (5 at the moment) instead of bailing out.
>>
>> migrate_pages() bails out right away on -ENOMEM because it is considered a fatal
>> error. Do the same here instead of keep going and retrying.
> 
> I suspect this is not really a real life problem, right? The allocation
> would be more costly in the end but this is to be expected under a heavy
> memory pressure.
> 
> That being said, bailing out early makes sense to me. But now that
> you've made me look into the migrate_pages excellent error state reporting
> I suspect we have a bug here. Note the
> "Returns the number of pages that were not migrated, or an error code."
> 
> but I do not see putback_movable_pages for ret > 0 so it seems we might
> leak some pages.

At least in __alloc_contig_migrate_range() we seem to always leave the 
loop with ret <= 0 and do a putback_movable_pages() with ret < 0.

Which code are you referring to?

(I think the logic flow inside __alloc_contig_migrate_range() might be 
improved ...)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 1/5] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range
  2021-03-17 14:42     ` David Hildenbrand
@ 2021-03-17 14:49       ` Michal Hocko
  0 siblings, 0 replies; 33+ messages in thread
From: Michal Hocko @ 2021-03-17 14:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, Andrew Morton, Vlastimil Babka, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Wed 17-03-21 15:42:43, David Hildenbrand wrote:
> On 17.03.21 15:05, Michal Hocko wrote:
> > On Wed 17-03-21 12:12:47, Oscar Salvador wrote:
> > > Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or -EBUSY,
> > > and report them down the chain.
> > > The problem is that when migrate_pages() reports -ENOMEM, we keep going till we
> > > exhaust all the try-attempts (5 at the moment) instead of bailing out.
> > > 
> > > migrate_pages() bails out right away on -ENOMEM because it is considered a fatal
> > > error. Do the same here instead of keep going and retrying.
> > 
> > I suspect this is not really a real life problem, right? The allocation
> > would be more costly in the end but this is to be expected under a heavy
> > memory pressure.
> > 
> > That being said, bailing out early makes sense to me. But now that
> > you've made me look into the migrate_pages excellent error state reporting
> > I suspect we have a bug here. Note the
> > "Returns the number of pages that were not migrated, or an error code."
> > 
> > but I do not see putback_movable_pages for ret > 0 so it seems we might
> > leak some pages.
> 
> At least in __alloc_contig_migrate_range() we seem to always leave the loop
> with ret <= 0 and do a putback_movable_pages() with ret < 0.
> 
> Which code are you referring to?

OK, my bad. I have managed to confuse myself around the retry bailout
which indeed overrides the return value. So there is no bug. Sorry about
the noise but I still believe making migrate_pages less tricky with
error handling would be an improvement.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/5] mm,compaction: Let isolate_migratepages_{range,block} return error codes
  2021-03-17 14:38     ` Oscar Salvador
@ 2021-03-17 14:59       ` Michal Hocko
  2021-03-18  9:50         ` Vlastimil Babka
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-03-17 14:59 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Wed 17-03-21 15:38:35, Oscar Salvador wrote:
> On Wed, Mar 17, 2021 at 03:12:29PM +0100, Michal Hocko wrote:
> > > Since isolate_migratepages_block will stop returning the next pfn to be
> > > scanned, we reuse the cc->migrate_pfn field to keep track of that.
> > 
> > This looks hakish and I cannot really tell that users of cc->migrate_pfn
> > work as intended.
> 
> When discussing this with Vlastimil, I came up with the idea of adding a new
> field in compact_control struct, e.g: next_pfn_scan to keep track of the next
> pfn to be scanned.
> 
> But Vlastimil made me realize that since cc->migrate_pfn points to that aleady,
> so we do not need any extra field.

This deserves a big fat comment.

> > > @@ -810,6 +811,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> > >  	unsigned long next_skip_pfn = 0;
> > >  	bool skip_updated = false;
> > >  
> > > +	cc->migrate_pfn = low_pfn;
> > > +
> > >  	/*
> > >  	 * Ensure that there are not too many pages isolated from the LRU
> > >  	 * list by either parallel reclaimers or compaction. If there are,
> > > @@ -818,16 +821,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> > >  	while (unlikely(too_many_isolated(pgdat))) {
> > >  		/* stop isolation if there are still pages not migrated */
> > >  		if (cc->nr_migratepages)
> > > -			return 0;
> > > +			return -EINTR;
> > >  
> > >  		/* async migration should just abort */
> > >  		if (cc->mode == MIGRATE_ASYNC)
> > > -			return 0;
> > > +			return -EINTR;
> > 
> > EINTR for anything other than signal based bail out is really confusing.
> 
> When coding that, I thought about using -1 for the first two checks, and keep
> -EINTR for the signal check, but isolate_migratepages_block only has two users:

No, do not mix error reporting with different semantic. Either make it
errno or return -1 for all failures if you do not care which error that
is. You do care and hence this patch so make that errno and above two
should simply EAGAIN as this is a congestion situation.

> - isolate_migratepages: Does not care about the return code other than pfn != 0,
>   and it does not pass the error down the chain.
> - isolate_migratepages_range: The error is passed down the chain, and !pfn is being
>   treated as -EINTR:
> 
> static int __alloc_contig_migrate_range(struct compact_control *cc,
> 					unsigned long start, unsigned long end)
>  {
>   ...
>   ...
>   pfn = isolate_migratepages_range(cc, pfn, end);
>   if (!pfn) {
>           ret = -EINTR;
>           break;
>   }
>   ...
>  }
> 
> That is why I decided to stick with -EINTR.

I suspect this is only because there was not really a better way to tell
the failure so it went with EINTR which makes alloc_contig_range bail
out. The high level handling there is quite dubious as EAGAIN is already
possible from the page migration path and that shouldn't be a fatal
failure.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 5/5] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig
  2021-03-17 14:36     ` David Hildenbrand
@ 2021-03-17 15:03       ` Michal Hocko
  2021-03-18  8:44         ` Oscar Salvador
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-03-17 15:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, Andrew Morton, Vlastimil Babka, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Wed 17-03-21 15:36:35, David Hildenbrand wrote:
> On 17.03.21 15:31, Michal Hocko wrote:
> > On Wed 17-03-21 12:12:51, Oscar Salvador wrote:
> > > pfn_range_valid_contig() bails out when it finds an in-use page or a
> > > hugetlb page, among other things.
> > > We can drop the in-use page check since __alloc_contig_pages can migrate
> > > away those pages, and the hugetlb page check can go too since
> > > isolate_migratepages_range is now capable of dealing with hugetlb pages.
> > > Either way, those checks are racy so let the end function handle it
> > > when the time comes.
> > 
> > I haven't realized PageHuge check is done this early. This means that
> > previous patches are not actually active until now which is not really
> > greate for bisectability. Can we remove the HugePage check earlier?
> 
> alloc_contig_pages() vs. alloc_contig_range(). The patches are active for
> virtio-mem and CMA AFAIKS.

yeah, I meant to say "are not actually fully active".
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 5/5] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig
  2021-03-17 15:03       ` Michal Hocko
@ 2021-03-18  8:44         ` Oscar Salvador
  2021-03-18  8:55           ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Oscar Salvador @ 2021-03-18  8:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Andrew Morton, Vlastimil Babka, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Wed, Mar 17, 2021 at 04:03:06PM +0100, Michal Hocko wrote:
> > alloc_contig_pages() vs. alloc_contig_range(). The patches are active for
> > virtio-mem and CMA AFAIKS.
> 
> yeah, I meant to say "are not actually fully active".

We could place this patch earlier in this patchset.
The only thing is that we would lose the prevalidation (at leat for
HugeTLB page) which is done upfront to find later on that we do not
support hugetlb handling in isolate_migratepates_block.
So the bad thing about placing it earlier, is that wrt. hugetlb pages,
alloc_gigantic_page will take longer to fail (when we already know that
will fail).

Then we have the page_count check, which is also racy and
isolate_migratepages_block will take care of it.
So I guess can think of this patch as a preparatory patch that removes racy
checks that will be re-checked later on in the end function which does
the actual handling.

What do you think?

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 4/5] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-03-17 14:26   ` Michal Hocko
@ 2021-03-18  8:54     ` Oscar Salvador
  2021-03-18  9:29       ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Oscar Salvador @ 2021-03-18  8:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Wed, Mar 17, 2021 at 03:26:50PM +0100, Michal Hocko wrote:
> it would be imho better to retry inside alloc_and_dissolve_huge_page
> because it already has its retry logic implemented.
> 
> But not something I will insist on.

Ok, what about this (I did not even compile it yet, but gives a rough
idea):

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index bcff86ca616f..a37b4ce86e58 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -583,7 +583,7 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };

-int isolate_or_dissolve_huge_page(struct page *page);
+int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -866,7 +866,8 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};

-static inline int isolate_or_dissolve_huge_page(struct page *page)
+static inline int isolate_or_dissolve_huge_page(struct page *page,
+						struct list_head *list)
 {
 	return -ENOMEM;
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index 9f253fc3b4f9..6e47855fd154 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -910,7 +910,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		}

 		if (PageHuge(page) && cc->alloc_contig) {
-			ret = isolate_or_dissolve_huge_page(page);
+			ret = isolate_or_dissolve_huge_page(page, &cc->migratepages);

 			/*
 			 * Fail isolation in case isolate_or_dissolve_huge_page
@@ -927,6 +927,15 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 				goto isolate_fail;
 			}

+			if (PageHuge(page)) {
+				/*
+				 * Hugepage was successfully isolated and placed
+				 * on the cc->migratepages list.
+				 */
+				low_pfn += compound_nr(page) - 1;
+				goto isolate_success_no_list;
+			}
+
 			/*
 			 * Ok, the hugepage was dissolved. Now these pages are
 			 * Buddy and cannot be re-allocated because they are
@@ -1068,6 +1077,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,

 isolate_success:
 		list_add(&page->lru, &cc->migratepages);
+isolate_success_no_list:
 		cc->nr_migratepages += compound_nr(page);
 		nr_isolated += compound_nr(page);

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3194c1bd9e32..87227224c03b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2257,7 +2257,8 @@ static void restore_reserve_on_error(struct hstate *h,
  * Returns 0 on success, otherwise negated error.
  */

-static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
+static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
+					struct list_head *list)
 {
 	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
 	int nid = page_to_nid(old_page);
@@ -2287,10 +2288,12 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
 		goto unlock;
 	} else if (page_count(old_page)) {
 		/*
-		 * Someone has grabbed the page, fail for now.
+		 * Someone has grabbed the page, try to isolate it here.
+		 * Fail with -EBUSY if not possible.
 		 */
-		ret = -EBUSY;
 		update_and_free_page(h, new_page);
+		if (!isolate_huge_page(old_page, list)
+			ret = -EBUSY;
 		goto unlock;
 	} else if (!HPageFreed(old_page)) {
 		/*
@@ -2319,10 +2322,11 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
 	return ret;
 }

-int isolate_or_dissolve_huge_page(struct page *page)
+int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
 {
 	struct hstate *h;
 	struct page *head;
+	int ret = -EBUSY;

 	/*
 	 * The page might have been dissolved from under our feet, so make sure
@@ -2347,7 +2351,12 @@ int isolate_or_dissolve_huge_page(struct page *page)
 	if (hstate_is_gigantic(h))
 		return -ENOMEM;

-	return alloc_and_dissolve_huge_page(h, head);
+	if (page_count(head) && isolate_huge_page(head, list))
+		ret = 0;
+	else if (!page_count(head))
+		ret = alloc_and_dissolve_huge_page(h, head, list);
+
+	return ret;
 }

 struct page *alloc_huge_page(struct vm_area_struct *vma,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 562e87cbd7a1..42aaef30633e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1507,8 +1507,9 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 	LIST_HEAD(clean_pages);

 	list_for_each_entry_safe(page, next, page_list, lru) {
-		if (page_is_file_lru(page) && !PageDirty(page) &&
-		    !__PageMovable(page) && !PageUnevictable(page)) {
+		if (!PageHuge(page) && page_is_file_lru(page) &&
+		    !PageDirty(page) && !__PageMovable(page) &&
+		    !PageUnevictable(page)) {
 			ClearPageActive(page);
 			list_move(&page->lru, &clean_pages);
 		}
--
2.16.3




-- 
Oscar Salvador
SUSE L3

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 5/5] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig
  2021-03-18  8:44         ` Oscar Salvador
@ 2021-03-18  8:55           ` Michal Hocko
  0 siblings, 0 replies; 33+ messages in thread
From: Michal Hocko @ 2021-03-18  8:55 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: David Hildenbrand, Andrew Morton, Vlastimil Babka, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Thu 18-03-21 09:44:19, Oscar Salvador wrote:
> On Wed, Mar 17, 2021 at 04:03:06PM +0100, Michal Hocko wrote:
> > > alloc_contig_pages() vs. alloc_contig_range(). The patches are active for
> > > virtio-mem and CMA AFAIKS.
> > 
> > yeah, I meant to say "are not actually fully active".
> 
> We could place this patch earlier in this patchset.
> The only thing is that we would lose the prevalidation (at leat for
> HugeTLB page) which is done upfront to find later on that we do not
> support hugetlb handling in isolate_migratepates_block.
> So the bad thing about placing it earlier, is that wrt. hugetlb pages,
> alloc_gigantic_page will take longer to fail (when we already know that
> will fail).

From a bisactability POV this shouldn't be a major concern. If you are
too worried then just drop the HugePage check in the patch allowing to
migrate free hugetlb pages. It is unlikely that somebody will run with
that patch alone but if the said patch introduces some sort of bug it
would be good to bisect down to it.

> Then we have the page_count check, which is also racy and
> isolate_migratepages_block will take care of it.
> So I guess can think of this patch as a preparatory patch that removes racy
> checks that will be re-checked later on in the end function which does
> the actual handling.

TBH, I do not care much about the page count check. It is comletely
orthogonal to the migration changes in this series. So both preparatory
and follow up are ok.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 4/5] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-03-18  8:54     ` Oscar Salvador
@ 2021-03-18  9:29       ` Michal Hocko
  2021-03-18  9:59         ` Oscar Salvador
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-03-18  9:29 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Thu 18-03-21 09:54:01, Oscar Salvador wrote:
[...]
> @@ -2287,10 +2288,12 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
>  		goto unlock;
>  	} else if (page_count(old_page)) {
>  		/*
> -		 * Someone has grabbed the page, fail for now.
> +		 * Someone has grabbed the page, try to isolate it here.
> +		 * Fail with -EBUSY if not possible.
>  		 */
> -		ret = -EBUSY;
>  		update_and_free_page(h, new_page);
> +		if (!isolate_huge_page(old_page, list)
> +			ret = -EBUSY;
>  		goto unlock;
>  	} else if (!HPageFreed(old_page)) {

I do not think you want to call isolate_huge_page with hugetlb_lock
held. You would need to drop the lock before calling isolate_huge_page.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/5] mm,compaction: Let isolate_migratepages_{range,block} return error codes
  2021-03-17 14:59       ` Michal Hocko
@ 2021-03-18  9:50         ` Vlastimil Babka
  2021-03-18 10:22           ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Vlastimil Babka @ 2021-03-18  9:50 UTC (permalink / raw)
  To: Michal Hocko, Oscar Salvador
  Cc: Andrew Morton, David Hildenbrand, Muchun Song, Mike Kravetz,
	linux-mm, linux-kernel

On 3/17/21 3:59 PM, Michal Hocko wrote:
> On Wed 17-03-21 15:38:35, Oscar Salvador wrote:
>> On Wed, Mar 17, 2021 at 03:12:29PM +0100, Michal Hocko wrote:
>> > > Since isolate_migratepages_block will stop returning the next pfn to be
>> > > scanned, we reuse the cc->migrate_pfn field to keep track of that.
>> > 
>> > This looks hakish and I cannot really tell that users of cc->migrate_pfn
>> > work as intended.

We did check those in detail. Of course it's possible to overlook something...

The alloc_contig_range user never cared about cc->migrate_pfn. compaction
(isolate_migratepages() -> isolate_migratepages_block()) did, and
isolate_migratepages_block() returned the pfn only to be assigned to
cc->migrate_pfn in isolate_migratepages(). I think it's now better that
isolate_migratepages_block() sets it.

>> When discussing this with Vlastimil, I came up with the idea of adding a new
>> field in compact_control struct, e.g: next_pfn_scan to keep track of the next
>> pfn to be scanned.
>> 
>> But Vlastimil made me realize that since cc->migrate_pfn points to that aleady,
>> so we do not need any extra field.

Yes, the first patch had at asome point:

	/* Record where migration scanner will be restarted. */
	cc->migrate_pfn = cc->the_new_field;

Which was a clear sign that the new field is unnecessary.

> This deserves a big fat comment.

Comment where, saying what? :)


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 4/5] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-03-18  9:29       ` Michal Hocko
@ 2021-03-18  9:59         ` Oscar Salvador
  2021-03-18 10:12           ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Oscar Salvador @ 2021-03-18  9:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Thu, Mar 18, 2021 at 10:29:57AM +0100, Michal Hocko wrote:
> On Thu 18-03-21 09:54:01, Oscar Salvador wrote:
> [...]
> > @@ -2287,10 +2288,12 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
> >  		goto unlock;
> >  	} else if (page_count(old_page)) {
> >  		/*
> > -		 * Someone has grabbed the page, fail for now.
> > +		 * Someone has grabbed the page, try to isolate it here.
> > +		 * Fail with -EBUSY if not possible.
> >  		 */
> > -		ret = -EBUSY;
> >  		update_and_free_page(h, new_page);
> > +		if (!isolate_huge_page(old_page, list)
> > +			ret = -EBUSY;
> >  		goto unlock;
> >  	} else if (!HPageFreed(old_page)) {
> 
> I do not think you want to call isolate_huge_page with hugetlb_lock
> held. You would need to drop the lock before calling isolate_huge_page.

Yeah, that was an oversight. As I said I did not compile it(let alone
test it), otherwise the system would have screamed I guess.

I was more interested in knowing whether how did it look wrt. retry
concerns:

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index bcff86ca616f..a37b4ce86e58 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -583,7 +583,7 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };

-int isolate_or_dissolve_huge_page(struct page *page);
+int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -866,7 +866,8 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};

-static inline int isolate_or_dissolve_huge_page(struct page *page)
+static inline int isolate_or_dissolve_huge_page(struct page *page,
+						struct list_head *list)
 {
 	return -ENOMEM;
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index 9f253fc3b4f9..6e47855fd154 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -910,7 +910,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		}

 		if (PageHuge(page) && cc->alloc_contig) {
-			ret = isolate_or_dissolve_huge_page(page);
+			ret = isolate_or_dissolve_huge_page(page, &cc->migratepages);

 			/*
 			 * Fail isolation in case isolate_or_dissolve_huge_page
@@ -927,6 +927,15 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 				goto isolate_fail;
 			}

+			if (PageHuge(page)) {
+				/*
+				 * Hugepage was successfully isolated and placed
+				 * on the cc->migratepages list.
+				 */
+				low_pfn += compound_nr(page) - 1;
+				goto isolate_success_no_list;
+			}
+
 			/*
 			 * Ok, the hugepage was dissolved. Now these pages are
 			 * Buddy and cannot be re-allocated because they are
@@ -1068,6 +1077,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,

 isolate_success:
 		list_add(&page->lru, &cc->migratepages);
+isolate_success_no_list:
 		cc->nr_migratepages += compound_nr(page);
 		nr_isolated += compound_nr(page);

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3194c1bd9e32..f55fa6acc6f9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2257,7 +2257,8 @@ static void restore_reserve_on_error(struct hstate *h,
  * Returns 0 on success, otherwise negated error.
  */

-static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
+static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
+					struct list_head *list)
 {
 	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
 	int nid = page_to_nid(old_page);
@@ -2287,10 +2288,14 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
 		goto unlock;
 	} else if (page_count(old_page)) {
 		/*
-		 * Someone has grabbed the page, fail for now.
+		 * Someone has grabbed the page, try to isolate it here.
+		 * Fail with -EBUSY if not possible.
 		 */
-		ret = -EBUSY;
 		update_and_free_page(h, new_page);
+		spin_unlock(&hugetlb_lock);
+		if (!isolate_huge_page(old_page, list)
+			ret = -EBUSY;
+		spin_lock(&hugetlb_lock);
 		goto unlock;
 	} else if (!HPageFreed(old_page)) {
 		/*
@@ -2319,10 +2324,11 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
 	return ret;
 }

-int isolate_or_dissolve_huge_page(struct page *page)
+int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
 {
 	struct hstate *h;
 	struct page *head;
+	int ret = -EBUSY;

 	/*
 	 * The page might have been dissolved from under our feet, so make sure
@@ -2347,7 +2353,12 @@ int isolate_or_dissolve_huge_page(struct page *page)
 	if (hstate_is_gigantic(h))
 		return -ENOMEM;

-	return alloc_and_dissolve_huge_page(h, head);
+	if (page_count(head) && isolate_huge_page(head, list))
+		ret = 0;
+	else if (!page_count(head))
+		ret = alloc_and_dissolve_huge_page(h, head, list);
+
+	return ret;
 }

 struct page *alloc_huge_page(struct vm_area_struct *vma,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 562e87cbd7a1..42aaef30633e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1507,8 +1507,9 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 	LIST_HEAD(clean_pages);

 	list_for_each_entry_safe(page, next, page_list, lru) {
-		if (page_is_file_lru(page) && !PageDirty(page) &&
-		    !__PageMovable(page) && !PageUnevictable(page)) {
+		if (!PageHuge(page) && page_is_file_lru(page) &&
+		    !PageDirty(page) && !__PageMovable(page) &&
+		    !PageUnevictable(page)) {
 			ClearPageActive(page);
 			list_move(&page->lru, &clean_pages);
 		}


The spin_lock after the isolate_huge_page() in
alloc_and_dissolve_huge_page() could probably be spared by placing a
goto out directly before the return.
But just a POC.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 4/5] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-03-18  9:59         ` Oscar Salvador
@ 2021-03-18 10:12           ` Michal Hocko
  0 siblings, 0 replies; 33+ messages in thread
From: Michal Hocko @ 2021-03-18 10:12 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Thu 18-03-21 10:59:10, Oscar Salvador wrote:
> On Thu, Mar 18, 2021 at 10:29:57AM +0100, Michal Hocko wrote:
> > On Thu 18-03-21 09:54:01, Oscar Salvador wrote:
> > [...]
> > > @@ -2287,10 +2288,12 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
> > >  		goto unlock;
> > >  	} else if (page_count(old_page)) {
> > >  		/*
> > > -		 * Someone has grabbed the page, fail for now.
> > > +		 * Someone has grabbed the page, try to isolate it here.
> > > +		 * Fail with -EBUSY if not possible.
> > >  		 */
> > > -		ret = -EBUSY;
> > >  		update_and_free_page(h, new_page);
> > > +		if (!isolate_huge_page(old_page, list)
> > > +			ret = -EBUSY;
> > >  		goto unlock;
> > >  	} else if (!HPageFreed(old_page)) {
> > 
> > I do not think you want to call isolate_huge_page with hugetlb_lock
> > held. You would need to drop the lock before calling isolate_huge_page.
> 
> Yeah, that was an oversight. As I said I did not compile it(let alone
> test it), otherwise the system would have screamed I guess.
> 
> I was more interested in knowing whether how did it look wrt. retry
> concerns:

Yes this looks much more to my taste. If we need to retry then it could
just goto retry there. The caller doesn't really have to care.

> @@ -2287,10 +2288,14 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
>  		goto unlock;
>  	} else if (page_count(old_page)) {
>  		/*
> -		 * Someone has grabbed the page, fail for now.
> +		 * Someone has grabbed the page, try to isolate it here.
> +		 * Fail with -EBUSY if not possible.
>  		 */
> -		ret = -EBUSY;
>  		update_and_free_page(h, new_page);
> +		spin_unlock(&hugetlb_lock);
> +		if (!isolate_huge_page(old_page, list)
> +			ret = -EBUSY;
> +		spin_lock(&hugetlb_lock);
>  		goto unlock;

simply return ret; here
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/5] mm,compaction: Let isolate_migratepages_{range,block} return error codes
  2021-03-18  9:50         ` Vlastimil Babka
@ 2021-03-18 10:22           ` Michal Hocko
  2021-03-18 11:10             ` Vlastimil Babka
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-03-18 10:22 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Oscar Salvador, Andrew Morton, David Hildenbrand, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Thu 18-03-21 10:50:38, Vlastimil Babka wrote:
> On 3/17/21 3:59 PM, Michal Hocko wrote:
> > On Wed 17-03-21 15:38:35, Oscar Salvador wrote:
> >> On Wed, Mar 17, 2021 at 03:12:29PM +0100, Michal Hocko wrote:
> >> > > Since isolate_migratepages_block will stop returning the next pfn to be
> >> > > scanned, we reuse the cc->migrate_pfn field to keep track of that.
> >> > 
> >> > This looks hakish and I cannot really tell that users of cc->migrate_pfn
> >> > work as intended.
> 
> We did check those in detail. Of course it's possible to overlook something...
> 
> The alloc_contig_range user never cared about cc->migrate_pfn. compaction
> (isolate_migratepages() -> isolate_migratepages_block()) did, and
> isolate_migratepages_block() returned the pfn only to be assigned to
> cc->migrate_pfn in isolate_migratepages(). I think it's now better that
> isolate_migratepages_block() sets it.
> 
> >> When discussing this with Vlastimil, I came up with the idea of adding a new
> >> field in compact_control struct, e.g: next_pfn_scan to keep track of the next
> >> pfn to be scanned.
> >> 
> >> But Vlastimil made me realize that since cc->migrate_pfn points to that aleady,
> >> so we do not need any extra field.
> 
> Yes, the first patch had at asome point:
> 
> 	/* Record where migration scanner will be restarted. */
> 	cc->migrate_pfn = cc->the_new_field;
> 
> Which was a clear sign that the new field is unnecessary.
> 
> > This deserves a big fat comment.
> 
> Comment where, saying what? :)

E.g. something like the following
diff --git a/mm/internal.h b/mm/internal.h
index 1432feec62df..6c5a9066adf0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -225,7 +225,13 @@ struct compact_control {
 	unsigned int nr_freepages;	/* Number of isolated free pages */
 	unsigned int nr_migratepages;	/* Number of pages to migrate */
 	unsigned long free_pfn;		/* isolate_freepages search base */
-	unsigned long migrate_pfn;	/* isolate_migratepages search base */
+	unsigned long migrate_pfn;	/* Acts as an in/out parameter to page
+					 * isolation.
+					 * isolate_migratepages uses it as a search base.
+					 * isolate_migratepages_block will update the
+					 * value the next pfn after the last isolated
+					 * one.
+					 */
 	unsigned long fast_start_pfn;	/* a pfn to start linear scan from */
 	struct zone *zone;
 	unsigned long total_migrate_scanned;

Btw isolate_migratepages_block still has this comment which needs
updating
"The cc->migrate_pfn field is neither read nor updated."
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 1/5] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range
  2021-03-17 14:05   ` Michal Hocko
  2021-03-17 14:42     ` David Hildenbrand
@ 2021-03-18 11:04     ` Oscar Salvador
  2021-03-18 11:37       ` Michal Hocko
  1 sibling, 1 reply; 33+ messages in thread
From: Oscar Salvador @ 2021-03-18 11:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Wed, Mar 17, 2021 at 03:05:40PM +0100, Michal Hocko wrote:
> That being said, bailing out early makes sense to me. But now that
> you've made me look into the migrate_pages excellent error state reporting
> I suspect we have a bug here. Note the 
> "Returns the number of pages that were not migrated, or an error code."
> 
> but I do not see putback_movable_pages for ret > 0 so it seems we might
> leak some pages.

I fell for the same thing when looking at that code.
It took a while until I realized what was really going on.

> > Signed-off-by: Oscar Salvador <osalvador@suse.de>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > Reviewed-by: David Hildenbrand <david@redhat.com>
> 
> The patch itself looks reasonable but make sure to mention this is mere
> cosmetic change unless there is a real problem fixed by this.
> Acked-by: Michal Hocko <mhocko@suse.com>

What about appending the following in the changelog:

"Note that this is not fixing a real issue, just a cosmetic change. Although
 we can save some cycles by backing off ealier."


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/5] mm,compaction: Let isolate_migratepages_{range,block} return error codes
  2021-03-18 10:22           ` Michal Hocko
@ 2021-03-18 11:10             ` Vlastimil Babka
  2021-03-18 11:36               ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Vlastimil Babka @ 2021-03-18 11:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oscar Salvador, Andrew Morton, David Hildenbrand, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On 3/18/21 11:22 AM, Michal Hocko wrote:
> On Thu 18-03-21 10:50:38, Vlastimil Babka wrote:
>> On 3/17/21 3:59 PM, Michal Hocko wrote:
>> > On Wed 17-03-21 15:38:35, Oscar Salvador wrote:
>> >> On Wed, Mar 17, 2021 at 03:12:29PM +0100, Michal Hocko wrote:
>> >> > > Since isolate_migratepages_block will stop returning the next pfn to be
>> >> > > scanned, we reuse the cc->migrate_pfn field to keep track of that.
>> >> > 
>> >> > This looks hakish and I cannot really tell that users of cc->migrate_pfn
>> >> > work as intended.
>> 
>> We did check those in detail. Of course it's possible to overlook something...
>> 
>> The alloc_contig_range user never cared about cc->migrate_pfn. compaction
>> (isolate_migratepages() -> isolate_migratepages_block()) did, and
>> isolate_migratepages_block() returned the pfn only to be assigned to
>> cc->migrate_pfn in isolate_migratepages(). I think it's now better that
>> isolate_migratepages_block() sets it.
>> 
>> >> When discussing this with Vlastimil, I came up with the idea of adding a new
>> >> field in compact_control struct, e.g: next_pfn_scan to keep track of the next
>> >> pfn to be scanned.
>> >> 
>> >> But Vlastimil made me realize that since cc->migrate_pfn points to that aleady,
>> >> so we do not need any extra field.
>> 
>> Yes, the first patch had at asome point:
>> 
>> 	/* Record where migration scanner will be restarted. */
>> 	cc->migrate_pfn = cc->the_new_field;
>> 
>> Which was a clear sign that the new field is unnecessary.
>> 
>> > This deserves a big fat comment.
>> 
>> Comment where, saying what? :)
> 
> E.g. something like the following
> diff --git a/mm/internal.h b/mm/internal.h
> index 1432feec62df..6c5a9066adf0 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -225,7 +225,13 @@ struct compact_control {
>  	unsigned int nr_freepages;	/* Number of isolated free pages */
>  	unsigned int nr_migratepages;	/* Number of pages to migrate */
>  	unsigned long free_pfn;		/* isolate_freepages search base */
> -	unsigned long migrate_pfn;	/* isolate_migratepages search base */
> +	unsigned long migrate_pfn;	/* Acts as an in/out parameter to page
> +					 * isolation.
> +					 * isolate_migratepages uses it as a search base.
> +					 * isolate_migratepages_block will update the
> +					 * value the next pfn after the last isolated
> +					 * one.
> +					 */

Fair enough. I would even stop pretending we might cram something useful in the
rest of the line, and move all the comments to blocks before the variables.
There might be more of them that would deserve more thorough description.

>  	unsigned long fast_start_pfn;	/* a pfn to start linear scan from */
>  	struct zone *zone;
>  	unsigned long total_migrate_scanned;
> 
> Btw isolate_migratepages_block still has this comment which needs
> updating
> "The cc->migrate_pfn field is neither read nor updated."

Good catch.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/5] mm,compaction: Let isolate_migratepages_{range,block} return error codes
  2021-03-18 11:10             ` Vlastimil Babka
@ 2021-03-18 11:36               ` Michal Hocko
  2021-03-19  9:57                 ` Oscar Salvador
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-03-18 11:36 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Oscar Salvador, Andrew Morton, David Hildenbrand, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Thu 18-03-21 12:10:14, Vlastimil Babka wrote:
> On 3/18/21 11:22 AM, Michal Hocko wrote:
[...]
> > E.g. something like the following
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 1432feec62df..6c5a9066adf0 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -225,7 +225,13 @@ struct compact_control {
> >  	unsigned int nr_freepages;	/* Number of isolated free pages */
> >  	unsigned int nr_migratepages;	/* Number of pages to migrate */
> >  	unsigned long free_pfn;		/* isolate_freepages search base */
> > -	unsigned long migrate_pfn;	/* isolate_migratepages search base */
> > +	unsigned long migrate_pfn;	/* Acts as an in/out parameter to page
> > +					 * isolation.
> > +					 * isolate_migratepages uses it as a search base.
> > +					 * isolate_migratepages_block will update the
> > +					 * value the next pfn after the last isolated
> > +					 * one.
> > +					 */
> 
> Fair enough. I would even stop pretending we might cram something useful in the
> rest of the line, and move all the comments to blocks before the variables.
> There might be more of them that would deserve more thorough description.

Yeah, makes sense. I am not a fan of the above form of documentation.
Btw. maybe renaming the field would be even better, both from the
intention and review all existing users. I would go with pfn_iter or
something that wouldn't make it sound like migration specific.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 1/5] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range
  2021-03-18 11:04     ` Oscar Salvador
@ 2021-03-18 11:37       ` Michal Hocko
  0 siblings, 0 replies; 33+ messages in thread
From: Michal Hocko @ 2021-03-18 11:37 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Vlastimil Babka, David Hildenbrand, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Thu 18-03-21 12:04:00, Oscar Salvador wrote:
> On Wed, Mar 17, 2021 at 03:05:40PM +0100, Michal Hocko wrote:
> > That being said, bailing out early makes sense to me. But now that
> > you've made me look into the migrate_pages excellent error state reporting
> > I suspect we have a bug here. Note the 
> > "Returns the number of pages that were not migrated, or an error code."
> > 
> > but I do not see putback_movable_pages for ret > 0 so it seems we might
> > leak some pages.
> 
> I fell for the same thing when looking at that code.
> It took a while until I realized what was really going on.
> 
> > > Signed-off-by: Oscar Salvador <osalvador@suse.de>
> > > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > > Reviewed-by: David Hildenbrand <david@redhat.com>
> > 
> > The patch itself looks reasonable but make sure to mention this is mere
> > cosmetic change unless there is a real problem fixed by this.
> > Acked-by: Michal Hocko <mhocko@suse.com>
> 
> What about appending the following in the changelog:
> 
> "Note that this is not fixing a real issue, just a cosmetic change. Although
>  we can save some cycles by backing off ealier."

Sounds good to me.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/5] mm,compaction: Let isolate_migratepages_{range,block} return error codes
  2021-03-18 11:36               ` Michal Hocko
@ 2021-03-19  9:57                 ` Oscar Salvador
  2021-03-19 10:14                   ` Vlastimil Babka
  0 siblings, 1 reply; 33+ messages in thread
From: Oscar Salvador @ 2021-03-19  9:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Andrew Morton, David Hildenbrand, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Thu, Mar 18, 2021 at 12:36:52PM +0100, Michal Hocko wrote:
> Yeah, makes sense. I am not a fan of the above form of documentation.
> Btw. maybe renaming the field would be even better, both from the
> intention and review all existing users. I would go with pfn_iter or
> something that wouldn't make it sound like migration specific.

Just to be sure we are on the same page, you meant something like the following
(wrt. comments):

 /*
  * compact_control is used to track pages being migrated and the free pages
  * they are being migrated to during memory compaction. The free_pfn starts
  * at the end of a zone and migrate_pfn begins at the start. Movable pages
  * are moved to the end of a zone during a compaction run and the run
  * completes when free_pfn <= migrate_pfn
  *
  * freepages:           List of free pages to migrate to
  * migratepages:        List of pages that need to be migrated
  * nr_freepages:        Number of isolated free pages
  ...
  */
  struct compact_control {
          struct list_head freepages;
          ...

With the preface that I am not really familiar with compaction code:

About renaming the variable to something else, I wouldn't do it.
I see migrate_pfn being used in contexts where migration gets mentioned,
e.g: 

 /*
  * Briefly search the free lists for a migration source that already has
  * some free pages to reduce the number of pages that need migration
  * before a pageblock is free.
  */
 fast_find_migrateblock(struct compact_control *cc)
 {
  ...
  unsigned long pfn = cc->migrate_pfn;
 }

isolate_migratepages()
 /* Record where migration scanner will be restarted. */


So, I would either stick with it, or add a new 'iter_pfn'/'next_pfn_scan'
field if we feel the need to.


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/5] mm,compaction: Let isolate_migratepages_{range,block} return error codes
  2021-03-19  9:57                 ` Oscar Salvador
@ 2021-03-19 10:14                   ` Vlastimil Babka
  2021-03-19 10:26                     ` Oscar Salvador
  0 siblings, 1 reply; 33+ messages in thread
From: Vlastimil Babka @ 2021-03-19 10:14 UTC (permalink / raw)
  To: Oscar Salvador, Michal Hocko
  Cc: Andrew Morton, David Hildenbrand, Muchun Song, Mike Kravetz,
	linux-mm, linux-kernel

On 3/19/21 10:57 AM, Oscar Salvador wrote:
> On Thu, Mar 18, 2021 at 12:36:52PM +0100, Michal Hocko wrote:
>> Yeah, makes sense. I am not a fan of the above form of documentation.
>> Btw. maybe renaming the field would be even better, both from the
>> intention and review all existing users. I would go with pfn_iter or
>> something that wouldn't make it sound like migration specific.
> 
> Just to be sure we are on the same page, you meant something like the following
> (wrt. comments):
> 
>  /*
>   * compact_control is used to track pages being migrated and the free pages
>   * they are being migrated to during memory compaction. The free_pfn starts
>   * at the end of a zone and migrate_pfn begins at the start. Movable pages
>   * are moved to the end of a zone during a compaction run and the run
>   * completes when free_pfn <= migrate_pfn
>   *
>   * freepages:           List of free pages to migrate to
>   * migratepages:        List of pages that need to be migrated
>   * nr_freepages:        Number of isolated free pages
>   ...
>   */
>   struct compact_control {
>           struct list_head freepages;
>           ...

No I meant this:

--- a/mm/internal.h
+++ b/mm/internal.h
@@ -225,7 +225,13 @@ struct compact_control {
        unsigned int nr_freepages;      /* Number of isolated free pages */
        unsigned int nr_migratepages;   /* Number of pages to migrate */
        unsigned long free_pfn;         /* isolate_freepages search base */
-       unsigned long migrate_pfn;      /* isolate_migratepages search base */
+       /*
+        * Acts as an in/out parameter to page isolation for migration.
+        * isolate_migratepages uses it as a search base.
+        * isolate_migratepages_block will update the value to the next pfn
+        * after the last isolated one.
+        */
+       unsigned long migrate_pfn;
        unsigned long fast_start_pfn;   /* a pfn to start linear scan from */
        struct zone *zone;
        unsigned long total_migrate_scanned;


> With the preface that I am not really familiar with compaction code:
> 
> About renaming the variable to something else, I wouldn't do it.
> I see migrate_pfn being used in contexts where migration gets mentioned,
> e.g: 

I also don't like the renaming much. "Migration" is important as this is about
pages to be migrated, and there's "free_pfn" field tracking scan for free pages as
migration target. So the name can't be as generic as "pfn_iter".

>  /*
>   * Briefly search the free lists for a migration source that already has
>   * some free pages to reduce the number of pages that need migration
>   * before a pageblock is free.
>   */
>  fast_find_migrateblock(struct compact_control *cc)
>  {
>   ...
>   unsigned long pfn = cc->migrate_pfn;
>  }
> 
> isolate_migratepages()
>  /* Record where migration scanner will be restarted. */
> 
> 
> So, I would either stick with it, or add a new 'iter_pfn'/'next_pfn_scan'
> field if we feel the need to.
> 
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 2/5] mm,compaction: Let isolate_migratepages_{range,block} return error codes
  2021-03-19 10:14                   ` Vlastimil Babka
@ 2021-03-19 10:26                     ` Oscar Salvador
  0 siblings, 0 replies; 33+ messages in thread
From: Oscar Salvador @ 2021-03-19 10:26 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Andrew Morton, David Hildenbrand, Muchun Song,
	Mike Kravetz, linux-mm, linux-kernel

On Fri, Mar 19, 2021 at 11:14:25AM +0100, Vlastimil Babka wrote:
> No I meant this:
> 
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -225,7 +225,13 @@ struct compact_control {
>         unsigned int nr_freepages;      /* Number of isolated free pages */
>         unsigned int nr_migratepages;   /* Number of pages to migrate */
>         unsigned long free_pfn;         /* isolate_freepages search base */
> -       unsigned long migrate_pfn;      /* isolate_migratepages search base */
> +       /*
> +        * Acts as an in/out parameter to page isolation for migration.
> +        * isolate_migratepages uses it as a search base.
> +        * isolate_migratepages_block will update the value to the next pfn
> +        * after the last isolated one.
> +        */
> +       unsigned long migrate_pfn;
>         unsigned long fast_start_pfn;   /* a pfn to start linear scan from */
>         struct zone *zone;
>         unsigned long total_migrate_scanned;

Meh, silly me.
Ok, I will do it that way.

I am also for expanding some of the comments as I see that some explanations are
rather laconic, but I do not think such work fits in this patchset.

Since I happen to be checking compaction code due to other reasons, I shall
come back to this matter once I am done with this patchset.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2021-03-19 10:27 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-17 11:12 [PATCH v5 0/5] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
2021-03-17 11:12 ` [PATCH v5 1/5] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range Oscar Salvador
2021-03-17 14:05   ` Michal Hocko
2021-03-17 14:42     ` David Hildenbrand
2021-03-17 14:49       ` Michal Hocko
2021-03-18 11:04     ` Oscar Salvador
2021-03-18 11:37       ` Michal Hocko
2021-03-17 11:12 ` [PATCH v5 2/5] mm,compaction: Let isolate_migratepages_{range,block} return error codes Oscar Salvador
2021-03-17 14:12   ` Michal Hocko
2021-03-17 14:38     ` Oscar Salvador
2021-03-17 14:59       ` Michal Hocko
2021-03-18  9:50         ` Vlastimil Babka
2021-03-18 10:22           ` Michal Hocko
2021-03-18 11:10             ` Vlastimil Babka
2021-03-18 11:36               ` Michal Hocko
2021-03-19  9:57                 ` Oscar Salvador
2021-03-19 10:14                   ` Vlastimil Babka
2021-03-19 10:26                     ` Oscar Salvador
2021-03-17 11:12 ` [PATCH v5 3/5] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
2021-03-17 14:22   ` Michal Hocko
2021-03-17 11:12 ` [PATCH v5 4/5] mm: Make alloc_contig_range handle in-use " Oscar Salvador
2021-03-17 14:26   ` Michal Hocko
2021-03-18  8:54     ` Oscar Salvador
2021-03-18  9:29       ` Michal Hocko
2021-03-18  9:59         ` Oscar Salvador
2021-03-18 10:12           ` Michal Hocko
2021-03-17 11:12 ` [PATCH v5 5/5] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig Oscar Salvador
2021-03-17 11:15   ` David Hildenbrand
2021-03-17 14:31   ` Michal Hocko
2021-03-17 14:36     ` David Hildenbrand
2021-03-17 15:03       ` Michal Hocko
2021-03-18  8:44         ` Oscar Salvador
2021-03-18  8:55           ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).