linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/2] Make alloc_contig_range handle Hugetlb pages
@ 2021-02-22 13:51 Oscar Salvador
  2021-02-22 13:51 ` [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Oscar Salvador @ 2021-02-22 13:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, David Hildenbrand, Muchun Song, Michal Hocko,
	linux-mm, linux-kernel, Oscar Salvador

v2 -> v3:
 - Drop usage of high-level generic helpers in favour of
   low-level approach (per Michal)
 - Check for the page to be marked as PageHugeFreed
 - Add a one-time retry in case someone grabbed the free huge page
   from under us

v1 -> v2:
 - Adressed feedback by Michal
 - Restrict the allocation to a node with __GFP_THISNODE
 - Drop PageHuge check in alloc_and_dissolve_huge_page
 - Re-order comments in isolate_or_dissolve_huge_page
 - Extend comment in isolate_migratepages_block
 - Place put_page right after we got the page, otherwise
   dissolve_free_huge_page will fail

 RFC -> v1:
 - Drop RFC
 - Addressed feedback from David and Mike
 - Fence off gigantic pages as there is a cyclic dependency between
   them and alloc_contig_range
 - Re-organize the code to make race-window smaller and to put
   all details in hugetlb code
 - Drop nodemask initialization. First a node will be tried and then we
   will back to other nodes containing memory (N_MEMORY). Details in
   patch#1's changelog
 - Count new page as surplus in case we failed to dissolve the old page
   and the new one. Details in patch#1.

Cover letter:

 alloc_contig_range lacks the hability for handling HugeTLB pages.
 This can be problematic for some users, e.g: CMA and virtio-mem, where those
 users will fail the call if alloc_contig_range ever sees a HugeTLB page, even
 when those pages lay in ZONE_MOVABLE and are free.
 That problem can be easily solved by replacing the page in the free hugepage
 pool.

 In-use HugeTLB are no exception though, as those can be isolated and migrated
 as any other LRU or Movable page.

 This patchset aims for improving alloc_contig_range->isolate_migratepages_block,
 so HugeTLB pages can be recognized and handled.

 Below is an insight from David (thanks), where the problem can clearly be seen:

 "Start a VM with 4G. Hotplug 1G via virtio-mem and online it to
 ZONE_MOVABLE. Allocate 512 huge pages.

 [root@localhost ~]# cat /proc/meminfo
 MemTotal:        5061512 kB
 MemFree:         3319396 kB
 MemAvailable:    3457144 kB
 ...
 HugePages_Total:     512
 HugePages_Free:      512
 HugePages_Rsvd:        0
 HugePages_Surp:        0
 Hugepagesize:       2048 kB


 The huge pages get partially allocate from ZONE_MOVABLE. Try unplugging
 1G via virtio-mem (remember, all ZONE_MOVABLE). Inside the guest:

 [  180.058992] alloc_contig_range: [1b8000, 1c0000) PFNs busy
 [  180.060531] alloc_contig_range: [1b8000, 1c0000) PFNs busy
 [  180.061972] alloc_contig_range: [1b8000, 1c0000) PFNs busy
 [  180.063413] alloc_contig_range: [1b8000, 1c0000) PFNs busy
 [  180.064838] alloc_contig_range: [1b8000, 1c0000) PFNs busy
 [  180.065848] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
 [  180.066794] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
 [  180.067738] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
 [  180.068669] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
 [  180.069598] alloc_contig_range: [1bfc00, 1c0000) PFNs busy"

Oscar Salvador (2):
  mm: Make alloc_contig_range handle free hugetlb pages
  mm: Make alloc_contig_range handle in-use hugetlb pages

 include/linux/hugetlb.h |   7 +++
 mm/compaction.c         |  22 +++++++++
 mm/hugetlb.c            | 124 +++++++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c             |   5 +-
 4 files changed, 154 insertions(+), 4 deletions(-)

-- 
2.16.3



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-22 13:51 [PATCH v3 0/2] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
@ 2021-02-22 13:51 ` Oscar Salvador
  2021-02-25 20:03   ` Mike Kravetz
                     ` (2 more replies)
  2021-02-22 13:51 ` [PATCH v3 2/2] mm: Make alloc_contig_range handle in-use " Oscar Salvador
  2021-03-01 12:43 ` [PATCH v3 0/2] Make alloc_contig_range handle Hugetlb pages David Hildenbrand
  2 siblings, 3 replies; 25+ messages in thread
From: Oscar Salvador @ 2021-02-22 13:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, David Hildenbrand, Muchun Song, Michal Hocko,
	linux-mm, linux-kernel, Oscar Salvador

alloc_contig_range will fail if it ever sees a HugeTLB page within the
range we are trying to allocate, even when that page is free and can be
easily reallocated.
This has proved to be problematic for some users of alloc_contic_range,
e.g: CMA and virtio-mem, where those would fail the call even when those
pages lay in ZONE_MOVABLE and are free.

We can do better by trying to replace such page.

Free hugepages are tricky to handle so as to no userspace application
notices disruption, we need to replace the current free hugepage with
a new one.

In order to do that, a new function called alloc_and_dissolve_huge_page
is introduced.
This function will first try to get a new fresh hugepage, and if it
succeeds, it will replace the old one in the free hugepage pool.

All operations are being handled under hugetlb_lock, so no races are
possible. The only exception is when page's refcount is 0, but it still
has not been flagged as PageHugeFreed.
In this case we retry as the window race is quite small and we have high
chances to succeed next time.

With regard to the allocation, we restrict it to the node the page belongs
to with __GFP_THISNODE, meaning we do not fallback on other node's zones.

Note that gigantic hugetlb pages are fenced off since there is a cyclic
dependency between them and alloc_contig_range.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/hugetlb.h |   6 +++
 mm/compaction.c         |  12 ++++++
 mm/hugetlb.c            | 111 +++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 127 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b5807f23caf8..72352d718829 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -505,6 +505,7 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };
 
+bool isolate_or_dissolve_huge_page(struct page *page);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -775,6 +776,11 @@ void set_page_huge_active(struct page *page);
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
+static inline bool isolate_or_dissolve_huge_page(struct page *page)
+{
+	return false;
+}
+
 static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
 					   unsigned long addr,
 					   int avoid_reserve)
diff --git a/mm/compaction.c b/mm/compaction.c
index 190ccdaa6c19..d52506ed9db7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -905,6 +905,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			valid_page = page;
 		}
 
+		if (PageHuge(page) && cc->alloc_contig) {
+			if (!isolate_or_dissolve_huge_page(page))
+				goto isolate_fail;
+
+			/*
+			 * Ok, the hugepage was dissolved. Now these pages are
+			 * Buddy and cannot be re-allocated because they are
+			 * isolated. Fall-through as the check below handles
+			 * Buddy pages.
+			 */
+		}
+
 		/*
 		 * Skip if free. We read page order here without zone lock
 		 * which is generally unsafe, but the race window is small and
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4bdb58ab14cb..56eba64a1d33 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1037,13 +1037,18 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
 	return false;
 }
 
+static void __enqueue_huge_page(struct list_head *list, struct page *page)
+{
+	list_move(&page->lru, list);
+	SetPageHugeFreed(page);
+}
+
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);
-	list_move(&page->lru, &h->hugepage_freelists[nid]);
+	__enqueue_huge_page(&h->hugepage_freelists[nid], page);
 	h->free_huge_pages++;
 	h->free_huge_pages_node[nid]++;
-	SetPageHugeFreed(page);
 }
 
 static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
@@ -2294,6 +2299,108 @@ static void restore_reserve_on_error(struct hstate *h,
 	}
 }
 
+/*
+ * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one
+ * @h: struct hstate old page belongs to
+ * @old_page: Old page to dissolve
+ * Returns 0 on success, otherwise negated error.
+ */
+
+static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
+{
+	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
+	int nid = page_to_nid(old_page);
+	struct page *new_page;
+	int ret = 0;
+
+	/*
+	 * Before dissolving the page, we need to allocate a new one,
+	 * so the pool remains stable.
+	 */
+	new_page = alloc_fresh_huge_page(h, gfp_mask, nid, NULL, NULL);
+	if (!new_page)
+		return -ENOMEM;
+
+	/*
+	 * Pages got from Buddy are self-refcounted, but free hugepages
+	 * need to have a refcount of 0.
+	 */
+	page_ref_dec(new_page);
+retry:
+	spin_lock(&hugetlb_lock);
+	if (!PageHuge(old_page)) {
+		/*
+		 * Freed from under us. Drop new_page too.
+		 */
+		update_and_free_page(h, new_page);
+		goto unlock;
+	} else if (page_count(old_page)) {
+		/*
+		 * Someone has grabbed the page, fail for now.
+		 */
+		ret = -EBUSY;
+		update_and_free_page(h, new_page);
+		goto unlock;
+	} else if (!PageHugeFreed(old_page)) {
+		/*
+		 * Page's refcount is 0 but it has not been enqueued in the
+		 * freelist yet. Race window is small, so we can succed here if
+		 * we retry.
+		 */
+		spin_unlock(&hugetlb_lock);
+		cond_resched();
+		goto retry;
+	} else {
+		/*
+		 * Ok, old_page is still a genuine free hugepage. Replace it
+		 * with the new one.
+		 */
+		list_del(&old_page->lru);
+		update_and_free_page(h, old_page);
+		/*
+		 * h->free_huge_pages{_node} counters do not need to be updated.
+		 */
+		__enqueue_huge_page(&h->hugepage_freelists[nid], new_page);
+	}
+unlock:
+	spin_unlock(&hugetlb_lock);
+
+	return ret;
+}
+
+bool isolate_or_dissolve_huge_page(struct page *page)
+{
+	struct hstate *h = NULL;
+	struct page *head;
+	bool ret = false;
+
+	spin_lock(&hugetlb_lock);
+	if (PageHuge(page)) {
+		head = compound_head(page);
+		h = page_hstate(head);
+	}
+	spin_unlock(&hugetlb_lock);
+
+	/*
+	 * The page might have been dissolved from under our feet.
+	 * If that is the case, return success as if we dissolved it ourselves.
+	 */
+	if (!h)
+		return true;
+
+	/*
+	 * Fence off gigantic pages as there is a cyclic dependency
+	 * between alloc_contig_range and them.
+	 */
+	if (hstate_is_gigantic(h))
+		return ret;
+
+	if (!page_count(head) && alloc_and_dissolve_huge_page(h, head))
+		ret = true;
+
+	return ret;
+}
+
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
-- 
2.16.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 2/2] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-22 13:51 [PATCH v3 0/2] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
  2021-02-22 13:51 ` [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
@ 2021-02-22 13:51 ` Oscar Salvador
  2021-02-25 23:05   ` Mike Kravetz
                     ` (2 more replies)
  2021-03-01 12:43 ` [PATCH v3 0/2] Make alloc_contig_range handle Hugetlb pages David Hildenbrand
  2 siblings, 3 replies; 25+ messages in thread
From: Oscar Salvador @ 2021-02-22 13:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, David Hildenbrand, Muchun Song, Michal Hocko,
	linux-mm, linux-kernel, Oscar Salvador

alloc_contig_range() will fail if it finds a HugeTLB page within the range,
without a chance to handle them. Since HugeTLB pages can be migrated as any
LRU or Movable page, it does not make sense to bail out without trying.
Enable the interface to recognize in-use HugeTLB pages so we can migrate
them, and have much better chances to succeed the call.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/hugetlb.h |  5 +++--
 mm/compaction.c         | 12 +++++++++++-
 mm/hugetlb.c            | 21 +++++++++++++++++----
 mm/vmscan.c             |  5 +++--
 4 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 72352d718829..8c17d0dbc87c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -505,7 +505,7 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };
 
-bool isolate_or_dissolve_huge_page(struct page *page);
+bool isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -776,7 +776,8 @@ void set_page_huge_active(struct page *page);
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
-static inline bool isolate_or_dissolve_huge_page(struct page *page)
+static inline bool isolate_or_dissolve_huge_page(struct page *page,
+						 struct list_head *list)
 {
 	return false;
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index d52506ed9db7..6d9169e71d61 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -906,9 +906,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		}
 
 		if (PageHuge(page) && cc->alloc_contig) {
-			if (!isolate_or_dissolve_huge_page(page))
+			if (!isolate_or_dissolve_huge_page(page, &cc->migratepages))
 				goto isolate_fail;
 
+			if (PageHuge(page)) {
+				/*
+				 * Hugepage was successfully isolated and placed
+				 * on the cc->migratepages list.
+				 */
+				low_pfn += compound_nr(page) - 1;
+				goto isolate_success_no_list;
+			}
+
 			/*
 			 * Ok, the hugepage was dissolved. Now these pages are
 			 * Buddy and cannot be re-allocated because they are
@@ -1053,6 +1062,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 isolate_success:
 		list_add(&page->lru, &cc->migratepages);
+isolate_success_no_list:
 		cc->nr_migratepages += compound_nr(page);
 		nr_isolated += compound_nr(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 56eba64a1d33..95dd54cd53c0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2336,7 +2336,9 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
 		goto unlock;
 	} else if (page_count(old_page)) {
 		/*
-		 * Someone has grabbed the page, fail for now.
+		 * Someone has grabbed the page, return -EBUSY so we give
+		 * isolate_or_dissolve_huge_page a chance to handle an in-use
+		 * page.
 		 */
 		ret = -EBUSY;
 		update_and_free_page(h, new_page);
@@ -2368,11 +2370,12 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page)
 	return ret;
 }
 
-bool isolate_or_dissolve_huge_page(struct page *page)
+bool isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
 {
 	struct hstate *h = NULL;
 	struct page *head;
 	bool ret = false;
+	bool try_again = true;
 
 	spin_lock(&hugetlb_lock);
 	if (PageHuge(page)) {
@@ -2394,9 +2397,19 @@ bool isolate_or_dissolve_huge_page(struct page *page)
 	 */
 	if (hstate_is_gigantic(h))
 		return ret;
-
-	if (!page_count(head) && alloc_and_dissolve_huge_page(h, head))
+retry:
+	if (page_count(head) && isolate_huge_page(head, list)) {
 		ret = true;
+	} else if (!page_count(head)) {
+		int err = alloc_and_dissolve_huge_page(h, head);
+
+		if (!err) {
+			ret = true;
+		} else if (err == -EBUSY && try_again) {
+			try_again = false;
+			goto retry;
+		}
+	}
 
 	return ret;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b1b574ad199d..0803adca4469 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1506,8 +1506,9 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 	LIST_HEAD(clean_pages);
 
 	list_for_each_entry_safe(page, next, page_list, lru) {
-		if (page_is_file_lru(page) && !PageDirty(page) &&
-		    !__PageMovable(page) && !PageUnevictable(page)) {
+		if (!PageHuge(page) && page_is_file_lru(page) &&
+		    !PageDirty(page) && !__PageMovable(page) &&
+		    !PageUnevictable(page)) {
 			ClearPageActive(page);
 			list_move(&page->lru, &clean_pages);
 		}
-- 
2.16.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-22 13:51 ` [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
@ 2021-02-25 20:03   ` Mike Kravetz
  2021-02-26  9:48     ` Oscar Salvador
  2021-02-26  8:35   ` Michal Hocko
  2021-03-01 14:09   ` David Hildenbrand
  2 siblings, 1 reply; 25+ messages in thread
From: Mike Kravetz @ 2021-02-25 20:03 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: David Hildenbrand, Muchun Song, Michal Hocko, linux-mm, linux-kernel

On 2/22/21 5:51 AM, Oscar Salvador wrote:
> alloc_contig_range will fail if it ever sees a HugeTLB page within the
> range we are trying to allocate, even when that page is free and can be
> easily reallocated.
> This has proved to be problematic for some users of alloc_contic_range,
> e.g: CMA and virtio-mem, where those would fail the call even when those
> pages lay in ZONE_MOVABLE and are free.
> 
> We can do better by trying to replace such page.
> 
> Free hugepages are tricky to handle so as to no userspace application
> notices disruption, we need to replace the current free hugepage with
> a new one.
> 
> In order to do that, a new function called alloc_and_dissolve_huge_page
> is introduced.
> This function will first try to get a new fresh hugepage, and if it
> succeeds, it will replace the old one in the free hugepage pool.
> 
> All operations are being handled under hugetlb_lock, so no races are
> possible. The only exception is when page's refcount is 0, but it still
> has not been flagged as PageHugeFreed.
> In this case we retry as the window race is quite small and we have high
> chances to succeed next time.
> 
> With regard to the allocation, we restrict it to the node the page belongs
> to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
> 
> Note that gigantic hugetlb pages are fenced off since there is a cyclic
> dependency between them and alloc_contig_range.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Thanks Oscar,

I spent a bunch of time looking for possible race issues.  Thankfully,
the recent code from Muchun dealing with free lists helps. In addition,
all the hugetlb acounting looks good.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

> ---
>  include/linux/hugetlb.h |   6 +++
>  mm/compaction.c         |  12 ++++++
>  mm/hugetlb.c            | 111 +++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 127 insertions(+), 2 deletions(-)
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/2] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-22 13:51 ` [PATCH v3 2/2] mm: Make alloc_contig_range handle in-use " Oscar Salvador
@ 2021-02-25 23:05   ` Mike Kravetz
  2021-02-26  8:46   ` Michal Hocko
  2021-03-05 17:30   ` David Hildenbrand
  2 siblings, 0 replies; 25+ messages in thread
From: Mike Kravetz @ 2021-02-25 23:05 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: David Hildenbrand, Muchun Song, Michal Hocko, linux-mm, linux-kernel

On 2/22/21 5:51 AM, Oscar Salvador wrote:
> alloc_contig_range() will fail if it finds a HugeTLB page within the range,
> without a chance to handle them. Since HugeTLB pages can be migrated as any
> LRU or Movable page, it does not make sense to bail out without trying.
> Enable the interface to recognize in-use HugeTLB pages so we can migrate
> them, and have much better chances to succeed the call.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  include/linux/hugetlb.h |  5 +++--
>  mm/compaction.c         | 12 +++++++++++-
>  mm/hugetlb.c            | 21 +++++++++++++++++----
>  mm/vmscan.c             |  5 +++--
>  4 files changed, 34 insertions(+), 9 deletions(-)

Thanks,

Changes look good.  I like the simple retry one time for pages which may
go from free to in use.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

BTW,
This series will need to be rebased on latest which has hugetlb page flag
changes.  Should be as simple as:
s/PageHugeFreed/HPageFreed/
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-22 13:51 ` [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
  2021-02-25 20:03   ` Mike Kravetz
@ 2021-02-26  8:35   ` Michal Hocko
  2021-02-26  8:38     ` Michal Hocko
  2021-02-26  9:45     ` Oscar Salvador
  2021-03-01 14:09   ` David Hildenbrand
  2 siblings, 2 replies; 25+ messages in thread
From: Michal Hocko @ 2021-02-26  8:35 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Mon 22-02-21 14:51:36, Oscar Salvador wrote:
> alloc_contig_range will fail if it ever sees a HugeTLB page within the
> range we are trying to allocate, even when that page is free and can be
> easily reallocated.
> This has proved to be problematic for some users of alloc_contic_range,
> e.g: CMA and virtio-mem, where those would fail the call even when those
> pages lay in ZONE_MOVABLE and are free.
> 
> We can do better by trying to replace such page.
> 
> Free hugepages are tricky to handle so as to no userspace application
> notices disruption, we need to replace the current free hugepage with
> a new one.
> 
> In order to do that, a new function called alloc_and_dissolve_huge_page
> is introduced.
> This function will first try to get a new fresh hugepage, and if it
> succeeds, it will replace the old one in the free hugepage pool.
> 
> All operations are being handled under hugetlb_lock, so no races are
> possible. The only exception is when page's refcount is 0, but it still
> has not been flagged as PageHugeFreed.

I think it would be helpful to call out that specific case explicitly
here. I can see only one scenario (are there more?)
__free_huge_page()		isolate_or_dissolve_huge_page
				  PageHuge() == T
				  alloc_and_dissolve_huge_page
				    alloc_fresh_huge_page()
				    spin_lock(hugetlb_lock)
				    // PageHuge() && !PageHugeFreed &&
				    // !PageCount()
				    spin_unlock(hugetlb_lock)
  spin_lock(hugetlb_lock)
  1) update_and_free_page
       PageHuge() == F
       __free_pages()
  2) enqueue_huge_page
       SetPageHugeFreed()
  spin_unlock(&hugetlb_lock)			    

> In this case we retry as the window race is quite small and we have high
> chances to succeed next time.
> 
> With regard to the allocation, we restrict it to the node the page belongs
> to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
> 
> Note that gigantic hugetlb pages are fenced off since there is a cyclic
> dependency between them and alloc_contig_range.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Thanks this looks much better than the initial version. One nit below.
Acked-by: Michal Hocko <mhocko@suse.com>

[...]
> +bool isolate_or_dissolve_huge_page(struct page *page)
> +{
> +	struct hstate *h = NULL;
> +	struct page *head;
> +	bool ret = false;
> +
> +	spin_lock(&hugetlb_lock);
> +	if (PageHuge(page)) {
> +		head = compound_head(page);
> +		h = page_hstate(head);
> +	}
> +	spin_unlock(&hugetlb_lock);
> +
> +	/*
> +	 * The page might have been dissolved from under our feet.
> +	 * If that is the case, return success as if we dissolved it ourselves.
> +	 */
> +	if (!h)
> +		return true;

I am still fighting with this construct a bit. It is not really clear
what the lock is protecting us from here. alloc_fresh_huge_page deals
with all potential races and this looks like an optimistic check to save
some work. But in fact the lock is really necessary for correctness
because hstate might be completely bogus without the lock or us holding
a reference on the poage. The following construct would be more
explicit and compact. What do you think?

	struct hstate *h;

	/*
	 * The page might have been dissloved from under our feet
	 * so make sure to carefully check the state under the lock.
	 * Return success on when racing as if we dissloved the page
	 * ourselves.
	 */
	spin_lock(&hugetlb_lock);
	if (PageHuge(page)) {
		head = compound_head(page);
		h = page_hstate(head);
	} else {
		spin_unlock(&hugetlb_lock);
		return true;
	}
	spin_unlock(&hugetlb_lock);

> +
> +	/*
> +	 * Fence off gigantic pages as there is a cyclic dependency
> +	 * between alloc_contig_range and them.
> +	 */
> +	if (hstate_is_gigantic(h))
> +		return ret;
> +
> +	if (!page_count(head) && alloc_and_dissolve_huge_page(h, head))
> +		ret = true;
> +
> +	return ret;
> +}
> +
>  struct page *alloc_huge_page(struct vm_area_struct *vma,
>  				    unsigned long addr, int avoid_reserve)
>  {
> -- 
> 2.16.3

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-26  8:35   ` Michal Hocko
@ 2021-02-26  8:38     ` Michal Hocko
  2021-02-26  9:25       ` David Hildenbrand
  2021-02-26  9:45     ` Oscar Salvador
  1 sibling, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2021-02-26  8:38 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Fri 26-02-21 09:35:10, Michal Hocko wrote:
> On Mon 22-02-21 14:51:36, Oscar Salvador wrote:
> > alloc_contig_range will fail if it ever sees a HugeTLB page within the
> > range we are trying to allocate, even when that page is free and can be
> > easily reallocated.
> > This has proved to be problematic for some users of alloc_contic_range,
> > e.g: CMA and virtio-mem, where those would fail the call even when those
> > pages lay in ZONE_MOVABLE and are free.
> > 
> > We can do better by trying to replace such page.
> > 
> > Free hugepages are tricky to handle so as to no userspace application
> > notices disruption, we need to replace the current free hugepage with
> > a new one.
> > 
> > In order to do that, a new function called alloc_and_dissolve_huge_page
> > is introduced.
> > This function will first try to get a new fresh hugepage, and if it
> > succeeds, it will replace the old one in the free hugepage pool.
> > 
> > All operations are being handled under hugetlb_lock, so no races are
> > possible. The only exception is when page's refcount is 0, but it still
> > has not been flagged as PageHugeFreed.
> 
> I think it would be helpful to call out that specific case explicitly
> here. I can see only one scenario (are there more?)
> __free_huge_page()		isolate_or_dissolve_huge_page
> 				  PageHuge() == T
> 				  alloc_and_dissolve_huge_page
> 				    alloc_fresh_huge_page()
> 				    spin_lock(hugetlb_lock)
> 				    // PageHuge() && !PageHugeFreed &&
> 				    // !PageCount()
> 				    spin_unlock(hugetlb_lock)
>   spin_lock(hugetlb_lock)
>   1) update_and_free_page
>        PageHuge() == F
>        __free_pages()
>   2) enqueue_huge_page
>        SetPageHugeFreed()
>   spin_unlock(&hugetlb_lock)			    
> 
> > In this case we retry as the window race is quite small and we have high
> > chances to succeed next time.
> > 
> > With regard to the allocation, we restrict it to the node the page belongs
> > to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
> > 
> > Note that gigantic hugetlb pages are fenced off since there is a cyclic
> > dependency between them and alloc_contig_range.
> > 
> > Signed-off-by: Oscar Salvador <osalvador@suse.de>
> 
> Thanks this looks much better than the initial version. One nit below.
> Acked-by: Michal Hocko <mhocko@suse.com>

Btw. if David has some numbers it would be great to add them to the
changelog.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/2] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-22 13:51 ` [PATCH v3 2/2] mm: Make alloc_contig_range handle in-use " Oscar Salvador
  2021-02-25 23:05   ` Mike Kravetz
@ 2021-02-26  8:46   ` Michal Hocko
  2021-02-26 10:24     ` Oscar Salvador
  2021-03-05 17:30   ` David Hildenbrand
  2 siblings, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2021-02-26  8:46 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Mon 22-02-21 14:51:37, Oscar Salvador wrote:
[...]
> @@ -2394,9 +2397,19 @@ bool isolate_or_dissolve_huge_page(struct page *page)
>  	 */
>  	if (hstate_is_gigantic(h))
>  		return ret;
> -
> -	if (!page_count(head) && alloc_and_dissolve_huge_page(h, head))
> +retry:
> +	if (page_count(head) && isolate_huge_page(head, list)) {
>  		ret = true;
> +	} else if (!page_count(head)) {

This is rather head spinning. Do we need to test page_count in the else
branch? Do you want to optimize for a case where the page cannot be
isolated because of page_huge_active?

> +		int err = alloc_and_dissolve_huge_page(h, head);
> +
> +		if (!err) {
> +			ret = true;
> +		} else if (err == -EBUSY && try_again) {
> +			try_again = false;
> +			goto retry;
> +		}

Is this retry once logic really needed? Does it really give us any real
benefit? alloc_and_dissolve_huge_page already retries when the page is
being freed.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-26  8:38     ` Michal Hocko
@ 2021-02-26  9:25       ` David Hildenbrand
  2021-02-26  9:47         ` Oscar Salvador
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2021-02-26  9:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oscar Salvador, Andrew Morton, Mike Kravetz, David Hildenbrand,
	Muchun Song, linux-mm, linux-kernel


> Am 26.02.2021 um 09:38 schrieb Michal Hocko <mhocko@suse.com>:
> 
> On Fri 26-02-21 09:35:10, Michal Hocko wrote:
>>> On Mon 22-02-21 14:51:36, Oscar Salvador wrote:
>>> alloc_contig_range will fail if it ever sees a HugeTLB page within the
>>> range we are trying to allocate, even when that page is free and can be
>>> easily reallocated.
>>> This has proved to be problematic for some users of alloc_contic_range,
>>> e.g: CMA and virtio-mem, where those would fail the call even when those
>>> pages lay in ZONE_MOVABLE and are free.
>>> 
>>> We can do better by trying to replace such page.
>>> 
>>> Free hugepages are tricky to handle so as to no userspace application
>>> notices disruption, we need to replace the current free hugepage with
>>> a new one.
>>> 
>>> In order to do that, a new function called alloc_and_dissolve_huge_page
>>> is introduced.
>>> This function will first try to get a new fresh hugepage, and if it
>>> succeeds, it will replace the old one in the free hugepage pool.
>>> 
>>> All operations are being handled under hugetlb_lock, so no races are
>>> possible. The only exception is when page's refcount is 0, but it still
>>> has not been flagged as PageHugeFreed.
>> 
>> I think it would be helpful to call out that specific case explicitly
>> here. I can see only one scenario (are there more?)
>> __free_huge_page()        isolate_or_dissolve_huge_page
>>                  PageHuge() == T
>>                  alloc_and_dissolve_huge_page
>>                    alloc_fresh_huge_page()
>>                    spin_lock(hugetlb_lock)
>>                    // PageHuge() && !PageHugeFreed &&
>>                    // !PageCount()
>>                    spin_unlock(hugetlb_lock)
>>  spin_lock(hugetlb_lock)
>>  1) update_and_free_page
>>       PageHuge() == F
>>       __free_pages()
>>  2) enqueue_huge_page
>>       SetPageHugeFreed()
>>  spin_unlock(&hugetlb_lock)                
>> 
>>> In this case we retry as the window race is quite small and we have high
>>> chances to succeed next time.
>>> 
>>> With regard to the allocation, we restrict it to the node the page belongs
>>> to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
>>> 
>>> Note that gigantic hugetlb pages are fenced off since there is a cyclic
>>> dependency between them and alloc_contig_range.
>>> 
>>> Signed-off-by: Oscar Salvador <osalvador@suse.de>
>> 
>> Thanks this looks much better than the initial version. One nit below.
>> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Btw. if David has some numbers it would be great to add them to the
> changelog.

I‘m planning on giving both patches a churn early next week, with

a) free huge pages
b) idle allocated huge pages
c) heavily read huge pages

(Them I‘m also planning on having another brief look at the patches :) )

Thanks!



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-26  8:35   ` Michal Hocko
  2021-02-26  8:38     ` Michal Hocko
@ 2021-02-26  9:45     ` Oscar Salvador
  2021-02-26  9:51       ` Michal Hocko
  1 sibling, 1 reply; 25+ messages in thread
From: Oscar Salvador @ 2021-02-26  9:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Fri, Feb 26, 2021 at 09:35:09AM +0100, Michal Hocko wrote:
> I think it would be helpful to call out that specific case explicitly
> here. I can see only one scenario (are there more?)
> __free_huge_page()		isolate_or_dissolve_huge_page
> 				  PageHuge() == T
> 				  alloc_and_dissolve_huge_page
> 				    alloc_fresh_huge_page()
> 				    spin_lock(hugetlb_lock)
> 				    // PageHuge() && !PageHugeFreed &&
> 				    // !PageCount()
> 				    spin_unlock(hugetlb_lock)
>   spin_lock(hugetlb_lock)
>   1) update_and_free_page
>        PageHuge() == F
>        __free_pages()
>   2) enqueue_huge_page
>        SetPageHugeFreed()
>   spin_unlock(&hugetlb_lock)

I do not think there are more scenarios. The thing is to find a !page_count &&
!PageHugeFreed.
AFAICS, this can only happen after:
put_page->put_page_test_zero->..->_free_huge_page gets called but __free_huge_page
has not reached enqueue_huge_page yet (as you just described above)

By calling out this case, you meant to describe it in the changelog?

> 
> > In this case we retry as the window race is quite small and we have high
> > chances to succeed next time.
> > 
> > With regard to the allocation, we restrict it to the node the page belongs
> > to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
> > 
> > Note that gigantic hugetlb pages are fenced off since there is a cyclic
> > dependency between them and alloc_contig_range.
> > 
> > Signed-off-by: Oscar Salvador <osalvador@suse.de>
> 
> Thanks this looks much better than the initial version. One nit below.
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> > +bool isolate_or_dissolve_huge_page(struct page *page)
> > +{
> > +	struct hstate *h = NULL;
> > +	struct page *head;
> > +	bool ret = false;
> > +
> > +	spin_lock(&hugetlb_lock);
> > +	if (PageHuge(page)) {
> > +		head = compound_head(page);
> > +		h = page_hstate(head);
> > +	}
> > +	spin_unlock(&hugetlb_lock);
> > +
> > +	/*
> > +	 * The page might have been dissolved from under our feet.
> > +	 * If that is the case, return success as if we dissolved it ourselves.
> > +	 */
> > +	if (!h)
> > +		return true;
> 
> I am still fighting with this construct a bit. It is not really clear
> what the lock is protecting us from here. alloc_fresh_huge_page deals
> with all potential races and this looks like an optimistic check to save
> some work. But in fact the lock is really necessary for correctness
> because hstate might be completely bogus without the lock or us holding
> a reference on the poage. The following construct would be more
> explicit and compact. What do you think?
> 
> 	struct hstate *h;
> 
> 	/*
> 	 * The page might have been dissloved from under our feet
> 	 * so make sure to carefully check the state under the lock.
> 	 * Return success on when racing as if we dissloved the page
> 	 * ourselves.
> 	 */
> 	spin_lock(&hugetlb_lock);
> 	if (PageHuge(page)) {
> 		head = compound_head(page);
> 		h = page_hstate(head);
> 	} else {
> 		spin_unlock(&hugetlb_lock);
> 		return true;
> 	}
> 	spin_unlock(&hugetlb_lock);

Yes, I find this less eyesore.

I will fix it up in v4.

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-26  9:25       ` David Hildenbrand
@ 2021-02-26  9:47         ` Oscar Salvador
  0 siblings, 0 replies; 25+ messages in thread
From: Oscar Salvador @ 2021-02-26  9:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Andrew Morton, Mike Kravetz, Muchun Song, linux-mm,
	linux-kernel

On Fri, Feb 26, 2021 at 10:25:46AM +0100, David Hildenbrand wrote:
> I‘m planning on giving both patches a churn early next week, with
> 
> a) free huge pages
> b) idle allocated huge pages
> c) heavily read huge pages
> 
> (Them I‘m also planning on having another brief look at the patches :) )

That would be great, thanks!

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-25 20:03   ` Mike Kravetz
@ 2021-02-26  9:48     ` Oscar Salvador
  0 siblings, 0 replies; 25+ messages in thread
From: Oscar Salvador @ 2021-02-26  9:48 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Andrew Morton, David Hildenbrand, Muchun Song, Michal Hocko,
	linux-mm, linux-kernel

On Thu, Feb 25, 2021 at 12:03:18PM -0800, Mike Kravetz wrote:
> Thanks Oscar,
> 
> I spent a bunch of time looking for possible race issues.  Thankfully,
> the recent code from Muchun dealing with free lists helps. In addition,
> all the hugetlb acounting looks good.
> 
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

thanks for having a look Mike ;-)

> 
> > ---
> >  include/linux/hugetlb.h |   6 +++
> >  mm/compaction.c         |  12 ++++++
> >  mm/hugetlb.c            | 111 +++++++++++++++++++++++++++++++++++++++++++++++-
> >  3 files changed, 127 insertions(+), 2 deletions(-)
> -- 
> Mike Kravetz

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-26  9:45     ` Oscar Salvador
@ 2021-02-26  9:51       ` Michal Hocko
  0 siblings, 0 replies; 25+ messages in thread
From: Michal Hocko @ 2021-02-26  9:51 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Fri 26-02-21 10:45:14, Oscar Salvador wrote:
> On Fri, Feb 26, 2021 at 09:35:09AM +0100, Michal Hocko wrote:
> > I think it would be helpful to call out that specific case explicitly
> > here. I can see only one scenario (are there more?)
> > __free_huge_page()		isolate_or_dissolve_huge_page
> > 				  PageHuge() == T
> > 				  alloc_and_dissolve_huge_page
> > 				    alloc_fresh_huge_page()
> > 				    spin_lock(hugetlb_lock)
> > 				    // PageHuge() && !PageHugeFreed &&
> > 				    // !PageCount()
> > 				    spin_unlock(hugetlb_lock)
> >   spin_lock(hugetlb_lock)
> >   1) update_and_free_page
> >        PageHuge() == F
> >        __free_pages()
> >   2) enqueue_huge_page
> >        SetPageHugeFreed()
> >   spin_unlock(&hugetlb_lock)
> 
> I do not think there are more scenarios. The thing is to find a !page_count &&
> !PageHugeFreed.
> AFAICS, this can only happen after:
> put_page->put_page_test_zero->..->_free_huge_page gets called but __free_huge_page
> has not reached enqueue_huge_page yet (as you just described above)
> 
> By calling out this case, you meant to describe it in the changelog?

Yes.
[...]
> > 	struct hstate *h;
> > 
> > 	/*
> > 	 * The page might have been dissloved from under our feet
> > 	 * so make sure to carefully check the state under the lock.
> > 	 * Return success on when racing as if we dissloved the page
> > 	 * ourselves.
> > 	 */
> > 	spin_lock(&hugetlb_lock);
> > 	if (PageHuge(page)) {
> > 		head = compound_head(page);
> > 		h = page_hstate(head);
> > 	} else {
> > 		spin_unlock(&hugetlb_lock);
> > 		return true;
> > 	}
> > 	spin_unlock(&hugetlb_lock);
> 
> Yes, I find this less eyesore.
> 
> I will fix it up in v4.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/2] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-26  8:46   ` Michal Hocko
@ 2021-02-26 10:24     ` Oscar Salvador
  2021-02-26 10:27       ` Oscar Salvador
  2021-02-26 12:46       ` Michal Hocko
  0 siblings, 2 replies; 25+ messages in thread
From: Oscar Salvador @ 2021-02-26 10:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Fri, Feb 26, 2021 at 09:46:57AM +0100, Michal Hocko wrote:
> On Mon 22-02-21 14:51:37, Oscar Salvador wrote:
> [...]
> > @@ -2394,9 +2397,19 @@ bool isolate_or_dissolve_huge_page(struct page *page)
> >  	 */
> >  	if (hstate_is_gigantic(h))
> >  		return ret;
> > -
> > -	if (!page_count(head) && alloc_and_dissolve_huge_page(h, head))
> > +retry:
> > +	if (page_count(head) && isolate_huge_page(head, list)) {
> >  		ret = true;
> > +	} else if (!page_count(head)) {
> 
> This is rather head spinning. Do we need to test page_count in the else
> branch? Do you want to optimize for a case where the page cannot be
> isolated because of page_huge_active?

Well, I wanted to explictly call out both cases.
We either 1) have an in-use page and we try to issolate it or 2) we have a free
page (count == 0).

If the page could not be dissolved due to page_huge_active, this would either
mean that page is about to be freed, or that someone has already issolated the
page.
Being the former case, one could say that falling-through alloc_and_dissolve is
ok.

But no, I did not really want to optimize anything here, just wanted to be explicit
about what we are checking and why.

> > +
> > +		if (!err) {
> > +			ret = true;
> > +		} else if (err == -EBUSY && try_again) {
> > +			try_again = false;
> > +			goto retry;
> > +		}
> 
> Is this retry once logic really needed? Does it really give us any real
> benefit? alloc_and_dissolve_huge_page already retries when the page is
> being freed.

alloc_and_dissolve_huge_page retries when the page is about to be freed.
Here we retry in case alloc_and_dissolve_huge_page() noticed that someone grabbed
the page (refcount 0 -> 1), which would possibly mean userspace started using this
page. If that is the case, we give isolate_huge_page another chance to see if we
can succeed and we can migrate the page.

Chances this to happen? Honestly, as any race, quite low.
So, the answer to your question would be, no, it is not really needed, I just felt
we could play "clever" here.

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/2] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-26 10:24     ` Oscar Salvador
@ 2021-02-26 10:27       ` Oscar Salvador
  2021-02-26 12:46       ` Michal Hocko
  1 sibling, 0 replies; 25+ messages in thread
From: Oscar Salvador @ 2021-02-26 10:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Fri, Feb 26, 2021 at 11:24:29AM +0100, Oscar Salvador wrote:
> On Fri, Feb 26, 2021 at 09:46:57AM +0100, Michal Hocko wrote:
> > On Mon 22-02-21 14:51:37, Oscar Salvador wrote:
> > [...]
> > > @@ -2394,9 +2397,19 @@ bool isolate_or_dissolve_huge_page(struct page *page)
> > >  	 */
> > >  	if (hstate_is_gigantic(h))
> > >  		return ret;
> > > -
> > > -	if (!page_count(head) && alloc_and_dissolve_huge_page(h, head))
> > > +retry:
> > > +	if (page_count(head) && isolate_huge_page(head, list)) {
> > >  		ret = true;
> > > +	} else if (!page_count(head)) {
> > 
> > This is rather head spinning. Do we need to test page_count in the else
> > branch? Do you want to optimize for a case where the page cannot be
> > isolated because of page_huge_active?
> 
> Well, I wanted to explictly call out both cases.
> We either 1) have an in-use page and we try to issolate it or 2) we have a free
> page (count == 0).
> 
> If the page could not be dissolved due to page_huge_active, this would either
> mean that page is about to be freed, or that someone has already issolated the
> page.
> Being the former case, one could say that falling-through alloc_and_dissolve is
> ok.
> 
> But no, I did not really want to optimize anything here, just wanted to be explicit
> about what we are checking and why.

Maybe I could add a comment to make it more explicit.
 

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/2] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-26 10:24     ` Oscar Salvador
  2021-02-26 10:27       ` Oscar Salvador
@ 2021-02-26 12:46       ` Michal Hocko
  2021-02-28 13:43         ` Oscar Salvador
  1 sibling, 1 reply; 25+ messages in thread
From: Michal Hocko @ 2021-02-26 12:46 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Fri 26-02-21 11:24:29, Oscar Salvador wrote:
> On Fri, Feb 26, 2021 at 09:46:57AM +0100, Michal Hocko wrote:
> > On Mon 22-02-21 14:51:37, Oscar Salvador wrote:
> > [...]
> > > @@ -2394,9 +2397,19 @@ bool isolate_or_dissolve_huge_page(struct page *page)
> > >  	 */
> > >  	if (hstate_is_gigantic(h))
> > >  		return ret;
> > > -
> > > -	if (!page_count(head) && alloc_and_dissolve_huge_page(h, head))
> > > +retry:
> > > +	if (page_count(head) && isolate_huge_page(head, list)) {
> > >  		ret = true;
> > > +	} else if (!page_count(head)) {
> > 
> > This is rather head spinning. Do we need to test page_count in the else
> > branch? Do you want to optimize for a case where the page cannot be
> > isolated because of page_huge_active?
> 
> Well, I wanted to explictly call out both cases.
> We either 1) have an in-use page and we try to issolate it or 2) we have a free
> page (count == 0).
> 
> If the page could not be dissolved due to page_huge_active, this would either
> mean that page is about to be freed, or that someone has already issolated the
> page.
> Being the former case, one could say that falling-through alloc_and_dissolve is
> ok.
> 
> But no, I did not really want to optimize anything here, just wanted to be explicit
> about what we are checking and why.

Well, I will leave it to others. I do not feel strongly about this but
to me it makes the code harder to think about because the situation is
unstable and any of those condition can change as they are evaluated. So
an explicit checks makes the code harder in the end. I would simply got
with 
	if (isolate_huge_page(head, list) || !alloc_and_dissolve_huge_page())
		ret = true;

if either of the conditional needs a retry then it should be done
internally. Like alloc_and_dissolve_huge_page already does to stabilize
the PageFreed flag. An early bail out on non-free hugetlb page would
also better be done inside alloc_and_dissolve_huge_page.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/2] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-26 12:46       ` Michal Hocko
@ 2021-02-28 13:43         ` Oscar Salvador
  0 siblings, 0 replies; 25+ messages in thread
From: Oscar Salvador @ 2021-02-28 13:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Fri, Feb 26, 2021 at 01:46:21PM +0100, Michal Hocko wrote:
> Well, I will leave it to others. I do not feel strongly about this but
> to me it makes the code harder to think about because the situation is
> unstable and any of those condition can change as they are evaluated. So
> an explicit checks makes the code harder in the end. I would simply got
> with 
> 	if (isolate_huge_page(head, list) || !alloc_and_dissolve_huge_page())
> 		ret = true;
> 
> if either of the conditional needs a retry then it should be done
> internally. Like alloc_and_dissolve_huge_page already does to stabilize
> the PageFreed flag. An early bail out on non-free hugetlb page would
> also better be done inside alloc_and_dissolve_huge_page.

The retry could be done internally in alloc_and_dissolve_huge_page in
case someoen grabbed the page from under us, but calling
isolate_huge_page from there seemed a bit odd to me, that is why I
placed the logic in the outter function.
It looks more logic to me, but of course, that is just my taste.

I do not think it makes the code that hard to follow, but I will leave
it to the others.
If there is a consensus that a simplistic version is prefered, I do not
have a problem to go with that.

Mike, what is your take on this?

Thanks


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 0/2] Make alloc_contig_range handle Hugetlb pages
  2021-02-22 13:51 [PATCH v3 0/2] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
  2021-02-22 13:51 ` [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
  2021-02-22 13:51 ` [PATCH v3 2/2] mm: Make alloc_contig_range handle in-use " Oscar Salvador
@ 2021-03-01 12:43 ` David Hildenbrand
  2021-03-01 12:57   ` Oscar Salvador
  2 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2021-03-01 12:43 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Mike Kravetz, Muchun Song, Michal Hocko, linux-mm, linux-kernel

On 22.02.21 14:51, Oscar Salvador wrote:
> v2 -> v3:
>   - Drop usage of high-level generic helpers in favour of
>     low-level approach (per Michal)
>   - Check for the page to be marked as PageHugeFreed
>   - Add a one-time retry in case someone grabbed the free huge page
>     from under us
> 
> v1 -> v2:
>   - Adressed feedback by Michal
>   - Restrict the allocation to a node with __GFP_THISNODE
>   - Drop PageHuge check in alloc_and_dissolve_huge_page
>   - Re-order comments in isolate_or_dissolve_huge_page
>   - Extend comment in isolate_migratepages_block
>   - Place put_page right after we got the page, otherwise
>     dissolve_free_huge_page will fail
> 
>   RFC -> v1:
>   - Drop RFC
>   - Addressed feedback from David and Mike
>   - Fence off gigantic pages as there is a cyclic dependency between
>     them and alloc_contig_range
>   - Re-organize the code to make race-window smaller and to put
>     all details in hugetlb code
>   - Drop nodemask initialization. First a node will be tried and then we
>     will back to other nodes containing memory (N_MEMORY). Details in
>     patch#1's changelog
>   - Count new page as surplus in case we failed to dissolve the old page
>     and the new one. Details in patch#1.
> 
> Cover letter:
> 
>   alloc_contig_range lacks the hability for handling HugeTLB pages.
>   This can be problematic for some users, e.g: CMA and virtio-mem, where those
>   users will fail the call if alloc_contig_range ever sees a HugeTLB page, even
>   when those pages lay in ZONE_MOVABLE and are free.
>   That problem can be easily solved by replacing the page in the free hugepage
>   pool.
> 
>   In-use HugeTLB are no exception though, as those can be isolated and migrated
>   as any other LRU or Movable page.
> 
>   This patchset aims for improving alloc_contig_range->isolate_migratepages_block,
>   so HugeTLB pages can be recognized and handled.
> 
>   Below is an insight from David (thanks), where the problem can clearly be seen:
> 
>   "Start a VM with 4G. Hotplug 1G via virtio-mem and online it to
>   ZONE_MOVABLE. Allocate 512 huge pages.
> 
>   [root@localhost ~]# cat /proc/meminfo
>   MemTotal:        5061512 kB
>   MemFree:         3319396 kB
>   MemAvailable:    3457144 kB
>   ...
>   HugePages_Total:     512
>   HugePages_Free:      512
>   HugePages_Rsvd:        0
>   HugePages_Surp:        0
>   Hugepagesize:       2048 kB
> 
> 
>   The huge pages get partially allocate from ZONE_MOVABLE. Try unplugging
>   1G via virtio-mem (remember, all ZONE_MOVABLE). Inside the guest:
> 
>   [  180.058992] alloc_contig_range: [1b8000, 1c0000) PFNs busy
>   [  180.060531] alloc_contig_range: [1b8000, 1c0000) PFNs busy
>   [  180.061972] alloc_contig_range: [1b8000, 1c0000) PFNs busy
>   [  180.063413] alloc_contig_range: [1b8000, 1c0000) PFNs busy
>   [  180.064838] alloc_contig_range: [1b8000, 1c0000) PFNs busy
>   [  180.065848] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
>   [  180.066794] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
>   [  180.067738] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
>   [  180.068669] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
>   [  180.069598] alloc_contig_range: [1bfc00, 1c0000) PFNs busy"

Same experiment with ZONE_MOVABLE:

a) Free huge pages: all memory can get unplugged again.

b) Allocated/populated but idle huge pages: all memory can get unplugged 
again.

c) Allocated/populated but all 512 huge pages are read/written in a 
loop: all memory can get unplugged again, but I get a single

[  121.192345] alloc_contig_range: [180000, 188000) PFNs busy

Most probably because it happened to try migrating a huge page while it 
was busy. As virtio-mem retries on ZONE_MOVABLE a couple of times, it 
can deal with this temporary failure.



Last but not least, I did something extreme:

]# cat /proc/meminfo
MemTotal:        5061568 kB
MemFree:          186560 kB
MemAvailable:     354524 kB
...
HugePages_Total:    2048
HugePages_Free:     2048
HugePages_Rsvd:        0
HugePages_Surp:        0


Triggering unplug would require to dissolve+alloc - which now fails when 
trying to allocate an additional ~512 huge pages (1G).


As expected, I can properly see memory unplug not fully succeeding. + I 
get a fairly continuous stream of

[  226.611584] alloc_contig_range: [19f400, 19f800) PFNs busy
...

But more importantly, the hugepage count remains stable, as configured 
by the admin (me):

HugePages_Total:    2048
HugePages_Free:     2048
HugePages_Rsvd:        0
HugePages_Surp:        0


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 0/2] Make alloc_contig_range handle Hugetlb pages
  2021-03-01 12:43 ` [PATCH v3 0/2] Make alloc_contig_range handle Hugetlb pages David Hildenbrand
@ 2021-03-01 12:57   ` Oscar Salvador
  2021-03-01 12:59     ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Oscar Salvador @ 2021-03-01 12:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Michal Hocko, linux-mm,
	linux-kernel

On Mon, Mar 01, 2021 at 01:43:00PM +0100, David Hildenbrand wrote:
> Same experiment with ZONE_MOVABLE:
> 
> a) Free huge pages: all memory can get unplugged again.
> 
> b) Allocated/populated but idle huge pages: all memory can get unplugged
> again.
> 
> c) Allocated/populated but all 512 huge pages are read/written in a loop:
> all memory can get unplugged again, but I get a single
> 
> [  121.192345] alloc_contig_range: [180000, 188000) PFNs busy
> 
> Most probably because it happened to try migrating a huge page while it was
> busy. As virtio-mem retries on ZONE_MOVABLE a couple of times, it can deal
> with this temporary failure.
> 
> 
> 
> Last but not least, I did something extreme:
> 
> ]# cat /proc/meminfo
> MemTotal:        5061568 kB
> MemFree:          186560 kB
> MemAvailable:     354524 kB
> ...
> HugePages_Total:    2048
> HugePages_Free:     2048
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> 
> 
> Triggering unplug would require to dissolve+alloc - which now fails when
> trying to allocate an additional ~512 huge pages (1G).
> 
> 
> As expected, I can properly see memory unplug not fully succeeding. + I get
> a fairly continuous stream of
> 
> [  226.611584] alloc_contig_range: [19f400, 19f800) PFNs busy
> ...
> 
> But more importantly, the hugepage count remains stable, as configured by
> the admin (me):
> 
> HugePages_Total:    2048
> HugePages_Free:     2048
> HugePages_Rsvd:        0
> HugePages_Surp:        0

Thanks for giving it a spin David, that is highly appreciated ;-)!

I will add above information in next's version changelog if you do not mind,
so the before-and-after can be seen clearly.

I shall send v4 in the course of the next few days.

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 0/2] Make alloc_contig_range handle Hugetlb pages
  2021-03-01 12:57   ` Oscar Salvador
@ 2021-03-01 12:59     ` David Hildenbrand
  0 siblings, 0 replies; 25+ messages in thread
From: David Hildenbrand @ 2021-03-01 12:59 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Michal Hocko, linux-mm,
	linux-kernel

On 01.03.21 13:57, Oscar Salvador wrote:
> On Mon, Mar 01, 2021 at 01:43:00PM +0100, David Hildenbrand wrote:
>> Same experiment with ZONE_MOVABLE:
>>
>> a) Free huge pages: all memory can get unplugged again.
>>
>> b) Allocated/populated but idle huge pages: all memory can get unplugged
>> again.
>>
>> c) Allocated/populated but all 512 huge pages are read/written in a loop:
>> all memory can get unplugged again, but I get a single
>>
>> [  121.192345] alloc_contig_range: [180000, 188000) PFNs busy
>>
>> Most probably because it happened to try migrating a huge page while it was
>> busy. As virtio-mem retries on ZONE_MOVABLE a couple of times, it can deal
>> with this temporary failure.
>>
>>
>>
>> Last but not least, I did something extreme:
>>
>> ]# cat /proc/meminfo
>> MemTotal:        5061568 kB
>> MemFree:          186560 kB
>> MemAvailable:     354524 kB
>> ...
>> HugePages_Total:    2048
>> HugePages_Free:     2048
>> HugePages_Rsvd:        0
>> HugePages_Surp:        0
>>
>>
>> Triggering unplug would require to dissolve+alloc - which now fails when
>> trying to allocate an additional ~512 huge pages (1G).
>>
>>
>> As expected, I can properly see memory unplug not fully succeeding. + I get
>> a fairly continuous stream of
>>
>> [  226.611584] alloc_contig_range: [19f400, 19f800) PFNs busy
>> ...
>>
>> But more importantly, the hugepage count remains stable, as configured by
>> the admin (me):
>>
>> HugePages_Total:    2048
>> HugePages_Free:     2048
>> HugePages_Rsvd:        0
>> HugePages_Surp:        0
> 
> Thanks for giving it a spin David, that is highly appreciated ;-)!
> 
> I will add above information in next's version changelog if you do not mind,
> so the before-and-after can be seen clearly.
> 
> I shall send v4 in the course of the next few days.
> 

I'll have some review feedback on error handling that might be improved, 
I'll share that shortly.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-22 13:51 ` [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
  2021-02-25 20:03   ` Mike Kravetz
  2021-02-26  8:35   ` Michal Hocko
@ 2021-03-01 14:09   ` David Hildenbrand
  2021-03-04 10:19     ` Oscar Salvador
  2 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2021-03-01 14:09 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Mike Kravetz, Muchun Song, Michal Hocko, linux-mm, linux-kernel

On 22.02.21 14:51, Oscar Salvador wrote:
> alloc_contig_range will fail if it ever sees a HugeTLB page within the
> range we are trying to allocate, even when that page is free and can be
> easily reallocated.
> This has proved to be problematic for some users of alloc_contic_range,
> e.g: CMA and virtio-mem, where those would fail the call even when those
> pages lay in ZONE_MOVABLE and are free.
> 
> We can do better by trying to replace such page.
> 
> Free hugepages are tricky to handle so as to no userspace application
> notices disruption, we need to replace the current free hugepage with
> a new one.
> 
> In order to do that, a new function called alloc_and_dissolve_huge_page
> is introduced.
> This function will first try to get a new fresh hugepage, and if it
> succeeds, it will replace the old one in the free hugepage pool.
> 
> All operations are being handled under hugetlb_lock, so no races are
> possible. The only exception is when page's refcount is 0, but it still
> has not been flagged as PageHugeFreed.
> In this case we retry as the window race is quite small and we have high
> chances to succeed next time.
> 
> With regard to the allocation, we restrict it to the node the page belongs
> to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
> 
> Note that gigantic hugetlb pages are fenced off since there is a cyclic
> dependency between them and alloc_contig_range.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   include/linux/hugetlb.h |   6 +++
>   mm/compaction.c         |  12 ++++++
>   mm/hugetlb.c            | 111 +++++++++++++++++++++++++++++++++++++++++++++++-
>   3 files changed, 127 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index b5807f23caf8..72352d718829 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -505,6 +505,7 @@ struct huge_bootmem_page {
>   	struct hstate *hstate;
>   };
>   
> +bool isolate_or_dissolve_huge_page(struct page *page);
>   struct page *alloc_huge_page(struct vm_area_struct *vma,
>   				unsigned long addr, int avoid_reserve);
>   struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> @@ -775,6 +776,11 @@ void set_page_huge_active(struct page *page);
>   #else	/* CONFIG_HUGETLB_PAGE */
>   struct hstate {};
>   
> +static inline bool isolate_or_dissolve_huge_page(struct page *page)
> +{
> +	return false;
> +}
> +
>   static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
>   					   unsigned long addr,
>   					   int avoid_reserve)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 190ccdaa6c19..d52506ed9db7 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -905,6 +905,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   			valid_page = page;
>   		}
>   
> +		if (PageHuge(page) && cc->alloc_contig) {
> +			if (!isolate_or_dissolve_huge_page(page))
> +				goto isolate_fail;


So, the callchain is:

alloc_contig_range()->__alloc_contig_migrate_range()->isolate_migratepages_range()->isolate_migratepages_block()

The case I am thinking about is if we run out of memory and would return 
-ENOMEM from alloc_and_dissolve_huge_page(). We silently drop the real 
error (e.g., -ENOMEM vs. -EBUSY vs. e.g., -EAGAIN) we had in 
isolate_or_dissolve_huge_page().


I think we should not swallo such return values in 
isolate_or_dissolve_huge_page() and instead properly report esp. -ENOMEM 
properly up this callchain now. Otherwise we'll end up retrying / 
reporting -EBUSY, which is misleading.

 From isolate_migratepages_range()/isolate_migratepages_block() we'll 
keep reporting "pfn > 0".

a) In isolate_migratepages_range() we'll keep iterating over pageblocks 
although we should just fail with -ENOMEM right away.

b) In __alloc_contig_migrate_range() we'll keep retrying up to 5 times 
although we should just fail with -ENOMEM. We end up returning "-EBUSY" 
after retrying.

c) In alloc_contig_range() we'll continue trying to isolate although we 
should just return -ENOMEM.


I think we have should start returning proper errors from 
isolate_migratepages_range()/isolate_migratepages_block() on critical 
issues (-EINTR, -ENOMEM) instead of going via "!pfn vs. pfn" and 
retrying on "pfn".

So we should then fail with -ENOMEM during isolate_migratepages_range() 
cleanly, just as we would do when we get -ENOMEM during migrate_pages().



-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-03-01 14:09   ` David Hildenbrand
@ 2021-03-04 10:19     ` Oscar Salvador
  2021-03-04 10:32       ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Oscar Salvador @ 2021-03-04 10:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Michal Hocko, linux-mm,
	linux-kernel

On Mon, Mar 01, 2021 at 03:09:06PM +0100, David Hildenbrand wrote:
> On 22.02.21 14:51, Oscar Salvador wrote:
> > @@ -905,6 +905,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >   			valid_page = page;
> >   		}
> > +		if (PageHuge(page) && cc->alloc_contig) {
> > +			if (!isolate_or_dissolve_huge_page(page))
> > +				goto isolate_fail;

Bleh, sorry for the lateness David, I was farly busy.

> So, the callchain is:
> 
> alloc_contig_range()->__alloc_contig_migrate_range()->isolate_migratepages_range()->isolate_migratepages_block()
> 
> The case I am thinking about is if we run out of memory and would return
> -ENOMEM from alloc_and_dissolve_huge_page(). We silently drop the real error
> (e.g., -ENOMEM vs. -EBUSY vs. e.g., -EAGAIN) we had in
> isolate_or_dissolve_huge_page().

Yes, that is true.

> I think we should not swallo such return values in
> isolate_or_dissolve_huge_page() and instead properly report esp. -ENOMEM
> properly up this callchain now. Otherwise we'll end up retrying / reporting
> -EBUSY, which is misleading.

I am not sure I follow you here.
So, atm, alloc_and_dissolve_huge_page can either generate -ENOMEM or
-EBUSY wrt. error codes.
-ENOMEM when we cannot allocate a page, and -EBUSY when we raced with
someone.
You mean to only report ENOMEM down the chain?

> From isolate_migratepages_range()/isolate_migratepages_block() we'll keep
> reporting "pfn > 0".
> 
> a) In isolate_migratepages_range() we'll keep iterating over pageblocks
> although we should just fail with -ENOMEM right away.
> 
> b) In __alloc_contig_migrate_range() we'll keep retrying up to 5 times
> although we should just fail with -ENOMEM. We end up returning "-EBUSY"
> after retrying.
> 
> c) In alloc_contig_range() we'll continue trying to isolate although we
> should just return -ENOMEM.

Yes, "fatal" errors get masked, and hence we treat everything as "things
are busy, let us try again", and this is rather unforunate.

> I think we have should start returning proper errors from
> isolate_migratepages_range()/isolate_migratepages_block() on critical issues
> (-EINTR, -ENOMEM) instead of going via "!pfn vs. pfn" and retrying on "pfn".
> 
> So we should then fail with -ENOMEM during isolate_migratepages_range()
> cleanly, just as we would do when we get -ENOMEM during migrate_pages().

I guess we could rework the interface and make isolate_migratepages_range and
isolate_migratepages_block to report the right thing.
I yet have to check that this does not mess up a lot with the compaction
interface.

But overall I agree with your point here, and I am willing to to tackle
this.

The question is whether we want to do this as part of this series, or
after this series gets picked up.
IMHO, we could do it after, as a follow-up, unless you feel strong about
it.

What do you think?


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-03-04 10:19     ` Oscar Salvador
@ 2021-03-04 10:32       ` David Hildenbrand
  2021-03-04 10:41         ` Oscar Salvador
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2021-03-04 10:32 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Michal Hocko, linux-mm,
	linux-kernel

>> I think we should not swallo such return values in
>> isolate_or_dissolve_huge_page() and instead properly report esp. -ENOMEM
>> properly up this callchain now. Otherwise we'll end up retrying / reporting
>> -EBUSY, which is misleading.
> 
> I am not sure I follow you here.
> So, atm, alloc_and_dissolve_huge_page can either generate -ENOMEM or
> -EBUSY wrt. error codes.
> -ENOMEM when we cannot allocate a page, and -EBUSY when we raced with
> someone.
> You mean to only report ENOMEM down the chain?

Yes, fatal errors.

> 
>>  From isolate_migratepages_range()/isolate_migratepages_block() we'll keep
>> reporting "pfn > 0".
>>
>> a) In isolate_migratepages_range() we'll keep iterating over pageblocks
>> although we should just fail with -ENOMEM right away.
>>
>> b) In __alloc_contig_migrate_range() we'll keep retrying up to 5 times
>> although we should just fail with -ENOMEM. We end up returning "-EBUSY"
>> after retrying.
>>
>> c) In alloc_contig_range() we'll continue trying to isolate although we
>> should just return -ENOMEM.
> 
> Yes, "fatal" errors get masked, and hence we treat everything as "things
> are busy, let us try again", and this is rather unforunate.
> 
>> I think we have should start returning proper errors from
>> isolate_migratepages_range()/isolate_migratepages_block() on critical issues
>> (-EINTR, -ENOMEM) instead of going via "!pfn vs. pfn" and retrying on "pfn".
>>
>> So we should then fail with -ENOMEM during isolate_migratepages_range()
>> cleanly, just as we would do when we get -ENOMEM during migrate_pages().
> 
> I guess we could rework the interface and make isolate_migratepages_range and
> isolate_migratepages_block to report the right thing.
> I yet have to check that this does not mess up a lot with the compaction
> interface.
> 
> But overall I agree with your point here, and I am willing to to tackle
> this.
> 
> The question is whether we want to do this as part of this series, or
> after this series gets picked up.
> IMHO, we could do it after, as a follow-up, unless you feel strong about
> it.
> 
> What do you think?

I think this is now the second fatal error we can have (-EINTR, 
-ENOMEM), thus the current interface (return "NULL" on fatal errros) no 
longer works properly.

No strong opinion about fixing this up on top - could be it's cleaner to 
just do it right away.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-03-04 10:32       ` David Hildenbrand
@ 2021-03-04 10:41         ` Oscar Salvador
  0 siblings, 0 replies; 25+ messages in thread
From: Oscar Salvador @ 2021-03-04 10:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, Michal Hocko, linux-mm,
	linux-kernel

On Thu, Mar 04, 2021 at 11:32:29AM +0100, David Hildenbrand wrote:
> I think this is now the second fatal error we can have (-EINTR, -ENOMEM),
> thus the current interface (return "NULL" on fatal errros) no longer works
> properly.
> 
> No strong opinion about fixing this up on top - could be it's cleaner to
> just do it right away.

Ok, I see.

Then I will start working on that in v4.

Any more feedback to either patch#1 or patch#2?

Thanks!


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/2] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-22 13:51 ` [PATCH v3 2/2] mm: Make alloc_contig_range handle in-use " Oscar Salvador
  2021-02-25 23:05   ` Mike Kravetz
  2021-02-26  8:46   ` Michal Hocko
@ 2021-03-05 17:30   ` David Hildenbrand
  2 siblings, 0 replies; 25+ messages in thread
From: David Hildenbrand @ 2021-03-05 17:30 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Mike Kravetz, Muchun Song, Michal Hocko, linux-mm, linux-kernel

On 22.02.21 14:51, Oscar Salvador wrote:
> alloc_contig_range() will fail if it finds a HugeTLB page within the range,
> without a chance to handle them. Since HugeTLB pages can be migrated as any
> LRU or Movable page, it does not make sense to bail out without trying.
> Enable the interface to recognize in-use HugeTLB pages so we can migrate
> them, and have much better chances to succeed the call.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

LGTM as far as I can tell, I am only wondering if it wouldn't even be 
cleaner to squash both patches.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-03-05 17:31 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-22 13:51 [PATCH v3 0/2] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
2021-02-22 13:51 ` [PATCH v3 1/2] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
2021-02-25 20:03   ` Mike Kravetz
2021-02-26  9:48     ` Oscar Salvador
2021-02-26  8:35   ` Michal Hocko
2021-02-26  8:38     ` Michal Hocko
2021-02-26  9:25       ` David Hildenbrand
2021-02-26  9:47         ` Oscar Salvador
2021-02-26  9:45     ` Oscar Salvador
2021-02-26  9:51       ` Michal Hocko
2021-03-01 14:09   ` David Hildenbrand
2021-03-04 10:19     ` Oscar Salvador
2021-03-04 10:32       ` David Hildenbrand
2021-03-04 10:41         ` Oscar Salvador
2021-02-22 13:51 ` [PATCH v3 2/2] mm: Make alloc_contig_range handle in-use " Oscar Salvador
2021-02-25 23:05   ` Mike Kravetz
2021-02-26  8:46   ` Michal Hocko
2021-02-26 10:24     ` Oscar Salvador
2021-02-26 10:27       ` Oscar Salvador
2021-02-26 12:46       ` Michal Hocko
2021-02-28 13:43         ` Oscar Salvador
2021-03-05 17:30   ` David Hildenbrand
2021-03-01 12:43 ` [PATCH v3 0/2] Make alloc_contig_range handle Hugetlb pages David Hildenbrand
2021-03-01 12:57   ` Oscar Salvador
2021-03-01 12:59     ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).