linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Make alloc_contig_range handle Hugetlb pages
@ 2021-02-17 10:08 Oscar Salvador
  2021-02-17 10:08 ` [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
  2021-02-17 10:08 ` [PATCH 2/2] mm: Make alloc_contig_range handle in-use " Oscar Salvador
  0 siblings, 2 replies; 33+ messages in thread
From: Oscar Salvador @ 2021-02-17 10:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, David Hildenbrand, Muchun Song, Michal Hocko,
	linux-mm, linux-kernel, Oscar Salvador

Hi, 

this is the new version of [1].

 RFC -> v1:
 - Drop RFC
 - Addressed feedback from David and Mike
 - Fence off gigantic pages as there is a cyclic dependency between
   them and alloc_contig_range
 - Re-organize the code to make race-window smaller and to put
   all details in hugetlb code
 - Drop nodemask initialization. First a node will be tried and then we
   will back to other nodes containing memory (N_MEMORY). Details in
   patch#1's changelog
 - Count new page as surplus in case we failed to dissolve the old page
   and the new one. Details in patch#1.

I already have a patch that uses the new alloc_and_dissolve() from hotplug
code, but I decided to leave that patch out until this patchset is settled.

[1] https://patchwork.kernel.org/project/linux-mm/list/?series=429833

Oscar Salvador (2):
  mm: Make alloc_contig_range handle free hugetlb pages
  mm: Make alloc_contig_range handle in-use hugetlb pages

 include/linux/hugetlb.h |  7 ++++
 mm/compaction.c         | 21 +++++++++++
 mm/hugetlb.c            | 93 +++++++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c             |  5 +--
 4 files changed, 124 insertions(+), 2 deletions(-)

-- 
2.16.3



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-17 10:08 [PATCH 0/2] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
@ 2021-02-17 10:08 ` Oscar Salvador
  2021-02-17 13:30   ` Michal Hocko
  2021-02-17 15:00   ` Michal Hocko
  2021-02-17 10:08 ` [PATCH 2/2] mm: Make alloc_contig_range handle in-use " Oscar Salvador
  1 sibling, 2 replies; 33+ messages in thread
From: Oscar Salvador @ 2021-02-17 10:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, David Hildenbrand, Muchun Song, Michal Hocko,
	linux-mm, linux-kernel, Oscar Salvador

Free hugetlb pages are tricky to handle so as to no userspace application
notices disruption, we need to replace the current free hugepage with
a new one.

In order to do that, a new function called alloc_and_dissolve_huge_page
is introduced.
This function will first try to get a new fresh hugetlb page, and if it
succeeds, it will dissolve the old one.

With regard to the allocation, since we do not know whether the old page
was allocated on a specific node on request, the node the old page belongs
to will be tried first, and then we will fallback to all nodes containing
memory (N_MEMORY).

Note that gigantic hugetlb pages are fenced off since there is a cyclic
dependency between them and alloc_contig_range.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/hugetlb.h |  6 ++++
 mm/compaction.c         | 12 +++++++
 mm/hugetlb.c            | 91 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 109 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b5807f23caf8..72352d718829 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -505,6 +505,7 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };
 
+bool isolate_or_dissolve_huge_page(struct page *page);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -775,6 +776,11 @@ void set_page_huge_active(struct page *page);
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
+static inline bool isolate_or_dissolve_huge_page(struct page *page)
+{
+	return false;
+}
+
 static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
 					   unsigned long addr,
 					   int avoid_reserve)
diff --git a/mm/compaction.c b/mm/compaction.c
index 190ccdaa6c19..d52506ed9db7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -905,6 +905,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			valid_page = page;
 		}
 
+		if (PageHuge(page) && cc->alloc_contig) {
+			if (!isolate_or_dissolve_huge_page(page))
+				goto isolate_fail;
+
+			/*
+			 * Ok, the hugepage was dissolved. Now these pages are
+			 * Buddy and cannot be re-allocated because they are
+			 * isolated. Fall-through as the check below handles
+			 * Buddy pages.
+			 */
+		}
+
 		/*
 		 * Skip if free. We read page order here without zone lock
 		 * which is generally unsafe, but the race window is small and
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4bdb58ab14cb..b78926bca60a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2294,6 +2294,97 @@ static void restore_reserve_on_error(struct hstate *h,
 	}
 }
 
+static bool alloc_and_dissolve_huge_page(struct hstate *h, struct page *page)
+{
+	gfp_t gfp_mask = htlb_alloc_mask(h);
+	nodemask_t *nmask = &node_states[N_MEMORY];
+	struct page *new_page;
+	bool ret = false;
+	int nid;
+
+	spin_lock(&hugetlb_lock);
+	/*
+	 * Check one more time to make race-window smaller.
+	 */
+	if (!PageHuge(page)) {
+		/*
+		 * Dissolved from under our feet.
+		 */
+		spin_unlock(&hugetlb_lock);
+		return true;
+	}
+
+	nid = page_to_nid(page);
+	spin_unlock(&hugetlb_lock);
+
+	/*
+	 * Before dissolving the page, we need to allocate a new one,
+	 * so the pool remains stable.
+	 */
+	new_page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
+	if (new_page) {
+		/*
+		 * Ok, we got a new free hugepage to replace this one. Try to
+		 * dissolve the old page.
+		 */
+		if (!dissolve_free_huge_page(page)) {
+			ret = true;
+		} else if (dissolve_free_huge_page(new_page)) {
+			/*
+			 * Seems the old page could not be dissolved, so try to
+			 * dissolve the freshly allocated page. If that fails
+			 * too, let us count the new page as a surplus. Doing so
+			 * allows the pool to be re-balanced when pages are freed
+			 * instead of enqueued again.
+			 */
+			spin_lock(&hugetlb_lock);
+			h->surplus_huge_pages++;
+			h->surplus_huge_pages_node[nid]++;
+			spin_unlock(&hugetlb_lock);
+		}
+		/*
+		 * Free it into the hugepage allocator
+		 */
+		put_page(new_page);
+	}
+
+	return ret;
+}
+
+bool isolate_or_dissolve_huge_page(struct page *page)
+{
+	struct hstate *h = NULL;
+	struct page *head;
+	bool ret = false;
+
+	spin_lock(&hugetlb_lock);
+	if (PageHuge(page)) {
+		head = compound_head(page);
+		h = page_hstate(head);
+	}
+	spin_unlock(&hugetlb_lock);
+
+	if (!h)
+		/*
+		 * The page might have been dissolved from under our feet.
+		 * If that is the case, return success as if we dissolved it
+		 * ourselves.
+		 */
+		return true;
+
+	if (hstate_is_gigantic(h))
+		/*
+		 * Fence off gigantic pages as there is a cyclic dependency
+		 * between alloc_contig_range and them.
+		 */
+		return ret;
+
+	if(!page_count(head) && alloc_and_dissolve_huge_page(h, head))
+		ret = true;
+
+	return ret;
+}
+
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
-- 
2.16.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/2] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-17 10:08 [PATCH 0/2] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
  2021-02-17 10:08 ` [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
@ 2021-02-17 10:08 ` Oscar Salvador
  2021-02-17 13:36   ` Michal Hocko
  2021-02-17 15:06   ` Michal Hocko
  1 sibling, 2 replies; 33+ messages in thread
From: Oscar Salvador @ 2021-02-17 10:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, David Hildenbrand, Muchun Song, Michal Hocko,
	linux-mm, linux-kernel, Oscar Salvador

In-use hugetlb pages can be migrated as any other page (LRU
and Movable), so let alloc_contig_range handle them.

All we need is to succesfully isolate such page.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/hugetlb.h |  5 +++--
 mm/compaction.c         | 11 ++++++++++-
 mm/hugetlb.c            |  6 ++++--
 mm/vmscan.c             |  5 +++--
 4 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 72352d718829..8c17d0dbc87c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -505,7 +505,7 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };
 
-bool isolate_or_dissolve_huge_page(struct page *page);
+bool isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -776,7 +776,8 @@ void set_page_huge_active(struct page *page);
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
-static inline bool isolate_or_dissolve_huge_page(struct page *page)
+static inline bool isolate_or_dissolve_huge_page(struct page *page,
+						 struct list_head *list)
 {
 	return false;
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index d52506ed9db7..55a41a9228a9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -906,9 +906,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		}
 
 		if (PageHuge(page) && cc->alloc_contig) {
-			if (!isolate_or_dissolve_huge_page(page))
+			if (!isolate_or_dissolve_huge_page(page, &cc->migratepages))
 				goto isolate_fail;
 
+			if (PageHuge(page)) {
+				/*
+				 * Hugepage was succesfully isolated.
+				 */
+				low_pfn += compound_nr(page) - 1;
+				goto isolate_success_no_list;
+			}
+
 			/*
 			 * Ok, the hugepage was dissolved. Now these pages are
 			 * Buddy and cannot be re-allocated because they are
@@ -1053,6 +1061,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 isolate_success:
 		list_add(&page->lru, &cc->migratepages);
+isolate_success_no_list:
 		cc->nr_migratepages += compound_nr(page);
 		nr_isolated += compound_nr(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b78926bca60a..9fa678d13c68 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2351,7 +2351,7 @@ static bool alloc_and_dissolve_huge_page(struct hstate *h, struct page *page)
 	return ret;
 }
 
-bool isolate_or_dissolve_huge_page(struct page *page)
+bool isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
 {
 	struct hstate *h = NULL;
 	struct page *head;
@@ -2379,7 +2379,9 @@ bool isolate_or_dissolve_huge_page(struct page *page)
 		 */
 		return ret;
 
-	if(!page_count(head) && alloc_and_dissolve_huge_page(h, head))
+	if (page_count(head) && isolate_huge_page(head, list))
+		ret = true;
+	else if(!page_count(head) && alloc_and_dissolve_huge_page(h, head))
 		ret = true;
 
 	return ret;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b1b574ad199d..0803adca4469 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1506,8 +1506,9 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 	LIST_HEAD(clean_pages);
 
 	list_for_each_entry_safe(page, next, page_list, lru) {
-		if (page_is_file_lru(page) && !PageDirty(page) &&
-		    !__PageMovable(page) && !PageUnevictable(page)) {
+		if (!PageHuge(page) && page_is_file_lru(page) &&
+		    !PageDirty(page) && !__PageMovable(page) &&
+		    !PageUnevictable(page)) {
 			ClearPageActive(page);
 			list_move(&page->lru, &clean_pages);
 		}
-- 
2.16.3



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-17 10:08 ` [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
@ 2021-02-17 13:30   ` Michal Hocko
  2021-02-17 13:36     ` David Hildenbrand
  2021-02-17 13:42     ` Oscar Salvador
  2021-02-17 15:00   ` Michal Hocko
  1 sibling, 2 replies; 33+ messages in thread
From: Michal Hocko @ 2021-02-17 13:30 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Wed 17-02-21 11:08:15, Oscar Salvador wrote:
> Free hugetlb pages are tricky to handle so as to no userspace application
> notices disruption, we need to replace the current free hugepage with
> a new one.
> 
> In order to do that, a new function called alloc_and_dissolve_huge_page
> is introduced.
> This function will first try to get a new fresh hugetlb page, and if it
> succeeds, it will dissolve the old one.
> 
> With regard to the allocation, since we do not know whether the old page
> was allocated on a specific node on request, the node the old page belongs
> to will be tried first, and then we will fallback to all nodes containing
> memory (N_MEMORY).

I do not think fallback to a different zone is ok. If yes then this
really requires a very good reasoning. alloc_contig_range is an
optimistic allocation interface at best and it shouldn't break carefully
node aware preallocation done by administrator.

> Note that gigantic hugetlb pages are fenced off since there is a cyclic
> dependency between them and alloc_contig_range.

Why do we need/want to do all this in the first place?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-17 10:08 ` [PATCH 2/2] mm: Make alloc_contig_range handle in-use " Oscar Salvador
@ 2021-02-17 13:36   ` Michal Hocko
  2021-02-17 13:46     ` Oscar Salvador
  2021-02-17 15:06   ` Michal Hocko
  1 sibling, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-02-17 13:36 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Wed 17-02-21 11:08:16, Oscar Salvador wrote:
> In-use hugetlb pages can be migrated as any other page (LRU
> and Movable), so let alloc_contig_range handle them.
> 
> All we need is to succesfully isolate such page.

Again, this is missing a problem statement and a justification why we
want/need this.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-17 13:30   ` Michal Hocko
@ 2021-02-17 13:36     ` David Hildenbrand
  2021-02-17 13:50       ` Michal Hocko
  2021-02-17 13:42     ` Oscar Salvador
  1 sibling, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2021-02-17 13:36 UTC (permalink / raw)
  To: Michal Hocko, Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, Muchun Song, linux-mm, linux-kernel

On 17.02.21 14:30, Michal Hocko wrote:
> On Wed 17-02-21 11:08:15, Oscar Salvador wrote:
>> Free hugetlb pages are tricky to handle so as to no userspace application
>> notices disruption, we need to replace the current free hugepage with
>> a new one.
>>
>> In order to do that, a new function called alloc_and_dissolve_huge_page
>> is introduced.
>> This function will first try to get a new fresh hugetlb page, and if it
>> succeeds, it will dissolve the old one.
>>
>> With regard to the allocation, since we do not know whether the old page
>> was allocated on a specific node on request, the node the old page belongs
>> to will be tried first, and then we will fallback to all nodes containing
>> memory (N_MEMORY).
> 
> I do not think fallback to a different zone is ok. If yes then this
> really requires a very good reasoning. alloc_contig_range is an
> optimistic allocation interface at best and it shouldn't break carefully
> node aware preallocation done by administrator.

What does memory offlining do when migrating in-use hugetlbfs pages? 
Does it always keep the node?

I think keeping the node is the easiest/simplest approach for now.

> 
>> Note that gigantic hugetlb pages are fenced off since there is a cyclic
>> dependency between them and alloc_contig_range.
> 
> Why do we need/want to do all this in the first place?

cma and virtio-mem (especially on ZONE_MOVABLE) really want to handle 
hugetlbfs pages.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-17 13:30   ` Michal Hocko
  2021-02-17 13:36     ` David Hildenbrand
@ 2021-02-17 13:42     ` Oscar Salvador
  1 sibling, 0 replies; 33+ messages in thread
From: Oscar Salvador @ 2021-02-17 13:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Wed, Feb 17, 2021 at 02:30:43PM +0100, Michal Hocko wrote:
> On Wed 17-02-21 11:08:15, Oscar Salvador wrote:
> I do not think fallback to a different zone is ok. If yes then this
> really requires a very good reasoning. alloc_contig_range is an
> optimistic allocation interface at best and it shouldn't break carefully
> node aware preallocation done by administrator.

Yeah, previous version (RFC) was more careful with that.
I somehow thought that it might be ok to fallback to other nodes in case
we failed to allocate on the preferred nid.

I will get RFC handling back wrt. allocation once I gather more feedback.

> 
> > Note that gigantic hugetlb pages are fenced off since there is a cyclic
> > dependency between them and alloc_contig_range.
> 
> Why do we need/want to do all this in the first place?

When trying to allocate a memory chunk with alloc_contig_range, it will fail
if it ever sees a Hugetlb page because isolate_migratepages_range() does not
"recognize" those for migration (or to put it different, does not know about
them).

Given that HugeTLB pages can be migrated or dissolved if free, it makes sense
to enable isolate_migratepages_range() to recognize HugeTLB pages in order
to handle them, as it currently does with LRU and Movable pages.

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-17 13:36   ` Michal Hocko
@ 2021-02-17 13:46     ` Oscar Salvador
  2021-02-17 13:54       ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Oscar Salvador @ 2021-02-17 13:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Wed, Feb 17, 2021 at 02:36:31PM +0100, Michal Hocko wrote:
> On Wed 17-02-21 11:08:16, Oscar Salvador wrote:
> > In-use hugetlb pages can be migrated as any other page (LRU
> > and Movable), so let alloc_contig_range handle them.
> > 
> > All we need is to succesfully isolate such page.
> 
> Again, this is missing a problem statement and a justification why we
> want/need this.

Heh, I was poor in words.

"alloc_contig_range() will fail miserably if it finds a HugeTLB page within
 the range without a chance to handle them. Since HugeTLB pages can be migrated
 as any other page (LRU and Movable), it does not make sense to bail out.
 Enable the interface to recognize in-use HugeTLB pages and have a chance
 to migrate them"

What about something along those lines?


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-17 13:36     ` David Hildenbrand
@ 2021-02-17 13:50       ` Michal Hocko
  2021-02-17 13:53         ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-02-17 13:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, Andrew Morton, Mike Kravetz, Muchun Song,
	linux-mm, linux-kernel

On Wed 17-02-21 14:36:47, David Hildenbrand wrote:
> On 17.02.21 14:30, Michal Hocko wrote:
> > On Wed 17-02-21 11:08:15, Oscar Salvador wrote:
> > > Free hugetlb pages are tricky to handle so as to no userspace application
> > > notices disruption, we need to replace the current free hugepage with
> > > a new one.
> > > 
> > > In order to do that, a new function called alloc_and_dissolve_huge_page
> > > is introduced.
> > > This function will first try to get a new fresh hugetlb page, and if it
> > > succeeds, it will dissolve the old one.
> > > 
> > > With regard to the allocation, since we do not know whether the old page
> > > was allocated on a specific node on request, the node the old page belongs
> > > to will be tried first, and then we will fallback to all nodes containing
> > > memory (N_MEMORY).
> > 
> > I do not think fallback to a different zone is ok. If yes then this
> > really requires a very good reasoning. alloc_contig_range is an
> > optimistic allocation interface at best and it shouldn't break carefully
> > node aware preallocation done by administrator.
> 
> What does memory offlining do when migrating in-use hugetlbfs pages? Does it
> always keep the node?

No it will break the node pool. The reasoning behind that is that
offlining is an explicit request from the userspace and it is expected
to break affinities because it is a destructive action from the memory
capacity point of view. It is impossible to have former affinity while
you are cutting the memory off under its user.

> I think keeping the node is the easiest/simplest approach for now.
> 
> > 
> > > Note that gigantic hugetlb pages are fenced off since there is a cyclic
> > > dependency between them and alloc_contig_range.
> > 
> > Why do we need/want to do all this in the first place?
> 
> cma and virtio-mem (especially on ZONE_MOVABLE) really want to handle
> hugetlbfs pages.

Do we have any real life examples? Or does this fall more into, let's
optimize an existing implementation category.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-17 13:50       ` Michal Hocko
@ 2021-02-17 13:53         ` David Hildenbrand
  2021-02-17 13:59           ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2021-02-17 13:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oscar Salvador, Andrew Morton, Mike Kravetz, Muchun Song,
	linux-mm, linux-kernel

On 17.02.21 14:50, Michal Hocko wrote:
> On Wed 17-02-21 14:36:47, David Hildenbrand wrote:
>> On 17.02.21 14:30, Michal Hocko wrote:
>>> On Wed 17-02-21 11:08:15, Oscar Salvador wrote:
>>>> Free hugetlb pages are tricky to handle so as to no userspace application
>>>> notices disruption, we need to replace the current free hugepage with
>>>> a new one.
>>>>
>>>> In order to do that, a new function called alloc_and_dissolve_huge_page
>>>> is introduced.
>>>> This function will first try to get a new fresh hugetlb page, and if it
>>>> succeeds, it will dissolve the old one.
>>>>
>>>> With regard to the allocation, since we do not know whether the old page
>>>> was allocated on a specific node on request, the node the old page belongs
>>>> to will be tried first, and then we will fallback to all nodes containing
>>>> memory (N_MEMORY).
>>>
>>> I do not think fallback to a different zone is ok. If yes then this
>>> really requires a very good reasoning. alloc_contig_range is an
>>> optimistic allocation interface at best and it shouldn't break carefully
>>> node aware preallocation done by administrator.
>>
>> What does memory offlining do when migrating in-use hugetlbfs pages? Does it
>> always keep the node?
> 
> No it will break the node pool. The reasoning behind that is that
> offlining is an explicit request from the userspace and it is expected

userspace? in 99,9996% it's the hardware that triggers the unplug of a DIMM.

> 
>> I think keeping the node is the easiest/simplest approach for now.
>>
>>>
>>>> Note that gigantic hugetlb pages are fenced off since there is a cyclic
>>>> dependency between them and alloc_contig_range.
>>>
>>> Why do we need/want to do all this in the first place?
>>
>> cma and virtio-mem (especially on ZONE_MOVABLE) really want to handle
>> hugetlbfs pages.
> 
> Do we have any real life examples? Or does this fall more into, let's
> optimize an existing implementation category.
> 

It's a big TODO item I have on my list and I am happy that Oscar is 
looking into it. So yes, I noticed it while working on virtio-mem. It's 
real.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-17 13:46     ` Oscar Salvador
@ 2021-02-17 13:54       ` Michal Hocko
  0 siblings, 0 replies; 33+ messages in thread
From: Michal Hocko @ 2021-02-17 13:54 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Wed 17-02-21 14:46:49, Oscar Salvador wrote:
> On Wed, Feb 17, 2021 at 02:36:31PM +0100, Michal Hocko wrote:
> > On Wed 17-02-21 11:08:16, Oscar Salvador wrote:
> > > In-use hugetlb pages can be migrated as any other page (LRU
> > > and Movable), so let alloc_contig_range handle them.
> > > 
> > > All we need is to succesfully isolate such page.
> > 
> > Again, this is missing a problem statement and a justification why we
> > want/need this.
> 
> Heh, I was poor in words.
> 
> "alloc_contig_range() will fail miserably if it finds a HugeTLB page within
>  the range without a chance to handle them. Since HugeTLB pages can be migrated
>  as any other page (LRU and Movable), it does not make sense to bail out.
>  Enable the interface to recognize in-use HugeTLB pages and have a chance
>  to migrate them"
> 
> What about something along those lines?

Is this a real life problem? I know we _can_ but I do not see any
reasoning _why_ should we care all that much.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-17 13:53         ` David Hildenbrand
@ 2021-02-17 13:59           ` Michal Hocko
  2021-02-17 14:08             ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-02-17 13:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, Andrew Morton, Mike Kravetz, Muchun Song,
	linux-mm, linux-kernel

On Wed 17-02-21 14:53:37, David Hildenbrand wrote:
> On 17.02.21 14:50, Michal Hocko wrote:
[...]
> > Do we have any real life examples? Or does this fall more into, let's
> > optimize an existing implementation category.
> > 
> 
> It's a big TODO item I have on my list and I am happy that Oscar is looking
> into it. So yes, I noticed it while working on virtio-mem. It's real.

Do not take me wrong, I am not opposing to the functionality. I am
asking for the specific usecase.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-17 13:59           ` Michal Hocko
@ 2021-02-17 14:08             ` David Hildenbrand
  2021-02-17 14:14               ` Michal Hocko
  2021-02-17 14:23               ` Oscar Salvador
  0 siblings, 2 replies; 33+ messages in thread
From: David Hildenbrand @ 2021-02-17 14:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oscar Salvador, Andrew Morton, Mike Kravetz, Muchun Song,
	linux-mm, linux-kernel

On 17.02.21 14:59, Michal Hocko wrote:
> On Wed 17-02-21 14:53:37, David Hildenbrand wrote:
>> On 17.02.21 14:50, Michal Hocko wrote:
> [...]
>>> Do we have any real life examples? Or does this fall more into, let's
>>> optimize an existing implementation category.
>>>
>>
>> It's a big TODO item I have on my list and I am happy that Oscar is looking
>> into it. So yes, I noticed it while working on virtio-mem. It's real.
> 
> Do not take me wrong, I am not opposing to the functionality. I am
> asking for the specific usecase.

Makes sense, and a proper motivation should be included in the 
patches/cover letter. So here comes a quick-n-dirty example:


Start a VM with 4G. Hotplug 1G via virtio-mem and online it to 
ZONE_MOVABLE. Allocate 512 huge pages.

[root@localhost ~]# cat /proc/meminfo
MemTotal:        5061512 kB
MemFree:         3319396 kB
MemAvailable:    3457144 kB
...
HugePages_Total:     512
HugePages_Free:      512
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB


The huge pages get partially allocate from ZONE_MOVABLE. Try unplugging 
1G via virtio-mem (remember, all ZONE_MOVABLE). Inside the guest:

[  180.058992] alloc_contig_range: [1b8000, 1c0000) PFNs busy
[  180.060531] alloc_contig_range: [1b8000, 1c0000) PFNs busy
[  180.061972] alloc_contig_range: [1b8000, 1c0000) PFNs busy
[  180.063413] alloc_contig_range: [1b8000, 1c0000) PFNs busy
[  180.064838] alloc_contig_range: [1b8000, 1c0000) PFNs busy
[  180.065848] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
[  180.066794] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
[  180.067738] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
[  180.068669] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
[  180.069598] alloc_contig_range: [1bfc00, 1c0000) PFNs busy


I succeed in unplugging 540MB - 484 MB remain blocked by huge pages 
("which did not end up there by pure luck"). These pages are movable 
(and even free!) and can easily be reallocated.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-17 14:08             ` David Hildenbrand
@ 2021-02-17 14:14               ` Michal Hocko
  2021-02-17 14:23               ` Oscar Salvador
  1 sibling, 0 replies; 33+ messages in thread
From: Michal Hocko @ 2021-02-17 14:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, Andrew Morton, Mike Kravetz, Muchun Song,
	linux-mm, linux-kernel

On Wed 17-02-21 15:08:04, David Hildenbrand wrote:
> On 17.02.21 14:59, Michal Hocko wrote:
> > On Wed 17-02-21 14:53:37, David Hildenbrand wrote:
> > > On 17.02.21 14:50, Michal Hocko wrote:
> > [...]
> > > > Do we have any real life examples? Or does this fall more into, let's
> > > > optimize an existing implementation category.
> > > > 
> > > 
> > > It's a big TODO item I have on my list and I am happy that Oscar is looking
> > > into it. So yes, I noticed it while working on virtio-mem. It's real.
> > 
> > Do not take me wrong, I am not opposing to the functionality. I am
> > asking for the specific usecase.
> 
> Makes sense, and a proper motivation should be included in the patches/cover
> letter. So here comes a quick-n-dirty example:
> 
> 
> Start a VM with 4G. Hotplug 1G via virtio-mem and online it to ZONE_MOVABLE.
> Allocate 512 huge pages.
> 
> [root@localhost ~]# cat /proc/meminfo
> MemTotal:        5061512 kB
> MemFree:         3319396 kB
> MemAvailable:    3457144 kB
> ...
> HugePages_Total:     512
> HugePages_Free:      512
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> 
> 
> The huge pages get partially allocate from ZONE_MOVABLE. Try unplugging 1G
> via virtio-mem (remember, all ZONE_MOVABLE). Inside the guest:
> 
> [  180.058992] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [  180.060531] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [  180.061972] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [  180.063413] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [  180.064838] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [  180.065848] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
> [  180.066794] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
> [  180.067738] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
> [  180.068669] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
> [  180.069598] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
> 
> 
> I succeed in unplugging 540MB - 484 MB remain blocked by huge pages ("which
> did not end up there by pure luck"). These pages are movable (and even
> free!) and can easily be reallocated.

OK, this sounds reasonable.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-17 14:08             ` David Hildenbrand
  2021-02-17 14:14               ` Michal Hocko
@ 2021-02-17 14:23               ` Oscar Salvador
  1 sibling, 0 replies; 33+ messages in thread
From: Oscar Salvador @ 2021-02-17 14:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Andrew Morton, Mike Kravetz, Muchun Song, linux-mm,
	linux-kernel

On Wed, Feb 17, 2021 at 03:08:04PM +0100, David Hildenbrand wrote:
> On 17.02.21 14:59, Michal Hocko wrote:
> > On Wed 17-02-21 14:53:37, David Hildenbrand wrote:
> > > On 17.02.21 14:50, Michal Hocko wrote:
> > [...]
> > > > Do we have any real life examples? Or does this fall more into, let's
> > > > optimize an existing implementation category.
> > > > 
> > > 
> > > It's a big TODO item I have on my list and I am happy that Oscar is looking
> > > into it. So yes, I noticed it while working on virtio-mem. It's real.
> > 
> > Do not take me wrong, I am not opposing to the functionality. I am
> > asking for the specific usecase.
> 
> Makes sense, and a proper motivation should be included in the patches/cover
> letter. So here comes a quick-n-dirty example:

Definitely. I took it for granted that the problem was obvious.

> Start a VM with 4G. Hotplug 1G via virtio-mem and online it to ZONE_MOVABLE.
> Allocate 512 huge pages.
> 
> [root@localhost ~]# cat /proc/meminfo
> MemTotal:        5061512 kB
> MemFree:         3319396 kB
> MemAvailable:    3457144 kB
> ...
> HugePages_Total:     512
> HugePages_Free:      512
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> 
> 
> The huge pages get partially allocate from ZONE_MOVABLE. Try unplugging 1G
> via virtio-mem (remember, all ZONE_MOVABLE). Inside the guest:
> 
> [  180.058992] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [  180.060531] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [  180.061972] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [  180.063413] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [  180.064838] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [  180.065848] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
> [  180.066794] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
> [  180.067738] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
> [  180.068669] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
> [  180.069598] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
> 
> 
> I succeed in unplugging 540MB - 484 MB remain blocked by huge pages ("which
> did not end up there by pure luck"). These pages are movable (and even
> free!) and can easily be reallocated.

Thanks for providing such a detailed case David.
I will gather the bits and prepare a v2 once I gather more feedback.

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-17 10:08 ` [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
  2021-02-17 13:30   ` Michal Hocko
@ 2021-02-17 15:00   ` Michal Hocko
  2021-02-18 10:09     ` Oscar Salvador
  1 sibling, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-02-17 15:00 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Wed 17-02-21 11:08:15, Oscar Salvador wrote:
[...]
> +static bool alloc_and_dissolve_huge_page(struct hstate *h, struct page *page)
> +{
> +	gfp_t gfp_mask = htlb_alloc_mask(h);
> +	nodemask_t *nmask = &node_states[N_MEMORY];
> +	struct page *new_page;
> +	bool ret = false;
> +	int nid;
> +
> +	spin_lock(&hugetlb_lock);
> +	/*
> +	 * Check one more time to make race-window smaller.
> +	 */
> +	if (!PageHuge(page)) {
> +		/*
> +		 * Dissolved from under our feet.
> +		 */
> +		spin_unlock(&hugetlb_lock);
> +		return true;
> +	}

Is this really necessary? dissolve_free_huge_page will take care of this
and the race windown you are covering is really tiny.

> +
> +	nid = page_to_nid(page);
> +	spin_unlock(&hugetlb_lock);
> +
> +	/*
> +	 * Before dissolving the page, we need to allocate a new one,
> +	 * so the pool remains stable.
> +	 */
> +	new_page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);

wrt. fallback to other zones, I haven't realized that the primary
usecase is a form of memory offlining (from virt-mem). I am not yet sure
what the proper behavior is in that case but if breaking hugetlb pools,
similar to the normal hotplug operation, is viable then this needs a
special mode. We do not want a random alloc_contig_range user to do the
same. So for starter I would go with __GFP_THISNODE here.

> +	if (new_page) {
> +		/*
> +		 * Ok, we got a new free hugepage to replace this one. Try to
> +		 * dissolve the old page.
> +		 */
> +		if (!dissolve_free_huge_page(page)) {
> +			ret = true;
> +		} else if (dissolve_free_huge_page(new_page)) {
> +			/*
> +			 * Seems the old page could not be dissolved, so try to
> +			 * dissolve the freshly allocated page. If that fails
> +			 * too, let us count the new page as a surplus. Doing so
> +			 * allows the pool to be re-balanced when pages are freed
> +			 * instead of enqueued again.
> +			 */
> +			spin_lock(&hugetlb_lock);
> +			h->surplus_huge_pages++;
> +			h->surplus_huge_pages_node[nid]++;
> +			spin_unlock(&hugetlb_lock);
> +		}
> +		/*
> +		 * Free it into the hugepage allocator
> +		 */
> +		put_page(new_page);
> +	}
> +
> +	return ret;
> +}
> +
> +bool isolate_or_dissolve_huge_page(struct page *page)
> +{
> +	struct hstate *h = NULL;
> +	struct page *head;
> +	bool ret = false;
> +
> +	spin_lock(&hugetlb_lock);
> +	if (PageHuge(page)) {
> +		head = compound_head(page);
> +		h = page_hstate(head);
> +	}
> +	spin_unlock(&hugetlb_lock);
> +
> +	if (!h)
> +		/*
> +		 * The page might have been dissolved from under our feet.
> +		 * If that is the case, return success as if we dissolved it
> +		 * ourselves.
> +		 */
> +		return true;

nit I would put the comment above the conditin for both cases. It reads
more easily that way. At least without { }.

> +
> +	if (hstate_is_gigantic(h))
> +		/*
> +		 * Fence off gigantic pages as there is a cyclic dependency
> +		 * between alloc_contig_range and them.
> +		 */
> +		return ret;
> +
> +	if(!page_count(head) && alloc_and_dissolve_huge_page(h, head))
> +		ret = true;
> +
> +	return ret;
> +}
> +
>  struct page *alloc_huge_page(struct vm_area_struct *vma,
>  				    unsigned long addr, int avoid_reserve)
>  {

Other than that I haven't noticed any surprises.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-17 10:08 ` [PATCH 2/2] mm: Make alloc_contig_range handle in-use " Oscar Salvador
  2021-02-17 13:36   ` Michal Hocko
@ 2021-02-17 15:06   ` Michal Hocko
  2021-02-17 15:27     ` Oscar Salvador
  1 sibling, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-02-17 15:06 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Wed 17-02-21 11:08:16, Oscar Salvador wrote:
> In-use hugetlb pages can be migrated as any other page (LRU
> and Movable), so let alloc_contig_range handle them.
> 
> All we need is to succesfully isolate such page.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  include/linux/hugetlb.h |  5 +++--
>  mm/compaction.c         | 11 ++++++++++-
>  mm/hugetlb.c            |  6 ++++--
>  mm/vmscan.c             |  5 +++--
>  4 files changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 72352d718829..8c17d0dbc87c 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -505,7 +505,7 @@ struct huge_bootmem_page {
>  	struct hstate *hstate;
>  };
>  
> -bool isolate_or_dissolve_huge_page(struct page *page);
> +bool isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
>  struct page *alloc_huge_page(struct vm_area_struct *vma,
>  				unsigned long addr, int avoid_reserve);
>  struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> @@ -776,7 +776,8 @@ void set_page_huge_active(struct page *page);
>  #else	/* CONFIG_HUGETLB_PAGE */
>  struct hstate {};
>  
> -static inline bool isolate_or_dissolve_huge_page(struct page *page)
> +static inline bool isolate_or_dissolve_huge_page(struct page *page,
> +						 struct list_head *list)
>  {
>  	return false;
>  }
> diff --git a/mm/compaction.c b/mm/compaction.c
> index d52506ed9db7..55a41a9228a9 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -906,9 +906,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>  		}
>  
>  		if (PageHuge(page) && cc->alloc_contig) {
> -			if (!isolate_or_dissolve_huge_page(page))
> +			if (!isolate_or_dissolve_huge_page(page, &cc->migratepages))
>  				goto isolate_fail;
>  
> +			if (PageHuge(page)) {
> +				/*
> +				 * Hugepage was succesfully isolated.
> +				 */
> +				low_pfn += compound_nr(page) - 1;
> +				goto isolate_success_no_list;
> +			}
> +
>  			/*
>  			 * Ok, the hugepage was dissolved. Now these pages are
>  			 * Buddy and cannot be re-allocated because they are
> @@ -1053,6 +1061,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>  
>  isolate_success:
>  		list_add(&page->lru, &cc->migratepages);
> +isolate_success_no_list:
>  		cc->nr_migratepages += compound_nr(page);

I do not follow. You have successfully isolated huge pages so why don't
you add that page to the migratepages list to be migrated?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-17 15:06   ` Michal Hocko
@ 2021-02-17 15:27     ` Oscar Salvador
  2021-02-17 15:33       ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Oscar Salvador @ 2021-02-17 15:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On 2021-02-17 16:06, Michal Hocko wrote:
> I do not follow. You have successfully isolated huge pages so why don't
> you add that page to the migratepages list to be migrated?

It is added. Note that I pass de list (cc->migratepages) to 
isolate_or_dissolve_huge_page().
It is done this way because we have to pass the list to 
isolate_huge_page() directly.
That is why I jump to isolate_success_no_list instead of isolate_sucess.

I know it is subtle, but I could not come up with a better name at that 
time.

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-17 15:27     ` Oscar Salvador
@ 2021-02-17 15:33       ` Michal Hocko
  2021-02-18  6:01         ` Oscar Salvador
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-02-17 15:33 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Wed 17-02-21 16:27:27, Oscar Salvador wrote:
> On 2021-02-17 16:06, Michal Hocko wrote:
> > I do not follow. You have successfully isolated huge pages so why don't
> > you add that page to the migratepages list to be migrated?
> 
> It is added. Note that I pass de list (cc->migratepages) to
> isolate_or_dissolve_huge_page().
> It is done this way because we have to pass the list to isolate_huge_page()
> directly.
> That is why I jump to isolate_success_no_list instead of isolate_sucess.
> 
> I know it is subtle, but I could not come up with a better name at that
> time.

OK, I have missed that. Maybe just extend the comment. 

	/*
	 * Hugepage was succesfully isolated and on the tmigratepages
	 * list 
	 */
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/2] mm: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-17 15:33       ` Michal Hocko
@ 2021-02-18  6:01         ` Oscar Salvador
  0 siblings, 0 replies; 33+ messages in thread
From: Oscar Salvador @ 2021-02-18  6:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Wed, Feb 17, 2021 at 04:33:18PM +0100, Michal Hocko wrote:
> OK, I have missed that. Maybe just extend the comment. 
> 
> 	/*
> 	 * Hugepage was succesfully isolated and on the tmigratepages
> 	 * list 
> 	 */

Sure, I will improve it.

Thanks 

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-17 15:00   ` Michal Hocko
@ 2021-02-18 10:09     ` Oscar Salvador
  2021-02-18 12:52       ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Oscar Salvador @ 2021-02-18 10:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Wed, Feb 17, 2021 at 04:00:11PM +0100, Michal Hocko wrote:
> Is this really necessary? dissolve_free_huge_page will take care of this
> and the race windown you are covering is really tiny.

Probably not, I was trying to shrink to race window as much as possible
but the call to dissolve_free_huge_page might be enough.

> > +	nid = page_to_nid(page);
> > +	spin_unlock(&hugetlb_lock);
> > +
> > +	/*
> > +	 * Before dissolving the page, we need to allocate a new one,
> > +	 * so the pool remains stable.
> > +	 */
> > +	new_page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
> 
> wrt. fallback to other zones, I haven't realized that the primary
> usecase is a form of memory offlining (from virt-mem). I am not yet sure
> what the proper behavior is in that case but if breaking hugetlb pools,
> similar to the normal hotplug operation, is viable then this needs a
> special mode. We do not want a random alloc_contig_range user to do the
> same. So for starter I would go with __GFP_THISNODE here.

Ok, makes sense.
__GFP_THISNODE will not allow fallback to other node's zones.
Since we only allow the nid the page belongs to, nodemask should be
NULL, right?

> > +	if (!h)
> > +		/*
> > +		 * The page might have been dissolved from under our feet.
> > +		 * If that is the case, return success as if we dissolved it
> > +		 * ourselves.
> > +		 */
> > +		return true;
> 
> nit I would put the comment above the conditin for both cases. It reads
> more easily that way. At least without { }.

Yes, makes sense.

> Other than that I haven't noticed any surprises.

I did. The 'put_page' call should be placed above, right after getting
the page. Otherwise, refcount == 1 and we will fail to dissolve the
new page if we need to (in case old page fails to be dissolved).
I already fixed that locally.

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-18 10:09     ` Oscar Salvador
@ 2021-02-18 12:52       ` Michal Hocko
  2021-02-18 13:32         ` Oscar Salvador
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-02-18 12:52 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Thu 18-02-21 11:09:17, Oscar Salvador wrote:
> On Wed, Feb 17, 2021 at 04:00:11PM +0100, Michal Hocko wrote:
> > Is this really necessary? dissolve_free_huge_page will take care of this
> > and the race windown you are covering is really tiny.
> 
> Probably not, I was trying to shrink to race window as much as possible
> but the call to dissolve_free_huge_page might be enough.
> 
> > > +	nid = page_to_nid(page);
> > > +	spin_unlock(&hugetlb_lock);
> > > +
> > > +	/*
> > > +	 * Before dissolving the page, we need to allocate a new one,
> > > +	 * so the pool remains stable.
> > > +	 */
> > > +	new_page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
> > 
> > wrt. fallback to other zones, I haven't realized that the primary
> > usecase is a form of memory offlining (from virt-mem). I am not yet sure
> > what the proper behavior is in that case but if breaking hugetlb pools,
> > similar to the normal hotplug operation, is viable then this needs a
> > special mode. We do not want a random alloc_contig_range user to do the
> > same. So for starter I would go with __GFP_THISNODE here.
> 
> Ok, makes sense.
> __GFP_THISNODE will not allow fallback to other node's zones.
> Since we only allow the nid the page belongs to, nodemask should be
> NULL, right?

I would have to double check because hugetlb has a slightly different
expectations from nodemask than the page allocator. The later translates
that to all possible nodes but hugetlb API tries to dereference nodes.
Maybe THIS node special cases it somewhere.

> > > +	if (!h)
> > > +		/*
> > > +		 * The page might have been dissolved from under our feet.
> > > +		 * If that is the case, return success as if we dissolved it
> > > +		 * ourselves.
> > > +		 */
> > > +		return true;
> > 
> > nit I would put the comment above the conditin for both cases. It reads
> > more easily that way. At least without { }.
> 
> Yes, makes sense.
> 
> > Other than that I haven't noticed any surprises.
> 
> I did. The 'put_page' call should be placed above, right after getting
> the page. Otherwise, refcount == 1 and we will fail to dissolve the
> new page if we need to (in case old page fails to be dissolved).
> I already fixed that locally.

I am not sure I follow. newly allocated pages is unreferenced
unconditionally and the old page is not referenced by this path.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-18 12:52       ` Michal Hocko
@ 2021-02-18 13:32         ` Oscar Salvador
  2021-02-18 13:59           ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Oscar Salvador @ 2021-02-18 13:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Thu, Feb 18, 2021 at 01:52:38PM +0100, Michal Hocko wrote:
> > Ok, makes sense.
> > __GFP_THISNODE will not allow fallback to other node's zones.
> > Since we only allow the nid the page belongs to, nodemask should be
> > NULL, right?
> 
> I would have to double check because hugetlb has a slightly different
> expectations from nodemask than the page allocator. The later translates
> that to all possible nodes but hugetlb API tries to dereference nodes.
> Maybe THIS node special cases it somewhere.

Uhm, I do not quite follow here.
AFAICS, alloc_fresh_huge_page->alloc_buddy_huge_page does nothing
with the nodemask, bur rather with nodes_retry mask. That is done
to not retry on a node we failed to allocate a page.

Now, alloc_buddy_huge_page calls __alloc_pages_nodemask directly.
If my understanding is correct, it is ok to have a null nodemask
as __next_zones_zonelist() will go through our own zonelist,
since __GFP_THISNODE made us take ZONELIST_NOFALLBACK.

Actually, I do not see how passing a non-null nodemask migth have
helped there, unless we allow to specify more nodes.

> > I did. The 'put_page' call should be placed above, right after getting
> > the page. Otherwise, refcount == 1 and we will fail to dissolve the
> > new page if we need to (in case old page fails to be dissolved).
> > I already fixed that locally.
> 
> I am not sure I follow. newly allocated pages is unreferenced
> unconditionally and the old page is not referenced by this path.

Current code is:

 allocate_a_new_page (new_page's refcount = 1)
 dissolve_old_page
  : if fail
     dissolve_new_page (we cannot dissolve it refcount != 0)
 put_page(new_page);

It should be:

 allocate_a_new_page (new_page's refcount = 1)
 put_page(new_page); (new_page's refcount = 0)
 dissolve_old_page
  : if fail
     dissolve_new_page (we can dissolve it as refcount == 0)

I hope this clarifies it .

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-18 13:32         ` Oscar Salvador
@ 2021-02-18 13:59           ` Michal Hocko
  2021-02-18 16:53             ` Oscar Salvador
  2021-02-19  9:05             ` Oscar Salvador
  0 siblings, 2 replies; 33+ messages in thread
From: Michal Hocko @ 2021-02-18 13:59 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Thu 18-02-21 14:32:50, Oscar Salvador wrote:
> On Thu, Feb 18, 2021 at 01:52:38PM +0100, Michal Hocko wrote:
> > > Ok, makes sense.
> > > __GFP_THISNODE will not allow fallback to other node's zones.
> > > Since we only allow the nid the page belongs to, nodemask should be
> > > NULL, right?
> > 
> > I would have to double check because hugetlb has a slightly different
> > expectations from nodemask than the page allocator. The later translates
> > that to all possible nodes but hugetlb API tries to dereference nodes.
> > Maybe THIS node special cases it somewhere.
> 
> Uhm, I do not quite follow here.
> AFAICS, alloc_fresh_huge_page->alloc_buddy_huge_page does nothing
> with the nodemask, bur rather with nodes_retry mask. That is done
> to not retry on a node we failed to allocate a page.
> 
> Now, alloc_buddy_huge_page calls __alloc_pages_nodemask directly.
> If my understanding is correct, it is ok to have a null nodemask
> as __next_zones_zonelist() will go through our own zonelist,
> since __GFP_THISNODE made us take ZONELIST_NOFALLBACK.

As I've said. Page allocator can cope with NULL nodemask just fine.
I have checked the code and now remember the tricky part. It is
alloc_gigantic_page which cannot work with NULL nodemask because it
relies on for_each_node_mask and that, unlike zonelist iterator, cannot
cope with NULL node mask. This is the case only for !GFP_THISNODE.

> Actually, I do not see how passing a non-null nodemask migth have
> helped there, unless we allow to specify more nodes.

No, nodemask is doesn't make any difference.
 
> > > I did. The 'put_page' call should be placed above, right after getting
> > > the page. Otherwise, refcount == 1 and we will fail to dissolve the
> > > new page if we need to (in case old page fails to be dissolved).
> > > I already fixed that locally.
> > 
> > I am not sure I follow. newly allocated pages is unreferenced
> > unconditionally and the old page is not referenced by this path.
> 
> Current code is:
> 
>  allocate_a_new_page (new_page's refcount = 1)
>  dissolve_old_page
>   : if fail
>      dissolve_new_page (we cannot dissolve it refcount != 0)
>  put_page(new_page);

OK, new_page would go to the pool rather than get freed as this is
neither a surplus nnor a temporary page.

> It should be:
> 
>  allocate_a_new_page (new_page's refcount = 1)
>  put_page(new_page); (new_page's refcount = 0)
>  dissolve_old_page
>   : if fail
>      dissolve_new_page (we can dissolve it as refcount == 0)
> 
> I hope this clarifies it .

OK, I see the problem now. And your above solution is not really
optimal either. Your put_page would add the page to the pool and so it
could be used by somebody. One way around it would be either directly
manipulating reference count which is fugly or you can make it a
temporal page (alloc_migrate_huge_page) or maybe even better not special
case this here but rather allow migrating free hugetlb pages in the
migrate_page path.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-18 13:59           ` Michal Hocko
@ 2021-02-18 16:53             ` Oscar Salvador
  2021-02-19  9:05             ` Oscar Salvador
  1 sibling, 0 replies; 33+ messages in thread
From: Oscar Salvador @ 2021-02-18 16:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On 2021-02-18 14:59, Michal Hocko wrote:
> As I've said. Page allocator can cope with NULL nodemask just fine.
> I have checked the code and now remember the tricky part. It is
> alloc_gigantic_page which cannot work with NULL nodemask because it
> relies on for_each_node_mask and that, unlike zonelist iterator, cannot
> cope with NULL node mask. This is the case only for !GFP_THISNODE.

Ok, thanks for the clarification.

> OK, I see the problem now. And your above solution is not really
> optimal either. Your put_page would add the page to the pool and so it
> could be used by somebody. One

Yeah, that is right.

> way around it would be either directly
> manipulating reference count which is fugly or you can make it a
> temporal page (alloc_migrate_huge_page) or maybe even better not 
> special
> case this here but rather allow migrating free hugetlb pages in the
> migrate_page path.

I will have a look into it to see how what would look.

Thanks for the feedback

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-18 13:59           ` Michal Hocko
  2021-02-18 16:53             ` Oscar Salvador
@ 2021-02-19  9:05             ` Oscar Salvador
  2021-02-19  9:56               ` Michal Hocko
  1 sibling, 1 reply; 33+ messages in thread
From: Oscar Salvador @ 2021-02-19  9:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Thu, Feb 18, 2021 at 02:59:40PM +0100, Michal Hocko wrote:
> > It should be:
> > 
> >  allocate_a_new_page (new_page's refcount = 1)
> >  put_page(new_page); (new_page's refcount = 0)
> >  dissolve_old_page
> >   : if fail
> >      dissolve_new_page (we can dissolve it as refcount == 0)
> > 
> > I hope this clarifies it .
> 
> OK, I see the problem now. And your above solution is not really
> optimal either. Your put_page would add the page to the pool and so it
> could be used by somebody. One way around it would be either directly
> manipulating reference count which is fugly or you can make it a
> temporal page (alloc_migrate_huge_page) or maybe even better not special
> case this here but rather allow migrating free hugetlb pages in the
> migrate_page path.

I have been weighting up this option because it seemed the most clean way to
proceed. Having the hability to migrate free hugepages directly from migrate_page
would spare us this function.
But there is a problem. migrate_pages needs the pages to be on a list (by
page->lru). That is no problem for used pages, but for freehugepages we would
have to remove the page from hstate->hugepage_freelists, meaning that if userspace
comes along and tries to get a hugepage (a hugepage he previously allocated by
e.g: /proc/sys/.../nr_hugepages), it will fail.

So I am not really sure we can go this way. Unless we are willing to accept
that temporary userspace can get ENOMEM if it tries to use a hugepage, which
I do not think it is a good idea.
Another way to go would be to make up for the free hugepages to be migrated and
allocate the same amount, but that starts to go down a rabbit hole.

I yet need to give it some more spins, but all in all, I think the easiest way
forward way might be to do something like:

alloc_and_dissolve_huge_page {

   new_page = alloc_fresh_huge_page(h, gfp_mask, nid, NULL, NULL);
   if (new_page) {
           /*
            * Put the page in the freelist hugepage pool.
            * We might race with someone coming by and grabbing the page,
            * but that is fine since we mark the page as Temporary in case
            * both old and new_page fail to be dissolved, so new_page
            * will be freed when its last reference is gone.
            */
           put_page(new_page);
      
           if (!dissolve_free_huge_page(page)) {
                   /*
                    * Old page could be dissolved.
                    */
                   ret = true;
           } else if (dissolve_free_huge_page(new_page)) {
                  /*
                   * Page might have been dissolved by admin by doing
                   * "echo 0 > /proc/../nr_hugepages". Check it before marking
                   * the page.
                   */
                  spin_lock(&hugetlb_lock);
                  /* Mark the page Temporary in case we fail to dissolve both
                   * the old page and new_page. It will be freed when the last
                   * user drops it.
                   */
                  if (PageHuge(new_page))
                          SetPageHugeTemporary(new_page);
                  spin_unlock(&hugetlb_lock);
           }
   }

There is one more thing to cover though.
Between the dissolve_free_huge_page(new_page) and the PageHuge check, the
page might have been freed, and so enqueued (because we did not mark it as 
Temporary yet). If that is the case, we would later mark the page as Temporary
being in the freepool, not nice.

One way out would be to pass a boolean parameter to dissolve_free_huge_page(),
that tells whether the page should be marked Temporary in case the operation fails.
That would ease things a lot because __everything__ is being done under the lock,
which means the page cannot go away in the meantime.

Just my thoughts.

What do you think?


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-19  9:05             ` Oscar Salvador
@ 2021-02-19  9:56               ` Michal Hocko
  2021-02-19 10:14                 ` Oscar Salvador
  2021-02-19 10:40                 ` Oscar Salvador
  0 siblings, 2 replies; 33+ messages in thread
From: Michal Hocko @ 2021-02-19  9:56 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Fri 19-02-21 10:05:53, Oscar Salvador wrote:
> On Thu, Feb 18, 2021 at 02:59:40PM +0100, Michal Hocko wrote:
> > > It should be:
> > > 
> > >  allocate_a_new_page (new_page's refcount = 1)
> > >  put_page(new_page); (new_page's refcount = 0)
> > >  dissolve_old_page
> > >   : if fail
> > >      dissolve_new_page (we can dissolve it as refcount == 0)
> > > 
> > > I hope this clarifies it .
> > 
> > OK, I see the problem now. And your above solution is not really
> > optimal either. Your put_page would add the page to the pool and so it
> > could be used by somebody. One way around it would be either directly
> > manipulating reference count which is fugly or you can make it a
> > temporal page (alloc_migrate_huge_page) or maybe even better not special
> > case this here but rather allow migrating free hugetlb pages in the
> > migrate_page path.
> 
> I have been weighting up this option because it seemed the most clean way to
> proceed. Having the hability to migrate free hugepages directly from migrate_page
> would spare us this function.
> But there is a problem. migrate_pages needs the pages to be on a list (by
> page->lru). That is no problem for used pages, but for freehugepages we would
> have to remove the page from hstate->hugepage_freelists, meaning that if userspace
> comes along and tries to get a hugepage (a hugepage he previously allocated by
> e.g: /proc/sys/.../nr_hugepages), it will fail.

Good point. I should have realized that.
 
> So I am not really sure we can go this way. Unless we are willing to accept
> that temporary userspace can get ENOMEM if it tries to use a hugepage, which
> I do not think it is a good idea.

No, this is not acceptable.

> Another way to go would be to make up for the free hugepages to be migrated and
> allocate the same amount, but that starts to go down a rabbit hole.
> 
> I yet need to give it some more spins, but all in all, I think the easiest way
> forward way might be to do something like:
> 
> alloc_and_dissolve_huge_page {
> 
>    new_page = alloc_fresh_huge_page(h, gfp_mask, nid, NULL, NULL);
>    if (new_page) {
>            /*
>             * Put the page in the freelist hugepage pool.
>             * We might race with someone coming by and grabbing the page,
>             * but that is fine since we mark the page as Temporary in case
>             * both old and new_page fail to be dissolved, so new_page
>             * will be freed when its last reference is gone.
>             */
>            put_page(new_page);
>       
>            if (!dissolve_free_huge_page(page)) {
>                    /*
>                     * Old page could be dissolved.
>                     */
>                    ret = true;
>            } else if (dissolve_free_huge_page(new_page)) {
>                   /*
>                    * Page might have been dissolved by admin by doing
>                    * "echo 0 > /proc/../nr_hugepages". Check it before marking
>                    * the page.
>                    */
>                   spin_lock(&hugetlb_lock);
>                   /* Mark the page Temporary in case we fail to dissolve both
>                    * the old page and new_page. It will be freed when the last
>                    * user drops it.
>                    */
>                   if (PageHuge(new_page))
>                           SetPageHugeTemporary(new_page);
>                   spin_unlock(&hugetlb_lock);
>            }
>    }

OK, this should work but I am really wondering whether it wouldn't be
just simpler to replace the old page by a new one in the free list
directly. Or is there any reason we have to go through the generic
helpers path? I mean something like this

	new_page = alloc_fresh_huge_page();
	if (!new_page)
		goto fail;
	spin_lock(hugetlb_lock);
	if (!PageHuge(old_page)) {
		/* freed from under us, nothing to do */ 
		__update_and_free_page(new_page);
		goto unlock;
	}
	list_del(&old_page->lru);
	__update_and_free_page(old_page);
	__enqueue_huge_page(new_page);
unlock:
	spin_unlock(hugetlb_lock);

This will require to split update_and_free_page and enqueue_huge_page to
counters independent parts but that shouldn't be a big deal. But it will
also protect from any races. Not an act of beauty but seems less hackish
to me.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-19  9:56               ` Michal Hocko
@ 2021-02-19 10:14                 ` Oscar Salvador
  2021-02-19 20:00                   ` Mike Kravetz
  2021-02-19 10:40                 ` Oscar Salvador
  1 sibling, 1 reply; 33+ messages in thread
From: Oscar Salvador @ 2021-02-19 10:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Fri, Feb 19, 2021 at 10:56:42AM +0100, Michal Hocko wrote:
> OK, this should work but I am really wondering whether it wouldn't be
> just simpler to replace the old page by a new one in the free list
> directly. Or is there any reason we have to go through the generic
> helpers path? I mean something like this
> 
> 	new_page = alloc_fresh_huge_page();
> 	if (!new_page)
> 		goto fail;
> 	spin_lock(hugetlb_lock);
> 	if (!PageHuge(old_page)) {
> 		/* freed from under us, nothing to do */ 
> 		__update_and_free_page(new_page);
> 		goto unlock;
> 	}
> 	list_del(&old_page->lru);
> 	__update_and_free_page(old_page);
> 	__enqueue_huge_page(new_page);
> unlock:
> 	spin_unlock(hugetlb_lock);
> 
> This will require to split update_and_free_page and enqueue_huge_page to
> counters independent parts but that shouldn't be a big deal. But it will
> also protect from any races. Not an act of beauty but seems less hackish
> to me.

Yes, I think this would to the trick, and it is race-free.
Let me play with it a bit and see what I can come up with.

Thanks for the valuable insight.

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-19  9:56               ` Michal Hocko
  2021-02-19 10:14                 ` Oscar Salvador
@ 2021-02-19 10:40                 ` Oscar Salvador
  2021-02-19 10:55                   ` Michal Hocko
  1 sibling, 1 reply; 33+ messages in thread
From: Oscar Salvador @ 2021-02-19 10:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Fri, Feb 19, 2021 at 10:56:42AM +0100, Michal Hocko wrote:
> OK, this should work but I am really wondering whether it wouldn't be
> just simpler to replace the old page by a new one in the free list
> directly. Or is there any reason we have to go through the generic
> helpers path? I mean something like this
> 
> 	new_page = alloc_fresh_huge_page();
> 	if (!new_page)
> 		goto fail;
> 	spin_lock(hugetlb_lock);
> 	if (!PageHuge(old_page)) {
> 		/* freed from under us, nothing to do */ 
> 		__update_and_free_page(new_page);
> 		goto unlock;
> 	}
> 	list_del(&old_page->lru);
> 	__update_and_free_page(old_page);
> 	__enqueue_huge_page(new_page);
> unlock:
> 	spin_unlock(hugetlb_lock);
> 
> This will require to split update_and_free_page and enqueue_huge_page to
> counters independent parts but that shouldn't be a big deal. But it will
> also protect from any races. Not an act of beauty but seems less hackish
> to me.

On a closer look, do we really need to decouple update_and_free_page and
enqueue_huge_page? These two functions do not handle the lock, but rather
the functions that call them (as would be in our case).
Only update_and_free_page drops the lock during the freeing of a gigantic page
and then it takes it again, as the caller is who took the lock.

am I missing anything obvious here?

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-19 10:40                 ` Oscar Salvador
@ 2021-02-19 10:55                   ` Michal Hocko
  2021-02-19 11:17                     ` Oscar Salvador
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-02-19 10:55 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Fri 19-02-21 11:40:30, Oscar Salvador wrote:
> On Fri, Feb 19, 2021 at 10:56:42AM +0100, Michal Hocko wrote:
> > OK, this should work but I am really wondering whether it wouldn't be
> > just simpler to replace the old page by a new one in the free list
> > directly. Or is there any reason we have to go through the generic
> > helpers path? I mean something like this
> > 
> > 	new_page = alloc_fresh_huge_page();
> > 	if (!new_page)
> > 		goto fail;
> > 	spin_lock(hugetlb_lock);
> > 	if (!PageHuge(old_page)) {
> > 		/* freed from under us, nothing to do */ 
> > 		__update_and_free_page(new_page);
> > 		goto unlock;
> > 	}
> > 	list_del(&old_page->lru);
> > 	__update_and_free_page(old_page);
> > 	__enqueue_huge_page(new_page);
> > unlock:
> > 	spin_unlock(hugetlb_lock);
> > 
> > This will require to split update_and_free_page and enqueue_huge_page to
> > counters independent parts but that shouldn't be a big deal. But it will
> > also protect from any races. Not an act of beauty but seems less hackish
> > to me.
> 
> On a closer look, do we really need to decouple update_and_free_page and
> enqueue_huge_page? These two functions do not handle the lock, but rather
> the functions that call them (as would be in our case).
> Only update_and_free_page drops the lock during the freeing of a gigantic page
> and then it takes it again, as the caller is who took the lock.
> 
> am I missing anything obvious here?

It is not the lock that I care about but more about counters. The
intention was that there is a single place to handle both enqueing and
dequeing. As not all places require counters to be updated. E.g. the
migration which just replaces one page by another.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-19 10:55                   ` Michal Hocko
@ 2021-02-19 11:17                     ` Oscar Salvador
  2021-02-19 11:24                       ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Oscar Salvador @ 2021-02-19 11:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Fri, Feb 19, 2021 at 11:55:00AM +0100, Michal Hocko wrote:
> It is not the lock that I care about but more about counters. The
> intention was that there is a single place to handle both enqueing and
> dequeing. As not all places require counters to be updated. E.g. the
> migration which just replaces one page by another.

I see.
alloc_fresh_huge_page->prep_new_huge_page increments h->nr_huge_pages{_node}
counters.
Which means:

>       new_page = alloc_fresh_huge_page();
>       if (!new_page)
>               goto fail;
>       spin_lock(hugetlb_lock);
>       if (!PageHuge(old_page)) {
>               /* freed from under us, nothing to do */ 
>               __update_and_free_page(new_page);

Here we need update_and_free_page, otherwise we would be leaving a stale value
in h->nr_huge_pages{_node}. 

>               goto unlock;
>       }
>       list_del(&old_page->lru);
>       __update_and_free_page(old_page);

Same here.

>       __enqueue_huge_page(new_page);

This is ok since h->free_huge_pages{_node} do not need to be updated.

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-19 11:17                     ` Oscar Salvador
@ 2021-02-19 11:24                       ` Michal Hocko
  0 siblings, 0 replies; 33+ messages in thread
From: Michal Hocko @ 2021-02-19 11:24 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Mike Kravetz, David Hildenbrand, Muchun Song,
	linux-mm, linux-kernel

On Fri 19-02-21 12:17:11, Oscar Salvador wrote:
> On Fri, Feb 19, 2021 at 11:55:00AM +0100, Michal Hocko wrote:
> > It is not the lock that I care about but more about counters. The
> > intention was that there is a single place to handle both enqueing and
> > dequeing. As not all places require counters to be updated. E.g. the
> > migration which just replaces one page by another.
> 
> I see.
> alloc_fresh_huge_page->prep_new_huge_page increments h->nr_huge_pages{_node}
> counters.
> Which means:
> 
> >       new_page = alloc_fresh_huge_page();
> >       if (!new_page)
> >               goto fail;
> >       spin_lock(hugetlb_lock);
> >       if (!PageHuge(old_page)) {
> >               /* freed from under us, nothing to do */ 
> >               __update_and_free_page(new_page);
> 
> Here we need update_and_free_page, otherwise we would be leaving a stale value
> in h->nr_huge_pages{_node}. 
> 
> >               goto unlock;
> >       }
> >       list_del(&old_page->lru);
> >       __update_and_free_page(old_page);
> 
> Same here.
> 
> >       __enqueue_huge_page(new_page);
> 
> This is ok since h->free_huge_pages{_node} do not need to be updated.

Fair enough. I didn't get to think this through obviously, but you
should get the idea ;)
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages
  2021-02-19 10:14                 ` Oscar Salvador
@ 2021-02-19 20:00                   ` Mike Kravetz
  0 siblings, 0 replies; 33+ messages in thread
From: Mike Kravetz @ 2021-02-19 20:00 UTC (permalink / raw)
  To: Oscar Salvador, Michal Hocko
  Cc: Andrew Morton, David Hildenbrand, Muchun Song, linux-mm, linux-kernel

On 2/19/21 2:14 AM, Oscar Salvador wrote:
> On Fri, Feb 19, 2021 at 10:56:42AM +0100, Michal Hocko wrote:
>> OK, this should work but I am really wondering whether it wouldn't be
>> just simpler to replace the old page by a new one in the free list
>> directly. Or is there any reason we have to go through the generic
>> helpers path? I mean something like this
>>
>> 	new_page = alloc_fresh_huge_page();
>> 	if (!new_page)
>> 		goto fail;
>> 	spin_lock(hugetlb_lock);
>> 	if (!PageHuge(old_page)) {

Yes, something like this should work.  I'll let Oscar work out the details.
One thing to note is that you also need to check for old_page not on the
free list here.  It could have been allocated and in use.  In addition,
make sure to check the new flag HPageFreed to ensure page is on free list
before doing any type of update_and_free_page call.
-- 
Mike Kravetz

>> 		/* freed from under us, nothing to do */ 
>> 		__update_and_free_page(new_page);
>> 		goto unlock;
>> 	}
>> 	list_del(&old_page->lru);
>> 	__update_and_free_page(old_page);
>> 	__enqueue_huge_page(new_page);
>> unlock:
>> 	spin_unlock(hugetlb_lock);
>>
>> This will require to split update_and_free_page and enqueue_huge_page to
>> counters independent parts but that shouldn't be a big deal. But it will
>> also protect from any races. Not an act of beauty but seems less hackish
>> to me.
> 
> Yes, I think this would to the trick, and it is race-free.
> Let me play with it a bit and see what I can come up with.
> 
> Thanks for the valuable insight.
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2021-02-19 20:01 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-17 10:08 [PATCH 0/2] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
2021-02-17 10:08 ` [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages Oscar Salvador
2021-02-17 13:30   ` Michal Hocko
2021-02-17 13:36     ` David Hildenbrand
2021-02-17 13:50       ` Michal Hocko
2021-02-17 13:53         ` David Hildenbrand
2021-02-17 13:59           ` Michal Hocko
2021-02-17 14:08             ` David Hildenbrand
2021-02-17 14:14               ` Michal Hocko
2021-02-17 14:23               ` Oscar Salvador
2021-02-17 13:42     ` Oscar Salvador
2021-02-17 15:00   ` Michal Hocko
2021-02-18 10:09     ` Oscar Salvador
2021-02-18 12:52       ` Michal Hocko
2021-02-18 13:32         ` Oscar Salvador
2021-02-18 13:59           ` Michal Hocko
2021-02-18 16:53             ` Oscar Salvador
2021-02-19  9:05             ` Oscar Salvador
2021-02-19  9:56               ` Michal Hocko
2021-02-19 10:14                 ` Oscar Salvador
2021-02-19 20:00                   ` Mike Kravetz
2021-02-19 10:40                 ` Oscar Salvador
2021-02-19 10:55                   ` Michal Hocko
2021-02-19 11:17                     ` Oscar Salvador
2021-02-19 11:24                       ` Michal Hocko
2021-02-17 10:08 ` [PATCH 2/2] mm: Make alloc_contig_range handle in-use " Oscar Salvador
2021-02-17 13:36   ` Michal Hocko
2021-02-17 13:46     ` Oscar Salvador
2021-02-17 13:54       ` Michal Hocko
2021-02-17 15:06   ` Michal Hocko
2021-02-17 15:27     ` Oscar Salvador
2021-02-17 15:33       ` Michal Hocko
2021-02-18  6:01         ` Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).