linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] Make alloc_contig_range handle Hugetlb pages
@ 2021-02-08 10:38 Oscar Salvador
  2021-02-08 10:38 ` [RFC PATCH 1/2] mm,page_alloc: Make alloc_contig_range handle in-use hugetlb pages Oscar Salvador
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Oscar Salvador @ 2021-02-08 10:38 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: David Hildenbrand, Muchun Song, Michal Hocko, linux-mm,
	linux-kernel, Oscar Salvador

Hi,

I promised to Mike to have a look into this a few weeks ago.
This is my first attempt.

I carried out some tests with a module that tries to allocate with
alloc_contig_range() from a range where we have free and in-use hugetlb
pages.
So far I did not spot any problem and it worked.

Please, note that it is not fully completed, as some things remain to be sorted
out (see list below), but I wanted to publish it to see whether the way I am
going makes sense to people, or I am doing something worrisome.

E.g:
 - What happens when we allocated a new hugetlb page, but we cannot dissolve
   the old one? (should not really happen (tm))
 - When allocating the new hugetlb page I try to do it in the same node
   the old one is. Should we be more flexible and allow to fallback to
   other nodes?
   Userspace can request hugetlb on specific nodes [1], but it can also
   request them through generic interfaces [2].
 - Problems I did not foresee yet

 [1] /sys/devices/system/node/nodeX/hugepages/*
 [2] /proc/sys/vm/nr_hugepages or /sys/kernel/mm/hugepages/*

Oscar Salvador (2):
  mm,page_alloc: Make alloc_contig_range handle in-use hugetlb pages
  mm,page_alloc: Make alloc_contig_range handle free hugetlb pages

 include/linux/hugetlb.h |  6 ++++++
 mm/compaction.c         | 28 ++++++++++++++++++++++++++++
 mm/hugetlb.c            | 35 +++++++++++++++++++++++++++++++++++
 mm/vmscan.c             |  5 +++--
 4 files changed, 72 insertions(+), 2 deletions(-)

-- 
2.16.3



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH 1/2] mm,page_alloc: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-08 10:38 [RFC PATCH 0/2] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
@ 2021-02-08 10:38 ` Oscar Salvador
  2021-02-10  8:56   ` David Hildenbrand
  2021-02-11  0:56   ` Mike Kravetz
  2021-02-08 10:38 ` [RFC PATCH 2/2] mm,page_alloc: Make alloc_contig_range handle free " Oscar Salvador
  2021-02-08 10:39 ` [RFC PATCH 0/2] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
  2 siblings, 2 replies; 18+ messages in thread
From: Oscar Salvador @ 2021-02-08 10:38 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: David Hildenbrand, Muchun Song, Michal Hocko, linux-mm,
	linux-kernel, Oscar Salvador

alloc_contig_range is not prepared to handle hugetlb pages and will
fail if it ever sees one, but since they can be migrated as any other
page (LRU and Movable), it makes sense to also handle them.

For now, do it only when coming from alloc_contig_range.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/compaction.c | 17 +++++++++++++++++
 mm/vmscan.c     |  5 +++--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index e5acb9714436..89cd2e60da29 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -940,6 +940,22 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			goto isolate_fail;
 		}
 
+		/*
+		 * Handle hugetlb pages only when coming from alloc_contig
+		 */
+		if (PageHuge(page) && cc->alloc_contig) {
+			if (page_count(page)) {
+				/*
+				 * Hugetlb page in-use. Isolate and migrate.
+				 */
+				if (isolate_huge_page(page, &cc->migratepages)) {
+					low_pfn += compound_nr(page) - 1;
+					goto isolate_success_no_list;
+				}
+			}
+			goto isolate_fail;
+		}
+
 		/*
 		 * Check may be lockless but that's ok as we recheck later.
 		 * It's possible to migrate LRU and non-lru movable pages.
@@ -1041,6 +1057,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 isolate_success:
 		list_add(&page->lru, &cc->migratepages);
+isolate_success_no_list:
 		cc->nr_migratepages += compound_nr(page);
 		nr_isolated += compound_nr(page);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b1b574ad199d..0803adca4469 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1506,8 +1506,9 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 	LIST_HEAD(clean_pages);
 
 	list_for_each_entry_safe(page, next, page_list, lru) {
-		if (page_is_file_lru(page) && !PageDirty(page) &&
-		    !__PageMovable(page) && !PageUnevictable(page)) {
+		if (!PageHuge(page) && page_is_file_lru(page) &&
+		    !PageDirty(page) && !__PageMovable(page) &&
+		    !PageUnevictable(page)) {
 			ClearPageActive(page);
 			list_move(&page->lru, &clean_pages);
 		}
-- 
2.16.3



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 2/2] mm,page_alloc: Make alloc_contig_range handle free hugetlb pages
  2021-02-08 10:38 [RFC PATCH 0/2] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
  2021-02-08 10:38 ` [RFC PATCH 1/2] mm,page_alloc: Make alloc_contig_range handle in-use hugetlb pages Oscar Salvador
@ 2021-02-08 10:38 ` Oscar Salvador
  2021-02-10  8:23   ` David Hildenbrand
  2021-02-11  1:16   ` Mike Kravetz
  2021-02-08 10:39 ` [RFC PATCH 0/2] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
  2 siblings, 2 replies; 18+ messages in thread
From: Oscar Salvador @ 2021-02-08 10:38 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: David Hildenbrand, Muchun Song, Michal Hocko, linux-mm,
	linux-kernel, Oscar Salvador

Free hugetlb pages are trickier to handle as to in order to guarantee
no userspace appplication disruption, we need to replace the
current free hugepage with a new one.

In order to do that, a new function called alloc_and_dissolve_huge_page
in introduced.
This function will first try to get a new fresh hugetlb page, and if it
succeeds, it will dissolve the old one.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/hugetlb.h |  6 ++++++
 mm/compaction.c         | 11 +++++++++++
 mm/hugetlb.c            | 35 +++++++++++++++++++++++++++++++++++
 3 files changed, 52 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ebca2ef02212..f81afcb86e89 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -505,6 +505,7 @@ struct huge_bootmem_page {
 	struct hstate *hstate;
 };
 
+bool alloc_and_dissolve_huge_page(struct page *page);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -773,6 +774,11 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
+static inline bool alloc_and_dissolve_huge_page(struct page *page)
+{
+	return false;
+}
+
 static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
 					   unsigned long addr,
 					   int avoid_reserve)
diff --git a/mm/compaction.c b/mm/compaction.c
index 89cd2e60da29..7969ddc10856 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -952,6 +952,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 					low_pfn += compound_nr(page) - 1;
 					goto isolate_success_no_list;
 				}
+			} else {
+				/*
+				 * Free hugetlb page. Allocate a new one and
+				 * dissolve this is if succeed.
+				 */
+				if (alloc_and_dissolve_huge_page(page)) {
+					unsigned long order = buddy_order_unsafe(page);
+
+					low_pfn += (1UL << order) - 1;
+					continue;
+				}
 			}
 			goto isolate_fail;
 		}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 18f6ee317900..79ffbb64c4ee 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2253,6 +2253,41 @@ static void restore_reserve_on_error(struct hstate *h,
 	}
 }
 
+bool alloc_and_dissolve_huge_page(struct page *page)
+{
+	NODEMASK_ALLOC(nodemask_t, nodes_allowed, GFP_KERNEL);
+	struct page *head;
+	struct hstate *h;
+	bool ret = false;
+	int nid;
+
+	if (!nodes_allowed)
+		return ret;
+
+	spin_lock(&hugetlb_lock);
+	head = compound_head(page);
+	h = page_hstate(head);
+	nid = page_to_nid(head);
+	spin_unlock(&hugetlb_lock);
+
+	init_nodemask_of_node(nodes_allowed, nid);
+
+	/*
+	 * Before dissolving the page, we need to allocate a new one,
+	 * so the pool remains stable.
+	 */
+	if (alloc_pool_huge_page(h, nodes_allowed, NULL)) {
+		/*
+		 * Ok, we have a free hugetlb-page to replace this
+		 * one. Dissolve the old page.
+		 */
+		if (!dissolve_free_huge_page(page))
+			ret = true;
+	}
+
+	return ret;
+}
+
 struct page *alloc_huge_page(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
-- 
2.16.3



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] Make alloc_contig_range handle Hugetlb pages
  2021-02-08 10:38 [RFC PATCH 0/2] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
  2021-02-08 10:38 ` [RFC PATCH 1/2] mm,page_alloc: Make alloc_contig_range handle in-use hugetlb pages Oscar Salvador
  2021-02-08 10:38 ` [RFC PATCH 2/2] mm,page_alloc: Make alloc_contig_range handle free " Oscar Salvador
@ 2021-02-08 10:39 ` Oscar Salvador
  2 siblings, 0 replies; 18+ messages in thread
From: Oscar Salvador @ 2021-02-08 10:39 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: David Hildenbrand, Muchun Song, Michal Hocko, linux-mm, linux-kernel

On Mon, Feb 08, 2021 at 11:38:10AM +0100, Oscar Salvador wrote:
> Hi,

Heh, I've had a case of the mondays and I sent a first spin
without the ML, sorry for the spam.


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/2] mm,page_alloc: Make alloc_contig_range handle free hugetlb pages
  2021-02-08 10:38 ` [RFC PATCH 2/2] mm,page_alloc: Make alloc_contig_range handle free " Oscar Salvador
@ 2021-02-10  8:23   ` David Hildenbrand
  2021-02-10 14:24     ` Oscar Salvador
  2021-02-25 21:43     ` Mike Kravetz
  2021-02-11  1:16   ` Mike Kravetz
  1 sibling, 2 replies; 18+ messages in thread
From: David Hildenbrand @ 2021-02-10  8:23 UTC (permalink / raw)
  To: Oscar Salvador, Mike Kravetz
  Cc: Muchun Song, Michal Hocko, linux-mm, linux-kernel

On 08.02.21 11:38, Oscar Salvador wrote:
> Free hugetlb pages are trickier to handle as to in order to guarantee
> no userspace appplication disruption, we need to replace the
> current free hugepage with a new one.
> 
> In order to do that, a new function called alloc_and_dissolve_huge_page
> in introduced.
> This function will first try to get a new fresh hugetlb page, and if it
> succeeds, it will dissolve the old one.
>

Thanks for looking into this! Can we move this patch to #1 in the 
series? It is the easier case.

I also wonder if we should at least try on the memory unplug path to 
keep nr_pages by at least trying to allocate at new one if required, and 
printing a warning if that fails (after all, we're messing with 
something configured by the admin - "nr_pages"). Note that gigantic 
pages are special (below).

> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   include/linux/hugetlb.h |  6 ++++++
>   mm/compaction.c         | 11 +++++++++++
>   mm/hugetlb.c            | 35 +++++++++++++++++++++++++++++++++++
>   3 files changed, 52 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index ebca2ef02212..f81afcb86e89 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -505,6 +505,7 @@ struct huge_bootmem_page {
>   	struct hstate *hstate;
>   };
>   
> +bool alloc_and_dissolve_huge_page(struct page *page);
>   struct page *alloc_huge_page(struct vm_area_struct *vma,
>   				unsigned long addr, int avoid_reserve);
>   struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> @@ -773,6 +774,11 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
>   #else	/* CONFIG_HUGETLB_PAGE */
>   struct hstate {};
>   
> +static inline bool alloc_and_dissolve_huge_page(struct page *page)
> +{
> +	return false;
> +}
> +
>   static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
>   					   unsigned long addr,
>   					   int avoid_reserve)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 89cd2e60da29..7969ddc10856 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -952,6 +952,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   					low_pfn += compound_nr(page) - 1;
>   					goto isolate_success_no_list;
>   				}
> +			} else {

} else if (alloc_and_dissolve_huge_page(page))) {

...

> +				/*
> +				 * Free hugetlb page. Allocate a new one and
> +				 * dissolve this is if succeed.
> +				 */
> +				if (alloc_and_dissolve_huge_page(page)) {
> +					unsigned long order = buddy_order_unsafe(page);
> +
> +					low_pfn += (1UL << order) - 1;
> +					continue;
> +				}



Note that there is a very ugly corner case we will have to handle 
gracefully (I think also in patch #1):

Assume you allocated a gigantic page (and assume that we are not using 
CMA for gigantic pages for simplicity). Assume you want to allocate 
another one. alloc_pool_huge_page()->...->alloc_contig_pages() will 
stumble over the first allocated page. It will try to 
alloc_and_dissolve_huge_page() the existing gigantic page. To do that, 
it will alloc_pool_huge_page()->...->alloc_contig_pages() ... and so on. 
Bad.

We really don't want to mess with gigantic pages (migrate, dissolve) 
while allocating a gigantic page. I think the easiest (and cleanest) way 
forward is to not mess (isolate, migrate, dissolve) with gigantic pages 
at all.

Gigantic pages are not movable, so they won't be placed on random CMA / 
ZONE_MOVABLE.

Some hstate_is_gigantic(h) calls (maybe inside 
alloc_and_dissolve_huge_page() ? ) along with a nice comment might be 
good enough to avoid having to pass down some kind of alloc_contig 
context. I even think that should be handled inside

(the main issue is that in contrast to CMA, plain alloc_contig_pages() 
has no memory about which parts were allocated and will simply try 
re-allocating what it previously allocated and never freed - which is 
usually fine, unless we're dealing with such special cases)

Apart from that, not messing with gigantic pages feels like the right 
approach (allocating/migrating gigantic pages is just horribly slow and 
most probably not worth it anyway).

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] mm,page_alloc: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-08 10:38 ` [RFC PATCH 1/2] mm,page_alloc: Make alloc_contig_range handle in-use hugetlb pages Oscar Salvador
@ 2021-02-10  8:56   ` David Hildenbrand
  2021-02-10 14:09     ` Oscar Salvador
  2021-02-11  0:56   ` Mike Kravetz
  1 sibling, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2021-02-10  8:56 UTC (permalink / raw)
  To: Oscar Salvador, Mike Kravetz
  Cc: Muchun Song, Michal Hocko, linux-mm, linux-kernel

On 08.02.21 11:38, Oscar Salvador wrote:
> alloc_contig_range is not prepared to handle hugetlb pages and will
> fail if it ever sees one, but since they can be migrated as any other
> page (LRU and Movable), it makes sense to also handle them.
> 
> For now, do it only when coming from alloc_contig_range.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   mm/compaction.c | 17 +++++++++++++++++
>   mm/vmscan.c     |  5 +++--
>   2 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e5acb9714436..89cd2e60da29 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -940,6 +940,22 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   			goto isolate_fail;
>   		}
>   
> +		/*
> +		 * Handle hugetlb pages only when coming from alloc_contig
> +		 */
> +		if (PageHuge(page) && cc->alloc_contig) {
> +			if (page_count(page)) {

I wonder if we should care about races here. What if someone 
concurrently allocates/frees?

Note that PageHuge() succeeds on tail pages, isolate_huge_page() not, i 
assume we'll have to handle that as well.

I wonder if it would make sense to move some of the magic to hugetlb 
code and handle it there with less chances for races (isolate if used, 
alloc-and-dissolve if not).

> +				/*
> +				 * Hugetlb page in-use. Isolate and migrate.
> +				 */
> +				if (isolate_huge_page(page, &cc->migratepages)) {
> +					low_pfn += compound_nr(page) - 1;
> +					goto isolate_success_no_list;
> +				}
> +			}

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] mm,page_alloc: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-10  8:56   ` David Hildenbrand
@ 2021-02-10 14:09     ` Oscar Salvador
  2021-02-10 14:11       ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Oscar Salvador @ 2021-02-10 14:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Kravetz, Muchun Song, Michal Hocko, linux-mm, linux-kernel

On Wed, Feb 10, 2021 at 09:56:37AM +0100, David Hildenbrand wrote:
> On 08.02.21 11:38, Oscar Salvador wrote:
> > alloc_contig_range is not prepared to handle hugetlb pages and will
> > fail if it ever sees one, but since they can be migrated as any other
> > page (LRU and Movable), it makes sense to also handle them.
> > 
> > For now, do it only when coming from alloc_contig_range.
> > 
> > Signed-off-by: Oscar Salvador <osalvador@suse.de>
> > ---
> >   mm/compaction.c | 17 +++++++++++++++++
> >   mm/vmscan.c     |  5 +++--
> >   2 files changed, 20 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index e5acb9714436..89cd2e60da29 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -940,6 +940,22 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >   			goto isolate_fail;
> >   		}
> > +		/*
> > +		 * Handle hugetlb pages only when coming from alloc_contig
> > +		 */
> > +		if (PageHuge(page) && cc->alloc_contig) {
> > +			if (page_count(page)) {
> 
> I wonder if we should care about races here. What if someone concurrently
> allocates/frees?
> 
> Note that PageHuge() succeeds on tail pages, isolate_huge_page() not, i
> assume we'll have to handle that as well.
> 
> I wonder if it would make sense to move some of the magic to hugetlb code
> and handle it there with less chances for races (isolate if used,
> alloc-and-dissolve if not).

Yes, it makes sense to keep the magic in hugetlb code.
Note, though, that removing all races might be tricky.

isolate_huge_page() checks for PageHuge under hugetlb_lock,
so there is a race between a call to PageHuge(x) and a subsequent
call to isolate_huge_page().
But we should be fine as isolate_huge_page will fail in case the page is
no longer HugeTLB.

Also, since isolate_migratepages_block() gets called with ranges
pageblock aligned, we should never be handling tail pages in the core
of the function. E.g: the same way we handle THP:

    /* The whole page is taken off the LRU; skip the tail pages. */
    if (PageCompound(page))
           low_pfn += compound_nr(page) - 1;

But all in all, the code has to be more bullet-proof. This RFC was more
like a PoC to see whether something crazy was done.
And as I said, moving the handling of hugetlb pages to hugetlb.c might
help towards a better error-race-handling.

Thanks for having a look ;-)

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] mm,page_alloc: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-10 14:09     ` Oscar Salvador
@ 2021-02-10 14:11       ` David Hildenbrand
  2021-02-10 14:14         ` Oscar Salvador
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2021-02-10 14:11 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Mike Kravetz, Muchun Song, Michal Hocko, linux-mm, linux-kernel

On 10.02.21 15:09, Oscar Salvador wrote:
> On Wed, Feb 10, 2021 at 09:56:37AM +0100, David Hildenbrand wrote:
>> On 08.02.21 11:38, Oscar Salvador wrote:
>>> alloc_contig_range is not prepared to handle hugetlb pages and will
>>> fail if it ever sees one, but since they can be migrated as any other
>>> page (LRU and Movable), it makes sense to also handle them.
>>>
>>> For now, do it only when coming from alloc_contig_range.
>>>
>>> Signed-off-by: Oscar Salvador <osalvador@suse.de>
>>> ---
>>>    mm/compaction.c | 17 +++++++++++++++++
>>>    mm/vmscan.c     |  5 +++--
>>>    2 files changed, 20 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/compaction.c b/mm/compaction.c
>>> index e5acb9714436..89cd2e60da29 100644
>>> --- a/mm/compaction.c
>>> +++ b/mm/compaction.c
>>> @@ -940,6 +940,22 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>>>    			goto isolate_fail;
>>>    		}
>>> +		/*
>>> +		 * Handle hugetlb pages only when coming from alloc_contig
>>> +		 */
>>> +		if (PageHuge(page) && cc->alloc_contig) {
>>> +			if (page_count(page)) {
>>
>> I wonder if we should care about races here. What if someone concurrently
>> allocates/frees?
>>
>> Note that PageHuge() succeeds on tail pages, isolate_huge_page() not, i
>> assume we'll have to handle that as well.
>>
>> I wonder if it would make sense to move some of the magic to hugetlb code
>> and handle it there with less chances for races (isolate if used,
>> alloc-and-dissolve if not).
> 
> Yes, it makes sense to keep the magic in hugetlb code.
> Note, though, that removing all races might be tricky.
> 
> isolate_huge_page() checks for PageHuge under hugetlb_lock,
> so there is a race between a call to PageHuge(x) and a subsequent
> call to isolate_huge_page().
> But we should be fine as isolate_huge_page will fail in case the page is
> no longer HugeTLB.
> 
> Also, since isolate_migratepages_block() gets called with ranges
> pageblock aligned, we should never be handling tail pages in the core
> of the function. E.g: the same way we handle THP:

Gigantic pages? (spoiler: see my comments to next patch :) )

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] mm,page_alloc: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-10 14:11       ` David Hildenbrand
@ 2021-02-10 14:14         ` Oscar Salvador
  2021-02-10 14:22           ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Oscar Salvador @ 2021-02-10 14:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Kravetz, Muchun Song, Michal Hocko, linux-mm, linux-kernel

On Wed, Feb 10, 2021 at 03:11:05PM +0100, David Hildenbrand wrote:
> On 10.02.21 15:09, Oscar Salvador wrote:
> > On Wed, Feb 10, 2021 at 09:56:37AM +0100, David Hildenbrand wrote:
> > > On 08.02.21 11:38, Oscar Salvador wrote:
> > > > alloc_contig_range is not prepared to handle hugetlb pages and will
> > > > fail if it ever sees one, but since they can be migrated as any other
> > > > page (LRU and Movable), it makes sense to also handle them.
> > > > 
> > > > For now, do it only when coming from alloc_contig_range.
> > > > 
> > > > Signed-off-by: Oscar Salvador <osalvador@suse.de>
> > > > ---
> > > >    mm/compaction.c | 17 +++++++++++++++++
> > > >    mm/vmscan.c     |  5 +++--
> > > >    2 files changed, 20 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > > index e5acb9714436..89cd2e60da29 100644
> > > > --- a/mm/compaction.c
> > > > +++ b/mm/compaction.c
> > > > @@ -940,6 +940,22 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> > > >    			goto isolate_fail;
> > > >    		}
> > > > +		/*
> > > > +		 * Handle hugetlb pages only when coming from alloc_contig
> > > > +		 */
> > > > +		if (PageHuge(page) && cc->alloc_contig) {
> > > > +			if (page_count(page)) {
> > > 
> > > I wonder if we should care about races here. What if someone concurrently
> > > allocates/frees?
> > > 
> > > Note that PageHuge() succeeds on tail pages, isolate_huge_page() not, i
> > > assume we'll have to handle that as well.
> > > 
> > > I wonder if it would make sense to move some of the magic to hugetlb code
> > > and handle it there with less chances for races (isolate if used,
> > > alloc-and-dissolve if not).
> > 
> > Yes, it makes sense to keep the magic in hugetlb code.
> > Note, though, that removing all races might be tricky.
> > 
> > isolate_huge_page() checks for PageHuge under hugetlb_lock,
> > so there is a race between a call to PageHuge(x) and a subsequent
> > call to isolate_huge_page().
> > But we should be fine as isolate_huge_page will fail in case the page is
> > no longer HugeTLB.
> > 
> > Also, since isolate_migratepages_block() gets called with ranges
> > pageblock aligned, we should never be handling tail pages in the core
> > of the function. E.g: the same way we handle THP:
> 
> Gigantic pages? (spoiler: see my comments to next patch :) )

Oh, yeah, that sucks.
We had the same problem in scan_movable_pages/has_unmovable_pages
with such pages.

Uhm, I will try to be more careful :-)

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] mm,page_alloc: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-10 14:14         ` Oscar Salvador
@ 2021-02-10 14:22           ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2021-02-10 14:22 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Mike Kravetz, Muchun Song, Michal Hocko, linux-mm, linux-kernel

On 10.02.21 15:14, Oscar Salvador wrote:
> On Wed, Feb 10, 2021 at 03:11:05PM +0100, David Hildenbrand wrote:
>> On 10.02.21 15:09, Oscar Salvador wrote:
>>> On Wed, Feb 10, 2021 at 09:56:37AM +0100, David Hildenbrand wrote:
>>>> On 08.02.21 11:38, Oscar Salvador wrote:
>>>>> alloc_contig_range is not prepared to handle hugetlb pages and will
>>>>> fail if it ever sees one, but since they can be migrated as any other
>>>>> page (LRU and Movable), it makes sense to also handle them.
>>>>>
>>>>> For now, do it only when coming from alloc_contig_range.
>>>>>
>>>>> Signed-off-by: Oscar Salvador <osalvador@suse.de>
>>>>> ---
>>>>>     mm/compaction.c | 17 +++++++++++++++++
>>>>>     mm/vmscan.c     |  5 +++--
>>>>>     2 files changed, 20 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/mm/compaction.c b/mm/compaction.c
>>>>> index e5acb9714436..89cd2e60da29 100644
>>>>> --- a/mm/compaction.c
>>>>> +++ b/mm/compaction.c
>>>>> @@ -940,6 +940,22 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>>>>>     			goto isolate_fail;
>>>>>     		}
>>>>> +		/*
>>>>> +		 * Handle hugetlb pages only when coming from alloc_contig
>>>>> +		 */
>>>>> +		if (PageHuge(page) && cc->alloc_contig) {
>>>>> +			if (page_count(page)) {
>>>>
>>>> I wonder if we should care about races here. What if someone concurrently
>>>> allocates/frees?
>>>>
>>>> Note that PageHuge() succeeds on tail pages, isolate_huge_page() not, i
>>>> assume we'll have to handle that as well.
>>>>
>>>> I wonder if it would make sense to move some of the magic to hugetlb code
>>>> and handle it there with less chances for races (isolate if used,
>>>> alloc-and-dissolve if not).
>>>
>>> Yes, it makes sense to keep the magic in hugetlb code.
>>> Note, though, that removing all races might be tricky.
>>>
>>> isolate_huge_page() checks for PageHuge under hugetlb_lock,
>>> so there is a race between a call to PageHuge(x) and a subsequent
>>> call to isolate_huge_page().
>>> But we should be fine as isolate_huge_page will fail in case the page is
>>> no longer HugeTLB.
>>>
>>> Also, since isolate_migratepages_block() gets called with ranges
>>> pageblock aligned, we should never be handling tail pages in the core
>>> of the function. E.g: the same way we handle THP:
>>
>> Gigantic pages? (spoiler: see my comments to next patch :) )
> 
> Oh, yeah, that sucks.
> We had the same problem in scan_movable_pages/has_unmovable_pages
> with such pages.
> 
> Uhm, I will try to be more careful :-)
> 

Gigantic pages are a minefield. Not your fault :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/2] mm,page_alloc: Make alloc_contig_range handle free hugetlb pages
  2021-02-10  8:23   ` David Hildenbrand
@ 2021-02-10 14:24     ` Oscar Salvador
  2021-02-10 14:36       ` David Hildenbrand
  2021-02-25 21:43     ` Mike Kravetz
  1 sibling, 1 reply; 18+ messages in thread
From: Oscar Salvador @ 2021-02-10 14:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Kravetz, Muchun Song, Michal Hocko, linux-mm, linux-kernel

On Wed, Feb 10, 2021 at 09:23:59AM +0100, David Hildenbrand wrote:
> On 08.02.21 11:38, Oscar Salvador wrote:
> > Free hugetlb pages are trickier to handle as to in order to guarantee
> > no userspace appplication disruption, we need to replace the
> > current free hugepage with a new one.
> > 
> > In order to do that, a new function called alloc_and_dissolve_huge_page
> > in introduced.
> > This function will first try to get a new fresh hugetlb page, and if it
> > succeeds, it will dissolve the old one.
> > 
> 
> Thanks for looking into this! Can we move this patch to #1 in the series? It
> is the easier case.
> 
> I also wonder if we should at least try on the memory unplug path to keep
> nr_pages by at least trying to allocate at new one if required, and printing
> a warning if that fails (after all, we're messing with something configured
> by the admin - "nr_pages"). Note that gigantic pages are special (below).

So, do you mean to allocate a new fresh hugepage in case we have a free
hugetlb page within the range we are trying to offline? That makes some
sense I guess.

I can have a look at that, and make hotplug code use the new
alloc_and_dissolve().

Thanks for bringing this up, it is somsething I did not think about.

> > +				/*
> > +				 * Free hugetlb page. Allocate a new one and
> > +				 * dissolve this is if succeed.
> > +				 */
> > +				if (alloc_and_dissolve_huge_page(page)) {
> > +					unsigned long order = buddy_order_unsafe(page);
> > +
> > +					low_pfn += (1UL << order) - 1;
> > +					continue;
> > +				}
> 
> 
> 
> Note that there is a very ugly corner case we will have to handle gracefully
> (I think also in patch #1):
> 
> Assume you allocated a gigantic page (and assume that we are not using CMA
> for gigantic pages for simplicity). Assume you want to allocate another one.
> alloc_pool_huge_page()->...->alloc_contig_pages() will stumble over the
> first allocated page. It will try to alloc_and_dissolve_huge_page() the
> existing gigantic page. To do that, it will
> alloc_pool_huge_page()->...->alloc_contig_pages() ... and so on. Bad.

Heh, I was too naive. I have to confess I completely forgot about
gigantic pages and this cyclic dependency.

> We really don't want to mess with gigantic pages (migrate, dissolve) while
> allocating a gigantic page. I think the easiest (and cleanest) way forward
> is to not mess (isolate, migrate, dissolve) with gigantic pages at all.
> 
> Gigantic pages are not movable, so they won't be placed on random CMA /
> ZONE_MOVABLE.
> 
> Some hstate_is_gigantic(h) calls (maybe inside
> alloc_and_dissolve_huge_page() ? ) along with a nice comment might be good
> enough to avoid having to pass down some kind of alloc_contig context. I
> even think that should be handled inside
> 
> (the main issue is that in contrast to CMA, plain alloc_contig_pages() has
> no memory about which parts were allocated and will simply try re-allocating
> what it previously allocated and never freed - which is usually fine, unless
> we're dealing with such special cases)
> 
> Apart from that, not messing with gigantic pages feels like the right
> approach (allocating/migrating gigantic pages is just horribly slow and most
> probably not worth it anyway).

Yes, I also agree that we should leave out gigantic pages, at least for
now.
We might make it work in the future but I cannot come up with a fancy
way to work around that right now, so it makes sense to cut down the
complexity here.

Thanks David for the insight!

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/2] mm,page_alloc: Make alloc_contig_range handle free hugetlb pages
  2021-02-10 14:24     ` Oscar Salvador
@ 2021-02-10 14:36       ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2021-02-10 14:36 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Mike Kravetz, Muchun Song, Michal Hocko, linux-mm, linux-kernel

On 10.02.21 15:24, Oscar Salvador wrote:
> On Wed, Feb 10, 2021 at 09:23:59AM +0100, David Hildenbrand wrote:
>> On 08.02.21 11:38, Oscar Salvador wrote:
>>> Free hugetlb pages are trickier to handle as to in order to guarantee
>>> no userspace appplication disruption, we need to replace the
>>> current free hugepage with a new one.
>>>
>>> In order to do that, a new function called alloc_and_dissolve_huge_page
>>> in introduced.
>>> This function will first try to get a new fresh hugetlb page, and if it
>>> succeeds, it will dissolve the old one.
>>>
>>
>> Thanks for looking into this! Can we move this patch to #1 in the series? It
>> is the easier case.
>>
>> I also wonder if we should at least try on the memory unplug path to keep
>> nr_pages by at least trying to allocate at new one if required, and printing
>> a warning if that fails (after all, we're messing with something configured
>> by the admin - "nr_pages"). Note that gigantic pages are special (below).
> 
> So, do you mean to allocate a new fresh hugepage in case we have a free
> hugetlb page within the range we are trying to offline? That makes some
> sense I guess.
> 
> I can have a look at that, and make hotplug code use the new
> alloc_and_dissolve().

Yes, with the difference that hotplug code most probably wants to 
continue even if allocation failed (printing a warning) - mimix existing 
behavior. For alloc_contig, I'd say, fail if we cannot "relocate free 
huge pages that are still required to no modify nr_pages".

alloc_and_dissolve() should only allocate a page if really required 
(e.g., not sure if we could skip allocation in some cases - like with 
surplus pages, needs some investigation), such that the admin-configured 
nr_pages stays unchanged.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] mm,page_alloc: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-08 10:38 ` [RFC PATCH 1/2] mm,page_alloc: Make alloc_contig_range handle in-use hugetlb pages Oscar Salvador
  2021-02-10  8:56   ` David Hildenbrand
@ 2021-02-11  0:56   ` Mike Kravetz
  2021-02-11 10:40     ` David Hildenbrand
  1 sibling, 1 reply; 18+ messages in thread
From: Mike Kravetz @ 2021-02-11  0:56 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: David Hildenbrand, Muchun Song, Michal Hocko, linux-mm, linux-kernel

On 2/8/21 2:38 AM, Oscar Salvador wrote:
> alloc_contig_range is not prepared to handle hugetlb pages and will
> fail if it ever sees one, but since they can be migrated as any other
> page (LRU and Movable), it makes sense to also handle them.
> 
> For now, do it only when coming from alloc_contig_range.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  mm/compaction.c | 17 +++++++++++++++++
>  mm/vmscan.c     |  5 +++--
>  2 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e5acb9714436..89cd2e60da29 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -940,6 +940,22 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>  			goto isolate_fail;
>  		}
>  
> +		/*
> +		 * Handle hugetlb pages only when coming from alloc_contig
> +		 */
> +		if (PageHuge(page) && cc->alloc_contig) {
> +			if (page_count(page)) {

Thanks for doing this!

I agree with everything in the discussion you and David had.  This code
is racy, but since we are scanning lockless there is no way to eliminate
them all.  Best to just minimize the windows and document.
-- 
Mike Kravetz

> +				/*
> +				 * Hugetlb page in-use. Isolate and migrate.
> +				 */
> +				if (isolate_huge_page(page, &cc->migratepages)) {
> +					low_pfn += compound_nr(page) - 1;
> +					goto isolate_success_no_list;
> +				}
> +			}
> +			goto isolate_fail;
> +		}
> +
>  		/*
>  		 * Check may be lockless but that's ok as we recheck later.
>  		 * It's possible to migrate LRU and non-lru movable pages.
> @@ -1041,6 +1057,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>  
>  isolate_success:
>  		list_add(&page->lru, &cc->migratepages);
> +isolate_success_no_list:
>  		cc->nr_migratepages += compound_nr(page);
>  		nr_isolated += compound_nr(page);
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b1b574ad199d..0803adca4469 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1506,8 +1506,9 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>  	LIST_HEAD(clean_pages);
>  
>  	list_for_each_entry_safe(page, next, page_list, lru) {
> -		if (page_is_file_lru(page) && !PageDirty(page) &&
> -		    !__PageMovable(page) && !PageUnevictable(page)) {
> +		if (!PageHuge(page) && page_is_file_lru(page) &&
> +		    !PageDirty(page) && !__PageMovable(page) &&
> +		    !PageUnevictable(page)) {
>  			ClearPageActive(page);
>  			list_move(&page->lru, &clean_pages);
>  		}
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/2] mm,page_alloc: Make alloc_contig_range handle free hugetlb pages
  2021-02-08 10:38 ` [RFC PATCH 2/2] mm,page_alloc: Make alloc_contig_range handle free " Oscar Salvador
  2021-02-10  8:23   ` David Hildenbrand
@ 2021-02-11  1:16   ` Mike Kravetz
  2021-02-11 21:38     ` Oscar Salvador
  1 sibling, 1 reply; 18+ messages in thread
From: Mike Kravetz @ 2021-02-11  1:16 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: David Hildenbrand, Muchun Song, Michal Hocko, linux-mm, linux-kernel

On 2/8/21 2:38 AM, Oscar Salvador wrote:
> Free hugetlb pages are trickier to handle as to in order to guarantee
> no userspace appplication disruption, we need to replace the
> current free hugepage with a new one.
> 
> In order to do that, a new function called alloc_and_dissolve_huge_page
> in introduced.
> This function will first try to get a new fresh hugetlb page, and if it
> succeeds, it will dissolve the old one.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  include/linux/hugetlb.h |  6 ++++++
>  mm/compaction.c         | 11 +++++++++++
>  mm/hugetlb.c            | 35 +++++++++++++++++++++++++++++++++++
>  3 files changed, 52 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index ebca2ef02212..f81afcb86e89 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -505,6 +505,7 @@ struct huge_bootmem_page {
>  	struct hstate *hstate;
>  };
>  
> +bool alloc_and_dissolve_huge_page(struct page *page);
>  struct page *alloc_huge_page(struct vm_area_struct *vma,
>  				unsigned long addr, int avoid_reserve);
>  struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> @@ -773,6 +774,11 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
>  #else	/* CONFIG_HUGETLB_PAGE */
>  struct hstate {};
>  
> +static inline bool alloc_and_dissolve_huge_page(struct page *page)
> +{
> +	return false;
> +}
> +
>  static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
>  					   unsigned long addr,
>  					   int avoid_reserve)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 89cd2e60da29..7969ddc10856 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -952,6 +952,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>  					low_pfn += compound_nr(page) - 1;
>  					goto isolate_success_no_list;
>  				}
> +			} else {
> +				/*
> +				 * Free hugetlb page. Allocate a new one and
> +				 * dissolve this is if succeed.
> +				 */
> +				if (alloc_and_dissolve_huge_page(page)) {
> +					unsigned long order = buddy_order_unsafe(page);
> +
> +					low_pfn += (1UL << order) - 1;
> +					continue;
> +				}
>  			}
>  			goto isolate_fail;
>  		}
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 18f6ee317900..79ffbb64c4ee 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2253,6 +2253,41 @@ static void restore_reserve_on_error(struct hstate *h,
>  	}
>  }
>  
> +bool alloc_and_dissolve_huge_page(struct page *page)
> +{
> +	NODEMASK_ALLOC(nodemask_t, nodes_allowed, GFP_KERNEL);
> +	struct page *head;
> +	struct hstate *h;
> +	bool ret = false;
> +	int nid;
> +
> +	if (!nodes_allowed)
> +		return ret;
> +
> +	spin_lock(&hugetlb_lock);
> +	head = compound_head(page);
> +	h = page_hstate(head);
> +	nid = page_to_nid(head);
> +	spin_unlock(&hugetlb_lock);
> +
> +	init_nodemask_of_node(nodes_allowed, nid);
> +
> +	/*
> +	 * Before dissolving the page, we need to allocate a new one,
> +	 * so the pool remains stable.
> +	 */
> +	if (alloc_pool_huge_page(h, nodes_allowed, NULL)) {
> +		/*
> +		 * Ok, we have a free hugetlb-page to replace this
> +		 * one. Dissolve the old page.
> +		 */
> +		if (!dissolve_free_huge_page(page))
> +			ret = true;

Should probably check for -EBUSY as this means someone started using
the page while we were allocating a new one.  It would complicate the
code to try and do the 'right thing'.  Right thing might be getting
dissolving the new pool page and then trying to isolate this in use
page.  Of course things could change again while you are doing that. :(

Might be good enough as is.  As noted previously, this code is racy.
-- 
Mike Kravetz

> +	}
> +
> +	return ret;
> +}
> +
>  struct page *alloc_huge_page(struct vm_area_struct *vma,
>  				    unsigned long addr, int avoid_reserve)
>  {
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] mm,page_alloc: Make alloc_contig_range handle in-use hugetlb pages
  2021-02-11  0:56   ` Mike Kravetz
@ 2021-02-11 10:40     ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2021-02-11 10:40 UTC (permalink / raw)
  To: Mike Kravetz, Oscar Salvador
  Cc: Muchun Song, Michal Hocko, linux-mm, linux-kernel

On 11.02.21 01:56, Mike Kravetz wrote:
> On 2/8/21 2:38 AM, Oscar Salvador wrote:
>> alloc_contig_range is not prepared to handle hugetlb pages and will
>> fail if it ever sees one, but since they can be migrated as any other
>> page (LRU and Movable), it makes sense to also handle them.
>>
>> For now, do it only when coming from alloc_contig_range.
>>
>> Signed-off-by: Oscar Salvador <osalvador@suse.de>
>> ---
>>   mm/compaction.c | 17 +++++++++++++++++
>>   mm/vmscan.c     |  5 +++--
>>   2 files changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index e5acb9714436..89cd2e60da29 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -940,6 +940,22 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>>   			goto isolate_fail;
>>   		}
>>   
>> +		/*
>> +		 * Handle hugetlb pages only when coming from alloc_contig
>> +		 */
>> +		if (PageHuge(page) && cc->alloc_contig) {
>> +			if (page_count(page)) {
> 
> Thanks for doing this!
> 
> I agree with everything in the discussion you and David had.  This code
> is racy, but since we are scanning lockless there is no way to eliminate
> them all.  Best to just minimize the windows and document.
>

Agreed - and make sure that we don't have strange side. (e.g., in the 
next patch, allocate a new page, try to dissolve. Dissolving fails, what 
happens to the just-allocated page?)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/2] mm,page_alloc: Make alloc_contig_range handle free hugetlb pages
  2021-02-11  1:16   ` Mike Kravetz
@ 2021-02-11 21:38     ` Oscar Salvador
  0 siblings, 0 replies; 18+ messages in thread
From: Oscar Salvador @ 2021-02-11 21:38 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: David Hildenbrand, Muchun Song, Michal Hocko, linux-mm, linux-kernel

On Wed, Feb 10, 2021 at 05:16:29PM -0800, Mike Kravetz wrote:
> Should probably check for -EBUSY as this means someone started using
> the page while we were allocating a new one.  It would complicate the
> code to try and do the 'right thing'.  Right thing might be getting
> dissolving the new pool page and then trying to isolate this in use
> page.  Of course things could change again while you are doing that. :(

Yeah, I kept the error handling rather low just be clear about the
approach I was leaning towards, but yes, we should definitely check
for -EBUSY on dissolve_free_huge_page().

And it might be that dissolve_free_huge_page() returns -EBUSY on the old
page, and we need to dissolve the freshly allocated one as it is not
going to be used, and that might fail as well due to reserves for
instance, or maybe someone started using it?
I have to confess that I need to check the reservation code closer to be
aware of corner cases.

We used to try to be clever in such situations in memory-failure code,
but at some point you end up thinking "ok, how many retries are
considered enough?".
That code was trickier as we were handling uncorrected/corrected memory
errors, so we strived to do our best, but here we can be more flexible
as the whole thing is racy per se, and just fail if we find too many
obstacles.

I shall resume work early next week. 

Thanks for the tips ;-)

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/2] mm,page_alloc: Make alloc_contig_range handle free hugetlb pages
  2021-02-10  8:23   ` David Hildenbrand
  2021-02-10 14:24     ` Oscar Salvador
@ 2021-02-25 21:43     ` Mike Kravetz
  2021-02-25 22:15       ` David Hildenbrand
  1 sibling, 1 reply; 18+ messages in thread
From: Mike Kravetz @ 2021-02-25 21:43 UTC (permalink / raw)
  To: David Hildenbrand, Oscar Salvador
  Cc: Muchun Song, Michal Hocko, linux-mm, linux-kernel

On 2/10/21 12:23 AM, David Hildenbrand wrote:
> On 08.02.21 11:38, Oscar Salvador wrote:
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -952,6 +952,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>>                       low_pfn += compound_nr(page) - 1;
>>                       goto isolate_success_no_list;
>>                   }
>> +            } else {
> 
> } else if (alloc_and_dissolve_huge_page(page))) {
> 
> ...
> 
>> +                /*
>> +                 * Free hugetlb page. Allocate a new one and
>> +                 * dissolve this is if succeed.
>> +                 */
>> +                if (alloc_and_dissolve_huge_page(page)) {
>> +                    unsigned long order = buddy_order_unsafe(page);
>> +
>> +                    low_pfn += (1UL << order) - 1;
>> +                    continue;
>> +                }
> 
> 
> 
> Note that there is a very ugly corner case we will have to handle gracefully (I think also in patch #1):
> 
> Assume you allocated a gigantic page (and assume that we are not using CMA for gigantic pages for simplicity). Assume you want to allocate another one. alloc_pool_huge_page()->...->alloc_contig_pages() will stumble over the first allocated page. It will try to alloc_and_dissolve_huge_page() the existing gigantic page. To do that, it will alloc_pool_huge_page()->...->alloc_contig_pages() ... and so on. Bad.
> 

Sorry for resurrecting an old thread.
While looking at V3 of these patches, I was exploring all the calling
sequences looking for races and other issues.  It 'may' be that the
issue about infinitely allocating and freeing gigantic pages may not be
an issue.  Of course, I could be mistaken.  Here is my reasoning:

alloc_and_dissolve_huge_page (now isolate_or_dissolve_huge_page) will be
called from __alloc_contig_migrate_range() within alloc_contig_range().
Before calling __alloc_contig_migrate_range, we call start_isolate_page_range
to isolate all page blocks in the range.  Because all the page blocks in
the range are isolated, another invocation of alloc_contig_range will
not operate on any part of that range.  See the comments for
start_isolate_page_range or commit 2c7452a075d4.  So, when
start_isolate_page_range goes to allocate another gigantic page it will
never notice/operate on the existing gigantic page.

Again, this is confusing and I might be missing something.

In any case, I agree that gigantic pages are tricky and we should leave
them out of the discussion for now.  We can rethink this later if
necessary.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/2] mm,page_alloc: Make alloc_contig_range handle free hugetlb pages
  2021-02-25 21:43     ` Mike Kravetz
@ 2021-02-25 22:15       ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2021-02-25 22:15 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: David Hildenbrand, Oscar Salvador, Muchun Song, Michal Hocko,
	linux-mm, linux-kernel


> Am 25.02.2021 um 22:43 schrieb Mike Kravetz <mike.kravetz@oracle.com>:
> 
> On 2/10/21 12:23 AM, David Hildenbrand wrote:
>>> On 08.02.21 11:38, Oscar Salvador wrote:
>>> --- a/mm/compaction.c
>>> +++ b/mm/compaction.c
>>> @@ -952,6 +952,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>>>                      low_pfn += compound_nr(page) - 1;
>>>                      goto isolate_success_no_list;
>>>                  }
>>> +            } else {
>> 
>> } else if (alloc_and_dissolve_huge_page(page))) {
>> 
>> ...
>> 
>>> +                /*
>>> +                 * Free hugetlb page. Allocate a new one and
>>> +                 * dissolve this is if succeed.
>>> +                 */
>>> +                if (alloc_and_dissolve_huge_page(page)) {
>>> +                    unsigned long order = buddy_order_unsafe(page);
>>> +
>>> +                    low_pfn += (1UL << order) - 1;
>>> +                    continue;
>>> +                }
>> 
>> 
>> 
>> Note that there is a very ugly corner case we will have to handle gracefully (I think also in patch #1):
>> 
>> Assume you allocated a gigantic page (and assume that we are not using CMA for gigantic pages for simplicity). Assume you want to allocate another one. alloc_pool_huge_page()->...->alloc_contig_pages() will stumble over the first allocated page. It will try to alloc_and_dissolve_huge_page() the existing gigantic page. To do that, it will alloc_pool_huge_page()->...->alloc_contig_pages() ... and so on. Bad.
>> 
> 
> Sorry for resurrecting an old thread.
> While looking at V3 of these patches, I was exploring all the calling
> sequences looking for races and other issues.  It 'may' be that the
> issue about infinitely allocating and freeing gigantic pages may not be
> an issue.  Of course, I could be mistaken.  Here is my reasoning:
> 
> alloc_and_dissolve_huge_page (now isolate_or_dissolve_huge_page) will be
> called from __alloc_contig_migrate_range() within alloc_contig_range().
> Before calling __alloc_contig_migrate_range, we call start_isolate_page_range
> to isolate all page blocks in the range.  Because all the page blocks in
> the range are isolated, another invocation of alloc_contig_range will
> not operate on any part of that range.  See the comments for
> start_isolate_page_range or commit 2c7452a075d4.  So, when
> start_isolate_page_range goes to allocate another gigantic page it will
> never notice/operate on the existing gigantic page.
> 
> Again, this is confusing and I might be missing something.

I think you are right that the endless loop is blocked. But I think the whole thing could cascade once we have multiple gigantic pages allocated.

Try allocating a new gpage. We find an existing gpage, isolate it and try to migrate it. To do that, we try allocating a new gpage. We find yet another existing gpage, isolate and try to migrate it ... until we isolated all gpages on out way to an actual usable area. Then we have to actually migrate all these in reverse order ...

Of course this only works if we can actually isolate a gigantic page - which should be the case I think (they are migratable and should be marked as movable).

> 
> In any case, I agree that gigantic pages are tricky and we should leave
> them out of the discussion for now.  We can rethink this later if
> necessary.

Yes, it‘s tricky and not strictly required right now because we never place them on ZONE_MOVABLE. And as I said, actual use cases might be rare.



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-02-25 22:15 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-08 10:38 [RFC PATCH 0/2] Make alloc_contig_range handle Hugetlb pages Oscar Salvador
2021-02-08 10:38 ` [RFC PATCH 1/2] mm,page_alloc: Make alloc_contig_range handle in-use hugetlb pages Oscar Salvador
2021-02-10  8:56   ` David Hildenbrand
2021-02-10 14:09     ` Oscar Salvador
2021-02-10 14:11       ` David Hildenbrand
2021-02-10 14:14         ` Oscar Salvador
2021-02-10 14:22           ` David Hildenbrand
2021-02-11  0:56   ` Mike Kravetz
2021-02-11 10:40     ` David Hildenbrand
2021-02-08 10:38 ` [RFC PATCH 2/2] mm,page_alloc: Make alloc_contig_range handle free " Oscar Salvador
2021-02-10  8:23   ` David Hildenbrand
2021-02-10 14:24     ` Oscar Salvador
2021-02-10 14:36       ` David Hildenbrand
2021-02-25 21:43     ` Mike Kravetz
2021-02-25 22:15       ` David Hildenbrand
2021-02-11  1:16   ` Mike Kravetz
2021-02-11 21:38     ` Oscar Salvador
2021-02-08 10:39 ` [RFC PATCH 0/2] Make alloc_contig_range handle Hugetlb pages Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).