linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] mm/hugetlb: follow_hugetlb_page() improvements
@ 2021-01-28 18:26 Joao Martins
  2021-01-28 18:26 ` [PATCH v2 1/2] mm/hugetlb: grab head page refcount once for group of subpages Joao Martins
  2021-01-28 18:26 ` [PATCH v2 2/2] mm/hugetlb: refactor subpage recording Joao Martins
  0 siblings, 2 replies; 8+ messages in thread
From: Joao Martins @ 2021-01-28 18:26 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, Mike Kravetz, Andrew Morton, Joao Martins

Hey,

While looking at ZONE_DEVICE struct page reuse particularly the last
patch[0], I found two possible improvements for follow_hugetlb_page()
which is solely used for get_user_pages()/pin_user_pages().

The first patch batches page refcount updates while the second tidies
up storing the subpages/vmas. Both together bring the cost of slow
variant of gup() cost from ~87.6k usecs to ~5.8k usecs.

libhugetlbfs tests seem to pass as well gup_test benchmarks
with hugetlbfs vmas.

v2:
  * switch from refs++ to ++refs;
  * add Mike's Rb on patch 1;
  * switch from page++ to mem_map_offset() on the second patch;
  
[0] https://lore.kernel.org/linux-mm/20201208172901.17384-11-joao.m.martins@oracle.com/

Joao Martins (2):
  mm/hugetlb: grab head page refcount once for group of subpages
  mm/hugetlb: refactor subpage recording

 include/linux/mm.h |  3 +++
 mm/gup.c           |  5 ++--
 mm/hugetlb.c       | 66 +++++++++++++++++++++++++++-------------------
 3 files changed, 44 insertions(+), 30 deletions(-)

-- 
2.17.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/2] mm/hugetlb: grab head page refcount once for group of subpages
  2021-01-28 18:26 [PATCH v2 0/2] mm/hugetlb: follow_hugetlb_page() improvements Joao Martins
@ 2021-01-28 18:26 ` Joao Martins
  2021-01-28 18:26 ` [PATCH v2 2/2] mm/hugetlb: refactor subpage recording Joao Martins
  1 sibling, 0 replies; 8+ messages in thread
From: Joao Martins @ 2021-01-28 18:26 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, Mike Kravetz, Andrew Morton, Joao Martins

follow_hugetlb_page() once it locks the pmd/pud, checks all its
N subpages in a huge page and grabs a reference for each one.
Similar to gup-fast, have follow_hugetlb_page() grab the head
page refcount only after counting all its subpages that are part
of the just faulted huge page.

Consequently we reduce the number of atomics necessary to pin
said huge page, which improves non-fast gup() considerably:

  - 16G with 1G huge page size
  gup_test -f /mnt/huge/file -m 16384 -r 10 -L -S -n 512 -w

PIN_LONGTERM_BENCHMARK: ~87.6k us -> ~12.8k us

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/mm.h |  3 +++
 mm/gup.c           |  5 ++---
 mm/hugetlb.c       | 43 ++++++++++++++++++++++++-------------------
 3 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a5d618d08506..0d793486822b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1182,6 +1182,9 @@ static inline void get_page(struct page *page)
 }
 
 bool __must_check try_grab_page(struct page *page, unsigned int flags);
+__maybe_unused struct page *try_grab_compound_head(struct page *page, int refs,
+						   unsigned int flags);
+
 
 static inline __must_check bool try_get_page(struct page *page)
 {
diff --git a/mm/gup.c b/mm/gup.c
index 3e086b073624..ecadc80934b2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -79,9 +79,8 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
  * considered failure, and furthermore, a likely bug in the caller, so a warning
  * is also emitted.
  */
-static __maybe_unused struct page *try_grab_compound_head(struct page *page,
-							  int refs,
-							  unsigned int flags)
+__maybe_unused struct page *try_grab_compound_head(struct page *page,
+						   int refs, unsigned int flags)
 {
 	if (flags & FOLL_GET)
 		return try_get_compound_head(page, refs);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a6bad1f686c5..becef936ec21 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4798,7 +4798,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long vaddr = *position;
 	unsigned long remainder = *nr_pages;
 	struct hstate *h = hstate_vma(vma);
-	int err = -EFAULT;
+	int err = -EFAULT, refs;
 
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
@@ -4918,26 +4918,11 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			continue;
 		}
 
+		refs = 0;
+
 same_page:
-		if (pages) {
+		if (pages)
 			pages[i] = mem_map_offset(page, pfn_offset);
-			/*
-			 * try_grab_page() should always succeed here, because:
-			 * a) we hold the ptl lock, and b) we've just checked
-			 * that the huge page is present in the page tables. If
-			 * the huge page is present, then the tail pages must
-			 * also be present. The ptl prevents the head page and
-			 * tail pages from being rearranged in any way. So this
-			 * page must be available at this point, unless the page
-			 * refcount overflowed:
-			 */
-			if (WARN_ON_ONCE(!try_grab_page(pages[i], flags))) {
-				spin_unlock(ptl);
-				remainder = 0;
-				err = -ENOMEM;
-				break;
-			}
-		}
 
 		if (vmas)
 			vmas[i] = vma;
@@ -4946,6 +4931,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		++pfn_offset;
 		--remainder;
 		++i;
+		++refs;
 		if (vaddr < vma->vm_end && remainder &&
 				pfn_offset < pages_per_huge_page(h)) {
 			/*
@@ -4953,6 +4939,25 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 * of this compound page.
 			 */
 			goto same_page;
+		} else if (pages) {
+			/*
+			 * try_grab_compound_head() should always succeed here,
+			 * because: a) we hold the ptl lock, and b) we've just
+			 * checked that the huge page is present in the page
+			 * tables. If the huge page is present, then the tail
+			 * pages must also be present. The ptl prevents the
+			 * head page and tail pages from being rearranged in
+			 * any way. So this page must be available at this
+			 * point, unless the page refcount overflowed:
+			 */
+			if (WARN_ON_ONCE(!try_grab_compound_head(pages[i-1],
+								 refs,
+								 flags))) {
+				spin_unlock(ptl);
+				remainder = 0;
+				err = -ENOMEM;
+				break;
+			}
 		}
 		spin_unlock(ptl);
 	}
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/2] mm/hugetlb: refactor subpage recording
  2021-01-28 18:26 [PATCH v2 0/2] mm/hugetlb: follow_hugetlb_page() improvements Joao Martins
  2021-01-28 18:26 ` [PATCH v2 1/2] mm/hugetlb: grab head page refcount once for group of subpages Joao Martins
@ 2021-01-28 18:26 ` Joao Martins
  2021-01-28 21:53   ` Mike Kravetz
  1 sibling, 1 reply; 8+ messages in thread
From: Joao Martins @ 2021-01-28 18:26 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, Mike Kravetz, Andrew Morton, Joao Martins

For a given hugepage backing a VA, there's a rather ineficient
loop which is solely responsible for storing subpages in GUP
@pages/@vmas array. For each subpage we check whether it's within
range or size of @pages and keep increment @pfn_offset and a couple
other variables per subpage iteration.

Simplify this logic and minimize the cost of each iteration to just
store the output page/vma. Instead of incrementing number of @refs
iteratively, we do it through pre-calculation of @refs and only
with a tight loop for storing pinned subpages/vmas.

Additionally, retain existing behaviour with using mem_map_offset()
when recording the subpages for configurations that don't have a
contiguous mem_map.

pinning consequently improves bringing us close to
{pin,get}_user_pages_fast:

  - 16G with 1G huge page size
  gup_test -f /mnt/huge/file -m 16384 -r 30 -L -S -n 512 -w

PIN_LONGTERM_BENCHMARK: ~12.8k us -> ~5.8k us
PIN_FAST_BENCHMARK: ~3.7k us

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/hugetlb.c | 49 ++++++++++++++++++++++++++++---------------------
 1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index becef936ec21..f3baabbda432 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4789,6 +4789,20 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	goto out;
 }
 
+static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
+				 int refs, struct page **pages,
+				 struct vm_area_struct **vmas)
+{
+	int nr;
+
+	for (nr = 0; nr < refs; nr++) {
+		if (likely(pages))
+			pages[nr] = mem_map_offset(page, nr);
+		if (vmas)
+			vmas[nr] = vma;
+	}
+}
+
 long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 struct page **pages, struct vm_area_struct **vmas,
 			 unsigned long *position, unsigned long *nr_pages,
@@ -4918,28 +4932,16 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			continue;
 		}
 
-		refs = 0;
+		refs = min3(pages_per_huge_page(h) - pfn_offset,
+			    (vma->vm_end - vaddr) >> PAGE_SHIFT, remainder);
 
-same_page:
-		if (pages)
-			pages[i] = mem_map_offset(page, pfn_offset);
+		if (pages || vmas)
+			record_subpages_vmas(mem_map_offset(page, pfn_offset),
+					     vma, refs,
+					     likely(pages) ? pages + i : NULL,
+					     vmas ? vmas + i : NULL);
 
-		if (vmas)
-			vmas[i] = vma;
-
-		vaddr += PAGE_SIZE;
-		++pfn_offset;
-		--remainder;
-		++i;
-		++refs;
-		if (vaddr < vma->vm_end && remainder &&
-				pfn_offset < pages_per_huge_page(h)) {
-			/*
-			 * We use pfn_offset to avoid touching the pageframes
-			 * of this compound page.
-			 */
-			goto same_page;
-		} else if (pages) {
+		if (pages) {
 			/*
 			 * try_grab_compound_head() should always succeed here,
 			 * because: a) we hold the ptl lock, and b) we've just
@@ -4950,7 +4952,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 * any way. So this page must be available at this
 			 * point, unless the page refcount overflowed:
 			 */
-			if (WARN_ON_ONCE(!try_grab_compound_head(pages[i-1],
+			if (WARN_ON_ONCE(!try_grab_compound_head(pages[i],
 								 refs,
 								 flags))) {
 				spin_unlock(ptl);
@@ -4959,6 +4961,11 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				break;
 			}
 		}
+
+		vaddr += (refs << PAGE_SHIFT);
+		remainder -= refs;
+		i += refs;
+
 		spin_unlock(ptl);
 	}
 	*nr_pages = remainder;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] mm/hugetlb: refactor subpage recording
  2021-01-28 18:26 ` [PATCH v2 2/2] mm/hugetlb: refactor subpage recording Joao Martins
@ 2021-01-28 21:53   ` Mike Kravetz
  2021-02-11 20:47     ` Zi Yan
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Kravetz @ 2021-01-28 21:53 UTC (permalink / raw)
  To: Joao Martins, linux-mm; +Cc: linux-kernel, Andrew Morton

On 1/28/21 10:26 AM, Joao Martins wrote:
> For a given hugepage backing a VA, there's a rather ineficient
> loop which is solely responsible for storing subpages in GUP
> @pages/@vmas array. For each subpage we check whether it's within
> range or size of @pages and keep increment @pfn_offset and a couple
> other variables per subpage iteration.
> 
> Simplify this logic and minimize the cost of each iteration to just
> store the output page/vma. Instead of incrementing number of @refs
> iteratively, we do it through pre-calculation of @refs and only
> with a tight loop for storing pinned subpages/vmas.
> 
> Additionally, retain existing behaviour with using mem_map_offset()
> when recording the subpages for configurations that don't have a
> contiguous mem_map.
> 
> pinning consequently improves bringing us close to
> {pin,get}_user_pages_fast:
> 
>   - 16G with 1G huge page size
>   gup_test -f /mnt/huge/file -m 16384 -r 30 -L -S -n 512 -w
> 
> PIN_LONGTERM_BENCHMARK: ~12.8k us -> ~5.8k us
> PIN_FAST_BENCHMARK: ~3.7k us
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/hugetlb.c | 49 ++++++++++++++++++++++++++++---------------------
>  1 file changed, 28 insertions(+), 21 deletions(-)

Thanks for updating this.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

I think there still is an open general question about whether we can always
assume page structs are contiguous for really big pages.  That is outside
the scope of this patch.  Adding the mem_map_offset() keeps this consistent
with other hugetlbfs specific code.

-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] mm/hugetlb: refactor subpage recording
  2021-01-28 21:53   ` Mike Kravetz
@ 2021-02-11 20:47     ` Zi Yan
  2021-02-11 23:44       ` Mike Kravetz
  0 siblings, 1 reply; 8+ messages in thread
From: Zi Yan @ 2021-02-11 20:47 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Joao Martins, linux-mm, linux-kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2533 bytes --]

On 28 Jan 2021, at 16:53, Mike Kravetz wrote:

> On 1/28/21 10:26 AM, Joao Martins wrote:
>> For a given hugepage backing a VA, there's a rather ineficient
>> loop which is solely responsible for storing subpages in GUP
>> @pages/@vmas array. For each subpage we check whether it's within
>> range or size of @pages and keep increment @pfn_offset and a couple
>> other variables per subpage iteration.
>>
>> Simplify this logic and minimize the cost of each iteration to just
>> store the output page/vma. Instead of incrementing number of @refs
>> iteratively, we do it through pre-calculation of @refs and only
>> with a tight loop for storing pinned subpages/vmas.
>>
>> Additionally, retain existing behaviour with using mem_map_offset()
>> when recording the subpages for configurations that don't have a
>> contiguous mem_map.
>>
>> pinning consequently improves bringing us close to
>> {pin,get}_user_pages_fast:
>>
>>   - 16G with 1G huge page size
>>   gup_test -f /mnt/huge/file -m 16384 -r 30 -L -S -n 512 -w
>>
>> PIN_LONGTERM_BENCHMARK: ~12.8k us -> ~5.8k us
>> PIN_FAST_BENCHMARK: ~3.7k us
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  mm/hugetlb.c | 49 ++++++++++++++++++++++++++++---------------------
>>  1 file changed, 28 insertions(+), 21 deletions(-)
>
> Thanks for updating this.
>
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
>
> I think there still is an open general question about whether we can always
> assume page structs are contiguous for really big pages.  That is outside

I do not think page structs need to be contiguous, but PFNs within a big page
need to be contiguous, at least based on existing code like mem_map_offset() we have.
The assumption seems valid according to the existing big page allocation methods,
which use alloc_contig_pages() at the end of the day. alloc_contig_pages()
calls pfn_range_valid_contig() to make sure all PFNs are contiguous.
On the other hand, the buddy allocator only merges contiguous PFNs, so there
will be no problem even if someone configures the buddy allocator to allocate
gigantic pages.

Unless someone comes up with some fancy way of making page allocations from
contiguous page structs in SPARSEMEM_VMEMMAP case, where non-contiguous
PFNs with contiguous page structs are possible, or out of any adjacent
pages in !SPARSEMEM_VMEMMAP case, where non-contiguous page structs
and non-contiguous PFNs are possible, we should be good.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] mm/hugetlb: refactor subpage recording
  2021-02-11 20:47     ` Zi Yan
@ 2021-02-11 23:44       ` Mike Kravetz
  2021-02-13 15:44         ` Zi Yan
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Kravetz @ 2021-02-11 23:44 UTC (permalink / raw)
  To: Zi Yan; +Cc: Joao Martins, linux-mm, linux-kernel, Andrew Morton

On 2/11/21 12:47 PM, Zi Yan wrote:
> On 28 Jan 2021, at 16:53, Mike Kravetz wrote:
> 
>> On 1/28/21 10:26 AM, Joao Martins wrote:
>>> For a given hugepage backing a VA, there's a rather ineficient
>>> loop which is solely responsible for storing subpages in GUP
>>> @pages/@vmas array. For each subpage we check whether it's within
>>> range or size of @pages and keep increment @pfn_offset and a couple
>>> other variables per subpage iteration.
>>>
>>> Simplify this logic and minimize the cost of each iteration to just
>>> store the output page/vma. Instead of incrementing number of @refs
>>> iteratively, we do it through pre-calculation of @refs and only
>>> with a tight loop for storing pinned subpages/vmas.
>>>
>>> Additionally, retain existing behaviour with using mem_map_offset()
>>> when recording the subpages for configurations that don't have a
>>> contiguous mem_map.
>>>
>>> pinning consequently improves bringing us close to
>>> {pin,get}_user_pages_fast:
>>>
>>>   - 16G with 1G huge page size
>>>   gup_test -f /mnt/huge/file -m 16384 -r 30 -L -S -n 512 -w
>>>
>>> PIN_LONGTERM_BENCHMARK: ~12.8k us -> ~5.8k us
>>> PIN_FAST_BENCHMARK: ~3.7k us
>>>
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> ---
>>>  mm/hugetlb.c | 49 ++++++++++++++++++++++++++++---------------------
>>>  1 file changed, 28 insertions(+), 21 deletions(-)
>>
>> Thanks for updating this.
>>
>> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
>>
>> I think there still is an open general question about whether we can always
>> assume page structs are contiguous for really big pages.  That is outside
> 
> I do not think page structs need to be contiguous, but PFNs within a big page
> need to be contiguous, at least based on existing code like mem_map_offset() we have.

Thanks for looking Zi,
Yes, PFNs need to be contiguous.  Also, as you say page structs do not need
to be contiguous.  The issue is that there is code that assumes page structs
are contiguous for gigantic pages.  hugetlb code does not make this assumption
and does a pfn_to_page() when looping through page structs for gigantic pages.

I do not believe this to be a huge issue.  In most cases CONFIG_VIRTUAL_MEM_MAP
is defined and struct pages can be accessed contiguously.  I 'think' we could
run into problems with CONFIG_SPARSEMEM and without CONFIG_VIRTUAL_MEM_MAP
and doing hotplug operations.  However, I still need to look into more.
-- 
Mike Kravetz

> The assumption seems valid according to the existing big page allocation methods,
> which use alloc_contig_pages() at the end of the day. alloc_contig_pages()
> calls pfn_range_valid_contig() to make sure all PFNs are contiguous.
> On the other hand, the buddy allocator only merges contiguous PFNs, so there
> will be no problem even if someone configures the buddy allocator to allocate
> gigantic pages.
> 
> Unless someone comes up with some fancy way of making page allocations from
> contiguous page structs in SPARSEMEM_VMEMMAP case, where non-contiguous
> PFNs with contiguous page structs are possible, or out of any adjacent
> pages in !SPARSEMEM_VMEMMAP case, where non-contiguous page structs
> and non-contiguous PFNs are possible, we should be good.
> 
> 
> —
> Best Regards,
> Yan Zi
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] mm/hugetlb: refactor subpage recording
  2021-02-11 23:44       ` Mike Kravetz
@ 2021-02-13 15:44         ` Zi Yan
  2021-02-13 21:04           ` Mike Kravetz
  0 siblings, 1 reply; 8+ messages in thread
From: Zi Yan @ 2021-02-13 15:44 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Joao Martins, linux-mm, linux-kernel, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 4690 bytes --]

On 11 Feb 2021, at 18:44, Mike Kravetz wrote:

> On 2/11/21 12:47 PM, Zi Yan wrote:
>> On 28 Jan 2021, at 16:53, Mike Kravetz wrote:
>>
>>> On 1/28/21 10:26 AM, Joao Martins wrote:
>>>> For a given hugepage backing a VA, there's a rather ineficient
>>>> loop which is solely responsible for storing subpages in GUP
>>>> @pages/@vmas array. For each subpage we check whether it's within
>>>> range or size of @pages and keep increment @pfn_offset and a couple
>>>> other variables per subpage iteration.
>>>>
>>>> Simplify this logic and minimize the cost of each iteration to just
>>>> store the output page/vma. Instead of incrementing number of @refs
>>>> iteratively, we do it through pre-calculation of @refs and only
>>>> with a tight loop for storing pinned subpages/vmas.
>>>>
>>>> Additionally, retain existing behaviour with using mem_map_offset()
>>>> when recording the subpages for configurations that don't have a
>>>> contiguous mem_map.
>>>>
>>>> pinning consequently improves bringing us close to
>>>> {pin,get}_user_pages_fast:
>>>>
>>>>   - 16G with 1G huge page size
>>>>   gup_test -f /mnt/huge/file -m 16384 -r 30 -L -S -n 512 -w
>>>>
>>>> PIN_LONGTERM_BENCHMARK: ~12.8k us -> ~5.8k us
>>>> PIN_FAST_BENCHMARK: ~3.7k us
>>>>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> ---
>>>>  mm/hugetlb.c | 49 ++++++++++++++++++++++++++++---------------------
>>>>  1 file changed, 28 insertions(+), 21 deletions(-)
>>>
>>> Thanks for updating this.
>>>
>>> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>
>>> I think there still is an open general question about whether we can always
>>> assume page structs are contiguous for really big pages.  That is outside
>>
>> I do not think page structs need to be contiguous, but PFNs within a big page
>> need to be contiguous, at least based on existing code like mem_map_offset() we have.
>
> Thanks for looking Zi,
> Yes, PFNs need to be contiguous.  Also, as you say page structs do not need
> to be contiguous.  The issue is that there is code that assumes page structs
> are contiguous for gigantic pages.  hugetlb code does not make this assumption
> and does a pfn_to_page() when looping through page structs for gigantic pages.
>
> I do not believe this to be a huge issue.  In most cases CONFIG_VIRTUAL_MEM_MAP
> is defined and struct pages can be accessed contiguously.  I 'think' we could
> run into problems with CONFIG_SPARSEMEM and without CONFIG_VIRTUAL_MEM_MAP
> and doing hotplug operations.  However, I still need to look into more.

Yeah, you are right about this. The combination of CONFIG_SPARSEMEM,
!CONFIG_SPARSEMEM_VMEMMAP and doing hotplug does cause errors, as simple as
dynamically reserving gigantic hugetlb pages then freeing them in a system
with CONFIG_SPARSEMEM_VMEMMAP not set and some hotplug memory.

Here are the steps to reproduce:
0. Configure a kernel with CONFIG_SPARSEMEM_VMEMMAP not set.
1. Create a VM using qemu with “-m size=8g,slots=16,maxmem=16g” to enable hotplug.
2. After boot the machine, add large enough memory using
   “object_add memory-backend-ram,id=mem1,size=7g” and
   “device_add pc-dimm,id=dimm1,memdev=mem1”.
3. In the guest OS, online all hot-plugged memory. My VM has 128MB memory block size.
If you have larger memory block size, I think you will need to plug in more memory.
4. Reserve gigantic hugetlb pages so that hot-plugged memory will be used. I reserved
12GB, like “echo 12 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages”.
5. Free all hugetlb gigantic pages,
“echo 0 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages”.
6. You will get “BUG: Bad page state in process …” errors.

The patch below can fix the error, but I suspect there might be other places missing
the necessary mem_map_next() too.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4bdb58ab14cb..aae99c6984f3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1319,7 +1319,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
        h->nr_huge_pages--;
        h->nr_huge_pages_node[page_to_nid(page)]--;
        for (i = 0; i < pages_per_huge_page(h); i++) {
-               page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
+               struct page *subpage = mem_map_next(subpage, page, i);
+               subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
                                1 << PG_referenced | 1 << PG_dirty |
                                1 << PG_active | 1 << PG_private |
                                1 << PG_writeback);


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] mm/hugetlb: refactor subpage recording
  2021-02-13 15:44         ` Zi Yan
@ 2021-02-13 21:04           ` Mike Kravetz
  0 siblings, 0 replies; 8+ messages in thread
From: Mike Kravetz @ 2021-02-13 21:04 UTC (permalink / raw)
  To: Zi Yan; +Cc: Joao Martins, linux-mm, linux-kernel, Andrew Morton

On 2/13/21 7:44 AM, Zi Yan wrote:
> On 11 Feb 2021, at 18:44, Mike Kravetz wrote:
> 
>> On 2/11/21 12:47 PM, Zi Yan wrote:
>>> On 28 Jan 2021, at 16:53, Mike Kravetz wrote:
>>>> On 1/28/21 10:26 AM, Joao Martins wrote:
>>>>> For a given hugepage backing a VA, there's a rather ineficient
>>>>> loop which is solely responsible for storing subpages in GUP
>>>>> @pages/@vmas array. For each subpage we check whether it's within
>>>>> range or size of @pages and keep increment @pfn_offset and a couple
>>>>> other variables per subpage iteration.
>>>>>
>>>>> Simplify this logic and minimize the cost of each iteration to just
>>>>> store the output page/vma. Instead of incrementing number of @refs
>>>>> iteratively, we do it through pre-calculation of @refs and only
>>>>> with a tight loop for storing pinned subpages/vmas.
>>>>>
>>>>> Additionally, retain existing behaviour with using mem_map_offset()
>>>>> when recording the subpages for configurations that don't have a
>>>>> contiguous mem_map.
>>>>>
>>>>> pinning consequently improves bringing us close to
>>>>> {pin,get}_user_pages_fast:
>>>>>
>>>>>   - 16G with 1G huge page size
>>>>>   gup_test -f /mnt/huge/file -m 16384 -r 30 -L -S -n 512 -w
>>>>>
>>>>> PIN_LONGTERM_BENCHMARK: ~12.8k us -> ~5.8k us
>>>>> PIN_FAST_BENCHMARK: ~3.7k us
>>>>>
>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>>> ---
>>>>>  mm/hugetlb.c | 49 ++++++++++++++++++++++++++++---------------------
>>>>>  1 file changed, 28 insertions(+), 21 deletions(-)
>>>>
>>>> Thanks for updating this.
>>>>
>>>> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>>
>>>> I think there still is an open general question about whether we can always
>>>> assume page structs are contiguous for really big pages.  That is outside
>>>
>>> I do not think page structs need to be contiguous, but PFNs within a big page
>>> need to be contiguous, at least based on existing code like mem_map_offset() we have.
>>
>> Thanks for looking Zi,
>> Yes, PFNs need to be contiguous.  Also, as you say page structs do not need
>> to be contiguous.  The issue is that there is code that assumes page structs
>> are contiguous for gigantic pages.  hugetlb code does not make this assumption
>> and does a pfn_to_page() when looping through page structs for gigantic pages.
>>
>> I do not believe this to be a huge issue.  In most cases CONFIG_VIRTUAL_MEM_MAP
>> is defined and struct pages can be accessed contiguously.  I 'think' we could
>> run into problems with CONFIG_SPARSEMEM and without CONFIG_VIRTUAL_MEM_MAP
>> and doing hotplug operations.  However, I still need to look into more.
> 
> Yeah, you are right about this. The combination of CONFIG_SPARSEMEM,
> !CONFIG_SPARSEMEM_VMEMMAP and doing hotplug does cause errors, as simple as
> dynamically reserving gigantic hugetlb pages then freeing them in a system
> with CONFIG_SPARSEMEM_VMEMMAP not set and some hotplug memory.
> 
> Here are the steps to reproduce:
> 0. Configure a kernel with CONFIG_SPARSEMEM_VMEMMAP not set.
> 1. Create a VM using qemu with “-m size=8g,slots=16,maxmem=16g” to enable hotplug.
> 2. After boot the machine, add large enough memory using
>    “object_add memory-backend-ram,id=mem1,size=7g” and
>    “device_add pc-dimm,id=dimm1,memdev=mem1”.
> 3. In the guest OS, online all hot-plugged memory. My VM has 128MB memory block size.
> If you have larger memory block size, I think you will need to plug in more memory.
> 4. Reserve gigantic hugetlb pages so that hot-plugged memory will be used. I reserved
> 12GB, like “echo 12 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages”.
> 5. Free all hugetlb gigantic pages,
> “echo 0 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages”.
> 6. You will get “BUG: Bad page state in process …” errors.
> 
> The patch below can fix the error, but I suspect there might be other places missing
> the necessary mem_map_next() too.
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4bdb58ab14cb..aae99c6984f3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1319,7 +1319,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>         h->nr_huge_pages--;
>         h->nr_huge_pages_node[page_to_nid(page)]--;
>         for (i = 0; i < pages_per_huge_page(h); i++) {
> -               page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
> +               struct page *subpage = mem_map_next(subpage, page, i);
> +               subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
>                                 1 << PG_referenced | 1 << PG_dirty |
>                                 1 << PG_active | 1 << PG_private |
>                                 1 << PG_writeback);
> 
> 
> —
> Best Regards,
> Yan Zi

Thanks for confirming my suspicions Zi!

I thought hugetlb code always handled this situation, but was obviously
incorrect.

Perhaps the bigger issue is the GUP code which has the same problem as
suspected by Joao when we were discussing the first version of this patch.
It also is going traverse the list of page structs with page++.  I'll
point the people on that original thread to your findings here.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-02-13 21:04 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-28 18:26 [PATCH v2 0/2] mm/hugetlb: follow_hugetlb_page() improvements Joao Martins
2021-01-28 18:26 ` [PATCH v2 1/2] mm/hugetlb: grab head page refcount once for group of subpages Joao Martins
2021-01-28 18:26 ` [PATCH v2 2/2] mm/hugetlb: refactor subpage recording Joao Martins
2021-01-28 21:53   ` Mike Kravetz
2021-02-11 20:47     ` Zi Yan
2021-02-11 23:44       ` Mike Kravetz
2021-02-13 15:44         ` Zi Yan
2021-02-13 21:04           ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).