[PATCH 0/2] Fixes for hugetlb mapcount at most 1 for shared PMDs

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Fixes for hugetlb mapcount at most 1 for shared PMDs
@ 2023-01-26 22:27 Mike Kravetz
  2023-01-26 22:27 ` [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps Mike Kravetz
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Mike Kravetz @ 2023-01-26 22:27 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, James Houghton, Peter Xu, David Hildenbrand,
	Michal Hocko, Yang Shi, Vishal Moola, Matthew Wilcox,
	Muchun Song, Andrew Morton, Mike Kravetz

This issue of mapcount in hugetlb pages referenced by shared PMDs was
discussed in [1].  The following two patches address user visible
behavior caused by this issue.

Patches apply to mm-stable as they can also target stable backports.

Ongoing folio conversions cause context conflicts in the second patch
when applied to mm-unstable/linux-next.  I can create separate patch(es)
if people agree with these.

[1] https://lore.kernel.org/linux-mm/Y9BF+OCdWnCSilEu@monkey/
Mike Kravetz (2):
  mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
  migrate: hugetlb: Check for hugetlb shared PMD in node migration

 fs/proc/task_mmu.c      | 10 ++++++++--
 include/linux/hugetlb.h | 12 ++++++++++++
 mm/mempolicy.c          |  3 ++-
 3 files changed, 22 insertions(+), 3 deletions(-)

-- 
2.39.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
  2023-01-26 22:27 [PATCH 0/2] Fixes for hugetlb mapcount at most 1 for shared PMDs Mike Kravetz
@ 2023-01-26 22:27 ` Mike Kravetz
  2023-01-27 16:23   ` David Hildenbrand
  2023-01-26 22:27 ` [PATCH 2/2] migrate: hugetlb: Check for hugetlb shared PMD in node migration Mike Kravetz
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2023-01-26 22:27 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, James Houghton, Peter Xu, David Hildenbrand,
	Michal Hocko, Yang Shi, Vishal Moola, Matthew Wilcox,
	Muchun Song, Andrew Morton, Mike Kravetz, stable

A hugetlb page will have a mapcount of 1 if mapped by multiple processes
via a shared PMD.  This is because only the first process increases the
map count, and subsequent processes just add the shared PMD page to
their page table.

page_mapcount is being used to decide if a hugetlb page is shared or
private in /proc/PID/smaps.  Pages referenced via a shared PMD were
incorrectly being counted as private.

To fix, check for a shared PMD if mapcount is 1.  If a shared PMD is
found count the hugetlb page as shared.  A new helper to check for a
shared PMD is added.

Fixes: 25ee01a2fca0 ("mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/proc/task_mmu.c      | 10 ++++++++--
 include/linux/hugetlb.h | 12 ++++++++++++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e35a0398db63..cb9539879402 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -749,8 +749,14 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 
 		if (mapcount >= 2)
 			mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
-		else
-			mss->private_hugetlb += huge_page_size(hstate_vma(vma));
+		else {
+			if (hugetlb_pmd_shared(pte))
+				mss->shared_hugetlb +=
+						huge_page_size(hstate_vma(vma));
+			else
+				mss->private_hugetlb +=
+						huge_page_size(hstate_vma(vma));
+		}
 	}
 	return 0;
 }
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e3aa336df900..8e65920e4363 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1225,6 +1225,18 @@ static inline __init void hugetlb_cma_reserve(int order)
 }
 #endif
 
+#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
+static inline bool hugetlb_pmd_shared(pte_t *pte)
+{
+	return page_count(virt_to_page(pte)) > 1;
+}
+#else
+static inline bool hugetlb_pmd_shared(pte_t *pte)
+{
+	return false;
+}
+#endif
+
 bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr);
 
 #ifndef __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] migrate: hugetlb: Check for hugetlb shared PMD in node migration
  2023-01-26 22:27 [PATCH 0/2] Fixes for hugetlb mapcount at most 1 for shared PMDs Mike Kravetz
  2023-01-26 22:27 ` [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps Mike Kravetz
@ 2023-01-26 22:27 ` Mike Kravetz
  2023-01-27 16:23   ` David Hildenbrand
  2023-01-26 22:47 ` [PATCH 0/2] Fixes for hugetlb mapcount at most 1 for shared PMDs Andrew Morton
  2023-01-26 22:48 ` Peter Xu
  3 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2023-01-26 22:27 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, James Houghton, Peter Xu, David Hildenbrand,
	Michal Hocko, Yang Shi, Vishal Moola, Matthew Wilcox,
	Muchun Song, Andrew Morton, Mike Kravetz, stable

migrate_pages/mempolicy semantics state that CAP_SYS_NICE is required
to move pages shared with another process to a different node.
page_mapcount > 1 is being used to determine if a hugetlb page is shared.
However, a hugetlb page will have a mapcount of 1 if mapped by multiple
processes via a shared PMD.  As a result, hugetlb pages shared by multiple
processes and mapped with a shared PMD can be moved by a process without
CAP_SYS_NICE.

To fix, check for a shared PMD if mapcount is 1.  If a shared PMD is
found consider the page shared.

Fixes: e2d8cf405525 ("migrate: add hugepage migration code to migrate_pages()")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/mempolicy.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 85a34f1f3ab8..72142fbe7652 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -600,7 +600,8 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
 
 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
 	if (flags & (MPOL_MF_MOVE_ALL) ||
-	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
+	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1 &&
+	     !hugetlb_pmd_shared(pte))) {
 		if (isolate_hugetlb(page, qp->pagelist) &&
 			(flags & MPOL_MF_STRICT))
 			/*
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] Fixes for hugetlb mapcount at most 1 for shared PMDs
  2023-01-26 22:27 [PATCH 0/2] Fixes for hugetlb mapcount at most 1 for shared PMDs Mike Kravetz
  2023-01-26 22:27 ` [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps Mike Kravetz
  2023-01-26 22:27 ` [PATCH 2/2] migrate: hugetlb: Check for hugetlb shared PMD in node migration Mike Kravetz
@ 2023-01-26 22:47 ` Andrew Morton
  2023-01-26 22:48 ` Peter Xu
  3 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2023-01-26 22:47 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, James Houghton,
	Peter Xu, David Hildenbrand, Michal Hocko, Yang Shi,
	Vishal Moola, Matthew Wilcox, Muchun Song, Vishal Moola

On Thu, 26 Jan 2023 14:27:19 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> Ongoing folio conversions cause context conflicts in the second patch
> when applied to mm-unstable/linux-next.  I can create separate patch(es)
> if people agree with these.

I fixed things up.  queue_folios_hugetlb() is now

static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
			       unsigned long addr, unsigned long end,
			       struct mm_walk *walk)
{
	int ret = 0;
#ifdef CONFIG_HUGETLB_PAGE
	struct queue_pages *qp = walk->private;
	unsigned long flags = (qp->flags & MPOL_MF_VALID);
	struct folio *folio;
	spinlock_t *ptl;
	pte_t entry;

	ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
	entry = huge_ptep_get(pte);
	if (!pte_present(entry))
		goto unlock;
	folio = pfn_folio(pte_pfn(entry));
	if (!queue_folio_required(folio, qp))
		goto unlock;

	if (flags == MPOL_MF_STRICT) {
		/*
		 * STRICT alone means only detecting misplaced folio and no
		 * need to further check other vma.
		 */
		ret = -EIO;
		goto unlock;
	}

	if (!vma_migratable(walk->vma)) {
		/*
		 * Must be STRICT with MOVE*, otherwise .test_walk() have
		 * stopped walking current vma.
		 * Detecting misplaced folio but allow migrating folios which
		 * have been queued.
		 */
		ret = 1;
		goto unlock;
	}

	/*
	 * With MPOL_MF_MOVE, we try to migrate only unshared folios. If it
	 * is shared it is likely not worth migrating.
	 *
	 * To check if the folio is shared, ideally we want to make sure
	 * every page is mapped to the same process. Doing that is very
	 * expensive, so check the estimated mapcount of the folio instead.
	 */
	if (flags & (MPOL_MF_MOVE_ALL) ||
	    (flags & MPOL_MF_MOVE && folio_estimated_mapcount(folio) == 1 &&
	     !hugetlb_pmd_shared(pte))) {
		if (isolate_hugetlb(folio, qp->pagelist) &&
			(flags & MPOL_MF_STRICT))
			/*
			 * Failed to isolate folio but allow migrating pages
			 * which have been queued.
			 */
			ret = 1;
	}
unlock:
	spin_unlock(ptl);
#else
	BUG();
#endif
	return ret;
}


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] Fixes for hugetlb mapcount at most 1 for shared PMDs
  2023-01-26 22:27 [PATCH 0/2] Fixes for hugetlb mapcount at most 1 for shared PMDs Mike Kravetz
                   ` (2 preceding siblings ...)
  2023-01-26 22:47 ` [PATCH 0/2] Fixes for hugetlb mapcount at most 1 for shared PMDs Andrew Morton
@ 2023-01-26 22:48 ` Peter Xu
  3 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2023-01-26 22:48 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, James Houghton,
	David Hildenbrand, Michal Hocko, Yang Shi, Vishal Moola,
	Matthew Wilcox, Muchun Song, Andrew Morton

On Thu, Jan 26, 2023 at 02:27:19PM -0800, Mike Kravetz wrote:
> This issue of mapcount in hugetlb pages referenced by shared PMDs was
> discussed in [1].  The following two patches address user visible
> behavior caused by this issue.
> 
> Patches apply to mm-stable as they can also target stable backports.
> 
> Ongoing folio conversions cause context conflicts in the second patch
> when applied to mm-unstable/linux-next.  I can create separate patch(es)
> if people agree with these.
> 
> [1] https://lore.kernel.org/linux-mm/Y9BF+OCdWnCSilEu@monkey/
> Mike Kravetz (2):
>   mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
>   migrate: hugetlb: Check for hugetlb shared PMD in node migration

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
  2023-01-26 22:27 ` [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps Mike Kravetz
@ 2023-01-27 16:23   ` David Hildenbrand
  2023-01-27 23:04     ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2023-01-27 16:23 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Naoya Horiguchi, James Houghton, Peter Xu, Michal Hocko,
	Yang Shi, Vishal Moola, Matthew Wilcox, Muchun Song,
	Andrew Morton, stable

On 26.01.23 23:27, Mike Kravetz wrote:
> A hugetlb page will have a mapcount of 1 if mapped by multiple processes
> via a shared PMD.  This is because only the first process increases the
> map count, and subsequent processes just add the shared PMD page to
> their page table.
> 
> page_mapcount is being used to decide if a hugetlb page is shared or
> private in /proc/PID/smaps.  Pages referenced via a shared PMD were
> incorrectly being counted as private.
> 
> To fix, check for a shared PMD if mapcount is 1.  If a shared PMD is
> found count the hugetlb page as shared.  A new helper to check for a
> shared PMD is added.
> 
> Fixes: 25ee01a2fca0 ("mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps")
> Cc: stable@vger.kernel.org
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   fs/proc/task_mmu.c      | 10 ++++++++--
>   include/linux/hugetlb.h | 12 ++++++++++++
>   2 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index e35a0398db63..cb9539879402 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -749,8 +749,14 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
>   
>   		if (mapcount >= 2)
>   			mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> -		else
> -			mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> +		else {

Better:

if (mapcount >= 2 || hugetlb_pmd_shared(pte))
	mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
else
	mss->private_hugetlb += huge_page_size(hstate_vma(vma));



-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] migrate: hugetlb: Check for hugetlb shared PMD in node migration
  2023-01-26 22:27 ` [PATCH 2/2] migrate: hugetlb: Check for hugetlb shared PMD in node migration Mike Kravetz
@ 2023-01-27 16:23   ` David Hildenbrand
  0 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2023-01-27 16:23 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Naoya Horiguchi, James Houghton, Peter Xu, Michal Hocko,
	Yang Shi, Vishal Moola, Matthew Wilcox, Muchun Song,
	Andrew Morton, stable

On 26.01.23 23:27, Mike Kravetz wrote:
> migrate_pages/mempolicy semantics state that CAP_SYS_NICE is required
> to move pages shared with another process to a different node.
> page_mapcount > 1 is being used to determine if a hugetlb page is shared.
> However, a hugetlb page will have a mapcount of 1 if mapped by multiple
> processes via a shared PMD.  As a result, hugetlb pages shared by multiple
> processes and mapped with a shared PMD can be moved by a process without
> CAP_SYS_NICE.
> 
> To fix, check for a shared PMD if mapcount is 1.  If a shared PMD is
> found consider the page shared.
> 
> Fixes: e2d8cf405525 ("migrate: add hugepage migration code to migrate_pages()")
> Cc: stable@vger.kernel.org
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/mempolicy.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 85a34f1f3ab8..72142fbe7652 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -600,7 +600,8 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
>   
>   	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
>   	if (flags & (MPOL_MF_MOVE_ALL) ||
> -	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
> +	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1 &&
> +	     !hugetlb_pmd_shared(pte))) {
>   		if (isolate_hugetlb(page, qp->pagelist) &&
>   			(flags & MPOL_MF_STRICT))
>   			/*

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
  2023-01-27 16:23   ` David Hildenbrand
@ 2023-01-27 23:04     ` Andrew Morton
  2023-01-28  1:12       ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2023-01-27 23:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Kravetz, linux-mm, linux-kernel, Naoya Horiguchi,
	James Houghton, Peter Xu, Michal Hocko, Yang Shi, Vishal Moola,
	Matthew Wilcox, Muchun Song, stable

On Fri, 27 Jan 2023 17:23:39 +0100 David Hildenbrand <david@redhat.com> wrote:

> On 26.01.23 23:27, Mike Kravetz wrote:
> > A hugetlb page will have a mapcount of 1 if mapped by multiple processes
> > via a shared PMD.  This is because only the first process increases the
> > map count, and subsequent processes just add the shared PMD page to
> > their page table.
> > 
> > page_mapcount is being used to decide if a hugetlb page is shared or
> > private in /proc/PID/smaps.  Pages referenced via a shared PMD were
> > incorrectly being counted as private.
> > 
> > To fix, check for a shared PMD if mapcount is 1.  If a shared PMD is
> > found count the hugetlb page as shared.  A new helper to check for a
> > shared PMD is added.
> > 
> ...
>
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -749,8 +749,14 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> >   
> >   		if (mapcount >= 2)
> >   			mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > -		else
> > -			mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > +		else {
> 
> Better:
> 
> if (mapcount >= 2 || hugetlb_pmd_shared(pte))
> 	mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> else
> 	mss->private_hugetlb += huge_page_size(hstate_vma(vma));

Yup.  And that local doesn't add any value?

--- a/fs/proc/task_mmu.c~mm-hugetlb-proc-check-for-hugetlb-shared-pmd-in-proc-pid-smaps-fix
+++ a/fs/proc/task_mmu.c
@@ -745,18 +745,10 @@ static int smaps_hugetlb_range(pte_t *pt
 			page = pfn_swap_entry_to_page(swpent);
 	}
 	if (page) {
-		int mapcount = page_mapcount(page);
-
-		if (mapcount >= 2)
+		if (page_mapcount(page) >= 2 || hugetlb_pmd_shared(pte))
 			mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
-		else {
-			if (hugetlb_pmd_shared(pte))
-				mss->shared_hugetlb +=
-						huge_page_size(hstate_vma(vma));
-			else
-				mss->private_hugetlb +=
-						huge_page_size(hstate_vma(vma));
-		}
+		else
+			mss->private_hugetlb += huge_page_size(hstate_vma(vma));
 	}
 	return 0;
 }
_


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
  2023-01-27 23:04     ` Andrew Morton
@ 2023-01-28  1:12       ` Mike Kravetz
  2023-01-30 12:36         ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2023-01-28  1:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, linux-mm, linux-kernel, Naoya Horiguchi,
	James Houghton, Peter Xu, Michal Hocko, Yang Shi, Vishal Moola,
	Matthew Wilcox, Muchun Song, stable

On 01/27/23 15:04, Andrew Morton wrote:
> On Fri, 27 Jan 2023 17:23:39 +0100 David Hildenbrand <david@redhat.com> wrote:
> 
> > On 26.01.23 23:27, Mike Kravetz wrote:
> > > A hugetlb page will have a mapcount of 1 if mapped by multiple processes
> > > via a shared PMD.  This is because only the first process increases the
> > > map count, and subsequent processes just add the shared PMD page to
> > > their page table.
> > > 
> > > page_mapcount is being used to decide if a hugetlb page is shared or
> > > private in /proc/PID/smaps.  Pages referenced via a shared PMD were
> > > incorrectly being counted as private.
> > > 
> > > To fix, check for a shared PMD if mapcount is 1.  If a shared PMD is
> > > found count the hugetlb page as shared.  A new helper to check for a
> > > shared PMD is added.
> > > 
> > ...
> >
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -749,8 +749,14 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> > >   
> > >   		if (mapcount >= 2)
> > >   			mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > > -		else
> > > -			mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > > +		else {
> > 
> > Better:
> > 
> > if (mapcount >= 2 || hugetlb_pmd_shared(pte))
> > 	mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > else
> > 	mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> 
> Yup.  And that local doesn't add any value?
> 
> --- a/fs/proc/task_mmu.c~mm-hugetlb-proc-check-for-hugetlb-shared-pmd-in-proc-pid-smaps-fix
> +++ a/fs/proc/task_mmu.c
> @@ -745,18 +745,10 @@ static int smaps_hugetlb_range(pte_t *pt
>  			page = pfn_swap_entry_to_page(swpent);
>  	}
>  	if (page) {
> -		int mapcount = page_mapcount(page);
> -
> -		if (mapcount >= 2)
> +		if (page_mapcount(page) >= 2 || hugetlb_pmd_shared(pte))
>  			mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> -		else {
> -			if (hugetlb_pmd_shared(pte))
> -				mss->shared_hugetlb +=
> -						huge_page_size(hstate_vma(vma));
> -			else
> -				mss->private_hugetlb +=
> -						huge_page_size(hstate_vma(vma));
> -		}
> +		else
> +			mss->private_hugetlb += huge_page_size(hstate_vma(vma));
>  	}
>  	return 0;
>  }

Thank you both!  That looks much better.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
  2023-01-28  1:12       ` Mike Kravetz
@ 2023-01-30 12:36         ` Michal Hocko
  2023-01-30 22:08           ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2023-01-30 12:36 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Andrew Morton, David Hildenbrand, linux-mm, linux-kernel,
	Naoya Horiguchi, James Houghton, Peter Xu, Yang Shi,
	Vishal Moola, Matthew Wilcox, Muchun Song, stable

On Fri 27-01-23 17:12:05, Mike Kravetz wrote:
> On 01/27/23 15:04, Andrew Morton wrote:
> > On Fri, 27 Jan 2023 17:23:39 +0100 David Hildenbrand <david@redhat.com> wrote:
> > 
> > > On 26.01.23 23:27, Mike Kravetz wrote:
> > > > A hugetlb page will have a mapcount of 1 if mapped by multiple processes
> > > > via a shared PMD.  This is because only the first process increases the
> > > > map count, and subsequent processes just add the shared PMD page to
> > > > their page table.
> > > > 
> > > > page_mapcount is being used to decide if a hugetlb page is shared or
> > > > private in /proc/PID/smaps.  Pages referenced via a shared PMD were
> > > > incorrectly being counted as private.
> > > > 
> > > > To fix, check for a shared PMD if mapcount is 1.  If a shared PMD is
> > > > found count the hugetlb page as shared.  A new helper to check for a
> > > > shared PMD is added.
> > > > 
> > > ...
> > >
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -749,8 +749,14 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> > > >   
> > > >   		if (mapcount >= 2)
> > > >   			mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > > > -		else
> > > > -			mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > > > +		else {
> > > 
> > > Better:
> > > 
> > > if (mapcount >= 2 || hugetlb_pmd_shared(pte))
> > > 	mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > > else
> > > 	mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > 
> > Yup.  And that local doesn't add any value?
> > 
> > --- a/fs/proc/task_mmu.c~mm-hugetlb-proc-check-for-hugetlb-shared-pmd-in-proc-pid-smaps-fix
> > +++ a/fs/proc/task_mmu.c
> > @@ -745,18 +745,10 @@ static int smaps_hugetlb_range(pte_t *pt
> >  			page = pfn_swap_entry_to_page(swpent);
> >  	}
> >  	if (page) {
> > -		int mapcount = page_mapcount(page);
> > -
> > -		if (mapcount >= 2)
> > +		if (page_mapcount(page) >= 2 || hugetlb_pmd_shared(pte))
> >  			mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > -		else {
> > -			if (hugetlb_pmd_shared(pte))
> > -				mss->shared_hugetlb +=
> > -						huge_page_size(hstate_vma(vma));
> > -			else
> > -				mss->private_hugetlb +=
> > -						huge_page_size(hstate_vma(vma));
> > -		}
> > +		else
> > +			mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> >  	}
> >  	return 0;
> >  }
> 
> Thank you both!  That looks much better.

Yes, this looks simple enough. My only concern would be that this
special casing might be required on other places which is hard to
evaluate. I thought PSS reported by smaps would be broken as well but it
seems pss is not really accounted for hugetlb mappings at all.

Have you tried to look into {in,de}creasing the map count of the page when
the the pte is {un}shared for it?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
  2023-01-30 12:36         ` Michal Hocko
@ 2023-01-30 22:08           ` Mike Kravetz
  2023-02-01  7:47             ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2023-01-30 22:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Hildenbrand, linux-mm, linux-kernel,
	Naoya Horiguchi, James Houghton, Peter Xu, Yang Shi,
	Vishal Moola, Matthew Wilcox, Muchun Song, stable

On 01/30/23 13:36, Michal Hocko wrote:
> On Fri 27-01-23 17:12:05, Mike Kravetz wrote:
> > On 01/27/23 15:04, Andrew Morton wrote:
> > > On Fri, 27 Jan 2023 17:23:39 +0100 David Hildenbrand <david@redhat.com> wrote:
> > > > On 26.01.23 23:27, Mike Kravetz wrote:
> 
> Yes, this looks simple enough. My only concern would be that this
> special casing might be required on other places which is hard to
> evaluate. I thought PSS reported by smaps would be broken as well but it
> seems pss is not really accounted for hugetlb mappings at all.
> 
> Have you tried to look into {in,de}creasing the map count of the page when
> the the pte is {un}shared for it?

A quick thought is that it would not be too difficult.  It would need
to include the following:
- At PMD share time in huge_pmd_share(),
  Go through all entries in the PMD, and increment map and ref count for
  all referenced pages.  huge_pmd_share is just adding another sharing
  process.
- At PMD unshare time in huge_pmd_unshare(),
  Go through all entries in the PMD, and decrement map and ref count for
  all referenced pages.  huge_pmd_unshare is just removing one sharing
  process.
- At page fault time, check if we are adding a new entry to a shared PMD.
  If yes, add 'num_of_sharing__processes - 1' to the ref and map count.

In each of the above operations, we are holding the PTL lock (which is
really the split/PMD lock) so synchronization should not be an issue.

Although I mention processes sharing the PMD above, it is really mappings/vmas
sharing the PMD.  You could have two mappings of the same object in the same
process sharing PMDs.

I'll code this up and see how it looks.

However, unless you have an objection I would prefer the simple patches
move forward, especially for stable backports.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
  2023-01-30 22:08           ` Mike Kravetz
@ 2023-02-01  7:47             ` Michal Hocko
  2023-02-01 21:05               ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2023-02-01  7:47 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Andrew Morton, David Hildenbrand, linux-mm, linux-kernel,
	Naoya Horiguchi, James Houghton, Peter Xu, Yang Shi,
	Vishal Moola, Matthew Wilcox, Muchun Song, stable

On Mon 30-01-23 14:08:47, Mike Kravetz wrote:
> On 01/30/23 13:36, Michal Hocko wrote:
> > On Fri 27-01-23 17:12:05, Mike Kravetz wrote:
> > > On 01/27/23 15:04, Andrew Morton wrote:
> > > > On Fri, 27 Jan 2023 17:23:39 +0100 David Hildenbrand <david@redhat.com> wrote:
> > > > > On 26.01.23 23:27, Mike Kravetz wrote:
> > 
> > Yes, this looks simple enough. My only concern would be that this
> > special casing might be required on other places which is hard to
> > evaluate. I thought PSS reported by smaps would be broken as well but it
> > seems pss is not really accounted for hugetlb mappings at all.
> > 
> > Have you tried to look into {in,de}creasing the map count of the page when
> > the the pte is {un}shared for it?
> 
> A quick thought is that it would not be too difficult.  It would need
> to include the following:
> - At PMD share time in huge_pmd_share(),
>   Go through all entries in the PMD, and increment map and ref count for
>   all referenced pages.  huge_pmd_share is just adding another sharing
>   process.
> - At PMD unshare time in huge_pmd_unshare(),
>   Go through all entries in the PMD, and decrement map and ref count for
>   all referenced pages.  huge_pmd_unshare is just removing one sharing
>   process.
> - At page fault time, check if we are adding a new entry to a shared PMD.
>   If yes, add 'num_of_sharing__processes - 1' to the ref and map count.
> 
> In each of the above operations, we are holding the PTL lock (which is
> really the split/PMD lock) so synchronization should not be an issue.
> 
> Although I mention processes sharing the PMD above, it is really mappings/vmas
> sharing the PMD.  You could have two mappings of the same object in the same
> process sharing PMDs.
> 
> I'll code this up and see how it looks.

Thanks!
 
> However, unless you have an objection I would prefer the simple patches
> move forward, especially for stable backports.

Yes, the current patch is much simpler and more suitable for stable
backports. If the explicit map count modifications are not all that
terrible then this would sound like a more appropriate long term plan
though.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
  2023-02-01  7:47             ` Michal Hocko
@ 2023-02-01 21:05               ` Mike Kravetz
  2023-02-03 13:40                 ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2023-02-01 21:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Hildenbrand, linux-mm, linux-kernel,
	Naoya Horiguchi, James Houghton, Peter Xu, Yang Shi,
	Vishal Moola, Matthew Wilcox, Muchun Song, stable

On 02/01/23 08:47, Michal Hocko wrote:
> On Mon 30-01-23 14:08:47, Mike Kravetz wrote:
> > On 01/30/23 13:36, Michal Hocko wrote:
> > > On Fri 27-01-23 17:12:05, Mike Kravetz wrote:
> > > > On 01/27/23 15:04, Andrew Morton wrote:
> > > > > On Fri, 27 Jan 2023 17:23:39 +0100 David Hildenbrand <david@redhat.com> wrote:
> > > > > > On 26.01.23 23:27, Mike Kravetz wrote:
> > > 
> > > Yes, this looks simple enough. My only concern would be that this
> > > special casing might be required on other places which is hard to
> > > evaluate. I thought PSS reported by smaps would be broken as well but it
> > > seems pss is not really accounted for hugetlb mappings at all.
> > > 
> > > Have you tried to look into {in,de}creasing the map count of the page when
> > > the the pte is {un}shared for it?
> > 
> > A quick thought is that it would not be too difficult.  It would need
> > to include the following:
> > - At PMD share time in huge_pmd_share(),
> >   Go through all entries in the PMD, and increment map and ref count for
> >   all referenced pages.  huge_pmd_share is just adding another sharing
> >   process.
> > - At PMD unshare time in huge_pmd_unshare(),
> >   Go through all entries in the PMD, and decrement map and ref count for
> >   all referenced pages.  huge_pmd_unshare is just removing one sharing
> >   process.
> > - At page fault time, check if we are adding a new entry to a shared PMD.
> >   If yes, add 'num_of_sharing__processes - 1' to the ref and map count.
> > 
> > In each of the above operations, we are holding the PTL lock (which is
> > really the split/PMD lock) so synchronization should not be an issue.
> > 
> > Although I mention processes sharing the PMD above, it is really mappings/vmas
> > sharing the PMD.  You could have two mappings of the same object in the same
> > process sharing PMDs.
> > 
> > I'll code this up and see how it looks.
> 
> Thanks!
>  
> > However, unless you have an objection I would prefer the simple patches
> > move forward, especially for stable backports.
> 
> Yes, the current patch is much simpler and more suitable for stable
> backports. If the explicit map count modifications are not all that
> terrible then this would sound like a more appropriate long term plan
> though.

The approach mentioned above seems to be simple enough.  Patch is below.

I 'tested' with the same method and tests used to measure fault scalabilty
when developing vma based locking [1].  I figured this would be a good stress
of the share, unshare and fault paths.  With the patch, we are doing more
with the page table lock held, so I expected to see a little difference
in scalability, but not as much as actually measured:

				next-20230131
test		instances	unmodified	patched
--------------------------------------------------------------------------
Combined faults 24		61888.4		58314.8
Combined forks  24		  157.3		  130.1

These tests could seem a bit like a micro-benchmark targeting these code
paths.  However, I put them together based on the description of a
customer workload that prompted the vma based locking work.  And, performance
of these tests seems to reflect performance of their workloads.

This extra overhead is the cost needed to make shared PMD map counts be
accurate and in line with what is normal and expected.  I think it is
worth the cost.  Other opinions?  Of course, the patch below may have
issues so please take a look.

[1] https://lore.kernel.org/linux-mm/20220914221810.95771-1-mike.kravetz@oracle.com/


From bff5a717521f96b0e5075ac4b5a1ef84a3589b7e Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Mon, 30 Jan 2023 20:14:14 -0800
Subject: [PATCH] hugetlb: Adjust hugetlbp page ref/map counts for PMD sharing

When hugetlb PMDS are shared, the sharing code simply adds the shared
PMD to another processes page table.  It will not update the ref/map
counts of pages referenced by the shared PMD.  As a result, the ref/map
count will only reflect when the page was added to the shared PMD.  Even
though the shared PMD may be in MANY process page tables, ref/map counts
on the pages will only appear to be that of a single process.

Update ref/map counts to take PMD sharing into account.  This is done in
three distinct places:
1) At PMD share time in huge_pmd_share(),
   Go through all entries in the PMD, and increment map and ref count for
   all referenced pages.  huge_pmd_share is just adding another use and
   mapping of each page.
2) At PMD unshare time in huge_pmd_unshare(),
   Go through all entries in the PMD, and decrement map and ref count for
   all referenced pages.  huge_pmd_unshare is just removing one use and
   mapping of each page.
3) When faulting in a new hugetlb page,
   Check if we are adding a new entry to a shared PMD.  If yes, add
   'num_of_sharing__processes - 1' to the ref and map count.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 52 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3a01a9dbf445..c7b1c6307a82 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -96,6 +96,7 @@ static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
 static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
 static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
 		unsigned long start, unsigned long end);
+static void adjust_page_counts_for_shared_pmd(pte_t *ptep, struct folio *folio);
 
 static inline bool subpool_is_free(struct hugepage_subpool *spool)
 {
@@ -5905,10 +5906,12 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	if (!pte_same(huge_ptep_get(ptep), old_pte))
 		goto backout;
 
-	if (anon_rmap)
+	if (anon_rmap) {
 		hugepage_add_new_anon_rmap(folio, vma, haddr);
-	else
+	} else {
 		page_dup_file_rmap(&folio->page, true);
+		adjust_page_counts_for_shared_pmd(ptep, folio);
+	}
 	new_pte = make_huge_pte(vma, &folio->page, ((vma->vm_flags & VM_WRITE)
 				&& (vma->vm_flags & VM_SHARED)));
 	/*
@@ -7036,6 +7039,43 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
 		*end = ALIGN(*end, PUD_SIZE);
 }
 
+static void adjust_page_counts_for_shared_pmd(pte_t *ptep, struct folio *folio)
+{
+	int shared_count = page_count(virt_to_page(ptep));
+
+	if (shared_count < 2)
+		return;
+
+	folio_ref_add(folio, shared_count - 1);
+	atomic_add(shared_count - 1, &folio->_entire_mapcount);
+}
+
+static void adjust_shared_pmd_page_counts(pmd_t *pmd_start, int delta)
+{
+	struct folio *folio;
+	struct page *page;
+	pte_t *ptep, pte;
+	int i;
+
+	for (i= 0; i < PTRS_PER_PMD; i++) {
+		ptep = (pte_t *)(pmd_start + i);
+
+		pte = huge_ptep_get(ptep);
+		if (huge_pte_none(pte) || !pte_present(pte))
+			continue;
+
+		page = pte_page(pte);
+		folio = (struct folio *)page;
+		if (delta > 0) {
+			folio_get(folio);
+			atomic_inc(&folio->_entire_mapcount);
+		} else {
+			folio_put(folio);
+			atomic_dec(&folio->_entire_mapcount);
+		}
+	}
+}
+
 /*
  * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
  * and returns the corresponding pte. While this is not necessary for the
@@ -7078,9 +7118,11 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	ptl = huge_pte_lock(hstate_vma(vma), mm, spte);
 	if (pud_none(*pud)) {
-		pud_populate(mm, pud,
-				(pmd_t *)((unsigned long)spte & PAGE_MASK));
+		pmd_t *pmdp = (pmd_t *)((unsigned long)spte & PAGE_MASK);
+
+		pud_populate(mm, pud, pmdp);
 		mm_inc_nr_pmds(mm);
+		adjust_shared_pmd_page_counts(pmdp, 1);
 	} else {
 		put_page(virt_to_page(spte));
 	}
@@ -7118,12 +7160,18 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	pud_clear(pud);
 	put_page(virt_to_page(ptep));
+	adjust_shared_pmd_page_counts(
+				(pmd_t *)((unsigned long)ptep & PAGE_MASK), -1);
 	mm_dec_nr_pmds(mm);
 	return 1;
 }
 
 #else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
 
+static void adjust_page_counts_for_shared_pmd(pte_t *ptep, struct folio *folio)
+{
+}
+
 pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, pud_t *pud)
 {
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
  2023-02-01 21:05               ` Mike Kravetz
@ 2023-02-03 13:40                 ` Michal Hocko
  2023-02-03 20:16                   ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2023-02-03 13:40 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Andrew Morton, David Hildenbrand, linux-mm, linux-kernel,
	Naoya Horiguchi, James Houghton, Peter Xu, Yang Shi,
	Vishal Moola, Matthew Wilcox, Muchun Song, stable

On Wed 01-02-23 13:05:35, Mike Kravetz wrote:
> On 02/01/23 08:47, Michal Hocko wrote:
> > On Mon 30-01-23 14:08:47, Mike Kravetz wrote:
> > > On 01/30/23 13:36, Michal Hocko wrote:
> > > > On Fri 27-01-23 17:12:05, Mike Kravetz wrote:
> > > > > On 01/27/23 15:04, Andrew Morton wrote:
> > > > > > On Fri, 27 Jan 2023 17:23:39 +0100 David Hildenbrand <david@redhat.com> wrote:
> > > > > > > On 26.01.23 23:27, Mike Kravetz wrote:
> > > > 
> > > > Yes, this looks simple enough. My only concern would be that this
> > > > special casing might be required on other places which is hard to
> > > > evaluate. I thought PSS reported by smaps would be broken as well but it
> > > > seems pss is not really accounted for hugetlb mappings at all.
> > > > 
> > > > Have you tried to look into {in,de}creasing the map count of the page when
> > > > the the pte is {un}shared for it?
> > > 
> > > A quick thought is that it would not be too difficult.  It would need
> > > to include the following:
> > > - At PMD share time in huge_pmd_share(),
> > >   Go through all entries in the PMD, and increment map and ref count for
> > >   all referenced pages.  huge_pmd_share is just adding another sharing
> > >   process.
> > > - At PMD unshare time in huge_pmd_unshare(),
> > >   Go through all entries in the PMD, and decrement map and ref count for
> > >   all referenced pages.  huge_pmd_unshare is just removing one sharing
> > >   process.
> > > - At page fault time, check if we are adding a new entry to a shared PMD.
> > >   If yes, add 'num_of_sharing__processes - 1' to the ref and map count.
> > > 
> > > In each of the above operations, we are holding the PTL lock (which is
> > > really the split/PMD lock) so synchronization should not be an issue.
> > > 
> > > Although I mention processes sharing the PMD above, it is really mappings/vmas
> > > sharing the PMD.  You could have two mappings of the same object in the same
> > > process sharing PMDs.
> > > 
> > > I'll code this up and see how it looks.
> > 
> > Thanks!
> >  
> > > However, unless you have an objection I would prefer the simple patches
> > > move forward, especially for stable backports.
> > 
> > Yes, the current patch is much simpler and more suitable for stable
> > backports. If the explicit map count modifications are not all that
> > terrible then this would sound like a more appropriate long term plan
> > though.
> 
> The approach mentioned above seems to be simple enough.  Patch is below.
> 
> I 'tested' with the same method and tests used to measure fault scalabilty
> when developing vma based locking [1].  I figured this would be a good stress
> of the share, unshare and fault paths.  With the patch, we are doing more
> with the page table lock held, so I expected to see a little difference
> in scalability, but not as much as actually measured:
> 
> 				next-20230131
> test		instances	unmodified	patched
> --------------------------------------------------------------------------
> Combined faults 24		61888.4		58314.8
> Combined forks  24		  157.3		  130.1

So faults are 6 % slower while forks are hit by 18%. This is quite a
lot and more than I expected. pmd sharing shouldn't really be a common
operation AFAICS. It should only happen with the first mapping in the
pud (so once every 1GB/2MB faults for fully populated mappings).

It would be good to know whether this is purely lock contention based
or the additional work in each #pf and unmapping makes a big impact as
well.

> These tests could seem a bit like a micro-benchmark targeting these code
> paths.  However, I put them together based on the description of a
> customer workload that prompted the vma based locking work.  And, performance
> of these tests seems to reflect performance of their workloads.
> 
> This extra overhead is the cost needed to make shared PMD map counts be
> accurate and in line with what is normal and expected.  I think it is
> worth the cost.  Other opinions?  Of course, the patch below may have
> issues so please take a look.

If 18% slowdown really reflects a real workload then this might just be
too expensive I am afraid.

> [1] https://lore.kernel.org/linux-mm/20220914221810.95771-1-mike.kravetz@oracle.com/
> 
> 
> >From bff5a717521f96b0e5075ac4b5a1ef84a3589b7e Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Mon, 30 Jan 2023 20:14:14 -0800
> Subject: [PATCH] hugetlb: Adjust hugetlbp page ref/map counts for PMD sharing
> 
> When hugetlb PMDS are shared, the sharing code simply adds the shared
> PMD to another processes page table.  It will not update the ref/map
> counts of pages referenced by the shared PMD.  As a result, the ref/map
> count will only reflect when the page was added to the shared PMD.  Even
> though the shared PMD may be in MANY process page tables, ref/map counts
> on the pages will only appear to be that of a single process.
> 
> Update ref/map counts to take PMD sharing into account.  This is done in
> three distinct places:
> 1) At PMD share time in huge_pmd_share(),
>    Go through all entries in the PMD, and increment map and ref count for
>    all referenced pages.  huge_pmd_share is just adding another use and
>    mapping of each page.
> 2) At PMD unshare time in huge_pmd_unshare(),
>    Go through all entries in the PMD, and decrement map and ref count for
>    all referenced pages.  huge_pmd_unshare is just removing one use and
>    mapping of each page.
> 3) When faulting in a new hugetlb page,
>    Check if we are adding a new entry to a shared PMD.  If yes, add
>    'num_of_sharing__processes - 1' to the ref and map count.

Honestly, I didn't really have much time to think about this very deeply
so I might be missing something here. The patch seems correct to me.
adjust_shared_pmd_page_counts's delta parameter is confusing because it
implies a delta adjustments while it justs want to be "bool increase"
instead.

Thanks for looking into this Mike!
[...]
> +static void adjust_shared_pmd_page_counts(pmd_t *pmd_start, int delta)
> +{
> +	struct folio *folio;
> +	struct page *page;
> +	pte_t *ptep, pte;
> +	int i;
> +
> +	for (i= 0; i < PTRS_PER_PMD; i++) {
> +		ptep = (pte_t *)(pmd_start + i);
> +
> +		pte = huge_ptep_get(ptep);
> +		if (huge_pte_none(pte) || !pte_present(pte))
> +			continue;
> +
> +		page = pte_page(pte);
> +		folio = (struct folio *)page;
> +		if (delta > 0) {
> +			folio_get(folio);
> +			atomic_inc(&folio->_entire_mapcount);
> +		} else {
> +			folio_put(folio);
> +			atomic_dec(&folio->_entire_mapcount);
> +		}
> +	}
> +}
[...]

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
  2023-02-03 13:40                 ` Michal Hocko
@ 2023-02-03 20:16                   ` Mike Kravetz
  2023-02-13 18:01                     ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2023-02-03 20:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Hildenbrand, linux-mm, linux-kernel,
	Naoya Horiguchi, James Houghton, Peter Xu, Yang Shi,
	Vishal Moola, Matthew Wilcox, Muchun Song, stable

On 02/03/23 14:40, Michal Hocko wrote:
> On Wed 01-02-23 13:05:35, Mike Kravetz wrote:
> > On 02/01/23 08:47, Michal Hocko wrote:
> > > >   process.
> > > > - At page fault time, check if we are adding a new entry to a shared PMD.
> > > >   If yes, add 'num_of_sharing__processes - 1' to the ref and map count.
> > > > 
> > > > In each of the above operations, we are holding the PTL lock (which is
> > > > really the split/PMD lock) so synchronization should not be an issue.
> > > > 
> > > > Although I mention processes sharing the PMD above, it is really mappings/vmas
> > > > sharing the PMD.  You could have two mappings of the same object in the same
> > > > process sharing PMDs.
> > > > 
> > > > I'll code this up and see how it looks.
> > > 
> > > Thanks!
> > >  
> > > > However, unless you have an objection I would prefer the simple patches
> > > > move forward, especially for stable backports.
> > > 
> > > Yes, the current patch is much simpler and more suitable for stable
> > > backports. If the explicit map count modifications are not all that
> > > terrible then this would sound like a more appropriate long term plan
> > > though.
> > 
> > The approach mentioned above seems to be simple enough.  Patch is below.
> > 
> > I 'tested' with the same method and tests used to measure fault scalabilty
> > when developing vma based locking [1].  I figured this would be a good stress
> > of the share, unshare and fault paths.  With the patch, we are doing more
> > with the page table lock held, so I expected to see a little difference
> > in scalability, but not as much as actually measured:
> > 
> > 				next-20230131
> > test		instances	unmodified	patched
> > --------------------------------------------------------------------------
> > Combined faults 24		61888.4		58314.8
> > Combined forks  24		  157.3		  130.1
> 
> So faults are 6 % slower while forks are hit by 18%. This is quite a
> lot and more than I expected. pmd sharing shouldn't really be a common
> operation AFAICS. It should only happen with the first mapping in the
> pud (so once every 1GB/2MB faults for fully populated mappings).

Just want to be perfectly clear on what in being measured in these tests:
- faults program does the following in a loop
  . mmap 250GB hugetlb file PUD_SIZE aligned
  . read fault each hugetlb page (so all on shared PMDs)
  . unmap file
  measurement is how many times this map/read/unmap loop is completed
- fork program does the following
  . mmap 250GB hugetlb file PUD_SIZE aligned
  . read fault 3 pages on different PUDs
  . fork
    . child write faults 3 pages on different PUDs
    . child exits
  measurement is how many children can be created sequentially
For the results above, 24 instances of the fault program are being run
in parallel with 24 instances of the fork program.

> 
> It would be good to know whether this is purely lock contention based
> or the additional work in each #pf and unmapping makes a big impact as
> well.

I did not do a deep dive into the exact cause of the slowdown.  Do note
how much we are executing the PMD sharing paths in these tests.  I did
not really plan it that way, but was trying to simulate what might
be happening in a customer environment.

> > These tests could seem a bit like a micro-benchmark targeting these code
> > paths.  However, I put them together based on the description of a
> > customer workload that prompted the vma based locking work.  And, performance
> > of these tests seems to reflect performance of their workloads.
> > 
> > This extra overhead is the cost needed to make shared PMD map counts be
> > accurate and in line with what is normal and expected.  I think it is
> > worth the cost.  Other opinions?  Of course, the patch below may have
> > issues so please take a look.
> 
> If 18% slowdown really reflects a real workload then this might just be
> too expensive I am afraid.

On second thought, I tend to agree.  The fixes already done cover all known
exposures from inaccurate counts due to PMD sharing.  If we want to move
forward with the approach here, I would like to:
- Do more analysis in order to explain exactly why this is happening.
- Try to run the proposed patch is a more accurate customer environment
  simulation to determine if slowdown is actually visible.  I do not
  have access to such an environment, so will require cooperation from
  external vendor/customer.

Unless someone thinks we should move forward, I will not push the code
for this approach now.  It will also be interesting to see if this is
impacted at all by the outcome of discussions to perhaps redesign
mapcount.
-- 
Mike Kravetz

> 
> > [1] https://lore.kernel.org/linux-mm/20220914221810.95771-1-mike.kravetz@oracle.com/
> > 
> > 
> > >From bff5a717521f96b0e5075ac4b5a1ef84a3589b7e Mon Sep 17 00:00:00 2001
> > From: Mike Kravetz <mike.kravetz@oracle.com>
> > Date: Mon, 30 Jan 2023 20:14:14 -0800
> > Subject: [PATCH] hugetlb: Adjust hugetlbp page ref/map counts for PMD sharing
> > 
> > When hugetlb PMDS are shared, the sharing code simply adds the shared
> > PMD to another processes page table.  It will not update the ref/map
> > counts of pages referenced by the shared PMD.  As a result, the ref/map
> > count will only reflect when the page was added to the shared PMD.  Even
> > though the shared PMD may be in MANY process page tables, ref/map counts
> > on the pages will only appear to be that of a single process.
> > 
> > Update ref/map counts to take PMD sharing into account.  This is done in
> > three distinct places:
> > 1) At PMD share time in huge_pmd_share(),
> >    Go through all entries in the PMD, and increment map and ref count for
> >    all referenced pages.  huge_pmd_share is just adding another use and
> >    mapping of each page.
> > 2) At PMD unshare time in huge_pmd_unshare(),
> >    Go through all entries in the PMD, and decrement map and ref count for
> >    all referenced pages.  huge_pmd_unshare is just removing one use and
> >    mapping of each page.
> > 3) When faulting in a new hugetlb page,
> >    Check if we are adding a new entry to a shared PMD.  If yes, add
> >    'num_of_sharing__processes - 1' to the ref and map count.
> 
> Honestly, I didn't really have much time to think about this very deeply
> so I might be missing something here. The patch seems correct to me.
> adjust_shared_pmd_page_counts's delta parameter is confusing because it
> implies a delta adjustments while it justs want to be "bool increase"
> instead.
> 
> Thanks for looking into this Mike!
> [...]
> > +static void adjust_shared_pmd_page_counts(pmd_t *pmd_start, int delta)
> > +{
> > +	struct folio *folio;
> > +	struct page *page;
> > +	pte_t *ptep, pte;
> > +	int i;
> > +
> > +	for (i= 0; i < PTRS_PER_PMD; i++) {
> > +		ptep = (pte_t *)(pmd_start + i);
> > +
> > +		pte = huge_ptep_get(ptep);
> > +		if (huge_pte_none(pte) || !pte_present(pte))
> > +			continue;
> > +
> > +		page = pte_page(pte);
> > +		folio = (struct folio *)page;
> > +		if (delta > 0) {
> > +			folio_get(folio);
> > +			atomic_inc(&folio->_entire_mapcount);
> > +		} else {
> > +			folio_put(folio);
> > +			atomic_dec(&folio->_entire_mapcount);
> > +		}
> > +	}
> > +}
> [...]
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
  2023-02-03 20:16                   ` Mike Kravetz
@ 2023-02-13 18:01                     ` Michal Hocko
  2023-02-14  1:40                       ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2023-02-13 18:01 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Andrew Morton, David Hildenbrand, linux-mm, linux-kernel,
	Naoya Horiguchi, James Houghton, Peter Xu, Yang Shi,
	Vishal Moola, Matthew Wilcox, Muchun Song, stable

On Fri 03-02-23 12:16:04, Mike Kravetz wrote:
[...]
> Unless someone thinks we should move forward, I will not push the code
> for this approach now.  It will also be interesting to see if this is
> impacted at all by the outcome of discussions to perhaps redesign
> mapcount.

Yes, I do agree. We might want to extend page_mapcount documentation a
bit though. The comment is explicit about the order-0 pages but a note
about hugetlb and pmd sharing wouldn't hurt. WDYT?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
  2023-02-13 18:01                     ` Michal Hocko
@ 2023-02-14  1:40                       ` Mike Kravetz
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Kravetz @ 2023-02-14  1:40 UTC (permalink / raw)
  To: Michal Hocko, Matthew Wilcox
  Cc: Andrew Morton, David Hildenbrand, linux-mm, linux-kernel,
	Naoya Horiguchi, James Houghton, Peter Xu, Yang Shi,
	Vishal Moola, Muchun Song, stable

On 02/13/23 19:01, Michal Hocko wrote:
> On Fri 03-02-23 12:16:04, Mike Kravetz wrote:
> [...]
> > Unless someone thinks we should move forward, I will not push the code
> > for this approach now.  It will also be interesting to see if this is
> > impacted at all by the outcome of discussions to perhaps redesign
> > mapcount.
> 
> Yes, I do agree. We might want to extend page_mapcount documentation a
> bit though. The comment is explicit about the order-0 pages but a note
> about hugetlb and pmd sharing wouldn't hurt. WDYT?

Looks like that comment about 'Mapcount of 0-order page' has been removed in
the latest version of page_mapcount().  It would not surprise me if the calls
to page_mapcount after which we check for shared PMDs will soon be replaced
with calls to folio_mapcount().

Perhaps Matthew has an opinion as to where map counts for hugetlb shared
PMDs might be mentioned.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-02-14  1:40 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-26 22:27 [PATCH 0/2] Fixes for hugetlb mapcount at most 1 for shared PMDs Mike Kravetz
2023-01-26 22:27 ` [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps Mike Kravetz
2023-01-27 16:23   ` David Hildenbrand
2023-01-27 23:04     ` Andrew Morton
2023-01-28  1:12       ` Mike Kravetz
2023-01-30 12:36         ` Michal Hocko
2023-01-30 22:08           ` Mike Kravetz
2023-02-01  7:47             ` Michal Hocko
2023-02-01 21:05               ` Mike Kravetz
2023-02-03 13:40                 ` Michal Hocko
2023-02-03 20:16                   ` Mike Kravetz
2023-02-13 18:01                     ` Michal Hocko
2023-02-14  1:40                       ` Mike Kravetz
2023-01-26 22:27 ` [PATCH 2/2] migrate: hugetlb: Check for hugetlb shared PMD in node migration Mike Kravetz
2023-01-27 16:23   ` David Hildenbrand
2023-01-26 22:47 ` [PATCH 0/2] Fixes for hugetlb mapcount at most 1 for shared PMDs Andrew Morton
2023-01-26 22:48 ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).