All of lore.kernel.org
 help / color / mirror / Atom feed
* A mapcount riddle
@ 2023-01-24 20:56 Mike Kravetz
  2023-01-24 23:00 ` Peter Xu
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Mike Kravetz @ 2023-01-24 20:56 UTC (permalink / raw)
  To: linux-mm
  Cc: Naoya Horiguchi, David Rientjes, Michal Hocko, Matthew Wilcox,
	David Hildenbrand, Peter Xu, James Houghton, Muchun Song

Q How can a page be mapped into multiple processes and have a
  mapcount of 1?

A It is a hugetlb page referenced by a shared PMD.

I was looking to expose some basic information about PMD sharing via
/proc/smaps.  After adding the code, I started a couple processes
sharing a large hugetlb mapping that would result in the use of
shared PMDs.  When I looked at the output of /proc/smaps, I saw
my new metric counting the number of shared PMDs.  However, what
stood out was that the entire mapping was listed as Private_Hugetlb.
WTH???  It certainly was shared!  The routine smaps_hugetlb_range
decides between Private_Hugetlb and Shared_Hugetlb with this code:

	if (page) {
		int mapcount = page_mapcount(page);

		if (mapcount >= 2)
			mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
		else
			mss->private_hugetlb += huge_page_size(hstate_vma(vma));
	}

After spending some time looking for issues in the page_mapcount code,
I came to the realization that the mapcount of hugetlb pages only
referenced by a shared PMD would be 1 no matter how many processes had
mapped the page.  When a page is first faulted, the mapcount is set to 1.
When faulted in other processes, the shared PMD is added to the page
table of the other processes.  No increase of mapcount will occur.

At first thought this seems bad.  However, I believe this has been the
behavior since hugetlb PMD sharing was introduced in 2006 and I am
unaware of any reported issues.  I did a audit of code looking at
mapcount.  In addition to the above issue with smaps, there appears
to be an issue with 'migrate_pages' where shared pages could be migrated
without appropriate privilege.

	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
	if (flags & (MPOL_MF_MOVE_ALL) ||
	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
		if (isolate_hugetlb(page, qp->pagelist) &&
			(flags & MPOL_MF_STRICT))
			/*
			 * Failed to isolate page but allow migrating pages
			 * which have been queued.
			 */
			ret = 1;
	}

I will prepare fixes for both of these.  However, I wanted to ask if
anyone has ideas about other potential issues with this?

Since COW is mostly relevant to private mappings, shared PMDs generally
do not apply.  Nothing stood out in a quick audit of code.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-24 20:56 A mapcount riddle Mike Kravetz
@ 2023-01-24 23:00 ` Peter Xu
  2023-01-24 23:29   ` Yang Shi
  2023-01-24 23:35   ` Mike Kravetz
  2023-01-25  8:24 ` Michal Hocko
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 22+ messages in thread
From: Peter Xu @ 2023-01-24 23:00 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, Naoya Horiguchi, David Rientjes, Michal Hocko,
	Matthew Wilcox, David Hildenbrand, James Houghton, Muchun Song

On Tue, Jan 24, 2023 at 12:56:24PM -0800, Mike Kravetz wrote:
> Q How can a page be mapped into multiple processes and have a
>   mapcount of 1?
> 
> A It is a hugetlb page referenced by a shared PMD.
> 
> I was looking to expose some basic information about PMD sharing via
> /proc/smaps.  After adding the code, I started a couple processes
> sharing a large hugetlb mapping that would result in the use of
> shared PMDs.  When I looked at the output of /proc/smaps, I saw
> my new metric counting the number of shared PMDs.  However, what
> stood out was that the entire mapping was listed as Private_Hugetlb.
> WTH???  It certainly was shared!  The routine smaps_hugetlb_range
> decides between Private_Hugetlb and Shared_Hugetlb with this code:
> 
> 	if (page) {
> 		int mapcount = page_mapcount(page);
> 
> 		if (mapcount >= 2)
> 			mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> 		else
> 			mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> 	}

This is definitely unfortunate..

> 
> After spending some time looking for issues in the page_mapcount code,
> I came to the realization that the mapcount of hugetlb pages only
> referenced by a shared PMD would be 1 no matter how many processes had
> mapped the page.  When a page is first faulted, the mapcount is set to 1.
> When faulted in other processes, the shared PMD is added to the page
> table of the other processes.  No increase of mapcount will occur.
> 
> At first thought this seems bad.  However, I believe this has been the
> behavior since hugetlb PMD sharing was introduced in 2006 and I am
> unaware of any reported issues.  I did a audit of code looking at
> mapcount.  In addition to the above issue with smaps, there appears
> to be an issue with 'migrate_pages' where shared pages could be migrated
> without appropriate privilege.
> 
> 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
> 	if (flags & (MPOL_MF_MOVE_ALL) ||
> 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
> 		if (isolate_hugetlb(page, qp->pagelist) &&
> 			(flags & MPOL_MF_STRICT))
> 			/*
> 			 * Failed to isolate page but allow migrating pages
> 			 * which have been queued.
> 			 */
> 			ret = 1;
> 	}
> 
> I will prepare fixes for both of these.  However, I wanted to ask if
> anyone has ideas about other potential issues with this?

This reminded me whether things should be checked already before this
happens.  E.g. when trying to share pmd, whether it makes sense to check
vma mempolicy before doing so?

Then the question is if pmd sharing only happens with the vma that shares
the same memory policy, whether above mapcount==1 check would be acceptable
even if it's shared by multiple processes.

Besides, I'm also curious on the planned fix too regarding the two issues
mentioned.

Thanks,

> 
> Since COW is mostly relevant to private mappings, shared PMDs generally
> do not apply.  Nothing stood out in a quick audit of code.
> -- 
> Mike Kravetz
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-24 23:00 ` Peter Xu
@ 2023-01-24 23:29   ` Yang Shi
  2023-01-25 16:02     ` Peter Xu
  2023-01-24 23:35   ` Mike Kravetz
  1 sibling, 1 reply; 22+ messages in thread
From: Yang Shi @ 2023-01-24 23:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, linux-mm, Naoya Horiguchi, David Rientjes,
	Michal Hocko, Matthew Wilcox, David Hildenbrand, James Houghton,
	Muchun Song

On Tue, Jan 24, 2023 at 3:00 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Jan 24, 2023 at 12:56:24PM -0800, Mike Kravetz wrote:
> > Q How can a page be mapped into multiple processes and have a
> >   mapcount of 1?
> >
> > A It is a hugetlb page referenced by a shared PMD.
> >
> > I was looking to expose some basic information about PMD sharing via
> > /proc/smaps.  After adding the code, I started a couple processes
> > sharing a large hugetlb mapping that would result in the use of
> > shared PMDs.  When I looked at the output of /proc/smaps, I saw
> > my new metric counting the number of shared PMDs.  However, what
> > stood out was that the entire mapping was listed as Private_Hugetlb.
> > WTH???  It certainly was shared!  The routine smaps_hugetlb_range
> > decides between Private_Hugetlb and Shared_Hugetlb with this code:
> >
> >       if (page) {
> >               int mapcount = page_mapcount(page);
> >
> >               if (mapcount >= 2)
> >                       mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> >               else
> >                       mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> >       }
>
> This is definitely unfortunate..
>
> >
> > After spending some time looking for issues in the page_mapcount code,
> > I came to the realization that the mapcount of hugetlb pages only
> > referenced by a shared PMD would be 1 no matter how many processes had
> > mapped the page.  When a page is first faulted, the mapcount is set to 1.
> > When faulted in other processes, the shared PMD is added to the page
> > table of the other processes.  No increase of mapcount will occur.
> >
> > At first thought this seems bad.  However, I believe this has been the
> > behavior since hugetlb PMD sharing was introduced in 2006 and I am
> > unaware of any reported issues.  I did a audit of code looking at
> > mapcount.  In addition to the above issue with smaps, there appears
> > to be an issue with 'migrate_pages' where shared pages could be migrated
> > without appropriate privilege.
> >
> >       /* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
> >       if (flags & (MPOL_MF_MOVE_ALL) ||
> >           (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
> >               if (isolate_hugetlb(page, qp->pagelist) &&
> >                       (flags & MPOL_MF_STRICT))
> >                       /*
> >                        * Failed to isolate page but allow migrating pages
> >                        * which have been queued.
> >                        */
> >                       ret = 1;
> >       }
> >
> > I will prepare fixes for both of these.  However, I wanted to ask if
> > anyone has ideas about other potential issues with this?
>
> This reminded me whether things should be checked already before this
> happens.  E.g. when trying to share pmd, whether it makes sense to check
> vma mempolicy before doing so?
>
> Then the question is if pmd sharing only happens with the vma that shares
> the same memory policy, whether above mapcount==1 check would be acceptable
> even if it's shared by multiple processes.

I don't think so. One process might change its policy, for example,
bind to another node, then result in migration for the hugepage due to
the incorrect mapcount. The above example code pasted by Mike actually
comes from mbind if I remember correctly.

I'm wondering whether we could use refcount instead of mapcount to
determine if hugetlb page is shared or not, assuming refcounting for
hugetlb page behaves similar to base page (inc when mapped by a new
process or pinned). If it is pinned (for example, GUP) we can't
migrate it either.

>
> Besides, I'm also curious on the planned fix too regarding the two issues
> mentioned.
>
> Thanks,
>
> >
> > Since COW is mostly relevant to private mappings, shared PMDs generally
> > do not apply.  Nothing stood out in a quick audit of code.
> > --
> > Mike Kravetz
> >
>
> --
> Peter Xu
>
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-24 23:00 ` Peter Xu
  2023-01-24 23:29   ` Yang Shi
@ 2023-01-24 23:35   ` Mike Kravetz
  2023-01-25 16:46     ` Peter Xu
  1 sibling, 1 reply; 22+ messages in thread
From: Mike Kravetz @ 2023-01-24 23:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, Naoya Horiguchi, David Rientjes, Michal Hocko,
	Matthew Wilcox, David Hildenbrand, James Houghton, Muchun Song

On 01/24/23 18:00, Peter Xu wrote:
> On Tue, Jan 24, 2023 at 12:56:24PM -0800, Mike Kravetz wrote:
> > At first thought this seems bad.  However, I believe this has been the
> > behavior since hugetlb PMD sharing was introduced in 2006 and I am
> > unaware of any reported issues.  I did a audit of code looking at
> > mapcount.  In addition to the above issue with smaps, there appears
> > to be an issue with 'migrate_pages' where shared pages could be migrated
> > without appropriate privilege.
> > 
> > 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
> > 	if (flags & (MPOL_MF_MOVE_ALL) ||
> > 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
> > 		if (isolate_hugetlb(page, qp->pagelist) &&
> > 			(flags & MPOL_MF_STRICT))
> > 			/*
> > 			 * Failed to isolate page but allow migrating pages
> > 			 * which have been queued.
> > 			 */
> > 			ret = 1;
> > 	}
> > 
> > I will prepare fixes for both of these.  However, I wanted to ask if
> > anyone has ideas about other potential issues with this?
> 
> This reminded me whether things should be checked already before this
> happens.  E.g. when trying to share pmd, whether it makes sense to check
> vma mempolicy before doing so?

Not sure I understand your question.  Are you questioning whether we should
enter into pmd sharing if mempolicy allows movement to another node?
Wouldn't this be the 'normal' case on a multi-node system?

> Then the question is if pmd sharing only happens with the vma that shares
> the same memory policy, whether above mapcount==1 check would be acceptable
> even if it's shared by multiple processes.

I am not a mempolicy expert, but that would still involve moving pages
mapped by another process.  For that CAP_SYS_NICE is required.  So, my
opinion would be that it is not allowed even if mempolicy is the same.

> Besides, I'm also curious on the planned fix too regarding the two issues
> mentioned.

My planned 'fix' is to simply check for shared a PMD
(page_count(virt_to_page(pte))) to determine if page with mapcount == 1
is shared.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-24 20:56 A mapcount riddle Mike Kravetz
  2023-01-24 23:00 ` Peter Xu
@ 2023-01-25  8:24 ` Michal Hocko
  2023-01-25 17:59   ` Mike Kravetz
  2023-01-25  9:09 ` David Hildenbrand
  2023-01-25 15:26 ` James Houghton
  3 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2023-01-25  8:24 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, Naoya Horiguchi, David Rientjes, Matthew Wilcox,
	David Hildenbrand, Peter Xu, James Houghton, Muchun Song

On Tue 24-01-23 12:56:24, Mike Kravetz wrote:
> Q How can a page be mapped into multiple processes and have a
>   mapcount of 1?
> 
> A It is a hugetlb page referenced by a shared PMD.
> 
> I was looking to expose some basic information about PMD sharing via
> /proc/smaps.  After adding the code, I started a couple processes
> sharing a large hugetlb mapping that would result in the use of
> shared PMDs.  When I looked at the output of /proc/smaps, I saw
> my new metric counting the number of shared PMDs.  However, what
> stood out was that the entire mapping was listed as Private_Hugetlb.
> WTH???  It certainly was shared!

It's been quite some time since I had to look into this area but pmd
shared hugetlb pages have always been quite weird AFAIR.

> The routine smaps_hugetlb_range
> decides between Private_Hugetlb and Shared_Hugetlb with this code:
> 
> 	if (page) {
> 		int mapcount = page_mapcount(page);
> 
> 		if (mapcount >= 2)
> 			mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> 		else
> 			mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> 	}
> 
> After spending some time looking for issues in the page_mapcount code,
> I came to the realization that the mapcount of hugetlb pages only
> referenced by a shared PMD would be 1 no matter how many processes had
> mapped the page.  When a page is first faulted, the mapcount is set to 1.
> When faulted in other processes, the shared PMD is added to the page
> table of the other processes.  No increase of mapcount will occur.

yes, really subtle but looking at it from the hugetlb POV, it is page
table that is shared rather than the underlying page. Is this
distinction useful/reasonable to the userspace. Not really but pmd
sharing is quite hard to stumble over by accident and I suspect most
users who use this feature just got used to those specialities.

> At first thought this seems bad.  However, I believe this has been the
> behavior since hugetlb PMD sharing was introduced in 2006 and I am
> unaware of any reported issues.  I did a audit of code looking at
> mapcount.  In addition to the above issue with smaps, there appears
> to be an issue with 'migrate_pages' where shared pages could be migrated
> without appropriate privilege.
> 
> 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
> 	if (flags & (MPOL_MF_MOVE_ALL) ||
> 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
> 		if (isolate_hugetlb(page, qp->pagelist) &&
> 			(flags & MPOL_MF_STRICT))
> 			/*
> 			 * Failed to isolate page but allow migrating pages
> 			 * which have been queued.
> 			 */
> 			ret = 1;
> 	}

Could you elaborate what is problematic about that? The whole pmd
sharing is a cooperative thing. So if some of the processes decides to
migrate the page then why that should be a problem for others sharing
that page via page table? Am I missing something obvious?

> I will prepare fixes for both of these.  However, I wanted to ask if
> anyone has ideas about other potential issues with this?
> 
> Since COW is mostly relevant to private mappings, shared PMDs generally
> do not apply.  Nothing stood out in a quick audit of code.

I am pretty sure there are other corner cases lurking in this area which
are really hard to look through until you stumble over them. The shared
mapping reporting is probably good to have fixed but I am not sure why
the migration is a real problem.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-24 20:56 A mapcount riddle Mike Kravetz
  2023-01-24 23:00 ` Peter Xu
  2023-01-25  8:24 ` Michal Hocko
@ 2023-01-25  9:09 ` David Hildenbrand
  2023-01-25 15:26 ` James Houghton
  3 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2023-01-25  9:09 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm
  Cc: Naoya Horiguchi, David Rientjes, Michal Hocko, Matthew Wilcox,
	Peter Xu, James Houghton, Muchun Song

On 24.01.23 21:56, Mike Kravetz wrote:
> Q How can a page be mapped into multiple processes and have a
>    mapcount of 1?
> 
> A It is a hugetlb page referenced by a shared PMD.
> 
> I was looking to expose some basic information about PMD sharing via
> /proc/smaps.  After adding the code, I started a couple processes
> sharing a large hugetlb mapping that would result in the use of
> shared PMDs.  When I looked at the output of /proc/smaps, I saw
> my new metric counting the number of shared PMDs.  However, what
> stood out was that the entire mapping was listed as Private_Hugetlb.
> WTH???  It certainly was shared!  The routine smaps_hugetlb_range
> decides between Private_Hugetlb and Shared_Hugetlb with this code:
> 
> 	if (page) {
> 		int mapcount = page_mapcount(page);
> 
> 		if (mapcount >= 2)
> 			mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> 		else
> 			mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> 	}
> 
> After spending some time looking for issues in the page_mapcount code,
> I came to the realization that the mapcount of hugetlb pages only
> referenced by a shared PMD would be 1 no matter how many processes had
> mapped the page.  When a page is first faulted, the mapcount is set to 1.
> When faulted in other processes, the shared PMD is added to the page
> table of the other processes.  No increase of mapcount will occur.
> 
> At first thought this seems bad.  However, I believe this has been the
> behavior since hugetlb PMD sharing was introduced in 2006 and I am
> unaware of any reported issues.  I did a audit of code looking at
> mapcount.  In addition to the above issue with smaps, there appears
> to be an issue with 'migrate_pages' where shared pages could be migrated
> without appropriate privilege.
> 
> 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
> 	if (flags & (MPOL_MF_MOVE_ALL) ||
> 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
> 		if (isolate_hugetlb(page, qp->pagelist) &&
> 			(flags & MPOL_MF_STRICT))
> 			/*
> 			 * Failed to isolate page but allow migrating pages
> 			 * which have been queued.
> 			 */
> 			ret = 1;
> 	}
> 
> I will prepare fixes for both of these.  However, I wanted to ask if
> anyone has ideas about other potential issues with this?
> 
> Since COW is mostly relevant to private mappings, shared PMDs generally
> do not apply.  Nothing stood out in a quick audit of code.

Yes, we shouldn't have to worry about anon pages in shared PMDs.

The observed mapcount weirdness is one of the reasons why I suggested 
for PTE-table sharing (new RFC was posted some time ago, but no time to 
look into that) to treat sharing of the page table only as a mechanism 
to deduplicate page table memory -- and to not change the semantics of 
pages mapped in there. That is: if the page is logically mapped into two 
page table structures, the refcount and the mapcount would be 2 instead 
of 1.

Of course, that implies some additional sharing-aware map/unmap logic, 
because the refcount+mapcount has to be adjusted accordingly.

But PTE-table sharing has to take proper care of private mappings as 
well, that's more what I was concerned about.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-24 20:56 A mapcount riddle Mike Kravetz
                   ` (2 preceding siblings ...)
  2023-01-25  9:09 ` David Hildenbrand
@ 2023-01-25 15:26 ` James Houghton
  2023-01-25 15:54   ` Peter Xu
  2023-01-26  9:10   ` David Hildenbrand
  3 siblings, 2 replies; 22+ messages in thread
From: James Houghton @ 2023-01-25 15:26 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, Naoya Horiguchi, David Rientjes, Michal Hocko,
	Matthew Wilcox, David Hildenbrand, Peter Xu, Muchun Song

> At first thought this seems bad.  However, I believe this has been the
> behavior since hugetlb PMD sharing was introduced in 2006 and I am
> unaware of any reported issues.  I did a audit of code looking at
> mapcount.  In addition to the above issue with smaps, there appears
> to be an issue with 'migrate_pages' where shared pages could be migrated
> without appropriate privilege.
>
>         /* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
>         if (flags & (MPOL_MF_MOVE_ALL) ||
>             (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
>                 if (isolate_hugetlb(page, qp->pagelist) &&
>                         (flags & MPOL_MF_STRICT))
>                         /*
>                          * Failed to isolate page but allow migrating pages
>                          * which have been queued.
>                          */
>                         ret = 1;
>         }

This isn't the exact same problem you're fixing Mike, but I want to
point out a related problem.

This is the generic-mm-equivalent of the hugetlb code above:

static int migrate_page_add(struct page *page, struct list_head
*pagelist, unsigned long flags)
{
        struct page *head = compound_head(page);
        /*
        * Avoid migrating a page that is shared with others.
        */
        if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
                if (!isolate_lru_page(head)) {
                        list_add_tail(&head->lru, pagelist);
                        mod_node_page_state(page_pgdat(head),
                                NR_ISOLATED_ANON + page_is_file_lru(head),
                                thp_nr_pages(head));
...
}

If you have a partially PTE-mapped THP, page_mapcount(head) will not
accurately determine if a page is mapped in multiple VMAs or not (it
only tells you how many times the head page is mapped).

For example...
1) You could have the THP PMD-mapped in one VMA, and then one tail
page of the THP can be mapped in another. page_mapcount(head) will be
1.
2) You could have two VMAs map two separate tail pages of the THP, in
which case page_mapcount(head) will be 0.

I bring this up because we have the same problem with HugeTLB
high-granularity mapping.

- James


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-25 15:26 ` James Houghton
@ 2023-01-25 15:54   ` Peter Xu
  2023-01-25 16:22     ` James Houghton
  2023-01-26  9:10   ` David Hildenbrand
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Xu @ 2023-01-25 15:54 UTC (permalink / raw)
  To: James Houghton
  Cc: Mike Kravetz, linux-mm, Naoya Horiguchi, David Rientjes,
	Michal Hocko, Matthew Wilcox, David Hildenbrand, Muchun Song

On Wed, Jan 25, 2023 at 07:26:49AM -0800, James Houghton wrote:
> > At first thought this seems bad.  However, I believe this has been the
> > behavior since hugetlb PMD sharing was introduced in 2006 and I am
> > unaware of any reported issues.  I did a audit of code looking at
> > mapcount.  In addition to the above issue with smaps, there appears
> > to be an issue with 'migrate_pages' where shared pages could be migrated
> > without appropriate privilege.
> >
> >         /* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
> >         if (flags & (MPOL_MF_MOVE_ALL) ||
> >             (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
> >                 if (isolate_hugetlb(page, qp->pagelist) &&
> >                         (flags & MPOL_MF_STRICT))
> >                         /*
> >                          * Failed to isolate page but allow migrating pages
> >                          * which have been queued.
> >                          */
> >                         ret = 1;
> >         }
> 
> This isn't the exact same problem you're fixing Mike, but I want to
> point out a related problem.
> 
> This is the generic-mm-equivalent of the hugetlb code above:
> 
> static int migrate_page_add(struct page *page, struct list_head
> *pagelist, unsigned long flags)
> {
>         struct page *head = compound_head(page);
>         /*
>         * Avoid migrating a page that is shared with others.
>         */
>         if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
>                 if (!isolate_lru_page(head)) {
>                         list_add_tail(&head->lru, pagelist);
>                         mod_node_page_state(page_pgdat(head),
>                                 NR_ISOLATED_ANON + page_is_file_lru(head),
>                                 thp_nr_pages(head));
> ...
> }
> 
> If you have a partially PTE-mapped THP, page_mapcount(head) will not
> accurately determine if a page is mapped in multiple VMAs or not (it
> only tells you how many times the head page is mapped).
> 
> For example...
> 1) You could have the THP PMD-mapped in one VMA, and then one tail
> page of the THP can be mapped in another. page_mapcount(head) will be
> 1.
> 2) You could have two VMAs map two separate tail pages of the THP, in
> which case page_mapcount(head) will be 0.
> 
> I bring this up because we have the same problem with HugeTLB
> high-granularity mapping.

Maybe a better match here is total_mapcount() rather than page_mapcount()
(despite the overheads on the sub-page loop)?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-24 23:29   ` Yang Shi
@ 2023-01-25 16:02     ` Peter Xu
  2023-01-25 18:26       ` Yang Shi
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2023-01-25 16:02 UTC (permalink / raw)
  To: Yang Shi
  Cc: Mike Kravetz, linux-mm, Naoya Horiguchi, David Rientjes,
	Michal Hocko, Matthew Wilcox, David Hildenbrand, James Houghton,
	Muchun Song

On Tue, Jan 24, 2023 at 03:29:53PM -0800, Yang Shi wrote:
> On Tue, Jan 24, 2023 at 3:00 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Jan 24, 2023 at 12:56:24PM -0800, Mike Kravetz wrote:
> > > Q How can a page be mapped into multiple processes and have a
> > >   mapcount of 1?
> > >
> > > A It is a hugetlb page referenced by a shared PMD.
> > >
> > > I was looking to expose some basic information about PMD sharing via
> > > /proc/smaps.  After adding the code, I started a couple processes
> > > sharing a large hugetlb mapping that would result in the use of
> > > shared PMDs.  When I looked at the output of /proc/smaps, I saw
> > > my new metric counting the number of shared PMDs.  However, what
> > > stood out was that the entire mapping was listed as Private_Hugetlb.
> > > WTH???  It certainly was shared!  The routine smaps_hugetlb_range
> > > decides between Private_Hugetlb and Shared_Hugetlb with this code:
> > >
> > >       if (page) {
> > >               int mapcount = page_mapcount(page);
> > >
> > >               if (mapcount >= 2)
> > >                       mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > >               else
> > >                       mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > >       }
> >
> > This is definitely unfortunate..
> >
> > >
> > > After spending some time looking for issues in the page_mapcount code,
> > > I came to the realization that the mapcount of hugetlb pages only
> > > referenced by a shared PMD would be 1 no matter how many processes had
> > > mapped the page.  When a page is first faulted, the mapcount is set to 1.
> > > When faulted in other processes, the shared PMD is added to the page
> > > table of the other processes.  No increase of mapcount will occur.
> > >
> > > At first thought this seems bad.  However, I believe this has been the
> > > behavior since hugetlb PMD sharing was introduced in 2006 and I am
> > > unaware of any reported issues.  I did a audit of code looking at
> > > mapcount.  In addition to the above issue with smaps, there appears
> > > to be an issue with 'migrate_pages' where shared pages could be migrated
> > > without appropriate privilege.
> > >
> > >       /* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
> > >       if (flags & (MPOL_MF_MOVE_ALL) ||
> > >           (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
> > >               if (isolate_hugetlb(page, qp->pagelist) &&
> > >                       (flags & MPOL_MF_STRICT))
> > >                       /*
> > >                        * Failed to isolate page but allow migrating pages
> > >                        * which have been queued.
> > >                        */
> > >                       ret = 1;
> > >       }
> > >
> > > I will prepare fixes for both of these.  However, I wanted to ask if
> > > anyone has ideas about other potential issues with this?
> >
> > This reminded me whether things should be checked already before this
> > happens.  E.g. when trying to share pmd, whether it makes sense to check
> > vma mempolicy before doing so?
> >
> > Then the question is if pmd sharing only happens with the vma that shares
> > the same memory policy, whether above mapcount==1 check would be acceptable
> > even if it's shared by multiple processes.
> 
> I don't think so. One process might change its policy, for example,
> bind to another node, then result in migration for the hugepage due to
> the incorrect mapcount. The above example code pasted by Mike actually
> comes from mbind if I remember correctly.

Yes, or any page migrations.  Above was a purely wild idea that we share
pmd based on vma attributes matching first (shared, mapping alignments,
etc.).  It can also take mempolicy into account so that when migrating one
page on the shared pmd, one can make a decision for all with mapcount==1
because that single mapcount may stand for all the sharers of the page as
long as they share the same mempolicy.

If above idea applies, we'll also need to unshare during mbind() when the
mempolicy of vma changes for hugetlb in this path, because right after the
mempolicy changed the vma attribute changed, so pmd sharing doesn't hold.

But please also ignore this whole thing - I don't think that'll resolve the
generic problem of mapcount issue on pmd sharing no matter what.  It's just
something come up to mind when I read it.

> 
> I'm wondering whether we could use refcount instead of mapcount to
> determine if hugetlb page is shared or not, assuming refcounting for
> hugetlb page behaves similar to base page (inc when mapped by a new
> process or pinned). If it is pinned (for example, GUP) we can't
> migrate it either.

I think refcount has the same issue because it's not accounted either when
pmd page is shared.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-25 15:54   ` Peter Xu
@ 2023-01-25 16:22     ` James Houghton
  2023-01-25 19:26       ` Vishal Moola
  2023-01-26  9:15       ` David Hildenbrand
  0 siblings, 2 replies; 22+ messages in thread
From: James Houghton @ 2023-01-25 16:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, linux-mm, Naoya Horiguchi, David Rientjes,
	Michal Hocko, Matthew Wilcox, David Hildenbrand, Muchun Song

On Wed, Jan 25, 2023 at 7:54 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Jan 25, 2023 at 07:26:49AM -0800, James Houghton wrote:
> > > At first thought this seems bad.  However, I believe this has been the
> > > behavior since hugetlb PMD sharing was introduced in 2006 and I am
> > > unaware of any reported issues.  I did a audit of code looking at
> > > mapcount.  In addition to the above issue with smaps, there appears
> > > to be an issue with 'migrate_pages' where shared pages could be migrated
> > > without appropriate privilege.
> > >
> > >         /* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
> > >         if (flags & (MPOL_MF_MOVE_ALL) ||
> > >             (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
> > >                 if (isolate_hugetlb(page, qp->pagelist) &&
> > >                         (flags & MPOL_MF_STRICT))
> > >                         /*
> > >                          * Failed to isolate page but allow migrating pages
> > >                          * which have been queued.
> > >                          */
> > >                         ret = 1;
> > >         }
> >
> > This isn't the exact same problem you're fixing Mike, but I want to
> > point out a related problem.
> >
> > This is the generic-mm-equivalent of the hugetlb code above:
> >
> > static int migrate_page_add(struct page *page, struct list_head
> > *pagelist, unsigned long flags)
> > {
> >         struct page *head = compound_head(page);
> >         /*
> >         * Avoid migrating a page that is shared with others.
> >         */
> >         if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
> >                 if (!isolate_lru_page(head)) {
> >                         list_add_tail(&head->lru, pagelist);
> >                         mod_node_page_state(page_pgdat(head),
> >                                 NR_ISOLATED_ANON + page_is_file_lru(head),
> >                                 thp_nr_pages(head));
> > ...
> > }
> >
> > If you have a partially PTE-mapped THP, page_mapcount(head) will not
> > accurately determine if a page is mapped in multiple VMAs or not (it
> > only tells you how many times the head page is mapped).
> >
> > For example...
> > 1) You could have the THP PMD-mapped in one VMA, and then one tail
> > page of the THP can be mapped in another. page_mapcount(head) will be
> > 1.
> > 2) You could have two VMAs map two separate tail pages of the THP, in
> > which case page_mapcount(head) will be 0.
> >
> > I bring this up because we have the same problem with HugeTLB
> > high-granularity mapping.
>
> Maybe a better match here is total_mapcount() rather than page_mapcount()
> (despite the overheads on the sub-page loop)?

This would kind of fix the problem, but it would be too conservative now. :)

In both example 1 and 2 above, total_mapcount(head) for both would be
2, so that's ok. But now consider: you have one VMA that is
PTE-mapping two pieces of the same THP. total_mapcount(head) is still
2, even though only a single VMA is mapping the page.

- James


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-24 23:35   ` Mike Kravetz
@ 2023-01-25 16:46     ` Peter Xu
  2023-01-25 18:16       ` Mike Kravetz
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2023-01-25 16:46 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, Naoya Horiguchi, David Rientjes, Michal Hocko,
	Matthew Wilcox, David Hildenbrand, James Houghton, Muchun Song

On Tue, Jan 24, 2023 at 03:35:38PM -0800, Mike Kravetz wrote:
> On 01/24/23 18:00, Peter Xu wrote:
> > On Tue, Jan 24, 2023 at 12:56:24PM -0800, Mike Kravetz wrote:
> > > At first thought this seems bad.  However, I believe this has been the
> > > behavior since hugetlb PMD sharing was introduced in 2006 and I am
> > > unaware of any reported issues.  I did a audit of code looking at
> > > mapcount.  In addition to the above issue with smaps, there appears
> > > to be an issue with 'migrate_pages' where shared pages could be migrated
> > > without appropriate privilege.
> > > 
> > > 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
> > > 	if (flags & (MPOL_MF_MOVE_ALL) ||
> > > 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
> > > 		if (isolate_hugetlb(page, qp->pagelist) &&
> > > 			(flags & MPOL_MF_STRICT))
> > > 			/*
> > > 			 * Failed to isolate page but allow migrating pages
> > > 			 * which have been queued.
> > > 			 */
> > > 			ret = 1;
> > > 	}
> > > 
> > > I will prepare fixes for both of these.  However, I wanted to ask if
> > > anyone has ideas about other potential issues with this?
> > 
> > This reminded me whether things should be checked already before this
> > happens.  E.g. when trying to share pmd, whether it makes sense to check
> > vma mempolicy before doing so?
> 
> Not sure I understand your question.  Are you questioning whether we should
> enter into pmd sharing if mempolicy allows movement to another node?
> Wouldn't this be the 'normal' case on a multi-node system?
> 
> > Then the question is if pmd sharing only happens with the vma that shares
> > the same memory policy, whether above mapcount==1 check would be acceptable
> > even if it's shared by multiple processes.
> 
> I am not a mempolicy expert, but that would still involve moving pages
> mapped by another process.  For that CAP_SYS_NICE is required.  So, my
> opinion would be that it is not allowed even if mempolicy is the same.

Makes sense.

> 
> > Besides, I'm also curious on the planned fix too regarding the two issues
> > mentioned.
> 
> My planned 'fix' is to simply check for shared a PMD
> (page_count(virt_to_page(pte))) to determine if page with mapcount == 1
> is shared.

I think having the current pte* won't easily work, we'll need to walk all
the pgtable that mapped this page.

To be explicit, one page can be mapped at pgtable1 which is shared by proc1
& proc2, and it can also be mapped at pgtable2 which is shared by proc3 &
proc4.  Then (assuming pte1* points to pgtable1):

  page_count(virt_to_page(pte1)) + page_mapcount(page)

Won't be the right mapcount we're looking for.

But then if we're going to rmap walk all the mappings it seems even easier
to just count how many times the page is mapped when walking, rather than
relying on page_mapcount.

What David said on maintaining map/ref counts during sharing is another
approach, I'm wondering whether there's any better way to do this.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-25  8:24 ` Michal Hocko
@ 2023-01-25 17:59   ` Mike Kravetz
  2023-01-26  9:16     ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Mike Kravetz @ 2023-01-25 17:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Naoya Horiguchi, David Rientjes, Matthew Wilcox,
	David Hildenbrand, Peter Xu, James Houghton, Muchun Song

On 01/25/23 09:24, Michal Hocko wrote:
> On Tue 24-01-23 12:56:24, Mike Kravetz wrote:
> > At first thought this seems bad.  However, I believe this has been the
> > behavior since hugetlb PMD sharing was introduced in 2006 and I am
> > unaware of any reported issues.  I did a audit of code looking at
> > mapcount.  In addition to the above issue with smaps, there appears
> > to be an issue with 'migrate_pages' where shared pages could be migrated
> > without appropriate privilege.
> > 
> > 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
> > 	if (flags & (MPOL_MF_MOVE_ALL) ||
> > 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
> > 		if (isolate_hugetlb(page, qp->pagelist) &&
> > 			(flags & MPOL_MF_STRICT))
> > 			/*
> > 			 * Failed to isolate page but allow migrating pages
> > 			 * which have been queued.
> > 			 */
> > 			ret = 1;
> > 	}
> 
> Could you elaborate what is problematic about that? The whole pmd
> sharing is a cooperative thing. So if some of the processes decides to
> migrate the page then why that should be a problem for others sharing
> that page via page table? Am I missing something obvious?

Nothing obvious.  It is just that the semantics seem to be that you can
only move shared pages if you have CAP_SYS_NICE.  Certainly cooperation
is implied for shared PMDs, but I would guess that most applications are
not even aware they are sharing PMDs.

Consider a group of processes sharing a hugetlb mapping.  If the mapping
is PUD_SIZE - huge_page_size, there is no sharing of PMDs and a process
without CAP_SYS_NICE can not migrate the shared pages.  However, if nothing
else changes and the mapping size is PUD_SIZE (and appropriately aligned)
the PMDs are shared.  Should we allow a process to migrate shared pages
without CAP_SYS_NICE in this case?
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-25 16:46     ` Peter Xu
@ 2023-01-25 18:16       ` Mike Kravetz
  2023-01-25 20:13         ` Peter Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Mike Kravetz @ 2023-01-25 18:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, Naoya Horiguchi, David Rientjes, Michal Hocko,
	Matthew Wilcox, David Hildenbrand, James Houghton, Muchun Song

On 01/25/23 11:46, Peter Xu wrote:
> On Tue, Jan 24, 2023 at 03:35:38PM -0800, Mike Kravetz wrote:
> > On 01/24/23 18:00, Peter Xu wrote:
> > > On Tue, Jan 24, 2023 at 12:56:24PM -0800, Mike Kravetz wrote:
> > > Besides, I'm also curious on the planned fix too regarding the two issues
> > > mentioned.
> > 
> > My planned 'fix' is to simply check for shared a PMD
> > (page_count(virt_to_page(pte))) to determine if page with mapcount == 1
> > is shared.
> 
> I think having the current pte* won't easily work, we'll need to walk all
> the pgtable that mapped this page.
> 
> To be explicit, one page can be mapped at pgtable1 which is shared by proc1
> & proc2, and it can also be mapped at pgtable2 which is shared by proc3 &
> proc4.  Then (assuming pte1* points to pgtable1):
> 
>   page_count(virt_to_page(pte1)) + page_mapcount(page)
> 
> Won't be the right mapcount we're looking for.

That assumes we want an accurate mapcount.  In the two code segments I pointed
out,  we only need to know if more than one process maps the page.  We can
get that with 'page_count(virt_to_page(pte)) + page_mapcount(page)'.

For now (and stable releases), I propose just fixing the improper behavior.
If we really need an accurate mapcount in the shared PMD case (and I am not
100% sure we do), then we can add code to maintain or compute that.

Part of my reason for sending this email was to determine if there may
be some places where we really do need an accurate mapcount in the shared
PMD case.  I have not found any, but could have easily missed something.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-25 16:02     ` Peter Xu
@ 2023-01-25 18:26       ` Yang Shi
  0 siblings, 0 replies; 22+ messages in thread
From: Yang Shi @ 2023-01-25 18:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, linux-mm, Naoya Horiguchi, David Rientjes,
	Michal Hocko, Matthew Wilcox, David Hildenbrand, James Houghton,
	Muchun Song

On Wed, Jan 25, 2023 at 8:02 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Jan 24, 2023 at 03:29:53PM -0800, Yang Shi wrote:
> > On Tue, Jan 24, 2023 at 3:00 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Tue, Jan 24, 2023 at 12:56:24PM -0800, Mike Kravetz wrote:
> > > > Q How can a page be mapped into multiple processes and have a
> > > >   mapcount of 1?
> > > >
> > > > A It is a hugetlb page referenced by a shared PMD.
> > > >
> > > > I was looking to expose some basic information about PMD sharing via
> > > > /proc/smaps.  After adding the code, I started a couple processes
> > > > sharing a large hugetlb mapping that would result in the use of
> > > > shared PMDs.  When I looked at the output of /proc/smaps, I saw
> > > > my new metric counting the number of shared PMDs.  However, what
> > > > stood out was that the entire mapping was listed as Private_Hugetlb.
> > > > WTH???  It certainly was shared!  The routine smaps_hugetlb_range
> > > > decides between Private_Hugetlb and Shared_Hugetlb with this code:
> > > >
> > > >       if (page) {
> > > >               int mapcount = page_mapcount(page);
> > > >
> > > >               if (mapcount >= 2)
> > > >                       mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > > >               else
> > > >                       mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > > >       }
> > >
> > > This is definitely unfortunate..
> > >
> > > >
> > > > After spending some time looking for issues in the page_mapcount code,
> > > > I came to the realization that the mapcount of hugetlb pages only
> > > > referenced by a shared PMD would be 1 no matter how many processes had
> > > > mapped the page.  When a page is first faulted, the mapcount is set to 1.
> > > > When faulted in other processes, the shared PMD is added to the page
> > > > table of the other processes.  No increase of mapcount will occur.
> > > >
> > > > At first thought this seems bad.  However, I believe this has been the
> > > > behavior since hugetlb PMD sharing was introduced in 2006 and I am
> > > > unaware of any reported issues.  I did a audit of code looking at
> > > > mapcount.  In addition to the above issue with smaps, there appears
> > > > to be an issue with 'migrate_pages' where shared pages could be migrated
> > > > without appropriate privilege.
> > > >
> > > >       /* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
> > > >       if (flags & (MPOL_MF_MOVE_ALL) ||
> > > >           (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
> > > >               if (isolate_hugetlb(page, qp->pagelist) &&
> > > >                       (flags & MPOL_MF_STRICT))
> > > >                       /*
> > > >                        * Failed to isolate page but allow migrating pages
> > > >                        * which have been queued.
> > > >                        */
> > > >                       ret = 1;
> > > >       }
> > > >
> > > > I will prepare fixes for both of these.  However, I wanted to ask if
> > > > anyone has ideas about other potential issues with this?
> > >
> > > This reminded me whether things should be checked already before this
> > > happens.  E.g. when trying to share pmd, whether it makes sense to check
> > > vma mempolicy before doing so?
> > >
> > > Then the question is if pmd sharing only happens with the vma that shares
> > > the same memory policy, whether above mapcount==1 check would be acceptable
> > > even if it's shared by multiple processes.
> >
> > I don't think so. One process might change its policy, for example,
> > bind to another node, then result in migration for the hugepage due to
> > the incorrect mapcount. The above example code pasted by Mike actually
> > comes from mbind if I remember correctly.
>
> Yes, or any page migrations.  Above was a purely wild idea that we share
> pmd based on vma attributes matching first (shared, mapping alignments,
> etc.).  It can also take mempolicy into account so that when migrating one
> page on the shared pmd, one can make a decision for all with mapcount==1
> because that single mapcount may stand for all the sharers of the page as
> long as they share the same mempolicy.
>
> If above idea applies, we'll also need to unshare during mbind() when the
> mempolicy of vma changes for hugetlb in this path, because right after the
> mempolicy changed the vma attribute changed, so pmd sharing doesn't hold.

Make sense to me. The vmas with different mempolicies can't be merged.
So, they shouldn't share PMD either.

>
> But please also ignore this whole thing - I don't think that'll resolve the
> generic problem of mapcount issue on pmd sharing no matter what.  It's just
> something come up to mind when I read it.
>
> >
> > I'm wondering whether we could use refcount instead of mapcount to
> > determine if hugetlb page is shared or not, assuming refcounting for
> > hugetlb page behaves similar to base page (inc when mapped by a new
> > process or pinned). If it is pinned (for example, GUP) we can't
> > migrate it either.
>
> I think refcount has the same issue because it's not accounted either when
> pmd page is shared.

Good to know that. It is a little bit counter-intuitive TBH.

>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-25 16:22     ` James Houghton
@ 2023-01-25 19:26       ` Vishal Moola
  2023-01-26  9:15       ` David Hildenbrand
  1 sibling, 0 replies; 22+ messages in thread
From: Vishal Moola @ 2023-01-25 19:26 UTC (permalink / raw)
  To: James Houghton
  Cc: Peter Xu, Mike Kravetz, linux-mm, Naoya Horiguchi,
	David Rientjes, Michal Hocko, Matthew Wilcox, David Hildenbrand,
	Muchun Song

On Wed, Jan 25, 2023 at 08:22:24AM -0800, James Houghton wrote:
> On Wed, Jan 25, 2023 at 7:54 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Jan 25, 2023 at 07:26:49AM -0800, James Houghton wrote:
> > > > At first thought this seems bad.  However, I believe this has been the
> > > > behavior since hugetlb PMD sharing was introduced in 2006 and I am
> > > > unaware of any reported issues.  I did a audit of code looking at
> > > > mapcount.  In addition to the above issue with smaps, there appears
> > > > to be an issue with 'migrate_pages' where shared pages could be migrated
> > > > without appropriate privilege.
> > > >
> > > >         /* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
> > > >         if (flags & (MPOL_MF_MOVE_ALL) ||
> > > >             (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
> > > >                 if (isolate_hugetlb(page, qp->pagelist) &&
> > > >                         (flags & MPOL_MF_STRICT))
> > > >                         /*
> > > >                          * Failed to isolate page but allow migrating pages
> > > >                          * which have been queued.
> > > >                          */
> > > >                         ret = 1;
> > > >         }
> > >
> > > This isn't the exact same problem you're fixing Mike, but I want to
> > > point out a related problem.
> > >
> > > This is the generic-mm-equivalent of the hugetlb code above:
> > >
> > > static int migrate_page_add(struct page *page, struct list_head
> > > *pagelist, unsigned long flags)
> > > {
> > >         struct page *head = compound_head(page);
> > >         /*
> > >         * Avoid migrating a page that is shared with others.
> > >         */
> > >         if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
> > >                 if (!isolate_lru_page(head)) {
> > >                         list_add_tail(&head->lru, pagelist);
> > >                         mod_node_page_state(page_pgdat(head),
> > >                                 NR_ISOLATED_ANON + page_is_file_lru(head),
> > >                                 thp_nr_pages(head));
> > > ...
> > > }
> > >

There's also 3 functions in migrate that appear to check for a similar
condition - add_page_for_migration(), numamigrate_isolate_page(), and
migrate_misplaced_page(). 

> > > If you have a partially PTE-mapped THP, page_mapcount(head) will not
> > > accurately determine if a page is mapped in multiple VMAs or not (it
> > > only tells you how many times the head page is mapped).
> > >
> > > For example...
> > > 1) You could have the THP PMD-mapped in one VMA, and then one tail
> > > page of the THP can be mapped in another. page_mapcount(head) will be
> > > 1.
> > > 2) You could have two VMAs map two separate tail pages of the THP, in
> > > which case page_mapcount(head) will be 0.
> > >
> > > I bring this up because we have the same problem with HugeTLB
> > > high-granularity mapping.
> >
> > Maybe a better match here is total_mapcount() rather than page_mapcount()
> > (despite the overheads on the sub-page loop)?
> 
> This would kind of fix the problem, but it would be too conservative now. :)

I agree. Interestingly, numamigrate_isolate_page() does take the
total_mapcount() approach right now, so I'm not sure how much of a
difference being more conservative makes.

> In both example 1 and 2 above, total_mapcount(head) for both would be
> 2, so that's ok. But now consider: you have one VMA that is
> PTE-mapping two pieces of the same THP. total_mapcount(head) is still
> 2, even though only a single VMA is mapping the page.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-25 18:16       ` Mike Kravetz
@ 2023-01-25 20:13         ` Peter Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2023-01-25 20:13 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, Naoya Horiguchi, David Rientjes, Michal Hocko,
	Matthew Wilcox, David Hildenbrand, James Houghton, Muchun Song

On Wed, Jan 25, 2023 at 10:16:58AM -0800, Mike Kravetz wrote:
> On 01/25/23 11:46, Peter Xu wrote:
> > On Tue, Jan 24, 2023 at 03:35:38PM -0800, Mike Kravetz wrote:
> > > On 01/24/23 18:00, Peter Xu wrote:
> > > > On Tue, Jan 24, 2023 at 12:56:24PM -0800, Mike Kravetz wrote:
> > > > Besides, I'm also curious on the planned fix too regarding the two issues
> > > > mentioned.
> > > 
> > > My planned 'fix' is to simply check for shared a PMD
> > > (page_count(virt_to_page(pte))) to determine if page with mapcount == 1
> > > is shared.
> > 
> > I think having the current pte* won't easily work, we'll need to walk all
> > the pgtable that mapped this page.
> > 
> > To be explicit, one page can be mapped at pgtable1 which is shared by proc1
> > & proc2, and it can also be mapped at pgtable2 which is shared by proc3 &
> > proc4.  Then (assuming pte1* points to pgtable1):
> > 
> >   page_count(virt_to_page(pte1)) + page_mapcount(page)
> > 
> > Won't be the right mapcount we're looking for.
> 
> That assumes we want an accurate mapcount.  In the two code segments I pointed
> out,  we only need to know if more than one process maps the page.  We can
> get that with 'page_count(virt_to_page(pte)) + page_mapcount(page)'.

I see.  Yes it sounds working. We can check mapcount==1 first and only go
for the pgtable if mapcount>1, then we also know that's the only pmd it got
mapped at.

> 
> For now (and stable releases), I propose just fixing the improper behavior.
> If we really need an accurate mapcount in the shared PMD case (and I am not
> 100% sure we do), then we can add code to maintain or compute that.
> 
> Part of my reason for sending this email was to determine if there may
> be some places where we really do need an accurate mapcount in the shared
> PMD case.  I have not found any, but could have easily missed something.

Nothing I noticed either.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-25 15:26 ` James Houghton
  2023-01-25 15:54   ` Peter Xu
@ 2023-01-26  9:10   ` David Hildenbrand
  1 sibling, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2023-01-26  9:10 UTC (permalink / raw)
  To: James Houghton, Mike Kravetz
  Cc: linux-mm, Naoya Horiguchi, David Rientjes, Michal Hocko,
	Matthew Wilcox, Peter Xu, Muchun Song

On 25.01.23 16:26, James Houghton wrote:
>> At first thought this seems bad.  However, I believe this has been the
>> behavior since hugetlb PMD sharing was introduced in 2006 and I am
>> unaware of any reported issues.  I did a audit of code looking at
>> mapcount.  In addition to the above issue with smaps, there appears
>> to be an issue with 'migrate_pages' where shared pages could be migrated
>> without appropriate privilege.
>>
>>          /* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
>>          if (flags & (MPOL_MF_MOVE_ALL) ||
>>              (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
>>                  if (isolate_hugetlb(page, qp->pagelist) &&
>>                          (flags & MPOL_MF_STRICT))
>>                          /*
>>                           * Failed to isolate page but allow migrating pages
>>                           * which have been queued.
>>                           */
>>                          ret = 1;
>>          }
> 
> This isn't the exact same problem you're fixing Mike, but I want to
> point out a related problem.
> 
> This is the generic-mm-equivalent of the hugetlb code above:
> 
> static int migrate_page_add(struct page *page, struct list_head
> *pagelist, unsigned long flags)
> {
>          struct page *head = compound_head(page);
>          /*
>          * Avoid migrating a page that is shared with others.
>          */
>          if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
>                  if (!isolate_lru_page(head)) {
>                          list_add_tail(&head->lru, pagelist);
>                          mod_node_page_state(page_pgdat(head),
>                                  NR_ISOLATED_ANON + page_is_file_lru(head),
>                                  thp_nr_pages(head));
> ...
> }
> 
> If you have a partially PTE-mapped THP, page_mapcount(head) will not
> accurately determine if a page is mapped in multiple VMAs or not (it
> only tells you how many times the head page is mapped).

This came up in the context of [1]. As the new naming (and my naming 
change) suggestion implies, this is a pure estimate of the numbers of 
sharers. The check is not supposed to be accurate and it can't be.


[1] https://lkml.kernel.org/r/20230124012210.13963-1-vishal.moola@gmail.com

> 
> For example...
> 1) You could have the THP PMD-mapped in one VMA, and then one tail
> page of the THP can be mapped in another. page_mapcount(head) will be
> 1.
> 2) You could have two VMAs map two separate tail pages of the THP, in
> which case page_mapcount(head) will be 0.
> 
> I bring this up because we have the same problem with HugeTLB
> high-granularity mapping.

The more I think of it, the nicer it would be to just keep maintaining a 
single mapcount+ref for the hugetlb case ...

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-25 16:22     ` James Houghton
  2023-01-25 19:26       ` Vishal Moola
@ 2023-01-26  9:15       ` David Hildenbrand
  2023-01-26 18:22         ` Yang Shi
  1 sibling, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2023-01-26  9:15 UTC (permalink / raw)
  To: James Houghton, Peter Xu
  Cc: Mike Kravetz, linux-mm, Naoya Horiguchi, David Rientjes,
	Michal Hocko, Matthew Wilcox, Muchun Song

On 25.01.23 17:22, James Houghton wrote:
> On Wed, Jan 25, 2023 at 7:54 AM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Wed, Jan 25, 2023 at 07:26:49AM -0800, James Houghton wrote:
>>>> At first thought this seems bad.  However, I believe this has been the
>>>> behavior since hugetlb PMD sharing was introduced in 2006 and I am
>>>> unaware of any reported issues.  I did a audit of code looking at
>>>> mapcount.  In addition to the above issue with smaps, there appears
>>>> to be an issue with 'migrate_pages' where shared pages could be migrated
>>>> without appropriate privilege.
>>>>
>>>>          /* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
>>>>          if (flags & (MPOL_MF_MOVE_ALL) ||
>>>>              (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
>>>>                  if (isolate_hugetlb(page, qp->pagelist) &&
>>>>                          (flags & MPOL_MF_STRICT))
>>>>                          /*
>>>>                           * Failed to isolate page but allow migrating pages
>>>>                           * which have been queued.
>>>>                           */
>>>>                          ret = 1;
>>>>          }
>>>
>>> This isn't the exact same problem you're fixing Mike, but I want to
>>> point out a related problem.
>>>
>>> This is the generic-mm-equivalent of the hugetlb code above:
>>>
>>> static int migrate_page_add(struct page *page, struct list_head
>>> *pagelist, unsigned long flags)
>>> {
>>>          struct page *head = compound_head(page);
>>>          /*
>>>          * Avoid migrating a page that is shared with others.
>>>          */
>>>          if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
>>>                  if (!isolate_lru_page(head)) {
>>>                          list_add_tail(&head->lru, pagelist);
>>>                          mod_node_page_state(page_pgdat(head),
>>>                                  NR_ISOLATED_ANON + page_is_file_lru(head),
>>>                                  thp_nr_pages(head));
>>> ...
>>> }
>>>
>>> If you have a partially PTE-mapped THP, page_mapcount(head) will not
>>> accurately determine if a page is mapped in multiple VMAs or not (it
>>> only tells you how many times the head page is mapped).
>>>
>>> For example...
>>> 1) You could have the THP PMD-mapped in one VMA, and then one tail
>>> page of the THP can be mapped in another. page_mapcount(head) will be
>>> 1.
>>> 2) You could have two VMAs map two separate tail pages of the THP, in
>>> which case page_mapcount(head) will be 0.
>>>
>>> I bring this up because we have the same problem with HugeTLB
>>> high-granularity mapping.
>>
>> Maybe a better match here is total_mapcount() rather than page_mapcount()
>> (despite the overheads on the sub-page loop)?

IIRC, total_mapcount() would also not be what we want: for a PTE-mapped 
THP it would be e.g. 512 instead of one. [unless I am confused again 
about mapcounts]

See my other comment, I believe this is supposed to be a guesstimate 
whether "this page is shared". And we use the first subpage to make a 
guess here ...

Of course, we could try harder, by looking at > 1 subpage, to test if 
any of these subpages has a mapcount > 1. But it still wouldn't be 
accurate ....

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-25 17:59   ` Mike Kravetz
@ 2023-01-26  9:16     ` Michal Hocko
  2023-01-26 17:51       ` Mike Kravetz
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2023-01-26  9:16 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, Naoya Horiguchi, David Rientjes, Matthew Wilcox,
	David Hildenbrand, Peter Xu, James Houghton, Muchun Song

On Wed 25-01-23 09:59:15, Mike Kravetz wrote:
> On 01/25/23 09:24, Michal Hocko wrote:
> > On Tue 24-01-23 12:56:24, Mike Kravetz wrote:
> > > At first thought this seems bad.  However, I believe this has been the
> > > behavior since hugetlb PMD sharing was introduced in 2006 and I am
> > > unaware of any reported issues.  I did a audit of code looking at
> > > mapcount.  In addition to the above issue with smaps, there appears
> > > to be an issue with 'migrate_pages' where shared pages could be migrated
> > > without appropriate privilege.
> > > 
> > > 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
> > > 	if (flags & (MPOL_MF_MOVE_ALL) ||
> > > 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
> > > 		if (isolate_hugetlb(page, qp->pagelist) &&
> > > 			(flags & MPOL_MF_STRICT))
> > > 			/*
> > > 			 * Failed to isolate page but allow migrating pages
> > > 			 * which have been queued.
> > > 			 */
> > > 			ret = 1;
> > > 	}
> > 
> > Could you elaborate what is problematic about that? The whole pmd
> > sharing is a cooperative thing. So if some of the processes decides to
> > migrate the page then why that should be a problem for others sharing
> > that page via page table? Am I missing something obvious?
> 
> Nothing obvious.  It is just that the semantics seem to be that you can
> only move shared pages if you have CAP_SYS_NICE.

Correct

> Certainly cooperation
> is implied for shared PMDs, but I would guess that most applications are
> not even aware they are sharing PMDs.

How come? They have to explicitly map those hugetlb pages to the same
address. Or is it common that the mapping just lands there by accident?

> Consider a group of processes sharing a hugetlb mapping.  If the mapping
> is PUD_SIZE - huge_page_size, there is no sharing of PMDs and a process
> without CAP_SYS_NICE can not migrate the shared pages.  However, if nothing
> else changes and the mapping size is PUD_SIZE (and appropriately aligned)
> the PMDs are shared.  Should we allow a process to migrate shared pages
> without CAP_SYS_NICE in this case?

I am not sure I follow. I have likely got lost in the above. So the
move_pages interface requires CAP_SYS_NICE to allow moving shared pages.
pmd shared hugetlb pages fail the "I am shared" detection so even
processes without CAP_SYS_NICE are allowed to migrate those. This is not
ideal because somebody unpriviledged (with an access to the address
space) could impose additional latencies.

The question is whether this really matters for workloads that opt-in for
pmd sharing. It is my understanding that those are in cooperative mode
so an adversary player is not a threat model. Or am I wrong in that
assumption? I haven't checked very closely but wouldn't be mprotect a
bigger problem? I do not remember any special casing for hugetlb pmd
sharing there.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-26  9:16     ` Michal Hocko
@ 2023-01-26 17:51       ` Mike Kravetz
  2023-01-27  9:56         ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Mike Kravetz @ 2023-01-26 17:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Naoya Horiguchi, David Rientjes, Matthew Wilcox,
	David Hildenbrand, Peter Xu, James Houghton, Muchun Song

On 01/26/23 10:16, Michal Hocko wrote:
> On Wed 25-01-23 09:59:15, Mike Kravetz wrote:
> > On 01/25/23 09:24, Michal Hocko wrote:
> > > On Tue 24-01-23 12:56:24, Mike Kravetz wrote:
> > > > At first thought this seems bad.  However, I believe this has been the
> > > > behavior since hugetlb PMD sharing was introduced in 2006 and I am
> > > > unaware of any reported issues.  I did a audit of code looking at
> > > > mapcount.  In addition to the above issue with smaps, there appears
> > > > to be an issue with 'migrate_pages' where shared pages could be migrated
> > > > without appropriate privilege.
> > > > 
> > > > 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
> > > > 	if (flags & (MPOL_MF_MOVE_ALL) ||
> > > > 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
> > > > 		if (isolate_hugetlb(page, qp->pagelist) &&
> > > > 			(flags & MPOL_MF_STRICT))
> > > > 			/*
> > > > 			 * Failed to isolate page but allow migrating pages
> > > > 			 * which have been queued.
> > > > 			 */
> > > > 			ret = 1;
> > > > 	}
> > > 
> > > Could you elaborate what is problematic about that? The whole pmd
> > > sharing is a cooperative thing. So if some of the processes decides to
> > > migrate the page then why that should be a problem for others sharing
> > > that page via page table? Am I missing something obvious?
> > 
> > Nothing obvious.  It is just that the semantics seem to be that you can
> > only move shared pages if you have CAP_SYS_NICE.
> 
> Correct
> 
> > Certainly cooperation
> > is implied for shared PMDs, but I would guess that most applications are
> > not even aware they are sharing PMDs.
> 
> How come? They have to explicitly map those hugetlb pages to the same
> address. Or is it common that the mapping just lands there by accident?

Mapping to the same address is not required for PMD sharing.  What is
required is that the alignment of PUD_SIZE offsets within the mapped object
(file) are mapped to PUD_SIZE aligned virtual addresses.  That may not be
clear as it is difficult to describe.  Bottom like is that addresses do not
need to match.

However, I am aware of one DB that maps large hugetlb shared areas at
the same virtual address in many processes for application convenience.
PMD sharing was not the reason for mapping at the same virtual address,
and people developing that DB were not necessarily aware that PMDs were
being shared.  I also worked on a performance issue with another application
making use of large hugetlb mappings that was unaware PMD sharing was
happening in their environment.  Since PMD sharing is not documented
anywhere (except source code), I suspect applications are not aware if
they happen make use of shared PMDs.  That is the reason for my
statement above.

> > Consider a group of processes sharing a hugetlb mapping.  If the mapping
> > is PUD_SIZE - huge_page_size, there is no sharing of PMDs and a process
> > without CAP_SYS_NICE can not migrate the shared pages.  However, if nothing
> > else changes and the mapping size is PUD_SIZE (and appropriately aligned)
> > the PMDs are shared.  Should we allow a process to migrate shared pages
> > without CAP_SYS_NICE in this case?
> 
> I am not sure I follow. I have likely got lost in the above. So the
> move_pages interface requires CAP_SYS_NICE to allow moving shared pages.
> pmd shared hugetlb pages fail the "I am shared" detection so even
> processes without CAP_SYS_NICE are allowed to migrate those. This is not
> ideal because somebody unpriviledged (with an access to the address
> space) could impose additional latencies.

Correct.  That is one of the things I will/want to fix.

> The question is whether this really matters for workloads that opt-in for
> pmd sharing. It is my understanding that those are in cooperative mode
> so an adversary player is not a threat model. Or am I wrong in that
> assumption?

Yes, the argument can be made that processes sharing a large hugetlb
object are cooperative and should trust each other.  My plan is to simply
make the code follow the documented behavior.  I would rather not have
different user visible behavior for mappings using shared PMDs.  And,
code changes are rather trivial.

>             I haven't checked very closely but wouldn't be mprotect a
> bigger problem? I do not remember any special casing for hugetlb pmd
> sharing there.

It is not an issue for mprotect.  Any change in protection disables PMD
sharing.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-26  9:15       ` David Hildenbrand
@ 2023-01-26 18:22         ` Yang Shi
  0 siblings, 0 replies; 22+ messages in thread
From: Yang Shi @ 2023-01-26 18:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: James Houghton, Peter Xu, Mike Kravetz, linux-mm,
	Naoya Horiguchi, David Rientjes, Michal Hocko, Matthew Wilcox,
	Muchun Song

On Thu, Jan 26, 2023 at 1:15 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 25.01.23 17:22, James Houghton wrote:
> > On Wed, Jan 25, 2023 at 7:54 AM Peter Xu <peterx@redhat.com> wrote:
> >>
> >> On Wed, Jan 25, 2023 at 07:26:49AM -0800, James Houghton wrote:
> >>>> At first thought this seems bad.  However, I believe this has been the
> >>>> behavior since hugetlb PMD sharing was introduced in 2006 and I am
> >>>> unaware of any reported issues.  I did a audit of code looking at
> >>>> mapcount.  In addition to the above issue with smaps, there appears
> >>>> to be an issue with 'migrate_pages' where shared pages could be migrated
> >>>> without appropriate privilege.
> >>>>
> >>>>          /* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
> >>>>          if (flags & (MPOL_MF_MOVE_ALL) ||
> >>>>              (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
> >>>>                  if (isolate_hugetlb(page, qp->pagelist) &&
> >>>>                          (flags & MPOL_MF_STRICT))
> >>>>                          /*
> >>>>                           * Failed to isolate page but allow migrating pages
> >>>>                           * which have been queued.
> >>>>                           */
> >>>>                          ret = 1;
> >>>>          }
> >>>
> >>> This isn't the exact same problem you're fixing Mike, but I want to
> >>> point out a related problem.
> >>>
> >>> This is the generic-mm-equivalent of the hugetlb code above:
> >>>
> >>> static int migrate_page_add(struct page *page, struct list_head
> >>> *pagelist, unsigned long flags)
> >>> {
> >>>          struct page *head = compound_head(page);
> >>>          /*
> >>>          * Avoid migrating a page that is shared with others.
> >>>          */
> >>>          if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
> >>>                  if (!isolate_lru_page(head)) {
> >>>                          list_add_tail(&head->lru, pagelist);
> >>>                          mod_node_page_state(page_pgdat(head),
> >>>                                  NR_ISOLATED_ANON + page_is_file_lru(head),
> >>>                                  thp_nr_pages(head));
> >>> ...
> >>> }
> >>>
> >>> If you have a partially PTE-mapped THP, page_mapcount(head) will not
> >>> accurately determine if a page is mapped in multiple VMAs or not (it
> >>> only tells you how many times the head page is mapped).
> >>>
> >>> For example...
> >>> 1) You could have the THP PMD-mapped in one VMA, and then one tail
> >>> page of the THP can be mapped in another. page_mapcount(head) will be
> >>> 1.
> >>> 2) You could have two VMAs map two separate tail pages of the THP, in
> >>> which case page_mapcount(head) will be 0.
> >>>
> >>> I bring this up because we have the same problem with HugeTLB
> >>> high-granularity mapping.
> >>
> >> Maybe a better match here is total_mapcount() rather than page_mapcount()
> >> (despite the overheads on the sub-page loop)?
>
> IIRC, total_mapcount() would also not be what we want: for a PTE-mapped
> THP it would be e.g. 512 instead of one. [unless I am confused again
> about mapcounts]
>
> See my other comment, I believe this is supposed to be a guesstimate
> whether "this page is shared". And we use the first subpage to make a
> guess here ...

I tried to dig into the git and review history. It seems like the code
was added when THP migration support was introduced, and was just
copied from the base page case IIRC.

I agree it is a heuristic guess about whether this page is shared or
not, but reading the head page's mapcount is not correct AFAICT. The
total_mapcount should be used, although it can't distinguish unshared
PTE mapped (multiple subpages mapped by PTEs) THP, but it could filter
out shared pages as expected.

>
> Of course, we could try harder, by looking at > 1 subpage, to test if
> any of these subpages has a mapcount > 1. But it still wouldn't be
> accurate ....

A little bit overkilling TBH.

>
> --
> Thanks,
>
> David / dhildenb
>
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: A mapcount riddle
  2023-01-26 17:51       ` Mike Kravetz
@ 2023-01-27  9:56         ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2023-01-27  9:56 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, Naoya Horiguchi, David Rientjes, Matthew Wilcox,
	David Hildenbrand, Peter Xu, James Houghton, Muchun Song

On Thu 26-01-23 09:51:12, Mike Kravetz wrote:
> On 01/26/23 10:16, Michal Hocko wrote:
> > On Wed 25-01-23 09:59:15, Mike Kravetz wrote:
> > > On 01/25/23 09:24, Michal Hocko wrote:
> > > > On Tue 24-01-23 12:56:24, Mike Kravetz wrote:
> > > > > At first thought this seems bad.  However, I believe this has been the
> > > > > behavior since hugetlb PMD sharing was introduced in 2006 and I am
> > > > > unaware of any reported issues.  I did a audit of code looking at
> > > > > mapcount.  In addition to the above issue with smaps, there appears
> > > > > to be an issue with 'migrate_pages' where shared pages could be migrated
> > > > > without appropriate privilege.
> > > > > 
> > > > > 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
> > > > > 	if (flags & (MPOL_MF_MOVE_ALL) ||
> > > > > 	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
> > > > > 		if (isolate_hugetlb(page, qp->pagelist) &&
> > > > > 			(flags & MPOL_MF_STRICT))
> > > > > 			/*
> > > > > 			 * Failed to isolate page but allow migrating pages
> > > > > 			 * which have been queued.
> > > > > 			 */
> > > > > 			ret = 1;
> > > > > 	}
> > > > 
> > > > Could you elaborate what is problematic about that? The whole pmd
> > > > sharing is a cooperative thing. So if some of the processes decides to
> > > > migrate the page then why that should be a problem for others sharing
> > > > that page via page table? Am I missing something obvious?
> > > 
> > > Nothing obvious.  It is just that the semantics seem to be that you can
> > > only move shared pages if you have CAP_SYS_NICE.
> > 
> > Correct
> > 
> > > Certainly cooperation
> > > is implied for shared PMDs, but I would guess that most applications are
> > > not even aware they are sharing PMDs.
> > 
> > How come? They have to explicitly map those hugetlb pages to the same
> > address. Or is it common that the mapping just lands there by accident?
> 
> Mapping to the same address is not required for PMD sharing.  What is
> required is that the alignment of PUD_SIZE offsets within the mapped object
> (file) are mapped to PUD_SIZE aligned virtual addresses.  That may not be
> clear as it is difficult to describe.  Bottom like is that addresses do not
> need to match.

Hmm, my bad then. I thought that is a strict requirement. But looking at
the code page_table_shareable talks about pmd_index indeed. I must have
misremember.

I do agree that it is much simpler to hit into page table sharing for
large mappings unintentionally - especially if they are GB aligned which
is not really that unexpected.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-01-27  9:56 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-24 20:56 A mapcount riddle Mike Kravetz
2023-01-24 23:00 ` Peter Xu
2023-01-24 23:29   ` Yang Shi
2023-01-25 16:02     ` Peter Xu
2023-01-25 18:26       ` Yang Shi
2023-01-24 23:35   ` Mike Kravetz
2023-01-25 16:46     ` Peter Xu
2023-01-25 18:16       ` Mike Kravetz
2023-01-25 20:13         ` Peter Xu
2023-01-25  8:24 ` Michal Hocko
2023-01-25 17:59   ` Mike Kravetz
2023-01-26  9:16     ` Michal Hocko
2023-01-26 17:51       ` Mike Kravetz
2023-01-27  9:56         ` Michal Hocko
2023-01-25  9:09 ` David Hildenbrand
2023-01-25 15:26 ` James Houghton
2023-01-25 15:54   ` Peter Xu
2023-01-25 16:22     ` James Houghton
2023-01-25 19:26       ` Vishal Moola
2023-01-26  9:15       ` David Hildenbrand
2023-01-26 18:22         ` Yang Shi
2023-01-26  9:10   ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.